Thomas Wolf
thomwolf [at] gmail [dot] com

I lead the Science Team at Huggingface Inc., a Brooklyn-based startup working on Natural Language Generation and Natural Language Understanding.

I’ve been programming since I was 10, writing video games and interactive software in Assembly and C/C++ but my first career was actually in Physics rather than Computer Science.

After graduating from Ecole Polytechnique (Paris, France), I worked on laser-plasma interactions at the BELLA Center of the Lawrence Berkeley National Laboratory (Berkeley, CA). Got accepted for a PhD at MIT (Cambridge, MA) but ended up doing my PhD in Statistical/Quantum physics at Sorbonne University and ESPCI (Paris, France), working on superconducting materials for the French DARPA (DGA) and Thales. After my PhD, I needed a change from the long time scale of experiments in physics and ended up totally changing direction. I joined an IP Law firm, Cabinet Plasseraud (Paris, France), got a law degree from Pantheon Sorbonne University and worked as a European Patent Attorney for 5 years, assisting a portfolio of startups and big companies to build and defend their Intellectual Property assets.

In 2015, I was consulting for many Deep-Learning/AI/ML startups and they made me discover the maths behind the new ML/AI revolution. I realised that most of these methods, equations and tools were just re-branded statistical physics approaches which fueled my interest for Machine Learning and Deep Learning. I started my online education in AI/ML reading books and following online courses. About year later, one of my friend asked me if I wanted to join his startup to build a science team, and there I was, doing science again and having a lot of fun!

Email  /  Medium  /  Twitter  /  Github  /  LinkedIn


I'm interested in Natural Language Processing, Deep Learning and Computational Linguistics. Much of my research is about Natural Language Generation (mostly) and Natural Language Understanding (as a tool for better generation).

You can find some details in this interview I gave to Gengo.AI's Daniel Smith and where I discussed the work we do at Huggingface, current trends in AI/NLP and my unusual background.

Invited Talks
  • On March 1st, 2019, I'll give a talk at the ILPS lab of the University of Amsterdam as part of the ILPS Monthly talks.
  • On January 30, 2019, I gave a talk at the Deep Learning Meetup in Paris [slides]
  • On January 22, 2019, I gave a talk at the NYU Center for Data Science on Transfer Learning Approaches to Natural Language Generation [see my UvA slides]
  • On January 18, 2019, I gave a talk at the University of Amsterdam as part of the SEA Meetups on a Transfer Learning Approach to Open-Domain Neural Network Dialog Agents [slides]
  • On January 11, 2019, I gave a talk at Utrecht University as part of the Data Science & Complexity Centre (DSCC) Central Topic Seminars on recent developments in Neural Network Based Dialogue Agents focusing on the use of Transfer Learning for dialog generation [slides]
  • In December 2018, I gave a talk during the NeurIPS 2018 Competition Track at part of the Winners talks & spotlights, discussing our solution to the Conversational Intelligence Challenge 2 (ConvAI2) [slides] [paper]
  • In September 2018, I gave a talk at The first annual WeCNLP Summit 2018 on a novel architecture and training scheme for chit-chat dialog systems [slides]
  • In September 2018, I gave a talk at Paris NLP on Neural networks based dialog agents: going beyond the seq2seq model [slides]
  • In October 2017, I gave a talk at France is AI 2017 on NeuralCoref, a neural coreference system for conversational agents [slides]

I like to explain clearly what I have learned and this has lead to a few blog posts that were quite interesting to other as well I guess (they totalise about a quarter million views at the end of 2018). I will try to continue writing things like that when I find the time. I used to be a teacher during my PhD and I do miss teaching. Blogging is my substitute.

Training Neural Nets on Larger Batches

💥 Training Neural Nets on Larger Batches: Practical Tips for 1-GPU, Multi-GPU & Distributed setups

I've spent most of 2018 training models that could barely fit 1-4 samples/GPU. But SGD usually needs more than few samples/batch for decent results. I wrote a post gathering practical tips I use, from simple tricks to multi-GPU code & distributed setups

⛵ Learning Meaning in Natural Language Processing — The Semantics Mega-Thread

A summary, overview and map of a huge discussion on learning meaning in NLP that happened on Twitter in August 2018 with more than a 100 comments and great inputs from Matt Gardner, Yoav Goldberg, Sam Bowman, Emily M. Bender, Graham Neubig, Jeremy Howard, Tal Linzen, Jacob Andreas, Ryan D. Cotterell ...

🚀 100 Times Faster Natural Language Processing in Python

How you can make your Python NLP module 50-100 times faster by use spaCy's internals and a bit of Cython magic! Womes with a Jupyter notebook with examples processing over 80 millions words per sec.

📚The Current Best of Universal Word Embeddings and Sentence Embeddings

A post summarizing recent developments in Universal Word/Sentence Embeddings that happend over 2017/early-2018 and future trends. With ELMo, InferSent, Google's Universal Sentence embeddings, learning by multi-tasking... Written with Victor Sanh.

🐣 From zero to research — An introduction to Meta-learning

To introduce the work we presented at ICLR 2018, I drafted a visual & intuitive introduction to Meta-Learning. In this post, I start by explaining what’s meta-learning in a very visual and intuitive way. Then, we code a meta-learning model in PyTorch and I share some of the lessons learned on this project.

✨How to train a neural coreference model— Neuralcoref 2

A post describing the internals of NeuralCoref. Neuralcoref is designed to strike a good balance between accuracy and speed/simplicity, using a rule-based mention detection module, a constrained number of features and a simple feed-forward neural network. This post describes how the coreference resolution system works and how to train it.

Understanding emotions — from Keras to pyTorch

A post accompanying our open-sourcing of torchMoji, a PyTorch adaptation of MIT's DeepMoji model. In this post, I detail several points that arose during the reimplementation of a Keras model in PyTorch: how to make a custom pyTorch LSTM with custom activation functions, how the PackedSequence object works and is built, how to convert an attention layer from Keras to pyTorch, how to load your data in pyTorch: DataSets and smart Batching, how to reproduce Keras weights initialization in pyTorch.

State-of-the-art neural coreference resolution for chatbots

A post accompanying our open-sourcing of NeuralCoref. It comprise an introduction to the field of co-reference resolution and describes how a coreference resolution system works in practice.

Open-sourced Projects

I also like to open-source my code base when I think it can be interesting to others. This has led to very interesting collaborations in the past. These few projects totalize a few thousand github stars and I am always happy when I see people coming with great PR on them to share their developments and ideas.


Magic Sandbox

Magic Sand is a software for operating an augmented reality sandbox like the Augmented Reality Sandbox developped by UC Davis. This project comprises the C++ codebase build on OpenFrameWorks and a tutorial to build the hardware (see also the associated reddit thread)

✨NeuralCoref: Coreference Resolution in spaCy with Neural Networks.

NeuralCoref is a pipeline extension for spaCy 2.0 that annotates and resolves coreference clusters using a neural network. NeuralCoref is production-ready, integrated in spaCy's NLP pipeline and easily extensible to new training datasets.

NeuralCoref is written in Cython & Python and comes with pre-trained statistical models for English. It can be trained in other languages.

😇 TorchMoji

TorchMoji is a pyTorch implementation of the DeepMoji model developped by Bjarke Felbo, Alan Mislove, Anders Søgaard, Iyad Rahwan and Sune Lehmann. This model trained on 1.2 billion tweets with emojis to understand how language is used to express emotions. Through transfer learning the model can obtain state-of-the-art performance on many emotion-related text modeling tasks. See the paper for more details.

PyTorch implementation of OpenAI's Finetuned Transformer Language Model.

A PyTorch implementation of the TensorFlow code provided with OpenAI's paper Improving Language Understanding by Generative Pre-Training by Alec Radford, Karthik Narasimhan, Tim Salimans and Ilya Sutskever. This implementation comprises a script to load in the PyTorch model the weights pre-trained by the authors with the TensorFlow implementation.

PyTorch Pretrained Bert.

This repository contains an op-for-op PyTorch reimplementation of Google's TensorFlow repository for the BERT model that was released together with the paper BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding by Jacob Devlin, Ming-Wei Chang, Kenton Lee and Kristina Toutanova. This implementation is provided with Google's pre-trained models, examples, notebooks and a command-line interface to load any pre-trained TensorFlow checkpoint for BERT is also provided.


Meta-Learning a Dynamical Language Model
Thomas Wolf, Julien Chaumond, Clement Delangue, 2018
Workshop track - ICLR 2018

We consider the task of word-level language modeling and study the possibility of combining hidden-states-based short-term representations with medium-term representations encoded in dynamical weights of a language model. Our work extends recent experiments on language models with dynamically evolving weights by casting the language modeling problem into an online learning-to-learn framework in which a meta-learner is trained by gradient-descent to continuously update a language model weights.

Continuous Learning in a Hierarchical Multiscale Neural Network
Thomas Wolf, Julien Chaumond, Clement Delangue, 2018
ACL 2018

We reformulate the problem of encoding a multi-scale representation of a sequence in a language model by casting it in a continuous learning framework. We propose a hierarchical multi-scale language model in which short time-scale dependencies are encoded in the hidden state of a lower-level recurrent neural network while longer time-scale dependencies are encoded in the dynamic of the lower-level network by having a meta-learner update the weights of the lower-level neural network in an online meta-learning fashion. We use elastic weights consolidation as a higher-level to prevent catastrophic forgetting in our continuous learning framework.

A Hierarchical Multi-task Approach for Learning Embeddings from Semantic Tasks
Victor Sanh, Thomas Wolf, Sebastian Ruder
AAAI 2019

Much effort has been devoted to evaluate whether multi-task learning can be leveraged to learn rich representations that can be used in various Natural Language Processing (NLP) down-stream applications. However, there is still a lack of understanding of the settings in which multi-task learning has a significant effect. In this work, we introduce a hierarchical model trained in a multi-task learning setup on a set of carefully selected semantic tasks. The model is trained in a hierarchical fashion to introduce an inductive bias by supervising a set of low level tasks at the bottom layers of the model and more complex tasks at the top layers of the model. This model achieves state-of-the-art results on a number of tasks, namely Named Entity Recognition, Entity Mention Detection and Relation Extraction without hand-engineered features or external NLP tools like syntactic parsers. The hierarchical training supervision induces a set of shared semantic representations at lower layers of the model. We show that as we move from the bottom to the top layers of the model, the hidden states of the layers tend to represent more complex semantic information.

TransferTransfo: A Transfer Learning Approach for Neural Network Based Conversational Agents
Thomas Wolf, Victor Sanh, Julien Chaumond and Clement Delangue
NeurIPS CAI workshop 2018

We introduce a new approach to data-driven dialogue systems (e.g. chatbots) called TransferTransfo which is a combination of aTransferlearning based train-ing scheme and a high-capacity generativeTransfo-rmer model. Fine-tuning isperformed by using a multi-task objective which combines several unsupervised pre-diction tasks. The resulting fine-tuned model shows strong improvements over thecurrent state-of-the-art end-to-end conversational models like memory augmentedseq2seq and information-retrieval models. On the privately held PERSONA-CHATdataset of the Conversational Intelligence Challenge 2, this approach obtains anew state-of-the-art, respectively pushing the perplexity, Hits@1 and F1 metrics to 16.28 (45% absolute improvement),80.7 (46% absolute improvement) and 19.5 (20% absolute improvement).

Created from Jonathan T. Barron's template