Thomas Wolf
thomaswolfcontact [at] gmail [dot] com

I lead the Science Team at Huggingface Inc., a Brooklyn-based startup working on Natural Language Generation and Natural Language Understanding.

I’ve been programming since I was 10, writing video games and interactive software in Assembly and C/C++ but my first career was actually in Physics rather than Computer Science.

After graduating from Ecole Polytechnique (Paris, France), I worked on laser-plasma interactions at the BELLA Center of the Lawrence Berkeley National Laboratory (Berkeley, CA). Got accepted for a PhD at MIT (Cambridge, MA) but ended up doing my PhD in Statistical/Quantum physics at Sorbonne University and ESPCI (Paris, France), working on superconducting materials for the French DARPA (DGA) and Thales. After my PhD, I needed a change from the long time scale of experiments in physics and ended up totally changing direction. I joined an IP Law firm, Cabinet Plasseraud (Paris, France), got a law degree from Pantheon Sorbonne University and worked as a European Patent Attorney for 5 years, assisting a portfolio of startups and big companies to build and defend their Intellectual Property assets.

In 2015, I was consulting for many Deep-Learning/AI/ML startups and they made me discover the maths behind the new ML/AI revolution. I realised that most of these methods, equations and tools were just re-branded statistical physics approaches which fueled my interest for Machine Learning and Deep Learning. I started my online education in AI/ML reading books and following online courses. About year later, one of my friend asked me if I wanted to join his startup to build a science team, and there I was, doing science again and having a lot of fun!

Email  /  Medium  /  Twitter  /  Github  /  LinkedIn


I'm interested in Natural Language Processing, Deep Learning and Computational Linguistics. Much of my research is about Natural Language Generation (mostly) and Natural Language Understanding (as a tool for better generation).

You can find some details in this interview I gave to PyImageSearch or this earlier interview I gave to LionBridge.AI where I discuss the work we do at Huggingface, current trends in AI/NLP and my unusual background.

Invited Talks and News
  • On March 4, 2020, I gave a talk at INRIA ALMAnaCH in Paris, France [slides].
  • On February 4, 2020, I gave a (remote) talk at the Sydney NLP meetup in Sydney, Australia [slides].
  • On February 2-4, 2020, I co-teached the NLPL Winter School with Yoav Goldberg symposium in Skeikampen, Norway [slides 1st session], [slides 2nd session], [slides 3rd session].
  • On January 17, 2020, I gave a talk at Transformers at Work symposium in Amsterdam, Netherlands [slides].
  • On October 25, 2019, I gave a talk at France is AI in Paris, France [slides].
  • On September 19, 2019, I gave a talk in the AI Assistant Summit track at Re-Work Deep Learning in London, UK [slides].
  • On September 7, 2019, I gave a talk on Transfer learning in NLP at the Data Science fwdays'19 in Kyiv, Ukraine [slides].
  • On June 6, 2019, I co-organized a workshop on Methods for Optimizing and Evaluating Neural Language Generation (NeuralGen), together with Antoine Bosselut, Marjan Ghazvininejad, Srinivasan Iyer, Urvashi Khandelwal, Hannah Rashkin and Asli Celikyilmaz, co-located with NAACL 2019 [website].
  • On June 2nd, 2019, I gave a tutorial on Transfer Learning in Natural Language Processing, together with Sebastian Ruder, Swabha Swayamdipta and Matthew Peters at NAACL 2019 [slides].
  • On March 1st, 2019, I gave a talk at the ILPS lab of the University of Amsterdam on Hierarchical Multi-tasking for learning embeddings from semantic tasks as part of the ILPS Monthly talks [slides].
  • On January 30, 2019, I gave a talk at the Deep Learning Meetup in Paris [slides]
  • On January 22, 2019, I gave a talk at the NYU Center for Data Science on Transfer Learning Approaches to Natural Language Generation [see my UvA slides]
  • On January 18, 2019, I gave a talk at the University of Amsterdam as part of the SEA Meetups on a Transfer Learning Approach to Open-Domain Neural Network Dialog Agents [slides]
  • On January 11, 2019, I gave a talk at Utrecht University as part of the Data Science & Complexity Centre (DSCC) Central Topic Seminars on recent developments in Neural Network Based Dialogue Agents focusing on the use of Transfer Learning for dialog generation [slides]
  • In December 2018, I gave a talk during the NeurIPS 2018 Competition Track at part of the Winners talks & spotlights, discussing our solution to the Conversational Intelligence Challenge 2 (ConvAI2) [slides] [paper]
  • In September 2018, I gave a talk at The first annual WeCNLP Summit 2018 on a novel architecture and training scheme for chit-chat dialog systems [slides]
  • In September 2018, I gave a talk at Paris NLP on Neural networks based dialog agents: going beyond the seq2seq model [slides]
  • In October 2017, I gave a talk at France is AI 2017 on NeuralCoref, a neural coreference system for conversational agents [slides]

I like to explain clearly what I have learned and this has lead to a few blog posts that were quite interesting to other as well I guess (they totalise over a quarter million views at the end of 2018). I will try to continue writing things like that when I find the time. I used to be a teacher during my PhD and I do miss teaching. Blogging is my substitute.

Training Neural Nets on Larger Batches

💥 Training Neural Nets on Larger Batches: Practical Tips for 1-GPU, Multi-GPU & Distributed setups

I've spent most of 2018 training models that could barely fit 1-4 samples/GPU. But SGD usually needs more than few samples/batch for decent results. I wrote a post gathering practical tips I use, from simple tricks to multi-GPU code & distributed setups

⛵ Learning Meaning in Natural Language Processing — The Semantics Mega-Thread

A summary, overview and map of a huge discussion on learning meaning in NLP that happened on Twitter in August 2018 with more than a 100 comments and great inputs from Matt Gardner, Yoav Goldberg, Sam Bowman, Emily M. Bender, Graham Neubig, Jeremy Howard, Tal Linzen, Jacob Andreas, Ryan D. Cotterell ...

🚀 100 Times Faster Natural Language Processing in Python

How you can make your Python NLP module 50-100 times faster by use spaCy's internals and a bit of Cython magic! Womes with a Jupyter notebook with examples processing over 80 millions words per sec.

📚The Current Best of Universal Word Embeddings and Sentence Embeddings

A post summarizing recent developments in Universal Word/Sentence Embeddings that happend over 2017/early-2018 and future trends. With ELMo, InferSent, Google's Universal Sentence embeddings, learning by multi-tasking... Written with Victor Sanh.

🐣 From zero to research — An introduction to Meta-learning

To introduce the work we presented at ICLR 2018, I drafted a visual & intuitive introduction to Meta-Learning. In this post, I start by explaining what’s meta-learning in a very visual and intuitive way. Then, we code a meta-learning model in PyTorch and I share some of the lessons learned on this project.

✨How to train a neural coreference model— Neuralcoref 2

A post describing the internals of NeuralCoref. Neuralcoref is designed to strike a good balance between accuracy and speed/simplicity, using a rule-based mention detection module, a constrained number of features and a simple feed-forward neural network. This post describes how the coreference resolution system works and how to train it.

Understanding emotions — from Keras to pyTorch

A post accompanying our open-sourcing of torchMoji, a PyTorch adaptation of MIT's DeepMoji model. In this post, I detail several points that arose during the reimplementation of a Keras model in PyTorch: how to make a custom pyTorch LSTM with custom activation functions, how the PackedSequence object works and is built, how to convert an attention layer from Keras to pyTorch, how to load your data in pyTorch: DataSets and smart Batching, how to reproduce Keras weights initialization in pyTorch.

State-of-the-art neural coreference resolution for chatbots

A post accompanying our open-sourcing of NeuralCoref. It comprise an introduction to the field of co-reference resolution and describes how a coreference resolution system works in practice.

Open-sourced Projects

I also like to open-source my code base when I think it can be interesting to others. This has led to very interesting collaborations in the past. These few projects totalize a few thousand github stars and I am always happy when I see people coming with great PR on them to share their developments and ideas.


Magic Sandbox

Magic Sand is a software for operating an augmented reality sandbox like the Augmented Reality Sandbox developped by UC Davis. This project comprises the C++ codebase build on OpenFrameWorks and a tutorial to build the hardware (see also the associated reddit thread)

✨NeuralCoref: Coreference Resolution in spaCy with Neural Networks.

NeuralCoref is a pipeline extension for spaCy 2.0 that annotates and resolves coreference clusters using a neural network. NeuralCoref is production-ready, integrated in spaCy's NLP pipeline and easily extensible to new training datasets.

NeuralCoref is written in Cython & Python and comes with pre-trained statistical models for English. It can be trained in other languages.

😇 TorchMoji

TorchMoji is a pyTorch implementation of the DeepMoji model developped by Bjarke Felbo, Alan Mislove, Anders Søgaard, Iyad Rahwan and Sune Lehmann. This model trained on 1.2 billion tweets with emojis to understand how language is used to express emotions. Through transfer learning the model can obtain state-of-the-art performance on many emotion-related text modeling tasks. See the paper for more details.

PyTorch implementation of OpenAI's Finetuned Transformer Language Model.

A PyTorch implementation of the TensorFlow code provided with OpenAI's paper Improving Language Understanding by Generative Pre-Training by Alec Radford, Karthik Narasimhan, Tim Salimans and Ilya Sutskever. This implementation comprises a script to load in the PyTorch model the weights pre-trained by the authors with the TensorFlow implementation.

PyTorch Transformers.

PyTorch-Transformers (formerly known as pytorch-pretrained-bert) is a library of state-of-the-art pre-trained models for Natural Language Processing (NLP).


Transfer Learning in Natural Language Processing
Sebastian Ruder, Matthew E Peters, Swabha Swayamdipta, Thomas Wolf (all authors contributed equally)
NAACL 2019 (Tutorial)

The classic supervised machine learning paradigm is based on learning in isolation, a single predictive model for a task using a single dataset. This approach requires a large number of training examples and performs best for well-defined and narrow tasks. Transfer learning refers to a set of methods that extend this approach by leveraging data from additional domains or tasks to train a model with better generalization properties. Over the last two years, the field of Natural Language Processing (NLP) has witnessed the emergence of several transfer learning methods and architectures which significantly improved upon the state-of-the-art on a wide range of NLP tasks. These improvements together with the wide availability and ease of integration of these methods are reminiscent of the factors that led to the success of pretrained word embeddings and ImageNet pretraining in computer vision, and indicate that these methods will likely become a common tool in the NLP landscape as well as an important research direction. We will present an overview of modern transfer learning methods in NLP, how models are pre-trained, what information the representations they learn capture, and review examples and case studies on how these models can be integrated and adapted in downstream NLP tasks.

Large-Scale Transfer Learning for Natural Language Generation
Sergey Golovanov, Rauf Kurbanov, Sergey Nikolenko, Kyryl Truskovskyi, Alexander Tselousov, Thomas Wolf (all authors contributed equally)
ACL 2019

Large-scale pretrained language models define state of the art in natural language processing, achieving outstanding performance on a variety of tasks. We study how these architectures can be applied and adapted for natural language generation, comparing a number of architectural and training schemes. We focus in particular on open-domain dialog as a typical high entropy generation task, presenting and comparing different architectures for adapting pretrained models with state of the art results.

Some additional experiments extending the tech report “Assessing BERT’s Syntactic Abilities” by Yoav Goldberg
Thomas Wolf
Tech report

This document report a few additional experiments extending Yoav Goldberg’s tech report ”Assessing BERT’s Syntactic Abilities” by evaluating the OpenAI Generative Pre-trained Transformer of Radford et al. [2018]1 which is a Transformer model with an architecture highly similar to BERT (see discussion below) but has been pre-trained with a Language Modeling objective on the Toronto Book Corpus [Zhu et al., 2015] only, and evaluating BERT when it is only supplied with a prefix.

TransferTransfo: A Transfer Learning Approach for Neural Network Based Conversational Agents
Thomas Wolf, Victor Sanh, Julien Chaumond and Clement Delangue
NeurIPS CAI workshop 2018

We introduce a new approach to data-driven dialogue systems (e.g. chatbots) called TransferTransfo which is a combination of aTransferlearning based train-ing scheme and a high-capacity generativeTransfo-rmer model. Fine-tuning isperformed by using a multi-task objective which combines several unsupervised pre-diction tasks. The resulting fine-tuned model shows strong improvements over thecurrent state-of-the-art end-to-end conversational models like memory augmentedseq2seq and information-retrieval models. On the privately held PERSONA-CHATdataset of the Conversational Intelligence Challenge 2, this approach obtains anew state-of-the-art, respectively pushing the perplexity, Hits@1 and F1 metrics to 16.28 (45% absolute improvement),80.7 (46% absolute improvement) and 19.5 (20% absolute improvement).

A Hierarchical Multi-task Approach for Learning Embeddings from Semantic Tasks
Victor Sanh, Thomas Wolf, Sebastian Ruder
AAAI 2019

Much effort has been devoted to evaluate whether multi-task learning can be leveraged to learn rich representations that can be used in various Natural Language Processing (NLP) down-stream applications. However, there is still a lack of understanding of the settings in which multi-task learning has a significant effect. In this work, we introduce a hierarchical model trained in a multi-task learning setup on a set of carefully selected semantic tasks. The model is trained in a hierarchical fashion to introduce an inductive bias by supervising a set of low level tasks at the bottom layers of the model and more complex tasks at the top layers of the model. This model achieves state-of-the-art results on a number of tasks, namely Named Entity Recognition, Entity Mention Detection and Relation Extraction without hand-engineered features or external NLP tools like syntactic parsers. The hierarchical training supervision induces a set of shared semantic representations at lower layers of the model. We show that as we move from the bottom to the top layers of the model, the hidden states of the layers tend to represent more complex semantic information.

Meta-Learning a Dynamical Language Model
Thomas Wolf, Julien Chaumond, Clement Delangue, 2018
Workshop track - ICLR 2018

We consider the task of word-level language modeling and study the possibility of combining hidden-states-based short-term representations with medium-term representations encoded in dynamical weights of a language model. Our work extends recent experiments on language models with dynamically evolving weights by casting the language modeling problem into an online learning-to-learn framework in which a meta-learner is trained by gradient-descent to continuously update a language model weights.

Continuous Learning in a Hierarchical Multiscale Neural Network
Thomas Wolf, Julien Chaumond, Clement Delangue, 2018
ACL 2018

We reformulate the problem of encoding a multi-scale representation of a sequence in a language model by casting it in a continuous learning framework. We propose a hierarchical multi-scale language model in which short time-scale dependencies are encoded in the hidden state of a lower-level recurrent neural network while longer time-scale dependencies are encoded in the dynamic of the lower-level network by having a meta-learner update the weights of the lower-level neural network in an online meta-learning fashion. We use elastic weights consolidation as a higher-level to prevent catastrophic forgetting in our continuous learning framework.

Created from Jonathan T. Barron's template