Thomas Wolf

Thomas Wolf is co-founder and Chief Science Officer (CSO) of Hugging Face where he has been at the inception of the open-source, educational and moonshot efforts.

Thomas enjoys creating open-source software that make complex research, models and datasets widely accessible (for instance by creating the Hugging Face Transformers and Datasets libraries). When he's not building OSS libraries, he can be found pushing for open-science in research in AI/ML, trying to lower the gap between academia and industrial labs through projects like the BigScience Workshop on Large Language Models (LLM) which lead to the BLOOM experiments, model and dataset. His current research interests are directed toward the future of AI and future moonshot. He also enjoys writing and producing education content on AI, ML and NLP, including writing the reference book "Natural Language Processing with Transformers" published at O'Reilly with amazing co-authors, "The Ultra-scale Playbook" teaching large-scale AI trainings, writing (not often enough) in his blog and recording (also not often enough) educational videos like The Future of Natural Language Processing.

I’ve been programming since forever, writing video games and software in Assembly and C/C++, but my first career was actually in Physics rather than Computer Science.

After graduating from Ecole Polytechnique (Paris, France), I worked on laser-plasma interactions at the BELLA Center of the Lawrence Berkeley National Laboratory (Berkeley, CA). Got accepted for a Ph.D. at MIT (Cambridge, MA) in the USA but ended up doing my Ph.D. in Statistical/Quantum physics at Sorbonne University and ESPCI (Paris, France), working on superconducting materials for the DGA(French DARPA) and Thales. After my PhD, I needed a change from the long time scale of experiments in physics and ended up totally changing direction. I joined an IP Law firm, Cabinet Plasseraud (Paris, France), got a law degree from Pantheon Sorbonne University and worked as a Patent Attorney for 5 years, assisting a portfolio of startups and big companies to build and defend their Intellectual Property assets.

In 2015, I was consulting for many Deep-Learning/AI/ML startups and they made me discover the maths behind the new ML/AI revolution. I realised that most of these methods, equations and tools were just re-branded statistical physics approaches which fueled my interest for Machine Learning and Deep Learning. I started my online education in AI/ML reading books and following online courses. About year later, one of my friend asked me if I wanted to start something crazy ambitious with Hugging Face, and there I was, doing science and coding again and having a lot of fun!

I like to explain what I have learned and this has lead to a few blog posts that were quite interesting to other as well I guess (they totalise over a quarter million views at the end of 2018). I will try to continue writing things like that when I find the time. I used to be a teacher during my PhD and I do miss teaching. Blogging is my substitute. A couple of notable posts:

🎠 Low Tech AI Some trends resulting in a possible future of low-tech AI.

🎙️ Speech AI models: an introduction Speech interfaces are a strange beast. On the one hand, they are such a natural fit for human communication, and on the other hand, they have so little adoption in the real world. I decided to write an introduction blog post and a small repository to get you started with speech AI models.

🔭 The Einstein AI model I shared a controversial take the other day at an event and I decided to write it down in a longer format: I’m afraid AI won't give us a compressed 21st century

🐳 Some notes on "DeepSeek and export control" Finally took time to go over Dario's essay on DeepSeek and export control and wrote some notes. I mostly disagree and I think it missed the point.

💥 Training Neural Nets on Larger Batches: Practical Tips for 1-GPU, Multi-GPU & Distributed setups I've spent most of 2018 training models that could barely fit 1-4 samples/GPU. But SGD usually needs more than few samples/batch for decent results. I wrote a post gathering practical tips I use, from simple tricks to multi-GPU code & distributed setups

⛵ Learning Meaning in Natural Language Processing — The Semantics Mega-Thread A summary, overview and map of a huge discussion on learning meaning in NLP that happened on Twitter in August 2018 with more than a 100 comments and great inputs from Matt Gardner, Yoav Goldberg, Sam Bowman, Emily M. Bender, Graham Neubig, Jeremy Howard, Tal Linzen, Jacob Andreas, Ryan D. Cotterell ...

🚀 100 Times Faster Natural Language Processing in Python How you can make your Python NLP module 50-100 times faster by use spaCy's internals and a bit of Cython magic! Womes with a Jupyter notebook with examples processing over 80 millions words per sec.

...

My full publication list can be found on my Google Scholar page. A couple of notable ones are:

Datasets: A Community Library for Natural Language Processing [EMNLP 2021, Best Demonstration Paper ] - ACL Anthology - Arxiv
Transformers: State-of-the-art natural language processing [EMNLP 2020, Best Demonstration Paper ] - ACL Anthology - Arxiv
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter [EM^2 workshop NeurIPS 2019] - Arxiv
Transfertransfo: A transfer learning approach for neural network based conversational agents [CAI Workshop NeurIPS 2018] - Arxiv
...

Most of my open-source work can be found on the Hugging Face github repository. A couple of notable library I created are:

Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX - repository - documentation
Datasets the largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools - repository - documentation
Magic Sand a software for operating an augmented reality sandbox - repository - tutorial
...

Invited Talks and News

On October 6, 2022, I gave a talk at the Forty Hub in Utrecht, NL [slides].
On March 28-29, 2022, I gave a keynote and two talks at the Applied Machine Learning Days in EPFL in Lausanne, Switzerland [slides keynote] [ slides talk 2].
On October 21, 2021, I gave a talk at Allen School NLP Speaker Series at University of Washington [slides].
On May 6, 2021, I gave a talk at the ICLR Social on Open Collaboration in ML at ICLR 2021 [slides].
On May 6, 2021, I gave two talks at both ICLR Social on Open Collaboration in ML and the Workshop on Enormous Language Models at ICLR 2021 [slides OCML][slides WELM].
On November 20, 2020, I gave a tutorial on language generation titled The Amazing World of Neural Language Generation with Yangfeng Ji, Antoine Bosselut and Asli Celikyilmaz at EMNLP 2020 [slides].
On November 19, 2020, I gave a invited speak at SCAI workshop at EMNLP 2020 [slides].
On October 26, 2020, I gave a keynote at VISxAI the IEEE Workshop on Visualization for AI Explainability [slides].
On September 30, 2020, I gave a talk at the Ray Summit 2020 online [slides].
On September 24, 2020, I gave a webinar with Analytics Vidhya online [slides].
On June 13, 2020, I gave a talk at fwdays’20 conference in Kyiv, Ukraine [slides].
On April 16, 2020, I gave a talk at ODSC East 2020 in Boston, MA, USA [slides].
On March 16, 2020, I gave a talk at NLP Zurich meetup in Zurich, Switzerland [slides].
On March 4, 2020, I gave a talk at INRIA ALMAnaCH in Paris, France [slides].
On February 4, 2020, I gave a (remote) talk at the Sydney NLP meetup in Sydney, Australia [slides].
On February 2-4, 2020, I co-teached the NLPL Winter School with Yoav Goldberg symposium in Skeikampen, Norway [slides 1st session], [ slides 2nd session], [slides 3rd session].
On January 17, 2020, I gave a talk at Transformers at Work symposium in Amsterdam, Netherlands [slides].
On October 25, 2019, I gave a talk at France is AI in Paris, France [slides].
On September 19, 2019, I gave a talk in the AI Assistant Summit track at Re-Work Deep Learning in London, UK [slides].
On September 7, 2019, I gave a talk on Transfer learning in NLP at the Data Science fwdays'19 in Kyiv, Ukraine [slides].
On June 6, 2019, I co-organized a workshop on Methods for Optimizing and Evaluating Neural Language Generation (NeuralGen), together with Antoine Bosselut, Marjan Ghazvininejad, Srinivasan Iyer, Urvashi Khandelwal, Hannah Rashkin and Asli Celikyilmaz, co-located with NAACL 2019 [website].
On June 2nd, 2019, I gave a tutorial on Transfer Learning in Natural Language Processing, together with Sebastian Ruder, Swabha Swayamdipta and Matthew Peters at NAACL 2019 [slides].
On March 1st, 2019, I gave a talk at the ILPS lab of the University of Amsterdam on Hierarchical Multi-tasking for learning embeddings from semantic tasks as part of the ILPS Monthly talks [slides].
On January 30, 2019, I gave a talk at the Deep Learning Meetup in Paris [slides]
On January 22, 2019, I gave a talk at the NYU Center for Data Science on Transfer Learning Approaches to Natural Language Generation [see my UvA slides]
On January 18, 2019, I gave a talk at the University of Amsterdam as part of the SEA Meetups on a Transfer Learning Approach to Open-Domain Neural Network Dialog Agents [slides]
On January 11, 2019, I gave a talk at Utrecht University as part of the Data Science & Complexity Centre (DSCC) Central Topic Seminars on recent developments in Neural Network Based Dialogue Agents focusing on the use of Transfer Learning for dialog generation [slides]
In December 2018, I gave a talk during the NeurIPS 2018 Competition Track at part of the Winners talks & spotlights, discussing our solution to the Conversational Intelligence Challenge 2 (ConvAI2) [slides] [paper]
In September 2018, I gave a talk at The first annual WeCNLP Summit 2018 on a novel architecture and training scheme for chit-chat dialog systems [slides]
In September 2018, I gave a talk at Paris NLP on Neural networks based dialog agents: going beyond the seq2seq model [ slides]
In October 2017, I gave a talk at France is AI 2017 on NeuralCoref, a neural coreference system for conversational agents [ slides]