Thomas Wolf

Thomas Wolf is co-founder and Chief Science Officer (CSO) of Hugging Face where he has been at the inception of the open-source, educational and moonshot efforts.

Thomas enjoys creating open-source software that make complex research, models and datasets widely accessible (for instance by creating the Hugging Face Transformers and Datasets libraries). When he's not building OSS libraries, he can be found pushing for open-science in research in AI/ML, trying to lower the gap between academia and industrial labs through projects like the BigScience Workshop on Large Language Models (LLM) which lead to the BLOOM experiments, model and dataset. His current research interests are directed toward the future of AI and future moonshot. He also enjoys writing and producing education content on AI, ML and NLP, including writing the reference book "Natural Language Processing with Transformers" published at O'Reilly with amazing co-authors, "The Ultra-scale Playbook" teaching large-scale AI trainings, writing (not often enough) in his blog and recording (also not often enough) educational videos like The Future of Natural Language Processing.

I’ve been programming since forever, writing video games and software in Assembly and C/C++, but my first career was actually in Physics rather than Computer Science.

After graduating from Ecole Polytechnique (Paris, France), I worked on laser-plasma interactions at the BELLA Center of the Lawrence Berkeley National Laboratory (Berkeley, CA). Got accepted for a Ph.D. at MIT (Cambridge, MA) in the USA but ended up doing my Ph.D. in Statistical/Quantum physics at Sorbonne University and ESPCI (Paris, France), working on superconducting materials for the DGA(French DARPA) and Thales. After my PhD, I needed a change from the long time scale of experiments in physics and ended up totally changing direction. I joined an IP Law firm, Cabinet Plasseraud (Paris, France), got a law degree from Pantheon Sorbonne University and worked as a Patent Attorney for 5 years, assisting a portfolio of startups and big companies to build and defend their Intellectual Property assets.

In 2015, I was consulting for many Deep-Learning/AI/ML startups and they made me discover the maths behind the new ML/AI revolution. I realised that most of these methods, equations and tools were just re-branded statistical physics approaches which fueled my interest for Machine Learning and Deep Learning. I started my online education in AI/ML reading books and following online courses. About year later, one of my friend asked me if I wanted to start something crazy ambitious with Hugging Face, and there I was, doing science and coding again and having a lot of fun!

I like to explain what I have learned and this has lead to a few blog posts that were quite interesting to other as well I guess (they totalise over a quarter million views at the end of 2018). I will try to continue writing things like that when I find the time. I used to be a teacher during my PhD and I do miss teaching. Blogging is my substitute. A couple of notable posts:

🎠 Low Tech AI Some trends resulting in a possible future of low-tech AI.

🎙️ Speech AI models: an introduction Speech interfaces are a strange beast. On the one hand, they are such a natural fit for human communication, and on the other hand, they have so little adoption in the real world. I decided to write an introduction blog post and a small repository to get you started with speech AI models.

🔭 The Einstein AI model I shared a controversial take the other day at an event and I decided to write it down in a longer format: I’m afraid AI won't give us a compressed 21st century

🐳 Some notes on "DeepSeek and export control" Finally took time to go over Dario's essay on DeepSeek and export control and wrote some notes. I mostly disagree and I think it missed the point.

💥 Training Neural Nets on Larger Batches: Practical Tips for 1-GPU, Multi-GPU & Distributed setups I've spent most of 2018 training models that could barely fit 1-4 samples/GPU. But SGD usually needs more than few samples/batch for decent results. I wrote a post gathering practical tips I use, from simple tricks to multi-GPU code & distributed setups

⛵ Learning Meaning in Natural Language Processing — The Semantics Mega-Thread A summary, overview and map of a huge discussion on learning meaning in NLP that happened on Twitter in August 2018 with more than a 100 comments and great inputs from Matt Gardner, Yoav Goldberg, Sam Bowman, Emily M. Bender, Graham Neubig, Jeremy Howard, Tal Linzen, Jacob Andreas, Ryan D. Cotterell ...

🚀 100 Times Faster Natural Language Processing in Python How you can make your Python NLP module 50-100 times faster by use spaCy's internals and a bit of Cython magic! Womes with a Jupyter notebook with examples processing over 80 millions words per sec.

...

My full publication list can be found on my Google Scholar page. A couple of notable ones are:

Datasets: A Community Library for Natural Language Processing [EMNLP 2021, Best Demonstration Paper ] - ACL Anthology - Arxiv
Transformers: State-of-the-art natural language processing [EMNLP 2020, Best Demonstration Paper ] - ACL Anthology - Arxiv
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter [EM^2 workshop NeurIPS 2019] - Arxiv
Transfertransfo: A transfer learning approach for neural network based conversational agents [CAI Workshop NeurIPS 2018] - Arxiv
...

Most of my open-source work can be found on the Hugging Face github repository. A couple of notable library I created are:

Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX - repository - documentation
Datasets the largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools - repository - documentation
Magic Sand a software for operating an augmented reality sandbox - repository - tutorial
...