How Much Fast is Fast Tokenizer of Huggingface?

Bhadresh Savani
2 min readFeb 17, 2021

Huggingface is the most popular open-source library in NLP. It allows building an end-to-end NLP application from text processing, Model Training, Evaluation, and also support functions for easy conversion to host it with different serving Techniques like TFServing, TorchServing, TRTServing, and ONNXConvertion.

After the 4.0 release of the Transformer library, they make Rust Based Tokenizer the default Tokenizer. They claim that it can make the tokenization process 10x faster than the old python-based tokenizer with Smart Caching in this blog.

How much faster is the Fast tokenizer if we use it instead of Python Based Tokenizer?

I used the train dataset context(130319 contexts) from the SQUAD2 dataset and tokenize it using Fast Tokenizer and Python Tokenizer.

For Distilbert Model,

Time in Minutes and Second, Throughput(Examples/Second)

It shows that without smart caching It is 4.33x faster.

I have replaced my current application with the latest one and it is pretty effective. It is a performance improvement.

If you want to check the supported model for fast tokenizer check out the big table of Huggingface

The entire code for this experiment is available at this GitHub repository in the form of colab Notebook

--

--