Tokenizer optimization #917

tllmmaster · 2025-11-25T03:39:05Z

tllmmaster
Nov 25, 2025

I am training a BPE tokenizer (32k vocab) on a 1GB dataset for an agglutinative language (Turkmen). I noticed significant redundancy in the vocabulary (e.g., separate tokens for 'Turkmenistan', 'türkmenistan', 'TURKMENISTAN'), which wastes valuable vocabulary slots.I was tempted to manually prune or merge these entries in the JSON file post-training, but I know this might break the BPE merge rules. What is the best practice here?

rasbt · 2025-11-30T16:54:22Z

rasbt
Nov 30, 2025
Maintainer

Hi there,

My tip here would be not to hand-edit the vocab. A better way is to change the inputs, i.e., normalize the words in the BPE training data but then also in the tokenizer. I.e., you could have a custom set of rules. I mean developing a “normalize_input” function that you call in the text before it goes into the tokenizer.

Let me tag some readers who posted in this forum who have developed custom tokenizers in the past:

@Aananda-giri via LLAMA3.2-Nepali-318M (from scratch) #570 and GPT2-Nepali (Pretrained from scratch) #485
@ImadSaddik via Trained a language model from scratch on the Moroccan Darija language #554

0 replies

ImadSaddik · 2025-11-30T18:55:36Z

ImadSaddik
Nov 30, 2025

@tllmmaster Hi,

Here is my opinion. As @rasbt said, don’t edit the vocabulary manually. Instead, use normalization and pre-tokenization. This chapter from the Hugging Face LLM course explains it well.

If you choose this approach, you need to create a function that applies a few transformations to the text, such as:

Converting the text to lowercase.
Removing extra spaces.
Removing diacritics.
And so on.

Then, use this function on the dataset you use to train the tokenizer, and later on the data you send to the LLM. The main issue with normalization and pre-tokenization is that your model cannot use characters that do not appear in the vocabulary.

For example, in Arabic we use diacritics to show how words should be read. Here is a sentence with diacritics:

وُلِدْتُ فِي مَكْنَاسْ، فِي المَغْرِبْ.

If you remove the diacritics, it becomes:

ولدت في مكناس، في المغرب.

If you normalize Arabic text by removing diacritics, the model will not be able to use them. This might be fine if your use case doesn’t require them. But if you need diacritics, you should not remove them during normalization.

In my case, I removed them because we rarely use diacritics in everyday Arabic writing. We already know how to read the text without them, and they are also hard to type on the keyboard 😅.

To summarize: don’t edit the vocabulary manually. Instead, write a function that normalizes your text.

0 replies

tllmmaster · 2025-12-01T03:46:47Z

tllmmaster
Dec 1, 2025
Author

Thanks for the advice!

…

On Sun, Nov 30, 2025 at 11:55 PM Imad Saddik ***@***.***> wrote: @tllmmaster <https://github.com/tllmmaster> Hi, Here is my opinion. As @rasbt <https://github.com/rasbt> said, don’t edit the vocabulary manually. Instead, use normalization and pre-tokenization. This chapter <https://huggingface.co/learn/llm-course/en/chapter6/4> from the Hugging Face LLM course explains it well. If you choose this approach, you need to create a function that applies a few transformations to the text, such as: - Converting the text to lowercase. - Removing extra spaces. - Removing diacritics. - And so on. Then, use this function on the dataset you use to train the tokenizer, and later on the data you send to the LLM. The main issue with normalization and pre-tokenization is that your model cannot use characters that do not appear in the vocabulary. For example, in Arabic we use diacritics to show how words should be read. Here is a sentence with diacritics: وُلِدْتُ فِي مَكْنِاسْ، فِي المَغْرِبْ. If you remove the diacritics, it becomes: ولدت في مكناس، في المغرب. If you normalize Arabic text by removing diacritics, the model will not be able to use them. This might be fine if your use case doesn’t require them. But if you need diacritics, you should not remove them during normalization. In my case, I removed them because we rarely use diacritics in everyday Arabic writing. We already know how to read the text without them, and they are also hard to type on the keyboard 😅. To summarize: don’t edit the vocabulary manually. Instead, write a function that normalizes your text. — Reply to this email directly, view it on GitHub <#917 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/BY22PQONGMAPGSKFA4B3GF337NDU3AVCNFSM6AAAAACNDCCJPWVHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTKMJRHAYTCMI> . You are receiving this because you were mentioned.Message ID: ***@***.*** com>

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Tokenizer optimization #917

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Tokenizer optimization #917

Uh oh!

tllmmaster Nov 25, 2025

Replies: 3 comments

Uh oh!

rasbt Nov 30, 2025 Maintainer

Uh oh!

Uh oh!

ImadSaddik Nov 30, 2025

Uh oh!

tllmmaster Dec 1, 2025 Author

tllmmaster
Nov 25, 2025

rasbt
Nov 30, 2025
Maintainer

ImadSaddik
Nov 30, 2025

tllmmaster
Dec 1, 2025
Author