Tokenizer optimization #917
Replies: 3 comments
-
|
Hi there, My tip here would be not to hand-edit the vocab. A better way is to change the inputs, i.e., normalize the words in the BPE training data but then also in the tokenizer. I.e., you could have a custom set of rules. I mean developing a “normalize_input” function that you call in the text before it goes into the tokenizer. Let me tag some readers who posted in this forum who have developed custom tokenizers in the past: |
Beta Was this translation helpful? Give feedback.
-
|
@tllmmaster Hi, Here is my opinion. As @rasbt said, don’t edit the vocabulary manually. Instead, use normalization and pre-tokenization. This chapter from the Hugging Face LLM course explains it well. If you choose this approach, you need to create a function that applies a few transformations to the text, such as:
Then, use this function on the dataset you use to train the tokenizer, and later on the data you send to the LLM. The main issue with normalization and pre-tokenization is that your model cannot use characters that do not appear in the vocabulary. For example, in Arabic we use diacritics to show how words should be read. Here is a sentence with diacritics: If you remove the diacritics, it becomes: If you normalize Arabic text by removing diacritics, the model will not be able to use them. This might be fine if your use case doesn’t require them. But if you need diacritics, you should not remove them during normalization. In my case, I removed them because we rarely use diacritics in everyday Arabic writing. We already know how to read the text without them, and they are also hard to type on the keyboard 😅. To summarize: don’t edit the vocabulary manually. Instead, write a function that normalizes your text. |
Beta Was this translation helpful? Give feedback.
-
|
Thanks for the advice!
…On Sun, Nov 30, 2025 at 11:55 PM Imad Saddik ***@***.***> wrote:
@tllmmaster <https://github.com/tllmmaster> Hi,
Here is my opinion. As @rasbt <https://github.com/rasbt> said, don’t edit
the vocabulary manually. Instead, use normalization and pre-tokenization. This
chapter <https://huggingface.co/learn/llm-course/en/chapter6/4> from the
Hugging Face LLM course explains it well.
If you choose this approach, you need to create a function that applies a
few transformations to the text, such as:
- Converting the text to lowercase.
- Removing extra spaces.
- Removing diacritics.
- And so on.
Then, use this function on the dataset you use to train the tokenizer, and
later on the data you send to the LLM. The main issue with normalization
and pre-tokenization is that your model cannot use characters that do not
appear in the vocabulary.
For example, in Arabic we use diacritics to show how words should be read.
Here is a sentence with diacritics:
وُلِدْتُ فِي مَكْنِاسْ، فِي المَغْرِبْ.
If you remove the diacritics, it becomes:
ولدت في مكناس، في المغرب.
If you normalize Arabic text by removing diacritics, the model will not be
able to use them. This might be fine if your use case doesn’t require them.
But if you need diacritics, you should not remove them during normalization.
In my case, I removed them because we rarely use diacritics in everyday
Arabic writing. We already know how to read the text without them, and they
are also hard to type on the keyboard 😅.
To summarize: don’t edit the vocabulary manually. Instead, write a
function that normalizes your text.
—
Reply to this email directly, view it on GitHub
<#917 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/BY22PQONGMAPGSKFA4B3GF337NDU3AVCNFSM6AAAAACNDCCJPWVHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTKMJRHAYTCMI>
.
You are receiving this because you were mentioned.Message ID:
***@***.***
com>
|
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I am training a BPE tokenizer (32k vocab) on a 1GB dataset for an agglutinative language (Turkmen). I noticed significant redundancy in the vocabulary (e.g., separate tokens for 'Turkmenistan', 'türkmenistan', 'TURKMENISTAN'), which wastes valuable vocabulary slots.I was tempted to manually prune or merge these entries in the JSON file post-training, but I know this might break the BPE merge rules. What is the best practice here?
Beta Was this translation helpful? Give feedback.
All reactions