-
Notifications
You must be signed in to change notification settings - Fork 29
Description
I found an issue in the MLM logic that was used for pre-training. The intention is to mask 80% of the tokens (with special masks), randomly replace 10% of the tokens with a random amino-acid, and leave the remaining 10% as they are.
The code currently looks as follows:
Line 59 in 327e134
| # 10% of the time, we replace masked input tokens with random vector. |
randomized = (
torch.bernoulli(torch.full(selected.shape, 0.1)).bool()
& selected
& ~replaced
)It should be torch.bernoulli(torch.full(selected.shape, 0.5)).bool(). This line comes after masking 80% of the tokens, so we have to randomly replace half of what remains.
The code is easy to fix, but it is hard to guess what the consequences were for the performance of CodonTransformer. This amounts to pre-training with two categories of tokens: masked or unmodified (the randomized being strongly under-represented). We would need to figure out whether the model performs better or worse with the expected behavior.