Error in MLM logic

I found an issue in the MLM logic that was used for pre-training. The intention is to mask 80% of the tokens (with special masks), randomly replace 10% of the tokens with a random amino-acid, and leave the remaining 10% as they are.

The code currently looks as follows:

https://github.com/Adibvafa/CodonTransformer/blob/327e134d341eefe8f06a609d0fe355e13cfb04eb/pretrain.py#L59


```python
        randomized = (
            torch.bernoulli(torch.full(selected.shape, 0.1)).bool()
            & selected
            & ~replaced
        )
```

It should be `torch.bernoulli(torch.full(selected.shape, 0.5)).bool()`. This line comes **after** masking 80% of the tokens, so we have to randomly replace **half** of what remains.

The code is easy to fix, but it is hard to guess what the consequences were for the performance of CodonTransformer. This amounts to pre-training with two categories of tokens: masked or unmodified (the randomized being strongly under-represented). We would need to figure out whether the model performs better or worse with the expected behavior.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Error in MLM logic #28

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Error in MLM logic #28

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions