Skip to content

Error in MLM logic #28

@gui11aume

Description

@gui11aume

I found an issue in the MLM logic that was used for pre-training. The intention is to mask 80% of the tokens (with special masks), randomly replace 10% of the tokens with a random amino-acid, and leave the remaining 10% as they are.

The code currently looks as follows:

# 10% of the time, we replace masked input tokens with random vector.

        randomized = (
            torch.bernoulli(torch.full(selected.shape, 0.1)).bool()
            & selected
            & ~replaced
        )

It should be torch.bernoulli(torch.full(selected.shape, 0.5)).bool(). This line comes after masking 80% of the tokens, so we have to randomly replace half of what remains.

The code is easy to fix, but it is hard to guess what the consequences were for the performance of CodonTransformer. This amounts to pre-training with two categories of tokens: masked or unmodified (the randomized being strongly under-represented). We would need to figure out whether the model performs better or worse with the expected behavior.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions