Skip to content

Finetune tokenizer padding is not set to custom max length #18

@ghost

Description

Hey. In the "finetune.py", the tokenizer is configured to pad to the max length of the longest sequence of the batch, as opposed to the entire dataset. Thanks.

PROBLEM:
`

def __call__(self, examples):
    tokenized = self.tokenizer(
        [ex["codons"] for ex in examples],
        return_attention_mask=True,
        return_token_type_ids=True,
        truncation=True,
        padding=True,
        max_length=MAX_LEN,
        return_tensors="pt",
    )

FIX:

def __call__(self, examples):
    tokenized = self.tokenizer(
        [ex["codons"] for ex in examples],
        return_attention_mask=True,
        return_token_type_ids=True,
        truncation=True,
        padding='max_length', #fixed this to pad to max length of dataset, not batch.
        max_length=MAX_LEN,
        return_tensors="pt",
    )

`

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions