-
Notifications
You must be signed in to change notification settings - Fork 29
Open
Description
Hey. In the "finetune.py", the tokenizer is configured to pad to the max length of the longest sequence of the batch, as opposed to the entire dataset. Thanks.
PROBLEM:
`
def __call__(self, examples):
tokenized = self.tokenizer(
[ex["codons"] for ex in examples],
return_attention_mask=True,
return_token_type_ids=True,
truncation=True,
padding=True,
max_length=MAX_LEN,
return_tensors="pt",
)
FIX:
def __call__(self, examples):
tokenized = self.tokenizer(
[ex["codons"] for ex in examples],
return_attention_mask=True,
return_token_type_ids=True,
truncation=True,
padding='max_length', #fixed this to pad to max length of dataset, not batch.
max_length=MAX_LEN,
return_tensors="pt",
)
`
Metadata
Metadata
Assignees
Labels
No labels