-
Notifications
You must be signed in to change notification settings - Fork 42
Bug fixes #453
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
fast_llm/data/sample/token.py
Outdated
| # Torch doesn't support type promotion between signed and unsigned types, so we convert here to avoid issues. | ||
| # Convert begin and end to int to avoid numpy dtype overflow when adding to begin_ | ||
| return TokenSample(self._tokens[begin_ + begin : begin_ + end].to(torch.int64), [end - begin]) | ||
| return TokenSample(self._tokens[begin_ + int(begin) : begin_ + int(end)].to(torch.int64), [end - begin]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These should already be integers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
without test_checkpoint_and_eval[apriel2_text_all_hybrid] fails with OverflowError: Python integer 44345 out of bounds for int16
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That points to another issue upstream, given the method signature
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def get_unsigned_integer_type(max_size: int) -> DataType:
# TODO: Use uint types (recently added for torch, not enough methods supported yet)
if max_size < 2**8:
return DataType.uint8
elif max_size < 2**15:
return DataType.int16
elif max_size < 2**31:
return DataType.int32
else:
return DataType.int64
Here only the uint8 is unsigned, others are signed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we probably need signed ints everywhere as inferred by the fact that we have this max token_start_index_in_document = max(token_start - token_count, 0) operation here, which underflows in tests with unsigned uint8.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The unsigned integer should be fine, the problem is we should convert to python int before doing this kind of operation. In this line token_start is a python int and token_count is a tensor, so the operation isn't really safe in signed ints either.
…nteger overflow" This reverts commit 2ba1401.
…instead of uint8.
…se int8 instead of uint8." This reverts commit a466078.
✨ Description
🔍 Type of change
Select all that apply:
📝 Changes
List the key changes introduced in this PR:
✅ Checklist
Make sure the following tasks are completed before submitting the PR:
General
Dependencies and Configuration
Testing
Performance Impact
📊 Performance Impact Details
If there is any impact on performance, describe it and provide benchmark results, if applicable:
🗒️ Additional Notes
Include any additional context, information, or considerations here, such as known issues, follow-up tasks, or backward compatibility concerns.