Skip to content

Conversation

@oleksost
Copy link
Contributor

@oleksost oleksost commented Jan 19, 2026

✨ Description

  • Test do not import from apriel2 modelling file since it has cuda imports
  • int16 addition overflow: when adding python int with numpy int16, addition is performed in int 16 resulting in overflow
  • bunch of tests from external module were not marked with @requires_cuda

🔍 Type of change

Select all that apply:

  • 🐛 Bug fix (non-breaking change that addresses a specific issue)
  • 🚀 New feature (non-breaking change that adds functionality)
  • ⚠️ Breaking change (a change that could affect existing functionality)
  • 📈 Performance improvement/optimization (improves speed, memory usage, or efficiency)
  • 🛠️ Code refactor (non-functional changes that improve code readability, structure, etc.)
  • 📦 Dependency bump (updates dependencies, including Dockerfile or package changes)
  • 📝 Documentation change (updates documentation, including new content or typo fixes)
  • 🔧 Infrastructure/Build change (affects build process, CI/CD, or dependencies)

📝 Changes

List the key changes introduced in this PR:

  1. Change A
  2. Change B

✅ Checklist

Make sure the following tasks are completed before submitting the PR:

General

  • 📜 I have read and followed the contributing guidelines.
  • 🏷️ I am using a clear and descriptive PR title that summarizes the key change or feature introduced.
  • 🎉 The functionality is complete, and I have tested the changes.
  • 📝 I have updated the documentation if needed.
  • ⚠️ The change does not introduce any new issues (e.g., runtime warnings, type checker errors, linting problems, unhandled edge cases).
  • 🧩 I have commented my code, especially in hard-to-understand areas.

Dependencies and Configuration

  • 🐋 I have updated the Docker configuration or dependencies, if applicable.
  • 🔄 I have ensured compatibility with the existing setup after dependency changes.

Testing

  • 🧪 I have added or updated tests to cover my changes.
  • ✔️ New and existing tests pass locally with my changes.
  • 🚦 I have tested these changes on GPUs and verified training stability.
  • 🏋️ I have tested the changes on realistic training workloads, if applicable.

Performance Impact

  • 📊 I have run benchmarks where applicable to evaluate the performance impact.
  • ✅ The benchmarks show no performance regression.
  • 🚀 The benchmarks indicate a potential performance improvement.
  • ⚠️ The benchmarks indicate a potential performance degradation.
  • 📈 I have provided benchmark results and detailed any performance impact below, if applicable.

📊 Performance Impact Details

If there is any impact on performance, describe it and provide benchmark results, if applicable:


🗒️ Additional Notes

Include any additional context, information, or considerations here, such as known issues, follow-up tasks, or backward compatibility concerns.

# Torch doesn't support type promotion between signed and unsigned types, so we convert here to avoid issues.
# Convert begin and end to int to avoid numpy dtype overflow when adding to begin_
return TokenSample(self._tokens[begin_ + begin : begin_ + end].to(torch.int64), [end - begin])
return TokenSample(self._tokens[begin_ + int(begin) : begin_ + int(end)].to(torch.int64), [end - begin])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should already be integers?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

without test_checkpoint_and_eval[apriel2_text_all_hybrid] fails with OverflowError: Python integer 44345 out of bounds for int16

Copy link
Collaborator

@jlamypoirier jlamypoirier Jan 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That points to another issue upstream, given the method signature

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

def get_unsigned_integer_type(max_size: int) -> DataType:
    # TODO: Use uint types (recently added for torch, not enough methods supported yet)
    if max_size < 2**8:
        return DataType.uint8
    elif max_size < 2**15:
        return DataType.int16
    elif max_size < 2**31:
        return DataType.int32
    else:
        return DataType.int64

Here only the uint8 is unsigned, others are signed.

Copy link
Contributor Author

@oleksost oleksost Jan 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we probably need signed ints everywhere as inferred by the fact that we have this max token_start_index_in_document = max(token_start - token_count, 0) operation here, which underflows in tests with unsigned uint8.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The unsigned integer should be fine, the problem is we should convert to python int before doing this kind of operation. In this line token_start is a python int and token_count is a tensor, so the operation isn't really safe in signed ints either.

@oleksost oleksost requested a review from jlamypoirier January 19, 2026 22:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants