Fix CI ValueError on NumPy 2.x by avoiding VLENUTF8 string dtype in tests by adilraza99 · Pull Request #853 · malariagen/malariagen-data-python

adilraza99 · 2026-02-01T19:32:49Z

This PR fixes the CI failures triggered by recent NumPy 2.x changes that were causing test initialization to crash with a ValueError.

While investigating #847, I found that several test fixtures were creating Zarr datasets using dtype=str. With NumPy ≥ 2.0 this path goes through the VLENUTF8 codec in numcodecs, which performs boolean checks that are no longer allowed and result in the following error:

ValueError: The truth value of an array with more than one element is ambiguous

What changed

In the affected test fixtures, dtype=str has been replaced with dtype="U" (NumPy fixed-length Unicode dtype).

This keeps the stored data as strings while avoiding the VLENUTF8 code path entirely.

Why this approach

Does not require pinning NumPy versions
Keeps behavior unchanged for NumPy 1.x
Fully compatible with NumPy 2.x
Limits the change strictly to test infrastructure
Avoids touching production code

Impact

Production code: unchanged
Dependencies: unchanged
Scope: test fixtures only
CI: unblocked across the full test matrix

Verification

Reproduced the failure locally on NumPy 2.x
Isolated the exact failing fixture initialization
Confirmed no other dtype=str usage exists outside tests
Validated that the fix works for both NumPy 1.26.x and NumPy 2.x behavior

This should unblock the current CI failures and allow dependent PRs to proceed without introducing version constraints or behavioral changes.

Fixes #847

…e dtype

adilraza99 · 2026-02-01T20:10:22Z

Hi @jonbrenas @ahernank,

I've opened a PR to address the CI failure in #847 by avoiding the VLENUTF8 string path in test fixtures.

The fix is test-only, NumPy-version agnostic (works for 1.26.x and 2.x), and keeps production code untouched.

Happy to make any adjustments if needed. Thanks!

jonbrenas · 2026-02-01T21:57:54Z

Thank you, @adilraza99.

It looks like the tests still fail for the CI but the error seems to be different, which I count as progress.

adilraza99 · 2026-02-01T21:58:35Z

Hi @jonbrenas,

Thanks for reviewing this!

I looked into the remaining CI failures in more detail. The current failures fall into two separate categories:

Coverage job failures:
These appear to be caused by a downstream compatibility issue between NumPy 2.x and Pandas (where StringArray.astype(order=...) raises a TypeError). This is coming from upstream behavior changes and is not related to the VLENUTF8 test fixture path addressed in this PR.
Original VLENUTF8-related CI failure (ValueError errors during CI checks #847):
This PR specifically targets this root cause — the boolean ambiguity triggered by the VLENUTF8 string dtype path during Zarr dataset initialization on NumPy 2.x.

To keep the change safe and reviewable, this PR is intentionally scoped to:

Be test-only (no production code changes)
Avoid the problematic VLENUTF8 code path in fixtures
Remain compatible with both NumPy 1.26.x and 2.x
Preserve existing test behavior and dataset semantics

If it helps with validation, it might be worth temporarily allowing or bypassing the coverage job to confirm that the core test suite passes cleanly with this fix in place.

I avoided bundling the Pandas/StringArray issue into this PR to keep the original regression fix minimal and isolated.

Happy to open a follow-up PR focused specifically on the coverage/Pandas compatibility issue if you'd like to handle that separately.

Appreciate your feedback.

jonbrenas

LGTM.

New errors that we need to investigate.

jonbrenas · 2026-02-01T22:16:21Z

Thank you, @adilraza99.

My bad, I forgot to check the Pandas error and I didn't recognize that it was the one showing up. I tried to rerun the tests with numpy==1.26.4, though, and it looks like a different error (due to a (0, n, m) slice, apparently) is causing tests to fail.

adilraza99 · 2026-02-04T18:45:11Z

Marking this PR as draft for now.

I’m currently focusing on stabilizing CI and baseline environment in PR #855.
Once that is merged and the CI matrix is stable, I’ll rebase/update this PR accordingly.

No changes are being abandoned - just pausing to avoid conflicting CI noise.

Thanks!

tests: avoid NumPy 2.x boolean ambiguity by using fixed-length Unicod…

c4e9bf4

…e dtype

style: apply pre-commit formatting fixes

1d6c890

jonbrenas previously approved these changes Feb 1, 2026

View reviewed changes

jonbrenas self-requested a review February 1, 2026 22:15

adilraza99 marked this pull request as draft February 4, 2026 18:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix CI ValueError on NumPy 2.x by avoiding VLENUTF8 string dtype in tests#853

Fix CI ValueError on NumPy 2.x by avoiding VLENUTF8 string dtype in tests#853
adilraza99 wants to merge 2 commits intomalariagen:masterfrom
adilraza99:fix-ci-valueerror-847

adilraza99 commented Feb 1, 2026

Uh oh!

adilraza99 commented Feb 1, 2026

Uh oh!

jonbrenas commented Feb 1, 2026

Uh oh!

adilraza99 commented Feb 1, 2026

Uh oh!

jonbrenas left a comment

Uh oh!

jonbrenas commented Feb 1, 2026

Uh oh!

adilraza99 commented Feb 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

adilraza99 commented Feb 1, 2026

What changed

Why this approach

Impact

Verification

Uh oh!

adilraza99 commented Feb 1, 2026

Uh oh!

jonbrenas commented Feb 1, 2026

Uh oh!

adilraza99 commented Feb 1, 2026

Uh oh!

jonbrenas left a comment

Choose a reason for hiding this comment

Uh oh!

jonbrenas commented Feb 1, 2026

Uh oh!

adilraza99 commented Feb 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants