Skip to content

Conversation

@Cydral
Copy link
Contributor

@Cydral Cydral commented Jan 7, 2026

Summary

This PR addresses critical CUDA synchronization issues and enhances the test_layer utility function.

CUDA Kernel Fixes

Several CUDA kernels were using __syncthreads() for cross-block synchronization, which is incorrect since __syncthreads() only synchronizes threads within the same block, not across different blocks. When grid_stride_range_y distributes work across multiple blocks, these synchronization barriers fail silently.

Affected functions decomposed into separate kernels:

  • inverse_norms()
  • dot_prods()
  • multiply_conv()
  • layer_normalize()
  • rms_normalize()
  • compute_act_halt_probabilities()

The fix replaces intra-kernel __syncthreads() with sequential launch_kernel() calls, which provide implicit synchronization between kernel executions.

test_layer Enhancement

Modified test_layer to accept optional parameters for testing layers with specific tensor input constraints, enabling proper gradient verification for layers that require particular input dimensions.

Related Discussion

Follow-up to #3128

@Cydral
Copy link
Contributor Author

Cydral commented Jan 8, 2026

@davisking, Hi Davis, once PR #3132 is resolved, I will be able to share a full update for the ACT processing layer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant