Skip to content

Conversation

@Cydral
Copy link
Contributor

@Cydral Cydral commented Jan 7, 2026

Summary

This PR addresses critical CUDA synchronization issues and enhances the test_layer utility function.

CUDA Kernel Fixes

Several CUDA kernels were using __syncthreads() for cross-block synchronization, which is incorrect since __syncthreads() only synchronizes threads within the same block, not across different blocks. When grid_stride_range_y distributes work across multiple blocks, these synchronization barriers fail silently.

Affected functions decomposed into separate kernels:

  • inverse_norms()
  • dot_prods()
  • multiply_conv()
  • layer_normalize()
  • rms_normalize()
  • compute_act_halt_probabilities()

The fix replaces intra-kernel __syncthreads() with sequential launch_kernel() calls, which provide implicit synchronization between kernel executions.

test_layer Enhancement

Modified test_layer to accept optional parameters for testing layers with specific tensor input constraints, enabling proper gradient verification for layers that require particular input dimensions.

Related Discussion

Follow-up to #3128

Cydral and others added 30 commits April 28, 2025 22:10
…des an optimized linear transformation for multi-dimensional inputs.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@Cydral Cydral closed this Jan 7, 2026
@Cydral Cydral deleted the fixes branch January 7, 2026 12:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants