Skip to content

Conversation

@tdavidcl
Copy link
Member

No description provided.

@github-actions
Copy link

Thanks @tdavidcl for opening this PR!

You can do multiple things directly here:
1 - Comment pre-commit.ci run to run pre-commit checks.
2 - Comment pre-commit.ci autofix to apply fixes.
3 - Add label autofix.ci to fix authorship & pre-commit for every commit made.
4 - Add label light-ci to only trigger a reduced & faster version of the CI (need the full one before merge).
5 - Add label trigger-ci to create an empty commit to trigger the CI.

Once the workflow completes a message will appear displaying informations related to the run.

Also the PR gets automatically reviewed by gemini, you can:
1 - Comment /gemini review to trigger a review
2 - Comment /gemini summary for a summary
3 - Tag it using @gemini-code-assist either in the PR or in review comments on files

@gemini-code-assist
Copy link

Summary of Changes

Hello @tdavidcl, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a new benchmark suite designed to rigorously evaluate the performance characteristics of various kernel defragmentation and fusion techniques. By comparing traditional individual kernel dispatches against several fused kernel implementations that leverage different SYCL memory management paradigms (DeviceBuffer, sycl::buffer, USM), the PR aims to identify optimal strategies for improving efficiency when dealing with numerous small computational tasks on heterogeneous devices. The tests are conducted across different queue types and fragmentation levels to provide a comprehensive performance analysis.

Highlights

  • New Benchmark Test: Introduced a new benchmark test file kernel_defragment_test.cpp to evaluate different strategies for kernel defragmentation and fusion.
  • Multiple Kernel Fusion Strategies: Implemented and benchmarked four distinct saxpy kernel execution methods: a baseline of individual kernel launches, and three fused approaches utilizing sham::DeviceBuffer, sycl::buffer, and SYCL Unified Shared Memory (USM) for parameter passing.
  • Comprehensive Performance Evaluation: The benchmark suite tests these strategies under both in-order and out-of-order SYCL queues, systematically varying the number of kernel fragments to assess performance across different fragmentation levels.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@tdavidcl
Copy link
Member Author

Current state:

- [1/1] : [001] kernel_defragmentation
Testing base case (out of order): 
[AdaptiveCpp Warning] kernel_cache: This application run has resulted in new binaries being JIT-compiled. This indicates that the runtime optimization process has not yet reached peak performance. You may want to run the application again until this warning no longer appears to achieve optimal performance.
 - n_bufs:    1 - size: 262144 - time: 1.0485e-05 seconds - correctness: correct 
 - n_bufs:    2 - size: 131072 - time: 1.3948e-05 seconds - correctness: correct 
 - n_bufs:    4 - size:  65536 - time: 2.2092e-05 seconds - correctness: correct 
 - n_bufs:    8 - size:  32768 - time: 3.9971e-05 seconds - correctness: correct 
 - n_bufs:   16 - size:  16384 - time: 7.3849e-05 seconds - correctness: correct 
 - n_bufs:   32 - size:   8192 - time: 1.4791e-04 seconds - correctness: correct 
 - n_bufs:   64 - size:   4096 - time: 2.9305e-04 seconds - correctness: correct 
 - n_bufs:  128 - size:   2048 - time: 5.9342e-04 seconds - correctness: correct 
 - n_bufs:  256 - size:   1024 - time: 1.1336e-03 seconds - correctness: correct 
 - n_bufs:  512 - size:    512 - time: 2.2787e-03 seconds - correctness: correct 
 - n_bufs: 1024 - size:    256 - time: 4.5492e-03 seconds - correctness: correct 
 - n_bufs: 2048 - size:    128 - time: 9.3470e-03 seconds - correctness: correct 
Testing base case (in order): 
 - n_bufs:    1 - size: 262144 - time: 9.7980e-06 seconds - correctness: correct 
 - n_bufs:    2 - size: 131072 - time: 1.2894e-05 seconds - correctness: correct 
 - n_bufs:    4 - size:  65536 - time: 1.9969e-05 seconds - correctness: correct 
 - n_bufs:    8 - size:  32768 - time: 3.5099e-05 seconds - correctness: correct 
 - n_bufs:   16 - size:  16384 - time: 6.5540e-05 seconds - correctness: correct 
 - n_bufs:   32 - size:   8192 - time: 1.3219e-04 seconds - correctness: correct 
 - n_bufs:   64 - size:   4096 - time: 2.5635e-04 seconds - correctness: correct 
 - n_bufs:  128 - size:   2048 - time: 5.1583e-04 seconds - correctness: correct 
 - n_bufs:  256 - size:   1024 - time: 1.0196e-03 seconds - correctness: correct 
 - n_bufs:  512 - size:    512 - time: 2.0886e-03 seconds - correctness: correct 
 - n_bufs: 1024 - size:    256 - time: 4.2120e-03 seconds - correctness: correct 
 - n_bufs: 2048 - size:    128 - time: 8.4548e-03 seconds - correctness: correct 
Testing base case (fuse lock out of order): 
 - n_bufs:    1 - size: 262144 - time: 7.2035e-05 seconds - correctness: correct 
 - n_bufs:    2 - size: 131072 - time: 6.4773e-05 seconds - correctness: correct 
 - n_bufs:    4 - size:  65536 - time: 5.7774e-05 seconds - correctness: correct 
 - n_bufs:    8 - size:  32768 - time: 6.1620e-05 seconds - correctness: correct 
 - n_bufs:   16 - size:  16384 - time: 6.1653e-05 seconds - correctness: correct 
 - n_bufs:   32 - size:   8192 - time: 6.6945e-05 seconds - correctness: correct 
 - n_bufs:   64 - size:   4096 - time: 8.6689e-05 seconds - correctness: correct 
 - n_bufs:  128 - size:   2048 - time: 1.1578e-04 seconds - correctness: correct 
 - n_bufs:  256 - size:   1024 - time: 1.8471e-04 seconds - correctness: correct 
 - n_bufs:  512 - size:    512 - time: 3.3370e-04 seconds - correctness: correct 
 - n_bufs: 1024 - size:    256 - time: 6.2578e-04 seconds - correctness: correct 
 - n_bufs: 2048 - size:    128 - time: 1.1530e-03 seconds - correctness: incorrect 
Testing base case (fuse lock in order): 
 - n_bufs:    1 - size: 262144 - time: 7.0873e-05 seconds - correctness: correct 
 - n_bufs:    2 - size: 131072 - time: 6.1959e-05 seconds - correctness: correct 
 - n_bufs:    4 - size:  65536 - time: 5.6592e-05 seconds - correctness: correct 
 - n_bufs:    8 - size:  32768 - time: 5.7397e-05 seconds - correctness: correct 
 - n_bufs:   16 - size:  16384 - time: 6.0496e-05 seconds - correctness: correct 
 - n_bufs:   32 - size:   8192 - time: 6.4414e-05 seconds - correctness: correct 
 - n_bufs:   64 - size:   4096 - time: 7.8303e-05 seconds - correctness: correct 
 - n_bufs:  128 - size:   2048 - time: 1.1086e-04 seconds - correctness: correct 
 - n_bufs:  256 - size:   1024 - time: 1.7962e-04 seconds - correctness: correct 
 - n_bufs:  512 - size:    512 - time: 3.1875e-04 seconds - correctness: correct 
 - n_bufs: 1024 - size:    256 - time: 6.1356e-04 seconds - correctness: correct 
 - n_bufs: 2048 - size:    128 - time: 1.1040e-03 seconds - correctness: incorrect 
Testing base case (fuse sycl buffer out of order): 
[AdaptiveCpp Warning] This application uses SYCL buffers; the SYCL buffer-accessor model is well-known to introduce unnecessary overheads. Please consider migrating to the SYCL2020 USM model, in particular device USM (sycl::malloc_device) combined with in-order queues for more performance. See the AdaptiveCpp performance guide for more information: 
https://github.com/AdaptiveCpp/AdaptiveCpp/blob/develop/doc/performance.md
 - n_bufs:    1 - size: 262144 - time: 1.6504e-05 seconds - correctness: correct 
 - n_bufs:    2 - size: 131072 - time: 1.6161e-05 seconds - correctness: correct 
 - n_bufs:    4 - size:  65536 - time: 1.6035e-05 seconds - correctness: correct 
 - n_bufs:    8 - size:  32768 - time: 1.7100e-05 seconds - correctness: correct 
 - n_bufs:   16 - size:  16384 - time: 1.9567e-05 seconds - correctness: correct 
 - n_bufs:   32 - size:   8192 - time: 2.3361e-05 seconds - correctness: correct 
 - n_bufs:   64 - size:   4096 - time: 3.9491e-05 seconds - correctness: correct 
 - n_bufs:  128 - size:   2048 - time: 1.0402e-04 seconds - correctness: correct 
 - n_bufs:  256 - size:   1024 - time: 1.6322e-04 seconds - correctness: correct 
 - n_bufs:  512 - size:    512 - time: 3.0442e-04 seconds - correctness: correct 
 - n_bufs: 1024 - size:    256 - time: 5.8007e-04 seconds - correctness: correct 
 - n_bufs: 2048 - size:    128 - time: 1.0711e-03 seconds - correctness: incorrect 
Testing base case (fuse sycl buffer in order): 
 - n_bufs:    1 - size: 262144 - time: 1.5526e-05 seconds - correctness: correct 
 - n_bufs:    2 - size: 131072 - time: 1.5506e-05 seconds - correctness: correct 
 - n_bufs:    4 - size:  65536 - time: 1.6514e-05 seconds - correctness: correct 
 - n_bufs:    8 - size:  32768 - time: 1.7019e-05 seconds - correctness: correct 
 - n_bufs:   16 - size:  16384 - time: 1.8753e-05 seconds - correctness: correct 
 - n_bufs:   32 - size:   8192 - time: 2.3560e-05 seconds - correctness: correct 
 - n_bufs:   64 - size:   4096 - time: 6.7909e-05 seconds - correctness: correct 
 - n_bufs:  128 - size:   2048 - time: 6.9739e-05 seconds - correctness: correct 
 - n_bufs:  256 - size:   1024 - time: 1.6146e-04 seconds - correctness: correct 
 - n_bufs:  512 - size:    512 - time: 3.0892e-04 seconds - correctness: correct 
 - n_bufs: 1024 - size:    256 - time: 5.8113e-04 seconds - correctness: correct 
 - n_bufs: 2048 - size:    128 - time: 1.0837e-03 seconds - correctness: incorrect 
Testing base case (fuse usm out of order): 
 - n_bufs:    1 - size: 262144 - time: 6.6889e-05 seconds - correctness: correct 
 - n_bufs:    2 - size: 131072 - time: 5.9232e-05 seconds - correctness: correct 
 - n_bufs:    4 - size:  65536 - time: 5.3225e-05 seconds - correctness: correct 
 - n_bufs:    8 - size:  32768 - time: 5.5211e-05 seconds - correctness: correct 
 - n_bufs:   16 - size:  16384 - time: 6.0262e-05 seconds - correctness: correct 
 - n_bufs:   32 - size:   8192 - time: 6.7698e-05 seconds - correctness: correct 
 - n_bufs:   64 - size:   4096 - time: 8.7782e-05 seconds - correctness: correct 
 - n_bufs:  128 - size:   2048 - time: 1.2819e-04 seconds - correctness: correct 
 - n_bufs:  256 - size:   1024 - time: 2.0597e-04 seconds - correctness: correct 
 - n_bufs:  512 - size:    512 - time: 3.7622e-04 seconds - correctness: correct 
 - n_bufs: 1024 - size:    256 - time: 7.2086e-04 seconds - correctness: correct 
 - n_bufs: 2048 - size:    128 - time: 1.4376e-03 seconds - correctness: incorrect 
Testing base case (fuse usm in order): 
 - n_bufs:    1 - size: 262144 - time: 6.4164e-05 seconds - correctness: correct 
 - n_bufs:    2 - size: 131072 - time: 5.6040e-05 seconds - correctness: correct 
 - n_bufs:    4 - size:  65536 - time: 5.0519e-05 seconds - correctness: correct 
 - n_bufs:    8 - size:  32768 - time: 5.3853e-05 seconds - correctness: correct 
 - n_bufs:   16 - size:  16384 - time: 5.7600e-05 seconds - correctness: correct 
 - n_bufs:   32 - size:   8192 - time: 6.5692e-05 seconds - correctness: correct 
 - n_bufs:   64 - size:   4096 - time: 8.8383e-05 seconds - correctness: correct 
 - n_bufs:  128 - size:   2048 - time: 1.2610e-04 seconds - correctness: correct 
 - n_bufs:  256 - size:   1024 - time: 2.1073e-04 seconds - correctness: correct 
 - n_bufs:  512 - size:    512 - time: 3.8097e-04 seconds - correctness: correct 
 - n_bufs: 1024 - size:    256 - time: 7.3678e-04 seconds - correctness: correct 
 - n_bufs: 2048 - size:    128 - time: 1.4987e-03 seconds - correctness: incorrect 
Test result gathered : 0 bytes 
       Rank   0 [36/96] : correct               ( Fail )

The test failure correspond to overflows on the group count if the buffer size is smaller than the number of groups.
Also we should store the exscan in the params pack and do a dichotomic search.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a benchmark to evaluate different kernel fusion strategies for SAXPY-like operations. The strategies include a baseline with many small kernels, and three fused versions using sham::DeviceBuffer, sycl::buffer, and USM for kernel parameters. The benchmark is comprehensive, testing against in-order and out-of-order queues with varying data fragmentation.

My review has identified a critical out-of-bounds memory access bug in all three fused kernel implementations that needs to be addressed. I've also included several suggestions to improve performance, reduce code duplication, and fix other minor issues in the benchmark code. All original comments have been retained as they do not contradict the provided rules. Please see the detailed comments for specifics.


u32 tile_id = current_group_index * group_size + id.get_local_id(0);

current_params.y[tile_id] = a * current_params.x[tile_id] + current_params.y[tile_id];

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

There is a potential out-of-bounds memory access here. The kernel is launched with a global size of ngroups * group_size, which can be larger than the actual buffer size if the buffer size is not a multiple of group_size. This can cause tile_id to exceed current_params.size, leading to out-of-bounds reads and writes. You should add a bounds check to prevent this.

            if (tile_id < current_params.size) current_params.y[tile_id] = a * current_params.x[tile_id] + current_params.y[tile_id];


u32 tile_id = current_group_index * group_size + id.get_local_id(0);

current_params.y[tile_id] = a * current_params.x[tile_id] + current_params.y[tile_id];

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Potential out-of-bounds memory access. The kernel's global size can be larger than the buffer size, causing tile_id to be out of bounds. A bounds check on tile_id against current_params.size is necessary to prevent this.

            if (tile_id < current_params.size) current_params.y[tile_id] = a * current_params.x[tile_id] + current_params.y[tile_id];


u32 tile_id = current_group_index * group_size + id.get_local_id(0);

current_params.y[tile_id] = a * current_params.x[tile_id] + current_params.y[tile_id];

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Potential out-of-bounds memory access. The kernel's global size can be larger than the buffer size, causing tile_id to be out of bounds. A bounds check on tile_id against current_params.size is necessary to prevent this.

            if (tile_id < current_params.size) current_params.y[tile_id] = a * current_params.x[tile_id] + current_params.y[tile_id];

Comment on lines +360 to +363
for (size_t i = 0; i < x.size(); i++) {
x[i].complete_event_state(sycl::event{});
y[i].complete_event_state(sycl::event{});
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Using a default-constructed sycl::event{} to complete the event state is incorrect here, as it discards the dependency information from the kernel execution. Although the e.wait() call makes this seem safe, it's brittle. You should pass the actual event e to complete_event_state to correctly track dependencies, consistent with the other _fuse_ implementations. This is crucial for correctness, especially if the explicit e.wait() is removed.

    for (size_t i = 0; i < x.size(); i++) {
        x[i].complete_event_state(e);
        y[i].complete_event_state(e);
    }

}
}

void saxpy_many_kernels_base_fuse_basic(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The functions saxpy_many_kernels_base_fuse_basic, saxpy_many_kernels_base_fuse_sycl_buffer, and saxpy_many_kernels_base_fuse_usm share a significant amount of duplicated code, especially the kernel implementation and setup logic. This makes maintenance difficult, as bug fixes or improvements must be applied in three places.

Consider refactoring the common logic into a helper function or a template. For instance, you could use a template parameterized by a type that provides the params buffer, abstracting away the differences in buffer creation and access.

Comment on lines +226 to +230
using atomic_ref_u32 = sycl::atomic_ref<
u32,
sycl::memory_order_relaxed,
sycl::memory_scope_device,
sycl::access::address_space::global_space>;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The using atomic_ref_u32 is defined but never used within the kernel. It can be removed.

Comment on lines +246 to +258
for (size_t i = 0; i < param_count; i++) {
current_params = params_acc[i];

u32 group_count_param = shambase::group_count(current_params.size, group_size);

if (current_group_id < group_count_param) {
local_params[0] = current_params;
local_group_index[0] = current_group_id;
break;
}

current_group_id -= group_count_param;
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This loop performs a linear scan over the params_acc accessor to find the correct parameters for the workgroup. This can be inefficient for a large number of buffers, as it involves repeated global memory accesses for each workgroup. A more performant approach would be to compute prefix sums of group counts on the host and use a faster search (like binary search) in the kernel to locate the correct buffer parameters.

Comment on lines +313 to +317
using atomic_ref_u32 = sycl::atomic_ref<
u32,
sycl::memory_order_relaxed,
sycl::memory_scope_device,
sycl::access::address_space::global_space>;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The using atomic_ref_u32 is defined but never used within the kernel. It can be removed.

Comment on lines +331 to +343
for (size_t i = 0; i < param_count; i++) {
current_params = params_acc[i];

u32 group_count_param = shambase::group_count(current_params.size, group_size);

if (current_group_id < group_count_param) {
local_params[0] = current_params;
local_group_index[0] = current_group_id;
break;
}

current_group_id -= group_count_param;
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This loop performs a linear scan over the params_acc pointer to find the correct parameters for the workgroup. This can be inefficient for a large number of buffers, as it involves repeated global memory accesses for each workgroup. A more performant approach would be to compute prefix sums of group counts on the host and use a faster search (like binary search) in the kernel to locate the correct buffer parameters.

});
});

e.wait();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The call to e.wait() is redundant here. The subsequent call to sycl::free(params_acc, q.q) is a blocking call on the host that waits for all commands previously submitted to q.q to complete. This includes the kernel that generates event e. Removing this explicit wait avoids unnecessarily stalling the host thread.

@github-actions
Copy link

Workflow report

workflow report corresponding to commit 4cb70d2
Commiter email is timothee.davidcleris@proton.me

Light CI is enabled. This will only run the basic tests and not the full tests.
Merging a PR require the job "on PR / all" to pass which is disabled in this case.

Pre-commit check report

Pre-commit check: ✅

trim trailing whitespace.................................................Passed
fix end of files.........................................................Passed
check for merge conflicts................................................Passed
check that executables have shebangs.....................................Passed
check that scripts with shebangs are executable..........................Passed
check for added large files..............................................Passed
check for case conflicts.................................................Passed
check for broken symlinks................................................Passed
check yaml...............................................................Passed
detect private key.......................................................Passed
No-tabs checker..........................................................Passed
Tabs remover.............................................................Passed
Validate GitHub Workflows................................................Passed
clang-format.............................................................Passed
black....................................................................Passed
ruff check...............................................................Passed
Check doxygen headers....................................................Passed
Check license headers....................................................Passed
Check #pragma once.......................................................Passed
Check SYCL #include......................................................Passed
No ssh in git submodules remote..........................................Passed

Test pipeline can run.

Doxygen diff with main

Removed warnings : 0
New warnings : 0
Warnings count : 7545 → 7545 (0.0%)

Detailed changes :

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant