-
Notifications
You must be signed in to change notification settings - Fork 18
[Backends] benchmark: try strategies for kernel defragmentation/fusion #1471
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Thanks @tdavidcl for opening this PR! You can do multiple things directly here: Once the workflow completes a message will appear displaying informations related to the run. Also the PR gets automatically reviewed by gemini, you can: |
Summary of ChangesHello @tdavidcl, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a new benchmark suite designed to rigorously evaluate the performance characteristics of various kernel defragmentation and fusion techniques. By comparing traditional individual kernel dispatches against several fused kernel implementations that leverage different SYCL memory management paradigms (DeviceBuffer, sycl::buffer, USM), the PR aims to identify optimal strategies for improving efficiency when dealing with numerous small computational tasks on heterogeneous devices. The tests are conducted across different queue types and fragmentation levels to provide a comprehensive performance analysis. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
|
Current state: The test failure correspond to overflows on the group count if the buffer size is smaller than the number of groups. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a benchmark to evaluate different kernel fusion strategies for SAXPY-like operations. The strategies include a baseline with many small kernels, and three fused versions using sham::DeviceBuffer, sycl::buffer, and USM for kernel parameters. The benchmark is comprehensive, testing against in-order and out-of-order queues with varying data fragmentation.
My review has identified a critical out-of-bounds memory access bug in all three fused kernel implementations that needs to be addressed. I've also included several suggestions to improve performance, reduce code duplication, and fix other minor issues in the benchmark code. All original comments have been retained as they do not contradict the provided rules. Please see the detailed comments for specifics.
|
|
||
| u32 tile_id = current_group_index * group_size + id.get_local_id(0); | ||
|
|
||
| current_params.y[tile_id] = a * current_params.x[tile_id] + current_params.y[tile_id]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a potential out-of-bounds memory access here. The kernel is launched with a global size of ngroups * group_size, which can be larger than the actual buffer size if the buffer size is not a multiple of group_size. This can cause tile_id to exceed current_params.size, leading to out-of-bounds reads and writes. You should add a bounds check to prevent this.
if (tile_id < current_params.size) current_params.y[tile_id] = a * current_params.x[tile_id] + current_params.y[tile_id];|
|
||
| u32 tile_id = current_group_index * group_size + id.get_local_id(0); | ||
|
|
||
| current_params.y[tile_id] = a * current_params.x[tile_id] + current_params.y[tile_id]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential out-of-bounds memory access. The kernel's global size can be larger than the buffer size, causing tile_id to be out of bounds. A bounds check on tile_id against current_params.size is necessary to prevent this.
if (tile_id < current_params.size) current_params.y[tile_id] = a * current_params.x[tile_id] + current_params.y[tile_id];|
|
||
| u32 tile_id = current_group_index * group_size + id.get_local_id(0); | ||
|
|
||
| current_params.y[tile_id] = a * current_params.x[tile_id] + current_params.y[tile_id]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential out-of-bounds memory access. The kernel's global size can be larger than the buffer size, causing tile_id to be out of bounds. A bounds check on tile_id against current_params.size is necessary to prevent this.
if (tile_id < current_params.size) current_params.y[tile_id] = a * current_params.x[tile_id] + current_params.y[tile_id];| for (size_t i = 0; i < x.size(); i++) { | ||
| x[i].complete_event_state(sycl::event{}); | ||
| y[i].complete_event_state(sycl::event{}); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using a default-constructed sycl::event{} to complete the event state is incorrect here, as it discards the dependency information from the kernel execution. Although the e.wait() call makes this seem safe, it's brittle. You should pass the actual event e to complete_event_state to correctly track dependencies, consistent with the other _fuse_ implementations. This is crucial for correctness, especially if the explicit e.wait() is removed.
for (size_t i = 0; i < x.size(); i++) {
x[i].complete_event_state(e);
y[i].complete_event_state(e);
}| } | ||
| } | ||
|
|
||
| void saxpy_many_kernels_base_fuse_basic( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The functions saxpy_many_kernels_base_fuse_basic, saxpy_many_kernels_base_fuse_sycl_buffer, and saxpy_many_kernels_base_fuse_usm share a significant amount of duplicated code, especially the kernel implementation and setup logic. This makes maintenance difficult, as bug fixes or improvements must be applied in three places.
Consider refactoring the common logic into a helper function or a template. For instance, you could use a template parameterized by a type that provides the params buffer, abstracting away the differences in buffer creation and access.
| using atomic_ref_u32 = sycl::atomic_ref< | ||
| u32, | ||
| sycl::memory_order_relaxed, | ||
| sycl::memory_scope_device, | ||
| sycl::access::address_space::global_space>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| for (size_t i = 0; i < param_count; i++) { | ||
| current_params = params_acc[i]; | ||
|
|
||
| u32 group_count_param = shambase::group_count(current_params.size, group_size); | ||
|
|
||
| if (current_group_id < group_count_param) { | ||
| local_params[0] = current_params; | ||
| local_group_index[0] = current_group_id; | ||
| break; | ||
| } | ||
|
|
||
| current_group_id -= group_count_param; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This loop performs a linear scan over the params_acc accessor to find the correct parameters for the workgroup. This can be inefficient for a large number of buffers, as it involves repeated global memory accesses for each workgroup. A more performant approach would be to compute prefix sums of group counts on the host and use a faster search (like binary search) in the kernel to locate the correct buffer parameters.
| using atomic_ref_u32 = sycl::atomic_ref< | ||
| u32, | ||
| sycl::memory_order_relaxed, | ||
| sycl::memory_scope_device, | ||
| sycl::access::address_space::global_space>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| for (size_t i = 0; i < param_count; i++) { | ||
| current_params = params_acc[i]; | ||
|
|
||
| u32 group_count_param = shambase::group_count(current_params.size, group_size); | ||
|
|
||
| if (current_group_id < group_count_param) { | ||
| local_params[0] = current_params; | ||
| local_group_index[0] = current_group_id; | ||
| break; | ||
| } | ||
|
|
||
| current_group_id -= group_count_param; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This loop performs a linear scan over the params_acc pointer to find the correct parameters for the workgroup. This can be inefficient for a large number of buffers, as it involves repeated global memory accesses for each workgroup. A more performant approach would be to compute prefix sums of group counts on the host and use a faster search (like binary search) in the kernel to locate the correct buffer parameters.
| }); | ||
| }); | ||
|
|
||
| e.wait(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The call to e.wait() is redundant here. The subsequent call to sycl::free(params_acc, q.q) is a blocking call on the host that waits for all commands previously submitted to q.q to complete. This includes the kernel that generates event e. Removing this explicit wait avoids unnecessarily stalling the host thread.
Workflow reportworkflow report corresponding to commit 4cb70d2 Light CI is enabled. This will only run the basic tests and not the full tests. Pre-commit check reportPre-commit check: ✅ Test pipeline can run. Doxygen diff with
|
No description provided.