-
Notifications
You must be signed in to change notification settings - Fork 12
Perf: Add fused CUDA kernels to accelerate sparse PDHG updates #33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
New Strategy of Fused KernelTuning StrategiesThe solver supports three tuning modes, controlled via 1. Heuristic (Old Version)
2. Benchmark (current default)
3. Fix cuSparse or Fused
Example LogBenchmark Results |
|
@ZedongPeng I think selection with benchmarking could be more stable (at least it will make sure fused kernel will not bring negative impact), and the overhead it brings can be small as it only requires few iterations. |
Description
This PR introduces a significant performance optimization for PDLPx, particularly for problems with highly sparse constraint matrices.
When a row of the constraint matrix (or its transpose) is highly sparse (i.e., has very few non-zeros), launching a full CUSPARSE SpMV kernel for the primal or dual update can be inefficient due to kernel launch overhead and low computational density.
This change introduces two new, "fused" CUDA kernels:
fused_compute_next_pdhg_primal_solution_kernelfused_compute_next_pdhg_dual_solution_kernelThese kernels perform the sparse matrix-vector multiplication (SpMV) using a simple
for-loop (which is more efficient for highly sparse rows) and fuse it with the subsequent PDHG update logic (e.g., projection onto bounds, reflection). This approach avoids the overhead of separate kernel launches and improves data locality.Implementation Details
A^T @ dual_solution) and fuses it with the primal variable update, projection (againstvar_lb,var_ub), and reflection.A @ primal_solution) and fuses it with the dual variable update, projection (againstconst_lb,const_ub), and reflection.Performance Improvements
This fusion results in substantial performance gains, as demonstrated on Hans' Benchmark and the MIPLIB dataset.
Hans' Benchmark Examples
MIPLIB Dataset Summary
The results across the MIPLIB dataset are excellent. Both methods were run for the same number of iterations.
There are 169 instances using fused update according to the auto-selection.