Perf: Add fused CUDA kernels to accelerate sparse PDHG updates #33

Lhongpei · 2025-11-17T14:49:00Z

Description

This PR introduces a significant performance optimization for PDLPx, particularly for problems with highly sparse constraint matrices.

When a row of the constraint matrix (or its transpose) is highly sparse (i.e., has very few non-zeros), launching a full CUSPARSE SpMV kernel for the primal or dual update can be inefficient due to kernel launch overhead and low computational density.

This change introduces two new, "fused" CUDA kernels:

fused_compute_next_pdhg_primal_solution_kernel
fused_compute_next_pdhg_dual_solution_kernel

These kernels perform the sparse matrix-vector multiplication (SpMV) using a simple for-loop (which is more efficient for highly sparse rows) and fuse it with the subsequent PDHG update logic (e.g., projection onto bounds, reflection). This approach avoids the overhead of separate kernel launches and improves data locality.

Implementation Details

Fused Primal Kernel: Computes the dual product (A^T @ dual_solution) and fuses it with the primal variable update, projection (against var_lb, var_ub), and reflection.
Fused Dual Kernel: Computes the primal product (A @ primal_solution) and fuses it with the dual variable update, projection (against const_lb, const_ub), and reflection.
Auto-Algorithm Selection: The new fused kernel path is automatically selected for a primal or dual update when the number of non-zeros in each row (or column) is less than 100 and density is less than 0.01, which can be tuned further. For denser matrices, the existing CUSPARSE-based update is retained.

Performance Improvements

This fusion results in substantial performance gains, as demonstrated on Hans' Benchmark and the MIPLIB dataset.

Hans' Benchmark Examples

Model	Iterations	Previous (CUSPARSE)	Fused Kernel	Speedup
cont11	799200	31.23s	7.68s	4.07x
thk48	18000	18.79s	13.21s	1.42x

MIPLIB Dataset Summary

The results across the MIPLIB dataset are excellent. Both methods were run for the same number of iterations.
There are 169 instances using fused update according to the auto-selection.

Metric	CUSPARSE Based Update	Fused Update
GEOMEAN	0.369009528	0.200248621
SGM10	2.34091312	1.69636033
Better Count	3 / 169	166 / 169
Mean Relative Improvement	-	32.47%

Lhongpei · 2025-12-04T01:59:33Z

New Strategy of Fused Kernel

Tuning Strategies

The solver supports three tuning modes, controlled via pdhg_parameters:

1. Heuristic (Old Version)

PDHG_TUNING_HEURISTIC Statically decides whether to use a Fused Kernel or cuSPARSE based on the non-zero (NNZ) count and density of the constraint matrix columns/rows.

Pros: Instant decision, zero overhead.
Cons: May not be accurate for complex sparsity patterns.

2. Benchmark (current default)

PDHG_TUNING_BENCHMARK Performs an "Online Benchmark" during the solver initialization phase:

Measure: Runs both the Fused and cuSPARSE kernels for a small batch of iterations (e.g., 5 warmup + 10 measure).
Compare: Records the precise execution time using CUDA Events.
Select: Independently selects the fastest kernel for the Primal step and the Dual step.

3. Fix cuSparse or Fused

PDHG_CUSPARSE_FIX and PDHG_FUSED_FIX

Example Log

---------------------------------------------------------------------------------------
                                    cuPDLPx v0.0.0-core-only                                    
                        A GPU-Accelerated First-Order LP Solver                        
               (c) Haihao Lu, Massachusetts Institute of Technology, 2025              
---------------------------------------------------------------------------------------
problem:
  variables     : 40398
  constraints   : 160792
  nnz(A)        : 399990
settings:
  iter_limit         : 2147483647
  time_limit         : 3600.00 sec
  eps_opt            : 1.0e-04
  eps_feas           : 1.0e-04
  eps_infeas_detect  : 1.0e-10
[Auto-Tuning] Strategy: BENCHMARK (Running tests...)
  Primal: Selected cuSPARSE (0.144 ms) < FUSED (41.473 ms)
  Dual  : Selected FUSED (0.051 ms) < cuSPARSE (0.169 ms)
---------------------------------------------------------------------------------------
   runtime     |     objective      |   absolute residuals    |   relative residuals    
  iter   time  |  pr obj    du obj  |  pr res  du res   gap   |  pr res  du res   gap   
---------------------------------------------------------------------------------------
     0 1.8e-04 |  0.0e+00   0.0e+00 | 3.1e+00 1.0e+00 0.0e+00 | 4.2e-03 5.0e-01 0.0e+00 
    10 5.9e-04 |  1.4e-02   1.7e-02 | 3.4e-01 5.9e-02 2.6e-03 | 4.5e-04 3.0e-02 2.5e-03 
    20 9.8e-04 |  1.4e-02   1.7e-02 | 4.2e-01 6.8e-02 3.2e-03 | 5.6e-04 3.4e-02 3.1e-03

Benchmark Results

===========================================================================
            BENCHMARK ANALYSIS REPORT
===========================================================================
Total Instances: 376
---------------------------------------------------------------------------
Time Metric               | Fused (s)    | Non-Fused (s)  | Speedup   
---------------------------------------------------------------------------
SGM (shift=0.1)           | 0.6647       | 0.8569         | x1.29
SGM (shift=1.0)           | 1.4480       | 1.7304         | x1.20
SGM (shift=10.0)          | 3.9993       | 4.4541         | x1.11
---------------------------------------------------------------------------
Avg Iterations            | 611629.4     | 610687.9       | --0.15%
Optimal Solved            | 376          | 376            | +0
---------------------------------------------------------------------------
Fused Speedup Distribution (Non-Fused Time / Fused Time):
  Faster by >=   1% (x1.01+) :  274 / 376 (72.9%)
  Faster by >=   5% (x1.05+) :  238 / 376 (63.3%)
  Faster by >=  10% (x1.10+) :  219 / 376 (58.2%)
  Faster by >=  50% (x1.50+) :   38 / 376 (10.1%)
  Faster by >= 100% (x2.00+) :   36 / 376 (9.6%)
===========================================================================

Lhongpei · 2025-12-04T02:04:09Z

@ZedongPeng I think selection with benchmarking could be more stable (at least it will make sure fused kernel will not bring negative impact), and the overhead it brings can be small as it only requires few iterations.

Lhongpei added 4 commits November 17, 2025 02:58

Fused kernel for optimizing efficiency on sparse cases

310ed03

Improve kernel efficiency

0e9b044

Merge branch 'main' into fused_update

f1ffb71

New Selecting Method: select update algorithm through benchmarking

99b6da5

Use fma and Kahan summation for numerical stability

cb01fe4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Perf: Add fused CUDA kernels to accelerate sparse PDHG updates #33

Perf: Add fused CUDA kernels to accelerate sparse PDHG updates #33

Lhongpei commented Nov 17, 2025 •

edited

Loading

Uh oh!

Lhongpei commented Dec 4, 2025

Uh oh!

Lhongpei commented Dec 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Perf: Add fused CUDA kernels to accelerate sparse PDHG updates #33

Are you sure you want to change the base?

Perf: Add fused CUDA kernels to accelerate sparse PDHG updates #33

Conversation

Lhongpei commented Nov 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Implementation Details

Performance Improvements

Hans' Benchmark Examples

MIPLIB Dataset Summary

Uh oh!

Lhongpei commented Dec 4, 2025

New Strategy of Fused Kernel

Tuning Strategies

1. Heuristic (Old Version)

2. Benchmark (current default)

3. Fix cuSparse or Fused

Example Log

Benchmark Results

Uh oh!

Lhongpei commented Dec 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Lhongpei commented Nov 17, 2025 •

edited

Loading