Skip to content

Conversation

@Lhongpei
Copy link
Contributor

@Lhongpei Lhongpei commented Nov 17, 2025

Description

This PR introduces a significant performance optimization for PDLPx, particularly for problems with highly sparse constraint matrices.

When a row of the constraint matrix (or its transpose) is highly sparse (i.e., has very few non-zeros), launching a full CUSPARSE SpMV kernel for the primal or dual update can be inefficient due to kernel launch overhead and low computational density.

This change introduces two new, "fused" CUDA kernels:

  • fused_compute_next_pdhg_primal_solution_kernel
  • fused_compute_next_pdhg_dual_solution_kernel

These kernels perform the sparse matrix-vector multiplication (SpMV) using a simple for-loop (which is more efficient for highly sparse rows) and fuse it with the subsequent PDHG update logic (e.g., projection onto bounds, reflection). This approach avoids the overhead of separate kernel launches and improves data locality.

Implementation Details

  • Fused Primal Kernel: Computes the dual product (A^T @ dual_solution) and fuses it with the primal variable update, projection (against var_lb, var_ub), and reflection.
  • Fused Dual Kernel: Computes the primal product (A @ primal_solution) and fuses it with the dual variable update, projection (against const_lb, const_ub), and reflection.
  • Auto-Algorithm Selection: The new fused kernel path is automatically selected for a primal or dual update when the number of non-zeros in each row (or column) is less than 100 and density is less than 0.01, which can be tuned further. For denser matrices, the existing CUSPARSE-based update is retained.

Performance Improvements

This fusion results in substantial performance gains, as demonstrated on Hans' Benchmark and the MIPLIB dataset.

Hans' Benchmark Examples

Model Iterations Previous (CUSPARSE) Fused Kernel Speedup
cont11 799200 31.23s 7.68s 4.07x
thk48 18000 18.79s 13.21s 1.42x

MIPLIB Dataset Summary

The results across the MIPLIB dataset are excellent. Both methods were run for the same number of iterations.
There are 169 instances using fused update according to the auto-selection.

Metric CUSPARSE Based Update Fused Update
GEOMEAN 0.369009528 0.200248621
SGM10 2.34091312 1.69636033
Better Count 3 / 169 166 / 169
Mean Relative Improvement - 32.47%

@Lhongpei
Copy link
Contributor Author

Lhongpei commented Dec 4, 2025

New Strategy of Fused Kernel

Tuning Strategies

The solver supports three tuning modes, controlled via pdhg_parameters:

1. Heuristic (Old Version)

PDHG_TUNING_HEURISTIC Statically decides whether to use a Fused Kernel or cuSPARSE based on the non-zero (NNZ) count and density of the constraint matrix columns/rows.

  • Pros: Instant decision, zero overhead.
  • Cons: May not be accurate for complex sparsity patterns.

2. Benchmark (current default)

PDHG_TUNING_BENCHMARK Performs an "Online Benchmark" during the solver initialization phase:

  1. Measure: Runs both the Fused and cuSPARSE kernels for a small batch of iterations (e.g., 5 warmup + 10 measure).
  2. Compare: Records the precise execution time using CUDA Events.
  3. Select: Independently selects the fastest kernel for the Primal step and the Dual step.

3. Fix cuSparse or Fused

PDHG_CUSPARSE_FIX and PDHG_FUSED_FIX

Example Log

---------------------------------------------------------------------------------------
                                    cuPDLPx v0.0.0-core-only                                    
                        A GPU-Accelerated First-Order LP Solver                        
               (c) Haihao Lu, Massachusetts Institute of Technology, 2025              
---------------------------------------------------------------------------------------
problem:
  variables     : 40398
  constraints   : 160792
  nnz(A)        : 399990
settings:
  iter_limit         : 2147483647
  time_limit         : 3600.00 sec
  eps_opt            : 1.0e-04
  eps_feas           : 1.0e-04
  eps_infeas_detect  : 1.0e-10
[Auto-Tuning] Strategy: BENCHMARK (Running tests...)
  Primal: Selected cuSPARSE (0.144 ms) < FUSED (41.473 ms)
  Dual  : Selected FUSED (0.051 ms) < cuSPARSE (0.169 ms)
---------------------------------------------------------------------------------------
   runtime     |     objective      |   absolute residuals    |   relative residuals    
  iter   time  |  pr obj    du obj  |  pr res  du res   gap   |  pr res  du res   gap   
---------------------------------------------------------------------------------------
     0 1.8e-04 |  0.0e+00   0.0e+00 | 3.1e+00 1.0e+00 0.0e+00 | 4.2e-03 5.0e-01 0.0e+00 
    10 5.9e-04 |  1.4e-02   1.7e-02 | 3.4e-01 5.9e-02 2.6e-03 | 4.5e-04 3.0e-02 2.5e-03 
    20 9.8e-04 |  1.4e-02   1.7e-02 | 4.2e-01 6.8e-02 3.2e-03 | 5.6e-04 3.4e-02 3.1e-03 

Benchmark Results

===========================================================================
            BENCHMARK ANALYSIS REPORT
===========================================================================
Total Instances: 376
---------------------------------------------------------------------------
Time Metric               | Fused (s)    | Non-Fused (s)  | Speedup   
---------------------------------------------------------------------------
SGM (shift=0.1)           | 0.6647       | 0.8569         | x1.29
SGM (shift=1.0)           | 1.4480       | 1.7304         | x1.20
SGM (shift=10.0)          | 3.9993       | 4.4541         | x1.11
---------------------------------------------------------------------------
Avg Iterations            | 611629.4     | 610687.9       | --0.15%
Optimal Solved            | 376          | 376            | +0
---------------------------------------------------------------------------
Fused Speedup Distribution (Non-Fused Time / Fused Time):
  Faster by >=   1% (x1.01+) :  274 / 376 (72.9%)
  Faster by >=   5% (x1.05+) :  238 / 376 (63.3%)
  Faster by >=  10% (x1.10+) :  219 / 376 (58.2%)
  Faster by >=  50% (x1.50+) :   38 / 376 (10.1%)
  Faster by >= 100% (x2.00+) :   36 / 376 (9.6%)
===========================================================================

@Lhongpei
Copy link
Contributor Author

Lhongpei commented Dec 4, 2025

@ZedongPeng I think selection with benchmarking could be more stable (at least it will make sure fused kernel will not bring negative impact), and the overhead it brings can be small as it only requires few iterations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant