⚡️ Speed up function `complex_activation` by 63% #1174

codeflash-ai · 2026-01-26T19:20:51Z

📄 63% (0.63x) speedup for `complex_activation` in `code_to_optimize/complex_activation.py`

⏱️ Runtime : 2.82 milliseconds → 1.73 milliseconds (best of 228 runs)

📝 Explanation and details

The optimized code achieves a 62% speedup (from 2.82ms to 1.73ms) by adding a single @torch.compile() decorator to the function. This optimization addresses the primary performance bottleneck: kernel launch overhead from sequential PyTorch operations.

What changed:

Added @torch.compile() decorator to enable PyTorch's JIT compilation

Why this creates a speedup:

The original function performs 6 sequential element-wise operations on GPU tensors:

torch.sin(x)
Multiplication with torch.cos(x)
Addition with torch.exp(-x.abs())
Division by (1 + x.pow(2))
Multiplication of torch.tanh(x) * torch.sigmoid(x)
Subtraction of 0.5 * x.pow(3)

Without compilation, each operation launches a separate GPU kernel. The line profiler shows the most expensive lines are torch.exp(-x.abs()) (40.2% of time) and the division operation (34.8%), which involve multiple kernel launches for computing intermediate values.

How torch.compile() optimizes this:

PyTorch's compiler performs operator fusion - it analyzes the computation graph and merges these 6+ separate operations into a single optimized kernel. This eliminates:

Repeated GPU kernel launches (the dominant overhead for small operations)
Intermediate memory allocations and data transfers between GPU global memory
Host-device synchronization points between operations

The fused kernel computes the entire activation function in one GPU execution pass, reading input once and writing output once, instead of 6+ round trips through memory.

Impact:
This optimization is particularly effective for:

Functions with many sequential element-wise operations (like custom activation functions)
Medium to large tensor sizes where kernel launch overhead dominates
Code paths called frequently during training or inference loops

The speedup would be most beneficial if this activation is used in neural network layers that execute thousands of times per training epoch or inference batch.

✅ Correctness verification report:

Test	Status
⚙️ Existing Unit Tests	✅ 11 Passed
🌀 Generated Regression Tests	🔘 None Found
⏪ Replay Tests	🔘 None Found
🔎 Concolic Coverage Tests	🔘 None Found
📊 Tests Coverage	100.0%

⚙️ Click to see Existing Unit Tests

Test File::Test Function	Original ⏱️	Optimized ⏱️	Speedup
`test_complex_activation.py::TestComplexActivation.test_deterministic`	465μs	261μs	77.8%✅
`test_complex_activation.py::TestComplexActivation.test_gradient_flow`	295μs	233μs	26.6%✅
`test_complex_activation.py::TestComplexActivation.test_negative_input`	231μs	149μs	55.2%✅
`test_complex_activation.py::TestComplexActivation.test_output_bounded`	267μs	157μs	69.9%✅
`test_complex_activation.py::TestComplexActivation.test_output_device`	267μs	156μs	70.7%✅
`test_complex_activation.py::TestComplexActivation.test_output_dtype`	267μs	157μs	70.2%✅
`test_complex_activation.py::TestComplexActivation.test_output_is_finite`	265μs	156μs	69.2%✅
`test_complex_activation.py::TestComplexActivation.test_output_shape`	269μs	162μs	66.2%✅
`test_complex_activation.py::TestComplexActivation.test_positive_input`	232μs	147μs	57.5%✅
`test_complex_activation.py::TestComplexActivation.test_zero_input`	252μs	149μs	69.3%✅

To edit these changes git checkout codeflash/optimize-complex_activation-mkvjytje and push.

The optimized code achieves a **62% speedup** (from 2.82ms to 1.73ms) by adding a single `@torch.compile()` decorator to the function. This optimization addresses the primary performance bottleneck: **kernel launch overhead from sequential PyTorch operations**. **What changed:** - Added `@torch.compile()` decorator to enable PyTorch's JIT compilation **Why this creates a speedup:** The original function performs 6 sequential element-wise operations on GPU tensors: 1. `torch.sin(x)` 2. Multiplication with `torch.cos(x)` 3. Addition with `torch.exp(-x.abs())` 4. Division by `(1 + x.pow(2))` 5. Multiplication of `torch.tanh(x) * torch.sigmoid(x)` 6. Subtraction of `0.5 * x.pow(3)` Without compilation, each operation launches a separate GPU kernel. The line profiler shows the most expensive lines are `torch.exp(-x.abs())` (40.2% of time) and the division operation (34.8%), which involve multiple kernel launches for computing intermediate values. **How `torch.compile()` optimizes this:** PyTorch's compiler performs **operator fusion** - it analyzes the computation graph and merges these 6+ separate operations into a single optimized kernel. This eliminates: - Repeated GPU kernel launches (the dominant overhead for small operations) - Intermediate memory allocations and data transfers between GPU global memory - Host-device synchronization points between operations The fused kernel computes the entire activation function in one GPU execution pass, reading input once and writing output once, instead of 6+ round trips through memory. **Impact:** This optimization is particularly effective for: - Functions with many sequential element-wise operations (like custom activation functions) - Medium to large tensor sizes where kernel launch overhead dominates - Code paths called frequently during training or inference loops The speedup would be most beneficial if this activation is used in neural network layers that execute thousands of times per training epoch or inference batch.

codeflash-ai bot requested a review from aseembits93 January 26, 2026 19:20

codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Jan 26, 2026

aseembits93 closed this Jan 26, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

⚡️ Speed up function `complex_activation` by 63% #1174

⚡️ Speed up function `complex_activation` by 63% #1174

codeflash-ai bot commented Jan 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

⚡️ Speed up function complex_activation by 63% #1174

⚡️ Speed up function complex_activation by 63% #1174

Conversation

codeflash-ai bot commented Jan 26, 2026

📄 63% (0.63x) speedup for complex_activation in code_to_optimize/complex_activation.py

📝 Explanation and details

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

⚡️ Speed up function `complex_activation` by 63% #1174

⚡️ Speed up function `complex_activation` by 63% #1174

📄 63% (0.63x) speedup for `complex_activation` in `code_to_optimize/complex_activation.py`