⚡️ Speed up function `complex_activation` by 77% #1172

codeflash-ai · 2026-01-26T18:10:07Z

📄 77% (0.77x) speedup for `complex_activation` in `code_to_optimize/complex_activation.py`

⏱️ Runtime : 2.77 milliseconds → 1.57 milliseconds (best of 134 runs)

📝 Explanation and details

The optimized code achieves a 77% speedup (from 2.77ms to 1.57ms) by applying torch.compile to the complex_activation function. This decorator enables kernel fusion, which is critical for this workload.

Why this optimization works:

The original implementation performs 6 sequential element-wise operations on tensors:

torch.sin(x)
Multiply by torch.cos(x)
Add torch.exp(-x.abs())
Divide by (1 + x.pow(2))
Multiply torch.tanh(x) * torch.sigmoid(x)
Subtract 0.5 * x.pow(3)

Without compilation, each operation launches a separate CUDA kernel (or CPU loop), incurring:

Kernel launch overhead (~1-10 microseconds per launch on GPU)
Memory round-trips (write intermediate results to global memory, then read them back)
Limited optimization across operation boundaries

The line profiler shows these operations dominate runtime, with torch.exp(-x.abs()) and division taking 40.4% and 35.2% respectively.

What torch.compile does:

By decorating the function with @torch.compile, PyTorch's compiler:

Traces the computation graph through all operations
Fuses multiple ops into a single kernel, eliminating intermediate memory writes
Generates optimized code that executes all operations in one pass over the data
Reduces Python overhead by compiling the entire function

The optimized line profiler shows the function now executes through a single compile_wrapper call, with the actual computation (return fn(*args, **kwargs)) taking the bulk of time as one fused operation rather than 6 separate ones.

Fallback safety:

The code includes a compatibility check: if torch.compile is unavailable (PyTorch < 2.0), it falls back to an identity decorator that preserves the original behavior. This ensures backward compatibility without breaking existing deployments.

Impact:

This optimization is particularly effective for:

Functions with many small sequential operations (like this activation function)
GPU workloads where kernel launch overhead is significant
Scenarios where the function is called repeatedly (e.g., in neural network forward passes), as the compilation cost is amortized after the first call

✅ Correctness verification report:

Test	Status
⚙️ Existing Unit Tests	✅ 11 Passed
🌀 Generated Regression Tests	🔘 None Found
⏪ Replay Tests	🔘 None Found
🔎 Concolic Coverage Tests	🔘 None Found
📊 Tests Coverage	100.0%

⚙️ Click to see Existing Unit Tests

Test File::Test Function	Original ⏱️	Optimized ⏱️	Speedup
`test_complex_activation.py::TestComplexActivation.test_deterministic`	458μs	237μs	92.8%✅
`test_complex_activation.py::TestComplexActivation.test_gradient_flow`	294μs	216μs	36.4%✅
`test_complex_activation.py::TestComplexActivation.test_negative_input`	228μs	133μs	71.1%✅
`test_complex_activation.py::TestComplexActivation.test_output_bounded`	261μs	142μs	84.3%✅
`test_complex_activation.py::TestComplexActivation.test_output_device`	263μs	140μs	87.6%✅
`test_complex_activation.py::TestComplexActivation.test_output_dtype`	263μs	141μs	85.5%✅
`test_complex_activation.py::TestComplexActivation.test_output_is_finite`	260μs	141μs	84.3%✅
`test_complex_activation.py::TestComplexActivation.test_output_shape`	265μs	147μs	80.2%✅
`test_complex_activation.py::TestComplexActivation.test_positive_input`	229μs	132μs	73.2%✅
`test_complex_activation.py::TestComplexActivation.test_zero_input`	249μs	133μs	86.7%✅

To edit these changes git checkout codeflash/optimize-complex_activation-mkvhfuuu and push.

The optimized code achieves a **77% speedup** (from 2.77ms to 1.57ms) by applying `torch.compile` to the `complex_activation` function. This decorator enables **kernel fusion**, which is critical for this workload. **Why this optimization works:** The original implementation performs **6 sequential element-wise operations** on tensors: 1. `torch.sin(x)` 2. Multiply by `torch.cos(x)` 3. Add `torch.exp(-x.abs())` 4. Divide by `(1 + x.pow(2))` 5. Multiply `torch.tanh(x) * torch.sigmoid(x)` 6. Subtract `0.5 * x.pow(3)` Without compilation, each operation launches a **separate CUDA kernel** (or CPU loop), incurring: - **Kernel launch overhead** (~1-10 microseconds per launch on GPU) - **Memory round-trips** (write intermediate results to global memory, then read them back) - **Limited optimization** across operation boundaries The line profiler shows these operations dominate runtime, with `torch.exp(-x.abs())` and division taking 40.4% and 35.2% respectively. **What `torch.compile` does:** By decorating the function with `@torch.compile`, PyTorch's compiler: 1. **Traces the computation graph** through all operations 2. **Fuses multiple ops into a single kernel**, eliminating intermediate memory writes 3. **Generates optimized code** that executes all operations in one pass over the data 4. **Reduces Python overhead** by compiling the entire function The optimized line profiler shows the function now executes through a single `compile_wrapper` call, with the actual computation (`return fn(*args, **kwargs)`) taking the bulk of time as one fused operation rather than 6 separate ones. **Fallback safety:** The code includes a compatibility check: if `torch.compile` is unavailable (PyTorch < 2.0), it falls back to an identity decorator that preserves the original behavior. This ensures backward compatibility without breaking existing deployments. **Impact:** This optimization is particularly effective for: - Functions with **many small sequential operations** (like this activation function) - **GPU workloads** where kernel launch overhead is significant - Scenarios where the function is **called repeatedly** (e.g., in neural network forward passes), as the compilation cost is amortized after the first call

codeflash-ai bot requested a review from aseembits93 January 26, 2026 18:10

codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Jan 26, 2026

aseembits93 closed this Jan 26, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

⚡️ Speed up function `complex_activation` by 77% #1172

⚡️ Speed up function `complex_activation` by 77% #1172

Uh oh!

codeflash-ai bot commented Jan 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

⚡️ Speed up function complex_activation by 77% #1172

⚡️ Speed up function complex_activation by 77% #1172

Uh oh!

Conversation

codeflash-ai bot commented Jan 26, 2026

📄 77% (0.77x) speedup for complex_activation in code_to_optimize/complex_activation.py

📝 Explanation and details

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

⚡️ Speed up function `complex_activation` by 77% #1172

⚡️ Speed up function `complex_activation` by 77% #1172

📄 77% (0.77x) speedup for `complex_activation` in `code_to_optimize/complex_activation.py`