⚡️ Speed up function complex_activation by 63%
#1174
Closed
+4
−1
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
📄 63% (0.63x) speedup for
complex_activationincode_to_optimize/complex_activation.py⏱️ Runtime :
2.82 milliseconds→1.73 milliseconds(best of228runs)📝 Explanation and details
The optimized code achieves a 62% speedup (from 2.82ms to 1.73ms) by adding a single
@torch.compile()decorator to the function. This optimization addresses the primary performance bottleneck: kernel launch overhead from sequential PyTorch operations.What changed:
@torch.compile()decorator to enable PyTorch's JIT compilationWhy this creates a speedup:
The original function performs 6 sequential element-wise operations on GPU tensors:
torch.sin(x)torch.cos(x)torch.exp(-x.abs())(1 + x.pow(2))torch.tanh(x) * torch.sigmoid(x)0.5 * x.pow(3)Without compilation, each operation launches a separate GPU kernel. The line profiler shows the most expensive lines are
torch.exp(-x.abs())(40.2% of time) and the division operation (34.8%), which involve multiple kernel launches for computing intermediate values.How
torch.compile()optimizes this:PyTorch's compiler performs operator fusion - it analyzes the computation graph and merges these 6+ separate operations into a single optimized kernel. This eliminates:
The fused kernel computes the entire activation function in one GPU execution pass, reading input once and writing output once, instead of 6+ round trips through memory.
Impact:
This optimization is particularly effective for:
The speedup would be most beneficial if this activation is used in neural network layers that execute thousands of times per training epoch or inference batch.
✅ Correctness verification report:
⚙️ Click to see Existing Unit Tests
test_complex_activation.py::TestComplexActivation.test_deterministictest_complex_activation.py::TestComplexActivation.test_gradient_flowtest_complex_activation.py::TestComplexActivation.test_negative_inputtest_complex_activation.py::TestComplexActivation.test_output_boundedtest_complex_activation.py::TestComplexActivation.test_output_devicetest_complex_activation.py::TestComplexActivation.test_output_dtypetest_complex_activation.py::TestComplexActivation.test_output_is_finitetest_complex_activation.py::TestComplexActivation.test_output_shapetest_complex_activation.py::TestComplexActivation.test_positive_inputtest_complex_activation.py::TestComplexActivation.test_zero_inputTo edit these changes
git checkout codeflash/optimize-complex_activation-mkvjytjeand push.