Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Original file line number Diff line number Diff line change
@@ -1,21 +1,18 @@
---
title: Accelerate the exponential function
title: Optimize exponential functions with FEXPA

draft: true
cascade:
draft: true

minutes_to_complete: 15

who_is_this_for: This is an introductory topic for developers interested in implementing the exponential function and optimizing it. The Scalable Vector Extension (SVE), introduced with the Armv8-A architecture, includes a dedicated instruction, FEXPA. Although initially not supported in SME, the FEXPA instruction has been made available in Scalable Matrix Extension (SME) version 2.2.
who_is_this_for: This is an introductory topic for developers interested in accelerating exponential function computations using Arm's Scalable Vector Extension (SVE). The FEXPA instruction provides hardware acceleration for exponential calculations on Arm Neoverse processors.

learning_objectives:
- Implement the exponential function using SVE intrinsics
- Optimize the function with FEXPA

prerequisites:
- Access to an [AWS Graviton4, Google Axion, or Azure Cobalt 100 virtual machine from a cloud service provider](/learning-paths/servers-and-cloud-computing/csp/).
- Some familiarity with SIMD programming and SVE intrinsics.
- Access to an [AWS Graviton4, Google Axion, or Azure Cobalt 100 virtual machine from a cloud service provider](/learning-paths/servers-and-cloud-computing/csp/)
- Some familiarity with SIMD programming and SVE intrinsics

author:
- Arnaud Grasset
Expand Down
Original file line number Diff line number Diff line change
@@ -1,18 +1,19 @@
---
title: Conclusion
title: Review benefits and next steps
weight: 5

### FIXED, DO NOT MODIFY
layout: learningpathall
---

## Conclusion
The SVE FEXPA instruction can speed-up the computation of the exponential functions by implementing table lookup and bit manipulation. The exponential function is the core of the Softmax function that, with the shift toward Generative AI, has become a critical component of modern neural network architectures.
## Summary

An implementation of the exponential function based on FEXPA can achieve a specified target precision using a polynomial of lower degree than that required by alternative implementations. Moreover, SME support for FEXPA lets you embed the exponential approximation directly into the matrix computation path and that translates into:
The SVE FEXPA instruction speeds up the computation of exponential functions by implementing table lookup and bit manipulation. The exponential function is the core of the Softmax function that, with the shift toward Generative AI, has become a critical component of modern neural network architectures.

An implementation of the exponential function based on FEXPA can achieve a specified target precision using a polynomial of lower degree than alternative implementations. SME support for FEXPA lets you embed the exponential approximation directly into the matrix computation path, which translates into:
- Fewer instructions (no back-and-forth to scalar/SVE code)
- Potentially higher aggregate throughput (more exponentials per cycle)
- Lower power & bandwidth (data being kept in the SME engine)
- Cleaner fusion with GEMM/GEMV workloads

All of which makes all exponential heavy workloads significantly faster on ARM CPUs.
These improvements make exponential-heavy workloads significantly faster on Arm CPUs.
23 changes: 16 additions & 7 deletions content/learning-paths/servers-and-cloud-computing/fexpa/fexpa.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
title: FEXPA
title: Optimize with FEXPA instruction
weight: 4

### FIXED, DO NOT MODIFY
Expand All @@ -8,9 +8,9 @@ layout: learningpathall

## The FEXPA instruction

Arm introduced in SVE an instruction called FEXPA: the Floating Point Exponential Accelerator.
Arm introduced an instruction in SVE called FEXPA: the Floating Point Exponential Accelerator.

Let’s segment the IEEE 754 floating-point representation fraction part into several sub-fields (Index, Exp and Remaining bits) with respective length of _Idxb_, _Expb_ and _Remb_ bits.
The IEEE 754 floating-point representation fraction part can be segmented into several sub-fields (Index, Exp and Remaining bits) with respective length of _Idxb_, _Expb_ and _Remb_ bits.

| IEEE 754 precision | Idxb | Expb | Remb |
|-------------------------|------|------|------|
Expand Down Expand Up @@ -46,7 +46,7 @@ With a table of size 2^L, the evaluation interval for the approximation polynomi

## Exponential implementation with FEXPA

FEXPA can be used to rapidly perform the table lookup. With this instruction a degree-2 polynomial is sufficient to obtain the same accuracy as the degree-4 polynomial implementation from the previous section.
Use FEXPA to rapidly perform the table lookup. With this instruction, a degree-2 polynomial is sufficient to obtain the same accuracy as the degree-4 polynomial implementation from the previous section.

### Add the FEXPA implementation

Expand Down Expand Up @@ -93,7 +93,7 @@ void exp_sve_fexpa(float *x, float *y, size_t n) {
```

{{% notice Arm Optimized Routines %}}
This implementation can be found in [ARM Optimized Routines](https://github.com/ARM-software/optimized-routines/blob/ba35b32/math/aarch64/sve/sv_expf_inline.h).
This implementation can be found in [Arm Optimized Routines](https://github.com/ARM-software/optimized-routines/blob/ba35b32/math/aarch64/sve/sv_expf_inline.h).
{{% /notice %}}


Expand Down Expand Up @@ -146,11 +146,20 @@ SVE+FEXPA (degree-2) 0.000414 5.95x

The benchmark shows the performance progression:

1. **SVE with degree-4 polynomial**: Provides up to 4x speedup through vectorization
2. **SVE with FEXPA and degree-2 polynomial**: Achieves an additional 1-2x improvement
- SVE with degree-4 polynomial provides up to 4x speedup through vectorization
- SVE with FEXPA and degree-2 polynomial achieves an additional 1-2x improvement

The FEXPA instruction delivers this improvement by:
- Replacing manual bit manipulation with a single hardware instruction (`svexpa()`)
- Enabling a simpler polynomial (degree-2 instead of degree-4) while maintaining accuracy

Both SVE implementations maintain comparable accuracy (errors in the 10^-9 to 10^-10 range), demonstrating that specialized hardware instructions can significantly improve performance without sacrificing precision.

## What you've accomplished and what's next

In this section, you:
- Implemented exponential function optimization using the FEXPA instruction
- Reduced polynomial degree from four to two while maintaining accuracy
- Achieved up to 6x speedup over the baseline implementation

Next, you'll review the key benefits and applications of FEXPA optimization.
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
title: First implementation
title: Implement exponential with SVE intrinsics
weight: 3

### FIXED, DO NOT MODIFY
Expand All @@ -8,11 +8,11 @@ layout: learningpathall

## Implement the exponential function

Based on the theory covered in the previous section, you can implement the exponential function using SVE intrinsics with polynomial approximation. This Learning Path was tested using a AWS Graviton4 instance type `r8g.medium`.
Based on the theory covered in the previous section, implement the exponential function using SVE intrinsics with polynomial approximation. This Learning Path was tested using an AWS Graviton4 instance type `r8g.medium`.

## Set up your environment

To run the example, you will need `gcc`.
To run the example, you need `gcc`.

```bash
sudo apt update
Expand Down Expand Up @@ -230,4 +230,11 @@ The benchmark demonstrates the performance benefit of using SVE intrinsics for v

The accuracy check confirms that the polynomial approximation maintains high precision, with errors typically in the range of 10^-9 to 10^-10 for single-precision floating-point values.

Continue to the next section to dive into the FEXPA intrinsic implementation, providing further performance uplifts.
## What you've accomplished and what's next

In this section, you:
- Implemented a vectorized exponential function using SVE intrinsics
- Applied range reduction and polynomial approximation techniques
- Achieved up to 4x speedup over the scalar baseline

Next, you'll optimize further using the FEXPA instruction for additional performance gains.
29 changes: 18 additions & 11 deletions content/learning-paths/servers-and-cloud-computing/fexpa/theory.md
Original file line number Diff line number Diff line change
@@ -1,16 +1,16 @@
---
title: Theory
title: Learn exponential function optimization techniques
weight: 2

### FIXED, DO NOT MODIFY
layout: learningpathall
---

## The exponential function
The exponential function is a fundamental mathematical function used across a wide range of algorithms for signal processing, High-Performance Computing and Machine Learning. Optimizing its computation has been the subject of extensive research for decades. The precision of the computation depends both on the selected approximation method and on the inherent rounding errors associated with finite-precision arithmetic, and it is directly traded off against performance when implementing the exponential function.
The exponential function is a fundamental mathematical function used across a wide range of algorithms for signal processing, High-Performance Computing and Machine Learning. Researchers have extensively studied optimizing its computation for decades. The precision of the computation depends both on the selected approximation method and on the inherent rounding errors associated with finite-precision arithmetic, and it is directly traded off against performance when implementing the exponential function.

## Range reduction
Polynomial approximations are among the most widely used methods for software implementations of the exponential function. The accuracy of a Taylor series approximation for exponential function can be improved with the polynomials degree but will always deteriorate as the evaluation point moves further from the expansion point. By applying range reduction techniques, the approximation of the exponential function can however be restricted to a very narrow interval where the function is well-conditioned. This approach consists in reformulating the exponential function in the following way:
Polynomial approximations are among the most widely used methods for software implementations of the exponential function. The accuracy of a Taylor series approximation for exponential function can be improved with the polynomial's degree but deteriorates as the evaluation point moves further from the expansion point. By applying range reduction techniques, you can restrict the approximation of the exponential function to a very narrow interval where the function is well-conditioned. This approach reformulates the exponential function in the following way:

$$e^x=e^{k×ln2+r}=2^k \times e^r$$

Expand All @@ -22,16 +22,16 @@ Since k is an integer, the evaluation of 2^k can be efficiently performed using

$$e^x \approx 2^k \times p(r)$$

It is important to note that the polynomial p(r) is evaluated exclusively over the interval [-ln2/2, +ln2/2]. So, the computational complexity can be optimized by selecting the polynomial degree based on the required precision of p(r) within this narrow range. Rather than relying on a Taylor polynomial, a minimax polynomial approximation can be used to minimize the maximum approximation error over the considered interval.
The polynomial p(r) is evaluated exclusively over the interval [-ln2/2, +ln2/2]. So, the computational complexity can be optimized by selecting the polynomial degree based on the required precision of p(r) within this narrow range. Rather than relying on a Taylor polynomial, a minimax polynomial approximation can be used to minimize the maximum approximation error over the considered interval.

## Decomposition of the input
The decomposition of an input value as x = k × ln2 + r can be done in 2 steps:
Decompose an input value as x = k × ln2 + r in two steps:
- Compute k as: k = round(x⁄ln2), where round(.) is the round-to-nearest function
- Compute r as: r = x - k × ln2

Rounding of k is performed by adding an adequately chosen large number to a floating-point value and subtracting it just afterward (the original value is rounded due to the finite precision of floating-point representation). Although explicit rounding instructions are available in both SVE and SME, this method remains advantageous as the addition of the constant can be fused with the multiplication by the reciprocal of ln2. This approach assumes however that the floating-point rounding mode is set to round-to-nearest, which is the default mode in Armv9-A. By integrating the bias into the constant, 2^k can also be directly computed by shifting the intermediate value.
Rounding of k is performed by adding an adequately chosen large number to a floating-point value and subtracting it just afterward (the original value is rounded because of the finite precision of floating-point representation). Although explicit rounding instructions are available in both SVE and SME, this method remains advantageous because the addition of the constant can be fused with the multiplication by the reciprocal of ln2. This approach assumes that the floating-point rounding mode is set to round-to-nearest, which is the default mode in Armv9-A. By integrating the bias into the constant, 2^k can also be directly computed by shifting the intermediate value.

Rounding error during the second step will introduce a global error as we will have:
A rounding error during the second step introduces a global error as you have:

$$ x \approx k \times ln2 + r $$

Expand All @@ -44,20 +44,27 @@ $$ (-1)^s \times 2^{(exponent - bias)} \times (1.fraction)_2 $$

where s is the sign bit and 1.fraction represents the significand.

The value 2^k can be encoded by setting both the sign and fraction bits to zero and assigning the exponent field the value k + bias. If k is an 8-bits integer, 2^k can be efficiently computed by adding the bias value and positioning the result into the exponent bits of a 32-bit floating-point number using a logical shift.
The value 2^k can be encoded by setting both the sign and fraction bits to zero and assigning the exponent field the value k + bias. If k is an 8-bit integer, 2^k can be efficiently computed by adding the bias value and positioning the result into the exponent bits of a 32-bit floating-point number using a logical shift.

Taking this approach a step further, a fast approximation of exponential function can be achieved using bits manipulation techniques alone. Specifically, adding a bias to an integer k and shifting the result into the exponent field can be accomplished by computing an integer i as follows:
Taking this approach a step further, you can achieve a fast approximation of exponential function using bit manipulation techniques alone. Specifically, adding a bias to an integer k and shifting the result into the exponent field can be accomplished by computing an integer i as follows:

$$i=2^{23} \times (k+bias) = 2^{23} \times k+2^{23} \times bias$$

This formulation assumes a 23-bit significand, but the method can be generalized to other floating-point precisions.

Now, consider the case where k is a real number. The fractional part of k will propagate into the significand bits of the resulting 2^k approximation. However, this side effect is not detrimental, it effectively acts as a form of linear interpolation, thereby improving the overall accuracy of the approximation. To approximate the exponential function, the following identity can be used:
Now, consider the case where k is a real number. The fractional part of k propagates into the significand bits of the resulting 2^k approximation. However, this side effect isn't detrimental; it effectively acts as a form of linear interpolation, thereby improving the overall accuracy of the approximation. To approximate the exponential function, use the following identity:

$$e^x = 2^{x⁄ln2}$$

As previously discussed, this value can be approximated by computing a 32-bit integer:

$$i = 2^{23} \times x⁄ln2 + 2^{23} \times bias = a \times x + b $$

Continue to the next section to make a C-based implementation of the exponential function.
## What you've accomplished and what's next

In this section, you learned the mathematical foundations for optimizing exponential functions:
- Range reduction techniques that narrow the evaluation interval
- How to decompose inputs using k × ln2 + r reformulation
- Bit manipulation techniques for computing scaling factors

Next, you'll implement these concepts in C using SVE intrinsics.