diff --git a/content/learning-paths/mobile-graphics-and-gaming/get-started-with-unity-on-android/images/add-disk.png b/content/learning-paths/mobile-graphics-and-gaming/get-started-with-unity-on-android/images/add-disk.png deleted file mode 100644 index 205d050a90..0000000000 Binary files a/content/learning-paths/mobile-graphics-and-gaming/get-started-with-unity-on-android/images/add-disk.png and /dev/null differ diff --git a/content/learning-paths/mobile-graphics-and-gaming/profiling-unity-apps-on-android/images/analyzer-check-gpu-bound.png b/content/learning-paths/mobile-graphics-and-gaming/profiling-unity-apps-on-android/images/analyzer-check-gpu-bound.png deleted file mode 100644 index 337e3df01f..0000000000 Binary files a/content/learning-paths/mobile-graphics-and-gaming/profiling-unity-apps-on-android/images/analyzer-check-gpu-bound.png and /dev/null differ diff --git a/content/learning-paths/mobile-graphics-and-gaming/profiling-unity-apps-on-android/images/analyzer-dataset-comparison.png b/content/learning-paths/mobile-graphics-and-gaming/profiling-unity-apps-on-android/images/analyzer-dataset-comparison.png deleted file mode 100644 index bde816bb4b..0000000000 Binary files a/content/learning-paths/mobile-graphics-and-gaming/profiling-unity-apps-on-android/images/analyzer-dataset-comparison.png and /dev/null differ diff --git a/content/learning-paths/mobile-graphics-and-gaming/profiling-unity-apps-on-android/images/analyzer-filter-text-collision.png b/content/learning-paths/mobile-graphics-and-gaming/profiling-unity-apps-on-android/images/analyzer-filter-text-collision.png deleted file mode 100644 index 0b08bb4dd6..0000000000 Binary files a/content/learning-paths/mobile-graphics-and-gaming/profiling-unity-apps-on-android/images/analyzer-filter-text-collision.png and /dev/null differ diff --git a/content/learning-paths/mobile-graphics-and-gaming/profiling-unity-apps-on-android/images/analyzer-panels-after-pull-data.png b/content/learning-paths/mobile-graphics-and-gaming/profiling-unity-apps-on-android/images/analyzer-panels-after-pull-data.png deleted file mode 100644 index d07ab8b8e1..0000000000 Binary files a/content/learning-paths/mobile-graphics-and-gaming/profiling-unity-apps-on-android/images/analyzer-panels-after-pull-data.png and /dev/null differ diff --git a/content/learning-paths/mobile-graphics-and-gaming/profiling-unity-apps-on-android/images/analyzer-pulled-datasets.png b/content/learning-paths/mobile-graphics-and-gaming/profiling-unity-apps-on-android/images/analyzer-pulled-datasets.png deleted file mode 100644 index 48ab4d2a89..0000000000 Binary files a/content/learning-paths/mobile-graphics-and-gaming/profiling-unity-apps-on-android/images/analyzer-pulled-datasets.png and /dev/null differ diff --git a/content/learning-paths/mobile-graphics-and-gaming/profiling-unity-apps-on-android/images/analyzer-select-similar-frames.png b/content/learning-paths/mobile-graphics-and-gaming/profiling-unity-apps-on-android/images/analyzer-select-similar-frames.png deleted file mode 100644 index 93c5eba2e2..0000000000 Binary files a/content/learning-paths/mobile-graphics-and-gaming/profiling-unity-apps-on-android/images/analyzer-select-similar-frames.png and /dev/null differ diff --git a/content/learning-paths/mobile-graphics-and-gaming/profiling-unity-apps-on-android/images/bad-load.png b/content/learning-paths/mobile-graphics-and-gaming/profiling-unity-apps-on-android/images/bad-load.png deleted file mode 100644 index e41ebfc14a..0000000000 Binary files a/content/learning-paths/mobile-graphics-and-gaming/profiling-unity-apps-on-android/images/bad-load.png and /dev/null differ diff --git a/content/learning-paths/mobile-graphics-and-gaming/profiling-unity-apps-on-android/images/game-view.png b/content/learning-paths/mobile-graphics-and-gaming/profiling-unity-apps-on-android/images/game-view.png deleted file mode 100644 index 8dcb5d9a18..0000000000 Binary files a/content/learning-paths/mobile-graphics-and-gaming/profiling-unity-apps-on-android/images/game-view.png and /dev/null differ diff --git a/content/learning-paths/mobile-graphics-and-gaming/profiling-unity-apps-on-android/images/import-window-step-1.png b/content/learning-paths/mobile-graphics-and-gaming/profiling-unity-apps-on-android/images/import-window-step-1.png deleted file mode 100644 index c70222b3c8..0000000000 Binary files a/content/learning-paths/mobile-graphics-and-gaming/profiling-unity-apps-on-android/images/import-window-step-1.png and /dev/null differ diff --git a/content/learning-paths/mobile-graphics-and-gaming/profiling-unity-apps-on-android/images/pa.png b/content/learning-paths/mobile-graphics-and-gaming/profiling-unity-apps-on-android/images/pa.png deleted file mode 100644 index 834ce70965..0000000000 Binary files a/content/learning-paths/mobile-graphics-and-gaming/profiling-unity-apps-on-android/images/pa.png and /dev/null differ diff --git a/content/learning-paths/mobile-graphics-and-gaming/profiling-unity-apps-on-android/images/pm.png b/content/learning-paths/mobile-graphics-and-gaming/profiling-unity-apps-on-android/images/pm.png deleted file mode 100644 index 2a9ee8dc36..0000000000 Binary files a/content/learning-paths/mobile-graphics-and-gaming/profiling-unity-apps-on-android/images/pm.png and /dev/null differ diff --git a/content/learning-paths/mobile-graphics-and-gaming/profiling-unity-apps-on-android/images/profiler-analyse-selected-frame-call-stacks-enabled.png b/content/learning-paths/mobile-graphics-and-gaming/profiling-unity-apps-on-android/images/profiler-analyse-selected-frame-call-stacks-enabled.png deleted file mode 100644 index 65023bc31b..0000000000 Binary files a/content/learning-paths/mobile-graphics-and-gaming/profiling-unity-apps-on-android/images/profiler-analyse-selected-frame-call-stacks-enabled.png and /dev/null differ diff --git a/content/learning-paths/mobile-graphics-and-gaming/profiling-unity-apps-on-android/images/profiler-analyse-selected-frame-hierarchy.png b/content/learning-paths/mobile-graphics-and-gaming/profiling-unity-apps-on-android/images/profiler-analyse-selected-frame-hierarchy.png deleted file mode 100644 index 78e012aaae..0000000000 Binary files a/content/learning-paths/mobile-graphics-and-gaming/profiling-unity-apps-on-android/images/profiler-analyse-selected-frame-hierarchy.png and /dev/null differ diff --git a/content/learning-paths/mobile-graphics-and-gaming/profiling-unity-apps-on-android/images/profiler-analyse-selected-frame.png b/content/learning-paths/mobile-graphics-and-gaming/profiling-unity-apps-on-android/images/profiler-analyse-selected-frame.png deleted file mode 100644 index d636aa9642..0000000000 Binary files a/content/learning-paths/mobile-graphics-and-gaming/profiling-unity-apps-on-android/images/profiler-analyse-selected-frame.png and /dev/null differ diff --git a/content/learning-paths/mobile-graphics-and-gaming/profiling-unity-apps-on-android/images/profiler-samsung-s8-plain.png b/content/learning-paths/mobile-graphics-and-gaming/profiling-unity-apps-on-android/images/profiler-samsung-s8-plain.png deleted file mode 100644 index 63fec9a477..0000000000 Binary files a/content/learning-paths/mobile-graphics-and-gaming/profiling-unity-apps-on-android/images/profiler-samsung-s8-plain.png and /dev/null differ diff --git a/content/learning-paths/mobile-graphics-and-gaming/profiling-unity-apps-on-android/images/sample-project-default-scene-view.png b/content/learning-paths/mobile-graphics-and-gaming/profiling-unity-apps-on-android/images/sample-project-default-scene-view.png deleted file mode 100644 index 1d2e00c7a9..0000000000 Binary files a/content/learning-paths/mobile-graphics-and-gaming/profiling-unity-apps-on-android/images/sample-project-default-scene-view.png and /dev/null differ diff --git a/content/learning-paths/mobile-graphics-and-gaming/profiling-unity-apps-on-android/images/urp.png b/content/learning-paths/mobile-graphics-and-gaming/profiling-unity-apps-on-android/images/urp.png deleted file mode 100644 index 5056cca552..0000000000 Binary files a/content/learning-paths/mobile-graphics-and-gaming/profiling-unity-apps-on-android/images/urp.png and /dev/null differ diff --git a/content/learning-paths/servers-and-cloud-computing/fexpa/_index.md b/content/learning-paths/servers-and-cloud-computing/fexpa/_index.md index 8a4bb96348..2ab064a69f 100644 --- a/content/learning-paths/servers-and-cloud-computing/fexpa/_index.md +++ b/content/learning-paths/servers-and-cloud-computing/fexpa/_index.md @@ -1,21 +1,18 @@ --- -title: Accelerate the exponential function +title: Optimize exponential functions with FEXPA -draft: true -cascade: - draft: true minutes_to_complete: 15 -who_is_this_for: This is an introductory topic for developers interested in implementing the exponential function and optimizing it. The Scalable Vector Extension (SVE), introduced with the Armv8-A architecture, includes a dedicated instruction, FEXPA. Although initially not supported in SME, the FEXPA instruction has been made available in Scalable Matrix Extension (SME) version 2.2. +who_is_this_for: This is an introductory topic for developers interested in accelerating exponential function computations using Arm's Scalable Vector Extension (SVE). The FEXPA instruction provides hardware acceleration for exponential calculations on Arm Neoverse processors. learning_objectives: - Implement the exponential function using SVE intrinsics - Optimize the function with FEXPA prerequisites: - - Access to an [AWS Graviton4, Google Axion, or Azure Cobalt 100 virtual machine from a cloud service provider](/learning-paths/servers-and-cloud-computing/csp/). - - Some familiarity with SIMD programming and SVE intrinsics. + - Access to an [AWS Graviton4, Google Axion, or Azure Cobalt 100 virtual machine from a cloud service provider](/learning-paths/servers-and-cloud-computing/csp/) + - Some familiarity with SIMD programming and SVE intrinsics author: - Arnaud Grasset diff --git a/content/learning-paths/servers-and-cloud-computing/fexpa/conclusion.md b/content/learning-paths/servers-and-cloud-computing/fexpa/conclusion.md index 9f62a198aa..25f5f3533e 100644 --- a/content/learning-paths/servers-and-cloud-computing/fexpa/conclusion.md +++ b/content/learning-paths/servers-and-cloud-computing/fexpa/conclusion.md @@ -1,18 +1,19 @@ --- -title: Conclusion +title: Review benefits and next steps weight: 5 ### FIXED, DO NOT MODIFY layout: learningpathall --- -## Conclusion -The SVE FEXPA instruction can speed-up the computation of the exponential functions by implementing table lookup and bit manipulation. The exponential function is the core of the Softmax function that, with the shift toward Generative AI, has become a critical component of modern neural network architectures. +## Summary -An implementation of the exponential function based on FEXPA can achieve a specified target precision using a polynomial of lower degree than that required by alternative implementations. Moreover, SME support for FEXPA lets you embed the exponential approximation directly into the matrix computation path and that translates into: +The SVE FEXPA instruction speeds up the computation of exponential functions by implementing table lookup and bit manipulation. The exponential function is the core of the Softmax function that, with the shift toward Generative AI, has become a critical component of modern neural network architectures. + +An implementation of the exponential function based on FEXPA can achieve a specified target precision using a polynomial of lower degree than alternative implementations. SME support for FEXPA lets you embed the exponential approximation directly into the matrix computation path, which translates into: - Fewer instructions (no back-and-forth to scalar/SVE code) - Potentially higher aggregate throughput (more exponentials per cycle) - Lower power & bandwidth (data being kept in the SME engine) - Cleaner fusion with GEMM/GEMV workloads -All of which makes all exponential heavy workloads significantly faster on ARM CPUs. +These improvements make exponential-heavy workloads significantly faster on Arm CPUs. diff --git a/content/learning-paths/servers-and-cloud-computing/fexpa/fexpa.md b/content/learning-paths/servers-and-cloud-computing/fexpa/fexpa.md index 52d7531a47..753d94cddb 100644 --- a/content/learning-paths/servers-and-cloud-computing/fexpa/fexpa.md +++ b/content/learning-paths/servers-and-cloud-computing/fexpa/fexpa.md @@ -1,5 +1,5 @@ --- -title: FEXPA +title: Optimize with FEXPA instruction weight: 4 ### FIXED, DO NOT MODIFY @@ -8,9 +8,9 @@ layout: learningpathall ## The FEXPA instruction -Arm introduced in SVE an instruction called FEXPA: the Floating Point Exponential Accelerator. +Arm introduced an instruction in SVE called FEXPA: the Floating Point Exponential Accelerator. -Let’s segment the IEEE 754 floating-point representation fraction part into several sub-fields (Index, Exp and Remaining bits) with respective length of _Idxb_, _Expb_ and _Remb_ bits. +The IEEE 754 floating-point representation fraction part can be segmented into several sub-fields (Index, Exp and Remaining bits) with respective length of _Idxb_, _Expb_ and _Remb_ bits. | IEEE 754 precision | Idxb | Expb | Remb | |-------------------------|------|------|------| @@ -46,7 +46,7 @@ With a table of size 2^L, the evaluation interval for the approximation polynomi ## Exponential implementation with FEXPA -FEXPA can be used to rapidly perform the table lookup. With this instruction a degree-2 polynomial is sufficient to obtain the same accuracy as the degree-4 polynomial implementation from the previous section. +Use FEXPA to rapidly perform the table lookup. With this instruction, a degree-2 polynomial is sufficient to obtain the same accuracy as the degree-4 polynomial implementation from the previous section. ### Add the FEXPA implementation @@ -93,7 +93,7 @@ void exp_sve_fexpa(float *x, float *y, size_t n) { ``` {{% notice Arm Optimized Routines %}} -This implementation can be found in [ARM Optimized Routines](https://github.com/ARM-software/optimized-routines/blob/ba35b32/math/aarch64/sve/sv_expf_inline.h). +This implementation can be found in [Arm Optimized Routines](https://github.com/ARM-software/optimized-routines/blob/ba35b32/math/aarch64/sve/sv_expf_inline.h). {{% /notice %}} @@ -146,11 +146,20 @@ SVE+FEXPA (degree-2) 0.000414 5.95x The benchmark shows the performance progression: -1. **SVE with degree-4 polynomial**: Provides up to 4x speedup through vectorization -2. **SVE with FEXPA and degree-2 polynomial**: Achieves an additional 1-2x improvement +- SVE with degree-4 polynomial provides up to 4x speedup through vectorization +- SVE with FEXPA and degree-2 polynomial achieves an additional 1-2x improvement The FEXPA instruction delivers this improvement by: - Replacing manual bit manipulation with a single hardware instruction (`svexpa()`) - Enabling a simpler polynomial (degree-2 instead of degree-4) while maintaining accuracy Both SVE implementations maintain comparable accuracy (errors in the 10^-9 to 10^-10 range), demonstrating that specialized hardware instructions can significantly improve performance without sacrificing precision. + +## What you've accomplished and what's next + +In this section, you: +- Implemented exponential function optimization using the FEXPA instruction +- Reduced polynomial degree from four to two while maintaining accuracy +- Achieved up to 6x speedup over the baseline implementation + +Next, you'll review the key benefits and applications of FEXPA optimization. diff --git a/content/learning-paths/servers-and-cloud-computing/fexpa/implementation.md b/content/learning-paths/servers-and-cloud-computing/fexpa/implementation.md index 725c98c78b..39be90567d 100644 --- a/content/learning-paths/servers-and-cloud-computing/fexpa/implementation.md +++ b/content/learning-paths/servers-and-cloud-computing/fexpa/implementation.md @@ -1,5 +1,5 @@ --- -title: First implementation +title: Implement exponential with SVE intrinsics weight: 3 ### FIXED, DO NOT MODIFY @@ -8,11 +8,11 @@ layout: learningpathall ## Implement the exponential function -Based on the theory covered in the previous section, you can implement the exponential function using SVE intrinsics with polynomial approximation. This Learning Path was tested using a AWS Graviton4 instance type `r8g.medium`. +Based on the theory covered in the previous section, implement the exponential function using SVE intrinsics with polynomial approximation. This Learning Path was tested using an AWS Graviton4 instance type `r8g.medium`. ## Set up your environment -To run the example, you will need `gcc`. +To run the example, you need `gcc`. ```bash sudo apt update @@ -230,4 +230,11 @@ The benchmark demonstrates the performance benefit of using SVE intrinsics for v The accuracy check confirms that the polynomial approximation maintains high precision, with errors typically in the range of 10^-9 to 10^-10 for single-precision floating-point values. -Continue to the next section to dive into the FEXPA intrinsic implementation, providing further performance uplifts. \ No newline at end of file +## What you've accomplished and what's next + +In this section, you: +- Implemented a vectorized exponential function using SVE intrinsics +- Applied range reduction and polynomial approximation techniques +- Achieved up to 4x speedup over the scalar baseline + +Next, you'll optimize further using the FEXPA instruction for additional performance gains. \ No newline at end of file diff --git a/content/learning-paths/servers-and-cloud-computing/fexpa/theory.md b/content/learning-paths/servers-and-cloud-computing/fexpa/theory.md index 269aab1bd3..ff13251e62 100644 --- a/content/learning-paths/servers-and-cloud-computing/fexpa/theory.md +++ b/content/learning-paths/servers-and-cloud-computing/fexpa/theory.md @@ -1,5 +1,5 @@ --- -title: Theory +title: Learn exponential function optimization techniques weight: 2 ### FIXED, DO NOT MODIFY @@ -7,10 +7,10 @@ layout: learningpathall --- ## The exponential function -The exponential function is a fundamental mathematical function used across a wide range of algorithms for signal processing, High-Performance Computing and Machine Learning. Optimizing its computation has been the subject of extensive research for decades. The precision of the computation depends both on the selected approximation method and on the inherent rounding errors associated with finite-precision arithmetic, and it is directly traded off against performance when implementing the exponential function. +The exponential function is a fundamental mathematical function used across a wide range of algorithms for signal processing, High-Performance Computing and Machine Learning. Researchers have extensively studied optimizing its computation for decades. The precision of the computation depends both on the selected approximation method and on the inherent rounding errors associated with finite-precision arithmetic, and it is directly traded off against performance when implementing the exponential function. ## Range reduction -Polynomial approximations are among the most widely used methods for software implementations of the exponential function. The accuracy of a Taylor series approximation for exponential function can be improved with the polynomial’s degree but will always deteriorate as the evaluation point moves further from the expansion point. By applying range reduction techniques, the approximation of the exponential function can however be restricted to a very narrow interval where the function is well-conditioned. This approach consists in reformulating the exponential function in the following way: +Polynomial approximations are among the most widely used methods for software implementations of the exponential function. The accuracy of a Taylor series approximation for exponential function can be improved with the polynomial's degree but deteriorates as the evaluation point moves further from the expansion point. By applying range reduction techniques, you can restrict the approximation of the exponential function to a very narrow interval where the function is well-conditioned. This approach reformulates the exponential function in the following way: $$e^x=e^{k×ln2+r}=2^k \times e^r$$ @@ -22,16 +22,16 @@ Since k is an integer, the evaluation of 2^k can be efficiently performed using $$e^x \approx 2^k \times p(r)$$ -It is important to note that the polynomial p(r) is evaluated exclusively over the interval [-ln2/2, +ln2/2]. So, the computational complexity can be optimized by selecting the polynomial degree based on the required precision of p(r) within this narrow range. Rather than relying on a Taylor polynomial, a minimax polynomial approximation can be used to minimize the maximum approximation error over the considered interval. +The polynomial p(r) is evaluated exclusively over the interval [-ln2/2, +ln2/2]. So, the computational complexity can be optimized by selecting the polynomial degree based on the required precision of p(r) within this narrow range. Rather than relying on a Taylor polynomial, a minimax polynomial approximation can be used to minimize the maximum approximation error over the considered interval. ## Decomposition of the input -The decomposition of an input value as x = k × ln2 + r can be done in 2 steps: +Decompose an input value as x = k × ln2 + r in two steps: - Compute k as: k = round(x⁄ln2), where round(.) is the round-to-nearest function - Compute r as: r = x - k × ln2 -Rounding of k is performed by adding an adequately chosen large number to a floating-point value and subtracting it just afterward (the original value is rounded due to the finite precision of floating-point representation). Although explicit rounding instructions are available in both SVE and SME, this method remains advantageous as the addition of the constant can be fused with the multiplication by the reciprocal of ln2. This approach assumes however that the floating-point rounding mode is set to round-to-nearest, which is the default mode in Armv9-A. By integrating the bias into the constant, 2^k can also be directly computed by shifting the intermediate value. +Rounding of k is performed by adding an adequately chosen large number to a floating-point value and subtracting it just afterward (the original value is rounded because of the finite precision of floating-point representation). Although explicit rounding instructions are available in both SVE and SME, this method remains advantageous because the addition of the constant can be fused with the multiplication by the reciprocal of ln2. This approach assumes that the floating-point rounding mode is set to round-to-nearest, which is the default mode in Armv9-A. By integrating the bias into the constant, 2^k can also be directly computed by shifting the intermediate value. -Rounding error during the second step will introduce a global error as we will have: +A rounding error during the second step introduces a global error as you have: $$ x \approx k \times ln2 + r $$ @@ -44,15 +44,15 @@ $$ (-1)^s \times 2^{(exponent - bias)} \times (1.fraction)_2 $$ where s is the sign bit and 1.fraction represents the significand. -The value 2^k can be encoded by setting both the sign and fraction bits to zero and assigning the exponent field the value k + bias. If k is an 8-bits integer, 2^k can be efficiently computed by adding the bias value and positioning the result into the exponent bits of a 32-bit floating-point number using a logical shift. +The value 2^k can be encoded by setting both the sign and fraction bits to zero and assigning the exponent field the value k + bias. If k is an 8-bit integer, 2^k can be efficiently computed by adding the bias value and positioning the result into the exponent bits of a 32-bit floating-point number using a logical shift. -Taking this approach a step further, a fast approximation of exponential function can be achieved using bits manipulation techniques alone. Specifically, adding a bias to an integer k and shifting the result into the exponent field can be accomplished by computing an integer i as follows: +Taking this approach a step further, you can achieve a fast approximation of exponential function using bit manipulation techniques alone. Specifically, adding a bias to an integer k and shifting the result into the exponent field can be accomplished by computing an integer i as follows: $$i=2^{23} \times (k+bias) = 2^{23} \times k+2^{23} \times bias$$ This formulation assumes a 23-bit significand, but the method can be generalized to other floating-point precisions. -Now, consider the case where k is a real number. The fractional part of k will propagate into the significand bits of the resulting 2^k approximation. However, this side effect is not detrimental, it effectively acts as a form of linear interpolation, thereby improving the overall accuracy of the approximation. To approximate the exponential function, the following identity can be used: +Now, consider the case where k is a real number. The fractional part of k propagates into the significand bits of the resulting 2^k approximation. However, this side effect isn't detrimental; it effectively acts as a form of linear interpolation, thereby improving the overall accuracy of the approximation. To approximate the exponential function, use the following identity: $$e^x = 2^{x⁄ln2}$$ @@ -60,4 +60,11 @@ As previously discussed, this value can be approximated by computing a 32-bit in $$i = 2^{23} \times x⁄ln2 + 2^{23} \times bias = a \times x + b $$ -Continue to the next section to make a C-based implementation of the exponential function. \ No newline at end of file +## What you've accomplished and what's next + +In this section, you learned the mathematical foundations for optimizing exponential functions: +- Range reduction techniques that narrow the evaluation interval +- How to decompose inputs using k × ln2 + r reformulation +- Bit manipulation techniques for computing scaling factors + +Next, you'll implement these concepts in C using SVE intrinsics. \ No newline at end of file