From 65e2f723ec705f35da2bb32ac7dff64a98e63979 Mon Sep 17 00:00:00 2001 From: Madeline Underwood Date: Sat, 10 Jan 2026 23:41:03 +0000 Subject: [PATCH 1/2] Refine content for LLM CPU inference performance tuning - Update title for clarity in _index.md - Improve descriptions and fix typos in learning objectives - Add section on threading trade-offs in background.md - Enhance clarity in setup instructions in build.md - Expand explanation of thread count impact in tune.md --- .../_index.md | 10 +++----- .../background.md | 20 ++++++++++------ .../build.md | 23 +++++++++++++++---- .../tune.md | 22 ++++++++++-------- 4 files changed, 47 insertions(+), 28 deletions(-) diff --git a/content/learning-paths/servers-and-cloud-computing/tune-pytorch-cpu-perf-with-threads/_index.md b/content/learning-paths/servers-and-cloud-computing/tune-pytorch-cpu-perf-with-threads/_index.md index 37b2d4c3d6..ae8a05eee8 100644 --- a/content/learning-paths/servers-and-cloud-computing/tune-pytorch-cpu-perf-with-threads/_index.md +++ b/content/learning-paths/servers-and-cloud-computing/tune-pytorch-cpu-perf-with-threads/_index.md @@ -1,17 +1,13 @@ --- -title: Fine tune LLM CPU inference performance with multithreading - -draft: true -cascade: - draft: true +title: Tune LLM CPU inference performance with multithreading minutes_to_complete: 30 -who_is_this_for: This is an introductory topic ML engineers optimizing LLM inference performance on Arm CPUs. +who_is_this_for: This is an introductory topic for ML engineers optimizing LLM inference performance on Arm CPUs. learning_objectives: - Understand how PyTorch uses multiple threads for CPU inference - - Measure performance impact of thread count on LLM inference + - Measure the performance impact of thread count on LLM inference - Tune thread count to optimize inference for specific models and systems prerequisites: diff --git a/content/learning-paths/servers-and-cloud-computing/tune-pytorch-cpu-perf-with-threads/background.md b/content/learning-paths/servers-and-cloud-computing/tune-pytorch-cpu-perf-with-threads/background.md index 6b64e23672..ccd2ecf7ef 100644 --- a/content/learning-paths/servers-and-cloud-computing/tune-pytorch-cpu-perf-with-threads/background.md +++ b/content/learning-paths/servers-and-cloud-computing/tune-pytorch-cpu-perf-with-threads/background.md @@ -6,6 +6,8 @@ weight: 3 layout: learningpathall --- +## Understanding threading trade-offs in CPU inference + A well-known challenge in parallel programming is choosing the right number of threads for a given amount of work. When multiple threads are created to perform a task, the actual computation must be large enough to justify the overhead of coordinating those threads. If a computation is split across many threads, the costs of creating the threads and synchronizing their results through shared memory can easily outweigh any performance gains from parallel execution. The same principle applies to generative AI workloads running on CPU. @@ -16,15 +18,15 @@ PyTorch attempts to automatically choose an appropriate number of threads. Howev ## Multithreading with PyTorch on CPU -When running inference, PyTorch uses an Application Thread Pool. PyTorch supports two types of parallelism: inter-op parallelism spawns threads to run separate operations in a graph in parallel (for example, one thread for a matmul and another thread for a softmax), while intra-op parallelism spawns multiple threads to work on the same operation. +When running inference, PyTorch uses an Application Thread Pool. PyTorch supports two types of parallelism: inter-op parallelism spawns threads to run separate operations in a graph in parallel (for example, one thread for a matrix multiplication and another thread for a softmax), while intra-op parallelism spawns multiple threads to work on the same operation. -The diagram below is taken from the [PyTorch documentation](https://docs.pytorch.org/docs/stable/index.html). +The diagram below shows PyTorch's threading model from the [PyTorch documentation](https://docs.pytorch.org/docs/stable/index.html). ![Diagram showing PyTorch's threading model with application thread pool, inter-op thread pool, and intra-op thread pool for CPU inference#center](./pytorch-threading.jpg "PyTorch threading model") The `torch.set_num_threads()` [API](https://docs.pytorch.org/docs/stable/generated/torch.set_num_threads.html) sets the maximum number of threads to spawn in the Application Thread Pool. -As of PyTorch 2.8.0, the default number of threads equals the number of CPU cores (see [PyTorch CPU Threading Documentation](https://docs.pytorch.org/docs/2.8/notes/cpu_threading_torchscript_inference.html) for more detail). PyTorch determines the ideal number of threads based on the workload size, as shown in this code snippet from [ParallelOpenMP.h](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/ParallelOpenMP.h): +As of PyTorch 2.8.0, the default number of threads equals the number of CPU cores (see the [PyTorch CPU Threading Documentation](https://docs.pytorch.org/docs/2.8/notes/cpu_threading_torchscript_inference.html) for more detail). PyTorch determines the ideal number of threads based on the workload size, as shown in this code snippet from [ParallelOpenMP.h](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/ParallelOpenMP.h): ```cpp int64_t num_threads = omp_get_num_threads(); @@ -112,9 +114,9 @@ Environment variables: ATen parallel backend: OpenMP ``` -The number of threads is set to the core count of 96, and the execution time is 2.24 ms. +PyTorch uses all 96 cores, and the execution time is 2.24 ms. -Reduce the number of OpenMP threads using the `OMP_NUM_THREADS` environment variable: +Now reduce the number of OpenMP threads using the `OMP_NUM_THREADS` environment variable: ```bash OMP_NUM_THREADS=16 python pytorch_omp_example.py @@ -142,6 +144,10 @@ Environment variables: ATen parallel backend: OpenMP ``` -The time varies with the number of threads and type of processor in your system. +The execution time varies with the number of threads and the processor type in your system. + +## What you've accomplished and what's next + +You've learned how PyTorch manages threads for CPU inference and seen how thread count affects performance in a simple example. The optimal thread count depends on both the workload size and system architecture. -In the next section, you'll apply these concepts to a much larger workload using a large language model (LLM). \ No newline at end of file +Next, you'll apply these concepts to a more realistic workload by tuning thread settings for large language model inference. \ No newline at end of file diff --git a/content/learning-paths/servers-and-cloud-computing/tune-pytorch-cpu-perf-with-threads/build.md b/content/learning-paths/servers-and-cloud-computing/tune-pytorch-cpu-perf-with-threads/build.md index 3d7ed7bf54..a05f6ad0ed 100644 --- a/content/learning-paths/servers-and-cloud-computing/tune-pytorch-cpu-perf-with-threads/build.md +++ b/content/learning-paths/servers-and-cloud-computing/tune-pytorch-cpu-perf-with-threads/build.md @@ -6,8 +6,12 @@ weight: 2 layout: learningpathall --- +## Before you begin + +Before you can tune PyTorch threading for LLM inference on Arm CPUs, you need to set up your development environment with Docker, PyTorch, and access to the Gemma-3 models from Hugging Face. This section walks you through creating your Hugging Face account, configuring an Arm server, and running the PyTorch container with all necessary dependencies. + {{% notice Note %}} -This Learning Path uses Arm's downstream canary release of PyTorch, which includes ready-to-use examples and scripts. While this release offers access to the latest downstream features, it's intended for experimentation rather than production use. +This Learning Path uses Arm's downstream canary release of PyTorch, which includes ready-to-use examples and scripts. This release provides access to the latest features but is intended for experimentation rather than production use. {{% /notice %}} ## Create a Hugging Face account @@ -79,9 +83,9 @@ aarch64_pytorch ~> ## Log in to Hugging Face -Create a new Read token on Hugging Face by navigating to [Create new Access Token](https://huggingface.co/settings/tokens/new?tokenType=read). +Create a new Read token on Hugging Face by navigating to [Create new Access Token](https://huggingface.co/settings/tokens/new?tokenType=read). -![Screenshot of Hugging Face token creation page showing the 'Create new token' dialog with token type set to 'Read'#center](./hf-access-token.jpg "Hugging Face token creation") +![Screenshot of Hugging Face token creation interface showing a dialog box with fields for token name and type, with the 'Read' option selected and a 'Create token' button visible alt-txt#center](./hf-access-token.jpg "Hugging Face token creation interface") Provide a token name, create the token, and copy the generated value. From within the Docker container, run the following command and paste the token when prompted: @@ -89,6 +93,15 @@ Provide a token name, create the token, and copy the generated value. From withi huggingface-cli login ``` -Messages indicating the token is valid and login is successful are printed. +Messages indicating the token is valid and login is successful are printed. + +Be aware that the login doesn't persist after the Docker container exits. You'll need to log in again if you restart the container. + +## What you've accomplished and what's next + +You've set up your environment with: +- A Hugging Face account with access to the Gemma-3 models +- An Arm server or cloud instance with Docker installed +- The PyTorch-aarch64 container running and authenticated -Be aware that the login doesn't persist after the Docker container exits. You'll need to log in again if you restart the container. \ No newline at end of file +You're now ready to run LLM inference experiments and measure how thread count affects performance. \ No newline at end of file diff --git a/content/learning-paths/servers-and-cloud-computing/tune-pytorch-cpu-perf-with-threads/tune.md b/content/learning-paths/servers-and-cloud-computing/tune-pytorch-cpu-perf-with-threads/tune.md index 03427aec39..55ece1a16d 100644 --- a/content/learning-paths/servers-and-cloud-computing/tune-pytorch-cpu-perf-with-threads/tune.md +++ b/content/learning-paths/servers-and-cloud-computing/tune-pytorch-cpu-perf-with-threads/tune.md @@ -6,6 +6,10 @@ weight: 4 layout: learningpathall --- +## Run inference experiments with different thread counts + +Now that you understand how PyTorch threading works and have your environment configured, you're ready to tune thread settings for actual LLM inference workloads. This section shows you how to measure inference performance across different thread counts using Google's Gemma-3 models on Arm CPUs. You'll run experiments with both the 270M and 1B parameter variants to understand how model size affects optimal thread configuration. + This section runs inference on Google's [Gemma-3](https://huggingface.co/google/gemma-3-1b-it) model and measures how inference performance varies with thread count for both the 270 million parameter and 1 billion parameter models. The `transformers_llm_text_gen.py` script applies groupwise, layout-aware INT4 quantization by default. Create a file named `comparison-1b.sh` with the following script: @@ -101,17 +105,17 @@ Decode Tokens per second: 45.23 The graph below shows how prefill tokens per second change with the number of OpenMP threads for the 270M and 1B variants of Gemma-3: -![Graph showing prefill tokens per second versus number of OpenMP threads for Gemma-3 270M and 1B models. Both models peak at 16-32 threads, with the 270M model showing steeper decline after the peak#center](./prefill_throughput.png "Prefill throughput comparison") +![Line graph comparing prefill throughput performance of Gemma-3 270M and 1B models across different thread counts from 2 to 96. The y-axis shows tokens per second (0-3000), and the x-axis shows number of OpenMP threads. Both lines peak at 16-32 threads, with the 270M model achieving higher throughput but showing a steeper decline after peak performance alt-txt#center](./prefill_throughput.png "Prefill throughput versus thread count for Gemma-3 models") -As expected, the smaller 270M model runs faster. Both models reach their optimal token generation rate at around 16–32 threads, though the 270M model exhibits a sharper performance drop-off beyond this range compared with the 1B variant. +As expected, the smaller 270M model runs faster. Both models reach their optimal token generation rate at around 16 to 32 threads, though the 270M model exhibits a sharper performance drop-off beyond this range compared with the 1B variant. ## Use PyTorch compilation mode -The examples so far have used PyTorch's eager execution mode. You can also test performance with PyTorch's compile mode. +The examples so far have used PyTorch's eager execution mode. PyTorch's compile mode can provide additional performance improvements. -Install a C++ compiler and dependencies: +Before testing compile mode, install a C++ compiler and dependencies: ```bash sudo apt update && sudo apt install g++ python3.10-dev build-essential -y @@ -151,13 +155,13 @@ Decode Tokens per second: 107.37 Reducing the thread count from 96 (default) to 16 provides a significant reduction in end-to-end generation time. -## What you've learned +## What you've accomplished and what's next -You've explored how the number of OpenMP threads impacts LLM inference performance on Arm CPUs. You've learned that: +You've explored how the number of OpenMP threads impacts LLM inference performance on Arm CPUs and learned that: - Default thread settings on many-core systems don't always provide optimal performance -- Smaller models typically benefit from fewer threads due to lower synchronization overhead +- Smaller models typically benefit from fewer threads because of lower synchronization overhead - The optimal thread count depends on both model size and system architecture -- PyTorch's compile mode can provide additional performance improvements when combined with thread tuning +- PyTorch's compile mode provides additional performance improvements when combined with thread tuning -In practice, use a heuristic or trial-and-error approach to determine the optimal thread count for your specific model and system configuration. +For your specific workloads, experiment with different thread counts to find the optimal setting. Start with powers of 2 (8, 16, 32) and measure the actual throughput and latency for your use case. The performance characteristics you observed in this Learning Path apply to other LLM inference workloads on Arm CPUs. From a171557f57718ce74547dcc3a1f8408d563de8b1 Mon Sep 17 00:00:00 2001 From: Madeline Underwood Date: Sat, 10 Jan 2026 23:41:11 +0000 Subject: [PATCH 2/2] Update alt text for PyTorch threading model diagram in background.md --- .../tune-pytorch-cpu-perf-with-threads/background.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/learning-paths/servers-and-cloud-computing/tune-pytorch-cpu-perf-with-threads/background.md b/content/learning-paths/servers-and-cloud-computing/tune-pytorch-cpu-perf-with-threads/background.md index ccd2ecf7ef..b6d401598d 100644 --- a/content/learning-paths/servers-and-cloud-computing/tune-pytorch-cpu-perf-with-threads/background.md +++ b/content/learning-paths/servers-and-cloud-computing/tune-pytorch-cpu-perf-with-threads/background.md @@ -22,7 +22,7 @@ When running inference, PyTorch uses an Application Thread Pool. PyTorch support The diagram below shows PyTorch's threading model from the [PyTorch documentation](https://docs.pytorch.org/docs/stable/index.html). -![Diagram showing PyTorch's threading model with application thread pool, inter-op thread pool, and intra-op thread pool for CPU inference#center](./pytorch-threading.jpg "PyTorch threading model") +![Diagram showing PyTorch's threading model with application thread pool, inter-op thread pool, and intra-op thread pool for CPU inference alt-txt#center](./pytorch-threading.jpg "PyTorch threading model") The `torch.set_num_threads()` [API](https://docs.pytorch.org/docs/stable/generated/torch.set_num_threads.html) sets the maximum number of threads to spawn in the Application Thread Pool.