Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -1,17 +1,13 @@
---
title: Fine tune LLM CPU inference performance with multithreading

draft: true
cascade:
draft: true
title: Tune LLM CPU inference performance with multithreading

minutes_to_complete: 30

who_is_this_for: This is an introductory topic ML engineers optimizing LLM inference performance on Arm CPUs.
who_is_this_for: This is an introductory topic for ML engineers optimizing LLM inference performance on Arm CPUs.

learning_objectives:
- Understand how PyTorch uses multiple threads for CPU inference
- Measure performance impact of thread count on LLM inference
- Measure the performance impact of thread count on LLM inference
- Tune thread count to optimize inference for specific models and systems

prerequisites:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,8 @@ weight: 3
layout: learningpathall
---

## Understanding threading trade-offs in CPU inference

A well-known challenge in parallel programming is choosing the right number of threads for a given amount of work. When multiple threads are created to perform a task, the actual computation must be large enough to justify the overhead of coordinating those threads.

If a computation is split across many threads, the costs of creating the threads and synchronizing their results through shared memory can easily outweigh any performance gains from parallel execution. The same principle applies to generative AI workloads running on CPU.
Expand All @@ -16,15 +18,15 @@ PyTorch attempts to automatically choose an appropriate number of threads. Howev

## Multithreading with PyTorch on CPU

When running inference, PyTorch uses an Application Thread Pool. PyTorch supports two types of parallelism: inter-op parallelism spawns threads to run separate operations in a graph in parallel (for example, one thread for a matmul and another thread for a softmax), while intra-op parallelism spawns multiple threads to work on the same operation.
When running inference, PyTorch uses an Application Thread Pool. PyTorch supports two types of parallelism: inter-op parallelism spawns threads to run separate operations in a graph in parallel (for example, one thread for a matrix multiplication and another thread for a softmax), while intra-op parallelism spawns multiple threads to work on the same operation.

The diagram below is taken from the [PyTorch documentation](https://docs.pytorch.org/docs/stable/index.html).
The diagram below shows PyTorch's threading model from the [PyTorch documentation](https://docs.pytorch.org/docs/stable/index.html).

![Diagram showing PyTorch's threading model with application thread pool, inter-op thread pool, and intra-op thread pool for CPU inference#center](./pytorch-threading.jpg "PyTorch threading model")
![Diagram showing PyTorch's threading model with application thread pool, inter-op thread pool, and intra-op thread pool for CPU inference alt-txt#center](./pytorch-threading.jpg "PyTorch threading model")

The `torch.set_num_threads()` [API](https://docs.pytorch.org/docs/stable/generated/torch.set_num_threads.html) sets the maximum number of threads to spawn in the Application Thread Pool.

As of PyTorch 2.8.0, the default number of threads equals the number of CPU cores (see [PyTorch CPU Threading Documentation](https://docs.pytorch.org/docs/2.8/notes/cpu_threading_torchscript_inference.html) for more detail). PyTorch determines the ideal number of threads based on the workload size, as shown in this code snippet from [ParallelOpenMP.h](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/ParallelOpenMP.h):
As of PyTorch 2.8.0, the default number of threads equals the number of CPU cores (see the [PyTorch CPU Threading Documentation](https://docs.pytorch.org/docs/2.8/notes/cpu_threading_torchscript_inference.html) for more detail). PyTorch determines the ideal number of threads based on the workload size, as shown in this code snippet from [ParallelOpenMP.h](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/ParallelOpenMP.h):

```cpp
int64_t num_threads = omp_get_num_threads();
Expand Down Expand Up @@ -112,9 +114,9 @@ Environment variables:
ATen parallel backend: OpenMP
```

The number of threads is set to the core count of 96, and the execution time is 2.24 ms.
PyTorch uses all 96 cores, and the execution time is 2.24 ms.

Reduce the number of OpenMP threads using the `OMP_NUM_THREADS` environment variable:
Now reduce the number of OpenMP threads using the `OMP_NUM_THREADS` environment variable:

```bash
OMP_NUM_THREADS=16 python pytorch_omp_example.py
Expand Down Expand Up @@ -142,6 +144,10 @@ Environment variables:
ATen parallel backend: OpenMP
```

The time varies with the number of threads and type of processor in your system.
The execution time varies with the number of threads and the processor type in your system.

## What you've accomplished and what's next

You've learned how PyTorch manages threads for CPU inference and seen how thread count affects performance in a simple example. The optimal thread count depends on both the workload size and system architecture.

In the next section, you'll apply these concepts to a much larger workload using a large language model (LLM).
Next, you'll apply these concepts to a more realistic workload by tuning thread settings for large language model inference.
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,12 @@ weight: 2
layout: learningpathall
---

## Before you begin

Before you can tune PyTorch threading for LLM inference on Arm CPUs, you need to set up your development environment with Docker, PyTorch, and access to the Gemma-3 models from Hugging Face. This section walks you through creating your Hugging Face account, configuring an Arm server, and running the PyTorch container with all necessary dependencies.

{{% notice Note %}}
This Learning Path uses Arm's downstream canary release of PyTorch, which includes ready-to-use examples and scripts. While this release offers access to the latest downstream features, it's intended for experimentation rather than production use.
This Learning Path uses Arm's downstream canary release of PyTorch, which includes ready-to-use examples and scripts. This release provides access to the latest features but is intended for experimentation rather than production use.
{{% /notice %}}

## Create a Hugging Face account
Expand Down Expand Up @@ -79,16 +83,25 @@ aarch64_pytorch ~>

## Log in to Hugging Face

Create a new Read token on Hugging Face by navigating to [Create new Access Token](https://huggingface.co/settings/tokens/new?tokenType=read).
Create a new Read token on Hugging Face by navigating to [Create new Access Token](https://huggingface.co/settings/tokens/new?tokenType=read).

![Screenshot of Hugging Face token creation page showing the 'Create new token' dialog with token type set to 'Read'#center](./hf-access-token.jpg "Hugging Face token creation")
![Screenshot of Hugging Face token creation interface showing a dialog box with fields for token name and type, with the 'Read' option selected and a 'Create token' button visible alt-txt#center](./hf-access-token.jpg "Hugging Face token creation interface")

Provide a token name, create the token, and copy the generated value. From within the Docker container, run the following command and paste the token when prompted:

```bash
huggingface-cli login
```

Messages indicating the token is valid and login is successful are printed.
Messages indicating the token is valid and login is successful are printed.

Be aware that the login doesn't persist after the Docker container exits. You'll need to log in again if you restart the container.

## What you've accomplished and what's next

You've set up your environment with:
- A Hugging Face account with access to the Gemma-3 models
- An Arm server or cloud instance with Docker installed
- The PyTorch-aarch64 container running and authenticated

Be aware that the login doesn't persist after the Docker container exits. You'll need to log in again if you restart the container.
You're now ready to run LLM inference experiments and measure how thread count affects performance.
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,10 @@ weight: 4
layout: learningpathall
---

## Run inference experiments with different thread counts

Now that you understand how PyTorch threading works and have your environment configured, you're ready to tune thread settings for actual LLM inference workloads. This section shows you how to measure inference performance across different thread counts using Google's Gemma-3 models on Arm CPUs. You'll run experiments with both the 270M and 1B parameter variants to understand how model size affects optimal thread configuration.

This section runs inference on Google's [Gemma-3](https://huggingface.co/google/gemma-3-1b-it) model and measures how inference performance varies with thread count for both the 270 million parameter and 1 billion parameter models. The `transformers_llm_text_gen.py` script applies groupwise, layout-aware INT4 quantization by default.

Create a file named `comparison-1b.sh` with the following script:
Expand Down Expand Up @@ -101,17 +105,17 @@ Decode Tokens per second: 45.23

The graph below shows how prefill tokens per second change with the number of OpenMP threads for the 270M and 1B variants of Gemma-3:

![Graph showing prefill tokens per second versus number of OpenMP threads for Gemma-3 270M and 1B models. Both models peak at 16-32 threads, with the 270M model showing steeper decline after the peak#center](./prefill_throughput.png "Prefill throughput comparison")
![Line graph comparing prefill throughput performance of Gemma-3 270M and 1B models across different thread counts from 2 to 96. The y-axis shows tokens per second (0-3000), and the x-axis shows number of OpenMP threads. Both lines peak at 16-32 threads, with the 270M model achieving higher throughput but showing a steeper decline after peak performance alt-txt#center](./prefill_throughput.png "Prefill throughput versus thread count for Gemma-3 models")

As expected, the smaller 270M model runs faster. Both models reach their optimal token generation rate at around 1632 threads, though the 270M model exhibits a sharper performance drop-off beyond this range compared with the 1B variant.
As expected, the smaller 270M model runs faster. Both models reach their optimal token generation rate at around 16 to 32 threads, though the 270M model exhibits a sharper performance drop-off beyond this range compared with the 1B variant.



## Use PyTorch compilation mode

The examples so far have used PyTorch's eager execution mode. You can also test performance with PyTorch's compile mode.
The examples so far have used PyTorch's eager execution mode. PyTorch's compile mode can provide additional performance improvements.

Install a C++ compiler and dependencies:
Before testing compile mode, install a C++ compiler and dependencies:

```bash
sudo apt update && sudo apt install g++ python3.10-dev build-essential -y
Expand Down Expand Up @@ -151,13 +155,13 @@ Decode Tokens per second: 107.37

Reducing the thread count from 96 (default) to 16 provides a significant reduction in end-to-end generation time.

## What you've learned
## What you've accomplished and what's next

You've explored how the number of OpenMP threads impacts LLM inference performance on Arm CPUs. You've learned that:
You've explored how the number of OpenMP threads impacts LLM inference performance on Arm CPUs and learned that:

- Default thread settings on many-core systems don't always provide optimal performance
- Smaller models typically benefit from fewer threads due to lower synchronization overhead
- Smaller models typically benefit from fewer threads because of lower synchronization overhead
- The optimal thread count depends on both model size and system architecture
- PyTorch's compile mode can provide additional performance improvements when combined with thread tuning
- PyTorch's compile mode provides additional performance improvements when combined with thread tuning

In practice, use a heuristic or trial-and-error approach to determine the optimal thread count for your specific model and system configuration.
For your specific workloads, experiment with different thread counts to find the optimal setting. Start with powers of 2 (8, 16, 32) and measure the actual throughput and latency for your use case. The performance characteristics you observed in this Learning Path apply to other LLM inference workloads on Arm CPUs.