Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ cascade:

minutes_to_complete: 30

who_is_this_for: Developers, performance engineers and system administrators looking to fine-tune the performance of their workload on many-core Arm-based systems.
who_is_this_for: Developers, performance engineers and system administrators, looking to fine-tune the performance of their workload on many-core Arm-based systems.

learning_objectives:
- Create CPU Sets and implement directly into sourcecode
Expand Down
Original file line number Diff line number Diff line change
@@ -1,20 +1,35 @@
---
title: Background Information
title: Thread pinning fundamentals
weight: 2

### FIXED, DO NOT MODIFY
layout: learningpathall
---

## Introduction
## CPU affinity

CPU affinity is the practice of binding a process or thread to a specific CPU core or set of cores. This tells the operating system scheduler where that work is allowed to run. By default, the Linux scheduler dynamically migrates threads across cores to balance load and maximize overall throughput. Pinning overrides this behavior by constraining execution to a chosen set of cores.

CPU affinity is the practice of binding a process or thread to a specific CPU core or set of cores, telling the operating system scheduler where that work is allowed to run. By default the Linux scheduler dynamically migrates threads across cores to balance load and maximize overall throughput. Pinning overrides this behavior by constraining execution to a chosen set of cores.
## Pinning

Pinning is most often used as a fine-tuning technique for workloads that aim to consume as many CPU cycles as possible while running alongside other demanding applications. Scientific computing pipelines and real time analytics frequently fall into this category. Typical applications that pin processes to specific cores are often sensitive to latency variation rather than just average throughput or have intricate memory access patterns. Pinning can reduce this noise and provide more consistent execution behavior or better memory access patterns under load.
Pinning is most often used as a fine-tuning technique for workloads that aim to consume as many CPU cycles as possible while running alongside other demanding applications. Scientific computing pipelines and real-time analytics frequently fall into this category.

Another important motivation is memory locality. On modern systems with Non Uniform Memory Access architectures (NUMA), different cores have memory access times and characteristics depending on where the data is fetched from. For example, in a server with 2 CPU sockets, that from a programmers view appears as a single processor, would have different memory access times depending on the core. By pinning threads to cores that are close to the memory they use and allocating memory accordingly, an application can reduce memory access latency and improve bandwidth.
Applications that pin processes to specific cores are often sensitive to latency variation rather than just average throughput. They may also have intricate memory access patterns. Pinning can reduce execution noise and provide more consistent behavior or better memory access patterns under load.

Developers can set affinity directly in source code using system calls. Many parallel frameworks expose higher level controls such as OpenMP affinity settings that manage thread placement automatically. Alternatively, at runtime system administrators can pin existing processes using utilities like `taskset` or launch applications with `NUMACTL` to control both CPU and memory placement without modifying code.
## Memory locality

Pinning is a tradeoff. It can improve determinism and locality but it can also reduce flexibility and hurt performance if the chosen layout is suboptimal or if system load changes. Over constraining the scheduler may lead to idle cores while pinned threads contend unnecessarily. As a general rule it is best to rely on the operating system scheduler as a first pass and introduce pinning only if you are looking to fine-tune performance.
Memory locality provides another important motivation for pinning. On modern systems with Non-Uniform Memory Access (NUMA) architectures, different cores have varying memory access times and characteristics. The performance depends on where the data is fetched from.

For example, in a server with two CPU sockets that appears as a single processor, memory access times differ depending on which core you use. By pinning threads to cores that are close to the memory they use and allocating memory accordingly, you can reduce memory access latency and improve bandwidth.

## Setting affinity

You can set affinity directly in source code using system calls. Many parallel frameworks expose higher-level controls, such as OpenMP affinity settings, that manage thread placement automatically.

Alternatively, at runtime, system administrators can pin existing processes using utilities like `taskset` or launch applications with `numactl` to control both CPU and memory placement without modifying code.

## Conclusion

Pinning is a tradeoff. It can improve determinism and locality, but it can also reduce flexibility and hurt performance if the chosen layout isn't optimal or if system load changes. Over-constraining the scheduler may lead to idle cores while pinned threads contend unnecessarily.

As a general rule, rely on the operating system scheduler initially and introduce pinning only when you're looking to fine-tune performance after measuring baseline behavior.
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
title: Setup
title: Create a CPU-intensive program
weight: 3

### FIXED, DO NOT MODIFY
Expand All @@ -8,60 +8,58 @@ layout: learningpathall

## Setup

In this example we will be using an AWS Graviton 3 `m7g.4xlarge` instance running Ubuntu 22.04 LTS, based on the Arm Neoverse V1 architecture. If you are unfamiliar with creating a cloud instance, please refer to our [getting started learning path](https://learn.arm.com/learning-paths/servers-and-cloud-computing/csp/).
This Learning Path works on any Arm Linux system with four or more CPU cores.

This learning path is expected to work on any linux-based Arm instance with 4 or more CPU cores. The `m7g.4xlarge` instance has a uniform processor architecture so there is neglible different in memory or CPU core performance across the cores. On Linux, this can easily be checked with the following command.
For example, you can use an AWS Graviton 3 `m7g.4xlarge` instance running Ubuntu 24.04 LTS, based on the Arm Neoverse V1 architecture.

If you're unfamiliar with creating a cloud instance, refer to [Get started with Arm-based cloud instances](/learning-paths/servers-and-cloud-computing/csp/).

The `m7g.4xlarge` instance has a uniform processor architecture, so there's no difference in memory or CPU core performance across the cores.

On Linux, you can check this with the following command:

```bash
lscpu | grep -i numa
```

For our `m7g.4xlarge` all 16 cores are in the same NUMA (non-uniform memory architecture) node.
For the `m7g.4xlarge`, all 16 cores are in the same NUMA node:

```out
```output
NUMA node(s): 1
NUMA node0 CPU(s): 0-15
```

First we will demonstrate how we can pin threads easily using the `taskset` utility available in Linux. This is used to set or retrieve the CPU affinity of a running process or set the affinity of a process about to be launched. This does not require any modifications to the source code.
You'll first learn how to pin threads using the `taskset` utility available in Linux.

This utility sets or retrieves the CPU affinity of a running process or sets the affinity of a process about to be launched. This approach doesn't require any modifications to the source code.

## Install Prerequisites
## Install prerequisites

Run the following commands:
Run the following commands to install the required packages:

```bash
sudo apt update && sudo apt install g++ cmake python3.12-venv -y
sudo apt update && sudo apt install g++ cmake python3-venv python-is-python3 -y
```

Install Google's Microbenchmarking support library.
Install Google's microbenchmarking support library:

```bash
# Check out the library.
git clone https://github.com/google/benchmark.git
# Go to the library root directory
cd benchmark
# Make a build directory to place the build output.
cmake -E make_directory "build"
# Generate build system files with cmake, and download any dependencies.
cmake -E chdir "build" cmake -DBENCHMARK_DOWNLOAD_DEPENDENCIES=on -DCMAKE_BUILD_TYPE=Release ../
# or, starting with CMake 3.13, use a simpler form:
# Build the library.
sudo cmake --build "build" --config Release --target install -j $(nproc)
```
If you have issues building and installing, please refer to the [official installation guide](https://github.com/google/benchmark).

Finally, you will need to install the Linux perf utility for measuring performance. We recommend using our [install guide](https://learn.arm.com/install-guides/perf/). As you may need to build from source.
If you have issues building and installing, visit the [Benchmark repository](https://github.com/google/benchmark).

## Example 1
Finally, install the Linux perf utility for measuring performance. See the [Linux Perf install guide](/install-guides/perf/) as you may need to build from source.

To demonstrate a use case of CPU affinity, we will create a program that heavily utilizes all the available CPU cores. Create a file named `use_all_cores.cpp` and paste in the source code below. In this example, we are repeatedly calculating the [Leibniz equation](https://en.wikipedia.org/wiki/Leibniz_formula_for_%CF%80) to compute the value of Pi. This is a computationally inefficient algorithm to calculate the value of Pi and we are splitting the work across many threads.
## Create a CPU-intensive example program

```bash
cd ~
touch use_all_cores.cpp && chmod 755 use_all_cores.cpp
```
To demonstrate CPU affinity, you'll create a program that heavily utilizes all available CPU cores. This example repeatedly calculates the [Leibniz equation](https://en.wikipedia.org/wiki/Leibniz_formula_for_%CF%80) to compute the value of Pi. This is a computationally inefficient algorithm to calculate Pi, and you'll split the work across many threads.

Use an editor to create a file named `use_all_cores.cpp` with the code below:

```cpp
#include <vector>
Expand All @@ -81,7 +79,6 @@ double multiplethreaded_leibniz(int terms, bool use_all_cores){
}
std::vector<double> partial_results(NUM_THREADS);


auto calculation = [&](int thread_id){
// Lambda function that does the calculation of the Leibniz equation
double denominator = 0.0;
Expand All @@ -100,7 +97,6 @@ double multiplethreaded_leibniz(int terms, bool use_all_cores){
}
};


std::vector<thread> threads;
for (int i = 0; i < NUM_THREADS; i++){
threads.push_back(std::thread(calculation, i));
Expand Down Expand Up @@ -139,18 +135,37 @@ int main(){
}
```

Compile the program with the following command.
Compile the program with the following command:

```bash
g++ -O2 --std=c++11 use_all_cores.cpp -o prog
```

In a separate terminal we can use the `top` utility to quickly view the utilization of each core. For example, run the following command and press the number `1`. Then we can run the program by entering `./prog`.
## Observe CPU utilization

Now that you've compiled the program, you can observe how it utilizes CPU cores. In a separate terminal, use the `top` utility to view the utilization of each core:

```bash
top -d 0.1
```

Press the number `1` to view per-core utilization.

Then run the program in the other terminal:

```bash
top -d 0.1 # then press 1 to view per core utilization
./prog
```

![CPU-utilization](./CPU-util.jpg)
![Screenshot of the top command showing CPU utilization with all 16 cores periodically reaching 100% usage, displayed in a dark terminal window with percentage bars for each CPU core](cpu_util.jpg "CPU utilization showing all cores being used")

You should observe all cores on your system being periodically utilized up to 100% and then dropping to idle until the program exits.

## What you've accomplished and what's next

In this section, you've:
- Set up an AWS Graviton 3 instance and installed the required tools
- Created a multi-threaded program that heavily utilizes all available CPU cores
- Observed how the program distributes work across cores using the `top` utility

As the screenshot above shows, you should observe all cores on your system being periodically utilized up to 100% and then down to idle until the program exits. In the next section we will look at how to bind this program to specific CPU cores when running alongside a single-threaded Python script.
In the next section, you'll learn how to bind this program to specific CPU cores when running alongside a single-threaded Python script.
Original file line number Diff line number Diff line change
@@ -1,16 +1,22 @@
---
title: CPU Affinity
title: Set CPU affinity in source code
weight: 5

### FIXED, DO NOT MODIFY
layout: learningpathall
---

## Pinning Threads at Source-Code Level
## Pin threads at the source code level

Another way to set CPU affinity is at the source code level, this allows developers to be more expressive as to which thread goes where at specific points during the runtime. For example, in a hot path that repeatedly updates shared state with a read-modify-write style, a pinned thread could avoids excessive cache invalidations due to other threads modifying data.
Another way to set CPU affinity is at the source code level. This allows you to be more expressive about which thread goes where at specific points during runtime.

To demonstrate this we have an example program below. Copy and paste the code below into a new file named `default_os_scheduling.cpp`.
For example, in a hot path that repeatedly updates shared state with a read-modify-write pattern, a pinned thread can avoid excessive cache invalidations caused by other threads modifying data.

## Create a baseline program without thread pinning

To demonstrate this, you'll create two example programs. The first uses the default OS scheduling without thread pinning.

Copy and paste the code below into a new file named `default_os_scheduling.cpp`:

```cpp
#include <benchmark/benchmark.h>
Expand Down Expand Up @@ -70,9 +76,11 @@ BENCHMARK(default_os_scheduling)->UseRealTime()->Unit(benchmark::kMillisecond);
BENCHMARK_MAIN();
```

`default_os_scheduling.cpp` has 2 atomic variables that are aligned on different cache lines to avoid thrashing. We spawn 4 threads, with 2 threads performing a read-modify-wite operation on the first atomic variable, and the final 2 threads performing the same operation on the second atomic variable.
This program has two atomic variables that are aligned on different cache lines to avoid thrashing. You spawn four threads: two threads perform a read-modify-write operation on the first atomic variable, and two threads perform the same operation on the second atomic variable.

Now, copy the code block below into a file named `thread_affinity.cpp`.
## Create a program with explicit thread pinning

Now, copy the code below into a new file named `thread_affinity.cpp`:

```cpp
#include <benchmark/benchmark.h>
Expand Down Expand Up @@ -150,25 +158,32 @@ BENCHMARK(thread_affinity)->UseRealTime()->Unit(benchmark::kMillisecond);
BENCHMARK_MAIN();
```

`Thread_affinity.cpp` uses the `pthread_set_affinity_np` function from the `pthread.h` header file to pin the 2 threads operating on atomic variable, `a`, to a specific CPU set and the other threads operating on atomic variable, `b`, to a different CPU.
This program uses the `pthread_setaffinity_np` function from the `pthread.h` header file to pin threads. The two threads operating on atomic variable `a` are pinned to a specific CPU set, and the other threads operating on atomic variable `b` are pinned to a different CPU.

## Compile and benchmark the programs

Compile both programs with the following command.
Compile both programs with the following commands:

```bash
g++ default_os_scheduling.cpp -O3 -march=native -lbenchmark -lpthread -o default-os-scheduling
g++ thread_affinity.cpp -O3 -march=native -lbenchmark -lpthread -o thread-affinity
```

We will use the `perf` tool to print statistic for the program.
Use Perf to print statistics for both programs:

```bash
perf stat -e L1-dcache-loads,L1-dcache-load-misses ./default-os-scheduling
perf stat -e L1-dcache-loads,L1-dcache-load-misses ./thread-affinity
```

Inspecting the output below we see that the `L1-dcache-load-misses` which occur when the the CPU core does not have a up-to-date version of the data in the L1 Data cache and must perform an expensive operation to fetch data from a different location, reduces from ~7.84% to ~0.6% as a result of the thread pinning. This results in a huge reduction in function execution time, dropping from 10.7ms to 3.53ms.
## Analyze the performance results

```outputRunning ./default-os-scheduling
Inspecting the output below, you can see that the `L1-dcache-load-misses` metric (which occurs when the CPU core doesn't have an up-to-date version of the data in the L1 data cache and must perform an expensive operation to fetch data from a different location) reduces from approximately 7.84% to approximately 0.6% as a result of thread pinning.

This results in a significant reduction in function execution time, dropping from 10.7 ms to 3.53 ms:

```output
Running ./default-os-scheduling
Run on (16 X 2100 MHz CPU s)
CPU Caches:
L1 Data 64 KiB (x16)
Expand Down Expand Up @@ -217,8 +232,16 @@ thread_affinity/real_time 3.53 ms 0.343 ms 198
0.169065000 seconds sys
```

### Conclusion
The results demonstrate that thread pinning can significantly improve performance when threads operate on separate data structures. By keeping threads on specific cores, you reduce cache coherency traffic and improve data locality.

## What you've accomplished and what's next

In this section, you've:
- Created two programs to compare default OS scheduling against explicit thread pinning
- Used the `pthread_setaffinity_np` API to control CPU affinity at the source code level
- Measured cache performance using Perf to quantify the impact of thread pinning
- Observed a performance improvement and a reduction in cache misses through strategic thread placement

In this tutorial, we introduced thread pinning (CPU affinity) through a pair of worked examples. By comparing default OS thread scheduling against explicitly pinned threads, we showed how controlling where threads run can reduce cache disruption in contention-heavy paths and improve runtime stability and performance.
You've seen how controlling where threads run can reduce cache disruption in contention-heavy paths and improve runtime stability and performance. You've also learned about the trade-offs: pinning can boost locality and predictability, but it can hurt performance of other running processes, especially if the workload characteristics change or if you over-constrain the scheduler.

We also highlighted the tradeoffs, pinning can boost locality and predictability, but it can hurt performance of other running processes, espec. Finally, we showed how to implement affinity quickly using common system utilities for inspection and measurement, and how to be more expressive directly in code using APIs like `pthread_setaffinity_np` from `pthread.h`.
Thread pinning is most effective when you have well-understood workload patterns and clear separation between data structures accessed by different threads. Use it as a fine-tuning technique after establishing baseline performance with default OS scheduling.
Loading