diff --git a/content/learning-paths/servers-and-cloud-computing/pinning-threads/_index.md b/content/learning-paths/servers-and-cloud-computing/pinning-threads/_index.md index 112aa01d24..f944c3da5b 100644 --- a/content/learning-paths/servers-and-cloud-computing/pinning-threads/_index.md +++ b/content/learning-paths/servers-and-cloud-computing/pinning-threads/_index.md @@ -7,7 +7,7 @@ cascade: minutes_to_complete: 30 -who_is_this_for: Developers, performance engineers and system administrators looking to fine-tune the performance of their workload on many-core Arm-based systems. +who_is_this_for: Developers, performance engineers and system administrators, looking to fine-tune the performance of their workload on many-core Arm-based systems. learning_objectives: - Create CPU Sets and implement directly into sourcecode diff --git a/content/learning-paths/servers-and-cloud-computing/pinning-threads/background_info.md b/content/learning-paths/servers-and-cloud-computing/pinning-threads/background_info.md index d108a2a3a6..f74fa74d7f 100644 --- a/content/learning-paths/servers-and-cloud-computing/pinning-threads/background_info.md +++ b/content/learning-paths/servers-and-cloud-computing/pinning-threads/background_info.md @@ -1,20 +1,35 @@ --- -title: Background Information +title: Thread pinning fundamentals weight: 2 ### FIXED, DO NOT MODIFY layout: learningpathall --- -## Introduction +## CPU affinity +CPU affinity is the practice of binding a process or thread to a specific CPU core or set of cores. This tells the operating system scheduler where that work is allowed to run. By default, the Linux scheduler dynamically migrates threads across cores to balance load and maximize overall throughput. Pinning overrides this behavior by constraining execution to a chosen set of cores. -CPU affinity is the practice of binding a process or thread to a specific CPU core or set of cores, telling the operating system scheduler where that work is allowed to run. By default the Linux scheduler dynamically migrates threads across cores to balance load and maximize overall throughput. Pinning overrides this behavior by constraining execution to a chosen set of cores. +## Pinning -Pinning is most often used as a fine-tuning technique for workloads that aim to consume as many CPU cycles as possible while running alongside other demanding applications. Scientific computing pipelines and real time analytics frequently fall into this category. Typical applications that pin processes to specific cores are often sensitive to latency variation rather than just average throughput or have intricate memory access patterns. Pinning can reduce this noise and provide more consistent execution behavior or better memory access patterns under load. +Pinning is most often used as a fine-tuning technique for workloads that aim to consume as many CPU cycles as possible while running alongside other demanding applications. Scientific computing pipelines and real-time analytics frequently fall into this category. -Another important motivation is memory locality. On modern systems with Non Uniform Memory Access architectures (NUMA), different cores have memory access times and characteristics depending on where the data is fetched from. For example, in a server with 2 CPU sockets, that from a programmers view appears as a single processor, would have different memory access times depending on the core. By pinning threads to cores that are close to the memory they use and allocating memory accordingly, an application can reduce memory access latency and improve bandwidth. +Applications that pin processes to specific cores are often sensitive to latency variation rather than just average throughput. They may also have intricate memory access patterns. Pinning can reduce execution noise and provide more consistent behavior or better memory access patterns under load. -Developers can set affinity directly in source code using system calls. Many parallel frameworks expose higher level controls such as OpenMP affinity settings that manage thread placement automatically. Alternatively, at runtime system administrators can pin existing processes using utilities like `taskset` or launch applications with `NUMACTL` to control both CPU and memory placement without modifying code. +## Memory locality -Pinning is a tradeoff. It can improve determinism and locality but it can also reduce flexibility and hurt performance if the chosen layout is suboptimal or if system load changes. Over constraining the scheduler may lead to idle cores while pinned threads contend unnecessarily. As a general rule it is best to rely on the operating system scheduler as a first pass and introduce pinning only if you are looking to fine-tune performance. +Memory locality provides another important motivation for pinning. On modern systems with Non-Uniform Memory Access (NUMA) architectures, different cores have varying memory access times and characteristics. The performance depends on where the data is fetched from. + +For example, in a server with two CPU sockets that appears as a single processor, memory access times differ depending on which core you use. By pinning threads to cores that are close to the memory they use and allocating memory accordingly, you can reduce memory access latency and improve bandwidth. + +## Setting affinity + +You can set affinity directly in source code using system calls. Many parallel frameworks expose higher-level controls, such as OpenMP affinity settings, that manage thread placement automatically. + +Alternatively, at runtime, system administrators can pin existing processes using utilities like `taskset` or launch applications with `numactl` to control both CPU and memory placement without modifying code. + +## Conclusion + +Pinning is a tradeoff. It can improve determinism and locality, but it can also reduce flexibility and hurt performance if the chosen layout isn't optimal or if system load changes. Over-constraining the scheduler may lead to idle cores while pinned threads contend unnecessarily. + +As a general rule, rely on the operating system scheduler initially and introduce pinning only when you're looking to fine-tune performance after measuring baseline behavior. diff --git a/content/learning-paths/servers-and-cloud-computing/pinning-threads/setup.md b/content/learning-paths/servers-and-cloud-computing/pinning-threads/setup.md index aa5e17f9a2..d3eb2507ef 100644 --- a/content/learning-paths/servers-and-cloud-computing/pinning-threads/setup.md +++ b/content/learning-paths/servers-and-cloud-computing/pinning-threads/setup.md @@ -1,5 +1,5 @@ --- -title: Setup +title: Create a CPU-intensive program weight: 3 ### FIXED, DO NOT MODIFY @@ -8,60 +8,58 @@ layout: learningpathall ## Setup -In this example we will be using an AWS Graviton 3 `m7g.4xlarge` instance running Ubuntu 22.04 LTS, based on the Arm Neoverse V1 architecture. If you are unfamiliar with creating a cloud instance, please refer to our [getting started learning path](https://learn.arm.com/learning-paths/servers-and-cloud-computing/csp/). +This Learning Path works on any Arm Linux system with four or more CPU cores. -This learning path is expected to work on any linux-based Arm instance with 4 or more CPU cores. The `m7g.4xlarge` instance has a uniform processor architecture so there is neglible different in memory or CPU core performance across the cores. On Linux, this can easily be checked with the following command. +For example, you can use an AWS Graviton 3 `m7g.4xlarge` instance running Ubuntu 24.04 LTS, based on the Arm Neoverse V1 architecture. + +If you're unfamiliar with creating a cloud instance, refer to [Get started with Arm-based cloud instances](/learning-paths/servers-and-cloud-computing/csp/). + +The `m7g.4xlarge` instance has a uniform processor architecture, so there's no difference in memory or CPU core performance across the cores. + +On Linux, you can check this with the following command: ```bash lscpu | grep -i numa ``` -For our `m7g.4xlarge` all 16 cores are in the same NUMA (non-uniform memory architecture) node. +For the `m7g.4xlarge`, all 16 cores are in the same NUMA node: -```out +```output NUMA node(s): 1 NUMA node0 CPU(s): 0-15 ``` -First we will demonstrate how we can pin threads easily using the `taskset` utility available in Linux. This is used to set or retrieve the CPU affinity of a running process or set the affinity of a process about to be launched. This does not require any modifications to the source code. +You'll first learn how to pin threads using the `taskset` utility available in Linux. +This utility sets or retrieves the CPU affinity of a running process or sets the affinity of a process about to be launched. This approach doesn't require any modifications to the source code. -## Install Prerequisites +## Install prerequisites -Run the following commands: +Run the following commands to install the required packages: ```bash -sudo apt update && sudo apt install g++ cmake python3.12-venv -y +sudo apt update && sudo apt install g++ cmake python3-venv python-is-python3 -y ``` -Install Google's Microbenchmarking support library. +Install Google's microbenchmarking support library: ```bash -# Check out the library. git clone https://github.com/google/benchmark.git -# Go to the library root directory cd benchmark -# Make a build directory to place the build output. cmake -E make_directory "build" -# Generate build system files with cmake, and download any dependencies. cmake -E chdir "build" cmake -DBENCHMARK_DOWNLOAD_DEPENDENCIES=on -DCMAKE_BUILD_TYPE=Release ../ -# or, starting with CMake 3.13, use a simpler form: -# Build the library. sudo cmake --build "build" --config Release --target install -j $(nproc) ``` -If you have issues building and installing, please refer to the [official installation guide](https://github.com/google/benchmark). -Finally, you will need to install the Linux perf utility for measuring performance. We recommend using our [install guide](https://learn.arm.com/install-guides/perf/). As you may need to build from source. +If you have issues building and installing, visit the [Benchmark repository](https://github.com/google/benchmark). -## Example 1 +Finally, install the Linux perf utility for measuring performance. See the [Linux Perf install guide](/install-guides/perf/) as you may need to build from source. -To demonstrate a use case of CPU affinity, we will create a program that heavily utilizes all the available CPU cores. Create a file named `use_all_cores.cpp` and paste in the source code below. In this example, we are repeatedly calculating the [Leibniz equation](https://en.wikipedia.org/wiki/Leibniz_formula_for_%CF%80) to compute the value of Pi. This is a computationally inefficient algorithm to calculate the value of Pi and we are splitting the work across many threads. +## Create a CPU-intensive example program -```bash -cd ~ -touch use_all_cores.cpp && chmod 755 use_all_cores.cpp -``` +To demonstrate CPU affinity, you'll create a program that heavily utilizes all available CPU cores. This example repeatedly calculates the [Leibniz equation](https://en.wikipedia.org/wiki/Leibniz_formula_for_%CF%80) to compute the value of Pi. This is a computationally inefficient algorithm to calculate Pi, and you'll split the work across many threads. +Use an editor to create a file named `use_all_cores.cpp` with the code below: ```cpp #include @@ -81,7 +79,6 @@ double multiplethreaded_leibniz(int terms, bool use_all_cores){ } std::vector partial_results(NUM_THREADS); - auto calculation = [&](int thread_id){ // Lambda function that does the calculation of the Leibniz equation double denominator = 0.0; @@ -100,7 +97,6 @@ double multiplethreaded_leibniz(int terms, bool use_all_cores){ } }; - std::vector threads; for (int i = 0; i < NUM_THREADS; i++){ threads.push_back(std::thread(calculation, i)); @@ -139,18 +135,37 @@ int main(){ } ``` -Compile the program with the following command. +Compile the program with the following command: ```bash g++ -O2 --std=c++11 use_all_cores.cpp -o prog ``` -In a separate terminal we can use the `top` utility to quickly view the utilization of each core. For example, run the following command and press the number `1`. Then we can run the program by entering `./prog`. +## Observe CPU utilization + +Now that you've compiled the program, you can observe how it utilizes CPU cores. In a separate terminal, use the `top` utility to view the utilization of each core: + +```bash +top -d 0.1 +``` + +Press the number `1` to view per-core utilization. + +Then run the program in the other terminal: ```bash -top -d 0.1 # then press 1 to view per core utilization +./prog ``` -![CPU-utilization](./CPU-util.jpg) +![Screenshot of the top command showing CPU utilization with all 16 cores periodically reaching 100% usage, displayed in a dark terminal window with percentage bars for each CPU core](cpu_util.jpg "CPU utilization showing all cores being used") + +You should observe all cores on your system being periodically utilized up to 100% and then dropping to idle until the program exits. + +## What you've accomplished and what's next + +In this section, you've: +- Set up an AWS Graviton 3 instance and installed the required tools +- Created a multi-threaded program that heavily utilizes all available CPU cores +- Observed how the program distributes work across cores using the `top` utility -As the screenshot above shows, you should observe all cores on your system being periodically utilized up to 100% and then down to idle until the program exits. In the next section we will look at how to bind this program to specific CPU cores when running alongside a single-threaded Python script. +In the next section, you'll learn how to bind this program to specific CPU cores when running alongside a single-threaded Python script. diff --git a/content/learning-paths/servers-and-cloud-computing/pinning-threads/thread_affinity.md b/content/learning-paths/servers-and-cloud-computing/pinning-threads/thread_affinity.md index 781eb808ae..08ff97b988 100644 --- a/content/learning-paths/servers-and-cloud-computing/pinning-threads/thread_affinity.md +++ b/content/learning-paths/servers-and-cloud-computing/pinning-threads/thread_affinity.md @@ -1,16 +1,22 @@ --- -title: CPU Affinity +title: Set CPU affinity in source code weight: 5 ### FIXED, DO NOT MODIFY layout: learningpathall --- -## Pinning Threads at Source-Code Level +## Pin threads at the source code level -Another way to set CPU affinity is at the source code level, this allows developers to be more expressive as to which thread goes where at specific points during the runtime. For example, in a hot path that repeatedly updates shared state with a read-modify-write style, a pinned thread could avoids excessive cache invalidations due to other threads modifying data. +Another way to set CPU affinity is at the source code level. This allows you to be more expressive about which thread goes where at specific points during runtime. -To demonstrate this we have an example program below. Copy and paste the code below into a new file named `default_os_scheduling.cpp`. +For example, in a hot path that repeatedly updates shared state with a read-modify-write pattern, a pinned thread can avoid excessive cache invalidations caused by other threads modifying data. + +## Create a baseline program without thread pinning + +To demonstrate this, you'll create two example programs. The first uses the default OS scheduling without thread pinning. + +Copy and paste the code below into a new file named `default_os_scheduling.cpp`: ```cpp #include @@ -70,9 +76,11 @@ BENCHMARK(default_os_scheduling)->UseRealTime()->Unit(benchmark::kMillisecond); BENCHMARK_MAIN(); ``` -`default_os_scheduling.cpp` has 2 atomic variables that are aligned on different cache lines to avoid thrashing. We spawn 4 threads, with 2 threads performing a read-modify-wite operation on the first atomic variable, and the final 2 threads performing the same operation on the second atomic variable. +This program has two atomic variables that are aligned on different cache lines to avoid thrashing. You spawn four threads: two threads perform a read-modify-write operation on the first atomic variable, and two threads perform the same operation on the second atomic variable. -Now, copy the code block below into a file named `thread_affinity.cpp`. +## Create a program with explicit thread pinning + +Now, copy the code below into a new file named `thread_affinity.cpp`: ```cpp #include @@ -150,25 +158,32 @@ BENCHMARK(thread_affinity)->UseRealTime()->Unit(benchmark::kMillisecond); BENCHMARK_MAIN(); ``` -`Thread_affinity.cpp` uses the `pthread_set_affinity_np` function from the `pthread.h` header file to pin the 2 threads operating on atomic variable, `a`, to a specific CPU set and the other threads operating on atomic variable, `b`, to a different CPU. +This program uses the `pthread_setaffinity_np` function from the `pthread.h` header file to pin threads. The two threads operating on atomic variable `a` are pinned to a specific CPU set, and the other threads operating on atomic variable `b` are pinned to a different CPU. + +## Compile and benchmark the programs -Compile both programs with the following command. +Compile both programs with the following commands: ```bash g++ default_os_scheduling.cpp -O3 -march=native -lbenchmark -lpthread -o default-os-scheduling g++ thread_affinity.cpp -O3 -march=native -lbenchmark -lpthread -o thread-affinity ``` -We will use the `perf` tool to print statistic for the program. +Use Perf to print statistics for both programs: ```bash perf stat -e L1-dcache-loads,L1-dcache-load-misses ./default-os-scheduling perf stat -e L1-dcache-loads,L1-dcache-load-misses ./thread-affinity ``` -Inspecting the output below we see that the `L1-dcache-load-misses` which occur when the the CPU core does not have a up-to-date version of the data in the L1 Data cache and must perform an expensive operation to fetch data from a different location, reduces from ~7.84% to ~0.6% as a result of the thread pinning. This results in a huge reduction in function execution time, dropping from 10.7ms to 3.53ms. +## Analyze the performance results -```outputRunning ./default-os-scheduling +Inspecting the output below, you can see that the `L1-dcache-load-misses` metric (which occurs when the CPU core doesn't have an up-to-date version of the data in the L1 data cache and must perform an expensive operation to fetch data from a different location) reduces from approximately 7.84% to approximately 0.6% as a result of thread pinning. + +This results in a significant reduction in function execution time, dropping from 10.7 ms to 3.53 ms: + +```output +Running ./default-os-scheduling Run on (16 X 2100 MHz CPU s) CPU Caches: L1 Data 64 KiB (x16) @@ -217,8 +232,16 @@ thread_affinity/real_time 3.53 ms 0.343 ms 198 0.169065000 seconds sys ``` -### Conclusion +The results demonstrate that thread pinning can significantly improve performance when threads operate on separate data structures. By keeping threads on specific cores, you reduce cache coherency traffic and improve data locality. + +## What you've accomplished and what's next + +In this section, you've: +- Created two programs to compare default OS scheduling against explicit thread pinning +- Used the `pthread_setaffinity_np` API to control CPU affinity at the source code level +- Measured cache performance using Perf to quantify the impact of thread pinning +- Observed a performance improvement and a reduction in cache misses through strategic thread placement -In this tutorial, we introduced thread pinning (CPU affinity) through a pair of worked examples. By comparing default OS thread scheduling against explicitly pinned threads, we showed how controlling where threads run can reduce cache disruption in contention-heavy paths and improve runtime stability and performance. +You've seen how controlling where threads run can reduce cache disruption in contention-heavy paths and improve runtime stability and performance. You've also learned about the trade-offs: pinning can boost locality and predictability, but it can hurt performance of other running processes, especially if the workload characteristics change or if you over-constrain the scheduler. -We also highlighted the tradeoffs, pinning can boost locality and predictability, but it can hurt performance of other running processes, espec. Finally, we showed how to implement affinity quickly using common system utilities for inspection and measurement, and how to be more expressive directly in code using APIs like `pthread_setaffinity_np` from `pthread.h`. \ No newline at end of file +Thread pinning is most effective when you have well-understood workload patterns and clear separation between data structures accessed by different threads. Use it as a fine-tuning technique after establishing baseline performance with default OS scheduling. \ No newline at end of file diff --git a/content/learning-paths/servers-and-cloud-computing/pinning-threads/using_taskset.md b/content/learning-paths/servers-and-cloud-computing/pinning-threads/using_taskset.md index 15a79297d3..9cae386474 100644 --- a/content/learning-paths/servers-and-cloud-computing/pinning-threads/using_taskset.md +++ b/content/learning-paths/servers-and-cloud-computing/pinning-threads/using_taskset.md @@ -1,43 +1,38 @@ --- -title: Using Taskset +title: Pin threads to cores with taskset weight: 4 ### FIXED, DO NOT MODIFY layout: learningpathall --- -## Python Script +## Create a single-threaded Python benchmark -Now that we have a basic program that utilizes all the available CPU cores, we will interleave this with a single-threaded program sensitive to variations in execution. This could be to simulate, for example, a log ingesting process or a single-threaded consumer that needs to keep a steady pace. +Now that you have a program that utilizes all available CPU cores, you'll create a single-threaded program that's sensitive to execution variations. This simulates scenarios like a log ingesting process or a single-threaded consumer that needs to maintain a steady pace. -Check that you have Python installed. +Check that you have Python installed: ```bash -python3 --version +python --version ``` -You should see the version of Python. If not, please install Python using the [online instructions](https://www.python.org/downloads/). +You should see the version of Python: ```output Python 3.12.3 ``` -Next, create a virtual environment. This allows you to install packages without interfering with system packages. +If Python isn't installed, use your Linux package manager to install it or refer to the [Python downloads page](https://www.python.org/downloads/). + +Next, create a virtual environment to install packages without interfering with system packages: ```bash -python3 -m venv venv +python -m venv venv source venv/bin/activate pip install matplotlib ``` -Create a file named `single_threaded_python_script.py` and update the permissions with the commands below. - -```bash -touch single_threaded_python_script.py -chmod 755 single_threaded_python_script.py -``` - -Paste in the follow Python script into `single_threaded_python_script.py`. +Use an editor to create a file named `single_threaded_python_script.py` with the following code. This script repeatedly measures the execution time of a computational function and writes the results to `data.txt`. It then generates time-series graphs to illustrate the effects of thread pinning: ```python #!/usr/bin/env python3 @@ -116,22 +111,29 @@ if __name__ == "__main__": main() ``` -The Python script above repeatedly measures the time to execute an arbitrary function, `bar` and writes it to a file `data.txt`. It then generates a time-series graph of the time to illustrate and compare the effects of pinning threads under different scenarios. +Make the script executable: +```bash +chmod +x single_threaded_python_script.py +``` -### Using Taskset to Pin Threads +## Compare thread pinning strategies -We will explore 3 different scenarios. One where we let the operating system allocate to any of 4 cores, another scenario where we pin the single-threaded process to an individual core but our program `prog` is free to run on any core, and a final scenario where the single-threaded script has exclusive access to a single core. We will observe the tradeoff in execution time for both programs running simulatenously. +You'll explore three different scenarios to understand the trade-offs of thread pinning: -Create 3 bash scripts with the following command. +1. Free: The operating system allocates both programs to any of four cores +2. Shared-pinned: The Python script is pinned to core 0, but `prog` can run on any core +3. Exclusive: The Python script has exclusive access to core 0, and `prog` runs on cores 1-3 -```bash -touch free-script.sh exclusive.sh shared-pinned.sh -chmod 755 free-script.sh exclusive.sh shared-pinned.sh -``` -Paste in the script below to the corresponding files. +### Create test scripts + +Create three bash scripts to automate the testing. + +#### Free script + +The first script allows both programs to run on any of the first four cores. -#### Free +Use an editor to create a file named `free-script.sh` with the following code: ```bash #!/bin/bash @@ -145,7 +147,11 @@ taskset --cpu-list 0-3 ./prog wait ``` -#### Shared-Pinned +#### Shared script + +The next script pins the Python script to core 0, while `prog` can use any of the first four cores: + +Use an editor to create a file named `shared-pinned.sh` with the following code: ```bash #!/bin/bash @@ -159,7 +165,11 @@ taskset --cpu-list 0-3 ./prog wait ``` -#### Exclusive Access +#### Exclusive script + +The last one gives the Python script exclusive access to core 0, and `prog` uses cores 1-3: + +Use a text editor to create a file named `exclusive.sh` with the following code: ```bash #!/bin/bash @@ -173,15 +183,22 @@ taskset --cpu-list 1-3 ./prog wait ``` +### Run the tests + +Execute all three scenarios: + ```bash +chmod +x free-script.sh shared-pinned.sh exclusive.sh ./free-script.sh ./shared-pinned.sh ./exclusive.sh ``` -The terminal output is the execution time under the 3 corresponding scenarios. Additionally, the Python script will generate 3 files, `Free.jpg`, `Exclusive.jpg` and `Shared.jpg`. +## Analyze the results + +The terminal output shows the execution time for `prog` under the three scenarios. The Python script also generates three files: `Free.jpg`, `Exclusive.jpg`, and `Shared.jpg`. -As the terminal output below shows, the `free.sh` script, where the Linux scheduler performs assigns threads to cores without restriction, calculated `prog` the quickest at 5.8s. The slowest calculation is where the Python script has exclusive access to cpu 0. This is to be expected as we have constrained `prog` to fewer cores. +As the terminal output below shows, the `free-script.sh` scenario (where the Linux scheduler assigns threads to cores without restriction) completes `prog` the fastest at 5.8 seconds. The slowest execution occurs when the Python script has exclusive access to CPU 0, which is expected because you've constrained `prog` to fewer cores: ```output Answer = 3.14159 5 iterations took 5838 milliseconds @@ -189,16 +206,42 @@ Answer = 3.14159 5 iterations took 5946 milliseconds Answer = 3.14159 5 iterations took 5971 milliseconds ``` -However, this is a tradeoff between the performance of the Python script. Looking at `free.jpg`, we have periodic zones of high latency (3.5ms) that likely coincide when there is contention between the `prog` and the Python script. +However, this represents a trade-off with the Python script's performance. + +### Free scenario results + +Looking at `Free.jpg`, you can see periodic zones of high latency (3.5 ms) that likely occur when there's contention between `prog` and the Python script: + +![Time-series graph showing execution time varying between 0.5ms and 3.5ms with periodic spikes, indicating contention between processes when both are free to run on any core](free.jpg "Free scenario: both programs can run on any of four cores") + +### Shared-pinned scenario results + +When pinning the Python script to core 0 while `prog` remains free to use any cores, you observe similar behavior: + +![Time-series graph showing execution time with similar periodic spikes as the free scenario, indicating continued contention despite pinning the Python script](pinned_shared.jpg "Shared-pinned scenario: Python script pinned to core 0, prog free to run on any core") + +### Exclusive scenario results + +When the Python script has exclusive access to core 0, you observe more consistent execution time around 0.49 ms because the script doesn't contend with any other demanding processes: + +![Time-series graph showing consistent execution time around 0.49ms with minimal variation, demonstrating stable performance when the Python script has exclusive core access](exclusive.jpg "Exclusive scenario: Python script has exclusive access to core 0, prog runs on cores 1-3") + +## Understanding the trade-offs -![free](./free.jpg) +The results demonstrate key trade-offs in thread pinning: -When, pinning the Python script to a core 0 with `prog` free to use any cores we also observe this behaviour. +- Free allocation: Fastest overall throughput but inconsistent latency for time-sensitive tasks +- Shared pinning: Provides some isolation but doesn't eliminate contention +- Exclusive pinning: Most consistent latency for the pinned process but reduces available cores for other work -![shared](./pinned_shared.jpg) +Multiple factors influence this behavior, including the Linux scheduler algorithm, associated parameters, and process priority. These topics are beyond the scope of this Learning Path. If you'd like to learn more, see the [nice utility documentation](https://man7.org/linux/man-pages/man2/nice.2.html) for information about process priority settings. -Finally, when the Python script has exclusive access to core 0, we observe more consistent time around 0.49ms as the script is not contending with any other demanding processes. +## What you've accomplished and what's next -![exclusive](./exclusive.jpg) +In this section, you've: +- Created a single-threaded Python benchmark that measures execution time variations +- Used `taskset` to pin processes to specific CPU cores +- Compared three thread pinning strategies: free, shared-pinned, and exclusive +- Analyzed the trade-offs between throughput and latency consistency -There are multiple additional factors the influence why we this exact profile, including the Linux scheduler algorithm and their associated parameters as well as the priority of the process. We will not go into said factors as it is out of scope for this learning path. If you'd like to learn more, please look into the Linux scheduler and priority setting via the [nice](https://man7.org/linux/man-pages/man2/nice.2.html) utility. \ No newline at end of file +In the next section, you'll learn how to use source code modifications and environment variables to control thread affinity programmatically. \ No newline at end of file