diff --git a/content/learning-paths/servers-and-cloud-computing/pinning-threads/_index.md b/content/learning-paths/servers-and-cloud-computing/pinning-threads/_index.md new file mode 100644 index 0000000000..6a824b159d --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/pinning-threads/_index.md @@ -0,0 +1,42 @@ +--- +title: Getting Started with CPU Affinity + +minutes_to_complete: 30 + +who_is_this_for: Developers, performance engineers and system administrators looking to fine-tune the performance of their workload on many-core Arm-based systems. + +learning_objectives: + - Create CPU Sets and implement directly into sourcecode + - Understand the performance tradeoff when pinning threads with CPU affinity masks + +prerequisites: + - Intermediate understanding of multi-threaded object-orientated programming in C++ and Python + - Foundational understanding of build systems and computer architecture + +author: Kieran Hejmadi + +### Tags +skilllevels: Introductory +subjects: Performance and Architecture +armips: + - Neoverse +tools_software_languages: + - C++ + - Python +operatingsystems: + - Linux + +further_reading: + - resource: + title: Taskset Manual + link: https://man7.org/linux/man-pages/man1/taskset.1.html + type: documentation + + + +### FIXED, DO NOT MODIFY +# ================================================================================ +weight: 1 # _index.md always has weight of 1 to order correctly +layout: "learningpathall" # All files under learning paths have this same wrapper +learning_path_main_page: "yes" # This should be surfaced when looking for related content. Only set for _index.md of learning path content. +--- diff --git a/content/learning-paths/servers-and-cloud-computing/pinning-threads/_next-steps.md b/content/learning-paths/servers-and-cloud-computing/pinning-threads/_next-steps.md new file mode 100644 index 0000000000..727b395ddd --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/pinning-threads/_next-steps.md @@ -0,0 +1,8 @@ +--- +# ================================================================================ +# FIXED, DO NOT MODIFY THIS FILE +# ================================================================================ +weight: 21 # The weight controls the order of the pages. _index.md always has weight 1. +title: "Next Steps" # Always the same, html page title. +layout: "learningpathall" # All files under learning paths have this same wrapper for Hugo processing. +--- diff --git a/content/learning-paths/servers-and-cloud-computing/pinning-threads/background_info.md b/content/learning-paths/servers-and-cloud-computing/pinning-threads/background_info.md new file mode 100644 index 0000000000..d108a2a3a6 --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/pinning-threads/background_info.md @@ -0,0 +1,20 @@ +--- +title: Background Information +weight: 2 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +## Introduction + + +CPU affinity is the practice of binding a process or thread to a specific CPU core or set of cores, telling the operating system scheduler where that work is allowed to run. By default the Linux scheduler dynamically migrates threads across cores to balance load and maximize overall throughput. Pinning overrides this behavior by constraining execution to a chosen set of cores. + +Pinning is most often used as a fine-tuning technique for workloads that aim to consume as many CPU cycles as possible while running alongside other demanding applications. Scientific computing pipelines and real time analytics frequently fall into this category. Typical applications that pin processes to specific cores are often sensitive to latency variation rather than just average throughput or have intricate memory access patterns. Pinning can reduce this noise and provide more consistent execution behavior or better memory access patterns under load. + +Another important motivation is memory locality. On modern systems with Non Uniform Memory Access architectures (NUMA), different cores have memory access times and characteristics depending on where the data is fetched from. For example, in a server with 2 CPU sockets, that from a programmers view appears as a single processor, would have different memory access times depending on the core. By pinning threads to cores that are close to the memory they use and allocating memory accordingly, an application can reduce memory access latency and improve bandwidth. + +Developers can set affinity directly in source code using system calls. Many parallel frameworks expose higher level controls such as OpenMP affinity settings that manage thread placement automatically. Alternatively, at runtime system administrators can pin existing processes using utilities like `taskset` or launch applications with `NUMACTL` to control both CPU and memory placement without modifying code. + +Pinning is a tradeoff. It can improve determinism and locality but it can also reduce flexibility and hurt performance if the chosen layout is suboptimal or if system load changes. Over constraining the scheduler may lead to idle cores while pinned threads contend unnecessarily. As a general rule it is best to rely on the operating system scheduler as a first pass and introduce pinning only if you are looking to fine-tune performance. diff --git a/content/learning-paths/servers-and-cloud-computing/pinning-threads/cpu_util.jpg b/content/learning-paths/servers-and-cloud-computing/pinning-threads/cpu_util.jpg new file mode 100644 index 0000000000..42027aed24 Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/pinning-threads/cpu_util.jpg differ diff --git a/content/learning-paths/servers-and-cloud-computing/pinning-threads/exclusive.jpg b/content/learning-paths/servers-and-cloud-computing/pinning-threads/exclusive.jpg new file mode 100644 index 0000000000..b5c2f188d5 Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/pinning-threads/exclusive.jpg differ diff --git a/content/learning-paths/servers-and-cloud-computing/pinning-threads/free.jpg b/content/learning-paths/servers-and-cloud-computing/pinning-threads/free.jpg new file mode 100644 index 0000000000..33d2e5671e Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/pinning-threads/free.jpg differ diff --git a/content/learning-paths/servers-and-cloud-computing/pinning-threads/pinned_shared.jpg b/content/learning-paths/servers-and-cloud-computing/pinning-threads/pinned_shared.jpg new file mode 100644 index 0000000000..41b7eb82c3 Binary files /dev/null and b/content/learning-paths/servers-and-cloud-computing/pinning-threads/pinned_shared.jpg differ diff --git a/content/learning-paths/servers-and-cloud-computing/pinning-threads/setup.md b/content/learning-paths/servers-and-cloud-computing/pinning-threads/setup.md new file mode 100644 index 0000000000..aa5e17f9a2 --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/pinning-threads/setup.md @@ -0,0 +1,156 @@ +--- +title: Setup +weight: 3 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +## Setup + +In this example we will be using an AWS Graviton 3 `m7g.4xlarge` instance running Ubuntu 22.04 LTS, based on the Arm Neoverse V1 architecture. If you are unfamiliar with creating a cloud instance, please refer to our [getting started learning path](https://learn.arm.com/learning-paths/servers-and-cloud-computing/csp/). + +This learning path is expected to work on any linux-based Arm instance with 4 or more CPU cores. The `m7g.4xlarge` instance has a uniform processor architecture so there is neglible different in memory or CPU core performance across the cores. On Linux, this can easily be checked with the following command. + +```bash +lscpu | grep -i numa +``` + +For our `m7g.4xlarge` all 16 cores are in the same NUMA (non-uniform memory architecture) node. + +```out +NUMA node(s): 1 +NUMA node0 CPU(s): 0-15 +``` + +First we will demonstrate how we can pin threads easily using the `taskset` utility available in Linux. This is used to set or retrieve the CPU affinity of a running process or set the affinity of a process about to be launched. This does not require any modifications to the source code. + + +## Install Prerequisites + +Run the following commands: + +```bash +sudo apt update && sudo apt install g++ cmake python3.12-venv -y +``` + +Install Google's Microbenchmarking support library. + +```bash +# Check out the library. +git clone https://github.com/google/benchmark.git +# Go to the library root directory +cd benchmark +# Make a build directory to place the build output. +cmake -E make_directory "build" +# Generate build system files with cmake, and download any dependencies. +cmake -E chdir "build" cmake -DBENCHMARK_DOWNLOAD_DEPENDENCIES=on -DCMAKE_BUILD_TYPE=Release ../ +# or, starting with CMake 3.13, use a simpler form: +# Build the library. +sudo cmake --build "build" --config Release --target install -j $(nproc) +``` +If you have issues building and installing, please refer to the [official installation guide](https://github.com/google/benchmark). + +Finally, you will need to install the Linux perf utility for measuring performance. We recommend using our [install guide](https://learn.arm.com/install-guides/perf/). As you may need to build from source. + +## Example 1 + +To demonstrate a use case of CPU affinity, we will create a program that heavily utilizes all the available CPU cores. Create a file named `use_all_cores.cpp` and paste in the source code below. In this example, we are repeatedly calculating the [Leibniz equation](https://en.wikipedia.org/wiki/Leibniz_formula_for_%CF%80) to compute the value of Pi. This is a computationally inefficient algorithm to calculate the value of Pi and we are splitting the work across many threads. + +```bash +cd ~ +touch use_all_cores.cpp && chmod 755 use_all_cores.cpp +``` + + +```cpp +#include +#include +#include +#include +#include + +using namespace std; + + +double multiplethreaded_leibniz(int terms, bool use_all_cores){ + + int NUM_THREADS = 2; // use 2 cores by default + if (use_all_cores){ + NUM_THREADS = std::thread::hardware_concurrency(); // e.g., 16 for a 16-core, single-threaded processor + } + std::vector partial_results(NUM_THREADS); + + + auto calculation = [&](int thread_id){ + // Lambda function that does the calculation of the Leibniz equation + double denominator = 0.0; + double term = 0.0; + + for (int i = thread_id; i < terms; i += NUM_THREADS){ + if (i % 32768 == 0){ + this_thread::sleep_for(std::chrono::nanoseconds(20)); + } + denominator = (2*i) + 1; + if (i%2==0){ + partial_results[thread_id] += (1/denominator); + } else{ + partial_results[thread_id] -= (1/denominator); + } + } + }; + + + std::vector threads; + for (int i = 0; i < NUM_THREADS; i++){ + threads.push_back(std::thread(calculation, i)); + } + + for (auto& thread: threads){ + thread.join(); + } + + // Accumulate and return final result + double final_result = 0.0; + for (auto& partial_result: partial_results){ + final_result += partial_result; + } + final_result = final_result * 4; + + return final_result; +} + +int main(){ + + double result = 0.0; + + auto start = std::chrono::steady_clock::now(); + for (int i = 0; i < 5; i++){ + result = multiplethreaded_leibniz((1<<29),true); + std::cout << "iteration\t" << i << std::endl; + } + auto end = std::chrono::steady_clock::now(); + + auto duration = std::chrono::duration_cast(end-start); + std::this_thread::sleep_for(chrono::seconds(5)); // Wait until Python script has finished before printing Answer + std::cout << "Answer = " << result << "\t5 iterations took " << duration.count() << " milliseconds" << std::endl; + + return 0; +} +``` + +Compile the program with the following command. + +```bash +g++ -O2 --std=c++11 use_all_cores.cpp -o prog +``` + +In a separate terminal we can use the `top` utility to quickly view the utilization of each core. For example, run the following command and press the number `1`. Then we can run the program by entering `./prog`. + +```bash +top -d 0.1 # then press 1 to view per core utilization +``` + +![CPU-utilization](./CPU-util.jpg) + +As the screenshot above shows, you should observe all cores on your system being periodically utilized up to 100% and then down to idle until the program exits. In the next section we will look at how to bind this program to specific CPU cores when running alongside a single-threaded Python script. diff --git a/content/learning-paths/servers-and-cloud-computing/pinning-threads/thread_affinity.md b/content/learning-paths/servers-and-cloud-computing/pinning-threads/thread_affinity.md new file mode 100644 index 0000000000..781eb808ae --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/pinning-threads/thread_affinity.md @@ -0,0 +1,224 @@ +--- +title: CPU Affinity +weight: 5 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +## Pinning Threads at Source-Code Level + +Another way to set CPU affinity is at the source code level, this allows developers to be more expressive as to which thread goes where at specific points during the runtime. For example, in a hot path that repeatedly updates shared state with a read-modify-write style, a pinned thread could avoids excessive cache invalidations due to other threads modifying data. + +To demonstrate this we have an example program below. Copy and paste the code below into a new file named `default_os_scheduling.cpp`. + +```cpp +#include +#include +#include +#include + +using namespace std; + +// Places each atomic float on a separate 64-byte cache line +struct AlignedAtomic { + alignas(64) std::atomic val = 0; +}; + +void os_scheduler() { + + const int NUM_THREADS = 4; + + AlignedAtomic a; + AlignedAtomic b; + + // Lambda Work Function + auto task = [](AlignedAtomic &atomic){ + for(int i = 0; i < (1 << 18); i++){ + atomic.val = atomic.val + 1.0f; + } + }; + + std::vector threads; + threads.reserve(NUM_THREADS); + + // Launch NUM_THREADS threads + for (int i = 0; i < NUM_THREADS; i++){ + if (i%2 == 0){ + threads.emplace_back(task, ref(a)); + } + else{ + threads.emplace_back(task, ref(b)); + + } + } + + // wait for all threads to join before exiting + for (auto& thread : threads){ + thread.join(); + } +} + +// Google Benchmark Framework +static void default_os_scheduling(benchmark::State& s) { + while (s.KeepRunning()) { + os_scheduler(); + } +} +BENCHMARK(default_os_scheduling)->UseRealTime()->Unit(benchmark::kMillisecond); + +BENCHMARK_MAIN(); +``` + +`default_os_scheduling.cpp` has 2 atomic variables that are aligned on different cache lines to avoid thrashing. We spawn 4 threads, with 2 threads performing a read-modify-wite operation on the first atomic variable, and the final 2 threads performing the same operation on the second atomic variable. + +Now, copy the code block below into a file named `thread_affinity.cpp`. + +```cpp +#include +#include +#include +#include +#include +#include + + +using namespace std; + +// Places each atomic float on a separate 64-byte cache line +struct AlignedAtomic { + alignas(64) std::atomic val = 0; +}; + +void thread_affinity() { + + const int NUM_THREADS = 4; + + AlignedAtomic a; + AlignedAtomic b; + + // Lambda Work Function + auto task = [](AlignedAtomic &atomic){ + for(int i = 0; i < (1 << 18); i++){ + atomic.val = atomic.val + 1.0f; + } + }; + + std::vector threads; + threads.reserve(NUM_THREADS); + + // Create cpu sets + cpu_set_t cpu_set_0; + cpu_set_t cpu_set_1; + + // Zero them out + CPU_ZERO(&cpu_set_0); + CPU_ZERO(&cpu_set_1); + + // And set the CPU cores we want to pin the threads too + CPU_SET(0, &cpu_set_0); + CPU_SET(1, &cpu_set_1); + + // Launch threads and pin variables a and b to the same CPU cores. + for (int i = 0; i < NUM_THREADS; i++){ + if (i%2 == 0){ + threads.emplace_back(task, ref(a)); + assert(pthread_setaffinity_np(threads[i].native_handle(), sizeof(cpu_set_t), &cpu_set_0) == 0); + + } + else{ + threads.emplace_back(task, ref(b)); + assert(pthread_setaffinity_np(threads[i].native_handle(), sizeof(cpu_set_t), &cpu_set_1) == 0); + } + } + + + // wait for all threads to join before exiting + for (auto& thread : threads){ + thread.join(); + } +} + +// Thread affinity benchmark +static void thread_affinity(benchmark::State& s) { + for(auto _ : s) { + thread_affinity(); + } +} +BENCHMARK(thread_affinity)->UseRealTime()->Unit(benchmark::kMillisecond); + +BENCHMARK_MAIN(); +``` + +`Thread_affinity.cpp` uses the `pthread_set_affinity_np` function from the `pthread.h` header file to pin the 2 threads operating on atomic variable, `a`, to a specific CPU set and the other threads operating on atomic variable, `b`, to a different CPU. + +Compile both programs with the following command. + +```bash +g++ default_os_scheduling.cpp -O3 -march=native -lbenchmark -lpthread -o default-os-scheduling +g++ thread_affinity.cpp -O3 -march=native -lbenchmark -lpthread -o thread-affinity +``` + +We will use the `perf` tool to print statistic for the program. + +```bash +perf stat -e L1-dcache-loads,L1-dcache-load-misses ./default-os-scheduling +perf stat -e L1-dcache-loads,L1-dcache-load-misses ./thread-affinity +``` + +Inspecting the output below we see that the `L1-dcache-load-misses` which occur when the the CPU core does not have a up-to-date version of the data in the L1 Data cache and must perform an expensive operation to fetch data from a different location, reduces from ~7.84% to ~0.6% as a result of the thread pinning. This results in a huge reduction in function execution time, dropping from 10.7ms to 3.53ms. + +```outputRunning ./default-os-scheduling +Run on (16 X 2100 MHz CPU s) +CPU Caches: + L1 Data 64 KiB (x16) + L1 Instruction 64 KiB (x16) + L2 Unified 1024 KiB (x16) + L3 Unified 32768 KiB (x1) +Load Average: 0.37, 0.40, 0.20 +-------------------------------------------------------------------------- +Benchmark Time CPU Iterations +-------------------------------------------------------------------------- +default_os_scheduling/real_time 10.7 ms 0.118 ms 64 + + Performance counter stats for './default-os-scheduling': + + 391719695 L1-dcache-loads + 30726569 L1-dcache-load-misses # 7.84% of all L1-dcache accesses + + 0.808460086 seconds time elapsed + + 3.059934000 seconds user + 0.030958000 seconds sys + + +2026-01-14T09:46:00+00:00 +Running ./thread-affinity +Run on (16 X 2100 MHz CPU s) +CPU Caches: + L1 Data 64 KiB (x16) + L1 Instruction 64 KiB (x16) + L2 Unified 1024 KiB (x16) + L3 Unified 32768 KiB (x1) +Load Average: 0.66, 0.46, 0.22 +-------------------------------------------------------------------- +Benchmark Time CPU Iterations +-------------------------------------------------------------------- +thread_affinity/real_time 3.53 ms 0.343 ms 198 + + Performance counter stats for './thread-affinity': + + 699781841 L1-dcache-loads + 3154506 L1-dcache-load-misses # 0.45% of all L1-dcache accesses + + 1.094879115 seconds time elapsed + + 2.044792000 seconds user + 0.169065000 seconds sys +``` + +### Conclusion + +In this tutorial, we introduced thread pinning (CPU affinity) through a pair of worked examples. By comparing default OS thread scheduling against explicitly pinned threads, we showed how controlling where threads run can reduce cache disruption in contention-heavy paths and improve runtime stability and performance. + +We also highlighted the tradeoffs, pinning can boost locality and predictability, but it can hurt performance of other running processes, espec. Finally, we showed how to implement affinity quickly using common system utilities for inspection and measurement, and how to be more expressive directly in code using APIs like `pthread_setaffinity_np` from `pthread.h`. \ No newline at end of file diff --git a/content/learning-paths/servers-and-cloud-computing/pinning-threads/using_taskset.md b/content/learning-paths/servers-and-cloud-computing/pinning-threads/using_taskset.md new file mode 100644 index 0000000000..15a79297d3 --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/pinning-threads/using_taskset.md @@ -0,0 +1,204 @@ +--- +title: Using Taskset +weight: 4 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +## Python Script + +Now that we have a basic program that utilizes all the available CPU cores, we will interleave this with a single-threaded program sensitive to variations in execution. This could be to simulate, for example, a log ingesting process or a single-threaded consumer that needs to keep a steady pace. + +Check that you have Python installed. + +```bash +python3 --version +``` + +You should see the version of Python. If not, please install Python using the [online instructions](https://www.python.org/downloads/). + +```output +Python 3.12.3 +``` + +Next, create a virtual environment. This allows you to install packages without interfering with system packages. + +```bash +python3 -m venv venv +source venv/bin/activate +pip install matplotlib +``` + +Create a file named `single_threaded_python_script.py` and update the permissions with the commands below. + +```bash +touch single_threaded_python_script.py +chmod 755 single_threaded_python_script.py +``` + +Paste in the follow Python script into `single_threaded_python_script.py`. + +```python +#!/usr/bin/env python3 +import time +import matplotlib.pyplot as plt +import matplotlib +import sys + +def timer(func): + def foo(*args,**kwargs): + with open("data.txt", "a") as f: + start = time.perf_counter() + ans = func(*args,**kwargs) + end = time.perf_counter() + duration = end - start + # print(f"Function {func.__name__} took {(duration*1000):4f} milliseconds") + f.write((str(duration*1000)) + ", ") + return ans + return foo + +@timer +def bar(x:int)->float: + """Random function that is time sensitive""" + res = 0.0 + for i in range(0,x*100): + res += (float(i) / 9.0) + (42.0 + float(i)) + + return res + +def plot_csv_values_from_txt(path: str, *, title: str | None = None, show_markers: bool = False) -> None: + """ + Reads a .txt file containing comma-separated numeric values (with optional whitespace/newlines) + and plots them as a simple chart. + """ + with open(path, "r", encoding="utf-8") as f: + text = f.read() + + # Split on commas, trim whitespace, ignore empty tokens (handles trailing comma) + tokens = [t.strip() for t in text.replace("\n", " ").split(",")] + values = [float(t) for t in tokens if t] + + plt.figure() + x = range(len(values)) + if show_markers: + plt.plot(x, values, marker="o", linestyle="-") + else: + plt.plot(x, values) + + plt.xlabel("Sample Number") + plt.ylabel("Time / milliseconds") + if title: + plt.title(title) + plt.tight_layout() + plt.grid() + plt.show() + if (sys.argv[1] == "exclusive"): + plt.savefig("Exclusive.jpg") + elif (sys.argv[1] == "shared"): + plt.savefig("Shared.jpg") + elif (sys.argv[1] == "free"): + plt.savefig("Free.jpg") + +def main(): + + for i in range(0,10000): + bar(50) + if (sys.argv[1] == "exclusive"): + plot_csv_values_from_txt(path="data.txt",title="Exclusively Pinned") + elif (sys.argv[1] == "shared"): + plot_csv_values_from_txt(path="data.txt", title="Shared") + elif (sys.argv[1] == "free"): + plot_csv_values_from_txt(path="data.txt", title="Free") + return 0 + +if __name__ == "__main__": + main() +``` + +The Python script above repeatedly measures the time to execute an arbitrary function, `bar` and writes it to a file `data.txt`. It then generates a time-series graph of the time to illustrate and compare the effects of pinning threads under different scenarios. + + +### Using Taskset to Pin Threads + +We will explore 3 different scenarios. One where we let the operating system allocate to any of 4 cores, another scenario where we pin the single-threaded process to an individual core but our program `prog` is free to run on any core, and a final scenario where the single-threaded script has exclusive access to a single core. We will observe the tradeoff in execution time for both programs running simulatenously. + +Create 3 bash scripts with the following command. + +```bash +touch free-script.sh exclusive.sh shared-pinned.sh +chmod 755 free-script.sh exclusive.sh shared-pinned.sh +``` +Paste in the script below to the corresponding files. + +#### Free + +```bash +#!/bin/bash + +set -euo pipefail + +rm -f ./data.txt +taskset --cpu-list 0-3 ./single_threaded_python_script.py free & # time-critical python script +taskset --cpu-list 0-3 ./prog + +wait +``` + +#### Shared-Pinned + +```bash +#!/bin/bash + +set -euo pipefail + +rm -f ./data.txt +taskset --cpu-list 0 ./single_threaded_python_script.py shared & # time-critical python script +taskset --cpu-list 0-3 ./prog + +wait +``` + +#### Exclusive Access + +```bash +#!/bin/bash + +set -euo pipefail + +rm -f ./data.txt +taskset --cpu-list 0 ./single_threaded_python_script.py exclusive & # time-critical python script +taskset --cpu-list 1-3 ./prog + +wait +``` + +```bash +./free-script.sh +./shared-pinned.sh +./exclusive.sh +``` + +The terminal output is the execution time under the 3 corresponding scenarios. Additionally, the Python script will generate 3 files, `Free.jpg`, `Exclusive.jpg` and `Shared.jpg`. + +As the terminal output below shows, the `free.sh` script, where the Linux scheduler performs assigns threads to cores without restriction, calculated `prog` the quickest at 5.8s. The slowest calculation is where the Python script has exclusive access to cpu 0. This is to be expected as we have constrained `prog` to fewer cores. + +```output +Answer = 3.14159 5 iterations took 5838 milliseconds +Answer = 3.14159 5 iterations took 5946 milliseconds +Answer = 3.14159 5 iterations took 5971 milliseconds +``` + +However, this is a tradeoff between the performance of the Python script. Looking at `free.jpg`, we have periodic zones of high latency (3.5ms) that likely coincide when there is contention between the `prog` and the Python script. + +![free](./free.jpg) + +When, pinning the Python script to a core 0 with `prog` free to use any cores we also observe this behaviour. + +![shared](./pinned_shared.jpg) + +Finally, when the Python script has exclusive access to core 0, we observe more consistent time around 0.49ms as the script is not contending with any other demanding processes. + +![exclusive](./exclusive.jpg) + +There are multiple additional factors the influence why we this exact profile, including the Linux scheduler algorithm and their associated parameters as well as the priority of the process. We will not go into said factors as it is out of scope for this learning path. If you'd like to learn more, please look into the Linux scheduler and priority setting via the [nice](https://man7.org/linux/man-pages/man2/nice.2.html) utility. \ No newline at end of file