I have observed that when I install NumPy using pip install numpy and run a np.dot() workload, it only utilizes 4 cores (4 threads), even though my Windows on ARM64 device has 12 cores.
I suspect that since we are not using NUM_THREADS while building for ARM64 in this script, it ends up using the number of cores available on the build machine as the value for NUM_THREADS.
To avoid this dependency on the build machine's core count, can we use a flag during the OpenBLAS build similar to what we do for x64 to make the number of threads configurable at runtime?