Improve formatting and clarity across parallel computing lectures

jstac · claude · jstac · commit cded2f828d33 · 2025-11-22T09:54:51.000+09:00
- Standardize header capitalization in need_for_speed.md - Update code cell types to ipython3 in numba.md for consistency - Remove redundant parallelization warning section in numba.md - Enhance explanatory text and code clarity in numpy_vs_numba_vs_jax.md - Fix formatting and add missing validation checks 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
diff --git a/lectures/need_for_speed.md b/lectures/need_for_speed.md
@@ -153,12 +153,12 @@ On the other hand, the standard implementation of Python (called CPython) cannot
 match the speed of compiled languages such as C or Fortran.
 
 
-### Where are the Bottlenecks?
+### Where are the bottlenecks?
 
 Why is this the case?
 
 
-#### Dynamic Typing
+#### Dynamic typing
 
 ```{index} single: Dynamic Typing
 ```
@@ -200,7 +200,7 @@ If we repeatedly execute this expression in a tight loop, the nontrivial
 overhead becomes a large overhead.
 
 
-#### Static Types
+#### Static types
 
 ```{index} single: Static Types
 ```
@@ -250,7 +250,7 @@ Such an array is stored in a single contiguous block of memory
 
 * In modern computers, memory addresses are allocated to each byte (one byte = 8 bits).
 * For example, a 64 bit integer is stored in 8 bytes of memory.
-* An array of $n$ such integers occupies $8n$ **consecutive** memory slots.
+* An array of $n$ such integers occupies $8n$ *consecutive* memory slots.
 
 Moreover, the compiler is made aware of the data type by the programmer.
 
@@ -336,7 +336,7 @@ for this purpose and supplied to users as part of a package.
 
 The core benefits are
 
-1. type-checking is paid per array, rather than per element, and
+1. type-checking is paid *per array*, rather than per element, and
 1. arrays containing elements with the same data type are efficient in terms of
    memory access.
 
diff --git a/lectures/numba.md b/lectures/numba.md
@@ -475,7 +475,7 @@ distribution.
 
 Here's the code:
 
-```{code-cell} ipython
+```{code-cell} ipython3
 from numpy.random import randn
 from numba import njit
 
@@ -496,7 +496,7 @@ def h(w, r=0.1, s=0.3, v1=0.1, v2=1.0):
 
 Let's have a look at how wealth evolves under this rule.
 
-```{code-cell} ipython
+```{code-cell} ipython3
 fig, ax = plt.subplots()
 
 T = 100
@@ -540,7 +540,7 @@ Then we'll calculate median wealth at the end period.
 
 Here's the code:
 
-```{code-cell} ipython
+```{code-cell} ipython3
 @njit
 def compute_long_run_median(w0=1, T=1000, num_reps=50_000):
 
@@ -556,7 +556,7 @@ def compute_long_run_median(w0=1, T=1000, num_reps=50_000):
 
 Let's see how fast this runs:
 
-```{code-cell} ipython
+```{code-cell} ipython3
 with qe.Timer():
     compute_long_run_median()
 ```
@@ -565,7 +565,7 @@ To speed this up, we're going to parallelize it via multithreading.
 
 To do so, we add the `parallel=True` flag and change `range` to `prange`:
 
-```{code-cell} ipython
+```{code-cell} ipython3
 from numba import prange
 
 @njit(parallel=True)
@@ -583,26 +583,13 @@ def compute_long_run_median_parallel(w0=1, T=1000, num_reps=50_000):
 
 Let's look at the timing:
 
-```{code-cell} ipython
+```{code-cell} ipython3
 with qe.Timer():
     compute_long_run_median_parallel()
 ```
 
 The speed-up is significant.
 
-### A Warning
-
-Parallelization works well in the outer loop of the last example because the individual tasks inside the loop are independent of each other.
-
-If this independence fails then parallelization is often problematic.
-
-For example, each step inside the inner loop depends on the last step, so
-independence fails, and this is why we use ordinary `range` instead of `prange`.
-
-When you see us using `prange` in later lectures, it is because the
-independence of tasks holds true.
-
-Conversely, when you see us using ordinary `range` in a jitted function, it is either because the speed gain from parallelization is small or because independence fails.
 
 ## Exercises
 
@@ -807,7 +794,7 @@ For the size of the Monte Carlo simulation, use something substantial, such as
 
 Here is one solution:
 
-```{code-cell} python3
+```{code-cell} ipython3
 from random import uniform
 
 @njit(parallel=True)
diff --git a/lectures/numpy_vs_numba_vs_jax.md b/lectures/numpy_vs_numba_vs_jax.md
@@ -321,11 +321,12 @@ x_mesh.nbytes + y_mesh.nbytes
 
 This extra memory usage can be a big problem in actual research calculations.
 
-Fortunately, JAX admits a different approach using [jax.vmap](https://docs.jax.dev/en/latest/_autosummary/jax.vmap.html)
+Fortunately, JAX admits a different approach 
+using [jax.vmap](https://docs.jax.dev/en/latest/_autosummary/jax.vmap.html).
 
 #### Version 1
 
-Here's one way we can do this
+Here's one way we can apply `vmap`.
 
 ```{code-cell} ipython3
 # Set up f to compute f(x, y) at every x for any given y
@@ -340,8 +341,8 @@ Let's see the timing:
 
 ```{code-cell} ipython3
 with qe.Timer(precision=8):
-    z_vmap_1 = f_vec(grid)
-    z_vmap_1.block_until_ready()
+    z_vmap = f_vec(grid)
+    z_vmap.block_until_ready()
 ```
 
 Let's check we got the right result:
@@ -393,6 +394,13 @@ with qe.Timer(precision=8):
     z_vmap = f_vec(x, y).block_until_ready()
 ```
 
+Let's check we got the right result:
+
+
+```{code-cell} ipython3
+jnp.allclose(z_mesh, z_vmap)
+```
+
 
 
 ### Summary
@@ -461,6 +469,8 @@ Numba's compilation is typically quite fast, and the resulting code performance
 
 Now let's create a JAX version using `lax.scan`:
 
+(We'll hold `n` static because it affects array size and hence JAX wants to specialize on its value in the compiled code.)
+
 ```{code-cell} ipython3
 from jax import lax
 from functools import partial
@@ -475,6 +485,8 @@ def qm_jax(x0, n, α=4.0):
     return jnp.concatenate([jnp.array([x0]), x])
 ```
 
+This code is not easy to read but, in essence, `lax.scan` repeatedly calls `qm_jax` and accumulates the returns `x_new` into an array.
+
 Let's time it with the same parameters:
 
 ```{code-cell} ipython3
@@ -489,24 +501,25 @@ with qe.Timer(precision=8):
     x_jax = qm_jax(0.1, n).block_until_ready()
 ```
 
-JAX is also very efficient for this sequential operation.
+JAX is also efficient for this sequential operation.
 
-Both JAX and Numba deliver strong performance after compilation.
-
-While the raw speed is similar for this type of operation, there are notable differences in code complexity and ease of understanding, which we discuss in the next section.
+Both JAX and Numba deliver strong performance after compilation, with Numba
+typically (but not always) offering slightly better speeds on purely sequential
+operations.
 
 ### Summary
 
-While both Numba and JAX deliver excellent performance for sequential operations, there are significant differences in code readability and ease of use.
+While both Numba and JAX deliver strong performance for sequential operations,
+there are significant differences in code readability and ease of use.
 
-The Numba version is straightforward and natural to read: we simply allocate an array and fill it element by element using a standard Python loop.
+The Numba version is straightforward and natural to read: we simply allocate an
+array and fill it element by element using a standard Python loop.
 
 This is exactly how most programmers think about the algorithm.
 
-The JAX version, on the other hand, requires using `lax.scan`, which is less intuitive and has a steeper learning curve.
-
-Additionally, JAX's immutable arrays mean we cannot simply update array elements in place.
+The JAX version, on the other hand, requires using `lax.scan`, which is significantly less intuitive.
 
-Instead, we must use functional programming patterns with `lax.scan`, where we define an `update` function that returns both the new state and the value to accumulate.
+Additionally, JAX's immutable arrays mean we cannot simply update array elements in place, making it hard to directly replicate the algorithm used by Numba.
 
-For this type of sequential operation, Numba is the clear winner in terms of code clarity and ease of implementation, while maintaining competitive performance.
+For this type of sequential operation, Numba is the clear winner in terms of
+code clarity and ease of implementation, as well as high performance.