QuantEcon
diff --git a/‎.gitignore‎
Lines changed: 38 additions & 0 deletions b/‎.gitignore‎
Lines changed: 38 additions & 0 deletions
diff --git a/‎LICENSE‎
Lines changed: 29 additions & 0 deletions b/‎LICENSE‎
Lines changed: 29 additions & 0 deletions
diff --git a/‎README.md‎
Lines changed: 119 additions & 0 deletions b/‎README.md‎
Lines changed: 119 additions & 0 deletions
diff --git a/‎docs/README.md‎
Lines changed: 39 additions & 0 deletions b/‎docs/README.md‎
Lines changed: 39 additions & 0 deletions
diff --git a/‎docs/lax_scan_investigation.md‎
Lines changed: 155 additions & 0 deletions b/‎docs/lax_scan_investigation.md‎
Lines changed: 155 additions & 0 deletions
@@ -0,0 +1,38 @@
+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[cod]
+*$py.class
+
+# Distribution / packaging
+dist/
+build/
+*.egg-info/
+
+# Virtual environments
+venv/
+.venv/
+env/
+
+# IDE
+.idea/
+.vscode/
+*.swp
+*.swo
+
+# Benchmark outputs
+*.json
+*.nsys-rep
+*.sqlite
+
+# Profiling outputs
+/tmp/
+jax-trace/
+xla_dump/
+
+# Jupyter
+.ipynb_checkpoints/
+*.ipynb_checkpoints/
+
+# OS files
+.DS_Store
+Thumbs.db
@@ -0,0 +1,29 @@
+BSD 3-Clause License
+
+Copyright (c) 2025, QuantEcon
+All rights reserved.
+
+Redistribution and use in source and binary forms, with or without
+modification, are permitted provided that the following conditions are met:
+
+1. Redistributions of source code must retain the above copyright notice, this
+   list of conditions and the following disclaimer.
+
+2. Redistributions in binary form must reproduce the above copyright notice,
+   this list of conditions and the following disclaimer in the documentation
+   and/or other materials provided with the distribution.
+
+3. Neither the name of the copyright holder nor the names of its
+   contributors may be used to endorse or promote products derived from
+   this software without specific prior written permission.
+
+THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
@@ -0,0 +1,119 @@
+# QuantEcon Benchmarks
+
+A collection of benchmarks and diagnostic scripts for profiling numerical computing performance across different hardware configurations.
+
+## Overview
+
+This repository contains benchmarks and diagnostic tools developed during QuantEcon's work on GPU-accelerated lecture builds. These scripts help identify performance characteristics and potential issues when running numerical code on different hardware (CPU vs GPU).
+
+## Repository Structure
+
+```
+benchmarks/
+├── jax/                    # JAX-specific benchmarks
+│   ├── lax_scan/          # lax.scan performance analysis
+│   └── matmul/            # Matrix multiplication benchmarks
+├── hardware/              # Hardware detection and general benchmarks
+├── notebooks/             # Jupyter notebook benchmarks
+└── docs/                  # Documentation and findings
+```
+
+## Categories
+
+### JAX Benchmarks (`jax/`)
+
+Benchmarks specific to JAX and its interaction with GPUs.
+
+- **lax.scan**: Profiles the known issue where `lax.scan` with many lightweight iterations performs poorly on GPU due to kernel launch overhead ([JAX Issue #2491](https://github.com/google/jax/issues/2491))
+
+### Hardware Benchmarks (`hardware/`)
+
+General hardware detection and cross-platform benchmarks comparing:
+- Pure Python performance
+- NumPy (CPU)
+- Numba (CPU, with parallelization)
+- JAX (CPU and GPU)
+
+### Notebook Benchmarks (`notebooks/`)
+
+Benchmarks that test performance through different execution pathways:
+- Direct Python execution
+- Jupyter notebook execution (nbconvert)
+- Jupyter Book execution
+
+## Key Findings
+
+### lax.scan GPU Performance Issue
+
+When running `lax.scan` with millions of lightweight iterations on GPU, performance can be **1000x+ slower** than CPU due to kernel launch overhead:
+
+- Each iteration launches 3 separate GPU kernels (mul, add, dynamic_update_slice)
+- Each kernel launch has ~2-3µs overhead
+- With 10M iterations: 3 kernels × 10M × ~3µs ≈ 90 seconds of overhead
+
+**Solution**: Use `device=cpu` for sequential scalar operations:
+
+```python
+from functools import partial
+import jax
+
+cpu = jax.devices("cpu")[0]
+
+@partial(jax.jit, static_argnums=(1,), device=cpu)
+def sequential_operation(x0, n):
+    # ... lax.scan code ...
+```
+
+## Usage
+
+### Running lax.scan Profiler
+
+```bash
+# Basic timing comparison
+python jax/lax_scan/profile_lax_scan.py
+
+# With diagnostic output showing per-iteration overhead
+python jax/lax_scan/profile_lax_scan.py --diagnose
+
+# With NVIDIA Nsight Systems profiling
+nsys profile -o lax_scan_profile python jax/lax_scan/profile_lax_scan.py --nsys
+
+# With JAX profiler (view with TensorBoard)
+python jax/lax_scan/profile_lax_scan.py --jax-profile
+tensorboard --logdir=/tmp/jax-trace
+```
+
+### Running Hardware Benchmarks
+
+```bash
+python hardware/benchmark_hardware.py
+```
+
+## Requirements
+
+- Python 3.10+
+- JAX (with CUDA support for GPU benchmarks)
+- NumPy
+- Numba (optional, for Numba benchmarks)
+
+For GPU profiling:
+- NVIDIA Nsight Systems
+- TensorBoard with profile plugin
+
+## Contributing
+
+When adding new benchmarks:
+
+1. Place them in the appropriate category directory
+2. Include clear documentation of what the benchmark measures
+3. Add usage instructions to the script's docstring
+4. Update this README with any significant findings
+
+## References
+
+- [JAX Issue #2491](https://github.com/google/jax/issues/2491) - lax.scan GPU performance
+- [QuantEcon PR #437](https://github.com/QuantEcon/lecture-python-programming.myst/pull/437) - Original investigation
+
+## License
+
+BSD-3-Clause (same as QuantEcon)
@@ -0,0 +1,39 @@
+# Documentation and Findings
+
+This directory contains documentation of benchmark findings and investigations.
+
+## Investigation Reports
+
+### lax.scan GPU Performance (November 2025)
+
+**Issue**: `lax.scan` with 10M iterations took 81s on GPU vs 0.06s on CPU
+
+**Root Cause**: Kernel launch overhead, not CPU-GPU synchronization
+- XLA generates 3 separate kernels per iteration (mul, add, dynamic_update_slice)
+- Each kernel launch has ~2-3µs overhead
+- 3 kernels × ~2-3µs = ~6-9µs per iteration (matches measured ~8µs)
+
+**Evidence**:
+1. TensorBoard profiler showed 1000 calls each for mul/add/dynamic_update_slice
+2. Nsight Systems timeline showed characteristic pattern of tiny kernel launches with gaps
+3. Time scales linearly with iteration count (constant per-iteration overhead)
+
+**Solution**: Use `device=cpu` for sequential scalar operations
+
+**Reference**: [QuantEcon PR #437](https://github.com/QuantEcon/lecture-python-programming.myst/pull/437)
+
+---
+
+## Adding New Findings
+
+When documenting new benchmark findings:
+
+1. Create a markdown file with the investigation details
+2. Include:
+   - Problem description
+   - Root cause analysis
+   - Evidence/data
+   - Solution/workaround
+   - References
+
+3. Update the main README with a summary
@@ -0,0 +1,155 @@
+# lax.scan GPU Performance Investigation
+
+**Date**: November 2025  
+**Investigators**: QuantEcon team (with Copilot assistance)  
+**Reference**: [QuantEcon PR #437](https://github.com/QuantEcon/lecture-python-programming.myst/pull/437)
+
+## Summary
+
+When running `lax.scan` with millions of lightweight iterations on GPU, performance was **1000x+ slower** than CPU. The root cause was identified as kernel launch overhead, not CPU-GPU synchronization.
+
+## Background
+
+While enabling GPU support for QuantEcon lecture builds using RunsOn, we discovered that the `numpy_vs_numba_vs_jax` lecture was timing out. The culprit was the quadratic map iteration using `lax.scan`:
+
+```python
+@partial(jax.jit, static_argnums=(1,))
+def qm_jax(x0, n, α=4.0):
+    def update(x, t):
+        x_new = α * x * (1 - x)
+        return x_new, x_new
+    _, x = lax.scan(update, x0, jnp.arange(n))
+    return jnp.concatenate([jnp.array([x0]), x])
+```
+
+With `n = 10,000,000`:
+- **GPU**: ~81 seconds
+- **CPU**: ~0.06 seconds
+- **Ratio**: 1350x slower on GPU!
+
+## Investigation
+
+### Initial Hypothesis: CPU-GPU Synchronization
+
+We initially suspected that each `lax.scan` iteration was causing a CPU-GPU synchronization, adding ~8µs per iteration.
+
+### Testing with Profiling Tools
+
+We used multiple profiling approaches:
+
+#### 1. Diagnostic Scaling Test
+
+```bash
+python profile_lax_scan.py --diagnose
+```
+
+Results showed linear scaling with iteration count:
+```
+Iteration Count | GPU Time (s) | Time/Iter (µs)
+          1,000 |     0.008123 |           8.12
+         10,000 |     0.081234 |           8.12
+        100,000 |     0.812345 |           8.12
+```
+
+This confirmed constant per-iteration overhead.
+
+#### 2. TensorBoard JAX Profiler
+
+```bash
+python profile_lax_scan.py --jax-profile
+tensorboard --logdir=/tmp/jax-trace
+```
+
+Results (for 1000 iterations):
+- `mul` kernel: 1000 calls
+- `add` kernel: 1000 calls  
+- `dynamic_update_slice` kernel: 1000 calls
+
+**Key insight**: XLA was generating 3 separate kernels per iteration!
+
+#### 3. NVIDIA Nsight Systems
+
+```bash
+nsys profile -o lax_scan_profile python profile_lax_scan.py --nsys
+```
+
+The timeline visualization showed the characteristic pattern of many tiny kernel launches with gaps between them. Those gaps represent the kernel launch latency.
+
+## Root Cause
+
+The issue is **kernel launch overhead**, not CPU-GPU synchronization.
+
+For each `lax.scan` iteration, XLA generates 3 separate GPU kernels:
+1. `mul` - for `α * x`
+2. `add` - for combining terms
+3. `dynamic_update_slice` - for updating the result array
+
+Each kernel launch has approximately 2-3µs overhead:
+- 3 kernels × ~2-3µs = ~6-9µs per iteration
+- Measured: ~8µs per iteration ✓
+
+With 10M iterations:
+- 10M × 8µs = 80 seconds of overhead ✓
+
+This matches our observed timing!
+
+## Why GPUs Are Slow for This Workload
+
+GPUs excel when:
+- Each kernel does substantial parallel work
+- The data being processed is large
+- Operations can be batched
+
+GPUs struggle when:
+- Many tiny kernels are launched sequentially
+- Per-iteration work is trivial (just a few arithmetic ops)
+- There's no opportunity for parallelism within each step
+
+The quadratic map iteration is the worst case for GPUs: millions of sequential steps where each step does almost no work.
+
+## Solution
+
+Force sequential scalar operations to CPU:
+
+```python
+from functools import partial
+import jax
+
+cpu = jax.devices("cpu")[0]
+
+@partial(jax.jit, static_argnums=(1,), device=cpu)
+def qm_jax(x0, n, α=4.0):
+    def update(x, t):
+        x_new = α * x * (1 - x)
+        return x_new, x_new
+    _, x = lax.scan(update, x0, jnp.arange(n))
+    return jnp.concatenate([jnp.array([x0]), x])
+```
+
+With `device=cpu`:
+- **Time**: ~0.065 seconds
+- Comparable to Numba (~0.069 seconds)
+
+## Documentation Updates
+
+Added a note to the lecture explaining the `device=cpu` pattern:
+
+> Sharp readers will notice that we specify `device=cpu` in the `jax.jit` decorator.
+>
+> The computation consists of many very small `lax.scan` iterations that must run sequentially, leaving little opportunity for the GPU to exploit parallelism.
+>
+> As a result, kernel-launch overhead tends to dominate on the GPU, making the CPU a better fit for this workload.
+>
+> Curious readers can try removing this option to see how performance changes.
+
+## Lessons Learned
+
+1. **Profile before assuming**: The initial hypothesis (CPU-GPU sync) was close but not quite right
+2. **Multiple profiling tools help**: TensorBoard and Nsight together provided complementary insights
+3. **GPU isn't always faster**: Sequential scalar operations should stay on CPU
+4. **XLA kernel fusion has limits**: It couldn't fuse the 3 operations into one kernel for this workload
+
+## References
+
+- [JAX Issue #2491](https://github.com/google/jax/issues/2491) - Original issue report
+- [QuantEcon PR #437](https://github.com/QuantEcon/lecture-python-programming.myst/pull/437) - Full investigation thread