From 09aaabd0dc40f17a03879f5d9191e0b639dd6878 Mon Sep 17 00:00:00 2001
From: Hyesoon Kim <hyesoon@cc.gatech.edu>
Date: Mon, 29 Sep 2025 16:33:56 -0400
Subject: [PATCH 01/11] Update README.md

Add workshop schedule
---
 README.md | 53 ++++++++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 48 insertions(+), 5 deletions(-)

diff --git a/README.md b/README.md
index 317ad10..4bc5708 100644
--- a/README.md
+++ b/README.md
@@ -18,17 +18,60 @@ Chihyo (Mark) Ahn (Georgia Institute of Technology)
 
 Shinnung Jeong (Georgia Institute of Technology)
 
-## Tentative Tutorial Schedule
+## Tentative Tutorial and Workshop Schedule
 
 | Time        | Contents                                    | Presenter         | slides |
 |-------------|---------------------------------------------|-------------------|--------|
 | 8:00-8:20   | Intro and GPU background                    | Hyesoon Kim       |  |
 | 8:20-9:20   | Vortex Microarchitecture and Software Stack | Blaise Tine       | | 
 | 9:20-9:40   | CuPBoP: Running OpenCL and CUDA on Vortex   | Chihyo (Mark) Ahn | ||
-| 9:40-10:00  | Q&A Session                                 |                   |        |
-| 10:00-10:20 | Coffee Break                                |                   |        |
-| 10:20-11:00 | Vortex Software |   |   | 
-| 11:00-12:00 | Vortex Workshop             |                   |        |
+| 9:40-10:00  | Vortex Tutorial Assignment                           |                   |        |
+| 10:00-10:30 | Q&A and Coffee Break                                |                   |        |
+| 10:30-11:40 | Vortex Workshop 
+| 11:40-12:00 | Review of Tutorial Assignments              |                   |        |
+
+
+# Vortex Workshop Info
+
+---
+
+## Portable Vortex HDL for FPGA and ASIC Technologies  
+**Presenters:** Jamie Kelly (enVention, LLC) and Scott O’Malia (enVention, LLC)
+
+### Abstract
+In this work, we analyze the open-source Vortex GPGPU HDL source code for portability between FPGA and ASIC target technologies. Beyond coding HDL source for legal RTL synthesis, several architecture aspects should be considered to ease technology retargeting without significant HDL source changes. Clock and reset trees and fanout control can be planned at the HDL level. Required sync and async reset types can vary with target technology, warranting a generic, global method to automatically handle each case. Special handling of clock and reset domain crossings may be required. Well-planned design hierarchy can aid floorplanning for back-end tools. Technology-specific leaf cells, such as static RAMs and arithmetic multipliers, should be wrapped using a common interface and parameter set. RAM wrappers can contain special reset control state machines to directly initialize RAM contents for many ASIC technologies that do not support this function. HDL logic pipelining and technology timing closure rely heavily on the use of flip-flop cells for delay. FPGA and ASIC flip-flop area costs are quite different, especially when complex scan-style cells are needed for ASIC manufacturing testing. The ratio of combinatorial look-up tables to flip-flops is examined. The Vortex GPGPU HDL source is analyzed for each of these cited aspects, and the results and suggested improvements are presented in this paper.
+
+### Bios
+**Jamie Kelly**  
+Jamie Kelly (MS EE ‘97, MS Physics ‘07) has worked in hardware, software, FPGA, and ASIC development for more than 25 years. He has expertise in telecommunications/networking, packet switching/queuing, Linux kernel/device drivers, and end-to-end FPGA/ASIC design. Jamie currently serves as the Director of Hardware Engineering at enVention, LLC in Huntsville, Alabama, USA.
+
+**Scott O’Malia**  
+Scott O'Malia (BS MET ’09, BS EE ’13) is an Electrical Engineer at enVention, LLC with over 10 years of experience in FPGA verification, embedded systems, and safety-critical hardware/software design. His expertise includes HDL development and verification, applying DO-178/DO-254 rigor for flight-critical applications, and advancing vendor-independent FPGA verification solutions for long-term sustainment.
+
+---
+
+## A Configurable Mixed-Precision Fused Dot Product Unit for GPGPU Tensor Computation  
+**Presenters:** Nikhil Rout (Vellore Institute of Technology) and Blaise Tine (UCLA)
+
+### Abstract
+There has been increasing interest in developing and accelerating mixed-precision Matrix-Multiply-Accumulate operations in GPGPUs for Deep Learning workloads. However, existing open-source RTL implementations of inner dot product units rely on discrete arithmetic units, leading to suboptimal throughput and poor resource utilization. To address these challenges, we propose a scalable mixed-precision dot product unit that integrates floating-point and integer arithmetic pipelines within a singular fused architecture, implemented as part of the open-source RISC-V based Vortex GPGPU’s Tensor Core Unit extension. Our design supports low-precision multiplication in FP16/BF16/FP8/BF8/INT8/UINT4 formats and higher-precision accumulation in FP32/INT32, with an extensible framework for adding and evaluating other custom representations in the future. Experimental results demonstrate 4-cycle operation latency at 362.2 MHz clock frequency on the AMD Xilinx Alveo U55C FPGA, delivering an ideal filled pipeline throughput of 11.948 GFlops in a 4-thread configuration.
+
+---
+
+## How Vortex Made Virgo Possible  
+**Presenter:** Hansung Kim (UC Berkeley)
+
+### Abstract
+*Abstract text to be provided.*
+
+### Bio
+**Hansung Kim**  
+Hansung Kim is a Ph.D. candidate at the University of California, Berkeley. His research focuses on computer architecture, with an emphasis on open-source GPU systems, heterogeneous computing, and hardware/software co-design. He has been actively involved in the Vortex project and its extensions, and his work highlights how Vortex enabled the development of the Virgo platform.
+
+---
+
+
+
 
 ## Tutorial Assignments
 

From 6a99604f2a602a3ef08402985f86d8c4f74c7e3b Mon Sep 17 00:00:00 2001
From: Hyesoon Kim <hyesoon@cc.gatech.edu>
Date: Mon, 29 Sep 2025 16:34:38 -0400
Subject: [PATCH 02/11] Update README.md

---
 README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/README.md b/README.md
index 4bc5708..d0a3d54 100644
--- a/README.md
+++ b/README.md
@@ -66,7 +66,7 @@ There has been increasing interest in developing and accelerating mixed-precisio
 
 ### Bio
 **Hansung Kim**  
-Hansung Kim is a Ph.D. candidate at the University of California, Berkeley. His research focuses on computer architecture, with an emphasis on open-source GPU systems, heterogeneous computing, and hardware/software co-design. He has been actively involved in the Vortex project and its extensions, and his work highlights how Vortex enabled the development of the Virgo platform.
+Hansung Kim is a Ph.D. candidate at the University of California, Berkeley. 
 
 ---
 

From 2f93bd20f580d31f2ae9032a9e5abf66206d1932 Mon Sep 17 00:00:00 2001
From: Hyesoon Kim <hyesoon@cc.gatech.edu>
Date: Fri, 10 Oct 2025 09:36:28 -0400
Subject: [PATCH 03/11] Update README.md

---
 README.md | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/README.md b/README.md
index d0a3d54..e122263 100644
--- a/README.md
+++ b/README.md
@@ -58,7 +58,7 @@ There has been increasing interest in developing and accelerating mixed-precisio
 
 ---
 
-## How Vortex Made Virgo Possible  
+## Virgo and Muon: Enabling Scalable Matrix Units and a New ASIC-Focused SIMT Core with Vortex
 **Presenter:** Hansung Kim (UC Berkeley)
 
 ### Abstract
@@ -66,7 +66,10 @@ There has been increasing interest in developing and accelerating mixed-precisio
 
 ### Bio
 **Hansung Kim**  
-Hansung Kim is a Ph.D. candidate at the University of California, Berkeley. 
+Hansung Kim is a 6th-year Ph.D. student in EECS at UC Berkeley, advised by Prof. Sophia Shao. His work focuses on GPU microarchitecture and
+hardware/software co-design, with strong technical expertise in RTL implementation, GPU kernel development, and SoC integration.  He is currently
+on the job market for industry positions and welcomes opportunities to connect.
+
 
 ---
 

From 0c22974e6f3a3a63f3a2feb500d8a3920c10479e Mon Sep 17 00:00:00 2001
From: Hyesoon Kim <hyesoon@cc.gatech.edu>
Date: Fri, 10 Oct 2025 10:21:29 -0400
Subject: [PATCH 04/11] Update README.md

---
 README.md | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/README.md b/README.md
index e122263..829d190 100644
--- a/README.md
+++ b/README.md
@@ -22,9 +22,10 @@ Shinnung Jeong (Georgia Institute of Technology)
 
 | Time        | Contents                                    | Presenter         | slides |
 |-------------|---------------------------------------------|-------------------|--------|
-| 8:00-8:20   | Intro and GPU background                    | Hyesoon Kim       |  |
-| 8:20-9:20   | Vortex Microarchitecture and Software Stack | Blaise Tine       | | 
-| 9:20-9:40   | CuPBoP: Running OpenCL and CUDA on Vortex   | Chihyo (Mark) Ahn | ||
+| 8:00-8:10   | Intro and GPU background                    | Hyesoon Kim       |  |
+| 8:10-9:10   | Vortex Microarchitecture and Software Stack | Blaise Tine       | | 
+| 9:10-9:25   | Vortex Compiler and running OpenCL          | Shinnung Jeong       | | 
+| 9:25-9:40   | CuPBoP: Running  CUDA on Vortex   | Chihyo (Mark) Ahn | ||
 | 9:40-10:00  | Vortex Tutorial Assignment                           |                   |        |
 | 10:00-10:30 | Q&A and Coffee Break                                |                   |        |
 | 10:30-11:40 | Vortex Workshop 

From 044406f05f223dc96ab52c47119daed563296644 Mon Sep 17 00:00:00 2001
From: Hyesoon Kim <hyesoon@cc.gatech.edu>
Date: Fri, 10 Oct 2025 10:23:05 -0400
Subject: [PATCH 05/11] Update README.md

---
 README.md | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/README.md b/README.md
index 829d190..e06154d 100644
--- a/README.md
+++ b/README.md
@@ -103,7 +103,10 @@ If you would like to set up Vortex on your own system, [instructions can be foun
 ## Relevant Repos
 
 * [Vortex](https://github.com/vortexgpgpu/vortex)
+* 
 * [Vortex Toolchain](https://github.com/vortexgpgpu/vortex-toolchain-prebuilt)
 
+* [Cupbop on Vortex] (https://github.com/cupbop/CuPBoP_Vortex)
+
 ## Mailing list
 For tutorial info please join  https://docs.google.com/forms/d/1r8E-Yo5NwA45Hi3-kEwte4AxK0mBsYDwgjM6Bul4so0/edit

From 1ba49c5075638250eba44e23d97ffd33b484678e Mon Sep 17 00:00:00 2001
From: Hyesoon Kim <hyesoon@cc.gatech.edu>
Date: Fri, 10 Oct 2025 16:47:15 -0400
Subject: [PATCH 06/11] Update README.md

---
 README.md | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/README.md b/README.md
index e06154d..3a2a435 100644
--- a/README.md
+++ b/README.md
@@ -54,6 +54,9 @@ Scott O'Malia (BS MET ’09, BS EE ’13) is an Electrical Engineer at enVention
 ## A Configurable Mixed-Precision Fused Dot Product Unit for GPGPU Tensor Computation  
 **Presenters:** Nikhil Rout (Vellore Institute of Technology) and Blaise Tine (UCLA)
 
+Nikhil Rout is a 4th-year undergraduate student in ECE at the Vellore Institute of Technology, Chennai. He has been a research intern with the Vortex GPGPU group since summer 2025, advised by Prof. Blaise Tine. His research interests lie in GPGPUs and DNN accelerators at the microarchitecture abstraction level.
+
+
 ### Abstract
 There has been increasing interest in developing and accelerating mixed-precision Matrix-Multiply-Accumulate operations in GPGPUs for Deep Learning workloads. However, existing open-source RTL implementations of inner dot product units rely on discrete arithmetic units, leading to suboptimal throughput and poor resource utilization. To address these challenges, we propose a scalable mixed-precision dot product unit that integrates floating-point and integer arithmetic pipelines within a singular fused architecture, implemented as part of the open-source RISC-V based Vortex GPGPU’s Tensor Core Unit extension. Our design supports low-precision multiplication in FP16/BF16/FP8/BF8/INT8/UINT4 formats and higher-precision accumulation in FP32/INT32, with an extensible framework for adding and evaluating other custom representations in the future. Experimental results demonstrate 4-cycle operation latency at 362.2 MHz clock frequency on the AMD Xilinx Alveo U55C FPGA, delivering an ideal filled pipeline throughput of 11.948 GFlops in a 4-thread configuration.
 

From 6522a0953415700f2b30569a703125bb3b79f963 Mon Sep 17 00:00:00 2001
From: Hansung Kim <hansung_kim@berkeley.edu>
Date: Mon, 13 Oct 2025 21:18:37 -0700
Subject: [PATCH 07/11] Update talk info

---
 README.md | 29 ++++++++++++++++++++++++-----
 1 file changed, 24 insertions(+), 5 deletions(-)

diff --git a/README.md b/README.md
index 3a2a435..28b303e 100644
--- a/README.md
+++ b/README.md
@@ -62,17 +62,36 @@ There has been increasing interest in developing and accelerating mixed-precisio
 
 ---
 
-## Virgo and Muon: Enabling Scalable Matrix Units and a New ASIC-Focused SIMT Core with Vortex
+## Virgo and Radiance: Enabling Scalable Matrix Units and an SoC-based GPU Platform with Vortex
 **Presenter:** Hansung Kim (UC Berkeley)
 
 ### Abstract
-*Abstract text to be provided.*
+Modern GPUs integrate specialized matrix units like Tensor Cores to accelerate
+deep learning.  However, their tight coupling with SIMT cores limits tensor
+operation size due to register file and bandwidth constraints, hindering both
+scalability and energy efficiency.
+
+To address this limitation, We present Virgo, a GPU microarchitecture that
+integrates matrix units at the SIMT cluster level. By physically disaggregating
+the matrix units from SIMT cores, Virgo supports larger tiles, lowers
+instruction overhead, and improves data reuse and energy efficiency. Leveraging
+the Vortex HW/SW stack, Virgo demonstrates full-system design and evaluation
+for fused kernels such as FlashAttention.
+
+Building on top of Virgo and Vortex, we introduce our recent work on Radiance,
+an ASIC SoC–based GPU platform within Chipyard.  Radiance features the new
+Chisel-based Muon SIMT core which improves PPA via a redesigned issue pipeline,
+dynamic warp occupancy support, and an extended ISA that expands register
+capacity while reducing stack accesses. We discuss tentative plans for
+a silicon tape-out.
 
 ### Bio
 **Hansung Kim**  
-Hansung Kim is a 6th-year Ph.D. student in EECS at UC Berkeley, advised by Prof. Sophia Shao. His work focuses on GPU microarchitecture and
-hardware/software co-design, with strong technical expertise in RTL implementation, GPU kernel development, and SoC integration.  He is currently
-on the job market for industry positions and welcomes opportunities to connect.
+Hansung Kim is a Ph.D. candidate at UC Berkeley, advised by Prof. Sophia
+Shao. His research focuses on GPU microarchitecture and hardware/software
+co-design, with technical expertise in RTL implementation, GPU kernel
+development and SoC integration.  He is currently on the job market for
+industry positions and welcomes opportunities to connect.
 
 
 ---

From d4ec10f34a30f68bfcf658e7f02d06af1d2f1880 Mon Sep 17 00:00:00 2001
From: Rahul Raj D N <rahul7rajdn@gmail.com>
Date: Tue, 14 Oct 2025 14:05:36 -0400
Subject: [PATCH 08/11] Added apptainer instructions for Vortex tutorials

---
 README.md            |   5 ++
 apptainer/INSTALL.md | 114 +++++++++++++++++++++++++++++++++++++++++++
 apptainer/README.md  |  77 +++++++++++++++++++++++++++++
 3 files changed, 196 insertions(+)
 create mode 100644 apptainer/INSTALL.md
 create mode 100644 apptainer/README.md

diff --git a/README.md b/README.md
index 28b303e..0f10171 100644
--- a/README.md
+++ b/README.md
@@ -116,6 +116,11 @@ Provided are seven hands-on tutorial assignments covering various aspects of Vor
 ### Remote Access
 A terminal interface hosted by the [CRNCH Rogues Gallery](https://crnch-rg.cc.gatech.edu/) is provided. [Instructions can be found here](./REMOTE_ACCESS.md).
 
+
+### Apptainer
+See the [Apptainer instructions](./apptainer/README.md) for how to set up the apptainer and run simulation for Vortex.
+
+
 ### Docker (Experimental)
 See the [Docker instructions](./docker/README.md) for how to set up a Docker image for Vortex.
 
diff --git a/apptainer/INSTALL.md b/apptainer/INSTALL.md
new file mode 100644
index 0000000..2827b4d
--- /dev/null
+++ b/apptainer/INSTALL.md
@@ -0,0 +1,114 @@
+Apptainer (formerly Singularity) is a container system optimized for HPC and secure scientific environments, so installation varies by OS family.
+
+
+## 🐧 1. Ubuntu / Debian
+#### ✅ Option A — Install via .deb package 
+
+```
+sudo apt update
+sudo apt install -y build-essential libseccomp-dev pkg-config squashfs-tools cryptsetup wget
+
+# Download the latest stable release
+wget https://github.com/apptainer/apptainer/releases/download/v1.2.2/apptainer_1.2.2_amd64.deb
+
+# Install
+sudo apt install ./apptainer_1.2.2_amd64.deb
+
+# Verify
+apptainer --version
+```
+
+#### ✅ Option B — Build from source (if .deb not available)
+```
+sudo apt update
+sudo apt install -y build-essential uuid-dev libseccomp-dev pkg-config squashfs-tools cryptsetup wget git golang-go
+
+cd /tmp
+wget https://github.com/apptainer/apptainer/releases/download/v1.2.2/apptainer-1.2.2.tar.gz
+tar -xzf apptainer-1.2.2.tar.gz
+cd apptainer-1.2.2
+./mconfig
+make -C builddir
+sudo make -C builddir install
+
+# Verify
+apptainer --version
+```
+
+
+## 🧱 2. RHEL / AlmaLinux / Rocky / CentOS
+#### ✅ Option A — Install via EPEL (Recommended)
+```
+sudo dnf install -y epel-release
+sudo dnf config-manager --set-enabled crb
+sudo dnf install -y apptainer
+```
+
+Works for RHEL 8/9, AlmaLinux, Rocky Linux, CentOS Stream, etc.
+
+#### ✅ Option B — Build from source
+```
+sudo dnf groupinstall -y "Development Tools"
+sudo dnf install -y golang libseccomp-devel squashfs-tools cryptsetup wget git pkg-config make
+
+cd /tmp
+wget https://github.com/apptainer/apptainer/releases/download/v1.2.2/apptainer-1.2.2.tar.gz
+tar -xzf apptainer-1.2.2.tar.gz
+cd apptainer-1.2.2
+./mconfig
+make -C builddir
+sudo make -C builddir install
+
+# Verify
+apptainer --version
+```
+
+
+## 🍎 3. macOS
+
+Apptainer doesn’t run natively on macOS — it’s a Linux-only system (needs Linux kernel namespaces).
+But you can run it using Linux virtual environments:
+
+#### ✅ Option A — Using Homebrew + Apptainer inside a Linux VM
+
+Install Homebrew and a lightweight Linux VM (like multipass):
+
+```
+brew install --cask multipass
+multipass launch --name ubuntu --cpus 4 --mem 4G --disk 20G
+multipass shell ubuntu
+```
+
+Inside the VM, follow the Ubuntu install steps above.
+
+#### ✅ Option B — Using Docker + Apptainer inside container
+```
+docker run -it --privileged ghcr.io/apptainer/apptainer:latest bash
+
+# Verify
+apptainer --version
+```
+
+
+## 🪟 4. Windows 10/11
+
+Apptainer requires Linux namespaces → it cannot run directly on native Windows.
+
+#### ✅ Option A — Use WSL2 (Windows Subsystem for Linux)
+
+Enable WSL2 and install Ubuntu:
+```
+wsl --install -d Ubuntu
+```
+Inside WSL Ubuntu terminal: Follow the Ubuntu install steps above (either using Ubuntu Debian package / Build from source).
+
+
+#### ✅ Option B — Use a full Linux VM (VirtualBox, VMware, or WSL2 Ubuntu)
+
+If you need GPU or privileged access, use a full Linux VM with Apptainer installed inside.
+
+
+
+
+### Reference:
+https://apptainer.org/docs/admin/main/installation.html
\ No newline at end of file
diff --git a/apptainer/README.md b/apptainer/README.md
new file mode 100644
index 0000000..234b3a9
--- /dev/null
+++ b/apptainer/README.md
@@ -0,0 +1,77 @@
+# Apptainer Build Process
+
+Prerequisite: Install `apptainer` package on your machine by following [INSTALL.md](./INSTALL.md)
+
+
+# Clone Vortex repo
+
+Create tools directory for mounting vortex-toolchains onto the apptainer
+```
+$   mkdir -p tools
+```
+
+```
+$	git clone --depth=1 --recursive https://github.com/vortexgpgpu/vortex.git
+```
+
+Go to `apptainer` directory and build the vortex apptainer
+
+```
+$   ls
+    tools  vortex
+
+$   cd vortex/miscs/apptainer
+
+$  apptainer build --no-https vortex.sif vortex.def
+
+```
+
+To start the apptainer,
+```
+apptainer shell --fakeroot --cleanenv --writable-tmpfs  --bind ../../../vortex:/home/vortex --bind ../../../tools:/home/tools vortex.sif
+```
+
+
+# Vortex Simulation inside Apptainer
+
+Go to the bind of vortex repo,
+```
+Apptainer> cd /home/vortex
+Apptainer> ./ci/install_dependencies.sh
+Apptainer> mkdir build
+Apptainer> cd build
+Apptainer> ../configure --xlen=32 --tooldir=$HOME/tools
+
+
+Skip the below 3 steps, if toolchains are already present in the $HOME/tools; (These steps are compulsory while getting the setup ready for the first time)
+Apptainer> sed -i 's/\btar /tar --no-same-owner /g' ci/toolchain_install.sh
+Apptainer> ./ci/toolchain_install.sh --all
+Apptainer> sed -i 's/\btar --no-same-owner /tar /g' ci/toolchain_install.sh
+
+Apptainer> ls $HOME/tools/
+libc32  libc64  libcrt32  libcrt64  llvm-vortex  pocl  riscv32-gnu-toolchain  riscv64-gnu-toolchain  sv2v  verilator  yosys
+
+Apptainer> source ./ci/toolchain_env.sh
+Apptainer> verilator --version
+```
+
+
+### Running SIMX, RTLSIM and XRTSIM
+```
+Compile the Vortex codebase
+Apptainer> make -s
+
+Run the programs by specifying the appropriate driver as shown below:
+
+SIMX
+Apptainer> ./ci/blackbox.sh --cores=2 --app=demo --driver=simx
+
+RTLSIM
+Apptainer> ./ci/blackbox.sh --cores=2 --app=demo --driver=rtlsim
+
+XRTSIM
+Apptainer> ./ci/blackbox.sh --cores=2 --app=demo --driver=xrt
+
+
+Apptainer> make -C runtime/ clean
+```

From e3bb527a6d04d2143dba34c4d0f9ab3ab83d2bef Mon Sep 17 00:00:00 2001
From: Vincent Pham <113321638+powlectro@users.noreply.github.com>
Date: Thu, 16 Oct 2025 16:45:25 -0400
Subject: [PATCH 09/11] Update assignment3.md

---
 Exercises/assignment3.md | 651 ++++++++++++++++++++++++++++++++++++---
 1 file changed, 614 insertions(+), 37 deletions(-)

diff --git a/Exercises/assignment3.md b/Exercises/assignment3.md
index 1d13d53..d8b0a73 100644
--- a/Exercises/assignment3.md
+++ b/Exercises/assignment3.md
@@ -1,64 +1,641 @@
 # Assignment #3: GPU Software Prefetching (SimX)
 
-This assignment is divided into two parts.
-The first part involves extending the tag in the cache to include a prefetch bit.
-The second involves adding three performance counters to measure the following metrics:
+This assignment will be divided into two parts. The first part involves adding a new prefetch instruction as well as a corresponding flag bit to identify if it has been prefetched. The second involves adding three performance counters to measure the following:
+
 1. Number of unique prefetch requests to main memory
-2. Number of unused prefetched blocks
+2. Number of unused prefetched blocks  
 3. Number of late prefetches
 
-All of these counters should be implemented in `VX_cache_bank.sv`.
+All of these counters should be implemented in `cache_sim.h`
+
+## Part 1: Adding Prefetch Instruction to SimX
+
+To begin, we will add the prefetch instruction in a new group of instructions. Then we want to develop a testing directory and script to ensure correctness and functionality
+
+### Step 1: Adding Prefetch Intrinsic
+
+First, add the prefetch intrinsic to `/kernel/include/vx_intrinsics.h` (right after `vx_barrier()`)
+
+```c
+// Software Prefetch
+inline void vx_prefetch(const void* addr) {
+    __asm__ volatile (".insn r %0, 0, 5, x0, %1, x0" :: "i"(RISCV_CUSTOM0), "r"(addr) : "memory");
+}
+```
+
+This will create a new group for the prefetch instruction, where this instruction is an R-type instruction format
+
+### Step 2: Implement into SimX
+
+#### 2a: Editing `types.h`
+
+Before we can decode the instruction, we need to add a new `PREFETCH` value into `LsuType` in the file `/sim/simx/types.h` 
+
+```cpp
+enum class LsuType {
+  LOAD,
+  STORE,
+  FENCE,
+  PREFETCH  // ADD
+};
+```
+
+We also need a prefetch case for `std::ostream` 
 
----
+```cpp
+inline std::ostream &operator<<(std::ostream &os, const LsuType& type) {
+  switch (type) {
+  case LsuType::LOAD:     os << "LOAD"; break;
+  case LsuType::STORE:    os << "STORE"; break;
+  case LsuType::FENCE:    os << "FENCE"; break;
+  case LsuType::PREFETCH: os << "PREFETCH"; break;  // ADD
+  default:
+    assert(false);
+  }
+  return os;
+}
+```
+
+#### 2b: Editing `decode.cpp`
+
+We want to update `case Opcode::EXT1:` (in the `/sim/simx/decode.cpp` file) where we add the new prefetch instruction group (right after the `case 2` instruction group)
+
+```cpp
+case 5: { // SOFTWARE PREFETCH
+	auto instr = std::allocate_shared<Instr>(instr_pool_, uuid, FUType::LSU);
+  switch (funct3) {
+	  case 0: // PREFETCH
+	    instr->setOpType(LsuType::PREFETCH);  // Make sure it is set to PREFETCH
+	    instr->setArgs(IntrLsuArgs{0, 0, 0});
+	    instr->setSrcReg(0, rs1, RegType::Integer);
+	    break;
+    default:
+      std::abort();
+  }
+  ibuffer.push_back(instr);
+} break;
+```
+
+In the `op_string()` function, we need to add a `PREFETCH` case (under the `FENCE` case)
+
+```cpp
+case LsuType::PREFETCH: return {"PREFETCH", ""};  // ADD
+```
+
+#### 2c: Editing execute.cpp
 
-## Part 1: Extending the cache tag
+In order for the instruction to perform a prefetch, we need to add a case for `PREFETCH` in the `execute()` function (in the `/sim/simx/execute.cpp` file)
 
-You will need to extend the metadata tag in the bank to incorporate an additional prefetch bit. Keep in mind that the metadata tag is not the same as the line tag.
+```cpp
+case LsuType::PREFETCH: {
+    auto trace_data = std::make_shared<LsuTraceData>(num_threads);
+    trace->data = trace_data;
+    
+    for (uint32_t t = thread_start; t < num_threads; ++t) {
+        if (!warp.tmask.test(t))
+            continue;
+        uint64_t prefetch_addr = rs1_data[t].u;
+        
+        // Record the prefetch address in trace
+        trace_data->mem_addrs.at(t) = {prefetch_addr, 4}; // 4 bytes or cache line size
+        
+        // Issue dummy read to populate cache
+        uint32_t dummy;
+        this->dcache_read(&dummy, prefetch_addr, sizeof(uint32_t));
+        
+        DP(2, "PREFETCH: addr=0x" << std::hex << prefetch_addr << std::dec << " (thread " << t << ")");
+    }
+} break;
+```
 
-### Hints
+In this implementation, we issue a dummy read in order to populate a cache. This will trigger SimX to place data (from an address) into cache, essentially prefetching the data. The instruction will not modify or perform anything outside of that.
 
-- The last two bits of `core_req_tag` are truncated before reaching `VX_cache_bank.sv`. Keep this in mind while adding the prefetch bit to the tag in the `VX_lsu_unit.sv`.
-- To verify that your implementation is correct, add the prefetch bit to the debug header in `VX_cache_bank.sv`.
+### Step 3: Creating Test Application
 
----
+#### 3a: Creating Test Directory
 
-## Part 2:
+In `/tests/regression/`, we want to duplicate the `fence` folder and rename it to `prefetch` 
 
-### 2a: Counter for the number of unique prefetch requests to memory
+```bash
+# Create prefetch test from fence test
+cp -r tests/regression/fence tests/regression/prefetch
+cd tests/regression/prefetch
 
-The kernel in the `prefetch` app generates multiple prefetch requests to the same address. A unique prefetch request is the first request generated for that address that misses in the cache and goes to main memory. Any subsequent prefetch requests to the same address result in a cache hit.
+# Modify the Makefile
+sed -i 's/PROJECT=fence/PROJECT=prefetch/g' Makefile
+```
 
-### Hints
-- Use the `mreq_push` signal in `VX_cache_bank.sv`.
+You should now have the `/tests/regression/prefetch` directory, this will be our testing directory for our new instruction
 
----
+**Note:** When cloning, make sure you go into `Makefile` and adjust the project name to `prefetch`
 
-### 2b: Counter for the number of unused prefetched blocks
+#### 3b: Modify Test Script
 
-- In part 1 of this assignment, you added a prefetch bit to the `core_req_tag` to indicate whether an ***instruction was a software prefetch***. Now, you need to add this bit to the tag store in VX_tag_access.sv to indicate whether a ***block has been brought in by a prefetch request***.
-- You need to add a new data structure in stage 1 of the cache pipeline (the same stage as the data access) to store information about whether a cache block has been used or not. Look at `VX_Cache_tags.sv` for an idea of how this can be done. This information is universal and is applicable for every cache block.
-- The first point comes into picture since you want to know whether a ***prefetched block*** has been used or not.
-- An important point to note is that we know whether a block has been used/unused only during a ***fill operation*** since that is when the block is evicted from the cache.
+We will need to modify `kernel.cpp` (in the testing directory) and add a call to `vx_prefetch()` 
 
----
+```cpp
+#include <vx_spawn.h>
+#include <vx_intrinsics.h>  // ADD
+#include "common.h"
 
-### 2c: Counter for the number of late prefetches
+void kernel_body(kernel_arg_t* __UNIFORM__ arg) {
+	uint32_t count    = arg->task_size;
+	int32_t* src0_ptr = (int32_t*)arg->src0_addr;
+	int32_t* src1_ptr = (int32_t*)arg->src1_addr;
+	int32_t* dst_ptr  = (int32_t*)arg->dst_addr;
 
-- A late prefetch is when there is a prefetch request for a particular address pending in the MSHR, and a there is a demand request for the same address.
-- You want to know whether an instruction in the MSHR is a prefetch instruction, you will need to add a data structure in the MSHR to hold the prefetch bit.
+	uint32_t offset = blockIdx.x * count;
+	for (uint32_t i = 0; i < count; ++i) {
+		// ADD
+		vx_prefetch(&src0_ptr[offset + i]);
+		vx_prefetch(&src1_ptr[offset + i]);
+		
+		dst_ptr[offset+i] = src0_ptr[offset+i] + src1_ptr[offset+i];
+	}
+
+	vx_fence();
+}
+
+int main() {
+	kernel_arg_t* arg = (kernel_arg_t*)csr_read(VX_CSR_MSCRATCH);
+	return vx_spawn_threads(1, &arg->num_tasks, nullptr, (vx_kernel_func_cb)kernel_body, arg);
+}
+```
 
-### Hints
-- Look at how `addr_table` is implemented to get an idea of how to add a prefetch table.
-- Look at how `addr_matches` is implemented to get an idea of how to implement the late prefetch counter.
+#### 3c: Building and Testing
 
----
+To check and see that the new instruction is working, run the following commands in your `/build/` directory
 
-## Verifying Your Results:
+**Note:** Check to see if you ran `source ./ci/toolchain_env.sh` before building!
 
-You can verify your results by running:
+```bash
+# Make the build
+make -s
 
-``` bash
-./ci/blackbox.sh --driver=rtlsim --cores=1 --app=prefetch --perf=1
+# Run debug to check to see if prefetch output is printed
+./ci/blackbox.sh --driver=simx --cores=1 --app=prefetch --debug=2
 ```
-\# of unused prefetched blocks = 2 \
-\# of late prefetches = 1
+
+All output will be in `run.log` in the `/build/` directory, check to see if `DEBUG PREFETCH: …` is present
+
+## Part 2: Implementing Performance Counters
+
+Now that you have prefetch instructions working in SimX, we want to implement the three performance counters to measure prefetch effectiveness
+
+### Step 1: Adding Prefetch Flags
+
+#### 1a: Editing `types.h`
+
+We want to add the `is_prefetch` flag to `LsuReq` (in the `/sim/simx/types.h` directory)
+
+```cpp
+struct LsuReq {
+  BitVector<> mask;
+  std::vector<uint64_t> addrs;
+  bool     write;
+  uint32_t tag;
+  uint32_t cid;
+  uint64_t uuid;
+  bool     is_prefetch;  // ADD
+
+  LsuReq(uint32_t size)
+    : mask(size)
+    , addrs(size, 0)
+    , write(false)
+    , tag(0)
+    , cid(0)
+    , uuid(0)
+    , is_prefetch(false)  // ADD
+  {}
+
+  friend std::ostream &operator<<(std::ostream &os, const LsuReq& req) {
+    os << "rw=" << req.write << ", mask=" << req.mask << ", addr={";
+    bool first_addr = true;
+    for (size_t i = 0; i < req.mask.size(); ++i) {
+      if (!first_addr) os << ", ";
+      first_addr = false;
+      if (req.mask.test(i)) {
+        os << "0x" << std::hex << req.addrs.at(i) << std::dec;
+      } else {
+        os << "-";
+      }
+    }
+    os << "}, tag=0x" << std::hex << req.tag << std::dec << ", cid=" << req.cid;
+    if (req.is_prefetch) os << ", prefetch=1";  // ADD
+    os << " (#" << req.uuid << ")";
+    return os;
+  }
+};
+```
+
+Similarly, we will add the same flag to `MemReq` 
+
+```cpp
+struct MemReq {
+  uint64_t addr;
+  bool     write;
+  AddrType type;
+  uint32_t tag;
+  uint32_t cid;
+  uint64_t uuid;
+  bool     is_prefetch;  // ADD
+
+  MemReq(uint64_t _addr = 0,
+          bool _write = false,
+          AddrType _type = AddrType::Global,
+          uint64_t _tag = 0,
+          uint32_t _cid = 0,
+          uint64_t _uuid = 0,
+          bool _is_prefetch = false  // ADD
+  ) : addr(_addr)
+    , write(_write)
+    , type(_type)
+    , tag(_tag)
+    , cid(_cid)
+    , uuid(_uuid)
+    , is_prefetch(_is_prefetch)  // ADD
+  {}
+
+  friend std::ostream &operator<<(std::ostream &os, const MemReq& req) {
+    os << "rw=" << req.write << ", ";
+    os << "addr=0x" << std::hex << req.addr << std::dec << ", type=" << req.type;
+    os << ", tag=0x" << std::hex << req.tag << std::dec << ", cid=" << req.cid;
+    if (req.is_prefetch) os << ", prefetch=1";  // ADD
+    os << " (#" << req.uuid << ")";
+    return os;
+  }
+};
+```
+
+#### 1b: Editing `func_unit.cpp`
+
+We need a way to mark prefetch requests in `LsuUnit::tick()` (in the `/sim/simx/func_unit.cpp` file), so we need to add functionality to our newly added `is_prefetch` flag
+
+```cpp
+void LsuUnit::tick() {
+    // ...
+    
+    for (uint32_t iw = 0; iw < ISSUE_WIDTH; ++iw) {
+        // ...
+        
+        bool is_fence = false;
+        bool is_write = false;
+        bool is_prefetch = false;  // ADD
+
+        auto trace = input.front();
+        if (std::get_if<LsuType>(&trace->op_type)) {
+            auto lsu_type = std::get<LsuType>(trace->op_type);
+            is_fence = (lsu_type == LsuType::FENCE);
+            is_write = (lsu_type == LsuType::STORE);
+            is_prefetch = (lsu_type == LsuType::PREFETCH);  // ADD
+        }
+        // ...
+        
+        if (remain_addrs_ != 0) {
+            // setup memory request
+            LsuReq lsu_req(NUM_LSU_LANES);
+            lsu_req.write = is_write;
+            lsu_req.is_prefetch = is_prefetch;  // ADD
+            
+            // ...
+        }
+    }
+}
+```
+
+#### 1c: Editing `cache_sim.cpp`
+
+To mimic an additional bit on the tag, we also add flag bits to the `line_t` structure (in the `/sim/simx/cache_sim.cpp`), specifically one to check if the data was prefetched and the other if it was used. These two flags will assist the counter with tracking
+
+```cpp
+struct line_t {
+    uint64_t tag;
+    uint32_t lru_ctr;
+    bool     valid;
+    bool     dirty;
+    bool     was_prefetched;  // ADD
+    bool     was_used;        // ADD
+
+    void reset() {
+        valid = false;
+        dirty = false;
+        was_prefetched = false;  // ADD
+        was_used = false;        // ADD
+    }
+};
+```
+
+Afterwards, we also need to update `bank_req_t` with the prefetch flag
+
+```cpp
+struct bank_req_t {
+    
+    // ...
+    
+    bool     is_prefetch;  // ADD
+
+    bank_req_t() {
+        this->reset();
+    }
+
+    void reset() {
+        type = ReqType::None;
+        is_prefetch = false;  // ADD
+    }
+
+    friend std::ostream &operator<<(std::ostream &os, const bank_req_t& req) {
+        os << "set=" << req.set_id << ", rw=" << req.write;
+        os << ", type=" << req.type;
+        os << ", addr_tag=0x" << std::hex << req.addr_tag;
+        os << ", req_tag=" << req.req_tag;
+        os << ", cid=" << std::dec << req.cid;
+        if (req.is_prefetch) os << ", prefetch=1";  // ADD
+        os << " (#" << req.uuid << ")";
+        return os;
+    }
+};
+```
+
+Now that we have the flags set in `cache_sim.cpp`, we want to implement logic into the `processInputs()` function
+
+```cpp
+void processInputs() {
+    // ...
+    
+    // second: schedule memory fill
+		if (!this->mem_rsp_port.empty()) {
+			auto& mem_rsp = mem_rsp_port.front();
+			DT(3, this->name() << "-fill-rsp: " << mem_rsp);
+			// update MSHR
+			auto& entry = mshr_.replay(mem_rsp.tag);
+			auto& set   = sets_.at(entry.bank_req.set_id);
+			auto& line  = set.lines.at(entry.line_id);
+			line.valid  = true;
+			line.tag    = entry.bank_req.addr_tag;
+			line.was_prefetched = entry.bank_req.is_prefetch;  // ADD
+    	
+    	// ...
+		}
+    
+    // third: schedule core request
+    if (!this->core_req_port.empty()) {
+      auto& core_req = core_req_port.front();
+        
+      // ...
+        
+      bank_req.type = bank_req_t::Core;
+      bank_req.cid = core_req.cid;
+      bank_req.uuid = core_req.uuid;
+      bank_req.set_id = params_.addr_set_id(core_req.addr);
+      bank_req.addr_tag = params_.addr_tag(core_req.addr);
+      bank_req.req_tag = core_req.tag;
+      bank_req.write = core_req.write;
+      bank_req.is_prefetch = core_req.is_prefetch;  // ADD
+        
+      // ...
+    }
+}
+```
+
+#### 1d: Editing `types.cpp`
+
+Now we want to propagate `is_prefetch` through `LsuMemAdapter` (in the `/sim/simx/types.cpp` file) so that the counter can see the flag
+
+```cpp
+// process incoming requests
+  if (!ReqIn.empty()) {
+    auto& in_req = ReqIn.front();
+    assert(in_req.mask.size() == input_size);
+    for (uint32_t i = 0; i < input_size; ++i) {
+      if (in_req.mask.test(i)) {
+        // build memory request
+        MemReq out_req;
+        out_req.write = in_req.write;
+        out_req.addr  = in_req.addrs.at(i);
+        out_req.is_prefetch = in_req.is_prefetch; // ADD
+        out_req.type  = get_addr_type(in_req.addrs.at(i));
+        out_req.tag   = in_req.tag;
+        out_req.cid   = in_req.cid;
+        out_req.uuid  = in_req.uuid;
+        // send memory request
+        ReqOut.at(i).push(out_req, delay_);
+        DT(4, this->name() << "-req" << i << ": " << out_req);
+      }
+    }
+    ReqIn.pop();
+  }
+```
+
+### Step 2: Adding Counters
+
+We want to add all three prefetch counters into the `PerfStats` structure (in the `/sim/simx/cache_sim.h` file)
+
+```cpp
+struct PerfStats {
+    uint64_t reads;
+    uint64_t writes;
+    uint64_t read_misses;
+    uint64_t write_misses;
+    uint64_t evictions;
+    uint64_t bank_stalls;
+    uint64_t mshr_stalls;
+    uint64_t mem_latency;
+    
+    uint64_t prefetch_requests;  // ADD
+    uint64_t prefetch_unused;    // ADD
+    uint64_t prefetch_late;      // ADD
+
+    PerfStats()
+        : reads(0)
+        , writes(0)
+        , read_misses(0)
+        , write_misses(0)
+        , evictions(0)
+        , bank_stalls(0)
+        , mshr_stalls(0)
+        , mem_latency(0)
+        , prefetch_requests(0)   // ADD
+        , prefetch_unused(0)     // ADD
+        , prefetch_late(0)       // ADD
+    {}
+
+    PerfStats& operator+=(const PerfStats& rhs) {
+        this->reads += rhs.reads;
+        this->writes += rhs.writes;
+        this->read_misses += rhs.read_misses;
+        this->write_misses += rhs.write_misses;
+        this->evictions += rhs.evictions;
+        this->bank_stalls += rhs.bank_stalls;
+        this->mshr_stalls += rhs.mshr_stalls;
+        this->mem_latency += rhs.mem_latency;
+        this->prefetch_requests += rhs.prefetch_requests;  // ADD
+        this->prefetch_unused += rhs.prefetch_unused;      // ADD
+        this->prefetch_late += rhs.prefetch_late;          // ADD
+        return *this;
+    }
+};
+```
+
+To implement functionality, we add counter logic in the `processRequests()` function (in the `/sim/simx/cache_sim.cpp` file)
+
+```cpp
+void processRequests() {
+    if (pipe_req_->empty())
+        return;
+    auto bank_req = pipe_req_->front();
+
+    switch (bank_req.type) {
+    
+    // ...
+
+    case bank_req_t::Core: {
+        
+        // ...
+        
+        if (hit_line_id != -1) {
+        
+            // ...
+            
+        } else {
+            // MISS
+            if (bank_req.write && !bank_req.is_prefetch) {
+                ++perf_stats_.write_misses;
+            } else if (!bank_req.is_prefetch) {
+                ++perf_stats_.read_misses;
+            }
+            
+            // Counter 1: Count unique prefetch requests that miss
+            if (bank_req.is_prefetch) {
+                ++perf_stats_.prefetch_requests;
+            }
+
+            // Check if there's already a pending MSHR for this address
+            auto mshr_pending = mshr_.lookup(bank_req);
+            
+            // Counter 3: Late prefetch (demand arrives while prefetch in MSHR)
+            if (!bank_req.is_prefetch && mshr_pending) {
+                ++perf_stats_.prefetch_late;
+            }
+
+            if (free_line_id == -1 && config_.write_back) {
+                // write back dirty line
+                auto& repl_line = set.lines.at(repl_line_id);
+                
+                // Counter 2: Unused prefetch (evicting prefetched but unused line)
+                if (repl_line.was_prefetched && !repl_line.was_used) {
+                    ++perf_stats_.prefetch_unused;
+                }
+                
+                if (repl_line.dirty) {
+                    MemReq mem_req;
+                    mem_req.addr  = params_.mem_addr(bank_id_, bank_req.set_id, repl_line.tag);
+                    mem_req.write = true;
+                    mem_req.cid   = bank_req.cid;
+                    this->mem_req_port.push(mem_req);
+                    DT(3, this->name() << "-writeback: " << mem_req);
+                    ++perf_stats_.evictions;
+                }
+            }
+
+            // ...
+        }
+    } break;
+    
+    // ...
+}
+```
+
+### Step 3: Printing Results
+
+#### 3a: Editing `VX_types.vh`
+
+In order to print the results, we first need to add three new CSR definitions into `VX_types.vh` (in the `/hw/rtl/` directory)
+
+**Note:** Despite this assignment focusing on SimX (C++), we are editing a `*.vh` file. This file creates a `VX_types.h` file that's in the `/build/hw/` directory (after making the build)
+
+```verilog
+`define VX_CSR_MPM_PREFETCH_REQ     12'hB20     // unique prefetch requests
+`define VX_CSR_MPM_PREFETCH_REQ_H   12'hBA0
+`define VX_CSR_MPM_PREFETCH_UNUSED  12'hB21     // unused prefetches
+`define VX_CSR_MPM_PREFETCH_UNUSED_H 12'hBA1
+`define VX_CSR_MPM_PREFETCH_LATE    12'hB22     // late prefetches
+`define VX_CSR_MPM_PREFETCH_LATE_H  12'hBA2
+```
+
+#### 3b: Editing `utils.cpp`
+
+To have `PERF: …` line at the end of the test, we need to add output logic within the `dcache_enable` if statement (in the `/runtime/stub/utils.cpp` file) with our newly added counters
+
+```cpp
+// ...
+
+// PERF: Prefetch counters
+uint64_t prefetch_requests;
+CHECK_ERR(vx_mpm_query(hdevice, VX_CSR_MPM_PREFETCH_REQ, core_id, &prefetch_requests), {
+return err;
+});
+uint64_t prefetch_unused;
+CHECK_ERR(vx_mpm_query(hdevice, VX_CSR_MPM_PREFETCH_UNUSED, core_id, &prefetch_unused), {
+return err;
+});
+uint64_t prefetch_late;
+CHECK_ERR(vx_mpm_query(hdevice, VX_CSR_MPM_PREFETCH_LATE, core_id, &prefetch_late), {
+return err;
+});
+fprintf(stream, "PERF: core%d: dcache prefetch requests=%ld\n", core_id, prefetch_requests);
+fprintf(stream, "PERF: core%d: dcache prefetch unused=%ld\n", core_id, prefetch_unused);
+fprintf(stream, "PERF: core%d: dcache prefetch late=%ld\n", core_id, prefetch_late);
+
+// ...
+```
+
+#### 3c: Editing `emulator.cpp`
+
+Because our addresses are extended outside of the CSR address range, we need to expand it from 32 to 64 bits in the `user-defined MPM CSRs` section (in the `/sim/simx/emulator.cpp` directory)
+
+```cpp
+// ...
+
+if ((addr >= VX_CSR_MPM_BASE && addr < (VX_CSR_MPM_BASE + 64)) // CHANGE
+ || (addr >= VX_CSR_MPM_BASE_H && addr < (VX_CSR_MPM_BASE_H + 64)))
+ 
+// ...
+```
+
+#### 3d: Editing `vortex.cpp`
+
+Similarly, we need to edit the `mpm_query()` function to support an extended address range (in the `/runtime/simx/vortex.cpp` file)
+
+```cpp
+// ...
+
+int mpm_query(uint32_t addr, uint32_t core_id, uint64_t *value) {
+  uint32_t offset = addr - VX_CSR_MPM_BASE;
+  if (offset > 63)  // CHANGE 1
+    return -1;
+  if (mpm_cache_.count(core_id) == 0) {
+    uint64_t mpm_mem_addr = IO_MPM_ADDR + core_id * 64 * sizeof(uint64_t);  // CHANGE 2
+    CHECK_ERR(this->download(mpm_cache_[core_id].data(), mpm_mem_addr, 64 * sizeof(uint64_t)), {  // CHANGE 3
+      return err;
+    });
+  }
+  *value = mpm_cache_.at(core_id).at(offset);
+  return 0;
+}
+
+// ...
+```
+
+## Verification and Testing:
+
+To test your changes, you can run the following to build and verify prefetch functionality
+
+```bash
+# Make the build
+make -s
+
+# Test with SimX
+./ci/blackbox.sh --driver=simx --cores=1 --app=prefetch --perf=2
+```
+
+The expected result is a test passed message and an output of all 3 metric counters, feel free to change `kernel.cpp` with different instruction/data sizes to observe prefetch efficiency

From dbeb8c39c1bfcb55111e3ded928d83e18fcd6e7f Mon Sep 17 00:00:00 2001
From: Vincent Pham <113321638+powlectro@users.noreply.github.com>
Date: Thu, 16 Oct 2025 17:12:09 -0400
Subject: [PATCH 10/11] Update assignment3.md

---
 Exercises/assignment3.md | 306 ++++++++++++++++++++++++++++-----------
 1 file changed, 223 insertions(+), 83 deletions(-)

diff --git a/Exercises/assignment3.md b/Exercises/assignment3.md
index d8b0a73..de3973e 100644
--- a/Exercises/assignment3.md
+++ b/Exercises/assignment3.md
@@ -3,7 +3,7 @@
 This assignment will be divided into two parts. The first part involves adding a new prefetch instruction as well as a corresponding flag bit to identify if it has been prefetched. The second involves adding three performance counters to measure the following:
 
 1. Number of unique prefetch requests to main memory
-2. Number of unused prefetched blocks  
+2. Number of unused prefetched blocks
 3. Number of late prefetches
 
 All of these counters should be implemented in `cache_sim.h`
@@ -82,7 +82,7 @@ In the `op_string()` function, we need to add a `PREFETCH` case (under the `FENC
 case LsuType::PREFETCH: return {"PREFETCH", ""};  // ADD
 ```
 
-#### 2c: Editing execute.cpp
+#### 2c: Editing `execute.cpp`
 
 In order for the instruction to perform a prefetch, we need to add a case for `PREFETCH` in the `execute()` function (in the `/sim/simx/execute.cpp` file)
 
@@ -145,15 +145,21 @@ void kernel_body(kernel_arg_t* __UNIFORM__ arg) {
 	int32_t* dst_ptr  = (int32_t*)arg->dst_addr;
 
 	uint32_t offset = blockIdx.x * count;
-	for (uint32_t i = 0; i < count; ++i) {
-		// ADD
+
+	const uint32_t elements_per_line = 16; // ADD: 64 bytes cache size / 4 bytes per int_32
+    
+  for (uint32_t i = 0; i < count; ++i) {
+	  // ADD: Only prefetch at cache line boundaries
+    if (i % elements_per_line == 0) {
 		vx_prefetch(&src0_ptr[offset + i]);
-		vx_prefetch(&src1_ptr[offset + i]);
-		
-		dst_ptr[offset+i] = src0_ptr[offset+i] + src1_ptr[offset+i];
-	}
+      	vx_prefetch(&src1_ptr[offset + i]);
+    }
+        
+    dst_ptr[offset+i] = src0_ptr[offset+i] + src1_ptr[offset+i];
+  }
 
 	vx_fence();
+
 }
 
 int main() {
@@ -360,9 +366,12 @@ Now that we have the flags set in `cache_sim.cpp`, we want to implement logic in
 
 ```cpp
 void processInputs() {
-    // ...
-    
-    // second: schedule memory fill
+	// proces inputs in prioroty order
+	do {
+			
+		// ...
+
+		// second: schedule memory fill
 		if (!this->mem_rsp_port.empty()) {
 			auto& mem_rsp = mem_rsp_port.front();
 			DT(3, this->name() << "-fill-rsp: " << mem_rsp);
@@ -373,27 +382,43 @@ void processInputs() {
 			line.valid  = true;
 			line.tag    = entry.bank_req.addr_tag;
 			line.was_prefetched = entry.bank_req.is_prefetch;  // ADD
-    	
-    	// ...
+    		line.was_used = false;  // ADD
+			mshr_.dequeue(&bank_req);
+			--pending_mshr_size_;
+			pipe_req_->push(bank_req);
+			mem_rsp_port.pop();
+			--pending_fill_reqs_;
+			break;
 		}
-    
-    // third: schedule core request
-    if (!this->core_req_port.empty()) {
-      auto& core_req = core_req_port.front();
-        
-      // ...
-        
-      bank_req.type = bank_req_t::Core;
-      bank_req.cid = core_req.cid;
-      bank_req.uuid = core_req.uuid;
-      bank_req.set_id = params_.addr_set_id(core_req.addr);
-      bank_req.addr_tag = params_.addr_tag(core_req.addr);
-      bank_req.req_tag = core_req.tag;
-      bank_req.write = core_req.write;
-      bank_req.is_prefetch = core_req.is_prefetch;  // ADD
-        
-      // ...
-    }
+
+		// third: schedule core request
+		if (!this->core_req_port.empty()) {
+			auto& core_req = core_req_port.front();
+			// check MSHR capacity
+			if ((!core_req.write || config_.write_back)
+			 && (pending_mshr_size_ >= mshr_.capacity())) {
+				++perf_stats_.mshr_stalls;
+				break;
+			}
+			++pending_mshr_size_;
+			DT(3, this->name() << "-core-req: " << core_req);
+			bank_req.type = bank_req_t::Core;
+			bank_req.cid = core_req.cid;
+			bank_req.uuid = core_req.uuid;
+			bank_req.set_id = params_.addr_set_id(core_req.addr);
+			bank_req.addr_tag = params_.addr_tag(core_req.addr);
+			bank_req.req_tag = core_req.tag;
+			bank_req.write = core_req.write;
+			bank_req.is_prefetch = core_req.is_prefetch;  // ADD
+			pipe_req_->push(bank_req);
+			if (core_req.write)
+				++perf_stats_.writes;
+			else
+				++perf_stats_.reads;
+			core_req_port.pop();
+			break;
+		}
+	} while (false);
 }
 ```
 
@@ -402,28 +427,75 @@ void processInputs() {
 Now we want to propagate `is_prefetch` through `LsuMemAdapter` (in the `/sim/simx/types.cpp` file) so that the counter can see the flag
 
 ```cpp
+// ...
+
 // process incoming requests
-  if (!ReqIn.empty()) {
-    auto& in_req = ReqIn.front();
-    assert(in_req.mask.size() == input_size);
-    for (uint32_t i = 0; i < input_size; ++i) {
-      if (in_req.mask.test(i)) {
-        // build memory request
-        MemReq out_req;
-        out_req.write = in_req.write;
-        out_req.addr  = in_req.addrs.at(i);
-        out_req.is_prefetch = in_req.is_prefetch; // ADD
-        out_req.type  = get_addr_type(in_req.addrs.at(i));
-        out_req.tag   = in_req.tag;
-        out_req.cid   = in_req.cid;
-        out_req.uuid  = in_req.uuid;
-        // send memory request
-        ReqOut.at(i).push(out_req, delay_);
-        DT(4, this->name() << "-req" << i << ": " << out_req);
-      }
+if (!ReqIn.empty()) {
+  auto& in_req = ReqIn.front();
+  assert(in_req.mask.size() == input_size);
+  for (uint32_t i = 0; i < input_size; ++i) {
+    if (in_req.mask.test(i)) {
+      // build memory request
+      MemReq out_req;
+      out_req.write = in_req.write;
+      out_req.addr  = in_req.addrs.at(i);
+      out_req.is_prefetch = in_req.is_prefetch; // ADD
+      out_req.type  = get_addr_type(in_req.addrs.at(i));
+      out_req.tag   = in_req.tag;
+      out_req.cid   = in_req.cid;
+      out_req.uuid  = in_req.uuid;
+      // send memory request
+      ReqOut.at(i).push(out_req, delay_);
+      DT(4, this->name() << "-req" << i << ": " << out_req);
     }
-    ReqIn.pop();
   }
+  ReqIn.pop();
+}
+
+// ...
+```
+
+Similarly, we also want to do the same for the `LocalMemSwitch::tick()` function
+
+```cpp
+// ...
+
+// process incoming requests
+if (!ReqIn.empty()) {
+  auto& in_req = ReqIn.front();
+
+  LsuReq out_dc_req(in_req.mask.size());
+  out_dc_req.write = in_req.write;
+  out_dc_req.tag   = in_req.tag;
+  out_dc_req.cid   = in_req.cid;
+  out_dc_req.uuid  = in_req.uuid;
+
+  out_dc_req.is_prefetch = in_req.is_prefetch; // ADD
+
+  LsuReq out_lmem_req(out_dc_req);
+    
+// ...
+```
+
+#### 1d: Editing `mem_coalescer.cpp`
+
+In the `MemCoalescer::tick()` function, we also need to propagate through the memory coalescer to ensure the flag is set throughout our structures
+
+```cpp
+// ...
+
+// build memory request
+LsuReq out_req{output_size_};
+out_req.mask = out_mask;
+out_req.tag = tag;
+out_req.write = in_req.write;
+out_req.addrs = out_addrs;
+out_req.cid = in_req.cid;
+out_req.uuid = in_req.uuid;
+  
+out_req.is_prefetch = in_req.is_prefetch; // ADD
+
+// ...
 ```
 
 ### Step 2: Adding Counters
@@ -480,39 +552,65 @@ To implement functionality, we add counter logic in the `processRequests()` func
 
 ```cpp
 void processRequests() {
-    if (pipe_req_->empty())
-        return;
-    auto bank_req = pipe_req_->front();
-
-    switch (bank_req.type) {
-    
-    // ...
+	    
+    //...
 
     case bank_req_t::Core: {
-        
-        // ...
-        
+        int32_t free_line_id = -1;
+        int32_t repl_line_id = 0;
+        auto& set = sets_.at(bank_req.set_id);
+		
+        // tag lookup
+        int hit_line_id = set.tag_lookup(bank_req.addr_tag, &free_line_id, &repl_line_id);
+		
         if (hit_line_id != -1) {
-        
-            // ...
-            
+            // Hit handling
+            auto& hit_line = set.lines.at(hit_line_id);
+		
+            // ADD: Mark as used if it was prefetched
+            if (hit_line.was_prefetched && bank_req.is_prefetch) {
+                hit_line.was_used = true;
+            }
+			
+            if (bank_req.write) {
+                // handle write hit
+                if (!config_.write_back) {
+                    MemReq mem_req;
+                    mem_req.addr  = params_.mem_addr(bank_id_, bank_req.set_id, bank_req.addr_tag);
+                    mem_req.write = true;
+                    mem_req.cid   = bank_req.cid;
+                    mem_req.uuid  = bank_req.uuid;
+                    this->mem_req_port.push(mem_req);
+                    DT(3, this->name() << "-writethrough: " << mem_req);
+                } else {
+                    hit_line.dirty = true;
+                }
+            }
+			
+            // CHANGE: send core response (not for prefetch)
+            if (!bank_req.is_prefetch && (!bank_req.write || config_.write_reponse)) {
+                MemRsp core_rsp{bank_req.req_tag, bank_req.cid, bank_req.uuid};
+                this->core_rsp_port.push(core_rsp);
+                DT(3, this->name() << "-core-rsp: " << core_rsp);
+            }
+            --pending_mshr_size_;
         } else {
-            // MISS
+            // Miss handling
             if (bank_req.write && !bank_req.is_prefetch) {
                 ++perf_stats_.write_misses;
             } else if (!bank_req.is_prefetch) {
                 ++perf_stats_.read_misses;
             }
-            
-            // Counter 1: Count unique prefetch requests that miss
+		
+            // ADD: Counter 1 - Count unique prefetch requests that miss
             if (bank_req.is_prefetch) {
                 ++perf_stats_.prefetch_requests;
             }
 
-            // Check if there's already a pending MSHR for this address
+            // ADD: Check if there's already a pending MSHR for this address
             auto mshr_pending = mshr_.lookup(bank_req);
-            
-            // Counter 3: Late prefetch (demand arrives while prefetch in MSHR)
+		
+            // ADD: Counter 3 - Late prefetch (demand arrives while prefetch in MSHR)
             if (!bank_req.is_prefetch && mshr_pending) {
                 ++perf_stats_.prefetch_late;
             }
@@ -520,12 +618,12 @@ void processRequests() {
             if (free_line_id == -1 && config_.write_back) {
                 // write back dirty line
                 auto& repl_line = set.lines.at(repl_line_id);
-                
-                // Counter 2: Unused prefetch (evicting prefetched but unused line)
+			
+                // ADD: Counter 2 - Unused prefetch (evicting prefetched but unused line)
                 if (repl_line.was_prefetched && !repl_line.was_used) {
                     ++perf_stats_.prefetch_unused;
                 }
-                
+			
                 if (repl_line.dirty) {
                     MemReq mem_req;
                     mem_req.addr  = params_.mem_addr(bank_id_, bank_req.set_id, repl_line.tag);
@@ -537,7 +635,47 @@ void processRequests() {
                 }
             }
 
-            // ...
+            if (bank_req.write && !config_.write_back) {
+                // forward write request to memory
+
+                {
+                    MemReq mem_req;
+                    mem_req.addr  = params_.mem_addr(bank_id_, bank_req.set_id, bank_req.addr_tag);
+                    mem_req.write = true;
+                    mem_req.cid   = bank_req.cid;
+                    mem_req.uuid  = bank_req.uuid;
+                    this->mem_req_port.push(mem_req);
+                    DT(3, this->name() << "-writethrough: " << mem_req);
+                }
+                // CHANGE: send core response
+                if (config_.write_reponse && !bank_req.is_prefetch) {
+                    MemRsp core_rsp{bank_req.req_tag, bank_req.cid, bank_req.uuid};
+                    this->core_rsp_port.push(core_rsp);
+                    DT(3, this->name() << "-core-rsp: " << core_rsp);
+                }
+                --pending_mshr_size_;
+            } else {
+	            // MSHR lookup
+				auto mshr_pending = mshr_.lookup(bank_req);
+	            
+                // allocate MSHR
+                auto mshr_id = mshr_.enqueue(bank_req, (free_line_id != -1) ? free_line_id : repl_line_id);
+                DT(3, this->name() << "-mshr-enqueue: " << bank_req);
+
+                // send fill request
+                if (!mshr_pending) {
+                    MemReq mem_req;
+                    mem_req.addr  = params_.mem_addr(bank_id_, bank_req.set_id, bank_req.addr_tag);
+                    mem_req.write = false;
+                    mem_req.tag   = mshr_id;
+                    mem_req.cid   = bank_req.cid;
+                    mem_req.uuid  = bank_req.uuid;
+                    mem_req.is_prefetch = bank_req.is_prefetch;  // ADD
+                    this->mem_req_port.push(mem_req);
+                    DT(3, this->name() << "-fill-req: " << mem_req);
+                    ++pending_fill_reqs_;
+                }
+            }
         }
     } break;
     
@@ -554,14 +692,16 @@ In order to print the results, we first need to add three new CSR definitions in
 **Note:** Despite this assignment focusing on SimX (C++), we are editing a `*.vh` file. This file creates a `VX_types.h` file that's in the `/build/hw/` directory (after making the build)
 
 ```verilog
-`define VX_CSR_MPM_PREFETCH_REQ     12'hB20     // unique prefetch requests
-`define VX_CSR_MPM_PREFETCH_REQ_H   12'hBA0
-`define VX_CSR_MPM_PREFETCH_UNUSED  12'hB21     // unused prefetches
-`define VX_CSR_MPM_PREFETCH_UNUSED_H 12'hBA1
-`define VX_CSR_MPM_PREFETCH_LATE    12'hB22     // late prefetches
-`define VX_CSR_MPM_PREFETCH_LATE_H  12'hBA2
+`define VX_CSR_MPM_PREFETCH_REQ     12'hB15     // unique prefetch requests
+`define VX_CSR_MPM_PREFETCH_REQ_H   12'hB95
+`define VX_CSR_MPM_PREFETCH_UNUSED  12'hB16     // unused prefetches
+`define VX_CSR_MPM_PREFETCH_UNUSED_H 12'hB96
+`define VX_CSR_MPM_PREFETCH_LATE    12'hB17     // late prefetches
+`define VX_CSR_MPM_PREFETCH_LATE_H  12'hB97
 ```
 
+**IMPORTANT:** Because class 2 counters are full, we cannot add these counters within that class, adding these counters into that class will result in errors!
+
 #### 3b: Editing `utils.cpp`
 
 To have `PERF: …` line at the end of the test, we need to add output logic within the `dcache_enable` if statement (in the `/runtime/stub/utils.cpp` file) with our newly added counters
@@ -582,9 +722,9 @@ uint64_t prefetch_late;
 CHECK_ERR(vx_mpm_query(hdevice, VX_CSR_MPM_PREFETCH_LATE, core_id, &prefetch_late), {
 return err;
 });
-fprintf(stream, "PERF: core%d: dcache prefetch requests=%ld\n", core_id, prefetch_requests);
-fprintf(stream, "PERF: core%d: dcache prefetch unused=%ld\n", core_id, prefetch_unused);
-fprintf(stream, "PERF: core%d: dcache prefetch late=%ld\n", core_id, prefetch_late);
+fprintf(stream, "PERF: core%d: dcache prefetch requests=%lu\n", core_id, prefetch_requests);
+fprintf(stream, "PERF: core%d: dcache prefetch unused=%lu\n", core_id, prefetch_unused);
+fprintf(stream, "PERF: core%d: dcache prefetch late=%lu\n", core_id, prefetch_late);
 
 // ...
 ```
@@ -635,7 +775,7 @@ To test your changes, you can run the following to build and verify prefetch fun
 make -s
 
 # Test with SimX
-./ci/blackbox.sh --driver=simx --cores=1 --app=prefetch --perf=2
+./ci/blackbox.sh --driver=simx --cores=1 --app=prefetch --perf=1
 ```
 
 The expected result is a test passed message and an output of all 3 metric counters, feel free to change `kernel.cpp` with different instruction/data sizes to observe prefetch efficiency

From 11e1f578dd90ba090a45e530d8bbfb8049cfc33c Mon Sep 17 00:00:00 2001
From: Vincent Pham <113321638+powlectro@users.noreply.github.com>
Date: Thu, 16 Oct 2025 20:34:15 -0400
Subject: [PATCH 11/11] Update assignment3.md

---
 Exercises/assignment3.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/Exercises/assignment3.md b/Exercises/assignment3.md
index de3973e..b70c044 100644
--- a/Exercises/assignment3.md
+++ b/Exercises/assignment3.md
@@ -568,7 +568,7 @@ void processRequests() {
             auto& hit_line = set.lines.at(hit_line_id);
 		
             // ADD: Mark as used if it was prefetched
-            if (hit_line.was_prefetched && bank_req.is_prefetch) {
+            if (hit_line.was_prefetched && !bank_req.is_prefetch) {
                 hit_line.was_used = true;
             }