|
4 | 4 | "cell_type": "markdown", |
5 | 5 | "metadata": {}, |
6 | 6 | "source": [ |
7 | | - "# SC24 Tutorial: Efficient Distributed GPU Programming for Exascale\n", |
| 7 | + "# SC25 Tutorial: Efficient Distributed GPU Programming for Exascale\n", |
8 | 8 | "\n", |
9 | | - "- Sunday, November 17, 2024 8:30 AM to 5:30 PM\n", |
10 | | - "- Location: B211, Atlanta Convention Center, Georgia, USA\n", |
11 | | - "- Program Link:\n", |
12 | | - " https://sc24.conference-program.com/presentation/?id=tut123&sess=sess412\n", |
13 | | - "\n", |
14 | | - "## Hands-On 3: Multi-GPU Parallelization with CUDA-aware MPI\n", |
| 9 | + "- Sunday, November 16, 2025 8:30 AM to 5:00 PM\n", |
| 10 | + "- Location: Room 127, St. Louis Convention Center, St. Louis, USA\n", |
| 11 | + "- Program Link:\n", |
| 12 | + " https://sc25.conference-program.com/presentation/?id=tut113&sess=sess252\n", |
| 13 | + " \\## Hands-On 3: Multi-GPU Parallelization with CUDA-aware MPI\n", |
15 | 14 | "\n", |
16 | 15 | "### Task: Parallelize Jacobi Solver for Multiple GPUs using CUDA-aware MPI\n", |
17 | 16 | "\n", |
|
27 | 26 | "and `POP` macros). Once you are familiar with the code, please work on\n", |
28 | 27 | "the `TODOs` in `jacobi.cu`:\n", |
29 | 28 | "\n", |
30 | | - "- Get the available GPU devices and use it and the local rank to set\n", |
31 | | - " the active GPU for each process\n", |
32 | | - "- Compute the top and bottom neigbhors. We are using\n", |
33 | | - " reflecting/periodic boundaries on top and bottom, so rank0’s Top\n", |
34 | | - " neighbor is (size-1) and rank(size-1) bottom neighbor is rank 0\n", |
35 | | - "- Use MPI_Sendrecv to exchange data between the neighbors\n", |
36 | | - " - use CUDA-aware MPI, so the send - and the receive buffers are\n", |
37 | | - " located in GPU-memory\n", |
38 | | - " - The first newly calculated row (‘iy_start’) is sent to the top\n", |
39 | | - " neigbor and the bottom boundary row (`iy_end`) is received from\n", |
40 | | - " the bottom process.\n", |
41 | | - " - The last calculated row (`iy_end-1`) is send to the bottom\n", |
42 | | - " process and the top boundary (`0`) is received from the top\n", |
43 | | - " - Don’t forget to synchronize the computation on the GPU before\n", |
44 | | - " starting the data transfer\n", |
45 | | - " - use the self-defined MPI_REAL_TYPE. This allows an easy switch\n", |
46 | | - " between single- and double precision\n", |
| 29 | + "- Get the available GPU devices and use it and the local rank to set the\n", |
| 30 | + " active GPU for each process\n", |
| 31 | + "- Compute the top and bottom neigbhors. We are using reflecting/periodic\n", |
| 32 | + " boundaries on top and bottom, so rank0’s Top neighbor is (size-1) and\n", |
| 33 | + " rank(size-1) bottom neighbor is rank 0\n", |
| 34 | + "- Use MPI_Sendrecv to exchange data between the neighbors\n", |
| 35 | + " - use CUDA-aware MPI, so the send - and the receive buffers are\n", |
| 36 | + " located in GPU-memory\n", |
| 37 | + " - The first newly calculated row (‘iy_start’) is sent to the top\n", |
| 38 | + " neigbor and the bottom boundary row (`iy_end`) is received from the\n", |
| 39 | + " bottom process.\n", |
| 40 | + " - The last calculated row (`iy_end-1`) is send to the bottom process\n", |
| 41 | + " and the top boundary (`0`) is received from the top\n", |
| 42 | + " - Don’t forget to synchronize the computation on the GPU before\n", |
| 43 | + " starting the data transfer\n", |
| 44 | + " - use the self-defined MPI_REAL_TYPE. This allows an easy switch\n", |
| 45 | + " between single- and double precision\n", |
47 | 46 | "\n", |
48 | 47 | "Compile with\n", |
49 | 48 | "\n", |
|
61 | 60 | "\n", |
62 | 61 | "### Description\n", |
63 | 62 | "\n", |
64 | | - "- The work distribution of the first task is not ideal, because it can\n", |
65 | | - " lead to the process with the last rank having to calculate\n", |
66 | | - " significantly more than all the others. Therefore, the load\n", |
67 | | - " distribution is to be optimized in this task.\n", |
68 | | - "- Compute the `chunk_size` that each rank gets either (ny - 2) / size\n", |
69 | | - " or (ny - 2) / size + 1 rows.\n", |
70 | | - "- Compute how many processes get (ny - 2) / size resp (ny - 2) /\n", |
71 | | - " size + 1 rows\n", |
72 | | - "- Adapt the computation of (`iy_start_global`)" |
| 63 | + "- The work distribution of the first task is not ideal, because it can\n", |
| 64 | + " lead to the process with the last rank having to calculate\n", |
| 65 | + " significantly more than all the others. Therefore, the load\n", |
| 66 | + " distribution is to be optimized in this task.\n", |
| 67 | + "- Compute the `chunk_size` that each rank gets either (ny - 2) / size or\n", |
| 68 | + " (ny - 2) / size + 1 rows.\n", |
| 69 | + "- Compute how many processes get (ny - 2) / size resp (ny - 2) / size +\n", |
| 70 | + " 1 rows\n", |
| 71 | + "- Adapt the computation of (`iy_start_global`)" |
73 | 72 | ], |
74 | | - "id": "e42b5ab3-f626-4da5-b0c9-52a444cefde8" |
| 73 | + "id": "8b73eab2-1e9f-42a8-b366-29ff21d469ea" |
75 | 74 | } |
76 | 75 | ], |
77 | 76 | "nbformat": 4, |
|
0 commit comments