|
1 | 1 | .. _index: |
2 | 2 | .. include:: ./../../ext_links.txt |
3 | 3 |
|
4 | | -Kernel Programming Basics |
5 | | -========================= |
6 | | - |
7 | | -`Data Parallel Extensions for Python*`_ introduce a concept of an *offload kernel*, defined as |
8 | | -a part of a Python program being submitted for execution to the device queue. |
9 | | - |
10 | | -.. image:: ./../../../asset/images/kernel-queue-device.png |
11 | | - :scale: 50% |
12 | | - :align: center |
13 | | - :alt: Offload Kernel |
14 | | - |
15 | | -There are multiple ways how to write offload kernels. CUDA*, OpenCl*, and SYCL* offer similar programming model |
16 | | -known as the *data parallel kernel programming*. In this model you express the work in terms of *work items*. |
17 | | -You split data into small pieces, and each piece will be a unit of work, or a *work item*. The total number of |
18 | | -work items is called *global size*. You can also group work items into bigger chunks called *work groups*. |
19 | | -The number of work items in the work group is called the *local size*. |
20 | | - |
21 | | -.. image:: ./../../../asset/images/kernel_prog_model.png |
22 | | - :scale: 50% |
23 | | - :align: center |
24 | | - :alt: Offload Kernel |
25 | | - |
26 | | -In this example there are 48 *work items* (8 in dimension 0, and 6 in dimension 1), that is the *global size* is 48. |
27 | | -Work items are grouped in *work groups* with the *local size* 8 (4 in dimension 0, and 2 in dimension 1). There are |
28 | | -total 48/8 = 6 work groups. |
29 | | - |
30 | | -In the *data parallel kernel programming* model you write a function that processes a given work item. |
31 | | -Such a function is called the *data parallel kernel*. |
32 | | - |
33 | | -**Data Parallel Extension for Numba** offers a way to write data parallel kernels directly using Python using |
34 | | -``numba_dpex.kernel``. It bears similarities with ``numba.cuda`` and ``numba.roc``, but unlike these proprietary |
35 | | -programming models ``numba_dpex`` is built on top of `SYCL*`_ , which is hardware agnostic, meaning |
36 | | -that with ``numba_dpex.kernel`` programming model you will be able to write a portable code targeting different |
37 | | -hardware vendors. |
38 | | - |
39 | | -.. note:: |
40 | | - The current version of ``numba-dpex`` supports Intel SYCL devices only |
41 | | - |
42 | | -.. toctree:: |
43 | | - :caption: This document will cover the following chapters: |
44 | | - :maxdepth: 2 |
45 | | - |
46 | | - writing_kernels |
47 | | - synchronization |
48 | | - device-functions |
49 | | - atomic-operations |
50 | | - memory_allocation_address_space |
51 | | - reduction |
52 | | - ufunc |
53 | | - supported-python-features |
| 4 | +Kernel Programming |
| 5 | +================== |
| 6 | + |
| 7 | +The tutorial covers the most important features of the KAPI kernel programming |
| 8 | +API and introduces the concepts needed to express data-parallel kernels in |
| 9 | +numba-dpex. |
| 10 | + |
| 11 | + |
| 12 | +Preliminary concepts |
| 13 | +-------------------- |
| 14 | + |
| 15 | +Data parallelism |
| 16 | +++++++++++++++++ |
| 17 | + |
| 18 | +Single Program Multiple Data |
| 19 | +++++++++++++++++++++++++++++ |
| 20 | + |
| 21 | +Range v/s Nd-Range Kernels |
| 22 | +++++++++++++++++++++++++++ |
| 23 | + |
| 24 | +Work items and Work groups |
| 25 | +++++++++++++++++++++++++++ |
| 26 | + |
| 27 | +Basic concepts |
| 28 | +-------------- |
| 29 | + |
| 30 | + |
| 31 | +Writing a *range* kernel |
| 32 | +++++++++++++++++++++++++ |
| 33 | + |
| 34 | +A *range* kernel represents the simplest form of parallelism that can be |
| 35 | +expressed in KAPI. A range kernel represents a data-parallel execution of the |
| 36 | +same function by a set of work items. In KAPI, an instance of the |
| 37 | +:py:class:`numba_dpex.kernel_api.Range` class represents the set of work items |
| 38 | +and each work item in the ``Range`` is represented by an instance of the |
| 39 | +:py:class:`numba_dpex.kernel_api.Item` class. As such these two classes are |
| 40 | +essential to writing a range kernel in KAPI. |
| 41 | + |
| 42 | +.. literalinclude:: ./../../../../numba_dpex/examples/kernel/vector_sum.py |
| 43 | + :language: python |
| 44 | + :lines: 8-9, 11-15 |
| 45 | + :caption: **EXAMPLE:** A KAPI range kernel |
| 46 | + :name: ex_kernel_declaration_vector_sum |
| 47 | + |
| 48 | +:ref:`ex_kernel_declaration_vector_sum` shows an example of a range kernel. |
| 49 | +Every range kernel requires its first argument to be an ``Item`` and |
| 50 | +needs to be launched via :py:func:`numba_dpex.experimental.launcher.call_kernel` |
| 51 | +by passing an instance a ``Range`` object. |
| 52 | + |
| 53 | +Do note that a ``Range`` object only controls the creation of work items, the |
| 54 | +distribution of work and data over a ``Range`` still needs to be defined by the |
| 55 | +user-written function. In the example, each work item access a single element of |
| 56 | +each of the three array and performs a single addition operation. It is possible |
| 57 | +to write the kernel differently so that each work item accesses multiple data |
| 58 | +elements or conditionally performs different amount of work. The data access |
| 59 | +patterns in a work item can have performance implications and programmers should |
| 60 | +refer a more topical material such as the `oneAPI GPU optimization guide`_ to |
| 61 | +learn more. |
| 62 | + |
| 63 | +A range kernel is meant to express a basic `parallel-for` calculation that is |
| 64 | +ideally suited for embarrassingly parallel kernels such as elementwise |
| 65 | +computations over ndarrays. The API for expressing a range kernel does not |
| 66 | +allow advanced features such as synchronization of work items and fine-grained |
| 67 | +control over memory allocation on a device. |
| 68 | + |
| 69 | +Writing an *nd-range* kernel |
| 70 | +++++++++++++++++++++++++++++ |
| 71 | + |
| 72 | +The ``device_func`` decorator |
| 73 | ++++++++++++++++++++++++++++++ |
| 74 | + |
| 75 | +Supported mathematical operations |
| 76 | ++++++++++++++++++++++++++++++++++ |
| 77 | + |
| 78 | +Supported Python operators |
| 79 | +++++++++++++++++++++++++++ |
| 80 | + |
| 81 | +Supported kernel arguments |
| 82 | +++++++++++++++++++++++++++ |
| 83 | + |
| 84 | +Launching a kernel |
| 85 | +++++++++++++++++++ |
| 86 | + |
| 87 | +Advanced topics |
| 88 | +--------------- |
| 89 | + |
| 90 | +Local memory allocation |
| 91 | ++++++++++++++++++++++++ |
| 92 | + |
| 93 | +Private memory allocation |
| 94 | ++++++++++++++++++++++++++ |
| 95 | + |
| 96 | +Group barrier synchronization |
| 97 | ++++++++++++++++++++++++++++++ |
| 98 | + |
| 99 | +Atomic operations |
| 100 | ++++++++++++++++++ |
| 101 | + |
| 102 | +Async kernel execution |
| 103 | +++++++++++++++++++++++ |
| 104 | + |
| 105 | +Specializing a kernel or a device_func |
| 106 | +++++++++++++++++++++++++++++++++++++++ |
0 commit comments