Application Guide: Numba¶
Numba is a just-in-time (jit) compiler for Python that works best on code that uses NumPy arrays and functions, and loops. Numba works best through its collection of decorators that when applied to a function instructs Numba to compile them. Numba can also be used to create GPU accelerated code with both examples present below.
numba_example.py
import numpy as np
from numba import njit, prange
# Decorator njit with parallel execution enabled
@njit(parallel=True)
def add_kernel(a, b, c):
for i in prange(a.size):
c[i] = a[i] + b[i]
# Array size
N = 1024
a = np.ones(N, dtype=np.float32)
b = np.ones(N, dtype=np.float32)
c = np.zeros(N, dtype=np.float32)
# Run JIT-compiled function
add_kernel(a, b, c)
# Verification
print("Success:", np.allclose(c, a + b))
Numba built in CUDA target
The CUDA target is now deprecated and will not work on the latest Nvidia drivers.
This has been moved to a separate package numba-cuda
Currently we do not have this module on BlueBEAR and must be self installed the below code will work for numba-cuda as well as with numba for older drivers.
numba_cuda_example.py
import numpy as np
from numba import cuda
@cuda.jit
def add_kernel(a, b, c):
i = cuda.grid(1)
if i < a.size:
c[i] = a[i] + b[i]
# Array size
N = 1024
a = np.ones(N, dtype=np.float32)
b = np.ones(N, dtype=np.float32)
c = np.zeros(N, dtype=np.float32)
# Data transfer to GPU
d_a = cuda.to_device(a)
d_b = cuda.to_device(b)
d_c = cuda.device_array_like(c)
# Configure of threading
threads_per_block = 128
blocks_per_grid = (N + threads_per_block - 1) // threads_per_block
# Launch Kernel
add_kernel[blocks_per_grid, threads_per_block](d_a, d_b, d_c)
cuda.synchronize()
# Copying results back to CPU memory
c = d_c.copy_to_host()
# Verification
print("Success:", np.allclose(c, a + b))