Application Guide: Numba¶

Numba is a just-in-time (jit) compiler for Python that works best on code that uses NumPy arrays and functions, and loops. Numba works best through its collection of decorators that when applied to a function instructs Numba to compile them. Numba can also be used to create GPU accelerated code with both examples present below.

CPU parallel exampleGPU example

numba_example.py

import numpy as np
from numba import njit, prange

# Decorator njit with parallel execution enabled
@njit(parallel=True) 
def add_kernel(a, b, c):
for i in prange(a.size):
   c[i] = a[i] + b[i]

# Array size
N = 1024
a = np.ones(N, dtype=np.float32)
b = np.ones(N, dtype=np.float32)
c = np.zeros(N, dtype=np.float32)

# Run JIT-compiled function
add_kernel(a, b, c)

# Verification
print("Success:", np.allclose(c, a + b))

Numba built in CUDA target

The CUDA target is now deprecated and will not work on the latest Nvidia drivers. This has been moved to a separate package numba-cuda Currently we do not have this module on BlueBEAR and must be self installed the below code will work for numba-cuda as well as with numba for older drivers.

numba_cuda_example.py

import numpy as np
from numba import cuda

@cuda.jit
def add_kernel(a, b, c):
i = cuda.grid(1)
if i < a.size:
   c[i] = a[i] + b[i]

# Array size
N = 1024
a = np.ones(N, dtype=np.float32)
b = np.ones(N, dtype=np.float32)
c = np.zeros(N, dtype=np.float32)

# Data transfer to GPU
d_a = cuda.to_device(a)
d_b = cuda.to_device(b)
d_c = cuda.device_array_like(c)

# Configure of threading
threads_per_block = 128
blocks_per_grid = (N + threads_per_block - 1) // threads_per_block

# Launch Kernel
add_kernel[blocks_per_grid, threads_per_block](d_a, d_b, d_c)

cuda.synchronize()

# Copying results back to CPU memory
c = d_c.copy_to_host()

# Verification
print("Success:", np.allclose(c, a + b))