## Benchmark: Phase correlation with Numba/cuFFT versus NumExpr3/Intel MKL-FFT

Continuum open-sourced their Python CUDA bindings this summer, which were previously part of their paid Anaconda Accelerate. Notability it provides bindings for linear algebra (BLAS), Fourier transforms, sparse matrices, and random number generation plus sorting algorithms. This is yet another nice contribution to the open-source scientific programming community by Continuum that I’m happy to get to play with. Here I’d like to compare `numba` and its companion `pyculib` for doing some computation on a graphics processing unit compared to a conventional CPU approach. Actually that’s not quite true, I’m comparing CUDA’s FFT to the Intel FFT found in Intel’s Python distribution, another thing Continuum has a hand in.

I always like to test with real-world applications that are within my field of interest, because I usually learn something new about the problems I work with. Here I’m going to compute the phase correlation for a stack of 89, 2048 x 2048 images against a template (in this case, the first image in the stack). The basic goal here is to correct for image drift amongst all the frames in a movie. Modern superphones do this with their cameras more-or-less all the time. When you take a picture, it actually records a movie, and then it uses correlations to remove the jitter from your unstable hands and shows you the summed result.

## Configuration

Here is some information about my test computer:

Intel iCore-7 7820X Skylake-X CPU

• Physical cores: 8
• Virtual cores: 16
• Clock Frequency: 3.6 GHz

This processor has access to the AVX512 instruction set, an important factor from the point of view of trying to maximize parallel performance, regardless of the number of cores. For example with the Skylake-X architecture here, each core has not one but two multiply units, and each multiple unit can do 16 single-precision floating-point operations per cycle, so in total you get 32 `float-32` flops/cycle (for multiply and fused-multiply add). Thus the parallelism from the number of cores (8) is dwarfed by the parallelism from the SIMD vectorized instructions. If you are truly concerned about performance for CPU computing you need to use a program that is both multi-threaded (or multi-processed) and SIMD aware. And to compare I have a Nvidia graphics card:

Nvidia 1080 GTX

• CUDA 9.0
• VRAM: 8.0 GB
• CUDA single-precision cores: 2560
• Clock Frequency: 1683 MHz

It’s fairly normal to use such commercial gaming cards in cluster computing. In Switzerland I had use of a cluster that had 6 Titan-XP cards per node. Titans are only marginally faster than 1080s in practical terms but they have 50 % more VRAM, which is more important in many tasks. The only real reason to buy Tesla cards IMHO is if you need double-precision, but they usually aren’t cost effective compared to CPUs.

Roughly speaking these two computing units are in a similar price range. With the SIMD instructions, we should get about 256 flops/cycle from the CPU. The CPU runs at about twice the clock rate of the GPU, so we expect (very roughly) the GPU to be 5.0x faster than the CPU.

Here I’m going to use the Intel Python in a virtual `conda` environment, for the basic reason that I’m using Windows and on windows `pyFFTW` is slow compared to the Intel MKL-FFT (by a factor of three on my machine). FFTW is basically written for `gcc` and the newest `gcc` on Windows is to the best of my knowledge v5.1 with TDM-GCC. Furthermore FFTW has seen any updates for a couple of processor architecture generations. The Intel MKL-FFT basically takes better advantage of the modern AVX512 SIMD instructions my processor has available.

For matrix operations we’re going to use the NumExpr-3.0 development branch. While Intel provides a NumExpr 2.6.2 build in their conda channel, we really want the auto-vectorization and the `complex64` support that the 3.0 branch brings. Moreover for a fair test we have to use `complex64` because my Nvidia 1080 GTX doesn’t perform well with double-precision numbers. For the tests, I made a new `conda` environment as follows:

``````conda config --add channel intel

conda create --name intel --channel intel --override-channels intelpython3_core

(source) activate intel

conda install numba pyculib
``````

Currently NumExpr3 is built from source, as follows:

``````
git clone https://github.com/pydata/numexpr.git

cd numexpr

git branch numexpr-3.0

python setup.py install
``````

For whatever reason if you build NumExpr3 with Visual Studio on Windows (but not GCC on Linux) it leaks memory with the Intel Python distribution, whereas with a conventional Anaconda build it doesn’t. My motherboard doesn’t support Ubuntu yet, so while I have a VirtualBox install, the performance inside VirtualBox is far worse than with a native OS. Also I couldn’t get CUDA 9.0 working with Ubuntu inside VirtualBox either, as the appropriate ver.384 drivers aren’t available yet with `apt`. Ideally I’d like to try and compile NumExpr3 with ICC instead and see if that fixes the problem, but I don’t have a copy.

## The benchmark

I wrote some benchmark scripts, one `cpu.py`, using MKL-FFT and NumExpr-3:

``````import numpy as np
from time import perf_counter
import sys, os, os.path, pickle

N = int(sys.argv)
tries = int(sys.argv)
timings = np.zeros(tries)

import mrcz
imageShape = frames.shape[1:]

for I in np.arange(tries):
t0 = perf_counter()
fft_template = np.empty( imageShape, dtype='complex64' )
fft_frame = np.empty_like(fft_template)
xcorr2 = ne3.NumExpr( 'fft_frame=fft_frame*fft_template/abs(fft_frame*fft_template)' )

fft_template = np.fft.fft2(frames[0,:,:])
fft_template = fft_template.conj()
for K in np.arange(1,N):
fft_frame = np.fft.fft2(frames[K,:,:])
# There's a memory leak with NE3 with the Intel Python distro.
xcorr2( fft_frame=fft_frame, fft_template=fft_template)
output = np.fft.ifft2( fft_frame )

t1 = perf_counter()
print( f'Computed {N} phase correlations on CPU in {t1-t0} s' )
timings[I] = t1-t0

if os.path.isfile( 'cpu_times.pkl' ):
with open( 'cpu_times.pkl', 'rb' ) as fh:
if not N in cpu_times:
cpu_times[N] = []
cpu_times[N].append( np.min(timings) )
else:
cpu_times = {N:[np.min(timings)]}

with open( 'cpu_times.pkl', 'wb') as fh:
pickle.dump(cpu_times, fh)
``````

and another `gpu.py`, using `numba.cuda` and `pyculib.fft`:

``````import numpy as np
import numba.cuda as cuda
import pyculib.fft as cfft
from time import perf_counter
import sys, os, os.path, pickle

N = int(sys.argv)
tries = int(sys.argv)
timings = np.zeros(tries)

import mrcz
imageShape = frames.shape[1:]

# 1080 GTX seems to return errors above 20 compute units and 64 threads per compute unit,
# even though it is supposed to have 128 threads per.
Gdim = 20  # Thread-blocks per grid
Bdim = 64 # Threads per block

for I in range(tries):
t0 = perf_counter() # Don't count file load time in benchmark

@cuda.jit('void(complex64[:,:], complex64[:,:])')
def xcorr2(frame, conj_template):
I, J = cuda.grid(2)
frame[I,J] *= conj_template[I,J] / abs( frame[I,J] * conj_template[I,J] )

@cuda.jit('void(complex64[:,:])')
def conj_inplace(frame):
I, J = cuda.grid(2)
frame[I,J] = frame[I,J].real - frame[I,J].imag

stream = cuda.stream()
fftPlan = cfft.FFTPlan( shape=imageShape, itype='complex64', otype='complex64', stream=stream )
with stream.auto_synchronize():
fft_template = cuda.to_device(frames[0,:,:].astype('complex64'), stream=stream)

fftPlan.forward(fft_template, fft_template)
conj_inplace[Gdim,Bdim] ( fft_template )

# Run template match
for K in np.arange(1,N):
fft_frame = cuda.to_device(frames[K,:,:], stream=stream)
fftPlan.forward(fft_frame, fft_frame)
xcorr2[Gdim,Bdim] (fft_frame, fft_template)
fftPlan.inverse(fft_frame, fft_frame)
frames[K,:,:] = fft_frame.copy_to_host()

t1 = perf_counter()
print( f'Computed {N} phase correlations on GPU in {t1-t0} s' )
timings[I] = t1-t0

if os.path.isfile( 'gpu_times.pkl' ):
with open( 'gpu_times.pkl', 'rb' ) as fh:
if not N in gpu_times:
gpu_times[N] = []
gpu_times[N].append( np.min(timings) )
else:
gpu_times = {N:[np.min(timings)]}

with open( 'gpu_times.pkl', 'wb') as fh:
pickle.dump(gpu_times, fh)
``````

And finally a simple `runner.py` to execute the subprocesses:

``````import subprocess as sub
import os

try: os.remove('gpu_times.pkl')
except OSError: pass
try: os.remove('cpu_times.pkl')
except OSError: pass

unique_tries = 5
repeat_tries = 1
for I in range(2,90):
for J in range(unique_tries):
sub.call( f'python cpu.py {I} {repeat_tries}' )
sub.call( f'python gpu.py {I} {repeat_tries}' )

``````

We have to run our tests in subprocesses to get a real view of the time taken to startup the components. For the FFTs, there’s planning involved (although since these are power of 2 shaped images that should be minimal) and then for the GPU solution we also have the time for Numba to transform the Python functions into a CUDA kernel. There are ways around this cost if you are running the logic more than once (say via `dask` over a cluster), typically via serialization with `pickle`. Both NumExpr-3 and Numba support pickling of their output functions. Historically FFTW offered a means of saving a plan to a file, known as wisdom, but I don’t see an option to do that for either the CUDA FFT or the Intel MKL FFT bindings featured here.

## Results

Here are the results for executing 1 to 88 cross-correlations, all for 2k x 2k images. I did 5 tries per step, and took the minimum time taken: It takes about 60 frames (or ~ 1GB of data) for the GPU solution to catch the CPU. This is largely due to the longer setup time for the GPU solution. Numba has to actually compile those functions, whereas NumExpr has its own internal virtual machine and it can parse Python code very quickly. The initialization time for the CPU solution is about 0.2 ms, whereas for the GPU solution is about 600 ms, or a factor of 3000x difference. About half of the GPU setup time is for the FFT plan, whereas the Intel FFT doesn’t seem to have a setup time. Likely the Intel FFT has some cached wisdom for the given image shape. Note that these benchmarks are really quite fast for both solutions. I tried the same process with `pyfftw` and it takes about 10.5 s for the full 88 cross-correlations. Using `scipy.fftpack` is far slower, clocking in at 68 s.

Sometimes one wants to calculate something once, but often we want to repeat the same calculation on different data sets. If `cpu.py` and `gpu.py` are run through their `repeat_tries` loop the initial run is slow, but then the later runs are fast. Here the GPU manages to pass the GPU around frame 20, a better performance. I’m not the world expert in either GPU programming or the use of Numba, so likely a more skilled hand could squeeze some more performance out of the GPU. Monitoring with `nvidia-smi` shows a maximum utilization on the GPU of about 30 %. In my experience it’s difficult to get 100 % utilization out of a GPU, but certainly 50-60 % isn’t unreasonable. There are optimizations that could be done. For example, we are returning the full array from the GPU to main memory. Practically we probably don’t want the entire cross-correlation, via the `fft_frame.copy_to_host()` call, as we are usually only concerned with the neighbourhood around the maximum. So we could return a small sub-array, or possibly even up-interpolate the sub-array to get sub-pixel precision in our image shifts.

## Conclusion

If the GPU really is five times faster than the CPU then the GPU slope should be 5x flatter than the CPU slope. In reality the marginal speed-up per cross-correlation is only 23 % for the GPU over the CPU. The GPU solution in this case ends up being below expectations, for a more intricate and time-consuming coding task. In my experience where GPUs really shine is for iterative algorithms where the data stays on the GPU for some time. In comparison the cross-correlation problem is not particularly well-suited to GPU processing. It involves too little calculation on the GPU, and too large a volume of arrays to transfer from main memory to the GPU and back. Ideally for pure GPU solutions we want to repeatedly crunch numbers on the board. This means the problem has to fit within the GPU’s buffer, which in this case is 8 GB. I’m thankful for the introduction of 4k computer monitors as before the memory available to GPUs was much smaller. Overall the most notable result here to me was just how fast the Intel MKL-FFT is, and it’s also extremely simple to use, as it patches `numpy.fft` directly. I really did not expect it to be so competitive. Overall Python has gotten much faster in 2017 for scientific calculation, and that’s pleasing to see. 