I’ve been working on a new branch of the venerable NumExpr module for Python.
It’s a fairly major extension, as I discuss below. The alpha release can be cloned from GitHub as follows:
git clone https://github.com/pydata/numexpr.git cd numexpr git checkout numexpr-3.0 python setup.py install (or pip install . but that suppresses error messages)
The operations were re-written in such a way that gcc can auto-vectorize the loops to use SIMD instructions. Each operation now has a strided and aligned branch, which improves performance on aligned arrays by ~ 40 %. The setup time for threads has been reduced, by removing an unnecessary abstraction layer, and various other minor re-factorizations, resulting in improved thread scaling.
The combination of speed-ups means that NumExpr3 often runs 200-500 % faster than NumExpr2.6 on a machine with AVX2 support. The break-even point with NumPy is now roughly arrays with 64k-elements, compared to 256-512k-elements for NE2.
Example 1: Multiply-Add
Here 8 threads are used for both NE2 and NE3 (NumPy is single-threaded), and the cache dictionaries for the NumExprs have been disabled.
The gap with complex number mathematics is larger because the entirety of the
complex operations within NumExpr3 was implemented with vectorized functions.
Vectorization hasn’t been done yet with cmath functions for real numbers (such
exp(), etc.), only for complex functions, but arithmetic operations
such as +, -, *, / have been auto-vectorized by gcc. Here we are cheating,
because NumExpr2 doesn’t support complex64 as a data type, so it has to up-cast
the data to
complex128. Similar advantages are had in image processing for
uint16, whereas NE2 would up-cast to
int32 and promptly return
wrong values in the case of under- or overflow.
Example 2: Mandelbrot set
Jean-Francois Puget wrote a blog post last year for IBM, “How To Quickly
Compute The Mandelbrot Set In Python.” NumExpr2 didn’t perform especially well
compared to Numba. There are some reasons for this. In Puget’s code, the
calculation would have been performed with
complex128 instead of
as intended because NE2 upcasts consts to
float64 and didn’t support a
complex64 datatype. Also it’s not clear how many threads were used (he did the
calculations on this laptop).
Here’s an example of the algorithm in a NumExpr3 format:
def mandelbrot_ne3(c, maxiter): output = np.zeros(c.shape, dtype='float32') notdone = np.zeros(c.shape, dtype='bool') z = np.zeros(c.shape, dtype='complex64' ) # Almost 30 % of the time in a comparison appears to be in the # cast to npy_bool neObj1 = ne3.NumExpr( 'notdone = abs2(z) < 4.0' ) neObj2 = ne3.NumExpr( 'z = where(notdone,z*z+c,z)' ) for it in np.arange(maxiter, dtype='float32'): # Here 'it' changes, but the AST parser doesn't know that and treats it # as a const if we use 'where(notdone, it, output)' # What we really need is an iter( ops, range ) function inside # ne3. This is an interesting case, since really here we see a # major limitation in NumExpr working inside a loop. neObj1.run( check_arrays=False ) output[notdone] = it neObj2.run( check_arrays=False ) output[output == maxiter-1] = 0 return output def mandelbrot_set_ne3(xmin,xmax,ymin,ymax,width,height,maxiter): r1 = np.linspace(xmin, xmax, width, dtype='float32') r2 = np.linspace(ymin, ymax, height, dtype='float32') c = r1 + r2[:,None]*1j n3 = mandelbrot_ne3(c,maxiter) return (r1,r2,n3.T)
Here I benchmarked with 4 threads on a Xeon CPU with a 2.5 GHz clock:
NE2.6: 365 ms for set #1 NE2.6: 9.73 s for set #2 NE3: 138 ms for set #1 NE3: 4.26 s for set #2
So again we see a nice 250 % speedup for NE3 compared to NE2. There are some
limitations in this algorithm for NumExpr in that the variable
with each iteration, but would be treated as a const by NumExpr. This forces
us in and out of the interpreter more than we would prefer. There’s a lot more
that I believe can be done to improve NE3 performance, especially with regards
to thread scaling.
More NumPy Datatypes
The program was re-factorized from a ascii-encoded byte code to a struct array,
so that the operation space is now 65535 instead of 128. As such, support for
types were added.
NumExpr3 now uses NumPy ‘safe’ casting rules. If an operation doesn’t return the same result as NumPy, it’s a bug. In the future other casting styles will be added if there is a demand for them.
More complete function set
With the enhanced operation space, almost the entire C++11 cmath function set is supported (if the compiler library has them; only C99 is expected). Also bitwise operations were added for all integer datatypes. There are now 436 operations/functions in NE3, with more to come, compared to 190 in NE2.
Also a library-enum has been added to the op keys which allows multiple backend libraries to be linked to the interpreter, and changed on a per-expression basis, rather than picking between GNU std and Intel VML at compile time, for example.
More complete Python language support
The Python compiler was re-written from scratch to use the CPython
and a functional programming approach. As such, NE3 now compiles a wider subset
of the Python language. It supports multi-line evaluation, and assignment with
named temporaries. The new compiler spends considerably less time in Python to
compile expressions, about 120 us for
'a*b' compared to 550 us for NE2.
Compare for example:
out_ne2 = ne2.evaluate( 'exp( -sin(2*a**2) - cos(2*b**2) - 2*a**2*b**2' )
neObj = NumExp( '''a2 = a*a; b2 = b*b out_magic = exp( -sin(2*a2) - cos(2*b2) - 2*a2*b2''' )
This is a contrived example but the multi-line approach will allow for cleaner
code and more sophisticated algorithms to be encapsulated in a single NumExpr
call. The convention is that intermediate assignment targets are named
temporaries if they do not exist in the calling frame, and full assignment
targets if they do, which provides a method for multiple returns. Single-level
self.data) is also supported for increased convenience
and cleaner code. Slicing still needs to be performed above the ne3.evaluate()
or ne3.NumExpr() call.
The code base was generally refactored to increase the prevalence of single-point declarations, such that modifications don’t require extensive knowledge of the code. In NE2 a lot of code was generated by the pre-processor using nested macros. That has been replaced by a object-oriented Python code generator called by setup.py, which generates about 15k lines of C code with 1k lines of Python. The use of generated code with defined line numbers makes debugging threaded code simpler.
The generator also builds the autotest portion of the test submodule, for checking equivalence between NumPy and NumExpr3 operations and functions.
What’s TODO compared to NE2?
- strided complex functions
- Intel VML support (less necessary now with gcc auto-vectorization)
- reductions (
What I’m looking for feedback on
- String arrays: How do you use them? How would unicode differ from bytes strings?
- Interface: We now have a more object-oriented interface underneath the familiar evaluate() interface. How would you like to use this interface? Francesc suggested generator support, as currently it’s more difficult to use NumExpr within a loop than it should be.
Ideas for the future
- vectorize real functions (such as
log) similar to the
- Add a keyword (likely
yield) to indicate that a token is intended to be changed by a generator inside a loop with each call to
If you have any thoughts or find any issues please don’t hesitate to open an issue at the Github repo. Although unit tests have been run over the operation space there are undoubtedly a number of bugs to squash.