This week has been hectic, but also very fruitful. I have the functionality I set out for, it works and it is fast! It still needs testing, polish and bug-fixing and is by no means finished, but it is now possible to define indexed math-expressions, translate to code, compile it, wrap it as a python module and import it again for fast numerical calculations. All of this can be done without leaving the isympy command interpreter.
I’ve been hacking on the automatic wrapping of binary routines, refactored the code printers, implemented handling of indexed powers, enabled scalar broadcasting and ufunc-ish support for arbitrary compiled functions. For next week I will focus on physics inspired examples, testing of binary code and improving the documentation. But first, it is time for a little demonstration.
The latest code is here, if you want to try for yourself. There are parts that don’t not work properly, for instance the Cython wrapper cannot handle numpy arrays yet, but as you will see below there are enough working parts that you can do some fancy stuff.
The trace of a Matrix — faster that numpy!
In an earlier blog post, I showed you that the C printer is able to print the necessary loops to implement summations over array indices. The Fortran printer has also this functionality now, so it can be used with f2py. Let’s start with a simple function to calculate numerically the trace of a matrix while running isympy:
In [1]: from sympy.utilities.autowrap import autowrap
In [2]: m, n = symbols('m n', integer=True)
In [3]: i = Idx('i', m)
In [4]: j = Idx('j', n)
In [5]: A = IndexedBase('A')
In [6]: trace = autowrap(A[i, i])
That’s it. Now lets have a look at the function we created,
In [7]: trace?
Type: fortran
String Form: <fortran object at 0xa93bea8>
Namespace: Interactive
Docstring:
autofunc - Function signature:
autofunc = autofunc(a,[m])
Required arguments:
a : input rank-2 array('d') with bounds (m,m)
Optional arguments:
m := shape(a,0) input int
Return objects:
autofunc : float
This informative docstring is created by f2py, and only slightly distorted by the blogging software. It tells us everything we need to know. Let’s try this function on a numpy array:
In [8]: import numpy as np In [9]: I = np.asarray(np.eye(1000), order='F') In [10]: trace(I) Out[10]: 1000.0 In [11]: timeit trace(I) 10000 loops, best of 3: 30.3 us per loop In [12]: np.trace(I) Out[12]: 1000.0 In [13]: timeit np.trace(I) 1000 loops, best of 3: 230 us per loop
First of all, please notice the parameter “order=’F'” passed to the numpy array. This ensures that the array is created with column-major memory order which is the Fortran format. Without that, the array is created with row-major order, and our Fortran function would run much, much slower.
Now to the fun part: Did you notice the speedup? The Sympy generated function is almost a factor of 8 faster that numpy! This is great, and I think I can be really happy about this result. But it is also time for a little warning: Please don’t take these numbers as some sort of absolute truth. For all I know I may be comparing apples and oranges.
In fact, the numbers I present are probably more relevant for a comparison of compilers and optimization flags. But, whatever the reason for the speedups, faster python callable functions is no doubt a good thing. For the record: I am running this on a little Acer Aspire One using numpy 1.3.0 shipped with Ubuntu 10.04. The autowrapped binaries are compiled with Ubuntu’s gfortran 4.4.3.
Matrix-Matrix product
Now lets look at a more expensive array operation. Whereas the trace of an m*m matrix has a cost of order O(m), the product of two m*m matrices scales like O(m^3). First we need to express a matrix-matrix product in terms of Indexed Sympy objects and create the binary function.
In [14]: B = IndexedBase('B')
In [15]: C = IndexedBase('C')
In [16]: o = symbols('o', integer=True)
In [17]: k = Idx('k', o)
In [18]: expr_mat_mat = Eq(C[i, j], A[i, k]*B[k, j]); expr_mat_mat
Out[18]: C[i, j] = A[i, k]⋅B[k, j]
The matrix-matrix product displayed at Out[18] is defined in terms of an Equality instance. This is a handy way to provide the code printers with both an expression and the variable it should be assigned to. For indexed expressions it is important that the indices on left hand side and right hand side are compatible. (Compatible means usually that the set of non-dummy indices must be identical, but with the exception that the right hand side can be scalar.) For the binary function, we get:
In [19]: matmat = autowrap(expr_mat_mat)
In [20]: matmat?
Type: fortran
String Form: <fortran object at 0x8f35290>
Namespace: Interactive
Docstring:
autofunc - Function signature:
c = autofunc(a,b,[m,n,o])
Required arguments:
a : input rank-2 array('d') with bounds (m,o)
b : input rank-2 array('d') with bounds (o,n)
Optional arguments:
m := shape(a,0) input int
n := shape(b,1) input int
o := shape(a,1) input int
Return objects:
c : rank-2 array('d') with bounds (m,n)
Now let’s test it with a QR factorization on some numpy arrays:
In [21]: M = np.asarray(np.random.rand(100, 100), order='F') In [22]: Q, R = np.linalg.qr(M) In [23]: M_np = np.dot(Q, R) In [24]: M_my = matmat(Q, R) In [25]: np.linalg.norm(M_my - M_np) Out[25]: 1.7522660836e-14
Here is the timing:
In [26]: %timeit M_np = np.dot(Q, R) 100 loops, best of 3: 10.9 ms per loop In [27]: %timeit M_my = matmat(Q, R) 100 loops, best of 3: 6.11 ms per loop
This looks impressive! But, I also found that numpy has a much better scaling than what I get from autowrap on this computer:
In [53]: M = np.asarray(np.random.rand(400, 400), order='F') In [54]: Q, R = np.linalg.qr(M) In [55]: %timeit M_np = np.dot(Q, R) 1 loops, best of 3: 687 ms per loop In [56]: %timeit M_my = matmat(Q, R) 1 loops, best of 3: 1.65 s per loop
I believe that this bad scaling must be an issue with gfortran, or rather an issue with how I use it. The theoretical scaling of O(m^3) correspond to a factor of 64, and while numpy follows theory quite closely, (687/10.9=63.0) the autowrapped function misses by far. (1650/6.11=270) . On another computer, I tested the same calculations with Intels ifort compiler and got much better results that were consistently better than numpy and had the expected scaling.
I have more things I want to show, but this blog post is becoming rather long, so stay tuned for more exciting new functionality in the coming days.
Congratulations on the speedup over numpy!
I found on my WordPress blog that you can wrap things in [code] and [/code], it gives you nice code blocks, and [code=’py’] colors it. For in-line, you should use <code> and </code>. And you can use $latex math$ to get
math.
(and I had to use fancy HTML to get those tags to render in this comment, so let’s hope it comes out right, or that I can edit it if it doesn’t)
I guess I can’t edit it. It’s latex math_expression to get LaTeX math.
Oh come on! Put $ around the “latex math_expression”.
Yeah, I’ve noticed the nice math formulas on your blog. Thanks for sharing, and esp. the code block is cool.
I’ll try to fix your comment, give me a shout if it comes out wrong
Yeah, I meant to type
.
You get the idea anyway. WordPress really needs to add a preview feature to its comments.
I wanted to try your example in my computer to see if I got similar timings, but I got this (I’m assuming autowrap3 (6442080) is the correct branch):
In [6]: trace = autowrap(A[i, i]) --------------------------------------------------------------------------- ImportError Traceback (most recent call last) /Users/aaronmeurer/Documents/python/sympy/sympy-scratch/<ipython console> in <module>() /Users/aaronmeurer/Documents/python/sympy/sympy-scratch/sympy/utilities/autowrap.py in autowrap(expr, language, backend, tempdir, args, flags, quiet, help_routines) 329 helpers.append(Routine(name, expr, args)) 330 --> 331 return code_wrapper.wrap_code(routine, helpers=helpers) 332 333 def binary_function(symfunc, expr, **kwargs): /Users/aaronmeurer/Documents/python/sympy/sympy-scratch/sympy/utilities/autowrap.py in wrap_code(self, routine, helpers) 120 self._prepare_files(routine) 121 self._process_files(routine) --> 122 mod = __import__(self.module_name) 123 finally: 124 sys.path.pop() ImportError: dynamic module does not define init function (initwrapper_module_0)(that’s what a [code] block looks like, by the way).
So is it a bug, or do I need to install something, or what?
You need to have f2py installed and a fortran compiler, C and Cython is not ready for numpy arrays yet. I pushed to github a minute ago, and in the current HEAD, autowrap takes an optional quiet flag. I’d suggest to try again with quiet=False.
OK, fink claims that f2py is installed as a virtual package to numpy.
This is my
quiet=Falseoutput.Sorry, I actually don’t know anything about numpy, Fortran, or code generation, but seeing SymPy faster than the other guy always perks my interest
Hmm.. this was a bit disappointing. Thanks for reporting the output, but as you see f2py is rather verbose and a bit difficult to decipher. Could you do me a favor and check on little thing?
autowrap also takes an argument tempdir, and if given the files are not removed after compilation. If you run it again and use tempdir=’/tmp/trace’ you will find the code and the f2py wrapper files in that directory. Could you do that and try to compile the .f90 file? (gfortran -c wrapped_code_?.f90)
I did that and gfortran quietly produced a wrapped_code_0.o file (i.e., it succeeded).
By the way, I have gfortran (GCC) version 4.4.4, and numpy version 1.4.1.
Thanks a lot! So then it is the wrapper framework that needs attention. I’ll see what I can do.
Note that O(m^3) is only the complexity of the naive matrix multiplication algorithm. There are in fact already at least two algorithms which are essentially O(m^2), but it seems that their correctness for general m is only conjectural at the moment. The algorithm used in practical BLAS implementations is a combination of the naive one and Strassen’s which scales approximately as O(m^2.8). I’m guessing that the reason you observed scaling closer to m^3 rather than m^2.8 is either that 400 is too close to the size when Strassen starts being used, or the algorithm is suffering from no longer being able to work entirely within your CPU’s caches.
PS: great project, I hope to be using it soon
Thanks, I was not aware that BLAS could achieve this scaling, I just knew from experience that I could never beat it with my own implementations.
It would be really nice to have Strassen’s O(2.8) available with autowrap. It is on the todo list in the tensor module, where the expression analysis takes place, to identify standard array operations in an expression. So, eventually the code printers should be able to determine and use appropriate library calls, bringing Strassen right home to Sympy.