Good news, everyone!

This week has been hectic, but also very fruitful.  I have the functionality I set out for, it works and it is fast! It still needs testing, polish and bug-fixing and is by no means finished, but it is now possible to define indexed math-expressions, translate to code, compile it, wrap it as a python module and import it again for fast numerical calculations. All of this can be done without leaving the isympy command interpreter.

I’ve been hacking on the automatic wrapping of binary routines, refactored the code printers, implemented handling of indexed powers, enabled scalar broadcasting and ufunc-ish  support for arbitrary compiled functions. For next week I will focus on physics inspired examples, testing of binary code and improving the documentation. But first, it is time for a little demonstration.

The latest code is here, if you want to try for yourself.  There are parts that don’t not work properly, for instance the Cython wrapper cannot handle numpy arrays yet, but as you will see below there are enough working parts that you can do some fancy stuff.

The trace of a Matrix — faster that numpy!

In an earlier blog post, I showed you that the C printer is able to print the necessary loops to implement summations over array indices.  The Fortran printer has also this functionality now, so it can be used with f2py.  Let’s start with a simple function to calculate numerically the trace of a matrix while running isympy:

In [1]: from sympy.utilities.autowrap import autowrap
In [2]: m, n = symbols('m n', integer=True)
In [3]: i = Idx('i', m)
In [4]: j = Idx('j', n)
In [5]: A = IndexedBase('A')
In [6]: trace = autowrap(A[i, i])

That’s it.  Now lets have a look at the function we created,

In [7]: trace?
Type:        fortran
String Form:    <fortran object at 0xa93bea8>
Namespace:    Interactive
Docstring:
 autofunc - Function signature:
 autofunc = autofunc(a,[m])
 Required arguments:
 a : input rank-2 array('d') with bounds (m,m)
 Optional arguments:
 m := shape(a,0) input int
 Return objects:
 autofunc : float

This informative docstring is created by f2py, and only slightly distorted by the blogging software.  It tells us everything we need to know.  Let’s try this function on a numpy array:

In [8]: import numpy as np
In [9]: I = np.asarray(np.eye(1000), order='F')
In [10]: trace(I)
Out[10]: 1000.0
In [11]: timeit trace(I)
10000 loops, best of 3: 30.3 us per loop
In [12]: np.trace(I)
Out[12]: 1000.0
In [13]: timeit np.trace(I)
1000 loops, best of 3: 230 us per loop

First of all, please notice the parameter “order=’F'” passed to the numpy array.  This ensures that the array is created with column-major memory order which is the Fortran format.  Without that, the array is created with row-major order, and our Fortran function would run much, much slower.

Now to the fun part: Did you notice the speedup?  The Sympy generated function is almost a factor of 8 faster that numpy!  This is great, and I think I can be really happy about this result.  But it is also time for a little warning: Please don’t take these numbers as some sort of absolute truth.  For all I know I may be comparing apples and oranges.

In fact, the numbers I present are probably more relevant for a comparison of compilers and optimization flags.  But, whatever the reason for the speedups, faster python callable functions is no doubt a good thing.   For the record: I am running this on a little Acer Aspire One using numpy 1.3.0 shipped with Ubuntu 10.04.  The autowrapped binaries are compiled with Ubuntu’s gfortran 4.4.3.

Matrix-Matrix product

Now lets look at a more expensive array operation.  Whereas the trace of an m*m matrix has a cost of order O(m), the product of two m*m matrices scales like O(m^3).  First we need to express a matrix-matrix product in terms of Indexed Sympy objects and create the binary function.

In [14]: B = IndexedBase('B')
In [15]: C = IndexedBase('C')
In [16]: o = symbols('o', integer=True)
In [17]: k = Idx('k', o)
In [18]: expr_mat_mat = Eq(C[i, j], A[i, k]*B[k, j]); expr_mat_mat
Out[18]: C[i, j] = A[i, k]⋅B[k, j]

The matrix-matrix product displayed at Out[18] is defined in terms of an Equality instance.  This is a handy way to provide the code printers with both an expression and the variable it should be assigned to.  For indexed expressions it is important that the indices on left hand side and right hand side are compatible.  (Compatible means usually that the set of non-dummy indices must be identical, but with the exception that the right hand side can be scalar.)  For the binary function, we get:

In [19]: matmat = autowrap(expr_mat_mat)
In [20]: matmat?
Type:        fortran
String Form:    <fortran object at 0x8f35290>
Namespace:    Interactive
Docstring:
 autofunc - Function signature:
 c = autofunc(a,b,[m,n,o])
 Required arguments:
 a : input rank-2 array('d') with bounds (m,o)
 b : input rank-2 array('d') with bounds (o,n)
 Optional arguments:
 m := shape(a,0) input int
 n := shape(b,1) input int
 o := shape(a,1) input int
 Return objects:
 c : rank-2 array('d') with bounds (m,n)

Now let’s test it with a QR factorization on some numpy arrays:

In [21]: M = np.asarray(np.random.rand(100, 100), order='F')
In [22]: Q, R = np.linalg.qr(M)
In [23]: M_np = np.dot(Q, R)
In [24]: M_my = matmat(Q, R)
In [25]: np.linalg.norm(M_my - M_np)
Out[25]: 1.7522660836e-14

Here is the timing:

In [26]: %timeit M_np = np.dot(Q, R)
100 loops, best of 3: 10.9 ms per loop
In [27]: %timeit M_my = matmat(Q, R)
100 loops, best of 3: 6.11 ms per loop

This looks impressive! But, I also found that numpy has a much better scaling than what I get from autowrap on this computer:

In [53]: M = np.asarray(np.random.rand(400, 400), order='F')
In [54]: Q, R = np.linalg.qr(M)
In [55]: %timeit M_np = np.dot(Q, R)
1 loops, best of 3: 687 ms per loop
In [56]: %timeit M_my = matmat(Q, R)
1 loops, best of 3: 1.65 s per loop

I believe that this bad scaling must be an issue with gfortran, or rather an issue with how I use it.  The theoretical scaling of O(m^3) correspond to a factor of 64, and while numpy follows theory quite closely, (687/10.9=63.0) the autowrapped function misses by far. (1650/6.11=270) .  On another computer, I tested the same calculations with Intels ifort compiler and got much better results that were consistently better than numpy and had the expected scaling.

I have more things I want to show, but this blog post is becoming rather long, so stay tuned for more exciting new functionality in the coming days.

About these ads
This entry was posted in sympy. Bookmark the permalink.

13 Responses to Good news, everyone!

  1. asmeurer says:

    Congratulations on the speedup over numpy!

    I found on my WordPress blog that you can wrap things in [code] and [/code], it gives you nice code blocks, and [code=’py’] colors it. For in-line, you should use <code> and </code>. And you can use $latex math$ to get \LaTeX math.

    (and I had to use fancy HTML to get those tags to render in this comment, so let’s hope it comes out right, or that I can edit it if it doesn’t)

  2. asmeurer says:

    Yeah, I meant to type \LaTeX.

    You get the idea anyway. WordPress really needs to add a preview feature to its comments.

  3. asmeurer says:

    I wanted to try your example in my computer to see if I got similar timings, but I got this (I’m assuming autowrap3 (6442080) is the correct branch):

    In [6]: trace = autowrap(A[i, i])
    ---------------------------------------------------------------------------
    ImportError                               Traceback (most recent call last)
    
    /Users/aaronmeurer/Documents/python/sympy/sympy-scratch/<ipython console> in <module>()
    
    /Users/aaronmeurer/Documents/python/sympy/sympy-scratch/sympy/utilities/autowrap.py in autowrap(expr, language, backend, tempdir, args, flags, quiet, help_routines)
        329         helpers.append(Routine(name, expr, args))
        330 
    --> 331     return code_wrapper.wrap_code(routine, helpers=helpers)
        332 
        333 def binary_function(symfunc, expr, **kwargs):
    
    /Users/aaronmeurer/Documents/python/sympy/sympy-scratch/sympy/utilities/autowrap.py in wrap_code(self, routine, helpers)
        120             self._prepare_files(routine)
        121             self._process_files(routine)
    --> 122             mod = __import__(self.module_name)
        123         finally:
        124             sys.path.pop()
    
    ImportError: dynamic module does not define init function (initwrapper_module_0)
    

    (that’s what a [code] block looks like, by the way).

    So is it a bug, or do I need to install something, or what?

    • jegerjensen says:

      You need to have f2py installed and a fortran compiler, C and Cython is not ready for numpy arrays yet. I pushed to github a minute ago, and in the current HEAD, autowrap takes an optional quiet flag. I’d suggest to try again with quiet=False.

      • asmeurer says:

        OK, fink claims that f2py is installed as a virtual package to numpy.

        This is my quiet=False output.

        Sorry, I actually don’t know anything about numpy, Fortran, or code generation, but seeing SymPy faster than the other guy always perks my interest :)

        • jegerjensen says:

          Hmm.. this was a bit disappointing. Thanks for reporting the output, but as you see f2py is rather verbose and a bit difficult to decipher. Could you do me a favor and check on little thing?

          autowrap also takes an argument tempdir, and if given the files are not removed after compilation. If you run it again and use tempdir=’/tmp/trace’ you will find the code and the f2py wrapper files in that directory. Could you do that and try to compile the .f90 file? (gfortran -c wrapped_code_?.f90)

  4. pdf says:

    Note that O(m^3) is only the complexity of the naive matrix multiplication algorithm. There are in fact already at least two algorithms which are essentially O(m^2), but it seems that their correctness for general m is only conjectural at the moment. The algorithm used in practical BLAS implementations is a combination of the naive one and Strassen’s which scales approximately as O(m^2.8). I’m guessing that the reason you observed scaling closer to m^3 rather than m^2.8 is either that 400 is too close to the size when Strassen starts being used, or the algorithm is suffering from no longer being able to work entirely within your CPU’s caches.

    PS: great project, I hope to be using it soon

    • jegerjensen says:

      Thanks, I was not aware that BLAS could achieve this scaling, I just knew from experience that I could never beat it with my own implementations. :-)

      It would be really nice to have Strassen’s O(2.8) available with autowrap. It is on the todo list in the tensor module, where the expression analysis takes place, to identify standard array operations in an expression. So, eventually the code printers should be able to determine and use appropriate library calls, bringing Strassen right home to Sympy.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s