Again, it will be interesting to see how we compare with the array initialization features of numpy. One lesson I learned while preparing my previous post is that the timing results depend heavily on the compiler. To avoid the scaling issues we observed with the matrix-matrix product last time, I will do all examples on a computer with ifort 11.0. The Numpy installation on that computer is version 1.1.0. The warning applies again: I report the timing results I get on my computer, but this is not a rigorous benchmark. Also, I may not be using Numpy in the optimal way. If you know how I could make it run faster, I’d love hear about it.

First, lets try to create our own compiled linspace function that we can compare with numpy.linspace. The Numpy function creates an array of evenly spaced numbers, like this:

In [1]: import numpy as np In [2]: np.linspace(0, 1, 5) Out[2]: array([ 0. , 0.25, 0.5 , 0.75, 1. ])

The same functionality can be created by autowrapping a Sympy expression that maps the array index to an interval on the real axis:

In [3]: from sympy.utilities.autowrap import autowrap In [4]: a, b = symbols('a b') In [5]: i = Idx('i', m) In [6]: linearmap = a + (i - 1)*(a - b)/(1 - m) In [7]: x = IndexedBase('x') In [8]: linspace = autowrap(Eq(x[i], linearmap)) In [9]: linspace(0, 1, 5) Out[9]: [ 0. 0.25 0.5 0.75 1. ]

So, it worked. The array has been initialized according to the linear map defined by the Sympy expression at In[6]. In order to map the index to the interval correctly one must remember that the index of a Fortran array starts at 1, while C arrays start at 0. This must be be taken into account when the initialization expression is defined, so a slightly different expression must be used with the C backend.

Now, you may be wondering how I could know the order of the arguments in the wrapped linspace function, and why it was so conveniently compatible with the interface for numpy.linspace. The answer, is that unless autowrap gets the optional keyword argument ‘args’, the arguments are sorted according to their string representation. I carefully chose the symbols in the linear map so that my linspace would automatically get the same argument sequence as Numpy’s.

In [14]: %timeit np.linspace(0, 5, 100000) 1000 loops, best of 3: 916 µs per loop In [15]: %timeit linspace(0, 5, 100000) 1000 loops, best of 3: 381 µs per loop

Sweet! Evenly spaced numbers are of course very useful, but let us also try something a bit more juicy.

One of the really useful concepts in Numpy is the “universal function” or ufunc. To put it short, ufuncs operate on arrays by applying a scalar function elementwise. It is actually more than that, as numpy ufuncs are required to support type casting, broadcasting and more, but we will ignore that and focus on the following quote from the Numpy docs:

That is, a ufunc is a “vectorized” wrapper for a function that takes a fixed number of scalar inputs and produces a fixed number of scalar outputs.

Can we make such functions with the autowrap framework? As you can probably guess, the answer is positive, so the question is rather: can we beat Numpy on speed? Let’s find out.

Very recently, Sympy got functionality to generate radial electronic wave functions for the Hydrogen atom. That is, quantum mechanical solutions to the attractive Coulomb potential. These are well known, and long since tabulated functions, that are routinely used in quantum mechanics. Wouldn’t it be nice to have the Hydrogen wave functions available as super fast ufunc-like binary functions?

The failsafe way to build a fast binary function that works element-wise on an array, is to construct a symbolic Lambda function that contains the initialization expression. The Lambda instance should then be attached to a regular Sympy Function. This is done with a call to implemented_function(‘f’, Lambda(…)) as in the following:

In [2]: from sympy.utilities.lambdify import implemented_function In [3]: from sympy.physics.hydrogen import R_nl In [4]: a, r = symbols('a r') In [5]: psi_nl = implemented_function('psi_nl', Lambda([a, r], R_nl(1, 0, a, r))) In [6]: psi_nl(a, r) Out[6]: psi_nl(a, r) In [7]: psi_nl._imp_(a, r) Out[7]: -r ____ -- / 1 a 2* / -- *e / 3 \/ a

As you would guess from Out[6], psi_nl is just a regular Sympy Function, but with an additional attribute, _imp_, that works as displayed in In[7]. The implemented_function trick landed in Sympy just a few weeks ago. It was intended for numeric implementations, but in my autowrap3 branch, I (ab)use it to store a symbolic Lambda instead.

In [8]: x = IndexedBase('x') In [9]: y = IndexedBase('y') In [10]: i = Idx('i', m) In [11]: hydrogen = autowrap(Eq(y[i], psi_nl(a, x[i])), tempdir='/tmp/hydro')

The argument ‘tempdir’ tells autowrap to compile the code in a specific directory, and leave the files intact when finished. Checking the Fortran source code reveals that the wave function is calculated like this:

do i = 1, m y(i) = 2*sqrt(a**(-3))*exp(-x(i)/a) end do

The implemented function has been printed as an *inline* expression, avoiding a costly function call in the loop. (If you wonder why I didn’t just print the expression directly, the explanation is here.) Now that we have the functions available, this is how they look:

The parameters and are quantum numbers related to the electron’s energy and orbital angular momentum respectively.

The autowrapped Hydrogen functions can also be compared with numpy equivalents, as all waves are expressed in terms of functions that are universal in numpy. We use Sympy’s ‘lambdify’ to create a Python lambda function that calls the relevant ufuncs from numpy’s namespace:

In [31]: from sympy.utilities.lambdify import lambdify In [32]: psi_lambda = lambdify([a, r], R_nl(1, 0, a, r), 'numpy') In [33]: grid = np.linspace(0,3,100) In [34]: np.linalg.norm(psi_lambda(0.5, grid) - hydrogen(0.5, grid)) Out[34]: 2.21416777433e-15

But there are many solutions to the Hydrogen atom, so let’s study them a bit more systematically. The following is the output from a loop over the energy quantum number . I used only the s-wave functions , corresponding to an electron that moves straight through the nucleus and out on the other side before turning back again to repeat the cycle. For each , I calculated:

- %timeit for application of lambda function with Numpy ufuncs
- %timeit for application of the compiled and wrapped Sympy function
- the absolute difference between the solutions:

Here is the output:

n = 1 Numpy: 100000 loops, best of 3: 16.8 µs per loop Sympy: 1000000 loops, best of 3: 1.92 µs per loop difference: 1.07708980623e-14 n = 2 Numpy: 10000 loops, best of 3: 30 µs per loop Sympy: 1000000 loops, best of 3: 2 µs per loop difference: 5.13651621237e-15 n = 3 Numpy: 10000 loops, best of 3: 48.1 µs per loop Sympy: 100000 loops, best of 3: 3.37 µs per loop difference: 2.28896762655e-15 n = 4 Numpy: 10000 loops, best of 3: 70.9 µs per loop Sympy: 100000 loops, best of 3: 2.7 µs per loop difference: 6.18114500556e-15 n = 5 Numpy: 10000 loops, best of 3: 116 µs per loop Sympy: 100000 loops, best of 3: 4.61 µs per loop difference: 5.68967616077e-15 n = 6 Numpy: 10000 loops, best of 3: 142 µs per loop Sympy: 100000 loops, best of 3: 5.02 µs per loop difference: 1.21523884705e-14 n = 7 Numpy: 10000 loops, best of 3: 183 µs per loop Sympy: 100000 loops, best of 3: 5.95 µs per loop difference: 1.04406500806e-14

Awesome! For we beat Numpy by a factor of 30. But the really interesting thing here is the *scaling* with expression complexity. Look at this:

For higher the wave function expressions increase in complexity, and so does the execution time of both Numpy and Sympy. However, while the autowrapped Sympy expression needs 3 times longer for than for , the execution time for a lambda of native Numpy ufuncs increases tenfold!

This last figure shows the linear relation between execution times of the compiled Sympy expressions and native Numpy lambda function. The plot supports the idea that the it is the inherent complexity of the mathematical expression that determines the execution time. By extrapolation, we can expect that for an expression that would take Sympy 9 times longer, Numpy would be hit by a factor of 100.

The main reason for the speedup, is that while Numpy calculates the wave function by applying a series of ufuncs, the Sympy way is to create a new “ufunc” and apply it once. It would be very cool to try this in an iterative solution scheme. With a super fast ufunc in a single python loop, one could do complex and cpu-intensive calculations with respectable performance from within isympy. That would be awesome!

]]>

I’ve been hacking on the automatic wrapping of binary routines, refactored the code printers, implemented handling of indexed powers, enabled scalar broadcasting and ufunc-ish support for arbitrary compiled functions. For next week I will focus on physics inspired examples, testing of binary code and improving the documentation. But first, it is time for a little demonstration.

The latest code is here, if you want to try for yourself. There are parts that don’t not work properly, for instance the Cython wrapper cannot handle numpy arrays yet, but as you will see below there are enough working parts that you can do some fancy stuff.

In an earlier blog post, I showed you that the C printer is able to print the necessary loops to implement summations over array indices. The Fortran printer has also this functionality now, so it can be used with f2py. Let’s start with a simple function to calculate numerically the trace of a matrix while running isympy:

In [1]: from sympy.utilities.autowrap import autowrap In [2]: m, n = symbols('m n', integer=True) In [3]: i = Idx('i', m) In [4]: j = Idx('j', n) In [5]: A = IndexedBase('A') In [6]: trace = autowrap(A[i, i])

That’s it. Now lets have a look at the function we created,

In [7]: trace? Type: fortran String Form: <fortran object at 0xa93bea8> Namespace: Interactive Docstring: autofunc - Function signature: autofunc = autofunc(a,[m]) Required arguments: a : input rank-2 array('d') with bounds (m,m) Optional arguments: m := shape(a,0) input int Return objects: autofunc : float

This informative docstring is created by f2py, and only slightly distorted by the blogging software. It tells us everything we need to know. Let’s try this function on a numpy array:

In [8]: import numpy as np In [9]: I = np.asarray(np.eye(1000), order='F') In [10]: trace(I) Out[10]: 1000.0 In [11]: timeit trace(I) 10000 loops, best of 3: 30.3 us per loop In [12]: np.trace(I) Out[12]: 1000.0 In [13]: timeit np.trace(I) 1000 loops, best of 3: 230 us per loop

First of all, please notice the parameter “order=’F'” passed to the numpy array. This ensures that the array is created with column-major memory order which is the Fortran format. Without that, the array is created with row-major order, and our Fortran function would run much, much slower.

Now to the fun part: Did you notice the speedup? The Sympy generated function is almost a factor of 8 faster that numpy! This is great, and I think I can be really happy about this result. But it is also time for a little warning: Please don’t take these numbers as some sort of absolute truth. For all I know I may be comparing apples and oranges.

In fact, the numbers I present are probably more relevant for a comparison of compilers and optimization flags. But, whatever the reason for the speedups, faster python callable functions is no doubt a good thing. For the record: I am running this on a little Acer Aspire One using numpy 1.3.0 shipped with Ubuntu 10.04. The autowrapped binaries are compiled with Ubuntu’s gfortran 4.4.3.

Now lets look at a more expensive array operation. Whereas the trace of an m*m matrix has a cost of order O(m), the product of two m*m matrices scales like O(m^3). First we need to express a matrix-matrix product in terms of Indexed Sympy objects and create the binary function.

In [14]: B = IndexedBase('B') In [15]: C = IndexedBase('C') In [16]: o = symbols('o', integer=True) In [17]: k = Idx('k', o) In [18]: expr_mat_mat = Eq(C[i, j], A[i, k]*B[k, j]); expr_mat_mat Out[18]: C[i, j] = A[i, k]⋅B[k, j]

The matrix-matrix product displayed at Out[18] is defined in terms of an Equality instance. This is a handy way to provide the code printers with both an expression and the variable it should be assigned to. For indexed expressions it is important that the indices on left hand side and right hand side are compatible. (Compatible means usually that the set of non-dummy indices must be identical, but with the exception that the right hand side can be scalar.) For the binary function, we get:

In [19]: matmat = autowrap(expr_mat_mat) In [20]: matmat? Type: fortran String Form: <fortran object at 0x8f35290> Namespace: Interactive Docstring: autofunc - Function signature: c = autofunc(a,b,[m,n,o]) Required arguments: a : input rank-2 array('d') with bounds (m,o) b : input rank-2 array('d') with bounds (o,n) Optional arguments: m := shape(a,0) input int n := shape(b,1) input int o := shape(a,1) input int Return objects: c : rank-2 array('d') with bounds (m,n)

Now let’s test it with a QR factorization on some numpy arrays:

In [21]: M = np.asarray(np.random.rand(100, 100), order='F') In [22]: Q, R = np.linalg.qr(M) In [23]: M_np = np.dot(Q, R) In [24]: M_my = matmat(Q, R) In [25]: np.linalg.norm(M_my - M_np) Out[25]: 1.7522660836e-14

Here is the timing:

In [26]: %timeit M_np = np.dot(Q, R) 100 loops, best of 3: 10.9 ms per loop In [27]: %timeit M_my = matmat(Q, R) 100 loops, best of 3: 6.11 ms per loop

This looks impressive! But, I also found that numpy has a much better scaling than what I get from autowrap on this computer:

In [53]: M = np.asarray(np.random.rand(400, 400), order='F') In [54]: Q, R = np.linalg.qr(M) In [55]: %timeit M_np = np.dot(Q, R) 1 loops, best of 3: 687 ms per loop In [56]: %timeit M_my = matmat(Q, R) 1 loops, best of 3: 1.65 s per loop

I believe that this bad scaling must be an issue with gfortran, or rather an issue with how I use it. The theoretical scaling of O(m^3) correspond to a factor of 64, and while numpy follows theory quite closely, (687/10.9=63.0) the autowrapped function misses by far. (1650/6.11=270) . On another computer, I tested the same calculations with Intels ifort compiler and got much better results that were consistently better than numpy and had the expected scaling.

I have more things I want to show, but this blog post is becoming rather long, so stay tuned for more exciting new functionality in the coming days.

]]>

I put up these improvements for review, and got valuable feedback that I spent the rest of the time implementing. In addition, I have addressed Ronan’s comments to my last blog post, and this has lead to stylistic improvements:

- Now an indexed expression is written with [] instead of (). This is more pythonic, as [] is used with all the builtin sequences. This is also consistent with numpy.
- Some classes have been renamed: IndexedElement is now Indexed, and the old Indexed class is now called IndexedBase. The new terminology should be less confusing.

The C code generator has become quite versatile and capable of handling complicated array expressions. Here is a little demonstration of my current codegen_C3 branch:

In [1]: A, x, y = map(IndexedBase, ['A', 'x', 'y']) In [2]: i = Idx('i', m) In [3]: j = Idx('j', n) In [4]: print ccode(A[i, i], assign_to=z) #trace of a matrix z = 0; for (int i=0; i<m; i++){ z = z + A[i + i*m]; } In [5]: print ccode(A[i, j]*x[j] + y[i], assign_to=y[i]) #matrix-vector 1 for (int i=0; i<m; i++){ for (int j=0; j<n; j++){ y[i] = x[j]*A[j + i*n] + y[i]; } } In [6]: print ccode(A[i, j]*x[j], assign_to=y[i]) #matrix-vector 2 for (int i=0; i<m; i++){ y[i] = 0; } for (int i=0; i<m; i++){ for (int j=0; j<n; j++){ y[i] = x[j]*A[j + i*n] + y[i]; } }

The code printer takes care of introducing and initialization of accumulator variables if needed, not the codegen module as previaously. The initialization is skipped if the left hand side variable is also present on the right hand side (as in statement In[5]).

Finally, you may have noticed that this post was a little bit delayed. This is because I was traveling in the weekend and didn’t get as much time for blogging as I had planned. I’ll make up for it by blogging twice this week.

]]>

The principal weakness of the Fortran code generator is that the support of array arguments has only limited testability. It is currently testable only on two levels:

1) The properties and behavior of IndexedElement instances,

2) The output from the Fortran code generator.

The missing piece in there, is the testability of complex expressions composed of IndexedElement instances. It is important to be able to test the consistency of indices in an expression, as it is very easy to create ambigous or even meaningless expessions of IndexedElement objects. For instance, the innocent looking expression: (which is considered undefined in my development branch)

x(i) + x(j)

would be ambigous at best. It could be (1) a scalar being the sum of components i and j of array x or (2) a symmetric rank 2 object a(i, j) where every component is made by adding components i and j of x. To avoid this kind of ambiguities, it is necessary to implement checks of index consistency. And it is preferable to implement it in the indexed module so that these checks are available whenever the module is in use. This gives the bonus of another level of testability:

3) The contraction structure of indexed expressions and the algorithms used to analyze the expressions.

With this in place, the reliability of array based code from the C code generator depends directly on how robust the analysis of indexed expressions is. The third level of testability is extremely important to make Sympy generated code a viable option for scientific applications.

In my branch codegen_C on github, these thoughts have materialized in a new file sympy/tensor/indexed_methods.py, as well as in modifications of the C code printer. I have implemented testable high-level interfaces that can analyze expressions and provide information about the index structure. The code printers should now be able to generate loops over arrays in a much more reliable fashion.

]]>

Also, Brian introduced me to the Python library called Theano. It “…allows you to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently.” Now, that description could very well have been a statement about my doings so far this summer. The Theano project has an emphasis on optimization, and is also able to perform calculations on a GPU if available, so it would be very nice if Theano could be used together with Sympy. After a little discussion with Brian and Ronan I was inspired to examine the possibility to use Theano as a back-end.

The expression tree structures in Theano and Sympy differ slightly, so I wrote a little module to convert the tree of a Sympy expression to a data structure with identical topology as the corresponding Theano expression. The idea is that expression trees on equivalent form can be converted between Sympy and Theano by a trivial mapping of the nodes. However, I was not able to follow this track further as I had to prioritize the review process of the Fortran code generation.

I’ve spent the rest of the time to bring the C generator up to par with the Fortran facilities. The current result is here, but it is not ready yet as the tests of array arguments are still XFAIL-ing. I don’t want to quick fix this by copying the approach I used for Fortran arrays, because there is need for a more robust implementation. This demands improvements in the indexed module, in particular functions to inspect and validate expressions. The code printers should rely on these functions to determine the correct implementation of the mathematical expressions.

The indexed module has become a central piece in my project as the way to represent arrays in Sympy. I think that the Indexed objects can be useful not only for generating loops in C and Fortran as I have done so far, but also for interfacing with other array-based systems, such as Theano, and also directly with Numpy. So improvements to the indexed module would benefit the interaction with all these array-based systems, and I should give it a high priority.

]]>

- Improved the Indexed and Idx classes, including tests and more information in docstring.
- Added syntactic sugar that allows creation of the indexed symbol (the `stem’) independent of the indices.
- Reorganized codegen by pushing logic up to the language agnostic super class.
- Reorganized tests of codegen so that all tests will be performed with several compilers and for both C and Fortran. It is now easy to add another compiler to the list and it will run through all existing tests.

The fixes are available in this branch.

In addition, I fleshed out the basics of a framework for automatic compilation and wrapping of SymPy expressions. Spurred by Ondrej’s comment on my previous post, I decided it was necessary to allow several backends. So the implementation is just a friendly function that will invoke backends for f2py, fwrap or Cython as appropriate. (Only f2py implemented so far.)

Apart from the coding I also spent time reviewing Matt’s work on Hilbert spaces. His work is very interesting to me as I plan to use his framework to implement quantum mechanical tests and examples of my own code generation efforts in the second part of the summer. After reviewing I am even more optimistic about that plan

]]>

Right after finishing a first implementation of array arguments last week, and bragging about it in a blog post, I realized it had some embarrassing shortcomings: The generated code was not compilable. Luckily, this was actually a minor problem that could be fixed by a one-line patch. More serious was it, that the matrix-vector product was not correctly implemented. To correct it I had to change the code logic, and it was a good opportunity to start refactoring the codegen module.

The goal with the refactoring was to make the data structures representing abstract code a bit more object oriented. For instance, the Routine class that represents a general routine in any programming language, was previously used just like a struct. It was used to store data describing the routine, but no logic was implemented in instance methods. By moving and adapting existing code lines I arranged it so that any Routine instance should be capable of initializing itself. To setup a Routine instance in my development branch, you only need to supply a name for the routine and a mathematical expression.

The Routine class is a very important piece of the codegen module. Viewing the codegen utility as a translator of mathematical expressions into a set of statements in a programming language, the Routine instances are responsible for storing information about how the math *has been interpreted*. So, figuratively speaking, there is need for something between the math expression and the Routine instance.

By placing the code that interprets math expressions in the __init__() method of the Routine class, it will be possible to enforce a consistent interpretation of the math, and a consistent storage of the data describing the routine. A consistent storage of conclusions from the interpretation step is important, so that a code generator that implement the routine in a specific language can get the information reliably.

This Monday I shifted my focus away from code generation, and to the task of compiling the generated Fortran and C code. I have been reading selected chapters and sections in a book by Hans Petter Langtangen. (In that book, by the way, a Sympy example session is listed on page 185 )

From reading that book, and after reading in the f2py user documentation, I have started to think that f2py can be the tool of choice for compiling SymPy expressions into python callable binary subroutines. Apparently, f2py can be used to compile and wrap both Fortran and C functions, so it covers everything I need for the GSOC project. Langtangen also states that f2py is easier to learn and less error prone than SWIG.

F2py is integrated with numpy. In fact, since 2007 it is maintained as part of the numpy project. SymPy already has some interaction with numpy, such as the symarray, so by relying on f2py I can extend the functionality of SymPy without introducing new dependencies. If the goals of my project can be realized with f2py, it looks like the perfect option.

Oh, and by the way, the code generated from the matrix-vector product in test_codegen.py now compiles, and the implementation is correct. It even compiles with f2py, after which it can be imported in and called from a python session. Using appropriate numpy arrays, you can calculate matrix vector products in compiled fortran. Check it out!

]]>

]]>

I spent a few days reading (GiNaC, distutils, numpy, fwrap, f2py, swig) and thinking before I set down to code on Thursday. By then I had reached the conclusion that I would need to implement Indexed objects. These will represent arrays in the Sympy expressions before code generation. Indexed expressions, e.g. a matrix-vector product denoted as A(i,j)x(j), contain information that is not explicit in the corresponding abstract notation Ax. In order to generate the code unambiguously, all relevant information must be explicit.

Having a specialized Sympy object that correspond to arrays in the code means that the user gets access to all the available expression manipulation functionality before the code is generated. The downside is that you need to work with indexed expressions, instead of abstract symbols. However, to translate non-indexed expressions to the corresponding indexed expression is just a call to Expr.subs, so it can be done just in time before code generation.

Over the weekend I managed to implement array argument handling in the fortran code printer, and FCodeGen, but I have not yet tested how the code compiles. This week hope to get the fortran to compile and also to implement everything in the C code generator. Maybe I can also refactor a little and move language agnostic code to the CodeGen super class.

[oy@ubuntulaptop sympy (fortran_codegen3)]$ git shortlog --since=1.weeks --author=Øyvind Øyvind Jensen (21): Fix issue 1920: SymTuple doesn't rebuild itself Moved SymTuple to sympy/core/symtuple.py + imported in core/__init__.py First implementation of Indexed and Idx indexed.py: Added example with matrix-vector product to module docstring move function wrap_fortran() to FCodePrinter._wrap_fortran() Implemented optional free-form source format for Fortran code printer Added tests for free-form fortran code + fix bug in comment line wrapping Created codegenerator FCodeGen for generating compilable fortran code Fixed double space after '=' in generated free form code fix fortran line wrapping for human=False FCodeGen should should use printer object, not fcode Added loop functionality to FCodeGen FCodePrinter: implemented code indentation for fortran free-form Fixes failing test due to AttributeError in matplotlib version 0.91.2 (Is Implemented Fortran declarations of array input arguments Added property `dimensions' to Indexed Implemented OutputArgument and InOutArgument in FCodeGen Move loop generation into FCodePrinter + add declaration of local variabl Fix leading blank in fortran codegen Added test for FCodeGen with array arguments (matrix-vector product) Added debug helper function for string comparison

]]>

]]>