I am working on something here where I need to normalize a large number of small vectors. That means picking up lots of pairs of numbers, calculating the sum of squares, and then taking the square root and dividing the numbers by it. To get the norm you can also divide the squared sum by the square root, what is the same as calculating the square rood directly.

Square roots are a pain to calculate, and so are divisions. If we could transform all that into an approximate calculation using just additions and multiplications it would be all much better.

This is an old problem of mankind, and many methods to calculate the square root of a number, and also more specifically the reciprocal of the square root of a number (x^-1/2) have been devised. Many of them are simply implementations of general methods to fin the roots of functions, such as the Newton-Raphson method. There is a curious and now famous algorithm that performs a single iteration of this method after calculating a good starting point making a very beautiful use of how the floating-point values are represented, and you should read if you like to brush bits (more info on Wikipedia).

Today we can use something instead of this code. Most modern processors have an instruction to aid in this calculation of approximate reciprocal square root, because it’s so much important for ocmputer graphics and other things. In Intel’s SSE instruction set this is the rsqrt instruction, and there is a similar one in ARM’s NEON too.

In my work I am using Python, and according to the profiler, there were a coupls of places where my code could use some optimization. One of them is precisely the calculation of the reciprocal square root, and I could use only a fast approximation. So I decided to try to write a module that would allow me to make the calculation from a numpy array, using this instruction.

There are many ways to do that. you can use scipy.weave, blitz and other things. It seems the most interesting to me would be using Cython, which is the continuation of Pyrex. It is program that lets you write your code in a kind of a python-like language, aiming at full python compatibility, and build a python module based in C. In fact, it generates a C code that you later compile into a python module, and as such you can easily integrate with other C code and system libraries.

So what I have done was first writing in C this function that calculates the approximate reciprocal square root using that special SSE instruction. I used this code from a guy at stackoverflow to make the compiler run the instruction using the so-called intrinsics, and then extended it with the Newton-Raphson step that was missing to improve the precision.

// file rsqrt.c // // Contains function SSE_rsqrt to calculate reciprocal of square root // using SSE helper instruction plus a single Newton-Raphson iteration. // // Ripped off from example posted by dude at stackoverflow, and // extended with a Newton-Raphson step. // To use with Cython, compile with gcc -fPIC -c rsqrt.c -o rsqrt.o // stolen and ressocialized by nwerneck@gmail.com - 16/09/2010 #include inline void SSE_rsqrt( float *pOut, float *pIn ) { // compiles to movss, sqrtss, movss _mm_store_ss( pOut, _mm_rsqrt_ss( _mm_load_ss( pIn ) ) ); // Newton-Raphson step *pOut *= ((3.0f - *pOut * *pOut * *pIn) * 0.5f); }

After you write that, you need to get ready to use Cython. First you write the setup.py file to use distutils to make your life easier.

from distutils.core import setup from distutils.extension import Extension from Cython.Distutils import build_ext ext_modules = [Extension("rsqrt", ["rsqrt_test.pyx"], include_dirs=[], extra_objects=['rsqrt.o'],libraries=[])] setup( name = 'rsqrt test', cmdclass = {'build_ext': build_ext}, ext_modules = ext_modules )

Notice how I included an *extra_objects* option with the object file we build previously. Now we go onto write our Cython pyx file.

def extern void SSE_rsqrt(float*, float*) def test_rsqrt(float y): cdef float x = 0.0 SSE_rsqrt(&x,&y) return x

This code has first an *extern* line with the declaration of the function at the file we compiled in C. Then it declares a Python function that we use to call from within Python. Finally, here is the test code.

import math import rsqrt sai= rsqrt.test_rsqrt(0.5) print sai print 1/math.sqrt(0.5) print sai-1/math.sqrt(0.5)

The output is

`1.41421341896`

1.41421356237

-1.43412523634e-07

The second line is the correct answer, but the value we calculated was pretty close, as we expected.

I tried to measure the time of the instruction, but it wasn’t too good. There are probably variable conversions in the way that make it slower. Where it really brights is at processing large vectors, as I will test soon.

I loved to do that because this shows Python’s great strengths, and it’s a good example of something that is possible to do but I never had really done before. Python is a very nice and clear prototyping language that can be extended in all sorts of ways. It means you can have good line profilers for example, that I used in first place to detect that the rsqrt step was indeed one of the critical ones in my procedure, and also that you can easily write stuff in C to make your code faster. More than that, this C code can even go all the way down to assembly! Currently numpy has no support for this kind of SSE rsqrt based calculation that I know of, but we have the freedom to try it using things like Cython, and it’s not too difficult once you learn how it works.

Pingback: How to generate floating-point random numbers efficiently « xor

Henry GomersallI am actually of the opinion that Python is a great language to do far more than just prototype. Architect in Python and optimise in C is a great way to write code.