Sharing some homeworks

So, I’ve spent some time last weeks doing some homeworks for courses at Politecnico di Milano. One of them is for the course “Image analysis and synthesis” in which I did together with a friend of mine some animations in Blender.

The goal was to make some animations of sport scenes and after apply a kind of “filter” in Matlab to simulate a high exposure of the camera, or, if you prefer, to draw the trace of the ball. As the only object moving in scene was the ball, these two are roughly the same thing. This is done by taking the mean of W frames and making a new one from them, where W is the “window” of the algorithm.

We took four sports: basketball, table tennis, golf and bowling. The amount of things produced is \~ 400MB, so I’m not putting it online for everyone. If anyone is interested in the Blender models, ask me privately (you find my email in the “About” link above). I put some animations in youtube, sou you can see them:
Table tennis:

3D animation is one of the areas I admire, but definitely isn’t an area I’m good at. Also, this was the first time I did something serious with blender, so don’t expect superb results (and continue reading this post if you gave up after seeing the videos ;-) ).

After thinking a while about the time Matlab took to make the final results I decided that it could be better… Two things could be improved: (i) use more threads to calculate the new images and (ii) use CPU optimized CPU instructions. I think it’s a shame to have a dual core with lot of specialized instructions to do things faster and don’t use them. These two goals brought me to code the algorithm in C.

I used pthreads to divide the work among threads (you say at command line the number of threads you want, or just say it to use the number of available processors) and freeimage to load and save images. I did made another version that uses gdk-pixbuf to load and store images and it took me so disappointed. To simply load and save images DIFFERENT images (so, things that are not correlated) it needs to use synchronization mechanisms. This leads to an algorithm not scaling very well as it could. I also tested another one with CImg… and the result was so disappointing too. I think not every developer is performance-ish as I am. Or not everyone cares about it as long as it is easy to use. So, stop talking and show the results:

[caption id=”attachment_125” align=”aligncenter” width=”640” caption=”Time to compute in each implementation”]Time to compute in

The best version run in less than half of the time required by Matlab! I could also optimize more the algorithm in C to get even better results, but I was already ok with that (read: I have other things to do). It’s also worth noting the improvement when optimizing the code for my CPU: with one thread I got a boost of \~20% and with 2 threads of \~18%. Particularly in this type of algorithm, it’s important to use SSE/SSE2 instructions to get a faster mean of the images. In fact, viewing the assembly code generated in this case we find these instructions as below (the xorps, movaps are SSE instructions that operate with xmmN registers).

 8049517:       0f 57 f6                xorps  %xmm6,%xmm6
 804951a:       8b 41 04                mov    0x4(%ecx),%eax
 804951d:       8b 4d 08                mov    0x8(%ebp),%ecx
 8049520:       39 41 08                cmp    %eax,0x8(%ecx)
 8049523:       0f 82 36 06 00 00       jb     8049b5f <worker_thread+0x6ff>
 8049529:       8d 7d cc                lea    -0x34(%ebp),%edi
 804952c:       0f 29 b5 08 ff ff ff    movaps %xmm6,-0xf8(%ebp)
 8049533:       89 bd 04 ff ff ff       mov    %edi,-0xfc(%ebp)
 8049539:       89 3c 24                mov    %edi,(%esp)
 804953c:       89 44 24 08             mov    %eax,0x8(%esp)
 8049540:       c7 44 24 04 64 a2 04    movl   $0x804a264,0x4(%esp

Awesome!! GCC automatically generated a SSE optimized code. If you got interested, see the wikipedia article about SSE.

The code is available under GPLv2 license at github: Let me know if it’s useful to you.