So, I’ve spent some time last weeks doing some homeworks for courses at
Politecnico di Milano. One of them is for the course “Image analysis and
synthesis” in which I did together with a friend of mine some animations
in Blender.
The goal was to make some animations of sport scenes and after apply a
kind of “filter” in Matlab to simulate a high exposure of the camera,
or, if you prefer, to draw the trace of the ball. As the only object
moving in scene was the ball, these two are roughly the same thing. This
is done by taking the mean of W frames and making a new one from them,
where W is the “window” of the algorithm.
We took four sports: basketball, table tennis, golf and bowling. The
amount of things produced is \~ 400MB, so I’m not putting it online for
everyone. If anyone is interested in the Blender models, ask me
privately (you find my email in the “About” link above). I put some
animations in youtube, sou you can see them:
Bowling: http://www.youtube.com/watch?v=YrgkLCTrsVg
Table tennis: http://www.youtube.com/watch?v=HmrDPdDevRM
Basketball: http://www.youtube.com/watch?v=TjeOXhxceWA
3D animation is one of the areas I admire, but definitely isn’t an area
I’m good at. Also, this was the first time I did something serious with
blender, so don’t expect superb results (and continue reading this post
if you gave up after seeing the videos ;-) ).
After thinking a while about the time Matlab took to make the final
results I decided that it could be better… Two things could be
improved: (i) use more threads to calculate the new images and (ii) use
CPU optimized CPU instructions. I think it’s a shame to have a dual core
with lot of specialized instructions to do things faster and don’t use
them. These two goals brought me to code the algorithm in C.
I used pthreads to divide the work among threads (you say at command
line the number of threads you want, or just say it to use the number of
available processors) and freeimage to load and save images. I did made
another version that uses gdk-pixbuf to load and store images and it
took me so disappointed. To simply load and save images DIFFERENT images
(so, things that are not correlated) it needs to use synchronization
mechanisms. This leads to an algorithm not scaling very well as it
could. I also tested another one with CImg… and the result was so
disappointing too. I think not every developer is performance-ish as I
am. Or not everyone cares about it as long as it is easy to use. So,
stop talking and show the results:
[caption id=”attachment_125” align=”aligncenter” width=”640”
caption=”Time to compute in each implementation”][/caption]
The best version run in less than half of the time required by Matlab! I
could also optimize more the algorithm in C to get even better results,
but I was already ok with that (read: I have other things to do). It’s
also worth noting the improvement when optimizing the code for my CPU:
with one thread I got a boost of \~20% and with 2 threads of \~18%.
Particularly in this type of algorithm, it’s important to use SSE/SSE2
instructions to get a faster mean of the images. In fact, viewing the
assembly code generated in this case we find these instructions as below
(the xorps, movaps are SSE instructions that operate with xmmN registers).
8049517: 0f 57 f6 xorps %xmm6,%xmm6
804951a: 8b 41 04 mov 0x4(%ecx),%eax
804951d: 8b 4d 08 mov 0x8(%ebp),%ecx
8049520: 39 41 08 cmp %eax,0x8(%ecx)
8049523: 0f 82 36 06 00 00 jb 8049b5f <worker_thread+0x6ff>
8049529: 8d 7d cc lea -0x34(%ebp),%edi
804952c: 0f 29 b5 08 ff ff ff movaps %xmm6,-0xf8(%ebp)
8049533: 89 bd 04 ff ff ff mov %edi,-0xfc(%ebp)
8049539: 89 3c 24 mov %edi,(%esp)
804953c: 89 44 24 08 mov %eax,0x8(%esp)
8049540: c7 44 24 04 64 a2 04 movl $0x804a264,0x4(%esp
Awesome!! GCC automatically generated a SSE optimized code. If you got
interested, see the wikipedia article about
SSE.
The code is available under GPLv2 license at github:
http://github.com/lucasdemarchi/imgmean. Let me know if it’s useful to you.