I finished all my exams on time, so last Monday I could starting working full-time on Krita. I’ve now been at it for a full week! How does it go?
First week according the plan was aimed at measuring the speed of Krita. We talked about our bottlenecks on IRC regulary. We also talked about them in Oslo. But we didn’t have any numbers. Sven Langkamp did some benchmarks using QTime on iterators, there were some performence tests scattered in unit tests etc. Boudewijn has decided to use new benchmark features from Qt4.6. So I wrote 10 benchmark classes where we benchmarked stuff like internal data memory management for images – our tile engine. Some access classes which access pixels for use called iterators. They allow to iterate over pixels in various ways. Vertically, horizontally , in small rectangles, randomly etc.
Another important thing is compositing. That is the work of class called KisPainter (something like QPainter in Qt but with different complicated features not available in QPainter). We benchmarked the speed of bitBlt operation with two types of memory storage. First is KisPaintDevice and second one is called KisFixedPaintDevice. The later one is lightweight version of the first one. It is similar to QPaintDevice in Qt but again with more complicated features.
Øyvind Kolås a.k.a. pippin, Gegl developer is around. Pippin shared his knowledge about benchmarking with us on irc (btw come and visit as at #krita on irc.freenode.net) We decided to make tests for blur and brightness/contrast filter. The first one is convolution filter, the second one is here because we wanted to be able to compare to GIMP.
We also made a benchmark for the image projection. The projection benchmark loads an image in Krita’s native format, computes the whole image constructed from various types of layers like group, filter, adjustment layer etc. and in the end we again save the file into native Krita format. Our focus through this plan is to speed up painting. So we can’t avoid stroke benchmarks.
Thanks go to Sven Langkamp for his work on presets saving/loading. Using paintop presets, we can use the benchmark code I did, for any paintop. We benchmarked our autobrush default paintop. It is most used paintop for digital painters. I’m very happy about this benchmark as I can test my other paintops easily, all I need to do is create a preset for any paintop, save it in Krita and run the benchmark with the preset.
There results of the benchmars are on our wiki.
So it looks: The data manager, which is responsible undo/redo and basically for storing and retrieving data is fast. It allows us to read/write data at very high speeds. From 1333.3 Mb/s to 1628.4 Mb/s, according to the benchmark results. We benchmarked on 4096×4094 RGBA image (64 Mb) which was read/wrote 100 times. For comparison memcpy for two buffers of the same size as the image we used is almost the same speed. There is benchmark code for memcpy in KisDatamanagerBenchmark, you can try it yourself if you want
Horizontal and vertical iterators are 11 times slower then Rectangular iterator. The reason is that there is no caching. From valgrind logs we see that fetching and switching tiles is very slow, so we need to implement caching there. Every 64 pixels a new tile is fetched and switched to. Why? Our tile is size of 64 pixels. This slow downs the iteration. We will cache the both iterators to avoid switching and fetching. We will cache tiles, we don’t do that now. The rectangular iterator is quite fast, does not offer so many opportunities to improve. The random iterator on the other hand is the slowest one. Again fetching and switching to tiles is expensive. Some caching strategy would be handy for moving around the image. But the use case for using the random accessor is different so the cache strategy should be somehow adaptive. The random accessor is 13-times slower than rectangular one.
The compositing operation also known as BitBlt of KisPainter is used very often. There is room for improvements, because currently it uses a slow iterator – random iterator. The speed is very decent, but we will try to make a lot faster.
Filters are very slow compared to GIMP. At least according the numbers pippin provided. But first we need to improve underlying issues like the aforementioned iterators to improve their performance. We blur the image with speed of 0.8 Mb/s. Here we need to optimize the convolution painter. Also the speed could be better when we optimize the horizontal iterator.
Strokes are slow because of the recomputation of the brush mask is needed every dab – every time brush touches the canvas in certain point. That’s slow. You can see from the valgrind log, that the math function atan2 is slow. We have to cache the result to avoid this.
The conclusion at the end of the first week is that we need to cache iterators and cache the brush mask. Some parts of Krita have very nice speed like the tile engine. Then we have slow parts like iterators. I think we can gather a good performance boost with our plan.
About iterators: iirc, KisRandomAccessor uses memmove for for storing cyclic cache instead of (i%N). I thing this is the easiest thing to fix =)
About bitBlt: we need to implement directBitBlt inside datamanager, that will be used when color spaces coincide. It will share the tiles directly. It should be really fast on tiles3.
Wouldn’t that only work for composite_copy or composite_over with 100% opacity? Or do you want to pass the compositeop object to the datamanager?
just a tipp for the speedup of the atan2 function:
in the old amiga/atari days we used precomputed memory maps , like x = sinus_array[y] , which should also be possible for arc-tan (split to sin/cos)
this give a huuuge speedup there.
It might be worth tracking which compiler version and compiler options you are using for this. Obviously that could make some difference, although I readily concede it may not be very much in the end.
@chickenpump: yes, we decided to do lookup table for atan2. Problem is that it accept two parameters so it will be 2D lookup table and quite big probably.
But we plan to add lazy initialization, so it maybe quite fast.
Wouldn’t using OpenSolaris with DTrace be better for finding performance bottlenecks?
My guess would be that to find hardware that works with OpenSolaris, installing it, making Lukas familiar with opensolaris, then learning dtrace could easily eat up a month or two, maybe three. Systemtap might be an option since Lukas is already using fedora. But I doubt that systemtap would help us more than valgrind at this stage. Maybe when we’ve optimized the hell out of our algorithms, memory and cache usages and are faced with weird latencies or io issues.
Maybe approximating the atan with something else would be an option? E.g. use some integration formula on 1/(1+x²), or use a polynomial approximation.