The second week of our action plan for optimizing Krita was devoted to optimizing painting in Krita. Although there are many great paintops in Krita, digital painters tend to use most of the time the simple default brush engine which we call Pixel Brush. Painters can use GIMP brushes here, the text brush, but the most used brush tip is called Autobrush. You can setup the brush attributes like shape (circle, rectangle) and you can change the ratio to get an ellipse. Then you can change softness by vertical and horizontal fading. If you play with spikes and ratio, you get stars and other funny shapes. The brush has many dynamic attributes thanks Cyrille Berger’s work on concept called sensors. E.g. tablet can control the size by pressure, by tilt or anything you want (e.g. interesting is drawing angle) and it can be tuned by curve.
The algorithm, which computes the brush mask, stamped on the canvas regular as you stroke, is computed by the KisCircleMask::valueAt(). It is a computationally expensive function according valgrind logs we did week ago and many times before. David Revoy, team member of the Durian project, said that using 70px brush on 2500×2500 image was very slow in Krita. So we needed to optimize that.
I started with exploration of the code. I’m not the author of the autobrush, though I did most of the paintops in Krita (10 paintops are mine out of 19, many are experimental). First catch was the interpolation in the brush mask computation. We called valueAt() 4 times per pixel of the brush mask. We found out with Cyrille that the valueAt function used to take integer parameters a long long time ago and double values of the brush mask pixel positions were computed with interpolation. So I decided to remove the interpolation as the function has been capable to take double input long time ago. And the results of the valueAt() are more precise then interpolation. The benefit was great. Painting was 4x faster. Benchmark for random lines with changing size according pressure dropped from 18 seconds to 4 seconds on 4096×4096 image. Check it in the wiki, related table is called Just with performance fix.
From the valgrind logs we noticed that the atan2 function is called too often. “Chickenpump” suggested some cool old school tricks in comments. And so we gave that a try. I used double hashing with QHash in QHash for the 2D function atan2, but that was very slow due to low cache hit ratio and expensive hashing. Then Cyrille posted some links with free code which implemented a fast atan2 with an internal lookup table. So I ported that code to Krita. Cyrille did some magic stuff like computation with fixed precision on library loading time and some little tune ups to speed the fast atan2 computation and we managed to get more speed up around 1.3x faster then without fast atan2 function. There is probably some more room for optimizations as the fast atan2 implementation uses a quite small lookup table. Also I tested some other implementations, but it had problem with precision. It had 3 degrees error. That is too much for us, so I dropped that.
I remembered a quite interesting magic function for fast inverse square root used in Quake III. So I gave it a try as we use inverse square root in valueAt() too. I found out by benchmark that fast inverse square root is slower then directly computing the inverse square root (1/sqrt(x)). It used to be 4 times faster a long time ago. Probably Intel implemented that in processors already. Or the optimization done by compiler was not so effective in case of fast version. Again we dropped that.
Most of the use cases for painting include brush masks which are symmetrical. The algorithm could compute just 1/4 of the mask. Next step was implementing this.
First version used 4 pointers to the memory and compute 1 pixel and copy 3 pixels to the right region. I managed to get another 1.7x speed up (from 3.555 ms to 1.9 ms).
Memory access is very important and can slow down computation. It is like when you use setPixel/pixel method to access pixels in pixel buffer – you supposed to use scanlines, that is faster. Here is some interesting article about it. If you don’t have something to read, here is also nice CPU memory bible.
First version used 4 iterators over image pixels. One computation per pixel. And copy the values.
So I decided to make it little faster just by using two pointers. I compute 1/4 of the mask and copy this part to the NW region. And then I copy the rows in the lower part of the mask in correct order – I mirror it.
Improved version used two iteratiors and memcpy the second half of the brush mask.
I found out on friday evening that it does not work though. Circle masks seems symmetrical from user point of view, but they are not. The brush mask respects sub-pixel precision in Krita, so the edge pixels of the circle are not symmetrical. The sub-pixel precision is visible when you work with big zoom level. I have an idea for computation 1/4 of the brush mask, but I decided to post-pone it.
Other possibilities are still around:
- mip-mapping : pre-compute various levels of brush mask to buffer and interpolate the masks. We do this for Gimp brush. We would interpolate two computed brush masks instead of computate the single mask. Maybe it could be faster, maybe not. The reason for mipmap in GIMP brush painting we have, was to increase the quality of the scaled brushes as Adrian Page, hacker in the Krita team, wrote me in an email. Mip-mapping would require to split rotation from the mask computation. This can lead to different results regarding of the brush mask. Now we consider the rotation in the mask computation. Then we would rotate un-rotated mask by image processing – rotate image. Some conformation rendering test would be needed. The advantage would be support for rotation of the gimp brushes in Krita.
- cache the brush mask for mouse users: cache the dab as the mask doesn’t change. This would be nice if we did not compute sub-pixel precision. But we do that, so the cache hits ratio would be very small. It would be usable for 100% zoom, when sub-pixel position is zero. And of course big condition for checking of the parameter changes would be required.
- Compute the mask with graphics card – use shaders: that would be cool, I have some initial experience with shaders but integration would be harder and probably too experimental for our plan. I’m mentioning this as we discussed this with Sven Langkamp in Oslo and so that it is not forgotten.
- We will probably do some garbage recycling – memory allocation is slow, we can benefit from recycling memory. It is a matter of discussion on IRC at #krita on freenode. You are welcome to join.
Final time of the computation in benchmark for random lines is 1,449.2 ms. It dropped from 18,576 ms. So the painting was 16xtimes faster.But I revert the 1/4 of the brush mask speed up, so the current speed is 3.555 ms. Painting will be 6xtimes faster. The speed is considered to be usable for big brushes now. I invite you to do check-out of the trunk and try to play with big brushes. 200 px is now very usable on my laptop. What about yours?
I updated my WordPress blog. I dropped the previous classic WordPress theme and selected the default one – lazy developer. I did not like the font in the previous one. I don’t have much time to play with web-designing these days. But at least I customized the default Kubrick theme. I changed the fixed width of the theme to wider values. I did also simple custom header with some random strokes with my paintops in Krita. I hope you will like it. Every image in the blogpost is made in Krita.
Very detailed and very nice article. Keep on rocking.
Ou great work&great english -)
Is sub-pixel precision even a desired feature at all? If I zoom an image when editing, it’s to make it easier to hit the right pixel, I sure don’t expect nor want the results to depend on where in the zoomed pixel I’m clicking!
@Kevin: yes, sub-pixel precision is desired feature by digital painters. If you ignore it in brush mask computation, you get serious artefacts. Now you are clicking to the right pixel and the mask is computed so that it is anti-aliased correctly.
Quite interesting!
You write about the thought of using gpus to speed up things. Do you have thought about using OpenCL? I know at this stage most drivers are of beta quality but in the future that is for sure to change.
Advantages of OpenCL are that it also works on a range of CPUs if for example the gpu does not support it. One backdraw might be portability of the code as some architectures are not supported. So you would end with a lot of ifdefs.
OpenCL has build in support for many math-functions like atan2 so that might be quite fast. Further one fantastic advantage is automatic support for SSE2 on cpus if you use the build in vector types like int4 or float4.
http://developer.amd.com/gpu/ATIStreamSDK/ImageConvolutionOpenCL/Pages/ImageConvolutionUsingOpenCL.aspx is an example of OpenCL and performance though keep in mind that the drivers at that time were only of beta quality and not optimized that much. Btw. as the author suggests you should really use the C++ bindings as they make error checking amongst others a lot easier. And as an endnote I only have very little experience with OpenCL just tried some very small programs but I like the overall concept.
I have seen CUDA a little and I studied OpenCL bit. Drivers are problem. E.g. I have NVidia graphics and only version 190.31 or so support OpenCL. So it is kinda weird now :/
You can use the ATIStreamSDK SDK for using OpenCL with cpus even if you have a cpu from intel, iirc you only need libopencl or what it is called and set the correct env vars.
It only goes to show where there’s will there’s a way. Keep on trying.
Here, http://assemblyrequired.crashworks.org/2009/10/16/timing-square-root/ the author computes sqrt by multiplying the inverse square root of x by x.
Q3’s fast sqrt * x is around 3.5 times slower than SSE rsqrtss * x. GCC likely generates a call to rsqrtss for your 1.0f / sqrt(x) code.
Pingback: Parallel brush mask processing « Sven's Blog