Week 8: Vectorization cancelled

In week 8 I started working on a part of the Krita I had never touched before. It is actually a library inside KOffice called Pigment. Pigment is responsible for the color management. It contains some colorspaces we actually don’t use in Krita because they are too simple for Krita’s needs. They are intended for use by other applications in KOffice. So far the applicationss do not have color management, but they can if the developers want it. Pigment contains many useful classes which operates on the color in some colorspace in many ways like doing the math, compute the histogram. It also contains the implementation of the composite operations. And here was my interest as we wanted to try to vectorize the composite operations by using SSE instructions and the vectorization feature in GCC4.x.

So first I started to write benchmarks for the various composite operations. And then I started to work with GCC feature. Vectorization of the composite operation is already implemented in GEGL by Øyvind Kolås. Also GIMP 2.6.x is using MMX, SSE, SSE2 so I had inspiration and I was trying to map it to our implementation of composite operations. GEGL is using nice code, but GIMP is using assembler directly. I don’t have much experience with assembler. I wish I had write assembler lessons previous year at the university. It is not hard
to code something, but it is hard to do it correctly. I read somewhere in Inkscape mailing list that they had some assembler code to speedup some work but it ended up to be slower then code optimized by compiler.

I had quite a hard time and I did not manage to implement the vectorization even with help I get regularly from Cyrille and boud and other Krita hackers and
GEGL hacker. We stopped it on Wednesday because we discovered that the issue is more complicated then we thought and it would require much more than two days to finish. Maybe another week or even weeks. And the result could be not faster. One of the problems we discovered was that the RGBA 8-bit colorspace uses the unsigned char datatype (quint8 in Qt) for the memory storage but when you do a composite operation, you have to retype it to the bigger data type like a int32. Why? If you have a pixel in quint8 with value 255 and other pixel with value 200 and you add or multiply them, you overflow the data type. And the result is bad. If you retype, you have solved that issue. I studied the GIMP code and how it is
implemented there. It is solved using MMX instuctions for this case. The MMX technology supports both saturating and wraparound modes. You can read about that more in details here. Another issue was that GIMP does not compute every composite operation with vectorization but only some composite operations are implemented this way.

So we decided to start to work on the other item in the action plan which is mirroring of the canvas and possibly rotation. On Thursday and Friday I was back in the canvas code. So far I have working code of the mirroring of the events from input devices. Now the hardest part will be to implement  mirroring in the projection. Projection is the code responsible for correct displaying of the zoomed image and it computes the image you see in the canvas when you scroll or move or some tool paint it’s outline or some part of the
image is changed by some tool. The task will continue also for OpenGL canvas as we have two canvases in Krita.

You see, the vectorization week does not bring any speedup, but I don’t want you to be sad so I decided to write a blog post about other Krita work I do in my spare time. Read it here.

This entry was posted in Krita. Bookmark the permalink.

13 Responses to Week 8: Vectorization cancelled

  1. Jos says:

    The Eigen math library has very nice code to do vectorization. If you use the vector and matrix classes from eigen, you get vectorization for free. Eigen is a header only library, so there is no runtime dependency. KSpread already uses Eigen.

  2. LukasT says:

    We already use Eigen2 in Krita, but it is hard to map it to composite operations. Maybe we already benefit from vectorization in some places, but not in composite operations.

  3. I am interested in more details of what problems you have to solve. Maybe they would make for good practical courses or topics for a Bachelor thesis or so… :-)

    Anyway, do you know about the PADDSB PADDUSB PSUBSB PSUBUSB instructions for saturated +/- on signed and unsigned chars? They are part of SSE2 and can be accessed easily from C++ code via (the almost standard) intrinsics.
    AFAIK multiplication is not supported for 8bit types with SSE, but you can convert to 16bit shorts and then use PMULLW. Conversion between 8bit and 16bit can be done with PACKSSWB/PACKUSWB and PUNPCKHBW/PUNPCKLBW.

    You also might be interested in Vc (http://gitorious.org/vc) which is a library (well, mostly just headers) to make writing portable SIMD code easy. I wrote it to vectorize an application that makes a bad fit for the SSEx instructions, still keeping a clean code-base. Please contact me if you’re interested in using it. It is not ready for real work with 8 bit integers, but I’m sure it can be made useful for your task.

  4. LukasT says:

    Hi, the problem is that I wanted to avoid to use pure assembler and I wanted to do it in a way that GEGL solve it here http://git.gnome.org/browse/gegl/tree/gegl/gegl-simd.h

    Compositing in GEGL is implemented without some abstraction we have in colorspaces so I can’t just do like (vec1 + vec2)/vec3
    because we do some checks per channel and the math differs according the colorspace and it’s alpha channel position. The other problem is that we store the data in unsigned char no matter what the colorspace bit depth is and then we do math of compositing according the bit depth, so we reinterpret the data. I did not know how to do it fast so that pixel buffer stored
    in Qt’s quint8 will be GCC’s vector of int or short int or float or double.

    I didn’t want to go into pure assembler as I noted in the blog post and I had only one week for this.

  5. Alex says:

    Another library you may want to look at is Orc: http://code.entropywave.com/projects/orc/
    It’s been used to great effect in the Schrödinger implementation of Dirac.

  6. Robert says:

    Hmm. I’ve had good experiences using the GCC vector extensions & various intrinsics.

  7. Can you give a pointer to the code that you wanted to vectorize?

  8. LukasT says:

    @Matthias Kretz:
    Here are all composite ops
    I started with this one

    The function composeColorChannels get one pixel (RGBA8,RGBA16,…,) to process.

    Feel free to pop-up in #krita at irc.freenode.net, I’m around as LukasT there if you have any questions ;), thanks for you interest!

  9. I expect best performance if you can change the API to allow processing of multiple pixels. E.g. by extending the API with a function that additionally passes the number of consecutive pixels that can be accessed at src and dst. If you can do this let me know and I give you an idea how to vectorize it.

    Otherwise: You need to create a mask because of the alpha_pos and channelFlags tests. For performance reasons you should make sure that the compiler is able to cache this mask between consecutive calls of the compose functions (you’d need to look at the resulting assembler to see this).

    As far as I know the GCC extension for vectors: you can’t do much with masks with it. But I might be wrong – never really used it. Better use intrinsics (not asm): http://www.intel.com/software/products/compilers/docs/clin/main_cls/mergedprojects/intref_cls/whnjs.htm . Intrinsics also work with MSVC and ICC. (Or even better Vc, but it would need some more features to support your problem.)

    The idea of the vectorization:
    – change the for loop to a mask creation
    – make an unaligned load from src/dst to a vector (careful with out-of-bounds accesses, might need 64 bit or 32 bit moves)
    – if needed convert to one or two vectors with 16 or 32 bit integers
    – do the blend operation
    – convert the resulting dst vector(s) back to the original integer width (using the saturating pack instruction)
    – blend the original dst vector with the calculated dst vector using the mask
    – store the dst vector to memory (again: be careful with out-of-bounds writes)

    I might be able to get a student to work on this, if you’re interested. But probably no earlier than fall/winter – if at all.

  10. s/do the blend operation/do the compose operation/

  11. Andre says:

    Matthias: Doesn’t this already operate on multiple pixels?
    See KoComposite.h

    @param dstRowStart pointer to the start of the byte array we will composite the source on
    @param dstRowStride length of the rows of the block of destination pixels in bytes
    @param srcRowStart pointer to the start of the byte array we will mix with dest
    @param srcRowStride length of the rows of the block of src in bytes
    pixels (may be different from the rowstride of the dst pixels,
    in which case the smaller value is used). If srcRowStride is null
    it is assumed that the source is a constant color.
    @param maskRowStart start of the byte mask that determines whether and if so, then how much of src is used for blending
    @param maskRowStride length of the mask scanlines in bytes
    @param rows number of scanlines to blend
    @param numColumns length of the row of pixels in pixels
    @param opacity transparency with which to blend
    virtual void composite(quint8 *dstRowStart, qint32 dstRowStride,
    const quint8 *srcRowStart, qint32 srcRowStride,
    const quint8 *maskRowStart, qint32 maskRowStride,
    qint32 rows, qint32 numColumns,
    quint8 opacity) const;

    I guess this is the method that is called by applications. The composeColorChannels() method is only called by implementations of the AlphaBase op. Should be no problem to change that I guess.

  12. @Andre: Yes, that looks good for API compatibility to applications. Still the composeColorChannels function gets called per pixel and it is where the calculation that should be vectorized happens. composite could be modified to call an overload of composeColorChannels that e.g. blends a whole scanline or so. But this requires a lot of code changes.

    Anyway, it’s something the maintainer of the library should decide what is allowed to be changed there. And I don’t know whether composeColorChannels is internal API or not… My recommendation for allowing efficient vectorization is to process at least one scanline in one composeColorChannels call.

    Please correct me if I’m saying something stupid – I did not take the time to review the whole API…

  13. LukasT says:

    Andre is right, you can implement KoCompositeOp which operates on buffer of pixels. composeColorChannels is introduced by KoCompositeOpAlphaBase where some alpha stuff are done to avoid code duplication.

    Cyrille Berger is pigment maintainer, so he can answer any of your questions. CyrilleB on IRC.

Leave a Reply