Garmaine Staff asked 2 years ago

I have a game engine that indexes colors of some bitmaps which allows using some of the crazy effects of olde (color strobing etc.). Sadly, the indexing algorithm is neither slow nor fast, but since the spritesheets these days are gigantic it really adds up. Currently, loading a single large spritesheet can take 150+ milliseconds, which is an eternity, relatively speaking.

This is the algorithm:

auto& palette = p->pal;     // vector
auto& lookup  = p->lookup;  // vector_map

palette.reserve(200); // There are on average ~100 unique colors
palette.push_back(0); // Index zero is the blank color

uint32_t lastColor = 0;
uint32_t lastPalette = 0;

for (size_t i = 0; i < pixels; i++)
    const auto color = data[i];
    if (color == lastColor)
        data[i] = lastPalette;
    else if (color == 0)
    uint32_t j = 0;
    const auto& it = lookup.find(color);
    if (it != lookup.end()) {
        j = it->second;
        j = palette.size();
        lookup.emplace(color, j);

    lastColor = color;
    // Write the index back to the bitmap:
    // (this is just a GPU texture encoding, don't mind it)
    data[i] = (j & 255) | ((j >> 8) << (8 + 6));
    lastPalette = data[i];

The base algorithm is fairly straight-forward: Go through each pixel, find or create an entry for it (the color), write the index back to the image.

Now, can you parallelize this? Probably not. I have tried with OMP and regular threads. It's simply not going to be fast because regardless of how much time you save by going through each portion of the image separately, at the end you have to have a common set of indexes that apply throughout the whole image, and those indexes have to be written back to the image. Sadly, finding the unique colors first and then writing back using parallelization is also slower than doing it once, sequentially. Makes sense, doesn't it?

Using a bitset has no function here. Knowing whether a color exists is useful, but the colors are 32-bit, which makes for 2^32 bits (aka. 530MB). In contrast, 24-bits is only ~2MB, which might be a micro-optimization. I'm not really looking for that anyway. I need to cut the time by 10x.

So, any ideas? Would it be possible to process 4 or 8 colors at the same time using SSE/AVX?