Dictionary.ContainsKey() vs. large arrays used for lookup, any alternative?

65 Views Asked by At

I have a span of over 64 millions of ints (these are RGBA32 colors read as int32 from a 8192^2 PNG file, all positive and 2^31 - 1 excluded). I need to enumerate the different ints with a dictionary and have a bijection ints <-> [0, count[.

Span<int> intSpan = Resources.Load<Texture2D>("PNGFile").GetRawTextureData<int>();

Dictionary<int, int> dictionary = new();

    for (int i = 0, count = 0; i < intSpan.Length; i++)
    {
        if (!dictionary.ContainsKey(intSpan[i]))
        {
            dictionary.Add(intSpan[i], count);

            count++;
        }
    }

511 ms to retrieve the 3412 different ints from the 8192^2 length span.

Now, let use a 2^31 - 1 length bool array instead.

bool[] bools = new bool[int.MaxValue];

for (int i = 0, count = 0; i < intSpan.Length; i++)
{
    if (!bools[intSpan[i]])
    {
        bools[intSpan[i]] = true;

        dictionary.Add(intSpan[i], count);

        count++;
    }
}

235 ms. Less than half. But that array takes over 2 GB of memory.

Using a 2^31 - 1 length Bitarray instead.

BitArray bitarray = new(int.MaxValue);

256 ms. Barely slower but 8 times less memory.

Using a regular 2^26 int array with bit shitfing.

int[] intArray = new int[(int)Mathf.Pow(2, 26)];

for (int i = 0, count = 0; i < intSpan.Length; i++)
{
    if (((intArray[intSpan[i] / 32] >> (intSpan[i] % 32)) & 1) == 0)
    {
        intArray[intSpan[i] / 32] |= (1 << (intSpan[i] % 32));

        dictionary.Add(intSpan[i], count);

        count++;
    }
}

160 ms.

Please challenge my apparent misconceptions and tell me what are better ways, in any sense, to do what these pieces of code do. Problems of enumeration of the different elements of very large collections has taken a significant part of my time for years.

The poor performance of Dictionary.ContainsKey() or Hashset.Contains() is even more apparent (and become an obstacle) with less primitive structs such as Vector3, or pairs of vectors (tuples), etc. where straight replacement by lookup arrays is not possible for size reasons.

0

There are 0 best solutions below