Implementing memcmp

8.4k Views Asked by At

The following is the Microsoft CRT implementation of memcmp:

int memcmp(const void* buf1,
           const void* buf2,
           size_t count)
{
    if(!count)
        return(0);

    while(--count && *(char*)buf1 == *(char*)buf2 ) {
        buf1 = (char*)buf1 + 1;
        buf2 = (char*)buf2 + 1;
    }

    return(*((unsigned char*)buf1) - *((unsigned char*)buf2));
}

It basically performs a byte by byte comparision.

My question is in two parts:

  1. Is there any reason to not alter this to an int by int comparison until count < sizeof(int), then do a byte by byte comparision for what remains?
  2. If I were to do 1, are there any potential/obvious problems?

Notes: I'm not using the CRT at all, so I have to implement this function anyway. I'm just looking for advice on how to implement it correctly.

8

There are 8 best solutions below

0
On

Another idea is to optimize for the processor cache and fetching. Processors like to fetch in large chunks rather than individual bytes at random times. Although the internal workings may already account for this, it would be a good exercise anyway. Always profile to determine the most efficient solution.

Psuedo code:

while bytes remaining > (cache size) / 2 do // Half the cache for source, other for dest.
  fetch source bytes
  fetch destination bytes
  perform comparison using fetched bytes
end-while
perform byte by byte comparison for remainder.

For more information, search the web for "Data Driven Design" and "data oriented programming".

Some processors, such as the ARM family, allow for conditional execution of instructions (in 32-bit, non-thumb) mode. The processor fetches the instructions but will only execute them if the conditions are satisfied. In this case, try rephrasing the comparison in terms of boolean assignments. This may also reduce the number of branches taken, which improves performance.

See also loop unrolling.
See also assembly language.

You can gain a lot of performance by tailoring the algorithm to a specific processor, but loose in the portability area.

2
On

You could do it as an int-by-int comparison or an even wider data type if you wish.

The two things you have to watch out for (at a minimum) are an overhang at the start as well as the end, and whether the alignments are different between the two areas.

Some processors run slower if you access values without following their alignment rules (some even crash if you try it).

So your code could probably do char comparisons up to an int alignment area, then int comparisons, then char comparisons again but, again, the alignments of both areas will probably matter.

Whether that extra code complexity is worth whatever savings you will get depends on many factors outside your control. A possible method would be to detect the ideal case where both areas are aligned identically and do it a fast way, otherwise just do it character by character.

5
On

The optimization you propose is very common. The biggest concern would be if you try to run it on a processor that doesn't allow unaligned accesses for anything other than a single byte, or is slower in that mode; the x86 family doesn't have that problem.

It's also more complicated, and thus more likely to contain a bug.

0
On

If you compare as int, you will need to check alignment and check if count is divisible by sizeof(int) (to compare the last bytes as char).

4
On

Is that really their implementation? I have other issues besides not doing it int-wise:

  • castng away constness.
  • does that return statement work? unsigned char - unsigned char = signed int?

int at a time only works if the pointers are aligned, or if you can read a few bytes from the front of each and they are both still aligned, so if both are 1 before the alignment boundary you can read one char of each then go int-at-a-time, but if they are aligned differently eg one is aligned and one is not, there is no way to do this.

memcmp is at its most inefficient (i.e. it takes the longest) when they do actually compare (it has to go to the end) and the data is long.

I would not write my own but if you are going to be comparing large portions of data you could do things like ensure alignment and even pad the ends, then do word-at-a-time, if you want.

0
On

Don't forget that when you find a mismatch within a larger chunk, you must then identify the first differing char within that chunk so that you can calculate the correct return value (memcmp() returns the difference of the first differing bytes, treated as unsigned char values).

1
On

Many processors implement this as a single instruction. If you can guarantee the processor you're running on it can be implemented with a single line of inline assembler.

0
On

The code you found is just a debug implementation of memcmp, it's optimized for simplicity and readability, not for performance.

The intrinsic compiler implementation is platform specific and smart enough to generate processor instructions that compare dwords or qwords (depending on the target architecture) at once whenever possible. Also, an intrinsic implementation may return immediately if both buffers have the same address (buf1 == buf2). This check is also missing in the debug implementation.

Finally, even when you know exactly on which platform you'll be running, the perfect implementation is still the less generic one as it depends on a bunch of different factors that are specific to the rest of your program:

  • What is the minumum guaranteed buffer alignment?
  • Can you read any padding bytes past the end of a buffer without triggering an access violation?
  • May the buffer parameters be identical?
  • May the buffer size be 0?
  • Do you only need to compare buffer contents for equality? Or do you also need to know which one is larger (return value < 0 or > 0)?
  • ...

If performace is a concern, I suggest writing the comparison routine in assembly. Most compilers give you an option to see the assembly lising that they generate for a source. You could take that code and adapt it to your needs.