According to "Schema Validation with Intel® Streaming SIMD Extensions 4 (Intel® SSE4)" (Intel, 2008) [they] added instructions to assist in character searches and comparison on two operands of 16 bytes at a time. I wrote some basic strlen() and strcmp() functions in C, but they seem slower than glibc.
I would like to maybe experiment with using inline assembly to see how my project behaves with inputting/outputting XML.
I've read (on here) that using SMID on things like strlen() is rife with potential problems (memory alignment), so I'm a little concerned about using it in production code.
glibc's implementations will be hard to beat. These functions are carefully optimized and include pieces hand written in assembly. Here is glibc's x86_64 implementation of strcmp, using AVX2 instructions. Be warned, it is 800 lines:
https://github.com/lattera/glibc/blob/master/sysdeps/x86_64/multiarch/strcmp-avx2.S
For more detail, read also Peter Codes' fantastic explanation about glibc's implementation.