I have 2 separate applications one in Java and the other is C++. I am using Murmurhash3 for both. However, in C++ I get a different result as compared to Java for the same string
Here is the one from C++: https://code.google.com/p/smhasher/source/browse/trunk/MurmurHash3.cpp?r=144
I am using the following function:
void MurmurHash3_x86_32 ( const void * key, int len,
uint32_t seed, void * out )
Here is the one for Java: http://search-hadoop.com/c/HBase:hbase-common/src/main/java/org/apache/hadoop/hbase/util/MurmurHash3.java||server+void+%2522hash
There are many versions of the same Java code above.
This is how I am making a call for Java:
String s = new String("b2622f5e1310a0aa14b7f957fe4246fa");
System.out.println(MurmurHash3.murmurhash3_x86_32(s.getBytes(), 0, s.length(), 2147368987));
The output I get from Java: -1868221715
The output I get from C++ 3297211900
When I tested for some other sample strings like "7c6c5be91430a56187060e06fd64dcb8" and "7e7e5f2613d0a2a8c591f101fe8c7351" they match in Java and C++.
Any pointers are appreciated
There are two problems I can see. First, C++ is using
uint32_t
, and giving you a value of 3,297,211,900. This number is larger than can fit in a signed 32-bit int, and Java uses only signed integers. However, -1,868,221,715 is not equal to 3,297,211,900, even accounting for the difference between signed and unsigned ints.(In Java 8 they have added
Integer.toUnsignedString(int)
, which will convert a signed 32-bit int to its unsigned string representation. In earlier versions of Java, you can cast theint
to along
and then mask off the high bits:((long) i) & 0xffffffffL
.)The second problem is that you are using the wrong version of
getBytes()
. The one that takes no argument converts a UnicodeString
to abyte[]
using the default platform encoding, which may vary depending on how your system is set up. It could be giving you UTF-8, Latin1, Windows-1252, KOI8-R, Shift-JIS, EBCDIC, etc.Never, ever, ever call the no arguments version of
String.getBytes()
, under any circumstances. It should be deprecated, decimated, defenestrated, destroyed, and deleted.Use
s.getBytes("UTF-8")
(or whatever encoding you're expecting to get) instead.As the Zen of Python says, "Explicit is better than implicit."
I can't tell if there may be any other problems beyond these two.