Why do string hash codes change for each execution in .NET?

3.6k Views Asked by At

Consider the following code:

Console.WriteLine("Hello, World!".GetHashCode());

First run:

139068974

Second run:

-263623806

Now consider the same thing written in Kotlin:

println("Hello, World!".hashCode())

First run:

1498789909

Second run:

1498789909

Why do hash codes for string change for every execution in .NET, but not on other runtimes like the JVM?

2

There are 2 best solutions below

10
shingo On BEST ANSWER

Why do hash codes for string change for every execution in .NET

In short to prevent hash collision attacks. You can roughly find out the reason from the docs of the <UseRandomizedStringHashAlgorithm> configuration element:

The string lookup in a hash table is typically an O(1) operation. However, when a large number of collisions occur, the lookup can become an O(n²) operation. You can use the configuration element to generate a random hashing algorithm per application domain, which in turn limits the number of potential collisions, particularly when the keys from which the hash codes are calculated are based on data input by users.

but not on other runtimes like the JVM?

Not exactly, for example Python's hash function is random. C# also produces identity hash in .net framework, core 1.0 and core 2.0 when <UseRandomizedStringHashAlgorithm> is not enabled.

For Java maybe it's a historical issue because the arithmetic is public, and it's not good, read this.

5
minnmass On

Why do hash codes change for every execution in .NET?

Because changing the hash code of strings (and other objects!) on each run is a very strong hint to developers that hash codes do not have any meaning outside of the process that generated the hash.

Specifically, the documentation says:

Furthermore, .NET does not guarantee the default implementation of the GetHashCode method, and the value this method returns may differ between .NET implementations, such as different versions of .NET Framework and .NET Core, and platforms, such as 32-bit and 64-bit platforms. For these reasons, do not use the default implementation of this method as a unique object identifier for hashing purposes. Two consequences follow from this:

  • You should not assume that equal hash codes imply object equality.
  • You should never persist or use a hash code outside the application domain in which it was created, because the same object may hash across application domains, processes, and platforms.

By changing the hash code of a given object from one run to the next, the runtime is telling the developer not to use the hash code for anything that crosses a process/app-domain boundary. That will help to insulate developers from bugs stemming from changes to the GetHashCode algorithms used by standard classes.

Having hash codes change from one run to the next also discourages things like persisting the hash code for use as a "did this thing change" short-cut. This both prevents bugs from changes to the underlying algorithms and bugs from assuming that two objects of the same type with the same hash code are equal, when no such guarantee is made (in fact, no such guarantee can be made for any data structure which requires or allows more than 32 bits, due to the pigeonhole principle).

Why do other languages generate stable hash codes?

Without a thorough language-by-language review, I can only speculate, but the major reasons are likely to be some combination of:

  • historical inertia (read: "backwards compatibility")
  • the disadvantages of stable hash codes were insufficiently understood when the language spec was defined
  • adding instability to hash codes was too computationally expensive when the language spec was defined
  • hash codes were less visible to developers