Function call to c_str() vs const char* in hash function

361 Views Asked by At

I was looking at hash functions on stackoverflow when I found one that was pretty interesting. It involves casting a const char* to a size_t* and then de-referencing the size_t. This is then bit shifted to a certain precision. This works for const char*, producing the same value each time. However, when I use an actual string type, and call c_str() instead, the two values produced do not match. Furthermore, on each run of the code, the string produces different values each run. Anyone have an idea of why this is occurring?

const string l = "BA";
const char* k = l.c_str();
const char* p = "BA";
cout << k << " " << *((size_t*)k) << endl;
cout << p << " " << *((size_t*)p) << endl;

Run 1:

BA 140736766951746
BA 7162260525311607106

Run 2:

BA 140736985055554
BA 7162260525311607106

Original question: Have a good hash function for a C++ hash table?

3

There are 3 best solutions below

0
On BEST ANSWER

*((size_t*)k) causes undefined behaviour by violating the strict aliasing rule. This code is only valid if k actually points to an object of type size_t.

Being undefined behaviour, seeing weird numbers is a possible result (as would be anything else).


I guess you intended something akin to:

size_t x;
memcpy(&x, k, sizeof x);
cout << k << " " << x << '\n';

It should now be clear what the problem is. Your string only contains 3 characters (2 plus the null terminator), however you attempt to read more than 3 characters which also causes undefined behaviour.

0
On

I'll start with saying that in:

const string l = "BA";
const char* k = l.c_str();
const char* p = "BA";
cout << k << " " << *((size_t*)k) << endl;
cout << p << " " << *((size_t*)p) << endl;

Both *((size_t*)k) and *((size_t*)p) invoke undefined behavior. This is so, since on most systems it will access data beyond the boundary of the char array. Note that, sizeof(size_t) > 3 * sizeof(char) for 32 and 64 bit system, so that *((size_t*)k) accesses at least one byte beyond the boundary.

In the whole example, the string literals (on your system) are possibly aligned to at least sizeof(size_t), with zero padding (don't count on it, but it seems so). This means the junk after the string literal "BA" (and the NUL terminator) is NUL character(s). This is consistent across runs.

In case of k, which comes from std::string you are not so lucky. The string is short, so most system will employ short string optimization. This means that that char buffer is in the std::string object. In your case, the string is so short that the remainder of it is still in the buffer dedicated for the short string optimization. As it seems, the remainder of the buffer is not initialized, and contains junk. The junk had been left over from before the function was called. As a result other than the first 3 bytes of BA\0, the rest is random junk.

You were lucky that this case of undefined behavior ends up with some additional junk, and not something more perplexing (like always returning zero, or calling unrelated functions). Don't rely on UB, ever.

2
On
// Simple null terminated character that is represented in memory as:
//
// ['B', 'A', '\0']
const char* p = "BA";

// From the other side `std::string` isn't so simple
//
// c_str() returns a pointer to some kind of buffer.
//
// ['B', 'A', '\0', ... reserved_memory]
//
const std::string l = "BA";
const char* k = l.c_str();

// Then you do a C-style cast.
//
// (size_t*)k that gives you the address to the beginning of the underlying
// data of the std::string (possibly it will be pointer on the heap or on
// stack depending on the SSO) and after that you dereference it to receive
// the value. BTW it can lead to the undefined behavior because you
// attempt to receive the value for 8 bytes (depending on the size_t size)
// but your actual string may be less than it, e.g. 4 bytes. As a result
// you will receive the garbage.
std::cout << k << " " << *((size_t*)k) << std::endl;

// Two strings created as
//
// const char* foo = "foo";
// const char* bar = "foo";
//
// are stored in the Read only segment of data in your executable. Actually
// two different pointers will point to the same string in this segment. Also
// note the same undefined behavior mentioned earlier.
std::cout << p << " " << *((size_t*)p) << std::endl;