I use this hash function a lot, i.e. to record the value of a dataframe. Wanted to see if I could break it. Why aren't these hash values identical?
This requires the digest package.
Plain text output:
> digest(Inf-Inf)
[1] "0d59b2dae9351c1ce6c76133295322d7"
> digest(NaN)
[1] "4e9653ddf814f0d16b72624aeb85bc20"
> digest(1)
[1] "6717f2823d3202449301145073ab8719"
> digest(1 + 0)
[1] "6717f2823d3202449301145073ab8719"
> digest(5)
[1] "5e338704a8e069ebd8b38ca71991cf94"
> digest(sum(1, 1, 1, 1, 1))
[1] "5e338704a8e069ebd8b38ca71991cf94"
> digest(1^0)
[1] "6717f2823d3202449301145073ab8719"
> 1^0
[1] 1
> digest(1)
[1] "6717f2823d3202449301145073ab8719"
Additional weirdness. Calculations that equal NaN have identical hash values, but NaN's hash values are not equivalent:
> Inf - Inf
[1] NaN
> 0/0
[1] NaN
> digest(Inf - Inf)
[1] "0d59b2dae9351c1ce6c76133295322d7"
> digest(0/0)
[1] "0d59b2dae9351c1ce6c76133295322d7"
> digest(NaN)
[1] "4e9653ddf814f0d16b72624aeb85bc20"
tl;dr this has to do with very deep details of how
NaNs are represented in binary. You could work around it by usingdigest(.,ascii=TRUE)...Following up on @Jozef's answer: note boldfaced digits ...
Alternatively, using
pryr::bytes()...The Wikipedia article on floating point format/NaNs says:
The sign is the first bit; the exponent is the next 11 bits; the fraction is the last 52 bits. Translating the first four hex digits given above to binary,
Inf-Infis1111 1111 1111 0100(sign=1; exponent is all ones, as required; fraction starts with0100) whereasNaNis0111 1111 1111 0100(the same, but with sign=0).To understand why
Inf-Infends up with sign bit 1 andNaNhas sign bit 0 you'd probably have to dig more deeply into the way floating point arithmetic is implemented on this platform ...It might be worth raising an issue on the digest GitHub repo about this; I can't think of an elegant way to do it, but it seems reasonable that objects where
identical(x,y)isTRUEin R should have identical hashes ... Note thatidentical()specifically ignores these differences in bit patterns via thesingle.NA(defaultTRUE) argument:Within the C code, it looks like R simply uses C's
!=operator to compareNaNvalues unless bitwise comparison is enabled, in which case it does an explicit check of equality of the memory locations: see here. That is, C's comparison operator appears to treat different kinds ofNaNvalues as equivalent ...