I came across a sql code, which creates primary keys with hashbytes function and md5 algorithm. The code looks like this:
SELECT
CONVERT(VARBINARY(32),
CONVERT( CHAR(32),
HASHBYTES('MD5',
(LTRIM(RTRIM(COALESCE(column1,'')))+';'+LTRIM(RTRIM(COALESCE(column2,''))))
),
2)
)
FROM database.schema.table
I find it hard to understand for what is the result from hashbytes function is converted to char and then to varbinary, when we get directly varbinary from hashbytes function. Is there any good reason to do so?
Short Version
This code pads a hash with
0x20bytes which is rather strange and most likely due to misunderstandings by the initial author. Using hashes as keys is a terrible idea anywayLong Version
Hashes are completely inappropriate for generating primary keys. In fact, since the same hash can be generated from different original data, this code is guaranteed to produce duplicate values, causing collisions at best.
Worst case, you end up updating or deleting the wrong row, resulting in data loss. In fact, given that MD5 was broken over 20 years ago, one can calculate the values that would result in collisions. This has been used to hack systems in the past and even generate rogue CA certificates as far back as 2008.
And even worse, the concatenation expression :
Will create the same initial string for multiple different column values.
On top of that, given the random nature of hash values, this results in table fragmentation and an index that can't be used for range queries. Primary keys most of the time are clustered keys as well, which means they specify the order rows are stored on disk. Using essentially random values for a PK means new rows can be added at the middle or even the start of a table's data pages.
This also harms caching, as data is loaded from disk in pages. With a meaningful clustered key, it's highly likely that loading a specific row will also load rows that will be needed very soon. Loading eg 50 rows while paging may only need to load a single page. With an essentially random key, you could end up loading 50 pages.
Using a GUID generated with
NEWID()would provide a key value without collisions. UsingNEWSEQUENTIALID()would generate sequential GUID values eliminating fragmentation and once again allowing range searches.An even better solution would be to just create a PK from the two columns :
Or just add an IDENTITY-generated ID column. A
bigintis large enough to handle all scenarios :If the intention was to ignore spaces in column values there are better options:
CHECKconstraint can be added to each column to ensure the columns can't have leading or trailing spaces.INSTEAD OFtrigger can be used to trim them.Column1_Cleaned as TRIM(Column1) PERSISTED. Persisted columns can be used in indexes and primary keysAs for what it does:
MD5is deprecated)0x20bytes. A rather ... unusual way of padding data. I suspect whoever first wrote this wanted to pad the hash to 32 bytes but used some copy-pasta code without understanding the implications.You can check the results by hashing any value. The following queries
A space in ASCII is the byte
0x20. Casting to binary replaces spaces with0x20, not0x00If one wanted to pad a 16-byte value to 32 bytes, it would make more sense to use
0x00. The result is no better than the original thoughTo get a real 32-byte hash,
SHA2_256can be used :