SHA1 value for latin characters in SQL server not matching with SHA1 of snowflake database

643 Views Asked by At

I am trying to match the sha1 values of certain data in the table in SQL Server and Snowflake.

I've got the sha1 for a latin character in SQL server in the following way-

select  sys.fn_varbintohexsubstring(0, HASHBYTES('SHA1',cast('á'  as varchar(1))),1,0) 

This returns b753d636f6ee46bb9242d01ff8b61f715e9a88c3

The sha1 function in Snowflake returns a different value for the same character.

select sha1(cast('á' as varchar))
Result - 2b9cc8d86a48fd3e4e76e117b1bd08884ec9691d

Note - The datatype in SQL Server is nvarchar while the datatype in Snowflake is varchar with default collation. For english characters, the sha1 values match after casting nvarchar to varchar. However, this is not the case with latin characters.

Is there a way to match sha1 values for non-english characters ? I need to get the value '2b9cc8d86a48fd3e4e76e117b1bd08884ec9691d' in SQL Server 2017 & below as it is what other databases like Oracle, Snowflake and Hive return.

Thanks

2

There are 2 best solutions below

7
Roger Wolf On

TL;DR: Never use varchar when calculating hashes. There are simply too many rakes you can step on in the process.

Just as an example, I adapted your code for easier understanding and run it in the context of a database which has Latin1_General_100_CI_AS default collation:

declare @a nchar(1) = N'á';
declare @b char(1) = cast(@a as char(1));

select @b as [Char], ascii(@b) as [A], unicode(@b) as [U], HASHBYTES('SHA1',@b) as [Hash]
union all
select @a, ascii(@a), unicode(@a), HASHBYTES('SHA1',@a);

The result is:

Char    A    U Hash
---- ---- ---- ------------------------------------------
á     225  225 0xB753D636F6EE46BB9242D01FF8B61F715E9A88C3
á     225  225 0xA4BCF633D5ECCD3F2A55CD0AD3D109A108A45F02

However, if I change the database context to another DB, with the Cyrillic_General_100_CI_AS collation, the same code suddenly returns different values:

Char    A    U Hash
---- ---- ---- ------------------------------------------
a      97   97 0x86F7E437FAA5A7FCE15D1DDCB9EAEAEA377667B8
á      97  225 0xA4BCF633D5ECCD3F2A55CD0AD3D109A108A45F02

As you can see, the [Char] in the first line is a different character now (small Latin "а"). This kind of implicit codepage adjustment cannot be prevented unless your data is in Unicode, or in a binary form.


Your options

  1. Upgrade to MS SQL Server 2019, or move to Azure SQL Database. Starting from this version, you can actually store strings in UTF-8 encoding, although you'll probably get a performance hit for that (whether it'll be noticeable or not, depends on your usage patterns).
  2. Calculate hashes externally (meaning, not in SQL). You can write a CLR function in C#, or something similar in Java (see Elliott Brossard's answer). This will increase complexity of your solution, and putting external code in your database might not be allowed by your company's policies, for example. Plus, maintaining external assemblies is usually a hassle.
4
Elliott Brossard On

You can compute SHA1 hashes of Latin-1 strings using a Java UDF. Here is an example:

create function latin1sha1(str varchar)
returns varbinary language java handler = 'Latin1Sha1.compute' as $$
import java.io.UnsupportedEncodingException;
import java.security.MessageDigest;
import java.security.NoSuchAlgorithmException;

class Latin1Sha1 {
  public byte[] compute(String str) throws NoSuchAlgorithmException, UnsupportedEncodingException {
    MessageDigest hash = MessageDigest.getInstance("SHA-1");
    hash.update(str.getBytes("ISO-8859-1"));  // AKA Latin-1
    return hash.digest();
  }
}
$$;

select hex_encode(latin1sha1('á'));

This returns B753D636F6EE46BB9242D01FF8B61F715E9A88C3.