How can I use dynamic data masking ro protect a column but still allow it to be used in joins?

735 Views Asked by At

This seems like a pretty common use-case. Let's say we have sensitive PII that we want to protect, such as SSNs. We mask that data using dynamic data masking in Snowflake. Now we have an engineer that is writing data transformations, and they need to join two tables using SSN. They don't have clearance to view the SSNs, but they can view the other information on both tables. I want the engineer to be able to join the two tables, and see all the combined unsecured data, while keeping the SSN secret from the engineer. I'm really not sure why Snowflake doesn't use real values for joins behind the scenes while refusing to return them in results. Is there a workaround?

One idea is to make the masking policy return a hash of the initial value. That has a couple of limitations. First, it is explicitly warned against in the Snowflake docs. Second, it requires runtime hashing of all the values, which slows down query execution seemingly needlessly. Third, there is the issue of hash collisions which could break joins. This could result in an engineer spending days working to track down a bug in their code, only to realize that the extra rows in their dataset are the result of a hash collision.

Another potential solution is using an external tokenization provider (docs). I don't understand this option well, but it appears that this would mean that I would need to store the actual values and their tokenized form with a third party service, then make an API call each time I wanted to use the values in a query. That seems less than ideal. I'd rather the solution be contained within Snowflake.

I'd love to hear any thoughts, thanks in advance.

1

There are 1 best solutions below

0
On

If you care about database integrity and avoid errors: Don't use SSNs as identifiers.

A SSN can be a property of a person, but don't use it as their primary key.

As the United States Social Security Administration says:

A 1990 OIG, HHS study indicated that 45% of organizations, both public and private, using SSNs make no effort to verify SSN accuracy. This leads to the real possibility that transfers of data from one organization to another could be inaccurate; computer matching of data between different organizations could be invalid; and innocent persons could be subjected to unwarranted intrusions into their privacy or improper changes in their benefits or services or even misidentified with serious results.

Also:

The SSN is the single most widely used record identifier for both government and the private sector, exerting a broad influence on the lives of most Americans. However, by itself, it is not a personal identifier because it lacks systematic assignment to every person and the means to authenticate a person's identity.

Instead you could create a unique id for each person within your database, and use that key for joins.