How to identify and redact all instances of a matching pattern in T-SQL

522 Views Asked by At

I have a requirement to run a function over certain fields to identify and redact any numbers which are 5 digits or longer, ensuring all but the last 4 digits are replaced with *

For example: "Some text with 12345 and 1234 and 12345678" would become "Some text with *2345 and 1234 and ****5678"

I've used PATINDEX to identify the the starting character of the pattern:

PATINDEX('%[0-9][0-9][0-9][0-9][0-9]%', TEST_TEXT)

I can recursively call that to get the starting character of all the occurrences, but I'm struggling with the actual redaction.

Does anyone have any pointers on how this can be done? I know to use REPLACE to insert the *s where they need to be, it's just the identification of what I should actually be replacing I'm struggling with.

Could do it on a program, but I need it to be T-SQL (can be a function if needed).

Any tips greatly appreciated!

3

There are 3 best solutions below

7
On BEST ANSWER

You can do this using the built in functions of SQL Server. All of which used in this example are present in SQL Server 2008 and higher.

DECLARE @String VARCHAR(500) = 'Example Input: 1234567890, 1234, 12345, 123456, 1234567, 123asd456'
DECLARE @StartPos INT = 1, @EndPos INT = 1;
DECLARE @Input VARCHAR(500) = ISNULL(@String, '') + ' '; --Sets input field and adds a control character at the end to make the loop easier.
DECLARE @OutputString VARCHAR(500) = ''; --Initalize an empty string to avoid string null errors

WHILE (@StartPOS <> 0)
BEGIN
    SET @StartPOS = PATINDEX('%[0-9][0-9][0-9][0-9][0-9]%', @Input);
    IF @StartPOS <> 0
    BEGIN
        SET @OutputString += SUBSTRING(@Input, 1, @StartPOS - 1); --Seperate all contents before the first occurance of our filter
        SET @Input = SUBSTRING(@Input, @StartPOS, 500); --Cut the entire string to the end. Last value must be greater than the original string length to simply cut it all.

        SET @EndPos = (PATINDEX('%[0-9][0-9][0-9][0-9][^0-9]%', @Input)); --First occurance of 4 numbers with a not number behind it.
        SET @Input = STUFF(@Input, 1, (@EndPos - 1), REPLICATE('*', (@EndPos - 1))); --@EndPos - 1 gives us the amount of chars we want to replace.
    END
END
SET @OutputString += @Input; --Append the last element

SET @OutputString = LEFT(@OutputString, LEN(@OutputString))
SELECT @OutputString;

Which outputs the following:

Example Input: ******7890, 1234, *2345, **3456, ***4567, 123asd456

This entire code could also be made as a function since it only requires an input text.

3
On

A dirty solution with recursive CTE

DECLARE 
  @tags nvarchar(max) = N'Some text with 12345 and 1234 and 12345678',
  @c nchar(1) = N' ';
;
WITH Process (s, i)
as
(
SELECT @tags, PATINDEX('%[0-9][0-9][0-9][0-9][0-9]%', @tags)
UNION ALL 
SELECT value,  PATINDEX('%[0-9][0-9][0-9][0-9][0-9]%', value)
FROM
(SELECT SUBSTRING(s,0,i)+'*'+SUBSTRING(s,i+4,len(s)) value
FROM Process
WHERE i >0) calc
  -- we surround the value and the string with leading/trailing ,
  -- so that cloth isn't a false positive for clothing
) 
SELECT * FROM Process
WHERE i=0

I think a better solution it's to add clr function in Ms SQL Server to manage regexp. sql-clr/RegEx

2
On

Here is an option using the DelimitedSplit8K_LEAD which can be found here. https://www.sqlservercentral.com/articles/reaping-the-benefits-of-the-window-functions-in-t-sql-2 This is an extension of Jeff Moden's splitter that is even a little bit faster than the original. The big advantage this splitter has over most of the others is that it returns the ordinal position of each element. One caveat to this is that I am using a space to split on based on your sample data. If you had numbers crammed in the middle of other characters this will ignore them. That may be good or bad depending on you specific requirements.

declare @Something varchar(100) = 'Some text with 12345 and 1234 and 12345678';

with MyCTE as
(
    select x.ItemNumber 
        , Result = isnull(case when TRY_CONVERT(bigint, x.Item) is not null then isnull(replicate('*', len(convert(varchar(20), TRY_CONVERT(bigint, x.Item))) - 4), '') + right(convert(varchar(20), TRY_CONVERT(bigint, x.Item)), 4) end, x.Item)
    from dbo.DelimitedSplit8K_LEAD(@Something, ' ') x
)
select Output = stuff((select ' ' + Result 
                        from MyCTE 
                        order by ItemNumber
                        FOR XML PATH('')), 1, 1, '')

This produces: Some text with *2345 and 1234 and ****5678