Unicode regex expression to POSIX expression conversion to support in Redshift/ Postgresql

466 Views Asked by At

I've been trying to convert a unicode regex to POSIX regex to remove \p{So} , \p{Cs}, \p{Cn} and \x1A type of characters from a column.

In Informatica I was using reg_replace (col_name,'[\p{So}\p{Cs}\p{Cn}\x1A]',' ') function to filter out these kind of characters and replace them with a single space.

However, when migrating to RedShift the same regex is not working with REGEXP_REPLACE() as it does not support the Unicode blocks.

For e.g. this is the string :-

'INT¡®L n°1 di KONGRESSE Ä Ê Í EUROREGIONE. V.Ž?#'

From above string, I need to keep the Latin / French letters, alphanumeric & punctuations and just remove the other symbols.

1

There are 1 best solutions below

4
Wiktor Stribiżew On

You can use

[^À-Ž[:punct:][:alnum:]]

It matches any single char other than letters from À to Ž, punctuation and alphanumerics. If [:alnum:] is Unicode-aware in your environment, you may remove À-Ž.

It also appears that you still can use a PCRE regex with Amazon Redshift REGEXP_REPLACE provided you pass the p parameters option:

REGEXP_REPLACE(col_name,'[\\p{So}\\p{Cs}\\p{Cn}\\x1A]+',' ', 1, 'p')

The backslashes need doubling, 1 param is the position in string to start matching from and the p option tells REGEXP_REPLACE to use PCRE regex syntax.