Hive Regex is acting greedy

181 Views Asked by At

I want to match only 911 or 1911 from a string with any number of preceding or ending * or #.

My Regex:

[^0-9]\*[1-9]{3,4}[^0-9]*

Test code below returns true when i was expecting it to be false:

select Digits
from (select '*11911#' as Digits) A
where Digits rlike '[^0-9]\*[1-9]{3,4}[^0-9]*'

What am I doing wrong?

1

There are 1 best solutions below

0
leftjoin On

BTW when escaping in Hive, you should use double-backslash: \\* or use [*], to avoid unpredicted behavior (sometimes single backslash works as escaping, sometimes not, double backslash always works as escape in Hive).

This '[^0-9]\\*[1-9]{3,4}[^0-9]*' - does not match, * correctly escaped and you have nothing before * in the string.

Let's remove [^0-9] before \\* and check again:

This returns no rows:

select Digits
from (select '*11911#' as Digits) A
where Digits rlike '\\*[1-9]{3,4}[^0-9]'

Also this '\\*[1-9]{3,4}[^0-9]+' does not match

And this matches:

'\\*[1-9]{3,4}[^0-9]*' 

Because * at the end means 0 or more times, it matches perfectly: there are 4 [1-9] and zero non digits in a row.

On regex101 it works the same: last * makes it matching