qdapRegex::rm_nchar_words returns different results when non English letters involved?

56 Views Asked by At

Please help me with the following confusion:

qdapRegex::rm_nchar_words("è ûé", "1,2")
[1] "è ûé"

qdapRegex::rm_nchar_words('k ku ppp d', "1,2")
[1] "ppp"

Why in the first code line it doesn't respond with "" but in the second one it works as expected. What do I miss here? The only thing I can think that in the first line of code the string is built from non English letters.

Any solution?

enter image description here

1

There are 1 best solutions below

0
SteveS On BEST ANSWER

As mentioned by the author of the package:

It uses \w to define letters which is defined as [A-Za-z0-9_]. You would need to write your own custom regex to handle the non-ascii letters

UPDATE:

On my Win 7 machine the output is as expected.

One of the possible ways to solve it using pattern "[\\pL_]" (any word in any language)

rm_nchar_words("è ûé", "1,2", pattern = "[\\pL_]")

Locale on Win machine:

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                           LC_TIME=English_United States.1252  

I will keep investigate this and post updates for my answer.

UPDATE 2:

rm_nchar_words("è ûé", "1,2", pattern = "[\\pL_]")
""

works on my Ubuntu 18.04.