I'm trying to match all high ASCII and special utf-8 characters using powershell:
gc $file -readcount 0 | select-string -allmatches -pattern "[\x80-\uffff]"
which should find all the characters I want. However, the regular expression seems to be failing as it's matching the character "i"
and "I"
.
I ran this to test and I'm baffled:
"abcdefghijklmnopqrstuvwxyz" | select-string -allmatches -pattern "[\x80-\uffff]"
Why is it matching "i"
? What I also don't get is if you cast the character to an int, the value is 105 which is clearly not within the range specified.
The reason is that
i
is matched onU+0130
(İ
, "Latin Capital Letter I with dot above"), a variant of capitalI
found in Turkish:Try with an inverted pattern:
Here is how I found out: