Match high ASCII characters (but not the letter i)

1.4k Views Asked by At

I'm trying to match all high ASCII and special utf-8 characters using powershell:

gc $file -readcount 0 | select-string -allmatches -pattern "[\x80-\uffff]"

which should find all the characters I want. However, the regular expression seems to be failing as it's matching the character "i" and "I".

I ran this to test and I'm baffled:

"abcdefghijklmnopqrstuvwxyz" | select-string -allmatches -pattern "[\x80-\uffff]"

Why is it matching "i"? What I also don't get is if you cast the character to an int, the value is 105 which is clearly not within the range specified.

2

There are 2 best solutions below

2
On BEST ANSWER

The reason is that i is matched on U+0130 (İ, "Latin Capital Letter I with dot above"), a variant of capital I found in Turkish:

PS C:\> 'i' -match '[\u0130]'
True

Try with an inverted pattern:

"abcdefghijklmnopqrstuvwxyz" | Select-String -AllMatches -Pattern "[^\x00-\x79]"

Here is how I found out:

0x80..0xffff |ForEach-Object {
    $CharCode = $_.ToString("X4")
    if('i' -match "[\u$CharCode]"){
        "U+$CharCode matches"
    }
}
0
On

Case sensitivity is another workaround for this weird Turkish İ bug. There's a little dot on the top.

"abcdefghijklmnopqrstuvwxyz" | 
  select-string -allmatches -pattern "[\x80-\uffff]" -casesensitive

Or this, but the letter i (small or capital) would pass through without -casesensitive:

# not 0-127
"abcdefghijklmnopqrstuvwxyz" | 
  select-string -allmatches -pattern "\P{IsBasicLatin}" -casesensitive

The lower case of that foreign character is considered to be the English small letter i. But it doesn't map the other way (in culture en-us).

'İ'.tolower()     
i

'i'.toupper()
I

The Kelvin also seems problematic. It's lowercase is a regular small 'k'. It's taken as ascii when case is ignored. I'm not sure why it behaves differently than the Turkish İ.

[char]0x212a | select-string '\P{IsBasicLatin}' # no output