I'm using Powershell 7 to work with the Wikipedia enwik9 1Gb UTF-8 text file. I have no experience with Unicode\UTF-8. I've captured the offset and values into a dict and they seem to come in pairs of 2,4, and 6 together when I use the code below and increment $i++.
- Is $line.Length valid for this string?
- $i is at a multibyte char, when it moves to the next iteration is it still valid?
- How do I know how many "chars" this one code contains? Is it Substring($i,1) or Substring($i,2) or maybe Substring($i,6)?
$text = (Get-Content 'enwik9.txt' -Raw)
$line = $text.Substring($i, 10000000)
for ($i = 0; $i -lt $line.Length; $i++) {
$total_cnt++
$s = $line.Substring($i, 1)
$n = [int][CHAR]$s #I wanted [byte][char] here
if ($n -ge 128) {
# Now $n is not what I want because it is not ASCII and > 255 a Unicode\multibyte character
}
}
I was able to answer my own questions and find a working solution based on the information on this page: Ã © and other codes