How can I iterate over a string by runes in Go?

163k Views Asked by At

I wanted to this:

for i := 0; i < len(str); i++ {
    dosomethingwithrune(str[i]) // takes a rune
}

But it turns out that str[i] has type byte (uint8) rather than rune.

How can I iterate over the string by runes rather than bytes?

5

There are 5 best solutions below

0
On BEST ANSWER

See this example from Effective Go :

for pos, char := range "日本語" {
    fmt.Printf("character %c starts at byte position %d\n", char, pos)
}

This prints :

character 日 starts at byte position 0
character 本 starts at byte position 3
character 語 starts at byte position 6

For strings, the range does more work for you, breaking out individual Unicode code points by parsing the UTF-8.

0
On

To mirror an example given at golang.org, Go allows you to easily convert a string to a slice of runes and then iterate over that, just like you wanted to originally:

runes := []rune("Hello, 世界")
for i := 0; i < len(runes) ; i++ {
    fmt.Printf("Rune %v is '%c'\n", i, runes[i])
}

Of course, we could also use a range operator like in the other examples here, but this more closely follows your original syntax. In any case, this will output:

Rune 0 is 'H'
Rune 1 is 'e'
Rune 2 is 'l'
Rune 3 is 'l'
Rune 4 is 'o'
Rune 5 is ','
Rune 6 is ' '
Rune 7 is '世'
Rune 8 is '界'

Note that since the rune type is an alias for int32, we must use %c instead of the usual %v in the Printf statement, or we will see the integer representation of the Unicode code point (see A Tour of Go).

0
On

For example:

package main

import "fmt"

func main() {
        for i, rune := range "Hello, 世界" {
                fmt.Printf("%d: %c\n", i, rune)
        }
}

Playground


Output:

0: H
1: e
2: l
3: l
4: o
5: ,
6:  
7: 世
10: 界
0
On

You can check the doc.

rune basically is an alias for int32 type:

type rune = int32

Literal strings in Go are being encoded in UTF-8 format, which allows to store Unicode codes corresponding to the characters from Unicode table:

0xxxxxxx                              unicode codes 0−127
110xxxxx 10xxxxxx                     128−2047
1110xxxx 10xxxxxx 10xxxxxx            2048−65535
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx   65536−0x10ffff

as you can see such encoding takes 1-4 bytes per character, and that is why we have rune = int32 (4 bytes) here to accommodate the worst-case scenario when we need 4 bytes to encode Unicode character code.

From Unicode table you can see that if your string has only alphanumeric (ASCII characters) then the number of runes in your string would be equal to the number of bytes, as such ASCII characters take just 1 byte to be encoded. But it is not true when you use non-ASCII characters:

import "unicode/utf8"

func countRunes() {
    s := "Hello, 世界"
    fmt.Println(len(s)) // "13" - bytes
    fmt.Println(utf8.RuneCountInString(s)) // "9" - runes characters
}

Strings are built from bytes so indexing them yields bytes and not characters, to access characters (runes) you can use the options below:

  1. Language built-in way to iterate over runes:
for i, r := range "Hello, 世界" {
    fmt.Printf("%d\t%q\t%d\n", i, r, r)
}
  1. By getting an array of runes from string:
runes := []rune("Hello, 世界")
for i := 0; i < len(runes) ; i++ {
    fmt.Printf("Rune : '%c'\n", runes[i])
}
  1. By using the standard library package utf8:
for i := 0; i < len(s); {
    r, size := utf8.DecodeRuneInString(s[i:])
    fmt.Printf("%d\t%c\n", i, r)
    i += size
}
0
On

Alternatively, a code example that doesn't uses fmt package:

package main

func main() {
    for _, rune := range "Hello, 世界" {
        println(string(rune))
    }
}

In the loop, the variable r represents the current rune being iterated over. We convert it to a string using the string() function before printing it to the console.

Playground