How to retrieve the first “complete” character of a []rune?

272 Views Asked by At

I am trying to write a function

func Anonymize(name string) string

that anonymizes names. Here are some examples of pairs of input and output so you get an idea of what it is supposed to do:

Müller → M.
von der Linden → v. d. L.
Meyer-Schulze → M.-S.

This function is supposed to work with names composed out of arbitrary characters. While implementing this function, I had the following question:

Given a []rune or string, how do I figure out how many runes I have to take to get a complete character, complete in the sense that all modifiers and combining accents corresponding to the character are taken, too. For instance, if the input is []rune{0x0041, 0x0308, 0x0066, 0x0067} (corresponding to the string ÄBC where Ä is represented as the combination of an A and a combining diaresis), the function should return 2 because the first two runes yield the first character, Ä. If I just took the first rune, I would get A which is incorrect.

I need an answer to this question because the name I want to anonymize might begin with an accented character and I don't want to remove the accent.

1

There are 1 best solutions below

2
On BEST ANSWER

You can try the following function (inspired by "Go language string length"):

func FirstGraphemeLen(str string) int {
    re := regexp.MustCompile("\\PM\\pM*|.")
    return len([]rune(re.FindAllString(str, -1)[0]))
}

See this example:

r := []rune{0x0041, 0x0308, 0x0066, 0x0041, 0x0308, 0x0067}
s := string(r)
fmt.Println(s, len(r), FirstGraphemeLen(s))

Output:

ÄfÄg 6 2

That string might use 6 runes, but its first grapheme uses 2.


The OP FUZxxl used another approach, using unicode.IsMark(r)

IsMark reports whether the rune is a mark character (category M).

The source (from FUZxxl's play.golang.org) includes:

// take one character including all modifiers from the last name
r, _, err := ln.ReadRune()
if err != nil {
    /* ... */
}

aln = append(aln, r)

for {
    r, _, err = ln.ReadRune()
    if err != nil {
        goto done
    }

    if !unicode.IsMark(r) {
        break
    }

    aln = append(aln, r)
}

aln = append(aln, '.')
/* ... */