How to iterate over grapheme clusters in Crystal?

153 Views Asked by At

The Unicode standard defines a grapheme cluster as an algorithmic approximation to a "user-perceived character". A grapheme cluster more or less corresponds to what people think of as a single "character" in text. Therefore it is a natural and important requirement in programming to be able to operate on strings as sequences of grapheme clusters.

The best general-purpose grapheme cluster definition is the extended grapheme cluster; there are other grapheme cluster algorithms (a tailored grapheme cluster) meant for specific localized usages.

In Crystal, how can I iterate over (or otherwise operate on) a String as a sequence of grapheme clusters?

1

There are 1 best solutions below

0
On BEST ANSWER

This answer is based on a thread in the Crystal forum.

Crystal does not have a built-in way to do this (unfortunately) as of 1.0.0.

However, the regex engine in Crystal does, with the \X pattern which matches a single extended grapheme cluster:

"\u0067\u0308\u1100\u1161\u11A8".scan(/\X/) do |match|
  grapheme = match[0]
  puts grapheme
end

# Output:
# g̈
# 각

Run it online

You can wrap this up in a nicer API as follows:

def each_grapheme(s : String, &)
  s.scan(/\X/) do |match|
    yield match[0]
  end
end

def graphemes(s : String) : Array(String)
  result = Array(String).new
  each_grapheme(s) do |g|
    result << g
  end
  return result
end

# Example from https://docs.swift.org/swift-book/LanguageGuide/StringsAndCharacters.html
s = "\u{E9}\u{65}\u{301}\u{D55C}\u{1112}\u{1161}\u{11AB}"
each_grapheme(s) do |g|
  puts "#{g}\t#{g.codepoints}"
end

# Output:
# é [233]
# é    [101, 769]
# 한 [54620]
# 한   [4370, 4449, 4523]

Run it online