"d̪".chars.to_a
gives me
["d"," ̪"]
How do I get Ruby to split it by graphemes?
["d̪"]
"d̪".chars.to_a
gives me
["d"," ̪"]
How do I get Ruby to split it by graphemes?
["d̪"]
Use Unicode::text_elements
from unicode.gem which is documented at http://www.yoshidam.net/unicode.txt.
irb(main):001:0> require 'unicode'
=> true
irb(main):006:0> s = "abčd̪é"
=> "abčd̪é"
irb(main):007:0> s.chars.to_a
=> ["a", "b", "č", "d", "̪", "é"]
irb(main):009:0> Unicode.nfc(s).chars.to_a
=> ["a", "b", "č", "d", "̪", "é"]
irb(main):010:0> Unicode.nfd(s).chars.to_a
=> ["a", "b", "c", "̌", "d", "̪", "e", "́"]
irb(main):017:0> Unicode.text_elements(s)
=> ["a", "b", "č", "d̪", "é"]
Edit: As @michau's answer notes, Ruby 2.5 introduced the
grapheme_clusters
method, as well aseach_grapheme_cluster
if you just want to iterate/enumerate without necessarily creating an array.In Ruby 2.0 or above you can use
str.scan /\X/
If you want to match the grapheme boundaries for any reason, you can use
(?=\X)
in your regex, for instance:ActiveSupport (which is included in Rails) also has a way if you can't use
\X
for some reason: