Why does #match return the correct result, but #scan finds only only one letter?

116 Views Asked by At

Background for Question

I'm using Ruby 3.2.1, and I know that changes have been made to Ruby's regexp engine. That may or may not be relevant here. However, I get unexpectedly different behavior from String#match and String#scan when using backreferences, and I don't understand why. See example code below.

My Examples, with Comments and Expectations

Working Result with Match
# Using #match finds the longest string of
# repeated letters, and the letter that is
# repeated.
"gjhfgfcttttdfs".match /(\p{alpha})\1{2,}/
=> #<MatchData "tttt" 1:"t">
Non-Working, Unexpected Result with Scan
# Here, #scan returns only a single sub-
# array with a single letter, which is
# the correct letter. However, I was
# expecting an array-of-arrays with all
# repeated letters.
"gjhfgfcttttdfs".scan /(\p{alpha})\1{2,}/
=> [["t"]]

Clarifying the Question

Assuming the problem exists between the keyboard and chair, why is String#scan not returning more matches, or even a single longer match? I'm assuming it's a mistake on my part in the capture expression, but I can't really figure out what I did wrong here.

2

There are 2 best solutions below

2
Alex On

If the pattern contains groups, each result is an array containing one entry per group. https://rubyapi.org/3.2/o/string#method-i-scan

Seems a little vague, but I think it's been this way for a long time.

>> "aaabccc".match /(\p{alpha})\1{2,}/
=> #<MatchData "aaa" 1:"a">
>> "aaabccc".scan /(\p{alpha})\1{2,}/
=> [["a"], ["c"]]

To get the whole match, you can capture it, to be part of scan results:

>> "aaabccc".match /((\p{alpha})\2{2,})/
=> #<MatchData "aaa" 1:"aaa" 2:"a">
#                    ^       ^
# two captures will return two captured results for each match
>> "aaabccc".scan /((\p{alpha})\2{2,})/
=> [["aaa", "a"], ["ccc", "c"]]

# that's the best i could come up with this
>> "aaabccc".scan(/((\p{alpha})\2{2,})/).map(&:first)
=> ["aaa", "ccc"]

Full match is only available from the scan block:

>> a = []; "aaabccc".scan(/(\p{alpha})\1{2,}/) { a << $& }; a
=> ["aaa", "ccc"]
0
Stefan On

Both methods are different in regards to their return values: match returns / yields MatchData objects whereas scan returns / yields strings which correspond to the content of capture groups (if present). scan's behavior can be somehow inconvenient but you can work around it.

If you want the whole MatchData object (the way match works) but for each match, you can call scan with a block and use $~ (or Regexp.last_match) to retrieve it:

"aaabccc".scan(/(\p{alpha})\1{2,}/) { p $~ }
# #<MatchData "aaa" 1:"a">
# #<MatchData "ccc" 1:"c">

To get an array of MatchData objects, you can utilize enum_for to get an "each match" enumerator and then map them to their respective MatchData via $~:

"aaabccc".enum_for(:scan, /(\p{alpha})\1{2,}/).map { $~ }
#=> [#<MatchData "aaa" 1:"a">, #<MatchData "ccc" 1:"c">]