ActiveSupport::Inflector::camelize - help in understanding regex

385 Views Asked by At

Short version:

I am having a rather hard time understanding two rather complex regular expressions in the ActiveSupport::Inflector::camelize method.

This is the definition of the camelize method:

def camelize(term, uppercase_first_letter = true)
  string = term.to_s
  if uppercase_first_letter
    string = string.sub(/^[a-z\d]*/) { inflections.acronyms[$&] || $&.capitalize }
  else
    string = string.sub(/^(?:#{inflections.acronym_regex}(?=\b|[A-Z_])|\w)/) { $&.downcase }
  end
  string.gsub(/(?:_|(\/))([a-z\d]*)/i) { "#{$1}#{inflections.acronyms[$2] || $2.capitalize}" }.gsub('/', '::')
end

I have some difficulty understanding:

string = string.sub(/^(?:#{inflections.acronym_regex}(?=\b|[A-Z_])|\w)/) { $&.downcase }

and:

string.gsub(/(?:_|(\/))([a-z\d]*)/i) { "#{$1}#{inflections.acronyms[$2] || $2.capitalize}" }.gsub('/', '::')

Please explain to me what they mean. Thank you.

Long version

This shows me trying to understand the regex and how I interpret them to mean. It would be very helpful if you could go through this and correct my mistakes.

For the first regex

string = string.sub(/^(?:#{inflections.acronym_regex}(?=\b|[A-Z_])|\w)/) { $&.downcase }

Based on what I am seeing, inflections.acronym_regex is from the Inflections class in the ActiveSupport::Inflector module, and in the initialize method of the Inflections class,

def initialize
  @plurals, @singulars, @uncountables, @humans, @acronyms, @acronym_regex = [], [], [], [], {}, /(?=a)b/
end

acronym_regex is assigned /(?=a)b/. From what I understand from http://www.ruby-doc.org/core-2.0.0/Regexp.html#class-Regexp-label-Anchors ,

(?=pat) - Positive lookahead assertion: ensures that the following characters match pat, but doesn't include those characters in the matched text

So /(?=a)b/ ensures that character a is inside the text, but we dont include character a inside the matched text, and what immediately follows character a must be character b. In other words, "abc" would match this regex, but "bbc" would not match this regex, and the matched text for "abc" would be "b" (instead of "ab").

So combining the value of inflections.acronym_regex into this regex /^(?:#{inflections.acronym_regex}(?=\b|[A-Z_])|\w)/, I do not know which of the following two regex results:

A. /^(?:/(?=a)b/(?=\b|[A-Z_])|\w)/

B. /^(?:(?=a)b(?=\b|[A-Z_])|\w)/

although I am thinking it is B. From what I understand, (?: provides grouping without capturing, (?= means positive lookahead assertion, \b matches word boundaries when outside brackets and matches backspace when inside brackets. So in english terms, regex B, when matching against a text, will find a string that begins with an a character, followed by a b character, and one of (1. backspace [whatever that may mean] 2. any uppercase character or underscore 3. any english alphabetic character, digit, or underscore).

However, I find it strange that passing upper_case_first_letter = false to the camelize function should cause it to match a string starting with the characters ab, given that that does not seem to be how the camelize function behaves.

For the second regex

string.gsub(/(?:_|(\/))([a-z\d]*)/i) { "#{$1}#{inflections.acronyms[$2] || $2.capitalize}" }.gsub('/', '::')

The regex is:

/(?:_|(\/))([a-z\d]*)/i

I am guessing that this regex will match a substring that starts with either an _ or /, followed by 0 or more (upper or lowercase english alpabetic characters or digit). Furthermore, for the first group (?:_|(\/)), whether we match the _ or /, the ([a-z\d]*) capturing group will always be regarded as the second group. I do understand the part where the block tries to look up inflections.acronyms[$2] and on failure, does $2.captitalize.

Since (?: means grouping without capturing, what is the value of $1 when we match _ ? Is it still _ ? And for the .gsub('/', '::') portion, I am guessing that it gets applied for each match in the initial gsub, instead of being applied to the overall string after the outer gsub call is done?

Apologies for the really long post. Please point out my errors in understanding the 2 regular expressions, or explain them in a better way if you can do it.

Thank you.

1

There are 1 best solutions below

0
On

However, I find it strange that passing upper_case_first_letter = false to the camelize function should cause it to match a string starting with the characters ab, given that that does not seem to be how the camelize function behaves.

?: acts like a . here and does match the string (ie. single character) but there is no grouping, therefore the match is in $&.

Since (?: means grouping without capturing, what is the value of $1 when we match _ ? Is it still _ ?

It's nil since there is no capturing. The value is in $2

And for the .gsub('/', '::') portion, I am guessing that it gets applied for each match in the initial gsub, instead of being applied to the overall string after the outer gsub call is done?

It's applied to the overall result as gsub with block returns a string and the gsub('/', '::') is outside of a block.