Regular Expression in R: extract characters and numbers from a column

556 Views Asked by At

I am working with a retail dataset which has a size description column. My task is to clean the column and separate the numeric size from the characters in the string. Is there a way to do it through regular expressions? I need to save both the number and any other character string present in the column in two different columns.

Observations about the data:

  • The column contains sizes of three broad categories: footwear, topwear and bottom wear.
  • Footwear: the number in the cell is generally the size and anything other than that is to be stored separately. The unique cases look like - EU 36 ( EU says its European size so conversion is required), UK 8(similar conversion required), 19 Wide, 10 Kids, 19(-25F)( in this case, I really do not need to save -25F info).
  • Topwear: The sizes here are generally XXS,XS,S,M,L,XL,XXL,XXXL. Any other string along with it, like Tall, inseam etc. needs to be stored seperately. Also a size like XXL can also be represented as 2XL.
  • Bottomwear: Size here generally occurs at the beginning. It can be a number- 32 or a character- XL(similar to topwear). If there is any other character string following it, it should be stored separately.

Thanks!

1

There are 1 best solutions below

2
LukStorms On

Here's a regex for those multiple cases.
It works for the examples.

details <- c("EU 36", "UK 8", "19 Wide", "10 Kids", "19(-25F)", "XXS", "XS is Extra Small", "S", "M", "L", "XL", "XXL", "XXXL", "2XL", "32")

pattern = "\\b(?:(?:(?:2?X*(?:S|L))|M|(?:EU|UK) [0-9]+)|(?:[0-9]{2}(?: (?:Kids|Wide))?))\\b"

matches <- regexpr(pattern, details)

regmatches(details, matches)

Breakdown of the regex:

\b    # Word boundary: a position between a word and non-word character 
      # (includes the start/end of the line).
  (?:       # a non-capturing group
    (?:     # ditto
      (?:   # ditto
         2?  # 0 or 1 "2" characters
           X*  # 0 or more "X" characters
             (?:S|L) # "S" or an "L" character
      )
      |    # or
       M   # the "M" character
      |    # or 
       (?:EU|UK) [0-9]+  # "EU" or "UK", followed by a space and 1 or more digits
      |    # or
       (?:[0-9]{2}(?: (?:Kids|Wide))? # 2 digits optionally followed by " Kids" or " Wide"
    )
  )
\b  # Word boundary