I am working with a retail dataset which has a size description column. My task is to clean the column and separate the numeric size from the characters in the string. Is there a way to do it through regular expressions? I need to save both the number and any other character string present in the column in two different columns.
Observations about the data:
- The column contains sizes of three broad categories: footwear, topwear and bottom wear.
- Footwear: the number in the cell is generally the size and anything other than that is to be stored separately. The unique cases look like - EU 36 ( EU says its European size so conversion is required), UK 8(similar conversion required), 19 Wide, 10 Kids, 19(-25F)( in this case, I really do not need to save -25F info).
- Topwear: The sizes here are generally XXS,XS,S,M,L,XL,XXL,XXXL. Any other string along with it, like Tall, inseam etc. needs to be stored seperately. Also a size like XXL can also be represented as 2XL.
- Bottomwear: Size here generally occurs at the beginning. It can be a number- 32 or a character- XL(similar to topwear). If there is any other character string following it, it should be stored separately.
Thanks!
Here's a regex for those multiple cases.
It works for the examples.
Breakdown of the regex: