I have a pandas dataframe from csv and I want to clean it using Regex in Python. The data that I have look like this:
Name | Date | Status | Number |
---|---|---|---|
A/bCDef | 2022-07-11 | Yes | io123-07 |
GhIjK-l | 2022-07-12 | No | io456-08 |
I'm trying to clean the dataframe so it will be easier to process, but the thing is, my code deletes the date, the word 'no', and the hyphen.
This the data that I got so far:
name | date | status | number |
---|---|---|---|
abcdef | yes | io | |
ghijkl | no | io |
This is the code that I found on the internet and tried on my dataframe:
def regex_values(cols):
nltk.download("stopwords")
stemmer = nltk.SnowballStemmer('english')
stopword = set(stopwords.words('english'))
cols = str(cols).lower()
cols = re.sub('\[.*?\]', '', cols)
cols = re.sub('https?://\S+|www\.\S+', '', cols)
cols = re.sub('<.*?>+/', '', cols)
cols = re.sub('[%s]' % re.escape(string.punctuation), '', cols)
cols = re.sub('\n', '', cols)
cols = re.sub('\w*\d\w*', '', cols)
cols = re.sub(r'^\s+|\s+$', '', cols)
cols = re.sub(' +', ' ', cols)
cols = re.sub(r'\b(\w+)(?:\W\1\b)+', 'r\1', cols, flags = re.IGNORECASE)
cols = [word for word in cols.split(' ') if word not in stopword]
cols = " ".join(cols)
return cols
This is the pandas dataframe that I wish to have at the end:
name | date | status | number |
---|---|---|---|
abcdef | 2022-07-11 | yes | io123-07 |
ghijkl | 2022-07-12 | no | io456-08 |
I'm new to Regex so I wish anyone can help me to code the right code. Or if there is a simpler way to clean my data I would much appreciate the help. Thanks in advance.
can you try this: