Cannot Keep My Datetime Data and 'No' Word in My Pandas DataFrame

48 Views Asked by At

I have a pandas dataframe from csv and I want to clean it using Regex in Python. The data that I have look like this:

Name Date Status Number
A/bCDef 2022-07-11 Yes io123-07
GhIjK-l 2022-07-12 No io456-08

I'm trying to clean the dataframe so it will be easier to process, but the thing is, my code deletes the date, the word 'no', and the hyphen.

This the data that I got so far:

name date status number
abcdef yes io
ghijkl no io

This is the code that I found on the internet and tried on my dataframe:

def regex_values(cols):
    nltk.download("stopwords")
    stemmer = nltk.SnowballStemmer('english')
    stopword = set(stopwords.words('english'))

    cols = str(cols).lower()
    cols = re.sub('\[.*?\]', '', cols)
    cols = re.sub('https?://\S+|www\.\S+', '', cols)
    cols = re.sub('<.*?>+/', '', cols)
    cols = re.sub('[%s]' % re.escape(string.punctuation), '', cols)
    cols = re.sub('\n', '', cols)
    cols = re.sub('\w*\d\w*', '', cols)
    cols = re.sub(r'^\s+|\s+$', '', cols)
    cols = re.sub(' +', ' ', cols)
    cols = re.sub(r'\b(\w+)(?:\W\1\b)+', 'r\1', cols, flags = re.IGNORECASE)
    cols = [word for word in cols.split(' ') if word not in stopword]
    cols = " ".join(cols)
    
    return cols

This is the pandas dataframe that I wish to have at the end:

name date status number
abcdef 2022-07-11 yes io123-07
ghijkl 2022-07-12 no io456-08

I'm new to Regex so I wish anyone can help me to code the right code. Or if there is a simpler way to clean my data I would much appreciate the help. Thanks in advance.

1

There are 1 best solutions below

0
On

can you try this:

df = df.applymap(lambda s: s.lower() if type(s) == str else s) #lower string values
df.columns = df.columns.str.lower() #lower for columns
df['name']=df['name'].str.replace(r'\W+', '') #remove any non-word character

#output
'''
     name        date status    number
0  abcdef  2022-07-11    yes  io123-07
1  ghijkl  2022-07-12     no  io456-08
'''