I'm a Python 3.x new user and try to get a little help on data cleaning code.
Here is the simulation data:
I got a database of customers'orders and have to clean the columns UserPhone. In this columns, a value can be a str (ie: 0909111111, 0909.111.111) or number (ie: 909111111, 909111111.0, 84909111111)
.
I want the result to be: 909111111
.
To do that, I must:
- remove'.0' from all values
- remove all not-digit characters (space, text if any) from all the values
- remove '0' if a value starts with '0'
- remove '84' if a value starts with '84'
I guess the 4 steps have the same structure. For example, to solve task 2, I try to write some code like this:
import pandas as pd
test=pd.read_excel('D:/relay-foods.xlsx')
b=test['UserPhone']
for i in test['UserPhone']:
if i:
i = str(i)
for j in i:
if not j.isdigit():
index_i=test.index(i)
i = i.replace(j, '')
b[index_i]=i
But it results an error
TypeError: 'RangeIndex' object is not callable
I searched and realized that the index(i) works only for List while 'UserPhone' is Series type. I tried test['UserPhone']=test['UserPhone'].tolist()
but it does not work. Could anyone please help me with this please?
And one more thing that's worth notice is that 'UserPhone' contains duplicate that can result the same index for the same 'UserPhone' value at different rows.
Thank you very much for your help!
Try converting the
UserPhone
column to a string column and then using the.str.replace
method to replace anything meeting your outlined conditions. The following should do:For the pattern provided here:
\.0$
matches '.0' that is at the end of the phone number;^0
matches any zero that is at the beginning of the phone number;^84
matches any "84" that is at the beginning of the phone number;[^0-9]+
matches one or more character that is not a number.The
|
operator is the equivalent ofOR
.Also, feel free to remove the
.astype(int)
part from the code if you want the output to remain a string.I hope this helps.