Data cleaning: How to index a Series value and deal with duplicate values indices in Python?

231 Views Asked by At

I'm a Python 3.x new user and try to get a little help on data cleaning code.

Here is the simulation data:

I got a database of customers'orders and have to clean the columns UserPhone. In this columns, a value can be a str (ie: 0909111111, 0909.111.111) or number (ie: 909111111, 909111111.0, 84909111111).

I want the result to be: 909111111.

To do that, I must:

  1. remove'.0' from all values
  2. remove all not-digit characters (space, text if any) from all the values
  3. remove '0' if a value starts with '0'
  4. remove '84' if a value starts with '84'

I guess the 4 steps have the same structure. For example, to solve task 2, I try to write some code like this:

import pandas as pd
test=pd.read_excel('D:/relay-foods.xlsx')
b=test['UserPhone']
for i in test['UserPhone']:
    if i:
        i = str(i)
        for j in i:
            if not j.isdigit():
                index_i=test.index(i)
                i = i.replace(j, '')
                b[index_i]=i

But it results an error TypeError: 'RangeIndex' object is not callable I searched and realized that the index(i) works only for List while 'UserPhone' is Series type. I tried test['UserPhone']=test['UserPhone'].tolist() but it does not work. Could anyone please help me with this please?

And one more thing that's worth notice is that 'UserPhone' contains duplicate that can result the same index for the same 'UserPhone' value at different rows.

Thank you very much for your help!

2

There are 2 best solutions below

1
On BEST ANSWER

Try converting the UserPhone column to a string column and then using the .str.replace method to replace anything meeting your outlined conditions. The following should do:

test['UserPhone'] = test['UserPhone'].astype(str).str.replace('\.0$|^0|^84|[^0-9]+‌​','').astype(int)

For the pattern provided here:

\.0$ matches '.0' that is at the end of the phone number;

^0 matches any zero that is at the beginning of the phone number;

^84 matches any "84" that is at the beginning of the phone number;

[^0-9]+‌​ matches one or more character that is not a number.

The | operator is the equivalent of OR.

Also, feel free to remove the .astype(int) part from the code if you want the output to remain a string.

I hope this helps.

1
On

I made an edit based on @Abdou 's code and this code works:

test['UserPhone'] = test['UserPhone'].astype(str).str.replace('\.0|^0|^84|[^0-9]+‌​','')

The only different is the \.0 instead of .0 because the the dot (.) is a metacharacter, we have to put a '\' before it to remove its special meaning. Thank you @Abdou again!