How to replace non-duplicated values in columns of csv files by stars("*")?

71 Views Asked by At

everybody.I need to anonymize the raw table to make a anonymized table. In another word, I need to replace the non_ duplicated sets with stars.

Actually, I have run this code:

    for j in range(len(zz_new)):
        for i in range(len(zz)):
            if zz_new.iloc[j][0] != zz.iloc[i][0]:
                zz_new.iat[j,0]="*"

            if zz_new.iloc[j][1] != zz.iloc[i][1]:
                zz_new.iat[j,1]="*"

            if zz_new.iloc[j][2] != zz.iloc[i][2]:
                zz_new.iat[j,2]="*"

            if zz_new.iloc[j][3] != zz.iloc[i][3]:
                zz_new.iat[j,3]="*"

            if zz_new.iloc[j][4] != zz.iloc[i][4]:
                zz_new.iat[j,4]="*"

, but the result is like this My anonymized table. I was wondering if you could help me to reach the anonymized table.

2

There are 2 best solutions below

0
On BEST ANSWER

Use the value_counts() method:

df                                                                                                                   
     age  education
0  30-39    HS-grad
1  40-49  Bachelors
2  30-39    HS-grad
3  30-39       11th

vcnt= df.education.value_counts().eq(1)                                                                              

HS-grad      False
Bachelors     True
11th          True
Name: education, dtype: bool

df["education"]= df.education.replace(vcnt.loc[vcnt].index,"*")                                                      

     age education
0  30-39   HS-grad
1  40-49         *
2  30-39   HS-grad
3  30-39         *
0
On

What you need to do is iterate over each of the row and find out which rows are duplicate. There is many way of doing this but the brute force algorithm looks like this:

  • start an empty list that keep track of non_duplicate_id
  • iterate over each row and check if there is one row that is exactly similar to this current element.
  • If yes there is an element exactly similar do nothing, if no add the id of this row to the non_duplicate_id list.
  • iterate over your non_duplicate_id list and set each of the row to star for the two field of interest (age and education)
  • save the new anonymized table

However, this solution do a lot of redundant lookup at step 2 and 3 and if the size of your dataset is large it might not scale well.