I have the following data:
| Name | Age | Country | Occupation | Hobby | |
|---|---|---|---|---|---|
| 0 | A | 23 | DE | Job holder | Fishing |
| 1 | A | 23 | DE | Job holder | Gardening |
| 2 | A | 23 | DE | Job holder | Fishing |
| 3 | A | 23 | DE | Job holder | Reading |
| 4 | B | 15 | SW | Job holder | Fishing |
| 5 | B | 15 | SW | Job holder | Playing |
| 6 | C | 23 | DD | Job holder | Coding |
| 7 | B | 23 | AA | Job holder | Fishing |
| 8 | D | 34 | GH | Job holder | Fishing |
| 9 | D | 33 | TR | Other | Fishing |
When there is a duplicate in the "Name" columns, I would like to compare them and I would like to know the reason for their uniqueness. I want to add one new column which column should contain the responsible column header names for this uniqueness. If a row is found one time then write 'Unique' in the reason column.
The output should look like this:
| Name | Age | Country | Occupation | Hobby | Reason | |
|---|---|---|---|---|---|---|
| 0 | A | 23 | DE | Job holder | Fishing | Occupation, Hobby |
| 1 | A | 23 | DE | Job holder | Gardening | Occupation, Hobby |
| 2 | A | 23 | DE | Student | Fishing | Occupation, Hobby |
| 3 | A | 23 | DE | Job holder | Reading | Occupation, Hobby |
| 4 | B | 15 | SW | Job holder | Fishing | Hobby |
| 5 | B | 15 | SW | Job holder | Playing | Hobby |
| 6 | C | 23 | DD | Job holder | Coding | Unique |
| 7 | B | 23 | AA | Job holder | Fishing | Age, Country |
| 8 | D | 34 | GH | Job holder | Fishing | Age, Country, Occupation |
| 9 | D | 33 | TR | Other | Fishing | Age, Occupation, Occupation |
I trying this way:
dif = [i for i, (x,y) in enumerate(zip(df.loc[0].values, df.loc[9,:].values)) if x!=y ] df.iloc[:, dif]
but it's not giving the way output should look like. I am new in python, any help would be appreciated!
You can identify the non-unique reasons, then merge as a string.
There are multiple ways to do this.
Using
groupby.nuniqueand dot product:With
groupby.nuniqueandmelt:With
groupby.apply:Output: