How to remove every possible accents from a column in python

906 Views Asked by At

I am new in python. I have a data frame with a column, named 'Name'. The column contains different type of accents. I am trying to remove those accents. For example, rubén => ruben, zuñiga=zuniga, etc. I wrote following code:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import re
import unicodedata


data=pd.read_csv('transactions.csv')

data.head()

nm=data['Name']
normal = unicodedata.normalize('NFKD', nm).encode('ASCII', 'ignore')

I am getting error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-41-1410866bc2c5> in <module>()
      1 nm=data['Name']
----> 2 normal = unicodedata.normalize('NFKD', nm).encode('ASCII', 'ignore')

TypeError: normalize() argument 2 must be unicode, not Series
3

There are 3 best solutions below

0
On

Try this for one column:

df[column_name] = df[column_name].apply(lambda x: unicodedata.normalize(u'NFKD', str(x)).encode('ascii', 'ignore').decode('utf-8'))

Change the column name according to your data columns.

0
On

The reason why it is giving you that error is because normalize requires a string for the second parameter, not a list of strings. I found an example of this online:

unicodedata.normalize('NFKD', u"Durrës Åland Islands").encode('ascii','ignore')
'Durres Aland Islands'
2
On

Try this for one column:

nm = nm.str.normalize('NFKD').str.encode('ascii', errors='ignore').str.decode('utf-8')

Try this for multiple columns:

obj_cols = data.select_dtypes(include=['O']).columns
data.loc[obj_cols] = data.loc[obj_cols].apply(lambda x: x.str.normalize('NFKD').str.encode('ascii', errors='ignore').str.decode('utf-8'))