unidecode a text column from postgres in python

731 Views Asked by At

I am new to Python and I want to take a column "user_name" from a postgresql database and remove all the accents from the names. Postgres earlier had a function called unaccent but it doesn't seem to work now. So, I resorted to Python.

So far I have:

from sqlalchemy import create_engine
from pandas import DataFrame
import unidecode
engine_gear = create_engine('XYZABC')
connection = engine_gear.connect()
member = 1
result = connection.execute("select user_name from user") 
df = DataFrame(result.fetchall())
df.columns = result.keys()
connection.close()
df['n'] = df['user_name'].apply(unidecode)

When I run this piece of code I get the following error:

Traceback (most recent call last):
File "C:/Users/s/PycharmProjects/test/name_matching_test.py", line 20, in <module>
df['n'] = df['user_name'].apply(unidecode)
File "C:\Python\lib\site-packages\pandas\core\series.py", line 2355, in apply
mapped = lib.map_infer(values, f, convert=convert_dtype)
File "pandas\_libs\src\inference.pyx", line 1574, in pandas._libs.lib.map_infer (pandas\_libs\lib.c:66645)
TypeError: 'module' object is not callable

At first, I thought that I should convert the user_name column to string. So, I used df['user_name'].astype('str'). But I still get the same error after doing so.

Any help or guidance would be appreciated.

Data Sample:

user_name
Linda
Alonso

TestUser1
Arjang "RJ"
XI(DAPHNE)
Ajuah-AJ
Anthony "Tony"
Joseph-Patrick
Zoë 
André

 

2

There are 2 best solutions below

2
On

You have 2 slight problems, "unidecode" in your code is a module, you want the unidecode function out of this module, second you need to apply to each element not series/column so:

df.applymap(unidecode.unidecode)
0
On

Try something like this:

df[col]=df[col].str.decode('utf8')

I query a KDB database, so I'm not sure how it would be in Postgres, but in my case strings always come back as type 'byte', and sometimes I need to use the 'latin-1' decoder instead of utf-8 (Spanish and French names in my case). What I would do is run a function after each query which looped over each column stored as "object" and decoded it. Something like this:

def cleanup_datatypes(df, decoder='latin-1'):
    """
    kdb returns all strings as bytes, decode into readable strings. default is latin-1, which includes french, but can be UTF-8 as well.
    """
    for col in df.columns:
        if df[col].dtypes==object:
            df[col]=df[col].str.decode(decoder)
    return df