I am trying to add data from a csv to a PostgreSQL DB using the python lib. sqlalchemy, however I always receive the error: 'UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbb in position 79: invalid start byte' when executing
main.drop_duplicates().to_sql('main', engine, if_exists='replace', index=False)
(full code below).
What I did so far:
- changed in postgresql.conf: client_encoding = utf8
- tried with and without ?charset=utf8" after the database URI, as suggested here
- added .read_csv(…, encoding_errors="replace") (thanks for the comment!)
- load test data not from a csv, but directly create it in pandas
data = {
'PROJECT_NUMBER': ['R01HL093399', 'R01HL093399', 'R01HL093399', 'R01HL093399', 'R01HL095924'],
'PATENT_ID': [8840150, 10022225, 9283301, 10293082, 9556432],
'PMC_ID': [3132660, 3132660, 3132660, 3132660, 4121125],
'UNIQUE_ID': ['R01HL093399_8840150_3132660', 'R01HL093399_10022225_3132660', 'R01HL093399_9283301_3132660', 'R01HL093399_10293082_3132660', 'R01HL095924_9556432_4121125']
}
df = pd.DataFrame(data)
- use chardet to 'guess' the encoding of the csv (utf-8)
PostgreSQL db/table creation:
CREATE DATABASE Post_db
WITH
OWNER = admin
ENCODING = 'UTF8'
LC_COLLATE = 'en_US.UTF-8'
LC_CTYPE = 'en_US.UTF-8'
TEMPLATE = template0;
CREATE TABLE main (
PROJECT_NUMBER VARCHAR(255),
PRODUCT_ID INT,
CUSTOMERID INT,
UNIQUE_ID VARCHAR(255) PRIMARY KEY,
UNIQUE (PRODUCT_ID),
UNIQUE (CUSTOMERID)
);
Python code:
import pandas as pd
import os
from sqlalchemy import create_engine
from sqlalchemy.engine import URL
DATA_PATH = '../../Data/'
CSV_FILE = 'customer_data.csv'
url = URL.create(
drivername="postgresql",
username="admin",
host="localhost",
database="Post_db",
password="PASSWORT"
)
engine = create_engine(url)
df = pd.read_csv(os.path.join(DATA_PATH, CSV_FILE), encoding='utf8')
df.drop_duplicates().to_sql('main', engine, if_exists='replace', index=False)
Example of CSV file:
PROJECT_NUMBER PRODUCT_ID CUSTOMERID UNIQUE_ID
0 R01HL093399 8840150 3132660 R01HL093399_8840150_3132660
1 R01HL093399 10022225 3132660 R01HL093399_10022225_3132660
Copy paste from CSV file:
,PROJECT_NUMBER,PRODUCT_ID,CUSTOMERID,UNIQUE_ID
0,R01HL093399,8840150,3132660,R01HL093399_8840150_3132660
1,R01HL093399,10022225,3132660,R01HL093399_10022225_3132660
2,R01HL093399,9283301,3132660,R01HL093399_9283301_3132660
3,R01HL093399,10293082,3132660,R01HL093399_10293082_3132660
4,R01HL095924,9556432,4121125,R01HL095924_9556432_4121125
Even if I try the to_sql command on the rows only, I still get the exact same error as before.