I have been readin about pandas to_sql solutions to not add duplicate records to a database. I am working with csv files of logs, each time i upload a new log file i then read the data and make some changes with pandas creating a new dataframe.
Then i execute to_sql('Logs',con = db.engine, if_exists = 'append', index=True). With the if_exists arg i make sure each time the new created dataframe from the new file is appended to the existing database. The problem is it keeps adding duplicating values. I want to make sure that if a file which has already been uploaded is by mistake uploaded again it won't be appended to the database. I want to try do this directly when creating the database withouth finding a workaround like just checking if the filename has been used before.
I am working with flask-sqlalchemy.
Thank you.
Your best bet is to catch duplicates by setting up your index as a primary key, and then using
try/exceptto catch uniqueness violations. You mentioned another post that suggested watching forIntegrityErrorexceptions, and I agree that's the best approach. You can combine that with a de-deuplication function to make sure your table updates run smoothly.Demonstrating the problem
Here's a toy example:
Now, two example data frames,
dfanddf2:Move
dfinto tablefoo:Now, when we try to append
df2, we catch theIntegrityError:Output:
Suggested Solution
On
IntegrityError, you can pull the existing table data, remove the duplicate entries of your new data, and then retry the append statement. Useapply()for this:Output: