Avoiding duplicate data in hdf5 or sqlite

500 Views Asked by At

I am taking 30 days of historical data and modifying it.

Hopefully, I can read the historical data and have it refer to a dynamic rolling date of 30 days. The 'DateTime' value is a column in the raw data.

df_new = df = pd.read_csv(loc+filename)
max_date = df_new['DateTime'].max()
date_range = max_date - Timedelta(30, unit='d')

df_old = pd.read_hdf(loc+filename,'TableName', where = [('max_date > date_range')])

Then I would read the new data which is a separate file which is always Month to Date values (all of June for example, this file is replaced daily with the latest data), and concat them to the old dataframe.

frames = [df_old, df_new]
df = pd.concat(frames)

Then I do some things to the file (I am checking if certain values repeat within a 30 day window, if they do then I place a timestamp in a column).

Now I would want to add this modified data back into my original file (it was HDF5 but it could be a .sqlite file too) called df_old. For sure, there are a ton of duplicates since I am reading the previous 30 days data and the MTD data. How do I manage this?

My only solution is to read the entire file (df_old along with the new data I added) and then drop duplicates and then overwrite it again. This isn't very efficient.

Can .sqlite or .hdf formats enforce non-duplicates? If so then I have 3 columns which identify a unique value (Date, EmpID, CustomerID). I do not want exact duplicate rows.

1

There are 1 best solutions below

2
On

Define them as primary keys in sqlite. It wont allow you to have set of non-unique primary keys.

e.g.

CREATE TABLE table (
a INT,
b INT,
c INT,
PRIMARY KEY(a,b)
);

wont allow you to have duplicates of a,b added to the data. Then use INSERT OR IGNORE to add data, and any duplicates will be ignored.

http://sqlite.org/lang_insert.html