I'm currently working on a dataframe that I want to create out of two different dataframes with the difficulty that I'm constantly running into memory error. I have already posted a question about it here:
In Python: How can I merge these two dataframes without running into Memory Error?
This is how the two dataframes look like:
event_df = {'ID_1': [1, 1, 1, 2, 3],
'ID_2': [1, 1, 2, 1, 1],
'Timestamp': ['1984-05-11 14:30:00',
'1984-05-11 15:30:00',
'1990-12-11 09:10:00',
'1975-01-08 23:23:00',
'1984-05-11 14:30:00'],
'Event': [0, 1, 0, 1, 1]
}
event_df = pd.DataFrame(event_df)
event_df['Timestamp'] = event_df['Timestamp'].astype('datetime64[s]')
lab_df = {'ID_1': [1, 1, 1, 2, 3],
'ID_2': [1, 1, 1, 1, 1],
'Timestamp_Lab': ['1984-05-11 14:00:00',
'1984-05-11 14:15:00',
'1984-05-11 15:00:00',
'1975-01-08 20:00:00',
'1984-05-11 14:00:00'],
'Hemoglobin': [np.nan, 14, 13, 10, 11],
'Leukocytes': [123, np.nan, 123, 50, 110],
'Platelets': [50, 50, 50, 110, 50]
}
lab_df = pd.DataFrame(lab_df)
lab_df['Timestamp_Lab'] = lab_df['Timestamp_Lab'].astype('datetime64[s]')
This is how I want the end result to look like:
result = {'ID_1': [1, 1, 1, 2, 3],
'ID_2': [1, 1, 2, 1, 1],
'Timestamp': ['1984-05-11 14:30:00',
'1984-05-11 15:30:00',
'1990-12-11 09:10:00',
'1975-01-08 23:23:00',
'1984-05-11 14:30:00'],
'Event': [0, 1, 0, 1, 1],
'Hemoglobin_1': [14, 14, np.nan, 10, 11],
'Hemoglobin_2': [np.nan, 13, np.nan, np.nan, np.nan],
'Leukocytes_1': [123, 123, np.nan, 50, 110],
'Leukocytes_2': [np.nan, 123, np.nan, np.nan, np.nan],
'Platelets_1': [50, 50, np.nan, 110, 50],
'Platelets_2': [50, 50, np.nan, np.nan, np.nan],
'Platelets_3': [np.nan, 50, np.nan, np.nan, np.nan]
}
result = pd.DataFrame(result)
Although comments from the last post have been helpful, the memory issue has not been solved. I therefore tried to use DuckDB in Python.
With Python code I originally tried to join the two dataframes on ID_2 and then remove some rows under the condition that the Timestamp_Lab is "larger" than the Timestamp from event_df. In SQL this can be done in one go, which is quite nice. Afterwards I grouped the Timestamps and SMID in Python and then indexed them. I did that so that I could pivot it wider afterwards using the newly created index, like this:
parameter = merged_df.columns[5:] #select all parameter names
merged_df['Index'] = merged_df.groupby(['Timestamp', 'ID_2']).cumcount() + 1
join = merged_df.pivot_table(index='Timestamp', columns='Index', values = parameter)
event_df = event_df[event_df.columns[0:5]]
event_df = event_df.merge(right=join, how='left',on= ['Timestamp', 'ID_2'])
In SQL I added the index like this:
SELECT *,
ROW_NUMBER() OVER (PARTITION BY event_df.ID_2, Timestamp ORDER BY Timestamp_Lab) AS GroupIndex
FROM event_df
LEFT JOIN lab_df
ON event_df.ID_2 = lab_df.ID_2 AND event_df.Timestamp >= lab_df.Timestamp_Lab
But now I'm having trouble pivoting. I tried the following, but I gives me a ParserError (I'm not very familiar with the syntax):
SELECT *
FROM (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY event_df.ID_2, Timestamp ORDER BY Timestamp_Lab) AS GroupIndex
FROM (
SELECT *
FROM event_df
LEFT JOIN lab_df
ON event_df.ID_2 = lab_df.ID_2 AND event_df.Timestamp >= lab_df.Timestamp_Lab
) subquery
)
PIVOT (
ON GroupIndex
USING (Hemoglobin)
GROUP BY even_df.ID_2, Timestamp
) AS pivoted_data
My question therefore is: Is a pivot on the index (as I did in Python) possible in SQL and if it is, how would it look like? I tried it, but couldn't get it to work. Even focusing on one parameter (e.g. Hemoglobin) and leaving out the rest would already help me a lot.
I tried with the following code
I am joining the pivoted table again, to get the columns which were dropped during and later on dropping the columns, which we pivoted.