Is pivoting wider in DuckDB on an index column possible?

221 Views Asked by At

I'm currently working on a dataframe that I want to create out of two different dataframes with the difficulty that I'm constantly running into memory error. I have already posted a question about it here:
In Python: How can I merge these two dataframes without running into Memory Error?

This is how the two dataframes look like:

event_df = {'ID_1': [1, 1, 1, 2, 3],
            'ID_2': [1, 1, 2, 1, 1],
            'Timestamp': ['1984-05-11 14:30:00', 
                          '1984-05-11 15:30:00', 
                          '1990-12-11 09:10:00', 
                          '1975-01-08 23:23:00', 
                          '1984-05-11 14:30:00'],
                'Event': [0, 1, 0, 1, 1]
               }
event_df = pd.DataFrame(event_df)
event_df['Timestamp'] = event_df['Timestamp'].astype('datetime64[s]')

lab_df = {'ID_1': [1, 1, 1, 2, 3],
          'ID_2': [1, 1, 1, 1, 1],
          'Timestamp_Lab': ['1984-05-11 14:00:00', 
                            '1984-05-11 14:15:00', 
                            '1984-05-11 15:00:00',
                            '1975-01-08 20:00:00',
                            '1984-05-11 14:00:00'],
          'Hemoglobin': [np.nan, 14, 13, 10, 11],
          'Leukocytes': [123, np.nan, 123, 50, 110],
          'Platelets': [50, 50, 50, 110, 50]
         }
lab_df = pd.DataFrame(lab_df)
lab_df['Timestamp_Lab'] = lab_df['Timestamp_Lab'].astype('datetime64[s]')

This is how I want the end result to look like:

result = {'ID_1': [1, 1, 1, 2, 3],
          'ID_2': [1, 1, 2, 1, 1],
          'Timestamp': ['1984-05-11 14:30:00', 
                        '1984-05-11 15:30:00', 
                        '1990-12-11 09:10:00', 
                        '1975-01-08 23:23:00', 
                        '1984-05-11 14:30:00'],
          'Event': [0, 1, 0, 1, 1],
          'Hemoglobin_1': [14, 14, np.nan, 10, 11],
          'Hemoglobin_2': [np.nan, 13, np.nan, np.nan, np.nan],
          'Leukocytes_1': [123, 123, np.nan, 50, 110],
          'Leukocytes_2': [np.nan, 123, np.nan, np.nan, np.nan],
          'Platelets_1': [50, 50, np.nan, 110, 50],
          'Platelets_2': [50, 50, np.nan, np.nan, np.nan],
          'Platelets_3': [np.nan, 50, np.nan, np.nan, np.nan]          
}
result = pd.DataFrame(result)

Although comments from the last post have been helpful, the memory issue has not been solved. I therefore tried to use DuckDB in Python.

With Python code I originally tried to join the two dataframes on ID_2 and then remove some rows under the condition that the Timestamp_Lab is "larger" than the Timestamp from event_df. In SQL this can be done in one go, which is quite nice. Afterwards I grouped the Timestamps and SMID in Python and then indexed them. I did that so that I could pivot it wider afterwards using the newly created index, like this:

parameter = merged_df.columns[5:] #select all parameter names
merged_df['Index'] = merged_df.groupby(['Timestamp', 'ID_2']).cumcount() + 1
join = merged_df.pivot_table(index='Timestamp', columns='Index', values = parameter)
event_df = event_df[event_df.columns[0:5]]
event_df = event_df.merge(right=join, how='left',on= ['Timestamp', 'ID_2'])

In SQL I added the index like this:

SELECT *,
    ROW_NUMBER() OVER (PARTITION BY event_df.ID_2, Timestamp ORDER BY Timestamp_Lab) AS GroupIndex
FROM event_df
    LEFT JOIN lab_df
        ON event_df.ID_2 = lab_df.ID_2 AND event_df.Timestamp >= lab_df.Timestamp_Lab

But now I'm having trouble pivoting. I tried the following, but I gives me a ParserError (I'm not very familiar with the syntax):

SELECT *
FROM (
    SELECT *,
        ROW_NUMBER() OVER (PARTITION BY event_df.ID_2, Timestamp ORDER BY Timestamp_Lab) AS GroupIndex
    FROM (
        SELECT *
          FROM event_df
               LEFT JOIN lab_df
                  ON event_df.ID_2 = lab_df.ID_2 AND event_df.Timestamp >= lab_df.Timestamp_Lab
    ) subquery
)
PIVOT (
    ON GroupIndex
    USING (Hemoglobin)
    GROUP BY even_df.ID_2, Timestamp
) AS pivoted_data

My question therefore is: Is a pivot on the index (as I did in Python) possible in SQL and if it is, how would it look like? I tried it, but couldn't get it to work. Even focusing on one parameter (e.g. Hemoglobin) and leaving out the rest would already help me a lot.

1

There are 1 best solutions below

0
On

I tried with the following code

import numpy as np
import pandas as pd

event_df = {'ID_1': [1, 1, 1, 2, 3],
            'ID_2': [1, 1, 2, 1, 1],
            'Timestamp': ['1984-05-11 14:30:00', 
                          '1984-05-11 15:30:00', 
                          '1990-12-11 09:10:00', 
                          '1975-01-08 23:23:00', 
                          '1984-05-11 14:30:00'],
                'Event': [0, 1, 0, 1, 1]
               }
event_df = pd.DataFrame(event_df)
event_df['Timestamp'] = event_df['Timestamp'].astype('datetime64[s]')

lab_df = {'ID_1': [1, 1, 1, 2, 3],
          'ID_2': [1, 1, 1, 1, 1],
          'Timestamp_Lab': ['1984-05-11 14:00:00', 
                            '1984-05-11 14:15:00', 
                            '1984-05-11 15:00:00',
                            '1975-01-08 20:00:00',
                            '1984-05-11 14:00:00'],
          'Hemoglobin': [np.nan, 14, 13, 10, 11],
          'Leukocytes': [123, np.nan, 123, 50, 110],
          'Platelets': [50, 50, 50, 110, 50]
        }
lab_df = pd.DataFrame(lab_df)
lab_df['Timestamp_Lab'] = lab_df['Timestamp_Lab'].astype('datetime64[s]')

print(lab_df.pivot_table(index=lab_df.index, columns=['ID_1'], values=['Hemoglobin', 'Leukocytes', 'Platelets']).join(lab_df).drop(['Hemoglobin', 'Leukocytes', 'Platelets'], axis=1))

I am joining the pivoted table again, to get the columns which were dropped during and later on dropping the columns, which we pivoted.

   (Hemoglobin, 1)  (Hemoglobin, 2)  (Hemoglobin, 3)  (Leukocytes, 1)  (Leukocytes, 2)  (Leukocytes, 3)  (Platelets, 1)  (Platelets, 2)  (Platelets, 3)  ID_1  ID_2       Timestamp_Lab
0              NaN              NaN              NaN            123.0              NaN              NaN            50.0             NaN             NaN     1     1 1984-05-11 14:00:00
1             14.0              NaN              NaN              NaN              NaN              NaN            50.0             NaN             NaN     1     1 1984-05-11 14:15:00
2             13.0              NaN              NaN            123.0              NaN              NaN            50.0             NaN             NaN     1     1 1984-05-11 15:00:00
3              NaN             10.0              NaN              NaN             50.0              NaN             NaN           110.0             NaN     2     1 1975-01-08 20:00:00
4              NaN              NaN             11.0              NaN              NaN            110.0             NaN             NaN            50.0     3     1 1984-05-11 14:00:00