I have a large pandas dataframe of ~2million rows. Columns A and B are both ID columns, where ID==1 in Column A is the same ID entity as ID==1 in column B. Column C is the target value column.
Is there an efficient way, using pandas' native functions such as groupby etc or vectorisation methods, and avoiding looping, to:
For each row, lookup the previous occurrence in column A of row B's ID value, and get the previous occurrence row's column C value?
I've tried various groupby/rolling methods to no avail. Many thanks in advance.
You can use a combination of
pd.get_dummiesandpd.mul.A simple example:
Here, indices 6-9 contain IDs in column
Bthat occur in previous rows of columnA. 6 and 7 match with index 0, 8 matches with index 7 and 9 with index 4.Breaking this down, the first step creates a pd.DataFrame of 1s and 0s for the IDs of column
A. Then this is multiplied by the values in columnC:Zeros are forward-filled and the rows are shifted once (else index 7 in this example will return its own value in
C). The values are multiplied by the 1s and 0s of B:And finally the maximum value in each row will always be the matching (as all others will be
0orNaN) - filling resultingNaNvalues with 0 for consistency.