How to retain rows from multiple dataframes with common column value?

625 Views Asked by At

Suppose I have multiple dataframes:

print (df1)
                datetime    A
0       2012-08-14 07:00    1
1       2012-08-14 07:01    2
2       2012-08-14 08:15    3
                     ...  ...
192908  2013-08-14 16:00  600
192948  2013-08-14 16:15  700
192949  2013-08-14 16:57  900

print (df2)
               datetime    B    
0      2012-08-14 07:00  100
1      2012-08-14 07:15  200
2      2012-08-14 07:30  300
                    ...  ...
12140  2013-09-24 15:45   50
12141  2013-09-24 16:00   60
12142  2013-09-24 16:15   70

How do I create a new df containing only the rows for which there is a value in columns A and B at the same datetime? I tried using the isin function:

df1 = df1[df1['date'].isin(df2['date'])]

but this only does a one-way check, i.e. only values of A for which there exists a value of B at the same datetime are retained, but if there are extra values in B for datetimes that do not exist in A then these are left in df2.

I can repeat the operation in the opposite direction to resolve this:

df2 = df2[df2['date'].isin(df1['date'])]

but for >2 dataframes (I have around fifty in my present work) this becomes extremely long and inefficient because it would be necessary to go through every possible paired combination between the full set of dataframes. For example, a third dataframe, df3, would first need to be checked against df1 and df2, but if it contained datetimes that exist neither in df1 nor df2 then df1 and df2 would in turn need to be re-checked back against df3.

Desired output is to have re-defined all of the dataframes such that they contain only values of A, B, etc. with a matching datetime value.

1

There are 1 best solutions below

3
On BEST ANSWER

This is a join / merge operation. Standard Codd relational theory/algebra.

import io
df1 = pd.read_csv(io.StringIO("""                datetime    A
0       2012-08-14 07:00    1
1       2012-08-14 07:01    2
2       2012-08-14 08:15    3
192908  2013-08-14 16:00  600
192948  2013-08-14 16:15  700
192949  2013-08-14 16:57  900"""), sep="\s\s+", engine="python")

df2 = pd.read_csv(io.StringIO("""               datetime    B    
0      2012-08-14 07:00  100
1      2012-08-14 07:15  200
2      2012-08-14 07:30  300
12140  2013-09-24 15:45   50
12141  2013-09-24 16:00   60
12142  2013-09-24 16:15   70"""), sep="\s\s+", engine="python")

pd.merge(df1,df2, on="datetime", how="inner")

output

    datetime    A   B
0   2012-08-14 07:00    1   100

want to merge many data frames

import io, random, functools

# generate a list of dataframes for merge... start with two sample ones
dfs = [df1, df2]
# generate longer list of dataframes, rename columns to add some interest for merge :-)
dfs = [dfs[random.randint(0, len(dfs)-1)].pipe(lambda d: d.rename(columns={d.columns[1]:f"{d.columns[1]}_{i}"})) for i in range(8)]

# and one line merge the whole list of dataframes
functools.reduce(lambda left,right: pd.merge(left,right,on='datetime'), dfs)

datetime A_0 B_1 B_2 B_3 A_4 B_5 B_6 A_7
0 2012-08-14 07:00 1 100 100 100 1 100 100 1