How to merge two pandas data frames where there are some with no matches and some with multiple?

71 Views Asked by At

I have two pandas data frames, one with two columns (let’s say “a” and “b”). This data frame was pulled in from an excel and the goal is to ultimately export a cop5 of this excel with variable “c” as a third column.

The second data frame also has variable “b” as well as the variable “c” that I want to merge into the first data frame when the rows match via the “b” amount.

The issue is, there are some “b”s that exist in the first data frame that are not on the second. And there are some “b”s in the first data frame that have multiple matches in the second data frame. (The first data frame only has one line for each b but the second data frame often has multiple lines of the same “b” with different amounts in the “c”.

Here is what I want to have done: I want a third column to be created in the first data frame.

When there is a row in data frame 1 where the “b” value doesn’t exist in data frame 2, I want it to put “0” in the third column for this row.

When there is a row in data frame 1 where the “b” value does exist and match in data frame 2, and it only appears once, I want the “c” amount from data frame 2 to get out into the third column for that row in data frame 1 (just like a standard merge).

Then when there is a row in data frame 1 where the “b” value does exist and match data frame 2, and that b appears on multiple rows in data frame 2, I want it to put a string in the third column for that row in data frame one where it shows all the amounts. For example, if the “b” identifier appeared twice in data frame 2, where one row has 10 in column c and the other row has 50 in column c, I want it to put the following string, “10, 50”, into the third column on data frame 1.

I’ve thought of maybe iterating through all the rows of data frame one, comparing it against the second data frame, and either pulling in information when applicable or putting a zero when it doesn’t appear.

I’ve also thought of just starting with a merge.

I just don’t want to over complicate things, especially because I want this to run quickly and smoothly for large datasets.

Here is an example of what I want to accomplish.

I pull in excel into df1, which looks like this:

A    B
XY   123
ZA   456
BD   789
EF   001
GH   555

I then have df2 which looks like this:

B     C
123   1000
456   200
789   50
456   75
555   90
456   100
123   2000

As you can see, sometimes a B exists in df1 that doesn’t exist in df2 (but never the other way around), and sometimes a B exists in df1 that appears multiple times in df2, with different amounts.

I want to create the following df from these two datasets:

A    B    C
XY   123  1000, 2000
ZA   456  75, 100
BD   789  50
EF   001  0
GH   555  90

Note that the A column needs to be sorted the same way as it was in df1, because it is going to be copied back into the excel where all the initial data came from. I can do that part though. I just need help getting to this result in the data frame! Thanks!

2

There are 2 best solutions below

0
Nick On

You can group df2 by B (aggregating as a list) and then merge that with df1, filling NaN with [0]:

res = df1.merge(df2.groupby('B').agg(list), on='B', how='left')
res.loc[res['C'].isnull(), 'C'] = pd.Series([[0]] * res['C'].isnull().sum()).values

Output for your sample data:

    A    B               C
0  XY  123    [1000, 2000]
1  ZA  456  [200, 75, 100]
2  BD  789            [50]
3  EF    1             [0]
4  GH  555            [90]

If you want a comma separated string as output you just need to convert the C values to str to join them:

res = df1.merge(df2['C'].astype(str).groupby(df2['B']).agg(', '.join), on='B', how='left').fillna('0')

Output for your sample data:

    A    B             C
0  XY  123    1000, 2000
1  ZA  456  200, 75, 100
2  BD  789            50
3  EF    1             0
4  GH  555            90
1
taller On

Please try.

import pandas as pd
data1 = {
    'A': ['XY', 'ZA', 'BD', 'EF', 'GH'],
    'B': [123, 456, 789, 1, 555]
}
data2 = {
    'B': [123, 456, 789, 456, 555, 456, 123],
    'C': [1000, 200, 50, 75, 90, 100, 2000]
}
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
df2 = df2.groupby('B')['C'].agg(lambda x: ','.join(map(str, x))).reset_index()
result = pd.merge(df1, df2, on='B', how='left').fillna(0)
print(result)

Output

    A    B           C                                                                                  
0  XY  123   1000,2000                                                                                  
1  ZA  456  200,75,100     ** 200 is missing on OP                                                                             
2  BD  789          50                                                                                  
3  EF    1           0                                                                                  
4  GH  555          90