I have two pandas data frames, one with two columns (let’s say “a” and “b”). This data frame was pulled in from an excel and the goal is to ultimately export a cop5 of this excel with variable “c” as a third column.
The second data frame also has variable “b” as well as the variable “c” that I want to merge into the first data frame when the rows match via the “b” amount.
The issue is, there are some “b”s that exist in the first data frame that are not on the second. And there are some “b”s in the first data frame that have multiple matches in the second data frame. (The first data frame only has one line for each b but the second data frame often has multiple lines of the same “b” with different amounts in the “c”.
Here is what I want to have done: I want a third column to be created in the first data frame.
When there is a row in data frame 1 where the “b” value doesn’t exist in data frame 2, I want it to put “0” in the third column for this row.
When there is a row in data frame 1 where the “b” value does exist and match in data frame 2, and it only appears once, I want the “c” amount from data frame 2 to get out into the third column for that row in data frame 1 (just like a standard merge).
Then when there is a row in data frame 1 where the “b” value does exist and match data frame 2, and that b appears on multiple rows in data frame 2, I want it to put a string in the third column for that row in data frame one where it shows all the amounts. For example, if the “b” identifier appeared twice in data frame 2, where one row has 10 in column c and the other row has 50 in column c, I want it to put the following string, “10, 50”, into the third column on data frame 1.
I’ve thought of maybe iterating through all the rows of data frame one, comparing it against the second data frame, and either pulling in information when applicable or putting a zero when it doesn’t appear.
I’ve also thought of just starting with a merge.
I just don’t want to over complicate things, especially because I want this to run quickly and smoothly for large datasets.
Here is an example of what I want to accomplish.
I pull in excel into df1, which looks like this:
A B
XY 123
ZA 456
BD 789
EF 001
GH 555
I then have df2 which looks like this:
B C
123 1000
456 200
789 50
456 75
555 90
456 100
123 2000
As you can see, sometimes a B exists in df1 that doesn’t exist in df2 (but never the other way around), and sometimes a B exists in df1 that appears multiple times in df2, with different amounts.
I want to create the following df from these two datasets:
A B C
XY 123 1000, 2000
ZA 456 75, 100
BD 789 50
EF 001 0
GH 555 90
Note that the A column needs to be sorted the same way as it was in df1, because it is going to be copied back into the excel where all the initial data came from. I can do that part though. I just need help getting to this result in the data frame! Thanks!
You can group
df2byB(aggregating as a list) and then merge that withdf1, fillingNaNwith[0]:Output for your sample data:
If you want a comma separated string as output you just need to convert the
Cvalues tostrtojointhem:Output for your sample data: