List of tuples: combine two dataframe columns if first tuple elements match

108 Views Asked by At

I have two separate lists each of n elements, with one being ID numbers and the second being pandas dataframes. I will define them as id and dfs. The dataframes in dfs have the same format with columns A, B, and C but with different numerical values. I have zipped the two lists together as such:

df_groups = list(zip(id, dfs))

With this list I am trying to locate any instances where id is the same and then add columns A and B together for those dataframes and merge into one dataframe. For an example, I will use the following:

id = ['a','b','c','a','d']

The corresponding dataframes I have may look as such:

dfs[0]
A  B  C
0  0  1
0  0  1

dfs[1]
A  B  C
0  1  1
0  1  1

dfs[2]
A  B  C
1  1  1
1  2  1

dfs[3]
A  B  C
5  6  1
11 8  1

dfs[4]
A  B  C
3  5  2
3  18 2

Then, as can be seen above, id[0] is the same as id[3]. As such, I want to create a new list of tuples such that dfs[0]['A'] and dfs[3]['A'] are added together (similarly also for column B and the duplicate id value is dropped.

Thus, it should look like this:

id = ['a','b','c','d']

dfs[0]
A  B  C
5  6  1
11 8  1

dfs[1]
A  B  C
0  1  1
0  1  1

dfs[2]
A  B  C
1  1  1
1  2  1

dfs[3]
A  B  C
3  5  2
3  18 2

The following worked for removing the duplicate values of id but I am not quite sure how to go about the column operations on dfs. I will of course need to add the columns A and B first before running the below:

from itertools import groupby

df_groups_b = ([next(b) for a, b in groupby(df_groups, lambda x: x[0])])

Any assistance would be much appreciated, thank you!

Edit: to clarify, column C from the original dataframe would be retained as is. In the case where the first tuple elements match, column C from the corresponding dataframes will be identical.

4

There are 4 best solutions below

8
On BEST ANSWER

You can write a custom summarising function to go through all dataframes in a group and return a sum.

I don't entirely love this solution, because in converts C to float, but you can play with it further if needed:

from itertools import groupby
import pandas as pd

ids = ['a','b','c','a','d']
dfs = [
    pd.DataFrame({"A": [0], "B": [0], "C": [0]}),
    pd.DataFrame({"A": [1], "B": [1], "C": [1]}),
    pd.DataFrame({"A": [2], "B": [2], "C": [2]}),
    pd.DataFrame({"A": [3], "B": [3], "C": [3]}),
    pd.DataFrame({"A": [4], "B": [4], "C": [4]})
]
df_groups = list(zip(ids, dfs))
df_groups = sorted(df_groups, key=lambda x: x[0])
def summarise_cols(group, cols_to_summarise=["A", "B"]):
    _, df = next(group)
    for _, next_df in group:
        df = df.add(next_df[cols_to_summarise], fill_value=0)
    return df

df_groups_b = ([summarise_cols(group)
                for _, group in groupby(df_groups, lambda x: x[0])])

for d in df_groups_b:
    print(d)

Output:

   A  B    C
0  3  3  0.0
   A  B  C
0  1  1  1
   A  B  C
0  2  2  2
   A  B  C
0  4  4  4

UPD: just noticed your update

Edit: to clarify, column C from the original dataframe would be retained as is. In the case where the first tuple elements match, column C from the corresponding dataframes will be identical.

You can then do this instead and the column types will be unchanged

def summarise_cols(group, cols_to_keep=["C"]):
    _, df = next(group)
    for _, next_df in group:
        df += next_df
        df[cols_to_keep] = next_df[cols_to_keep]
    return df

UPD2: Returning group_id and unified dataframe together as a tuple.

def summarise_cols(group, cols_to_keep=["C"]):
    group_id, df = next(group)
    for _, next_df in group:
        df += next_df
        df[cols_to_keep] = next_df[cols_to_keep]
    return group_id, df
0
On

Another way is to use pandas grouping:

a = (pd.concat([d.assign(id = i).reset_index() for i, d in df_groups])
       .groupby(['id', 'index'])
       .agg({'A':'sum', 'B':'sum', 'C':'first'})
       .reset_index(0).rename_axis(index = {'index':''})
       .pipe(lambda x: x.groupby(x.pop('id'))))

new_id, new_df_groups = zip(*a)
print(new_id)

('a', 'b', 'c', 'd')

for d in new_df_groups:
  print(d, "\n\n")

    A  B  C
0   5  6  1
1  11  8  1

   A  B  C
0  0  1  1
1  0  1  1

   A  B  C
0  1  1  1
1  1  2  1

   A   B  C
0  3   5  2
1  3  18  2
0
On

You can keep track of repeated ids using a dict, then join them using concat and groupby:

def add_dfs(df1: pd.DataFrame, df2: pd.DataFrame):
    to_sum = ["A", "B"]
    df = pd.concat([df1.loc[:, to_sum], df2[to_sum]]).reset_index()
    df = (
        df.groupby("index")
        .sum()
        .reset_index()
        .merge(df1["C"].reset_index(), how="left", on="index")
        .drop(columns="index")
    )
    return df


d = {}

for id, df in zip(id, dfs):
    if id not in d.keys():
        d[id] = df
    else:
        df2 = d[id]
        df = add_dfs(df, df2)
        d[id] = df

dfs = list(d.values())
0
On

I would first concat then use a simple dictionary comprehension with groupby:

ids = ['a','b','c','a','d']

out = {k: g.groupby(level=1).agg(dict.fromkeys(g.columns, 'sum')|{'C': 'first'})
       for k,g in pd.concat(dfs, keys=ids, names=['id', None]).groupby('id')}

NB. dict.fromkeys(g.columns, 'sum')|{'C': 'first'} is creating {'A': 'sum', 'B': 'sum', 'C': 'first'} programmatically. All columns as 'sum', except C as 'first'.

Or with pure python to group the dataframes with the same id (will be more memory efficient for many dfs:

d = {}
for k, df in zip(ids, dfs):
    d.setdefault(k, []).append(df)
    
out = {k: pd.concat(v).groupby(level=0)
            .agg({'A': 'sum', 'B': 'sum', 'C': 'first'})
       for k, v in d.items()}

Output:

{'a':     A  B  C
      0   5  6  1
      1  11  8  1,
 'b':    A  B  C
      0  0  1  1
      1  0  1  1,
 'c':    A  B  C
      0  1  1  1
      1  1  2  1,
 'd':    A   B  C
      0  3   5  2
      1  3  18  2}