Join big dataframes based on partial string-match between columns

82 Views Asked by At

Two DataFrames have gene and isoform names that are not formatted the same way. I'd like to do a join and add the df2 columns name, isoform for all partial string matches between the isoform (df2) and the name (df1) in both DataFrames. df2 is a key for the isoforms/genes, where a gene can have many isoforms. In df1, basically an output from a gene-quantification software (SALMON) the name field has both, the gene and isoform in it. I cant use regex since isoforms have variable suffixs, such as ".","_", "-", and many others. Another important piece of information is that each df1["Name"] cell has a unique isoform.

Piece of dfs to merge:

import pandas as pd

df1 = pd.DataFrame({'Name': {0: 'AT1G01010;AT1G01010.1;Isoseq::Chr1:3616-5846', 1: 'AT1G01010;AT1G01010_2;Isoseq::Chr1:3630-5894', 2: 'AT1G01010;AT1G01010.3;Isoseq::Chr1:3635-5849', 3: 'AT1G01020;AT1G01020.11;Isoseq::Chr1:6803-8713', 4: 'AT1G01020;AT1G01020.13;Isoseq::Chr1:6811-8713'}, 'Length': {0: 2230, 1: 2264, 2: 2214, 3: 1910, 4: 1902}, 'EffectiveLength': {0: 1980.0, 1: 2014.0, 2: 1964.0, 3: 1660.0, 4: 1652.0}, 'TPM': {0: 2.997776, 1: 1.58178, 2: 0.0, 3: 4.317311, 4: 0.0}, 'NumReads': {0: 154.876, 1: 83.124, 2: 0.0, 3: 187.0, 4: 0.0}})
df2 = pd.DataFrame({'gene': {0: 'AT1G01010', 14: 'AT1G01010', 30: 'AT1G01010', 46: 'AT1G01020', 62: 'AT1G01020', 80: 'AT1G01020', 100: 'AT1G01020', 116: 'AT1G01020', 138: 'AT1G01020', 156: 'AT1G01020'}, 'isoform': {0: 'AT1G01010.1', 14: 'AT1G01010_2', 30: 'AT1G01010.3', 46: 'AT1G01020.1', 62: 'AT1G01020.10', 80: 'AT1G01020.11', 100: 'AT1G01020.12', 116: 'AT1G01020.13', 138: 'AT1G01020.14', 156: 'AT1G01020.15'}})
display(df1)
display(df2)

Desired output:

df3 = pd.DataFrame({'gene': {0: 'AT1G01010', 1:"AT1G01010", 2:"AT1G01010", 3:"AT1G01020", 4:"AT1G01020"},'isoform': {0: 'AT1G01010.1',1:"AT1G01010_2", 2:"AT1G01010.3", 3:"AT1G01020.11", 4:"AT1G01020.13"}, 'Length': {0: 2230, 1: 2264, 2: 2214, 3: 1910, 4: 1902}, 'EffectiveLength': {0: 1980.0, 1: 2014.0, 2: 1964.0, 3: 1660.0, 4: 1652.0}, 'TPM': {0: 2.997776, 1: 1.58178, 2: 0.0, 3: 4.317311, 4: 0.0}, 'NumReads': {0: 154.876, 1: 83.124, 2: 0.0, 3: 187.0, 4: 0.0}})
#"Name" column from df1 is not necessary anymore. (the idea is to replace it for gene and isoform)
display(df3)

Real dfs size:

df1 = 143646 rows × 5 columns

df2 = 169499 rows × 2 columns

(since df1 may not have all the isoforms detected, it's always smaller than df2)

I tried some answers i found online, but since this dfs have a huge size, many need 50gb of RAM or so...

Already checked: Merge Dataframes Based on Partial Substrings Match, Join to Dataframes based on partial string matches in python, Join dataframes based on partial string-match between columns

Thanks for the help!

0

There are 0 best solutions below