I would like to fuzzy match two Pandas dataframes based on a number n parameters in columns, all present in both dataframes. The output should be a pair of unique filenames. The parameter sets might not be unique and could be duplicates. There should be a minimum threshold of string similarity.

Dataframe 1:

    FILENAME       PARAMETER1    PARAMETER2    PARAMETER3    PARAMETER4    
1   xyz111.txt     3.82          Rock          No            red
2   abc222.txt     14.20         Tree          Yes           green
3   def999.txt     6.91          House         Yes           yellow
4   uvw567.txt     2.11          Car           No            green
5   asd222.txt     13.90         Ball          Yes           blue
...

Dataframe 2:

    FILENAME       PARAMETER1    PARAMETER2    PARAMETER3    PARAMETER4    
1   stv999.txt     12.17         Car           Yes           red
2   hij888.txt     5.64          Tree          No            red
3   klh123.txt     7.21          House         No            green
4   qrs543.txt     3.20          Car           No            green
5   lmn111.txt     17.17         House         Yes           yellow
...

The preferred output should be the filename and exactly one unique matched filename like this:

Preferred output:

    FILENAME       FILENAME_MATCHED    
1   xyz111.txt     hij888.txt        
2   abc222.txt     stv999.txt       
3   def999.txt     klh123.txt      
4   uvw567.txt     None    
5   asd222.txt     lmn111.txt
...

I was able to do fuzzy matching with different libraries like "FuzzyWuzzy" or "rapidfuzz" etc. by combining the parameters into a new string column. Still I am having a hard time to deal with duplicates in the matched filenames, as the parameter sets might not be unique, but the filenames are.

0

There are 0 best solutions below