I would like to fuzzy match two Pandas dataframes based on a number n parameters in columns, all present in both dataframes. The output should be a pair of unique filenames. The parameter sets might not be unique and could be duplicates. There should be a minimum threshold of string similarity.
Dataframe 1:
FILENAME PARAMETER1 PARAMETER2 PARAMETER3 PARAMETER4
1 xyz111.txt 3.82 Rock No red
2 abc222.txt 14.20 Tree Yes green
3 def999.txt 6.91 House Yes yellow
4 uvw567.txt 2.11 Car No green
5 asd222.txt 13.90 Ball Yes blue
...
Dataframe 2:
FILENAME PARAMETER1 PARAMETER2 PARAMETER3 PARAMETER4
1 stv999.txt 12.17 Car Yes red
2 hij888.txt 5.64 Tree No red
3 klh123.txt 7.21 House No green
4 qrs543.txt 3.20 Car No green
5 lmn111.txt 17.17 House Yes yellow
...
The preferred output should be the filename and exactly one unique matched filename like this:
Preferred output:
FILENAME FILENAME_MATCHED
1 xyz111.txt hij888.txt
2 abc222.txt stv999.txt
3 def999.txt klh123.txt
4 uvw567.txt None
5 asd222.txt lmn111.txt
...
I was able to do fuzzy matching with different libraries like "FuzzyWuzzy" or "rapidfuzz" etc. by combining the parameters into a new string column. Still I am having a hard time to deal with duplicates in the matched filenames, as the parameter sets might not be unique, but the filenames are.