I am having some difficulties in finding text matching in the below dataset (note that Sim
is my current output and it is generated by running the code below. It shows the wrong match).
ID Text Sim
13 fsad amazing ... fsd
14 fdsdf best sport everand the gane of the year❤️❤️❤️❤️... fdsfdgte3e
18 gsd wonderful fast
21 dfsfs i love this its incredible ... reds
23 gwe wonderful end ever seen you ... add
... ... ... ...
261 add wonderful gwe
261 add wonderful gsd
261 add wonderful fdsdf
267 fdsfdgte3e best match ever its a masterpiece fdsdf
277 hgdfgre terrible destroys everything ... tm28
As shown above, Sim
does not give the ID
who wrote the text that match.
For example, add
should match with gsd
and vice versa. But my output says that add
matches with gwe
and this is not true.
The code I am using is the following:
from fuzzywuzzy import fuzz
def sim (nm, df): # this function finds matches between texts based on a threshold, which is 100. The logic is fuzzywuzzy, specifically partial ratio. The output should be IDs whether texts match, based on the threshold.
matches = dataset.apply(lambda row: ((fuzz.partial_ratio(row['Text'], nm)) = 100), axis=1)
return [df.ID[i] for i, x in enumerate(matches) if x]
df['L_Text']=df['Text'].str.lower()
df['Sim']=df.apply(lambda row: sim(row['L_Text'], df), axis=1)
df=df.assign(
Sim = df.apply(lambda x: [s for s in x['Sim'] if s != x['ID']], axis=1)
)
def tr (row): # this function assign a similarity score for each text applying partial_ratio similarity
return (df.loc[:row.name-1, 'L_Text']
.apply(lambda name: fuzz.partial_ratio(name, row['L_Text'])))
t = (df.loc[1:].apply(tr, axis=1)
.reindex(index=df.index,
columns=df.index)
.fillna(0)
.add_prefix('txt')
)
t += t.to_numpy().T + np.diag(np.ones(t.shape[0]))
Could you please help me understand the error in my code? Unfortunately I cannot see it.
My expected output would be as follows:
ID Text Sim
13 fsad amazing ...
14 fdsdf best sport everand the gane of the year❤️❤️❤️❤️...
18 gsd wonderful add
21 dfsfs i love this its incredible ...
23 gwe wonderful end ever seen you ...
... ... ... ...
261 add wonderful gsd
261 add wonderful gsd
261 add wonderful gsd
267 fdsfdgte3e best match ever its a masterpiece
277 hgdfgre terrible destroys everything ...
as it is set a perfect match (=1) in sim
function.
Initial assumption
First off, as your question was not a hundred percent clear to me, I assume that you would like to have a pairwise comparison of all rows and if the score of the match is >100 you would like to add the key of the matching row. If this is not the case, please correct me.
Syntactic problems
So there are multiple problems with you code above. First, if one would just copy and paste it, it is syntactically not possible to run it. The
sim()
function should read as follows:notice the
df
instead ofdataset
as well as the==
instead of the=
. I also removed the redundant parentheses for better readability.Semantic problems
If i then run your code and print
t
(which does not seem to be the end result), this gives me the following:which seems correct to me, as
fuzz.partial_ratio("wonderful end ever seen you", "wonderful")
returns100
(as a partial match is already considered a score of 100). For consistency reasons you could changeto
as all elements should perfectly match themselves. So when you said
this would be true in the sense that
fuzz.partial_ratio()
, you might want to consider usingfuzz.ratio()
instead. Also, there might be an error when convertingt
to the newSim
column, but there seems to be no code in the provided example.Alternative implementation
Also, as some comments suggested, sometimes it is helpful to restructure your code, so that it is easier for people to help you. Here is an example of how this could look like:
gives: