I'm trying to use some code the runs the Jaro Winkler function to compare the similiarity of two strings. If I just hard code in two values, john and jon, I get no problems using the logic below. However what I want is to use a csv file and compare all of the names. When I try that I'm getting
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
# Python3 implementation of above approach
from math import floor
import pandas as pd
# Function to calculate the
# Jaro Similarity of two strings
def jaro_distance(s1, s2):
# If the strings are equal
if (s1 == s2):
return 1.0;
# Length of two strings
len1 = len(s1);
len2 = len(s2);
if (len1 == 0 or len2 == 0):
return 0.0;
# Maximum distance upto which matching
# is allowed
max_dist = (max(len(s1), len(s2)) // 2) - 1;
# Count of matches
match = 0;
# Hash for matches
hash_s1 = [0] * len(s1);
hash_s2 = [0] * len(s2);
# Traverse through the first string
for i in range(len1):
# Check if there is any matches
for j in range(max(0, i - max_dist),
min(len2, i + max_dist + 1)):
# If there is a match
if (s1[i] == s2[j] and hash_s2[j] == 0):
hash_s1[i] = 1;
hash_s2[j] = 1;
match += 1;
break;
# If there is no match
if (match == 0):
return 0.0;
# Number of transpositions
t = 0;
point = 0;
# Count number of occurrences
# where two characters match but
# there is a third matched character
# in between the indices
for i in range(len1):
if (hash_s1[i]):
# Find the next matched character
# in second string
while (hash_s2[point] == 0):
point += 1;
if (s1[i] != s2[point]):
point += 1;
t += 1;
else:
point += 1;
t /= 2;
# Return the Jaro Similarity
return ((match / len1 + match / len2 +
(match - t) / match) / 3.0);
# Jaro Winkler Similarity
def jaro_Winkler(s1, s2):
jaro_dist = jaro_distance(s1, s2);
# If the jaro Similarity is above a threshold
if (jaro_dist > 0.7):
# Find the length of common prefix
prefix = 0;
for i in range(min(len(s1), len(s2))):
# If the characters match
if (s1[i] == s2[i]):
prefix += 1;
# Else break
else:
break;
# Maximum of 4 characters are allowed in prefix
prefix = min(4, prefix);
# Calculate jaro winkler Similarity
jaro_dist += 0.1 * prefix * (1 - jaro_dist);
return jaro_dist;
# Driver code
if __name__ == "__main__":
df = pd.read_csv('names.csv')
# s1 = 'john' -- this works
# s1 = 'jon' -- this works
s1 = df['name1'] --this doesn't. csv contains header row name1, name2, and a few rows in each
s2 = df['name2'] --this doesn't
print("Jaro-Winkler Similarity =", jaro_Winkler(s1, s2));
Traceback (most recent call last):
File "C:\Users\john\PycharmProjects\heatMap\Jaro.py", line 113, in <module>
print("Jaro-Winkler Similarity =", jaro_Winkler(s1, s2));
File "C:\Users\john\PycharmProjects\heatMap\Jaro.py", line 80, in jaro_Winkler
jaro_dist = jaro_distance(s1, s2);
File "C:\Users\john\PycharmProjects\heatMap\Jaro.py", line 9, in jaro_distance
if (s1 == s2):
File "C:\Users\john\PycharmProjects\heatMap\venv\lib\site-packages\pandas\core\generic.py", line 1537, in __nonzero__
raise ValueError(
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Process finished with exit code 1
Sample from csv enter image description here
The main problem you're facing is not with your function, but with logic.
Let's say we want to evaluate whether a statement is true or false, for example, comparing 2 numbers. When we have 1 number it's easy, we just compare those values and that's it (
1=1,1!=2,...).But let's say we want to compare a list of values to another one, for example
[1,2,3,4]to 1?Well, in our minds, it's easy, we just compare each number, so
1=1,1!=2, and so on. But if we want to know if the list is equal to 1, we find a problem, because the list, as a whole is equal and not equal at the same time.This is the main reason you're getting that error, you're trying to compare a list to something else. The traceback suggests:
These are all functions to tell the code how to compare the list/series to the other thing, either selecting only the empty values, turning them to booleans, selecting one item, checking if any value is true or all values are true (respectively).
Another option is to use the method .apply(), as suggested by @Nick Odell in their comment about this post. This method applies a function to every row of the dataframe, so it should solve the problem, as you can check the truth row by row.