Pandas Dataframe - Ambiguous

Question

Pandas Dataframe - Ambiguous

260 Views Asked by kicksixwde At 17 December 2021 at 18:51

I'm trying to use some code the runs the Jaro Winkler function to compare the similiarity of two strings. If I just hard code in two values, john and jon, I get no problems using the logic below. However what I want is to use a csv file and compare all of the names. When I try that I'm getting

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

# Python3 implementation of above approach
from math import floor
import pandas as pd

# Function to calculate the
# Jaro Similarity of two strings
def jaro_distance(s1, s2):
    # If the strings are equal
    if (s1 == s2):
        return 1.0;

    # Length of two strings
    len1 = len(s1);
    len2 = len(s2);

    if (len1 == 0 or len2 == 0):
        return 0.0;

    # Maximum distance upto which matching
    # is allowed
    max_dist = (max(len(s1), len(s2)) // 2) - 1;

    # Count of matches
    match = 0;

    # Hash for matches
    hash_s1 = [0] * len(s1);
    hash_s2 = [0] * len(s2);

    # Traverse through the first string
    for i in range(len1):

        # Check if there is any matches
        for j in range(max(0, i - max_dist),
                       min(len2, i + max_dist + 1)):

            # If there is a match
            if (s1[i] == s2[j] and hash_s2[j] == 0):
                hash_s1[i] = 1;
                hash_s2[j] = 1;
                match += 1;
                break;

    # If there is no match
    if (match == 0):
        return 0.0;

    # Number of transpositions
    t = 0;

    point = 0;

    # Count number of occurrences
    # where two characters match but
    # there is a third matched character
    # in between the indices
    for i in range(len1):
        if (hash_s1[i]):

            # Find the next matched character
            # in second string
            while (hash_s2[point] == 0):
                point += 1;

            if (s1[i] != s2[point]):
                point += 1;
                t += 1;
            else:
                point += 1;

        t /= 2;

    # Return the Jaro Similarity
    return ((match / len1 + match / len2 +
             (match - t) / match) / 3.0);


# Jaro Winkler Similarity
def jaro_Winkler(s1, s2):
    jaro_dist = jaro_distance(s1, s2);

    # If the jaro Similarity is above a threshold
    if (jaro_dist > 0.7):

        # Find the length of common prefix
        prefix = 0;

        for i in range(min(len(s1), len(s2))):

            # If the characters match
            if (s1[i] == s2[i]):
                prefix += 1;

            # Else break
            else:
                break;

        # Maximum of 4 characters are allowed in prefix
        prefix = min(4, prefix);

        # Calculate jaro winkler Similarity
        jaro_dist += 0.1 * prefix * (1 - jaro_dist);

    return jaro_dist;


# Driver code
if __name__ == "__main__":
    df = pd.read_csv('names.csv')
    # s1 = 'john' -- this works
    # s1 = 'jon' -- this works
    s1 = df['name1'] --this doesn't. csv contains header row name1, name2, and a few rows in each
    s2 = df['name2'] --this doesn't

    print("Jaro-Winkler Similarity =", jaro_Winkler(s1, s2));

Traceback (most recent call last):
  File "C:\Users\john\PycharmProjects\heatMap\Jaro.py", line 113, in <module>
    print("Jaro-Winkler Similarity =", jaro_Winkler(s1, s2));
  File "C:\Users\john\PycharmProjects\heatMap\Jaro.py", line 80, in jaro_Winkler
    jaro_dist = jaro_distance(s1, s2);
  File "C:\Users\john\PycharmProjects\heatMap\Jaro.py", line 9, in jaro_distance
    if (s1 == s2):
  File "C:\Users\john\PycharmProjects\heatMap\venv\lib\site-packages\pandas\core\generic.py", line 1537, in __nonzero__
    raise ValueError(
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

Process finished with exit code 1

Sample from csv enter image description here

Original Q&A

There are 1 best solutions below

**David Siret Marqués** · Answer 1 · 2023-10-04T12:49:34.020000

The main problem you're facing is not with your function, but with logic.

Let's say we want to evaluate whether a statement is true or false, for example, comparing 2 numbers. When we have 1 number it's easy, we just compare those values and that's it (1=1, 1!=2,...).

But let's say we want to compare a list of values to another one, for example [1,2,3,4] to 1?

Well, in our minds, it's easy, we just compare each number, so 1=1, 1!=2, and so on. But if we want to know if the list is equal to 1, we find a problem, because the list, as a whole is equal and not equal at the same time.

This is the main reason you're getting that error, you're trying to compare a list to something else. The traceback suggests:

Use a.empty, a.bool(), a.item(), a.any() or a.all().

These are all functions to tell the code how to compare the list/series to the other thing, either selecting only the empty values, turning them to booleans, selecting one item, checking if any value is true or all values are true (respectively).

Another option is to use the method .apply(), as suggested by @Nick Odell in their comment about this post. This method applies a function to every row of the dataframe, so it should solve the problem, as you can check the truth row by row.

Pandas Dataframe - Ambiguous

There are 1 best solutions below

Related Questions in PYTHON

Related Questions in PANDAS

Related Questions in AMBIGUOUS

Related Questions in JARO-WINKLER

Trending Questions

Popular # Hahtags

Popular Questions