Apply function to pandas series given varying arguments

Question

Apply function to pandas series given varying arguments

103 Views Asked by Cappo At 24 October 2019 at 11:26

Initial question

I want to calculate the Levenshtein distance between multiple strings, one in a series, the other in a list. I tried my hands on map, zip, etc., but I only got the desired result using a for loop and apply. Is there a way to improve style and especially speed?

Here is what I tried and it does what it is supposed to do, but lacks of speed given a large series.

import stringdist

strings = ['Hello', 'my', 'Friend', 'I', 'am']
s = pd.Series(data=strings, index=strings)
c = ['me', 'mine', 'Friend']
df = pd.DataFrame()
for w in c:
    df[w] = s.apply(lambda x: stringdist.levenshtein(x, w))

## Result: ##
        me  mine  Friend
Hello    4     5       6
my       1     3       6
Friend   5     4       0
I        2     4       6
am       2     4       6

Solution

Thanks to @Dames and @molybdenum42, I can provide the solution I used, directly beneath the question. For more insights, please check their great answers below.

import stringdist
from itertools import product

strings = ['Hello', 'my', 'Friend', 'I', 'am']
s = pd.Series(data=strings, index=strings)
c = ['me', 'mine', 'Friend']

word_combinations = np.array(list(product(s.values, c)))
vectorized_levenshtein = np.vectorize(stringdist.levenshtein)
result = vectorized_levenshtein(word_combinations[:, 0],       
word_combinations[:, 1])
result = result.reshape((len(s), len(c)))
df = pd.DataFrame(result, columns=c, index=s)

This results in the desired data frame.

Original Q&A

There are 2 best solutions below

molybdenum42 On 24 October 2019 at 12:46

Using itertools, you can at least get all the required combinations. Using a vectorized version of stringcount.levenshtein (made using numpy.vectorize()) you can then get your desired result without looping at all, though I haven't tested the performance of the vectorized levenshtein function.

The code could look something like this:

import stringdist
import numpy as np
import pandas as pd
import itertools

s = pd.Series(["Hello", "my","Friend"])
c = ['me', 'mine', 'Friend']

word_combinations = np.array(list(itertools.product(s.values, c)))
vectorized_levenshtein = np.vectorize(stringdist.levenshtein)
result = vectorized_levenshtein(word_combinations[:,0], word_combinations[:,1])

At this point you have the results in a numpy array, each corresponding to one of all the possible combinations of your two intial arrays. If you want to get it into the shape you have in your example, there's some pandas trickery to be done:

df = pd.DataFrame([word_combinations[:,0], word_combinations[:,1], result]).T

### initially looks like: ###
#         0       1  2
# 0   Hello      me  4
# 1   Hello    mine  5
# 2   Hello  Friend  6
# 3      my      me  1
# 4      my    mine  3
# 5      my  Friend  6
# 6  Friend      me  5
# 7  Friend    mine  4
# 8  Friend  Friend  0

df = df.set_index([0,1])[2].unstack()

### Now looks like: ###
#        Friend Hello my
# Friend      0     6  6
# me          5     4  1
# mine        4     5  3

Again, I haven't tested the performance of this method, so I recommend checking that out - it should be faster than iteration though.

EDIT: User @Dames has a better suggestion for making the result all pretty-like:

result = result.reshape(len(c), len(s))
df = pd.DataFrame(result, columns=c, index=s)

**Dames** · Accepted Answer · 2019-10-24T12:35:47.473000

Setup:

import stringdist
import pandas as pd
import numpy as np
import itertools

s = pd.Series(data=['Hello', 'my', 'Friend'],
              index=['Hello', 'my', 'Friend'])
c = ['me', 'mine', 'Friend']

Options

option: an easy one-liner

df = pd.DataFrame([s.apply(lambda x: stringdist.levenshtein(x, w)) for w in c])

option: np.fromfunction (thanks to @baccandr)

@np.vectorize
def lavdist(a, b):
    return stringdist.levenshtein(c[a], s[b])

df = pd.DataFrame(np.fromfunction(lavdist, (len(c), len(s)), dtype = int), 
                  columns=c, index=s)

option: see @molybdenum42

word_combinations = np.array(list(itertools.product(s.values, c)))
vectorized_levenshtein = np.vectorize(stringdist.levenshtein)
result = vectorized_levenshtein(word_combinations[:,0], word_combinations[:,1])
df = pd.DataFrame([word_combinations[:,1], word_combinations[:,1], result])
df = df.set_index([0,1])[2].unstack()

(the best) option: modified option 3

word_combinations = np.array(list(itertools.product(s.values, c)))
vectorized_levenshtein = np.vectorize(distance)
result = vectorized_levenshtein(word_combinations[:,0], word_combinations[:,1])
result = result.reshape((len(s), len(c)))
df = pd.DataFrame(result, columns=c, index=s)

Performance testing:

import timeit
from Levenshtein import distance
import pandas as pd
import numpy as np
import itertools

s = pd.Series(data=['Hello', 'my', 'Friend'],
              index=['Hello', 'my', 'Friend'])
c = ['me', 'mine', 'Friend']

test_code0 = """
df = pd.DataFrame()
for w in c:
    df[w] = s.apply(lambda x: distance(x, w))
"""

test_code1 = """
df = pd.DataFrame({w:s.apply(lambda x: distance(x, w)) for w in c})
"""

test_code2 = """
@np.vectorize
def lavdist(a, b):
    return distance(c[a], s[b])

df = pd.DataFrame(np.fromfunction(lavdist, (len(c), len(s)), dtype = int), 
                  columns=c, index=s)
"""

test_code3 = """
word_combinations = np.array(list(itertools.product(s.values, c)))
vectorized_levenshtein = np.vectorize(distance)
result = vectorized_levenshtein(word_combinations[:,0], word_combinations[:,1])
df = pd.DataFrame([word_combinations[:,1], word_combinations[:,1], result])
df = df.set_index([0,1])[2] #.unstack() produces error
"""

test_code4 = """
word_combinations = np.array(list(itertools.product(s.values, c)))
vectorized_levenshtein = np.vectorize(distance)
result = vectorized_levenshtein(word_combinations[:,0], word_combinations[:,1])
result = result.reshape((len(s), len(c)))
df = pd.DataFrame(result, columns=c, index=s)
"""

test_setup = "from __main__ import distance, s, c, pd, np, itertools"

print("test0", timeit.timeit(test_code0, number = 1000, setup = test_setup))
print("test1", timeit.timeit(test_code1, number = 1000, setup = test_setup))
print("test2", timeit.timeit(test_code2, number = 1000, setup = test_setup))
print("test3", timeit.timeit(test_code3, number = 1000, setup = test_setup))
print("test4", timeit.timeit(test_code4, number = 1000, setup = test_setup))

Results

# results
# test0 1.3671939949999796
# test1 0.5982696900009614
# test2 0.3246431229999871
# test3 2.0100400850005826
# test4 0.23796007100099814

Apply function to pandas series given varying arguments

Initial question

Solution

There are 2 best solutions below

Setup:

Options

Performance testing:

Results

Related Questions in PYTHON-3.X

Related Questions in PANDAS

Related Questions in APPLY

Related Questions in MULTIPLE-ARGUMENTS

Trending Questions

Popular # Hahtags

Popular Questions