Can python dictionary comprehension be used to create a dictionary of substrings and their locations?

412 Views Asked by At

Given a string of characters, I want to create a dictionary of all the n-character substrings contained in the string, where the dictionary key is the substring and the value is a list. The first element of the list is the number of occurrences of the substring, and the second element of the list is a list of starting locations for those occurrences.

For example, with n=3, the string 'abcdabcxdabc' results in this dictionary:

{'abc': [3, [0, 4, 9]], 
 'cda': [1, [2]], 
 'dab': [2, [3, 8]], 
 'bcd': [1, [1]], 
 'cxd': [1, [6]], 
 'bcx': [1, [5]], 
 'xda': [1, [7]]}

The code below works and is efficient since it only walks through the string once, but I'm wondering if there is a more elegant and/or more pythonic way to do this, maybe using a dictionary comprehension. I'm pretty new to python and still trying to figure out when it makes sense (or is even possible) to use comprehensions, etc.

text = 'abcdabcxdabc'
n = 3
d = {}
for i in range(len(text) - n + 1):
    sub = text[i:i + n]
    if sub in d:
        d[sub][0] += 1
        d[sub][1].append(i)
    else:
        d[sub] = [1, [i]]
print(d)    

Update: Thanks for all the replies. They generally confirm my suspicion that this is too complex to be efficiently implemented in a single comprehension (but thanks to volcano for showing that it is possible if efficiency isn't a concern). Thanks to RemcoGerlich and Ignacio Vazquez-Abrams for pointing me to defaultdict. I'll have to dig into that more. My timing results suggest that there's a little more overhead in initializing a defaultdict compared to a dict, but that the running time might be slightly faster, at least for this example. (Timing results posted in a separate comment below.) Now I'm wondering if there are any situations where dict would be preferred over defaultdict. Also, thanks to Narcolei for pointing me to the timeit functions.

5

There are 5 best solutions below

0
On BEST ANSWER

The problem is that v[0] depends on the length or v[1], which means that either the operation to generate v[1] would have to operate twice, or that the dictionary would have to be iterated over in order to fill in v[0] to replace the dummy value included the first time.

Another problem is that dict comprehensions expect the entire key and value to be available immediately, which means that you would have to run a list comprehension to get all the indexes of the character, which means that the entire operation becomes O(n2).

The only optimization I would make would be to replace the creation of d so that you don't need to check for key containment.

d = collections.defaultdict(lambda: [0, []])
0
On

As @Ignacio said, any comprehension that tries to solve this problem is going to have a quadratic runtime performance, as seen in @volcano's answer. The cleanest way to solve the problem is something like this:

def substrings(text, n):
    d = collections.defaultdict(list)
    for i in xrange(len(text)-n+1):
        d[text[i:i+n]].append(i)
    return d

Note that you don't need to store the number of substrings, since you can just do len(d['abc']) to get the number of occurrences of abc.

Here's some timings of this approach versus the comprehesion:

>>> import collections
>>> 
>>> def substrings(text, n):
>>>     d = collections.defaultdict(list)
>>>     for i in xrange(len(text)-n+1):
>>>         d[text[i:i+n]].append(i)
>>>     return d
>>> 
>>> def substrings2(text, n):
>>>     return {substr: [my_str.replace(substr, ' '*n, c).index(substr) for c in xrange(my_str.count(substr))] for substr in set(my_str[idx:idx+n] for idx in xrange(len(my_str)-n))}
>>> 
>>> text = 'abcdabcxdabc'
>>> 
>>> %timeit substrings(text, 3)
100000 loops, best of 3: 9.51 µs per loop
>>> %timeit substrings2(text, 3)
10000 loops, best of 3: 26.3 µs per loop
>>> text = text * 100
>>> %timeit substrings(text, 3)
1000 loops, best of 3: 440 µs per loop
>>> %timeit substrings2(text, 3)
100 loops, best of 3: 8.68 ms per loop

Notice how the time of the comprehension increases 1000x for a 100x increase in the size of the input.

0
On

It is scary, but (I added just offsets, number of occurrences you may get from list of offsets). Yes, it may be done

In [83]: my_str = 'abcdabcxdabc'

In [84]: n=3

In [85]: {substr: [my_str.replace(substr, ' '*n, c).index(substr) 
                   for c in xrange(my_str.count(substr))]
   ....: for substr in set(my_str[idx:idx+n] for idx in xrange(len(my_str)-n))}
Out[85]: 
{'abc': [0, 4, 9],
 'bcd': [1],
 'bcx': [5],
 'cda': [2],
 'cxd': [6],
 'dab': [3, 8],
 'xda': [7]}
0
On

I implemented a couple of variations using defaultdict and parts of volcano's comprehension and ran some timing tests.

Original version (test1):

d1 = {}
for i in range(len(text) - n + 1):
    sub = text[i:i + n]
    if sub in d1:
        d1[sub][0] += 1
        d1[sub][1].append(i)
    else:
        d1[sub] = [1, [i]]

First variation (test2):

d = defaultdict(lambda: [0, []])
for i in range(len(text) - n + 1):
    sub = text[i:i + n]
    d[sub][0] += 1
    d[sub][1].append(i)       

Second variation (test3):

d = defaultdict(lambda: [0, []])
for i, sub in [(i, text[i:i + n]) for i in range (len(text) - n + 1)]:
    d[sub][0] += 1
    d[sub][1].append(i)       

Third variation (test4):

d = {sub: [text.replace(sub, ' '*n, c).index(sub) for c in range(text.count(sub))]
     for sub in set(text[i:i + n] for i in range(len(text) - n + 1))}

Here are the timing results (showing execution time per loop):

text = "abcdabcxdabc":
10000 loops, best of 3, function test1: 7.37486786334e-06
10000 loops, best of 3, function test2: 1.02725863892e-05
10000 loops, best of 3, function test3: 1.16522984082e-05
10000 loops, best of 3, function test4: 1.98546753287e-05

text = "abcdabcxdabc"*10:
10000 loops, best of 3, function test1: 7.16965834334e-05
10000 loops, best of 3, function test2: 6.8394193171e-05
10000 loops, best of 3, function test3: 7.63521044367e-05
10000 loops, best of 3, function test4: 0.00016625460554

text = "abcdabcxdabc"*100:
1000 loops, best of 3, function test1: 0.000708709217238
1000 loops, best of 3, function test2: 0.000623426932274
1000 loops, best of 3, function test3: 0.000695915822531
1000 loops, best of 3, function test4: 0.00419154787196

text = "abcdabcxdabc"*1000:
1000 loops, best of 3, function test1: 0.00700270379835
1000 loops, best of 3, function test2: 0.00615744327104
1000 loops, best of 3, function test3: 0.00712429980398
1000 loops, best of 3, function test4: 0.296075626815

The original and first two variations seem to be O(n), while the 3rd is closer to O(n*n). I guess I'll go with the 2nd variation, since it's the most compact of the O(n) versions.

0
On

For the record, another one-liner:

>>> n, s = 3, 'abcdabcxdabc'
>>> L=[(s[i:i+n], i) for i in range(len(s)-n+1)]
>>> L
[('abc', 0), ('bcd', 1), ('cda', 2), ('dab', 3), ('abc', 4), ('bcx', 5), ('cxd', 6), ('xda', 7), ('dab', 8), ('abc', 9)]
>>> d={t:[i for u, i in L if u == t] for t, _ in L}
>>> d
{'abc': [0, 4, 9], 'bcd': [1], 'cda': [2], 'dab': [3, 8], 'bcx': [5], 'cxd': [6], 'xda': [7]}
>>> {k:(len(v), v) for k, v in d.items()}
{'abc': (3, [0, 4, 9]), 'bcd': (1, [1]), 'cda': (1, [2]), 'dab': (2, [3, 8]), 'bcx': (1, [5]), 'cxd': (1, [6]), 'xda': (1, [7])}

In one line:

>>> {k:(len(v), v) for L in ([(s[i:i+n], i) for i in range(len(s)-n+1)],) for k, v in ((t, [i for u, i in L if u == t]) for t, _ in L)}
{'abc': (3, [0, 4, 9]), 'bcd': (1, [1]), 'cda': (1, [2]), 'dab': (2, [3, 8]), 'bcx': (1, [5]), 'cxd': (1, [6]), 'xda': (1, [7])}

How I would do in "real world":

>>> def substrings(s, n):
...     d = {}
...     tis = ((s[i:i+n], i) for i in range(len(s)-n+1))
...     for t, i in tis:
...         d.setdefault(t, []).append(i)
...     return {k:(len(v), v) for k, v in d.items()}
... 
>>> substrings(s, n)
{'abc': (3, [0, 4, 9]), 'bcd': (1, [1]), 'cda': (1, [2]), 'dab': (2, [3, 8]), 'bcx': (1, [5]), 'cxd': (1, [6]), 'xda': (1, [7])}

The "real world" version differs from the one-liner on one point: the dict is built in O(n) vs O(n^2)