Given a string of characters, I want to create a dictionary of all the n-character substrings contained in the string, where the dictionary key is the substring and the value is a list. The first element of the list is the number of occurrences of the substring, and the second element of the list is a list of starting locations for those occurrences.
For example, with n=3
, the string 'abcdabcxdabc'
results in this dictionary:
{'abc': [3, [0, 4, 9]],
'cda': [1, [2]],
'dab': [2, [3, 8]],
'bcd': [1, [1]],
'cxd': [1, [6]],
'bcx': [1, [5]],
'xda': [1, [7]]}
The code below works and is efficient since it only walks through the string once, but I'm wondering if there is a more elegant and/or more pythonic way to do this, maybe using a dictionary comprehension. I'm pretty new to python and still trying to figure out when it makes sense (or is even possible) to use comprehensions, etc.
text = 'abcdabcxdabc'
n = 3
d = {}
for i in range(len(text) - n + 1):
sub = text[i:i + n]
if sub in d:
d[sub][0] += 1
d[sub][1].append(i)
else:
d[sub] = [1, [i]]
print(d)
Update: Thanks for all the replies. They generally confirm my suspicion that this is too complex to be efficiently implemented in a single comprehension (but thanks to volcano for showing that it is possible if efficiency isn't a concern). Thanks to RemcoGerlich and Ignacio Vazquez-Abrams for pointing me to defaultdict. I'll have to dig into that more. My timing results suggest that there's a little more overhead in initializing a defaultdict compared to a dict, but that the running time might be slightly faster, at least for this example. (Timing results posted in a separate comment below.) Now I'm wondering if there are any situations where dict would be preferred over defaultdict. Also, thanks to Narcolei for pointing me to the timeit functions.
The problem is that
v[0]
depends on the length orv[1]
, which means that either the operation to generatev[1]
would have to operate twice, or that the dictionary would have to be iterated over in order to fill inv[0]
to replace the dummy value included the first time.Another problem is that dict comprehensions expect the entire key and value to be available immediately, which means that you would have to run a list comprehension to get all the indexes of the character, which means that the entire operation becomes O(n2).
The only optimization I would make would be to replace the creation of
d
so that you don't need to check for key containment.