Given in list of characters, list('Hello▁world▁')
and a list of character tuples, i.e.
[('l', 'l'), ('ell', 'o▁'), ('Hell', 'o▁'), ('w', 'or'), ('o', 'r'), ('e', 'l'), ('el', 'l'), ('H', 'ell'), ('H', 'e'), ('He', 'll'), ('worl', 'd▁'), ('wor', 'l'), ('l', 'd▁'), ('d', '▁'), ('wor', 'ld▁'), ('H', 'el'), ('o', '▁'), ('w', 'o'), ('l', 'o▁'), ('l', 'o')]
The objective is to iterate through the tuples and collapse the list of characters if they match. I've tried this and it works:
import copy
def matcher(s, ngram):
while s:
window = tuple(s[:2]) # Since the string tuples are in pairs.
if window == ngram:
yield "".join(window)
s = s[2:]
else:
yield s[0]
s = s[1:]
def combine_ngrams(s, vocab):
prev = copy.copy(s)
while True:
for v in vocab:
s = list(matcher(s, v))
if s == prev:
break
else:
prev = s
return s
vocab = [('l', 'l'), ('ell', 'o▁'), ('Hell', 'o▁'), ('w', 'or'), ('o', 'r'), ('e', 'l'), ('el', 'l'), ('H', 'ell'), ('H', 'e'),
('He', 'll'), ('worl', 'd▁'), ('wor', 'l'), ('l', 'd▁'), ('d', '▁'), ('wor', 'ld▁'), ('H', 'el'), ('o', '▁'), ('w', 'o'), ('l', 'o▁'), ('l', 'o')]
s = list('Hello▁world▁')
combine_ngrams(s, vocab)
[out]:
['Hello▁', 'world▁']
But the multiple while loops in both the outer function combine_ngrams()
and inner matcher()
looks like something that can be easily simplified.
Or maybe the operations doesn't need to loop through the tuples and maybe some regex methods to iteratively apply the vocab substitution would work. Is there a way to simply the nested while loops in the combine_ngrams
function?
Here's more input/output examples:
[in]:
s = list('abcde'); vocab = [('a', 'b'), ('b', 'c'), ('a', 'bc'), ('abc', 'd'), ('abcd', 'e')]
s = list('abcde'); vocab = [('a', 'b'), ('ab', 'c'), ('b', 'c'), ('a', 'bc'), ('abc', 'd'), ('abcd', 'e')]
s = list('aaab'); vocab = [('a', 'a'), ('a', 'aa'), ('aaa', 'b')]
s = list('Hello▁ポケモンセンター▁world▁'); vocab = [('l', 'l'), ('ell', 'o▁'), ('Hell', 'o▁'), ('w', 'or'), ('o', 'r'), ('e', 'l'), ('el', 'l'), ('H', 'ell'), ('H', 'e'), ('He', 'll'), ('worl', 'd▁'), ('wor', 'l'), ('l', 'd▁'), ('d', '▁'), ('wor', 'ld▁'), ('H', 'el'), ('o', '▁'), ('w', 'o'), ('l', 'o▁'), ('l', 'o')]
[out]:
['ab', 'c', 'd', 'e']
['abcde']
['aa', 'a', 'b']
['Hello▁', 'ポ', 'ケ', 'モ', 'ン', 'セ', 'ン', 'タ', 'ー', '▁', 'world▁']
P/S: For anyone interested this is related to the byte-pair encoding algorithm and if there's a more algorithmic rather than pythonic loop way to solve this problem, please do suggest.
Instead of regular expression, you can use plain text string replacement.
This approach requires a delimiter that is never contained in the input. Also, a delimiter should not contain the same sequence repeatedly for
str.split
to work.I have not come up with a general algorithm for choosing the appropriate delimiter. However, for practical purposes, you could define a string (sequence) that cannot be used for input, using Unicode Private Use Area. Note that you do not have to forbid the use of some characters, only the use of a particular string.
This solution may seem like a dumb implementation, but
str.replace
is so well optimized that it would be hard to find a better alternative.