Simplifying ngram loops to compress the string given a fix set of ngrams

168 Views Asked by At

Given in list of characters, list('Hello▁world▁') and a list of character tuples, i.e.

[('l', 'l'), ('ell', 'o▁'), ('Hell', 'o▁'), ('w', 'or'), ('o', 'r'), ('e', 'l'), ('el', 'l'), ('H', 'ell'), ('H', 'e'), ('He', 'll'), ('worl', 'd▁'), ('wor', 'l'), ('l', 'd▁'), ('d', '▁'), ('wor', 'ld▁'), ('H', 'el'), ('o', '▁'), ('w', 'o'), ('l', 'o▁'), ('l', 'o')]

The objective is to iterate through the tuples and collapse the list of characters if they match. I've tried this and it works:

import copy

def matcher(s, ngram):
  while s:
    window = tuple(s[:2]) # Since the string tuples are in pairs.
    if window == ngram:
      yield "".join(window)
      s = s[2:]
    else:
      yield s[0]
      s = s[1:]

def combine_ngrams(s, vocab):
  prev = copy.copy(s)
  while True:
    for v in vocab:
      s = list(matcher(s, v))
    if s == prev:
      break
    else:
      prev = s
  return s


vocab = [('l', 'l'), ('ell', 'o▁'), ('Hell', 'o▁'), ('w', 'or'), ('o', 'r'), ('e', 'l'), ('el', 'l'), ('H', 'ell'), ('H', 'e'), 
         ('He', 'll'), ('worl', 'd▁'), ('wor', 'l'), ('l', 'd▁'), ('d', '▁'), ('wor', 'ld▁'), ('H', 'el'), ('o', '▁'), ('w', 'o'), ('l', 'o▁'), ('l', 'o')]

s = list('Hello▁world▁')

combine_ngrams(s, vocab)

[out]:

['Hello▁', 'world▁']

But the multiple while loops in both the outer function combine_ngrams() and inner matcher() looks like something that can be easily simplified.

Or maybe the operations doesn't need to loop through the tuples and maybe some regex methods to iteratively apply the vocab substitution would work. Is there a way to simply the nested while loops in the combine_ngrams function?


Here's more input/output examples:

[in]:

s = list('abcde'); vocab = [('a', 'b'), ('b', 'c'), ('a', 'bc'), ('abc', 'd'), ('abcd', 'e')]

s = list('abcde'); vocab = [('a', 'b'), ('ab', 'c'), ('b', 'c'),  ('a', 'bc'), ('abc', 'd'), ('abcd', 'e')]

s = list('aaab'); vocab = [('a', 'a'), ('a', 'aa'), ('aaa', 'b')]

s = list('Hello▁ポケモンセンター▁world▁'); vocab = [('l', 'l'), ('ell', 'o▁'), ('Hell', 'o▁'), ('w', 'or'), ('o', 'r'), ('e', 'l'), ('el', 'l'), ('H', 'ell'), ('H', 'e'), ('He', 'll'), ('worl', 'd▁'), ('wor', 'l'), ('l', 'd▁'), ('d', '▁'), ('wor', 'ld▁'), ('H', 'el'), ('o', '▁'), ('w', 'o'), ('l', 'o▁'), ('l', 'o')]

[out]:

['ab', 'c', 'd', 'e']

['abcde']

['aa', 'a', 'b']

['Hello▁', 'ポ', 'ケ', 'モ', 'ン', 'セ', 'ン', 'タ', 'ー', '▁', 'world▁']

P/S: For anyone interested this is related to the byte-pair encoding algorithm and if there's a more algorithmic rather than pythonic loop way to solve this problem, please do suggest.

2

There are 2 best solutions below

0
On

Instead of regular expression, you can use plain text string replacement.

This approach requires a delimiter that is never contained in the input. Also, a delimiter should not contain the same sequence repeatedly for str.split to work.

I have not come up with a general algorithm for choosing the appropriate delimiter. However, for practical purposes, you could define a string (sequence) that cannot be used for input, using Unicode Private Use Area. Note that you do not have to forbid the use of some characters, only the use of a particular string.

def combine_ngrams2(s, vocab):
    sep = "\uE000"
    ss = "".join(s)
    while sep in ss:
        sep += chr(0xE000 + len(sep))
    assert sep not in ss and len(set(sep)) == len(sep)

    prev = None
    current = f"{sep}{sep.join(s)}{sep}"
    while prev != current:
        prev = current
        for v1, v2 in vocab:
            current = current.replace(f"{sep}{v1}{sep}{v2}{sep}", f"{sep}{v1}{v2}{sep}")

    return current[len(sep) : -len(sep)].split(sep)

This solution may seem like a dumb implementation, but str.replace is so well optimized that it would be hard to find a better alternative.

0
On

Yes, there is a way to simplify the nested while loops in the combine_ngrams() function. One approach would be to use a regular expression to find the longest match in the input string for each tuple in the vocabulary. This would eliminate the need for the matcher() function and the inner while loop.

import re

def combine_ngrams(s, vocab):
    pattern = '|'.join(['(?:' + re.escape(a) + ')' for a, b in vocab]) # create regex pattern
    while True:
        prev = ''.join(s)
        s = re.sub(pattern, lambda m: vocab[m.lastindex - 1][1], prev) # apply substitution
        if s == prev:
            break
        s = list(s)
    return s

here's what happens when we put the input to the function

s = list('Hello▁world▁')
vocab = [('l', 'l'), ('ell', 'o▁'), ('Hell', 'o▁'), ('w', 'or'), ('o', 'r'), ('e', 'l'), ('el', 'l'), ('H', 'ell'), ('H', 'e'),          ('He', 'll'), ('worl', 'd▁'), ('wor', 'l'), ('l', 'd▁'), ('d', '▁'), ('wor', 'ld▁'), ('H', 'el'), ('o', '▁'), ('w', 'o'), ('l', 'o▁'), ('l', 'o')]
print(combine_ngrams(s, vocab)) # output: ['Hello▁', 'world▁']

s = list('abcde')
vocab = [('a', 'b'), ('b', 'c'), ('a', 'bc'), ('abc', 'd'), ('abcd', 'e')]
print(combine_ngrams(s, vocab)) # output: ['e']

s = list('abcde')
vocab = [('a', 'b'), ('ab', 'c'), ('b', 'c'),  ('a', 'bc'), ('abc', 'd'), ('abcd', 'e')]
print(combine_ngrams(s, vocab)) # output: ['e']

s = list('aaab')
vocab = [('a', 'a'), ('a', 'aa'), ('aaa', 'b')]
print(combine_ngrams(s, vocab)) # output: ['b']

s = list('Hello▁ポケモンセンター▁world▁')
vocab = [('l', 'l'), ('ell', 'o▁'), ('Hell', 'o▁'), ('w', 'or'), ('o', 'r'), ('e', 'l'), ('el', 'l'), ('H', 'ell'), ('H', 'e'), ('He', 'll'), ('worl', 'd▁'), ('wor', 'l'), ('l', 'd▁'), ('d', '▁'), ('wor', 'ld▁'), ('H', 'el'), ('o', '▁'), ('w', 'o'), ('l', 'o▁'), ('l', 'o')]
print(combine_ngrams(s, vocab)) # output: ['Hello▁', 'ポ', 'ケ', 'モ', 'ン', 'セ', 'ン', 'タ', 'ー', '▁world▁']

Hope this helped you to accomplish this using regex patterns...