How to turn a list of strings into a list of tuples by finding the largest repeating patterns in them in python3?

78 Views Asked by At

I was trying to find solution for my problem for some time but couldn't find any good idea yet..

I have a list of large strings where some character patterns (there are no spaces) are repeating. For my needs I need to reduce the memory these strings take by finding these patterns and converting the list of strings into two lists: one list that has all of the pattens typed once, and one list that has tuples/lists of pointers to the pattern list so that the original string can be recreated from the tuple.

For example, for the list of strings ['FOOBAR', 'BARFOO'] I would like to get ['FOO', 'BAR'], [(0, 1), (1, 0)]

The words in the pattern list should have length of at least 2 (unless we have no other choice, for example if there is only single character between two repeating patterns or entire input string has only length of 1) - or best as long as possible (cause addressing takes memory too so if some word happens once, it should have just one pointer instead of few).

Also the algorythm needs to be fast (best linear complexity) as my script performs this operation on user's input - and I don't want my user to wait too long.

Below I show example script of how it should work:

def getLists(str_list):
    # code here
    return pointers, out_strings


strings = ["FGJohnyRFGDERT", "VBSJohnR", "AAERFGR"]
pointers, out_strings = getLists(strings)
print(pointers, out_strings)
# [(0, 1, 2, 0, 3, 4, 5), (6, 1, 7), (8, 4, 0, 7)]["FG", "John", "yR", "D", "ER", "T", "VBS", "R", "AA"]

Thank you in advance for help! <3

EDIT: Someone proposed compressions like zlib. Sadly I need to unpack original strings in a low memory and very simple language that has no zlib support. So while compression algorythm can be very complex , the decompression has to be as simple as possible.

1

There are 1 best solutions below

9
On

This sounds like a compression problem. There's plenty of compression libraries out there. zlib is one that comes with the standard library & does a decent job of shrinking things.

Another thought is that you could change data structures. For instance, instead of a list of str, try a list of bytes, or a bytearray with a custom seeking & data segmentation protocol.

Anywho, here's an example of zlib in use:

import zlib

compressed_strings = [
    zlib.compress(string.encode(), 9) 
    for string in strings
]
assert strings == [
    zlib.decompress(compressed_string) 
    for compressed_string in compressed_strings
]

Best,