How to extract words from repeating strings

582 Views Asked by At

Here I have a string in a list:

['aaaaaaappppppprrrrrriiiiiilll']

I want to get the word 'april' in the list, but not just one of them, instead how many times the word 'april' actually occurs the string.

The output should be something like:

['aprilaprilapril']

Because the word 'april' occurred three times in that string.

Well the word actually didn't occurred three times, all the characters did. So I want to order these characters to 'april' for how many times did they appeared in the string.

My idea is basically to extract words from some random strings, but not just extracting the word, instead to extract all of the word that appears in the string. Each word should be extracted and the word (characters) should be ordered the way I wanted to.

But here I have some annoying conditions; you can't delete all the elements in the list and then just replace them with the word 'april'(you can't replace the whole string with the word 'april'); you can only extract 'april' from the string, not replacing them. You can't also delete the list with the string. Just think of all the string there being very important data, we just want some data, but these data must be ordered, and we need to delete all other data that doesn't match our "data chain" (the word 'april'). But once you delete the whole string you will lose all the important data. You don't know how to make another one of these "data chains", so we can't just put the word 'april' back in the list.

If anyone know how to solve my weird problem, please help me out, I am a beginner python programmer. Thank you!

4

There are 4 best solutions below

0
On

One way is to use itertools.groupby which will group the characters individually and unpack and iterate them using zip which will iterate n times given n is the number of characters in the smallest group (i.e. the group having lowest number of characters)

from itertools import groupby
'aaaaaaappppppprrrrrriiiiiilll'
result = ''
for each in zip(*[list(g) for k, g in groupby('aaaaaaappppppprrrrrriiiiiilll')]):
    result += ''.join(each)

# result = 'aprilaprilapril'    

Another possible solution is to create a custom counter that will count each unique sequence of characters (Please be noted that this method will work only for Python 3.6+, for lower version of Python, order of dictionaries is not guaranteed):

def getCounts(strng):
    if not strng:
        return [], 0
    counts = {}
    current = strng[0]
    for c in strng:
        if c in counts.keys():
            if current==c:
                counts[c] += 1
        else:
            current = c
            counts[c] = 1
    return counts.keys(), min(counts.values())

result = ''
counts=getCounts('aaaaaaappppppprrrrrriiiiiilll')
for i in range(counts[1]):
    result += ''.join(counts[0])

# result = 'aprilaprilapril'
0
On

How about using regex?

import re

word = 'april'
text = 'aaaaaaappppppprrrrrriiiiiilll'

regex = "".join(f"({c}+)" for c in word)
match = re.match(regex, text)

if match:
    # Find the lowest amount of character repeats
    lowest_amount = min(len(g) for g in match.groups())
    print(word * lowest_amount)
else:
    print("no match")

Outputs:

aprilaprilapril

Works like a charm

0
On

Here is a more native approach, with plain iteration.

It has a time complexity of O(n).

It uses an outer loop to iterate over the character in the search key, then an inner while loop that consumes all occurrences of that character in the search string while maintaining a counter. Once all consecutive occurrences of the current letter have been consumes, it updates a the minLetterCount to be the minimum of its previous value or this new count. Once we have iterated over all letters in the key, we return this accumulated minimum.

def countCompleteSequenceOccurences(searchString, key):
    left = 0
    minLetterCount = 0
    letterCount = 0
    for i, searchChar in enumerate(key):
        while left < len(searchString) and searchString[left] == searchChar:
            letterCount += 1
            left += 1
        
        minLetterCount = letterCount if i == 0 else min(minLetterCount, letterCount)
        letterCount = 0
        
    return minLetterCount

Testing:

testCasesToOracles = {
    "aaaaaaappppppprrrrrriiiiiilll": 3,
    "ppppppprrrrrriiiiiilll": 0,
    "aaaaaaappppppprrrrrriiiiii": 0,
    "aaaaaaapppppppzzzrrrrrriiiiiilll": 0,
    "pppppppaaaaaaarrrrrriiiiiilll": 0,
    "zaaaaaaappppppprrrrrriiiiiilll": 3,
    "zzzaaaaaaappppppprrrrrriiiiiilll": 3,
    "aaaaaaappppppprrrrrriiiiiilllzzz": 3,
    "zzzaaaaaaappppppprrrrrriiiiiilllzzz": 3,
}

key = "april"
for case, oracle in testCasesToOracles.items():
    result = countCompleteSequenceOccurences(case, key)
    assert result == oracle

Usage:

key = "april"
result = countCompleteSequenceOccurences("aaaaaaappppppprrrrrriiiiiilll", key)
print(result * key)

Output:

aprilaprilapril

0
On

A word will only occur as many times as the minimum letter recurrence. To account for the possibility of having repeated letters in the word (for example, appril, you need to factor this count out. Here is one way of doing this using collections.Counter:

from collections import Counter

def count_recurrence(kernel, string):
     # we need to count both strings
     kernel_counter = Counter(kernel)
     string_counter = Counter(string)

    # now get effective count by dividing the occurence in string by occurrence
    # in kernel
    effective_counter = {
        k: int(string_counter.get(k, 0)/v)
        for k, v in kernel_counter.items()
    }

    # min occurence of kernel is min of effective counter
    min_recurring_count = min(effective_counter.values())

    return kernel * min_recurring_count