How to find multi-word string from string, and label it in python?

880 Views Asked by At

For example, the sentence is "The corporate balance sheets data are available on an annual basis", and I need to label the "corporate balance sheets" which is a substring found from given sentence.

So, the pattern that I need to find is:

"corporate balance sheets"

Given the string:

"The corporate balance sheets data are available on an annual basis".

The output label sequence I want will be:

[0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]

There are a bunch of sentences(more than 2GB), and a bunch of patterns I need to find. I have no idea how to do that efficiently in python. Can someone give me a good algorithm?

2

There are 2 best solutions below

0
On BEST ANSWER

Since all words in the substring have to match, you can use all to check that and update the appropriate indices as you iterate through the sentence:

def encode(sub, sent):
    subwords, sentwords = sub.split(), sent.split()
    res = [0 for _ in sentwords]    
    for i, word in enumerate(sentwords[:-len(subwords) + 1]):
        if all(x == y for x, y in zip(subwords, sentwords[i:i + len(subwords)])):
            for j in range(len(subwords)):
                res[i + j] = 1
    return res


sub = "corporate balance sheets"
sent = "The corporate balance sheets data are available on an annual basis"
print(encode(sub, sent))
# [0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]

sent = "The corporate balance data are available on an annual basis sheets"
print(encode(sub, sent))
# [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
8
On

List Comprehension and using split:

import re
lst=[]
search_word = 'corporate balance sheets'
p = re.compile(search_word)
sentence="The corporate balance sheets data are available on an annual basis"

lst=[1 for i in range(len(search_word.split()))]
vect=[ lst if items == '__match_word' else 0 for items in re.sub(p,'__match_word',sentence).split()]
vectlstoflst=[[vec] if isinstance(vec,int) else vec for vec in vect]
flattened = [val for sublist in vectlstoflst for val in sublist]

Output:

 [0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]

Sentence ="The corporate balance sheets data are available on an annual basis sheets"

Output

[0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]