For example, the sentence is "The corporate balance sheets data are available on an annual basis"
, and I need to label the "corporate balance sheets"
which is a substring found from given sentence.
So, the pattern that I need to find is:
"corporate balance sheets"
Given the string:
"The corporate balance sheets data are available on an annual basis".
The output label sequence I want will be:
[0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]
There are a bunch of sentences(more than 2GB), and a bunch of patterns I need to find. I have no idea how to do that efficiently in python. Can someone give me a good algorithm?
Since all words in the substring have to match, you can use
all
to check that and update the appropriate indices as you iterate through the sentence: