Best way to understand the input text before applying ngram

283 Views Asked by Pyd At 09 October 2017 at 07:25

Currently I am reading text from excel file and applying bigram to it. finalList has list used in below sample code has the list of input words read from input excel file.

Removed the stopwords from input with help of following library:

from nltk.corpus import stopwords

bigram logic applied on list of input text of words

bigram=ngrams(finalList ,2)

input text: I completed my end-to-end process.

Current output: Completed end, end end, end process.

Desired output: completed end-to-end, end-to-end process.

That means some group of words like (end-to-end) should be considered as 1 word.

Original Q&A

There are 1 best solutions below

Mohammed On 12 October 2017 at 22:43 BEST ANSWER

To solve your problem, you have to clean the stop words using regex. See this example:

 import re
 text = 'I completed my end-to-end process..:?' 
 pattern = re.compile(r"\.*:\?*") # to remove zero or more instances of such stop words, the hyphen is not included in the stop words. 
 new_text = re.sub(pattern, '', text)
 print(new_text)
 'I completed my end-to-end process'


 # Now you can generate bigrams manually.
 # 1. Tokanize the new text
 tok = new_text.split()
 print(tok) # If the size of token is huge, just print the first five ones, like this print(tok[:5])
 ['I', 'completed', 'my', 'end-to-end', 'process']

 # 2. Loop over the list and generate bigrams, store them in a var called bigrams
 bigrams = []
 for i in range(len(tok) - 1):  # -1 to avoid index error
     bigram = tok[i] + ' ' + tok[i + 1]  
     bigrams.append(bigram)


 # 3. Print your bigrams
 for bi in bigrams:
     print(bi, end = ', ')

I completed, completed my, my end-to-end, end-to-end process,

I hope this helps!

Best way to understand the input text before applying ngram

There are 1 best solutions below

Related Questions in PYTHON-3.X

Related Questions in PANDAS

Related Questions in NLP

Related Questions in NLTK

Related Questions in NLTK-BOOK

Trending Questions

Popular # Hahtags

Popular Questions