Keeping punctuation as its own unit in Preprocessed Text

116 Views Asked by At

what is the code to split a sentence into a list of its constituent words AND punctuation? Most text preprocessing programs tend to remove punctuations.

For example, if I enter this:

"Punctuations to be included as its own unit."

The desired output would be:

result = ['Punctuations', 'to', 'be', 'included', 'as', 'its', 'own', 'unit', '.']

many thanks!

2

There are 2 best solutions below

0
On BEST ANSWER

You might want to consider using a Natural Language Toolkit or nltk.

Try this:

import nltk

sentence = "Punctuations to be included as its own unit."
tokens = nltk.word_tokenize(sentence)
print(tokens)

Output: ['Punctuations', 'to', 'be', 'included', 'as', 'its', 'own', 'unit', '.']

4
On

The following snippet can be used using regular expression to separate the words and punctuation in a list.

import string
import re

punctuations = string.punctuation
regularExpression="[\w]+|" + "[" + punctuations + "]"

content="Punctuations to be included as its own unit."
splittedWords_Puncs = re.findall(r""+regularExpression, content)
print(splittedWords_Puncs)

Output: ['Punctuations', 'to', 'be', 'included', 'as', 'its', 'own', 'unit', '.']