NLTK's tokeniser is acting very strangely when I try to tokenise a text with quotation marks in it.
I have a .txt
file like this:
This text has some "quotation marks" to really 'make things' difficult.
I've written this python script to read the file and tokenize the text.
from nltk.tokenize import word_tokenize
with open("test_file.txt", encoding='utf-8') as test_file:
test_text = test_file.read()
print(word_tokenize(test_text))
I would have expected (or, at least, hoped for) a result like this, where the quotation marks are treated like punctuation characters:
['This', 'text', 'has', 'some', '"', 'quotation', 'marks', '"', 'to', 'really', "'", 'make', 'things', "'", 'difficult', '.']
Instead this gives a weird result:
['This', 'text', 'has', 'some', '``', 'quotation', 'marks', "''", 'to', 'really', "'make", 'things', "'", 'difficult', '.']
Specifically, the tokens representing the quotation marks seem odd, as they're all represented differently to each other:
['``', "''", "'make", "'"]
The double quotes are changed into different characters, and the first of the single quotes is connected to the following word token, while the second becomes a separate token on its own. Maybe this is expected behaviour, but it's making a mess when I try to get the indexes of tokens within the overall string because '``'
and "''"
are two characters, whereas '"'
is just one.
I've tried escaping the quotations in various different ways before applying tokenization:
if '"' in test_text:
test_text = '\"'.join(test_text.split('"'))
and
if '"' in test_text:
test_text= '\\"'.join(test_text.split('"'))
I've also tried embedding the string in triple quotes before tokenization:
test_text= f"""{test_text}"""
In the end I just wrote the following script to reverse the changes made by NLTK's tokenizer, and to separate the single quote from the following word token, all after tokenizing the string:
token_list = word_tokenize(test_text)
token_list = [tok if tok not in ["``", "''"] else '"' for tok in token_list]
for toknum, token in enumerate(token_list):
if len(token) > 1 and token[0] == "'":
token_list = token_list[:toknum] + [token[:1], token[1:]] + token_list[toknum + 1:]
This solution seems really janky though. Is there any more elegant way to get the quotation marks to be treated like any other punctuation characters?