How to use NLTK tokeniser on text with quotation marks?

47 Views Asked by At

NLTK's tokeniser is acting very strangely when I try to tokenise a text with quotation marks in it.

I have a .txt file like this:

This text has some "quotation marks" to really 'make things' difficult.

I've written this python script to read the file and tokenize the text.

from nltk.tokenize import word_tokenize

with open("test_file.txt", encoding='utf-8') as test_file:
    test_text = test_file.read()

print(word_tokenize(test_text))

I would have expected (or, at least, hoped for) a result like this, where the quotation marks are treated like punctuation characters:

['This', 'text', 'has', 'some', '"', 'quotation', 'marks', '"', 'to', 'really', "'", 'make', 'things', "'", 'difficult', '.']

Instead this gives a weird result:

['This', 'text', 'has', 'some', '``', 'quotation', 'marks', "''", 'to', 'really', "'make", 'things', "'", 'difficult', '.']

Specifically, the tokens representing the quotation marks seem odd, as they're all represented differently to each other:

['``', "''", "'make", "'"]

The double quotes are changed into different characters, and the first of the single quotes is connected to the following word token, while the second becomes a separate token on its own. Maybe this is expected behaviour, but it's making a mess when I try to get the indexes of tokens within the overall string because '``' and "''" are two characters, whereas '"' is just one.

I've tried escaping the quotations in various different ways before applying tokenization:

if '"' in test_text:
    test_text = '\"'.join(test_text.split('"'))

and

if '"' in test_text:
    test_text= '\\"'.join(test_text.split('"'))

I've also tried embedding the string in triple quotes before tokenization:

test_text= f"""{test_text}"""

In the end I just wrote the following script to reverse the changes made by NLTK's tokenizer, and to separate the single quote from the following word token, all after tokenizing the string:

token_list = word_tokenize(test_text)
token_list = [tok if tok not in ["``", "''"] else '"' for tok in token_list]
for toknum, token in enumerate(token_list):
    if len(token) > 1 and token[0] == "'":
        token_list = token_list[:toknum] + [token[:1], token[1:]] + token_list[toknum + 1:]

This solution seems really janky though. Is there any more elegant way to get the quotation marks to be treated like any other punctuation characters?

0

There are 0 best solutions below