Difference between tokenized and normal text in Python NLTK

569 Views Asked by At

I am using the WordPunct Tokenizer to tokenize this sentence:

في_بيتنا كل شي لما تحتاجه يضيع ...ادور على شاحن فجأة يختفي ..لدرجة اني اسوي نفسي ادور شيء

My code is:

import re
import nltk
sentence= " في_بيتنا كل شي لما تحتاجه يضيع ...ادور على شاحن فجأة يختفي ..لدرجة اني اسوي نفسي ادور شيء"
wordsArray = nltk.tokenize.wordpunct_tokenize(sentence)
print " ".join(wordsArray)

I noticed that the printed output is the same as the input sentence, so why do use the tokenizer? Also, would there be any difference creating a machine translation system (MOSES) using the token files or normal text files?

1

There are 1 best solutions below

0
On

The output of the tokeniser is a list of tokens (wordsArray). What you do is you join again the tokens in the list into one string with the command:

print " ".join(wordsArray)

Replace this with:

print wordsArray

Your second question regarding MOSES is not clear, please try to be more specific.