I am using the WordPunct Tokenizer to tokenize this sentence:
في_بيتنا كل شي لما تحتاجه يضيع ...ادور على شاحن فجأة يختفي ..لدرجة اني اسوي نفسي ادور شيء
My code is:
import re
import nltk
sentence= " في_بيتنا كل شي لما تحتاجه يضيع ...ادور على شاحن فجأة يختفي ..لدرجة اني اسوي نفسي ادور شيء"
wordsArray = nltk.tokenize.wordpunct_tokenize(sentence)
print " ".join(wordsArray)
I noticed that the printed output is the same as the input sentence, so why do use the tokenizer? Also, would there be any difference creating a machine translation system (MOSES) using the token files or normal text files?
The output of the tokeniser is a list of tokens (
wordsArray
). What you do is you join again the tokens in the list into one string with the command:Replace this with:
Your second question regarding MOSES is not clear, please try to be more specific.