I have a corpora which is formatted like this:
sentence in english \t sentence in french \t score
sentence in english \t sentence in french \t score
Each sentence is tokenized (separated by a whitespac).
Now I need to load this sentences using NLTK. How can I do that ? What method in the CorpusReader may I use ?
In this example, I can load the comtrans corpus provided by NLTK:
from nltk.corpus.util import LazyCorpusLoader
from nltk.corpus.reader import AlignedCorpusReader
comtrans = LazyCorpusLoader(
'comtrans', AlignedCorpusReader, r'(?!\.).*\.txt',
encoding='iso-8859-1')
fe=comtrans.aligned_sents('alignment-en-fr.txt')[0]
print fe
In fact, i need to do the same thing but with a file create by myself.
In the last step, I need to lemmatize each word of english sentences.