load parallel corpora with NLTK and lemmatize english sentences

931 Views Asked by At

I have a corpora which is formatted like this:

sentence in english \t sentence in french \t score
sentence in english \t sentence in french \t score

Each sentence is tokenized (separated by a whitespac).

Now I need to load this sentences using NLTK. How can I do that ? What method in the CorpusReader may I use ?

In this example, I can load the comtrans corpus provided by NLTK:

from nltk.corpus.util import LazyCorpusLoader
from nltk.corpus.reader import AlignedCorpusReader

comtrans = LazyCorpusLoader(
    'comtrans', AlignedCorpusReader, r'(?!\.).*\.txt',
     encoding='iso-8859-1')

fe=comtrans.aligned_sents('alignment-en-fr.txt')[0]
print fe

In fact, i need to do the same thing but with a file create by myself.

In the last step, I need to lemmatize each word of english sentences.

0

There are 0 best solutions below