load parallel corpora with NLTK and lemmatize english sentences

935 Views Asked by Chester Mc Allister At 03 June 2025 at 10:15

I have a corpora which is formatted like this:

sentence in english \t sentence in french \t score
sentence in english \t sentence in french \t score

Each sentence is tokenized (separated by a whitespac).

Now I need to load this sentences using NLTK. How can I do that ? What method in the CorpusReader may I use ?

In this example, I can load the comtrans corpus provided by NLTK:

from nltk.corpus.util import LazyCorpusLoader
from nltk.corpus.reader import AlignedCorpusReader

comtrans = LazyCorpusLoader(
    'comtrans', AlignedCorpusReader, r'(?!\.).*\.txt',
     encoding='iso-8859-1')

fe=comtrans.aligned_sents('alignment-en-fr.txt')[0]
print fe

In fact, i need to do the same thing but with a file create by myself.

In the last step, I need to lemmatize each word of english sentences.

Original Q&A

load parallel corpora with NLTK and lemmatize english sentences

There are 0 best solutions below

Related Questions in PYTHON

Related Questions in NLP

Related Questions in NLTK

Related Questions in CORPUS

Related Questions in LEMMATIZATION

Trending Questions

Popular # Hahtags

Popular Questions