Sentence segmentation and aligment in noisy text corpus

397 Views Asked by htaghizadeh At 31 January 2013 at 12:48

I have a parallel corpus which contains about 100,000 aligned paragraphs in Arabic and Persian.

My corpus is a noisy corpus which its paragraphs are incomplete translation of each other (i.e., the parts of Arabic paragraphs are not translated to Persian, and the punctuation marks are not matched, too).

In order to divide the paragraphs to sentences, i used the punctuation marks, but the sentence count is not matched.

Then, I used Microsoft Aligner to align the sentences, but the result is really erroneous.

How do I segment and align the sentences of corpus?

Original Q&A

There are 1 best solutions below

Ben Allison On 06 February 2013 at 09:47

You've used the Giza++ tag in your question: did you look at using the alignment tools from there? The other option that I know quite a few people use is Moses, which is a fully featured statistical MT package, but I believe you can invoke the alignment models in isolation if this is really all you want.

Sentence segmentation and aligment in noisy text corpus

There are 1 best solutions below

Related Questions in ALIGNMENT

Related Questions in NLP

Related Questions in CORPUS

Related Questions in TEXT-SEGMENTATION

Related Questions in GIZA++

Trending Questions

Popular # Hahtags

Popular Questions