Sentence segmentation and aligment in noisy text corpus

368 Views Asked by At

I have a parallel corpus which contains about 100,000 aligned paragraphs in Arabic and Persian.

My corpus is a noisy corpus which its paragraphs are incomplete translation of each other (i.e., the parts of Arabic paragraphs are not translated to Persian, and the punctuation marks are not matched, too).

In order to divide the paragraphs to sentences, i used the punctuation marks, but the sentence count is not matched.

Then, I used Microsoft Aligner to align the sentences, but the result is really erroneous.

How do I segment and align the sentences of corpus?

1

There are 1 best solutions below

1
On

You've used the Giza++ tag in your question: did you look at using the alignment tools from there? The other option that I know quite a few people use is Moses, which is a fully featured statistical MT package, but I believe you can invoke the alignment models in isolation if this is really all you want.