How to resolve mkcls taking up lots of memory and time for word alignment using GIZA++?

615 Views Asked by At

I am using the GIZA++ for aligning word from the bitexts from the Europarl corpus.

Before i train the alignment model using GIZA++, i need to use the mkcls script to making classes that is necessary for Hidden Markov Model algorithm as such:

mkcls -n10 -pcorp.tok.low.src -Vcorp.tok.low.src.vcb.classes

I have tried it with a small size 1000 lines corpus and it works properly and completed in a few minutes. Now i'm trying it on corpus with 1,500,000 lines and it's taking up 100% of one of the my CPU (Six-Core AMD Opteron(tm) Processor 2431 × 12)

Before making the classes, i have taken the necessary step to tokenize, lower all upper cases and filter out lines with more than 40 words.

Does anyone have similar experience on the mkcls for GIZA++? How is it solved? If anyone had done the same on the Europarl corpus, how long did it take you to run the mkcls?

2

There are 2 best solutions below

0
On

Because the mkcls script for MOSES and GIZA++ isn't parallelized, and the number of sentences and words in the 1.5 million words in Europarl corpus, it takes around 1-2 hours to make the vocabulary classes.

the other pre-GIZA++ processing steps (viz. plain2snt, snt2cooc) takes much much lesser time and processing power.

2
On

try mgiza (http://www.kyloo.net/software/doku.php/mgiza:overview ) which support multi-threading. It should significantly decrease amount of time needed to accomplish your task.