how are the words clustered into word classes in the mkcls files of GIZA++ and on what basis are they grouped?

306 Views Asked by At
  1. What is the use of mkcls in giza++?

  2. while running mkcls, giza++ generates four files *.vcb.classes and *.vcb.classes.cats for both source and target language.

The output of *.vcb.classes is:

.      9
book  10
gave   4
he     3
him    5
i      7
loved  8
read   8
the    2

What does this numbers refer to? Is it is word class numbers? If it is a word class number then how is it generated or how is it categorized into different classes, on what basis?

1

There are 1 best solutions below

1
On

The 'mkcls' program groups words into equivalence classes. The output is used by GIZA++ for word alignment. See Franz Josef Och, An Efficient Method for Determining Bilingual Word Classes