Some questions about the usage of the Hunspell data format in the Hungarian Hunspell dictionary?

114 Views Asked by At

After reading through the Hunspell docs, I started looking at the seemingly most advanced instance of a set of Hunspell dictionary files, and it seems the Hungarian one (Hun-garian Spell) is the most robust.

I have a few questions that seem to be unanswered by the 17 page PDF docs (which are the only real resource on Hunspell it appears, other than the source code).

1. The meaning of the decimal numbers?

For example, the number 1547. We see it here:

AF @ # 1547

And it is used in PFX but not SFX:

PFX r 0 legújra/1547 . 24583
PFX r 0 legújjá/1547 . 24584
PFX r 0 legössze/1547 . 24585
PFX r 0 legát/1547 . 24586
PFX r 0 legáltal/1547 . 24587
PFX r 0 legvégig/1547 . 24588
PFX r 0 legvégbe/1547 . 24589
...

The thing after the slash is a flag as far as I learned, but where is that flag defined? The line AF @ # 1547 has 1547 as a comment, so not sure. Looking further at AF it appears the first line of AF 1548 means there are 1548 AF values that follow, and AF @ is the second to last one in the list, so maybe that's it?!

So then when does the @ symbol mean in regards to AF, which is said to be:

Hunspell can substitute affix flag sets with ordinal numbers in affix rules (alias compression, see makealias tool).

I'm not following....

2. The meaning of the last decimal numbers on PFX?

Like we have from above:

PFX r 0 legát/1547 . 24586

That is the only place 24586 appears in the .aff file. So what does it mean? Same for all the numbers in that position. Line #24586 in the .dic file doesn't seem related either:

lódenkabát/39   1

What do the /number mean in the .dic file?

Regarding that last example:

lódenkabát/39   1

What does /39 and the 1 mean? Where are those defined, I would have assumed to find a PFX 39 or SFX 39 defined in the .aff file, but I don't seem to see that.

1

There are 1 best solutions below

0
On BEST ANSWER

Learned more by looking at the tests around alias2.aff (and other alias2 files):

Files

alias2.aff:

AF 2
AF AB
AF A

AM 3
AM is:affix_x
AM ds:affix_y
AM po:noun xx:other_data

SFX A Y 1
SFX A 0 x . 1

SFX B Y 1
SFX B 0 y/2 . 2

alias2.dic:

1
foo/1   3

alias2.good:

foo
foox
fooy
fooyx

alias2.morph:

> foo
analyze(foo) =  st:foo po:noun xx:other_data
stem(foo) = foo
> foox
analyze(foox) =  st:foo po:noun xx:other_data is:affix_x
stem(foox) = foo
> fooy
analyze(fooy) =  st:foo po:noun xx:other_data ds:affix_y
stem(fooy) = fooy
> fooyx
analyze(fooyx) =  st:foo po:noun xx:other_data ds:affix_y is:affix_x
stem(fooyx) = fooy

Explanation

Explaining the AM

Stands for "morphological alias"?

So this is saying we are dealing with line numbers relative to when the AM and AF start! That is crazy to me, so brittle. But anyways....

SFX A 0 x . 1

That 1 is referring to AM morphological_fields (from the docs). So it is marking this suffix as AM 1 which is the first AM: is:affix_x. That corresponds to our alias2.morph file, where it shows:

> foox
analyze(foox) =  st:foo po:noun xx:other_data is:affix_x
stem(foox) = foo

Notice the is:affix_x.

Now, foox has more. This is because in the .dic file, it says:

foo/1   3

That 3 is pointing to another AM, which is the last one.

po:noun xx:other_data

So that gives us all three of the AMs shown in the alias2.morph:

po:noun xx:other_data is:affix_x

Explaining the AF

Stands for "affix flag".

The /1 here in the .dic references the AF position:

foo/1

And the /2 in the .aff does as well:

SFX B 0 y/2 . 2

So for the y/2, that is saying that y can come after suffix x, since 2 links to AF 2 which is AF A, which is linking to SFX A, which is the x suffix.

I'm a bit confused at foo/1, which is an alias to foo/AB, couldn't you just write foo/A and it knows to allow foo/AB because of the y/2 definition? Or foo/1 / foo/AB must be saying foo/A and foo/B allowed, but foo/B is only allowed after foo/A, as per the SFX B definition. That must be it.