Get vocabulary list in Galago

258 Views Asked by At

I am using Galago retrieval toolkit (a part of the Lemur project) and I need to have a list of all vocabulary terms in the collection (all unique terms). Actually I need a List <String> or Set <String> I really appreciate to let me know how can I obtain such a list?

1

There are 1 best solutions below

1
On BEST ANSWER

The `DumpKeysFn' class seems to give all the keys (unique terms) of the collection. The code should be like this:

public static Set <String> getAllVocabularyTerms (String fileName) throws IOException{
    Set <String> result = new HashSet<> ();
    IndexPartReader reader = DiskIndex.openIndexPart(fileName);
    if (reader.getManifest().get("emptyIndexFile", false)) {
        // do something!
    }

    KeyIterator iterator = reader.getIterator();
    while (!iterator.isDone()) {
      result.add(iterator.getKeyString());
      iterator.nextKey();
    }
    reader.close();
    return result;
}