How to get a list of all tokens from Lucene 8.6.1 index?

669 Views Asked by PSK At 19 November 2020 at 22:38

I have looked at how to get a list of all tokens from Solr/Lucene index? but Lucene 8.6.1 doesn't seem to offer IndexReader.terms(). Has it been moved or replaced? Is there an easier way than this answer?

Original Q&A

There are 1 best solutions below

andrewJames On 20 November 2020 at 02:03 BEST ANSWER

Some History

You asked: I'm just wondering if IndexReader.terms() has moved or been replaced by an alternative.

The Lucene v3 method IndexReader.terms() was moved to AtomicReader in Lucene v4. This was documented in the v4 alpha release notes.

(Bear in mind that Lucene v4 was released way back in 2012.)

The method in AtomicReader in v4 takes a field name.

As the v4 release notes state:

One big difference is that field and terms are now enumerated separately: a TermsEnum provides a BytesRef (wraps a byte[]) per term within a single field, not a Term.

The key part there is "per term within a single field". So from that point onward there was no longer a single API call to retrieve all terms from an index.

This approach has carried through to later releases - except that the AtomicReader and AtomicReaderContext classes were renamed to LeafReader and LeafReaderContext in Lucene v 5.0.0. See Lucene-5569.

Recent Releases

That leaves us with the ability to access lists of terms - but only on a per-field basis:

The following code is based on the latest release of Lucene (8.7.0), but should also hold true for the version you mention (8.6.1) - with the example using Java:

private void getTokensForField(IndexReader reader, String fieldName) throws IOException {
    List<LeafReaderContext> list = reader.leaves();

    for (LeafReaderContext lrc : list) {
        Terms terms = lrc.reader().terms(fieldName);
        if (terms != null) {
            TermsEnum termsEnum = terms.iterator();

            BytesRef term;
            while ((term = termsEnum.next()) != null) {
                System.out.println(term.utf8ToString());
            }
        }
    }
}

The above example assumes an index as follows:

private static final String INDEX_PATH = "/path/to/index/directory";
...
IndexReader reader = DirectoryReader.open(FSDirectory.open(Paths.get(INDEX_PATH)));

If you need to enumerate field names, the code in this question may provide a starting point.

Final Note

I guess you can also access terms on a per document basis, instead of a per field basis, as mentioned in the comments. I have not tried this.

How to get a list of all tokens from Lucene 8.6.1 index?

There are 1 best solutions below

Related Questions in JAVA

Related Questions in PYTHON

Related Questions in SEARCH

Related Questions in LUCENE

Related Questions in PYLUCENE

Trending Questions

Popular # Hahtags

Popular Questions