How to view Lucene Index

8.7k Views Asked by At

I am trying to learn and understand how lucene works, what is inside lucene index. Basically i would want to see how the data is represented inside lucene index?

I am using lucene-core 8.6.0 as dependency

Below is my very basic Lucene code

    private Document create(File file) throws IOException {
        Document document = new Document();

        Field field = new Field("contents", new FileReader(file), TextField.TYPE_NOT_STORED);
        Field fieldPath = new Field("path", file.getAbsolutePath(), TextField.TYPE_STORED);
        Field fieldName = new Field("name", file.getName(), TextField.TYPE_STORED);

        document.add(field);
        document.add(fieldPath);
        document.add(fieldName);

        //Create analyzer
        Analyzer analyzer = new StandardAnalyzer();

        //Create IndexWriter pass the analyzer

        Path indexPath = Files.createTempDirectory("tempIndex");
        Directory directory = FSDirectory.open(indexPath);
        IndexWriterConfig indexWriterCOnfig = new IndexWriterConfig(analyzer);
        IndexWriter iwriter = new IndexWriter(directory, indexWriterCOnfig);
        iwriter.addDocument(document);
        iwriter.close();
        return document;
    }

Note : I understand the knowledge behind Lucene - the inverted index, but i lack the understanding of the lucene library uses this concept and how the files are created so that search was made easy and feasible using lucene.

I tried Limo, but of no use. Its just did not work even though i gave the index location in the web.xml

2

There are 2 best solutions below

0
On

If the index is large in size (e.g. hundreds of GBs), Luke sometimes fails to open it. There is a command-line based alternative of Luke, called I-Rex. It is developed for researches in Information Retrieval. Here is the link to it: https://github.com/souravsaha/I-REX/tree/shell-lucene8

Feel free to add/edit the codes.

0
On

If you would like to see a good introductory code example, using the current version of Lucene (building an index and then using it), you can start with the basic demo (choose your version - this link is for Lucene 8.6).

The source code for the demo (using the latest version of Lucene) can be found here on Github.

If you would like to explore your indexed data, once it has been created, you can use Luke. In case you have not used it before: To run Luke, you need to download a binary release from the main download page. Unzip the file, and then navigate to the luke directory. Then run the relevant script (luke.bat or luke.sh).

(The only version of the LIMO tool I could find is this one on Sourceforge. Given it is from 2007, it is almost certainly no longer compatible with the latest Lucene index files. Maybe there is a more updated version somewhere.)

If you would like an overview of the files in a typical Lucene index, you can start here.

Many specifc questions can be answered by looking at the API documentation for relevant packages and classes.

Personally, I have also found the Solr and ElasticSearch documentation to be very useful for explaining specific concepts, which are often directly relevant to Lucene.

Beyond that, I don't worry too much about how Lucene manages its internal index data structures. Instead I focus on the different types of analyzer and query which can be used to access that data.


Update: SimpleTextCodec

It is now a few months later, but here is one more way to explore Lucene's index data: SimpleTextCodec. The standard Lucene codec (how data is written to index files and read from them) uses a binary format - and is therefore not human readable. You can't just open an index file and see what's in there.

However, if you change the codec to SimpleTextCodec, then Lucene will create plain-text index files, where you can see the structure more clearly.

This codec is provided purely for information/education, and should not be used in production.

To use the codec, you first need to include the relevant dependency - for example, like this:

<dependency>
    <groupId>org.apache.lucene</groupId>
    <artifactId>lucene-codecs</artifactId>
    <version>8.7.0</version>
</dependency>

Now you can use this new codec as follows:

iwc.setCodec(new SimpleTextCodec());

So, for example:

final String indexPath = "/path/to/index_dir";
final String docsPath = "/path/to/inputs_dir";
final Path docDir = Paths.get(docsPath);
Directory dir = FSDirectory.open(Paths.get(indexPath));
Analyzer analyzer = new StandardAnalyzer();
IndexWriterConfig iwc = new IndexWriterConfig(analyzer);
iwc.setOpenMode(OpenMode.CREATE);
iwc.setCodec(new SimpleTextCodec());
System.out.println(iwc.getCodec().getName());
try ( IndexWriter writer = new IndexWriter(dir, iwc)) {
    // read documents, and write index data:
    indexDocs(writer, docDir);
}

You are now free to inspect the resulting index files in a text reader (e.g. Notepad++).

In my case, the index data resulted in several files - but the one I was interested in here was my *.scf file - a "compound" file, containing various “virtual file” sections, where the human-readable index data was stored.