I am trying to learn and understand how lucene works, what is inside lucene index. Basically i would want to see how the data is represented inside lucene index?
I am using lucene-core 8.6.0 as dependency
Below is my very basic Lucene code
private Document create(File file) throws IOException {
Document document = new Document();
Field field = new Field("contents", new FileReader(file), TextField.TYPE_NOT_STORED);
Field fieldPath = new Field("path", file.getAbsolutePath(), TextField.TYPE_STORED);
Field fieldName = new Field("name", file.getName(), TextField.TYPE_STORED);
document.add(field);
document.add(fieldPath);
document.add(fieldName);
//Create analyzer
Analyzer analyzer = new StandardAnalyzer();
//Create IndexWriter pass the analyzer
Path indexPath = Files.createTempDirectory("tempIndex");
Directory directory = FSDirectory.open(indexPath);
IndexWriterConfig indexWriterCOnfig = new IndexWriterConfig(analyzer);
IndexWriter iwriter = new IndexWriter(directory, indexWriterCOnfig);
iwriter.addDocument(document);
iwriter.close();
return document;
}
Note : I understand the knowledge behind Lucene - the inverted index, but i lack the understanding of the lucene library uses this concept and how the files are created so that search was made easy and feasible using lucene.
I tried Limo, but of no use. Its just did not work even though i gave the index location in the web.xml
If you would like to see a good introductory code example, using the current version of Lucene (building an index and then using it), you can start with the basic demo (choose your version - this link is for Lucene 8.6).
The source code for the demo (using the latest version of Lucene) can be found here on Github.
If you would like to explore your indexed data, once it has been created, you can use Luke. In case you have not used it before: To run Luke, you need to download a binary release from the main download page. Unzip the file, and then navigate to the
lukedirectory. Then run the relevant script (luke.batorluke.sh).(The only version of the
LIMOtool I could find is this one on Sourceforge. Given it is from 2007, it is almost certainly no longer compatible with the latest Lucene index files. Maybe there is a more updated version somewhere.)If you would like an overview of the files in a typical Lucene index, you can start here.
Many specifc questions can be answered by looking at the API documentation for relevant packages and classes.
Personally, I have also found the Solr and ElasticSearch documentation to be very useful for explaining specific concepts, which are often directly relevant to Lucene.
Beyond that, I don't worry too much about how Lucene manages its internal index data structures. Instead I focus on the different types of analyzer and query which can be used to access that data.
Update: SimpleTextCodec
It is now a few months later, but here is one more way to explore Lucene's index data:
SimpleTextCodec. The standard Lucene codec (how data is written to index files and read from them) uses a binary format - and is therefore not human readable. You can't just open an index file and see what's in there.However, if you change the codec to
SimpleTextCodec, then Lucene will create plain-text index files, where you can see the structure more clearly.This codec is provided purely for information/education, and should not be used in production.
To use the codec, you first need to include the relevant dependency - for example, like this:
Now you can use this new codec as follows:
So, for example:
You are now free to inspect the resulting index files in a text reader (e.g. Notepad++).
In my case, the index data resulted in several files - but the one I was interested in here was my
*.scffile - a "compound" file, containing various “virtual file” sections, where the human-readable index data was stored.