Lucene Wikipedia Dump

472 Views Asked by At

I am currently indexing the Wikipedia dump (actually one from 2012, but the format is the same regardless) and would like to find out about performance costs (size and processing time).

I am using Lucene for Java v4.x and store all the dump fields inside the index. I work on a machine with an i5 processor and 8 GB of RAM. I just finished indexing 5000 articles that created an index with a size of 5GB that took about 10 minutes.

This means for 3.5 Million articles, it would be a 3.5 TB index and it would take me about 5 days, if the indexing time is linear (which it is not). I wonder if that is normal given that the raw Wikipedia dump file is just 35 GB...

1

There are 1 best solutions below

2
On

We used to have the same issue here, we did many research about this, so let me share some facts with you that we have faces regarding to this.

First: about the speed of the indexing process, you could a multithreading solution, or separate your index to a categories, You can design a solution to index your articles concurrently.

Examples:

1- we have separated our data to a categories and sub categories, that allowed us to open a single index writer for each sub category at the same time, which double the speed of index by x(n) of our sub categories.

2- We have designed a multithreading solution to index our data, we have create a fix size pool of threads, all threads in this pool use the same writer to perform indexing process on the same category of data, then commit the indexed data at once.

Second: about the index files size, you can do nothing about it, because you have no control on them. lucene has its way to deal with the file, so in this case we decided to move with lucene new versions 4.x, which has around 60% less size enhancements.