Reindexing a large SQL Server database to Lucene

1k Views Asked by Andrey At 24 February 2011 at 15:45

We have a web service method which accepts some data and puts it in Lucene index. We use it to index new and updated entries from our asp.net web app.

These entries are stored in a large SQL Server table (20M rows and growing), and I need a way to be able to reindex the whole table in case if current index gets deleted or corrupted. I'm not sure what's the optimal way to retrieve chunks of data from a large table. Currently, we use the fact that the table has PK which is autoincrement, so we get chunks of 1000 rows until it starts to return nothing. Kind of like (in pseudo language):

i = 0
while (true)
{
    SELECT col1, col2, col3 FROM mytable WHERE pk between i and i + 1000
    .... if result is empty 20 times in a row, break ....
    .... otherwise send result to web service to reindex ....
    i = i + 1000
}

This way, we don't need to SELECT COUNT(*) which would be a big performance killer, and we just move up the pk values until we stop getting any results. This has it's con: if we have a hole greater than 20,000 values somewhere in the table, it will stop indexing assuming it reached the end, but that's a tradeoff we have to live for now.

Can anyone suggest a more efficient way of getting data from a table to index? I would assume we are not the first ones facing this problem - search engines are widely used nowadays :)

Original Q&A

There are 3 best solutions below

Andrey On 24 February 2011 at 19:59 BEST ANSWER

I actually just figured it out - I can use IDENT_CURRENT(table_name) to get the last generated id, and use that instead of MAX() or Count() - this method should blow the other two away :)

sisve On 24 February 2011 at 17:43

Why is a COUNT(*) a performance killer? What about MAX(id)? I'm thinking that a index would provide the information needed for those queries. You do have an index on your primary key, right?

mindas On 24 February 2011 at 22:08

For what we do with Lucene, we rarely need to reindex everything. I can't remember coming across any case when all index would be corrupted (Lucene is actually quite safe/good at this), but it has been many times when individual items needed to be reindexed because of one reason or another. I'd say the most frequent reindexing patterns would be:

reindex items by given id (or set of ids)
reindex items by given period of time

The latter, of course, requires separate db index on the relevant date field(s) which should be a bit costly for 20M+ records but we decided to go for it (our biggest deployment had up to 10M records) as disk space is cheap these days anyway.

EDIT: added few explanations as per question author's comment.

If the source data structure changes, requiring reindexing of all records, our approach is to roll out new code which ensures all new data is correct (basically forms correct Lucene Document from this moment). Then after we can reindex things in batches (either manually or by hand), by providing relevant period ranges. This, to certain extent, also applies to Lucene version changes, too.

Reindexing a large SQL Server database to Lucene

There are 3 best solutions below

Related Questions in ASP.NET

Related Questions in SQL-SERVER

Related Questions in INDEXING

Related Questions in LUCENE

Related Questions in LARGE-DATA-VOLUMES

Trending Questions

Popular # Hahtags

Popular Questions