sphinxsearch Delta index updates

461 Views Asked by At

I have a problem with Delta-index updates.
If the document id is less than the max_doc_id, is not included in the delta-index, so as long as main-index is not updated, the changes will not apply this data.
Suppose, we have 1000 data.
If fiftieth document is changed, there will be no changes in the delta-index.
How will delta-index include documents changes that their id is less than max_doc_id?
Is there a way that delta-index includes the data are updated so that we do not have to wait main-index run?

CREATE TABLE sph_counter
(
    counter_id INTEGER PRIMARY KEY NOT NULL,
    max_doc_id INTEGER NOT NULL
);
source main
{
    # ...
    sql_query_pre = SET NAMES utf8
    sql_query_pre = REPLACE INTO sph_counter SELECT 1, MAX(id) FROM documents
    sql_query = SELECT id, title, body FROM documents \
     WHERE id<=( SELECT max_doc_id FROM sph_counter WHERE counter_id=1 )
}

source delta : main
{
    sql_query_pre = SET NAMES utf8
    sql_query = SELECT id, title, body FROM documents \
     WHERE id>( SELECT max_doc_id FROM sph_counter WHERE counter_id=1 )
}
2

There are 2 best solutions below

1
On

If you want to track document updates along with insertions, you should have a separate column for a document revision. Revision values should be unique across the document table, so it's a good idea to use global sequence to generate them.

When you update an existing document or insert a new one, you should take the next value from the revision sequence and save it in the document revision column. Sometimes it's a good idea to have DB triggers for automatic revision updates.

Then in sql_query_pre section you can save min and max revision values into sph_counter table and use them to create a proper delta index.

0
On

A really simply way I like for this is just to add a timestamp column to automatically track changed documents.

Add a column...

ALTER TABLE documents 
   ADD updated TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
   INDEX(updated);

The default is also important, so it newly created documents are also included.

Then can just use that in queries, with a kill list. The main will include include all documents at time of indexing. But the delta will include new and changed documents. The kill list means the old version in main, is ignored.

CREATE TABLE sph_counter
(
    counter_id INTEGER PRIMARY KEY NOT NULL,
    max_doc_id INTEGER NOT NULL,
    indexing_time DATETIME NOT NULL
);
source main
{
    # ...
    sql_query_pre = SET NAMES utf8
    sql_query_pre = REPLACE INTO sph_counter SELECT 1, MAX(id), NOW() FROM documents
    sql_query = SELECT id, title, body FROM documents
}

source delta : main
{
    sql_query_pre = SET NAMES utf8

    sql_query = SELECT id, title, body FROM documents \
     WHERE updated > ( SELECT indexing_time FROM sph_counter WHERE counter_id=1 )

    sql_query_killlist = SELECT id FROM documents \
     WHERE updated > ( SELECT indexing_time FROM sph_counter WHERE counter_id=1 )
}

(as have the kill list, no point filtering the main, duplicates wont matter. Also dont neve need max_doc_id - so sph_counter could be simplified along with the sql_query_pre. In many way its a shame you have to repeat the query in the kill list. Can't just tell sphinx to use all the docs in the index as a kill list)