Solr 8 delta query in DIH when there is no last insertion time available in MySQL database

116 Views Asked by At

I have to import a database of about 4TB in size to Apache SOlr 8. Database is MySQL and there are three tables that I join to get some information from DB. Solr is running in cloud mode. After configuring Solr DIH using this guide, I was able to full import data into Solr. My first confusion:

  1. Is DIH is good for such a very large data ?
  2. Is there any better option for this

Next, now I have to make sure that Solr index is completely Sync with DB and it should cover following scenarios:

  1. If a new record is added then only that should be indexed in Solr
  2. If a record is deleted from DB then it should also be deleted from Solr
  3. If an existing record is updated that it should be visible in SOlr also

According to my reference, there is no issue to handle above if there is some column with time of insertion like information in MySQL database. But for my case, there are just primary keys and other text data etc. How I can cater above requires without any (timestamp) field having indexing time in database.

Note: Due to some limitations, it is not possible to add a new column in database.

1

There are 1 best solutions below

2
Abhijit Bashetti On

DIH is the good option here. No doubt in it.

Is DIH is good for such a very large data ? Yes, No issues with it. You can easily use DIH.

Is there any better option for this. There is not better option. You have to convert the data from database to csv/Json format and then push it to SOlr. Some opt for this option. But I think this is again a repeatable job or a overhead task. I would suggest to go with DIH.

If a new record is added then only that should be indexed in Solr. You can sort the data using id field and check for the data which is greater than the previous id.

If a record is deleted from DB then it should also be deleted from Solr. At the same time hit the solr with the same id and remove the data from Solr.

If an existing record is updated that it should be visible in SOlr also At the same time update the record in solr. This would help you achive the real time modification. Or Store the updated data's id somewhere. Then after some time or EOD or nightly schedule a job for updating the records for those stored id.