fts document-level indexing, obtaining page-level results (with dtSearch example)

353 Views Asked by At

This is not necessarily a dtSearch-specific question (e.g. it's more like a fts-engine question) but deals with a way of indexing documents composed of multiple pages and obtaining page-level hit results.

I've googled and searched and found nothing, hence my question:

We have a bunch of scanned n tiff pages from m books. We OCR-them, full-text index them and perform a search.

We want the search results to be book-level (e.g. the search result should contain one book), but also to be able to obtain found items at page-level (in order to be able to efficiently perform hit-highlighting, e.g. the term SomeTerm was found on Page 1, Page 2 and Page 7).

And here comes the problem:

  • if we index the pages' text, one at a time, and Page1 from BookA contains term Term1, and Page2, also from BookA, contains term Term2, the search Term1 AND Term2 would not yield any results, which is normal
  • if we index all pages' text in one large text block, all belonging to the same Book, we wouldn't be able to obtain the page to which the found term belongs to.

The dtSearch Desktop has such a feature for PDF-indexing: it is able to index all pages' text from a single document, but can also tell the page in which the hit occured by using the %%Page%% symbol.

We're using a custom DataSource to feed the indexer, but we're unable to determine the document's structure to use in order to achieve the desired result.

If you were using any other fts engine (e.g. Lucene/Sphinx), how would you approach the above problem (with the risk of repeating myself):

  1. You need to index pages' content
  2. Pages are logically grouped into documents
  3. You need to obtain results by document
  4. The highlight results must contain the page number

Thanking you for any suggestions, George

PS: sorry for the long message

2

There are 2 best solutions below

0
On

As a dtsearch user for a long time, I think I would go back to basics by generating and indexing a paged pdf file, each page of which corresponding to a ocr text page of your book.

This way, you are totally independant from the search engine technology, letting it do what it does best on the well known pdf format.

Your index will not be overflowed with meaningless single page documents, the number of which would break best result ordering when searching for books.

Hope this will help, and sorry for my broken english

0
On

A brute-force approach would be to have 2 types of indexed documents:

  • Page-level documents with the text of the page, the page number, the name of the book, and a flag indicating that this is a page-level document.
  • Book-level documents with the text of the book, the name of the book, and a flag indicating that this is a book-level document.

You would first search on only the book-level documents to find the matching books. Then, you would search on only the page-level documents for the matching books to find the matching page-level documents. This would let you say "termX and termY appear in book Z, with termX on pages 2, 47, and 293, and termY appearing on pages 1, 3, 5, and 293."

One drawback to this approach is that you end up indexing the contents of each page twice.