This is not necessarily a dtSearch-specific question (e.g. it's more like a fts-engine question) but deals with a way of indexing documents composed of multiple pages and obtaining page-level hit results.
I've googled and searched and found nothing, hence my question:
We have a bunch of scanned n tiff pages from m books. We OCR-them, full-text index them and perform a search.
We want the search results to be book-level (e.g. the search result should contain one book), but also to be able to obtain found items at page-level (in order to be able to efficiently perform hit-highlighting, e.g. the term SomeTerm was found on Page 1, Page 2 and Page 7).
And here comes the problem:
- if we index the pages' text, one at a time, and Page1 from BookA contains term Term1, and Page2, also from BookA, contains term Term2, the search Term1 AND Term2 would not yield any results, which is normal
- if we index all pages' text in one large text block, all belonging to the same Book, we wouldn't be able to obtain the page to which the found term belongs to.
The dtSearch Desktop has such a feature for PDF-indexing: it is able to index all pages' text from a single document, but can also tell the page in which the hit occured by using the %%Page%% symbol.
We're using a custom DataSource to feed the indexer, but we're unable to determine the document's structure to use in order to achieve the desired result.
If you were using any other fts engine (e.g. Lucene/Sphinx), how would you approach the above problem (with the risk of repeating myself):
- You need to index pages' content
- Pages are logically grouped into documents
- You need to obtain results by document
- The highlight results must contain the page number
Thanking you for any suggestions, George
PS: sorry for the long message
As a dtsearch user for a long time, I think I would go back to basics by generating and indexing a paged pdf file, each page of which corresponding to a ocr text page of your book.
This way, you are totally independant from the search engine technology, letting it do what it does best on the well known pdf format.
Your index will not be overflowed with meaningless single page documents, the number of which would break best result ordering when searching for books.
Hope this will help, and sorry for my broken english