I intent to use solr's data import handler to create documents from rdbms records. One of the rdbms columns is a pdf/word file path. What I would like to do is parse the file with Tika and save the text result in another field of the above document. My final documents should have rdbms & tika imported data in the same document.
For example
Document fields from db: author, publish_year, e-mail
Document fields from tika: plain_text
Is this possible as a single document type configuration in data import handler or should I do separate data handler imports (sql & tika as separate document types) and then make joins from my queries?
Yes it is. After some trial and error, the following configuration works:
What happens is that two different-type datasources work together in a nesting entity configuration. The db datasource gets the filename and the file datasource retrieves the file contents for the Tika processor.