Is there a way for SolR data import handler to get Metadata from RDBMS and related file content from Tika?

Question

Is there a way for SolR data import handler to get Metadata from RDBMS and related file content from Tika?

160 Views Asked by jim At 22 July 2021 at 06:07

I intent to use solr's data import handler to create documents from rdbms records. One of the rdbms columns is a pdf/word file path. What I would like to do is parse the file with Tika and save the text result in another field of the above document. My final documents should have rdbms & tika imported data in the same document.

For example

Document fields from db: author, publish_year, e-mail

Document fields from tika: plain_text

Is this possible as a single document type configuration in data import handler or should I do separate data handler imports (sql & tika as separate document types) and then make joins from my queries?

Original Q&A

There are 1 best solutions below

**jim** · Accepted Answer · 2021-08-19T09:06:33.420000

Yes it is. After some trial and error, the following configuration works:

<dataConfig>
    <dataSource name="ds-db" driver="org.mariadb.jdbc.Driver" url="jdbc:mysql://localhost:3306/eepyakm?user=root" user="root" password="root"/>
    <dataSource name="ds-file" type="BinFileDataSource"/>
    <document>
        <entity name="supplier" query="select * from suppliers_tmp_view" dataSource="ds-db" 
                deltaQuery="select id from suppliers_tmp_view where last_modified > '${dataimporter.last_index_time}'"
                deltaImportQuery="select * from suppliers_tmp_view where id='${dataimporter.delta.id}'">
             
            <entity name="attachment" dataSource="ds-db" 
                    query="select * from suppliers_tmp_files_view where supplier_tmp_id='${supplier.id}' and path is not null"
                    deltaQuery="select id,supplier_tmp_id from suppliers_tmp_files_view where last_modified > '${dataimporter.last_index_time}' and path is not null"
                    parentDeltaQuery="select id from suppliers_tmp_view where id='${attachment.supplier_tmp_id}'">
            
                <field name="path" column="path"/>
                
                <entity name="file" onError="skip" processor="TikaEntityProcessor"  url="${attachment.path}" format="text" dataSource="ds-file">
                    
                    <field column="text"/>
                </entity>
            </entity>
        </entity>
    </document>
</dataConfig>

What happens is that two different-type datasources work together in a nesting entity configuration. The db datasource gets the filename and the file datasource retrieves the file contents for the Tika processor.

Is there a way for SolR data import handler to get Metadata from RDBMS and related file content from Tika?

There are 1 best solutions below

Related Questions in SOLR

Related Questions in APACHE-TIKA

Related Questions in DATAIMPORTHANDLER

Trending Questions

Popular # Hahtags

Popular Questions