How to index documents with their metadata in a DB using Solr 5.1.0

853 Views Asked by At

I'm using Apache Solr to index documents for a search engine. These documents are stored locally on my file system. In order to do a faceted search I also have to include these documents meta-data which is stored in a MySQL DB.

Is there a way to simultaneously index these documents in the file system while also attaching/indexing their corresponding metadata from the DB for the faceted search?

If not what is the alternative? Thanks in advance

1

There are 1 best solutions below

0
On BEST ANSWER

I'm not saying that Drew's answer is incorrect but I've found there is a more direct way to solve this problem.

After a couple of days of searching I and posting on the Lucene forums I was able to come up with a pretty comprehensive answer to this question. If you want to index a database and a file system and have them submit ONE comprehensive document for the file and its metadata there are two ways to go about it. One is better than the other.

The first way is to configure the DataImportHandler, or DIH. This involves changing the solrconfig.xml to enable the use of the DIH and then you need to create a new .xml file in the conf directory of the core you are using. This enables you to

1) Tap into multiple Datasources 2) Use data from the database to find the file in the filesystem. IE in this case the filepath.

This link will help you configure multiple datasources and understand the cabilities of the DIH

Data Import Handler Documentation

This link will help you set up the DIH and connect it to a database. There are two parts I recommend looking at both.

Configuring the data import handler and connecting it to a database

This is my final DIH config file for reference

<dataConfig> 
  
      <dataSource name="ds-db" type="JdbcDataSource" driver="com.mysql.jdbc.Driver"
                  url="jdbc:mysql://localhost:3306/EDMS_Metadata" 
                  user="root" 
                  password="*************" /> 
     
      <dataSource name="ds-file" type="BinFileDataSource"/> 
     
       <document name="doc1"> 
            <entity name="db-data" 
                    dataSource="ds-db" 
                    onError="skip" 
                    query="select TextContentURL as 'id', Title, AuthorCreator from MasterIndex" > 
              
            <field column="TextContentURL" name="id" /> 
              
            <field column="Title" name="title" /> 
              
        <field column="AuthorCreator" name="author" /> 
              
            <entity name="file" 
                    dataSource="ds-file" 
                    onError="skip" 
                    processor="TikaEntityProcessor" 
                    url="${db-data.id}" 
                    format="text">
              
             <field column="text" name="text" />  
              
             </entity>
              
        </entity> 
         
      </document> 
  
    </dataConfig>

BE WARNED WITH LARGE PDF's this makes Solr SLOW and may ultimately kill it. This is because you are processing the documents in Solr and it kills Tika . This is why I ultimately could not use this method which leads me to the next method which I recommend to those indexing rich documents.

You have to create your own indexer. I used something called SolrJ. It's an Java API that gives you access to Solr. Going into detail would take too long but here is a link to a skeleton of SolrJ that is used to index a file system and a database SEPARATELY. I was able to combine them to create a single Solr document with the set of metadata in the database and the files in the filesystem. I prefer this because it processes quickly and gives me more control over my fields.

Here's a link to the skeleton tutorial. Good luck. Hope this helps.

Indexing a file system and database using SolrJ