Is there a best practice schema.xml for SOLR when importing rich documents?

753 Views Asked by At

I'm working with SOLR on a project where we import a bunch (~40k items) of rich documents, mainly MS Word, Powerpoint, Excel and PDFs.

Is there a best practice schema.xml and/or solrconfig.xml to use in SOLR when using the ExtractingRequestHandler?

I have been doing tweaks to the default schema to attempt to get facets working on date modification times, but even without that, I figure there could very well exist a good example of how these files should be when the default output from Tika is enough.

If there is no such thing as a best-practice schema.xml and/or solrconfig.xml I'm also interested in good examples, preferably from existing open source projects or even good blog posts.

Any pointers are welcome!

1

There are 1 best solutions below

1
On

In the books Taming Text (http://www.manning.com/ingersoll/) you have some reference to ExtractingRequestHandler. This book it's about processing text using open source tools such as solr, tika or lucene.

I've read until chapter 5 and until now the book explain how extends the solr functionality by modifing the file schema.xml for create diferents type of fields, and procesing in query or indexing.