Is there a best practice schema.xml for SOLR when importing rich documents?

746 Views Asked by Pål Brattberg At 28 July 2025 at 05:14

I'm working with SOLR on a project where we import a bunch (~40k items) of rich documents, mainly MS Word, Powerpoint, Excel and PDFs.

Is there a best practice schema.xml and/or solrconfig.xml to use in SOLR when using the ExtractingRequestHandler?

I have been doing tweaks to the default schema to attempt to get facets working on date modification times, but even without that, I figure there could very well exist a good example of how these files should be when the default output from Tika is enough.

If there is no such thing as a best-practice schema.xml and/or solrconfig.xml I'm also interested in good examples, preferably from existing open source projects or even good blog posts.

Any pointers are welcome!

Original Q&A

There are 1 best solutions below

josegil On 09 December 2011 at 14:04

In the books Taming Text (http://www.manning.com/ingersoll/) you have some reference to ExtractingRequestHandler. This book it's about processing text using open source tools such as solr, tika or lucene.

I've read until chapter 5 and until now the book explain how extends the solr functionality by modifing the file schema.xml for create diferents type of fields, and procesing in query or indexing.

Is there a best practice schema.xml for SOLR when importing rich documents?

There are 1 best solutions below

Related Questions in SOLR

Related Questions in LUCENE

Related Questions in FULL-TEXT-SEARCH

Related Questions in APACHE-TIKA

Related Questions in SOLR-CELL

Trending Questions

Popular # Hahtags

Popular Questions