I'm working with SOLR on a project where we import a bunch (~40k items) of rich documents, mainly MS Word, Powerpoint, Excel and PDFs.
Is there a best practice schema.xml
and/or solrconfig.xml
to use in SOLR when using the ExtractingRequestHandler
?
I have been doing tweaks to the default schema to attempt to get facets working on date modification times, but even without that, I figure there could very well exist a good example of how these files should be when the default output from Tika is enough.
If there is no such thing as a best-practice schema.xml
and/or solrconfig.xml
I'm also interested in good examples, preferably from existing open source projects or even good blog posts.
Any pointers are welcome!
In the books Taming Text (http://www.manning.com/ingersoll/) you have some reference to ExtractingRequestHandler. This book it's about processing text using open source tools such as solr, tika or lucene.
I've read until chapter 5 and until now the book explain how extends the solr functionality by modifing the file schema.xml for create diferents type of fields, and procesing in query or indexing.