We just upgraded from Solr 6.3 to 7.5. With no changes to the schema or config, we are getting a 400 error on just about every pdf file that we try to index. These are files that Solr 6.3 had no problems indexing. All other types of complex file are indexed as before, it's just the pdf files causing the problem.
Clue #1: Out of ~1900 pdf files, only 2 were successfully processed. Most of our pdfs have a subject and a title, but these 2 did not.
Clue #2: In the console log we see failure messages like this: RequestHandlerBaseorg.apache.solr.common.SolrException: undefined field: "pdf_docinfo_title"
I can't find a field with that name in the schema. A google search on pdf_docinfo_title didn't turn up anything useful.
Since you don't have a field with that name, and no catch-all definition, Solr barfs when Tika hands it back a document with the field
pdf_docinfo_title
set.As Tika is upgraded between Solr versions if possible, this field was not included by the older version of Tika bundled with 6.3, while the version bundled with 7.5 provides it properly. It represents the document title for the pdf file.
You can also use the
fmap
parameter to map fields from Tika to a different field in your schema:You can also use the parameter
uprefix
to get the Tika module to prefix all unknown fields with a common prefix: