SOLR has a module called Cell. It uses Tika to extract content from documents and index it with SOLR.
From the sources at https://github.com/apache/lucene-solr/tree/master/solr/contrib/extraction , I conclude that Cell places the raw extracted text document text into a field called "content". The field is indexed by SOLR, but not stored. When you query for documents, "content" doesn't come up.
My SOLR instance has no schema (I left the default schema in place).
I'm trying to implement a similar kind of behavior using the default UpdateRequestHandler (POST to /solr/corename/update). The POST request goes:
<add commitWithin="60000">
<doc>
<field name="content">lorem ipsum</field>
<field name="id">123456</field>
<field name="someotherfield_i">17</field>
</doc>
</add>
With documents added in this manner, the content field is indexed and stored. It's present in query results. I don't want it to be; it's a waste of space.
What am I missing about the way Cell adds documents?
The Cell code indeed adds the content to the document as
content, but there's a built-in field translation rule that replacescontentwith_text_. In the schemaless SOLR,_text_is marked as not for storing.The rule is invoked by the following line in the
SolrContentHandler.addField():In the params object, there's a rule that
fmap.contentshould be treated as_text_. It comes fromcorename\conf\solrconfig.xml, where by default there's the following fragment:Meanwhile, in corename\conf\managed_schema there's a line:
And that's the whole story.