How does SOLR Cell add document content?

Question

How does SOLR Cell add document content?

1k Views Asked by Seva Alekseyev At 31 October 2016 at 15:49

SOLR has a module called Cell. It uses Tika to extract content from documents and index it with SOLR.

From the sources at https://github.com/apache/lucene-solr/tree/master/solr/contrib/extraction , I conclude that Cell places the raw extracted text document text into a field called "content". The field is indexed by SOLR, but not stored. When you query for documents, "content" doesn't come up.

My SOLR instance has no schema (I left the default schema in place).

I'm trying to implement a similar kind of behavior using the default UpdateRequestHandler (POST to /solr/corename/update). The POST request goes:

<add commitWithin="60000">
    <doc>
        <field name="content">lorem ipsum</field>
        <field name="id">123456</field>
        <field name="someotherfield_i">17</field>
    </doc>
</add>

With documents added in this manner, the content field is indexed and stored. It's present in query results. I don't want it to be; it's a waste of space.

What am I missing about the way Cell adds documents?

Original Q&A

There are 2 best solutions below

MatsLindh On 31 October 2016 at 16:31

If you don't want your field to store the contents, you have to set the field as stored="false".

Since you're using the schemaless mode (there still is a schema, it's just generated dynamically when new fields are added), you'll have to use the Schema API to change the field.

You can do this by issuing a replace-field command:

curl -X POST -H 'Content-type:application/json' --data-binary '{
  "replace-field":{
  "name":"content",
  "type":"text",
  "stored":false }
}' http://localhost:8983/solr/collection/schema

You can see the defined fields by issuing a request against /collection/schema/fields.

**Seva Alekseyev** · Accepted Answer · 2016-10-31T18:05:43.447000

The Cell code indeed adds the content to the document as content, but there's a built-in field translation rule that replaces content with _text_. In the schemaless SOLR, _text_ is marked as not for storing.

The rule is invoked by the following line in the SolrContentHandler.addField():

String name = findMappedName(fname);

In the params object, there's a rule that fmap.content should be treated as _text_. It comes from corename\conf\solrconfig.xml, where by default there's the following fragment:

<requestHandler name="/update/extract"
              startup="lazy"
              class="solr.extraction.ExtractingRequestHandler" >
  <lst name="defaults">
    <str name="lowernames">true</str>
    <str name="fmap.meta">ignored_</str>
    <str name="fmap.content">_text_</str> <!-- This one! -->
  </lst>
</requestHandler>

Meanwhile, in corename\conf\managed_schema there's a line:

<field name="_text_" type="text_general" multiValued="true" indexed="true" stored="false"/>

And that's the whole story.

How does SOLR Cell add document content?

There are 2 best solutions below

Related Questions in SOLR

Related Questions in SOLR-CELL

Trending Questions

Popular # Hahtags

Popular Questions