I am utilizing solr ExtractingRequestHandler to extract and index HTML content. My issue comes to the extracted links section that it produces. The extracted content returned has "rect" inserted where they do not exist in the HTML source.
I have my solrconfig cell configuration as follows:
<requestHandler name="/upate/extract"
startup="lazy"
class="solr.extraction.ExtractingRequestHandler" >
<lst name="defaults">
<str name="lowernames">true</str>
<!-- capture link hrefs but ignore div attributes -->
<str name="captureAttr">true</str>
<str name="fmap.div">ignored_</str>
</lst>
And my solr schema.xml with the following etnries:
<field name="content_type" type="string" indexed="true" stored="true" multiValued="true"/>
<field name="links" type="string" indexed="true" stored="true" multiValued="true"/>
<field name="meta" type="string" indexed="true" stored="true" multiValued="true"/>
<field name="content_encoding" type="string" indexed="false" stored="true" multiValued="false"/>
<field name="content" type="text_general" indexed="false" stored="true" multiValued="true"/>
I post the following HTML to sorl cell:
<!DOCTYPE html>
<html>
<body>
<h1>Heading1</h1><a href="http://www.google.com">Link to Google</a><a href=
"http://www.google.com">Link to Google2</a><a href="http://www.google.com">Link to
Google3</a><a href="http://www.google.com">Link to Google</a>
<p>Paragraph1</p>
</body>
</html>
Solr has the following indexed:
{
"meta": [
"Content-Encoding",
"ISO-8859-1",
"ignored_hbaseindexer_mime_type",
"text/html",
"Content-Type",
"text/html; charset=ISO-8859-1"
],
"links": [
"rect",
"http://www.google.com",
"rect",
"http://www.google.com",
"rect",
"http://www.google.com",
"rect",
"http://www.google.com"
],
"content_encoding": "ISO-8859-1",
"content_type": [
"text/html; charset=ISO-8859-1"
],
"content": [
" Heading1 Link to Google Link to Google2 Link to Google3 Link to Google Paragraph1 "
],
"id": "row69",
"_version_": 1461665607851180000
}
Notice the "rect" between every link. Why is solr cell or tika inserting these? I am not defining a tika config file to use. Do i need to configure tika?
Although an old Question, I also encountered this issue while indexing HTML documents via Solr 8.7.0.
HTML:
Result:
[ I am posting/indexing on the Linux command-line:
solr restart; sleep 1; post -c gettingstarted /mnt/Vancouver/programming/datasci/solr/test/solr_test9.html;]I grepped (ripgrep:
rg --color=always -w -e 'rect' . |less) the Solr code for that word, but found nothing, so the source ofrect http...in indexed URLs eludes me.My solution was to add a regex processor to my
solrconfig.xml:As alluded in my comments in that processor, I am extracting
<p />-formatted HTML content to apfield (field: p|type: text_general).That content did not parse with the
RegexReplaceProcessorFactoryprocessor.In the Solr Admin UI I noted that
titleandcontentwere copied as strings (e.g.:field: content|type: text_general|copied to: content_str), so I made copy field (p>>p_str) that resolved the regex issue.For completeness, here are the relevant parts of my
solrconfig.xmlrelated to HTML document indexing,... noting again that I added fields to the
managed-schemavia the Solr Admin UI.Result:
See also:
re:
<requestHandler name="/update/extract"...:Solr 8.6.3 could not index html file
https://lucene.apache.org/solr/guide/8_6/uploading-data-with-solr-cell-using-apache-tika.html#configuring-the-extractingrequesthandler-in-solrconfig-xml
My answer here (which deals with pecularities associated with the
updateRequestProcessorChain />, above) when switching from Solr'smanaged-schemato the classicschema.xml