I made myself familiar with crawling with Apache Nutch and Solr, but realized that while HTTP and HTTPS links are available in Solr query results in the content field magnet links are not. I adjusted conf/regex-urlfilter.txt to be
-^(file|ftp|mailto):
# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$
# skip URLs containing certain characters as probable queries, etc.
#-[?*!@=]
# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/
# for linuxtracker.org
+^https?://*linuxtracker.org/(.+)*$
#+^magnet:\?xt=(.+)*$
# causes magnet links to be ignored/not appear in content field
+^magnet:*$
# reject anything else
-.
and don't see why magnet links shouldn't be included inside content. As you can see, I'm investigating this using http://linuxtracker.org which e.g. has the magnet link magnet:?xt=urn:btih:ETDW2XT7HJ2Y6B4Y5G2YSXGC5GWJPF6P on http://linuxtracker.org/?page=torrent-details&id=24c76d5e7f3a758f0798e9b5895cc2e9ac9797cf.
After crawling with bin/crawl there're magnet links when querying Solr as follows in pysolr:
solr = pysolr.Solr(solr_core_url, timeout=10)
results = solr.search('*:*')
for result in results:
print(result)
I'm using Apache Nutch release-1.13-73-g9446b1e1 and Solr 6.6.1 on Ubuntu 17.04.
Short answer magnet links are not "normal" links and not supported out of the box by Nutch.
Long answer:
The configuration that you've changed get's applied after the links are extracted, in this case, if you're using
parse-htmlthe parse plugin try to evaluate if the possible outlink is a valid link this basically just creates ajava.net.URL.java.net.URLon the other hand doesn't support magnet links out of the box, according to the javadocs:If you're using
parse-tikasomething similar is happening.If you only want to have the links indexed in Solr/ES (for search), then you could write your own
HtmlParseFilterand add those links in a separated field for instance.