How to store custom metatags in elasticsearch index from a website using stormcrawler

Question

How to store custom metatags in elasticsearch index from a website using stormcrawler

42 Views Asked by MarioCB At 22 November 2023 at 16:24

I am crawling intranet websites using stormcrawler(v 2.10) and storing data on Elasticsearch (v 7.8.0). Using kibana for visualization. Intranet pages have custom meta tags as below

which i want to store in elastic search index "crawler-content". But I am not getting any of these field in kibana/elasticsearch.

Updated index script

{
  "settings": {
    "index": {
      "number_of_shards": 5,
      "number_of_replicas": 1,
      "refresh_interval": "5s",
      "default_pipeline": "timestamp"
    }
  },
  "mappings": {
    "_source": {
      "enabled": true
    },
    "properties": {
      "content": {
        "type": "text"
      },
      "description": {
        "type": "text"
      },
      "domain": {
        "type": "keyword"
      },
      "format": {
        "type": "keyword"
      },
      "keywords": {
        "type": "keyword"
      },
      "host": {
        "type": "keyword"
      },
      "title": {
        "type": "text"
      },
      "url": {
        "type": "keyword"
      },
      "timestamp": {
        "type": "date",
        "format": "date_optional_time"
      },
      "metatag": {
        "properties": {
          "article_description": {
            "type": "text"
          },
          "article_heading": {
            "type": "text"
          },
          "article_publisheddate": {
            "type": "date"
          },
          "article_type": {
            "type": "text"
          },
          "article_year": {
            "type": "text"
          }
        }
      }
    }
  }
}

in jsoupfilters.json added

"parse.article_description": "//META[@name=\"Article_Description\"]/@content",
"parse.article_heading": "//META[@name=\"Article_Heading\"]/@content",
"parse.article_publisheddate": "//META[@name=\"Article_PublishedDate\"]/@content",
"parse.article_type": "//META[@name=\"Article_Type\"]/@content",
"parse.article_year": "//META[@name=\"Article_Year\"]/@content"

in crawler-conf.yaml added

indexer.md.mapping:
  - parse.title=title
  - parse.search=search
  - parse.keywords=keywords
  - parse.description=description
  - parse.article_description=metatag.article_description
  - parse.article_heading=metatag.article_heading
  - parse.article_publisheddate=metatag.article_publisheddate
  - parse.article_type=metatag.article_type
  - parse.article_year=metatag.article_year
  - domain
  - format

Original Q&A

There are 1 best solutions below

**Julien Nioche** · Answer 1 · 2023-11-22T17:56:07.623000

I can't see anything obviously incorrect in your setup. You could run the class https://github.com/DigitalPebble/storm-crawler/blob/master/core/src/main/java/com/digitalpebble/stormcrawler/parse/JSoupFilters.java on a single URL to check the extraction. Would also be useful to test the output of the protocol on the command line, see our recent blog for an example.

How to store custom metatags in elasticsearch index from a website using stormcrawler

There are 1 best solutions below

Related Questions in APACHE-STORM

Related Questions in STORMCRAWLER

Trending Questions

Popular # Hahtags

Popular Questions