How to store custom metatags in elasticsearch index from a website using stormcrawler

35 Views Asked by At

I am crawling intranet websites using stormcrawler(v 2.10) and storing data on Elasticsearch (v 7.8.0). Using kibana for visualization. Intranet pages have custom meta tags as below

<meta name="Article_PublishedDate" content="2023-07-14T00:00:00Z" />

<meta name="Article_Year" content="2023" />

<meta name="Article_Heading" content="AWARDS RELEASE 2023" />

<meta name="Article_Description" content="BUSINESS AWARDS RELEASE 2023" />

<meta name="Article_Type" content="PressRelease" />

which i want to store in elastic search index "crawler-content". But I am not getting any of these field in kibana/elasticsearch.

Updated index script

{
  "settings": {
    "index": {
      "number_of_shards": 5,
      "number_of_replicas": 1,
      "refresh_interval": "5s",
      "default_pipeline": "timestamp"
    }
  },
  "mappings": {
    "_source": {
      "enabled": true
    },
    "properties": {
      "content": {
        "type": "text"
      },
      "description": {
        "type": "text"
      },
      "domain": {
        "type": "keyword"
      },
      "format": {
        "type": "keyword"
      },
      "keywords": {
        "type": "keyword"
      },
      "host": {
        "type": "keyword"
      },
      "title": {
        "type": "text"
      },
      "url": {
        "type": "keyword"
      },
      "timestamp": {
        "type": "date",
        "format": "date_optional_time"
      },
      "metatag": {
        "properties": {
          "article_description": {
            "type": "text"
          },
          "article_heading": {
            "type": "text"
          },
          "article_publisheddate": {
            "type": "date"
          },
          "article_type": {
            "type": "text"
          },
          "article_year": {
            "type": "text"
          }
        }
      }
    }
  }
}

in jsoupfilters.json added

"parse.article_description": "//META[@name=\"Article_Description\"]/@content",
"parse.article_heading": "//META[@name=\"Article_Heading\"]/@content",
"parse.article_publisheddate": "//META[@name=\"Article_PublishedDate\"]/@content",
"parse.article_type": "//META[@name=\"Article_Type\"]/@content",
"parse.article_year": "//META[@name=\"Article_Year\"]/@content"

in crawler-conf.yaml added

indexer.md.mapping:
  - parse.title=title
  - parse.search=search
  - parse.keywords=keywords
  - parse.description=description
  - parse.article_description=metatag.article_description
  - parse.article_heading=metatag.article_heading
  - parse.article_publisheddate=metatag.article_publisheddate
  - parse.article_type=metatag.article_type
  - parse.article_year=metatag.article_year
  - domain
  - format
1

There are 1 best solutions below

0
On

I can't see anything obviously incorrect in your setup. You could run the class https://github.com/DigitalPebble/storm-crawler/blob/master/core/src/main/java/com/digitalpebble/stormcrawler/parse/JSoupFilters.java on a single URL to check the extraction. Would also be useful to test the output of the protocol on the command line, see our recent blog for an example.