Scraping data from multiple websites, merging the data and indexing in Elasticsearch

1k Views Asked by At

I'm using Scrapy to scrape data on products (product name and manufacturer) from a website. I'm then using a pipeline (http://github.com/noplay/scrapy-elasticsearch) to index the data directly into an Elasticsearch search engine. I'd like to also scrape data from another site (either using an API or Scrapy again) which provides data on manufacturers and their reputation (a simple ranking of the top 250 manufacturers for example). So in the Elasticsearch index an example document might have the following fields:

product name: ifruit 7 (scraped from site A)
product manufacturer: pear (scraped from site A and site B)
manufacturer ranking: 17 (scraped from site B)

What is the simplest way to combine the scraped data so that in the Elasticsearch index each document is stored with information about the product name, manufacturer and product ranking? Is it best to try and merge the data within the scraping process, or try and combine two JSON files, or adapt the pipeline, or mess around with the data once its all been indexed in Elasticsearch? Or is there a better solution?

It's possible that the manufacturer may be spelled differently/phrased differently in the two data sets as well. How would this issue be overcome?

0

There are 0 best solutions below