NodeJS Bluebird Cheerio Web Scraper, Proper Indexing and Parsing of scraped data

255 Views Asked by At

Functional Overview I am scraping strings of thumbnail image URL data from an insecure site (http), downloading the images, and then uploading these images to a secure website server (https). I am using cheerio and bluebird to scrape list of website URL's using a mapped promise request, my code is shown below. I push the thumbnail URL image string data from the website URL's to an array stored in the "json" array and then write a file with the contained json data to a suppImages.json file.

Current Issue I am trying to address Their is a variable number of thumbnail images (around 20 each) contained in the website URL's I am scraping. Right now, my code is set up to aggregate all the thumbnail image URL's into one array. What I would like my code to do is parse the specific thumbnail image URL's into separate arrays per website URL. So basically instead of the output looking like one blob of aggregate data from all the website URL's I want their to be several arrays each contained the discrete thumbnail images displayed on the given website URL.

My code

let fs = require('fs')
const requestPromise = require('request-promise');
const Promise = require('bluebird');
const cheerio = require('cheerio');
const suppURL = require('./output.json');

const urls = suppURL.urli;
console.log("Currently reading URLs from buttons of Realty Warp"+urls)

var json = { pictureThumb: []};

scraper = () => Promise.map(urls, requestPromise)
  .map((htmlOnePage, index) => {
    const $ = cheerio.load(htmlOnePage);
        var linksPic = $(".thumb img"); 

        $(linksPic).each(function(i, link){    
            var sop = $(this).attr('src');

        console.log("sop:" + sop)
            json.pictureThumb.push(sop); 

        });

        fs.writeFile('suppImages.json', JSON.stringify(json, null, 6), function(err){



            console.log('wrote file');

        })


        return console.log("URL"+index+':Scrape Complete');

  })
    .then()


    .catch((e) => console.log('We encountered an error' + e));

    scraper()
0

There are 0 best solutions below