How to download stock price data only when it is not erroneous (404)?

327 Views Asked by At

The script downloads historic stock prices from finance.yahoo.com. An array of tickers is used to loops through the script, creats li´nks based on the ticker array and downloads the data associated to each ticker. However, some of the ticker symbols are not up to date anymore and as a result yahoo delivers a 404 page instead of a csv containing price information. The errorpage is then instead stored in a csv and saved to my computer. To not download these files I am looking for the string 'Sorry, the page you requested was not found.', which is contained within each of yahoos error sites as an indicator for a 404 page.

Behaviour of the code (output, see below code):

The code runs through all tickers and downloads all stock price .csv's. This works fine for all ticker, but some ticker symbols are not used anymore by yahoo. In the case of a ticker symbol that is not used anymore the program downloads a .csv containing yahoos 404 page. All files (also the good ones containing actual data) are downloaded in the directory c:\Users\W7ADM\stock-price-leecher\data2.

Problem:

I would like for the code to not download the 404 page into a csv file, but just do nothing in this case and move on to the next ticker symbol in the loop. I am trying to achive this with the if-condition that looks for the String "Sorry, the page you requested was not found." that is diplayed on yahoos 404-pages. In the end I hoope to download all csv's for tickers that actually exists and save them to my hdd.

var url_begin = 'http://real-chart.finance.yahoo.com/table.csv?s=';
var url_end = '&a=00&b=1&c=1950&d=11&e=31&f=2050&g=d&ignore=.csv';
var tickers = [];
var link_created = '';

var casper = require('casper').create({
    pageSettings: {
        webSecurityEnabled: false
    }
});                   

casper.start('http://www.google.de', function() {              
        tickers = ['ADS.DE', '0AM.DE']; //ADS.DE is retrievable, 0AM.DE is not
        //loop through all ticker symbols
        for (var i in tickers){
                //create a link with the current ticker
                link_created=url_begin + tickers[i] + url_end;
                //check to see, if the created link returns a 404 page
                this.open(link_created);
                var content = this.getHTML();
                //If is is a 404 page, jump to the next iteration of the for loop
                if (content.indexOf('Sorry, the page you requested was not found.')>-1){
                        console.log('No Page found.');
                        continue; //At this point I want to jump to the next iteration of the loop.
                }
                //Otherwise download file to local hdd
                else {
                        console.log(link_created);
                        this.download(link_created, 'stock-price-leecher\\data2\\'+tickers[i]+'.csv');
                }
        }
});

casper.run(function() {
        this.echo('Ende...').exit();
});

The Output:

C:\Users\Win7ADM>casperjs spl_old.js
ADS.DE,0AM.DE
http://real-chart.finance.yahoo.com/table.csv?s=ADS.DE&a=00&b=1&c=1950&d=11&e=31
&f=2050&g=d&ignore=.csv
http://real-chart.finance.yahoo.com/table.csv?s=0AM.DE&a=00&b=1&c=1950&d=11&e=31
&f=2050&g=d&ignore=.csv
Ende...

C:\Users\Win7ADM>
1

There are 1 best solutions below

0
On

casper.open is asynchronous (non-blocking), but you use it in a blocking fashion. You should use casper.thenOpen which has a callback which is called when the page is loaded and you can do stuff with it.

casper.start("http://example.com");

tickers = ['ADS.DE', '0AM.DE']; //ADS.DE is still retrievable, 0AM.DE is not
tickers.forEach(function(ticker){
    var link_created = url_begin + ticker + url_end;
    casper.thenOpen(link_created, function(){
        console.log("open", link_created);
        var content = this.getHTML();
        if (content.indexOf('Sorry, the page you requested was not found.') > -1) {
            console.log('No Page found.');
        } else {
            console.log("downloading...");
            this.download(link_created, 'test14_'+ticker+'.csv');
        }
    });
});

casper.run();

Instead of using the thenOpen callback, you can also register to the page.resource.received event and download it specifically by checking the status. But now you wouldn't have access to ticker so you either have to store it in a global variable or parse it from resource.url.

var i = 0;
casper.on("page.resource.received", function(resource){
    if (resource.stage === "end" && resource.status === 200) {
        this.download(resource.url, 'test14_'+(i++)+'.csv');
    }
});

casper.start("http://example.com");

tickers = ['ADS.DE', '0AM.DE']; //ADS.DE is still retrievable, 0AM.DE is not
tickers.forEach(function(ticker){
    var link_created = url_begin + ticker + url_end;
    casper.thenOpen(link_created);
});

casper.run();

I don't think you should do this with open or thenOpen. It may work on PhantomJS, but probably not on SlimerJS.


I actually tried it and your page is strange in that the download doesn't succeed. You can load some dummy page like example.com, download the csv files yourself using __utils__.sendAJAX (it is only accessible from the page context) and write them using the fs module. You should only write it based in the specific 404 error page text that you identified:

casper.start("http://example.com");

casper.then(function(){
    tickers = ['ADS.DE', '0AM.DE']; //ADS.DE is still retrievable, 0AM.DE is not
    tickers.forEach(function(ticker){
        var link_created = url_begin + ticker + url_end;
        var content = casper.evaluate(function(url){
            return __utils__.sendAJAX(url, "GET");
        }, link_created);
        console.log("len: ", content.length);
        if (content.indexOf('Sorry, the page you requested was not found.') > -1) {
            console.log('No Page found.');
        } else {
            console.log("writing...");
            fs.write('test14_'+ticker+'.csv', content);
        }
    });
});

casper.run();