What's the best way to write a maintainable web scraping app?

1.8k Views Asked by At

I wrote a perl script a while ago which logged into my online banking and emailed me my balance and a mini-statement every day. I found it very useful for keeping track of my finances. The only problem is that I wrote it just using perl and curl and it was quite complicated and hard to maintain. After a few instances of my bank changing their webpage I got fed up of debugging it to keep it up to date.

So what's the best way of writing such a program in such a way that it's easy to maintain? I'd like to write a nice well engineered version in either Perl or Java which will be easy to update when the bank inevitably fiddle with their web site.

7

There are 7 best solutions below

0
On BEST ANSWER

In Perl, something like WWW::Mechanize can already make your script more simple and robust, because it can find HTML forms in previous responses from the website. You can fill in these forms to prepare a new request. For example:

my $mech = WWW::Mechanize->new();
$mech->get($url);
$mech->submit_form(
    form_number => 1,
    fields      => { password => $password },
);
die unless ($mech->success);
5
On

Hmm, just found

Finance::Bank::Natwest

Which is a perl module specifically for my bank! Wasn't expecting it to be quite that easy.

3
On

If I were to give you one advice, it would be to use XPath for all your scraping needs. Avoid regexes.

1
On

A combination of WWW::Mechanize and Web::Scraper are the two tools that make me most productive. Theres a nice article about that combination at the catalyzed.org

1
On

A lot of banks publish their data in a standard format, which is commonly used by personal finance packages such as MS Money or Quicken to download transaction information. You could look for that hook and download using the same API, and then parse the data on your end (e.g. parse Excel documents with Spreadsheet::ParseExcel, and Quicken docs with Finance::QIF).

Edit (reply to comment): Have you considered contacting your bank and asking them how you can programmatically log into your account in order to download the financial data? Many/most banks have an API for this (which Quicken etc make use of, as described above).

0
On

There's a currently up to date Ruby implementation here:

http://github.com/warm/NatWoogle

0
On

Use perl and the web::scraper package: link text