Web Data Extraction / Screen Scraping (Open Source)

1.9k Views Asked by At

I have the below need that needs to be done using code using a screen scraping or web extraction framework.

  1. I go to a web page.
  2. Enter a value to search for an entity.
  3. Once the results are displayed, they need to be captured and returned as output.

Can someone suggest any good open source web extraction tools (which they have used) to allow this kind of data extractions (searches).

Any help/pointers will be greatly appreciated.

3

There are 3 best solutions below

0
On

Selenium may be what you are looking for. Though of course you can just write HTTP requests and parse the responses yourself in whatever language you are working in.

0
On

If you are looking for a solution that works generally for any website, it's a hard problem. Requirements would specifically include then: finding a search box, identifying each separate result, separating the fields of the results, and accessing in order all the result pages returned. For that, you'd want something such as ScreenSlicer (disclaimer: I made this project).

However, if you just want a way to submit queries to specific sites and get the resulting html, I'd recommend investigating the OpenSearch standard. Site operators implement OpenSearch and then consumers get programmatic access. E.g., one consumer of that is Firefox--see: Creating OpenSearch plugins for Firefox. Keep in mind that (unfortunately) very few site operators have implemented every feature allowed in the standard (such as paging through results, getting Atom formatted results, etc).

1
On

XtractData is the new venture of PPTS, where we are specializing on extracting the data from various Public Domains to make it easily accessible and making it user friendly for all your data need.