How to scrape different URLs from database with Nokogiri with different requirements

135 Views Asked by Dave C At 16 January 2017 at 09:49

I tried using Feedjira to assist with content analysis from newsfeeds, but it appears that RSS feeds now only link to content rather than including them with RSS as I found out in "Feedjira not adding content and author". I plan to use Feedjira to get the URL for the article, but then use Nokogiri to scrape the article and pick out the relevant parts.

The problem is that each media outlet will have a different format for their pages and I need to know the best way for Nokogiri to take the URL from the database (supplied by Feedjira) and depending on the associated feed title (also the database from Feedjira sync) scrape the page in a specific way and save it to a separate table in the database. Anyone got any suggestions?

Original Q&A

There are 2 best solutions below

PascalTurbo On 16 January 2017 at 09:59

I don't know your special use case but I'm also doing content analysis using news feeds. Maybe you'll have a look on Readability which provides you a generic content scraper.

the Tin Man On 17 January 2017 at 18:43

The problem you've encountered is that every feed generator does it a bit differently, just as with HTML generators. You can assume certain fields are going to be in place in an RDF, RSS or ATOM feed, however the author of the feed could use optional tags that you could find very useful, so you have to write code to look for them.

I wrote several feed aggregators in the past, including one that was handling well over 1000 feeds daily. By sniffing out the feed type, ATOM vs. RSS vs RDF, then I could make sensible checks for fields that were interesting given that format, and extract the data if it was available.

Pre-canned parsers get it wrong too often, either grabbing data you don't want and making a mess of the output, or skipping data you do want leaving gaps in the output, so be prepared to write code if you want it done correctly.

You'll probably want to take advantage of a backing database too, to keep track of what you looked at last and when you're supposed to look at it again; That's part of being a good network citizen. You'll also want to keep track whether a feed was down the last n times you looked so you can trim out dead sites.

How to scrape different URLs from database with Nokogiri with different requirements

There are 2 best solutions below

Related Questions in RUBY-ON-RAILS

Related Questions in RUBY

Related Questions in RSS

Related Questions in NOKOGIRI

Related Questions in FEEDJIRA

Trending Questions

Popular # Hahtags

Popular Questions