PHP Fast scraping

2.2k Views Asked by At

My goal is to collect headtitles from different news outlets and then echo them on my page. I've tried using Simple HTML DOM, and then run an IF statement to check for keywords. It works, but it is very slow! The code is to be found bellow. Is there a better way to go about this, and if so; how would it be written?

Thanks in advance.

<?php
require 'simple_html_dom.php';

// URL and keyword
$syds = file_get_html('http://www.sydsvenskan.se/nyhetsdygnet');
$syds_key = 'a.newsday__title';

// Debug
$i = 0;

// Checking for keyword "A" in the headtitles
foreach($syds->find($syds_key) as $element) {
   if (strpos($element, 'a') !== false || strpos($element, 'A') !== false) {
      echo $element->href . '<br>';
      $i++;
   }
} 

echo "<h1>$i were found</h1>";
?>
2

There are 2 best solutions below

15
On BEST ANSWER

How slow are we talking?

1-2 seconds would be pretty good.

If your using this for a website.

I'd advise splitting the crawling and the display into 2 separate scripts, and cache the results of each crawl.

You could:

  • have a crawl.php file that runs periodically to update your links.
  • then have a webpage.php that reads the results of the last crawl and displays it however you need for your website.

This way:

  • Every time you refresh your webpage, it doesn't re-request info from the news site.
  • It's less important that the news site takes a little long to respond.

Decouple crawling/display

You will want to decouple, crawling and display 100%. Have a "crawler.php" than runs over all the news sites one at a time saving the raw links to a file. This can run every 5-10 minutes to keep the news updated, be warned less than 1 minute and some news sites may get annoyed!

crawler.php

<?php
// Run this file from cli every 5-10 minutes
// doesn't matter if it takes 20-30 seconds

require 'simple_html_dom.php';

$html_output = ""; // use this to build up html output

$sites = array(
    array('http://www.sydsvenskan.se/nyhetsdygnet', 'a.newsday__title')
    /* more sites go here, like this */
    // array('URL', 'KEY')
);

// loop over each site
foreach ($sites as $site){
   $url = $site[0];
   $key = $site[1];
   // fetch site
   $syds = file_get_html($url);

   // loop over each link
   foreach($syds->find($key) as $element) {
     // add link to $html_output
     $html_output .= $element->href . '<br>\n';
   }
}
// save $html_output to a local file
file_put_contents("links.php", $html_output);
?>

display.php

/* other display stuff here */
<?php
// include the file of links
include("links.php");
?>

Still want faster?

If you wan't any faster, I'd suggest looking into node.js, its much faster at tcp connections and html parsing.

0
On

The bottlenecks are:

  • blocking IO - you can switch to an asynchronous library like guzzle

  • parsing - you can switch to a different parser for better parsing speed