My goal is to collect headtitles from different news outlets and then echo them on my page. I've tried using Simple HTML DOM, and then run an IF statement to check for keywords. It works, but it is very slow! The code is to be found bellow. Is there a better way to go about this, and if so; how would it be written?
Thanks in advance.
<?php
require 'simple_html_dom.php';
// URL and keyword
$syds = file_get_html('http://www.sydsvenskan.se/nyhetsdygnet');
$syds_key = 'a.newsday__title';
// Debug
$i = 0;
// Checking for keyword "A" in the headtitles
foreach($syds->find($syds_key) as $element) {
if (strpos($element, 'a') !== false || strpos($element, 'A') !== false) {
echo $element->href . '<br>';
$i++;
}
}
echo "<h1>$i were found</h1>";
?>
How slow are we talking?
1-2 seconds would be pretty good.
If your using this for a website.
I'd advise splitting the
crawling
and thedisplay
into 2 separate scripts, and cache the results of each crawl.You could:
crawl.php
file that runs periodically to update your links.webpage.php
that reads the results of the last crawl and displays it however you need for your website.This way:
Decouple crawling/display
You will want to
decouple
, crawling and display 100%. Have a "crawler.php" than runs over all the news sites one at a time saving the raw links to a file. This can run every 5-10 minutes to keep the news updated, be warned less than 1 minute and some news sites may get annoyed!crawler.php
display.php
Still want faster?
If you wan't any faster, I'd suggest looking into
node.js
, its much faster at tcp connections and html parsing.