I am using SimpleHTMLDOM to scrape pages (in servers other than mine).
The basic implementation is
try {
$html = file_get_html(urldecode(trim($url)));
} catch (Exception $e) {
echo $url;
}
foreach ($html->find('img') as $element) {
$src = "";
$src = $element->src;
if (preg_match("/\.(?:jpe?g|png)$/i", $src)) {
$images[] = $src;
}
}
This works fine but it returns all images from the page, including small avatars, icons, and button images. Of course I'd like to avoid these.
I then tried to insert within the loop as follows
...
if (preg_match("/\.(?:jpe?g|png)$/i", $src)) {
$size = getimagesize($src);
if ($size[0] > 200) {
$images[] = $src;
}
}
...
That works well on a page like http://cnn.com
.
But in others it returns numerous errors.
For example
http://www.huffingtonpost.com/2012/05/27/alan-simpson-republicans_n_1549604.html
gives a bunch of errors like
<p>Severity: Warning</p>
<p>Message: getimagesize(/images/snn-logo-comments.png): failed to open stream: No such file or directory
<p>Severity: Warning</p>
<p>Message: getimagesize(/images/close-gray.png): failed to open stream: No such file or directory
which seem to happening because of relative URLs in some images. The problem here is that this crashes the script and then no images a loaded, with my Ajax box loading forever.
Do you have any ideas how to troubleshoot this?
The problem is that the image URLs are relative to the site root, so your server can't make sense of them to fetch them and find out their size. You could refer to this question to figure out how to get absolute URLs from relative ones.