How to scrape only the largest images from the DOM?

1.6k Views Asked by At

I am using SimpleHTMLDOM to scrape pages (in servers other than mine).

The basic implementation is

try {
    $html = file_get_html(urldecode(trim($url)));
} catch (Exception $e) {
    echo $url;
}

foreach ($html->find('img') as $element) {
  $src = "";
  $src = $element->src;
    if (preg_match("/\.(?:jpe?g|png)$/i", $src)) {
        $images[] = $src;
    }
}

This works fine but it returns all images from the page, including small avatars, icons, and button images. Of course I'd like to avoid these.

I then tried to insert within the loop as follows

...

if (preg_match("/\.(?:jpe?g|png)$/i", $src)) {
    $size = getimagesize($src);
    if ($size[0] > 200) {
        $images[] = $src;
    }
}
...

That works well on a page like http://cnn.com. But in others it returns numerous errors. For example

http://www.huffingtonpost.com/2012/05/27/alan-simpson-republicans_n_1549604.html

gives a bunch of errors like

<p>Severity: Warning</p>
<p>Message:  getimagesize(/images/snn-logo-comments.png): failed to open stream: No such file or directory
<p>Severity: Warning</p>
<p>Message:  getimagesize(/images/close-gray.png): failed to open stream: No such file or directory

which seem to happening because of relative URLs in some images. The problem here is that this crashes the script and then no images a loaded, with my Ajax box loading forever.

Do you have any ideas how to troubleshoot this?

3

There are 3 best solutions below

0
On

The problem is that the image URLs are relative to the site root, so your server can't make sense of them to fetch them and find out their size. You could refer to this question to figure out how to get absolute URLs from relative ones.

0
On

The approach you tried with image size checking is correct.

However, in order for it to work on all sites, you would need to add some kind of relative URL parsing.

I don't know if there are any libraries or such for it but here's a quick overview on how to do it:

  • Find the domain part of the URL you're scraping
  • Assume any URL starting with / is an absolute URL. You can fetch these simply by concatenating domain and path
  • Assume any URL not starting with / is relative. You may need to parse any .. markers in the URL to locate the expected path
  • Check for the <base> tag in the document: If the document has a <base> tag, it will anchor all relative paths into the path defined in the tag.

You may be able to find a library to convert relative paths and absolute paths into something you can use, but in most cases they will not account for the <base> tag mentioned in the last point.

0
On

Try something like this assuming a url of http://somedomain.com...

$domain = explode('/', $url);
$domain = $domain[2];

// ... snip ...

if (preg_match("/\.(?:jpe?g|png)$/i", $src)) {
    $size = getimagesize($src);
    if ($size[0] > 200) {
        if(strpos($src, '/', 0) === 0)
            $src = $domain . $src;

        $images[] = $src;
    }
}

This will help some, but it won't be fool-proof - I can't think of many domains using ../../etc relative paths to images, but I'm sure someone is - of course, you could test for a match of anything other than the domain in the image's src attribute, and try throwing the domain on there but no promises that will work every time either. I would think there's a better way... perhaps have a default method and load a config with predefined domain "fixes" for troublesome domains.