Forcing a desktop version of site when scraping using file_get_contents()

421 Views Asked by At

I am scraping websites using the FriendsOfPHP/Goutte package. Everything works great. I'm scraping the sites for open graph tags like image, title, etc., when a user pastes a URL into an input.

The problem occurs when a user copies the URL from a mobile device, the URL is now a mobile URL, like https://m.datpiff.com/tape/818948, and on that URL there are no open-graph tags.

When I access the same URL and replace the sub-domain m with www e.g. https://www.datpiff.com/tape/818948 from a desktop, it redirects me to: http://www.datpiff.com/Chance-The-Rapper-Jeremih-Merry-Christmas-Lil-Mama-mixtape.818948.html.

and this desktop URL does contain open-graph tags.

Is there a way I can get my server to force or trick the receiving server to redirect all URLs to the desktop version, so that I can use the open graph tags? The receiving server is already redirecting to the proper URL, but only if I'm typing directly from a browser on a desktop.

Here's the code I'm using - it works great. I just need to be able to redirect the URL I'm scraping to the desktop version.

First I'm replacing the m with www in my js like so:

fullurl.replace('m.',"www");

that converts https://m.datpiff.com/tape/818948 into https://www.datpiff.com/tape/818948

then in my PHP code i'm using something like this:

$url_to_scrape = $urltoscrape;
    $client = new Client();

    // Go to the example.com website
    $crawler = $client->request('GET', $url_to_scrape);


    $opengraphImage =$crawler->filterXpath('//meta[@property="og:image"]')->attr('content');
    $title = $crawler->filter('title')->text();
3

There are 3 best solutions below

0
On

You can set your client to follow redirect responses (HTTP status 3XX + Location header). Add this line after instantiating $client:

$client->followRedirects(true);

It doesn't redirect mobile links from desktop browser, so you still need to replace m. with www.

0
On

You need to pass the cookies for redirect you to desktop version:

name    value      domain          path
mredir    0    .www.datpiff.com     /

It's strange that if you replace m. with www. doesn't work. Try to add the desktop user-agent too.

0
On

Unless you need to use that Client class, you can use file_get_contents() along with DOMDocument (borrowing code from this answer) to get a SimpleXMLElement and call SimpleXMLElement::xpath() to access the open graph tags.

$url = 'https://www.datpiff.com/tape/818948';
$html = file_get_contents($url);
print substr(htmlspecialchars($contents),0,400).'<br />';
$doc = new DOMDocument();
//suppress errors when loading html
@$doc->loadHTML($html);
$xml = simplexml_import_dom($doc);

$images = $xml->xpath('//meta[@property="og:image"]');
if (sizeof($images)) {
    $opengraphImage = (string)$images[0]['content'];
    echo 'opengraph image: '.$opengraphImage.'<br /><br />';
}
$titles = $xml->xpath('//title');
if (sizeof($titles)) {
    $title = (string)$titles[0];
    echo 'title: '.$title.'<br />';
}

See it demonstrated in this playground example.