Symfony + DomCrawler - how to extract data attributes from a <div>

2.7k Views Asked by At

I'm using Symfony 2.8 & DomCrawler to parse a web site and I'm having a problem reading data attributes from a HTML entity. It might be as simple as a specific convention for data attributes, but I've not been able to find any references or examples on the web that discuss how to retrieve data attributes via DomCrawler.

Here are the details:

I have encountered an instance of this construct in the HTML I am parsing (from another web site, so I can't modify this HTML):

  <div class='slideshowclass' id='slideshow'>           
    <div data-thumb='http://www.example.com/thumbs/1.jpg'
        data-src='http://www.example.com/thumbs/1.jpg'></div>
    <div data-thumb='http://www.example.com/thumbs/2.jpg'
        data-src='http://www.example.com/thumbs/2.jpg'></div>
    <div data-thumb='http://www.example.com/thumbs/3.jpg'
        data-src='http://www.example.com/thumbs/3.jpg'></div>
    <div data-thumb='http://www.example.com/thumbs/4.jpg'
        data-src='http://www.example.com/thumbs/4.jpg'></div>
    <div data-thumb='http://www.example.com/thumbs/5.jpg'
        data-src='http://www.example.com/thumbs/5.jpg'></div>
    <div data-thumb='http://www.example.com/thumbs/6.jpg'
        data-src='http://www.example.com/6.jpg'></div>
  </div>

I'm using this code to search the block of div's and return the data-src values:

function getList( Crawler $pWebDoc ) {
    $list = $pWebDoc->filter( 'div#slideshow');
    if ( !$list )
        return null;

    $retlist = null;
    $x = $list->count();
    if ( $x > 0 ) {
        /* @var $item Crawler */
        $retlist = $list->children()->each( function (Crawler $item, $i ) {
            return ( "$i:" . $item->attr( 'data-src' ));
        });
    }

    return ( $retlist );
}

From the DomCrawler docs I expect the attr function to return the data-src attribute value, but it returns null; the return from my function being an array of 6 elements with just the number and not additional text.

Thanks in advance for your help.

1

There are 1 best solutions below

1
On

This can be easily done using the DOMDocument and XPath libraries. XPath does provide the capability of returning array's of values instead of nodes.

/**
 * Filters the list of nodes with an XPath expression.
 *
 * The XPath expression should already be processed to apply it in the context of each node.
 *
 * @param string $xpath
 *
 * @return Crawler
 */
private function filterRelativeXPath($xpath)
{
    $prefixes = $this->findNamespacePrefixes($xpath);
    $crawler = $this->createSubCrawler(null);
    foreach ($this->nodes as $node) {
        $domxpath = $this->createDOMXPath($node->ownerDocument, $prefixes);
        $crawler->add($domxpath->query($xpath, $node));
    }
    return $crawler;
}

This function is from Crawler.php. My experience has been that the Crawler wasn't happy with complex xpath expressions, which resulted in switching from the DomCrawler to using xpath / dom directly.

Your base xpath query would be like //div/@data-src