I'm using Symfony 2.8 & DomCrawler to parse a web site and I'm having a problem reading data attributes from a HTML entity. It might be as simple as a specific convention for data attributes, but I've not been able to find any references or examples on the web that discuss how to retrieve data attributes via DomCrawler.
Here are the details:
I have encountered an instance of this construct in the HTML I am parsing (from another web site, so I can't modify this HTML):
<div class='slideshowclass' id='slideshow'>
<div data-thumb='http://www.example.com/thumbs/1.jpg'
data-src='http://www.example.com/thumbs/1.jpg'></div>
<div data-thumb='http://www.example.com/thumbs/2.jpg'
data-src='http://www.example.com/thumbs/2.jpg'></div>
<div data-thumb='http://www.example.com/thumbs/3.jpg'
data-src='http://www.example.com/thumbs/3.jpg'></div>
<div data-thumb='http://www.example.com/thumbs/4.jpg'
data-src='http://www.example.com/thumbs/4.jpg'></div>
<div data-thumb='http://www.example.com/thumbs/5.jpg'
data-src='http://www.example.com/thumbs/5.jpg'></div>
<div data-thumb='http://www.example.com/thumbs/6.jpg'
data-src='http://www.example.com/6.jpg'></div>
</div>
I'm using this code to search the block of div's and return the data-src values:
function getList( Crawler $pWebDoc ) {
$list = $pWebDoc->filter( 'div#slideshow');
if ( !$list )
return null;
$retlist = null;
$x = $list->count();
if ( $x > 0 ) {
/* @var $item Crawler */
$retlist = $list->children()->each( function (Crawler $item, $i ) {
return ( "$i:" . $item->attr( 'data-src' ));
});
}
return ( $retlist );
}
From the DomCrawler docs I expect the attr function to return the data-src attribute value, but it returns null; the return from my function being an array of 6 elements with just the number and not additional text.
Thanks in advance for your help.
This can be easily done using the DOMDocument and XPath libraries. XPath does provide the capability of returning array's of values instead of nodes.
This function is from Crawler.php. My experience has been that the Crawler wasn't happy with complex xpath expressions, which resulted in switching from the DomCrawler to using xpath / dom directly.
Your base xpath query would be like
//div/@data-src