I'm using Symfony 2.8 & DomCrawler to parse a web site and I'm having a problem reading data
attributes from a HTML entity. It might be as simple as a specific convention for data
attributes, but I've not been able to find any references or examples on the web that discuss how to retrieve data attributes via DomCrawler.
Here are the details:
I have encountered an instance of this construct in the HTML I am parsing (from another web site, so I can't modify this HTML):
<div class='slideshowclass' id='slideshow'>
<div data-thumb='http://www.example.com/thumbs/1.jpg'
data-src='http://www.example.com/thumbs/1.jpg'></div>
<div data-thumb='http://www.example.com/thumbs/2.jpg'
data-src='http://www.example.com/thumbs/2.jpg'></div>
<div data-thumb='http://www.example.com/thumbs/3.jpg'
data-src='http://www.example.com/thumbs/3.jpg'></div>
<div data-thumb='http://www.example.com/thumbs/4.jpg'
data-src='http://www.example.com/thumbs/4.jpg'></div>
<div data-thumb='http://www.example.com/thumbs/5.jpg'
data-src='http://www.example.com/thumbs/5.jpg'></div>
<div data-thumb='http://www.example.com/thumbs/6.jpg'
data-src='http://www.example.com/6.jpg'></div>
</div>
I'm using this code to search the block of div
's and return the data-src
values:
function getList( Crawler $pWebDoc ) {
$list = $pWebDoc->filter( 'div#slideshow');
if ( !$list )
return null;
$retlist = null;
$x = $list->count();
if ( $x > 0 ) {
/* @var $item Crawler */
$retlist = $list->children()->each( function (Crawler $item, $i ) {
return ( "$i:" . $item->attr( 'data-src' ));
});
}
return ( $retlist );
}
From the DomCrawler docs I expect the attr
function to return the data-src
attribute value, but it returns null; the return from my function being an array of 6 elements with just the number and not additional text.
Thanks in advance for your help.
This can be easily done using the DOMDocument and XPath libraries. XPath does provide the capability of returning array's of values instead of nodes.
This function is from Crawler.php. My experience has been that the Crawler wasn't happy with complex xpath expressions, which resulted in switching from the DomCrawler to using xpath / dom directly.
Your base xpath query would be like
//div/@data-src