1&1 hosting service is leaving Poland, my country with their services, so they told every their client to move out. Because there's no way to export the website, I need to parse it manually and retrieve data I want.
Basically it's about to export all articles with image attachments.
I'm trying to manipulate the HTML from this site: http://www.naszeiganie.org/lata-2014-2015/ to have each post in the individual div
element, to properly parse the whole document and retrieve mixed data, that articles have.
I figured that every article starts with a:
<div class="n module-type-header diyfeLiveArea ">
<h2>
<span class="diyfeDecoration">
and there's no repeatable end of the "article". In fact, next instance of above code is telling me, that current post is ending, and the new one begins.
function smi_parse_web(){
$url = 'http://www.naszeiganie.org/lata-2014-2015/';
$content = file_get_contents($url);
$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTML($content);
libxml_clear_errors();
$finder = new DomXPath($doc);
$node = $finder->query('//div[contains(@class,"module-type-header")]/h2');
foreach($node as $anchor){
if($anchor->nodeName == 'h2')
{
$element = $doc->createElement('div', 'x');
$element->setAttribute('class','DIV-WRAP');
$element->insertBefore($anchor);
}
}
echo $doc->saveHTML();
I figured out something like this, but the effect is none. The found $anchor
clears out it's content.
My target is to find all the html content between one and another div > h2
combination and wrap it up in the div.wrap
What would you suggest to do to move on with the project? Maybe I've gone wrong while the simpliest way is on my hand?
Thanks a lot!
(I know how to deal with images, but I want them to be attached to each downloaded article)