Keep new line, when the HTML is on 1 line and new line layout is done with <div>

292 Views Asked by At

I need to get content from a site

I need to get

/html/body/div/div[2]/table/tbody/tr/td/div/div[2]/form/fieldset[2]/table[2]

or

<table class='properties'>

For which the code is visible here: http://paste.pocoo.org/show/347881/

contents with all the content formatted just on new lines. I don't care about paddings, and other formatting, I just want to keep the new lines.

For example a proper output would be

tájékoztató
az eljárás eredményéről
A Közbeszerzések Tanácsa (Szerkesztőbizottsága) tölti ki
A hirdetmény kézhezvételének dátuma____________________
KÉ nyilvántartási szám_________________________________
I. SZAKASZ: AJÁNLATKÉRŐ
I.1) Név, cím és kapcsolattartási pont(ok) 

The problem I face that the new lines are introduced with the div's and cannot get it.

Update

This be executed by a PHP cron, so there is no access to JS.

2

There are 2 best solutions below

2
On BEST ANSWER

There is a library called phpQuery: http://code.google.com/p/phpquery/

You can walk through DOM object like with jQuery:

phpQuery::newDocument($htmlCode)->find('table.properties');

On a mached element's content fire strip_tags and you will get pure content of that table.

0
On

The trick is to fetch the inner divs in an xpath expression, then use their textContent property:

<?php

$domd = new DOMDocument();
libxml_use_internal_errors(true);
$domd->loadHTML(file_get_contents("..."));
libxml_use_internal_errors(false);

$domx = new DOMXPath($domd);
$items = $domx->query("/html/body/div/div[2]/table/tr/td/div/div[2]/form/fieldset[2]/table[2]/tr/td/div//div/div[@style='padding-left: 0px;']");

$output = "";
foreach ($items as $item) {
  $output .= $item->textContent . "\n";
}

echo $output;