Using Tidy to clean HTML, HTML content is being changed, encoding problem?

1.8k Views Asked by At

I am fetching HTML from a smarty template and need to clean it (simply want to remove extra whitespace, and format / indent the HTML nicely), I'm using tidy to do something like:


$html = $smarty->fetch('foo.tmpl');

$tidy = new tidy;
$tidy->parseString($html, array(
    'hide-comments' => TRUE,
    'output-xhtml' => TRUE,
    'indent' => TRUE,
    'wrap' => 0
));
$tidy->cleanRepair();
return $tidy;

While this works ok for english, multilingual support seems to break this. For example, I have arabic characters ok in $html, but after tidy I get back some nasty encoding:

هل أنت متأكد أنك تريد

Is there a setting in tidy that will format the HTML, but leave the HTML itself alone? I looked at this post: PHP "pretty print" HTML (not Tidy) but it's seems like this won't work since I'm grabbing my HTML from smarty.

Any suggestions appreciated.

2

There are 2 best solutions below

0
On

Try using the second argument to set the encoding in parseString

http://www.php.net/manual/en/tidy.parsestring.php

0
On
$html = $smarty->fetch('foo.tmpl');

$tidy = new tidy;
$tidy->parseString($html, array(
    'hide-comments' => TRUE,
    'output-xhtml' => TRUE,
    'indent' => TRUE,
    'wrap' => 0
            ),
'raw');
$tidy->cleanRepair();
return $tidy;

use raw as encoding parameter
For raw, Tidy will output values above 127 without translating them into entities and all Arabic characters are above 127