UTF-8 encoding for XML with php and accent characters along with ENT_XML1

939 Views Asked by At

AN ongoing issue for over a year, That I though I had corrected but has evolved into a monster.

I move large amounts of data between sites using XML generated on PHP systems. Mainly text I ran into some basic XML items that broke the transfer so I used this code of all XML values.

$value=str_replace("'","'",$value);
print '<'.$key.'>';
print htmlspecialchars($value, ENT_XML1 | ENT_QUOTES, 'UTF-8');
print '</'.$key.'>'; 

$key being the field and this works perfectly for all data except for anyting containing an accent such as piñata. A value with the ñ character shows as completely empty.

I have yet to locate a function to clean text for XML formatting with PHP. I currently dump data from a database into this format, then load into SImpleXML on the receiving side to load back into a database.

A solution by either cleaning all data or possibly json encoding instead of XML possibly would be fantastic.

Thanks-Chris

2

There are 2 best solutions below

0
On

For my instance, even though all my tables are set to UTF-8, When constructing my XML I have to set the values to UTF-8

$value=str_replace("'","&#039;",$value);
print '<'.$key.'>';
$value = utf8_encode($value);
print htmlspecialchars($value, ENT_XML1 | ENT_QUOTES, 'UTF-8');
print '</'.$key.'>'; 

Not sure when encoding is being changed between reading from table and placing but this has produced the results I required. I do not think BASE64 with special characters is viable.

0
On

If you use an XML Api (DOM, XMLReader) it will take care of encoding issues for values/text content. However tag names are a different issue. You will have to create a normalized tag name or use a fixed tag name. Then store the original field name as an attribute value.

For example with a fixed tag name field:

<records>
  <record>
    <field name="some field">some content</field>
  </record>
</records>

This is the cleaner variant, because here are no dynamic tag names, you can create a Schema/DTD and validate the XML.

Or a normalized version of the field name:

<records>
  <record>
    <some-field>some content</some-field>
  </record>
</records>

This is often used as a generic way to serialize a data structure as XML. It is only well formed XML, you can not define a Schema/XSD because the tag names depend on the data.