How to clean elements in xml string

1.2k Views Asked by At

I have a xml, where tags can have one,two or more spaces and periods(.) in elements.

the xml:
    $xml='<?xml version="1.0" encoding="UTF-8"?>  
 <xmldata>  
  <SalesHeader>  
      <DocType>Order</DocType>  
      <No>1002</No>  
      <SellToCustomerNo>CustNo</SellToCustomerNo>  
      <SellToCustomerName>Customer Name</SellToCustomerName>  
      <SellToCustomerName2 />   
      <SellToEmail>[email protected]</SellToEmail>  
      <OrderDate>04/03/13</OrderDate>  
      <ExtDocNo />  
      <ShipToName>Customer Ship to</ShipToName>  
      <ShipToCountry />  
      <TaxLiable>No</TaxLiable>  
      <TaxAreaCode />  
      <RequestedDeliveryDate />  
      <Shipping Agent>UPS</Shipping Agent>  
      <Shipping Agent Service>Ground New</Shipping Agent Service>  
      <Tracking Numbers>123123212,1231231321</Tracking Numbers>  
      <SalesLine>  
        <ItemNo.>12-34343-23</ItemNo.>  
        <Description>Item Description</Description>  
        <Quantity>1</Quantity>  
        <UnitPrice>79.00</UnitPrice>  
      </SalesLine>  
      <SalesLine>  
        <ItemNo.>12-34343-23</ItemNo.>  
        <Description>Item Description</Description>  
        <Quantity>1</Quantity>  
        <UnitPrice>79.00</UnitPrice>  
      </SalesLine>  
  </SalesHeader>  
 </xmldata>';

my code:

preg_replace(array('/(<\/?)[. ]+(\w*)(\/?>)/','/(<\/?)(\w*)[. ]+(\/?>)/','/(<\/?)(\w*)[. ]+(\w*\/?>)/'),array('$1$2$3','$1$2$3','$1$2$3'),$xml);

I've achieved only delete using preg_match when there is one space or period, but I want is delete periods(.) and replace the spaces with underscore(_) even if there are several periods or/and spaces in tags and any position.

I want to get this:

change:
<ItemNo.>12-34343-23</ItemNo.> 
by:
<ItemNo>12-34343-23</ItemNo> 

change:
<Shipping Agent>UPS</Shipping Agent> 
by
<Shipping_Agent>UPS</Shipping_Agent> 

change:
<Shipping Agent Service>Ground New</Shipping Agent Service> 
by
<Shipping_Agent_Service>Ground New</Shipping_Agent_Service> 
3

There are 3 best solutions below

2
On BEST ANSWER

Well, I resolved the problem by my self, this is the code:

$xml='<?xml version="1.0" encoding="UTF-8"?>  
 <xmldata xmlns="http://some.uri.com">  
  <SalesHeader>  
      <DocType name="sample">Order</DocType>  
      <No>1002</No>  
      <SellToCustomerNo>CustNo</SellToCustomerNo>  
      <SellToCustomerName>Customer Name</SellToCustomerName>  
      <SellToCustomerName2 />   
      <SellToEmail>[email protected]</SellToEmail>  
      <OrderDate>04/03/13</OrderDate>  
      <ExtDocNo />  
      <ShipToName>Customer Ship to</ShipToName>  
      <ShipToCountry />  
      <TaxLiable>No</TaxLiable>  
      <TaxAreaCode />  
      <RequestedDeliveryDate />  
      <Shipping Agent>UPS</Shipping Agent>  
      <Shipping Agent Service>Ground New</Shipping Agent Service>  
      <Tracking Numbers>123123212,1231231321</Tracking Numbers>  
      <SalesLine>  
        <ItemNo.>12-34343-23</ItemNo.>  
        <Description>Item Description</Description>  
        <Quantity>1</Quantity>  
        <UnitPrice>79.00</UnitPrice>  
      </SalesLine>  
      <SalesLine>  
        <ItemNo.>12-34343-23</ItemNo.>  
        <Description>Item Description</Description>  
        <Quantity>1</Quantity>  
        <UnitPrice>79.00</UnitPrice>  
      </SalesLine>  
  </SalesHeader>  
 </xmldata>';

function xmlcleaner($data){
    try{
        $xml_clean = preg_replace_callback('/(<\/?[^><]+\/?>)/',function($data){
            return preg_replace(array('/\./','/\s(?!\/|\>|\w+=\S+)/'),array('','_'),$data[0]);
        },$data['xml']);
        if(!empty($data['head'])){
            $xml_clean = preg_replace('/<\?.+\?>/','',$xml_clean);
            $xml_clean = $data['head'].$xml_clean;
        }
        //now work with SimpleXMLElement
        $result = new \SimpleXMLElement((string)$xml_clean);
        return $result;
    }catch(Exception $e){
        return $e->getMessage();
    }
}
$xml_clean = xmlcleaner(array(
    'xml'=>$xml,
    'head'=>'<?xml version="1.0" encoding="utf-8"?>'
));
print('<pre>');
print_r($xml_clean);
3
On

I don't think you'll have much luck coming up with a good regex for this. Even if you could, the spaces in particular are worrysome. Consider the following valid nodes:

<shipper name='baz' />
<shipper name='foo baz bang' />
<shipper name='foo.baz' />
<shipper.name />

Compared to nodes you want to correct:

<ship to name />
<ship. />

I think what you'd want to do is come up with a regex to match a tag, such as

$xmlParts = preg_split("/<[^>]+>/", $xml);

You could then iterate through $xmlParts. If it matches that same regex, it's an XML tag, and you could do some validation on it: check to see if it spaces should be replaced with _ (because they're not to indicate an attribute name or value), and if .'s should be replaced completely (because they're not part of an attribute value). After replacing invalid characters, append it to a new XML varaible.

If it doesn't match the regex, assume it's content and just append it.

With all of that said, it'd be a lot easier if you could get whatever's providing you with this "XML" to provide you with valid XML to begin with...

4
On

I assume your XML text has a well-defined structure. In this case there are only several invalid element names and all of them are known in advance.

The best solution to your problem is to create a list of replacements (wrong value => correct value) and use str_replace() to fix your XML text before parsing it with simplexml_load_string() or SimpleXMLElement:

$replacements = array(
    '<Shipping Agent>'  => '<Shipping_Agent>',
    '</Shipping Agent>' => '</Shipping_Agent>',
    '<Shipping Agent Service>'  => '<Shipping_Agent_Service>',
    '</Shipping Agent Service>' => '</Shipping_Agent_Service>',
    '<Tracking Numbers>'  => '<Tracking_Numbers>',
    '</Tracking Numbers>' => '</Tracking_Numbers>',
    '<ItemNo.>'  => '<ItemNo>',
    '</ItemNo.>' => '</ItemNo>',
);

$xml = str_replace(array_keys($replacements), array_values($replacements), $xml);

$result = new \SimpleXMLElement($xml);

Why is this the best solution?

  • It is clear for other programmers on the first glance what changes are operated on the input string.
  • It doesn't leave any room for mistakes. If the format of the input string changes (new badly formatted element names appear), it is very easy to add the wrong open and close tags and their correct forms and the code runs without problems and without needing careful testing. Let's say a new invalid element name that breaks the rules of valid XML formatting in a different way appears in the input string. Changing the regex-es requires close attention and extensive testing.
  • It runs much faster than your function xmlcleaner() because it uses a single call to str_replace() while xmlcleaner() calls preg_replace() multiple times; preg_replace() is slower than str_replace() to begin with.