Am working on a requirement, where our application receives an external soap xml response written to a file as is. A particular element within the soap response will have an embedded xml that has html entities '<' and '>' as escaped chars. The goal is
- to replace all the escaped chars with '<' and '>'
- parse each element and decode the embedded base64 encoded pdf, and combine all decoded page data into a single pdf file.
The soap response looks like this:
<?xml version="1.0" encoding="utf-8"?>
<soap:Envelope xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<soap:Body>
<RequestResponse>
<RequestResult><myAPI xmlns="http://integration.myapi.com" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://integration.myapi.com C:\MessageSet\DocumentInquiry.xsd"><myHeader Version="1.1">test Header</myHeader><Page Number="1" Format="PDF" ><Value>JVBERi0xLjYNJeLjz9MNCjI0IDAgb2Jq==</Value></Page><Page Number="2" Format="PDF" ><Value>JVBERi0xLjYNJeLjz9MNCjI0IDAgb2Jq==</Value></Page></myAPI>
</RequestResult>
</RequestResponse>
</soap:Body>
</soap:Envelope>
The embedded xml within has escaped html entities '<' and '>'. After unescaping the '<' and '>' the parsed response should look like
<?xml version="1.0" encoding="utf-8"?>
<soap:Envelope xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<soap:Body>
<RequestResponse>
<RequestResult>
<myAPI xmlns="http://integration.myapi.com" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://integration.myapi.com C:\MessageSet\DocumentInquiry.xsd">
<myHeader Version="1.1">test Header</myHeader>
<Page Number="1" Format="PDF" >
<Value>JVBERi0xLjYNJeLjz9MNCjI0IDAgb2Jq==</Value>
</Page>
<Page Number="2" Format="PDF">
<Value>JVBERi0xLjYNJeLjz9MNCjI0IDAgb2Jq==</Value>
</Page>
</myAPI>
</RequestResult>
</RequestResponse>
</soap:Body>
</soap:Envelope>
The parsed could have multiple elements each having a base64 encoded pdf data string.
Here's what I've come up with so far for perl script:
#!/usr/bin/perl
use strict;
use warnings;
use XML::Twig;
use MIME::Base64;
use Data::Dumper qw(Dumper);
my ($filename, $downloadLocation) = @ARGV;
if (not defined $filename) {
die "Need file name\n";
exit;
}
if (not defined $downloadLocation) {
die "Need download location\n";
exit;
}
# Take a back up of the file received
rename($filename, $filename.'.bak');
open(my $infh,'<:encoding(utf-8)', $filename.'.bak') or die "Error opening $filename: $!";
open(my $outfh,'>:encoding(utf-8)', $filename) or die "Error opening $filename: $!";
while(<$infh>)
{
# replace < with < and > with >
$_ =~ s/</</g;
$_ =~ s/>/>/g;
print $outfh $_;
}
close($infh);
close($outfh);
# create a twig for elements that hold pdf data
my $t= XML::Twig->new(
keep_spaces => 1,
keep_encoding => 1,
KeepEncoding => 1,
twig_roots => { 'Page[@Format]/Value' => \&decode_n_purge },
);
$t->parse($filename);
sub decode_n_purge {
my( $t, $elt)= @_;
my $epoc = time();
# Use the open() function to create the file where pdf data will be written.
unless(open DEST_FILE, '>'.$downloadLocation/$epoc.pdf) {
# Die with error message
# if we can't open it.
die "\nUnable to create $downloadLocation\n";
}
binmode DEST_FILE;
my $buf;
open(FILE, $filename) or die "$!";
# write decoded pdf data to the destination in chunks
while (read(FILE, $buf, 4000*57)) {
print DEST_FILE decode_base64($buf);
}
close FILE;
close DEST_FILE;
$t->purge; # frees the memory
}
The problem: After running this script passing in the received soap response file, getting this error
not well-formed (invalid token) at line 1, column 2, byte 2 at /usr/lib/perl5/vendor_perl/5.30/x86_64-cygwin-threads/XML/Parser.pm line 187.
And it points to this line from my script:
$t->parse($filename);
Am suspecting that after replacing the encoded html entities, the edited file is losing the original encoding, which is why included KeepEncoding in my twig definition. But still getting the invalid token. Also if I open the edited file with the decoded html entities in a browser, the file renders fine with no visible errors.
Any ideas what could be wrong with the edited file? Appreciate any pointers.
You use the statement
$t->parse($variable)
, but this is used for parsing a string. If you want to parse a file, you need to use$t->parsefile($filename)
. In this version of your script you're parsing the filename as if it were an XML document, but it isn't hence theinvalid token
error.