how to compare text between xml tags using perl

205 Views Asked by At

I have xml data like this

 <ce:affiliation id="aff1">
 <ce:label>a</ce:label>
 <ce:textfn>Department of Urology, Radboud University Nijmegen Medical Center, Nijmegen, The Netherlands</ce:textfn>
  <sa:affiliation>
 <sa:organization>Department of Urology</sa:organization>
 <sa:organization>Radboud University Nijmegen Medical Center</sa:organization>
 <sa:city>Nijmegen</sa:city>
 </sa:affiliation>

and ect..

nw i want read the text inside the "sa:affiliation" while reading text, first read text in tag inside sa:affilliation and make text like "Department of Urology, Radboud University Nijmegen Medical Center, Nijmegen" in this "," separation format and compare this text with text which inside "ce:textn" .... "/ce:textn"

like is i need to compare each ce:affillition tag with sa:affilliation for multiple files and if any mismatch need tell to user.

4

There are 4 best solutions below

4
On

Your question is a bit vague. It is not clear where each fragment of XML goes. One file? several files? One fragment per file? Several? If the data is in several files, how do you link a ce:affilliation element with the corresponding sa:affilliation, especially if what you are checking is whether the 2 texts match? Why is there no country in sa:affilliation? Where are the namespaces declared?

Assuming the 2 pieces of data are in 2 files, and the namespace prefixes do not change:

#!/usr/bin/perl

use strict;

use warnings;

use XML::Twig;
use Test::More;

my $DEFAULT_COUNTRY= "The Netherlands";

# usage is <tool> <ce file> <sa file>
my( $ce_file, $sa_file)= @ARGV;

my $ce= XML::Twig->new->parsefile( $ce_file)->root;
my $ce_text = $ce->field( 'ce:textfn');

my $sa= XML::Twig->new->parsefile( $sa_file)->root;

# add the country if not present
if( ! $sa->first_child( 'sa:country')) 
  { $sa->insert_new_elt( last_child => 'sa:country' => $DEFAULT_COUNTRY); }

my $sa_text= join( ', ', $sa->children_text);

is( $ce_text, $sa_text, "checking " . $ce->id);

done_testing();
2
On

You can use XML::XPath to find the nodes you want. Then just check whether the two nodes' string_value are neq.

2
On

finally i found this code but is there any method to pickup this ce:affillition and sa:affillition text without using if else condition because it failed some condition.

#!/usr/bin/perl  
@files = <*.xml>;
open my $out, '>', 'output.xml' or die $!;
foreach $file (@files) {
open   (FILE, "$file");
$a =1;
while(my $line= <FILE> ){
do{
if ($line =~ /<ce:affiliation id=\"aff$a\">(.+?)<ce:textfn>(.+?)<\/ce:textfn><sa:affiliation>(.+?)<\/sa:affiliation><\/ce:affiliation>/){
$count = $3;
$textfn = $2;
print ("$count\n");
print ("$textfn\n");
if ($count =~ /<\/sa:(.+?)>/){
$count =~ s/<\/sa:organization>/, /g;
$count =~ s/<\/sa:city>/, /g;
$count =~ s/<\/sa:country>/, /g;
$count =~ s/<\/sa:state>/, /g;
$count =~ s/<sa:organization>//g;
$count =~ s/<sa:city>//g;
$count =~ s/<sa:country>//g;
$count =~ s/<sa:state>//g;
chop($count);
chop($count);
if($count ne $textfn){
print $out("$file affilliation $a is mismatch\n");}}}
else{
if($line =~ /<ce:affiliation id=\"aff$a\">(.+?)<ce:textfn>(.+?)<\/ce:textfn><\/ce:affiliation>/){
print $out("$file sa:affilliation missing for $a\n");}}
$a=$a+1;}
while($line =~ /aff$a/);}}

For this condition xml i am getting wrong result

 <ce:affiliation id="aff1"><ce:label>a</ce:label><ce:textfn>Department of Urology, Radboud University Nijmegen Medical Center, Nijmegen, The Netherlands</ce:textfn><sa:affiliation><sa:organization>Department of Urology</sa:organization><sa:organization>Radboud University Nijmegen Medical Center</sa:organization><sa:city>Nijmegen</sa:city><sa:country>The Netherlands</sa:country></sa:affiliation></ce:affiliation><ce:affiliation id="aff2"><ce:textfn>Norris Comprehensive Cancer Center, University of Southern California Institute of Urology, Los Angeles, California</ce:textfn></ce:affiliation><ce:affiliation id="aff3"><ce:label>c</ce:label><ce:textfn>Department of Urology, Stanford University, Stanford, California</ce:textfn><sa:affiliation><sa:organization>Department of Urology</sa:organization><sa:organization>Stanford University</sa:organization><sa:city>Stanford</sa:city><sa:state>California</sa:state></sa:affiliation></ce:affiliation><ce:correspondence id="cor1"></article>
0
On

Finally i got required output.

#!/usr/bin/perl  
@files= <*.xml>;
open my $out, '>', 'output.xml' or die $!;
foreach $file (@files){
open   (FILE, "$file");
my $a =1;
while(my $line= <FILE> ){
do{
if($line =~ /<ce:affiliation id=\"aff$a\">(.+?)<\/ce:affiliation>/){
$count=$1;
if($count =~ /<ce:label>/){
$count=~ s/<ce:label>(.+?)<\/ce:label>//;}
if($count =~ /<sa:affiliation>/){
if($count =~ /<ce:textfn>(.+?)<\/ce:textfn><sa:affiliation>(.+?)<\/sa:affiliation>/){
$textfn=$1;
$sff=$2;
$sff =~ s/<\/sa:organization>/, /g;
$sff =~ s/<\/sa:city>/, /g;
$sff =~ s/<\/sa:country>/, /g;
$sff =~ s/<\/sa:state>/, /g;
$sff =~ s/<sa:organization>//g;
$sff =~ s/<sa:city>//g;
$sff =~ s/<sa:country>//g;
$sff =~ s/<sa:state>//g;
chop($sff);
chop($sff);}
if($textfn ne $sff){
print $out("$file ce:aff and sa:aff  mismatch in aff$a\n");}
if($textfn =~ /<ce:sup>/){
print $out("$file check label aff$a\n");}}
else{
if($line =~ /\"art520.dtd\"/){
print $out("$file strct affilition missing for aff$a\n");
}}}
$a=$a+1;
}while($line =~ /aff$a/);}}