The script below works. It parses a XML and looks up a particular node under the namespace "dei".
But is relying on regex for the namespace definition the proper way? (I do not really know XML. So I worry that such regex is not fool-proof for all Edgar XMLs. For example -- are such definitions always enclosed in double quotes and preceded by xmlns: ?)
Thanks.
use strict;
use warnings;
use LWP::Simple;
use XML::LibXML;
use XML::LibXML::XPathContext;
my $url = 'https://www.sec.gov/Archives/edgar/data/1057051/000119312517099664/acef-20161231.xml';
my $xml = LWP::Simple::get($url);
my $dom = XML::LibXML->load_xml(string => $xml);
my @nsDefs = ($xml =~ /xmlns:dei="(.+?)"/g);
die "Namespace definition must be unique!\n" unless @nsDefs == 1;
my $xpc = XML::LibXML::XPathContext->new($dom);
$xpc->registerNs('dei', $nsDefs[0]);
my @matches = $xpc->findnodes('//dei:TradingSymbol');
print 'Number of matches = ', scalar(@matches), "\n";
Output:
Number of matches = 1
The only important thing about a namespace in XML is the URI. Your code is assuming a namespace prefix of
dei, using that to locate the namespace declaration and determine that the URI ishttp://xbrl.sec.gov/dei/2014-01-31. This is exactly backwards. The thing you should be hard-coding in your script is the URI - it won't change. The namespace prefix is theoretically variable and a different prefix might be used for the same URI in other documents.