As a data analyst, I am constantly running across files with structured data that are in some proprietary format and resist normal XML parsing.
For example, I have an archive of about a hundred documents that all begin with this:
<!DOCTYPE DOCUMENT PUBLIC "-//Gale Research//DTD Document V2.0//EN">
I have included an abridged example of the document below, don't read it if you're offended by cloning.
At any rate, is there a way to query this without having DTD or namespace or URI or whatever it is I need? I'm ok using SQL Server 2012+ or xquery or, I dunno, php or vba.
<!DOCTYPE DOCUMENT PUBLIC "-//Gale Research//DTD Document V2.0//EN">
<document synfileid="MCIESS0044">
<galedata><project>
<projectname>
<title>Opposing Viewpoints Resource Center</title>
</projectname>
</project></galedata>
<doc.head>
<title>Cloning</title>
</doc.head>
<doc.body>
<para>A clone is an identical copy of a plant or animal, produced from the genetic material of a single organism. In 1996 scientists in Britain created a sheep named Dolly, the first successful clone of an adult mammal. Since then, scientists have successfully cloned other animals, such as goats, mice, pigs, and rabbits. People began wondering if human beings would be next. The question of whether human cloning should be allowed, and under what conditions, raises a number of challenging scientific, legal, and ethical issues—including what it means to be human.</para>
<head n="1">Scientific Background</head>
<para>People have been cloning plants for thousands of years. Some plants produce offspring without any genetic material from another organism. In these cases, cloning simply requires cutting pieces of the stems, roots, or leaves of the plants and then planting the cuttings. The cuttings will grow into identical copies of the originals. Many common fruits, vegetables, and ornamental plants are produced in this way from parent plants with especially desirable characteristics.</para>
<para>[lots of excluded text] Perhaps the most perplexing question of all: How would clones feel about their status? As a copy, would they lack the sense of uniqueness that is part of the human condition? As yet, such questions have no answers—perhaps they never will. The debate about cloning, both animal and human, however, will certainly continue. The technology exists to create clones. How will society use this technology?</para>
</doc.body>
</document>
Your SGML input data is almost XML, up to the fact that, unlike full SGML, XML always requires a system identifier (a file name or URL) for the external DTD subset, and not just a public identifier such as
-//Gale Research//DTD Document V2.0//EN
. So theDOCTYPE
declaration must take this form in XMLwhere I've added
"test.dtd"
as system identifier/file name of the external subset. Of course, now atest.dtd
file must exist. It is sufficient that an emptytest.dtd
file is created in the working directory, ortest.dtd
could contain some meaningful declarations for your markup at hand such as the following:But as you've found out, you can also make XML tools happy by just removing the
DOCTYPE
line.Now if you wanted to process your file(s) into XML without manual editing, you could use SGML to pre-process the file(s) into compliant XML, and then use XML query tools against the produced XML.
To do so using the OpenSP/OpenJade SGML package (can be installed on Ubuntu by eg.
sudo apt-get install opensp
), you'd place acatalog
file in the directory, containing the following line telling SGML to resolve the public identifier-//Gale Research//DTD Document V2.0//EN
totest.dtd
:You could edit
test.dtd
from the XML version I gave above to contain tag omission indicators which would be required for classic SGML by default. But there's another feature your test data is using which isn't enabled by default in OpenSP tools, namely support for hexadecimal character entity references such as—
in your example data. Hence, we're going to need to use a custom SGML declaration (a somewhat archaic piece of plain text specifying SGML features your document is using) for your data anyway. So while we're at it, we're also going to declareFEATURES MINIMIZE OMITTAG NO
in the the SGML declaration, which makes SGML accept element declarations without tag omission indicators so we don't have to changetest.dtd
from the XML version.We could place the SGML declaration right into the document itself, but to avoid manual editing. we're going to use catalog resolution for the SGML declaration. If you add the following line to your
catalog
filethen OpenSP will use whatever is stored in
xml10-sgmldecl.dcl
as SGML declaration. The actual SGML we're going to use is the official SGML declaration for XML 1.0 (which has all the features we want already). You don't need to understand the meaning of an SGML declaration in detail. Just paste the text attached below into a file namedxml10-sgmldecl.dcl
; if you're interested in the details, see my description at http://sgmljs.net/docs/sgmlrefman.html#sgml-declaration.Now you'll be able to invoke
to produce XML from your SGML without errors.