What do Public Identifier, System Identifier, and Base system identifier refer to in XML?

629 Views Asked by At

The Xerces2-j XMLInputSource, and also SAX InputSource, refer to public and system identifiers. Xerces2-J XMLInputSource also refers to a base system identifier.

What do these identifiers represent?

Edit: Xerces-J, when give a file location as the SystemId, will open the file as input. If the input is provided as a byte stream instead from some other source such as a database, is there any purpose to the public or system id?

2

There are 2 best solutions below

0
On BEST ANSWER

If you look at the XML syntax, you will see, for example that external entity references use the syntax:

ExternalID ::= 'SYSTEM' S SystemLiteral
  | 'PUBLIC' S PubidLiteral S SystemLiteral

Here's an example of this syntax in use:

<!ENTITY open-hatch
         PUBLIC "-//Textuality//TEXT Standard open-hatch boilerplate//EN"
         "http://www.textuality.com/boilerplate/OpenHatch.xml">

References to DTDs work in the same way (in fact, external DTDs are technically-speaking one kind of entity).

The "system identifier" is a URI that identifies where the text of an entity can be found. The "public identifier" (a hangover from SGML) is more like a name for the resource; it only helps you find the resource if you have some kind of index or catalog that tells you where to look.

System identifiers are often given as relative URI references (for example "books.dtd") which need to be resolved relative to a base URI. The base URI is generally the location where the containing resource (or entity) was found. For example, if an XML document is at http://my.com/lib/books.xml then its base URI is http://my.com/lib/ and the relative URI books.dtd is then expanded to http://my.com/lib/books.dtd.

In answer to your question "is there any purpose to the public or system id" the answer is no if the document consists entirely of a single entity (which is often the case). But as soon as multiple entities come into play, you need identifiers to link them together.

1
On

If the input is provided as a byte stream instead from some other source such as a database, is there any purpose to the public or system id?

No, because if the input is a byte stream there is no need to resolve the location of the entity.

What do these identifiers represent?

I think this thread explains it pretty well :

SYSTEM declaration can be used to specify a file on the local file
system like:

<!DOCTYPE RootElement SYSTEM "C:\validate.dtd">

The problem with this approach is that if the file is made public the
path specified on the local file system will not have any meaning any
more. Even if the path specified in the SYSTEM declaration *is* a URL:

<!DOCTYPE RootElement SYSTEM "http://www.mihaiu.name/validate.dtd">

the parser might be unable to retrieve the DTD file if the system is
not connected to the Internet.

The PUBLIC declaration constitutes a partial solution to this problem.
The string contained in a PUBLIC declaration is not an URL but an URN
(Uniform Resource Name). A URN does not pinpoint the precise location
of the resource, but only clearly specify its name. The *parser* of the
document must be smart enough to be able to generate a URL from a URN
using some internal logic.

Example of a PUBLIC declaration:

<!DOCTYPE RootElement PUBLIC "mihaiu/validate.dtd"
SYSTEM "http://www.mihaiu.name/validate.dtd">

In this case, a custom parser that already has a catalogue of DTDs
published by mihaiu can generate a URL from the PUBLIC declaration. The
generated URL can look like

c:\DTDs\validate.dtd

There is no standard way to convert a URN to a URL, so, if this
conversion fails because the parser does not contain the internal logic
to perform such a conversion (or for whatever other reason) the parser
will attempt to use the SYSTEM declaration which in this case resolves
to

http://www.mihaiu.name/validate.dtd

Important observation:
Since there is no standard way to generate a URL from a URN the PUBLIC
declarations can only be useful for customized parsers !!! (e.g. they
are not useful for general purpose parsers like Xerces)