Round-trippable encoding of XML dialects within HTML5

180 Views Asked by At

I am interested in the very richly semantic and XML-based TEI language, but I believe that if it could be encoded in a round-trippable manner with HTML, that it could thereby benefit from being creatable in web-based HTML editors or storeable on HTML-based wikis (at least those which supported the necessary semantic mechanisms), etc.

I would like to know whether RDFa would work as a mechanism for fully representing an XML dialect (or multiple ones) within HTML5, with the standard being round-trippability and awareness of the hierarchical nature of XML elements (and its other critical aspects like attributes).

I know one might be able to overload data-* attributes, Microformats, or Microdata, but none of these options allows for something which can both fully represent an XML dialect with its hierarchical nature while also being free of spec warnings about the mechanism not intended to be used by software independent of the site (e.g., if one wished to create a search engine to search such embedded XML in a hierarchically-aware manner).

If RDFa won't work, I think the best option might be data-* attributes, as one can easily do something like this to represent XML:

<div data-xml-ns="http://www.tei-c.org/ns/1.0"
      data-xml-ns="html:http://www.w3.org/1999/xhtml" data-xml-element="div1"
      data-xml-attribute-value="xml:id=myDiv1ID">Some TEI div1 content
      and <div data-xml-element="div2">some div2 content</div></div>

(Not a good example of semantic richness I know, but just showing the nature of encoding.)

But again, I'd prefer to avoid the limitations placed on this mechanism as stated in the HTML spec:

"These attributes are not intended for use by software that is independent of the site that uses the attributes"

"these attributes are intended for use by the site's own scripts, and are not a generic extension mechanism for publicly-usable metadata."

If RDFa will work for this, I would appreciate an example of how, e.g., the example above might be encoded to preserve the hierarchical relationships, etc.

2

There are 2 best solutions below

1
On

It should work, but it won't be "HTML5" per say, you can use something like your own DOMImplementation, which would allow you to manipulate the document as you wish, including defining your own document type if you want, to handle special tags and attributes as needed. DOMParser is another option, but not as capable as the latter. You should be able to render either one inside an iFrame as well, so that it is used more as a view, and the parent can do the manipulation via regular HTML methods.

0
On

Static information within the page itself is not usually used by the site to change how it handles the document, but rather the browser.

RDFa is intended to include additional meta information about page content and (hopefully) facilitate third party applications in their analysis of such documents. Little stops you from using it to express information about data that isn't rendered in the browser, if that is what you want to do.

One way to look at it is that using RDFa allows you to take mostly unstructured data (HTML5) and add as much structured data to it as you can/would like. Usually this results in data that would be described as semi-structured.

<body about="http://example.org/john-d/#me">
<h1>John's Home Page</h1>
<p>My name is <span property="foaf:nick">John D</span> and I like
  <a href="http://www.neubauten.org/" rel="foaf:interest"
    lang="de">Einstürzende Neubauten</a>.
</p>
<p>
  My <span rel="foaf:interest" resource="urn:ISBN:0752820907">favorite
  book is the inspiring <span about="urn:ISBN:0752820907"><cite
  property="dc:title">Weaving the Web</cite> by
  <span property="dc:creator">Tim Berners-Lee</span></span></span>.
</p>
<div about="http://example.com/cat">
  <span property="rdf:type" resource="rdfs:Class"></span>
</div>

In this example, RDFa uses a span to relate John D as the foaf:nick of the individual http://example.org/john-d/#me. This information is very unlikely to be utilized server side, and even more so unlikely to be used by the user's web browser. You can, however, introduce third party tools (bots, browser extensions) that can utilize this information and be sensitive to it.

In the <div> I am somewhat hand-wavy about namespace syntax and qnames, but what i am basically doing is creating invisible information that does not reflect a visible element within the rendered web page. It specifies, instead, some ontological knowledge in RDF. In this, we create the triple indicating that http://example.com/cat is rdf;type of rdfs:Class.

It will still be HTML5, and it will allow you to express a large amount of information. As far as the TEL problem goes, you can enhance an existing document by adding additional RDFa statements to it. This is extremely useful as it leaves documents renderable for humans while allowing further machine readability.

In terms of storing semantic information within wikis and the like, would suggest looking at semantic mediawiki. It allows one to store the expressed information within an external triple store, which makes it visible to outside applications. This may or may not be suitable to your use cases.