How to ignore html in an xml element when validating with relaxng compact

336 Views Asked by At

How can I have a pattern that ignores html within an element rather than the validator trying to validate it

<stuff>
   <data>
      this is some text <b>with the odd</b> bit of html<p>and unclosed tags
   </data>
</stuff>

This isn't valid but I tried things like

datatypes xs = "http://www.w3.org/2001/XMLSchema-datatypes"
start = stuff

stuff = element stuff
{
   element data { * }
}
2

There are 2 best solutions below

7
On

You can't allow arbitrary unmodified HTML within XML. Either escape the individual special characters (What are the official XML reserved characters?) or encapsulate the HTML within a CDATA container (Is it possible to insert HTML content in XML document?).

0
On

You won't be able to validate an XML document with non-well-formed HTML in it, since on account of the non-wellformedness such documents are not XML documents. But if in fact the input you're getting is XML, then you can certainly define data to allow any well-formed HTML elements, or any well-formed XML.

Allowing any well-formed XML is the simplest. We define a pattern than means "any well-formed XML here": any elements encountered are validated using the same pattern, recursively:

wellformed-xml = (text
                 | element * { wellformed-xml }
                 )*

Now define the data element to use that pattern:

stuff = element stuff {
            element data { wellformed-xml }
        }

If you really want to ensure that it's just HTML, you'll want a nameclass more restrictive than "*". I've populated it with b, i, p, span, and div, and leave it as an exercise to you to add the other elements you want.

start = stuff
stuff =
  element stuff {
    element data { wellformed-html }
  }

wellformed-html =
  (text
   | element b | div | i | p | span { wellformed-html }
   )*

If you want to be able to support XHTML input as well, you'll want to use a namespace reference; again, an exercise for the reader.