Can I use XML Schema to validate documents with no xmlns attribute?

817 Views Asked by At

I have a situation where I'd like to start using an XML Schema to validate documents that, until now, have never had a schema definition. As such, the existing documents I'd like to validate do not have any xmlns declaration in them.

I have no problem successfully validating a document which does include the xmlns declaration, but I'd also like to be able to validate those documents without such a declaration. I was hoping for something like this:

DocumentBuilderFactory dbf = ...;
dbf.setSchema(... my schema for namespace "foo:bar"...);
dbf.setValidating(false);
dbf.setNamespaceAware(true);
DocumentBuilder db = dbf.newDocumentBuilder();
db.setDefaultNamespace("foo:bar");
Document doc = db.parse(input);

There is no such method DocumentBuilder.setDefaultNamespace and so the schema validation is not performed when loading documents of this type.

Is there any way to force the namespace for a document if one is not set? Or does that require essentially parsing the XML without regard to schema, checking for an existing namespace, adjusting it, then re-validating the document with the schema?

I'm currently expecting the parser to perform validation during parsing, but I have no problem parsing first and then validating afterward.

UPDATE 2021-01-13

Here is a concrete example of what I'm trying to do, as a JUnit test case.

import java.io.IOException;
import java.io.StringReader;

import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.transform.Source;
import javax.xml.transform.stream.StreamSource;
import javax.xml.validation.Schema;
import javax.xml.validation.SchemaFactory;

import org.junit.Assert;
import org.junit.Test;
import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.xml.sax.ErrorHandler;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;
import org.xml.sax.SAXParseException;

public class XMLSchemaTest
{
    private static final String XMLNS = "http://www.example.com/schema";
    private static final String schemaDocument = "<xs:schema xmlns:xs=\"http://www.w3.org/2001/XMLSchema\" targetNamespace=\"" + XMLNS + "\" xmlns:e=\"" + XMLNS + "\" elementFormDefault=\"qualified\"><xs:element name=\"example\" type=\"e:exampleType\" /><xs:complexType name=\"exampleType\"><xs:sequence><xs:element name=\"test\" type=\"e:testType\" /></xs:sequence></xs:complexType><xs:complexType name=\"testType\" /></xs:schema>";

    private static Document parse(String document) throws SAXException, ParserConfigurationException, IOException {
        DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
        SchemaFactory sf = SchemaFactory.newInstance("http://www.w3.org/2001/XMLSchema");

        Source[] sources = new Source[] {
                new StreamSource(new StringReader(schemaDocument))
        };

        Schema schema = sf.newSchema(sources);

        dbf.setSchema(schema);
        dbf.setNamespaceAware(true);

        DocumentBuilder db = dbf.newDocumentBuilder();
        db.setErrorHandler(new MyErrorHandler());

        return db.parse(new InputSource(new StringReader(document)));

    }

    @Test
    public void testConformingDocumentWithSchema() throws Exception {
        String testDocument = "<example xmlns=\"" + XMLNS + "\"><test/></example>";

        Document doc = parse(testDocument);

        //Assert.assertEquals("Wrong document XML namespace", XMLNS, doc.getNamespaceURI());
        Element root = doc.getDocumentElement();
        Assert.assertEquals("Wrong root element XML namespace", XMLNS, root.getNamespaceURI());
        Assert.assertEquals("Wrong element name", "example", root.getLocalName());
        Assert.assertEquals("Wrong element name", "example", root.getTagName());
    }

    @Test
    public void testConformingDocumentWithoutSchema() throws Exception {
        String testDocument = "<example><test/></example>";

        Document doc = parse(testDocument);

        //Assert.assertEquals("Wrong document XML namespace", XMLNS, doc.getNamespaceURI());
        Element root = doc.getDocumentElement();
        Assert.assertEquals("Wrong root element XML namespace", XMLNS, root.getNamespaceURI());
        Assert.assertEquals("Wrong element name", "example", root.getLocalName());
        Assert.assertEquals("Wrong element name", "example", root.getTagName());
    }

    @Test
    public void testNononformingDocumentWithSchema() throws Exception {
        String testDocument = "<example xmlns=\"" + XMLNS + "\"><random/></example>";

        try {
            parse(testDocument);

            Assert.fail("Document should not have parsed properly");
        } catch (Exception e) {
            System.out.println(e);
            // Expected
        }
    }
    @Test
    public void testNononformingDocumentWithoutSchema() throws Exception {
        String testDocument = "<example><random/></example>";

        try {
            parse(testDocument);

            Assert.fail("Document should not have parsed properly");
        } catch (Exception e) {
            System.out.println(e);
            // Expected
        }
    }

    public static class MyErrorHandler implements ErrorHandler {

        @Override
        public void warning(SAXParseException exception) throws SAXException {
            System.err.println("WARNING: " + exception);
        }

        @Override
        public void error(SAXParseException exception) throws SAXException {
            throw exception;
        }

        @Override
        public void fatalError(SAXParseException exception) throws SAXException {
            System.err.println("FATAL: " + exception);
        }
    }
}

All of the tests pass except for testConformingDocumentWithoutSchema. I think this is kind of expected, as the document declares no namespace.

I'm asking how the test can e changed (but not the document itself!) so that I can validate the document against a schema that was not actually declared by the document.

1

There are 1 best solutions below

0
On

I pounded on this for a while, and I was able to come up with a hack that works. It may be possible to do this more elegantly (which was my original question), and it also may be possible to do this with less code, but this was what I was able to come up with.

If you look at the JUnit test case in the question, changing the "parse" method to the following (and adding XMLNS as the second argument to all calls to parse) will allow all tests to complete:

import org.w3c.dom.ls.DOMImplementationLS;
import org.w3c.dom.ls.LSOutput;
import org.w3c.dom.ls.LSSerializer;

...

    private static Document parse(String document, String namespace) throws SAXException, ParserConfigurationException, IOException {
        SchemaFactory sf = SchemaFactory.newInstance("http://www.w3.org/2001/XMLSchema");

        Source[] sources = new Source[] {
                new StreamSource(new StringReader(schemaDocument))
        };

        Schema schema = sf.newSchema(sources);

        DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
        dbf.setSchema(schema);
        dbf.setNamespaceAware(true);

        DocumentBuilder db = dbf.newDocumentBuilder();
        ErrorHandler errorHandler = new MyErrorHandler();
        db.setErrorHandler(errorHandler);

        try {
            return db.parse(new InputSource(new StringReader(document)));
        } catch (SAXParseException spe) {
            // Just in case this was a problem with a missing namespace
            // System.out.println("Possibly recovering from SPE " + spe);

            // New DocumentBuilder without the schema
            dbf.setSchema(null);
            db = dbf.newDocumentBuilder();
            db.setErrorHandler(errorHandler);

            Document doc = db.parse(new InputSource(new StringReader(document)));

            if(null != doc.getDocumentElement().getNamespaceURI()) {
                // Namespace URI was set; this is a fatal error
                throw spe;
            }

            // Override the namespace on the Document + root element
            doc.getDocumentElement().setAttribute("xmlns", namespace);

            // Serialize the document -> String to start over again
            DOMImplementationLS domImplementation = (DOMImplementationLS) doc.getImplementation();
            LSSerializer lsSerializer = domImplementation.createLSSerializer();
            LSOutput lsOutput = domImplementation.createLSOutput();
            lsOutput.setEncoding("UTF-8");
            StringWriter out = new StringWriter();
            lsOutput.setCharacterStream(out);

            lsSerializer.write(doc, lsOutput);

            String converted = out.toString();

            // Re-enable the schema
            dbf.setSchema(schema);
            db = dbf.newDocumentBuilder();
            db.setErrorHandler(errorHandler);

            return db.parse(new InputSource(new StringReader(converted)));
        }
    }

This works by catching SAXParseException and, because SAXParseException doesn't give up any of its details, assuming that the problem might be due to a missing XML namespace declaration. I then re-parse the document without the schema validation, add a namespace declaration to the in-memory Document, then serialize the Document to String and re-parse the document with the schema validation re-enabled.

I tried to do this just by setting the XML namespace and then using Schema.newValidator().validate(new DOMSource(doc)), but this failed validation every time for me. Running through the serializer got around that problem.