Is there a way when parsing XML in Python to ignore DTD in XML and use a local DTD file instead?

125 Views Asked by At

Returning to learning Python after long absence.

I have a question about something I'm doing to solve another problem. I'll list both problems in case my initial problem has a better solution than what I'm trying.

I have written code to parse Evernote's ENEX file which is XML.

Evernote's ENEX's documentation

I'm getting this error :

lxml.etree.XMLSyntaxError: Entity 'nbsp' not defined, line 68908, column 61176

for this code

mytree = ET.parse(PATH TO THE ENEX FILE)

So I copied the DTD http://xml.evernote.com/pub/evernote-export3.dtd to a local file and added an Entity to convert nbsp to a space.

I don't want to have to edit each ENEX file I parse with the new DTD.

Is there a way to tell it to ignore the DTD mentioned in the ENEX file and instead use one with path I provide?

1

There are 1 best solutions below

1
mzjn On

You can use an XML catalog.

The following catalog file (let's call it catalog.xml) contains a system entry that says "whenever the parser encounters http://xml.evernote.com/pub/evernote-export3.dtd, use local-evernote-export3.dtd instead":

<catalog xmlns="urn:oasis:names:tc:entity:xmlns:xml:catalog">
  <system systemId="http://xml.evernote.com/pub/evernote-export3.dtd"
          uri="local-evernote-export3.dtd"/>
</catalog>

Here is Python code that ensures that the catalog file is consulted:

import os
from lxml import etree

os.environ["XML_CATALOG_FILES"] = "catalog.xml"
parser = etree.XMLParser(load_dtd=True)
tree = etree.parse(PATH TO THE ENEX FILE, parser=parser)

More information: