I am seeing "SAXParseException: output_pdml.xml:1089:0: not well-formed (invalid token)" error message while parsing pcap log in PDML .xml file format. Actual root cause of this failure is due to this string in the decoded log message "". The characters "<><>" is making actual failure. I was going through the xml sax documentation in python, there is escape function available to replace these string in its supporting format

xml.sax.saxutils.escape(data, entities={}) Escape '&', '<', and '>' in a string of data. But I am not able to get how to invoke this function , as xml sax is completely working in event driven approach.In my code most of the parsing activities happening though startElement and endElement functions. I feel this exception is raised internally while invoking the startElement function

def startElement(self, data, attr): In startElement function just checking the type of the data element and it created a python object structure, I assume error will happen during these steps. My question here is how to invoke the escape while creating the parser instance and invoking the content handler

 parser = xml.sax.make_parser()
 parser.setContentHandler(content_handler_fun(call_back))
 parser.setFeature(xml.sax.handler.feature_external_ges, False)
 parser.parse(stdout_stream)

How do I invoke the escape function before invoking the parse() function

Tried for calling escape() method like this 
 parser = xml.sax.make_parser()
 parser.setContentHandler(cont`your text`ent_handler_fun(call_back))
 parser.setFeature(xml.sax.handler.feature_external_ges, False)
 data = xml.sax.saxutils.escape(stdout_stream)
 parser.parse(data)

Failure observed: '_io.BufferedReader' object has no attribute 'replace' I feel escape() expecting a string data, but my case I have to parse the content of stdout stream, it wont possible to read all stream content in string format and parse, because the content size is too huge, it is not practical to load complete content in to memory instead of referring a stream reference

1

There are 1 best solutions below

2
Hermann12 On

Not the answer for your question, but lxml could maybe a solution?

from lxml import etree

fail_xml = """
<pdml version="0">
  <field name="test" show="nice <d83dde0a>;" />
</pdml>"""

parser = etree.XMLParser(recover=True) # recover from bad characters.
root = etree.fromstring(fail_xml, parser=parser)
for elem in root.iter():
    print(elem.tag, elem.attrib)

Output:

pdml {'version': '0'}
field {'name': 'test', 'show': 'nice '}
d83dde0a {}

EDIT: The module xml.sax.saxutils knows escape() and for attrs quoteattr() only. XML parser have problems with not well-formed xml. If you take a html Parser or bs4 it can be handeld:

import html
from html.parser import HTMLParser


class MyHTMLParser(HTMLParser):
    
    def handle_starttag(self, tag, attrs):
        if tag == "field":
            print(attrs)
            print(html.escape(attrs[1][1]))


parser = MyHTMLParser()
parser.feed("""<pdml version="0">
  <field name="test" show="nice <d83dde0a>;" />
</pdml>""")

Output:

[('name', 'test'), ('show', 'nice <d83dde0a>;')]
nice &lt;d83dde0a&gt;;