using lxml ElementTree with sgml files (reuters dataset) with no root node

1.1k Views Asked by ggauravr At 05 September 2013 at 07:58

I'm working on a program that parses the various sgml files of reuters dataset. But the documents I found don't contain a root node, that encompasses all the children. It just has a set of <reuters>..</reuters> tags after DTD. So parsing the tree and using getroot() gives only the first <reuters> tag, and not the whole list. How can I work around it without changing the input files ? My code is given below:

import os
from lxml import etree as ET

dirname = 'dataset'

for filename in os.listdir(dirname):
    filepath = os.path.join(dirname, filename)

    parser = ET.parser(encoding='utf-8', recover=True)

    tree = ET.parse(filepath, parser)

    root = tree.getroot()

this root element is just the first <reuters> tag, while the sgml file is as below:

<!DOCTYPE lewis SYSTEM "lewis.dtd">
<reuters> .. </reuters>
<reuters> .. </reuters>
<reuters> .. </reuters>

What I want is to have all <reuters> tags, one at a time and work on their contents.

Original Q&A

using lxml ElementTree with sgml files (reuters dataset) with no root node

There are 0 best solutions below

Related Questions in PYTHON

Related Questions in XML-PARSING

Related Questions in LXML

Related Questions in SGML

Trending Questions

Popular # Hahtags

Popular Questions