I'm working on a program that parses the various sgml files of reuters dataset. But the documents I found don't contain a root node, that encompasses all the children. It just has a set of <reuters>..</reuters>
tags after DTD. So parsing the tree and using getroot()
gives only the first <reuters>
tag, and not the whole list. How can I work around it without changing the input files ? My code is given below:
import os
from lxml import etree as ET
dirname = 'dataset'
for filename in os.listdir(dirname):
filepath = os.path.join(dirname, filename)
parser = ET.parser(encoding='utf-8', recover=True)
tree = ET.parse(filepath, parser)
root = tree.getroot()
this root element is just the first <reuters>
tag, while the sgml file is as below:
<!DOCTYPE lewis SYSTEM "lewis.dtd">
<reuters> .. </reuters>
<reuters> .. </reuters>
<reuters> .. </reuters>
What I want is to have all <reuters>
tags, one at a time and work on their contents.