I'm doing my project on Text Categorization.I've got a text categorisation test collection called Reuters-21578 for my Information Retrieval project. It is distributed in 22 files. Each of the first 21 files (reut2-000.sgm through reut2-020.sgm) contains 1000 documents, while the last (reut2-021.sgm) contains 578 documents. The files are in SGML format. Each of the 22 files begins with a document type declaration line: The DTD file lewis.dtd is included in the distribution. Following the document type declaration line are individual Reuters articles marked up with SGML tags.
I need help to write a java program to read those 21578 documents or transform them into 21578 seperated text files.
can somebody plzz help me?????
From about five minutes of googling, it seems that there are no free SGML parsers for Java. This is rather surprising, but there you go.
I suggest you get hold of James Clark's SX tool, from the SP package, which is not Java but which is portable C, and use it to convert the SGML to XML. You can then parse the XML with a Java XML parser.