Tagsoup fails to parse html document from a StringReader ( java )

2.3k Views Asked by At

I have this function:

private Node getDOM(String str) throws SearchEngineException {

                DOMResult result = new DOMResult();

                try {
                        XMLReader reader = new Parser();
                        reader.setFeature(Parser.namespacesFeature, false);
                        reader.setFeature(Parser.namespacePrefixesFeature, false);
                        Transformer transformer = TransformerFactory.newInstance().newTransformer();
                        transformer.transform(new SAXSource(reader,new InputSource(new StringReader(str))), result);
                } catch (Exception ex) {
                        throw new SearchEngineException("NukatSearchEngine.getDom: " + ex.getMessage());
                }

                return result.getNode();
        }

It takes a String that contains the html document sent by the http server after a POST request, but fails to parse it properly - I only get like four nodes from the entire document. The string itself looks fine - if I print it out and copypasta it into a text document I see the page I expected.

When I use an overloaded version of the above method:

private Node getDOM(URL url) throws SearchEngineException {

                DOMResult result = new DOMResult();

                try {
                        XMLReader reader = new Parser();
                        reader.setFeature(Parser.namespacesFeature, false);
                        reader.setFeature(Parser.namespacePrefixesFeature, false);
                        Transformer transformer = TransformerFactory.newInstance().newTransformer();
                        transformer.transform(new SAXSource(reader, new InputSource(url.openStream())), result);
                } catch (Exception ex) {
                        throw new SearchEngineException("NukatSearchEngine.getDom: " + ex.getMessage());
                }

                return result.getNode();
        }

then everything works just fine - I get a proper DOM tree, but I need to somehow retrieve the POST answer from server.

Storing the string in a file and reading it back does not work - still getting the same results.

What could be the problem?

3

There are 3 best solutions below

0
On

This seems like an encoding problem. In the code example of yours that doesn't work you're passing the url as a string into the constructor, which uses it as the systemId, and you get problems with Tagsoup parsing the html. In the example that works you're passing the stream in to the InputSource constructor. The difference is that when you pass in the stream then the SAX implementation can figure out the encoding from the stream.

If you want to test this you could try these steps:

  • Stream the html you're parsing through a java.io.InputStreamReader and call getEncoding on it to see what encoding it detects.
  • In your first example code, call setEncoding on the InputSource passing in the encoding that the inputStreamReader reported.
  • See if the first example, changed to explicitly set the encoding, parses the html correctly.

There's a discussion of this toward the end of an article on using the SAX InputSource.

0
On

To get a POST response you first need to do a POST request, new InputSource(url.openStream()) probably opens a connection and reads the response from a GET request. Check out Sending a POST Request Using a URL.

Other possibilities that might be interesting to check out for doing POST requests and getting the response:

0
On

Is it maybe a problem with the xml encoding?