Clean up HTML input using HTMLcleaner

773 Views Asked by At

I am writing a java project using the HTMLCleaner library and save the output as a XML file this is the code that I wrote :

URL urlSB = new URL("http://www.groupon.com/browse/chicago?z=skip");
URLConnection urlConnection = urlSB.openConnection();
urlConnection.addRequestProperty("User-Agent", "google.com");
urlConnection.connect();
HtmlCleaner cleaner = new HtmlCleaner();
CleanerProperties props = cleaner.getProperties();
props.setNamespacesAware(false);
TagNode tagNodeRoot = cleaner.clean(urlConnection.getInputStream());

// serialize to xml file
new PrettyXmlSerializer(props).writeToFile(
        tagNodeRoot , "cleaned.xml", "utf-8"
);

The problem is that after running the project, cleaned.xml file is empty.

1

There are 1 best solutions below

0
On BEST ANSWER

The problem is that the page you are trying to access is configured to redirect to HTTPS. This does, for whatever reason, not work, and so the input stream is empty. If you change the URL to HTTPS, it's working fine:

URL urlSB = new URL("https://www.groupon.com/browse/chicago?z=skip");
URLConnection urlConnection = urlSB.openConnection();
urlConnection.addRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 6.3; WOW64; rv:5.0) Gecko/20100101 Firefox/25.0");
urlConnection.connect();
HtmlCleaner cleaner = new HtmlCleaner();
CleanerProperties props = cleaner.getProperties();
props.setNamespacesAware(false);
TagNode tagNodeRoot = cleaner.clean(urlConnection.getInputStream());
new PrettyXmlSerializer(props).writeToFile(tagNodeRoot, "cleaned.xml", "utf-8");