Guaranteed way to correctly get the contents of www.bing.com/

256 Views Asked by At

I have been working on a program that gets the contents of www.bing.com and saves it to a file, but out of the two ways I have tried one using sockets, and the other using HtmlUnit neither shows the contents 100% correct when I open the file. I know there are other options out there, but I looking for one that is guaranteed to get the contents of www.bing.com/ correctly. I would therefore appreciate it if someone could point me to a means of accomplishing this.

3

There are 3 best solutions below

2
On

The differences you see are likely due to the web server providing different content to different browsers based on the user agent string and other request headers.

Try setting the User-Agent header in your socket and HtmlUnit strategies to the one you are comparing against and see if the result is as expected. Moreover, you will likely have to replicate the request headers exactly as they are sent by your target browser.

1
On

What is "incorrect" about what is returned? Keep in mind, Bing is probably generating some of the content via JavaScript; your client will need to make additional requests to retrieve the JavaScript files, run the JavaScript, etc.

0
On

You can use a URL.openConnection() to create a URLConnection and call URLConnection.getInputStream(). You can read the InputStream contents and write it to a file.

If you need to override the User-Agent because the server is using it to serve different content you can do so by first setting the http.agent system property to empty string.

/* Somewhere in your code before you make requests */
System.setProperty("http.agent", ""); 

or using -Dhttp.agent= on your java command line

and then setting the User-Agent to something useful on the connection before you get the InputStream.

URLConnection conn = ... //Create your URL connection as described above.
String userAgent = ... //Some user-agent string here.
conn.setRequestProperty("User-Agent", userAgent);