I am trying to download the first 20 images/comics from xkcd website. The code I've written allows me to download a text file of the website or image if I change the fileName to "xkcd.jpg" and the URL to "http://imgs.xkcd.com/comics/monty_python.jpg"
The problem is that I need to download the embedded image on the site, without having to go back and forth copying the Image URLS of each comic over and over, that defeats the purpose of this program. I am guessing I need a for-loop at some point but I can't do that if I don't know how to download the embedded image on the website itself. I hope my explanation isn't too complicated
Below is my code
String fileName = "xkcd.txt";
URL url = new URL("http://xkcd.com/16/");
InputStream in = new BufferedInputStream(url.openStream());
ByteArrayOutputStream out = new ByteArrayOutputStream();
byte[] buf = new byte[1024];
int n = 0;
while (-1 != (n = in.read(buf))) {
out.write(buf, 0, n);
}
out.close();
in.close();
byte[] response = out.toByteArray();
FileOutputStream fos = new FileOutputStream(fileName);
fos.write(response);
fos.close();
This can be solved using the debugging-console of your browser and JSoup.
Finding the Image-URL
What we get from the debugging-console (firefox here, but should work with any brower):
This already shows pretty clearly the path to the comic itself would be the following:
Just use "Inspect Element" or whatever it's called in your browser from the context-menu, and the respective element should be highlighted (like in the screenshot).
I'll leave figuring out how extracting the relevant elements and attributes can be done to you, since it's already covered in quite a few other questions and I don't want to ruin your project by doing all of it ;).
Now creating a list can be done in numerous ways:
The simple way:
Posts all come with a sequential ID. Simply start with the number of the first question and extract that ID and decrement the respective number. This works, if you have a hard-coded link pointing to a specific comic.
A bit harder, but more generic
Actually these are two ways, assuming you start from xkcd.com:
1.)
There's a bit of text on the site, that helps finding the ID of the respective comic:
Extracting the ID from from the plain-text-HTML isn't too hard, since it's pre-/ and postfixed by some text that should be pretty unique on the site.
2.)
Directly extracting the path of the previous or next comic from the elements of the buttons for going to the next/previous comic. As shown above, use the development console to extract the respective information from the HTML-file. This method should be more bulletproof than the first, as it only relies on the structure of the HTML-file, contrary to the other methods.
Note though that any of the above methods only work by downloading the HTML-file in which a specific comic is embedded. The image-URL won't be of much help (other than brute-force searching, which you shouldn't do for a number of reasons).