How to download embedded images from websites Java

3k Views Asked by At

I am trying to download the first 20 images/comics from xkcd website. The code I've written allows me to download a text file of the website or image if I change the fileName to "xkcd.jpg" and the URL to "http://imgs.xkcd.com/comics/monty_python.jpg"

The problem is that I need to download the embedded image on the site, without having to go back and forth copying the Image URLS of each comic over and over, that defeats the purpose of this program. I am guessing I need a for-loop at some point but I can't do that if I don't know how to download the embedded image on the website itself. I hope my explanation isn't too complicated

Below is my code


String fileName = "xkcd.txt";
URL url = new URL("http://xkcd.com/16/");
InputStream in = new BufferedInputStream(url.openStream());
ByteArrayOutputStream out = new ByteArrayOutputStream();
byte[] buf = new byte[1024];
int n = 0;
while (-1 != (n = in.read(buf))) {
    out.write(buf, 0, n);
}
out.close();
in.close();
byte[] response = out.toByteArray();
FileOutputStream fos = new FileOutputStream(fileName);
fos.write(response);
fos.close();
2

There are 2 best solutions below

0
On

This can be solved using the debugging-console of your browser and JSoup.

Finding the Image-URL

What we get from the debugging-console (firefox here, but should work with any brower):

enter image description here

This already shows pretty clearly the path to the comic itself would be the following:

html -> div with id "middleContainer" -> div with id "comic" -> image element

Just use "Inspect Element" or whatever it's called in your browser from the context-menu, and the respective element should be highlighted (like in the screenshot).

I'll leave figuring out how extracting the relevant elements and attributes can be done to you, since it's already covered in quite a few other questions and I don't want to ruin your project by doing all of it ;).

Now creating a list can be done in numerous ways:

The simple way:

Posts all come with a sequential ID. Simply start with the number of the first question and extract that ID and decrement the respective number. This works, if you have a hard-coded link pointing to a specific comic.

A bit harder, but more generic

Actually these are two ways, assuming you start from xkcd.com:
1.)

There's a bit of text on the site, that helps finding the ID of the respective comic: enter image description here Extracting the ID from from the plain-text-HTML isn't too hard, since it's pre-/ and postfixed by some text that should be pretty unique on the site.

2.)

Directly extracting the path of the previous or next comic from the elements of the buttons for going to the next/previous comic. As shown above, use the development console to extract the respective information from the HTML-file. This method should be more bulletproof than the first, as it only relies on the structure of the HTML-file, contrary to the other methods.

Note though that any of the above methods only work by downloading the HTML-file in which a specific comic is embedded. The image-URL won't be of much help (other than brute-force searching, which you shouldn't do for a number of reasons).

0
On

You could use JSoup... and it would probably be a more stable option but if you just wanted to hack something together you might choose the more fragile approach of parsing the HTML

    package com.jbirdvegas.q41231970;

    import java.io.BufferedReader;
    import java.io.FileOutputStream;
    import java.io.IOException;
    import java.io.InputStream;
    import java.io.InputStreamReader;
    import java.net.URL;
    import java.nio.channels.Channels;
    import java.nio.channels.ReadableByteChannel;
    import java.util.stream.Collectors;
    import java.util.stream.IntStream;
    import java.util.stream.Stream;

    public class Download {
        public static void main(String[] args) {
            Download download = new Download();
            // go through each number 1 - 20
            IntStream.range(1, 20)
                    // parse the image url from the html page
                    .mapToObj(download::findImageLinkFromHtml)
                    // download and save each item in the image url list
                    .forEach(download::downloadImage);
        }

        /**
         * Warning manual HTML parsing below...
         * <p>
         * get XKCD image url for a given pageNumber
         *
         * @param pageNumber index of a give cartoon image
         * @return url of the page's image
         */
        private String findImageLinkFromHtml(int pageNumber) {
            // text we are looking for
            String textToFind = "Image URL (for hotlinking/embedding):";
            String url = String.format("https://xkcd.com/%d/", pageNumber);
            try (InputStream inputStream = new URL(url).openConnection().getInputStream();
                 BufferedReader reader = new BufferedReader(new InputStreamReader(inputStream))) {
                Stream<String> stream = reader.lines();
                String foundLine = stream.filter(lineOfHtml -> lineOfHtml.contains(textToFind))
                        .collect(Collectors.toList()).get(0);
                String[] split = foundLine.split(":");
                return String.format("%s:%s", split[1], split[2]);
            } catch (IOException e) {
                e.printStackTrace();
            }
            return null;
        }

        /**
         * Download a url to a file
         *
         * @param url downloads an image to a local file
         */
        private void downloadImage(String url) {
            try {
                System.out.println("Downloading image url: " + url);
                URL image = new URL(url);
                ReadableByteChannel rbc = Channels.newChannel(image.openStream());
                String[] urlSplit = url.split("/");
                FileOutputStream fos = new FileOutputStream(urlSplit[urlSplit.length - 1]);
                fos.getChannel().transferFrom(rbc, 0, Long.MAX_VALUE);
            } catch (IOException e) {
                e.printStackTrace();
            }
        }
    }

Outputs:

Downloading image url:  http://imgs.xkcd.com/comics/barrel_cropped_(1).jpg
Downloading image url:  http://imgs.xkcd.com/comics/tree_cropped_(1).jpg
Downloading image url:  http://imgs.xkcd.com/comics/island_color.jpg
Downloading image url:  http://imgs.xkcd.com/comics/landscape_cropped_(1).jpg
Downloading image url:  http://imgs.xkcd.com/comics/blownapart_color.jpg
Downloading image url:  http://imgs.xkcd.com/comics/irony_color.jpg
Downloading image url:  http://imgs.xkcd.com/comics/girl_sleeping_noline_(1).jpg
Downloading image url:  http://imgs.xkcd.com/comics/red_spiders_small.jpg
Downloading image url:  http://imgs.xkcd.com/comics/firefly.jpg
Downloading image url:  http://imgs.xkcd.com/comics/pi.jpg
Downloading image url:  http://imgs.xkcd.com/comics/barrel_mommies.jpg
Downloading image url:  http://imgs.xkcd.com/comics/poisson.jpg
Downloading image url:  http://imgs.xkcd.com/comics/canyon_small.jpg
Downloading image url:  http://imgs.xkcd.com/comics/copyright.jpg
Downloading image url:  http://imgs.xkcd.com/comics/just_alerting_you.jpg
Downloading image url:  http://imgs.xkcd.com/comics/monty_python.jpg
Downloading image url:  http://imgs.xkcd.com/comics/what_if.jpg
Downloading image url:  http://imgs.xkcd.com/comics/snapple.jpg
Downloading image url:  http://imgs.xkcd.com/comics/george_clinton.jpg

Also note there are plenty of issues with parsing websites... xkcd particularly likes helping parser developers find bugs :D see 859 for an example https://xkcd.com/859/