How to get images from a saved html page

181 Views Asked by At

I have a huge amount of saved HTML pages in my PC. I had parsed the the HTML page and got the image src. I need to store the images in every HTML page in a specific structure in separate directory. I tried out NET::HTTP.get but i am getting a error of Filename too long. Is there any way to do this ??

Below are the ways i tried out.

Method 1:

{
require 'open-uri'

def save_image(imgsrc)
    File.open("images/img1","w") do |f|
        asdf = open(imgsrc).read
        f.write(asdf)
    end
end
}

Method 2:

{
require 'NET::HTTP'

def save_image(imgsrc)
    File.open("images/img1","w") do |f|
        asdf = Net::HTTP.get_response(URI.parse(imgsrc)
        f.write(asdf)
    end
end
}


imgsrc => 
2

There are 2 best solutions below

2
On BEST ANSWER

You already have the images, the one you posted(in the imgsrc variable) is

this image

You only need to decode it using base64 module, and save the result to a file.

To decode your image i've used this service.


To decode using Base64 you should use #strict_decode64 method:

$ cat testb64.rb

imgsrc='/9j/4AAQS... ...oooA//2Q==' #( snipped here your long variable, 
                                    #  removed "data:image/jpeg;base64," 
                                    #  from the beginning )
require 'base64'
print Base64.strict_decode64(imgsrc)

$ ruby testb64.rb >img.jpg

$ xxd -p img.jpg 
ffd8ffe000104a464946....

(valid JFIF header, viewable JPEG by Gwenview and Dolphin)
1
On

This should work:

require 'open-uri'

require 'base64'
require 'open-uri'

def save_image(imgsrc)
  File.open("images/img1", "wb") do |fo|
    fo.write(Base64.decode64(open(imgsrc).read))
  end
end

It will save to the file path "images/img1" so you'll want to create separate paths for each file otherwise they'll overwrite each one.

"wb" means to open the output file using binary mode, which avoids line-end translations appropriate for your OS. Without b, Ruby will look for "\r" and "\n" and either remove or add them as necessary for a text file, which will corrupt a binary file. b avoids that step. This is documented in the IO.new description.

You can't pass

imgsrc => 

as the URL for an image, as that isn't a URL. Both OpenURI and Net::HTTP expect a URL to the image, which they will then request and read the resulting response, returning the data back to your code. You'd need to do a Base64 decode against that data, which will result in a binary string in memory, which you can then write to a file opened in binary mode.