Extracting excel data after unzipping using rubyzip

146 Views Asked by At

I'm trying to get spreadsheet data from zipped .xlsx files. I'm using rubyzip to access the contents of the zipfile

Zip::File.open(file_path) do |zip_file|
   zip_file.each do |entry|
     *process entry*
   end
end

My problem is that rubyzip gives a Zip::Entry object, which, I cant get to work with gems like roo or creek.

I've done something similar, but with .csv file. This was as simple as CSV.parse(entry.get_input_stream.read). However, that just gives me a string of encoded gibberish when using it on an .xlsx file.

I've looked around and the closest answer I got was temporarily extracting the files, but I want to avoid doing this since the files can get pretty large.

Does anyone have any suggestions? Thanks in advance.

1

There are 1 best solutions below

0
On BEST ANSWER

So what you need to do is convert the stream into an IO object that Roo can understand.

To determine if the object passed to Roo::Spreadsheet.open is a "stream" Roo uses the following method:

 def is_stream?(filename_or_stream)
    filename_or_stream.respond_to?(:seek)
  end

Since a Zip::InputStream does not respond to seek you cannot use this object directly. To get around this we simply need an object that does respond to seek (like a StringIO)

We can just read the input stream into the StringIO directly:

stream = StringIO.new(entry.get_input_stream.read)

Or the Zip library also provides a method to copy a Zip::InputStream to another IO object through the IOExtras module, which I think reads fairly nicely as well.

Knowing all of the above we can implement as follows:

Zip::File.open(file_path) do |zip_file|
   zip_file.each do |entry|
     # make sure Roo can handle the file (at least based on the extension)
     ext = File.extname(entry.name)&.to_sym 
     next unless Roo::CLASS_FOR_EXTENSION[ext]
     # stream = StringIO.new(entry.get_input_stream.read)
     ::Zip::IOExtras.copy_stream(stream = StringIO.new, entry.get_input_stream)
     spreadsheet = Roo::Spreadsheet.open(stream, extension: ext)
     # process file
   end
end