Ruby on Rails - How to convert to images some elements from a word document

149 Views Asked by At

Context In our platform we allow users to upload word documents, those documents are stored in google drive and then dowloaded again to our platform in HTML format to create a section where the users can interact with that content.

Rails 5.0.7 Ruby 2.5.7p206 selenium-webdriver 3.142.7 (latest stable version compatible with our ruby and rails versions)

Problem Some of the documents have charts or graphics inside that are not processed correctly giving wrong results after all the process. We have been trying to fix this problem at the moment we get the word document and before to send it to google drive.

I'm looking for a simple way to export the entire chart and/or table as an image, if anyone knows of a way to do this the advice would be much appreciated.

Edit 1: Adding some screenshots: This screenshot is from the original word doc: enter image description here

And this is how it looks in our systems: enter image description here

Here are the approaches I have tried that haven't worked for me so far.

Approach 1 Using nokogiri to read the document and found the nodes that contain the charts (we've found that they are called drawing) and then use Selenium to navigate through the file and take and screenshot of that particular section.

The problem we found with this approach is that the versions our gems are not compatible with the latest versions of selenium and its web drivers (chrome or firefox) and it is not posible to perform this action. Other problem, and it seems is due to security, is that selenium is not able to browse inside local files and open it.

      options = Selenium::WebDriver::Firefox::Options.new(binary: '/usr/bin/firefox', headless: true)
      driver = Selenium::WebDriver.for :firefox, options: options
      path = "#{Rails.root}/doc_file.docx"
      driver.navigate.to("file://#{path}")

      # Here occurs the first issue, it is not able to navigate to the file
      puts "Title: #{driver.title}"
      puts "URL: #{driver.current_url}"

      # Below is the code that I am trying to use to replace the images with the modified images
      drawing_elements = driver.find_elements(:css, 'w|drawing')
      modified_paragraphs = []
      drawing_elements.each do |drawing_element|
        paragraph_element = drawing_element.find_element(:xpath, '..')
        paragraph_element.screenshot.save('paragraph.png')
        modified_paragraph = File.read('paragraph.png')
        modified_paragraphs << modified_paragraph
      end
      driver.quit
      file = File.open(File.join(Rails.root, 'doc_file.docx'))
      doc = Nokogiri::XML(file)
      drawing_elements = doc.css('w|drawing')
      drawing_elements.each_with_index do |drawing_element, i|
        paragraph_element = drawing_element.parent
        paragraph_element.replace(modified_paragraphs[i])
      end
      new_doc_file = File.write('modified_doc.docx', doc.to_xml)
      s3_client.put_object(bucket: bucket, key: @document_path, body: new_doc_file)
      File.delete('doc_file.docx')

Approach 2 Using nokogiri to get the drawing elements and the try to convert it directly to an image using rmagick or mini_magick.

It is only possible if the drawing element actually contains an image, it can convert that correctly to an image, but the problem is when inside of the drawing element are not images but other elements like graphicData, pic, blipFill, blip. It needs to start looping into the element and rebuilding it, but at that point of time it seems that the element is malformed and it can't rebuild it.

Other issue with this approach is when it founds elements that seem to conform an svg file, it also needs to loop into all the elements and try to rebuild it, but the same as the above issue, it seems that the element is malformed.

          response = s3_client.get_object(bucket: bucket, key: @document_path)
      docx = response.body.read
      Zip::File.open_buffer(docx) do |zip|
        doc = zip.find_entry("word/document.xml")
        doc_xml = doc.get_input_stream.read
        doc = Nokogiri::XML(doc_xml)
        drawing_elements = doc.xpath("//w:drawing")

        drawing_elements.each do |drawing_element|
          node = get_chil_by_name(drawing_element, "graphic")
          if node.xpath("//a:graphicData/a:pic/a:blipFill/a:blip").any?
            img_data = node.xpath("//a:graphicData/a:pic/a:blipFill/a:blip").first.attributes["r:embed"].value
            img = Magick::Image.from_blob(img_data).first
            img.write("node.jpeg")
            node.replace("<img src='#{img.to_blob}'/>")
          elsif node.xpath("//a:graphicData/a:svg").any?
            svg_data = node.xpath("//a:graphicData/a:svg").to_s
            Prawn::Document.generate("node.pdf") do |pdf|
              pdf.svg svg_data, at: [0, pdf.cursor], width: pdf.bounds.width
            end
          else
            puts "unsupported format"
          end
        end    
        # update the file in S3
        s3.put_object(bucket: bucket, key: @document_path, body: doc)
      end

Approach 3 Convert the elements since its parents to a pdf file and then to an image.

Basically the same issue as in the approach 2, it needs to loop inside all the elements and try to rebuild it, we haven't found a way to do that.

0

There are 0 best solutions below