docsplit gem pdf to text

1.2k Views Asked by At

Well basically I have the same problems as discussed here: http://blog.joshsoftware.com/2014/08/13/pdf-to-plain-text-processing-using-docsplit/ But the solution that they propose in docsplit doesn't work.

 Docsplit.extract_text(filepath, {:pdf_opts => ‘-layout’, output: ‘tmp_text_file’})

the :pdf_opts => ‘-layout’ option doesn't do anything and I can't find any documentation about options like that, thus I get a single word per line in the output text file.

Does anyone know how to get an accurate text file ?

Thank you

1

There are 1 best solutions below

1
On BEST ANSWER

If you read blog post carefully internally processing

 :pdf_opts => ‘-layout’

is not supported yet by master branch of docsplit gem. For this you need to use https://github.com/documentcloud/docsplit/pull/114. So use

gem 'docsplit', git: 'git://github.com/narutosanjiv/docsplit.git'

Hope this helps. Let me know if you still face any issues.