Extract all content from PDF file (not just text, but also tables/diagrams)?

1.5k Views Asked by At

I'd like to reformat PDF main content, so I need to extract its main content, not just text, but also tables, diagrams, etc. with their layout information. I'm only interested in the main part of the content, for example, for technical paper, I'm only interested in the columns of text, tables, and diagrams. The headers, footers, and text on the margin can be ignored.

It would be like to scan content stream from PDF pages, recognize them whether they are text paragraph or other. If they are text paragraph, I may apply certain format treatment to it. If they are other like table, or diagrams, or anything not like a paragraph, I'll just keep them as is, or just shrink or enlarge to fit in the new display.

For example, the following stream, I'd collect the text, and make note of the starting point of the text relative to the page:

stream
BT
/F1 20 Tf
120 120 Td
(Hello from Steve) Tj
ET
endstream

Continue to decompose the stream content to organize in an array of document elements with relative position information, whether they are paragraph (to be able to reformat the associated text.)

I guess even just decompose a stream and tell whether they are paragraph of text and note down its relative position may not be trivial.

I found that pdf.js's page.render() might have the opportunity to help me to achieve the goal, but I haven't figured out how it could be adapted.

Also pdf2htmlEx might have similar mechanism to do so, as it can convert PDF file to html.

But not sure at what level the above tools do the rendering/conversion, if they directly do them as image, then they may not help to my purpose.

Adobe's PDF viewer on Android provides function of re-flow of PDF content on mobile phone's small screen. it may use some mechanism of full content capture, and transformation that I'd like to have.

So my question is for pointers how my requirements could be achieved?

Thanks a lot

0

There are 0 best solutions below