exist-db how to access a pdf

185 Views Asked by At

I am sure it is very simple ... I just cannot get my head around this... the exist-db Documentation is a bit fuzzy on content extraction... http://exist-db.org/exist/apps/doc/contentextraction.

I have a pdf-file, containing of about 162 high-res images (the pdf is quite big ...) and I do not know how to access any of the that are presumably created ...

please do not destroy me! I am just starting to build a database (for an Edition at Uni)I'd love to have a facsimile edition (so one Tab with the image-file and one tab with the transcribed texts)

I aim at doing something similar to what Heidelberg Universitdy did with the "Welsche Gast Digital" http://digi.ub.uni-heidelberg.de/diglit/cpg389/0190/image (the choosen image is just an example! ) This pic When clicking on faksimile the Scan opens and when clicking on Transkription the transcribed texts open!

I am quite new to Xquery, Xpath and most X-related stuff. I have a "working design" put together in exist-db and am looking at TEI for marking up the transcritpion etc, I fear I'll have to spend quite some time on this issue ... (it is not about doing my job for me, it's just about pointing me in the right direction)

1

There are 1 best solutions below

3
On BEST ANSWER

I m afraid the short answer is simply don't.

Storing a pdf in your db, and then trying to extract images from it, is kind of a recipe for disaster. Instead you should use the source images (not necessarily extracted from the pdf), and store these individually in a collection (e.g. resources/img). Those image files are then the binary resources that the documentation is actually talking about.

You might want to take a look at tei-publisher for creating digital edition in exist, especially this demo app for how to present high-res facsimiles with transcribed portions of text. I m afraid its all a bit more involved then just opening a pdf in a browser, but so is the Welsche Gast Digital