PDF to XML Convert in Marklogic

593 Views Asked by At

We are trying to convert a PDF to XML using the following command

xquery version "1.0-ml";
let $results := xdmp:pdf-convert(
xdmp:document-get("d:\CFR-2010-title48-vol1.pdf"), "CFR-2010-title48-vol1.xml" ),
$manifest := $results[1]
return $results

But it didnt generate the XML output for the PDF. It generated the following output files.

<parts xmlns="xdmp:pdf-convert"> <part>CFR-2010-title48-vol1_xml.xhtml</part> <part>CFR-2010-title48-vol1_xml_parts/01_00.jpg</part> <part>CFR-2010-title48-vol1_xml_parts/01_01.jpg</part> <part>CFR-2010-title48-vol1_xml_parts/conv.css</part> <part>CFR-2010-title48-vol1_xml_parts/toc.txt</part> </parts>

Can you please suggest how to generate the XML output for given PDF file?

Thanks

Venkat

1

There are 1 best solutions below

1
On

The first document returned is XML.

Were you looking to get the DocBook? For that you need to run the entire upconversion process, and the easiest way to do that is to run the document through the CPF conversion application, which runs through a series of steps and inferences to get to that point.

Or: Are you wondering why the name in the part doesn't match the name from the second parameter to xdmp:pdf-convert? The second parameter is just used to adjust the generated hrefs to images; it is not used for the conversion output itself.

Or: If you want to target XML of some other kind (not XHTML) directly from the format conversion of xdmp:pdf-convert, you can apply a different configuration file. See the documentation on that function for more details.