Convert .pdf to .docx on Adobe pdf services API (using Python)

1.3k Views Asked by At

I'm trying to write a Python program converting ".pdf" files to ".docx" ones, using Adobe PDF Server API (free trial).

I've found literature enabling to transform any ".pdf" file to a ".zip" file containing ".txt" files (restoring text data) and ".excel" files (returning tabular data).

import logging
import os.path

from adobe.pdfservices.operation.auth.credentials import Credentials
from adobe.pdfservices.operation.exception.exceptions import ServiceApiException, ServiceUsageException, SdkException
from adobe.pdfservices.operation.pdfops.options.extractpdf.extract_pdf_options import ExtractPDFOptions
from adobe.pdfservices.operation.pdfops.options.extractpdf.extract_element_type import ExtractElementType
from adobe.pdfservices.operation.execution_context import ExecutionContext
from adobe.pdfservices.operation.io.file_ref import FileRef
from adobe.pdfservices.operation.pdfops.extract_pdf_operation import ExtractPDFOperation


logging.basicConfig(level=os.environ.get("LOGLEVEL", "INFO"))

try:
    # get base path.
    base_path =os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath("C:/..link.../extractpdf/extract_txt_from_pdf.ipynb"))))

    # Initial setup, create credentials instance.
    credentials = Credentials.service_account_credentials_builder()\
        .from_file(base_path + "\\pdfservices-api-credentials.json") \
        .build()

    #Create an ExecutionContext using credentials and create a new operation instance.
    execution_context = ExecutionContext.create(credentials)
    extract_pdf_operation = ExtractPDFOperation.create_new()

    #Set operation input from a source file.
    source = FileRef.create_from_local_file(base_path + "/resources/trs_pdf_file.pdf")
    extract_pdf_operation.set_input(source)

    # Build ExtractPDF options and set them into the operation
    extract_pdf_options: ExtractPDFOptions = ExtractPDFOptions.builder() \
        .with_element_to_extract(ExtractElementType.TEXT) \
        .with_element_to_extract(ExtractElementType.TABLES) \
        .build()
    extract_pdf_operation.set_options(extract_pdf_options)

    #Execute the operation.
    result: FileRef = extract_pdf_operation.execute(execution_context)

    # Save the result to the specified location.
    result.save_as(base_path + "/output/Extract_TextTableau_From_trs_pdf_file.zip")
except (ServiceApiException, ServiceUsageException, SdkException):
    logging.exception("Exception encountered while executing operation")

But I can't yet get the conversion done to a ".docx" file, event after changing the name of the extracted file to name.docx

I went to read the litterature of adobe.pdfservices.operation.pdfops.options.extractpdf.extract_pdf_options.ExtractPDFOptions() but didn't found ways to tune the extraction and change it from ".zip" to ".docx". What things can I try next?

1

There are 1 best solutions below

0
Raymond Camden On BEST ANSWER

Unfortunately, right now the Python SDK is only supporting the Extract portion of our PDF services. You could use the services via the REST APIs (https://documentcloud.adobe.com/document-services/index.html#how-to-get-started-) as an alternative.