How to consolidate information into an excel or csv. file using Adobe PDF Services Extract API?

99 Views Asked by At

I have recently started using the Adobe PDF Services Extract API. Even though I am able to extract some pieces of information from the pdf, but they are not structured. How to consolidate the data in the pdf in separate columns of an excel or .csv file? The data includes text as well as tables.

I tried the sample extract pdf codes given in the documentation but the form in which the data is presented is not organised.

import logging
import os.path

from adobe.pdfservices.operation.auth.credentials import Credentials
from adobe.pdfservices.operation.exception.exceptions import ServiceApiException, ServiceUsageException, SdkException
from adobe.pdfservices.operation.pdfops.options.extractpdf.extract_pdf_options import ExtractPDFOptions
from adobe.pdfservices.operation.pdfops.options.extractpdf.extract_element_type import ExtractElementType
from adobe.pdfservices.operation.execution_context import ExecutionContext
from adobe.pdfservices.operation.io.file_ref import FileRef
from adobe.pdfservices.operation.pdfops.extract_pdf_operation import ExtractPDFOperation

logging.basicConfig(level=os.environ.get("LOGLEVEL", "INFO"))

try:
    # get base path.
    base_path = os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))

    # Initial setup, create credentials instance.
    credentials = Credentials.service_account_credentials_builder() \
        .from_file(base_path + "/pdfservices-api-credentials.json") \
        .build()

    # Create an ExecutionContext using credentials and create a new operation instance.
    execution_context = ExecutionContext.create(credentials)
    extract_pdf_operation = ExtractPDFOperation.create_new()

    # Set operation input from a source file.

    for x in range(0,100):
       s=str(x)
       source = FileRef.create_from_local_file(base_path + "/resources/output" +s + ".pdf")
       extract_pdf_operation.set_input(source)

    # Build ExtractPDF options and set them into the operation
       extract_pdf_options: ExtractPDFOptions = ExtractPDFOptions.builder() \
            .with_element_to_extract(ExtractElementType.TEXT) \
            .with_element_to_extract(ExtractElementType.TABLES) \
            .build()
       extract_pdf_operation.set_options(extract_pdf_options)

    # Execute the operation.
       result: FileRef = extract_pdf_operation.execute(execution_context)

    # Save the result to the specified location.
       result.save_as(base_path + "/output/ExtractTextTableInfoFromPDF" + s + ".zip")
except (ServiceApiException, ServiceUsageException, SdkException):
    logging.exception("Exception encountered while executing operation")

1

There are 1 best solutions below

0
On

So this is to be expected. Given a PDF, our API can tell you about all the parts of it, "at this point x and y, we have font so and so, and text so and so", but it can't tell you, "This is a person's first name." We'd find their name, but not know it is a name.

So we returned structured info, but it's about the document structure, not the content in terms of what it means.

If your PDF has tables, we'll get that in CSV/XLS format, which you can parse, but we wouldn't tell you, "oh this is a table of cats", but rather the tablular data with columns of numbers.