Python - Google Cloud Document AI API- Not reading the whole .pdf file

965 Views Asked by At

I am trying to read a pdf stored in gcs i Python using Google Document AI API and return the text from the pdf as a string.I do not want the parser to read tables and images as iam only interested in text. Below is the code i am using to parse the document.

def parse_invoice_1(project_id='xxx',
     input_uri='gs://xxx/file.pdf'):


client = documentai.DocumentUnderstandingServiceClient.from_service_account_json('json_file')

gcs_source = documentai.types.GcsSource(uri=input_uri)


input_config = documentai.types.InputConfig(
    gcs_source=gcs_source, mime_type='application/pdf')



key_value_pair_hints = [
    documentai.types.KeyValuePairHint(key='Emergency Contact',
                                      value_types=['NAME']),
    documentai.types.KeyValuePairHint(
        key='Referred By')
]

# Setting enabled=True enables form extraction
form_extraction_params = documentai.types.FormExtractionParams(
enabled=True,
key_value_pair_hints=key_value_pair_hints)


parent = 'projects/{}/locations/us'.format(project_id)
request = documentai.types.ProcessDocumentRequest(
    parent=parent,
    input_config=input_config,
    form_extraction_params=form_extraction_params)

document = client.process_document(request=request)
return(format(document.text))

string=parse_invoice_1()

The pdf file has about 410 pages. But the above parsing code reads only 6 pages.Am i missing out something?

1

There are 1 best solutions below

1
On

There is a limit of 10 pages per request when using process_document(). For you to process more pages at once, I suggest to use batch_process_documents(). batch_process_documents() process your documents asynchronously and can process at a maximum of 500 pages per request.

For table reference check it here: enter image description here

A code sample for batch processing can be seen in Large file offline processing (python).