I am trying to read a pdf stored in gcs i Python using Google Document AI API and return the text from the pdf as a string.I do not want the parser to read tables and images as iam only interested in text. Below is the code i am using to parse the document.
def parse_invoice_1(project_id='xxx',
input_uri='gs://xxx/file.pdf'):
client = documentai.DocumentUnderstandingServiceClient.from_service_account_json('json_file')
gcs_source = documentai.types.GcsSource(uri=input_uri)
input_config = documentai.types.InputConfig(
gcs_source=gcs_source, mime_type='application/pdf')
key_value_pair_hints = [
documentai.types.KeyValuePairHint(key='Emergency Contact',
value_types=['NAME']),
documentai.types.KeyValuePairHint(
key='Referred By')
]
# Setting enabled=True enables form extraction
form_extraction_params = documentai.types.FormExtractionParams(
enabled=True,
key_value_pair_hints=key_value_pair_hints)
parent = 'projects/{}/locations/us'.format(project_id)
request = documentai.types.ProcessDocumentRequest(
parent=parent,
input_config=input_config,
form_extraction_params=form_extraction_params)
document = client.process_document(request=request)
return(format(document.text))
string=parse_invoice_1()
The pdf file has about 410 pages. But the above parsing code reads only 6 pages.Am i missing out something?
There is a limit of 10 pages per request when using process_document(). For you to process more pages at once, I suggest to use batch_process_documents(). batch_process_documents() process your documents asynchronously and can process at a maximum of 500 pages per request.
For table reference check it here:
A code sample for batch processing can be seen in Large file offline processing (python).