Original File Name - GCP - Document AI

144 Views Asked by Camillo At 13 September 2023 at 13:03

I'm using Document AI to perform OCR on some thousands of pdf documents with their python client.

I'm uploading them into a bucket, batch processing them and a .json output is generated in another folder in the same bucket.

The issue I have is that after running Document AI, I have to somehow connect the .json output with the original file name (.pdf). Each pdf has only three pages so each pdf corresponds to single .json file, however the .json filename is not the same as the original file, it has slight modifications in the title.

e.g. input: 79169_FATTURA FORNITURA_73090036581301A 2.215,52.pdf output: f193caf7d2af1f4c-79169_FATTURA FORNITURA_73090036581301A 2.21552-0.json

is there a way to get in the output of the ocr the original file name or to connect the two somehow?

at the moment I have field mask set to only return text but I even by turning it off I don't see anything like that in the .json output

I tried checking the docs but couldn't find anything!

Original Q&A

There are 1 best solutions below

Holt Skinner On 15 September 2023 at 15:50

There is no way to customize the output JSON file name when doing batch processing with Document AI.

You can get the original input file name mapped to the output JSON files in BatchProcessMetadata.individualProcessStatuses[].inputGcsSource from the long-running Operation when the batch processing request is made.

The easiest way to access this is using the Document AI Toolbox Python SDK. You can create a wrapped Document object using the method from_batch_process_operation() using the operation name as input and the resulting object will contain the parameter gcs_input_uri which is the original input file path. This tool is open source on Github if you want to see how this is extracted from the BatchProcessMetadata.

Original File Name - GCP - Document AI

There are 1 best solutions below

Related Questions in PYTHON

Related Questions in GOOGLE-CLOUD-PLATFORM

Related Questions in GOOGLE-CLOUD-PYTHON

Related Questions in CLOUD-DOCUMENT-AI

Trending Questions

Popular # Hahtags

Popular Questions