Original File Name - GCP - Document AI

144 Views Asked by At

I'm using Document AI to perform OCR on some thousands of pdf documents with their python client.

I'm uploading them into a bucket, batch processing them and a .json output is generated in another folder in the same bucket.

The issue I have is that after running Document AI, I have to somehow connect the .json output with the original file name (.pdf). Each pdf has only three pages so each pdf corresponds to single .json file, however the .json filename is not the same as the original file, it has slight modifications in the title.

e.g. input: 79169_FATTURA FORNITURA_73090036581301A 2.215,52.pdf output: f193caf7d2af1f4c-79169_FATTURA FORNITURA_73090036581301A 2.21552-0.json

is there a way to get in the output of the ocr the original file name or to connect the two somehow?

at the moment I have field mask set to only return text but I even by turning it off I don't see anything like that in the .json output

I tried checking the docs but couldn't find anything!

1

There are 1 best solutions below

0
Holt Skinner On

There is no way to customize the output JSON file name when doing batch processing with Document AI.

You can get the original input file name mapped to the output JSON files in BatchProcessMetadata.individualProcessStatuses[].inputGcsSource from the long-running Operation when the batch processing request is made.

The easiest way to access this is using the Document AI Toolbox Python SDK. You can create a wrapped Document object using the method from_batch_process_operation() using the operation name as input and the resulting object will contain the parameter gcs_input_uri which is the original input file path. This tool is open source on Github if you want to see how this is extracted from the BatchProcessMetadata.