I have dumped JSON files from DOCAI to GCP but each file is stored in individual folder, although they are in the same bucket on Cloud Storage. I am not able to iterate through the JSON files stored in the blobs.
I have used 'Batch Processing' document provided on GCP which helps me generate output but the JSON files are created in separate folders. I have trained a Custom Document Extractor which I am validating on 50 documents using this code. My aim is to read these JSON files and create a dataframe of the output by appending them all
# Reading gcs files with gcsfs
import gcsfs
import json
import pandas as pd
gcs_file_system = gcsfs.GCSFileSystem(project="project-name")
gcs_json_path = "file0.json"
with gcs_file_system.open(gcs_json_path) as f:
json_dict = json.load(f)
djson = pd.json_normalize(json_dict, record_path=['entities'])
djson
I am using this code to read the JSON files by providing the gcs_path but I am able to read only one file.
It might be easier for you to use the Document AI Toolbox Python SDK.
There are built-in functions for getting
Document
JSON files from Google Cloud Storage after batch processing and functions for exporting entities to a dictionary, which can be converted to aDataFrame
.There is a method
from_gcs()
but this can currently only create a single Wrapped Document from a single document output in GCS.You can try something like this:
Alternatively, you can use the Batch Process Operation ID to get documents from a single batch.