I have dumped JSON files from DOCAI to GCP but each file is stored in individual folder, although they are in the same bucket on Cloud Storage. I am not able to iterate through the JSON files stored in the blobs.

I have used 'Batch Processing' document provided on GCP which helps me generate output but the JSON files are created in separate folders. I have trained a Custom Document Extractor which I am validating on 50 documents using this code. My aim is to read these JSON files and create a dataframe of the output by appending them all

Batch processing

# Reading gcs files with gcsfs
import gcsfs
import json
import pandas as pd

gcs_file_system = gcsfs.GCSFileSystem(project="project-name")
gcs_json_path = "file0.json" 
with gcs_file_system.open(gcs_json_path) as f:
    json_dict = json.load(f)
    
    
djson = pd.json_normalize(json_dict, record_path=['entities'])
djson

I am using this code to read the JSON files by providing the gcs_path but I am able to read only one file.

1

There are 1 best solutions below

0
On

It might be easier for you to use the Document AI Toolbox Python SDK.

There are built-in functions for getting Document JSON files from Google Cloud Storage after batch processing and functions for exporting entities to a dictionary, which can be converted to a DataFrame.

There is a method from_gcs() but this can currently only create a single Wrapped Document from a single document output in GCS.

You can try something like this:

from google.cloud.documentai_toolbox import document
from google.cloud.documentai_toolbox import gcs_utilities
import pandas as pd

# Given Document.JSON files in path gs://bucket/path/to/folder
gcs_bucket_name = "bucket"
gcs_prefix = "path/to/folder"

# Each "Wrapped Document" corresponds to a single input file.
wrapped_documents = [
    document.Document.from_gcs(
        gcs_bucket_name=gcs_bucket_name, gcs_prefix=directory
    )
    for directory, files in gcs_utilities.list_gcs_document_tree(
        gcs_bucket_name, gcs_prefix
    ).items()
    if files != [""]
]

# Entities from all Documents
all_entities = [wd.entities_to_dict() for wd in wrapped_documents]

df = pd.DataFrame.from_records(all_entities)

Alternatively, you can use the Batch Process Operation ID to get documents from a single batch.

from google.cloud import documentai
from google.cloud.documentai_toolbox import document
import pandas as pd

# ... Batch Processing Code
operation = client.batch_process_documents(request)

# Format: `projects/PROJECT_ID/locations/LOCATION/operations/OPERATION_ID`
operation_name = operation.operation.name

# Each "Wrapped Document" corresponds to a single input file.
wrapped_documents = document.Document.from_batch_process_operation(operation_name)

# Entities from all Documents
all_entities = [wd.entities_to_dict() for wd in wrapped_documents]

df = pd.DataFrame.from_records(all_entities)