From https://cloud.google.com/document-ai/docs/process-forms, I can see some example of processing single files. But in most cases, companies have buckets of documents. In that case, how do you scale the document ai processing? Do you use the document ai in conjunction with Spark? Or is there another way?
How do you scale Google Cloud Document AI processing?
612 Views Asked by Kevin Eid At
2
There are 2 best solutions below
0
Holt Skinner
On
You will need to use Batch Processing to handle multiple documents at once with Document AI.
This page in the Cloud Documentation shows how to make Batch Processing requests with REST and the Client Libraries.
https://cloud.google.com/document-ai/docs/send-request#batch-process
This codelab also illustrates how to do this in Python with the OCR Processor. https://codelabs.developers.google.com/codelabs/docai-ocr-python
Related Questions in GOOGLE-CLOUD-PLATFORM
- Google Logging API - What service name to use when writing entries from non-Google application?
- Custom exception message from google endpoints exception
- Unable to connect database of lamp instance from servlet running on tomcat instance of google cloud
- How to launch a Jar file using Spark on hadoop
- Google Cloud Bigtable Durability/Availability Guarantees
- How do I add a startup script to an existing VM from the developer console?
- What is the difference between an Instance and an Instance group
- How do i change files using ftp in google cloud?
- How to update all machines in an instance group on Google Cloud Platform?
- Setting up freeswitch server on Google cloud compute
- Google Cloud Endpoints: verifyToken: Signature length not correct
- Google Cloud BigTable connection setup time
- How GCE HTTP Cross-Region Load Balancing implemented
- Google Cloud Bigtable compression
- Google cloud SDK code to execute via cron
Related Questions in GOOGLE-CLOUD-DATAPROC
- I am not finding evidence of NodeInitializationAction for Dataproc having run
- google-fluentd : change severity in Cloud Logging log_level
- Read from BigQuery into Spark in efficient way?
- How to make a pyspark job properly parallelizable on multiple nodes and avoid memory issues?
- YARN Reserved Memory Issue on Dataproc
- Dataproc PySpark Workers Have no Permission to Use gsutil
- What is the solution for the error, “JBlas is not a member of package or apache”?
- Airflow DataProcPySparkOperator not considering cluster other than global region
- How to check if a file exists in Google Storage from Spark Dataproc?
- How can I include additional jars when starting a Google DataProc cluster to use with Jupyter notebooks?
- Why is the Hadoop job slower in cloud (with multi-node clustering) than on normal pc?
- http code 302 when getting file from Spark UI on Dataproc cluster
- NoSuchMethodError while reading from google cloud storage from Dataproc using java
- More number tasks than number vCPUs on dataproc
- Running more than spark streaming job in Google dataproc
Related Questions in CLOUD-DOCUMENT-AI
- Google Cloud Document AI can be installed and used on local, on-premise hardware?
- Uploading file directly to Google Cloud Document AI
- Document AI Custom Splitter Processor labels disappeared after evaluation
- InvoiceParser: errors with uptraing new version after activating invoice_type
- How to locally process a batch of files using Document AI with the Python client?
- PermissionDenied: 403 Permission denied on resource project XXXXXXX
- Batch Import and Label Assignment in Google Document AI
- Document AI Custom Doc Splitter - Low confidence score
- DocAI, How to extract circled information from PDF
- DocumentAI - Custom Extractor no entities
- Single letter recognition fails consitently
- Removing Documents from a Google Document AI Processor Dataset in Python
- Google Document AI Form Parser is not returning entities for all pages
- Document AI - GCP error in dataset maximum number of documents to train a custom processor reached
- Train a custom classifier on Document AI via code
Related Questions in GOOGLE-CLOUD-AI
- Issues in scaling GCP AI model TF serving
- GCP ML Tensorflow serving with authorization for GRPC
- Impossible to activate Duet AI despite receiving provisioning email
- Does Gemini embedding model support languages other than English?
- Can't iterate over files within a folder in google cloud notebooks instance
- Trying to get a Vertex AI prediction in BigQuery using Python
- GCP AI platform API
- How to solve 5xx errors in GCP AI Prediction Platform?
- How to fix: "error": "Prediction failed: unknown error." in custom prediction routine with scikit-learn?
- How do you scale Google Cloud Document AI processing?
- Best practice to run Tensorflow app on GCP?
- Google AI Platform: kernel dying
- how to reduce input size for mask-RCNN trained model while running prediction on google cloud platform
- Custom Model for Batch Prediction on Vertex.ai
- Op type not registered \'IO>BigQueryClient\' with BigQuery connector on AI platform
Trending Questions
- UIImageView Frame Doesn't Reflect Constraints
- Is it possible to use adb commands to click on a view by finding its ID?
- How to create a new web character symbol recognizable by html/javascript?
- Why isn't my CSS3 animation smooth in Google Chrome (but very smooth on other browsers)?
- Heap Gives Page Fault
- Connect ffmpeg to Visual Studio 2008
- Both Object- and ValueAnimator jumps when Duration is set above API LvL 24
- How to avoid default initialization of objects in std::vector?
- second argument of the command line arguments in a format other than char** argv or char* argv[]
- How to improve efficiency of algorithm which generates next lexicographic permutation?
- Navigating to the another actvity app getting crash in android
- How to read the particular message format in android and store in sqlite database?
- Resetting inventory status after order is cancelled
- Efficiently compute powers of X in SSE/AVX
- Insert into an external database using ajax and php : POST 500 (Internal Server Error)
Popular Questions
- How do I undo the most recent local commits in Git?
- How can I remove a specific item from an array in JavaScript?
- How do I delete a Git branch locally and remotely?
- Find all files containing a specific text (string) on Linux?
- How do I revert a Git repository to a previous commit?
- How do I create an HTML button that acts like a link?
- How do I check out a remote Git branch?
- How do I force "git pull" to overwrite local files?
- How do I list all files of a directory?
- How to check whether a string contains a substring in JavaScript?
- How do I redirect to another webpage?
- How can I iterate over rows in a Pandas DataFrame?
- How do I convert a String to an int in Java?
- Does Python have a string 'contains' substring method?
- How do I check if a string contains a specific word?
I could only find the following:
batch_process_documentsprocess many documents and return an async response that'll get saved in cloud storage.From there, I think that we can parametrise our job by adding an input path of the bucket prefix and distribute the job over several machines.
All of that could be orchestrated via Airflow for example.