I am trying the Google Document AI with a standard Form Parser. I processed a 60 pages PDF file and the OCR result returned entities
for a first few pages and the rest of the pages do not include the entities
in response. I couldn't find a documentation about this field but is this optimistic field?
Is there any way to enforce to have this field in all the response?
If not, is there any way to identify what kind of pages would likely to have this field in response vs not? i.e. how the page should look like to make the entities
detection to run.
gcs_docs = [
documentai.GcsDocument(
gcs_uri=input_file,
mime_type='application/pdf'
)
]
gcs_documents = documentai.GcsDocuments(documents=gcs_docs)
input_config = documentai.BatchDocumentsInputConfig(gcs_documents=gcs_documents)
gcs_output_config = documentai.DocumentOutputConfig.GcsOutputConfig(
gcs_uri=output_file,
field_mask="text,entities,pages.pageNumber,pages.formFields",
sharding_config={"pages_per_shard": 1, "pages_overlap": 0}
)
According to this article Form parser can detect 11 generic Entities. Are the files are images converted to PDF? The api might be having trouble detecting some entites due to image quality etc,
Can try different versions of the API just to see what is stable for your use case? See version management here. (If persist I would recommend filing a support case about this to help possible improvement of entity detection of the processor)