I'm trying to use the document ai forms processor to get the rows of a table. When I upload a document, the forms processor does not get each line separately. It combines multiple lines into a single "row" in the table data. The table is being uploaded in a 1 page pdf. See the screenshot of how the table is being processed. On the left showing the parsed rows, and in the outlines of the table on the right, the first "row" of the forms processor table includes rows 1,2 and 3. Rows 4 and 5 are correct but then 6,7 are combined. This does not happen on all tables but enough that I can't count on it getting all the rows.
I've tried to train a custom processor to get the data from the table, but since the data is sparsely populated and the number of columns can vary, that does not work well.
Is there any way to uptrain the forms processor to correctly get each row separately or is there a way to train a custom processor to read tables the way the forms processor does, or any other suggestions for how to get the rows of this table properly.
Which version of the Form Parser are you using?
I'd recommend trying the 2.0 versions
pretrained-form-parser-v2.0-2022-11-10
andpretrained-form-parser-v2.1-2023-06-26
to see if the quality improves.Version
pretrained-form-parser-v2.1-2023-06-26
supports digital PDF text as well.You can't uptrain the form parser, but you can create a Custom Document Extractor which has improved performance and easier labeling with generative ai models.
Tutorial: https://cloud.google.com/document-ai/docs/workbench/cde-with-genai
There's also Quick Tables labeling which allows for quicker labeling of tabular data.