I’ve noticed that it’s possible to upload multi-page files to Document AI, such that all pages are connected to each other by being associated to the same file.
My use case is invoice files that I would like to extract data from, using a custom extractor.
Most of the invoices are 1-pagers, but some of them span over 2 pages, meaning that the second page usually is leaner than the first page, and does not contain most of the information.
My question is - will there be a difference in a trained model performance between the following file upload mechanisms:
- Uploading each page as a separate file, even when an invoice spans over multiple pages (I preprocess it beforehand)
- Uploading each file without splitting it to pages
I assume that the performance of option # 2 will be equal or greater than option # 1 - my question is mainly whether it makes a difference or not, as uploading pages separately has its own advantages for us (our use case is a bit more complicated, I simplified it for the explanation).