Document Understanding is extracting data from all the pages of pdf in UiPath

1.4k Views Asked by At

I am using Document Understanding in UiPath to extract data from multiple pdf's. Each pdf file contains multiple copies of the same page which I cannot remove. Trouble is:

1.) The Regex Extractor is extracting data from all the pages of the pdf file. I only want the data from the first page of the pdf.

2.) It is also extracting other irrelevant data below it along with the required data.

I cannot remove the duplicate pages from the pdf file. So I cannot use the ML Extractor as it has a limit of 2 pages and 4mb size. Currently I am using Form Extractor and Regex Extractor to extract data and both of them are extracting data from all the pages of the pdf file.

Also for some data, it is also extracting other irrelevant data along with it (This happens only when I use Regex Extractor.). How can I solve these 2 problems?

Any help is appreciated!

1

There are 1 best solutions below

2
On

I'd recommend using the Intelligent Form Extractor but note this has limitations on a Community License; so follow the structure below.

  1. Load Taxonomy (where you configure the relevant fields to extract)
  2. Digitize Document - Use like OmniPage OCR or Microsoft OCR
  3. Classify Document Scope. Assign a Keyword Based Classifier and configure
  4. Data Extraction Scope - Use the Intelligent Form Extractor. You can setup the templates and use either the Elements, Selected Area or Anchors to assign where you want to extract the data from. You will need to get an API Key from your Orchestrator tenant (see Licenses)
  5. (Optional Step) Validation Station - You can add the Validation Station that will essentially request validation from a human when it's confidence doesn't meet the requirements. You can either have the local version or utilise 'Create Document Validation Action' which will create an Action on Orchestrator. (Please note - for the Create Action, you will need to have it based at the 'Main.xaml' as it is a persistant activity)
  6. Export Extraction Results

You might want to split your PDF before the Digitization so that you are only looking at Page 1 and you could always merge back after if required