I want to understand how to train via code a document classifier using the Document AI API, but I haven't found relevant information in the documentation or code samples. I have defined an Invoice OCR processor, but I am unsure about how to specify my training and test sets.
I need clients in our application to train the processor on their own, just like in Google. So I am looking for workarounds to achieve this.
Now, I have only one idea. What if I first send for processing the document and then download the json file from Google cloud storage. In that json file, I can probably change the values and coordinates of the fields to the ones I need and throw them through the processor training.
I have this approximate code: `
putenv('GOOGLE_APPLICATION_CREDENTIALS='.$this->parameterBag->get('gmail_private_key'));
$client = new DocumentProcessorServiceClient();
$name = $client::processorVersionName(self::PROJECT_ID, self::LOCATION, self::PROCESSOR_ID, self::PROCESSOR_VERSION);
$storageClient = new StorageClient();
$outputBlobs = $storageClient->bucket(self::BUCKET_NAME)->objects(['prefix' => $prefix]);
$document = new Document();
/** @var StorageObject $blob */
foreach ($outputBlobs as $blob) {
// Document AI повинен виводити лише файли JSON в GCS
if ($blob->info()['contentType'] !== "application/json") {
continue;
}
$jsonText = $blob->downloadAsStream();
$fields = json_decode($blob->downloadAsString(), true, 512, JSON_THROW_ON_ERROR);
$document->mergeFromJsonString($jsonText);
/** @var Document\Entity $entity */
foreach ($document->getEntities() as $entity) {
$entity->setType('something');
$entity->setConfidence(0.9);
$entity->setPageAnchor('something');
$entity->setNormalizedValue('something');
}
}
$getProcessorVersionRequest = new GetProcessorVersionRequest();
$getProcessorVersionRequest->setName($name);
$processorVersion = $client->getProcessorVersion($getProcessorVersionRequest);
$trainProcessorVersionRequest = new TrainProcessorVersionRequest();
$trainProcessorVersionRequest->setProcessorVersion($processorVersion);
$gcsDocument = new GcsDocument();
$gcsDocument->setGcsUri(self::GCS_URI);
$gcsDocument->setMimeType('application/json');
$gcsDocuments = new GcsDocuments();
$gcsDocuments->setDocuments([$gcsDocument]);
$batchDocumentsInputConfig = new BatchDocumentsInputConfig();
$batchDocumentsInputConfig->setGcsDocuments($gcsDocuments);
$inputData = new InputData();
$inputData->setTrainingDocuments($batchDocumentsInputConfig);
$trainProcessorVersionRequest->setInputData($inputData);
$client->trainProcessorVersion($trainProcessorVersionRequest);`