Train a custom classifier on Document AI via code

93 Views Asked by At

I want to understand how to train via code a document classifier using the Document AI API, but I haven't found relevant information in the documentation or code samples. I have defined an Invoice OCR processor, but I am unsure about how to specify my training and test sets.

I need clients in our application to train the processor on their own, just like in Google. So I am looking for workarounds to achieve this.

Now, I have only one idea. What if I first send for processing the document and then download the json file from Google cloud storage. In that json file, I can probably change the values and coordinates of the fields to the ones I need and throw them through the processor training.

I have this approximate code: `

putenv('GOOGLE_APPLICATION_CREDENTIALS='.$this->parameterBag->get('gmail_private_key'));
$client = new DocumentProcessorServiceClient();

$name = $client::processorVersionName(self::PROJECT_ID, self::LOCATION, self::PROCESSOR_ID, self::PROCESSOR_VERSION);

$storageClient = new StorageClient();

$outputBlobs = $storageClient->bucket(self::BUCKET_NAME)->objects(['prefix' => $prefix]);

$document = new Document();

/** @var StorageObject $blob */
foreach ($outputBlobs as $blob) {
    // Document AI повинен виводити лише файли JSON в GCS
    if ($blob->info()['contentType'] !== "application/json") {
        continue;
    }

    $jsonText = $blob->downloadAsStream();
    $fields = json_decode($blob->downloadAsString(), true, 512, JSON_THROW_ON_ERROR);
    $document->mergeFromJsonString($jsonText);


    /** @var Document\Entity $entity */
    foreach ($document->getEntities() as $entity) {
        $entity->setType('something');
        $entity->setConfidence(0.9);
        $entity->setPageAnchor('something');
        $entity->setNormalizedValue('something');
    }
}

$getProcessorVersionRequest = new GetProcessorVersionRequest();
$getProcessorVersionRequest->setName($name);
$processorVersion = $client->getProcessorVersion($getProcessorVersionRequest);

$trainProcessorVersionRequest = new TrainProcessorVersionRequest();
$trainProcessorVersionRequest->setProcessorVersion($processorVersion);

$gcsDocument = new GcsDocument();
$gcsDocument->setGcsUri(self::GCS_URI);
$gcsDocument->setMimeType('application/json');

$gcsDocuments = new GcsDocuments();
$gcsDocuments->setDocuments([$gcsDocument]);

$batchDocumentsInputConfig = new BatchDocumentsInputConfig();
$batchDocumentsInputConfig->setGcsDocuments($gcsDocuments);

$inputData = new InputData();
$inputData->setTrainingDocuments($batchDocumentsInputConfig);

$trainProcessorVersionRequest->setInputData($inputData);

$client->trainProcessorVersion($trainProcessorVersionRequest);`
0

There are 0 best solutions below