AnalyzeInvoices function in SynapseML

98 Views Asked by At

I am trying to understand AnalyzeInvoice function in synapse ML and I have few questions what is the difference between setImageUrlCol("source") & setImageBytesCol("data") and when should I use one over the other? What does "source" mean here? I am trying to scan set of invoices.jpeg files and want to flatten the data. How should be the output look like here?

analyzeInvoices = (AnalyzeInvoices() .setSubscriptionKey(cognitiveKey) .setLocation("eastus") .setImageUrlCol("source") .setOutputCol("invoices") .setConcurrency(5))

(analyzeInvoices
        .transform(imageDf)
        .withColumn("documents", explode(col("invoices.analyzeResult.documentResults.fields")))
        .select("source", "documents")).show()
1

There are 1 best solutions below

0
On

setImageBytesCol("data") - This is used to convert the image file into base64. The base 64 is the required input format. setImageBytesCol() is the library which can convert the JPG input image into base 64 bit

setImageUrlCol("source") - The URL of the image to be used in conversion procedure.

First the image array needs to be given and that will be the setImageUrlCol("source"). "Source" is the input array of images location

Second step is to convert the image into base 64.

https://mmlspark.blob.core.windows.net/docs/1.0.0-rc1/pyspark/_modules/mmlspark/cognitive/AnalyzeImage.html

https://microsoft.github.io/SynapseML/docs/features/cognitive_services/CognitiveServices%20-%20Overview/