We have a large-ish Azure BlobStorage account with around ~400k files in it. The files are mixed type and sorted by business-entity rather than type (so there may be an entity XYZ that would have .pdfs, .jpgs, .gifs with blob names like docs/xyz/image.jpg).
We'd like to iterate through this storage account and process all of the image files (creating thumbnails for them). We'd also like to maintain the same function to continue to process incoming files.
I wrote a BlobTrigger function for this, which worked on our testing storage with ~5k files - it automatically began going through all existing blobs which was perfect. But it failed to scale to our production/live account with only ~10k files processed. On the large storage account, the log stream has hundreds of notes such as:
2023-12-19T07:25:51Z [Verbose] Blob '[blobName]' will be skipped for function 'Function1' because this blob with ETag '"0x8D9B074C3D91F6D"' has already been processed. PollId: '9ba00431-9d8d-4752-8e33-e9b92bfcdbad'. Source: 'ContainerScan'.
and occasionally messages like
2023-12-19T08:00:06Z [Verbose] Poll for function 'Function1' on queue 'azure-webjobs-blobtrigger-thumbnailgenerator' with ClientRequestId 'ec4b2943-555c-4f05-88e4-10945665621c' found 0 messages in 233 ms. 2023-12-19T08:00:06Z [Verbose] Function 'Function1' will wait 60000 ms before polling queue 'azure-webjobs-blobtrigger-thumbnailgenerator'.
This is on a consumption plan. I think what is happening is that files are being queued up, but when the function times out then it queues up already seen files and the entire function lifetime is wasted dequeuing already processed files.
This table mentions that BlobTrigger does not scale and recommends EventGrid is for large storage accounts. I rewrote the BlobTrigger to instead use an EventGridTrigger and configured an EventGrid Event for BlobCreation. Hooking up an EventGridTrigger to our testing storage, it doesn't process all of the existing blobs (which I see, since I have now found that table, is by design)
Blob Trigger questions:
- Can I rescue my BlobTrigger somehow? I am okay with latency - thumbnails need to only appear within a few hours, but I need all existing blobs processed.
- For the BlobTrigger, I noticed I don't have a BlobScanInfo folder in my azure-webjobs-hosts. Is this related to the scanning repeating? I saw "continuation tokens" mentioned online but I don't know if they're working/not working.
For Event Grid:
- Can I easily issue Events for all existing blobs? I could write a custom event publishing app but this seems to be a lot of effort
and lastly,
- Are there other approaches I am missing? The large storage account is in active use in our production business app, so I don't want to do a major re-engineering like moving processed files to a new account. I don't mind using a local PC to do an initial first-pass of the files and I could integrate this with the EventGrid solution but thought I'd ask before embarking on more attempted fixes.
Thanks!
edit: code as requested in comments (slightly edited for anonymity). They're probably both a bit dodgy as they're the first iteration and AzureFunctions are somewhat outside my regular working domain.
The BlobTrigger function
[FunctionName("Function1")]
public void Run([BlobTrigger("documents/{blobName}.{blobExtension}", Connection = "company-production")] Stream inputBlob,
string blobName,
string blobExtension,
ILogger log,
IBinder binder)
{
log.LogInformation($"C# Blob trigger function Processed blob\n File Name:{blobName} " +
$"\n Extension: {blobExtension}");
if (blobName.Contains("company-thumbnail")) return; //no infinite loop
var allowedExtensions = new[] { "JPG", "JPEG", "JPE", "BMP", "GIF", "PNG", "TIFF", "HEIC", "SVG" };
if (allowedExtensions.All(i=>i != blobExtension.ToUpper())) return;
using var image = new MagickImage(inputBlob);
using var memoryStream = new MemoryStream();
// Create a new blob block to hold our image
var outBlobId = $"{blobName}-company-thumbnail.jpg";
var outboundBlob = new BlobAttribute($"documents/{outBlobId}", FileAccess.Write);
var outBlobBlock = binder.Bind<CloudBlockBlob>(outboundBlob);
image.Format = MagickFormat.Jpg;
MagickGeometry size = new MagickGeometry(128, 128)
{
Greater = true
};
image.Thumbnail(size);
image.AutoOrient();
image.Write(memoryStream);
memoryStream.Position = 0;
// Upload to azure
outBlobBlock.Properties.ContentType = "image/jpeg";
outBlobBlock.UploadFromStream(memoryStream);
}
and the EventTrigger.
public class ThumbnailGenerator
{
private static readonly string BLOB_STORAGE_CONNECTION_STRING = Environment.GetEnvironmentVariable("AzureWebJobsStorage");
[FunctionName("CreateThumbnail")]
public void Run(
[EventGridTrigger]EventGridEvent eventGridEvent,
[Blob("{data.url}", FileAccess.Read)] Stream inputBlob,
ILogger log)
{
log.LogInformation("Event received \n" + eventGridEvent.Data.ToString());
if(inputBlob == null){
log.LogInformation("Input blob stream was null/empty");
return;
}
try
{
if (eventGridEvent.TryGetSystemEventData(out object systemEvent))
{
switch (systemEvent)
{
case StorageBlobCreatedEventData blobCreated:
GenerateThumbnail(blobCreated.Url, inputBlob, log);
break;
default:
log.LogInformation(eventGridEvent.EventType);
break;
}
}
}
catch (Exception ex) {
log.LogError(ex.Message);
}
}
private void GenerateThumbnail(string url, Stream inputBlob, ILogger logger)
{
var blobName = GetBlobNameFromUrl(url);
if (blobName.Contains("company-thumbnail")){
logger.LogInformation("Skipping thumbnail blob");
return; //no infinite loop
}
var blobExtension = Path.GetExtension(url);
var allowedExtensions = new[] { ".JPG", ".JPEG", ".JPE", ".BMP", ".GIF", ".PNG", ".TIFF", ".HEIC", ".SVG" };
if (allowedExtensions.All(i => i != blobExtension.ToUpper())){
logger.LogInformation($"Blob extension: {blobExtension}. Skipping non-image blob");
return; //no infinite loop
}
using var image = new MagickImage(inputBlob);
using var memoryStream = new MemoryStream();
image.Format = MagickFormat.Jpg;
MagickGeometry size = new MagickGeometry(128, 128)
{
Greater = true
};
image.Thumbnail(size);
image.AutoOrient();
image.Write(memoryStream);
memoryStream.Position = 0;
logger.LogInformation("Image transformation done, uploading blob");
// Create a new blob block to hold our image
var outBlobId = blobName.Remove(blobName.LastIndexOf('.')) + "-company-thumbnail.jpg";
var blobServiceClient = new BlobServiceClient(BLOB_STORAGE_CONNECTION_STRING);
var blobContainerClient = blobServiceClient.GetBlobContainerClient("documents");
var blobClient = blobContainerClient.GetBlobClient(outBlobId);
// Upload to azure
var httpHeaders = new BlobHttpHeaders
{
ContentType = "image/jpeg"
};
blobClient.Upload(memoryStream, httpHeaders);
logger.LogInformation("Blob uploaded");
}
private static string GetBlobNameFromUrl(string bloblUrl)
{
var uri = new Uri(bloblUrl);
var blobClient = new BlobClient(uri);
return blobClient.Name;
}
}
Edit 2: As a temporary measure I ran my BlobTrigger function locally by editing my local.settings.json (after stopping all Azure functions)
{
"IsEncrypted": false,
"Values": {
"AzureWebJobsStorage": "UseDevelopmentStorage=true", //removed
"AzureWebJobsStorage": //actual production key
"FUNCTIONS_WORKER_RUNTIME": "dotnet"
}
when it has finished running I will update the receipts folder in azure-webjobs-hosts from my local computer name to the azure function name. That should suffice for an initial run through I hope!
Edit 4: Fixed up event trigger code
So, reporting back in.
In the end I used a simple console app and the blob API to process the existing blobs (as Thomas kind of suggested in the comments). This was much much faster and more resilient than running the BlobTrigger locally. Running through all the existing images took around 4-5 hours. For stable state, EventGrid is running and seems fine.
When the Azure Docs say that blob triggers...
They really do mean it. The wake-up and rescan all receipts problem is a serious problem for any storage accounts with even I'd say ~10,000 blobs. Restarting my local app with ~75,000 receipts took around 1.5 hours (and running it locally seemed to process receipts faster than the consumption plan). Additionally, the process for receiving existing blobs was also quite slow compared to the console app. This was all further complicated for our app by an early decision to use one container only (rather than say a separate container for unprocessed images, or a separate container for thumbnails)
So in the end for anyone going down the same path: