I'm trying to use Native Soft Blob Delete on Azure Cognitive Search (per https://learn.microsoft.com/en-us/azure/search/search-howto-index-changed-deleted-blobs) to delete when files are deleted from a container, and thus delete the files from the Index, but it is not going as I expect nor as the documentation provides.
I have created a Storage account with "Enable Soft Delete for Blobs" turned on. Then I create a storage container within that account.
I have created a Data Source to use that container with the following settings (JSON):
{
"@odata.context": "https://sotestservice1.search.windows.net/$metadata#datasources/$entity",
"@odata.etag": "\"0x8DC3D5D3E87D000\"",
"name": "sotestdatasource1",
"description": null,
"type": "azureblob",
"subtype": null,
"credentials": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=soteststorageaccount;AccountKey=OMITTED;EndpointSuffix=core.windows.net"
},
"container": {
"name": "sotestcontainer"
},
"dataChangeDetectionPolicy": null,
"dataDeletionDetectionPolicy": {
"@odata.type": "#Microsoft.Azure.Search.NativeBlobSoftDeleteDeletionDetectionPolicy"
},
"encryptionKey": null,
"identity": null
}
- I created an index on Azure Cognitive Search, with the following settings (JSON):
{
"@odata.context": "https://sotestservice1.search.windows.net/$metadata#indexes/$entity",
"@odata.etag": "\"0x8DC3D5CD0D66187\"",
"name": "sotestindex1",
"defaultScoringProfile": null,
"fields": [
{
"name": "id",
"type": "Edm.String",
"searchable": false,
"filterable": false,
"retrievable": true,
"sortable": false,
"facetable": false,
"key": true,
"indexAnalyzer": null,
"searchAnalyzer": null,
"analyzer": null,
"normalizer": null,
"dimensions": null,
"vectorSearchProfile": null,
"synonymMaps": []
},
{
"name": "title",
"type": "Edm.String",
"searchable": true,
"filterable": false,
"retrievable": true,
"sortable": false,
"facetable": false,
"key": false,
"indexAnalyzer": null,
"searchAnalyzer": null,
"analyzer": "standard.lucene",
"normalizer": null,
"dimensions": null,
"vectorSearchProfile": null,
"synonymMaps": []
},
{
"name": "content",
"type": "Edm.String",
"searchable": true,
"filterable": false,
"retrievable": true,
"sortable": false,
"facetable": false,
"key": false,
"indexAnalyzer": null,
"searchAnalyzer": null,
"analyzer": "standard.lucene",
"normalizer": null,
"dimensions": null,
"vectorSearchProfile": null,
"synonymMaps": []
},
{
"name": "titleVector",
"type": "Collection(Edm.Single)",
"searchable": true,
"filterable": false,
"retrievable": true,
"sortable": false,
"facetable": false,
"key": false,
"indexAnalyzer": null,
"searchAnalyzer": null,
"analyzer": null,
"normalizer": null,
"dimensions": 4,
"vectorSearchProfile": "vector-profile-1709674759253",
"synonymMaps": []
},
{
"name": "contentVector",
"type": "Collection(Edm.Single)",
"searchable": true,
"filterable": false,
"retrievable": true,
"sortable": false,
"facetable": false,
"key": false,
"indexAnalyzer": null,
"searchAnalyzer": null,
"analyzer": null,
"normalizer": null,
"dimensions": 4,
"vectorSearchProfile": "vector-profile-1709674759253",
"synonymMaps": []
}
],
"scoringProfiles": [],
"corsOptions": null,
"suggesters": [],
"analyzers": [],
"normalizers": [],
"tokenizers": [],
"tokenFilters": [],
"charFilters": [],
"encryptionKey": null,
"similarity": {
"@odata.type": "#Microsoft.Azure.Search.BM25Similarity",
"k1": null,
"b": null
},
"semantic": null,
"vectorSearch": {
"algorithms": [
{
"name": "vector-config-1709674581416",
"kind": "hnsw",
"hnswParameters": {
"metric": "cosine",
"m": 4,
"efConstruction": 400,
"efSearch": 500
},
"exhaustiveKnnParameters": null
}
],
"profiles": [
{
"name": "vector-profile-1709674759253",
"algorithm": "vector-config-1709674581416",
"vectorizer": null
}
],
"vectorizers": []
}
}
- I created an indexer to use the Datasource from #2 and the Index from #3
{
"@odata.context": "https://sotestservice1.search.windows.net/$metadata#indexers/$entity",
"@odata.etag": "\"0x8DC3D5F31973528\"",
"name": "sotestindexer1",
"description": null,
"dataSourceName": "sotestdatasource1",
"skillsetName": null,
"targetIndexName": "sotestindex1",
"disabled": null,
"schedule": null,
"parameters": {
"batchSize": null,
"maxFailedItems": null,
"maxFailedItemsPerBatch": null,
"base64EncodeKeys": null,
"configuration": {
"indexedFileNameExtensions": ".json",
"parsingMode": "json"
}
},
"fieldMappings": [],
"outputFieldMappings": [],
"cache": null,
"encryptionKey": null
}
- I uploaded JSON documents per below:
AbrahamLincoln.json:
{
"id": "https---en-wikipedia-org-wiki-Abraham-Lincoln",
"content": "Lincoln was born into poverty in a log cabin in Kentucky and was raised on the frontier, primarily in Indiana.",
"contentVector": [-0.7, 0.3, 0.9, -0.8],
"title": "Abraham Lincoln",
"titleVector": [0.6, -0.7, 0.2, 0.4],
"@search.action": "mergeOrUpload"
}
FranklinRoosevelt.json:
{
"id": "https----en-wikipedia-org-wiki-Franklin-D-Roosevelt",
"content": "A member of the Delano family and Roosevelt family, after attending university, Roosevelt began to practice law in New York City.",
"contentVector": [0.5, 0.9, 0.3, 0.4],
"title": "Franklin D. Roosevelt",
"titleVector": [0.7, 0.1, 0.8, -0.3],
"@search.action": "mergeOrUpload"
}
I run the indexer it gives me success (2 Documents), and '*' searches against the index return the two documents, exactly as expected. Up until this point, I am good.
I delete AbrahamLincoln.json from the storage container.
I re-run the indexer, which succeeds.
Here, I would expect the index to only contain a single document at this point. Instead, it contains three. My original two documents, and a additional document that looks like this:
{
"id": "aHR0cHM6Ly9zb3Rlc3RzdG9yYWdlYWNjb3VudC5ibG9iLmNvcmUud2luZG93cy5uZXQvc290ZXN0Y29udGFpbmVyL0ZyYW5rbGluUm9vc2V2ZWx0Lmpzb241",
"title": "Franklin D. Roosevelt",
"content": "A member of the Delano family and Roosevelt family, after attending university, Roosevelt began to practice law in New York City.",
"titleVector": [
0.7,
0.1,
0.8,
-0.3
],
"contentVector": [
0.5,
0.9,
0.3,
0.4
]
}
So now I'm confused, because there are 3 documents instead of 1, and this third document is a duplicate of the existing Document. In addition, it has a new id that is Base64 encoded of the URL of the blob (with a number stuck on the end to indicate how many ='s there should be).
Is Cognitive Search doing something wrong here, or am I?
For Native blob soft delete, there are some requirements to be satisfied. One of them is:
So, you need to map the key either to a blob property or blob metadata.
Modify your index definition like below:
Here i have added meta data
metadata_storage_pathand made that as key.Next, the base64 encrypted key you are getting because of either you given
base64EncodeKeysastrueor giving mapping function asbase64Encodein mappings field.Below is the definition of indexer.
In this case, if you set
base64EncodeKeysto true orbase64EncodeinmappingFunction, you will get a base64 encrypted key.Data source definition.
I have got successfully output on removing the json file.
Initially, there were 2 documents. After deleting 1 and re-running the indexer, I got 1 document in the index.