Why does Azure Cognitive Search Indexer Create Base64 names unnecessarily?

64 Views Asked by At

I'm trying to use Native Soft Blob Delete on Azure Cognitive Search (per https://learn.microsoft.com/en-us/azure/search/search-howto-index-changed-deleted-blobs) to delete when files are deleted from a container, and thus delete the files from the Index, but it is not going as I expect nor as the documentation provides.

  1. I have created a Storage account with "Enable Soft Delete for Blobs" turned on. Then I create a storage container within that account.

  2. I have created a Data Source to use that container with the following settings (JSON):

{
  "@odata.context": "https://sotestservice1.search.windows.net/$metadata#datasources/$entity",
  "@odata.etag": "\"0x8DC3D5D3E87D000\"",
  "name": "sotestdatasource1",
  "description": null,
  "type": "azureblob",
  "subtype": null,
  "credentials": { 
    "connectionString": "DefaultEndpointsProtocol=https;AccountName=soteststorageaccount;AccountKey=OMITTED;EndpointSuffix=core.windows.net"
  },
  "container": {
    "name": "sotestcontainer"
  },
  "dataChangeDetectionPolicy": null,
  "dataDeletionDetectionPolicy": {
    "@odata.type": "#Microsoft.Azure.Search.NativeBlobSoftDeleteDeletionDetectionPolicy"
  },
  "encryptionKey": null,
  "identity": null
}
  1. I created an index on Azure Cognitive Search, with the following settings (JSON):
{
  "@odata.context": "https://sotestservice1.search.windows.net/$metadata#indexes/$entity",
  "@odata.etag": "\"0x8DC3D5CD0D66187\"",
  "name": "sotestindex1",
  "defaultScoringProfile": null,
  "fields": [
    {
      "name": "id",
      "type": "Edm.String",
      "searchable": false,
      "filterable": false,
      "retrievable": true,
      "sortable": false,
      "facetable": false,
      "key": true,
      "indexAnalyzer": null,
      "searchAnalyzer": null,
      "analyzer": null,
      "normalizer": null,
      "dimensions": null,
      "vectorSearchProfile": null,
      "synonymMaps": []
    },
    {
      "name": "title",
      "type": "Edm.String",
      "searchable": true,
      "filterable": false,
      "retrievable": true,
      "sortable": false,
      "facetable": false,
      "key": false,
      "indexAnalyzer": null,
      "searchAnalyzer": null,
      "analyzer": "standard.lucene",
      "normalizer": null,
      "dimensions": null,
      "vectorSearchProfile": null,
      "synonymMaps": []
    },
    {
      "name": "content",
      "type": "Edm.String",
      "searchable": true,
      "filterable": false,
      "retrievable": true,
      "sortable": false,
      "facetable": false,
      "key": false,
      "indexAnalyzer": null,
      "searchAnalyzer": null,
      "analyzer": "standard.lucene",
      "normalizer": null,
      "dimensions": null,
      "vectorSearchProfile": null,
      "synonymMaps": []
    },
    {
      "name": "titleVector",
      "type": "Collection(Edm.Single)",
      "searchable": true,
      "filterable": false,
      "retrievable": true,
      "sortable": false,
      "facetable": false,
      "key": false,
      "indexAnalyzer": null,
      "searchAnalyzer": null,
      "analyzer": null,
      "normalizer": null,
      "dimensions": 4,
      "vectorSearchProfile": "vector-profile-1709674759253",
      "synonymMaps": []
    },
    {
      "name": "contentVector",
      "type": "Collection(Edm.Single)",
      "searchable": true,
      "filterable": false,
      "retrievable": true,
      "sortable": false,
      "facetable": false,
      "key": false,
      "indexAnalyzer": null,
      "searchAnalyzer": null,
      "analyzer": null,
      "normalizer": null,
      "dimensions": 4,
      "vectorSearchProfile": "vector-profile-1709674759253",
      "synonymMaps": []
    }
  ],
  "scoringProfiles": [],
  "corsOptions": null,
  "suggesters": [],
  "analyzers": [],
  "normalizers": [],
  "tokenizers": [],
  "tokenFilters": [],
  "charFilters": [],
  "encryptionKey": null,
  "similarity": {
    "@odata.type": "#Microsoft.Azure.Search.BM25Similarity",
    "k1": null,
    "b": null
  },
  "semantic": null,
  "vectorSearch": {
    "algorithms": [
      {
        "name": "vector-config-1709674581416",
        "kind": "hnsw",
        "hnswParameters": {
          "metric": "cosine",
          "m": 4,
          "efConstruction": 400,
          "efSearch": 500
        },
        "exhaustiveKnnParameters": null
      }
    ],
    "profiles": [
      {
        "name": "vector-profile-1709674759253",
        "algorithm": "vector-config-1709674581416",
        "vectorizer": null
      }
    ],
    "vectorizers": []
  }
}
  1. I created an indexer to use the Datasource from #2 and the Index from #3
{
  "@odata.context": "https://sotestservice1.search.windows.net/$metadata#indexers/$entity",
  "@odata.etag": "\"0x8DC3D5F31973528\"",
  "name": "sotestindexer1",
  "description": null,
  "dataSourceName": "sotestdatasource1",
  "skillsetName": null,
  "targetIndexName": "sotestindex1",
  "disabled": null,
  "schedule": null,
  "parameters": {
    "batchSize": null,
    "maxFailedItems": null,
    "maxFailedItemsPerBatch": null,
    "base64EncodeKeys": null,
    "configuration": {
      "indexedFileNameExtensions": ".json",
      "parsingMode": "json"
    }
  },
  "fieldMappings": [],
  "outputFieldMappings": [],
  "cache": null,
  "encryptionKey": null
}
  1. I uploaded JSON documents per below:

AbrahamLincoln.json:

{
    "id": "https---en-wikipedia-org-wiki-Abraham-Lincoln",
    "content": "Lincoln was born into poverty in a log cabin in Kentucky and was raised on the frontier, primarily in Indiana.", 
    "contentVector": [-0.7, 0.3, 0.9, -0.8], 
    "title": "Abraham Lincoln", 
    "titleVector": [0.6, -0.7, 0.2, 0.4], 
    "@search.action": "mergeOrUpload"
}

FranklinRoosevelt.json:

{
    "id": "https----en-wikipedia-org-wiki-Franklin-D-Roosevelt",
    "content": "A member of the Delano family and Roosevelt family, after attending university, Roosevelt began to practice law in New York City.", 
    "contentVector": [0.5, 0.9, 0.3, 0.4], 
    "title": "Franklin D. Roosevelt", 
    "titleVector": [0.7, 0.1, 0.8, -0.3], 
    "@search.action": "mergeOrUpload"
}
  1. I run the indexer it gives me success (2 Documents), and '*' searches against the index return the two documents, exactly as expected. Up until this point, I am good.

  2. I delete AbrahamLincoln.json from the storage container.

  3. I re-run the indexer, which succeeds.

  4. Here, I would expect the index to only contain a single document at this point. Instead, it contains three. My original two documents, and a additional document that looks like this:

   {
      "id": "aHR0cHM6Ly9zb3Rlc3RzdG9yYWdlYWNjb3VudC5ibG9iLmNvcmUud2luZG93cy5uZXQvc290ZXN0Y29udGFpbmVyL0ZyYW5rbGluUm9vc2V2ZWx0Lmpzb241",
      "title": "Franklin D. Roosevelt",
      "content": "A member of the Delano family and Roosevelt family, after attending university, Roosevelt began to practice law in New York City.",
      "titleVector": [
        0.7,
        0.1,
        0.8,
        -0.3
      ],
      "contentVector": [
        0.5,
        0.9,
        0.3,
        0.4
      ]
    }

So now I'm confused, because there are 3 documents instead of 1, and this third document is a duplicate of the existing Document. In addition, it has a new id that is Base64 encoded of the URL of the blob (with a number stuck on the end to indicate how many ='s there should be).

Is Cognitive Search doing something wrong here, or am I?

1

There are 1 best solutions below

3
JayashankarGS On BEST ANSWER

For Native blob soft delete, there are some requirements to be satisfied. One of them is:

  • Document keys for the documents in your index must be mapped to either a blob property or blob metadata, such as "metadata_storage_path".

So, you need to map the key either to a blob property or blob metadata.

Modify your index definition like below:

{
  "@odata.context": "https://jgsai.search.windows.net/$metadata#indexes/$entity",
  "@odata.etag": "\"0x8DC3D9565D5ADC9\"",
  "name": "azureblob-index-2",
  "defaultScoringProfile": "",
  "fields": [
    {
      "name": "id",
      "type": "Edm.String",
      .........
    },
    {
      "name": "title",
      "type": "Edm.String",
     .........
    },
    {
      "name": "content",
      "type": "Edm.String",
      .........
    },
    {
      "name": "titleVector",
      "type": "Collection(Edm.Double)",
      .........
    },
    {
      "name": "contentVector",
      "type": "Collection(Edm.Double)",
      .........
    },
    {
      "name": "metadata_storage_path",
      "type": "Edm.String",
      "searchable": false,
      "filterable": false,
      "retrievable": true,
      "sortable": false,
      "facetable": false,
      "key": true,
      "indexAnalyzer": null,
      "searchAnalyzer": null,
      "analyzer": null,
      "normalizer": null,
      "dimensions": null,
      "vectorSearchProfile": null,
      "synonymMaps": []
    }
  ],
  "scoringProfiles": [],
  "corsOptions": null,
  "suggesters": [],
  "analyzers": [],
  "normalizers": [],
  "tokenizers": [],
  "tokenFilters": [],
  "charFilters": [],
  "encryptionKey": null,
  "similarity": {
    "@odata.type": "#Microsoft.Azure.Search.BM25Similarity",
    "k1": null,
    "b": null
  },
  "semantic": null,
  "vectorSearch": null
}

Here i have added meta data metadata_storage_path and made that as key.

Next, the base64 encrypted key you are getting because of either you given

base64EncodeKeys as true or giving mapping function as base64Encode in mappings field.

Below is the definition of indexer.

{
  "@odata.context": "https://jgsai.search.windows.net/$metadata#indexers/$entity",
  "@odata.etag": "\"0x8DC3D95C7751E0D\"",
  "name": "azureblob-indexer",
  "description": "",
  "dataSourceName": "blobsource",
  "skillsetName": null,
  "targetIndexName": "azureblob-index-2",
  "disabled": null,
  "schedule": null,
  "parameters": {
    "batchSize": null,
    "maxFailedItems": 0,
    "maxFailedItemsPerBatch": 0,
    "base64EncodeKeys": null,
    "configuration": {
      "dataToExtract": "contentAndMetadata",
      "parsingMode": "json"
    }
  },
  "fieldMappings": [
    {
      "sourceFieldName": "metadata_storage_path",
      "targetFieldName": "metadata_storage_path",
      "mappingFunction": {
        "name": "base64Encode",
        "parameters": null
      }
    }
  ],
  "outputFieldMappings": [],
  "cache": null,
  "encryptionKey": null
}

In this case, if you set base64EncodeKeys to true or base64Encode in mappingFunction, you will get a base64 encrypted key.

Data source definition.

{
  "@odata.context": "https://jgsai.search.windows.net/$metadata#datasources/$entity",
  "@odata.etag": "\"0x8DC3D923FFB589D\"",
  "name": "blobsource",
  "description": null,
  "type": "azureblob",
  "subtype": null,
  "credentials": {
    "connectionString": "DefaultEndpointsProtocol=https;AccountName=jgsblob;AccountKey=..."
  },
  "container": {
    "name": "data",
    "query": "json"
  },
  "dataChangeDetectionPolicy": null,
  "dataDeletionDetectionPolicy": {
    "@odata.type": "#Microsoft.Azure.Search.NativeBlobSoftDeleteDeletionDetectionPolicy"
  },
  "encryptionKey": null,
  "identity": null
}

I have got successfully output on removing the json file.

enter image description here

Initially, there were 2 documents. After deleting 1 and re-running the indexer, I got 1 document in the index.