Uploading documents to Azure AI search

116 Views Asked by At

I am implementing RAG using azure AI search. I have created the index nd have 2605 document chunks in all to upload to the index. The peculiar behaviour that I have observed is :

  1. i cannot upload all 2605 chunks in one go.
  2. I try passing these in batch sizes of 600, by loooping over and passing 600 in every iteration. I end up uploading only 2000. It loads 600 for three iterations but on fourth iteration it loads just 200 and then aborts.
  3. if i increase the batch size to 900. I see from the output that all the chunks get loaded 900 in first two iterations and the remaining 805 in the third.

I am trying to understand what goes on under the hood as I need to provision a code that would take care of uploads as small as 10 chunks to as large as 10000 chunks. From documentation on website there are certain limits that Azure AI imposes. Like documents uploaded cannot be greater than 16 MB, The batch size cannot exceed 1000 per batch. These two together still don't explain why I am unable to load all the chunks with batch size of 600 whereas with 900 I am successful.

I was expecting it to load the chunks irrespective of the batch size.

1

There are 1 best solutions below

0
JayashankarGS On

I have used the Python SDK to upload documents, and they uploaded successfully. I tried with 3k and 10k documents, and it successfully uploaded all those documents to the index in one go.

Refer to the code below.

import os

index_name = "hotels-2"

from azure.core.credentials import AzureKeyCredential
from azure.search.documents import SearchClient

search_client = SearchClient(service_endpoint, index_name, AzureKeyCredential(key))

def upload_document():

    result = search_client.upload_documents(documents=hotels)

    print("Upload of new document succeeded: {}".format(result[0].succeeded))

Output:

Enter image description here

If you see, the length of the document is 10000.

In the portal:

Enter image description here

For more information, refer to this GitHub repository.