The concerning data is updated daily through an HTTP API, and no metadata is available. It consists of +/- 28 columns and 10,000 rows. There is a pagination of 30 applied to the API. It retrieves JSON data and is converted to Parquet through mapping.
Retrieving the data was done via this loop, created by @HimanshuSinha-MSFT and explained via: https://learn.microsoft.com/en-us/answers/questions/468561/azure-data-factory-pagination-issue.html#answer-470345
However, there are 3 problems to solve:
No metadata from HTTP API The total number of iterations is needed, based on total rows, which is unavailable. I temporarily solved this by putting in a large number for total iterations, so ADF continues getting the empty rows until the iterations are over. This takes extremely long. This can be fixed by getting the metadata, but it is not available for an HTTP gateway. How can I retrieve the metadata?
Method is time consuming Iterating through the loop takes 20 seconds, for 30 rows. The iterations are sequential. When a part in the loop fails, the whole loop fails (which takes time). Can this process go faster?
JSON columns not recognised Not all column names from the JSON endpoint are recognised by ADF. This is due to the nested arrays. Is there a solution to this?
Thank you!
To paginate Http API, you can use Pagination rules by Azure Data Factory rather than ForEach loop.
I took one sample APi and applied Pagination to it
Here I passed range to
Offset
From 1 and the end I left blank and because I don't know when the number of total rows, an gave offset as 20For this situation in Azure Data Factory Pagination rules there are end condition based on object value:
Here I used End Condition as
$.results.name
is no exists.using Azure Data Factory Pagination rules will might reduce the execution time.
To avoid this issue use mapping option in copy activity. Import the schemas of the
json
output and map it properly as below.