Azure Data Factory2's pagination rule (for HTTP API)

1.8k Views Asked by At

The concerning data is updated daily through an HTTP API, and no metadata is available. It consists of +/- 28 columns and 10,000 rows. There is a pagination of 30 applied to the API. It retrieves JSON data and is converted to Parquet through mapping.

Retrieving the data was done via this loop, created by @HimanshuSinha-MSFT and explained via: https://learn.microsoft.com/en-us/answers/questions/468561/azure-data-factory-pagination-issue.html#answer-470345

However, there are 3 problems to solve:

No metadata from HTTP API The total number of iterations is needed, based on total rows, which is unavailable. I temporarily solved this by putting in a large number for total iterations, so ADF continues getting the empty rows until the iterations are over. This takes extremely long. This can be fixed by getting the metadata, but it is not available for an HTTP gateway. How can I retrieve the metadata?

Method is time consuming Iterating through the loop takes 20 seconds, for 30 rows. The iterations are sequential. When a part in the loop fails, the whole loop fails (which takes time). Can this process go faster?

JSON columns not recognised Not all column names from the JSON endpoint are recognised by ADF. This is due to the nested arrays. Is there a solution to this?

Thank you!

1

There are 1 best solutions below

5
On

To paginate Http API, you can use Pagination rules by Azure Data Factory rather than ForEach loop.

I took one sample APi and applied Pagination to it

enter image description here

Here I passed range to Offset From 1 and the end I left blank and because I don't know when the number of total rows, an gave offset as 20

  1. No metadata from HTTP API The total number of iterations is needed, based on total rows, which is unavailable.

For this situation in Azure Data Factory Pagination rules there are end condition based on object value:

  • Empty: The pagination ends when the value of the specific node in response is empty.
  • Exist: The pagination ends when the value of the specific node in response exists.
  • Not Exist: The pagination ends when the value of the specific node in response dose not exists.
  • Const: The pagination ends when the value of the specific node in response is a user-defined const value.

Here I used End Condition as $.results.name is no exists.

enter image description here

  1. Method is time consuming Iterating through the loop takes 20 seconds, for 30 rows.

using Azure Data Factory Pagination rules will might reduce the execution time.

enter image description here

  1. JSON columns not recognised Not all column names from the JSON endpoint are recognised by ADF. This is due to the nested arrays. Is there a solution to this?

To avoid this issue use mapping option in copy activity. Import the schemas of the json output and map it properly as below.

enter image description here