How to get historical data of question and answers from stackoverlfow exchange without being throttled?

152 Views Asked by At

I am trying to read all AAD related questions and answers from Stack Exchange API /2.2/search/advanced/pagesize=100&fromdate=2019-07-01&todate=2020-10-19&site=stackoverflow&filter=!BLIw93LDFyFBUjlepdSTkMo7r6Pkpx&q=listOfTags by passing set of tags, since we are trying to get the data from July 1st 2019.

Our ADF pipeline keeps getting throttlede and even if we set the wait time for 1 minute and our ETL is very slow, it's running forever.

Current Approach (very slow)

  1. I am using ADF to Pull the all the questions (iterating through page by page using until activity) which meets the tags and load the data into SQL enter image description here

  2. Pass the question id to this API https://api.stackexchange.com/docs/answers-on-questions#order=desc&sort=activity&ids=29433422&filter=!0U7YRMKgNJq(Exonzn(PdiZE5&site=stackoverflow&run=true to get all the answers for respective question and then load the result into SQL.

Questions:

  1. Is there a direct back-end (Kusto or SQL or cosmos etc.) we can get the data than calling the API to get the question and answers? If so how do we get the access to the back-end?

  2. What is the efficient approach to pull the historical data without throttling from Stack Overflow?

2

There are 2 best solutions below

0
On

You are being throttled because you probably have made 300 requests (maximum calls without a key) or because the URL is invalid. FWIW, registering your application on StackApps increases your API quota from 300 to 10,000! You can then pass it as a parameter: &key=.... Now, regarding the URL:

  • You are using .../advanced/pagesize=100.... It should be /advanced?pagesize=100&param=value....
  • You are passing dates as YYYY-MM-DD. They should be in Unix epoch time!. In your case fromdate should be 1561939200 and todate 1603065600 (Note: if you want to fetch results until today, then you can omit this parameter).

I'm not sure I understand what you're trying to do. However, if the API is suitable for your task, then you don't need such a big delay. It probably should be < 1sec. What you should do is to check if backoff field exists in the API response. If it does, then wait that many seconds before proceeding.

With regards to your first question... how about SEDE? You can run SQL queries for any site you want there and get the results in CSV format. Here is the help page and you can find the public schema in this Meta Stack Exchange question. If you encounter difficulties, feel free to ask a new question.

References:

0
On

#1 ) I doubt if there is anything like that . #2) Throttling will happen on the client IP , may be you can try to deploy the same ADF pipeline on different region , that may help . if you go that route you will have to update the API with the date filter , so that no two region query the same set of data .