I am trying to read all AAD related questions and answers from Stack Exchange API /2.2/search/advanced/pagesize=100&fromdate=2019-07-01&todate=2020-10-19&site=stackoverflow&filter=!BLIw93LDFyFBUjlepdSTkMo7r6Pkpx&q=listOfTags
by passing set of tags, since we are trying to get the data from July 1st 2019.
Our ADF pipeline keeps getting throttlede and even if we set the wait time for 1 minute and our ETL is very slow, it's running forever.
Current Approach (very slow)
I am using ADF to Pull the all the questions (iterating through page by page using until activity) which meets the tags and load the data into SQL
Pass the question id to this API https://api.stackexchange.com/docs/answers-on-questions#order=desc&sort=activity&ids=29433422&filter=!0U7YRMKgNJq(Exonzn(PdiZE5&site=stackoverflow&run=true to get all the answers for respective question and then load the result into SQL.
Questions:
Is there a direct back-end (Kusto or SQL or cosmos etc.) we can get the data than calling the API to get the question and answers? If so how do we get the access to the back-end?
What is the efficient approach to pull the historical data without throttling from Stack Overflow?
You are being throttled because you probably have made 300 requests (maximum calls without a key) or because the URL is invalid. FWIW, registering your application on StackApps increases your API quota from 300 to 10,000! You can then pass it as a parameter:
&key=...
. Now, regarding the URL:.../advanced/pagesize=100...
. It should be/advanced?pagesize=100¶m=value...
.YYYY-MM-DD
. They should be in Unix epoch time!. In your casefromdate
should be1561939200
andtodate
1603065600
(Note: if you want to fetch results until today, then you can omit this parameter).I'm not sure I understand what you're trying to do. However, if the API is suitable for your task, then you don't need such a big delay. It probably should be < 1sec. What you should do is to check if
backoff
field exists in the API response. If it does, then wait that many seconds before proceeding.With regards to your first question... how about SEDE? You can run SQL queries for any site you want there and get the results in CSV format. Here is the help page and you can find the public schema in this Meta Stack Exchange question. If you encounter difficulties, feel free to ask a new question.
References: