I need to iterate through the Users endpoint of the SAP API to return a custom column per user - the issue is that I have over 25,000 users and the API call must pass each user ID as a parameter.
Currently my code is taking about 8 minutes per 500 records - I'm struggling to find an approach that would be much more efficient as I can't wait 6 hours for this process to execute.
I have tried using Spark, asyncio and aiohttp but they don't seem to be offering much of a query improvement. Any guidance would be greatly appreciated
I understand that iterating through each user id is not efficient, but I can't seem to find another option as the data must come from this specific endpoint
In code below user_df is a dataframe that contains all of the user data, I am iterating through the user ids in this dataframe and passing the respective id as a parameter in the endpoint call.
import pandas as pd
def find_cust_col(user_id):
cust_col_endpoint = f'https://test.com/Users(\'{user_id}\')'
response = requests.get(cust_col_endpoint, headers=headers)
if response.status_code == 200:
data = response.json()
custom_data = next((item['value'] for item in data['customColumn'] if item['columnNumber'] == 10), None)
return user_id, custom_data
print(f"No data found for user {user_id}, status code {response.status_code}")
return None
results = []
with ThreadPoolExecutor() as executor:
results = list(executor.map(find_cust_col, user_df['user_id']))
results = [result for result in results if result is not None]
result_df = pd.DataFrame(results, columns=['user_id', 'cust_col'])```