Improving performance when looping through country codes in App Store Scraper

93 Views Asked by At

I am using the App Store Scraper to get podcast reviews from the Apple Store. One thing students and I realised is that, naturally, popular international podcasts get reviews from several countries, and when we want to catch them all, we need to loop through the country codes. As even people in Sweden or Poland may comment on a BBC podcast in English, we did not want to exclude any countries but use the whole set, which I have (as a starting point) hard-coded as follows:

# Select country codes
# full list of countries where Apple podcasts are available has been shared on Gitlab
countries=["DZ", "AO", "AI",
"AR", "AM", "AU",
"AT", "AZ", "BH",
"BB", "BY", "BE",
"BZ", "BM", "BO",
"BW", "BR", "VG",
"BN", "BG", "CA",
"KY", "CL", "CN",
"CO", "CR", "HR",
"CY", "CZ", "DK",
"DM", "EC", "EG",
"SV", "EE", "FI",
"FR", "DE", "GH",
"GB", "GR", "GD",
"GT", "GY", "HN",
"HK", "HU", "IS",
"IN", "ID", "IE",
"IL", "IT", "JM",
"JP", "JO", "KE",
"KW", "LV", "LB",
"LT", "LU", "MO",
"MG", "MY", "ML",
"MT", "MU", "MX",
"MS", "NP", "NL",
"NZ", "NI", "NE",
"NG", "NO", "OM",
"PK", "PA", "PY",
"PE", "PH", "PL",
"PT", "QA", "MK",
"RO", "RU", "SA",
"SN", "SG", "SK",
"SI", "ZA", "KR",
"ES", "LK", "SR",
"SE", "CH", "TW",
"TZ", "TH", "TN",
"TR", "UG", "UA",
"AE", "US", "UY",
"UZ", "VE", "VN",
"YE"]

Then I loop through this list to get the reviews, which works OK -- but it the process is very slow! Whenever a podcast does not have any reviews, the app store scraper (according to the notifications I get) tries the request 20 times before moving on to the next item, so the loop takes ages. How can I make the process faster, e.g. forcing the script to move on if the first request is unsuccessful? This is what I have so far:

# Set podcast details
app_id = 1614435903
app_name = '28ish-days-later'
# important: country codes will be selected from the list above

# Set output path
path_out = "podcast_reviews"

filename_csv = f'{app_name}_reviews_table.csv'
file_csv = directory + path_out + filename_csv

# Optional: use (how_many=n) after sys.review to limit output
# otherwise all reviews are fetched

for c in countries:
    # Create class object
    sysk = Podcast(country=c, app_name=app_name, app_id=app_id)
    sysk.review()
    print(f"No. of reviews found for country {c}:")
    #pprint(sysk.reviews)
    pprint(sysk.reviews_count)

    # NOTE: the review count seen on the landing page differs from the actual number of reviews fetched.
    # This is simply because only some users who rated the app also leave reviews.

The notification I get in the output when no reviews are found is this:

ERROR:Base:Something went wrong: HTTPSConnectionPool(host='amp-api.podcasts.apple.com', port=443): Max retries exceeded with url: /v1/catalog/dz/podcasts/1614435903/reviews?l=en-GB&offset=0&limit=20 (Caused by ResponseError('too many 404 error responses'))

No. of reviews found for country DZ:
0

My first attempt was to include a try and except, but that does not stop the script from attempting the max retries before raising the error, so I got rid of it. Perhaps it is possible to give the script a "how_many=1" limitation for all country codes and write only the ones that retrieve a result to a new list before starting the loop. I will post this as an answer if it works.

1

There are 1 best solutions below

0
On

Based on the discussion with @buran above, here is my solution for checking the existence of reviews via requests first before feeding a much shorter country code list into the app store scraper:

## check if countries have reviews at all
import requests

# URL for the podcasts
url = "https://podcasts.apple.com/{}/podcast/28ish-days-later/id1614435903?see-all=reviews"

# create a list to store countries with reviews
countries_reviewed = []

# iterate over the list of country codes
for country_code in countries:
    # format the URL with the current country code
    url_new = url.format(country_code)

    # send HTTP GET request
    response = requests.get(url_new)

    # check if the request was successful
    if response.status_code == 200:
        # check if the specified review string is present in HTML
        if '"@type":"Review"' in response.text:
            # if present, add the country code to the list
            countries_reviewed.append(country_code)
            print(f"The podcast has reviews for country code '{country_code}'.")
        else:
            print(f"No reviews found for country code '{country_code}'.")
    else:
        print(f"Failed to retrieve the content for country code '{country_code}'. Status code: {response.status_code}")

# Print the final list of countries with reviews
print("Countries with reviews:", countries_reviewed)

This gives me the following output in a short amount of time:

No reviews found for country code 'LV'.
No reviews found for country code 'LB'.
No reviews found for country code 'LT'.
No reviews found for country code 'LU'.
No reviews found for country code 'MO'.
No reviews found for country code 'MG'.
No reviews found for country code 'MY'.
No reviews found for country code 'ML'.
No reviews found for country code 'MT'.
No reviews found for country code 'MU'.
No reviews found for country code 'MX'.
No reviews found for country code 'MS'.
No reviews found for country code 'NP'.
The podcast has reviews for country code 'NL'.
No reviews found for country code 'NZ'.
No reviews found for country code 'NI'.
No reviews found for country code 'NE'.
No reviews found for country code 'NG'.
No reviews found for country code 'NO'.
No reviews found for country code 'OM'.
No reviews found for country code 'PK'.
No reviews found for country code 'PA'.
No reviews found for country code 'PY'.
No reviews found for country code 'PE'.
No reviews found for country code 'PH'.
No reviews found for country code 'PL'.
No reviews found for country code 'PT'.
No reviews found for country code 'QA'.
No reviews found for country code 'MK'.
No reviews found for country code 'RO'.
No reviews found for country code 'RU'.
No reviews found for country code 'SA'.
No reviews found for country code 'SN'.
No reviews found for country code 'SG'.
No reviews found for country code 'SK'.
No reviews found for country code 'SI'.
No reviews found for country code 'ZA'.
No reviews found for country code 'KR'.
No reviews found for country code 'ES'.
No reviews found for country code 'LK'.
No reviews found for country code 'SR'.
The podcast has reviews for country code 'SE'.
The podcast has reviews for country code 'CH'.
No reviews found for country code 'TW'.
No reviews found for country code 'TZ'.
No reviews found for country code 'TH'.
No reviews found for country code 'TN'.
No reviews found for country code 'TR'.
No reviews found for country code 'UG'.
No reviews found for country code 'UA'.
No reviews found for country code 'AE'.
The podcast has reviews for country code 'US'.
No reviews found for country code 'UY'.
No reviews found for country code 'UZ'.
No reviews found for country code 'VE'.
No reviews found for country code 'VN'.
No reviews found for country code 'YE'.
Countries with reviews: ['AU', 'CA', 'DK', 'GB', 'NL', 'SE', 'CH', 'US']