python - BeautifulSoup and requests does not produce expected results with .findAll()

1.1k Views Asked by At

I have been writing a piece of code that will retrieve a list of items and their corresponding prices from the Steam Marketplace (for the game Unturned). I am using BeautifulSoup (bs4) and requests library. This is my code so far:

for page_num in range(1,10):
website = 'http://steamcommunity.com/market/search?appid=304930#p'+str(page_num)+'_popular_desc'
r = requests.get(website)
doc = r.text.split('\n')
soup = BeautifulSoup(''.join(doc), "html.parser")

names = soup.findAll("span", { "class" : "market_listing_item_name" })
for item in range(len(names)):
    items.append(names[item].contents[0])

costs = soup.findAll("span", { "class" : "normal_price" })
for cost in range(len(costs)):
    prices.append(costs[cost].contents[0])

Expected Output:

Festive Gift Present :  $0.32 USD
Halloween Gift Present :  $0.26 USD
Carbon Fiber Mystery Box :  $0.47 USD
Festive Hat :  $1.67 USD
Nuclear Matamorez :  $0.39 USD
... and so on

The problem with this code is, it is only getting the names of the first page. If I type the URL manually with different numbers in place of page_num it changes the page, and also the HTML document changes. However, the code doesn't seem to get the results from the second page and so on. requests is fetching the correct URL each time, but the HTML doc returns the same?

1

There are 1 best solutions below

1
On BEST ANSWER

Page 2, 3, etc, are requested via ajax (or similar), so the source code isn't present when you first load the page. To bypass this we can sniff the ajax url and parse the source directly, in this case, json encoded, i.e:

import json
from bs4 import BeautifulSoup
from urllib2 import urlopen
output = ""
items =[]
prices =[]
for page_num in range(0,100, 10): #
    start = page_num
    count = page_num + 10

    url = urlopen("http://steamcommunity.com/market/search/render/?query=&start={}&count={}&search_descriptions=0&sort_column=popular&sort_dir=desc&appid=304930".format(start, count))
    jsonCode = json.loads(url.read())
    output += jsonCode['results_html']

soup = BeautifulSoup(output, "html.parser")

names = soup.findAll("span", { "class" : "market_listing_item_name" })
for item in range(len(names)):
    items.append(names[item].contents[0])

costs = soup.findAll("span", { "class" : "normal_price" })
for cost in range(len(costs)):
    if "Starting at" not in costs[cost].contents[0]: # we just get the first price
        prices.append(costs[cost].contents[0])



print items
[u'Festive Gift Present', u'Halloween Gift Present', u'Hypertech Timberwolf', u'Holiday Scarf', u'Chill Honeybadger', etc...] 
print prices
[u'$0.34 USD', u'$0.28 USD', u'$1.77 USD', u'$0.31 USD', u'$0.65 USD', etc...]

PS: Steam will temporary ban your ip after ~50 requests