Unable to scrape Craigslist with Beautifulsoup

870 Views Asked by At

I am just learning and new to scraping Yesterday I was able to scrape craigslist with a beautiful soup. Today I am unable to.

Here is my code to scrape the first page of rental housing search result on CL.

from requests import get
from bs4 import BeautifulSoup

#get the first page of the san diego housing prices
url = 'https://sandiego.craigslist.org/search/apa?hasPic=1&availabilityMode=0&sale_date=all+dates'
response = get(url) # link exlcudes posts with no picures

html_soup = BeautifulSoup(response.text, 'html.parser')

#get the macro-container for the housing posts
posts = html_soup.find_all('li', class_="result-row")
print(type(posts)) #to double check that I got a ResultSet
print(len(posts)) #to double check I got 120 (elements/page)

The html_soup is not the same as it is in the actual url. It actually has the following in there:

<script>
        window.cl.specialCurtainMessages = {
            unsupportedBrowser: [
                "We've detected you are using a browser that is missing critical features.",
                "Please visit craigslist from a modern browser."
            ],
            unrecoverableError: [
                "There was an error loading the page."
            ]
        };
    </script>

Any help would be much appreciated.

I am not sure if I've potentially been 'blocked' somehow from scraping. I read this article about proxies and rotating IP addresses, but I do not want to break rules if I've been blocked, and also do not want to spend money on this. Is it not allowed to scrape craigslist? I have seen so many educational tutorials on it so thought it was okay.

1

There are 1 best solutions below

0
On
import requests
from pprint import pp


def main(url):
    with requests.Session() as req:

        params = {
            "availabilityMode": "0",
            "batch": "8-0-360-0-0",
            "cc": "US",
            "hasPic": "1",
            "lang": "en",
            "sale_date": "all dates",
            "searchPath": "apa"
        }
        r = req.get(url, params=params)
        for i in r.json()['data']['items']:
            pp(i)
            break


main('https://sapi.craigslist.org/web/v7/postings/search/full')