Logged into Stackoverflow using scrapy and pyquery but could not do further scraping

76 Views Asked by At

I am currently learning web scraping using scrapy and trying/learning various methods to login into stackoverflow and then extract some questions for practice web scraping. I have successfully logged into stackoverflow using scrapy and pyquery using following code:

import scrapy
import requests
import getpass
from pyquery import PyQuery
from scrapy import FormRequest
from scrapy.utils.response import open_in_browser


class QuoteSpider(scrapy.Spider):
    name = 'stackoverflow'
    start_urls = ['https://stackoverflow.com/users/login']
    
    # def login_page(self):
    # Fetch the fkey
    login_page = requests.get(start_urls[0]).text
    pq = PyQuery(login_page)
    fkey = pq('input[name="fkey"]').val()

    # Prompt for email and password
    email = input("Email: ")
    password = getpass.getpass()

    # Login
    response = requests.post(
        start_urls[0],
        data = {
            'email': email,
            'password': password,
            'fkey': fkey
        })
    print(response)

    def parse(self, response):
        open_in_browser(response)
        
    def get_questions_link(self):
        pass

But in response, it is only giving me success status code i.e. 200 using following command to run:

scrapy crawl stackoverflow -L WARN
Email: [email protected]
Password: 
<Response [200]>

So, how can I get the response of whole html page data, so that I could scrape some more questions/data. The parse function is also working but it is only opening stackoverflow login page.

1

There are 1 best solutions below

0
On

It looks like you're using the Requests library for the final POST request that you're making. The response that comes back from requests.post() will make the body of the response available in a number of ways. See: https://requests.readthedocs.io/en/master/user/quickstart/#response-content. You should check the response code for a 2XX value via response.status_code. A shortcut for that is to just check response.ok. Once you do that, you can get the response body as text:

response.text

which is what you'll want if you're expecting a web page (HTML) to come back.

If you get back JSON, you can get the resulting data structure expanded from that JSON, via:

response.json

If you're not sure what to expect will come back, check the Content-Type header value.