I am currently learning web scraping using scrapy and trying/learning various methods to login into stackoverflow and then extract some questions for practice web scraping. I have successfully logged into stackoverflow using scrapy and pyquery using following code:
import scrapy
import requests
import getpass
from pyquery import PyQuery
from scrapy import FormRequest
from scrapy.utils.response import open_in_browser
class QuoteSpider(scrapy.Spider):
name = 'stackoverflow'
start_urls = ['https://stackoverflow.com/users/login']
# def login_page(self):
# Fetch the fkey
login_page = requests.get(start_urls[0]).text
pq = PyQuery(login_page)
fkey = pq('input[name="fkey"]').val()
# Prompt for email and password
email = input("Email: ")
password = getpass.getpass()
# Login
response = requests.post(
start_urls[0],
data = {
'email': email,
'password': password,
'fkey': fkey
})
print(response)
def parse(self, response):
open_in_browser(response)
def get_questions_link(self):
pass
But in response, it is only giving me success status code i.e. 200 using following command to run:
scrapy crawl stackoverflow -L WARN
Email: [email protected]
Password:
<Response [200]>
So, how can I get the response of whole html page data, so that I could scrape some more questions/data. The parse function is also working but it is only opening stackoverflow login page.
It looks like you're using the Requests library for the final POST request that you're making. The response that comes back from
requests.post()
will make the body of the response available in a number of ways. See: https://requests.readthedocs.io/en/master/user/quickstart/#response-content. You should check the response code for a 2XX value viaresponse.status_code
. A shortcut for that is to just checkresponse.ok
. Once you do that, you can get the response body as text:which is what you'll want if you're expecting a web page (HTML) to come back.
If you get back JSON, you can get the resulting data structure expanded from that JSON, via:
If you're not sure what to expect will come back, check the
Content-Type
header value.