I am trying to scrape HTML from a site that requires a login but am not getting any data

445 Views Asked by At

I am following this tutorial but I can't seem to get any data when I am running the python. I get an HTTP status code of 200 and status.ok returns a true value. Any help would be great. This is what my response looks like in Terminal:

[]

200

True

import requests
from lxml import html

USERNAME = "[email protected]"
PASSWORD = "legitpassword"

LOGIN_URL = "https://bitbucket.org/account/signin/?next=/"
URL = "https://bitbucket.org/dashboard/overview"

def main():
session_requests = requests.session()

# Get login csrf token
result = session_requests.get(LOGIN_URL)
tree = html.fromstring(result.text)
authenticity_token = list(set(tree.xpath("//input[@name='csrfmiddlewaretoken']/@value")))[0]

# Create payload
payload = {
    "username": USERNAME, 
    "password": PASSWORD, 
    "csrfmiddlewaretoken": authenticity_token
}

# Perform login
result = session_requests.post(LOGIN_URL, data = payload, headers = dict(referer = LOGIN_URL))

# Scrape url
result = session_requests.get(URL, headers = dict(referer = URL))
tree = html.fromstring(result.content)
bucket_elems = tree.findall(".//span[@class='repo-name']")
bucket_names = [bucket_elem.text_content().replace("\n", "").strip() for bucket_elem in bucket_elems]

print bucket_names
print result.status_code

if __name__ == '__main__':
main()
1

There are 1 best solutions below

0
On

The xpath is wrong, there is no span with the class repo-name, you can get the repo names from the anchor tags with:

bucket_elems = tree.xpath("//a[@class='execute repo-list--repo-name']")
bucket_names = [bucket_elem.text_content().strip() for bucket_elem in bucket_elems]

The html has obviously changed since the tutorial was written.