Is there a way to set the pyquery user agent string

53 Views Asked by At

When scraping HTML with pyquery, is there a way to set the browser string when I retrieve the page?

import pyquery
pqobj = pyquery.PyQuery(url="https://www.google.com/")
html = pqobj.html()
plain_text = pqobj.text()

The browser string can be anything, I just want it to look like my scrape comes from a real web browser.

1

There are 1 best solutions below

0
Mike Pennington On

how do I set the browser string?

Use the headers option...

import pyquery
pqobj = pyquery.PyQuery(
    url="https://www.cisco.com/",
    headers={"user-agent": "Foo Browser version 0.1"})  # <--
html = pqobj.html()
plain_text = pqobj.text()

# Example pyquery parsing all html <body> hrefs from the page
all_hrefs = []
doc = pyquery.PyQuery(html)
# get all <a> tags immediately under the <body> tag
for fragment in doc('body a'):
    # Iterate over tags inside the fragment
    for htmlobj in fragment.body.iter():
        # only get HTML <a> href
        if htmlobj.tag == "a":
            all_hrefs.append(htmlobj.attrib['href'])