I want to get all the a
elements href
attribute in the webpage https://learningenglish.voanews.com/z/1581
:
from lxml import html
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch(channel='chrome',headless=True)
page = browser.new_page()
url = "https://learningenglish.voanews.com/z/1581"
page.goto(url,wait_until= "networkidle")
doc = html.fromstring(page.content())
elements = doc.xpath('//div[@class="media-block__content"]//a')
for e in elements:
print(e.attrib['href'])
It can print all a
elements href
address,try to fulfill same function with pure playwright codes,failed .
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch(channel='chrome',headless=True)
page = browser.new_page()
url = "https://learningenglish.voanews.com/z/1581"
page.goto(url,wait_until= "networkidle")
elements = page.locator('//div[@class="media-block__content"]//a')
for e in elements:
print(e.get_attribute('href'))
It encounter error:
TypeError: 'Locator' object is not iterable
How can fix it?
You can use
evaluate_all
:Note that I've used a CSS selector rather than an XPath, which are generally harder to read and maintain and usually not necessary.
Also, the data you want can be gathered without waiting for network idle, so
"domcontentloaded"
will run a bit faster.In fact, since the data you want is baked into the static HTML, you probably don't need to run JS, allow requests beyond the root HTML document, or even use Playwright at all. A simple HTTP request and BeautifulSoup is sufficient to get the data:
Playwright:
Requests/BeautifulSoup: