How can rewrite the code with pure playwright codes?

68 Views Asked by At

I want to get all the a elements href attribute in the webpage https://learningenglish.voanews.com/z/1581:

from lxml import html
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
    browser = p.chromium.launch(channel='chrome',headless=True)
    page = browser.new_page()
    url = "https://learningenglish.voanews.com/z/1581"
    page.goto(url,wait_until= "networkidle")
    doc = html.fromstring(page.content())
    elements = doc.xpath('//div[@class="media-block__content"]//a')
    for e in elements:
        print(e.attrib['href']) 

It can print all a elements href address,try to fulfill same function with pure playwright codes,failed .

from playwright.sync_api import sync_playwright
with sync_playwright() as p:
    browser = p.chromium.launch(channel='chrome',headless=True)
    page = browser.new_page()
    url = "https://learningenglish.voanews.com/z/1581"
    page.goto(url,wait_until= "networkidle")
    elements = page.locator('//div[@class="media-block__content"]//a')
    for e in elements:
        print(e.get_attribute('href'))

It encounter error:

TypeError: 'Locator' object is not iterable

How can fix it?

1

There are 1 best solutions below

0
On

You can use evaluate_all:

from playwright.sync_api import sync_playwright


with sync_playwright() as p:
    browser = p.chromium.launch(channel="chrome", headless=True)
    page = browser.new_page()
    url = "<Your URL>"
    page.goto(url, wait_until="domcontentloaded")
    hrefs = (
        page.locator(".media-block__content a")
            .evaluate_all("els => els.map(e => e.href)")
    )
    print(hrefs)

Note that I've used a CSS selector rather than an XPath, which are generally harder to read and maintain and usually not necessary.

Also, the data you want can be gathered without waiting for network idle, so "domcontentloaded" will run a bit faster.

In fact, since the data you want is baked into the static HTML, you probably don't need to run JS, allow requests beyond the root HTML document, or even use Playwright at all. A simple HTTP request and BeautifulSoup is sufficient to get the data:

import requests
from bs4 import BeautifulSoup


url = "<Your URL>"
response = requests.get(url)
response.raise_for_status()
soup = BeautifulSoup(response.text, "lxml")
hrefs = [x["href"] for x in soup.select(".media-block__content a")]
print(hrefs)

Playwright:

real 0m0.770s
user 0m0.427s
sys  0m0.127s

Requests/BeautifulSoup:

real 0m0.260s
user 0m0.189s
sys  0m0.009s