Scrapy-playwright completes 3 urls and gets 'timeout' on other 48

183 Views Asked by marcb At 23 July 2023 at 16:32

I'm using scrapy-playwright to scrape a page, and it is only returning 3 of the pages and giving WARNING: Closing page due to failed request: on the other 48.

import scrapy
from urllib.parse import quote

class PwspiderSpider(scrapy.Spider):
    name = 'pwspider'

    def start_requests(self):
        yield scrapy.Request('https://wearpact.com/men/apparel', meta={'playwright': True})

    def parse(self, response):
        products = response.xpath('//div[@class="card product"]/a/@href').getall()

        print(products)

        for product in products:
            yield scrapy.Request(url='https://wearpact.com' + product, 
                                 meta={'playwright': True},  
                                 callback=(self.parse_product)
                                 )

    def parse_product(self, response):

        yield {
            'link': response.url,
            # 'img': 'https:' + response.xpath(".//div[@class='product-images']//img/@src").get(), 
            # 'title': response.xpath("//div[@class='product-title']/text()").get(),
            # 'price': response.xpath("//div[@class='product-price']/div/text()").get(),
        }

The start.Request seems to work fine, because I can print the returned products fine. Then when I send the URLs to the parse_product, it is only successfully scraping 3 of the 51 urls and returning a timeout on the other 48.

When I remove the meta={'playwright': True}, from the scrapy.Request that goes to parse_products then the response.url returns fine, so it seems like it is something in playwright that is timing-out, and without using playwright there is no timeout.

I tried adding these settings:

PLAYWRIGHT_MAX_CONTEXTS = 1

# Set the page load timeout
PLAYWRIGHT_PAGE_LOAD_TIMEOUT = 60  # 60 seconds

# Set the script evaluation timeout
PLAYWRIGHT_SCRIPT_TIMEOUT = 60  # 60 seconds

# Set the selector timeout
PLAYWRIGHT_SELECTOR_TIMEOUT = 30  # 30 seconds
###

I also tried using urllib to use quote() encoding, and that didn't seem to make a difference.

Here is the log showing the WARNING for the timeout:

item_scraped_count': 3,
 'log_count/DEBUG': 23256,
 'log_count/ERROR': 48,
 'log_count/INFO': 19,
 'log_count/WARNING': 48,

Here is the downloaders:

downloader/exception_count': 48,
 'downloader/exception_type_count/playwright._impl._api_types.TimeoutError': 48,
 'downloader/request_bytes': 19880,
 'downloader/request_count': 53,
 'downloader/request_method_count/GET': 53,
 'downloader/response_bytes': 1590559,
 'downloader/response_count': 5,
 'downloader/response_status_count/200': 5,

The 5 status 200 successes are robots.txt, the start_url, and the 3 successes.

The pages to be scraped sometimes has an email capture popup. Could it be something that simple? But then why would it work on 3 pages and then timeout on the rest?

Original Q&A

Scrapy-playwright completes 3 urls and gets 'timeout' on other 48

There are 0 best solutions below

Related Questions in SCRAPY

Related Questions in SCRAPY-PLAYWRIGHT

Trending Questions

Popular # Hahtags

Popular Questions