I'm using scrapy-playwright to scrape a page, and it is only returning 3 of the pages and giving WARNING: Closing page due to failed request: on the other 48.
import scrapy
from urllib.parse import quote
class PwspiderSpider(scrapy.Spider):
name = 'pwspider'
def start_requests(self):
yield scrapy.Request('https://wearpact.com/men/apparel', meta={'playwright': True})
def parse(self, response):
products = response.xpath('//div[@class="card product"]/a/@href').getall()
print(products)
for product in products:
yield scrapy.Request(url='https://wearpact.com' + product,
meta={'playwright': True},
callback=(self.parse_product)
)
def parse_product(self, response):
yield {
'link': response.url,
# 'img': 'https:' + response.xpath(".//div[@class='product-images']//img/@src").get(),
# 'title': response.xpath("//div[@class='product-title']/text()").get(),
# 'price': response.xpath("//div[@class='product-price']/div/text()").get(),
}
The start.Request seems to work fine, because I can print the returned products fine.
Then when I send the URLs to the parse_product, it is only successfully scraping 3 of the 51 urls and returning a timeout on the other 48.
When I remove the meta={'playwright': True}, from the scrapy.Request that goes to parse_products then the response.url returns fine, so it seems like it is something in playwright that is timing-out, and without using playwright there is no timeout.
I tried adding these settings:
PLAYWRIGHT_MAX_CONTEXTS = 1
# Set the page load timeout
PLAYWRIGHT_PAGE_LOAD_TIMEOUT = 60 # 60 seconds
# Set the script evaluation timeout
PLAYWRIGHT_SCRIPT_TIMEOUT = 60 # 60 seconds
# Set the selector timeout
PLAYWRIGHT_SELECTOR_TIMEOUT = 30 # 30 seconds
###
I also tried using urllib to use quote() encoding, and that didn't seem to make a difference.
Here is the log showing the WARNING for the timeout:
item_scraped_count': 3,
'log_count/DEBUG': 23256,
'log_count/ERROR': 48,
'log_count/INFO': 19,
'log_count/WARNING': 48,
Here is the downloaders:
downloader/exception_count': 48,
'downloader/exception_type_count/playwright._impl._api_types.TimeoutError': 48,
'downloader/request_bytes': 19880,
'downloader/request_count': 53,
'downloader/request_method_count/GET': 53,
'downloader/response_bytes': 1590559,
'downloader/response_count': 5,
'downloader/response_status_count/200': 5,
The 5 status 200 successes are robots.txt, the start_url, and the 3 successes.
The pages to be scraped sometimes has an email capture popup. Could it be something that simple? But then why would it work on 3 pages and then timeout on the rest?