Hello, I'm trying to make scraping bot for a site that uses javascript. I have about 20 urls from the site and would like to scale to houndreds, I need the urls to be scraped quite often, so I tried using lua script do make "dynamic" waiting times. When I use the default webkit engine, the html output of the site is just text that says that the site doesn't support this browser, that's why I'm using chromium engine. Without the lua script the scraping gave output items only with chromium engine, but it did work. After I tried it with lua I got errors with chromium engine, and with webkit it executed without errors, but didn't give any output items. This is the start request I'm using with the lua:
def start_requests(self):
lua_script = """
function main(splash, args)
assert(splash:go(args.url))
while not splash:select('div.o-matchRow')
splash:wait(1)
print('waiting...')
end
return {html=splash:html()}
end
"""
for url in self.start_urls:
yield SplashRequest(
url=url,
callback=self.parse,
endpoint='execute',
args={'engine': 'chromium', 'lua_source': lua_script}
)
It's something simple I wanted to test out. Does anyone know what is the deal with lua and chromium engine, or how can I use webkit when the site doesn't support it? (Btw sorry for my English, I'm not a native speaker) These are the errors with chromium engine:
2023-12-04 21:23:54 [scrapy.utils.log] INFO: Scrapy 2.11.0 started (bot: tipsport_scraper)
2023-12-04 21:23:54 [scrapy.utils.log] INFO: Versions: lxml 4.9.3.0, libxml2 2.10.3, cssselect 1.2.0, parsel 1.8.1, w3lib 2.1.2, Twisted 22.10.0, Python 3.11.4 (tags/v3.11.4:d2340ef, Jun 7 2023, 05:45:37) [MSC v.1934 64 bit (AMD64)
], pyOpenSSL 23.3.0 (OpenSSL 3.1.4 24 Oct 2023), cryptography 41.0.5, Platform Windows-10-10.0.19045-SP0
2023-12-04 21:23:54 [scrapy.addons] INFO: Enabled addons:
[]
2023-12-04 21:23:54 [asyncio] DEBUG: Using selector: SelectSelector
2023-12-04 21:23:54 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor
2023-12-04 21:23:54 [scrapy.utils.log] DEBUG: Using asyncio event loop: asyncio.windows_events._WindowsSelectorEventLoop
2023-12-04 21:23:54 [scrapy.extensions.telnet] INFO: Telnet Password: **************
2023-12-04 21:23:54 [py.warnings] WARNING: C:\Users\Kryštof\AppData\Local\Programs\Python\Python311\Lib\site-packages\scrapy\extensions\feedexport.py:406: ScrapyDeprecationWarning: The `FEED_URI` and `FEED_FORMAT` settings have been
deprecated in favor of the `FEEDS` setting. Please see the `FEEDS` setting docs for more details
exporter = cls(crawler)
2023-12-04 21:23:54 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.logstats.LogStats']
2023-12-04 21:23:54 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'tipsport_scraper',
'CONCURRENT_REQUESTS': 5,
'DOWNLOAD_DELAY': 5,
'DUPEFILTER_CLASS': 'scrapy_splash.SplashAwareDupeFilter',
'FEED_EXPORT_ENCODING': 'utf-8',
'HTTPCACHE_STORAGE': 'scrapy_splash.SplashAwareFSCacheStorage',
'NEWSPIDER_MODULE': 'tipsport_scraper.spiders',
'REQUEST_FINGERPRINTER_IMPLEMENTATION': '2.7',
'SPIDER_MODULES': ['tipsport_scraper.spiders'],
'TWISTED_REACTOR': 'twisted.internet.asyncioreactor.AsyncioSelectorReactor',
'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 '
'(KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'}
2023-12-04 21:23:54 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy_splash.SplashCookiesMiddleware',
'scrapy_splash.SplashMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2023-12-04 21:23:54 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy_splash.SplashDeduplicateArgsMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2023-12-04 21:23:54 [scrapy.middleware] INFO: Enabled item pipelines:
['tipsport_scraper.pipelines.TipsportScraperPipeline']
2023-12-04 21:23:54 [scrapy.core.engine] INFO: Spider opened
2023-12-04 21:23:54 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2023-12-04 21:23:54 [scrapy.extensions.telnet] INFO: Telnet console listening on **********
2023-12-04 21:23:54 [scrapy.core.scraper] ERROR: Error downloading <GET https://www.tipsport.cz/kurzy/fotbal-16?limit=1000 via http://localhost:8050/execute>
Traceback (most recent call last):
File "C:\Users\Kryštof\AppData\Local\Programs\Python\Python311\Lib\site-packages\twisted\internet\defer.py", line 1697, in _inlineCallbacks
result = context.run(gen.send, result)
File "C:\Users\Kryštof\AppData\Local\Programs\Python\Python311\Lib\site-packages\scrapy\core\downloader\middleware.py", line 68, in process_response
method(request=request, response=response, spider=spider)
File "C:\Users\Kryštof\AppData\Local\Programs\Python\Python311\Lib\site-packages\scrapy_splash\middleware.py", line 412, in process_response
response = self._change_response_class(request, response)
File "C:\Users\Kryštof\AppData\Local\Programs\Python\Python311\Lib\site-packages\scrapy_splash\middleware.py", line 433, in _change_response_class
response = response.replace(cls=respcls, request=request)
File "C:\Users\Kryštof\AppData\Local\Programs\Python\Python311\Lib\site-packages\scrapy\http\response\__init__.py", line 125, in replace
return cls(*args, **kwargs)
File "C:\Users\Kryštof\AppData\Local\Programs\Python\Python311\Lib\site-packages\scrapy_splash\response.py", line 120, in __init__
self._load_from_json()
File "C:\Users\Kryštof\AppData\Local\Programs\Python\Python311\Lib\site-packages\scrapy_splash\response.py", line 174, in _load_from_json
error = self.data['info']['error']
TypeError: string indices must be integers, not 'str'
2023-12-04 21:23:54 [scrapy.core.engine] INFO: Closing spider (finished)
2023-12-04 21:23:54 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1045,
'downloader/request_count': 1,
'downloader/request_method_count/POST': 1,
'downloader/response_bytes': 255,
'downloader/response_count': 1,
'downloader/response_status_count/400': 1,
'elapsed_time_seconds': 0.233518,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2023, 12, 4, 20, 23, 54, 847285, tzinfo=datetime.timezone.utc),
'log_count/DEBUG': 3,
'log_count/ERROR': 1,
'log_count/INFO': 10,
'log_count/WARNING': 1,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'splash/execute/request_count': 1,
'splash/execute/response_count/400': 1,
'start_time': datetime.datetime(2023, 12, 4, 20, 23, 54, 613767, tzinfo=datetime.timezone.utc)}
2023-12-04 21:23:54 [scrapy.core.engine] INFO: Spider closed (finished)
I deleted the telenet password and some kind of IP just in case if it was something sensitive, i replaced them with *.
For Chromium, make sure your Splash is set up correctly to handle Chromium requests. If it still doesn't work, updating Splash might help.
For WebKit, the site seems to block it, so try changing your user agent in Scrapy to something more common. Also, check if the
div.o-matchRowyou're waiting for in your Lua script actually exists on the site. If it does and you still have issues, try setting a limit on how long the script waits to avoid getting stuck.The
TypeErrorin your log suggests there's a problem with how your script processes the response. Make sure you're handling the data format correctly in your script.