How do I use scrapy_splash with lua using chromium engine?

Question

How do I use scrapy_splash with lua using chromium engine?

120 Views Asked by Kryštof Bochníček At 04 December 2023 at 20:33

Hello, I'm trying to make scraping bot for a site that uses javascript. I have about 20 urls from the site and would like to scale to houndreds, I need the urls to be scraped quite often, so I tried using lua script do make "dynamic" waiting times. When I use the default webkit engine, the html output of the site is just text that says that the site doesn't support this browser, that's why I'm using chromium engine. Without the lua script the scraping gave output items only with chromium engine, but it did work. After I tried it with lua I got errors with chromium engine, and with webkit it executed without errors, but didn't give any output items. This is the start request I'm using with the lua:

def start_requests(self):
        lua_script = """
        function main(splash, args)
            assert(splash:go(args.url))

            while not splash:select('div.o-matchRow')
                splash:wait(1)
                print('waiting...')
            end
            return {html=splash:html()}
        end    
        """

        for url in self.start_urls:
            yield SplashRequest(
                url=url,
                callback=self.parse,
                endpoint='execute',
                args={'engine': 'chromium', 'lua_source': lua_script}
            )

It's something simple I wanted to test out. Does anyone know what is the deal with lua and chromium engine, or how can I use webkit when the site doesn't support it? (Btw sorry for my English, I'm not a native speaker) These are the errors with chromium engine:

2023-12-04 21:23:54 [scrapy.utils.log] INFO: Scrapy 2.11.0 started (bot: tipsport_scraper)
2023-12-04 21:23:54 [scrapy.utils.log] INFO: Versions: lxml 4.9.3.0, libxml2 2.10.3, cssselect 1.2.0, parsel 1.8.1, w3lib 2.1.2, Twisted 22.10.0, Python 3.11.4 (tags/v3.11.4:d2340ef, Jun  7 2023, 05:45:37) [MSC v.1934 64 bit (AMD64)
], pyOpenSSL 23.3.0 (OpenSSL 3.1.4 24 Oct 2023), cryptography 41.0.5, Platform Windows-10-10.0.19045-SP0
2023-12-04 21:23:54 [scrapy.addons] INFO: Enabled addons:                                                               
[]                                                                                                                      
2023-12-04 21:23:54 [asyncio] DEBUG: Using selector: SelectSelector                                                     
2023-12-04 21:23:54 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor     
2023-12-04 21:23:54 [scrapy.utils.log] DEBUG: Using asyncio event loop: asyncio.windows_events._WindowsSelectorEventLoop
2023-12-04 21:23:54 [scrapy.extensions.telnet] INFO: Telnet Password: **************
2023-12-04 21:23:54 [py.warnings] WARNING: C:\Users\Kryštof\AppData\Local\Programs\Python\Python311\Lib\site-packages\scrapy\extensions\feedexport.py:406: ScrapyDeprecationWarning: The `FEED_URI` and `FEED_FORMAT` settings have been
 deprecated in favor of the `FEEDS` setting. Please see the `FEEDS` setting docs for more details
  exporter = cls(crawler)

2023-12-04 21:23:54 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.logstats.LogStats']
2023-12-04 21:23:54 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'tipsport_scraper',
 'CONCURRENT_REQUESTS': 5,
 'DOWNLOAD_DELAY': 5,
 'DUPEFILTER_CLASS': 'scrapy_splash.SplashAwareDupeFilter',
 'FEED_EXPORT_ENCODING': 'utf-8',
 'HTTPCACHE_STORAGE': 'scrapy_splash.SplashAwareFSCacheStorage',
 'NEWSPIDER_MODULE': 'tipsport_scraper.spiders',
 'REQUEST_FINGERPRINTER_IMPLEMENTATION': '2.7',
 'SPIDER_MODULES': ['tipsport_scraper.spiders'],
 'TWISTED_REACTOR': 'twisted.internet.asyncioreactor.AsyncioSelectorReactor',
 'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 '
               '(KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'}
2023-12-04 21:23:54 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy_splash.SplashCookiesMiddleware',
 'scrapy_splash.SplashMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2023-12-04 21:23:54 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy_splash.SplashDeduplicateArgsMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2023-12-04 21:23:54 [scrapy.middleware] INFO: Enabled item pipelines:
['tipsport_scraper.pipelines.TipsportScraperPipeline']
2023-12-04 21:23:54 [scrapy.core.engine] INFO: Spider opened
2023-12-04 21:23:54 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2023-12-04 21:23:54 [scrapy.extensions.telnet] INFO: Telnet console listening on **********
2023-12-04 21:23:54 [scrapy.core.scraper] ERROR: Error downloading <GET https://www.tipsport.cz/kurzy/fotbal-16?limit=1000 via http://localhost:8050/execute>
Traceback (most recent call last):
  File "C:\Users\Kryštof\AppData\Local\Programs\Python\Python311\Lib\site-packages\twisted\internet\defer.py", line 1697, in _inlineCallbacks
    result = context.run(gen.send, result)
  File "C:\Users\Kryštof\AppData\Local\Programs\Python\Python311\Lib\site-packages\scrapy\core\downloader\middleware.py", line 68, in process_response
    method(request=request, response=response, spider=spider)
  File "C:\Users\Kryštof\AppData\Local\Programs\Python\Python311\Lib\site-packages\scrapy_splash\middleware.py", line 412, in process_response
    response = self._change_response_class(request, response)
  File "C:\Users\Kryštof\AppData\Local\Programs\Python\Python311\Lib\site-packages\scrapy_splash\middleware.py", line 433, in _change_response_class
    response = response.replace(cls=respcls, request=request)
  File "C:\Users\Kryštof\AppData\Local\Programs\Python\Python311\Lib\site-packages\scrapy\http\response\__init__.py", line 125, in replace
    return cls(*args, **kwargs)
  File "C:\Users\Kryštof\AppData\Local\Programs\Python\Python311\Lib\site-packages\scrapy_splash\response.py", line 120, in __init__
    self._load_from_json()
  File "C:\Users\Kryštof\AppData\Local\Programs\Python\Python311\Lib\site-packages\scrapy_splash\response.py", line 174, in _load_from_json
    error = self.data['info']['error']
TypeError: string indices must be integers, not 'str'
2023-12-04 21:23:54 [scrapy.core.engine] INFO: Closing spider (finished)
2023-12-04 21:23:54 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1045,
 'downloader/request_count': 1,
 'downloader/request_method_count/POST': 1,
 'downloader/response_bytes': 255,
 'downloader/response_count': 1,
 'downloader/response_status_count/400': 1,
 'elapsed_time_seconds': 0.233518,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2023, 12, 4, 20, 23, 54, 847285, tzinfo=datetime.timezone.utc),
 'log_count/DEBUG': 3,
 'log_count/ERROR': 1,
 'log_count/INFO': 10,
 'log_count/WARNING': 1,
 'scheduler/dequeued': 2,
 'scheduler/dequeued/memory': 2,
 'scheduler/enqueued': 2,
 'scheduler/enqueued/memory': 2,
 'splash/execute/request_count': 1,
 'splash/execute/response_count/400': 1,
 'start_time': datetime.datetime(2023, 12, 4, 20, 23, 54, 613767, tzinfo=datetime.timezone.utc)}
2023-12-04 21:23:54 [scrapy.core.engine] INFO: Spider closed (finished)

I deleted the telenet password and some kind of IP just in case if it was something sensitive, i replaced them with *.

Original Q&A

There are 1 best solutions below

**mk mcmahon** · Answer 1 · 2023-12-06T04:18:20.653000

For Chromium, make sure your Splash is set up correctly to handle Chromium requests. If it still doesn't work, updating Splash might help.

For WebKit, the site seems to block it, so try changing your user agent in Scrapy to something more common. Also, check if the div.o-matchRow you're waiting for in your Lua script actually exists on the site. If it does and you still have issues, try setting a limit on how long the script waits to avoid getting stuck.

The TypeError in your log suggests there's a problem with how your script processes the response. Make sure you're handling the data format correctly in your script.

How do I use scrapy_splash with lua using chromium engine?

There are 1 best solutions below

Related Questions in PYTHON

Related Questions in WEB-SCRAPING

Related Questions in LUA

Related Questions in SCRAPY

Related Questions in SCRAPY-SPLASH

Trending Questions

Popular # Hahtags

Popular Questions