Why is Scrapy-splash not returning expected HTML from dynamic javascript page?

35 Views Asked by At

I'm attempting to scrape the Market table data from the following page utilizing scrapy-splash:

"manta.layerbank.finance/bank" (Put in quotes because might be causing spam issue?)

So far I'm to the stage of trying to get the full page html so that I can use XPath or CSS to grab the relevant information. When running my spider I'm only getting the outer html but none of the information in the body.

The HTML I was expecting can be seen by inspecting the table on the site, but it looks like this: HTML for body market table page

The HTML I am getting from my Scrapy/Splash response is as follows:

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="utf-8">
    <meta property="og:type" content="website">
    <meta property="og:url" content="https://layerbank.finance">
    <meta property="og:title" content="LayerBank">
    <meta property="og:image" content="https://cdn.layerbank.finance/open_graph.png">
    <meta property="og:image:width" content="1500">
    <meta property="og:image:height" content="788">
    <meta property="og:description" content="The Ultimate Money Market for All EVM-Layers">
    <meta property="og:site_name" content="LayerBank">
    <meta property="og:locale" content="en_US">
    <meta http-equiv="X-UA-Compatible" content="IE=edge">
    <meta name="viewport" content="width=device-width,initial-scale=1">
    <link rel="preconnect" href="https://fonts.googleapis.com">
    <link href="https://fonts.googleapis.com/css2?family=Inter:wght@100;200;300;400;500;600;700;800;900&amp;display=swap"
          rel="stylesheet" type="text/css">
    <title>LayerBank</title>
    <script defer="defer" src="/app.cf91ad4bfa347ec83220.bundle.js"></script>
</head>
<body>
<noscript>You need to enable JavaScript to run this app.</noscript>
<div id="root"></div>
</body>
</html>

Here you can see the body contains only a script element containing a string to enable javascript and an empty root div which should contain all of the html I am looking for but it is empty.

My current spider code in my application is the following:

import os
from scrapy_splash import SplashRequest

import scrapy
import datetime


class MarketsSpider(scrapy.Spider):
    name = "markets"
    allowed_domains = ["manta.layerbank.finance"]
    start_urls = ["https://manta.layerbank.finance/bank"]
    output_directory = "webpageRepository"

    def start_requests(self):
        for url in self.start_urls:
            yield SplashRequest(url,
                                self.response_parser,
                                endpoint='render.html',
                                args={'wait': 2.0},
                                )

    def response_parser(self, response):
        # Save the rendered HTML obtained after JavaScript execution
        date_time = datetime.datetime.now().strftime('%m-%d-%YT%H:%M:%S')
        filename = f"bank-page_{date_time}.html"
        output_path = os.path.join(self.output_directory, filename)

        # Create the output directory if it doesn't exist
        os.makedirs(self.output_directory, exist_ok=True)

        # Save the HTML content to the specified directory
        with open(output_path, 'w', encoding='utf-8') as file:
            file.write(response.text)

        self.log(f"Saved file {output_path}")

Brief explanation: I make a call out to the site, hopefully render the page and get the HTML, then save that html to a file iterated by a timestamp.

To me it looks like the relevant html can be found in the JS file referenced in script element in the response I'm currently getting ("/app.cf91ad4bfa347ec83220.bundle.js"). However I thought Splash was capable of resolving these references and rendering the HTML as stated in the Scrapy documentation.

So far I have tried writing some various LUA functions to pass to my local Splash web interface and changing the engine Scrapy uses to make the call... with no luck.

Any help on getting this page to render or explanation as to why it's not would be appreciated. If this is not possible with Splash I'm thinking next steps might be to try a headless browser like Playwright or Selenium as mentioned in the Scrapy docs.

Edit: Sorry for not providing direct links to the site in question and the Scrapy docs but SO seemed to think my question was spam when my question contained them.

0

There are 0 best solutions below