I try to extract data from https://www.marinetraffic.com/en/ais/details/ships/imo:9829069/ using the following scrapy's spider and then I save the response to file.html.
# -*- coding: utf-8 -*-
import scrapy
from fake_useragent import UserAgent
class MarinetrafficSpider(scrapy.Spider):
name = 'marinetraffic'
allowed_domains = ['marinetraffic.com']
ua = UserAgent()
ua.update()
def start_requests(self):
urls = [
'https://www.marinetraffic.com/en/ais/details/ships/imo:9829069/'
]
headers= {'User-Agent': self.ua['google chrome'] }
for url in urls:
yield scrapy.Request(url, callback=self.parse, headers=headers)
def parse(self, response):
with open('file.html', 'wb') as f:
f.write(response.body)
self.log('Saved file')
But I don't take the expected response. The returned response is in file.html
Please check the debug results.
What modifications do I need to do on the above code so that the returned response be the same as the response I take from the browser?
I will apprisiate your notings.
The reason you do not see anything is that the website is rendered via JavaScript. In other words, MarineTraffic server sends you a very basic HTML page, along with a JS script that will load the content, construct and display the required HTML for you.
To get the full HTML, with the data you are looking for, you need to emulate a real browser. If you're using Python, you can have a look at Selenium, along with Chromedriver.
But beware, last time I checked (3 years ago) MarineTraffic had a very strong anti-crawler protection, that would block you after a couple pages visited with the Selenium + Chromedriver setup.