scrapy : how to include hashtag in URL

484 Views Asked by At

I have a URL of the form

url = "http://www.example.com/search.html#query=test"

When passing this to scrapy.Request as

yield scrapy.Request(url, self.parse_result)

and picking it up in parse_result like this

def parse_result(self, response):
    print(response.url)

the last bit in the string is always stripped, and is printed as follows

http://www.example.com/search.html

What do I need to do to be able to pick up the string in full from response.url meaning including the #query=test part? Tried to use the %23 code instead of the hashtag, but that is just being passed on as the number but not as a hashtag. And using

urllib.parse.quote(url)

creates a value error:

ValueError: Missing scheme in request
1

There are 1 best solutions below

2
Maksim Kviatkouski On

Peter, the thing is that servers never get hash (or fragment identifier - that's how that piece is called). Per https://en.wikipedia.org/wiki/Fragment_identifier "its processing is exclusively client-side".

In your case it means that there is some JS on webpage that will pick-up hash after page has been loaded, process it and bring a page to an actual state. Out of the box Scrapy is not capable of executing JS. So you have a few options here:

  • Check Network tab of your browser and try to see if browser is making any XHR/Ajax requests. If yes, they may contain information you need to scrape.
  • If browser doesn't make ajax/xhr requests then probably all required information is already in HTML response you got from server. It maybe in html tag data attributes, in hidden blocks and so on. Try to search around html response (do not use Inspect Element - that will show you html after it's been processed by JS. Instead, use View page source - that will show exactly what server has sent you).
  • There are ways to execute JS with Scrapy - https://github.com/scrapy-plugins/scrapy-splash but required advanced setup and more work than simple server-side processing.