I am new to Scrapy framework & currently using it to extract articles from multiple 'Health & Wellness' websites. For some of the requests, scrapy is redirecting to homepage(this behavior is not observed in browser). Below is an example:
Command: scrapy shell "http://www.bornfitness.com/blog/page/10/" Result: 2015-06-19 21:32:15+0530 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080 2015-06-19 21:32:15+0530 [default] INFO: Spider opened 2015-06-19 21:32:15+0530 [default] DEBUG: Redirecting (301) to http://www.bornfitness.com/> from http://www.bornfitness.com/blog/page/10/> 2015-06-19 21:32:16+0530 [default] DEBUG: Crawled (200) http://www.bornfitness.com/> (referer: None)
Note that the page number in url(10) is a two-digit number. I don't see this issue with urls with single-sigit page number(8 for example). Result: 2015-06-19 21:43:15+0530 [default] INFO: Spider opened 2015-06-19 21:43:16+0530 [default] DEBUG: Crawled (200) http://www.bornfitness.com/blog/page/8/> (referer: None)
When you have trouble replicating browser behavior using scrapy, you generally want to look at what are those things which are being communicated differently when your browser is talking to the website compared with when your spider is talking to the website. Remember that a website is (almost always) not designed to be nice to webcrawlers, but to interact with web browsers.
For your situation, if you look at the headers being sent with your scrapy request, you should see something like:
If you examine the headers sent by a request for the same page by your web browser, you might see something like:
Try changing the
User-Agent
in your request. This should allow you to get around the redirect.