I am trying to download some content using Python's urllib.request
. The following command yields an exception:
import urllib.request
print(urllib.request.urlopen("https://fpgroup.foreignpolicy.com/foreign-policy-releases-mayjune-spy-issue/").code)
result:
...
HTTPError: HTTP Error 403: Forbidden
if I use firefox or links (command line browser) I get the content and a status code of 200. If I use lynx, strange enough, I also get 403.
I expect all methods to work
- the same way
- successfully
Why is that not the case?
Most likely the site is blocking people from scraping their sites. You can trick them at a basic level by including header info along with other stuff. See here for more info.
Quoting from: https://docs.python.org/3/howto/urllib2.html#headers
There are many reasons why people don't want scripts to scrape their websites. It takes their bandwidth for one. They don't want people to benefit (money-wise) by making a scrape bot. Maybe they don't want you to copy their site information. You can also think of it as a book. Authors want people to read their books, but maybe some of them wouldn't want a robot to scan their books, to create an off copy, or maybe the robot might summarize it.
The second part of your question in the comment is to vague and broad to answer here as there are too many opinionated answers.