Why does urllib.request.urlopen sometimes does not work, but browsers work?

6.7k Views Asked by At

I am trying to download some content using Python's urllib.request. The following command yields an exception:

import urllib.request
print(urllib.request.urlopen("https://fpgroup.foreignpolicy.com/foreign-policy-releases-mayjune-spy-issue/").code)

result:

...
HTTPError: HTTP Error 403: Forbidden

if I use firefox or links (command line browser) I get the content and a status code of 200. If I use lynx, strange enough, I also get 403.

I expect all methods to work

  1. the same way
  2. successfully

Why is that not the case?

2

There are 2 best solutions below

1
On BEST ANSWER

Most likely the site is blocking people from scraping their sites. You can trick them at a basic level by including header info along with other stuff. See here for more info.

Quoting from: https://docs.python.org/3/howto/urllib2.html#headers

import urllib.parse
import urllib.request

url = 'http://www.someserver.com/cgi-bin/register.cgi'
user_agent = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64)'
values = {'name' : 'Michael Foord',
          'location' : 'Northampton',
          'language' : 'Python' }
headers = { 'User-Agent' : user_agent }

data = urllib.parse.urlencode(values)
data = data.encode('ascii')
req = urllib.request.Request(url, data, headers)
with urllib.request.urlopen(req) as response:
   the_page = response.read()

There are many reasons why people don't want scripts to scrape their websites. It takes their bandwidth for one. They don't want people to benefit (money-wise) by making a scrape bot. Maybe they don't want you to copy their site information. You can also think of it as a book. Authors want people to read their books, but maybe some of them wouldn't want a robot to scan their books, to create an off copy, or maybe the robot might summarize it.

The second part of your question in the comment is to vague and broad to answer here as there are too many opinionated answers.

3
On

I tried with this code and everything was okay.

I just added headers to the request. See the example below:

from urllib.request import Request, urlopen, HTTPError
from time import sleep

def get_url_data(url = ""):
    try:
        request = Request(url, headers = {'User-Agent' :\
            "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.0 Safari/537.36"})

        response = urlopen(request)
        data = response.read().decode("utf8")
        return data
    except HTTPError:
        return None

url = "https://fpgroup.foreignpolicy.com/foreign-policy-releases-mayjune-spy-issue/"

for i in range(50):
    d = get_url_data(url)
    if d != None:
        print("Attempt %d was a Success" % i)
    else:
        print("Attempt %d was a Failure" % i)
    sleep(1)

Output:

Attempt 0 was a Success
Attempt 1 was a Success
Attempt 2 was a Success
Attempt 3 was a Success
Attempt 4 was a Success
Attempt 5 was a Success
Attempt 6 was a Success
Attempt 7 was a Success
Attempt 8 was a Success
Attempt 9 was a Success
...
Attempt 42 was a Success
Attempt 43 was a Success
Attempt 44 was a Success
Attempt 45 was a Success
Attempt 46 was a Success
Attempt 47 was a Success
Attempt 48 was a Success
Attempt 49 was a Success