python3 urllib.request will block forever in gevent

1k Views Asked by At

I want to write a spider program to download web pages using gevent in python3. Here is my code:

import gevent
import gevent.pool
import gevent.monkey
import urllib.request

gevent.monkey.patch_all()

def download(url):
    return urllib.request.urlopen(url).read(10)

urls = ['http://www.google.com'] * 100
jobs = [gevent.spawn(download, url) for url in urls]
gevent.joinall(jobs)

But when I run it, there is an error:

Traceback (most recent call last):
File "/usr/local/lib/python3.4/dist-packages/gevent/greenlet.py", line 340, in run
result = self._run(*self.args, **self.kwargs)
File "e.py", line 8, in download
return urllib.request.urlopen(url).read(10)
File "/usr/lib/python3.4/urllib/request.py", line 153, in urlopen
return opener.open(url, data, timeout)

......
return greenlet.switch(self)
gevent.hub.LoopExit: This operation would block forever
<Greenlet at 0x7f4b33d2fdf0: download('http://www.google.com')> failed with LoopExit
......

It seems that the urllib.request blocks, so the program can not work. How to solve it?

2

There are 2 best solutions below

1
On

the same problem as in Python, gevent, urllib2.urlopen.read(), download accelerator.

to reiterate from the said post:

the argument to read is a number of bytes, not an offset.

also:

You're trying to read a response to a single request from different greenlets.

If you'd like to download the same file using several concurrent connections then you could use Range http header if the server supports it (you get 206 status instead of 200 for the request with Range header). See HTTPRangeHandler.

0
On

It could be due to the setting of the proxy when it is within a company network. Personal recommendation is to use Selenium in combination with beautiful soup which uses the browser to open the url link and you can download html content or control the browser directly. Hope it helps

from selenium import webdriver
from bs4 import BeautifulSoup
browser = webdriver.Ie()
url = "http://www.google.com"
browser.get(url)
html_source = browser.page_source
soup = BeautifulSoup(html_source, "lxml")
print(soup)
browser.close()