In Python, pull down wikipedia page

153 Views Asked by At

So I have a text file with a bunch of wikipedia links

http://en.wikipedia.org/wiki/List_of_cities_and_towns_in_Alabama
http://en.wikipedia.org/wiki/List_of_cities_and_census-designated_places_in_Alaska
http://en.wikipedia.org/wiki/List_of_cities_and_towns_in_Arizona
http://en.wikipedia.org/wiki/List_of_cities_and_towns_in_Arkansas
http://en.wikipedia.org/wiki/List_of_cities_and_towns_in_California
http://en.wikipedia.org/wiki/List_of_cities_and_towns_in_Colorado
... etc

and the following python script designed to pull down the html of each of the pages

import urllib.request
for line in open("sites.txt", "r"):
  print("Pulling: " + line)
  urllib.request.urlretrieve(line, line.split('/'))

but when I run it I get the following error:

Traceback (most recent call last):
File "C:\Users\brandon\Desktop\site thing\miner.py", line 5, in <module>
  urllib.request.urlretrieve(line, line.split('/'))
File "C:\Python3\lib\urllib\request.py", line 188, in urlretrieve
  tfp = open(filename, 'wb')
TypeError: invalid file: ['http:', '', 'en.wikipedia.org', 'wiki', 'List_of_cities_and_towns_in_Alabama\n']

Any ideas how to fix this and do what I am wanting?

--- EDIT ---

The solution:

import urllib.request
for line in open("sites.txt", "r"):
  article = line.replace('\n', '')
  print("Pulling: " + article)
  urllib.request.urlretrieve(article, article.split('/')[-1] + ".html")
2

There are 2 best solutions below

1
On

Try this (I prefer the requests library):

import requests

with open('sites.txt', 'r') as url_list:
    for url in url_list:
        print("Getting: " + url)
        r = requests.get(url)
        # do whatever you want with text 
        # using r.text to access it
0
On

webpage link fetcher code:

import urllib, htmllib, formatter

website = urllib.urlopen("http://en.wikipedia.org")
data = website.read()
website.close()
format = formatter.AbstractFormatter(formatter.NullWriter())
ptext = htmllib.HTMLParser(format)
ptext.feed(data)
for link in ptext.anchorlist:
   print(link)

//full webpage content & response fetcher

import urllib

response = urllib.urlopen('http://en.wikipedia.org')
print 'RESPONSE:', response
print 'URL     :', response.geturl()

headers = response.info()
print 'DATE    :', headers['date']
print 'HEADERS :'
print '---------'
print headers

data = response.read()
print 'LENGTH  :', len(data)
print 'DATA    :'
print '---------'
print data