So I have a text file with a bunch of wikipedia links
http://en.wikipedia.org/wiki/List_of_cities_and_towns_in_Alabama
http://en.wikipedia.org/wiki/List_of_cities_and_census-designated_places_in_Alaska
http://en.wikipedia.org/wiki/List_of_cities_and_towns_in_Arizona
http://en.wikipedia.org/wiki/List_of_cities_and_towns_in_Arkansas
http://en.wikipedia.org/wiki/List_of_cities_and_towns_in_California
http://en.wikipedia.org/wiki/List_of_cities_and_towns_in_Colorado
... etc
and the following python script designed to pull down the html of each of the pages
import urllib.request
for line in open("sites.txt", "r"):
print("Pulling: " + line)
urllib.request.urlretrieve(line, line.split('/'))
but when I run it I get the following error:
Traceback (most recent call last):
File "C:\Users\brandon\Desktop\site thing\miner.py", line 5, in <module>
urllib.request.urlretrieve(line, line.split('/'))
File "C:\Python3\lib\urllib\request.py", line 188, in urlretrieve
tfp = open(filename, 'wb')
TypeError: invalid file: ['http:', '', 'en.wikipedia.org', 'wiki', 'List_of_cities_and_towns_in_Alabama\n']
Any ideas how to fix this and do what I am wanting?
--- EDIT ---
The solution:
import urllib.request
for line in open("sites.txt", "r"):
article = line.replace('\n', '')
print("Pulling: " + article)
urllib.request.urlretrieve(article, article.split('/')[-1] + ".html")
Try this (I prefer the
requests
library):