I am trying to extract data from a website for personal use. I only want the precipitation at the top of the hour. I am nearly complete but I cannot sum the data up. I think its because its returning null values, and/or because the data are not all integers? Maybe using a for loop is incorrect?
Here is the code:
import urllib2
from bs4 import BeautifulSoup
import re
url = 'http://www.saiawos2.com/K61/15MinuteReport.php'
page = urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
table = soup.findAll('table')[0]
rows = table.findAll('tr')
second_columns = []
thirteen_columns = []
for row in rows[1:]:
second_columns.append(row.findAll('td')[1]) #Column with times
thirteen_columns.append(row.findAll('td')[12]) #Precipitation Column
for second, thirteen in zip(second_columns, thirteen_columns):
times = ['12:00','11:00','10:00','09:00','08:00','07:00','06:00',
'05:00','04:00','03:00','02:00','01:00','00:00','23:00',
'22:00','21:00','20:00','19:00','18:00','17:00','16:00',
'15:00','14:00','13:00',]
time = '|'.join(times)
if re.search(time, second.text):
pcpn = re.sub('[^0-9]', '', thirteen.text) #Get rid of text
print sum(pcpn[1:]) #Print sum and get rid of leading zero
Perhaps there is an easy way to do this, but this is what I have so far. When I sum(pcpn) it gives the following error for the line with the print statement:
TypeError: unsupported operand type(s) for +: 'int' and 'unicode'
The problem is
sum
tries to find the sum of list of integers where as you have passed a list of unicode characters which it cannot sum.All you need to do is to map each element of the list to
int
and pass it to sum.What it does?
re.findall(r'[0-9.]+', thirteen.text)
rather than using there.sub
function we usere.findall()
which will give you a list of matches, which can then be passed to thesum()
function. Here the match is digits.sum( float(x) for x in pcpn )
Maps each element tofloat
and find the sum.( float(x) for x in pcpn )
is a generator statement which creates elements on the go.