Extracting Numbers From a Table on a Website

1k Views Asked by At

I am trying to extract data from a website for personal use. I only want the precipitation at the top of the hour. I am nearly complete but I cannot sum the data up. I think its because its returning null values, and/or because the data are not all integers? Maybe using a for loop is incorrect?

Here is the code:

import urllib2
from bs4 import BeautifulSoup
import re

url = 'http://www.saiawos2.com/K61/15MinuteReport.php'
page = urllib2.urlopen(url) 
soup  = BeautifulSoup(page.read())

table = soup.findAll('table')[0]
rows = table.findAll('tr')

second_columns = []
thirteen_columns = []

for row in rows[1:]:
    second_columns.append(row.findAll('td')[1]) #Column with times
    thirteen_columns.append(row.findAll('td')[12]) #Precipitation Column

for second, thirteen in zip(second_columns, thirteen_columns):
    times = ['12:00','11:00','10:00','09:00','08:00','07:00','06:00',
         '05:00','04:00','03:00','02:00','01:00','00:00','23:00',
         '22:00','21:00','20:00','19:00','18:00','17:00','16:00',
         '15:00','14:00','13:00',]
    time = '|'.join(times) 
    if re.search(time, second.text):
        pcpn = re.sub('[^0-9]', '', thirteen.text) #Get rid of text
        print sum(pcpn[1:]) #Print sum and get rid of leading zero

Perhaps there is an easy way to do this, but this is what I have so far. When I sum(pcpn) it gives the following error for the line with the print statement:

TypeError: unsupported operand type(s) for +: 'int' and 'unicode'
1

There are 1 best solutions below

7
On BEST ANSWER

The problem is sum tries to find the sum of list of integers where as you have passed a list of unicode characters which it cannot sum.

All you need to do is to map each element of the list to int and pass it to sum.

if re.search(time, second.text):
        pcpn = re.findall(r'[0-9.]+', thirteen.text) 
        print sum( float(x) for x in pcpn )

What it does?

  • re.findall(r'[0-9.]+', thirteen.text) rather than using the re.sub function we use re.findall() which will give you a list of matches, which can then be passed to the sum() function. Here the match is digits.

  • sum( float(x) for x in pcpn ) Maps each element to float and find the sum.

    • ( float(x) for x in pcpn ) is a generator statement which creates elements on the go.