I made a well-functioning scrubber to get all the classes from my university (to filter them later), but it sometimes suddenly gives strange errors like `AttributeError: 'NoneType' object has no attribute 'findAll'. If I move on to another long page, it will give me a similar error.
My code:
from bs4 import BeautifulSoup
import urllib2
import datetime
import httplib
from math import floor
from random import randrange
import cPickle as pickle
[...irrelevant code...]
urls = ["http://locus.vub.ac.be/reporting/spreadsheet?identifier=DA&submit=toon%20de%20gegevens%20-%20show%20the%20teaching%20activities&idtype=name&template=Mod%2bSS&objectclass=module%2bgroup", "http://locus.vub.ac.be/reporting/spreadsheet?identifier=AL+tot+AP&submit=toon+de+gegevens+-+show+the+teaching+activities&idtype=name&template=Mod%2BSS&objectclass=module%2Bgroup"]
for url in urls:
url = urllib2.urlopen(url).read()
soup = BeautifulSoup(url)
begins = soup.findAll("span", {"class" : "label-1-0-0"})
for begin in begins:
table = begin.findNext("table", {"class" : "spreadsheet"})
#if table is not None:
gegevens = table.findAll("tr")
for i in range (1, len(gegevens)):
naam = gegevens[i].td
dag = naam.find_next_sibling("td")
beginuur = dag.find_next_sibling("td")
einduur = beginuur.find_next_sibling("td")
duur = einduur.find_next_sibling("td")
weken = duur.find_next_sibling("td")
titularis = weken.find_next_sibling("td")
lokaal = titularis.find_next_sibling("td")
print naam.text + " " + dag.text + " " + beginuur.text + " " + einduur.text + " " + weken.text + " " + titularis.text + " " + lokaal.text
My output for link 1:
[...]
Discrete wiskunde (HOC) ma 18:00 21:00 4, 8, 11, 13 CARA PHILIPPE F.4.111
Discrete wiskunde (WPO2) ma 13:00 15:00 3-6, 8, 10-12, 14 Deneckere Tom E.0.12
Discrete wiskunde (HOC) wo 9:00 11:00 2-3, 6, 8-9, 11-14 CARA PHILIPPE E.0.07
Traceback (most recent call last):
File "Untitled 7.py", line 24, in <module>
titularis = weken.find_next_sibling("td")
AttributeError: 'NoneType' object has no attribute 'find_next_sibling'
My output for link 2:
[...]
Algemeen boekhouden - WPO - TEW - groep 5 (E-M) ma 9:00 11:00 5-6 VANDENHAUTE Marie-Laure D.3.04
Algemeen boekhouden - WPO - HI - groep 1 (A-D) di 14:00 16:00 3-14 VANDENHAUTE Marie-Laure D.2.09
Algemeen boekhouden - WPO - HI - groep 3 (Q-Z) ma 9:00 11:00 3-8, 10-14 CEUSTERMANS Stefanie D.2.10
Algemeen boekhouden - WPO - HI - groep 2 (E-P) di 9:00 11:00 3-8, 10-11, 13-14 VANDENHAUTE Marie-Laure D.3.05
Approaches to language teaching & learning for multilingual education HOC- wo 10:00 12:00 2-9, 11-14 VAN DE CRAEN PIERRE E.3.05
Traceback (most recent call last):
File "Untitled 7.py", line 16, in <module>
gegevens = table.findAll("tr")
AttributeError: 'NoneType' object has no attribute 'findAll'
EDIT: replacing soup = BeautifulSoup(url)
with soup = BeautifulSoup(url, "xml")
(and importing the lxml library) resolved the issue. I have no idea why though...
Seems like an error from urllib2.urlopen. You should make sure you can get the page you are trying to get on your server, or handle exceptions properly.