Strange beautifulsoup nonetype error

618 Views Asked by At

I made a well-functioning scrubber to get all the classes from my university (to filter them later), but it sometimes suddenly gives strange errors like `AttributeError: 'NoneType' object has no attribute 'findAll'. If I move on to another long page, it will give me a similar error.

My code:

from bs4 import BeautifulSoup
import urllib2
import datetime
import httplib
from math import floor
from random import randrange
import cPickle as pickle
[...irrelevant code...]
urls = ["http://locus.vub.ac.be/reporting/spreadsheet?identifier=DA&submit=toon%20de%20gegevens%20-%20show%20the%20teaching%20activities&idtype=name&template=Mod%2bSS&objectclass=module%2bgroup", "http://locus.vub.ac.be/reporting/spreadsheet?identifier=AL+tot+AP&submit=toon+de+gegevens+-+show+the+teaching+activities&idtype=name&template=Mod%2BSS&objectclass=module%2Bgroup"]
for url in urls:
    url = urllib2.urlopen(url).read()
    soup = BeautifulSoup(url)
    begins = soup.findAll("span", {"class" : "label-1-0-0"})
    for begin in begins:
        table = begin.findNext("table", {"class" : "spreadsheet"})
        #if table is not None:
        gegevens = table.findAll("tr")
        for i in range (1, len(gegevens)):
            naam = gegevens[i].td
            dag = naam.find_next_sibling("td")
            beginuur = dag.find_next_sibling("td")
            einduur = beginuur.find_next_sibling("td")
            duur = einduur.find_next_sibling("td")
            weken = duur.find_next_sibling("td")
            titularis = weken.find_next_sibling("td")
            lokaal = titularis.find_next_sibling("td")
            print naam.text + " " + dag.text + " " + beginuur.text + " " + einduur.text + " " + weken.text + " " + titularis.text + " " + lokaal.text

My output for link 1:

[...]
Discrete wiskunde (HOC) ma 18:00 21:00 4, 8, 11, 13 CARA PHILIPPE F.4.111
Discrete wiskunde (WPO2) ma 13:00 15:00 3-6, 8, 10-12, 14 Deneckere Tom E.0.12
Discrete wiskunde (HOC) wo 9:00 11:00 2-3, 6, 8-9, 11-14 CARA PHILIPPE E.0.07
Traceback (most recent call last):
  File "Untitled 7.py", line 24, in <module>
    titularis = weken.find_next_sibling("td")
AttributeError: 'NoneType' object has no attribute 'find_next_sibling'

My output for link 2:

[...]
Algemeen boekhouden - WPO - TEW - groep 5 (E-M) ma 9:00 11:00 5-6 VANDENHAUTE Marie-Laure D.3.04
Algemeen boekhouden - WPO - HI - groep 1 (A-D) di 14:00 16:00 3-14 VANDENHAUTE Marie-Laure D.2.09
Algemeen boekhouden - WPO - HI - groep 3 (Q-Z) ma 9:00 11:00 3-8, 10-14 CEUSTERMANS Stefanie D.2.10
Algemeen boekhouden - WPO - HI - groep 2 (E-P) di 9:00 11:00 3-8, 10-11, 13-14 VANDENHAUTE Marie-Laure D.3.05
Approaches to language teaching & learning for multilingual education HOC- wo 10:00 12:00 2-9, 11-14 VAN DE CRAEN PIERRE E.3.05
Traceback (most recent call last):
  File "Untitled 7.py", line 16, in <module>
    gegevens = table.findAll("tr")
AttributeError: 'NoneType' object has no attribute 'findAll'

EDIT: replacing soup = BeautifulSoup(url) with soup = BeautifulSoup(url, "xml") (and importing the lxml library) resolved the issue. I have no idea why though...

1

There are 1 best solutions below

1
On

Seems like an error from urllib2.urlopen. You should make sure you can get the page you are trying to get on your server, or handle exceptions properly.