import os, re, sys, urllib2
from bs4 import BeautifulSoup
import lxml
html = urllib2.urlopen("http://www.hoerzu.de/tv-programm/jetzt/")
soup = BeautifulSoup(html, "lxml")
divs = soup.find_all("div", {"class":"block"})
print len(divs)
Output:
ActivePython 2.7.2.5 (ActiveState Software Inc.) based on
Python 2.7.2 (default, Jun 24 2011, 12:21:10) [MSC v.1500 32 bit (Intel)] on win
32
Type "help", "copyright", "credits" or "license" for more information.
>>> import os, re, sys, urllib2
>>> from bs4 import BeautifulSoup
>>> import lxml
>>>
>>> html = urllib2.urlopen("http://www.hoerzu.de/tv-programm/jetzt/")
>>> soup = BeautifulSoup(html, "lxml")
>>> divs = soup.find_all("div", {"class":"block"})
>>> print len(divs)
2
I also tried:
divs = soup.find_all(class_="block")
with same result ...
But there are 11 elements that fit this condition. So are there any limitations such as max element size resp. how can I get all the elements?
The easiest way is probably using the 'html.parser' instead of 'lxml':
With your original code (using
lxml) it printed1for me, but this prints11.lxmlis lenient but not as lenient ashtml.parserfor this page.Please note that the page has over one thousand warnings if you run it through
tidy. Including invalid character codes, unclosed<div>s, letters like<and/at positions they are not allowed.