I am a newbie and looking at HTML code for first time. For my research I need to know the number of tags and attributes in a webpage.
I looked at various parser and found Beautiful Soup to be one of the most preferred one. The following code (taken from Parsing HTML using Python) shows the way to parse a file:
import urllib2
from BeautifulSoup import BeautifulSoup
page = urllib2.urlopen('http://www.google.com/')
soup = BeautifulSoup(page)
x = soup.body.find('div', attrs={'class' : 'container'}).text
I found find_all quite useful, but needs an argument to find something.
Can someone guide me on how to know the count of all tags and attributes in a html page?
Can google developer tool help in that regard?
If you would call
find_all()
without any arguments, it would find all elements on a page recursively. Demo:Padraic showed you how to count elements and attributes via
BeautifulSoup
. In addition to it, here is how to do the same withlxml.html
:As a bonus, I've made a simple benchmark demonstrating that the latter approach is much faster (on my machine, with my setup and without specifying a parser which would make
BeautifulSoup
uselxml
under-the-hood etc..a lot of things can affect the results, but anyway):where
test.py
contains: