I am trying to crawl the commit page of Github to do some analysis. The page is here
However, there are two tags called "js-diff-progressive-container" and each one has many child tags. See below
When I use urllib2.Request() and urllib2.urlopen() to get html page and use beautifulsoup to parse the html code, it seems that I can only get the first "js-diff-progressive-container" tag and its child tag. For the second one I will get a tag which class is "js-diff-progressive-retry". The parsing code is here:
for tag in soup.find_all('div', class_='js-diff-progressive-container'):
print 1
for div in tag.find_all('div'):
id = div.get('id')
if id:
id = id.split('-')
print id
if id[0] == 'diff':
div2 = div.find_all('div')
class_div = div2[0]
if class_div.get('data-path'):
changed_class.append(class_div.get('data-path'))
Someone told me that I cannot get all the html code at once since this tag is loaded dynamically. How can I get the whole html page code?