I was trying to parse the HTML of Google's image search result and get the original link of the images.
So far I was successful in writing a Python code to get the HTML of Google's search using Python's Mechanize and BeautifulSoup.
Looking at Google's search results HTML source I found that Google is storing double encode of original image's URL in a div with class rg_meta
, but the HTML I am receiving from Mechanize does not contain any such class. In fact, the whole new webpage is being returned through Mechanize.
I am aware of Google's image search APIs but I need to parse HTML this way. What am I doing wrong? Can I mask Mechanize as Chrome or a different browser?
This is a snippet of what I was trying. It's returning nothing:
import urllib
import mechanize
from bs4 import BeautifulSoup
from urlparse import urlparse
search = "cars"
browser = mechanize.Browser()
browser.set_proxies({"https": "10.0.2.88:3128"})
browser.set_handle_robots(False)
browser.addheaders = [('User-agent','Mozilla')]
html = browser.open("https://www.google.co.in/search?&source=lnms&tbm=isch&sa=X&q="+search+"&oq="+search)
htmltext=html.read()
print htmltext
img_urls = []
formatted_images = []
soup = BeautifulSoup(htmltext)
#results = soup.findAll("a")
results = soup.findAll("div", { "class" : "rg_meta" })
print results
Thanks for trying but i had to shift to urllib2 to solve this problem, Following code is parsing the google search page for images link.