Scrape src attribute from google with beautiful soup only

672 Views Asked by At

I'm trying to scrape google images. While beautiful soup extracts 'src' it outputs links data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw== which is not the actual image. The script tag looks heavily encoded and doesn't contain the actual URI. Can anybody suggest me a solution?

Actually this is minified data URI which when decoded yields a 1x1 image. My question is how google minifies complete data URI and how can we access the full URI so that we can get the actual image?

3

There are 3 best solutions below

0
On BEST ANSWER

That's the image in Base64 encoding. You can save it to a image file like:

src = "BASE64 DATA"
img = open("MyImage.gif","wb+")
img.write(src.decode('base64'))
img.close()
1
On

this is data URL, please refer https://developer.mozilla.org/en-US/docs/Web/HTTP/Basics_of_HTTP/Data_URIs

you can decode the base64 string then save to a image file.

0
On

Google Images are inserted to DOM from (thankfully) inline JavaScript. Open a page source of search results for any query, copy the image src attribute, and find it in the page source.

To extract it with bs4 only, you can mimic the browser and extract data from inline JavaScript with regular expressions.

Page source of Google Images results for "stackoverflow" search query

Alternatively, you can use SerpApi to extract URIs of full images. It's a paid SaaS with a free trial.

Example usage with curl.

curl -s 'https://serpapi.com/search?q=coffee&tbm=isch'

Example usage with google-search-results Python package on Repl.it.

from serpapi import GoogleSearch
import os

params = {
    "engine": "google",
    "q": "coffee",
    "tbm": "isch",
    "api_key": os.getenv("API_KEY")
}

client = GoogleSearch(params)
data = client.get_dict()

print("Images results")

for result in data['images_results']:
    print(f"""
Position: {result['position']}
Original image: {result['original']}
""")

Example output

Images results

Position: 1
Original image: https://upload.wikimedia.org/wikipedia/commons/4/45/A_small_cup_of_coffee.JPG


Position: 2
Original image: https://media3.s-nbcnews.com/j/newscms/2019_33/2203981/171026-better-coffee-boost-se-329p_67dfb6820f7d3898b5486975903c2e51.fit-1240w.jpg

Check documentation for Google Images API on SerpApi website.

Disclaimer: I work at SerpApi.