Scraping multiple URLs using BeautifulSoup

633 Views Asked by At

I am trying to scrape a website, however, I was unable to complete the code so that I could insert several URLs at once. Currently the code is functional with one URL at a time,

The current code is:

import requests
from bs4 import BeautifulSoup
import lxml
import pandas as pd

from urllib.request import urlopen
from urllib.error import HTTPError
from urllib.error import URLError
from bs4 import BeautifulSoup
try:
    html = urlopen("http://google.com")
except HTTPError as e:
    print(e)
except URLError:
    print("error")
else:
    res = BeautifulSoup(html.read(),"html5lib")
    tags = res.findAll("div", {"itemtype": "http://schema.org/LocalBusiness"})
    title = res.title.text
    print(title)
    for tag in tags:
      print(tag)

could someone help me make the modification so that I can insert something like this?

html = urlopen ("url1, url2, url3") 
2

There are 2 best solutions below

1
On

Wrap the repeatable parts of your code in a function and use a list:

def urlhelper(x):
    for ele in x:
        try:
            html = urlopen(ele)
        except HTTPError as e:
            print(e)
        except URLError:
            print("error")
        else:
            res = BeautifulSoup(html.read(),"html5lib")
            tags = res.findAll("div", {"itemtype": "http://schema.org/LocalBusiness"})
            title = res.title.text
            print(title)
            for tag in tags:
            print(tag)

Invoke this function with urlhelper(["url1","url2","etc"])

The key concept to understand here is "for" which tells python to iterate over each element in the list.

I recommend reading up on iterators and lists for more info:

https://www.w3schools.com/python/python_lists.asp

https://www.w3schools.com/python/python_iterators.asp

2
On

You can create a list of urls and loop through it with for loop like this:

import requests
from bs4 import BeautifulSoup
import lxml
import pandas as pd

from urllib.request import urlopen
from urllib.error import HTTPError
from urllib.error import URLError
from bs4 import BeautifulSoup

urlList = ["url1", "url2", "url3", "url4"]

for url in urlList:
    try:
        html = urlopen(url)
    except HTTPError as e:
        print(e)
    except URLError:
        print("error")
    else:
        res = BeautifulSoup(html.read(),"html5lib")
        tags = res.findAll("div", {"itemtype": "http://schema.org/LocalBusiness"})
        title = res.title.text
        print(title)
        for tag in tags:
          print(tag)