Web Scraping with Python: beautiful soup: bs4: <h1> Error 200 OK </h1>

610 Views Asked by At

I am using Python 3 and I am trying to simply download the content of a website as follows:

# IMPORTS --------------------------------------------------------------------
import urllib.request
from bs4 import BeautifulSoup as bs

# CLASS DESC -----------------------------------------------------------------
class Parser:

    # CONSTRUCTOR
    def __init__(self, url):
        self.soup = bs(urllib.request.urlopen(url).read(), "lxml")

    # METHODS
    def getMetaData(self):

        print(self.soup.prettify()[0:1000])

# MAIN FUNCTION --------------------------------------------------------------
if __name__ == "__main__":

    webSite = Parser("http://www.donnamoderna.com")
    webSite.getMetaData()

for which I am getting the following output:

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html>
     <head>
        <title>
            200 OK
        </title>
    </head>
    <body>
        <h1>
            Error 200 OK
        </h1>
        <p>
            OK
        </p>
        <h3>
            Guru Meditation:
        </h3>
        <p>
            XID: 1815743332
        </p>
        <hr/>
        <p>
            Varnish cache server
        </p>
    </body>
</html>

and I don't understand why this is happening. It is not a proxy thing; I tried using:

curl "http://www.donnamoderna.com" 

and it works just fine. I also tried the code on a different website like https://www.google.com and it works just fine. Is it that the http protocol is not secure (i.e. https)? Should I change something in my code? Thanks.

1

There are 1 best solutions below

0
On BEST ANSWER

So it turns out that the issue was that the server was reading my request as a-not-a-browser request, therefore denying it access to the content requested. I was able to resolve the issue by using the requests lib and changing the request's header, in order to '"confuse" the server (masking my request as one that comes from a browser) as follows:

import requests

# CONSTRUCTOR
def __init__(self, url):

    # Necessary to make the server think that we are a browser
    headers = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko)' 'Chrome/41.0.2227.1 Safari/537.36'}

    # Make request
    r = requests.get(url, headers=headers)

    # Create soup object
    self.soup =  bs(r.content, 'html.parser')