Derive protocol from url

381 Views Asked by At

I do have a list of urls such as ["www.bol.com ","www.dopper.com"]format. In order to be inputted on scrappy as start URLs I need to know the correct HTTP protocol.

For example:

["https://www.bol.com/nl/nl/", "https://dopper.com/nl"]

As you see the protocol might differ from https to http or even with or without www.

Not sure if there are any other variations.

  1. is there any python tool that can determine the right protocol?
  2. If not and I have to build the logic by myself what are the cases that I should take into account?

For option 2, this is what I have so far:

def identify_protocol(url):
    try:
        r = requests.get("https://" + url + "/", timeout=10)
        return r.url, r.status_code
    except requests.HTTPError:
        r = requests.get("http//" + url + "/", timeout=10)
        return r.url, r.status_code
    except requests.HTTPError:
        r = requests.get("https//" + url.replace("www.","") + "/", timeout=10)
        return r.url, r.status_code
    except:
        return None, None

is there any other possibility I should take into account?

2

There are 2 best solutions below

0
On BEST ANSWER

As I understood question, you need to retrieve final url after all possible redirections. It could be done with built-in urllib.request. If provided url has no scheme you can use http as default. To parse input url I used combination of urlsplit() and urlunsplit().

Code:

import urllib.request as request
import urllib.parse as parse

def find_redirect_location(url, proxy=None):
    parsed_url = parse.urlsplit(url.strip())
    url = parse.urlunsplit((
        parsed_url.scheme or "http",
        parsed_url.netloc or parsed_url.path,
        parsed_url.path.rstrip("/") + "/" if parsed_url.netloc else "/",
        parsed_url.query,
        parsed_url.fragment
    ))

    if proxy:
        handler = request.ProxyHandler(dict.fromkeys(("http", "https"), proxy))
        opener = request.build_opener(handler, request.ProxyBasicAuthHandler())
    else:
        opener = request.build_opener()

    with opener.open(url) as response:
        return response.url

Then you can just call this function on every url in list:

urls = ["bol.com ","www.dopper.com", "https://google.com"]
final_urls = list(map(find_redirect_location, urls)) 

You can also use proxies:

from itertools import cycle

urls = ["bol.com ","www.dopper.com", "https://google.com"]
proxies = ["http://localhost:8888"]
final_urls = list(map(find_redirect_location, urls, cycle(proxies)))

To make it a bit faster you can make checks in parallel threads using ThreadPoolExecutor:

from concurrent.futures import ThreadPoolExecutor

urls = ["bol.com ","www.dopper.com", "https://google.com"]
final_urls = list(ThreadPoolExecutor().map(find_redirect_location, urls))
1
On

There is no way to determine the protocol/full domain from the fragment directly, the information simply isn't there. In order to find it you would either need:

  1. a database of the correct protocol/domains, which you can lookup your domain fragment in
  2. to make the request and see what the server tells you

If you do (2) you can of course gradually build your own database to avoid needing the request in future.

On many https servers, if you attempt a http connection you will be redirected to https. If you are not, then you can reliably use the http. If the http fails, then you could try again with https and see if it works.

The same applies to the domain: if the site usually redirects, you can perform the request using the original domain and see where you are redirected.

An example using requests:

>>> import requests
>>> r = requests.get('http://bol.com')
>>> r
<Response [200]>
>>> r.url
'https://www.bol.com/nl/nl/'

As you can see the request object url parameter has the final destination URL, plus protocol.