I do have a list of urls such as ["www.bol.com ","www.dopper.com"]
format.
In order to be inputted on scrappy as start URLs I need to know the correct HTTP protocol.
For example:
["https://www.bol.com/nl/nl/", "https://dopper.com/nl"]
As you see the protocol might differ from https
to http
or even with or without www.
Not sure if there are any other variations.
- is there any python tool that can determine the right protocol?
- If not and I have to build the logic by myself what are the cases that I should take into account?
For option 2, this is what I have so far:
def identify_protocol(url):
try:
r = requests.get("https://" + url + "/", timeout=10)
return r.url, r.status_code
except requests.HTTPError:
r = requests.get("http//" + url + "/", timeout=10)
return r.url, r.status_code
except requests.HTTPError:
r = requests.get("https//" + url.replace("www.","") + "/", timeout=10)
return r.url, r.status_code
except:
return None, None
is there any other possibility I should take into account?
As I understood question, you need to retrieve final url after all possible redirections. It could be done with built-in
urllib.request
. If provided url has no scheme you can usehttp
as default. To parse input url I used combination ofurlsplit()
andurlunsplit()
.Code:
Then you can just call this function on every url in list:
You can also use proxies:
To make it a bit faster you can make checks in parallel threads using
ThreadPoolExecutor
: