I'm trying to write/use URL validation in python by simply analyzing the string (no http requests) but I get a lot of edge cases with the different solutions I tried.
After looking at django's urlvalidator, I still have some edge cases that are misclassified:
def is_url_valid(url: str) -> bool:
# from django urlvalidator
url_pattern = re.compile(
r'^(?:http|ftp)s?://' # http:// or https://
r'(?:(?:[A-Z0-9](?:[A-Z0-9-]{0,61}[A-Z0-9])?\.)+(?:[A-Z]{2,6}\.?|[A-Z0-9-]{2,}\.?)|' # domain...
r'localhost|' # localhost...
r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})' # ...or ip
r'(?::\d+)?' # optional port
r'(?:/?|[/?]\S+)$', re.IGNORECASE)
return bool(re.match(url_pattern, url))
we still get:
>>> is_url_valid('contact.html')
True
Other approaches we tried:
validators (python package), recommended by this SO Q&A
>>> import validators >>> validators.url("firespecialties.com/bullseyetwo.html") # this is a valid url ValidationFailure(func=url, args={'value': 'firespecialties.com/bullseyetwo.html', 'public': False})
from this validating urls in python SO Q&A while
urllib.parse.urlparse('contact.html') correctly assess it as a path, it fails with
urllib.parse.urlparse('www.images.com/example.html')`:>>> from urllib.parse import urlparse >>> urlparse('www.images.com/example.html') ParseResult(scheme='', netloc='', path='www.images.com/example.html', params='', query='', fragment='')
Adapting logic from this javascript SO Q&A