URL validation in Python - edge cases

249 Views Asked by At

I'm trying to write/use URL validation in python by simply analyzing the string (no http requests) but I get a lot of edge cases with the different solutions I tried.

After looking at django's urlvalidator, I still have some edge cases that are misclassified:

def is_url_valid(url: str) -> bool:
    # from django urlvalidator
    url_pattern = re.compile(
        r'^(?:http|ftp)s?://'  # http:// or https://
        r'(?:(?:[A-Z0-9](?:[A-Z0-9-]{0,61}[A-Z0-9])?\.)+(?:[A-Z]{2,6}\.?|[A-Z0-9-]{2,}\.?)|'  # domain...
        r'localhost|'  # localhost...
        r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})'  # ...or ip
        r'(?::\d+)?'  # optional port
        r'(?:/?|[/?]\S+)$', re.IGNORECASE)

 

    return bool(re.match(url_pattern, url))

we still get:

>>> is_url_valid('contact.html')
True

Other approaches we tried:

  1. validators (python package), recommended by this SO Q&A

    >>> import validators
    >>> validators.url("firespecialties.com/bullseyetwo.html") # this is a valid url
    ValidationFailure(func=url, args={'value': 'firespecialties.com/bullseyetwo.html', 'public': False})
    
  2. from this validating urls in python SO Q&A while urllib.parse.urlparse('contact.html') correctly assess it as a path, it fails with urllib.parse.urlparse('www.images.com/example.html')`:

    >>> from urllib.parse import urlparse
    >>> urlparse('www.images.com/example.html')
    ParseResult(scheme='', netloc='', path='www.images.com/example.html', params='', query='', fragment='')
    
  3. Adapting logic from this javascript SO Q&A

0

There are 0 best solutions below