Python: URL parsing issue while adding a trailing slash

1.8k Views Asked by At

I was developing a small experiment in python to normalize a URL. My main purpose is to add slash / at the end of the URL if it is not already present. for example if it is http://www.example.com then it should be converted to http://www.example.com/

Here is a small snippet for the same:

if url[len(url)-1] != "/":
        url = url + "/"

But this also converts file names. For example http://www.example.com/image.png into http://www.example.com/image.png/ which is wrong. I just want to add slash to directory and not file names. How do i do this?

Thanks in advance!

2

There are 2 best solutions below

1
On BEST ANSWER

You could pattern match on the last substring to check for known domains vs file extensions. It's not too difficult to enumerate at least the basic top level domains like .com, .gov, .org, etc.

If you are familiar with regular extensions, you can match on a pattern like '.com$'.

Otherwise, you can split by '.' and check the last substring you get:

In [32]: url_png = 'http://www.example.com/image.png'

In [33]: url_com = 'http://www.example.com'

In [34]: domains = ['com', 'org', 'gov']

In [35]: for url in [url_png, url_com]:
   ....:     suffix = url.split('.')[-1]
   ....:     if suffix in domains:
   ....:         print url
   ....:
http://www.example.com

As a side note and as you see in the above example, you don't need to do url[len(url)-1] to index the last element of a list; the Pythonic way is just url[-1].

3
On

You gotta ensure that whenever a . comes in URL, for directory it should be in the hostname. If its anywhere else, it is a file name. So for this, just do url.count('.') and check if that is greater than the ones in your hostname (eg, in here its equal to 2)

if url.count('.') > 2:
    url = url if url[-1] != '/' else url[:-1]
else:
    url = url  if url[-1] == '/' else url + '/'