Python, Special case to extract hostnames?

88 Views Asked by At

I've seen similar problems but mine is quite different and I can't solve it alone.

Given a line of text I want to extract only the hostname, below are examples of input - output pairs I'm expecting:

Some Cool Text google.ru.ts -> google.ru.ts

Ignore google.com Ignore -> google.com

google.com/sign_in.htm -> google.com

13.59.135.97/wp-includes/fqhw5-6k88r-dgufy.view/ -> 13.59.135.97

I found a regex expression to match hostnames but it has some problems:

hostname_pattern = re.compile(r'(?:https?://)?(?:www\.)?([a-zA-Z0-9.-]+\.[a-zA-Z]{2,})\b')

For the third example I get in.htm in the output (in addition to google.com)

For the forth one it returns fqhw5-6k88r-dgufy.view

How can I fix this?

3

There are 3 best solutions below

2
volkanncicek On

you can use this regex;

hostname_pattern = re.compile(r'(?:https?://)?(?:www\.)?((?:[a-zA-Z0-9-]+\.)+[a-zA-Z]{2,}|[0-9]{1,3}(?:\.[0-9]{1,3}){3})(?::[0-9]+)?(?=[/\s]|$)')

i tried the examples and the results are like this;

  1. Some Cool Text google.ru.ts -> google.ru.ts
  2. Ignore google.com Ignore -> google.com
  3. google.com/sign_in.htm -> google.com
  4. 13.59.135.97/wp-includes/fqhw5-6k88r-dgufy.view/ -> 13.59.135.97
6
eternal_white On

I've prefixed the regex with a \b, so it'll look like this:

hostname_pattern = re.compile(r'(?:https?://)?(?:www\.)?([a-zA-Z0-9.-]+\.[a-zA-Z]{2,})\b')

Now it's not going to match in.html. This solves for the 3rd sample. For the fourth sample, you forgot to include the 0 - 9 range and the + sign in the regex. So the final regex would look like this:

r'\b(?:https?://)?(?:www\.)?([a-zA-Z0-9.-]+\.[a-zA-Z0-9.]{2,})\b'

Here's my text code:

inputs = [ i.split('>')[0].strip()  for i in """    Some Cool Text google.ru.ts -> google.ru.ts
    Ignore google.com Ignore -> google.com
    google.com/sign_in.htm -> google.com
    13.1234.1.321/wp-includes/fqhw5-6k88r-dgufy.view/ -> 13.59.135.97""".split('\n')]

regex = r'\b(?:https?://)?(?:www\.)?([a-zA-Z0-9.-]+\.[a-zA-Z0-9.]{2,})\b'
for i  in inputs:
    print(re.findall(regex, i))

This is the output:

['google.ru.ts']
['google.com']
['google.com']
['13.1234.1.321', 'fqhw5-6k88r-dgufy.view']

Now it still returns the other 'match' for the fourth case, but you can just take the first element from the returned list and you're good.

0
vgenovpy On

I've made a regex on regex101.com

It's the following, and I can say, that at least in the case you submitted, it works:

\d{2,3}\.\d{2,3}\.\d{2,3}\.\d{2,3}| [a-zA-Z1-9\.]{1,}\.\w{2,3}|^[a-zA-Z1-9\.]{1,}\.\w{2,3}

You can check that here.

I've made also a little Python scripts that proves the mentioned above:

import re

urls = ["google.com/sign_in.html",
"Ignore google.com Ignore",
"Some Cool Text google.ru.ts",
"13.59.135.97/wp-includes/fqhw5-6k88r-dgufy.view/"]

pattern = "\d{2,3}\.\d{2,3}\.\d{2,3}\.\d{2,3}| [a-zA-Z1-9\.]{1,}\.\w{2,3}|^[a-zA-Z1-9\.]{1,}\.\w{2,3}"

for url in urls:
    host = re.findall(pattern, url)
    if host:
        print(host[0].strip())