Python, Special case to extract hostnames?

88 Views Asked by Omer At 17 January 2024 at 10:52

I've seen similar problems but mine is quite different and I can't solve it alone.

Given a line of text I want to extract only the hostname, below are examples of input - output pairs I'm expecting:

Some Cool Text google.ru.ts -> google.ru.ts

Ignore google.com Ignore -> google.com

google.com/sign_in.htm -> google.com

13.59.135.97/wp-includes/fqhw5-6k88r-dgufy.view/ -> 13.59.135.97

I found a regex expression to match hostnames but it has some problems:

hostname_pattern = re.compile(r'(?:https?://)?(?:www\.)?([a-zA-Z0-9.-]+\.[a-zA-Z]{2,})\b')

For the third example I get in.htm in the output (in addition to google.com)

For the forth one it returns fqhw5-6k88r-dgufy.view

How can I fix this?

Original Q&A

There are 3 best solutions below

volkanncicek On 17 January 2024 at 11:08

you can use this regex;

hostname_pattern = re.compile(r'(?:https?://)?(?:www\.)?((?:[a-zA-Z0-9-]+\.)+[a-zA-Z]{2,}|[0-9]{1,3}(?:\.[0-9]{1,3}){3})(?::[0-9]+)?(?=[/\s]|$)')

i tried the examples and the results are like this;

Some Cool Text google.ru.ts -> google.ru.ts
Ignore google.com Ignore -> google.com
google.com/sign_in.htm -> google.com
13.59.135.97/wp-includes/fqhw5-6k88r-dgufy.view/ -> 13.59.135.97

eternal_white On 17 January 2024 at 11:16

I've prefixed the regex with a \b, so it'll look like this:

hostname_pattern = re.compile(r'(?:https?://)?(?:www\.)?([a-zA-Z0-9.-]+\.[a-zA-Z]{2,})\b')

Now it's not going to match in.html. This solves for the 3rd sample. For the fourth sample, you forgot to include the 0 - 9 range and the + sign in the regex. So the final regex would look like this:

r'\b(?:https?://)?(?:www\.)?([a-zA-Z0-9.-]+\.[a-zA-Z0-9.]{2,})\b'

Here's my text code:

inputs = [ i.split('>')[0].strip()  for i in """    Some Cool Text google.ru.ts -> google.ru.ts
    Ignore google.com Ignore -> google.com
    google.com/sign_in.htm -> google.com
    13.1234.1.321/wp-includes/fqhw5-6k88r-dgufy.view/ -> 13.59.135.97""".split('\n')]

regex = r'\b(?:https?://)?(?:www\.)?([a-zA-Z0-9.-]+\.[a-zA-Z0-9.]{2,})\b'
for i  in inputs:
    print(re.findall(regex, i))

This is the output:

['google.ru.ts']
['google.com']
['google.com']
['13.1234.1.321', 'fqhw5-6k88r-dgufy.view']

Now it still returns the other 'match' for the fourth case, but you can just take the first element from the returned list and you're good.

vgenovpy On 17 January 2024 at 14:12

I've made a regex on regex101.com

It's the following, and I can say, that at least in the case you submitted, it works:

\d{2,3}\.\d{2,3}\.\d{2,3}\.\d{2,3}| [a-zA-Z1-9\.]{1,}\.\w{2,3}|^[a-zA-Z1-9\.]{1,}\.\w{2,3}

You can check that here.

I've made also a little Python scripts that proves the mentioned above:

import re

urls = ["google.com/sign_in.html",
"Ignore google.com Ignore",
"Some Cool Text google.ru.ts",
"13.59.135.97/wp-includes/fqhw5-6k88r-dgufy.view/"]

pattern = "\d{2,3}\.\d{2,3}\.\d{2,3}\.\d{2,3}| [a-zA-Z1-9\.]{1,}\.\w{2,3}|^[a-zA-Z1-9\.]{1,}\.\w{2,3}"

for url in urls:
    host = re.findall(pattern, url)
    if host:
        print(host[0].strip())

Python, Special case to extract hostnames?

There are 3 best solutions below

Related Questions in PYTHON

Related Questions in PYTHON-3.X

Related Questions in REGEX

Related Questions in HOSTNAME

Trending Questions

Popular # Hahtags

Popular Questions