I've seen similar problems but mine is quite different and I can't solve it alone.
Given a line of text I want to extract only the hostname, below are examples of input - output pairs I'm expecting:
Some Cool Text google.ru.ts -> google.ru.ts
Ignore google.com Ignore -> google.com
google.com/sign_in.htm -> google.com
13.59.135.97/wp-includes/fqhw5-6k88r-dgufy.view/ -> 13.59.135.97
I found a regex expression to match hostnames but it has some problems:
hostname_pattern = re.compile(r'(?:https?://)?(?:www\.)?([a-zA-Z0-9.-]+\.[a-zA-Z]{2,})\b')
For the third example I get in.htm in the output (in addition to google.com)
For the forth one it returns fqhw5-6k88r-dgufy.view
How can I fix this?
you can use this regex;
i tried the examples and the results are like this;