In Google Sheets, I am trying to extract domain names from URLs, which may (or may not) have http://
, https://
, and a subdomain (whether www.
or something else like mail.
). For example:
- http://www.google.com -> google.com
- mail.yahoo.com -> yahoo.com
- www.my-site.org -> my-site.org
Google Sheets uses the re2 syntax for regular expressions. The best I've been able to do so far is:
http[s]?://www.([[:alpha:]]*.[[:alpha:]]*)
This regular expression doesn't work for three main scenarios I've encountered:
- when the "http(s)" doesn't exist, but I'm having trouble figuring out how to get the subdomain (www., mail.) to also be recognized as optional
- when the domain name has a non-alphanumeric character, like a dash
- when the top-level domain has multiple levels, like .co.uk
use: