re2 Regex for extracting domain

622 Views Asked by At

In Google Sheets, I am trying to extract domain names from URLs, which may (or may not) have http://, https://, and a subdomain (whether www. or something else like mail.). For example:

Google Sheets uses the re2 syntax for regular expressions. The best I've been able to do so far is: http[s]?://www.([[:alpha:]]*.[[:alpha:]]*)

This regular expression doesn't work for three main scenarios I've encountered:

  • when the "http(s)" doesn't exist, but I'm having trouble figuring out how to get the subdomain (www., mail.) to also be recognized as optional
  • when the domain name has a non-alphanumeric character, like a dash
  • when the top-level domain has multiple levels, like .co.uk
1

There are 1 best solutions below

0
On BEST ANSWER

use:

=INDEX(IFERROR(REGEXEXTRACT(A1:A, "^(?:https?:\/\/)?(?:ftp:\/\/)?(?:www\.)?([^\/]+)")))

enter image description here