I am using library guava from google to get the TLD and suffix from domains my Implementation is like
def getTopPrivateDomain(urlString: String): String = {
try {
val domain = InternetDomainName.from(urlString).topPrivateDomain().toString
println("domain from url: ", urlString + " is: " + domain)
domain
} catch {
case e: Exception =>
println("Exception occured ", e)
val domain = urlString.split("\\.").takeRight(2).mkString(".")
println("after exception" + domain)
domain
}
}
val hostedExtractUDF = udf((urlString: String) => getTopPrivateDomain(urlString))
while running the code from my project
filteredRecords = filtered
.withColumn("suffix", hostedExtractUDF(col("fullyQualifiedDomainName")))
for domain "remotedesktop-pa.googleapis.com" this is my output running from intellij
(domain from url: ,remotedesktop-pa.googleapis.com is: remotedesktop-pa.googleapis.com)
if I run the same function getTopPrivateDomain in spark-shell and pass the same domain I get different answer.
def getTopPrivateDomain(urlString: String): String = {
try {
val domain = InternetDomainName.from(urlString).topPrivateDomain().toString
println("domain from url: ", urlString + " is: " + domain)
domain
} catch {
case e: Exception =>
println("Exception occured ", e)
val domain = urlString.split("\\.").takeRight(2).mkString(".")
println("after exception" + domain)
domain
}
}
// Exiting paste mode, now interpreting.
getTopPrivateDomain: (urlString: String)String
scala> println(getTopPrivateDomain("remotedesktop-pa.googleapis.com"))
(domain from url: ,remotedesktop-pa.googleapis.com is: InternetDomainName{name=googleapis.com})
InternetDomainName{name=googleapis.com}
scala>
I am getting different result from both what can be the reason and I belive output from the spark-shell is correct
EDIT: version I am using is
<dependency>
<groupId>com.google.guava</groupId>
<artifactId>guava</artifactId>
<version>33.0.0-jre</version>
</dependency>