I have a dataset with following case class type:
case class AddressRawData(
addressId: String,
customerId: String,
address: String
)
I want to convert it to:
case class AddressData(
addressId: String,
customerId: String,
address: String,
number: Option[Int], //i.e. it is optional
road: Option[String],
city: Option[String],
country: Option[String]
)
Using a parser function:
def addressParser(unparsedAddress: Seq[AddressData]): Seq[AddressData] = {
unparsedAddress.map(address => {
val split = address.address.split(", ")
address.copy(
number = Some(split(0).toInt),
road = Some(split(1)),
city = Some(split(2)),
country = Some(split(3))
)
}
)
}
I am new to scala and spark. Could anyone please let me know how can this be done?
You were on the right path! There are multiple ways of doing this of course. But as you're already on the way by making some case classes, and you've started making a parsing function an elegant solution is by using the Dataset's
map
function. From the docs, thismap
function signature is the following:Where
T
is the starting type (AddressRawData
in your case) andU
is the type you want to get to (AddressData
in your case). So the input of thismap
function is a function that transforms aAddressRawData
to aAddressData
. That could perfectly be theaddressParser
you've started making!Now, your current
addressParser
has the following signature:In order to be able to feed it to that
map
function, we need to make this signature:Knowing all of this, we can work further! An example would be the following:
As you see, thanks to the fact that you had already foreseen that parsing can go wrong, it was easily possible to use
scala.util.Try
to try and get the pieces of that raw address and add some robustness in there (the second line contains somenull
values where it could not parse theaddress
string.Hope this helps!