I have a dataset with following case class type:
  case class AddressRawData(
                         addressId: String,
                         customerId: String,
                         address: String
                       )
I want to convert it to:
case class AddressData(
                          addressId: String,
                          customerId: String,
                          address: String,
                          number: Option[Int], //i.e. it is optional
                          road: Option[String],
                          city: Option[String],
                          country: Option[String]
                        )
Using a parser function:
  def addressParser(unparsedAddress: Seq[AddressData]): Seq[AddressData] = {
    unparsedAddress.map(address => {
      val split = address.address.split(", ")
      address.copy(
        number = Some(split(0).toInt),
        road = Some(split(1)),
        city = Some(split(2)),
        country = Some(split(3))
      )
    }
    )
  }
I am new to scala and spark. Could anyone please let me know how can this be done?
 
                        
You were on the right path! There are multiple ways of doing this of course. But as you're already on the way by making some case classes, and you've started making a parsing function an elegant solution is by using the Dataset's
mapfunction. From the docs, thismapfunction signature is the following:Where
Tis the starting type (AddressRawDatain your case) andUis the type you want to get to (AddressDatain your case). So the input of thismapfunction is a function that transforms aAddressRawDatato aAddressData. That could perfectly be theaddressParseryou've started making!Now, your current
addressParserhas the following signature:In order to be able to feed it to that
mapfunction, we need to make this signature:Knowing all of this, we can work further! An example would be the following:
As you see, thanks to the fact that you had already foreseen that parsing can go wrong, it was easily possible to use
scala.util.Tryto try and get the pieces of that raw address and add some robustness in there (the second line contains somenullvalues where it could not parse theaddressstring.Hope this helps!