I wrote a scala script to load an avro file, and to work with the generated data (to retrieve top contributors). The problem is that while loading the file it gives a dataset that i can not convert to dataframe cuz it contains some complex types:
val history_src = "path_to_avro_files\\frwiki*.avro"
val revisions_dataset = spark.read.format("avro").load(history_src)
//gives a dataset the we can see the data and make a take(1) without problems
val first_essay = revisions_dataset.map(row => (row.getString(0), row.getLong(2), row.get(3).asInstanceOf[mutable.WrappedArray[Revision]].array
.map(x=> (x.r_contributor.r_username, x.r_contributor.r_contributor_id, x.r_contributor.r_contributor_ip)))).take(1)
//gives GenericRowWithSchema cannot be cast to Revision
val second_essay = revisions_dataset.map(row => (row.getString(0), row.getLong(2), row.get(3).asInstanceOf[mutable.WrappedArray[GenericRowWithSchema]].toStream
.map(x=> (x.getLong(0),row.get(3).asInstanceOf[mutable.WrappedArray[GenericRowWithSchema]].map(c => (c.getLong(0))))))).take(1)
// gives WrappedArray$ofRef cannot be cast to scala.collection.mutable.ArrayBuffer
I tried with Encoders and Encoder using my case classes Below but didn't work
case class History (title: String, namespace: Long, id: Long, revisions: Array[Revision])
case class Contributor (r_username: String, r_contributor_id: Long, r_contributor_ip: String)
case class Revision(r_id: Long, r_parent_id: Long, timestamp : Long, r_contributor: Contributor, sha: String)
I can generate the schema from my revisions_dataset is like this and it gives this:
root
|-- p_title: string (nullable = true)
|-- p_namespace: long (nullable = true)
|-- p_id: long (nullable = true)
|-- p_revisions: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- r_id: long (nullable = true)
| | |-- r_parent_id: long (nullable = true)
| | |-- r_timestamp: long (nullable = true)
| | |-- r_contributor: struct (nullable = true)
| | | |-- r_username: string (nullable = true)
| | | |-- r_contributor_id: long (nullable = true)
| | | |-- r_contributor_ip: string (nullable = true)
| | |-- r_sha1: string (nullable = true)
My goal is to have a dataframe to be able retrive the list of contributors on the revisions list and to flatten it to have a list of conributors inside the page (with the same level as the title).
Any help Please ?
Output: