In Scala/Spark
application I created two different DataFrame. My task is to create one excel file with two sheet for each DataFrame.
I decided to use spark-excel library but I am little bit confused. As far as I understand the future excel file is saved in the hdfs file system, right? I need to set the path of the future excel file in .save()
method, right? Also I don't understand what format should be in dataAddress
option?
import org.apache.spark.sql.Dataset
import spark.implicits._
val df1 = Seq(
("2019-01-01 00:00:00", "7056589658"),
("2019-02-02 00:00:00", "7778965896")
).toDF("DATE_TIME", "PHONE_NUMBER")
df1.show()
val df2 = Seq(
("2019-01-01 01:00:00", "194.67.45.126"),
("2019-02-02 00:00:00", "102.85.62.100"),
("2019-03-03 03:00:00", "102.85.62.100")
).toDF("DATE_TIME", "IP")
df2.show()
df1.write
.format("com.crealytics.spark.excel")
.option("dataAddress", "'First'!A1:B1000")
.option("useHeader", "true")
.mode("append")
.save("/hdd/home/NNogerbek/data.xlsx")
df2.write
.format("com.crealytics.spark.excel")
.option("dataAddress", "'Second'!A1:B1000")
.option("useHeader", "true")
.mode("append")
.save("/hdd/home/NNogerbek/data.xlsx")
First thing is this is maven dependency I used
what is data addess ? from docs
so "My Sheet1'!B3:C35" means you are telling api that... My Sheet1 and B3:C35
column positions in the excel sheet..
The below is the complete listing through which I achieved desired..
Note : .coalesce(1) will create a single file not multiple part files...
Result : since i used local result will be saved in local if its yarn it will be in hdfs. if you want to use cloud storage like s3, its also possible with yarn as master. basically this is based on you requirements...
sheet 1 :
sheet 2 :
Also, 1) see my article How to do Simple reporting with Excel sheets using Apache Spark Scala ?
2) see my answer here.
Hope that helps!!