Data governance with scala/spark

377 Views Asked by At

I have an ETL to analyze large data and all my tables are DataFrames with Spark 2.2.X. Now, I have to add data governance in order to know what is the origin of the data. For example:

Table A

| Col1 | Col2 |  
| ---- | ---- |  
| test | hello |  
| test3 | bye |

Table B

| Col1 | Col2 |  
| ---- | ---- |  
| test2 | hey |  
| test3 | bye |

Now I have my two tables, what I do is a join by Col1 and Col2 + Col2. Resulting table:

Final Table

| Col1 | Col2 |  
| ---- | ---- |  
|test3 | byebye|  

My question is, Is there any function in Spark DataFrame, API or something that does not make me change the code so much and I can show all transformations in the DataFrame that I have?

1

There are 1 best solutions below

0
On

If you want a quick solution for this, you can have a look at RDD#toDebugString. You can call the rdd method on your DataFrame and than show its lineage through this method.

Here is an example from Jacek Laskowski's book "Mastering Apache Spark":

scala> val wordCount = sc.textFile("README.md").flatMap(_.split("\\s+")).map((_, 1)).reduceByKey(_ + _)
wordCount: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[21] at reduceByKey at <console>:24

scala> wordCount.toDebugString
res13: String =
(2) ShuffledRDD[21] at reduceByKey at <console>:24 []
 +-(2) MapPartitionsRDD[20] at map at <console>:24 []
    |  MapPartitionsRDD[19] at flatMap at <console>:24 []
    |  README.md MapPartitionsRDD[18] at textFile at <console>:24 []
    |  README.md HadoopRDD[17] at textFile at <console>:24 []

This snippet, along with a detailed explanation about RDD lineage and toDebugString is available here.