I have an ETL to analyze large data and all my tables are DataFrames with Spark 2.2.X. Now, I have to add data governance in order to know what is the origin of the data. For example:
Table A
| Col1 | Col2 |
| ---- | ---- |
| test | hello |
| test3 | bye |
Table B
| Col1 | Col2 |
| ---- | ---- |
| test2 | hey |
| test3 | bye |
Now I have my two tables, what I do is a join by Col1
and Col2 + Col2
. Resulting table:
Final Table
| Col1 | Col2 |
| ---- | ---- |
|test3 | byebye|
My question is, Is there any function in Spark DataFrame, API or something that does not make me change the code so much and I can show all transformations in the DataFrame that I have?
If you want a quick solution for this, you can have a look at
RDD#toDebugString
. You can call therdd
method on yourDataFrame
and than show its lineage through this method.Here is an example from Jacek Laskowski's book "Mastering Apache Spark":
This snippet, along with a detailed explanation about RDD lineage and
toDebugString
is available here.