I have a bunch of columns, sample like my data displayed as show below. I need to check the columns for errors and will have to generate two output files. I'm using Apache Spark 2.0 and I would like to do this in a efficient way.
Schema Details
---------------
EMPID - (NUMBER)
ENAME - (STRING,SIZE(50))
GENDER - (STRING,SIZE(1))
Data
----
EMPID,ENAME,GENDER
1001,RIO,M
1010,RICK,MM
1015,123MYA,F
My excepected output files should be as shown below:
1.
EMPID,ENAME,GENDER
1001,RIO,M
1010,RICK,NULL
1015,NULL,F
2.
EMPID,ERROR_COLUMN,ERROR_VALUE,ERROR_DESCRIPTION
1010,GENDER,"MM","OVERSIZED"
1010,GENDER,"MM","VALUE INVALID FOR GENDER"
1015,ENAME,"123MYA","NAME SHOULD BE A STRING"
Thanks
I have not really worked with Spark 2.0, so I'll try answering your question with a solution in Spark 1.6.
I have used this approach personally and it works for me. I hope it points you in the right direction.