How to pivot rows into columns in Athena when you dont have the entire list of values or if the list is too long

1.8k Views Asked by At

Like below in amazon Athena.

DATA

ID , CODE 
a  , 123
b  , 345
a  , 123
a  , 345
c  , 246
d  , 678

Grouped Data

I have grouped above data with ID and CODE to get the frequency/count of the combination as below.

ID , CODE ,FREQ
a  , 123  , 2
a  , 345  , 1
b  , 345  , 1
c  , 246  , 1
d  , 678  , 1

Required Transformation

Required Transformation of the data : I am trying to get pivot the rows in CODE column into header row and create a matrix/data frame.

ID , 123 ,345 ,246, 678
a  ,   2 ,  1 , 0 , 0
b  ,   0 ,  1 , 0 , 0
c  ,   0 ,  0 , 1 , 0
d  ,   0 ,  0 , 0 , 1

On smaller data set I have successfully used R "Table" command and it works okay. But Now I have 27M unique IDs with 300 unique columns, with 220 M rows. so my ultimate matrix will have 27Mrows X 350 columns

My questions:

  1. Is there a query in Athena that I can write to achieve this result.
  2. Can I use R ? But Table command wont support it need to find a library to parallelize the data.Then use reshape2 package. Even that I am not sure how to go about.
  3. Is spark a better solution.If so how do i do it. I have already set up spark on a EC2 instance and downloaded the grouped data from S3 to EC2 tmp folder as a CSV its 8GB data file.
  4. Should I operate on the original data set or use the grouped by data set.

Kindly give me pointers. I am a newbie in all these technologies and figuring this out.

0

There are 0 best solutions below