Like below in amazon Athena.

DATA

ID , CODE 
a  , 123
b  , 345
a  , 123
a  , 345
c  , 246
d  , 678

Grouped Data

I have grouped above data with ID and CODE to get the frequency/count of the combination as below.

ID , CODE ,FREQ
a  , 123  , 2
a  , 345  , 1
b  , 345  , 1
c  , 246  , 1
d  , 678  , 1

Required Transformation

Required Transformation of the data : I am trying to get pivot the rows in CODE column into header row and create a matrix/data frame.

ID , 123 ,345 ,246, 678
a  ,   2 ,  1 , 0 , 0
b  ,   0 ,  1 , 0 , 0
c  ,   0 ,  0 , 1 , 0
d  ,   0 ,  0 , 0 , 1

On smaller data set I have successfully used R "Table" command and it works okay. But Now I have 27M unique IDs with 300 unique columns, with 220 M rows. so my ultimate matrix will have 27Mrows X 350 columns

My questions:

  1. Is there a query in Athena that I can write to achieve this result.
  2. Can I use R ? But Table command wont support it need to find a library to parallelize the data.Then use reshape2 package. Even that I am not sure how to go about.
  3. Is spark a better solution.If so how do i do it. I have already set up spark on a EC2 instance and downloaded the grouped data from S3 to EC2 tmp folder as a CSV its 8GB data file.
  4. Should I operate on the original data set or use the grouped by data set.

Kindly give me pointers. I am a newbie in all these technologies and figuring this out.

0

There are 0 best solutions below