Like below in amazon Athena.
DATA
ID , CODE
a , 123
b , 345
a , 123
a , 345
c , 246
d , 678
Grouped Data
I have grouped above data with ID and CODE to get the frequency/count of the combination as below.
ID , CODE ,FREQ
a , 123 , 2
a , 345 , 1
b , 345 , 1
c , 246 , 1
d , 678 , 1
Required Transformation
Required Transformation of the data : I am trying to get pivot the rows in CODE column into header row and create a matrix/data frame.
ID , 123 ,345 ,246, 678
a , 2 , 1 , 0 , 0
b , 0 , 1 , 0 , 0
c , 0 , 0 , 1 , 0
d , 0 , 0 , 0 , 1
On smaller data set I have successfully used R "Table" command and it works okay. But Now I have 27M unique IDs with 300 unique columns, with 220 M rows. so my ultimate matrix will have 27Mrows X 350 columns
My questions:
- Is there a query in Athena that I can write to achieve this result.
- Can I use R ? But Table command wont support it need to find a library to parallelize the data.Then use reshape2 package. Even that I am not sure how to go about.
- Is spark a better solution.If so how do i do it. I have already set up spark on a EC2 instance and downloaded the grouped data from S3 to EC2 tmp folder as a CSV its 8GB data file.
- Should I operate on the original data set or use the grouped by data set.
Kindly give me pointers. I am a newbie in all these technologies and figuring this out.