Find Mean of all the numeric variables of a Spark dataframe in R

544 Views Asked by At

I have a Spark Dataframe with the below structure present in R :-

Var1-----    Var 2-----   Var 3 -------      Var 4-----        Group  
98.64----   32.35----   11906.91--  08.65-----   A  
94.83----   29.36----   17287.57--  06.01-----   B  
99.94----   35.36----   30411.85--  08.82-----   C  
99.45----   34.58----   18267.26--  10.09-----   C  
99.93----   36.64----   23560.04--  07.34-----   A  
99.66----   48.81----   42076.44--  08.44-----   B  
99.96----   27.38----   18474.01--  11.39-----   A  
97.49----   25.28----   14615.50--  06.60-----   B  
98.98----   32.50----   10282.90--  07.71-----   C  
99.57----   31.54----   12725.56--  06.17-----   C  
99.91----   26.46----   10990.13--  06.17-----   C  

This is my representative dataset, number of records are pretty huge. Similarly number of columns are more than 200 as well.

Can someone please help me with the following result set. For a local dataframe in R, doing this using DPLYR is very easy. But working on Spark Dataframe seems

Group   Average_Var1    Average_Var2    Average_Var3    Average_Var4  
A   -----    99.51  ------------    32.13   ----------    17980.34  -----    9.13  
B   -----    97.32  ------------    34.42   ----------    24659.83  -----    6.89  
C   -----    99.57  ------------    32.10   ----------    16535.54  -----    7.78  
3

There are 3 best solutions below

0
On BEST ANSWER

Using sparklyr try this:

df%>% group_by(Group)%>% summarize_all(.funs = mean)
0
On

base function by can be used with colMeans as follows:

by(df[, 1:4], df[,"Group"], colMeans)

output:

df[, "Group"]: A
        Var1         Var2         Var3         Var4 
   99.516118    32.130696 17980.341453     9.130542 
----------------------------------------------------------- 
df[, "Group"]: B
        Var1         Var2         Var3         Var4 
   97.328825    34.489235 24659.840630     6.874534 
----------------------------------------------------------- 
df[, "Group"]: C
        Var1         Var2         Var3         Var4 
   99.575422    32.109159 16535.543470     7.787882 
1
On
> aggregate(df[, 1:4], list(df$Group), mean)
  Group.1     Var1    Var.2    Var.3    Var.4
1       A 99.51612 32.13070 17980.34 9.130542
2       B 97.32882 34.48923 24659.84 6.874534
3       C 99.57542 32.10916 16535.54 7.787882