I want to calculate the mean and standard deviation across all rows for the columns that start with Sum of EBIT [CY 2]
. I was able to calculate the mean, by adding the 10 columns and dividing it by 10.
As can be seen below:
pub fn industry_beta_f(raw_data:DataFrame, marginal_tax_rate:Expr) -> DataFrame{
let df = raw_data.clone().lazy()
.with_columns([
((col("Sum of EBIT [CY 2011] ($USDmm, Historical rate)") + col("Sum of EBIT [CY 2012] ($USDmm, Historical rate)") + col("Sum of EBIT [CY 2013] ($USDmm, Historical rate)") + col("Sum of EBIT [CY 2014] ($USDmm, Historical rate)") + col("Sum of EBIT [CY 2015] ($USDmm, Historical rate)") + col("Sum of EBIT [CY 2016] ($USDmm, Historical rate)") + col("Sum of EBIT [CY 2017] ($USDmm, Historical rate)") + col("Sum of EBIT [CY 2018] ($USDmm, Historical rate)") + col("Sum of EBIT [CY 2019] ($USDmm, Historical rate)") + col("Sum of EBIT [CY 2020] ($USDmm, Historical rate)")) / lit(10.0)).alias("moments_mean"),
(col("Sum of EBIT [CY 2011] ($USDmm, Historical rate)") + col("Sum of EBIT [CY 2012] ($USDmm, Historical rate)") + col("Sum of EBIT [CY 2013] ($USDmm, Historical rate)") + col("Sum of EBIT [CY 2014] ($USDmm, Historical rate)") + col("Sum of EBIT [CY 2015] ($USDmm, Historical rate)") + col("Sum of EBIT [CY 2016] ($USDmm, Historical rate)") + col("Sum of EBIT [CY 2017] ($USDmm, Historical rate)") + col("Sum of EBIT [CY 2018] ($USDmm, Historical rate)") + col("Sum of EBIT [CY 2019] ($USDmm, Historical rate)") + col("Sum of EBIT [CY 2020] ($USDmm, Historical rate)")).std(1).alias("moments_std"),
])
.with_columns([
when(col("moments_mean").gt(lit(0.0)))
.then(col("moments_std") / col("moments_mean"))
.otherwise(f64::NAN)
.alias("Standard deviation in operating income (last 10 years)")
])
.select([col("Industry Name"),
col("Number of firms"),
col("Standard deviation in operating income (last 10 years)")])
.collect()
.unwrap();
return df
}
I am having trouble calculating the standard deviation across all the rows for the columns that start with Sum of EBIT [CY 2]
. Because using the std()
formula, it calculates the std for each column instead of across the rows.
There is a huge gap between the two outputs. Since, for the current output the std
is calculated across the column, and for the expected output the std
is calculated across the rows.
You can calculate the mean using the built-in
mean_horizontal
method of a DataFrame.The standard deviation is not supported out-of-the-box, and therefore a bit trickier. You first calculate the mean, then the sum of squared errors, and then divide it by the number of columns - 1, as follows: