How can I calculate the standard deviation and mean across rows of various columns?

125 Views Asked by At

I want to calculate the mean and standard deviation across all rows for the columns that start with Sum of EBIT [CY 2]. I was able to calculate the mean, by adding the 10 columns and dividing it by 10.

As can be seen below:

pub fn industry_beta_f(raw_data:DataFrame, marginal_tax_rate:Expr) -> DataFrame{

    let df = raw_data.clone().lazy()
                                .with_columns([
                                    ((col("Sum of EBIT [CY 2011] ($USDmm, Historical rate)") + col("Sum of EBIT [CY 2012] ($USDmm, Historical rate)") + col("Sum of EBIT [CY 2013] ($USDmm, Historical rate)") + col("Sum of EBIT [CY 2014] ($USDmm, Historical rate)") + col("Sum of EBIT [CY 2015] ($USDmm, Historical rate)") + col("Sum of EBIT [CY 2016] ($USDmm, Historical rate)") + col("Sum of EBIT [CY 2017] ($USDmm, Historical rate)") + col("Sum of EBIT [CY 2018] ($USDmm, Historical rate)") + col("Sum of EBIT [CY 2019] ($USDmm, Historical rate)") + col("Sum of EBIT [CY 2020] ($USDmm, Historical rate)")) / lit(10.0)).alias("moments_mean"),
                                    (col("Sum of EBIT [CY 2011] ($USDmm, Historical rate)") + col("Sum of EBIT [CY 2012] ($USDmm, Historical rate)") + col("Sum of EBIT [CY 2013] ($USDmm, Historical rate)") + col("Sum of EBIT [CY 2014] ($USDmm, Historical rate)") + col("Sum of EBIT [CY 2015] ($USDmm, Historical rate)") + col("Sum of EBIT [CY 2016] ($USDmm, Historical rate)") + col("Sum of EBIT [CY 2017] ($USDmm, Historical rate)") + col("Sum of EBIT [CY 2018] ($USDmm, Historical rate)") + col("Sum of EBIT [CY 2019] ($USDmm, Historical rate)") + col("Sum of EBIT [CY 2020] ($USDmm, Historical rate)")).std(1).alias("moments_std"),

                                ])
                                .with_columns([
                                    when(col("moments_mean").gt(lit(0.0)))
                                    .then(col("moments_std") / col("moments_mean"))
                                    .otherwise(f64::NAN)
                                    .alias("Standard deviation in operating income (last 10 years)")
                                ])
                                .select([col("Industry Name"),
                                        col("Number of firms"),
                                        col("Standard deviation in operating income (last 10 years)")])
                                .collect()
                                .unwrap();
    return df
}

I am having trouble calculating the standard deviation across all the rows for the columns that start with Sum of EBIT [CY 2]. Because using the std() formula, it calculates the std for each column instead of across the rows.

Current Output

Expected Output

There is a huge gap between the two outputs. Since, for the current output the std is calculated across the column, and for the expected output the std is calculated across the rows.

1

There are 1 best solutions below

2
On

You can calculate the mean using the built-in mean_horizontal method of a DataFrame.

The standard deviation is not supported out-of-the-box, and therefore a bit trickier. You first calculate the mean, then the sum of squared errors, and then divide it by the number of columns - 1, as follows:

use polars::prelude::*;

fn main() -> PolarsResult<()> {
    let mut df = df! (
        "col_1" => &[1, 2, 3, 4, 5],
        "col_2" => &[2, 4, 6, 8, 5],
        "col_3" => &[10, 8, 6, 4, 5],
    )?;

    let n = df.get_column_names().len() as i32;

    let col_mean = df.mean_horizontal(polars::frame::NullStrategy::Ignore)?;

    let df_w_mean: &mut DataFrame;
    if let Some(mean) = col_mean {
        df_w_mean = df.with_column(mean.with_name("col_mean"))?;
    } else {
        return Err(PolarsError::ComputeError("No mean can be calculated".into()));
    }
    
    let sse = df_w_mean
        .clone()
        .lazy()
        .with_column((col("*") - col("col_mean")).pow(2))
        .collect()?
        .sum_horizontal(polars::frame::NullStrategy::Ignore)?;
    
    let df_w_sse: &mut DataFrame;
    if let Some(sse) = sse {
        df_w_sse = df_w_mean.with_column(sse.with_name("col_std"))?;
    } else {
        return Err(PolarsError::ComputeError("No sse can be caculated".into()));
    }

    let result = df_w_sse.clone().lazy().with_column((col("col_std") / lit(n - 1)).sqrt()).collect()?;
    println!("{:?}", result);

    Ok(())
}