Find the least value in between the columns using Spark DataFrame

77 Views Asked by At

I have a dataframe like below and need to find the least value except zeros and add it in a new column as 'Least'.

Column1 Column2 Column3
100.0 120.0 150.0
200.0 0.0 0.0
0.0 20.0 100.0

I tried with least() function but I didn't get the expected output.

expected output would be like below.

Column1 Column2 Column3 Least
100.0 120.0 150.0 100.0
200.0 0.0 0.0 200.0
0.0 20.0 100.0 20.0
2

There are 2 best solutions below

1
Ali BOUHLEL On BEST ANSWER

You can do something like this to get the least values

import sparkSession.implicits._

      val df = List(
        (100.0, 120.0, 150.0),
        (200.0, 0.0, 0.0),
        (0.0, 20.0, 100.0)
      ).toDF("column1", "column2", "column3")
      
      val columns = df.columns.toSeq
      
      
      val leastRow = least(
        columns map col: _*
      ).alias("min")

      df.select($"*", leastRow).show

Try to improve the leastRow method to ignore the zero values. think about replacing the zero values with the maximum possible float value in your use case, Double.PositiveInfinity in general ect.. Do not hesitate to post your work and be sure that you'll get help ! Good luck.

0
M_S On

I found two solutions:

  1. You can swap 0 to inf when calculating least to skip zeroes
  2. You can use udf which will skip zeroes

Option 1)

import org.apache.spark.sql.functions._

val data = Seq((100.0, 120.0, 150.0), (200.0, 0.0, 0.0), (0.0, 20.0, 100.0))
val columns = Seq("Column1", "Column2", "Column3")
val df: DataFrame = spark.createDataFrame(data).toDF(columns: _*)

// Calculate the least value, swaping all zeroes to inf
val leastValueExpr = least(
  df.columns.map(colName =>
    when(col(colName) =!= 0.0, col(colName)).otherwise(Double.PositiveInfinity)
  ): _*
)
val resultDF = df.withColumn("Least", leastValueExpr)

resultDF.show()

Option 2

// Define a UDF to calculate the least value, excluding zero values
val leastNonZero: UserDefinedFunction =
  udf((cols: Seq[Double]) => cols.filter(_ != 0.0).min)

// Apply the UDF to calculate the least value, excluding zero values
val dfWithLeast: DataFrame =
  df.withColumn("Least", leastNonZero(array(df.columns.map(col): _*)))

// Show the result
dfWithLeast.show()

Output:

+-------+-------+-------+-----+
|Column1|Column2|Column3|Least|
+-------+-------+-------+-----+
|  100.0|  120.0|  150.0|100.0|
|  200.0|    0.0|    0.0|200.0|
|    0.0|   20.0|  100.0| 20.0|
+-------+-------+-------+-----+