Why I get different results with dropmissing or skipmissing in Julia?

259 Views Asked by At

I am trying to emulate the following R code (with dplyr) in Julia with DataFrames and DataFramesMeta (this is the flights subdataset from nycflights2013):

flights %>%
  group_by(month, origin, dest) %>%
  summarise(retard_moyen = mean(dep_delay, na.rm = TRUE)) %>%
  group_by(month) %>%
  slice_max(retard_moyen)

#>    month origin dest  retard_moyen
#>    <int> <chr>  <chr>        <dbl>
#>  1     1 EWR    TUL           55.2
#>  2     2 EWR    DSM           48.6
#>  3     3 EWR    DSM           71.0
#>  4     4 EWR    OKC           47.0
#>  5     5 EWR    TYS           60.6
#>  6     6 EWR    TYS           68.2
#>  7     7 EWR    CAE           81.5
#>  8     8 LGA    GSO           50.1
#>  9     9 LGA    MSN           24.7
#> 10    10 EWR    CAE           50.1
#> 11    11 LGA    SBN           67.5
#> 12    12 EWR    BZN           75

But I get two different results depending on the use of skipmissing or dropmissing. With the dropmissing version I get exactly the same result as the R code above, while with skipmissing I get some NaN values when taking the mean and than the result lacks two months... Here are the two codes:

# using skipmissing

@chain flights begin
    groupby([:month, :origin, :dest])
    @combine :mean_delay = mean(skipmissing(:dep_delay))
    groupby(:month)
    @subset(:mean_delay .== maximum(:mean_delay))
    sort(:month)
end

10×4 DataFrame
 Row │ month  origin  dest    mean_delay
     │ Int64  String  String  Float64
─────┼───────────────────────────────────
   1 │     1  EWR     TUL        55.2222
   2 │     2  EWR     DSM        48.5652
   3 │     3  EWR     DSM        71.0385
   4 │     4  EWR     OKC        46.9643
   5 │     5  EWR     TYS        60.5556
   6 │     6  EWR     TYS        68.2414
   7 │     8  LGA     GSO        50.1111
   8 │    10  EWR     CAE        50.1111
   9 │    11  LGA     SBN        67.5
  10 │    12  EWR     BZN        75.0


# using dropmissing

@chain flights begin
    dropmissing(:dep_delay)
    groupby([:month, :origin, :dest])
    @combine :mean_delay = mean(:dep_delay)
    groupby(:month)
    @subset(:mean_delay .== maximum(:mean_delay))
    sort(:month)
end

12×4 DataFrame
 Row │ month  origin  dest    mean_delay
     │ Int64  String  String  Float64
─────┼───────────────────────────────────
   1 │     1  EWR     TUL        55.2222
   2 │     2  EWR     DSM        48.5652
   3 │     3  EWR     DSM        71.0385
   4 │     4  EWR     OKC        46.9643
   5 │     5  EWR     TYS        60.5556
   6 │     6  EWR     TYS        68.2414
   7 │     7  EWR     CAE        81.5
   8 │     8  LGA     GSO        50.1111
   9 │     9  LGA     MSN        24.6905
  10 │    10  EWR     CAE        50.1111
  11 │    11  LGA     SBN        67.5
  12 │    12  EWR     BZN        75.0

Does anyone have a clue for this behavior? Thanks.

1

There are 1 best solutions below

0
On BEST ANSWER

The first transformation operation should be:

julia> @chain flights begin
    groupby([:month, :origin, :dest])
    @combine :mean_delay = mean(skipmissing(:dep_delay))
    groupby(:month)
    @subset(:mean_delay .== maximum(filter(!isnan, :mean_delay)))
    sort(:month)
end

Note that I have added filter in @subset part.

What is the reason that it is needed? Some combinations of [:month, :origin, :dest] have only missing values in :dep_delay and in such a case you have:

julia> mean(skipmissing(missings(Float64, 3)))
NaN

So NaN is produced. And then you have:

julia> maximum([1, 2, NaN])
NaN

so if for some :month there were at least one NaN produced it is silently dropped (as NaN is not equal by == to anything). You could catch that NaNs are produced if you switched == to e.g. === (or isequal) in comparison:

@chain flights begin
    groupby([:month, :origin, :dest])
    @combine :mean_delay = mean(skipmissing(:dep_delay))
    groupby(:month)
    @subset(:mean_delay .=== maximum(:mean_delay))
    sort(:month)
end