I am trying to emulate the following R code (with dplyr) in Julia with DataFrames and DataFramesMeta (this is the flights subdataset from nycflights2013):
flights %>%
group_by(month, origin, dest) %>%
summarise(retard_moyen = mean(dep_delay, na.rm = TRUE)) %>%
group_by(month) %>%
slice_max(retard_moyen)
#> month origin dest retard_moyen
#> <int> <chr> <chr> <dbl>
#> 1 1 EWR TUL 55.2
#> 2 2 EWR DSM 48.6
#> 3 3 EWR DSM 71.0
#> 4 4 EWR OKC 47.0
#> 5 5 EWR TYS 60.6
#> 6 6 EWR TYS 68.2
#> 7 7 EWR CAE 81.5
#> 8 8 LGA GSO 50.1
#> 9 9 LGA MSN 24.7
#> 10 10 EWR CAE 50.1
#> 11 11 LGA SBN 67.5
#> 12 12 EWR BZN 75
But I get two different results depending on the use of skipmissing or dropmissing. With the dropmissing version I get exactly the same result as the R code above, while with skipmissing I get some NaN values when taking the mean and than the result lacks two months... Here are the two codes:
# using skipmissing
@chain flights begin
groupby([:month, :origin, :dest])
@combine :mean_delay = mean(skipmissing(:dep_delay))
groupby(:month)
@subset(:mean_delay .== maximum(:mean_delay))
sort(:month)
end
10×4 DataFrame
Row │ month origin dest mean_delay
│ Int64 String String Float64
─────┼───────────────────────────────────
1 │ 1 EWR TUL 55.2222
2 │ 2 EWR DSM 48.5652
3 │ 3 EWR DSM 71.0385
4 │ 4 EWR OKC 46.9643
5 │ 5 EWR TYS 60.5556
6 │ 6 EWR TYS 68.2414
7 │ 8 LGA GSO 50.1111
8 │ 10 EWR CAE 50.1111
9 │ 11 LGA SBN 67.5
10 │ 12 EWR BZN 75.0
# using dropmissing
@chain flights begin
dropmissing(:dep_delay)
groupby([:month, :origin, :dest])
@combine :mean_delay = mean(:dep_delay)
groupby(:month)
@subset(:mean_delay .== maximum(:mean_delay))
sort(:month)
end
12×4 DataFrame
Row │ month origin dest mean_delay
│ Int64 String String Float64
─────┼───────────────────────────────────
1 │ 1 EWR TUL 55.2222
2 │ 2 EWR DSM 48.5652
3 │ 3 EWR DSM 71.0385
4 │ 4 EWR OKC 46.9643
5 │ 5 EWR TYS 60.5556
6 │ 6 EWR TYS 68.2414
7 │ 7 EWR CAE 81.5
8 │ 8 LGA GSO 50.1111
9 │ 9 LGA MSN 24.6905
10 │ 10 EWR CAE 50.1111
11 │ 11 LGA SBN 67.5
12 │ 12 EWR BZN 75.0
Does anyone have a clue for this behavior? Thanks.
The first transformation operation should be:
Note that I have added
filter
in@subset
part.What is the reason that it is needed? Some combinations of
[:month, :origin, :dest]
have only missing values in:dep_delay
and in such a case you have:So
NaN
is produced. And then you have:so if for some
:month
there were at least oneNaN
produced it is silently dropped (asNaN
is not equal by==
to anything). You could catch thatNaN
s are produced if you switched==
to e.g.===
(orisequal
) in comparison: