Tidying data in Julia using @linq from the DataFramesMeta package

481 Views Asked by At

I would like to tidy my data using the @linq macro from the DataFramesMeta package.

The result should give the same result as the code below:

using CSV, DataFrames, Dates

url = "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv"
download(url, joinpath(pwd(), "confirmed.csv"))

df = CSV.read(joinpath(pwd(), "confirmed.csv"))
rename!(df, 1 => :Province, 2 => :Country)
select!(df, Not([:Province, :Lat, :Long]))
df = combine(groupby(df, :Country), names(df)[2:end] .=> sum .=> names(df)[2:end])
df = stack(df, Not(:Country), :Country, variable_name=:Date, value_name=:Confirmed)
df[:Date] = parse.(Dates.Date, String.(df[:Date]), Dates.DateFormat("m/d/Y")) .+ Dates.Year(2000)

println(last(df, 10))

Result:

10×3 DataFrame
│ Row │ Country            │ Date       │ Confirmed │
│     │ String             │ Date       │ Int64     │
├─────┼────────────────────┼────────────┼───────────┤
│ 1   │ United Kingdom     │ 2020-10-29 │ 968456    │
│ 2   │ Uruguay            │ 2020-10-29 │ 3044      │
│ 3   │ Uzbekistan         │ 2020-10-29 │ 66392     │
│ 4   │ Venezuela          │ 2020-10-29 │ 91280     │
│ 5   │ Vietnam            │ 2020-10-29 │ 1177      │
│ 6   │ West Bank and Gaza │ 2020-10-29 │ 52571     │
│ 7   │ Western Sahara     │ 2020-10-29 │ 10        │
│ 8   │ Yemen              │ 2020-10-29 │ 2062      │
│ 9   │ Zambia             │ 2020-10-29 │ 16325     │
│ 10  │ Zimbabwe           │ 2020-10-29 │ 8349      │

My code so far using @linq is:

using DataFramesMeta

df = @linq CSV.read(joinpath(pwd(), "data", "time_series_covid19_confirmed_global.csv")) |>
    rename(1 => :Province, 2 => :Country) |>
    select(vcat(names(df)[2], names(df)[5:end]))

However, I'm stuck at the point of grouping and summarizing the data. Any suggestions how to complete this linq query?

1

There are 1 best solutions below

4
On BEST ANSWER

DataFramesMeta.jl is undergoing a significant refactoring now. Maybe @pdeffebach can suggest an idiomatic use of it using the functionality on master.

Here is how I would rewrite your code using raw DataFrames.jl and Pipe.jl:

df = CSV.read(joinpath(pwd(), "confirmed.csv"), DataFrame)

@pipe df |>
      rename(_, 1 => :Province, 2 => :Country) |> # avoid ! to make sure we do not mutate the source
      select!(_, Not([:Province, :Lat, :Long])) |>
      groupby(_, :Country) |>
      combine(_, valuecols(_) .=> sum .=> valuecols(_)) |>
      stack(_, Not(:Country), :Country, variable_name=:Date, value_name=:Confirmed) |>
      transform!(_, :Date => ByRow(x -> parse(Date, string(x), DateFormat("m/d/Y")) + Year(2000)) => :Date) |>
      last(_, 10)

Note that in your code:

  • CSV.read(joinpath(pwd(), "confirmed.csv")) (add DataFrame as a second argument as in my code)
  • df[:Date] should be df.Date or df."Date" (the second style is in case the name containse e.g. spaces)

are deprecated.