I would like to tidy my data using the @linq
macro from the DataFramesMeta
package.
The result should give the same result as the code below:
using CSV, DataFrames, Dates
url = "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv"
download(url, joinpath(pwd(), "confirmed.csv"))
df = CSV.read(joinpath(pwd(), "confirmed.csv"))
rename!(df, 1 => :Province, 2 => :Country)
select!(df, Not([:Province, :Lat, :Long]))
df = combine(groupby(df, :Country), names(df)[2:end] .=> sum .=> names(df)[2:end])
df = stack(df, Not(:Country), :Country, variable_name=:Date, value_name=:Confirmed)
df[:Date] = parse.(Dates.Date, String.(df[:Date]), Dates.DateFormat("m/d/Y")) .+ Dates.Year(2000)
println(last(df, 10))
Result:
10×3 DataFrame
│ Row │ Country │ Date │ Confirmed │
│ │ String │ Date │ Int64 │
├─────┼────────────────────┼────────────┼───────────┤
│ 1 │ United Kingdom │ 2020-10-29 │ 968456 │
│ 2 │ Uruguay │ 2020-10-29 │ 3044 │
│ 3 │ Uzbekistan │ 2020-10-29 │ 66392 │
│ 4 │ Venezuela │ 2020-10-29 │ 91280 │
│ 5 │ Vietnam │ 2020-10-29 │ 1177 │
│ 6 │ West Bank and Gaza │ 2020-10-29 │ 52571 │
│ 7 │ Western Sahara │ 2020-10-29 │ 10 │
│ 8 │ Yemen │ 2020-10-29 │ 2062 │
│ 9 │ Zambia │ 2020-10-29 │ 16325 │
│ 10 │ Zimbabwe │ 2020-10-29 │ 8349 │
My code so far using @linq
is:
using DataFramesMeta
df = @linq CSV.read(joinpath(pwd(), "data", "time_series_covid19_confirmed_global.csv")) |>
rename(1 => :Province, 2 => :Country) |>
select(vcat(names(df)[2], names(df)[5:end]))
However, I'm stuck at the point of grouping and summarizing the data.
Any suggestions how to complete this linq
query?
DataFramesMeta.jl is undergoing a significant refactoring now. Maybe @pdeffebach can suggest an idiomatic use of it using the functionality on master.
Here is how I would rewrite your code using raw DataFrames.jl and Pipe.jl:
Note that in your code:
CSV.read(joinpath(pwd(), "confirmed.csv"))
(addDataFrame
as a second argument as in my code)df[:Date]
should bedf.Date
ordf."Date"
(the second style is in case the name containse e.g. spaces)are deprecated.