I am trying to use DataFrames.combine to chain multiple transformations. The desired final DataFrame is the one below.
using DataFrames, Statistics
df = DataFrame(x = repeat([1], 4))
df_2 = combine(df,
:x => sum => :sum_x)
df_2.sqrt_sum_x .= sqrt.(df_2.sum_x)
println(df_2)
#1×2 DataFrame
# Row │ sum_x sqrt_sum_x
# │ Int64 Float64
#─────┼───────────────────
# 1 │ 4 2.0
I was wondering if there is any way of achieving the previous result with a single call to combine. E.g. by using the new target_cols :sum_x as a column in the argument (see code below). However, this seems to throw an ArgumentError as it can not find the newly computed :sum_x column.
combine(df,
:x => sum => :sum_x,
:sum_x => sqrt => :sqrt_sum_x)
# ERROR: ArgumentError: column name :sum_x not found in the data frame
Currently this is not allowed. The reason is that the order of execution of transformations in
combineis undefined. In particular, in some situations these operations are executed in parallel using multi-threading (to improve performance).Additionally such operation could potentially be problematic in interpretation for example if you would have written:
then in transformation:
:xwould come from the source data framedf(and have 4 elements), while:sum_xwould come from "yet not existent" target data frame (and have 1 element). Technically it would be possible to make it work, but we considered that this could be confusing.