Julia pandas - how to append dataframes together

2.9k Views Asked by At

Working with Julia 1.0 I have a large numbers of data frames which I read into Julia using pandas (read_csv) and I am looking for a way to append them all together into a single big data frame. For some reason the "append" function does not do the trick. A simplified example below:

using Pandas 

df = Pandas.DataFrame([[1, 2], [3, 4]], columns=['A', 'B'])

df2 = Pandas.DataFrame([[5, 6], [7, 8]], columns=['A', 'B'])

df[:append](df2)  #fails

df.append(df2)    #fails

df[:concat](df2)  #fails

vcat(df,df2)       

The last step works but produces a 2 element Array with each element being a DataFrame

Any ideas on how to stack the two dataframes one under the other?

2

There are 2 best solutions below

4
On BEST ANSWER

This seems to work

julia> df = Pandas.DataFrame([[1, 2], [3, 4]], columns=[:A, :B])
   A  B
0  1  2
1  3  4


julia> df2 = Pandas.DataFrame([[5, 6], [7, 8]], columns=[:A, :B])
   A  B
0  5  6
1  7  8


julia> df.pyo[:append](df2, ignore_index = true )
PyObject    A  B
0  1  2
1  3  4
2  5  6
3  7  8

Notes:

  • I don't know if this is a Pandas thing or a julia 1.0 PyCall thing, but the object seems to need the .pyo field explicitly before calling a method. If you try df[:append] it will try to interpret this as if you're trying to index the :append: column. Try doing df[:col3] = 3 to see what I mean
  • There is a julia native DataFrames package. No need to use Pandas unless you have some weird "I have ready made code" issue. And even then you're probably just complicating things by using Pandas via a Python layer in Julia.

For reference, here's the equivalent in julia DataFrames:

julia> df  = DataFrames.DataFrame( [1:2, 3:4], [:A, :B]);
julia> df2 = DataFrames.DataFrame( [5:6, 7:8], [:A, :B]);
julia> append!(df, df2)
4×2 DataFrames.DataFrame
│ Row │ A │ B │
├─────┼───┼───┤
│ 1   │ 1 │ 3 │
│ 2   │ 2 │ 4 │
│ 3   │ 5 │ 7 │
│ 4   │ 6 │ 8 │
0
On

Since you said you have a lot of dataframes, you can add them to a list. Then pd.concat the list, and take the header of the first file (assuming they all have the same header) as the header of the new dataframe. This will skip the first line in all your dataframes, so you dont have a bunch of header rows in there.

dfs = [df, df2]

df3 = pd.DataFrame(pd.concat(dfs), columns=df.columns)