Writing Trees, number of baskets and compression (uproot)

430 Views Asked by At

I am trying to optimize the way trees are written in pyroot and came across uproot. In the end my application should write events (consisiting of arrays) to a tree which are continuously coming in.

The first approach is the classic way:

event= [1.,2.,3.]
f = ROOT.TFile("my_tree.root", "RECREATE")
tree = ROOT.TTree("tree", "An Example Tree")

pt = array.array('f', [0.]*3)


tree.Branch("pt", pt, "pt[3]/F")

#for loop to simulate incoming events
for _ in range(10000):
    for i, element in enumerate(event):
        pt[i] = element

    tree.Fill()

tree.Print()
tree.Write("", ROOT.TObject.kOverwrite);
f.Close()

This gives the following Tree and execution time:

Tree characterisitics

Trying to do it with uproot my code looks like this:

np_array = np.array([[1,2,3]])
ak_array = ak.from_numpy(np_array)

with uproot.recreate("testing.root", compression=None) as fout:
    fout.mktree("tree", {"branch": ak_array.type})
    
    for _ in range(10000):
        
        fout["tree"].extend({"branch": ak_array})

which gives the following tree:

Tree characteristics

So the uproot method takes much longer, the file size is much bigger and each event gets a seperate basket. I tried out different commpression settings but that did not change anything. Any idea on how to optimize this? Is this even a sensible usecase for uproot and can the process of writing trees being speed up in comparision to the first way of doing it?

1

There are 1 best solutions below

4
On

The extend method is supposed to write a new TBasket with each invocation. (See the documentation, especially the orange warning box. The purpose of that is so that you can control the TBasket sizes.) If you're calling it 10000 times to write 1 value (the value [1, 2, 3]) each, that's a maximally inefficient use.

Fundamentally, you're thinking about this problem in an entry-by-entry way, rather than in terms of columns, the way that scientific processing is normally done in Python. What you want to do instead is to collect a large dataset in memory and write it to the file in one chunk. If the data that you'll eventually be addressing is larger than the memory on your computer, you would do it in "large enough" chunks, which is probably on the order of hundreds of megabytes or gigabytes.

For instance, starting with your example,

import time
import uproot
import numpy as np
import awkward as ak

np_array = np.array([[1, 2, 3]])
ak_array = ak.from_numpy(np_array)

starttime = time.time()

with uproot.recreate("bad.root") as fout:
    fout.mktree("tree", {"branch": ak_array.type})
    for _ in range(10000):
        fout["tree"].extend({"branch": ak_array})

print("Total time:", time.time() - starttime)

The total time (on my computer) is 1.9 seconds and the TTree characteristics are atrocious:

******************************************************************************
*Tree    :tree      :                                                        *
*Entries :    10000 : Total =         1170660 bytes  File  Size =    2970640 *
*        :          : Tree compression factor =   1.00                       *
******************************************************************************
*Br    0 :branch    : branch[3]/L                                            *
*Entries :    10000 : Total  Size=    1170323 bytes  File Size  =     970000 *
*Baskets :    10000 : Basket Size=      32000 bytes  Compression=   1.00     *
*............................................................................*

Instead, we want the data to be in a single array (or some loop that produces ~GB scale arrays):

np_array = np.array([[1, 2, 3]] * 10000)

(This isn't necessarily how you would get np_array, since * 10000 makes a large, intermediate Python list. Suffice to say, you get the data somehow.)

Now we do the write with a single call to extend, which makes a single TBasket:

np_array = np.array([[1, 2, 3]] * 10000)
ak_array = ak.from_numpy(np_array)

starttime = time.time()

with uproot.recreate("good.root") as fout:
    fout.mktree("tree", {"branch": ak_array.type})
    fout["tree"].extend({"branch": ak_array})

print("Total time:", time.time() - starttime)

The total time (on my computer) is 0.0020 seconds and the TTree characteristics are much better:

******************************************************************************
*Tree    :tree      :                                                        *
*Entries :    10000 : Total =          240913 bytes  File  Size =       3069 *
*        :          : Tree compression factor = 107.70                       *
******************************************************************************
*Br    0 :branch    : branch[3]/L                                            *
*Entries :    10000 : Total  Size=     240576 bytes  File Size  =       2229 *
*Baskets :        1 : Basket Size=      32000 bytes  Compression= 107.70     *
*............................................................................*

So, the writing is almost 1000× faster and the compression is 100× better. (With one entry per TBasket in the previous example, there was no compression because any compressed data would be bigger than the original!)

By comparison, if we do entry-by-entry writing with PyROOT,

import time
import array
import ROOT

data = [1, 2, 3]
holder = array.array("q", [0]*3)

file = ROOT.TFile("pyroot.root", "RECREATE")
tree = ROOT.TTree("tree", "An Example Tree")
tree.Branch("branch", holder, "branch[3]/L")

starttime = time.time()
for _ in range(10000):
    for i, x in enumerate(data):
        holder[i] = x

    tree.Fill()

tree.Write("", ROOT.TObject.kOverwrite)
file.Close()

print("Total time:", time.time() - starttime)

The total time (on my computer) is 0.062 seconds and the TTree characteristics are fine:

******************************************************************************
*Tree    :tree      : An Example Tree                                        *
*Entries :    10000 : Total =          241446 bytes  File  Size =       3521 *
*        :          : Tree compression factor =  78.01                       *
******************************************************************************
*Br    0 :branch    : branch[3]/L                                            *
*Entries :    10000 : Total  Size=     241087 bytes  File Size  =       3084 *
*Baskets :        8 : Basket Size=      32000 bytes  Compression=  78.01     *
*............................................................................*

So, PyROOT is 30× slower here, but the compression is almost as good. ROOT decided to make 8 TBaskets, which is configurable with AutoFlush parameters.

Keep in mind, though, that this is a comparison of techniques, not libraries. If you wrap a NumPy array with RDataFrame and write that, then you can skip all of the overhead involved in the Python for loop and you get the advantages of columnar processing.

But columnar processing only matters if you're working with big data. Much like compression, if you apply it to very small datasets (or a very small dataset many times), then it can hurt, rather than help.