Trying to improve Uproot4 Root tree file input deserialization of one branch

265 Views Asked by At

I am using Uproot to access a Root Tree in Python and I am noticing a significant slowdown when I try to access one particular branch: wf, which contains an array of jagged arrays

Root Tree Branches

I am accessing the branches by using the Lazy/Awkward method and I am using the step_size option.

LazyFileWF = uproot.lazy('../Layers9_Xe_Phantom102_run1.root:dstree;111', filter_name= "wf",step_size=100)

I experience a 6 to 10 second slow down when I want to access an entry in "LazyFileWF" but if I move on to the next consecutive entry, it only takes about 14 ms up until the end of the step_size. However my script needs to select entries randomly, not sequentially, which means every entry will take me about 8 seconds to access. I am able to access data from the other branches fairly quickly with the exception of this one and I wanted to find out why.

By using uproot.open() and then .show() I noticed that the interpretation of the branch was being labeled as AsObjects(AsObjects(AsVector(True, AsVector(False, dtype('>f4'))))

I did some digging in the Documentation and found this:

Uproot AsObjects Doc

It mentions I can use simplify to improve the slow deserialization.

So here's what I would like to know, based on the Root Tree I have, can I use simplify to reduce the 8 second slowdown to access my branch? And if so how can implement it? Is there a better way to read this branch?

I tried:

a = uproot.AsObjects.simplify(LazyFileWF.wf)
a

but I got an error telling me

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
/tmp/ipykernel_147439/260639244.py in <module>
      5 LazyFileWF = uproot.lazy('../Layers9_Xe_Phantom102_run1.root:dstree;111', filter_name= "wf",step_size=100)
      6 events.show(typename_width=35, interpretation_width= 60)
----> 7 a = uproot.AsObjects.simplify(LazyFileWF.wf)
      8 a

~/anaconda3/envs/rapids-21.10/lib/python3.7/site-packages/uproot/interpretation/objects.py in simplify(self)
    245         ``self``.
    246         """
--> 247         if self._branch is not None:
    248             try:
    249                 return self._model.strided_interpretation(

~/anaconda3/envs/rapids-21.10/lib/python3.7/site-packages/awkward/highlevel.py in __getattr__(self, where)
   1129                 raise AttributeError(
   1130                     "no field named {0}".format(repr(where))
-> 1131                     + ak._util.exception_suffix(__file__)
   1132                 )
   1133 

AttributeError: no field named '_branch'

(https://github.com/scikit-hep/awkward-1.0/blob/1.7.0/src/awkward/highlevel.py#L1131)
1

There are 1 best solutions below

3
On

The AsObjects.simplify function is internally applied to produce the default TBranch.interpretation that you're using if you don't override the interpretation when loading a TBranch as an array. The only reason you'd pass a custom interpretation is if the default is wrong—it's a back-door to fixing cases in which Uproot auto-detected the interpretation incorrectly.

If the default TBranch.interpreation is

AsObjects(AsVector(True, AsVector(False, dtype('>f4')))

then it did try to simplify—i.e. replace the AsObjects with an AsStridedObjects or AsJagged—but couldn't. This must be a C++ std::vector<std::vector<float>>, which has a variable number of bytes per object, so there aren't any simplified interpretations that will work. What's "simplified" about AsStridedObjects and AsJagged is that they have a fixed number of bytes per object and can therefore be interpreted in bulk, without a Python for loop over all the items in the TBasket.

Incidentally, we studied this exact case in https://arxiv.org/abs/2102.13516, and the AwkwardForth solution described in that paper will be adapted into Uproot this summer. Unfortunately, that doesn't help you right now.

The slow-fast pattern you're seeing is because each time you ask for an entry from a different TBasket, Uproot interprets the whole TBasket. If you were running sequentially, you'd see a pause at the beginning of each TBasket. The lazy array is caching interpreted data, so when your random-access comes back to a previously read TBasket, it should be fast again: by only looking at the first few requests, you're getting an impression that each request will be slow, but that's just because early requests are more likely to hit unread TBaskets than late requests.

If you're only looking into this because the process as a whole is too slow (i.e. just letting it run and fill up its cache isn't good enough), then consider reading the whole TBranch into an array and randomly access the array. If your random access is in a Python loop (as opposed to Numba), then there's also nothing to be gained and some performance to be lost by calling __getitem__ on an Awkward Array as opposed to a NumPy array, so pass library="np".

If you don't have enough memory to load the entire TBranch into an array—which could explain why you're using a lazy array—then you're in a difficult position, because the lazy array's caching would work against you: it would evict from cache the TBaskets that haven't been hit in a while, so even a long-running process would end up repeatedly reading/interpreting. This is a fundamental issue in random-access problems of data that are too large for memory: there isn't a good way to cache it because new requests keep pushing old results out of cache. (The same problem applies to disk access, web-cached data, databases, etc.)

Hopefully, the array fits into memory and you can random-access it in memory. Awkward Arrays have slower __getitem__ than NumPy, but they're more compact in memory, so which one will work best for you depends on the details.

I hope these pointers help!