Uproot and Dask

172 Views Asked by At

I am experimenting with the dask functionality of uproot, i.e. loading branches into dask arrays.

Unfortunately I am not understanding why things happen when trying to perform computations on these arrays, e.g.

import dask.array as da

tree = uproot.dask("file.root:tree", library = 'np')
branch_data = data["testbranch"]
mean = da.mean(branch_data).compute()

The branch data is a 2-dimensional array and I would like to compute the mean along axis=1, i.e. for each row. Strangely the output is the same as if i did:

np.mean(branch_data.T, axis = 1)

It somehow computes the mean of the columns instead. Trying to add axis=1results in an error saying axis out of bounds for array with dimension 1. But calling compute() on the branch to print the actual data it clearly is the expected 2-dim array. This also happens with other methods like da.sum().

EDIT: Here I provide an example that reproduces the problem:

import numpy as np
import uproot

# creating sample file with tree
with uproot.recreate("test.root") as file:
    file["test_tree"] = {"test_branch": np.random.random((100,10))}

# Standard uproot (output I aim for with the dask option)
tree = uproot.open("./test.root:test_tree")
branch = tree["test_branch"].array(library = 'np')
mean = np.mean(branch, axis = 1)
print(mean)

# Uproot-Dask (Will compute mean columnwise but would expect a single scalar. Strange...)
tree = uproot.dask("./test.root:test_tree", library = 'np')
branch = tree["test_branch"]
mean = np.mean(branch).compute()
print(mean)

#This should correspond to the standard uproot ouput but does not work. Also strange
mean = np.mean(branch, axis = 1).compute()
print(mean)
1

There are 1 best solutions below

4
joanis On

According to the uproot manual, the uproot.dask() function "returns an unevaluated Dask array from TTrees."

Playing with the branch object you created in your sample code, I was able to convert it to a plain Numpy array by calling compute() on it:

import numpy as np
import uproot

with uproot.recreate("test.root") as file:
    file["test_tree"] = {"test_branch": np.random.random((100,10))}

tree = uproot.dask("./test.root:test_tree", library = 'np')
branch = tree["test_branch"]
b_as_array = branch.compute()

Continuing interactively:

>>> branch
dask.array<test_branch-from-uproot, shape=(100,), dtype=float64, chunksize=(100,), chunktype=numpy.ndarray>
>>> b_as_array = branch.compute()
>>> b_as_array.shape
(100, 10)
>>> np.mean(b_as_array, axis=0)
array([0.54450986, 0.48361194, 0.52477069, 0.50902231, 0.52925032,
       0.47309532, 0.49022969, 0.48736406, 0.5027256 , 0.56298907])

Note that I had to use axis=0 to get the mean you want since np.mean(b_as_array) returned just one number being the average of all 1000 numbers.

Now, you probably want to keep the efficiency of Dask arrays, and Dask itself provides the same operations. You should probably use the Dask implementations rather than the equivalent Numpy ones.

E.g.

>>> branch.mean().compute()
array([0.54450986, 0.48361194, 0.52477069, 0.50902231, 0.52925032,
       0.47309532, 0.49022969, 0.48736406, 0.5027256 , 0.56298907])

Lots more details at https://docs.dask.org/en/stable/array.html