I want to extract data out of a root file and get it into shape to end with a numpy array/tensor to fill it into a neural-network. I am already able to get the track data I want in shape with a padding to convert it into a numpy array, but I want to extend my array with the data of the jet they are originated from. So I have the information of all the tracks, of each jet and the intervall of tracks they are corresponded too. My first instinc was to construct an array in the shape of the tracks and using something like np.dstack
to merge those two.
import uproot4 as uproot
import numpy as np
import awkward1 as ak
def ak_into_np(ak_array):
data=np.dstack([ak.to_numpy(x) for x in ak_array])
return data
def get_data(filename,padding_size):
f=uproot.open(filename)
events= f["btagana/ttree;1"]
track_data=events.arrays(filter_name=["Track_pt","Track_phi","Track_eta","Track_dxy","Track_dz","Track_charge"])
jet_interval=events.arrays(filter_name=["Jet_nFirstTrack","Jet_nLastTrack"])
jet_interval=jet_interval["Jet_nLastTrack"]-jet_interval["Jet_nFirstTrack"]
jet_data=events.arrays(filter_name=["Jet_pt","Jet_phi","Jet_eta"])
arrays_track=ak.unzip(ak.fill_none(ak.pad_none(track_data, padding_size), 0))
arrays_interval=ak.unzip(ak.fill_none(ak.pad_none(jet_interval,padding_size),0))
arrays_jet=ak.unzip(ak.fill_none(ak.pad_none(jet_data,padding_size),0))
track=ak_into_np(arrays_track)
jet=ak_into_np(arrays_jet)
interval=ak_into_np(arrays_interval)
return track,jet,interval
This is where I am so far. For efficiency reason I hope to be able to achieve this in awkward before going into numpy. I tried it in numpy with following:
def extend(track,jet,interval):
events,tracks,varstrack=(np.shape(track))
events,jets,varsjet=np.shape(jet)
jet_into_track_data=[]
for i in range(events):
dataloop=[]
for k in range(jets):
if interval[i][k][0]!=0 :
dataloop.append(np.broadcast_to(jet[i][k],(interval[i][k][0],varsjet)))
else
jet_into_track_data.append(dataloop)
return jet_into_track_data
but it already takes about 3 seconds without even achieving my goal for only 2000 events. The aim is basically [track_variables] ->[track_variables,jet_variables if track is in intervall]
and it shall be stored [(event1)[[track_1],...,[track_padding_size]],...,(eventn)[[track_1],...,[track_padding_size]]]
I don't get to see the structure of your original data and I don't have a clear notion of what the desired final state is, but I can give an example inspired by the above that you can adapt. Also, I'm going to ignore the padding because that only complicates things. You'll probably want to put off padding until you've finished the combinatorics.
The
tracks
andjets
below come from the same set of events (i.e. the arrays have the same lengths and they're jagged, with a different number of tracks in each list and a different number of jets in each list). Sincejets
are somehow derived fromtracks
, there are strictly fewer jets than tracks, and I take it that in your problem, the link between them is such that each jet corresponds to a contiguous, non-overlapping set oftracks
(the "intervals").Real tracks and jets would have a lot more properties than these—this is a bare minimum example. I've given the tracks an
"id"
so we can tell one from another, but these jets only have the inclusive"start"
index and exclusive"stop"
index.If your tracks don't come with identifiers, you can add them with ak.local_index.
If you had all combinations between
tracks
andjets
in each event, then you could use a slice to pick the ones that match. This is particularly useful when you have imprecise matches (that you have to match with ΔR or something). The ak.cartesian function produces lists of combinations, andnested=True
groups the results by the first argument:We can go from there, selecting
"id"
values that are between the"start"
and"stop"
values. I started writing up a solution, but the slicing gets kind of complicated, generating all Cartesian combinations is more computationally expensive than is strictly needed for this problem (though no where near as expensive as writing a for loop!), and the general view of the Cartesian product is more useful for approximate matches than the exact indexes you have.Instead, let's write a for loop in Numba, a just-in-time compiler for Python. Numba is limited in the Python that it can compile (all types must be known at compile-time), but it can recognize read-only Awkward Arrays and append-only ak.ArrayBuilder.
Here's a loop that considers only
tracks
andjets
in the same event, loops over tracks, and puts the firstjet
that matches eachtrack
into an output ArrayBuilder.Notice that these jet objects are duplicated to match the appropriate track. (This "duplication" is actually just a pointer, not really a copy.) To attach this to the tracks, you can assign it:
Here, I've assumed that you want to attach a jet to each track—perhaps you wanted to attach the set of all associated tracks to each jet:
(Since I did both, now there are links both ways.) Attaching jets to tracks, rather than tracks to jets, has the added complication that some tracks might not be associated with any jet, in which case you'd have to account for possibly-missing data. (Hint: make them lists of zero or one jet for "no match" and "yes match," then use ak.firsts to convert empty lists to Nones.)
Or you could make the output of the Numba-compiled function be plain NumPy arrays. Numba knows a lot of NumPy functions. Hint for building up a Numba-compiled function: start with a minimal loop over data that doesn't do anything, test-running it as you go add output. Numba complains with a lot of output when it can't recognize the types of something—which is allowed in non-compiled Python—so it's good to know which change caused it to complain so much.
Hopefully, these examples can get you started!