Simulate numpy.cumsum at the deepest level of a JaggedArray

148 Views Asked by At

I have a singe level nested array, and I'd like to calculate the running sum at the deepest level:

<JaggedArray [[0.8143442176354316 0.18565578236456845] [1.0] [0.8029232081440607 0.1970767918559393] ... [0.42036116755776154 0.5796388324422386] [0.18512572262194366 0.31914669745950724 0.13598232751162054 0.3597452524069286] [0.34350475143310905 0.19023361856972956 0.4662616299971615]] at 0x7f8969e32af0>

after doing something like numpy.cumsum(jagged_array) I'd like to have:

[[0.8143442176354316 1.0] [1.0] [0.8029232081440607 1.0] ...

In short - the running sum at the deepest level (which is restarted with each new "event").

I'm using awkard0, and the documentation says that broadcast is run at the deepest level, however, I get an error when I tried just handing a JaggedArray directly to numpy.cumsum: operands could not be broadcast together with shapes (2,) (3,)

The dataset is large - I'd like to stay within the awkward system - so avoid python loops in processing these.

2

There are 2 best solutions below

1
On

I think you're just trying to call np.cumsum on each of the lists in your larger list. Let me know if I'm misunderstanding your intention.

In that case

result = [np.cumsum(one_list) for one_list in jagged_array]
0
On

There isn't a "high-level" way to do this, a way that is independent of knowledge of the array's layout, but I can walk you through this.

Awkward 0.x (obsolete)

Assuming that you have a simple jagged array,

>>> import awkward0
>>> import numpy as np
>>> array = awkward0.fromiter([[1.1, 2.2, 3.3], [], [4.4, 5.5]])
>>> array.layout
 layout 
[    ()] JaggedArray(starts=layout[0], stops=layout[1], content=layout[2])
[     0]   ndarray(shape=3, dtype=dtype('int64'))
[     1]   ndarray(shape=3, dtype=dtype('int64'))
[     2]   ndarray(shape=5, dtype=dtype('float64'))

You can apply the cumulative sum to the content:

>>> np.cumsum(array.content)
array([ 1.1,  3.3,  6.6, 11. , 16.5])

and wrap that up as a new jagged array:

>>> scan = awkward0.JaggedArray.fromoffsets(array.offsets, np.cumsum(array.content))
>>> scan
<JaggedArray [[1.1 3.3000000000000003 6.6] [] [11.0 16.5]] at 0x7f0621a826a0>

Awkward 1.x

The offsets and content structure that we directly manipulated in Awkward 0.x are now hidden in a "layout" to distinguish between high-level operations (which don't require knowledge of the exact layout) and low-level operations (which do). This problem doesn't have a high-level solution, and the low-level way is like the above, but it involves extra wrapping and unwrapping.

>>> import awkward as ak
>>> import numpy as np
>>> array = ak.Array([[1.1, 2.2, 3.3], [], [4.4, 5.5]])
>>> array
<Array [[1.1, 2.2, 3.3], [], [4.4, 5.5]] type='3 * var * float64'>
>>> layout = array.layout
>>> layout
<ListOffsetArray64>
    <offsets><Index64 i="[0 3 3 5]" offset="0" length="4" at="0x55737ef6f880"/></offsets>
    <content><NumpyArray format="d" shape="5" data="1.1 2.2 3.3 4.4 5.5" at="0x55737ef71890"/></content>
</ListOffsetArray64>

As before, you can do a cumulative sum on the content:

>>> np.cumsum(layout.content)
array([ 1.1,  3.3,  6.6, 11. , 16.5])

Here's the structure of how it gets wrapped up:

>>> scan = ak.Array(
...     ak.layout.ListOffsetArray64(
...         layout.offsets,
...         ak.layout.NumpyArray(
...             np.cumsum(layout.content)
...         )
...     )
... )
...
>>> scan
<Array [[1.1, 3.3, 6.6], [], [11, 16.5]] type='3 * var * float64'>

What if you want the scan per-list?

If you want a solution similar to Frank Yellin's, in which each scan starts new in each list, the fact that we did one np.cumsum on the content is a problem. In concrete terms, we have the third list starting with 11, instead of 4.4.

A vectorized way to do that is to subtract the first scan element of each list from the whole list and add the first array element back in. In both Awkward 0.x and 1.x, this can be done with slices like array[:, 0] and broadcasting, but empty lists (if you have them) are going to be a problem. Awkward 1.x has enough alternatives to work around that:

>>> ak.firsts(scan)
<Array [1.1, None, 11] type='3 * ?float64'>

>>> scan - ak.firsts(scan)
<Array [[0, 2.2, 5.5], None, [0, 5.5]] type='3 * option[var * float64]'>

>>> scan - ak.firsts(scan) + ak.firsts(array)
<Array [[1.1, 3.3, 6.6], None, [4.4, 9.9]] type='3 * option[var * float64]'>

>>> ak.fill_none(scan - ak.firsts(scan) + ak.firsts(array), [])
<Array [[1.1, 3.3, 6.6], [], [4.4, 9.9]] type='3 * var * float64'>

Most of these don't have equivalents in Awkward 0.x.