Using `group_by_rolling` to turn long sequence data into many shorter sequences? Also with a group column

84 Views Asked by At

I have a large timeseries dataset which I would like to decimate into shorter sequences, in reparation for using it with an LSTM model. I just cannot quite figure out the semantics of it.

Here's a toy example, that I think shows what I am trying to do:

df_ex = pl.DataFrame({
    "person": ["Alice",]*5 + ["Bob",]*5 + ["Cap"]*5,
    "x1" : np.array( [1,2,3,4,5]*3 ),
    "x2" : np.sin( [10,20,30,40,50]*3 ),
})
df_ex

shape: (15, 3)
┌────────┬─────┬───────────┐
│ person ┆ x1  ┆ x2        │
│ ---    ┆ --- ┆ ---       │
│ str    ┆ i64 ┆ f64       │
╞════════╪═════╪═══════════╡
│ Alice  ┆ 1   ┆ -0.544021 │
│ Alice  ┆ 2   ┆ 0.912945  │
│ Alice  ┆ 3   ┆ -0.988032 │
│ Alice  ┆ 4   ┆ 0.745113  │
│ …      ┆ …   ┆ …         │
│ Cap    ┆ 2   ┆ 0.912945  │
│ Cap    ┆ 3   ┆ -0.988032 │
│ Cap    ┆ 4   ┆ 0.745113  │
│ Cap    ┆ 5   ┆ -0.262375 │
└────────┴─────┴───────────┘

then turning it into sequences:

df_lists = (df_ex
            .with_columns( pl.ones(pl.count(), dtype=pl.Int32).cumsum().over('person').alias('enum') )
            .group_by_rolling(by='person', index_column='enum', period='3i')
            .agg( pl.col(['x1','x2'])  )
            .filter( pl.col('x1').list.lengths() >=3 )
            )
df_lists

shape: (9, 4)
┌────────┬──────┬───────────┬──────────────────────────────────┐
│ person ┆ enum ┆ x1        ┆ x2                               │
│ ---    ┆ ---  ┆ ---       ┆ ---                              │
│ str    ┆ i32  ┆ list[i64] ┆ list[f64]                        │
╞════════╪══════╪═══════════╪══════════════════════════════════╡
│ Alice  ┆ 3    ┆ [1, 2, 3] ┆ [-0.544021, 0.912945, -0.988032] │
│ Alice  ┆ 4    ┆ [2, 3, 4] ┆ [0.912945, -0.988032, 0.745113]  │
│ Alice  ┆ 5    ┆ [3, 4, 5] ┆ [-0.988032, 0.745113, -0.262375] │
│ Bob    ┆ 3    ┆ [1, 2, 3] ┆ [-0.544021, 0.912945, -0.988032] │
│ Bob    ┆ 4    ┆ [2, 3, 4] ┆ [0.912945, -0.988032, 0.745113]  │
│ Bob    ┆ 5    ┆ [3, 4, 5] ┆ [-0.988032, 0.745113, -0.262375] │
│ Cap    ┆ 3    ┆ [1, 2, 3] ┆ [-0.544021, 0.912945, -0.988032] │
│ Cap    ┆ 4    ┆ [2, 3, 4] ┆ [0.912945, -0.988032, 0.745113]  │
│ Cap    ┆ 5    ┆ [3, 4, 5] ┆ [-0.988032, 0.745113, -0.262375] │
└────────┴──────┴───────────┴──────────────────────────────────┘

So now, df_lists has shape (9,4).

As you can see, there's a group for "person", here, which needs to be maintained.

I ultimately want to end up with a numpy array with shape (3,3,3,2), where: dim 0 : person (ordinally - I will need to track this) dim 1 : sequence # dim 2 : along the sequence dim 3 : features at each sequence point

Alternately, maybe dropping dim 0 in favor of a dictionary would be smarter. But I am not even sure how to start collapsing dimensions here, so either outcome would be fine.

My questions are:

  1. Is there a straightforward way to get that numpy array out? Or maybe something close, that will require swapaxes to get there.

  2. The way of accomplishing the group_by_dynamic seems fairly roundabout, so please let me know if I've done something silly.


EDIT:

OK, following the help from @jqurious , and to clarify the problem, let's add some more. Yes, the code below gives a dictionary, just so I can keep mental track of this as I implement ... and actually just using a dict might be fine, in the long run, as I can still wrap a custom torch DataLoader around that, which iterates of sequences of sequences from one person at a time. (just how the RL game has to work) The initial attraction to a 4d list was because I thought I could do it all via polars grammar, without any outer loop.

Notably, this now doesn't even use group_by_rolling.

Below is some code that does seem to accomplish what I want. Note that I do have to do the .swapaxes(1,0).swapaxes(2,1) swizzle at the end.

import collections
from itertools import islice

def sliding_window(iterable, n):
    # sliding_window('ABCDEFG', 4) --> ABCD BCDE CDEF DEFG
    it = iter(iterable)
    window = collections.deque(islice(it, n-1), maxlen=n)
    for x in it:
        window.append(x)
        yield tuple(window)

all_seqs = {}
pids = df_ex['person'].unique()
for pid in pids:
    df = df_ex.filter( pl.col('person') == pid )

    seqs = np.array(
        [
            list( sliding_window( df_ex['x1'], 3 ) ),
            list( sliding_window( df_ex['x2'], 3 ) ),
        ]
    )
    print(f"{pid}: {seqs.shape}")
    if seqs.shape[1] == 0:
        continue

    seqs = seqs.swapaxes(1,0).swapaxes(2,1)
    all_seqs[pid] = seqs

then: all_seqs['Bob'].shape == (3, 3, 2) and all_seqs['Bob'] is:

array([[[ 1.        , -0.54402111],
        [ 2.        ,  0.91294525],
        [ 3.        , -0.98803162]],

       [[ 2.        ,  0.91294525],
        [ 3.        , -0.98803162],
        [ 4.        ,  0.74511316]],

       [[ 3.        , -0.98803162],
        [ 4.        ,  0.74511316],
        [ 5.        , -0.26237485]]])

In the "ideal" case, it would be a df such that:
df[0,:,:,:] --> "Bob"
df[1,:,:,:] --> "Alice"
df[2,:,:,:] --> "Cap"

2

There are 2 best solutions below

0
On

Just addressing point #2 until further clarification of the desired output:

The way of accomplishing the group_by_rolling seems fairly roundabout

The current approach can be simplified a small bit:

  • .int_range() for the enumeration
  • if you are extracting a column "as-is", you can refer to it directly without pl.col()
(df.with_columns( pl.int_range(0, pl.count()).over('person').alias('enum') )
   .group_by_rolling(by='person', index_column='enum', period='3i')
   .agg( 'x1','x2' )
   .filter( pl.col('x1').list.len() >=3 )
)
shape: (9, 4)
┌────────┬──────┬───────────┬──────────────────────────────────┐
│ person ┆ enum ┆ x1        ┆ x2                               │
│ ---    ┆ ---  ┆ ---       ┆ ---                              │
│ str    ┆ i64  ┆ list[i64] ┆ list[f64]                        │
╞════════╪══════╪═══════════╪══════════════════════════════════╡
│ Alice  ┆ 2    ┆ [1, 2, 3] ┆ [-0.544021, 0.912945, -0.988032] │
│ Alice  ┆ 3    ┆ [2, 3, 4] ┆ [0.912945, -0.988032, 0.745113]  │
│ Alice  ┆ 4    ┆ [3, 4, 5] ┆ [-0.988032, 0.745113, -0.262375] │
│ Bob    ┆ 2    ┆ [1, 2, 3] ┆ [-0.544021, 0.912945, -0.988032] │
│ Bob    ┆ 3    ┆ [2, 3, 4] ┆ [0.912945, -0.988032, 0.745113]  │
│ Bob    ┆ 4    ┆ [3, 4, 5] ┆ [-0.988032, 0.745113, -0.262375] │
│ Cap    ┆ 2    ┆ [1, 2, 3] ┆ [-0.544021, 0.912945, -0.988032] │
│ Cap    ┆ 3    ┆ [2, 3, 4] ┆ [0.912945, -0.988032, 0.745113]  │
│ Cap    ┆ 4    ┆ [3, 4, 5] ┆ [-0.988032, 0.745113, -0.262375] │
└────────┴──────┴───────────┴──────────────────────────────────┘
  • DeprecationWarning: lengths is deprecated. It has been renamed to len. https://github.com/pola-rs/polars/issues/11577

It is also possible to pass an expression directly to index_column

(
   df.group_by_rolling(
      by = "person",
      index_column = pl.int_range(0, pl.count()).over("person").alias("enum"),
      period = "3i"
   )
   .agg("x1", "x2")
   .filter(pl.col("x1").list.len() >= 3)
)

Perhaps there is room for a feature/enhancement request to simplify this particular task in Polars?

itertools calls this .sliding_window() in its recipe examples, for example.

0
On

Perhaps a regular .group_by() and "manual slicing" with .slice() is a simpler approach.

all_seqs = (
   df.group_by("person")
     .agg(
        pl.concat_list(pl.exclude("person").slice(n, 3)).alias(f"{n}")
        for n in range(3)
     )
)
shape: (3, 4)
┌────────┬───────────────────────────────────┬───────────────────────────────────┬───────────────────────────────────┐
│ person ┆ 0                                 ┆ 1                                 ┆ 2                                 │
│ ---    ┆ ---                               ┆ ---                               ┆ ---                               │
│ str    ┆ list[list[f64]]                   ┆ list[list[f64]]                   ┆ list[list[f64]]                   │
╞════════╪═══════════════════════════════════╪═══════════════════════════════════╪═══════════════════════════════════╡
│ Alice  ┆ [[1.0, -0.544021], [2.0, 0.91294… ┆ [[2.0, 0.912945], [3.0, -0.98803… ┆ [[3.0, -0.988032], [4.0, 0.74511… │
│ Cap    ┆ [[1.0, -0.544021], [2.0, 0.91294… ┆ [[2.0, 0.912945], [3.0, -0.98803… ┆ [[3.0, -0.988032], [4.0, 0.74511… │
│ Bob    ┆ [[1.0, -0.544021], [2.0, 0.91294… ┆ [[2.0, 0.912945], [3.0, -0.98803… ┆ [[3.0, -0.988032], [4.0, 0.74511… │
└────────┴───────────────────────────────────┴───────────────────────────────────┴───────────────────────────────────┘

As for unpacking into a numpy array, perhaps .rows_by_key() may help here:

all_seqs = { k: np.array(v[0]) for k, v in all_seqs.rows_by_key("person").items() }
all_seqs["Bob"]
array([[[ 1.        , -0.54402111],
        [ 2.        ,  0.91294525],
        [ 3.        , -0.98803162]],

       [[ 2.        ,  0.91294525],
        [ 3.        , -0.98803162],
        [ 4.        ,  0.74511316]],

       [[ 3.        , -0.98803162],
        [ 4.        ,  0.74511316],
        [ 5.        , -0.26237485]]])