I have a large timeseries dataset which I would like to decimate into shorter sequences, in reparation for using it with an LSTM model. I just cannot quite figure out the semantics of it.
Here's a toy example, that I think shows what I am trying to do:
df_ex = pl.DataFrame({
"person": ["Alice",]*5 + ["Bob",]*5 + ["Cap"]*5,
"x1" : np.array( [1,2,3,4,5]*3 ),
"x2" : np.sin( [10,20,30,40,50]*3 ),
})
df_ex
shape: (15, 3)
┌────────┬─────┬───────────┐
│ person ┆ x1 ┆ x2 │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ f64 │
╞════════╪═════╪═══════════╡
│ Alice ┆ 1 ┆ -0.544021 │
│ Alice ┆ 2 ┆ 0.912945 │
│ Alice ┆ 3 ┆ -0.988032 │
│ Alice ┆ 4 ┆ 0.745113 │
│ … ┆ … ┆ … │
│ Cap ┆ 2 ┆ 0.912945 │
│ Cap ┆ 3 ┆ -0.988032 │
│ Cap ┆ 4 ┆ 0.745113 │
│ Cap ┆ 5 ┆ -0.262375 │
└────────┴─────┴───────────┘
then turning it into sequences:
df_lists = (df_ex
.with_columns( pl.ones(pl.count(), dtype=pl.Int32).cumsum().over('person').alias('enum') )
.group_by_rolling(by='person', index_column='enum', period='3i')
.agg( pl.col(['x1','x2']) )
.filter( pl.col('x1').list.lengths() >=3 )
)
df_lists
shape: (9, 4)
┌────────┬──────┬───────────┬──────────────────────────────────┐
│ person ┆ enum ┆ x1 ┆ x2 │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i32 ┆ list[i64] ┆ list[f64] │
╞════════╪══════╪═══════════╪══════════════════════════════════╡
│ Alice ┆ 3 ┆ [1, 2, 3] ┆ [-0.544021, 0.912945, -0.988032] │
│ Alice ┆ 4 ┆ [2, 3, 4] ┆ [0.912945, -0.988032, 0.745113] │
│ Alice ┆ 5 ┆ [3, 4, 5] ┆ [-0.988032, 0.745113, -0.262375] │
│ Bob ┆ 3 ┆ [1, 2, 3] ┆ [-0.544021, 0.912945, -0.988032] │
│ Bob ┆ 4 ┆ [2, 3, 4] ┆ [0.912945, -0.988032, 0.745113] │
│ Bob ┆ 5 ┆ [3, 4, 5] ┆ [-0.988032, 0.745113, -0.262375] │
│ Cap ┆ 3 ┆ [1, 2, 3] ┆ [-0.544021, 0.912945, -0.988032] │
│ Cap ┆ 4 ┆ [2, 3, 4] ┆ [0.912945, -0.988032, 0.745113] │
│ Cap ┆ 5 ┆ [3, 4, 5] ┆ [-0.988032, 0.745113, -0.262375] │
└────────┴──────┴───────────┴──────────────────────────────────┘
So now, df_lists has shape (9,4)
.
As you can see, there's a group for "person"
, here, which needs to be maintained.
I ultimately want to end up with a numpy array with shape (3,3,3,2)
,
where:
dim 0 : person (ordinally - I will need to track this)
dim 1 : sequence #
dim 2 : along the sequence
dim 3 : features at each sequence point
Alternately, maybe dropping dim 0 in favor of a dictionary would be smarter. But I am not even sure how to start collapsing dimensions here, so either outcome would be fine.
My questions are:
Is there a straightforward way to get that numpy array out? Or maybe something close, that will require swapaxes to get there.
The way of accomplishing the
group_by_dynamic
seems fairly roundabout, so please let me know if I've done something silly.
EDIT:
OK, following the help from @jqurious , and to clarify the problem, let's add some more.
Yes, the code below gives a dictionary, just so I can keep mental track of this as I implement ... and actually just using a dict
might be fine, in the long run, as I can still wrap a custom torch DataLoader
around that, which iterates of sequences of sequences from one person at a time. (just how the RL game has to work)
The initial attraction to a 4d list was because I thought I could do it all via polars grammar, without any outer loop.
Notably, this now doesn't even use group_by_rolling
.
Below is some code that does seem to accomplish what I want.
Note that I do have to do the .swapaxes(1,0).swapaxes(2,1)
swizzle at the end.
import collections
from itertools import islice
def sliding_window(iterable, n):
# sliding_window('ABCDEFG', 4) --> ABCD BCDE CDEF DEFG
it = iter(iterable)
window = collections.deque(islice(it, n-1), maxlen=n)
for x in it:
window.append(x)
yield tuple(window)
all_seqs = {}
pids = df_ex['person'].unique()
for pid in pids:
df = df_ex.filter( pl.col('person') == pid )
seqs = np.array(
[
list( sliding_window( df_ex['x1'], 3 ) ),
list( sliding_window( df_ex['x2'], 3 ) ),
]
)
print(f"{pid}: {seqs.shape}")
if seqs.shape[1] == 0:
continue
seqs = seqs.swapaxes(1,0).swapaxes(2,1)
all_seqs[pid] = seqs
then:
all_seqs['Bob'].shape == (3, 3, 2)
and all_seqs['Bob']
is:
array([[[ 1. , -0.54402111],
[ 2. , 0.91294525],
[ 3. , -0.98803162]],
[[ 2. , 0.91294525],
[ 3. , -0.98803162],
[ 4. , 0.74511316]],
[[ 3. , -0.98803162],
[ 4. , 0.74511316],
[ 5. , -0.26237485]]])
In the "ideal" case, it would be a df such that:
df[0,:,:,:] --> "Bob"
df[1,:,:,:] --> "Alice"
df[2,:,:,:] --> "Cap"
Just addressing point #2 until further clarification of the desired output:
The current approach can be simplified a small bit:
.int_range()
for the enumerationpl.col()
lengths
is deprecated. It has been renamed tolen
. https://github.com/pola-rs/polars/issues/11577It is also possible to pass an expression directly to
index_column
Perhaps there is room for a feature/enhancement request to simplify this particular task in Polars?
itertools calls this
.sliding_window()
in its recipe examples, for example.