Python polars library: problem with the shift() function

68 Views Asked by At

I have this script on polars...

import polars as pl
from faker import Faker

fake = Faker(locale='fr_FR')

def my_func():
    operation = fake.unique.pystr_format(
        string_format='??###', 
        letters='abcdefghijklmnopqrstuvwxyz').upper()
    
    return operation

my_df = pl.DataFrame({
    "Date":[date(2023, 1, 2), date(2023, 1, 2), date(2023, 1, 2),
            date(2023, 3, 5), date(2023, 3, 17), date(2023, 3, 17), 
            date(2023, 3, 17), date(2023, 3, 17), date(2023, 5, 1),
            date(2023, 5, 17), date(2023, 5, 17), date(2023, 5, 17)],
    "Invoice":['001', '001', '001', '002', 
               '003', '003', '003', '003',
               '004', '005', '005', '005'],
    "Amount":[250, 250, 250, 300, 
              450, 450, 450, 450, 
              500, 650, 650, 650],
    'Operation':['', '', '', '', 
                 '', '', '', '', 
                 '', '', '', '',]
})

my_df.with_columns(
    pl.when(pl.col('Invoice') != pl.col('Invoice').shift())
    .then(pl.col('Operation').map_elements(lambda x : my_func()))
    .otherwise(pl.col('Operation').shift()))

It gives the following result:

Date Invoice Amount Operation
2023-01-02 00:00:00 001 250
2023-01-02 00:00:00 001 250
2023-01-02 00:00:00 001 250
2023-03-05 00:00:00 002 300 WV068
2023-03-17 00:00:00 003 450 KL429
2023-03-17 00:00:00 003 450
2023-03-17 00:00:00 003 450
2023-03-17 00:00:00 003 450
2023-05-01 00:00:00 004 500 VF254
2023-05-17 00:00:00 005 650 NT977
2023-05-17 00:00:00 005 650
2023-05-17 00:00:00 005 650

I have a problem with the shift() function -> I can't get this result:

Date Invoice Amount Operation
2023-01-02 00:00:00 001 250
2023-01-02 00:00:00 001 250
2023-01-02 00:00:00 001 250
2023-03-05 00:00:00 002 300 WV068
2023-03-17 00:00:00 003 450 KL429
2023-03-17 00:00:00 003 450 KL429
2023-03-17 00:00:00 003 450 KL429
2023-03-17 00:00:00 003 450 KL429
2023-05-01 00:00:00 004 500 VF254
2023-05-17 00:00:00 005 650 NT977
2023-05-17 00:00:00 005 650 NT977
2023-05-17 00:00:00 005 650 NT977

Can you please help me?

Thank you in advance

1

There are 1 best solutions below

0
jqurious On

It looks like you may want .forward_fill() instead of .shift()

pl.Config(tbl_rows=12)

df.with_columns(
    pl.when(pl.col('Invoice') != pl.col('Invoice').shift())
      .then(pl.col('Operation').map_elements(lambda x : my_func()))
      .forward_fill()
)
shape: (12, 4)
┌────────────┬─────────┬────────┬───────────┐
│ Date       ┆ Invoice ┆ Amount ┆ Operation │
│ ---        ┆ ---     ┆ ---    ┆ ---       │
│ date       ┆ str     ┆ i64    ┆ str       │
╞════════════╪═════════╪════════╪═══════════╡
│ 2023-01-02 ┆ 001     ┆ 250    ┆ null      │
│ 2023-01-02 ┆ 001     ┆ 250    ┆ null      │
│ 2023-01-02 ┆ 001     ┆ 250    ┆ null      │
│ 2023-03-05 ┆ 002     ┆ 300    ┆ RV887     │
│ 2023-03-17 ┆ 003     ┆ 450    ┆ SY703     │
│ 2023-03-17 ┆ 003     ┆ 450    ┆ SY703     │
│ 2023-03-17 ┆ 003     ┆ 450    ┆ SY703     │
│ 2023-03-17 ┆ 003     ┆ 450    ┆ SY703     │
│ 2023-05-01 ┆ 004     ┆ 500    ┆ AT019     │
│ 2023-05-17 ┆ 005     ┆ 650    ┆ QF270     │
│ 2023-05-17 ┆ 005     ┆ 650    ┆ QF270     │
│ 2023-05-17 ┆ 005     ┆ 650    ┆ QF270     │
└────────────┴─────────┴────────┴───────────┘

As for the actual "task", it looks like you're trying to group consecutive "runs" of the Invoice column.

Polars has a specific function for that: .rle_id()

df.with_columns(
   pl.col("Operation").map_elements(lambda x: my_func())
     .over(pl.col("Invoice").rle_id())
)
shape: (12, 4)
┌────────────┬─────────┬────────┬───────────┐
│ Date       ┆ Invoice ┆ Amount ┆ Operation │
│ ---        ┆ ---     ┆ ---    ┆ ---       │
│ date       ┆ str     ┆ i64    ┆ str       │
╞════════════╪═════════╪════════╪═══════════╡
│ 2023-01-02 ┆ 001     ┆ 250    ┆ IP153     │
│ 2023-01-02 ┆ 001     ┆ 250    ┆ IP153     │
│ 2023-01-02 ┆ 001     ┆ 250    ┆ IP153     │
│ 2023-03-05 ┆ 002     ┆ 300    ┆ IO697     │
│ 2023-03-17 ┆ 003     ┆ 450    ┆ WR538     │
│ 2023-03-17 ┆ 003     ┆ 450    ┆ WR538     │
│ 2023-03-17 ┆ 003     ┆ 450    ┆ WR538     │
│ 2023-03-17 ┆ 003     ┆ 450    ┆ WR538     │
│ 2023-05-01 ┆ 004     ┆ 500    ┆ OC371     │
│ 2023-05-17 ┆ 005     ┆ 650    ┆ EI222     │
│ 2023-05-17 ┆ 005     ┆ 650    ┆ EI222     │
│ 2023-05-17 ┆ 005     ┆ 650    ┆ EI222     │
└────────────┴─────────┴────────┴───────────┘