Polars asof join on next available date

81 Views Asked by At

I have a frame (events) which I want to join into another frame (fr), joining on Date and Symbol. There aren't necessarily any date overlaps. The date in events would match with the first occurrence only on the same or later date in fr, so if the event date is 2010-12-01, it would join on the same date or if not present then the next available date (2010-12-02).

I've tried to do this using search_sorted and join_asof but I'd like to group by the Symbol column and also this isn't a proper join. This somewhat works for a single Symbol only.

fr = pl.DataFrame(
    {
        'Symbol': ['A']*5,
        'Date': ['2010-08-29', '2010-09-01', '2010-09-05',
                 '2010-11-30', '2010-12-02'],
    }
).with_columns(pl.col('Date').str.strptime(pl.Date, '%Y-%m-%d')).with_row_index().set_sorted("Date")

events = pl.DataFrame(
    {
         'Symbol': ['A']*3,
         'Earnings_Date': ['2010-06-01', '2010-09-01', '2010-12-01'],
         'Event': [1, 4, 7],
     }
).with_columns(pl.col('Earnings_Date').str.strptime(pl.Date, '%Y-%m-%d')).set_sorted("Earnings_Date")

idx = fr["Date"].search_sorted(events["Earnings_Date"], "left")

fr = fr.with_columns(
    pl.when(
        pl.col("index").is_in(idx)
    )
    .then(True)
    .otherwise(False)
    .alias("Earnings")
)

fr = fr.join_asof(events, by="Symbol", left_on="Date", right_on="Earnings_Date")
fr = fr.with_columns(
    pl.when(
       pl.col("Earnings") == True
    )
    .then(pl.col("Event"))
    .otherwise(False)
    .alias("Event")
)
2

There are 2 best solutions below

3
Hericks On BEST ANSWER

It sounds like you are on the right track using pl.DataFrame.join_asof. To group by the symbol the by parameter can be used.

(
    fr
    .join_asof(
        events,
        left_on="Date", right_on="Earnings_Date",
        by="Symbol",
    )
)
shape: (5, 5)
┌───────┬────────┬────────────┬───────────────┬───────┐
│ index ┆ Symbol ┆ Date       ┆ Earnings_Date ┆ Event │
│ ---   ┆ ---    ┆ ---        ┆ ---           ┆ ---   │
│ u32   ┆ str    ┆ date       ┆ date          ┆ i64   │
╞═══════╪════════╪════════════╪═══════════════╪═══════╡
│ 0     ┆ A      ┆ 2010-08-29 ┆ 2010-06-01    ┆ 1     │
│ 1     ┆ A      ┆ 2010-09-01 ┆ 2010-09-01    ┆ 4     │
│ 2     ┆ A      ┆ 2010-09-05 ┆ 2010-09-01    ┆ 4     │
│ 3     ┆ A      ┆ 2010-11-30 ┆ 2010-09-01    ┆ 4     │
│ 4     ┆ A      ┆ 2010-12-02 ┆ 2010-12-01    ┆ 7     │
└───────┴────────┴────────────┴───────────────┴───────┘

Now, I understand that you'd like each event to be matched at most once. I don't believe this is possible with join_asof alone. However, we can set all event rows that equal the previous row to Null. For this, an pl.when().then() construct can be used.

(
    fr
    .join_asof(
        events,
        left_on="Date", right_on="Earnings_Date",
        by="Symbol",
    )
    .with_columns(
        pl.when(
            pl.col("Earnings_Date", "Event").is_first_distinct()
        ).then(
            pl.col("Earnings_Date", "Event")
        ).over("Symbol")
    )
)
shape: (5, 5)
┌───────┬────────┬────────────┬───────────────┬───────┐
│ index ┆ Symbol ┆ Date       ┆ Earnings_Date ┆ Event │
│ ---   ┆ ---    ┆ ---        ┆ ---           ┆ ---   │
│ u32   ┆ str    ┆ date       ┆ date          ┆ i64   │
╞═══════╪════════╪════════════╪═══════════════╪═══════╡
│ 0     ┆ A      ┆ 2010-08-29 ┆ 2010-06-01    ┆ 1     │
│ 1     ┆ A      ┆ 2010-09-01 ┆ 2010-09-01    ┆ 4     │
│ 2     ┆ A      ┆ 2010-09-05 ┆ null          ┆ null  │
│ 3     ┆ A      ┆ 2010-11-30 ┆ null          ┆ null  │
│ 4     ┆ A      ┆ 2010-12-02 ┆ 2010-12-01    ┆ 7     │
└───────┴────────┴────────────┴───────────────┴───────┘
1
ouroboros1 On

Problem

  • Merge fr and events based on a match between events['Earnings_Date'] and the exact or next date in fr['Date'] (so: 'looking backward' from fr's point of view), grouped by column 'Symbol'. Keep only the first match.

Setup

Let's add 2 symbols (one expecting a match, one expecting zero). Also, it is perhaps easier to use pl.Expr.str.to_date instead of your pl.Expr.str.strptime.

  • fr
import polars as pl

fr = (
    pl.DataFrame(
        {
            'Symbol': ['A', 'A', 'A', 'B', 'B', 'C'],
            'Date': ['2010-08-29', '2010-09-01', '2010-11-30', 
                     '2010-09-05', '2010-12-01', '2010-09-01'],
        }
    )
    .with_columns(pl.col('Date').str.to_date('%Y-%m-%d'))
    .set_sorted(('Symbol', 'Date'))
)

fr

shape: (6, 2)
┌────────┬────────────┐
│ Symbol ┆ Date       │
│ ---    ┆ ---        │
│ str    ┆ date       │
╞════════╪════════════╡
│ A      ┆ 2010-08-29 │
│ A      ┆ 2010-09-01 │
│ A      ┆ 2010-11-30 │
│ B      ┆ 2010-09-05 │
│ B      ┆ 2010-12-01 │
│ C      ┆ 2010-09-01 │
└────────┴────────────┘
  • events
events = (
    pl.DataFrame(
        {
            'Symbol': ['A', 'A', 'B'],
            'Earnings_Date': ['2010-06-01', '2010-09-01', '2010-12-01'],
            'Event': [1, 4, 7],
        }
    )
    .with_columns(pl.col('Earnings_Date').str.to_date('%Y-%m-%d'))
    .set_sorted(('Symbol', 'Earnings_Date'))
)

events

shape: (3, 3)
┌────────┬───────────────┬───────┐
│ Symbol ┆ Earnings_Date ┆ Event │
│ ---    ┆ ---           ┆ ---   │
│ str    ┆ date          ┆ i64   │
╞════════╪═══════════════╪═══════╡
│ A      ┆ 2010-06-01    ┆ 1     │
│ A      ┆ 2010-09-01    ┆ 4     │
│ B      ┆ 2010-12-01    ┆ 7     │
└────────┴───────────────┴───────┘

Code

fr = (
    fr.join_asof(
        events,
        left_on='Date',
        right_on='Earnings_Date',
        by='Symbol'
    )
    .with_columns(
        pl.when(pl.struct('Symbol', 'Earnings_Date').is_first_distinct())
        .then(pl.col('Earnings_Date', 'Event'))
    )
)

fr

shape: (6, 4)
┌────────┬────────────┬───────────────┬───────┐
│ Symbol ┆ Date       ┆ Earnings_Date ┆ Event │
│ ---    ┆ ---        ┆ ---           ┆ ---   │
│ str    ┆ date       ┆ date          ┆ i64   │
╞════════╪════════════╪═══════════════╪═══════╡
│ A      ┆ 2010-08-29 ┆ 2010-06-01    ┆ 1     │
│ A      ┆ 2010-09-01 ┆ 2010-09-01    ┆ 4     │
│ A      ┆ 2010-11-30 ┆ null          ┆ null  │
│ B      ┆ 2010-09-05 ┆ null          ┆ null  │
│ B      ┆ 2010-12-01 ┆ 2010-12-01    ┆ 7     │
│ C      ┆ 2010-09-01 ┆ null          ┆ null  │
└────────┴────────────┴───────────────┴───────┘

Explanation