I have a frame (events) which I want to join into another frame (fr), joining on Date and Symbol. There aren't necessarily any date overlaps. The date in events would match with the first occurrence only on the same or later date in fr, so if the event date is 2010-12-01, it would join on the same date or if not present then the next available date (2010-12-02).
I've tried to do this using search_sorted and join_asof but I'd like to group by the Symbol column and also this isn't a proper join. This somewhat works for a single Symbol only.
fr = pl.DataFrame(
{
'Symbol': ['A']*5,
'Date': ['2010-08-29', '2010-09-01', '2010-09-05',
'2010-11-30', '2010-12-02'],
}
).with_columns(pl.col('Date').str.strptime(pl.Date, '%Y-%m-%d')).with_row_index().set_sorted("Date")
events = pl.DataFrame(
{
'Symbol': ['A']*3,
'Earnings_Date': ['2010-06-01', '2010-09-01', '2010-12-01'],
'Event': [1, 4, 7],
}
).with_columns(pl.col('Earnings_Date').str.strptime(pl.Date, '%Y-%m-%d')).set_sorted("Earnings_Date")
idx = fr["Date"].search_sorted(events["Earnings_Date"], "left")
fr = fr.with_columns(
pl.when(
pl.col("index").is_in(idx)
)
.then(True)
.otherwise(False)
.alias("Earnings")
)
fr = fr.join_asof(events, by="Symbol", left_on="Date", right_on="Earnings_Date")
fr = fr.with_columns(
pl.when(
pl.col("Earnings") == True
)
.then(pl.col("Event"))
.otherwise(False)
.alias("Event")
)
It sounds like you are on the right track using
pl.DataFrame.join_asof. To group by the symbol thebyparameter can be used.Now, I understand that you'd like each event to be matched at most once. I don't believe this is possible with
join_asofalone. However, we can set all event rows that equal the previous row toNull. For this, anpl.when().then()construct can be used.