How does Polars auto-cache mechanism work on LazyFrames?

234 Views Asked by barak1412 At 29 July 2025 at 04:40

As stated here, Polars introduced an auto-cache mechanism for LazyFrames that occures multiple times in the logical plan, so the user will not have to actively perform the cache.
However, while trying to examine their new mechanism, I encountred scenarios that the auto-cache is not performed optimally:

Without explicit cache:

import polars as pl

df1 = pl.DataFrame({'id': [0,5,6]}).lazy()
df2 = pl.DataFrame({'id': [0,8,6]}).lazy()
df3 = pl.DataFrame({'id': [7,8,6]}).lazy()

df4 = df1.join(df2, on='id')
print(pl.concat([df4.join(df3, on='id'), df1,
                 df4]).explain())

We get the logical plan:

UNION
  PLAN 0:
    INNER JOIN:
    LEFT PLAN ON: [col("id")]
      INNER JOIN:
      LEFT PLAN ON: [col("id")]
        CACHE[id: a4bcf9591fefc837, count: 3]
          DF ["id"]; PROJECT 1/1 COLUMNS; SELECTION: "None"
      RIGHT PLAN ON: [col("id")]
        CACHE[id: 8cee8e3a6f454983, count: 1]
          DF ["id"]; PROJECT 1/1 COLUMNS; SELECTION: "None"
      END INNER JOIN
    RIGHT PLAN ON: [col("id")]
      DF ["id"]; PROJECT */1 COLUMNS; SELECTION: "None"
    END INNER JOIN
  PLAN 1:
    CACHE[id: a4bcf9591fefc837, count: 3]
      DF ["id"]; PROJECT 1/1 COLUMNS; SELECTION: "None"
  PLAN 2:
    INNER JOIN:
    LEFT PLAN ON: [col("id")]
      CACHE[id: a4bcf9591fefc837, count: 3]
        DF ["id"]; PROJECT 1/1 COLUMNS; SELECTION: "None"
    RIGHT PLAN ON: [col("id")]
      CACHE[id: 8cee8e3a6f454983, count: 1]
        DF ["id"]; PROJECT 1/1 COLUMNS; SELECTION: "None"
    END INNER JOIN
END UNION

With explicit cache:

import polars as pl

df1 = pl.DataFrame({'id': [0,5,6]}).lazy()
df2 = pl.DataFrame({'id': [0,8,6]}).lazy()
df3 = pl.DataFrame({'id': [7,8,6]}).lazy()

df4 = df1.join(df2, on='id').cache()
print(pl.concat([df4.join(df3, on='id'), df1,
                 df4]).explain())

We get the logical plan:

UNION
  PLAN 0:
    INNER JOIN:
    LEFT PLAN ON: [col("id")]
      CACHE[id: 290661b0780, count: 18446744073709551615]
        FAST_PROJECT: [id]
          INNER JOIN:
          LEFT PLAN ON: [col("id")]
            DF ["id"]; PROJECT 1/1 COLUMNS; SELECTION: "None"
          RIGHT PLAN ON: [col("id")]
            DF ["id"]; PROJECT 1/1 COLUMNS; SELECTION: "None"
          END INNER JOIN
    RIGHT PLAN ON: [col("id")]
      DF ["id"]; PROJECT */1 COLUMNS; SELECTION: "None"
    END INNER JOIN
  PLAN 1:
    DF ["id"]; PROJECT */1 COLUMNS; SELECTION: "None"
  PLAN 2:
    CACHE[id: 290661b0780, count: 18446744073709551615]
      FAST_PROJECT: [id]
        INNER JOIN:
        LEFT PLAN ON: [col("id")]
          DF ["id"]; PROJECT 1/1 COLUMNS; SELECTION: "None"
        RIGHT PLAN ON: [col("id")]
          DF ["id"]; PROJECT 1/1 COLUMNS; SELECTION: "None"
        END INNER JOIN
END UNION

You can see, that with the explicit cache, we get more optimal plan because the join of df1 and df2 is performed only once.

Why doesn't Polars auto-cache mechanism detect the repeated usage of join, and apply the cache by itself? What am I missing?

Thanks.

Original Q&A

How does Polars auto-cache mechanism work on LazyFrames?

There are 0 best solutions below

Related Questions in PYTHON

Related Questions in PYTHON-POLARS

Related Questions in RUST-POLARS

Trending Questions

Popular # Hahtags

Popular Questions