How does Polars auto-cache mechanism work on LazyFrames?

240 Views Asked by At

As stated here, Polars introduced an auto-cache mechanism for LazyFrames that occures multiple times in the logical plan, so the user will not have to actively perform the cache.
However, while trying to examine their new mechanism, I encountred scenarios that the auto-cache is not performed optimally:

Without explicit cache:

import polars as pl

df1 = pl.DataFrame({'id': [0,5,6]}).lazy()
df2 = pl.DataFrame({'id': [0,8,6]}).lazy()
df3 = pl.DataFrame({'id': [7,8,6]}).lazy()

df4 = df1.join(df2, on='id')
print(pl.concat([df4.join(df3, on='id'), df1,
                 df4]).explain())

We get the logical plan:

UNION
  PLAN 0:
    INNER JOIN:
    LEFT PLAN ON: [col("id")]
      INNER JOIN:
      LEFT PLAN ON: [col("id")]
        CACHE[id: a4bcf9591fefc837, count: 3]
          DF ["id"]; PROJECT 1/1 COLUMNS; SELECTION: "None"
      RIGHT PLAN ON: [col("id")]
        CACHE[id: 8cee8e3a6f454983, count: 1]
          DF ["id"]; PROJECT 1/1 COLUMNS; SELECTION: "None"
      END INNER JOIN
    RIGHT PLAN ON: [col("id")]
      DF ["id"]; PROJECT */1 COLUMNS; SELECTION: "None"
    END INNER JOIN
  PLAN 1:
    CACHE[id: a4bcf9591fefc837, count: 3]
      DF ["id"]; PROJECT 1/1 COLUMNS; SELECTION: "None"
  PLAN 2:
    INNER JOIN:
    LEFT PLAN ON: [col("id")]
      CACHE[id: a4bcf9591fefc837, count: 3]
        DF ["id"]; PROJECT 1/1 COLUMNS; SELECTION: "None"
    RIGHT PLAN ON: [col("id")]
      CACHE[id: 8cee8e3a6f454983, count: 1]
        DF ["id"]; PROJECT 1/1 COLUMNS; SELECTION: "None"
    END INNER JOIN
END UNION

With explicit cache:

import polars as pl

df1 = pl.DataFrame({'id': [0,5,6]}).lazy()
df2 = pl.DataFrame({'id': [0,8,6]}).lazy()
df3 = pl.DataFrame({'id': [7,8,6]}).lazy()

df4 = df1.join(df2, on='id').cache()
print(pl.concat([df4.join(df3, on='id'), df1,
                 df4]).explain())

We get the logical plan:

UNION
  PLAN 0:
    INNER JOIN:
    LEFT PLAN ON: [col("id")]
      CACHE[id: 290661b0780, count: 18446744073709551615]
        FAST_PROJECT: [id]
          INNER JOIN:
          LEFT PLAN ON: [col("id")]
            DF ["id"]; PROJECT 1/1 COLUMNS; SELECTION: "None"
          RIGHT PLAN ON: [col("id")]
            DF ["id"]; PROJECT 1/1 COLUMNS; SELECTION: "None"
          END INNER JOIN
    RIGHT PLAN ON: [col("id")]
      DF ["id"]; PROJECT */1 COLUMNS; SELECTION: "None"
    END INNER JOIN
  PLAN 1:
    DF ["id"]; PROJECT */1 COLUMNS; SELECTION: "None"
  PLAN 2:
    CACHE[id: 290661b0780, count: 18446744073709551615]
      FAST_PROJECT: [id]
        INNER JOIN:
        LEFT PLAN ON: [col("id")]
          DF ["id"]; PROJECT 1/1 COLUMNS; SELECTION: "None"
        RIGHT PLAN ON: [col("id")]
          DF ["id"]; PROJECT 1/1 COLUMNS; SELECTION: "None"
        END INNER JOIN
END UNION

You can see, that with the explicit cache, we get more optimal plan because the join of df1 and df2 is performed only once.

Why doesn't Polars auto-cache mechanism detect the repeated usage of join, and apply the cache by itself? What am I missing?

Thanks.

0

There are 0 best solutions below