plotnine geom_density memory and performance issues

391 Views Asked by At

I am running into memory and performance issues when trying to implement a density plot using python's plotnine.

Consider the below dataset with 3 variables and 50,000 observations. This is not a large dataset. The below code took 15 minutes to run. In contrast, it ran in R in 0.22 seconds.

With n = 100000, I get the following error in plotnine:

MemoryError: Unable to allocate 74.5 GiB for an array with shape (100000, 100000) and data type float64

Again, R was able to execute this in circa 0.2 seconds.

Am I mis-specifying the plotnine code, or is this a known problem that will be fixed?

plotnine code:

import numpy as np
import pandas as pd
from plotnine import *

n = 100000

df = pd.DataFrame({
    'age': np.random.choice(range(20,66),n),
    'gender': np.random.choice(range(1,3),n),
    'variable': np.random.lognormal(0,0.5,n),
})

p = (ggplot(df, aes('variable'))
  + theme_light(7)
  + geom_density(alpha=0.5, size=0.35)
)
p

R code:

library(ggplot2)

n = 100000

df = data.frame(
        age = sample(seq(20:66), n, replace=TRUE),
        gender = sample(1:2, n, replace=TRUE),
        variable = rlnorm(n, meanlog=0, sdlog=0.5)
)

p = ggplot(df, aes(variable)) + 
      theme_light(7) +
      geom_density(alpha=0.5, size=0.35)
p
0

There are 0 best solutions below