I am running into memory and performance issues when trying to implement a density plot using python's plotnine.
Consider the below dataset with 3 variables and 50,000 observations. This is not a large dataset. The below code took 15 minutes to run. In contrast, it ran in R in 0.22 seconds.
With n = 100000
, I get the following error in plotnine:
MemoryError: Unable to allocate 74.5 GiB for an array with shape (100000, 100000) and data type float64
Again, R was able to execute this in circa 0.2 seconds.
Am I mis-specifying the plotnine code, or is this a known problem that will be fixed?
plotnine code:
import numpy as np
import pandas as pd
from plotnine import *
n = 100000
df = pd.DataFrame({
'age': np.random.choice(range(20,66),n),
'gender': np.random.choice(range(1,3),n),
'variable': np.random.lognormal(0,0.5,n),
})
p = (ggplot(df, aes('variable'))
+ theme_light(7)
+ geom_density(alpha=0.5, size=0.35)
)
p
R code:
library(ggplot2)
n = 100000
df = data.frame(
age = sample(seq(20:66), n, replace=TRUE),
gender = sample(1:2, n, replace=TRUE),
variable = rlnorm(n, meanlog=0, sdlog=0.5)
)
p = ggplot(df, aes(variable)) +
theme_light(7) +
geom_density(alpha=0.5, size=0.35)
p