hvplot taking hours to render image

307 Views Asked by At

I'm working with Gaia astrometric data from the data release 3 and saw hvplot/datashader as the go-to for visualizing large data due to very fast render times and interactivity. In every example I'm seeing, it's taking a few seconds to render an image from hundreds of millions of data points on the slow end. However, when I try to employ the same code for my data, it takes hours for any image to render at all.

For context, I'm running this code on a very large research computer cluster with hundreds of gigs of RAM, a hundred or so cores, and terabytes of storage at my disposal, computing power should not be an issue here. Additionally, I've converted the data I need to a series of parquet files that are being read into a dask dataframe with glob. My code is as follows:

...

import dask.dataframe as dd
import hvplot.dask
import glob

df=dd.read_parquet(glob.glob(r'myfiles/*'),engine='fastparquet')
df=df.astype('float32')
df=df[['col1','col2']]
df.hvplot.scatter(x='col1',y='col2',rasterize=True,cmap=cc.fire)

...

does anybody have any ideas what could be the issue here? Any help would be appreciated

Edit: I've got the rendering times below an hour now by turning the data into a smaller number of higher memory files (3386 -> 175)

1

There are 1 best solutions below

2
SultanOrazbayev On

Hard to debug without access to the data, but one quick optimization you can implement is to avoid loading all the data and select the specific columns of interest:

df=dd.read_parquet(glob.glob(r'myfiles/*'), engine='fastparquet', columns=['col1','col2'])

Unless crucial, I'd also avoid doing .astype. It shouldn't be a bottleneck, but the gains from this float32 might not be relevant if memory isn't a constraint.