I'm working with Gaia astrometric data from the data release 3 and saw hvplot/datashader as the go-to for visualizing large data due to very fast render times and interactivity. In every example I'm seeing, it's taking a few seconds to render an image from hundreds of millions of data points on the slow end. However, when I try to employ the same code for my data, it takes hours for any image to render at all.
For context, I'm running this code on a very large research computer cluster with hundreds of gigs of RAM, a hundred or so cores, and terabytes of storage at my disposal, computing power should not be an issue here. Additionally, I've converted the data I need to a series of parquet files that are being read into a dask dataframe with glob. My code is as follows:
...
import dask.dataframe as dd
import hvplot.dask
import glob
df=dd.read_parquet(glob.glob(r'myfiles/*'),engine='fastparquet')
df=df.astype('float32')
df=df[['col1','col2']]
df.hvplot.scatter(x='col1',y='col2',rasterize=True,cmap=cc.fire)
...
does anybody have any ideas what could be the issue here? Any help would be appreciated
Edit: I've got the rendering times below an hour now by turning the data into a smaller number of higher memory files (3386 -> 175)
Hard to debug without access to the data, but one quick optimization you can implement is to avoid loading all the data and select the specific columns of interest:
Unless crucial, I'd also avoid doing
.astype
. It shouldn't be a bottleneck, but the gains from thisfloat32
might not be relevant if memory isn't a constraint.