difference between datashader and other plotting libraries

1.3k Views Asked by At

I want to understand the clear difference between Datashader and other graphing libraries eg plotly/matplotlib etc.

  1. I understand that in order to plot millions/billions of data points, we need datashader as other plotting libraries will hung up the browser.

  2. But what exactly is the reason which makes datashader fast and does not hung up the browser and how exactly the plotting is done which doesnt put any load on the browser ????

Also, datashader doesnt put any load on browser because in the backend datashader will create a graph on the basis of my dataframe and send only the image to the browser which is why its fast??

Plz explain i am unable to understand the in and out clearly.

2

There are 2 best solutions below

0
On BEST ANSWER

It may be helpful to first think of Datashader not in comparison to Matplotlib or Plotly, but in comparison to numpy.histogram2d. By default, Datashader will turn a long list of (x,y) points into a 2D histogram, just like histogram2d. Doing so only requires a simple increment of a grid cell for each new point, which is easily accellerated to machine-code speeds with Numba and is trivial to parallelize with Dask. The resulting array is then at most the size of your display screen, no matter how big your dataset is. So it's cheap to process in a separate program that adds axes, labels, etc., and it will never crash your browser.

By contrast, a plotting program like Plotly will need to convert each data point into a JSON or other serialized representation, pass that to JavaScript in the browser, have JavaScript draw a shape into a graphics buffer, and make each such shape support hover and other interactive features. Those interactive features are great, but it means Plotly is doing vastly more work per data point than Datashader is, and requires that the browser can hold all those data points. The only computation Datashader needs to do with your full data is to linearly scale the x and y locations of each point to fit the grid, then increment the grid value, which is much easier than what Plotly does.

The comparison to Matplotlib is slightly more complicated, because with an Agg backend, Matplotlib is also pre-rendering to a fixed-size graphics buffer before display (somewhat like Datashader). But Matplotlib was written before Numba and Dask (making it more difficult to speed up), it still has to draw shapes for each point (not just a simple increment), it can't fully parallelize the operations (because later points overwrite earlier ones in Matplotlib), and it provides anti-aliasing and other nice features not available in Datashader. So again Matplotlib is doing a lot more work than Datashader.

But if what you really want to do is see the faithful 2D distribution of billions of data points, Datashader is the way to go, because that's really all it is doing. :-)

0
On

From the datashader docs,

datashader is designed to "rasterize" or "aggregate" datasets into regular grids that can be viewed as images, making it simple and quick to see the properties and patterns of your data. Datashader can plot a billion points in a second or so on a 16GB laptop, and scales up easily to out-of-core or distributed processing for even larger datasets.

There aren't any tricks going on in any of these libraries - rendering a huge number of points takes a long time. What datashader does is to shift the burden of visualization from rendering to computing. There's a very good reason you have to create a canvas before plotting instructions in datashader. The first step in a datashader pipeline is to rasterize a dataset, in other words, it approximates the position of each piece of data and then uses aggregation functions to determine the intensity or color of each pixel. This allows datashader to plot enormous numbers of points; even more points than can be held in memory.

Matplotlib, on the other hand, renders every single point you instruct it to plot, making plotting large datasets time consuming or even impossible.