I want to understand the clear difference between Datashader
and other graphing libraries eg plotly/matplotlib
etc.
I understand that in order to plot millions/billions of data points, we need datashader as other plotting libraries will hung up the browser.
But what exactly is the reason which makes datashader fast and does not hung up the browser and how exactly the plotting is done which doesnt put any load on the browser ????
Also, datashader doesnt put any load on browser because in the backend datashader will create a graph on the basis of my dataframe and send only the image to the browser which is why its fast??
Plz explain i am unable to understand the in and out clearly.
It may be helpful to first think of Datashader not in comparison to Matplotlib or Plotly, but in comparison to
numpy.histogram2d
. By default, Datashader will turn a long list of (x,y) points into a 2D histogram, just like histogram2d. Doing so only requires a simple increment of a grid cell for each new point, which is easily accellerated to machine-code speeds with Numba and is trivial to parallelize with Dask. The resulting array is then at most the size of your display screen, no matter how big your dataset is. So it's cheap to process in a separate program that adds axes, labels, etc., and it will never crash your browser.By contrast, a plotting program like Plotly will need to convert each data point into a JSON or other serialized representation, pass that to JavaScript in the browser, have JavaScript draw a shape into a graphics buffer, and make each such shape support hover and other interactive features. Those interactive features are great, but it means Plotly is doing vastly more work per data point than Datashader is, and requires that the browser can hold all those data points. The only computation Datashader needs to do with your full data is to linearly scale the x and y locations of each point to fit the grid, then increment the grid value, which is much easier than what Plotly does.
The comparison to Matplotlib is slightly more complicated, because with an Agg backend, Matplotlib is also pre-rendering to a fixed-size graphics buffer before display (somewhat like Datashader). But Matplotlib was written before Numba and Dask (making it more difficult to speed up), it still has to draw shapes for each point (not just a simple increment), it can't fully parallelize the operations (because later points overwrite earlier ones in Matplotlib), and it provides anti-aliasing and other nice features not available in Datashader. So again Matplotlib is doing a lot more work than Datashader.
But if what you really want to do is see the faithful 2D distribution of billions of data points, Datashader is the way to go, because that's really all it is doing. :-)