Databricks display() function equivalent or alternative to Jupyter

35k Views Asked by At

I'm in the process of migrating current DataBricks Spark notebooks to Jupyter notebooks, DataBricks provides convenient and beautiful display(data_frame) function to be able to visualize Spark dataframes and RDDs ,but there's no direct equivalent for Jupyter(im not sure but i think its a DataBricks specific function), i tried :

dataframe.show()

But it's a text version of it ,when you have many columns it breaks , so i'm trying to find an alternative to display() that can render Spark dataframes better than show() functions. Is there any equivalent or alternative to this?

6

There are 6 best solutions below

1
On BEST ANSWER

When you use Jupyter, instead of using df.show() use myDF.limit(10).toPandas().head(). And, as sometimes, we are working multiple columns it truncates the view. So just set your Pandas view column config to the max.

# Alternative to Databricks display function.
import pandas as pd
pd.set_option('max_columns', None)

myDF.limit(10).toPandas().head()enter image description here

0
On

In recent IPython, you can just use display(df) if df is a panda dataframe, it will just work. On older version you might need to do a from IPython.display import display. It will also automatically display if the result of the last expression of a cell is a data_frame. For example this notebook. Of course the representation will depends on the library you use to make your dataframe. If you are using PySpark and it does not defined a nice representation by default, then you'll need to teach IPython how to display the Spark DataFrame. For example here is a project that teach IPython how to display Spark Contexts, and Spark Sessions.

0
On

Try Apache Zeppelin (https://zeppelin.apache.org/). There's some nice standard visualizations of dataframes, specifically if you use the sql interpreter. There's also support for other useful interpreters as well.

0
On

Without converting to pandas dataframe. Use this... This will render dataframe in proper grids.

from IPython.core.display import HTML
display(HTML("<style>pre { white-space: pre !important; }</style>"))

df.show()
1
On

You can set config spark.conf.set('spark.sql.repl.eagerEval.enabled', True). This will allow to display native pyspark DataFrame without explicitly using df.show() and there is also no need to transfer DataFrame to Pandas either, all you need to is just df.

0
On

First Recommendation: When you use Jupyter, don't use df.show() instead use df.limit(10).toPandas().head() which results perfect display even better Databricks display()

Second Recommendation: Zeppelin Notebook. Just use z.show(df.limit(10))

Additionally in Zeppelin;

  1. You register your dataframe as SQL Table df.createOrReplaceTempView('tableName')
  2. Insert new paragraph beginning %sql then query your table with amazing display.