I'm running Jupyter (v4.2.1) with Apache Toree - PySpark. When I try to invoke plotly's init_notebook_mode function, I run into the following error :
import numpy as np
import pandas as pd
import plotly.plotly as py
import plotly.graph_objs as go
from plotly import tools
from plotly.offline import iplot, init_notebook_mode
init_notebook_mode()
Error :
Name: org.apache.toree.interpreter.broker.BrokerException
Message: Traceback (most recent call last):
File "/tmp/kernel-PySpark-6415c581-01c4-4c90-b4d9-81773c2bc03f/pyspark_runner.py", line 134, in <module>
eval(compiled_code)
File "<string>", line 7, in <module>
File "/usr/local/lib/python3.4/dist-packages/plotly/offline/offline.py", line 151, in init_notebook_mode
display(HTML(script_inject))
File "/usr/local/lib/python3.4/dist-packages/IPython/core/display.py", line 158, in display
format = InteractiveShell.instance().display_formatter.format
File "/usr/local/lib/python3.4/dist-packages/traitlets/config/configurable.py", line 412, in instance
inst = cls(*args, **kwargs)
File "/usr/local/lib/python3.4/dist-packages/IPython/core/interactiveshell.py", line 499, in __init__
self.init_io()
File "/usr/local/lib/python3.4/dist-packages/IPython/core/interactiveshell.py", line 658, in init_io
io.stdout = io.IOStream(sys.stdout)
File "/usr/local/lib/python3.4/dist-packages/IPython/utils/io.py", line 34, in __init__
raise ValueError("fallback required, but not specified")
ValueError: fallback required, but not specified
StackTrace: org.apache.toree.interpreter.broker.BrokerState$$anonfun$markFailure$1.apply(BrokerState.scala:140)
org.apache.toree.interpreter.broker.BrokerState$$anonfun$markFailure$1.apply(BrokerState.scala:140)
scala.Option.foreach(Option.scala:236)
org.apache.toree.interpreter.broker.BrokerState.markFailure(BrokerState.scala:139)
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
java.lang.reflect.Method.invoke(Method.java:498)
py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
py4j.Gateway.invoke(Gateway.java:259)
py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
py4j.commands.CallCommand.execute(CallCommand.java:79)
py4j.GatewayConnection.run(GatewayConnection.java:209)
java.lang.Thread.run(Thread.java:745)
I'm unable to find any info about this on the web. When I digged into the code where this is failing - io.py in IPython utils, I see that the stream that is being passed must have both the attributes - write as well as flush. But for some reason, the stream passed in this case - sys.stdout has only the "write" attribute, and not the "flush" attribute.
I believe this happens because plotly's notebook mode assumes that it is running inside an IPython jupyter kernel doing the notebook communictation; you see in the stacktrace that it's trying to call into IPython packages.
Toree, however, is a different jupyter kernel and has its own protocol handling for communicating with the notebook server. Even when you use toree to run a PySpark interpreter, you get a "plain" PySpark (just like when you start it from a shell) and toree drives the input/output of that interpreter.
So the IPython machinery is not set up and calling init_notebook_mode() in that environment will fail, just like it would when you run in in a PySpark started directly from the shell, which knows nothing about notebooks.
To my knowledge, there is currently no way to get plotting output from a PySpark session run via toree -- we recently faced the same problem. Instead of running python via toree, you need to run an IPython kernel, import the PySpark libs there and connect to your Spark cluster. See https://github.com/jupyter/docker-stacks/tree/master/pyspark-notebook for a dockerized example to do that.