I'd like to give friendly exploratory environment to Data Scientists to do some early iteration on Beam data processing pipelines.
Jupyter notebooks seem like a great environment to do this.
However, I'm running into the problem that SQL commands are very slow to execute from Jupyter.
E.g. when using beam_sql even the simplest command only returning a few constants takes more than 60 seconds to execute:
%%beam_sql -o pcoll
SELECT CAST(1 AS INT) AS `id`, CAST('foo' AS VARCHAR) AS `str`, CAST(3.14 AS DOUBLE) AS `flt`
My understanding is that this is because a separate Java process is launched in the background for each SQL command?
Is there maybe some way to speed things up?
Or should I consider exploring Pandas Dataframes rather than SQL when I aim to provide a friendlier user experience to Python users in the context of Jupyter notebooks?
p.s.: the timing comes from following the "Develop Apache Beam notebooks with the interactive runner" guide, so Jupyter runs on a GCP Dataflow Workbench.