Petastorm with Databricks Connect failing

90 Views Asked by At

Using Azure Databricks. I have petastorm==0.11.2 and databricks-connect==9.1.0

My databricks-connect session seems to be working I'm able to read in data into my remote workspace. But when I use petastorm to create a spark converter object it says unable to infer schema, even though if take the object I'm passing it and check its .schema attribute it shows me a schema just fine.

The exact same code works within the databricks workspace in the notebooks. But doesn't work when I'm on a separate VM using DBConnect to read in the data.

I think the issue is around setting this configuration: SparkDatasetConverter.PARENT_CACHE_DIR_URL_CONF. When in the local databricks workspace using the value 'file:///tmp/petastorm/cache/' works fine. When using databricks-connect it supposedly builds a spark context that's linked to the cluster and otherwise for read and write paths behaves fine.

Any ideas?

0

There are 0 best solutions below