How to create make_batch_reader object of petastorm library in DataBricks?

Question

How to create make_batch_reader object of petastorm library in DataBricks?

326 Views Asked by jimmy page At 27 June 2025 at 17:55

I have data saved in parquet format. Petastorm is a library I am using to obtain batches of data for training.

Though I was able to do this in my local system, but the same code is not working in Databricks.

Code I used in my local system

# create a iterator object train_reader. num_epochs is the number of epochs for which we want to train our model

with make_batch_reader('file:///config/workspace/scaled.parquet', num_epochs=4,shuffle_row_groups=False) as train_reader:
  train_ds = make_petastorm_dataset(train_reader).unbatch().map(lambda x: (tf.convert_to_tensor(x))).batch(2)
  

  for ele in train_ds:
    tensor = tf.reshape(ele,(2,1,15))
    model.fit(tensor,tensor)

Code I used in Databricks

with make_batch_reader('dbfs://output/scaled.parquet', num_epochs=4,shuffle_row_groups=False) as train_reader:
    train_ds = make_petastorm_dataset(train_reader).unbatch().map(lambda x: (tf.convert_to_tensor(x))).batch(2)
  

    for ele in train_ds:
        tensor = tf.reshape(ele,(2,1,15))
        model.fit(tensor,tensor)

Error I ma getting on DataBricks code is:

TypeError: init() missing 2 required positional arguments: 'instance' and 'token'

I have checked the documentation, but couldn't find any argument that Goes by the name of instance and token.However, in a similar method make_reader in petastorm, for Azure Databricks I see the below line of code:

# create sas token for storage account access, use your own adls account info
remote_url = "abfs://container_name@storage_account_url"
account_name = "<<adls account name>>"
linked_service_name = '<<linked service name>>'
TokenLibrary = spark._jvm.com.microsoft.azure.synapse.tokenlibrary.TokenLibrary
sas_token = TokenLibrary.getConnectionString(linked_service_name)

with make_reader('{}/data_directory'.format(remote_url), storage_options = {'sas_token' : sas_token}) as reader:
    for row in reader:
        print(row)

Here I see some 'sas_token' being passed as input.

Please suggest how do I resolve this error?

I tried changing paths of the parquet file but that did not work out for me.

Original Q&A

There are 2 best solutions below

**Saideep Arikontham** · Answer 1

The SAS Token that is used in the code can be generated for your container by using the following steps:

Navigate to where your container exists and select settings. Click Generate SAS

enter image description here

Now select all the required permissions that you are going to grant (operations you require to perform).

enter image description here

When you click generate, you will get the token that can be used in your code.

enter image description here

**devVipin01** · Answer 2

The problem is that you have to provide the path in a different format on databricks which works for me. add the file keyword and use three front slashes /// like that:- petastorm_dataset_url = "file://" + get_local_path(parquet_path)

'file:///dbfs/output/scaled.parquet'

How to create make_batch_reader object of petastorm library in DataBricks?

There are 2 best solutions below

Related Questions in KERAS

Related Questions in PYSPARK

Related Questions in AZURE-DATABRICKS

Related Questions in PETASTORM

Trending Questions

Popular # Hahtags

Popular Questions