How to create make_batch_reader object of petastorm library in DataBricks?

320 Views Asked by At

I have data saved in parquet format. Petastorm is a library I am using to obtain batches of data for training.

Though I was able to do this in my local system, but the same code is not working in Databricks.

Code I used in my local system

# create a iterator object train_reader. num_epochs is the number of epochs for which we want to train our model

with make_batch_reader('file:///config/workspace/scaled.parquet', num_epochs=4,shuffle_row_groups=False) as train_reader:
  train_ds = make_petastorm_dataset(train_reader).unbatch().map(lambda x: (tf.convert_to_tensor(x))).batch(2)
  

  for ele in train_ds:
    tensor = tf.reshape(ele,(2,1,15))
    model.fit(tensor,tensor)

Code I used in Databricks

with make_batch_reader('dbfs://output/scaled.parquet', num_epochs=4,shuffle_row_groups=False) as train_reader:
    train_ds = make_petastorm_dataset(train_reader).unbatch().map(lambda x: (tf.convert_to_tensor(x))).batch(2)
  

    for ele in train_ds:
        tensor = tf.reshape(ele,(2,1,15))
        model.fit(tensor,tensor)

Error I ma getting on DataBricks code is:

TypeError: init() missing 2 required positional arguments: 'instance' and 'token'

I have checked the documentation, but couldn't find any argument that Goes by the name of instance and token.However, in a similar method make_reader in petastorm, for Azure Databricks I see the below line of code:

# create sas token for storage account access, use your own adls account info
remote_url = "abfs://container_name@storage_account_url"
account_name = "<<adls account name>>"
linked_service_name = '<<linked service name>>'
TokenLibrary = spark._jvm.com.microsoft.azure.synapse.tokenlibrary.TokenLibrary
sas_token = TokenLibrary.getConnectionString(linked_service_name)

with make_reader('{}/data_directory'.format(remote_url), storage_options = {'sas_token' : sas_token}) as reader:
    for row in reader:
        print(row)

Here I see some 'sas_token' being passed as input.

Please suggest how do I resolve this error?

I tried changing paths of the parquet file but that did not work out for me.

2

There are 2 best solutions below

0
On

The SAS Token that is used in the code can be generated for your container by using the following steps:

  • Navigate to where your container exists and select settings. Click Generate SAS

enter image description here

  • Now select all the required permissions that you are going to grant (operations you require to perform).

enter image description here

  • When you click generate, you will get the token that can be used in your code.

enter image description here

0
On

The problem is that you have to provide the path in a different format on databricks which works for me. add the file keyword and use three front slashes /// like that:- petastorm_dataset_url = "file://" + get_local_path(parquet_path)

'file:///dbfs/output/scaled.parquet'