Databricks - How to access Workspace Files in init scripts

2.7k Views Asked by At

Hope everyone is doing well...

We are exploring to see if it will be possible to organize a few of our jars as part of a folder in Workspace and have it moved around as part of the init scripts.

For example, in the workspace we have the following structure.

/Workspace/<Folder_Name1>/jars/sample_name_01.jar

The init script would attempt to move it to a path in DBFS/Driver Node File System.

!/bin/bash
cp /Workspace/<Folder_Name1>/jars/sample_name_01.jar /dbfs/jars/
cp /Workspace/<Folder_Name1>/jars/sample_name_01.jar /tmp/jars/

Of course the init script is failing with the error message

cp: cannot stat '/Workspace/<Folder_Name1>/jars/sample_name_01.jar': No such file or directory

Have tried with the path having both /Workspace included and removed. I have also tried accessing the file from the web terminal and I am able to see the files.

  1. Are workspace files accessible via init script ?
  2. Is there a limitation for jars and whl/egg files ?
  3. What is the right syntax to access them ?
  4. Does it make sense to have the jars (only few) as part of the workspace files or in DBFS ?

Thanks for all the help... Cheers...

Update 01:

Tried some of the suggestions received via other means...

  1. Considering that the init scripts from Workspace are referred without the /Workspace I have also tried without them, but still the same issue.
  2. Have also tried listing files and printing them. The path itself does not seem to get recognized.
  3. Have also tried sleeping for upto 2 minutes to give some time for mounts, still nothing...
3

There are 3 best solutions below

4
JayashankarGS On

First, check you have permissions to the workspace and jar folders. If you and still cp is not working, below are the possible reasons.

When admins upload jar files, there are two options.

  1. Upload jars has library.
  2. Upload jars as just a file.

Option 1 Below is how it done when it is uploaded as library.

enter image description here

After, it prompts for upload.

enter image description here

After clicking on create, below is the result.

enter image description here

Here you can see, it gives option to install on cluster, and Source, which is needed for you.

When uploading as library, you will get the jars in dbfs path by default in the below location.

/dbfs/FileStore/jars/

Option 2

When it is uploaded as just file. enter image description here

prompt for file upload and create.

Below are jars uploaded.

enter image description here

You use your copy command on jar, uploaded as file, that will work.

If you still get same error, then it is required permission. So, the possible solution for it is, you can run below code in notebook after cluster creation.

%sh
cp '/Workspace/Users/xxxxxxx/jars/helloworld-2.0_tmp (1).jar' /dbfs/jars/
ls /dbfs/jars/

enter image description here

Note - This does not work if admins upload as library. As i mentioned above they will be available in dbfs only.

2
rainingdistros On

As per a related post in the databricks community forums, it has been confirmed that for now, this is not possible. When an init script is placed in a workspace, access is limited only to that init script and not to any other files in the workspace. It also mentioned that accessing the files would still be possible through API Calls or through databricks CLI but I personally feel that it makes it slightly a roundabout way of doing it, just my personal opinion is all. Thank you for all the help. I hope and look forward to better ways of doing it.

0
FRG96 On

Just copying the answer, I found useful in this Databricks Community Forum: https://community.databricks.com/t5/data-engineering/accessing-workspace-files-within-cluster-init-script/m-p/3183

The init script runs on the cluster nodes before the notebook execution, and it does not have direct access to workspace files.

The documentation you mentioned refers to placing the init script inside a workspace file, which means you can store the script itself in a file within the Databricks workspace. However, it doesn't grant direct access to other workspace files from within the init script.

To access a workspace file within the init script, you can consider using the Databricks CLI or Databricks API to retrieve the file and then copy or read it on the cluster nodes during the init script execution.