I would like to use dbx execute to run a task/job on an azure databricks cluster. However, i cannot make it install my code.
More Details on the situation:
- Project A with a setup.py is dependent on Project B
- Project B is also python based and is realeased as a azure devops artifact
- I can successfully install A by using an init script on an azure databricks cluster by git clone both projects in the init script and then pip install -e project B and A.
- It also works when i create a pip.conf file in the init script which configures a token to use my artifacts feed
- So dbx deploy/launch works fine as my clusters use the init script
- However dbx execute always fails telling me that it cannot find and install Project B
Does anyone know how to configure the pip which is used during dbx execute installation process? Somehow this seems to be ignoring any conf which was set with init scripts.
I searched through lots of documentation such as https://docs.databricks.com/libraries/index.html and https://dbx.readthedocs.io/en/latest/reference/deployment/#advanced-package-dependency-management but with no luck
When i look into dbx package seems not that there is an option to set any pip.conf :( https://github.com/databrickslabs/dbx/blob/main/dbx/commands/execute.py
I raised an issue also in the github repo of dbx. https://github.com/databrickslabs/dbx/issues/669 They pointed me to this link
https://dbx.readthedocs.io/en/latest/guides/general/dependency_management/?h=custom+rep#installing-python-packages-from-custom-pypi-repos
which explains how to do it.
In short. Overwrite the global pip.conf in /etc/pip.conf in your init.sh
To make it work with azure devops. I created an azure devops personal access token and adapted extra-index-url looked like this:
replace all values in <....> with your values. can have any value as the token is enough for authentication