Update Databricks Workspace Repo by Connecting to Databricks CLI with Github Actions

1.6k Views Asked by At

I'm attempting to automatically pull the latest version of a GitHub repo into my Databricks workspace every time a new push is made to the repo. Everything works fine until the Databricks CLI requests the host URL after which it fails with "Error: Process completed with exit code 1." I'm assuming it's an issue with my token and host credentials stored as secrets not properly loading into the environment. According to Databricks, "CLI 0.8.0 and above supports the following environment variables: DATABRICKS_HOST, DATABRICKS_USERNAME, DATABRICKS_PASSWORD, DATABRICKS_TOKEN". I've added both DATABRICKS_HOST and DATABRICKS_TOKEN as repository secrets, so I'm not sure what I'm doing wrong.

on:
 push:

jobs:
 build:
  runs-on: ubuntu-latest

  steps:

    - name: setup python
      uses: actions/setup-python@v2
      with:
        python-version: 3.8 #install the python version needed

    - name: execute py
      env:
        DATABRICKS_HOST: $(DATABRICKS_HOST)
        DATABRICKS_TOKEN: $(DATABRICKS_TOKEN)
      run: |
        python -m pip install --upgrade databricks-cli
        databricks configure --token
        databricks repos update --repo-id REPOID-ENTERED --branch "Development"

The error:

Successfully built databricks-cli
Installing collected packages: tabulate, certifi, urllib3, six, pyjwt, oauthlib, idna, click, charset-normalizer, requests, databricks-cli
Successfully installed certifi-2021.10.8 charset-normalizer-2.0.12 click-8.1.3 databricks-cli-0.16.6 idna-3.3 oauthlib-3.2.0 pyjwt-2.4.0 requests-2.27.1 six-1.16.0 tabulate-0.8.9 urllib3-1.26.9
WARNING: You are using pip version 22.0.4; however, version 22.1 is available.
You should consider upgrading via the '/opt/hostedtoolcache/Python/3.8.12/x64/bin/python -m pip install --upgrade pip' command.
Aborted!
Databricks Host (should begin with https://): 
Error: Process completed with exit code 1.
2

There are 2 best solutions below

2
On BEST ANSWER

Just remove databricks configure --token from your command - it's not required. Databricks CLI will use environment variables in this case. See working pipeline for Azure DevOps here.

0
On

I think calling the api directly without using the client works best. Below is code that works from azure devops. Should also work for a github action.

      import requests
      import sys
      from adal import AuthenticationContext

      user_parameters = {
          "tenant" : "$(SP_TENANT_ID)",
            "client_id" : "$(SP-CLIENT-ID)", 
            "redirect_uri" : "http://localhost",
            "client_secret": "$(SP-CLIENT-SECRET)"   
      }
      
      authority_host_url = "https://login.microsoftonline.com/"
      azure_databricks_resource_id = "put_here"
      authority_url = authority_host_url + user_parameters['tenant']
      
      # supply the refresh_token (whose default lifetime is 90 days or longer [token lifetime])
      def refresh_access_token(refresh_token):
        context = AuthenticationContext(authority_url)
        # function link
        token_response = context.acquire_token_with_refresh_token(
                        refresh_token,
                        user_parameters['client_id'],
                        azure_databricks_resource_id,
                        user_parameters['client_secret'])
        
        # the new 'refreshToken' and  'accessToken' will be returned
        return (token_response['refreshToken'], token_response['accessToken'])
      
      (refresh_token, access_token) = refresh_access_token("$(AAD-REFRESH-TOKEN)")
      print('##vso[task.setvariable variable=ACCESS_TOKEN;]%s' % (access_token))
- bash: |
    # Write your commands here
    
    echo 'Patching Repo $(DB_WORKSPACE_HOST/$(REPO_ID)'
    # Update the repo to the given tag
    
    echo 'https://$(DB_WORKSPACE_HOST)/api/2.0/repos/$(REPO_ID) $(Build.SourceBranchName)'
    
    curl -n -X PATCH -o "/tmp/db_patch-out.json" https://$(DB_WORKSPACE_HOST)/api/2.0/repos/$(REPO_ID) \
        -H 'Authorization: Bearer $(ACCESS_TOKEN)' \
        -d '{"branch": "$(Build.SourceBranchName)"}'
    cat "/tmp/db_patch-out.json"
    grep -v error_code "/tmp/db_patch-out.json"
  displayName: 'Update DataBricks Repo'

This works if there is network connectivity to databricks from your git provider. If you have adf on the same network and do not have network connectivity you can 1) spin up an api gateway to secure and bridge your network calls or 2) you can do an asynch trigger to adf and have it call databricks by dropping a file in azure storage https://learn.microsoft.com/en-us/azure/data-factory/how-to-create-event-trigger?tabs=data-factory. Or sending an email or other event trigger.

While the above methods work if there is a true IP address restriction, it appears the issue with the call my just be the CDC certificates are not verified correctly. You can override this locally using pip-system-certs or by exporting the cert from your browser and specifying the pem file.