Python Wheel Package Automation for Databricks Deployment

Complete CI/CD of Python Wheel using Azure Pipelines

4 min readDec 10, 2022

A Python wheel is a standard way to package and distribute the files required to run a Python application. Using the Python wheel task, you can ensure fast and reliable installation of Python code in your Databricks jobs.

In this blog, we will automate the deployment of Wheel Packages on to a Databricks Cluster.

Create Setup File

First, create the setup.py file. This file contains all the metadata information, and the packages that we want to wrap inside the Wheel package.

This setup.py file will be used during the CI stage of the Pipeline to package the artifacts.

from setuptools import setup, find_packages

setup(
    name='devops-practice',
    version="1.0.0",
    packages=find_packages(),
    install_requires=[
        'click>=7.0',
        'pyjwt>=1.7.0',
        'oauthlib>=3.1.0',
        'requests>=2.17.3',
        'tabulate>=0.7.7',
        'six>=1.10.0',
        'configparser>=0.3.5;python_version < "3.6"',
    ],
)

Creating the Wheel Package Deployment Pipeline

The Wheel Package Deployment pipeline will consist of the Build and Deploy stages. In the Build stage, run the setup.py file and generate the artifacts.

Build Stage

Assume that the setup.py file is stored in the root directory of the repo.

The first step is to run a bash script. The bash script consists of installing the wheel package using python pip. And then run the setup.py file with bdist_wheel.

stages:
- stage: "Generate_Wheel_Package"
  jobs:
  - job: "Build"
    steps:
      - task: Bash@3
        displayName: "Generate Wheel Package"
        inputs:
          targetType: 'inline'
          script: |
            # run tests
            pip install wheel
            python setup.py bdist_wheel

2. The above step will generate a package with a .whl extension inside a transient dist folder. We will take the package, and publish it as a Pipeline Artifacts.

- task: PublishPipelineArtifact@1
  displayName: "Publish Wheel Artifacts"
  inputs:
    targetPath: '$(Build.SourcesDirectory)/dist'
    artifact: 'devops-wheel-$(Build.BuildNumber)'
    publishLocation: 'pipeline'

Deploy Stage

In the Deploy stage, Pipeline will execute multiple steps starting from downloading the artifacts to deploying it. Let’s go through the steps-

The Deploy stage starts with the Downloading the artifacts generated in the Build stage.

- stage: "Deploy_wheel_Package"
  condition: and(succeeded(), ne(variables['Build.Reason'], 'PullRequest'))
  jobs:
  - job: "Wheel_Deployment"
    steps:
    - task: DownloadPipelineArtifact@2
      displayName: "Download Wheel Artifacts"
      inputs:
        buildType: 'current'
        artifactName: 'devops-wheel-$(Build.BuildNumber)'
        targetPath: '$(Pipeline.Workspace)/devops-wheel-$(Build.BuildNumber)'

2. The next step is to Configure the Databricks workspace that you are using. In this example, Databricks PAT is used to authenticate to Databricks workspace.

- task: configuredatabricks@0
  displayName: "Authenticate the Databricks workspace"
  inputs:
    url: '$(azure_databricks_workspace_url)'
    token: '$(databricks-token)'

3. Next step is to use the databricks libraries command to list the current libraries installed on the cluster. This is not a mandatory step, but just a way to view the current libraries before installing new ones.

- task: Bash@3
  displayName: "List Cluster Libraries"
  inputs:
    targetType: 'inline'
    script: |
      pip install databricks-cli --upgrade        
      databricks libraries cluster-status --cluster-id $(azure_databricks_cluster_id) --profile AZDO

4. After listing the library, copy the .whl package from the artifacts to the Databricks file system (DBFS). This is done so that we have the latest wheel package stored on the Databricks file system as well.

- task: Bash@3
  displayName: "Copy wheel package to dbfs"
  inputs:
    targetType: 'inline'
    script: |
      # Write your commands here              
      dbfs cp $(Pipeline.Workspace)/devops-wheel-$(Build.BuildNumber) dbfs:/whl_package --recursive --overwrite --profile AZDO

5. Once the whl package is present in the dbfs, go ahead and uninstall the existing older version of wheel package from the cluster.

- task: Bash@3
  displayName: "Uninstall existing Wheel Package from Cluster"
  inputs:
    targetType: 'inline'
    script: |
      databricks libraries uninstall --cluster-id $(azure_databricks_cluster_id) --whl dbfs:/whl_package/devops-practice-0.1.87-py3-none-any.whl --profile AZDO

6. The above step requires the cluster to be restarted before installing the latest wheel package. But if the cluster is already in terminated state, then restarting the cluster step will fail as cluster is not in running mode.

To tackle the above issue, we will check the current status of cluster. If cluster is running, just proceed to next step.
If the cluster in terminated state, start the cluster and wait for the cluster to be up and running. We are using a databricks extension to achieve this. You can also use databricks cli command to check the current state of cluster.

- task: startcluster@0
  displayName: "Check Cluster status and start if in terminated state"
  inputs:
    clusterid: '$(azure_databricks_cluster_id)'

7. With the above step, now the cluster is in running state. Now, run the cluster restart command to complete the uninstall process of wheel package.

- task: Bash@3
  displayName: "Restarting the Cluster"
  inputs:
    targetType: 'inline'
    script: |
      databricks clusters restart --cluster-id $(azure_databricks_cluster_id) --profile AZDO

8. Once the cluster is restarted, we are good to install the new .whl package that is stored in dbfs file.

- task: Bash@3
  displayName: "Install Wheel Package on Cluster"
  inputs:
    targetType: 'inline'
    script: |
      databricks libraries install --cluster-id $(azure_databricks_cluster_id) --whl dbfs:/whl_package/devops-practice-0.1.87-py3-none-any.whl --profile AZDO

Assumptions

Following assumptions were made while writing this blog-

The setup.py file is present at the root location of the repo.
All the variables like azure_databricks_cluster_id are coming from the Variable groups created in Azure DevOps.

Summary

The above pipeline completes the deployment Process for Databricks Wheel Packages. By creating an Azure DevOps pipeline, we have successfully automated the generation and deployment of wheel packages on Databricks cluster.