Python Wheel Package Automation for Databricks Deployment
Complete CI/CD of Python Wheel using Azure Pipelines
A Python wheel is a standard way to package and distribute the files required to run a Python application. Using the Python wheel task, you can ensure fast and reliable installation of Python code in your Databricks jobs.
In this blog, we will automate the deployment of Wheel Packages on to a Databricks Cluster.
Create Setup File
First, create the setup.py file. This file contains all the metadata information, and the packages that we want to wrap inside the Wheel package.
This setup.py
file will be used during the CI stage of the Pipeline to package the artifacts.
from setuptools import setup, find_packages
setup(
name='devops-practice',
version="1.0.0",
packages=find_packages(),
install_requires=[
'click>=7.0',
'pyjwt>=1.7.0',
'oauthlib>=3.1.0',
'requests>=2.17.3',
'tabulate>=0.7.7',
'six>=1.10.0',
'configparser>=0.3.5;python_version < "3.6"',
],
)
Creating the Wheel Package Deployment Pipeline
The Wheel Package Deployment pipeline will consist of the Build and Deploy stages. In the Build stage, run the setup.py file and generate the artifacts.
Build Stage
Assume that the setup.py
file is stored in the root directory of the repo.
- The first step is to run a bash script. The bash script consists of installing the wheel package using python pip. And then run the setup.py file with bdist_wheel.
stages:
- stage: "Generate_Wheel_Package"
jobs:
- job: "Build"
steps:
- task: Bash@3
displayName: "Generate Wheel Package"
inputs:
targetType: 'inline'
script: |
# run tests
pip install wheel
python setup.py bdist_wheel
2. The above step will generate a package with a .whl
extension inside a transient dist
folder. We will take the package, and publish it as a Pipeline Artifacts.
- task: PublishPipelineArtifact@1
displayName: "Publish Wheel Artifacts"
inputs:
targetPath: '$(Build.SourcesDirectory)/dist'
artifact: 'devops-wheel-$(Build.BuildNumber)'
publishLocation: 'pipeline'
Deploy Stage
In the Deploy stage, Pipeline will execute multiple steps starting from downloading the artifacts to deploying it. Let’s go through the steps-
- The Deploy stage starts with the Downloading the artifacts generated in the Build stage.
- stage: "Deploy_wheel_Package"
condition: and(succeeded(), ne(variables['Build.Reason'], 'PullRequest'))
jobs:
- job: "Wheel_Deployment"
steps:
- task: DownloadPipelineArtifact@2
displayName: "Download Wheel Artifacts"
inputs:
buildType: 'current'
artifactName: 'devops-wheel-$(Build.BuildNumber)'
targetPath: '$(Pipeline.Workspace)/devops-wheel-$(Build.BuildNumber)'
2. The next step is to Configure the Databricks workspace that you are using. In this example, Databricks PAT is used to authenticate to Databricks workspace.
- task: configuredatabricks@0
displayName: "Authenticate the Databricks workspace"
inputs:
url: '$(azure_databricks_workspace_url)'
token: '$(databricks-token)'
3. Next step is to use the databricks libraries command to list the current libraries installed on the cluster. This is not a mandatory step, but just a way to view the current libraries before installing new ones.
- task: Bash@3
displayName: "List Cluster Libraries"
inputs:
targetType: 'inline'
script: |
pip install databricks-cli --upgrade
databricks libraries cluster-status --cluster-id $(azure_databricks_cluster_id) --profile AZDO
4. After listing the library, copy the .whl
package from the artifacts to the Databricks file system (DBFS). This is done so that we have the latest wheel package stored on the Databricks file system as well.
- task: Bash@3
displayName: "Copy wheel package to dbfs"
inputs:
targetType: 'inline'
script: |
# Write your commands here
dbfs cp $(Pipeline.Workspace)/devops-wheel-$(Build.BuildNumber) dbfs:/whl_package --recursive --overwrite --profile AZDO
5. Once the whl
package is present in the dbfs, go ahead and uninstall the existing older version of wheel package from the cluster.
- task: Bash@3
displayName: "Uninstall existing Wheel Package from Cluster"
inputs:
targetType: 'inline'
script: |
databricks libraries uninstall --cluster-id $(azure_databricks_cluster_id) --whl dbfs:/whl_package/devops-practice-0.1.87-py3-none-any.whl --profile AZDO
6. The above step requires the cluster to be restarted before installing the latest wheel package. But if the cluster is already in terminated state, then restarting the cluster step will fail as cluster is not in running mode.
To tackle the above issue, we will check the current status of cluster. If cluster is running, just proceed to next step.
If the cluster in terminated state, start the cluster and wait for the cluster to be up and running. We are using a databricks extension to achieve this. You can also use databricks cli command to check the current state of cluster.
- task: startcluster@0
displayName: "Check Cluster status and start if in terminated state"
inputs:
clusterid: '$(azure_databricks_cluster_id)'
7. With the above step, now the cluster is in running state. Now, run the cluster restart command to complete the uninstall process of wheel package.
- task: Bash@3
displayName: "Restarting the Cluster"
inputs:
targetType: 'inline'
script: |
databricks clusters restart --cluster-id $(azure_databricks_cluster_id) --profile AZDO
8. Once the cluster is restarted, we are good to install the new .whl
package that is stored in dbfs file.
- task: Bash@3
displayName: "Install Wheel Package on Cluster"
inputs:
targetType: 'inline'
script: |
databricks libraries install --cluster-id $(azure_databricks_cluster_id) --whl dbfs:/whl_package/devops-practice-0.1.87-py3-none-any.whl --profile AZDO
Assumptions
Following assumptions were made while writing this blog-
- The setup.py file is present at the root location of the repo.
- All the variables like
azure_databricks_cluster_id
are coming from the Variable groups created in Azure DevOps.
Summary
The above pipeline completes the deployment Process for Databricks Wheel Packages. By creating an Azure DevOps pipeline, we have successfully automated the generation and deployment of wheel packages on Databricks cluster.