Automated CI/CD For Azure Data Factory

E2E Deployment of ADF Using Automated Publish Method

Manjit Singh
9 min readDec 8, 2022
Azure Data Factory Deployment

Introduction

Azure Data Factory(ADF) is a service in Microsoft Azure that allows the developers to integrate and transform their data using various processes. ADF has good integrations with multiple internal and external services.

In ADF, CI/CD essentially means deploying the Data Factory components(Linked Service, Triggers, Pipelines, Datasets) to a new environment(ADF) automatically using Pipelines.

Azure Data Factory CI/CD Lifecycle

The Azure Data Factory CI/CD Lifecycle consist of steps from development to deployment. Refer this link to know more.

ADF Modes

  1. ADF consists of two modes- Live Mode and Git Mode.
  2. Git integration in ADF consists of selecting two branches Collaboration branch and Publish branch.
  3. Collaboration branch is where all the feature branch is merged (will be mapped to ‘develop’ branch for our case). Publish branch is where all the changes including auto generated ARM templates gets published (by default, ADF creates ‘adf_publish’ branch for that).
  4. There will be a corresponding ‘adf’ folder in our repo, where all the adf resources will be present.

Traditional Deployment Method- Manual Publish

  • Earlier, the ADF deployment involved manual publish from collaboration branch. This manual publish would generate the ARM template and store them in adf_publish branch. And the deployment process was done using adf_publish branch.
  • The main issue with this approach is that the developers have to go and manually click publish button in ADF portal. The manual publish generated the ARM templates which gets used for deployment.

Better Deployment Approach- Automatic Publish

  • In this approach, the manual publish step will be eliminated by doing an automatic publish. With the automatic publish method, we generate the ARM templates using the Build Pipeline artifacts using an NPM package.
  • Below is the architecture of ADF CI/CD Implementation.
Azure Data Factory- Development and Deployment Architecture

In above diagram, there is one ADF in dev environment that is linked to GitHub. The development activity is performed on dev ADF via feature branch, and merge the changes back to develop branch using Pull requests. Once merged, the changes are deployed to dev ADF.

After merge, the ADF CI/CD Pipeline is triggered that automatically deploys the ADF resources to Dev and Prod environment.

Pre-requisites

1. Azure DevOps Account.
2. Data Factory in a dev environment with Azure Repos Git integration.
3. Optional- A Data Factory in another Environment(Pre-Prod).

Development flow for ADF

  1. In ADF Portal, create a feature branch(feature_1) from your collaboration branch(develop).
  2. Develop and manually test your changes in feature branch of ADF Portal.
  3. Create a PR from feature branch to collaboration branch in GitHub.
  4. Once PR is merged, the changes will be deployed in ADF Dev Environment.

Adding the package.json

Before you start creating the pipeline, you will have to create a package.json file. This file will contain the details to obtain the ADFUtilities package. The content of the file is given below:

  1. In the repository, we will create a ‘build’ folder (Folder name can be anything).
  2. Inside the folder, create a package.json file.
  3. Paste the below code block in package.json file.
  4. This NPM package will use this JSON file find the ADFUtilities package.
{
"scripts":{
"build":"node node_modules/@microsoft/azure-data-factory-utilities/lib/index"
},
"dependencies":{
"@microsoft/azure-data-factory-utilities":"^0.1.5"
}
}

Creating the Azure YAML Pipeline

The Azure YAML Pipeline file will contain stages for CI and CD with required tasks in each stages.

  • The Pipeline file starts with declaring the variables block. In this Pipeline file, we have stored the variables in Azure DevOps variable groups .
  • In our example, there are two variable groups corresponding to dev and pre-prod environment.
  • The variables needed in Variable groups are- azure_subscription_id, azure_service_connection_name, resource_group_name, azure_data_factory_name.
  • There is two more variable present in pipeline- adf_code_path which is the path in repo where our ADF code is stored. This is the same place that you choose as “Root folder” while linking Git with ADF.
  • adf_package_file_path -This is the path in repo where the package.json file is present. This variable will be used in generating the ARM Templates.
variables:

- ${{ if eq(variables['build.SourceBranchName'], 'develop') }}:
- group: adf-dev

- ${{ if eq(variables['build.SourceBranchName'], 'main') }}:
- group: adf-pre-prod

- name: adf_code_path
value: "$(Build.SourcesDirectory)/data_ops/adf"

- name: adf_package_file_path
value: "$(Build.SourcesDirectory)/build/"

Validate ADF (Optional Stage)

  • This is an optional stage where ADF resources(Linked Service, Pipelines, Datasets) in repository are validated .
  • This stage consists of one step where Azure DevOps extension is used to validate the ADF resources.
stages:
- stage: Validate_ADF_Code
pool:
vmImage: 'windows-latest'
jobs:
- job: Build_ADF_Code
displayName: 'Validating the ADF Code'
steps:
- task: BuildADFTask@1
inputs:
DataFactoryCodePath: '$(adf_code_path)'
Action: 'Build'
Output of ADF Validation Step

CI Process (Build Stage)

In the Build stage, our goal is to validate the ADF Code, retrieve the files from the ‘develop’ branch of the git repository and automatically generate the ARM templates for the Deployment stage.

The Build stage consists of 5 steps-

  1. Declare a stage “Build_And_Publish_ADF_Artifacts” which will contain the Build steps.
  - stage: Build_And_Publish_ADF_Artifacts
jobs:
- job: Build_Adf_Arm_Template
displayName: 'ADF - ARM template'
steps:

2. Next, you need to install the dependencies. Azure provides a tool ADFUtilities. This package is used to validate and create the deployment template. In order to install this package, we need to install Node.js and NPM package management.

- task: NodeTool@0
inputs:
versionSpec: '10.x'
displayName: 'Install Node.js'

- task: Npm@1
inputs:
command: 'install'
workingDir: '$(adf_package_file_path)' #replace with the package.json folder.
verbose: true
displayName: 'Install NPM Package'

3. Next, you need to Validate all of the Data Factory resource code in the repository. This calls the ‘validate’ function along with the path where the ADF code is stored in repo. The working directory is where the ADFUtilities is installed.

- task: Npm@1
displayName: 'Validate ADF Code'
inputs:
command: 'custom'
workingDir: '$(adf_package_file_path)' #replace with the package.json folder.
customCommand: 'run build validate $(adf_code_path) /subscriptions/$(azure_subscription_id)/resourceGroups/$(resource_group_name)/providers/Microsoft.DataFactory/factories/$(azure_data_factory_name)'

4. The next step is to Generate ARM template from Azure Data Factory source code. The ‘export’ function is used to output the ARM template in the ‘ArmTemplate’ folder inside the working Directory.

- task: Npm@1
displayName: 'Validate and Generate ARM template'
inputs:
command: 'custom'
workingDir: '$(adf_package_file_path)' #replace with the package.json folder.
customCommand: 'run build export $(adf_code_path) /subscriptions/$(azure_subscription_id)/resourceGroups/$(resource_group_name)/providers/Microsoft.DataFactory/factories/$(azure_data_factory_name) "ArmTemplate"' # Change "ADFIntegration" to the name of your root folder, if there is not root folder, remove that part.

5. Finally, ARM template is generated and Published as a Pipeline artifact. This will create an artifact with name name ‘adf-artifact-$(Build.BuildNumber)’.

- task: PublishPipelineArtifact@1
displayName: Download Build Artifacts - ADF ARM templates
inputs:
targetPath: '$(Build.SourcesDirectory)/.pipelines/ArmTemplate' #replace with the package.json folder.
artifact: 'adf-artifact-$(Build.BuildNumber)'
publishLocation: 'pipeline'

CD Process (Deployment Stage)

The main goal of Deployment stage is to deploy the ADF resources. This process is achieved by deploying the ARM templates generated in Build stage.

The Deployment stage consists of following steps-

  1. Create a new stage ‘Deploy_to_Dev’. Ensure this stage only runs if

a) the CI Build stage is successful;
b) code is merged on PR;
c) The pipeline is triggered from develop branch.

- stage: Deploy_to_Dev
condition: and(succeeded(), ne(variables['Build.Reason'], 'PullRequest'), eq(variables['build.SourceBranchName'], 'develop'))
displayName: Deploy To Development Environment
dependsOn: Build_And_Publish_ADF_Artifacts
jobs:
- job: Deploy_Dev
displayName: 'Deployment - Dev'
steps:

2. Download the ARM artifacts that was Published in the Build stage. By default, the Pipeline artifacts are published in $(Pipeline.Workspace) folder.

- task: DownloadPipelineArtifact@2
displayName: Download Build Artifacts - ADF ARM templates
inputs:
artifactName: 'adf-artifact-$(Build.BuildNumber)'
targetPath: '$(Pipeline.Workspace)/adf-artifact-$(Build.BuildNumber)'

3. ADF can contain Triggers to run the ADF Pipelines according to some condition or schedule. It’s a best practice to STOP the triggers during deployment so that ADF pipeline does not get triggered while Deployment is running. This is an optional but an important step.

- task: toggle-adf-trigger@2
displayName: STOP ADF Triggers before Deployment
inputs:
azureSubscription: '$(azure_service_connection_name)'
ResourceGroupName: '$(resource_group_name)'
DatafactoryName: '$(azure_data_factory_name)'
TriggerFilter: 'Dataload_Trigger' #Name of the trigger. Leave empty if you want to stop all trigger
TriggerStatus: 'stop'

4. This is the step where the deployment happens. The ARM Templates are deployed from the artifacts. The overrideParameterscontains the custom values like the linked service where we override the default values with the custom values.

- task: AzureResourceManagerTemplateDeployment@3
displayName: 'Deploying to Production'
inputs:
deploymentScope: 'Resource Group'
azureResourceManagerConnection: '$(azure_service_connection_name)'
subscriptionId: '$(azure_subscription_id)'
action: 'Create Or Update Resource Group'
resourceGroupName: '$(resource_group_name)'
location: '$(location)'
templateLocation: 'Linked artifact'
csmFile: '$(Pipeline.Workspace)/adf-artifact-$(Build.BuildNumber)/ARMTemplateForFactory.json'
csmParametersFile: '$(Pipeline.Workspace)/adf-artifact-$(Build.BuildNumber)/ARMTemplateParametersForFactory.json'
overrideParameters: '-factoryName $(azure_data_factory_name) -adls_connection_properties_typeProperties_url "https://$(azure_storage_account_name).dfs.core.windows.net/" -databricks_connection_properties_typeProperties_existingClusterId $(azure_databricks_cluster_id) -keyvault_connection_properties_typeProperties_baseUrl "https://$(azure_keyvault_name).vault.azure.net/"'
deploymentMode: 'Incremental'

Note: In order to successfully deploy Linked Services, make sure that the service principal or the Managed identity have sufficient access to the Azure resources linked from ADF. For example- Key Vault, Azure Databricks, Storage Account.

5. After deployment is completed, Start the ADF triggers which was stopped in Step-3, so that ADF Pipelines can continue to run according to their configured schedule.

 - task: toggle-adf-trigger@2
displayName: START ADF Triggers after Deployment
inputs:
azureSubscription: '$(azure_service_connection_name)'
ResourceGroupName: '$(resource_group_name)'
DatafactoryName: '$(azure_data_factory_name)'
TriggerFilter: 'Dataload_Trigger' #Name of the trigger. Leave empty if you want to start all trigger
TriggerStatus: 'start'

6. The final step (Optional but can be useful) is to validate the ADF Linked Services. The ADF connects to other Azure Services via Linked Services. So its a good practice to check whether the Linked Service are configured properly.

- task: TestAdfLinkedServiceTask@1
displayName: TEST connection of ADF Linked Services
inputs:
azureSubscription: '$(azure_service_connection_name)'
ResourceGroupName: '$(resource_group_name)'
DataFactoryName: '$(azure_data_factory_name)'
LinkedServiceName: 'keyvault_connection,adls_connection,databricks_connection' #Comma separated Names of Linked Service whose connection you want to check
ClientID: '$(service_principal_client_id)'
ClientSecret: '$(service_principal_client_secret)'
ADF Deployment to DEV Environment

Deployment to New Environment

  • In the above sections, we saw the Development flow and the deployment happening on same ADF that is ADF in Dev Environment.
  • Development happens in Git mode and deployment happens in Live mode.
  • In this section, we will see how one can deploy the ADF to a completely new environment(Pre-Prod). We will make use of the same artifacts that are generated in Build stage, and then theoverrideParameters will be used to override the values according to new environment.
- stage: Deploy_to_PreProd
condition: and(succeeded(), ne(variables['Build.Reason'], 'PullRequest'), eq(variables['build.SourceBranchName'], 'main'))
displayName: Deploy Pre Prod Stage
dependsOn: Build_And_Publish_ADF_Artifacts
jobs:
- job: Deploy_PreProd
displayName: 'Deployment - PreProd'
steps:
- task: DownloadPipelineArtifact@2
displayName: Download Build Artifacts - ADF ARM templates
inputs:
artifactName: 'adf-artifact-$(Build.BuildNumber)'
targetPath: '$(Pipeline.Workspace)/adf-artifact-$(Build.BuildNumber)'

- task: toggle-adf-trigger@2
inputs:
azureSubscription: '$(azure_service_connection_name)'
ResourceGroupName: '$(resource_group_name)'
DatafactoryName: '$(azure_data_factory_name)'
TriggerFilter: 'Dataload_Trigger'
TriggerStatus: 'stop'

- task: AzureResourceManagerTemplateDeployment@3
inputs:
deploymentScope: 'Resource Group'
azureResourceManagerConnection: '$(azure_service_connection_name)'
subscriptionId: '$(azure_subscription_id)'
action: 'Create Or Update Resource Group'
resourceGroupName: '$(resource_group_name)'
location: '$(location)'
templateLocation: 'Linked artifact'
csmFile: '$(Pipeline.Workspace)/adf-artifact-$(Build.BuildNumber)/ARMTemplateForFactory.json'
csmParametersFile: '$(Pipeline.Workspace)/adf-artifact-$(Build.BuildNumber)/ARMTemplateParametersForFactory.json'
overrideParameters: '-factoryName "$(azure_data_factory_name)" -adls_connection_properties_typeProperties_url "https://$(azure_storage_account_name).dfs.core.windows.net/" -databricks_connection_properties_typeProperties_existingClusterId $(azure_databricks_cluster_id) -keyvault_connection_properties_typeProperties_baseUrl "https://$(azure_keyvault_name).vault.azure.net/"'
deploymentMode: 'Incremental'

- task: toggle-adf-trigger@2
inputs:
azureSubscription: '$(azure_service_connection_name)'
ResourceGroupName: '$(resource_group_name)'
DatafactoryName: '$(azure_data_factory_name)'
TriggerFilter: 'Dataload_Trigger'
TriggerStatus: 'start'

Here is the link to full YAML Pipeline.

ADF Deployment to Pre-Prod Environment

Adding Custom(New) Parameters in ARM Templates

There can be some times when the parameter that you want to override is not present in the ARM Template parameters file. In that case, you will not be able to override that parameter when deploying to a new environment. In order to get the parameter in ARM Template parameters file, you need to do the following steps-

  1. Navigate to ADF Portal and go to Manage Tab.
  2. Under the ARM Template section, Click on Edit parameter configuration to load the JSON file.
  3. Go to the required section. For example- If your parameter was in Linked Service, then you can go to “Microsoft.DataFactory/factories/linkedServices” section.
  4. Under typeProperties, you will see many properties mentioned there. Add the parameter that you want to come in ARM Template parameter file.
  5. Click on Ok. This will generate a file called “arm-template-parameters-definition.json” in the repo where ADF code is present.
  6. Run the Pipeline again, and you will see the new parameter in the Template Parameter file.

Access Considerations

  • ADF by default creates a system assigned managed identity with same name as the ADF name.
  • Its best practice to use this managed identity to connect ADF to other Azure services while creating the Linked Services.
  • Make sure that the Managed Identity has the required access on the resources. Usually Contributor access works well.

Summary

The above pipeline completes the CI/CD Process for ADF. By creating an Azure DevOps pipeline, we have successfully automated our data pipeline deployment process.

This approach eliminates the need to manually click on the ‘Publish’ button. In the below section, we will extend this solution to deploy the artifact into a new (Pre-Prod) environment.

I hope this article on Azure Data Factory CI/CD is helpful. Please let me know if you face any issues or have any queries related to Data Factory Deployments.

--

--