Skip to content

Repository for CC Tableflow integrating with Databricks using Azure Blob Storage backend

Notifications You must be signed in to change notification settings

pneff93/Tableflow-Azure-Databricks

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 

Repository files navigation

Tableflow on Azure with Databricks

Goal: Enable Tableflow, store the data in your Azure Blob Storage and analyse the data in Databricks

General Resources

Prerequisites

  • Confluent Cloud
  • Azure Databricks

Connect Azure and Confluent

Configure Azure Storage Account

Set up Azure Integration in Confluent

Create the service principal with the id from the window.

az login
az ad sp create --id d4da7410-16c9-4bd0-b35b-f449a6d2947a

We need the appDisplayName for the role binding assignments. Assign both mentioned roles.

Enter the Azure Entra ID tenant id and validate the connection.

Set up Tableflow

We create a cluster and a Datagen Connector.

After some time, we see the data in Azure Blob Storage.

Integrate with Unity Catalog

Set up Databricks

Note

I am not a Databricks expert. The following steps are based on the documentation; however some steps may be unnecessary, and certain role bindings or permissions may not follow the principle of least privilege.

For the managed identity unity-catalog-access-connector (being created automatically in Azure), we provide the following role bindings to the Storage account in Azure:

  • Storage Blob Data Contributor
  • Storage Queue Data Contributor

Create an external location and test the connection (everything is successful except for "File Events Read")

Enable external data access

Note

Since I am using a personal Azure tenant, the setup was complex because my primary user utilizes a non-Microsoft email address. I resolved this by logging into the Databricks account console with an Azure user assigned the Global Administrator role in Azure. From there, I enabled Delta Sharing and assigned the Account Admin role to my personal identity. This finally allowed me to enable External Data Access.

Create a Databricks managed service principal in Databricks and create a secret. Copy the secret (for later usage).

Provide the following role bindings to the service principal for the external location and catalog. Note, that these are different compared to what is state in the documentation (framed in red).

Set up Tableflow Catalog Integration

Use the client id and secret from the service principal in Databricks

Finally, we see the data in Databricks 🥳

About

Repository for CC Tableflow integrating with Databricks using Azure Blob Storage backend

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published