Databricks Delta Lake

Step-by-step guide on setting up Databricks Delta Lake as a destination in RudderStack.

Delta Lake is a popular data lake used for both streaming and batch operations. It lets you store structured, unstructured, and semi-structured data securely and reliably. With features such as support for ACID transactions, scalable metadata management, and schema enforcement, Delta Lake enables you to scale and deliver real-time data insights and analytics directly via your data lake.

RudderStack lets you configure Delta Lake as a destination to which you can send your event data seamlessly.

Find the open-source transformer code for this destination in the GitHub repository.

Configuring Delta Lake destination in RudderStack

Before configuring Delta Lake as a destination in RudderStack, it is highly recommended to go through the following sections to obtain the necessary configuration settings. These sections also contain the steps to grant RudderStack and Databricks the necessary permissions to your preferred storage bucket.

To send event data to Delta Lake, you first need to add it as a destination in RudderStack and connect it to your data source. Once the destination is enabled, events will automatically start flowing to Delta Lake via RudderStack.

To configure Delta Lake as a destination in RudderStack, follow these steps:

  1. In your RudderStack dashboard, set up the data source. Then, select Databricks Delta Lake from the list of destinations.

  2. Assign a name to your destination and then click on Next.

Connection settings

Enter the following credentials in the Connection Credentials page:

  • Host: Enter your server hostname from the Databricks dashboard.

For more information on where to find the server hostname, refer to the Obtaining the JDBC/ODBC configuration section below.

  • Port: Enter the port number.

For more information on obtaining the port number, refer to the Obtaining the JDBC/ODBC configuration section below.

  • HTTP Path: Enter the cluster's HTTP path.

For more information on obtaining the HTTP path, refer to the Obtaining the JDBC/ODBC configuration section below.

  • Personal Access Token: Enter your Databricks access token.

For more information on generating the access token, refer to the Generating the Databricks access token section below.

  • Namespace: Enter the the name of the schema where RudderStack will create the tables. If you don't specify a namespace in the dashboard settings, RudderStack will set it to the source name, by default.

  • Sync Frequency: Specify how often RudderStack should sync the data to your Delta Lake instance.

  • Sync Starting At: This optional setting lets you specify the particular time of the day (in UTC) when you want RudderStack to sync the data to the Delta Lake instance.

  • Exclude Window: This optional setting lets you specify the time window (in UTC) when RudderStack will skip the data sync.

  • Object Storage Configuration: RudderStack currently supports the following platforms for storing the staging files:

    • Amazon S3

    • Google Cloud Storage

    • Azure Blob Storage

If you select S3 as your storage provider, RudderStack gives you the option to specify the AWS access key and secret access key in the dashboard itself, to grant Databricks access to your staging bucket. To do so, enable the Use STS Tokens to copy staging files setting in the dashboard. For more information, refer to the Amazon S3 storage bucket settings section below.

Granting RudderStack access to your storage bucket

This section contains the steps to edit your bucket policy to grant RudderStack the necessary permissions, depending on your preferred cloud platform.

Amazon S3

Follow these steps to grant RudderStack access to your S3 bucket based on the following two cases:

Case 1: Use STS Token to copy staging files is disabled in the dashboard

Follow the steps listed in this section if the Use STS Token to copy staging files option is disabled, i.e. you don't want to specify the AWS access key and secret access key while configuring your Delta Lake destination.

For RudderStack Cloud

If you are using RudderStack Cloud, edit your bucket policy using the following JSON:

{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Principal": {
      "AWS": "arn:aws:iam::422074288268:user/s3-copy"
    },
    "Action": [
      "s3:GetObject",
      "s3:PutObject",
      "s3:PutObjectAcl",
      "s3:ListBucket"
    ],
    "Resource": [
      "arn:aws:s3:::YOUR_BUCKET_NAME/*",
      "arn:aws:s3:::YOUR_BUCKET_NAME"
    ]
  }]
}

Make sure you replace YOUR_BUCKET_NAME with the name of your S3 bucket.

For self-hosted RudderStack

If you are self-hosting RudderStack, follow these steps:

  1. Create an IAM policy with the following JSON:

{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Action": "*",
    "Resource": "arn:aws:s3:::*"
  }]
}
  1. Then, create an IAM user with programmatic access. Attach the above IAM policy to this user.

Copy the ARN of this newly-created user. This is required in the next step.

  1. Next, edit your bucket policy with the following JSON to allow RudderStack to write to your S3 bucket.

{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Principal": {
      "AWS": "arn:aws:iam::ACCOUNT_ID:user/USER_ARN"
    },
    "Action": [
      "s3:GetObject",
      "s3:PutObject",
      "s3:PutObjectAcl",
      "s3:ListBucket"
    ],
    "Resource": [
      "arn:aws:s3:::YOUR_BUCKET_NAME/*",
      "arn:aws:s3:::YOUR_BUCKET_NAME"
    ]
  }]
}

Make sure you replace USER_ARN with the ARN copied in the previous step. Also, replace ACCOUNT_ID with your AWS account ID and YOUR_BUCKET_NAME with the name of your S3 bucket.

  1. Finally, add the programmatic access credentials to the env file present in your RudderStack installation, as shown:

RUDDER_AWS_S3_COPY_USER_ACCESS_KEY_ID=<user_access_key>
RUDDER_AWS_S3_COPY_USER_ACCESS_KEY=<user_access_key_secret>

Case 2: Use STS Token to copy staging files is enabled in the dashboard

Follow the steps listed in this section if the Use STS Token to copy staging files option is enabled in the dashboard.

You can provide the configuration directly while setting up the Delta Lake destination in RudderStack, as shown:

Google Cloud Storage

You can provide the necessary GCS bucket configuration while setting up the Delta Lake destination in RudderStack. For more information, refer to the Google Cloud Storage bucket settings.

Azure Blob Storage

You can provide the necessary Blob Storage container configuration while setting up the Delta Lake destination in RudderStack. For more information, refer to the Azure Blob Storage settings.

Granting Databricks access to your staging bucket

This section contains the steps to grant Databricks the necessary permissions to access your staging bucket, depending on your preferred cloud platform.

Amazon S3

Follow these steps to grant Databricks access to your S3 bucket depending on your case:

Case 1: Use STS Token to copy staging files is disabled in the dashboard

Follow the steps listed in this section if the Use STS Token to copy staging files option is disabled, i.e. you don't want to specify the AWS access key and secret access key while configuring your Delta Lake destination.

In this case, you will be required to configure your AWS account to create an instance profile which will then be attached with your Databricks cluster.

Follow these steps in the exact order:

Case 2: Use STS Token to copy staging files is enabled in the dashboard

Follow the steps listed in this section if the Use STS Token to copy staging files option is enabled, i.e. you are specifying the AWS access key and secret access key in the dashboard while configuring your Delta Lake destination.

Add the following Spark configuration to your Databricks cluster:

spark.hadoop.fs.s3.impl shaded.databricks.org.apache.hadoop.fs.s3a.S3AFileSystem
spark.hadoop.fs.s3a.impl shaded.databricks.org.apache.hadoop.fs.s3a.S3AFileSystem
spark.hadoop.fs.s3n.impl shaded.databricks.org.apache.hadoop.fs.s3a.S3AFileSystem
spark.hadoop.fs.s3.impl.disable.cache true
spark.hadoop.fs.s3a.impl.disable.cache true
spark.hadoop.fs.s3n.impl.disable.cache true

For more information on adding custom Spark configuration properties in a Databricks cluster, refer to Spark configuration guide.

Google Cloud Storage

To grant Databricks access to your GCS bucket, follow these steps:

  1. Follow the steps listed in this user permissions section to set up the required role and user permissions.

  2. Then, add the following Spark configuration to your Databricks cluster:

spark.hadoop.fs.gs.auth.service.account.email <client_email>
spark.hadoop.fs.gs.project.id <project_id>
spark.hadoop.fs.gs.auth.service.account.private.key <private_key>
spark.hadoop.fs.gs.auth.service.account.private.key.id <private_key_id>

For more information on adding custom Spark configuration properties in a Databricks cluster, refer to Spark configuration guide.

  1. Finally, replace the following fields with the values obtained from the downloaded JSON in the previous step: <project_id>,<private_key>, <private_key_id>,<client_email>.

Azure Blob Storage

To grant Databricks access to your Azure Blob Storage container, follow these steps:

  1. Add the following Spark configuration to your Databricks cluster.

spark.hadoop.fs.azure.account.key.<storage-account-name>.blob.core.windows.net <storage-account-access-key>

For more information on adding custom Spark configuration properties in a Databricks cluster, refer to Spark configuration guide.

  1. Replace the following fields with the relevant values from your Blob Storage account settings: <storage-account-name>,<storage-account-access-key>.

Creating a new Databricks cluster

To create a new Databricks cluster, follow these steps:

  1. Sign into your Databricks account. Then, click on the Compute option on the dashboard, as shown:

  1. Click on the Create Cluster option.

  2. Next, enter the cluster details. Fill in the Cluster Name, as shown:

  1. Select the Cluster Mode depending on your use-case. The following image highlights the three cluster modes:

  1. Then, select the Databricks Runtime Version as 7.1 or higher, as shown:

  1. Configure the rest of the settings as per your requirement.

  2. In the Advanced Options section, configure the Instances field as shown in the following image:

  1. In the Instance Profile dropdown menu, select the Databricks instance profile that you added to your account in the previous step.

  1. Finally, click on the Create Cluster button to complete the configuration and create the Databricks cluster.

Obtaining the JDBC/ODBC configuration

Follow these steps to get the JDBC/ODBC configuration:

  1. In your Databricks dashboard, click on the Compute option, as shown:

  1. Then, select the cluster you created in the previous section.

  1. In the Advanced Options section, select the JDBC/ODBC field and copy the Server Hostname, Port, and HTTP Path values, as shown:

The Server Hostname, Port, and HTTP Path values are required to configure Delta Lake as a destination in RudderStack.

Generating the Databricks access token

To generate the Databricks access token, follow these steps:

  1. In your Databricks dashboard, go to Settings and click on User Settings, as shown:

  1. Then, go to the Access Tokens section and click on Generate New Token, as shown:

  1. Enter your comment in the Comment field and click on Generate, as shown:

Keep the Lifetime (days) field blank. If you enter a number, your access token will expire after that number of days.

  1. Finally, copy the access token as it will be used during the Delta Lake destination setup in RudderStack.

IPs to be whitelisted

You will need to whitelist the following RudderStack IPs to enable network access:

  • 3.216.35.97

  • 34.198.90.241

  • 54.147.40.62

  • 23.20.96.9

  • 18.214.35.254

  • 35.83.226.133

  • 52.41.61.208

  • 44.227.140.138

  • 54.245.141.180

  • 3.66.99.198

  • 3.64.201.167

If you have your deployment in the EU region, you can whitelist only the following two IPs:

  • 3.66.99.198

  • 3.64.201.167

All the outbound traffic is routed through these RudderStack IPs.

FAQs

What are the reserved keys for Delta Lake?

Refer to this documentation for a complete list of the reserved keywords.

How does RudderStack handle the reserved words in a column, table, or schema?

There are some limitations when it comes to using reserved words as a schema, table, or column name. If such words are used in event names, traits or properties, they will be prefixed with a _ when RudderStack creates tables or columns for them in your schema.

Also, integers are not allowed at the start of a schema or table name. Hence, such schema, column, or table names will be prefixed with a _. For example, '25dollarpurchase' will be changed to '_25dollarpurchase'.

Contact us

For queries on any of the sections covered in this guide, you can contact us or start a conversation in our Slack community.

Last updated