LogoLogo
  • Contributing to RudderStack
  • Destination_Name
  • LICENSE
  • RudderStack Docs
  • docs
    • FAQ
    • Identity Resolution
    • Home
    • cloud-extract-sources
      • ActiveCampaign Source
      • Bing Ads
      • Chargebee
      • Common Settings
      • Facebook Ads
      • Freshdesk
      • Google Ads Source
      • Google Analytics
      • Google Search Console
      • Google Sheets
      • Cloud Extract Sources
      • Intercom v2
      • Intercom
      • Mailchimp
      • Marketo
      • Mixpanel
      • NetSuite
      • Pipedrive
      • QuickBooks
      • Salesforce Pardot
      • Sendgrid Source
      • Stripe Source
      • Xero
      • Zendesk Chat
      • Zendesk
      • hubspot
        • HubSpot Data Model and Schema Information
        • HubSpot
      • salesforce
        • Salesforce
        • Schema Comparison: RudderStack vs. Segment
    • connections
      • Connection Modes: Cloud Mode vs. Device Mode
    • data-governance
      • Data Governance
      • RudderTyper
      • Data Governance API
      • RudderTyper
      • tracking-plans
        • Tracking Plans
        • Tracking Plan Spreadsheet
    • data-warehouse-integrations
      • Amazon Redshift
      • Azure Data Lake
      • Azure Synapse
      • ClickHouse
      • Databricks Delta Lake
      • Google Cloud Storage Data Lake
      • Google BigQuery
      • Identity Resolution
      • Warehouse Destinations
      • Microsoft SQL Server
      • PostgreSQL
      • Amazon S3 Data Lake
      • Snowflake
      • FAQ
      • Warehouse Schema
    • destinations
      • Destinations
      • Webhooks
      • advertising
        • Bing Ads
        • Criteo
        • DCM Floodlight
        • Facebook App Events
        • Facebook Custom Audience
        • Facebook Pixel
        • Google Ads (gtag.js)
        • Google AdWords Enhanced Conversions
        • Google Adwords Remarketing Lists (Customer Match)
        • Advertising
        • LinkedIn Insight Tag
        • Lotame
        • Pinterest Tag
        • Reddit Pixel
        • Snap Pixel
        • TikTok Ads
      • analytics
        • Amplitude
        • AWS Personalize
        • Chartbeat
        • Firebase
        • FullStory
        • Google Analytics 360
        • Google Analytics
        • Heap.io
        • Hotjar
        • Analytics
        • Indicative
        • Keen
        • Kissmetrics
        • Kubit
        • Lytics
        • Mixpanel
        • Pendo
        • PostHog
        • Quantum Metric
        • Singular
        • adobe-analytics
          • Adobe Analytics Heartbeat Measurement
          • Mobile Device Mode Settings
          • Web Device Mode Settings
          • E-commerce Events
          • Adobe Analytics
          • Setting Up Adobe Analytics in RudderStack
        • google-analytics-4
          • Cloud Mode
          • Device Mode
          • Google Analytics 4
          • Setting up Google Analytics 4
        • profitwell
          • ProfitWell
          • Cloud Mode
          • Device Mode
      • attribution
        • Adjust
        • AppsFlyer
        • Branch
        • Attribution
        • Kochava
        • TVSquared
      • business-messaging
        • Business Messaging
        • Intercom
        • Kustomer
        • Slack
        • Trengo
      • continuous-integration
        • Visual Studio App Center
        • Continuous Integration
      • crm
        • Delighted
        • HubSpot
        • CRM
        • Salesforce
        • Variance
        • Zendesk
      • customer-data-platform
        • Customer Data Platform
        • Segment
      • error-reporting
        • Bugsnag
        • Error Reporting
        • Sentry
      • marketing
        • ActiveCampaign
        • AdRoll
        • Airship
        • Appcues
        • Autopilot
        • Blueshift
        • Braze
        • CleverTap
        • Customer.io
        • Gainsight PX
        • Gainsight
        • Marketing
        • Iterable
        • Klaviyo
        • Leanplum
        • Mailchimp
        • Marketo Lead Import
        • Marketo
        • MoEngage
        • Ometria
        • Pardot
        • Post Affiliate Pro
        • Qualtrics
        • SendGrid
        • Salesforce Marketing Cloud
        • Userlist
        • drip
          • Cloud Mode
          • Device Mode
          • Drip
          • Setting Up Drip in RudderStack
      • productivity
        • Google Sheets
        • Productivity
      • storage-platforms
        • Amazon S3
        • DigitalOcean Spaces
        • Google Cloud Storage
        • Storage Platforms
        • Azure Blob Storage
        • MinIO
        • Redis
      • streaming-platforms
        • Amazon EventBridge
        • Amazon Kinesis Firehose
        • Amazon Kinesis
        • Azure Event Hubs
        • BigQuery Stream
        • Confluent Cloud
        • Google Pub/Sub
        • Streaming Platforms
        • Apache Kafka
      • tag-managers
        • Google Tag Manager
        • Tag Managers
      • testing-and-personalization
        • Algolia Insights
        • Candu
        • Google Optimize
        • A/B Testing & Personalization
        • LaunchDarkly
        • Monetate
        • Optimizely Full Stack
        • Optimizely Web
        • Split.io
        • Statsig
        • VWO (Visual Website Optimizer)
    • get-started
      • RudderStack Cloud vs. RudderStack Open Source
      • Glossary
      • Get Started
      • RudderStack Architecture
    • reverse-etl
      • Amazon Redshift
      • Amazon S3
      • ClickHouse
      • FAQ
      • Google BigQuery
      • Reverse ETL
      • PostgreSQL
      • Snowflake
      • common-settings
        • Importing Data using Models
        • Importing Data using Tables
        • Common Settings
        • Sync Modes
        • Sync Schedule
      • features
        • Airflow Provider
        • Features
        • Models
        • Visual Data Mapper
    • rudderstack-api
      • Data Regulation API
      • HTTP API
      • RudderStack API
      • Personal Access Tokens
      • Pixel API
      • Test API
      • api-specification
        • Application Lifecycle Events Specification
        • API Specification
        • Video Events Specification
        • rudderstack-ecommerce-events-specification
          • Browsing
          • Coupons
          • E-Commerce Events Specification
          • Ordering
          • Promotions
          • Reviewing
          • Sharing
          • Wishlist
        • rudderstack-spec
          • Alias
          • Common Fields
          • Group
          • Identify
          • RudderStack Event Specification
          • Page
          • Screen
          • Track
    • rudderstack-cloud
      • Audit Logs
      • Dashboard Overview
      • Destinations
      • RudderStack Cloud
      • Live Events
      • Connection Modes: Cloud Mode vs. Device Mode
      • Sources
      • Teammates (User Management)
      • connections
        • Adding a Destination
        • Connections
    • rudderstack-open-source
      • Control Plane Setup
      • RudderStack Open Source
      • installing-and-setting-up-rudderstack
        • Developer Machine Setup
        • Docker
        • Data Plane Setup
        • Kubernetes
        • Sending Test Events
    • stream-sources
      • App Center
      • AppsFlyer
      • Auth0
      • Braze
      • Customer.io
      • Extole
      • Event Stream Sources
      • Iterable
      • Looker
      • PostHog
      • Segment
      • Shopify
      • Webhook Source
      • rudderstack-sdk-integration-guides
        • Client-side Event Filtering
        • SDKs
        • AMP Analytics
        • Cordova
        • .NET
        • Go
        • Java
        • Node.js
        • PHP
        • Python
        • React Native
        • Ruby
        • Rust
        • Unity
        • SDK FAQs
        • rudderstack-android-sdk
          • Adding Application Class
          • Flushing Events Periodically
          • Android
        • rudderstack-flutter-sdk
          • Flutter SDK v1
          • Flutter v2
          • Flutter
        • rudderstack-ios-sdk
          • iOS
          • tvOS
          • watchOS
        • rudderstack-javascript-sdk
          • Data Storage in Cookies
          • Detecting Ad-blocked Pages
          • JavaScript
          • JavaScript SDK Enhancements
          • JavaScript SDK FAQs
          • Querystring API
          • Quick Start Guide
          • Version Migration Guide
          • consent-managers
            • Consent Managers
            • OneTrust
    • transformations
      • Access Token
      • FAQ
      • Transformations
      • Transformations API
    • user-guides
      • User Guides
      • administrators-guide
        • Troubleshooting Guide
        • Alerting Guide
        • Bucket Configuration Settings for Event Backups
        • Configuration Parameters
        • Event Replay
        • High Availability
        • Horizontal Scaling
        • Administrator's Guides
        • Infrastructure Provisioning
        • Monitoring and Metrics
        • Okta SSO Setup
        • OneLogin SSO Setup
        • RudderStack Grafana Dashboard
        • Software Releases
      • how-to-guides
        • How to Use Custom Domains
        • How to Develop Integrations for RudderStack
        • How to Configure a Destination via the Event Payload
        • How to Filter Events using Different Methods
        • How to Filter Selective Destinations
        • How to Submit a Pull Request for a New Integration
        • How-to Guides
        • How to Debug Live Destination Events
        • How to Use AWS Lambda Functions with RudderStack
        • create-a-new-destination-transformer-for-rudder
          • Best Practices for Coding Transformation Functions in JavaScript
          • How to Create a New Destination Transformation for RudderStack
        • implement-native-js-sdk-integration
          • How to Add a Device Mode SDK to RudderStack JavaScript SDK
          • How to Implement a Native JavaScript SDK Integration
        • rudderstack-jamstack-integration
          • How to Integrate RudderStack with Your JAMstack Site
          • How to Integrate Rudderstack with Your Angular App
          • How to Integrate Rudderstack with Your Astro Site
          • How to Integrate Rudderstack with Your Eleventy Site
          • How to Integrate Rudderstack with Your Ember.js App
          • How to Integrate Rudderstack with a Gatsby Website
          • How to Integrate Rudderstack with a Hugo Site
          • How to Integrate Rudderstack with Your Jekyll Site
          • How to Integrate Rudderstack with Your Next.js App
          • How to Integrate Rudderstack with Your Nuxt.js App
          • How to Integrate Rudderstack with Your Svelte App
          • How to Integrate Rudderstack with Your Vue App
      • migration-guides
        • Migrating from Blendo to RudderStack
        • Migrating Your Warehouse Destination from Segment to RudderStack
        • Migration Guides
        • Migrating from Segment to RudderStack
  • src
    • @rocketseat
      • gatsby-theme-docs
        • text
          • Home
Powered by GitBook
On this page
  • Configuring Delta Lake destination in RudderStack
  • Connection settings
  • Granting RudderStack access to your storage bucket
  • Amazon S3
  • Google Cloud Storage
  • Azure Blob Storage
  • Granting Databricks access to your staging bucket
  • Amazon S3
  • Google Cloud Storage
  • Azure Blob Storage
  • Creating a new Databricks cluster
  • Obtaining the JDBC/ODBC configuration
  • Generating the Databricks access token
  • IPs to be whitelisted
  • FAQs
  • What are the reserved keys for Delta Lake?
  • How does RudderStack handle the reserved words in a column, table, or schema?
  • Contact us

Was this helpful?

  1. docs
  2. data-warehouse-integrations

Databricks Delta Lake

Step-by-step guide on setting up Databricks Delta Lake as a destination in RudderStack.

PreviousClickHouseNextGoogle Cloud Storage Data Lake

Last updated 3 years ago

Was this helpful?

is a popular data lake used for both streaming and batch operations. It lets you store structured, unstructured, and semi-structured data securely and reliably. With features such as support for ACID transactions, scalable metadata management, and schema enforcement, Delta Lake enables you to scale and deliver real-time data insights and analytics directly via your data lake.

RudderStack lets you configure Delta Lake as a destination to which you can send your event data seamlessly.

Find the open-source transformer code for this destination in the .

Configuring Delta Lake destination in RudderStack

Before configuring Delta Lake as a destination in RudderStack, it is highly recommended to go through the following sections to obtain the necessary configuration settings. These sections also contain the steps to grant RudderStack and Databricks the necessary permissions to your preferred storage bucket.

To send event data to Delta Lake, you first need to add it as a destination in RudderStack and connect it to your data source. Once the destination is enabled, events will automatically start flowing to Delta Lake via RudderStack.

To configure Delta Lake as a destination in RudderStack, follow these steps:

  1. In your , set up the data source. Then, select Databricks Delta Lake from the list of destinations.

  2. Assign a name to your destination and then click on Next.

Connection settings

Enter the following credentials in the Connection Credentials page:

  • Host: Enter your server hostname from the Databricks dashboard.

  • Port: Enter the port number.

  • HTTP Path: Enter the cluster's HTTP path.

  • Personal Access Token: Enter your Databricks access token.

  • Namespace: Enter the the name of the schema where RudderStack will create the tables. If you don't specify a namespace in the dashboard settings, RudderStack will set it to the source name, by default.

  • Sync Frequency: Specify how often RudderStack should sync the data to your Delta Lake instance.

  • Sync Starting At: This optional setting lets you specify the particular time of the day (in UTC) when you want RudderStack to sync the data to the Delta Lake instance.

  • Exclude Window: This optional setting lets you specify the time window (in UTC) when RudderStack will skip the data sync.

  • Object Storage Configuration: RudderStack currently supports the following platforms for storing the staging files:

    • Amazon S3

    • Google Cloud Storage

    • Azure Blob Storage

Granting RudderStack access to your storage bucket

This section contains the steps to edit your bucket policy to grant RudderStack the necessary permissions, depending on your preferred cloud platform.

Amazon S3

Follow these steps to grant RudderStack access to your S3 bucket based on the following two cases:

Case 1: Use STS Token to copy staging files is disabled in the dashboard

Follow the steps listed in this section if the Use STS Token to copy staging files option is disabled, i.e. you don't want to specify the AWS access key and secret access key while configuring your Delta Lake destination.

For RudderStack Cloud

If you are using RudderStack Cloud, edit your bucket policy using the following JSON:

{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Principal": {
      "AWS": "arn:aws:iam::422074288268:user/s3-copy"
    },
    "Action": [
      "s3:GetObject",
      "s3:PutObject",
      "s3:PutObjectAcl",
      "s3:ListBucket"
    ],
    "Resource": [
      "arn:aws:s3:::YOUR_BUCKET_NAME/*",
      "arn:aws:s3:::YOUR_BUCKET_NAME"
    ]
  }]
}

Make sure you replace YOUR_BUCKET_NAME with the name of your S3 bucket.

For self-hosted RudderStack

  1. Create an IAM policy with the following JSON:

{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Action": "*",
    "Resource": "arn:aws:s3:::*"
  }]
}

Copy the ARN of this newly-created user. This is required in the next step.

  1. Next, edit your bucket policy with the following JSON to allow RudderStack to write to your S3 bucket.

{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Principal": {
      "AWS": "arn:aws:iam::ACCOUNT_ID:user/USER_ARN"
    },
    "Action": [
      "s3:GetObject",
      "s3:PutObject",
      "s3:PutObjectAcl",
      "s3:ListBucket"
    ],
    "Resource": [
      "arn:aws:s3:::YOUR_BUCKET_NAME/*",
      "arn:aws:s3:::YOUR_BUCKET_NAME"
    ]
  }]
}

Make sure you replace USER_ARN with the ARN copied in the previous step. Also, replace ACCOUNT_ID with your AWS account ID and YOUR_BUCKET_NAME with the name of your S3 bucket.

  1. Finally, add the programmatic access credentials to the env file present in your RudderStack installation, as shown:

RUDDER_AWS_S3_COPY_USER_ACCESS_KEY_ID=<user_access_key>
RUDDER_AWS_S3_COPY_USER_ACCESS_KEY=<user_access_key_secret>

Case 2: Use STS Token to copy staging files is enabled in the dashboard

Follow the steps listed in this section if the Use STS Token to copy staging files option is enabled in the dashboard.

You can provide the configuration directly while setting up the Delta Lake destination in RudderStack, as shown:

Google Cloud Storage

Azure Blob Storage

Granting Databricks access to your staging bucket

This section contains the steps to grant Databricks the necessary permissions to access your staging bucket, depending on your preferred cloud platform.

Amazon S3

Follow these steps to grant Databricks access to your S3 bucket depending on your case:

Case 1: Use STS Token to copy staging files is disabled in the dashboard

Follow the steps listed in this section if the Use STS Token to copy staging files option is disabled, i.e. you don't want to specify the AWS access key and secret access key while configuring your Delta Lake destination.

Follow these steps in the exact order:

Case 2: Use STS Token to copy staging files is enabled in the dashboard

Follow the steps listed in this section if the Use STS Token to copy staging files option is enabled, i.e. you are specifying the AWS access key and secret access key in the dashboard while configuring your Delta Lake destination.

Add the following Spark configuration to your Databricks cluster:

spark.hadoop.fs.s3.impl shaded.databricks.org.apache.hadoop.fs.s3a.S3AFileSystem
spark.hadoop.fs.s3a.impl shaded.databricks.org.apache.hadoop.fs.s3a.S3AFileSystem
spark.hadoop.fs.s3n.impl shaded.databricks.org.apache.hadoop.fs.s3a.S3AFileSystem
spark.hadoop.fs.s3.impl.disable.cache true
spark.hadoop.fs.s3a.impl.disable.cache true
spark.hadoop.fs.s3n.impl.disable.cache true

Google Cloud Storage

To grant Databricks access to your GCS bucket, follow these steps:

  1. Then, add the following Spark configuration to your Databricks cluster:

spark.hadoop.fs.gs.auth.service.account.email <client_email>
spark.hadoop.fs.gs.project.id <project_id>
spark.hadoop.fs.gs.auth.service.account.private.key <private_key>
spark.hadoop.fs.gs.auth.service.account.private.key.id <private_key_id>
  1. Finally, replace the following fields with the values obtained from the downloaded JSON in the previous step: <project_id>,<private_key>, <private_key_id>,<client_email>.

Azure Blob Storage

To grant Databricks access to your Azure Blob Storage container, follow these steps:

  1. Add the following Spark configuration to your Databricks cluster.

spark.hadoop.fs.azure.account.key.<storage-account-name>.blob.core.windows.net <storage-account-access-key>
  1. Replace the following fields with the relevant values from your Blob Storage account settings: <storage-account-name>,<storage-account-access-key>.

Creating a new Databricks cluster

To create a new Databricks cluster, follow these steps:

  1. Sign into your Databricks account. Then, click on the Compute option on the dashboard, as shown:

  1. Click on the Create Cluster option.

  2. Next, enter the cluster details. Fill in the Cluster Name, as shown:

  1. Select the Cluster Mode depending on your use-case. The following image highlights the three cluster modes:

  1. Then, select the Databricks Runtime Version as 7.1 or higher, as shown:

  1. Configure the rest of the settings as per your requirement.

  2. In the Advanced Options section, configure the Instances field as shown in the following image:

  1. In the Instance Profile dropdown menu, select the Databricks instance profile that you added to your account in the previous step.

  1. Finally, click on the Create Cluster button to complete the configuration and create the Databricks cluster.

Obtaining the JDBC/ODBC configuration

Follow these steps to get the JDBC/ODBC configuration:

  1. In your Databricks dashboard, click on the Compute option, as shown:

  1. Then, select the cluster you created in the previous section.

  1. In the Advanced Options section, select the JDBC/ODBC field and copy the Server Hostname, Port, and HTTP Path values, as shown:

The Server Hostname, Port, and HTTP Path values are required to configure Delta Lake as a destination in RudderStack.

Generating the Databricks access token

To generate the Databricks access token, follow these steps:

  1. In your Databricks dashboard, go to Settings and click on User Settings, as shown:

  1. Then, go to the Access Tokens section and click on Generate New Token, as shown:

  1. Enter your comment in the Comment field and click on Generate, as shown:

Keep the Lifetime (days) field blank. If you enter a number, your access token will expire after that number of days.

  1. Finally, copy the access token as it will be used during the Delta Lake destination setup in RudderStack.

IPs to be whitelisted

You will need to whitelist the following RudderStack IPs to enable network access:

  • 3.216.35.97

  • 34.198.90.241

  • 54.147.40.62

  • 23.20.96.9

  • 18.214.35.254

  • 35.83.226.133

  • 52.41.61.208

  • 44.227.140.138

  • 54.245.141.180

  • 3.66.99.198

  • 3.64.201.167

If you have your deployment in the EU region, you can whitelist only the following two IPs:

  • 3.66.99.198

  • 3.64.201.167

All the outbound traffic is routed through these RudderStack IPs.

FAQs

What are the reserved keys for Delta Lake?

How does RudderStack handle the reserved words in a column, table, or schema?

There are some limitations when it comes to using reserved words as a schema, table, or column name. If such words are used in event names, traits or properties, they will be prefixed with a _ when RudderStack creates tables or columns for them in your schema.

Also, integers are not allowed at the start of a schema or table name. Hence, such schema, column, or table names will be prefixed with a _. For example, '25dollarpurchase' will be changed to '_25dollarpurchase'.

Contact us

For more information on where to find the server hostname, refer to the section below.

For more information on obtaining the port number, refer to the section below.

For more information on obtaining the HTTP path, refer to the section below.

For more information on generating the access token, refer to the section below.

If you select S3 as your storage provider, RudderStack gives you the option to specify the AWS access key and secret access key in the dashboard itself, to grant Databricks access to your staging bucket. To do so, enable the Use STS Tokens to copy staging files setting in the dashboard. For more information, refer to the section below.

If you are , follow these steps:

Then, create an . Attach the above IAM policy to this user.

You can provide the necessary GCS bucket configuration while setting up the Delta Lake destination in RudderStack. For more information, refer to the .

You can provide the necessary Blob Storage container configuration while setting up the Delta Lake destination in RudderStack. For more information, refer to the .

In this case, you will be required to configure your AWS account to which will then be attached with your Databricks cluster.

.

.

.

.

.

For more information on adding custom Spark configuration properties in a Databricks cluster, refer to .

Follow the steps listed in this section to set up the required role and user permissions.

For more information on adding custom Spark configuration properties in a Databricks cluster, refer to .

For more information on adding custom Spark configuration properties in a Databricks cluster, refer to .

Refer to this for a complete list of the reserved keywords.

For queries on any of the sections covered in this guide, you can or start a conversation in our community.

Obtaining the JDBC/ODBC configuration
Obtaining the JDBC/ODBC configuration
Obtaining the JDBC/ODBC configuration
Generating the Databricks access token
self-hosting RudderStack
IAM user with programmatic access
Google Cloud Storage bucket settings
Azure Blob Storage settings
create an instance profile
Create an instance profile to access the S3 bucket
Create a bucket policy for the target S3 bucket
Note the IAM role used to create the Databricks deployment
Add the S3 IAM role to the EC2 policy
Add the instance profile to Databricks
Spark configuration guide
user permissions
Spark configuration guide
Spark configuration guide
documentation
contact us
Slack
Amazon S3 storage bucket settings
Delta Lake
GitHub repository
Granting RudderStack access to your storage bucket
Granting Databricks access to your storage bucket
Creating a new Databricks cluster
Generating the Databricks access token
Obtaining the JDBC/ODBC configuration
RudderStack dashboard
Delta Lake destination settings in RudderStack
S3 settings in RudderStack dashboard
Delta Lake Compute option
Delta Lake destination settings in RudderStack
Delta Lake cluster modes
Delta Lake Compute option
Delta Lake Cluster name
Delta Lake runtime version
Delta Lake instances
Delta Lake instances field
Delta Lake create cluster option
Delta Lake cluster
Databricks user settings
Delta Lake JDBC/ODBC settings
Databricks generating new token
Access tokens