LogoLogo
  • Contributing to RudderStack
  • Destination_Name
  • LICENSE
  • RudderStack Docs
  • docs
    • FAQ
    • Identity Resolution
    • Home
    • cloud-extract-sources
      • ActiveCampaign Source
      • Bing Ads
      • Chargebee
      • Common Settings
      • Facebook Ads
      • Freshdesk
      • Google Ads Source
      • Google Analytics
      • Google Search Console
      • Google Sheets
      • Cloud Extract Sources
      • Intercom v2
      • Intercom
      • Mailchimp
      • Marketo
      • Mixpanel
      • NetSuite
      • Pipedrive
      • QuickBooks
      • Salesforce Pardot
      • Sendgrid Source
      • Stripe Source
      • Xero
      • Zendesk Chat
      • Zendesk
      • hubspot
        • HubSpot Data Model and Schema Information
        • HubSpot
      • salesforce
        • Salesforce
        • Schema Comparison: RudderStack vs. Segment
    • connections
      • Connection Modes: Cloud Mode vs. Device Mode
    • data-governance
      • Data Governance
      • RudderTyper
      • Data Governance API
      • RudderTyper
      • tracking-plans
        • Tracking Plans
        • Tracking Plan Spreadsheet
    • data-warehouse-integrations
      • Amazon Redshift
      • Azure Data Lake
      • Azure Synapse
      • ClickHouse
      • Databricks Delta Lake
      • Google Cloud Storage Data Lake
      • Google BigQuery
      • Identity Resolution
      • Warehouse Destinations
      • Microsoft SQL Server
      • PostgreSQL
      • Amazon S3 Data Lake
      • Snowflake
      • FAQ
      • Warehouse Schema
    • destinations
      • Destinations
      • Webhooks
      • advertising
        • Bing Ads
        • Criteo
        • DCM Floodlight
        • Facebook App Events
        • Facebook Custom Audience
        • Facebook Pixel
        • Google Ads (gtag.js)
        • Google AdWords Enhanced Conversions
        • Google Adwords Remarketing Lists (Customer Match)
        • Advertising
        • LinkedIn Insight Tag
        • Lotame
        • Pinterest Tag
        • Reddit Pixel
        • Snap Pixel
        • TikTok Ads
      • analytics
        • Amplitude
        • AWS Personalize
        • Chartbeat
        • Firebase
        • FullStory
        • Google Analytics 360
        • Google Analytics
        • Heap.io
        • Hotjar
        • Analytics
        • Indicative
        • Keen
        • Kissmetrics
        • Kubit
        • Lytics
        • Mixpanel
        • Pendo
        • PostHog
        • Quantum Metric
        • Singular
        • adobe-analytics
          • Adobe Analytics Heartbeat Measurement
          • Mobile Device Mode Settings
          • Web Device Mode Settings
          • E-commerce Events
          • Adobe Analytics
          • Setting Up Adobe Analytics in RudderStack
        • google-analytics-4
          • Cloud Mode
          • Device Mode
          • Google Analytics 4
          • Setting up Google Analytics 4
        • profitwell
          • ProfitWell
          • Cloud Mode
          • Device Mode
      • attribution
        • Adjust
        • AppsFlyer
        • Branch
        • Attribution
        • Kochava
        • TVSquared
      • business-messaging
        • Business Messaging
        • Intercom
        • Kustomer
        • Slack
        • Trengo
      • continuous-integration
        • Visual Studio App Center
        • Continuous Integration
      • crm
        • Delighted
        • HubSpot
        • CRM
        • Salesforce
        • Variance
        • Zendesk
      • customer-data-platform
        • Customer Data Platform
        • Segment
      • error-reporting
        • Bugsnag
        • Error Reporting
        • Sentry
      • marketing
        • ActiveCampaign
        • AdRoll
        • Airship
        • Appcues
        • Autopilot
        • Blueshift
        • Braze
        • CleverTap
        • Customer.io
        • Gainsight PX
        • Gainsight
        • Marketing
        • Iterable
        • Klaviyo
        • Leanplum
        • Mailchimp
        • Marketo Lead Import
        • Marketo
        • MoEngage
        • Ometria
        • Pardot
        • Post Affiliate Pro
        • Qualtrics
        • SendGrid
        • Salesforce Marketing Cloud
        • Userlist
        • drip
          • Cloud Mode
          • Device Mode
          • Drip
          • Setting Up Drip in RudderStack
      • productivity
        • Google Sheets
        • Productivity
      • storage-platforms
        • Amazon S3
        • DigitalOcean Spaces
        • Google Cloud Storage
        • Storage Platforms
        • Azure Blob Storage
        • MinIO
        • Redis
      • streaming-platforms
        • Amazon EventBridge
        • Amazon Kinesis Firehose
        • Amazon Kinesis
        • Azure Event Hubs
        • BigQuery Stream
        • Confluent Cloud
        • Google Pub/Sub
        • Streaming Platforms
        • Apache Kafka
      • tag-managers
        • Google Tag Manager
        • Tag Managers
      • testing-and-personalization
        • Algolia Insights
        • Candu
        • Google Optimize
        • A/B Testing & Personalization
        • LaunchDarkly
        • Monetate
        • Optimizely Full Stack
        • Optimizely Web
        • Split.io
        • Statsig
        • VWO (Visual Website Optimizer)
    • get-started
      • RudderStack Cloud vs. RudderStack Open Source
      • Glossary
      • Get Started
      • RudderStack Architecture
    • reverse-etl
      • Amazon Redshift
      • Amazon S3
      • ClickHouse
      • FAQ
      • Google BigQuery
      • Reverse ETL
      • PostgreSQL
      • Snowflake
      • common-settings
        • Importing Data using Models
        • Importing Data using Tables
        • Common Settings
        • Sync Modes
        • Sync Schedule
      • features
        • Airflow Provider
        • Features
        • Models
        • Visual Data Mapper
    • rudderstack-api
      • Data Regulation API
      • HTTP API
      • RudderStack API
      • Personal Access Tokens
      • Pixel API
      • Test API
      • api-specification
        • Application Lifecycle Events Specification
        • API Specification
        • Video Events Specification
        • rudderstack-ecommerce-events-specification
          • Browsing
          • Coupons
          • E-Commerce Events Specification
          • Ordering
          • Promotions
          • Reviewing
          • Sharing
          • Wishlist
        • rudderstack-spec
          • Alias
          • Common Fields
          • Group
          • Identify
          • RudderStack Event Specification
          • Page
          • Screen
          • Track
    • rudderstack-cloud
      • Audit Logs
      • Dashboard Overview
      • Destinations
      • RudderStack Cloud
      • Live Events
      • Connection Modes: Cloud Mode vs. Device Mode
      • Sources
      • Teammates (User Management)
      • connections
        • Adding a Destination
        • Connections
    • rudderstack-open-source
      • Control Plane Setup
      • RudderStack Open Source
      • installing-and-setting-up-rudderstack
        • Developer Machine Setup
        • Docker
        • Data Plane Setup
        • Kubernetes
        • Sending Test Events
    • stream-sources
      • App Center
      • AppsFlyer
      • Auth0
      • Braze
      • Customer.io
      • Extole
      • Event Stream Sources
      • Iterable
      • Looker
      • PostHog
      • Segment
      • Shopify
      • Webhook Source
      • rudderstack-sdk-integration-guides
        • Client-side Event Filtering
        • SDKs
        • AMP Analytics
        • Cordova
        • .NET
        • Go
        • Java
        • Node.js
        • PHP
        • Python
        • React Native
        • Ruby
        • Rust
        • Unity
        • SDK FAQs
        • rudderstack-android-sdk
          • Adding Application Class
          • Flushing Events Periodically
          • Android
        • rudderstack-flutter-sdk
          • Flutter SDK v1
          • Flutter v2
          • Flutter
        • rudderstack-ios-sdk
          • iOS
          • tvOS
          • watchOS
        • rudderstack-javascript-sdk
          • Data Storage in Cookies
          • Detecting Ad-blocked Pages
          • JavaScript
          • JavaScript SDK Enhancements
          • JavaScript SDK FAQs
          • Querystring API
          • Quick Start Guide
          • Version Migration Guide
          • consent-managers
            • Consent Managers
            • OneTrust
    • transformations
      • Access Token
      • FAQ
      • Transformations
      • Transformations API
    • user-guides
      • User Guides
      • administrators-guide
        • Troubleshooting Guide
        • Alerting Guide
        • Bucket Configuration Settings for Event Backups
        • Configuration Parameters
        • Event Replay
        • High Availability
        • Horizontal Scaling
        • Administrator's Guides
        • Infrastructure Provisioning
        • Monitoring and Metrics
        • Okta SSO Setup
        • OneLogin SSO Setup
        • RudderStack Grafana Dashboard
        • Software Releases
      • how-to-guides
        • How to Use Custom Domains
        • How to Develop Integrations for RudderStack
        • How to Configure a Destination via the Event Payload
        • How to Filter Events using Different Methods
        • How to Filter Selective Destinations
        • How to Submit a Pull Request for a New Integration
        • How-to Guides
        • How to Debug Live Destination Events
        • How to Use AWS Lambda Functions with RudderStack
        • create-a-new-destination-transformer-for-rudder
          • Best Practices for Coding Transformation Functions in JavaScript
          • How to Create a New Destination Transformation for RudderStack
        • implement-native-js-sdk-integration
          • How to Add a Device Mode SDK to RudderStack JavaScript SDK
          • How to Implement a Native JavaScript SDK Integration
        • rudderstack-jamstack-integration
          • How to Integrate RudderStack with Your JAMstack Site
          • How to Integrate Rudderstack with Your Angular App
          • How to Integrate Rudderstack with Your Astro Site
          • How to Integrate Rudderstack with Your Eleventy Site
          • How to Integrate Rudderstack with Your Ember.js App
          • How to Integrate Rudderstack with a Gatsby Website
          • How to Integrate Rudderstack with a Hugo Site
          • How to Integrate Rudderstack with Your Jekyll Site
          • How to Integrate Rudderstack with Your Next.js App
          • How to Integrate Rudderstack with Your Nuxt.js App
          • How to Integrate Rudderstack with Your Svelte App
          • How to Integrate Rudderstack with Your Vue App
      • migration-guides
        • Migrating from Blendo to RudderStack
        • Migrating Your Warehouse Destination from Segment to RudderStack
        • Migration Guides
        • Migrating from Segment to RudderStack
  • src
    • @rocketseat
      • gatsby-theme-docs
        • text
          • Home
Powered by GitBook
On this page
  • Configuring S3 Data Lake destination in RudderStack
  • Connection settings
  • Finding your data in S3 data lake
  • Creating a crawler
  • Querying data using AWS Athena
  • IPs to be whitelisted
  • Contact us

Was this helpful?

  1. docs
  2. data-warehouse-integrations

Amazon S3 Data Lake

Step-by-step guide on setting up S3 Data Lake as a destination in RudderStack.

PreviousPostgreSQLNextSnowflake

Last updated 3 years ago

Was this helpful?

Amazon S3 is a popular object storage service used to store both structured and unstructured data. You can leverage S3 to securely and cost-effectively build a data lake of any size or scale. With an S3-powered data lake, you can easily use the native AWS services for data processing, analytics, machine learning, and more.

RudderStack lets you configure S3 data lake as a destination to which you can send your event data seamlessly.

Refer to the guide for more information on how the events are mapped to the tables in S3 data lake.

Find the open source transformer code for this destination in the .

Configuring S3 Data Lake destination in RudderStack

To send event data to SQL Server, you first need to add it as a destination in RudderStack and connect it to your data source. Once the destination is enabled, events will automatically start flowing to SQL Server via RudderStack.

To configure SQL Server as a destination in RudderStack, follow these steps:

  1. In your , set up the data source. Then, select S3 Data Lake from the list of destinations.

  2. Assign a name to your destination and then click on Next.

Connection settings

Enter the following credentials in the Connection Credentials page:

  • S3 Storage Bucket Name: The name of the S3 bucket that will be used to store the data before loading it into the S3 data lake.

  • Register schema on AWS Glue: If you enable this option, RudderStack registers the schema of your incoming data on AWS Glue's data catalog.

For more information on registering your schema in AWS Glue, refer to the .

  • AWS Glue Region: Your AWS Glue region. For example, for N.Virginia, it would be us-east-1.

  • S3 Prefix: If specified, RudderStack creates a folder in the bucket with this prefix and push all the data within that folder.

  • Namespace: If specified, all the data for the destination will be pushed to s3://<bucketName>/<prefix>/rudder-datalake/<namespace>. If you don't specify a namespace in the settings, it is set to the source name, by default.

If AWS Glue is enabled, all the table definitions are created in a database with the name set to this namespace.

  • AWS Access Key ID: Your AWS access key ID.

  • AWS Secret Access Key: Your AWS secret access key.

Make sure the above credentials (Access Key ID and Secret Access Key) have the permissions to read and write into the configured bucket.

If AWS Glue is enabled, make sure that the following permissions are granted to it:

  • glue:CreateTable

  • glue:UpdateTable

  • glue:CreateDatabase

  • glue:GetTables

Finding your data in S3 data lake

RudderStack converts your events into Parquet files and dumps them to the configured S3 bucket. Before dumping the events, RudderStack groups them by the event name based on the time (UTC) they were received.

The folder structure is shown below:

s3://<bucketName>/<prefix>/rudder-datalake/<namespace>/<tableName>/YYYY/MM/DD/HH
  • prefix: This is the S3 prefix in the destination settings. If not specified, RudderStack will omit this value.

  • namespace: The namespace specified in the destination settings. If not specified, RudderStack sets this field to the source name by default.

  • tableName: RudderStack sets this to the event name.

YYYY, MM, DD, and HH are replaced by actual time values. A combination of these values represents the UTC time.

Suppose that RudderStack tracks the following two events:

Event name
Timestamp

Product Purchased

"2019-10-12T08:40:50.52Z" UTC

Cart Viewed

"2019-11-12T09:34:50.52Z" UTC

RudderStack will convert these events into Parquet files and dump them into the following folders:

Event Name
Folder Location

Product Purchased

s3://<bucketName>/<prefix>/rudder-datalake/<namespace>/product_purchased/2019/10/12/08

Cart Viewed

s3://<bucketName>/<prefix>/rudder-datalake/<namespace>/cart_viewed/2019/11/12/09

If AWS Glue is enabled, all the table definitions are created in a database with the name set to the namespace specified in the destination settings.

Creating a crawler

Refer to this section only if you haven't enabled the Register Schema on AWS Glue setting while configuring the S3 data lake destination in RudderStack.

In the absence of AWS Glue, you can create a crawler to go through your data and create table definitions out of it.

Follow these steps to create a crawler:

  1. Go to your AWS Glue console and select Crawler from the left pane.

  2. Select Add Crawler.

  3. Specify a name for your crawler and click Next, as shown:

  • Next, under the Crawler source type section, choose Data stores.

  1. Configure the Repeat crawls of S3 data stores based on your requirement.

  2. Then, under the Data store section, select S3 from the dropdown for the Choose a data store setting, as shown:

  1. For the Crawl data in setting, choose the option Specified path in my account.

  2. In the Include path setting, enter the S3 URI of your configured bucket followed by the suffix /<prefix>/rudder-datalake/<namespace>/.

If your S3 bucket name is testBucket and the configured prefix and namespace are testPrefix and testNameSpace respectively, then your path should be: s3://testBucket/testPrefix/rudder-datalake/testNameSpace/

If you have not configured any prefix while setting up the S3 data lake destination in RudderStack, omit the prefix. The path would then be: s3://testBucket/rudder-datalake/testNameSpace/.

  1. Then, under the Add another data store setting, select No, as shown:

  1. In the IAM Role section, configure a suitable IAM role.

  1. In the Schedule section, select the frequency of your crawler from the dropdown options, as shown:

  1. In the Output section, configure the database that stores all the tables. Under the Grouping behavior for S3 data section, enable the Create a single schema for each S3 path option, as shown:

  1. Specify the Table level as 5 or 4 (refer to the tips below). This value indicates the absolute level of the table location in the bucket.

The level for the top-level folder is 1. For example, for the path mydataset/a/b, if the level is set to 3, the table will be created at the location mydataset/a/b. Similarly, if the level is set to 2, the table will be created at the location mydataset/a.

Since all tables are created in the path s3://testBucket/<prefix>/rudder-datalake/<namespace>/, make sure the table level is set to:

  • 5: If a prefix is configured.

  • 4: If a prefix is not configured.

  1. Review your crawler configuration, as shown:

  1. Click on Finish to confirm the configuration.

  2. Finally, click on your crawler and run it. Wait for the process to finish - you should see some tables created in your configured database.

Querying data using AWS Athena

Before querying your data on S3, make sure that you have sent some data to S3 and that the sync has been completed.

Follow these steps to start querying your data on s3 -

  1. Open your AWS Athena console. Then, go to the same AWS region which was used while to configure AWS Glue.

  2. In the left pane, select AwsDataCatalog as your data source, as shown:

  1. Select your configured namespace (or the database you specified while configuring the crawler) from the database dropdown menu.

By default, the namespace is set to your source name if you did not specify it in the destination settings.

  1. You should see some tables already created under the Tables section in the left pane.

  2. You can preview the data by clicking on the three dots next to the table and selecting the Preview Data option. Alternatively, you can run your own SQL queries in the workspace on the right, as shown:

IPs to be whitelisted

To enable network access to RudderStack, you will need to whitelist the following RudderStack IPs:

  • 3.216.35.97

  • 34.198.90.241

  • 54.147.40.62

  • 23.20.96.9

  • 18.214.35.254

  • 35.83.226.133

  • 52.41.61.208

  • 44.227.140.138

  • 54.245.141.180

  • 3.66.99.198

  • 3.64.201.167

If you have your deployment in the EU region, you can whitelist only the following two IPs:

  • 3.66.99.198

  • 3.64.201.167

All the outbound traffic is routed through these RudderStack IPs.

Contact us

As mentioned in the section:

You can query your S3 data using a tool like which lets you run SQL queries on S3.

For queries on any of the sections covered in this guide, you can or start a conversation in our community.

AWS Athena
contact us
Slack
Connection Settings
Warehouse Schemas
GitHub repository
RudderStack dashboard
AWS Glue documentation
Choose a data store
S3 data lake destination settings in RudderStack
Add another data store
IAM Role
Add Crawler
Crawler source type
Output
AwsDataCatalog
Scheduler
Review
Database
Preview Data