Google Cloud Storage Data Lake
Step-by-step guide on setting up GCS Data Lake as a destination in RudderStack.
Last updated
Was this helpful?
Step-by-step guide on setting up GCS Data Lake as a destination in RudderStack.
Last updated
Was this helpful?
The Google Cloud Storage (GCS) data lake leverages Google Cloud Storage for storing and accessing your data in the GCP infrastructure. It offers state-of-the-art performance and scalability, along with ensuring the security and privacy of your data.
RudderStack lets you configure GCS data lake as a destination to which you can send your event data seamlessly.
To set up GCS data lake as a destination in RudderStack, you will need to create a new user role and grant the required permissions to create schemas and temporary tables.
Go to the Google Cloud IAM Admin console and click on CREATE ROLE.
Then, fill in the details as shown:
Fill the details and click on ADD PERMISSIONS.
Under Filter permissions by role, select Storage Object Admin and grant the required permissions:
The permission required to successfully use the GCS data lake destination is shown:
Then, click on CREATE. This will successfully create the role.
Go to the Service Accounts option in the Google Cloud IAM Admin console.
Then, select the project containing the dataset that you want to use.
Next, click on CREATE SERVICE ACCOUNT.
Fill in the details as shown below and click on CREATE.
Then, select the previously created role and click on CONTINUE.
Finally, click on DONE.
Click on the three dots under Actions in the service account that you just created and select Manage keys, as shown:
Click on ADD KEY, followed by Create new key, as shown:
In the resulting pop-up, select JSON and click on CREATE.
Finally, download this JSON file. This file is required while setting up the GCS data lake destination in RudderStack.
To send event data to GCS data lake, you first need to add it as a destination in RudderStack and connect it to your data source. Once the destination is enabled, events will automatically start flowing to GCS data lake via RudderStack.
To configure GCS data lake as a destination in RudderStack, follow these steps:
In your RudderStack dashboard, set up the data source. Then, select Google Cloud Storage Data Lake from the list of destinations.
Assign a name to your destination and then click on Next.
Enter the following credentials in the Connection Credentials page: - GCS Storage Bucket Name: The name of the GCS bucket used to store data before loading it into the GCS data lake. - Prefix: If specified, RudderStack will create a folder in the bucket with this prefix and push all the data within that folder. For example, https://storage.googleapis.com/<bucketName>/<prefix>/
. - Namespace: If specified, all the data for the destination will be pushed to https://storage.googleapis.com/<bucketName>/<prefix>/rudder-datalake/<namespace>
. If you don't specify a namespace in the settings, RudderStack sets it to the source name, by default. - Credentials: Enter the content of the downloaded credentials JSON file in this field. - Sync Frequency: Specify how often RudderStack should sync the data to your GCS data lake. - Sync Starting At: This optional setting lets you specify the particular time of the day (in UTC) when you want RudderStack to sync the data to the data lake.
RudderStack converts your events into Parquet files and dumps them into the configured GCS bucket. Before dumping the events, RudderStack groups the files by the event name based on the time (in UTC) they were received.
The folder structure is similar to the following format:
As specified in the Connnection settings section above:
<prefix>
is the GCS prefix used while configuring the GCS data lake destination in RudderStack. If not specified, RudderStack will omit this value.
<namespace>
is the namespace specified in the destination settings. If not specified, RudderStack sets it to the source name.
<tableName>
is set to the event name.
YYYY
, MM
, DD
, and HH
are replaced by the actual time values.
A combination of the YYYY
, MM
, DD
, and HH
values represents the UTC time.
Suppose that RudderStack tracks the following two events:
Product Purchased
"2019-10-12T08:40:50.52Z" UTC
Cart Viewed
"2019-11-12T09:34:50.52Z" UTC
RudderStack then converts these events into Parquet files and dumps them into the following folders:
Product Purchased
https://storage.googleapis.com/<bucketName>/<prefix>/rudder-datalake/<namespace>/product_purchased/2019/10/12/08
Cart Viewed
https://storage.googleapis.com/<bucketName>/<prefix>/rudder-datalake/<namespace>/cart_viewed/2019/11/12/09
To enable network access to RudderStack, you will need to whitelist the following RudderStack IPs:
3.216.35.97
34.198.90.241
54.147.40.62
23.20.96.9
18.214.35.254
35.83.226.133
52.41.61.208
44.227.140.138
54.245.141.180
3.66.99.198
3.64.201.167
If you have your deployment in the EU region, you can whitelist only the following two IPs:
3.66.99.198
3.64.201.167
All the outbound traffic is routed through these RudderStack IPs.
For queries on any of the sections covered in this guide, you can contact us or start a conversation in our Slack community.