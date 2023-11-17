Amazon Redshift is a cloud data warehousing service that provides high-performance analytical processing based on massively parallel processing (MPP) architecture. Building and maintaining data pipelines is a common challenge for all enterprises. Managing SQL files, integrating cross-team work, incorporating all the software engineering principles, and importing external utilities can be a time-consuming task that requires complex design and a lot of preparation.

DBT (DataBuildTools) offers this mechanism by offering a well-structured framework for data analysis, transformation, and orchestration. It also implements common software engineering principles such as integration with Git repositories, installing DRYer code, adding functional test cases, and incorporating external libraries. This mechanism allows developers to focus on preparing SQL files according to business logic, and the rest is taken care of by dbt.

In this post, we look at an optimal and cost-effective way to incorporate DBT within Amazon Redshift. We use Amazon Elastic Container Registry (Amazon ECR) to store our DBT Docker images and AWS Fargate as an Amazon Elastic Container Service (Amazon ECS) job to run the jobs.

How does the DBT Framework work with Amazon Redshift?

dbt has an Amazon Redshift adapter module named dbt-redshift that enables it to connect to and work with Amazon Redshift. All connection profiles are configured within the dbt profile.yml file. In an optimal environment, we store the credentials in AWS Secrets Manager and retrieve them.

The following code shows the contents of profile.yml:

SampleProject: Target: Dev Output: Dev: Type: Redshift Host: “{{ env_var(‘DBT_HOST’) }}” User: “{{ env_var(‘DBT_USER’) }}” Password: “{{ env_var(‘DBT_PASSWORD’ ) }}” port: 5439 dbname: “{{ env_var(‘DBT_DB_NAME’) }}” schema: dev threads: 4 keepalives_idle: 240 # default 240 sec connect_timeout: 10 # default 10 sec sslmode: require ra3_node: true

The following diagram shows the major components of the DBT framework:

The primary components are as follows:

model – These are written as SELECT statements and saved as .sql file. All transformation queries can be written here which can be materialized as a table or visual. Table refresh can be full or incremental depending on configuration. For more information, see SQL Model.

– These are written as SELECT statements and saved as .sql file. All transformation queries can be written here which can be materialized as a table or visual. Table refresh can be full or incremental depending on configuration. For more information, see SQL Model. snapshots – These implement type-2 slowly varying dimensions (SCD) on mutable source tables. These SCDs identify how a row in a table changes over time.

– These implement type-2 slowly varying dimensions (SCD) on mutable source tables. These SCDs identify how a row in a table changes over time. Seed – These are CSV files in your DBT project (usually in your seed directory), which DBT can load into your data warehouse using the dbt seed command.

– These are CSV files in your DBT project (usually in your seed directory), which DBT can load into your data warehouse using the dbt seed command. tests – These are claims you make about your models and other resources (like sources, seeds, and snapshots) in your DBT project. When you run DBT tests, DBT will tell you whether each test in your project passed or failed.

– These are claims you make about your models and other resources (like sources, seeds, and snapshots) in your DBT project. When you run DBT tests, DBT will tell you whether each test in your project passed or failed. macro – These are pieces of code that can be reused multiple times. They are analogous to “functions” in other programming languages, and are extremely useful if you find yourself repeating code across multiple models.

These components are stored as .sql files and run by dbt CLI commands. At run time, dbt creates a directed acyclic graph (DAG) based on the internal context between dbt components. It uses DAG to organize the run sequence accordingly,

Multiple profiles can be created within the profile.yml file, which DBT can use to target different Redshift environments when running. For more information, see Redshift setup.

solution overview

The following diagram shows our solution architecture.

The workflow includes the following steps:

The open source dbt-redshift connector is used to create our dbt project, including all necessary models, snapshots, tests, macros, and profiles. A Docker image is created and pushed to the ECR repository. The Docker image is run by Fargate as an ECS job triggered via AWS Step Functions. All Amazon Redshift credentials are stored in the Secrets Manager, which is used by ECS jobs to connect to Amazon Redshift. During the run, DBT converts all models, snapshots, tests, and macros into Amazon Redshift compliant SQL statements and organizes the run based on the internal data lineage graph it creates. These SQL commands are run directly on the Redshift cluster and hence the workload goes directly to Amazon Redshift. When the run is complete, dbt will create a set of HTML and JSON files to host the dbt documentation that describes the data catalog, compiled SQL statements, data lineage graphs, and more.

Prerequisites

You must have the following conditions:

Good understanding of DBT principles and implementation steps.

An AWS account with user role permission to access the AWS services used in this solution.

Security groups for Fargate to access Redshift clusters and Secrets Manager from Amazon ECS.

a redshift cluster , For creation instructions, see Create a cluster.

, For creation instructions, see Create a cluster. an ECR repository :For instructions, see Creating a private repository

:For instructions, see Creating a private repository a mystery manager Secret which contains all the credentials to connect to Amazon Redshift. This includes host, port, database name, username, and password. For more information, see Create an AWS Secrets Manager Database Secret.

Secret which contains all the credentials to connect to Amazon Redshift. This includes host, port, database name, username, and password. For more information, see Create an AWS Secrets Manager Database Secret. One amazon simple storage (Amazon S3) Bucket to host documentation files.

Create a DBT Project

We are using the DBT CLI so all commands are run in the command line. So, install pip if not already installed. Refer to installation for more information.

To create a DBT project, complete the following steps:

Install dependent dbt packages:

pip install dbt-redshift dbt init Start a dbt project using the command, which creates all the template folders automatically. Add all required DBT artifacts.

Refer to the dbt-redshift-atlpattern repo which contains a reference dbt project. For more information about building construction projects, see About DBT Projects.

In the reference project, we have implemented the following features:

SCD type 1 using the incremental model

SCD Type 2 using snapshots

seed look-up files

Macros to add reusable code to a project

test to analyze incoming data

Python script has been created to obtain the required credentials from the Secrets Manager to access Amazon Redshift. See export_redshift_connection.py file.

Create run_dbt.sh script to run the dbt pipeline sequentially. This script is placed in the root folder of the dbt project as shown in the sample repo.

–Import dependent external libraries dbt deps –profiles-dir. –project-dir. – Create tables based on seed files dbt seed –profile-dir. –project-dir. – Run all model files dbt run –profiles-dir. –project-dir. — Run all snapshot files dbt snapshot –profiles-dir. –project-dir. — Run all inbuilt and custom test cases created with dbt test –profiles-dir. –project-dir. –generate dbt doc files generate dbt docs –profiles-dir. –project-dir. –copying dbt output to s3 bucket – to host aws s3 cp –recursive –exclude=’*’ –include=’*.json’ –include=’*.html’ dbt/target /s3:// <बकेटनाम>/REDSHIFT_POC/

Create a Docker file in the root directory of the dbt project folder. This step creates the image of the DBT project to be sent to the ECR repository.

From Python:3 add dbt_src /dbt_src RUN pip install -U pip # Install the DBT library RUN pip install –no-cache-dir dbt-core RUN pip install –no-cache-dir dbt-redshift RUN pip install — no -cache-dir boto3 RUN pip install –no-cache-dir awscli WORKDIR /dbt_src RUN chmod -R 755. Entry Point [ “/bin/sh”, “-c” ]

Chairman and Managing Director [“./run_dbt.sh”]

Upload the image to Amazon ECR and run it as an ECS job

To send an image to the ECR repository, complete the following steps:

Get an authentication token and authenticate your Docker client to your registry: aws ecr get-login-password –region , docker login –username aws –password -stdin

Build your Docker image using the following commands:

docker build -t <इमेज टैग>,

After the build is complete, tag your image so you can commit it to the repository:

docker tag <छवि टैग>:latest :latest

Run the following command to commit the image to your newly created AWS repository:

docker push , :latest

On the Amazon ECS console, create a cluster with Fargate as the infrastructure option. Provide your VPC and Subnet as required. After creating the cluster, create an ECS task and assign the created DBT image as a task definition family. In the Networking section, select your VPC, subnet, and security group to connect to Amazon Redshift, Amazon S3, and Secrets Manager.

This task will trigger the run_dbt.sh pipeline script and run all dbt commands sequentially. When the script completes, we can see the results in Amazon Redshift and the documentation files sent to Amazon S3.

You can host documents through Amazon S3 static website hosting. For more information, see Hosting a static website using Amazon S3. Finally, you can run this task as an ECS task in Step Functions to schedule tasks as needed. For more information, see Manage Amazon ECS or Fargate tasks with Step Functions.

The dbt-redshift-etlpattern repo now contains all the necessary code samples.

The cost of executing DBT jobs in AWS Fargate as an Amazon ECS job with minimal operational requirements will be approximately $1.5 (cost_link) per month.

cleanliness

Complete the following steps to clean up your resources:

Delete the ECS cluster that you created. Delete the ECR repository you created to store the image files. Delete the Redshift cluster you created. Delete Redshift secrets stored in the Secrets Manager.

conclusion

This post covers the basic implementation of using dbt with Amazon Redshift in a cost-efficient way using Fargate in Amazon ECS. We described the key infrastructure and configuration set-up with a sample project. This architecture can help you leverage the benefits of the DBT framework to manage your data warehouse platform in Amazon Redshift.

For more information about DBT macros and models for Amazon Redshift internal operations and maintenance, see the following GitHub repo. In the next post, we’ll explore traditional extract, transform, and load (ETL) patterns that you can implement using the DBT framework in Amazon Redshift. Test this solution in your account and give feedback or suggestions in the comments.

About the Author

Seshadri Senthamaraikannan is a Data Architect with the AWS Professional Services team based in London, UK. He is well experienced and expert in data analytics and works with customers focused on building innovative and scalable solutions in the AWS cloud to meet their business goals. In his free time, he enjoys spending time with his family and playing sports.

mohammed hamdi is a Senior Big Data Architect with AWS Professional Services, based in London, UK. He has over 15 years of experience architecting, leading, and building data warehouses and big data platforms. He helps customers develop big data and analytics solutions to accelerate their business outcomes through their cloud adoption journey. Apart from work, Mohammed enjoys travelling, running, swimming and playing squash.

Source: aws.amazon.com