29 lines
12 KiB
Markdown
29 lines
12 KiB
Markdown
|
---
|
||
|
preview_image:
|
||
|
hero:
|
||
|
image: /icons/containers.svg
|
||
|
title: "Build an ETL pipeline with Amazon Redshift and AWS Glue | Modern Infrastructure"
|
||
|
title: "Build an ETL pipeline with Amazon Redshift and AWS Glue"
|
||
|
meta_desc: |
|
||
|
Learn how to combine AWS Glue and Amazon Redshift to build a fully-automated ETL pipeline with Pulumi. We'll use three components to complete our example.
|
||
|
url_slug: build-etl-pipeline-amazon-redshift-aws-glue-modern-infrastructure
|
||
|
featured: false
|
||
|
pre_recorded: true
|
||
|
pulumi_tv: false
|
||
|
unlisted: false
|
||
|
gated: false
|
||
|
type: webinars
|
||
|
external: false
|
||
|
no_getting_started: true
|
||
|
block_external_search_index: false
|
||
|
main:
|
||
|
title: "Build an ETL pipeline with Amazon Redshift and AWS Glue | Modern Infrastructure"
|
||
|
description: |
|
||
|
Learn how to combine AWS Glue and Amazon Redshift to build a fully-automated ETL pipeline with Pulumi. We'll use three components to complete our ETL pipeline-to-be: ▪️ A Glue crawler. The crawler is responsible for fetching data from some external source (for us, an S3 bucket) and importing it into a Glue catalog. ▪️ A Glue job. The job is responsible for running an ETL script (e.g., on a schedule) to process the data imported by the crawler into the catalog. ▪️ A Glue script. The script is where the ETL magic happens. Ours will be written in Python and be responsible for extracting data from the catalog, applying some slight transformations, and loading the transformed data into Redshift. ► Get the code to follow along at https://www.pulumi.com/blog/redshift-etl-with-pulumi-and-aws-glue/ ✅ Get Started with Pulumi: https://pulumip.us/Get-Started ✅ Create a Pulumi account. It's free: https://pulumip.us/Sign-Up-OpenSource 00:00 Introduction 00:10 What we did in "Deploying a Data Warehouse with Pulumi and Amazon Redshift." 00:30 Common question when setting up an ETL pipeline 00:58 What is AWS Glue? 01:23 Watch part 1, if you haven't: https://youtu.be/2v_53eWGrqE 01:49 Recap of where we left off on part 1 of the demo 02:10 Three components to complete our ETL pipeline 02:20 1. Glue crawler 02:47 2. Glue job 03:01 3. Glue script 03:27 Adding the Glue crawler 05:30 Adding the Glue job 07:23 Adding the Glue script 08:40 Deploy! 09:31 Load some data 10:36 Take a look at the AWS console 10:59 Go to the Glue console 12:04 Go to the Glue Job section 12:29 Go to the Redshift console to verify the data 13:15 Let's put in more data to verify the glue job 15:53 Summary of what we did (part 1 + part 2) 17:10 Check out the Metabase package (AWS) 17:38 Goodbyes
|
||
|
sortable_date: 2023-02-01T20:40:52Z
|
||
|
youtube_url: https://www.youtube.com/embed/cbAzk9ovR9s
|
||
|
transcript: |
|
||
|
Welcome to another episode of modern infrastructure Wednesday. We're gonna build a ETL pipeline with Amazon Red Ship and Aws Glue. If you recall from our last episode, we learned how to deploy data warehouse with Pulumi and Amazon Redshift. Uh we covered using Pulumi to load unstructured data for a from Amazon S3 into Amazon Redshift cluster. And at the end of that episode, there are a few unanswered questions. For example, how do we avoid importing and processing the same data twice? How can we transform the data during the ingestion process? And what are our options for loading data automatically, for example, on a regular schedule. So when your platform of choice is Amazon Redshift, those questions will often be answered by pointing you to another Amazon service. Uh for example, Aws glue. So with glue, you can define processes that monitor external data sources like S3, keep track of data that's already been processed and write code in general purpose programming languages like Python to process and transform the data on its way into Redshift. There's a lot more you can do with glue. However, for this project, it's just what we need uh watch the previous video to get up to speed on what we're building and why. Uh when we left off, we've gotten red shift up and running and we were able to pull the data from S3 into redshift directly by running a manual query in the redshift console. But that's as far as we got, there was no automation, no protection from duplicate records, just the absolute basics. So we're gonna pick up from there uh just to quickly recap. Uh, we deployed a VPC with a private subnet, a red shift cluster deployed into the subnet, an S3 bucket that we use to hold some raw data and a VPC endpoint allowing direct access to the bucket over the private network. So now it's time to add in glue at a high level. We'll need three components to complete our ETL pipeline to be um, one a glue crawler. So the crawler is responsible for fetching data from some external source. Uh, for us, it's a S3 bucket and then importing it into a glue catalog. A catalog is a kind of staging database that tracks your data through the ingestion process. Our particular crawler will pull our S3 bucket for new data and import that into a catalog table. Two a glue job. The job is responsible for running an ETL script on a schedule to process the data imported by the crawler into the catalog and then finally three a glue script. The script is where all the ETL magic happens. Ours will be written in Python and be responsible for extracting data from the catalog, applying some slight transformations to it and loading the transform data in the red shift. So next, we will translate these high-level components into concrete plumy resources. Step one, adding a glue crawler. Uh So the first thing is we'll need a unique name for a glue catalog. So we'll set it using plume config here. All right. Now, back end editor, what we'll do is copy in some code for the glue crawler. So do that. All right. So, so this pulls in the glue database name. Um, this piece of code defines a aws Ron expression. So it runs every 15 minutes. Um, this is specifically creating the glue catalog and then this is defining a im role granting glue access to the S3 bucket that we created on up here. And then finally, this is the glue crawler to process the contents of the data bucket on a schedule. So feeding it in the name, the role and then the schedule. And then we're also pulling in, uh, the S3 bucket from above. Ok. So let's run, let's save this and run a quick pulling me up. Ok. That is now done. Now, let's add the glue job. So, so this is step two, adding the glue job. Uh, we're gonna paste in some code again. Ok. This job is fairly simple, but still requires a few different aws resources. So here we create a glue connection to the Redshift cluster. Um It's just pulling in the user name and password that we define from plume config. Uh Then we'll need a S3 bucket. This is where glue is gonna look for the glue script that we'll cover in step three. Um We will up
|
||
|
|
||
|
---
|