

We created a simple library called redshift-query that could run directly on AWS Glue Python shell jobs and wrapped into a Cloudformation template taking SQL statement as arguments and running them based on events or schedules with passwordless authentication. AWS Glue jobs support retrying on failures, and it can be started based on other Jobs. For our team Python was the primary language of the team so this was an advantage. The AWS Glue team saw these types of workloads running and introduced AWS Glue Python Shell Jobs, similar to AWS Lambda but there was no time limit and only Python could be used. However, there was no time limit, and the environment was completely managed, so the complexity compared to the other solutions was lower. Using this for our simple queries that loaded data into Redshift and then transforming them (ELT rather than ETL), was similar to using a spaceship to commute from one city to another(instead of planets) when you could have simply used a car. AWS GlueĪWS Glue arrived, a managed Apache Spark environment that could ingest data from S3, transform and then dump into Redshift through S3.

This meant running long running analysis queries and large inserts were not possible without breaking down the steps into smaller pieces, which resulted in increasing complexity and costs. AWS LambdaĪWS Lambda paired with Step Functions running the queries directly was a good option, however the Lambda had to be alive during the time the query was running, which meant we were limited to only 15 minutes of run time per query. Using this API means that we do not have to store Redshift credentials anywhere and do not have to care about rotating them, we can simply use IAM roles for compute environments that run the queries, making this more straightforward. Redshift GetClusterCredentials can generate temporary credentials of a Redshift user that can be used to authenticate and run SQL statements. AWS has no longer added features to it and it did not support newer EC2 instance types, nor passwordless Redshift authentication via Redshift GetClusterCredentials API. This was an improvement from running the EC2 ourselves, but still clunky, and we had to make sure the queries were retried manually. It would run an EC2 instance with a given AMI and run the given queries in the steps configured. What were the alternatives? Data PipelineĪWS Data Pipeline was an option that achieved the same goal in the same manner.

For such a simple thing, you had to ensure the EC2 instance was always patched, whilst making sure it was continuously running and had to run failed queries in case something happened manually (instance failures, networking disruptions, database failures). In previous years, the standard and easy solution would have been to deploy cron jobs to EC2 instances. Inserting the results of specific queries into a table or S3.Loading data into Redshift upon creation of S3 objects.When running queries on Redshift, the ability to automate them based on events or schedules is a requirement that comes up often. BackgroundĪt HeleCloud, we utilise many AWS services for customer projects, including Amazon Redshift, and are often looking for ways to create greater performance and reliability over the queries we run.Īmazon Redshift is a managed data warehouse service that allows analysing terabytes of data using standard SQL. In this blog article, we will discover the numerous ways of running queries on Amazon Redshift based on events and schedules in a reliable and cost-effective way. This is a CloudFormation template that deploys a step function state machine to your account that can run, track, and retry Redshift queries using Redshift Data APIs, enabling far greater reliability and value creation. To use Redshift Data APIs for multi-step Extract-Transform-Load(ETL) jobs, we have created “ redshift-query-step-function ”. These can run without any additional costs, and added as an EventBridge target, giving more significant potential to integrate through your AWS ecosystem. | By Farid Neshat, Cloud Engineer HeleCloud The Netherlands | IntroductionĪmazon Web Services (AWS) has introduced Redshift Data APIs, which can run arbitrary SQL statements on Redshift clusters asynchronously.
