Amazon Redshift: What It Is and When to Use It
Definition
Amazon Redshift is AWS's fully managed, petabyte-scale cloud data warehouse designed for analytical (OLAP) workloads. It uses columnar storage, Massively Parallel Processing (MPP), and aggressive compression to run SQL queries against terabytes to petabytes of structured and semi-structured data with interactive response times. Launched in 2012, Redshift powers enterprise BI dashboards, financial reporting, product analytics, and marketing attribution across thousands of AWS customers.
How It Works
A Redshift cluster consists of a leader node and one or more compute nodes:
- The leader node parses SQL, builds the query plan, and coordinates execution across the compute nodes.
- Compute nodes store data in columnar format, execute query fragments in parallel on slices, and return results to the leader.
Current-generation RA3 nodes separate compute from storage: hot data is cached on fast local SSDs, and the full dataset lives in Redshift Managed Storage (RMS) on S3-class infrastructure. RA3 lets you size compute and storage independently and is the recommended default for new provisioned clusters.
Redshift Serverless skips cluster management entirely — you specify a base Redshift Processing Unit (RPU) capacity, and Redshift scales automatically as queries arrive, billing per RPU-second. Ideal for unpredictable or intermittent analytics.
Redshift stores data in columns rather than rows, which dramatically improves compression and allows queries that touch a few columns out of many to skip most of the I/O. Distribution styles (KEY, EVEN, ALL, AUTO) and sort keys govern how data is laid out across slices to optimize joins and range scans.
Key Features and Limits
Node types
- RA3 (ra3.xlplus, ra3.4xlarge, ra3.16xlarge) — managed storage on RMS, recommended default. Scale from 2 to 128 nodes per cluster, supporting petabytes of data.
- DC2 — legacy dense-compute nodes with local SSD only. Use only for very small, high-performance clusters.
- Previous-generation DS2 and DC1 nodes are deprecated.
Redshift Spectrum
Query data directly in Amazon S3 without loading it into Redshift. External tables are defined in the AWS Glue Data Catalog (or Hive metastore), and Spectrum workers scan Parquet/ORC/JSON/CSV files in parallel. Great for querying a large data lake alongside warehouse data in the same SQL statement (a lakehouse pattern).
Concurrency Scaling
Automatically adds transient compute capacity when query queues back up, then removes it when load drops. One hour of free Concurrency Scaling is earned for every 24 hours a cluster runs, with additional usage billed per second.
Zero-ETL integrations
- Aurora Zero-ETL to Redshift — transactional data from Aurora MySQL/PostgreSQL is replicated to Redshift within seconds, no ETL pipeline required.
- RDS for MySQL Zero-ETL, DynamoDB Zero-ETL, and S3 Zero-ETL are generally available, extending the pattern to other sources.
Other notable features
- Materialized views with automatic refresh and query rewrite.
- AQUA (Advanced Query Accelerator) — hardware-accelerated caching for certain queries on RA3 (region/generation dependent).
- Data sharing — query live data across clusters and accounts without copying.
- Federated query — query RDS/Aurora PostgreSQL and MySQL directly from Redshift SQL.
- Redshift ML — train and deploy SageMaker models from SQL with
CREATE MODEL. - Automated snapshots with cross-Region copy for DR; RPO of hours, RTO of minutes.
Common Use Cases
- Enterprise BI / dashboards — Tableau, QuickSight, Looker, Power BI connect via JDBC/ODBC for interactive reporting.
- Customer 360 / marketing analytics — joining CRM, clickstream, ad spend, and transactional data at scale.
- Financial reporting and planning — period-end close, consolidations, forecasting.
- Log and event analytics — querying CloudTrail/VPC Flow Logs/application logs piped through Firehose.
- Lakehouse analytics — Spectrum on S3 Parquet data merged with curated facts in RMS.
- Zero-ETL OLTP-to-OLAP — Aurora transactional tables replicated to Redshift for near-real-time analytics.
Pricing Model
- Provisioned RA3 — per node-hour based on node type, plus per-GB-month of Redshift Managed Storage actually used.
- Redshift Serverless — per RPU-second, with a configurable base capacity (minimum 8 RPU) and auto scaling. Storage billed separately.
- Redshift Spectrum — $5 per TB scanned in S3 (Parquet/ORC and columnar formats dramatically reduce this).
- Concurrency Scaling — free for the first hour per 24 hours of runtime; beyond that, per-second cluster rate.
- Data transfer — free in same Region/AZ; standard rates cross-Region.
- Reserved Instances — 1- or 3-year commitments for provisioned clusters, up to ~75% off.
Pros and Cons
Pros
- Petabyte-scale SQL performance on columnar MPP.
- RA3 decouples storage and compute; Serverless removes cluster management.
- Deep AWS integration: S3 via Spectrum, Glue, Kinesis, SageMaker, QuickSight.
- Zero-ETL from Aurora/RDS/DynamoDB/S3 slashes pipeline complexity.
- Strong ecosystem of BI tools and compatibility with PostgreSQL client protocol.
Cons
- Not designed for OLTP; high-concurrency single-row writes are a poor fit.
- Cluster management (WLM, vacuum, analyze) still required on provisioned clusters without AUTO features.
- Cold starts on Serverless can surprise first queries.
- Larger scale comes at a price; small warehouses may find Athena or Aurora cheaper.
- Cross-Region DR requires snapshot copy + rehydrate, not a hot standby by default.
Comparison with Alternatives
| | Redshift | Athena | Snowflake on AWS | BigQuery (cross-cloud) | | --- | --- | --- | --- | --- | | Engine | MPP columnar | Serverless Presto/Trino on S3 | Virtual warehouses on S3-class storage | Serverless on Google storage | | Cluster mgmt | Provisioned or Serverless | None | Fully managed | Fully managed | | Storage | RMS + S3 via Spectrum | S3 only | Proprietary storage | Proprietary storage | | Pricing | Node-hour or RPU-second | Per-TB scanned | Per-credit-second | Per-TB scanned or slots | | Best for | Dedicated warehouse at scale | Ad-hoc SQL on data lake | Cross-cloud warehousing | Google-native analytics |
Pick Athena for ad-hoc queries on S3 without a warehouse. Pick Redshift when workloads are large, recurring, latency-sensitive, or need materialized views, data sharing, or Zero-ETL sources.
Exam Relevance
- Solutions Architect Associate (SAA-C03) — recognize Redshift as the data-warehouse answer; Redshift vs Athena vs RDS for analytics.
- Data Engineer Associate (DEA-C01) — distribution keys, sort keys, VACUUM, COPY from S3, Redshift Spectrum, materialized views.
- Database Specialty (DBS-C01) — RA3 managed storage, Concurrency Scaling, data sharing, cross-Region snapshots, migration from Teradata/Oracle.
- Data Analytics Specialty (DAS-C01, retired but still referenced) — classic source of Redshift internals questions.
Exam trap: questions mentioning "query S3 without loading" point to Athena or Redshift Spectrum; "petabyte-scale recurring BI workload" points to Redshift.
Frequently Asked Questions
Q: When should I pick Redshift Serverless vs a provisioned RA3 cluster?
A: Pick Redshift Serverless for unpredictable, bursty, or intermittent analytics — it scales automatically and you pay only for RPU-seconds consumed, with no cluster to manage. Pick provisioned RA3 for steady, high-utilization warehouses where Reserved Instances bring the per-hour cost below Serverless, or when you need specific features like fine-grained WLM tuning. Many teams start on Serverless for speed of adoption, then migrate hot workloads to provisioned RA3 as usage patterns stabilize.
Q: How does Redshift Spectrum work and when should I use it?
A: Spectrum lets you query data stored as Parquet, ORC, JSON, or CSV in S3 via external tables registered in the Glue Data Catalog. The leader node pushes filtering, aggregation, and projection to a fleet of Spectrum workers that scan S3 in parallel. Use Spectrum to (1) query cold data that doesn't justify loading into RMS, (2) run lakehouse queries joining S3 data with warehouse tables, and (3) keep open file formats in S3 as the source of truth. Always prefer columnar formats like Parquet to minimize bytes scanned — Spectrum bills $5 per TB scanned.
Q: What is Zero-ETL from Aurora to Redshift?
A: Aurora Zero-ETL replicates committed transactions from an Aurora MySQL or PostgreSQL cluster directly into Redshift within seconds, with no pipeline to build and no Kinesis/DMS configuration. You enable the integration on both sides, pick the databases or tables, and Redshift maintains a near-real-time analytical copy you can query alongside your existing warehouse tables. It is the recommended pattern for operational analytics that used to be built with CDC + Firehose + Lambda + Glue pipelines.
This article reflects AWS features and pricing as of 2026. AWS services evolve rapidly — always verify against the official Amazon Redshift documentation before making production decisions.