This module covers Star and Snowflake Schemas, Slowly Changing Dimensions (SCD), Change Data Capture (CDC), Event-driven Design, ETL vs ELT.
**Star Schema** and **Snowflake Schema** are two of the most common data warehouse design patterns used for organizing relational data in a way that supports efficient business intelligence queries.
The **Star Schema** is characterized by its central fact table with multiple dimension tables connected to it. This design is also known as a 'fact constellation'.
// Example Star Schema structure
CREATE TABLE Fact_Sales (
SaleID INT PRIMARY KEY,
DateID INT,
ProductID INT,
CustomerID INT,
Amount DECIMAL,
FOREIGN KEY (DateID) REFERENCES Dim_Date(DateID),
FOREIGN KEY (ProductID) REFERENCES Dim_Product(ProductID),
FOREIGN KEY (CustomerID) REFERENCES Dim_Customer(CustomerID)
);
CREATE TABLE Dim_Date (
DateID INT PRIMARY KEY,
Date DATE,
Month VARCHAR(20),
Year INT
);
The **Snowflake Schema** is an extension of the Star Schema where dimension tables are further normalized into multiple related tables, creating a structure that resembles a snowflake.
// Example Snowflake Schema structure
CREATE TABLE Fact_Sales (
SaleID INT PRIMARY KEY,
DateID INT,
ProductID INT,
CustomerID INT,
Amount DECIMAL,
FOREIGN KEY (DateID) REFERENCES Dim_Date(DateID),
FOREIGN KEY (ProductID) REFERENCES Dim_Product(ProductID),
FOREIGN KEY (CustomerID) REFERENCES Dim_Customer(CustomerID)
);
CREATE TABLE Dim_Date (
DateID INT PRIMARY KEY,
MonthID INT,
YearID INT,
FOREIGN KEY (MonthID) REFERENCES Dim_Month(MonthID),
FOREIGN KEY (YearID) REFERENCES Dim_Year(YearID)
);
In data warehousing, **Slowly Changing Dimensions (SCD)** refer to dimensions that change over time but retain their historical values. This concept is crucial for maintaining accurate records of how data evolves. There are three main types of SCD implementations: Type 1, Type 2, and Type 3. In this chapter, we'll explore each type in depth, including implementation strategies, best practices, and practical examples.
Type 1 SCD occurs when an attribute's value is updated without maintaining a history of the previous values. This type is typically used for non-key attributes that don't require historical tracking.
UPDATE dim_customer
SET phone_number = '555-1234'
WHERE customer_id = 1;
Type 2 SCD occurs when a new row is added to the dimension table to represent a change in an attribute, while the old row remains unchanged. This type maintains a complete history of all changes.
-- Insert new customer record
INSERT INTO dim_customer (customer_id, name, email, phone_number)
VALUES (1, 'John Doe', 'john.doe@example.com', '555-4321');
Type 3 SCD occurs when a new column is added to the dimension table to track the effective period of an attribute change. This type combines both attribute override and row addition approaches.
ALTER TABLE dim_customer
ADD COLUMN valid_from DATE,
ADD COLUMN valid_to DATE;
When updating an attribute, the old value is marked as invalid and a new row with the updated value is added.
-- Update customer record
UPDATE dim_customer
SET valid_to = CURRENT_TIMESTAMP - INTERVAL '1 second'
WHERE customer_id = 1 AND valid_to IS NULL;
INSERT INTO dim_customer (customer_id, name, email, phone_number, valid_from)
VALUES (1, 'John Doe', 'john.doe@example.com', '555-4321', CURRENT_TIMESTAMP);
Let's walk through a practical example using Type 2 SCD. Imagine we have a customer dimension table and want to track changes in the customer's email address.
-- Initial customer record
INSERT INTO dim_customer (customer_id, name, email)
VALUES (1, 'John Doe', 'john.doe@example.com');
After some time, the customer updates their email address.
-- Update customer record using SCD Type 2
INSERT INTO dim_customer (customer_id, name, email)
VALUES (1, 'John Doe', 'john.doe@newexample.com');
Now we can query the dimension table to see both versions of the customer record.
-- Querying historical data
SELECT * FROM dim_customer
WHERE customer_id = 1
ORDER BY effective_date DESC;
Event-driven design is a modern approach to building scalable and responsive systems using SQL. It leverages streaming data and real-time events to enable efficient processing of information across distributed systems.
CREATE TABLE kafka_topic (
id INT,
event_type VARCHAR(255),
payload JSON
) WITH (
'connector' = 'kafka',
'topic' = 'my-topic'
);
-- Example Debezium connector configuration
{
"name": "my-connector",
"config": {
"connector.class": "io.debezium.connector.mysql.MySqlConnector",
"database.hostname": "localhost",
"database.port": "3306",
"database.user": "debezium",
"database.password": "dbz",
"database.server.id": "12345"
}
}
ETL (Extract, Transform, Load) and ELT (Extract, Load, Transform) are two modern design patterns used in data pipelines. Understanding the differences between them is crucial for building efficient and scalable data architectures using SQL.
ETL stands for Extract, Transform, Load. It's a traditional data integration pattern where raw data is extracted from source systems, transformed into a structured format, and then loaded into a target system (like a warehouse).
WITH transformed_data AS (
SELECT
user_id,
order_date,
CASE WHEN status = 'COMPLETED' THEN 1 ELSE 0 END as completed
FROM orders
)
INSERT INTO analytics.order_metrics
SELECT * FROM transformed_data;
ELT stands for Extract, Load, Transform. It's a modern approach where raw data is extracted and loaded into the target system before applying transformations. This approach leverages the power of SQL in cloud warehouses.
// Extract raw orders
COPY orders FROM 's3://bucket/orders.csv';
// Transform using SQL
CREATE MATERIALIZED VIEW analytics.order_metrics AS (
SELECT
user_id,
order_date,
CASE WHEN status = 'COMPLETED' THEN 1 ELSE 0 END as completed
FROM orders
);
Question 1 of 21