🧰Modern Design Patterns in SQL

This module covers Star and Snowflake Schemas, Slowly Changing Dimensions (SCD), Change Data Capture (CDC), Event-driven Design, ETL vs ELT.

📘 Star and Snowflake Schemas

**Star Schema** and **Snowflake Schema** are two of the most common data warehouse design patterns used for organizing relational data in a way that supports efficient business intelligence queries.

💡 Understanding Star Schema

The **Star Schema** is characterized by its central fact table with multiple dimension tables connected to it. This design is also known as a 'fact constellation'.

  • `Fact Table`: Stores quantitative, measurable data (e.g., sales amounts, quantities) and foreign keys pointing to dimension tables.
  • `Dimension Tables`: Contain descriptive attributes about the facts (e.g., date dimensions, product categories).
// Example Star Schema structure
CREATE TABLE Fact_Sales (
  SaleID INT PRIMARY KEY,
  DateID INT,
  ProductID INT,
  CustomerID INT,
  Amount DECIMAL,
  FOREIGN KEY (DateID) REFERENCES Dim_Date(DateID),
  FOREIGN KEY (ProductID) REFERENCES Dim_Product(ProductID),
  FOREIGN KEY (CustomerID) REFERENCES Dim_Customer(CustomerID)
);

CREATE TABLE Dim_Date (
  DateID INT PRIMARY KEY,
  Date DATE,
  Month VARCHAR(20),
  Year INT
);

💡 Advantages of Star Schema

  • Simplifies query writing with fewer joins needed.
  • Fast query performance due to denormalized structure.
  • Easy to understand and maintain.

💡 Understanding Snowflake Schema

The **Snowflake Schema** is an extension of the Star Schema where dimension tables are further normalized into multiple related tables, creating a structure that resembles a snowflake.

// Example Snowflake Schema structure
CREATE TABLE Fact_Sales (
  SaleID INT PRIMARY KEY,
  DateID INT,
  ProductID INT,
  CustomerID INT,
  Amount DECIMAL,
  FOREIGN KEY (DateID) REFERENCES Dim_Date(DateID),
  FOREIGN KEY (ProductID) REFERENCES Dim_Product(ProductID),
  FOREIGN KEY (CustomerID) REFERENCES Dim_Customer(CustomerID)
);

CREATE TABLE Dim_Date (
  DateID INT PRIMARY KEY,
  MonthID INT,
  YearID INT,
  FOREIGN KEY (MonthID) REFERENCES Dim_Month(MonthID),
  FOREIGN KEY (YearID) REFERENCES Dim_Year(YearID)
);

💡 Advantages of Snowflake Schema

  • Reduces redundancy by fully normalizing dimension tables.
  • Improves data integrity with proper referential constraints.
  • Saves storage space by eliminating repeated information.

💡 When to Choose Star vs Snowflake Schema?

  • Use **Star Schema** when query performance is critical and storage space is not a primary concern.
  • Use **Snowflake Schema** when data integrity, storage efficiency, and full normalization are important considerations.

📘 Slowly Changing Dimensions (SCD)

In data warehousing, **Slowly Changing Dimensions (SCD)** refer to dimensions that change over time but retain their historical values. This concept is crucial for maintaining accurate records of how data evolves. There are three main types of SCD implementations: Type 1, Type 2, and Type 3. In this chapter, we'll explore each type in depth, including implementation strategies, best practices, and practical examples.

💡 What is an SCD?

  • A dimension that changes over time but retains historical data.
  • Used to track changes in attributes of entities over time.
  • Commonly used in data warehousing for maintaining accurate analytics.

SCD Type 1: Attribute Override

Type 1 SCD occurs when an attribute's value is updated without maintaining a history of the previous values. This type is typically used for non-key attributes that don't require historical tracking.

UPDATE dim_customer
SET phone_number = '555-1234'
WHERE customer_id = 1;

💡 When to Use Type 1?

  • Non-key attributes (e.g., phone number, email).
  • Attributes that don't require historical tracking.
  • Simple updates where maintaining history is unnecessary.

SCD Type 2: Add New Row

Type 2 SCD occurs when a new row is added to the dimension table to represent a change in an attribute, while the old row remains unchanged. This type maintains a complete history of all changes.

-- Insert new customer record
INSERT INTO dim_customer (customer_id, name, email, phone_number)
VALUES (1, 'John Doe', 'john.doe@example.com', '555-4321');

💡 When to Use Type 2?

  • Key attributes that require historical tracking.
  • Attributes that are part of the business key or natural key.
  • Scenarios where maintaining a complete history is critical.

SCD Type 3: Add New Column

Type 3 SCD occurs when a new column is added to the dimension table to track the effective period of an attribute change. This type combines both attribute override and row addition approaches.

ALTER TABLE dim_customer
ADD COLUMN valid_from DATE,
ADD COLUMN valid_to DATE;

When updating an attribute, the old value is marked as invalid and a new row with the updated value is added.

-- Update customer record
UPDATE dim_customer
SET valid_to = CURRENT_TIMESTAMP - INTERVAL '1 second'
WHERE customer_id = 1 AND valid_to IS NULL;

INSERT INTO dim_customer (customer_id, name, email, phone_number, valid_from)
VALUES (1, 'John Doe', 'john.doe@example.com', '555-4321', CURRENT_TIMESTAMP);

💡 When to Use Type 3?

  • Attributes that require both historical tracking and current values.
  • Scenarios where space efficiency is a concern compared to Type 2.
  • Middle ground between Type 1 and Type 2 implementations.

💡 Best Practices for Implementing SCDs

  • Identify the business requirements before choosing an SCD type.
  • Use Type 2 for most scenarios where historical tracking is needed.
  • Implement Type 3 when you need to balance history and space efficiency.
  • Avoid using Type 1 for key attributes that require historical tracking.
  • Regularly audit your dimension tables to ensure data integrity.

Common Mistakes to Avoid

  • Don't use Type 1 for attributes that require historical tracking.
  • Avoid mixing SCD types within the same dimension table.
  • Don't forget to update the valid_to date when using Type 3.
  • Don't ignore data quality and consistency in your dimension tables.

💡 Practical Examples of SCD Implementation

Let's walk through a practical example using Type 2 SCD. Imagine we have a customer dimension table and want to track changes in the customer's email address.

-- Initial customer record
INSERT INTO dim_customer (customer_id, name, email)
VALUES (1, 'John Doe', 'john.doe@example.com');

After some time, the customer updates their email address.

-- Update customer record using SCD Type 2
INSERT INTO dim_customer (customer_id, name, email)
VALUES (1, 'John Doe', 'john.doe@newexample.com');

Now we can query the dimension table to see both versions of the customer record.

-- Querying historical data
SELECT * FROM dim_customer
WHERE customer_id = 1
ORDER BY effective_date DESC;

📘 Change Data Capture (CDC)

📘 Event-driven Design

Event-driven design is a modern approach to building scalable and responsive systems using SQL. It leverages streaming data and real-time events to enable efficient processing of information across distributed systems.

  • `Event-driven architecture` allows decoupling of components through asynchronous communication.
  • `Streaming platforms` like Kafka are essential for handling high-throughput event data.
  • `Change Data Capture (CDC)` with tools like Debezium enables real-time tracking of database changes.

💡 Key Components of Event-Driven Design

  • `Event producers`: Generate events from data sources.
  • `Event consumers`: Process and react to incoming events.
  • `Event streams`: Sequences of events that can be queried and analyzed.

Integrating Kafka with SQL

CREATE TABLE kafka_topic (
  id INT,
  event_type VARCHAR(255),
  payload JSON
) WITH (
  'connector' = 'kafka',
  'topic' = 'my-topic'
);

Using Debezium for CDC

-- Example Debezium connector configuration
{
  "name": "my-connector",
  "config": {
    "connector.class": "io.debezium.connector.mysql.MySqlConnector",
    "database.hostname": "localhost",
    "database.port": "3306",
    "database.user": "debezium",
    "database.password": "dbz",
    "database.server.id": "12345"
  }
}

💡 Best Practices for Event-Driven Design

  • Always design events as `immutable records`.
  • Ensure events are `self-contained` with all necessary data.
  • Implement proper `idempotency` to handle event retries.

Common Mistakes to Avoid

  • Avoid overloading event streams with unnecessary data.
  • Do not ignore error handling and retry mechanisms.
  • Do not mix unrelated business domains in the same event stream.

Advanced Patterns for Event-Driven Systems

  • `Event Storming`: Collaborative design session to identify domain events.
  • `Sagas`: Long-running transactions that span multiple services.
  • `Materialized Views`: Precomputed results for efficient querying of event streams.

📘 ETL vs ELT

ETL (Extract, Transform, Load) and ELT (Extract, Load, Transform) are two modern design patterns used in data pipelines. Understanding the differences between them is crucial for building efficient and scalable data architectures using SQL.

💡 What is ETL?

ETL stands for Extract, Transform, Load. It's a traditional data integration pattern where raw data is extracted from source systems, transformed into a structured format, and then loaded into a target system (like a warehouse).

  • `Extract`: Pulls data from various sources like databases, APIs, or files.
  • `Transform`: Applies cleaning, validation, and enrichment operations using SQL or ELT tools.
  • `Load`: Loads processed data into the target system (e.g., Snowflake, BigQuery).

💡 ETL Use Cases

  • Data warehousing and analytics
  • ETL pipelines using tools like Apache NiFi, Talend, or Fivetran
  • When you need strict data governance and transformation before loading
WITH transformed_data AS (
  SELECT 
    user_id,
    order_date,
    CASE WHEN status = 'COMPLETED' THEN 1 ELSE 0 END as completed
  FROM orders
)
INSERT INTO analytics.order_metrics
SELECT * FROM transformed_data;

💡 What is ELT?

ELT stands for Extract, Load, Transform. It's a modern approach where raw data is extracted and loaded into the target system before applying transformations. This approach leverages the power of SQL in cloud warehouses.

  • `Extract`: Pulls raw data from sources.
  • `Load`: Loads raw data directly into the target warehouse (e.g., Snowflake, Redshift).
  • `Transform`: Applies transformations using SQL within the warehouse.

💡 ELT Use Cases

  • Cloud data warehouses (Snowflake, BigQuery)
  • Modern ETL tools like dbt, Airflow, or Fivetran
  • When you need flexibility and scalability for complex transformations
// Extract raw orders
COPY orders FROM 's3://bucket/orders.csv';

// Transform using SQL
CREATE MATERIALIZED VIEW analytics.order_metrics AS (
  SELECT 
    user_id,
    order_date,
    CASE WHEN status = 'COMPLETED' THEN 1 ELSE 0 END as completed
  FROM orders
);

💡 ETL vs ELT: Key Differences

  • **Data Transformation Location**: ETL transforms data during extraction, while ELT transforms after loading into the warehouse.
  • **Scalability**: ELT is more scalable for large datasets as it leverages cloud warehouses' power.
  • **Latency**: ETL can have higher latency due to multiple stages, while ELT offers faster access to raw data.
  • **Flexibility**: ELT provides greater flexibility for iterative transformations and analytics.

💡 When to Use ETL vs ELT

  • Use **ETL** when:
  • - You need strict data governance and validation before loading
  • - Working with legacy systems or on-premise infrastructure
  • - Requires complex transformations that are easier to handle in ETL tools
  • Use **ELT** when:
  • - Building modern cloud-based architectures
  • - Need flexibility for iterative analytics and transformations
  • - Want to leverage SQL capabilities of your warehouse

Quiz

Question 1 of 21

What is the central table in a Star Schema called?

  • Dimension Table
  • Fact Table
  • Join Table
  • Normalization Table