📐Aggregations and Grouping

This module covers GROUP BY and HAVING, ROLLUP, CUBE, and GROUPING SETS.

📘 GROUP BY and HAVING

The `GROUP BY` clause is used in SQL to group rows that have the same values in specified columns. This allows you to perform aggregate functions like counting, summing, averaging, etc., on these groups of rows.

Basic GROUP BY Syntax

SELECT column1, COUNT(*) 
FROM table_name
GROUP BY column1;

The `HAVING` clause is used to filter groups based on a condition. It's similar to the `WHERE` clause but applies to aggregated data.

SELECT column1, COUNT(*) 
FROM table_name
GROUP BY column1
HAVING COUNT(*) > 10;

💡 Key Concepts

  • `GROUP BY` groups rows based on specified columns.
  • Only grouped columns and aggregate functions can be in the `SELECT` statement.
  • Use `HAVING` to filter groups after aggregation.

Best Practices

  • Always include columns used in the `GROUP BY` clause in your `SELECT` statement.
  • Use aggregate functions wisely to avoid performance issues.
  • Test queries with smaller datasets before applying them to large tables.

Common Mistakes

  • Using non-grouped columns in the `SELECT` statement.
  • Forgetting to use `HAVING` instead of `WHERE` for aggregate conditions.
  • Not ordering results when it makes interpretation easier.

💡 Advanced Grouping Techniques

You can group by multiple columns or even expressions. For example, grouping by a substring of a column.

SELECT SUBSTRING(column1, 1, 3), COUNT(*) 
FROM table_name
GROUP BY SUBSTRING(column1, 1, 3);

💡 Aggregate Functions

  • `COUNT(*)` - Returns the number of rows in each group.
  • `SUM(column)` - Sums all values in a column for each group.
  • `AVG(column)` - Calculates the average value in a column for each group.
  • `MIN(column)` - Finds the smallest value in a column for each group.
  • `MAX(column)` - Finds the largest value in a column for each group.

Practical Example

Let's say you have an `orders` table and want to find the total sales per customer.

SELECT customer_id, SUM(amount) AS total_sales
FROM orders
GROUP BY customer_id
HAVING SUM(amount) > 1000;

📘 ROLLUP, CUBE, and GROUPING SETS

In advanced SQL aggregations, particularly in OLAP (Online Analytical Processing) scenarios, we often need to generate subtotals and totals across multiple dimensions. This is where **ROLLUP**, **CUBE**, and **GROUPING SETS** come into play. These powerful tools allow you to create complex hierarchical groupings and summarize data efficiently.

Understanding ROLLUP

**ROLLUP** generates subtotals for the specified grouping columns, as well as a grand total. It is particularly useful when you need to create hierarchical groupings and see totals at each level.

SELECT region, year, SUM(sales) AS total_sales
FROM sales_data
GROUP BY ROLLUP(region, year);

Understanding CUBE

**CUBE** generates all possible combinations of the specified grouping columns. This includes individual groupings, pairwise combinations, and the grand total. Use **CUBE** when you need to analyze data across multiple dimensions simultaneously.

SELECT region, product_category, SUM(sales) AS total_sales
FROM sales_data
GROUP BY CUBE(region, product_category);

Understanding GROUPING SETS

**GROUPING SETS** allow you to explicitly define which groupings should be included in the result. This provides precise control over the output, making it easier to customize your aggregations.

SELECT region, product_category, SUM(sales) AS total_sales
FROM sales_data
GROUP BY GROUPING SETS(
  (region),
  (product_category)
);

💡 Key Differences Between ROLLUP, CUBE, and GROUPING SETS

  • `ROLLUP` generates subtotals hierarchically - it creates totals for each level of grouping plus a grand total.
  • `CUBE` generates all possible combinations of the specified groupings, which can result in more rows than `ROLLUP` or `GROUPING SETS`.
  • `GROUPING SETS` provides precise control over exactly which groupings to include, making it the most flexible option.

💡 Best Practices for Using these Functions

  • Always include `GROUPING_ID()` when using these functions to better understand the grouping structure.
  • Use `ROLLUP` when you need hierarchical totals.
  • Use `CUBE` when you need all possible combinations of groupings.
  • Use `GROUPING SETS` for precise control over the aggregation structure.

Common Mistakes to Avoid

  • Don't use `CUBE` when you only need hierarchical totals - it will generate more data than needed.
  • Don't forget to handle null values in grouping columns, as they can affect the results.
  • Avoid using overly complex grouping sets that are hard to interpret.

💡 Practical Applications and Examples

SELECT region, year, product_category,
       SUM(sales) AS total_sales
FROM sales_data
GROUP BY CUBE(region, year, product_category)
ORDER BY GROUPING_ID();

This query will generate all possible combinations of the three dimensions and include a `GROUPING_ID()` for easier interpretation. The results will show detailed sales data along with various subtotal and total aggregations.

Quiz

Question 1 of 11

When should you use the `GROUP BY` clause?

  • To filter rows before aggregation
  • To group rows that have the same values in specified columns
  • To sort the final results of a query
  • To join multiple tables together