Unlock Massive Savings in Your Analytics Pipeline!

Kyle Dufrane
4 min readOct 22, 2023

--

When creating analytical pipelines, performance and cost should always be at the forefront of an engineer’s brain. Generally, there is a tradeoff between the two, but what if I proposed a solution where both are positively impacted?

I give you advanced group by functions. In this article, we will cover the impact and use cases of Grouping Sets and Cube. These powerful functions are extremely valuable when creating datasets that will be ingested into dashboards. In short, the end product allows your dashboarding solution to query the dataset only using WHERE clauses and never needing to apply aggregations on the fly, which can significantly impact dashboard performance and reduce cost.

So, for the meat and potatoes, what do Grouping Sets and Cube accomplish? These functions allow for one or more aggregations to be applied across all possible combinations of the supplied dimensions within a dataset in a single table scan. At a high level, think of creating a few aggregations across a dataset, storing them in multiple CTE’s, and finally unioning the CTE’s to create a single dataset. That is exactly how the Grouping Sets and Cube functions output their datasets. The difference between these two use cases is that multiple CTE’s equal multiple table scans, increasing costs and degrading performance.

Before jumping into SQL code, let’s explore the differences between these two arguments and their potential drawback. Grouping Sets allow an engineer to specify the dimensions they want aggregations for, while Cube takes all possible combinations of the dimensions supplied. In short, Grouping Sets offer more flexibility. I’ll outline this difference specifically in the below examples.

I mentioned that there might be a specific drawback, but that will be on the dashboarding solution. In my use case, I use Apache Superset and will outline a mild hiccup. Let us dive back into the example of multiple CTE’s with various aggregations unioned to form a single dataset. For example purposes, here are the datasets:

Dataset A: col_a, col_b, col_c, agg_1

Dataset B: col_a, col_c, agg_2

Dataset C: col_b, col_c, agg_3

In this example, we can see that Dataset A contains all dimensions while Dataset B doesn’t include col_b and Dataset C doesn’t include col_a. When applying a union, you must add the missing dimensions to the appropriate CTE’s which will be filled with null values. Once the union has been completed, these null values are categorical variables and must be filterable from the dashboard application (An easy fix here is to COALESCE the null values to something readable like ‘ALL’).

Now, how does this impact Apache Superset? In Superset, you cannot redact the ‘Select All’ selection from a multi-select drop-down filter. This throws in an additional step to train end users when using your dashboard and allows for human error. Keep this in the back of your mind; I’ll dive back into this after some code snippets, which I think will help you wrap your head around what is happening.

That was a lot of verbiage, so don’t feel discouraged if you’re not fully grasping these concepts; the examples below show use cases with expected outputs.

One final note: Facebook, Netflix, and AirBnB use these functions in their production workflows and have some data ingested via Apache Superset. From my point of view, the above drawback is just a need to make end users aware and talk with the Superset folks to allow the ‘Select All’ to be deselected from multi-select dropdowns.

Ok, now for some SQL code. Let’s build out an example dataset and start with the basics.

Example Dataset Columns: timestamp_a, product, department, quantity, gross_revenue

For the first example, let’s create a query that will aggregate the sum of total products sold (quantity) by timestamp_a by product and timestamp_a by department. That query using Grouping Sets would look like:

SELECT 
timestamp_a,
COALESCE(department, 'ALL') AS department,
COALESCE(product, 'ALL') AS product,
sum(quantity)
FROM
my_dataset
GROUP BY
GROUPING SETS (
(timestamp_a, product),
(timestamp_a, department)
)

Another powerful aspect of this query is that you can feed this data into multiple graphs in your dashboarding tool (product/department visualization) using one dataset, one query, and one table scan.

Next, Cube:

SELECT
timestamp_a,
department,
product,
sum(quantity)
FROM
my_dataset
GROUP BY
timestamp_a,
CUBE(
product,
department
)

Ahhh the POWER! Cube here will return three unique aggregations being

  1. timestamp_a, product, sum
  2. timestamp_a, department, sum
  3. timestmap_a, department, product, sum

As seen here, CUBE returns all possible combinations of dimensions supplied within the function.

I hope as you see these use cases, these functions are becoming clearer. Cube returns all possible combinations of the supplied dimensions. Grouping Sets accomplishes the same task (with a lot more code) but allows for flexibility of aggregations based on specific dimensions.

In summary, these datasets allow for performant dashboards due to only needing a WHERE clause wrapper on the query and not needing to perform aggregations on the fly. Depending on the scope of your company analytics, you should see a significant cost reduction in queries, allowing for more money in the bank!

If you have any questions on this article, please don’t hesitate to connect!

--

--

Kyle Dufrane
Kyle Dufrane

Written by Kyle Dufrane

Data Science | Machine Learning | Big Data

No responses yet