Blog‎ > ‎

Data Grouping by Month and Quarter

posted May 30, 2017, 4:47 PM by Matt Himrod   [ updated Sep 21, 2017, 10:52 AM ]
I'm not very good at this. I haven't posted since my "first post" last November. I suppose that's okay because it's my blog... and not yours.

One of the things that I often do in my job is creating reports. I often want to create reports with data broken down by both month and quarter. In the past, I've accomplished this by doing something like this:

SELECT DATE_FORMAT(date_field, '%Y-%m')                    AS yearmonth, 
       CONCAT(YEAR(date_field), '-Q', QUARTER(date_field)) AS yearquarter,
       COUNT(DISTINCT IF(thing_one   = criteria_one,   id, NULL)) AS statistic_one,
       COUNT(DISTINCT IF(thing_two   = criteria_two,   id, NULL)) AS statistic_two,
       COUNT(DISTINCT IF(thing_three = criteria_three, id, NULL)) AS statistic_three
FROM table
GROUP BY yearquarter DESC, yearmonth DESC WITH ROLLUP;

Since I'm not grouping by anything else, I can just assume that any value in yearmonth field is for a month, and when there is a NULL value, I should look at yearquarter. It works rather well and it's terribly difficult to follow in the query logic. However, what if I needed to add other layers of grouping? What if I had group_field_one and group_field_two that I needed to group on both monthly and quarterly? 

SELECT DATE_FORMAT(date_field, '%Y-%m')                    AS yearmonth, 
       CONCAT(YEAR(date_field), '-Q', QUARTER(date_field)) AS yearquarter,
       group_field_one,
       group_field_two,
       COUNT(DISTINCT IF(thing_one   = criteria_one,   id, NULL)) AS statistic_one,
       COUNT(DISTINCT IF(thing_two   = criteria_two,   id, NULL)) AS statistic_two,
       COUNT(DISTINCT IF(thing_three = criteria_three, id, NULL)) AS statistic_three
FROM table
GROUP BY yearquarter DESC, yearmonth DESC, group_field_one, group_field_two WITH ROLLUP;

It looks like it might work the same way, right? Well, no. Instead of giving the results of the grouped fields for both the months and quarters, there's just rollup numbers for the quarters. 

I needed another solution. The most obvious is to run the query once for months and once for quarters, but then I'd have to maintain two versions of the same query and two versions of the underlying code that handles the data as well. (I'll have to make another post entirely about how I'm able to manipulate the month and quarter data using the same code in the underlying program, so perhaps I won't be waiting 6 months for my next post.)

Another option was to wrap the query in some code that inserted MONTH or QUARTER as appropriate. I don't like that because I prefer to be able to copy my query into my query browser application, make changes, and paste it back into my program. I find that I introduce bugs when I have to manipulate placeholders like that. Plus, it's just a pain. 

I needed a way to essentially overlay the month and quarter data within the same query so that I would only have one date field grouping. In order to get both month and quarter grouping in one field, I needed a way to duplicate the data and match it against the months and quarters of interest. The solution feels a bit dirty, but sometimes the right answer feels wrong. Here's what I ended up writing:

SELECT years_quarters.time_period,
       group_field_one,
       group_field_two,
       COUNT(DISTINCT IF(thing_one   = criteria_one,   id, NULL)) AS statistic_one,
       COUNT(DISTINCT IF(thing_two   = criteria_two,   id, NULL)) AS statistic_two,
       COUNT(DISTINCT IF(thing_three = criteria_three, id, NULL)) AS statistic_three
FROM table
INNER JOIN (SELECT DATE_FORMAT(months.m, '%Y-%m') AS time_period
            FROM (SELECT CURDATE() - INTERVAL (months.m) MONTH AS m
                  FROM (SELECT 0 AS m UNION ALL SELECT  1 UNION ALL SELECT  2 UNION ALL SELECT  3 
                                      UNION ALL SELECT  4 UNION ALL SELECT  5 UNION ALL SELECT  6 
                                      UNION ALL SELECT  7 UNION ALL SELECT  8 UNION ALL SELECT  9 
                                      UNION ALL SELECT 10 UNION ALL SELECT 11 UNION ALL SELECT 12) AS months
                  ) AS months
            UNION ALL 
            SELECT CONCAT(YEAR(quarters.q), '-Q', QUARTER(quarters.q))
            FROM (SELECT CURDATE() - INTERVAL (quarters.q) QUARTER AS q
                  FROM (SELECT 0 AS q UNION ALL SELECT  1 UNION ALL SELECT  2 
                                      UNION ALL SELECT  3 UNION ALL SELECT  4) AS quarters
                  ) AS quarters
            ) AS years_quarters ON CAST(DATE_FORMAT(table.date_field, '%Y-%m') AS CHAR) = years_quarters.time_period OR 
                                   CAST(CONCAT(YEAR(table.date_field), '-Q', QUARTER(table.date_field)) AS CHAR) = years_quarters.time_period
GROUP BY time_period DESC, group_field_one, group_field_two WITH ROLLUP;

I'm either a genius or a monster. Maybe both.

via GIPHY


Let's break this apart into a few components. First, the sub-query. It creates a static list of both the months and quarters of interest:

SELECT DATE_FORMAT(months.m, '%Y-%m') AS time_period
FROM (SELECT CURDATE() - INTERVAL (months.m) MONTH AS m
      FROM (SELECT 0 AS m UNION ALL SELECT  1 UNION ALL SELECT  2 UNION ALL SELECT  3 
                          UNION ALL SELECT  4 UNION ALL SELECT  5 UNION ALL SELECT  6 
                          UNION ALL SELECT  7 UNION ALL SELECT  8 UNION ALL SELECT  9 
                          UNION ALL SELECT 10 UNION ALL SELECT 11 UNION ALL SELECT 12) AS months
      ) AS months
UNION ALL 
SELECT CONCAT(YEAR(quarters.q), '-Q', QUARTER(quarters.q))
FROM (SELECT CURDATE() - INTERVAL (quarters.q) QUARTER AS q
      FROM (SELECT 0 AS q UNION ALL SELECT  1 UNION ALL SELECT  2 
                          UNION ALL SELECT  3 UNION ALL SELECT  4) AS quarters
      ) AS quarters

The first half handles the months. The sub-query's sub-query starts with a static list of the numbers 0 to 12. We just take the current date and subtract each number in months, then format that into YYYY-MM. The quarters in the second half work similarly except that DATE_FORMAT() doesn't have an option for "quarter", so I have to use CONCAT() and the YEAR() and QUARTER() functions. If I need more months or quarters, I add more static numbers. (or if I need a lot more, there's a CROSS JOIN trick I can use that should get its own post.) 

In the ON clause that ties the sub-query to the rest of the data, I run the original date_field through both the year-month and year-quarter formats. Since I have both in the sub-query's static list, date_field will mach once for each. In my GROUP BY clause, the first field I'm grouping by is time_period. This gives me a grouping by both month and quarter in one field, but more importantly, I get the grouped statistics for the other two fields in the GROUP BY for both the months and the quarters. Win!

Note: If you're wondering why I have the CAST() in there, it's because I got bitten by MySQL's implicit type casting. In one of the queries where I used this technique, I was grouping by years and quarters, not years and months as I'm doing in this example. In that query, instead of using DATE_FORMAT(), I just used YEAR(), which returns an integer. In the ON clause, there was an implicit cast of YYYY-QQ as an INTEGER. This cut off the quarter, and I ended up with just years... no quarters. It took me an embarrassingly long time to figure that one out. 

In the end, I'm not sure that this technique is particularly efficient because the data is essentially doubled in memory before the grouping happens. On the other hand, it does perform better than running the query twice. I have just one query and one set of code that handles the presentation. This is definitely worth keeping in my arsenal, but like I said... genius, monster, or maybe both.
Comments