Blog

Data Grouping by Month and Quarter

posted May 30, 2017, 4:47 PM by Matt Himrod   [ updated Sep 21, 2017, 10:52 AM ]

I'm not very good at this. I haven't posted since my "first post" last November. I suppose that's okay because it's my blog... and not yours.

One of the things that I often do in my job is creating reports. I often want to create reports with data broken down by both month and quarter. In the past, I've accomplished this by doing something like this:

SELECT DATE_FORMAT(date_field, '%Y-%m')                    AS yearmonth, 
       CONCAT(YEAR(date_field), '-Q', QUARTER(date_field)) AS yearquarter,
       COUNT(DISTINCT IF(thing_one   = criteria_one,   id, NULL)) AS statistic_one,
       COUNT(DISTINCT IF(thing_two   = criteria_two,   id, NULL)) AS statistic_two,
       COUNT(DISTINCT IF(thing_three = criteria_three, id, NULL)) AS statistic_three
FROM table
GROUP BY yearquarter DESC, yearmonth DESC WITH ROLLUP;

Since I'm not grouping by anything else, I can just assume that any value in yearmonth field is for a month, and when there is a NULL value, I should look at yearquarter. It works rather well and it's terribly difficult to follow in the query logic. However, what if I needed to add other layers of grouping? What if I had group_field_one and group_field_two that I needed to group on both monthly and quarterly? 

SELECT DATE_FORMAT(date_field, '%Y-%m')                    AS yearmonth, 
       CONCAT(YEAR(date_field), '-Q', QUARTER(date_field)) AS yearquarter,
       group_field_one,
       group_field_two,
       COUNT(DISTINCT IF(thing_one   = criteria_one,   id, NULL)) AS statistic_one,
       COUNT(DISTINCT IF(thing_two   = criteria_two,   id, NULL)) AS statistic_two,
       COUNT(DISTINCT IF(thing_three = criteria_three, id, NULL)) AS statistic_three
FROM table
GROUP BY yearquarter DESC, yearmonth DESC, group_field_one, group_field_two WITH ROLLUP;

It looks like it might work the same way, right? Well, no. Instead of giving the results of the grouped fields for both the months and quarters, there's just rollup numbers for the quarters. 

I needed another solution. The most obvious is to run the query once for months and once for quarters, but then I'd have to maintain two versions of the same query and two versions of the underlying code that handles the data as well. (I'll have to make another post entirely about how I'm able to manipulate the month and quarter data using the same code in the underlying program, so perhaps I won't be waiting 6 months for my next post.)

Another option was to wrap the query in some code that inserted MONTH or QUARTER as appropriate. I don't like that because I prefer to be able to copy my query into my query browser application, make changes, and paste it back into my program. I find that I introduce bugs when I have to manipulate placeholders like that. Plus, it's just a pain. 

I needed a way to essentially overlay the month and quarter data within the same query so that I would only have one date field grouping. In order to get both month and quarter grouping in one field, I needed a way to duplicate the data and match it against the months and quarters of interest. The solution feels a bit dirty, but sometimes the right answer feels wrong. Here's what I ended up writing:

SELECT years_quarters.time_period,
       group_field_one,
       group_field_two,
       COUNT(DISTINCT IF(thing_one   = criteria_one,   id, NULL)) AS statistic_one,
       COUNT(DISTINCT IF(thing_two   = criteria_two,   id, NULL)) AS statistic_two,
       COUNT(DISTINCT IF(thing_three = criteria_three, id, NULL)) AS statistic_three
FROM table
INNER JOIN (SELECT DATE_FORMAT(months.m, '%Y-%m') AS time_period
            FROM (SELECT CURDATE() - INTERVAL (months.m) MONTH AS m
                  FROM (SELECT 0 AS m UNION ALL SELECT  1 UNION ALL SELECT  2 UNION ALL SELECT  3 
                                      UNION ALL SELECT  4 UNION ALL SELECT  5 UNION ALL SELECT  6 
                                      UNION ALL SELECT  7 UNION ALL SELECT  8 UNION ALL SELECT  9 
                                      UNION ALL SELECT 10 UNION ALL SELECT 11 UNION ALL SELECT 12) AS months
                  ) AS months
            UNION ALL 
            SELECT CONCAT(YEAR(quarters.q), '-Q', QUARTER(quarters.q))
            FROM (SELECT CURDATE() - INTERVAL (quarters.q) QUARTER AS q
                  FROM (SELECT 0 AS q UNION ALL SELECT  1 UNION ALL SELECT  2 
                                      UNION ALL SELECT  3 UNION ALL SELECT  4) AS quarters
                  ) AS quarters
            ) AS years_quarters ON CAST(DATE_FORMAT(table.date_field, '%Y-%m') AS CHAR) = years_quarters.time_period OR 
                                   CAST(CONCAT(YEAR(table.date_field), '-Q', QUARTER(table.date_field)) AS CHAR) = years_quarters.time_period
GROUP BY time_period DESC, group_field_one, group_field_two WITH ROLLUP;

I'm either a genius or a monster. Maybe both.

via GIPHY


Let's break this apart into a few components. First, the sub-query. It creates a static list of both the months and quarters of interest:

SELECT DATE_FORMAT(months.m, '%Y-%m') AS time_period
FROM (SELECT CURDATE() - INTERVAL (months.m) MONTH AS m
      FROM (SELECT 0 AS m UNION ALL SELECT  1 UNION ALL SELECT  2 UNION ALL SELECT  3 
                          UNION ALL SELECT  4 UNION ALL SELECT  5 UNION ALL SELECT  6 
                          UNION ALL SELECT  7 UNION ALL SELECT  8 UNION ALL SELECT  9 
                          UNION ALL SELECT 10 UNION ALL SELECT 11 UNION ALL SELECT 12) AS months
      ) AS months
UNION ALL 
SELECT CONCAT(YEAR(quarters.q), '-Q', QUARTER(quarters.q))
FROM (SELECT CURDATE() - INTERVAL (quarters.q) QUARTER AS q
      FROM (SELECT 0 AS q UNION ALL SELECT  1 UNION ALL SELECT  2 
                          UNION ALL SELECT  3 UNION ALL SELECT  4) AS quarters
      ) AS quarters

The first half handles the months. The sub-query's sub-query starts with a static list of the numbers 0 to 12. We just take the current date and subtract each number in months, then format that into YYYY-MM. The quarters in the second half work similarly except that DATE_FORMAT() doesn't have an option for "quarter", so I have to use CONCAT() and the YEAR() and QUARTER() functions. If I need more months or quarters, I add more static numbers. (or if I need a lot more, there's a CROSS JOIN trick I can use that should get its own post.) 

In the ON clause that ties the sub-query to the rest of the data, I run the original date_field through both the year-month and year-quarter formats. Since I have both in the sub-query's static list, date_field will mach once for each. In my GROUP BY clause, the first field I'm grouping by is time_period. This gives me a grouping by both month and quarter in one field, but more importantly, I get the grouped statistics for the other two fields in the GROUP BY for both the months and the quarters. Win!

Note: If you're wondering why I have the CAST() in there, it's because I got bitten by MySQL's implicit type casting. In one of the queries where I used this technique, I was grouping by years and quarters, not years and months as I'm doing in this example. In that query, instead of using DATE_FORMAT(), I just used YEAR(), which returns an integer. In the ON clause, there was an implicit cast of YYYY-QQ as an INTEGER. This cut off the quarter, and I ended up with just years... no quarters. It took me an embarrassingly long time to figure that one out. 

In the end, I'm not sure that this technique is particularly efficient because the data is essentially doubled in memory before the grouping happens. On the other hand, it does perform better than running the query twice. I have just one query and one set of code that handles the presentation. This is definitely worth keeping in my arsenal, but like I said... genius, monster, or maybe both.

Query Optimization

posted Nov 2, 2016, 6:15 PM by Matt Himrod   [ updated Nov 2, 2016, 6:21 PM ]

So... who makes a blog and leaves it empty for months... apparently this guy.

I struggle with blogging because I don't often think that noteworthy events occur, but last week I had something that I wanted to write about. Query optimization. Specifically optimizing a query in a way that surprised me.

I've worked with MySQL for nearly 9 years. I've gotten into the habit of trying to write queries with efficiency in the back of my mind. We have some rather large customer databases, and I'm impatient, so if I can write a query that I don't have to wait for, I win. 

Unfortunately with this example, I can't really give specifics. I was writing a query against our largest customer's database in the database's largest table. It seemed rather simple:

SELECT MAX(indexed_date_field)
FROM very_big_table
WHERE indexed_foreign_key > 100 AND 
      unindexed_field = 'enumerated_value';

This is a fast query in the average customer's database, but in this one, I waited for over 10 minutes before giving up. Running it with EXPLAIN yielded the culprit: the query plan included inspecting over 10 million records! I couldn't leave anything out of this query. I added a constraint limiting the indexed_date_field to more recent dates, but the query plan was still examining millions of records. On a whim, I said to myself, "I could do this by hand faster by sorting by the indexed_date_field and finding the first one where unindexed_field equals the enumerated_value! So, just for kicks, I rewrote it like this:

SELECT indexed_date_field
FROM very_big_table
WHERE indexed_foreign_key > 100 AND 
      unindexed_field = 'enumerated_value'
ORDER BY indexed_date_field DESC
LIMIT 1

It ran in .2 seconds. The query plan examined 2 records. It make sense because it sorts with the index and then finds the first record meeting the criteria in the WHERE clause. I've vaguely recalled reading about this sort of query optimization but it's something I always forget about. Still...

MIND = BLOWN

via GIPHY




I don't think I'll forget this one anytime soon.

1-2 of 2