This blog has moved to Medium

Subscribe via email


The order of columns in GROUP BY statements matters on mysql

Take a look at this SQL query:

SELECT COUNT(*) AS cnt
FROM products
WHERE ExternalProductId IS NOT NULL
GROUP BY SourceId, ExternalProductId
HAVING cnt > 1

This query scans the products table and find all duplicates, where a dup is defined as having the same (SourceId, ExternalProductId) as another row. Counter to intuition, it turns out that in mysql, the order of the columns in the GROUP BY statement can make a huge deal, performance wise. Let’s assume we have an index on the pair (ExternalProductId, SourceId). The query should be fast, right?

Wrong. It takes 30 minutes on our sample data set (about 30 million rows). An EXPLAIN query, and SHOW PROCESSLIST revealed that mysql was copying the table or index data to a temporary location before starting to process the actual query. This was taking up most of the execution time.

A quick question to Stack Overflow…

It appears the order of the columns makes all the difference in the world. Switching the GROUP BY columns to (ExternalProductId, SourceID) made the query run in place and not copy any temp data whatsoever, resulting in execution time of 30 seconds instead of 30 minutes! I don’t fully understand why mysql takes the column order under consideration – semantically, the order of the GROUP BY columns doesn’t matter, so it’s a matter of a simple optimization to choose the optimal order.

4 Comments

  1. Ken Egozi:

    Perhaps it is a matter of the indices defined for that table. MySQL is not known for its Query Analyser. A proper RDBMS will supposedly be smarted about this.

  2. ripper234:

    @Ken – tell that to Facebook 🙂

    But yeah, this looks like a bug in the query analyzer – Shlomo plans to file it as a bug report.

  3. Ofer Egozi:

    Isn’t it the same as making the same order change for a simple WHERE clause? in your case:
    SELECT COUNT(*) AS cnt FROM products WHERE SourceId=1 AND ExternalProductId IS NOT NULL
    versus
    SELECT COUNT(*) AS cnt FROM products WHERE ExternalProductId IS NOT NULL and SourceId=1
    that will also make all of the difference between using the index and not using it, since the index is used in the order it was defined.
    http://dev.mysql.com/doc/refman/5.0/en/multiple-column-indexes.html

  4. ripper234:

    No, when used in a where clause, the order of the items does not matter.

    Try running EXPLAIN on these queries.