The order of columns in GROUP BY statements matters on mysql
Take a look at this SQL query:
SELECT COUNT(*) AS cnt FROM products WHERE ExternalProductId IS NOT NULL GROUP BY SourceId, ExternalProductId HAVING cnt > 1 |
This query scans the products table and find all duplicates, where a dup is defined as having the same (SourceId, ExternalProductId) as another row. Counter to intuition, it turns out that in mysql, the order of the columns in the GROUP BY statement can make a huge deal, performance wise. Let’s assume we have an index on the pair (ExternalProductId, SourceId). The query should be fast, right?
Wrong. It takes 30 minutes on our sample data set (about 30 million rows). An EXPLAIN query, and SHOW PROCESSLIST revealed that mysql was copying the table or index data to a temporary location before starting to process the actual query. This was taking up most of the execution time.
A quick question to Stack Overflow…
It appears the order of the columns makes all the difference in the world. Switching the GROUP BY columns to (ExternalProductId, SourceID) made the query run in place and not copy any temp data whatsoever, resulting in execution time of 30 seconds instead of 30 minutes! I don’t fully understand why mysql takes the column order under consideration – semantically, the order of the GROUP BY columns doesn’t matter, so it’s a matter of a simple optimization to choose the optimal order.