This blog has moved to Medium

Subscribe via email


Posts tagged ‘Databases’

The order of columns in GROUP BY statements matters on mysql

Take a look at this SQL query:

SELECT COUNT(*) AS cnt
FROM products
WHERE ExternalProductId IS NOT NULL
GROUP BY SourceId, ExternalProductId
HAVING cnt > 1

This query scans the products table and find all duplicates, where a dup is defined as having the same (SourceId, ExternalProductId) as another row. Counter to intuition, it turns out that in mysql, the order of the columns in the GROUP BY statement can make a huge deal, performance wise. Let’s assume we have an index on the pair (ExternalProductId, SourceId). The query should be fast, right?

Wrong. It takes 30 minutes on our sample data set (about 30 million rows). An EXPLAIN query, and SHOW PROCESSLIST revealed that mysql was copying the table or index data to a temporary location before starting to process the actual query. This was taking up most of the execution time.

A quick question to Stack Overflow…

It appears the order of the columns makes all the difference in the world. Switching the GROUP BY columns to (ExternalProductId, SourceID) made the query run in place and not copy any temp data whatsoever, resulting in execution time of 30 seconds instead of 30 minutes! I don’t fully understand why mysql takes the column order under consideration – semantically, the order of the GROUP BY columns doesn’t matter, so it’s a matter of a simple optimization to choose the optimal order.