This blog has moved to Medium

Subscribe via email

Archive for December 2010

The order of columns in GROUP BY statements matters on mysql

Take a look at this SQL query:

FROM products
WHERE ExternalProductId IS NOT NULL
GROUP BY SourceId, ExternalProductId
HAVING cnt > 1

This query scans the products table and find all duplicates, where a dup is defined as having the same (SourceId, ExternalProductId) as another row. Counter to intuition, it turns out that in mysql, the order of the columns in the GROUP BY statement can make a huge deal, performance wise. Let’s assume we have an index on the pair (ExternalProductId, SourceId). The query should be fast, right?

Wrong. It takes 30 minutes on our sample data set (about 30 million rows). An EXPLAIN query, and SHOW PROCESSLIST revealed that mysql was copying the table or index data to a temporary location before starting to process the actual query. This was taking up most of the execution time.

A quick question to Stack Overflow…

It appears the order of the columns makes all the difference in the world. Switching the GROUP BY columns to (ExternalProductId, SourceID) made the query run in place and not copy any temp data whatsoever, resulting in execution time of 30 seconds instead of 30 minutes! I don’t fully understand why mysql takes the column order under consideration – semantically, the order of the GROUP BY columns doesn’t matter, so it’s a matter of a simple optimization to choose the optimal order.

Baraza – a failed Google experiment at a Q&A site

Learning from failures is important, so I invite you to take a look at Google Baraza. It is “a Q&A site designed for Africans“, whatever that means.

From the looks of it, it seems to be a clone of the Stack Overflow / OSQA / Quora models. What’s unique about it? Nothing, as far as I can tell. My question about this got little informative responses (the most pathetic response was “Because it’s Google, it’s cool.”).

The site seems plagued with questions such as where can we find girls to sex, and How can i handle a nagging wife.

Every startup or project should be treated as an experiment. At the start, you think you have “The Spark”, that undefinable quality that will make your project a success. Most startups are copies or combinations of existing ideas, there’s nothing wrong with that. Before Google, there was Alta Vista, and before Facebook, MySpace. But as we know, 90% of all startups are destined to fail. Better to fail early than lumber around stuck in a limbo forever. If I were leading the Google Baraza team now, I would decide now is the correct time to fail.

Java Puzzle – spot the bad code

What’s the most important fault in the following java code (thanks to Roman for both writing and finding the bug :))?

public class Worker extends Runnable {
    public void run() {
            synchronized (emailMessages){
                try {
                }catch (InterruptedException e) {
                    throw new RuntimeException(e);
                } finally{