This blog has moved to Medium

Subscribe via email


Posts tagged ‘Optimization’

On the evils of yield

I absolutely love yield. Don’t we all? It simplifies writing enumerations to the point of absurdity. Simplicity, however, is a double-edged sword – I spent the better part of this day debugging a most evil bug, that resulted from over-yielding.

At Delver (now Sears) we have a file-based repository containing millions of items. We try to make things as efficient as possible, and sometime we overdo it. Our sin for today is using IEnumerable a bit too much. This repository was designed to be:

  1. Scalable (within our constraints) – should able to hold several million tasks
  2. Fast
  3. Relatively convenient to use – the user should be able to iterate on it using foreach, for one.

To accomplish 1 and 2, we avoided allocating large in-memory structures because they wouldn’t be able to hold the amount of items we’re talking about. To provide a convenient interface, we used IEnumerable.

Here is a mock-up of the code (for simplicity, it doesn’t use the disk but an in-memory serialized dictionary):

public class PeopleRepository
{
    private readonly Dictionary _serializedPeople = new Dictionary();
 
    public void Save(Person person)
    {
        // this method, as innocent as it looks, make it more difficult to discover the bug. See ahead.
        Save(new[]{person});
    }
 
    public void Save(IEnumerable<Person> people)
    {
        var serializedPeople = from p in people select new {p.ID, p.Serialized};
        foreach (var p in serializedPeople)
            _serializedPeople[p.ID] = p.Serialized;
    }
 
    public IEnumerable<Person> Read(Predicate<Person> predicate)
    {
        foreach (int id in _serializedPeople.Keys)
        {
            var person = new Person(_serializedPeople[id]);
            if (predicate(person))
                yield return person;
        }
    }
}

The bug I tracked was that updates to the repository were not taking place, but instead were simply ignored. The first thing I tried, was writing a simple test:

// Setup
var repository = new PeopleRepository();
repository.Save(new Person(1, " John")); // oops, I put an extra space here
 
// Find &amp; Fix John
var john = repository.Read(p => p.ID == 1).First();
john.Name = john.Name.Trim();
 
// Fix poor John back to the repository
repository.Save(john);
 
// Make sure john is saved properly
john = repository.Read(p => p.ID == 1).First();
if (john.Name != "John")
    throw new Exception();

Sadly, this test passed with flying colors. More debugging revealed the problem happened because both our Read and Save methods returned IEnumerables. It appears that Read() read the items and made the required updates … but … when writing the items back to the repository, it iterated on them.

Let me repeat – we read some items, iterated on them and modified some, saved and thus reiterated. Bam!

Bam

The second iteration didn’t iterate on the modified items – because the internal implementation of Read used a yield statement, there was no actual collection returned. So the second iteration just caused the repository to re-read all the items from disk, and ignore the modified items.

Conclusion: whenever you see methods that return IEnumerables, be suspicious. Odds are it should return a List or Collection. And whatever you do, watch out from feeding that IEnumerable back to the same repository.

Here is a final test that almost reproduces the problem. It crashes with a CollectionWasModified exception, while my actual test & code just silently failed (because the repository I mocked up here doesn’t save to the disk, but rather keep everything in-memory).

// Setup
var repository = new PeopleRepository();
repository.Save(new Person(1, " John")); // oops, I saved a space in front of John
 
// Let's read and fix all people starting with a space
var people = repository.Read(p => p.Name.StartsWith(" "));
foreach (var person in people)
    person.Name = person.Name.Trim();
 
// store the modified points back
repository.Save(people);

Don’t be silly, turn Debug off

The number 2 rule of performance testing is “never test in Debug mode” (rule number 1 is “don’t optimize what you don’t need“).

Well, I was reminded today this is not limited to compiling your application in Release mode. A process I’m developing showed about 10 times worse performance than our initial testing showed. I began to think about optimizations and introducing parallelism, when a I stumbled on the real answer:

log4net logging level was on “Debug”. D’oh!

Never rely on tail recursion elimination

Corollary: Never write a recursively method with an unbounded stack depth.

I previously posted a recursive implementation of Factorial, and asked the readers “not to bother saying it’s inefficient because of recursion, since it will probably be optimized away anyway”. Not true.

I don’t really understand why (just posted a question to StackOverflow), but C# does not generally perform tail recursion elimination (at least as my measly test show).

The following method can be easily optimized, both to work faster, consume less space and not to cause a StackOverflowException … but it is not optimized by the compiler (verified by Reflector and the fact a StackOverflowException was indeed thrown).

private static void Foo(int i)
{
    if (i == 1000000)
        return;
 
    if (i % 100 == 0)
        Console.WriteLine(i);
 
    Foo(i+1);
}

So … if for some reason you prefer writing recursive methods over using loops, do it only where performance is not a consideration and you have a good understanding of the maximal stack depth.

Avoid expensive ToString() calls in log4net

premature optimization is the root of all evil.” (Donald Knuth)

And, if I might add: “ugly optimization is not so great either”.

We have some classes with rather expensive ToString() methods in our code base. When used in logs (specifically, log4net), we incured the performance penality even when the logging level (Debug/Info/Warn/…) was too low, and the log wasn’t even printed. In other words, ToString() was called before the log method, in order to build the single string that is the input parameter to Logger.Log().

Our ugly optimization was to add checks before every log line, as in:

if (logger.IsDebugEnabled)
   logger.Debug("Calling " + someObjectWithExpensiveToString);

A better solution, which I just tested to work, is to use the …Format() family of methods. This way, you pass objects and not strings, which are then only converted to strings if there is a need.

logger.DebugFormat("Calling {0}", someObjectWithExpensiveToString);

The following test assures me it actually works:

ILog logger = LogManager.GetLogger("Foo");
ToStringCounter counter = new ToStringCounter();
Assert.AreEqual(0, counter.ToStrings);
Assert.IsFalse(logger.IsDebugEnabled);
 
logger.DebugFormat("Expecting zero ToString() calls: {0}", counter);
Assert.AreEqual(0, counter.ToStrings);

Where ToStringCounter is simply a class that counts the number of calls to its ToString() method.