This blog has moved to Medium

Subscribe via email


Archive for March 2009

A survey of open APIs

I did some research in the past week on a few “Open APIs”, and wanted to share my findings here. This is just a summary of other, more comprehensive sources. Also, if you have any comments or corrections I’d love to hear them. I chose to present my findings as a list of concepts:

The Open Stack

This is an emerging stack of open protocols that will help build socially-connected websites. I will explore the key elements (I take XRDS-Simple to be rather low level and uninteresting).

OpenID

  • A single sign-on protocol (help to user not to create yet another set of user/pass)
  • Essential Workflow (here in more details):
    1. You want to logon to StackOverflow, which is an OpenID Consumer
    2. Instead of opening yet another account you are given an alternative (almost no site relies solely on OpenID).
    3. You either enter a URL (way less user friendly) or select from a fixed subset of Providers
    4. You are redirected to that URL, enter your credentials there (only if you are not logged in), and are asked to allow StackOverflow access to your OpenID identity.
    5. Depending on your OpenID provider, you can set for how long this access is granted
    6. Then, you are redirected back to StackOverflow, with a token (encoded in the URL), that is used to grant you access.
  • OpenID is mostly still just a login method today (doesn’t convey extra information beyond a logon grant) – although I did see some evidence to the contrary when I just opened an OpenID account at VeriSign – it seems websites can request more information from an OpenID provider – such as email, nickname and full name.
  • Microsoft, Yahoo, Google are now OpenID providers (in addition to more veteran providers). This is significant because it doesn’t force users to go to yet another place to open an OpenID account – they can just use an existing webmail account.
  • Facebook just joined OpenID foundation board (eBay is there already). Looks like it will become an OpenID provider, maybe also client.
  • Users are still not comfortable with it / unable to figure it out.
  • However, here is an interesting post about using email addresses as OpenID. When this happens, it might help bring in the users.
  • Here is a recipe to enable OpenID on a website (Consumer). Sample Provider implementation for .NET is here.
  • Of course, for your WordPress blog the process is much easier – just install OpenID plugin.
    • Although, I heeded Tomer’s advice and instead used OpenID delegation – this means I use my blog’s URL as my OpenID, but it is just a redirect to a more serious OpenID provider. OpenID is/will be the keys to your world – better guard them safely.
  • Right now, the value of OpenID to a new website is still limited:
    1. Will not eliminate the need for us to implement user logon
    2. Not much value in being an OpenID Provider – it’s a nice to have feature, but in many cases not worth the cost (at least until you get a large user base).
    3. Is not sufficient, on its own, to get access to complete profile information about users (and use this data to help users interact with your website). But … can be complemented with more technologies.
  • Has a nice usage graph.

Google OpenSocial

  • A set of social APIs developed by Google (get list of friends, publish stories).
  • Implemneted y hi5, LinkedIn, MySpace, orkut, Yahoo!, Friendster (full list at the OpenSocial wiki)… but not Facebook (yet). All of these are OpenSocial Containers – data warehouses that answer OpenSocial queries.
  • OpenSocial is Open. The data is not stored on Google – the providers just conform to the OpenSocial API which reflects the data stored at each social network.
  • Had theoretical reach (not usage!) of 700M users Nov 08.
  • To serve OpenSocial:
  • Client implementations are available in javascript, java, .net, php, whatnot…

Google FriendConnect

  • An application that uses OpenSocial to enhance websites with social widgets (Comments, Reviews, …).
  • Is still not wildly spread (this directory is rather sparse at the moment)
  • Doesn’t seem to be targeted at big sites, rather at small sites/blogs (my impression at least).
  • No programming required to add social features to your site – but you have limited control over the effect.
  • Cool flow I tested: Login to FriendConnect on a website, and you already see in your friends list a buddy from Gmail that’s using FriendConnect at this site.

OAuth

Portable Contacts

Facebook Connect

  • Competitor of the open stack (OpenID+OpenSocial) – gives single sign-on + connects you to Facebook friends & feed
  • Uses a popup instead of redirecting to FB (less intimidating for users).
  • Has already been witnessed to boost new user signup.
  • Main Flow:
    1. User clicks Connect 
    2. Popup (in facebook.com) asks user to confirm
    3. User is shown a Presence Indicator at the target site
    4. Website can pull user’s profile & connections.
    5. Publish stories to Facebook.
    6. Send Facebook hashes of emails, and Facebook replies if they have a user with an identical hash. This can be used to show a user a count of his Facebook buddies that are using the target site (“10 of your friends are using this site, connect to Facebook now”).
  • Example of a FB connect-enabled site – TheRunAround.

RPX / JanRain

  • JanRain – An early OpenID player (in the market since 2005, one of the founders of OpenID).
  • RPXNow – abstracts away Single Sign-on, supports both Facebook Connect and major OpenID players.
  • Here is a blog post about why not to use it (Vendoer lock-in, single point of failure, too little benefit).
    • However, check out the counter arguments in the comments.
  • RPX get social profile data from Google, MySpace, Facebook.
    • This includes interesting profile fields like email/gender/birthday/photo.
  • The API also hints at getting an array of friends from relevant services.

That’s it for now, I hope you enjoyed this review. Remember that most of these APIs are very new or just being adopted, so expect changes to most of these. I expect a large API convergence to happen in the following year or two, which will simplify life for those of us building social applications.

Keep fighting the voodoo

Good developers dislike voodoo, but sometimes all the good will can’t stop a dark spawn of evil. For example, one of our build machines always bluescreens when running a certain system test. Why? Probably some hardware/driver failure, we’ll look into that.

In the meantime, let’s just ensure the build agent doesn’t get these nasty builds, shall we?

  1. Build Configuration –> Agent Requirementse –> Add requirements for environment variable –> SHITTY_AGENT (condition = “does not exist”)
  2. Go to the shitty agent and add a system variable named SHITTY_AGENT
  3. Restart the build service:
    1. net stop “TeamCity Build Agent Service”
    2. net start “TeamCity Build Agent Service”

Fewer projects makes a lean solution

Bogen just optimized our visual studio solution by cutting down the number of projects in it from about 60 to 20 something. Compile time is blazing fast now, thanks dude (though this means less time for me to check Google Reader 🙁 )

Cats life cycle

Here is a small flow chart I created to explain the life cycle of our cats.

Cats Life Cycle

If you have cats, you may have a few adjustments to this diagram (all cats are different).

On the evils of yield

I absolutely love yield. Don’t we all? It simplifies writing enumerations to the point of absurdity. Simplicity, however, is a double-edged sword – I spent the better part of this day debugging a most evil bug, that resulted from over-yielding.

At Delver (now Sears) we have a file-based repository containing millions of items. We try to make things as efficient as possible, and sometime we overdo it. Our sin for today is using IEnumerable a bit too much. This repository was designed to be:

  1. Scalable (within our constraints) – should able to hold several million tasks
  2. Fast
  3. Relatively convenient to use – the user should be able to iterate on it using foreach, for one.

To accomplish 1 and 2, we avoided allocating large in-memory structures because they wouldn’t be able to hold the amount of items we’re talking about. To provide a convenient interface, we used IEnumerable.

Here is a mock-up of the code (for simplicity, it doesn’t use the disk but an in-memory serialized dictionary):

public class PeopleRepository
{
    private readonly Dictionary _serializedPeople = new Dictionary();
 
    public void Save(Person person)
    {
        // this method, as innocent as it looks, make it more difficult to discover the bug. See ahead.
        Save(new[]{person});
    }
 
    public void Save(IEnumerable<Person> people)
    {
        var serializedPeople = from p in people select new {p.ID, p.Serialized};
        foreach (var p in serializedPeople)
            _serializedPeople[p.ID] = p.Serialized;
    }
 
    public IEnumerable<Person> Read(Predicate<Person> predicate)
    {
        foreach (int id in _serializedPeople.Keys)
        {
            var person = new Person(_serializedPeople[id]);
            if (predicate(person))
                yield return person;
        }
    }
}

The bug I tracked was that updates to the repository were not taking place, but instead were simply ignored. The first thing I tried, was writing a simple test:

// Setup
var repository = new PeopleRepository();
repository.Save(new Person(1, " John")); // oops, I put an extra space here
 
// Find &amp; Fix John
var john = repository.Read(p => p.ID == 1).First();
john.Name = john.Name.Trim();
 
// Fix poor John back to the repository
repository.Save(john);
 
// Make sure john is saved properly
john = repository.Read(p => p.ID == 1).First();
if (john.Name != "John")
    throw new Exception();

Sadly, this test passed with flying colors. More debugging revealed the problem happened because both our Read and Save methods returned IEnumerables. It appears that Read() read the items and made the required updates … but … when writing the items back to the repository, it iterated on them.

Let me repeat – we read some items, iterated on them and modified some, saved and thus reiterated. Bam!

Bam

The second iteration didn’t iterate on the modified items – because the internal implementation of Read used a yield statement, there was no actual collection returned. So the second iteration just caused the repository to re-read all the items from disk, and ignore the modified items.

Conclusion: whenever you see methods that return IEnumerables, be suspicious. Odds are it should return a List or Collection. And whatever you do, watch out from feeding that IEnumerable back to the same repository.

Here is a final test that almost reproduces the problem. It crashes with a CollectionWasModified exception, while my actual test & code just silently failed (because the repository I mocked up here doesn’t save to the disk, but rather keep everything in-memory).

// Setup
var repository = new PeopleRepository();
repository.Save(new Person(1, " John")); // oops, I saved a space in front of John
 
// Let's read and fix all people starting with a space
var people = repository.Read(p => p.Name.StartsWith(" "));
foreach (var person in people)
    person.Name = person.Name.Trim();
 
// store the modified points back
repository.Save(people);

Efficient multiple totient calculation in C#

For a certain problem in Project Euler, I needed a function to calculate the Totient of all numbers in [2..10^7]. Going over them one at a time and factorizing them would take too long, so I used a variant of the prime sieve to speed up the calculation. The code is not overly documented, but it should be relatively easy to grok once you understand how to calculate the Totient of a single number.

The time complexity is O(n log n).

/// <summary>
/// Return an array of all totients in [0..n-1]
/// </summary>
/// <param name="n"></param>
/// <returns></returns>
public static long[] CalcTotients(long n)
{
    long[] divisors = GetDivisors(n);
    long i;
    var phi = new long[n];
    phi[1] = 1;
    for (i = 1; i < n; ++i)
        CalcTotient(i, phi, divisors);
 
    return phi;
}
 
/// <summary>
/// For every integer, the result will contain its lowest divisor.
/// </summary>
/// <param name="n"></param>
/// <returns></returns>
private static long[] GetDivisors(long n)
{
    var divisors = new long[n];
    divisors[1] = 1;
    long i;
    for (i = 2; i < n; ++i)
    {
        if (divisors[i] != 0)
            continue;
 
        for (long j = i; j < n; j += i)
            divisors[j] = i;
    }
    return divisors;
}
 
private static long CalcTotient(long i, long[] phi, long[] divisors)
{
    if (phi[i] != 0)
        return phi[i];
 
    long div = divisors[i];
    if (div == i)
    {
        phi[i] = i - 1;
        return phi[i];
    }
 
    long lower = 1;
    int exp = 0;
    while ((i > 1) && (i % div == 0))
    {
        i /= div;
        lower *= div;
        exp++;
    }
    if (i == 1)
    {
        phi[lower] = ((long)Math.Pow(div, exp - 1)) * (div-1);
        return phi[lower];
    }
    phi[i*lower] = CalcTotient(i, phi, divisors) *
                         CalcTotient(lower, phi, divisors);
    return phi[i*lower];
}

Euler problem # 206

I just solved my 78th problem from Project Euler. The solution was not complicated (just a little mathematical observation) and then brute force CPU.

I was working on Aya‘s new laptop, and I noticed the C# code took 9 seconds to run in “power saver” mode, and only 4 seconds in “high performance” mode. It turns out that Vista caps the CPU speed at 50% for power saver mode. Sweet.

(This might seem very obvious to some, however I haven’t operated a laptop for several years, and so this is new to me – if Windows XP does such tricks, it is far less explicit than Vista).

Don’t be silly, turn Debug off

The number 2 rule of performance testing is “never test in Debug mode” (rule number 1 is “don’t optimize what you don’t need“).

Well, I was reminded today this is not limited to compiling your application in Release mode. A process I’m developing showed about 10 times worse performance than our initial testing showed. I began to think about optimizations and introducing parallelism, when a I stumbled on the real answer:

log4net logging level was on “Debug”. D’oh!