This blog has moved to Medium

Subscribe via email


Posts tagged ‘Programming’

Cancelor – a java task cancelation service

Lately I need to support task cancellation in a Java process I’m working on. The straightforward options I know to implement this are:

  1. Thread.interrupt() – the caller interrupts the worker thread (either directly or using Future.cancel()). Some say this is an erroneous approach, but I still haven’t figured out why. However, it is buggy on some recent versions on the JDK, and it is a bit fragile (what if the worker threads create worker threads that also need to be canceled?).
  2. Passing some object (AtomicBoolean?) down to every object you would like to support cancellation. These objects will check the value of this boolean, and should stop if it is false. They can pass the boolean to other objects / tasks. While this works, this boolean cannot be injected, and so must be manually passed along the call stack.

If you want the advantages of the second method, but don’t want to break IOC, here’s how:

First, the usage:

The listener object adds a dependency on ICancelor

public class Foo {
  public Foo(ICancelor cancelor) {
    this.cancelor = cancelor;
    ...
}

It then checks the cancellation state every now and then:

if (cancelor.wasTaskCanceled("TakeOverTheWorld"))
   return;

The top-level thread that wishes to cancel a task simply calls

cancelor.cancelTask("TakeOverTheWorld");

And whenever a task is started, you should call

cancelor.resetTask("TakeOverTheWorld");

I’ll admit using strings for task names is a bit ugly, but this is not a terrible price to pay, assuming you have a few core tasks you intend to support. All that remains is the cancellation service itself:

/**
 * A cancellation service.
 */
public interface ICancelor {
    /**
     * Resets a task to "Not canceled" state
     */
    void resetTask(String name);
 
    /**
     * Returns true iff the a cancelTask was called, and no resetTask was called afterwards.
     */
    boolean wasTaskCanceled(String name);
 
    /**
     * Cancel a task
     */
    void cancelTask(String name);
}
 
public class Cancelor implements ICancelor {
  private final ConcurrentHashMap tasks = new ConcurrentHashMap();
 
    public void resetTask(String name) {
        tasks.put(name, true);
    }
 
    public boolean wasTaskCanceled(String name) {
        Boolean value = tasks.get(name);
        return value != null & value;
    }
 
    public void cancelTask(String name) {
        tasks.put(name, false);
    }
}

Because we rely on task names, there is an assumption here that all classes that play in the cancellation game belong to the same task semantically. If a class is a common class that doesn’t belong to a single task or flow, this approach does not work – in fact, I cannot think of an approach that will work in this case with dependency injection. The common class has to accept the cancellation signal somehow, it must either get an boolean explicit and not from the IOC container, or must check its interrupted state (or some other thread-local state) itself. Any smart ideas on how to solve this problem?

ALT.NET Israel Tools #1

Come hear about .Net tools in a “no bullshit” evening (more details here).

Playing around with PLINQ and IO-bound tasks


I recently downloaded Visual Studio 2010 beta, and took the chance to play with PLINQ. PLINQ, for those of you in the dark ages of .Net Framework 2.0, is parallel LINQ – an extension to the famous query language that makes it easy to write parallel code (essential to programming in the 21th century, in the age of the many-core).

A code sample, as usual, is the best demonstration:

public static int CountPrimes(IEnumerable<int> input)
{
    return input.AsParallel().Where(IsPrime).Count();
}
 
private static bool IsPrime(int n)
{
    for (int i = 2; i*i < n; ++i)
        if (n % i == 0)
            return false;
    return true;
}

This code sample, regardless of using an inefficient primality test, is fully parallel. PLINQ will utilize all your cores when running the above code, and I didn’t have to use any locks, queues, threadpools or any of the more complex tools of the trade. Just tell PLINQ “AsParallel()”, and it works.

I hit some gotcha when I tried to compare the parallel performance with the sequential one. Do you spot the problem in the following code?

public static void CountPrimesTest(IEnumerable<int> input)
{
    // parallel benchmark 
    var timer = new Stopwatch();
    timer.Start();
    CountPrimes(input.AsParallel());
    timer.Stop();
    Console.WriteLine("Counted primes in parallel took " + timer.Elapsed);
 
    // sequential benchmark
    timer = new Stopwatch();
    timer.Start();
    CountPrimes(input);
    timer.Stop();
    Console.WriteLine("Counted primes sequentially took " + timer.Elapsed);
}

This is all fine and dandy when the task at hand is CPU bound, but works pretty miserabbly when your task is IO bound, like downloading a bunch of web pages. Next, I simulated some IO-bound tasks (I used Sleep() to emulate IO – basically not using a lot of CPU for every task):

[ThreadStatic]
private static Random _random;
 
public static List<string> FindInterestingDomains(IEnumerable<string> urls)
{
    // select all the domains of the interesting URLs
    return urls.AsParallel().Where(SexFilter).
                Select(url => new Uri(url).Host).ToList();
}
 
public static bool SexFilter(string url)
{
    if (_random == null)
        _random = new Random();
 
    // simulate a download
    Thread.Sleep(1000);
    var html = "<html>" + _random.Next() + "</html>";
    return html.Contains("69");
}

Testing this with a list of 10 URLs took 5 seconds, meaning LINQ again spun only two cores, which is the number of cores on my machine. This really sucks for IO bound tasks, because most of the time the threads are idle, waiting on IO. Let’s see if we can speed this up:

// Use WithDegreeOfParallelism to specify the number of threads to run
return urls.AsParallel().WithDegreeOfParallelism(10).Where(SexFilter).
              Select(url => new Uri(url).Host).ToList();

This appeared not to work at first, because WithDegreeOfParallelism is just a recommendation or upper bound. You can ask PLINQ nicely to run with ten threads, but it will only allocate two if it so chooses. This is yet another example of C# being more magical than Java – compared to Java’s rich ExecutorService, PLINQ offers less fine grained control.

However, further testing revealed the damage is not so horrible. This is what happened when I put the above code in a while(true):

Tested 10 URLs in 00:00:05.0576333
Tested 10 URLs in 00:00:03.0018617
Tested 10 URLs in 00:00:03.0013939
Tested 10 URLs in 00:00:03.0013175
Tested 10 URLs in 00:00:04.0018983
Tested 10 URLs in 00:00:03.0024044
Tested 10 URLs in 00:00:01.0004407
Tested 10 URLs in 00:00:01.0007645
Tested 10 URLs in 00:00:01.0007280
Tested 10 URLs in 00:00:01.0003358
Tested 10 URLs in 00:00:01.0003347
Tested 10 URLs in 00:00:01.0002470

After some trial and error, PLINQ found that the optimal number of threads needed to run this task under its concurrency guidelines is ten. I imagine that if at some point in the future the optimal number of threads change, it will adapt.

P.S.
If you found this interesting, wait till you read about DryadLINQ – it’s LINQ taken to the extreme, run over a cluster of computers.

Java is less magical than C#

I have been programming in C# for several years now, and recently made the switch to Java (at least for now). I noticed that Java, as a language, is “less magical” than C#.

What do I mean by that is that in C# things are usually done for you, behind the scenes, magically, while Java is much more explicit in the toolset it provides. For example, take thread-local storage. The concept is identical in both langauges – there is often a need for a copy of a member variable that’s unique to the current thread, so it can be used without any locks or fear of concurrency problems.

The implementation in C# is based on attributes. You basically take a static field, annotate it with [ThreadStatic], and that’s it:

[ThreadStatic]
private static ThreadUnsafeClass foo = null;
 
private ThreadUnsafeClass Foo
{
  get
  {
    if (foo != null)
      foo = new ThreadUnsafeClass(...);
 
    // no other thread will have access to this copy of foo
    // note - foo is still static, so it will be shared between instances of this class.
    return foo;
  }
}

How does it work? Magic. Sure, one can find the implementation if he digs deep enough, but the first time I encountered it I just had to try it to make sure it actually works, because it seemed too mysterious.

Let’s take a look at Java’s equivalent, ThreadLocal. This is how it works (amusingly enough, from a documentation bug report):

public class SerialNum {
     // The next serial number to be assigned
     private static int nextSerialNum = 0;
 
     private static ThreadLocal<Integer> serialNum = new ThreadLocal<Integer>() {
         protected synchronized Integer initialValue() {
             return new Integer(nextSerialNum++);
         }
     };
 
     public static int get() {
         return serialNum.get();
     }
 }

No magic is involved here – get() gets the value from a map, stored on the calling Thread object (source code here, but the real beauty is that’s it’s available from inside your IDE without any special effort to install it).

Let’s look at another example – closures.

In C#, you can write this useful piece of code:

var list = new List<int>();
...
// find an element larger than 10
list.Find(x => x > 10);

You can also make this mistake:

var printers = new List<Action>();
...
foreach (var item in list)
{
  printers.Add(() => Console.WriteLine(item));
}
Parallel.Foreach(printers, p => p())

An innocent reader might think this prints all the items in list, but actually this only prints the last items list.Count times. This is how closures work. This happens because the item referred to in the closure is not a new copy of item, it’s actually the same item that’s being modified by the loop. A workaround is to add a new temporary variable like this:

foreach (var item in list)
{
  int tempItem = item;
  printers.add(() => Console.WriteLine(tempItem));
}

And in Java? Instead of closures, one uses anonymous classes. In fact, this is how they are implemented under the hood in C#. Here the same example, in Java:

for (Integer item : list)
{
  final int tempItem = item;
  printers.add(new Action(){
    public void doAction()
    {
      // can't reference item here because it's not final.
      // this would have been a compilation error
      // system.out.println(item);
      System.out.println(tempItem);
    });
}
...

Notice it’s impossible to make the mistake and capture the loop variable instead of a copy of it, because Java requires it to be final. So … less powerful perhaps than C#, but more predictable. As a side note, Resharper catches the ill-advised capturing of local variables and warns about it.

I myself rather prefer the magic of C#, because it does save a lot of the trouble. Lambdas, properties, auto-typing variables… all these are so convenient it’s addictive. But I have to give Java a bit of credit, as the explicit way of doing stuff sometimes teaches you things that you just wouldn’t have learn cruising away in C# land.

Israeli Developers Community Conference 2009

Check out the idcc, register (free), vote on the topics, and attend.

Q.E.D.

P.S.

Actually, registration costs 100 NIS.

Never use synchronized methods or lock on this

Especially when extending a 3rd party base class.

This is a known best practice, but when I read about it I natrually assumed I was smarter than the author of the best practice. The reason not to use synchronized methods (or lock(this)) is that other code might lock on your object too, thus causing nasty deadlocks.

I figured this wouldn’t happen because ‘who would just lock on my object, there’s no chance of that’. Well, this is obviously not safe, but especially so when extending a 3rd-party base class. In my case, I was extending log4j’s AppenderSkeleton, and found out the hard way that log4j obtains locks on the appenders.

The solution:

  1. Use a private lock object (duh), seperating your intended lock semantics from whatever evil outside code will use
  2. Stop assuming that I know best and ‘it will never happen’

This comes BEFORE your business logic!

I thought it might be worthwhile to formulate a technical checklist for a software project – gather all the questions you need answered before you should begin coding the business logic itself. To most of these questions there are only one or two possible answers, and StackOverflow can help you choose between them if you don’t know already which is the best solution for you.

It’s a long list, but I really believe all or most of these will bite you in the ass if you delay them.

  1. Programming language / framework – this is the first choice because it influences everything else. All of us have our favorite languages, and our degree of proficiency with them varies. Besides this factor (which may turn out to be huge), consider:
    1. Performance characteristics. This is probably not relevant today to over 95% of software projects, as most languages will have a reference implementation or two that will be fast enough – but don’t go writing embedded code in Ruby (I’d take this chance to refute once and for all the illusion that some people still maintain – C/C++ is not faster than .Net or Java, it’s just more predictable).
    2. Important 3rd party libraries. If your business application just has to have Lucene 2.4, and ports of Lucene to other languages are lacking in functionality, this pretty much limits you to Java. If you’re writing a GUI-rich application targeted at Windows, then .Net is probably your safest bet.
  2. Unit testing
    1. This should include a mocking framework, though I usually tend to write integration tests more than unit tests.
    2. Think about integration with your test runner – for example, Resharper didn’t support running MSTest tests two years ago (when we were using it, mainly because we didn’t know any better).
    3. Unit tests are not enough – integration tests are the only thing that gives confidence in the actual end-to-end product. For strong  TDD advocates: System Tests are also very valuable as they are the only thing that tests flows on the entire system, and not on a single component-path.
  3. Dependency Injection / IOC framework
    1. I’ve only recently started applying this technique heavily, and it’s a beauty. It allows writing isolated unit tests and easy mocking, and helps lifetime management (e.g. no need to hand code the Singleton pattern by hand). A good framework will ease your life, not complicate it.
    2. When implementing your choice of IOC framework, remember that wiring it up for integration tests is not the necessarily same wireup for actual production code.
  4. Presentation Tier – Most projects need one, whether it’s a web or desktop application.
  5. Data Tier
    1. How do you plan to store and access your data? Can your site be database-driven? Or do you need to go the NO-SQL path? You might want to combine both – use a database to store most of your data, but use another place to store performance-critical data.
    2. Object Relation Mapping – You usually will not want to hand-craft your SQL, but rather use an ORM (we’re all object oriented here, right?).
  6. Communication Layer – if your project have several independent components or services, you should choose a communication model. Do they communicate over the database, using direct RPC invocations, or message passing via a Message Bus?
  7. Logging
    1. Logging everything to a central database if in my experience the best solution. Use a logging framework (and preferably pick one that has a simple error level scheme and allows built-in log formatting).
    2. Make sue your logging doesn’t hurt your performance (opening a new connection to the DB for every log is a bad idea), but don’t prematurely optimize it.
    3. Do not scatter your logs – a unified logging scheme is crucial in analyzing application errors, don’t log some events to the DB and other to file.
    4. I find it useful to automatically fail unit tests that raise errors. Even if the class under test behaved as expected, it might have reported an error condition internally, in which case you want to know about it. Use an in-memory appender, collect all the error/fatal logs and assert there are none – except for tests in which you specifically feed your code erroneous input.
  8. Configuration
    1. Decide on a sensible API to access configuration from code. You can use your language’s native configuration mechanism or home grow your own.
    2. Decide on how to maintain configuration files. Who is responsible for updating them? Which configurations are mandatory, which are optional? Is the configuration itself (not the schema) version controlled?
  9. Release Documentation
    1. Release notes / changelog – A simple (source controlled) text file is usually enough, as it gives crucial information on what a build contains. Should include new features, bug fixes, new known issues, and of course how-to deployment instructions.
    2. Configuration documentation – especially on large teams, you should maintain a single place where all mandatory configurations are documented. This makes it easy for anyone to configure and run the system.
  10. Packaging and Deployment
    1. Have your build process package an entire release into a self-contained package.
    2. Have your builds versioned – every commit to source control should increase the revision number, and the version should be integrated into the assemblies automatically – this ensures you always know what version you’re running.
    3. Depending on your IT staff and how complicated manual deployment is, you might want to invest in a deployment script – a script that gets a release package and does everything (or almost everything) need in order to deploy it. This is a prerequisite for writing effective system tests.
  11. Tooling
    1. Source control
      1. SVN, TFS, Git, whatever. Choose one (according to your needs and budget) and put everything you develop under it.
      2. One painful issue is your branch management. Some prefer to work constantly on a single always-stable trunk, other prefer feature branches. The choice is affected by your chosen SCM tool, the size of your team, and the level of experience you have with the tool.
    2. Build System – Unit tests that are never or seldom run are hardly effective. Use TeamCity to make sure your code is always well tested (at least as well tested as you thought).
    3. IDE – Some programming languages have only one significant IDE, other have a few.
      (Note – I don’t really consider Visual Studio to be an IDE without Resharper)
    4. Bug tracking – have a simple place to collect and process bugs.
    5. Feature and backlog management – have an easy-to-access place that shows you and the entire team:
      1. What features are you currently working on
      2. What tasks are left to do in order to complete features
      3. What prioritized features are on the backlog – this is crucial to help you choose what to do next (I prefer the sprint-based approach)
    6. Documentation standard. It can be a wiki, shared folders, Google Docs, or (ugh) SharePoint, but you should decide on a single organizational scheme. I strongly suggest that you not send documents as attachments, because then you can’t tell when they change.
    7. Basic IT – Backup, shared storage, VPN, email, …

Do you agree with this list? What did I miss? What can be delayed for later stages in the project?

Your comments (sadly, on Facebook and not just in here)

Omri reminded me (as always) not to leave out QA out of the picture. I’m not sure this is as essential as the list above, but from his prespective as a QA lead a test management tool is an essential.

Thrift win32 binary

I’ve been playing around with Facebook Apache Thrift recently. I had a tough time finding working Thrift binaries for win32, and the compilation process was not trivial, so I’ve decided to put them online.

So here they are, compiled with cygwin from Thrift 0.1.0. This will probably require cygwin to run (remember to add cygwin binary directory to your path).

Note, the compilation process created two different files named thrift.exe – one small (16kb) file and the larger 10mb file I’ve put online (this one actually works).