This blog has moved to Medium

Subscribe via email


Posts tagged ‘Programming’

Javascript refactoring is hard

I’ve been refactoring code all my professional career. I started from C/C++, and when I hit C# (Resharper) and Java (IntelliJ), my productivity at refactoring was boosted by a few factors by the wonderful IDEs and refactoring tools that these languages have.

I am rather confident when I write “dirty code” in Java or C#, because I know that I can swiftly refactor it into beautiful code without too much trouble. Both aforementioned IDEs are so great at this, that it’s painless. You can take an ugly 300 line method and break it into several methods, break a long class into several classes, inline, move, and otherwise shape your code.

Enters javascript and web development.

Last year I’ve made the plunge and officially started working on front end web development, first at Google, and recently at Commerce Sciences. And I’ve recently discovered that … refactoring javascript and frontend code is hard.

The language is dynamically typed

Compilers and IDEs can help you less when the language is dynamic. They can make less assurances about your code, making reasoning and refactoring harder. Safely moving a method from class Foo to class Baz is an easy feat in a statically typed language, and a difficult or impossible one in a dynamic language.

Classes are not first-class citizens

While you can do OOP in javascript, classes are not first class citizens, but rather they’re implemented using functions, objects and prototypes. Automatically reasoning about “classes” without having an equivalent of keyword class is difficult.

Your code is not just Javascript, it’s HTML & CSS as well

Web development is not just about javascript, it’s a combination of JS, HTML & CSS (not to mention other potential technologies such as LESS/SCSS, HAML, JSON, and whatever language your backend is written in). Unless you’re already a web development ace and perfectly design your codebase … you will get these mixed up. Refactoring is about changing the design of your program, and when that design is split up between three or four different technology domains, design mistakes are harder to rectify.

You can’t (at least not yet) do an automatic refactoring to “move inline css into an external css file”, or to “convert static html snippet into a javascript DOM manipulation method”. The makers of IDEs and refactoring tools don’t have javascript completely figured out yet, no wonder they haven’t gotten around to building cross-domain refactoring!

Again, if you “know what you’re doing”, you can structure your program perfectly in the first place, and won’t ever have to do this kind of “cross domain refactoring”. Sadly, we’re not all born with this kind of experience.

It’s harder to know when you’ve broken something

Backend code is so much easier to test than frontend code. While frontend testing is important, and growing, it is by no means as well understood or practiced as classical “backend TDD, this is a calculator, assertEquals(30, calc.plus(10, 20))”.

So, since BE code has much wider test coverage than FE code, and is bloody compiled for Java & C#, when you refactor FE code it has a much greater chance of breaking down, often in subtle ways you’ll only notice on IE 8, or when the network is slow, or whatever edge case surfaces your particular bug.

 

 

So … how do we manage?

  • Refactor less – Since we know refactoring is hard, we try to do less refactoring and more pre-factoring. Think a little more than usual before coding. While I’ve grown the habit of “code first, think about design later” over the years due to power of refactoring, it’s less useful in FE dev, so I need to take the time before coding and try harder to get it the design right on my first attempt.
  • Still .. refactor – IntelliJ and Resharper both offer some refactoring capabilities. I’m most comfortable with “limited scope refactoring” – those that affect one function like Extract Variable. Use the tools you do have, instead of whining about the tools you don’t.
  • Try to think harder about the different problem domains (JS/HTML/CSS), and to develop a better understanding of how to structure your program in a way that won’t force you do to refactoring across problem domains.

Please do share your own experience with refactoring in web dev!

KISS Project management with Trello

Update – I advise you to take a look on how the Trello team manage Trello.

When I initially read about Trello, I wasn’t over excited (much like my first thought upon hearing of Stack Overflow was “oh no, yet another Q&A site”). Trello, if you haven’t had the pleasure, is a simple List Management App. No more, no less.

Then, we happened to want a little bit more order in our task management back at <Unnamed Hot New Startup I Recently Joined>. After considering several tools (somehow, always, Excel rears its ugly head in such discussions), we agreed to give Trello a shot.

So far, we’re about three weeks into the process, and while I can’t yet speak for everyone, personally I’m liking the experience a lot.

Trello is a member of the “let’s keep everything simple” family of tools. It is certainly not fully featured (at least not as a project management app), but as a simple tool to manage 1-6 people, it’s really a no overhead, no bullshit tool that gets the job done (we’re currently only 3 people, so I can’t testify to how it scales yet). After a bit of tweaking, we arrived at the follow scheme of working with Trello:

  1. Keep a “current week/sprint” list
  2. Keep three lists for Small, Medium and Large features
  3. Maintain a “deployed, not yet reviewed” list, and another “done, not yet deployed list”

That’s it. This is how it looks like:

Our “sprints” are 1 week long, mostly because of the stage we’re at – we’re only three people at this point, and our priorities are very dynamic (remember, a sprint is just a unit of planning – it doesn’t correlate to how many deployments we do).

At least every week, we review the board together, and see what we’ve done in the last sprint. Any tasks that we haven’t completed, we move to the next sprint (or to the appropriate backlog if its priority has decreased). Completed tasks usually just get archived – well, rather the entire “Sprint X” list gets archived. Sometimes, when there are features that are especially relevant for review, we move the features to the “Code Ready” list, and when deploying, to the “Deployed, ready to review” list. When the features are reviewed, we then archive them individually.

We pick features for the next sprint by looking at the three buckets or backlogs. First we see if there are any “Large” features we want to accomplish or make progress on this week (usually there are). Those gets picked first. After that, we might fit in a few Medium or Small features. Small/Medium features are also useful to fill in gaps in planning – sometimes, I have an hour or less free, and I don’t want to start working on a Large or Medium feature … I know that just getting into the state of mind will often take half an hour, so I pick one of the Small features from the current sprint or from the backlog, drag it to the current sprint, and execute it quickly.

We also have little icons on the cards that show who they’re assigned to (not shown in this image). A feature can be just a headline, or can be very detailed with a description, checklist of sub-tasks, and other shiny items. Most features are simpler, but sometimes you just need to conduct some kind of conversation about the feature, and the best place to keep it is on the card itself.

What I like about our system is that it’s really ultra simple, gives us the ability focus to on what’s important right now, and to plan a bit for the future. It will not scale to long plans or huge teams, and it won’t give us any “smart conclusions”, like Evidence Based Scheduling in FogBugz. But it’s simple, it’s free, and it works (for now).

What are you using to manage your projects? (And please, “Excel” is not a good answer)

WhateverOrigin – Combat the Same Origin Policy with Heroku and Play! Framework

A little while ago, while coding Bitcoin Pie, I found the need to overcome the notorious Same Origin Policy that limits the domains javascript running on a client’s browser can access. Via Stack Overflow I found a site called Any Origin, that’s basically the easiest way to defeat Same Origin Policy without setting up a dedicated server.

All was well, until about a week ago,  Any Origin stopped working for some (but not all) https requests. It just so happened that in that time I had gained some experience with Play! and Heroku, which enabled me to quickly build an open source clone of Any Origin called Whatever Origin (.org!) (on github). For those unfamiliar with Play! and Heroku, let me give a short introduction:

Heroku is one of the leading PaaS providers. PaaS is just a fancy way of saying “Let us manage your servers, scalability, and security … you just focus on writing the appliaction.” Heroku started as a Ruby shop, but they now support a variety of programming languages and platforms including python, java, scala, javascript/Node.Js. What’s extra cool about them is that they offer a huge set of addons ranging from simple stuff like Custom Domains and Logging through scheduling, email, SMS, and up to more powerful addons like Redis, Neo4j and Memcached.

Now for the application part, I had recently found Play! Framework. Play is a Java/Scala framework for writing web applications that borrows from the Ruby on Rails / Django ideas of providing you with a complete pre-built solution, letting you focus on writing your actual business logic, while allowing you to customize everything later if needed. I encourage you to watch the 12 minute video on Play!’s homepage, it shows how to achieve powerful capabilities from literally scratch. Play! is natively supported at Heroku, so really all you need to do to get a production app running is:

  • play new
  • Write some business logic (Controllers/Views/whatnot)
  • git init … git commit
  • “heroku apps add” to create a new app (don’t forget to add “–stack cedar” to use the latest generation Cedar stack)
  • “git push heroku master” to upload a new version of your app … it’s automatically built and deployed.

Armed with these tools (which really took me only a few days to learn), I set out to build Whatever Origin. Handling JSONP requests is an IO-bound task – your server basically does an HTTP request, and when it completes, it sends the response to your client wrapped in some javascript/JSON magic. Luckily Play!’s support for Async IO is really sweet and simple. Just look at my single get method:

public static void get(final String url, final String callback) {
    F.Promise<WS.HttpResponse> remoteCall = WS.url(url).getAsync();
 
    await(remoteCall, new F.Action<WS.HttpResponse>() {
        public void invoke(WS.HttpResponse result) {
            String responseStr = getResponseStr(result, url);   // code for getResponseStr() not included in this snippet to hide some ugly irrelevant details
 
            // http://blog.altosresearch.com/supporting-the-jsonp-callback-protocol-with-jquery-and-java/
            if ( callback != null ) {
                response.contentType = "application/x-javascript";
                responseStr = callback + "(" + responseStr + ")";
            } else {
                response.contentType = "application/json";
            }
 
            renderJSON(responseStr);
        }
    });
}

The first line initiates an async fetch of the requested URL, followed by registration to the completion event, and releasing the thread. You could almost think this is Node.Js!

What actually took me the longest time to develop and debug was JSONP itself. The information I found about it, and jQuery’s client-side support was a little tricky to find, and I spent a few hours struggling with overly escaped JSON and other fun stuff. After that was done, I simply pushed it to github, registered the whateverorigin.org domain for a measly $7 a year, and replaced anyorigin.com with whateverorigin.org in Bitcoin Pie’s code, and voila – the site was back online.

I really like developing websites in 2011 – there are entire industries out there that have set out to make it easy for individuals / small startups to build amazing products.

MapBinder basics in Guice

Another less known feature in Google Guice is map binders. I haven’t seen a good basic map binder tutorial, so I wanted to share the little I learned about it (it is covered in the documentation, but it’s not necessarily the best intro material). Basically, is a way to collect bindings from several models into one central map. The idiom is having each module add bindings from its “knowledge domain”, and providing the entire collection as one unified map.

class SomeModule {
  protected void configure() {
     // Bind the value "Eve" to the key "Adam"
     MapBinder.newMapBinder(binder(), String.class, String.class)
       .addBinding("Adam").toInstance("Eve");
  }
}
 
class AnotherModule {
  protected void configure() {
     // Bind the value "Abel" to the key "Kane"
     MapBinder.newMapBinder(binder(), String.class, String.class)
       .addBinding("Kane").toInstance("Abel");
  }
}
 
class NeedsMap {
  @Inject
  NeedsMap(Map<String, String> biblicalNames) {
    // gets a map of all values bound in the relevant modules
  }
}
 
main() {
  NeedsMap needsMap = Guice.createInjector(new SomeModule(), new AnotherModule()).getInstance(NeedsMap.class);
}

Of course, the map can be specified using any types, not just String, and doesn’t have to bind to a specific instance, but can bind to a class. To use this, remember to depend on the proper Guice Extension (Maven guice-multibindings). The code is available on Github. Check out this questions for “when is this actually useful?

A summary of my history with source control

  • Folder-based, copy-paste VCS. Where was that stable copy again? Ah damn, we deleted it.
  • SourceSafe. Will eat up your files™
  • ClearCase. How do I merge that branch? The ClearCase admin will be here in two weeks, he’ll help you. Hopefully.
  • TFS. Only works within Visual Studio? That’s ok, we’ll just use A SEPARATE VCS to store our java files.
  • SVN. Ah, you mean I don’t need to mark a file I’m editing, only when I actually commit it? Nice. But merges are still hell, and svn update takes forever.
  • git. You mean I can switch to another branch is about 20 milliseconds, and merge in back in a second? Sweet.

The End?

My walk through the Git book

I’ve been experimenting with git for about the last year, but most of the work I did with it so far was in the “single developer, hack some stuff, push to github” mode of operation, which is very superficial. Now that I’ll be working with it full time (git is one of the “semi wildly adopted” SCMs at Google), I thought it’s time to take a closer look at some wisdom accumulated by other folks, so I finally cracked open the Git book and did a pass over it.

The book is great and usually very fluid. It begins by show-casing the simple use cases you’ll encouter with git, and is filled with short code snippets you can try (even on a train with no WiFi – this is a distributed source control system after all). Some of the examples weren’t crystal clear straight out of the box, and relied on some previous knowledge the authors had (after all, much of the book was pulled together from different sources, so I imagine it was relatively easy to accidentally assume a bit of knowledge that its readers don’t necessarily have at that point).

Here is a summary of questions I had while reading the book, followed by some cool stuff I found at the end. I recommend at least some knowledge of git for the rest of this article, best accompanied with a reading of the Git book itself. As usual, if you find a mistake, please let me know. Some more related recommended reading is the Git for beginners SO question.

What happens on double git add?

git add is used not just to add new files, but also to ‘add’ changes in existing files.

When I do:

echo v1 > foo
git add foo
echo v2 > foo
git add foo
git commit -m bar

Are both versions of foo added to the commit log, or just the latest?

The answer is that just the latest version is actually committed.

After I git merge without conflicts, is a git commit needed?

Coming from svn it was my expectation that after I merge changes into my local branch, I will have to commit them. Doing a quick experiment showed that in git this is not the case at all – if a merge is resolved without manual intervention (including concurrent edits to different places of the same file), then no commit is needed. If there are any conflicts that are resolved manually (by git adding the file after fixing the merge), then a git commit is required.

How does gitk work? Sometimes I see branches, sometimes I don’t … it’s very confusing

This one has been puzzling me for quite a long time. I found that I couldn’t trust gitk, the graphical tool for visualizing commits, branches and merges, because it kept giving me inconsistent results, and for the life of me I couldn’t understand why.

Now I did a few experiments and digging, and found that by default gitk will only show you the current branch, and any objects that are its descendants in the version graph. If you create a branch, switch back to master, and ran gitk, you would not see this branch. What confused me is that upon refreshing, gitk rescans the current branch and add any new nodes to its display, while retaining anything alreaday shown – meaning if you run gitk, switch to a new branch, and refresh gitk, the new branch and its relation to the previous will now be displayed in gitk.

Of course, like all things linux, gitk can be controlled to behave like you want it. Just follow the gitk command with the names of the branches you want shown, or simply add “–all” to see all the branches in your repository.

How can you see the ‘branch structure’ of a repository?

In svn, there is a well defined directed graph between branches. When a branch is created of its parent, this parent-child relation is created and maintained, and the tools readily show you this branch graph.

I could have guessed this, but sources on Stack Overflow confirmed that there is no direct equivalent in git. Instead of branches having parent-child relations, there is a parent-child relation between objects, and so individual files and directories can have multiple parents in the version graph, where other files on the same branch might have completely linear histories. The model is more complex, but more powerful, and it seems to be the core reasons why merges in git are supposed to be easier than in svn.

What does ‘fast forward’ really mean?

Using git, I often saw messages with the words “fast forward”, but never really understood what it meant. This bit is explained rather nicely in the Git book – a fast forward happens when you merged branch b1 to b2, resolved any possible conflicts, and then merge the result back to b1. b2 already contains a version that is a descendant of the “heads” of both b1 and b2, meaning all the “merge work” was already done in it. So, when this structure is merged back to b1, what actually happens is all the revisions and merge work that happened on b2 is copied to b1. After this copying, the b1 branch (a pointer into the revision DAG) is “fast forwarded” to a descendant node that is the head of b2. In effect, the merge’s result becomes the head of b1 in a clean and simple manner.

This is radically different than svn – I still have horror flashbacks sometimes about trying to merge a branch back to trunk. I always first merged trunk to the branch, had to work my ass off to resolve all the conflicts and make the build green, and then sometimes had to do double the work when merging back to trunk. With git, you’re assured that the conflict resolution work you do on your branch is presereved and used to make merging back to master (the git equivalent of trunk) is as easy as cake.

git pull, fetch, and what’s in between

It is said that “git pull” is equivalent to “git fetch”, followed by “git merge”.
The ability to immediately fetch all the content of any remote repository without forcing you to merge it right now is great – you’re free to do the actual merge work and conflict resolution separately, and you only need connectivity to the remote repository for the fetch phase. When I tried this using two local folders, git merge complained, and I failed to understand what arguments I should pass to “git merge” in this case?

This turned out to be a simple technical issue. To merge the changes manually after fetching from an arbitrary remote, simply run git merge FETCH_HEAD (sometimes you just have to know the magic words). Normally, you would fetch from origin (usually the branch you cloned off), or another remotely tracked named branch, so you would just specify its name as the parameter to “git merge”.

How does pushing actually work?

Let’s say I setup a local “common” repo (it has to be bare for reasons explained in the Git book)

mkdir bare
cd bare
git init --bare
cd ..
git clone bare alice
cd alice
touch a && git add a && git commit -m "Added a"
git push # This fails


Why does the push fail?

It turns out that the problem was I tried to push to an empty repository. If I do “git push origin master”, then subsequent “git push” with no arguments succeed.

And now, for some cool stuff:

git bisect ftw

Suppose you just found a critical bug, and have no idea when it was introduced. You write a simple (manual/automated) test for it, and reproduce it, but you’re not sure what it causing it. git bisect to the rescue!

git bisect allows you to do a binary search on your repository to find the exact commit that introduced the bug. While this is possible with other VCSs, it is so natural in git that it’s beautiful. You simply do “git bisect start”, followed by “git bisect good” to indicate the current version works, and “git bisect bad” to indicate it doesn’t, and git will direct you towards the correct half of the version graph until you find the exact version when things turned bad.

Configure your defaults for fun and profit

Here are some tweaks I found in the book that you might want to do (if you have any other tweaks you’d like to recommend, please comment!)

oneline log messages

If, like me, you find the “one liner” log messages easier to read, you can make it the default with

git config –global format.pretty oneline

Life is colorful

Make git status and other messages much easier to read with

git config –global color.ui true

git-svn made easy

For the last year, my main usage of git was for my own personal projects – rather basic stuff, consisting of simple commit/push/pull operations. Recently, I wanted to edit some code on the OSQA project, which is unfortunately hosted on SVN. I am not a committer (yet), so if I wanted my work to be source control, I actually had no clear option except using git-svn.

It took me some time to get started, I find that there are still some gotchas that can surprise you if you’re new to git or git-svn. Luckily I stumbled across this lovely series of screencasts. Thomas walks you through the basics, and showcases some more advanced use cases as well. I highly recommend it! (I subscribed to his blog as well)

The order of columns in GROUP BY statements matters on mysql

Take a look at this SQL query:

SELECT COUNT(*) AS cnt
FROM products
WHERE ExternalProductId IS NOT NULL
GROUP BY SourceId, ExternalProductId
HAVING cnt > 1

This query scans the products table and find all duplicates, where a dup is defined as having the same (SourceId, ExternalProductId) as another row. Counter to intuition, it turns out that in mysql, the order of the columns in the GROUP BY statement can make a huge deal, performance wise. Let’s assume we have an index on the pair (ExternalProductId, SourceId). The query should be fast, right?

Wrong. It takes 30 minutes on our sample data set (about 30 million rows). An EXPLAIN query, and SHOW PROCESSLIST revealed that mysql was copying the table or index data to a temporary location before starting to process the actual query. This was taking up most of the execution time.

A quick question to Stack Overflow…

It appears the order of the columns makes all the difference in the world. Switching the GROUP BY columns to (ExternalProductId, SourceID) made the query run in place and not copy any temp data whatsoever, resulting in execution time of 30 seconds instead of 30 minutes! I don’t fully understand why mysql takes the column order under consideration – semantically, the order of the GROUP BY columns doesn’t matter, so it’s a matter of a simple optimization to choose the optimal order.