April 29th 2013

TAGS:
Technology
Community

In Search of Software Quality

Note: Our friends at TMX wrote this piece about building high quality software and with their permission, we’re reposting it here.

I started writing this article about 6 months after our launch. Things had quieted down a bit by then, and I had the opportunity to think through some of the lessons learned and bounce my ideas off the rest of the team. The time was right to start writing lessons for posterity.

I’ve been creating software professionally for over a decade with many different organizations, and never before have I worked with a codebase built to such a high engineering standard. I’m amazed at how many things we had done right, and every day we reap the rewards. It is my hope that these ideas will prove useful to others as they strive to build high-quality software.

But before we can talk about how to achieve software quality, there are three issues we must address first:

The first is just what do I mean by “Quality”? To me, it’s something that is well designed, well engineered, and well built, be it software or anything else. While we may quibble over some details, I believe there are certain universal virtues that underlie high quality software:

Correctness: Above and beyond anything else, it must do what it's designed to do.
Robustness: Good software is robust. It handles bad input well, it fails gracefully, it's resistant to partial failure.
Simplicity: Build as simple as you can, but no simpler.
Extensibility: In the real world, ever-changing requirements are a fact of life.
Scalability: It should be built with provisions for growth, but beware of premature optimization.
Transparency: Being able to glance into a running program is a godsend, and will save you much sleep.
Elegance: As Eric Raymond wisely noted, the ability to comprehend complexity is intimately tied to the sense of esthetics. Ugly code is plain hard to read, and therefore hard to understand. That makes it a constant source bugs.

The second question concerns our motivations. Why do we need quality anyway? The simplistic idea that quality is an end in and of itself is incorrect. The real goal is to create value, of which quality is just one aspect. High quality code is easy to debug, easy to refactor and easy to build on top of, which makes it easier to add new functionality. We want quality because it makes it cheaper to add value in the long run.

Unfortunately, as with all things, there are tradeoffs to building high quality software, and this brings us to our third question. Just how much quality do I need? Because there is no escaping the fact that quality has costs. It therefore behooves you to decide just how much quality you need, and live with the consequences. Among other concerns, you must think through the consequences of potential bugs and you must consider the expected lifetime of your codebase. The answers depend on your particular problem domain. What’s needed for a bank would be wrong for a smartphone game, and vice versa. At the end of the day, teams with perfect code never ship.

With that out of the way, you can think of the following list as a distillation of the ideas behind the development process we have evolved. By no means am I implying that we have discovered the only true path. That would be hubristic. But these practices have worked for us, and worked really well.

Fast iterations: Boyd's Law of Iteration states that the speed of iteration is more important than quality of iteration. This is doubly so in software, because marginal deployment costs (rolling out a new release) are in most cases negligible.
Automated tests: If you're not writing unit tests, you can stop reading right now and download whatever is the in-vogue unit test framework for your language of choice. I have just saved your sanity. If you already are, congratulations. You're on the right path. But to really take you to the next level, unit tests are not enough.
- You want integration tests, and ideally you want a Continuous Integration (CI) environment.
- You want empirical measurements of your tests. You need to know for a fact what code is covered and what isn't. You want to know how long your tests take to run. Now, what level of coverage is acceptable depends on what you're building. Since in our case it's financial software, we decided early on to strive for 100%. Last time I checked, we're averaging a little above 99%. In your case, it may make sense to settle for less.
Documentation: No matter how well written and logical your code is, you still want it documented and commented. Your code tells you what it's doing. Your comments tell you what it should be doing. If the two are out of congruence, you want to detect that as early as possible. Using automated tools to measure documentation coverage is highly recommended.
Peer review: I cannot emphasize enough the value brought on by code reviews. From day one, we have had mandatory review by at least one person of any code before it was permitted upstream (in fact, as part of our process, the reviewer merges in the code), and I cannot count the number of times potentially catastrophic lossage was averted. Peer review is our primary mechanism to make sure code is up to par. As an added benefit, it familiarizes engineers with different parts of the code base, and trains them to read code.
Metrics and automated code analysis: Static analysis is an invaluable tool. Not only will it report potential bugs, it alerts you to various code smells. Is there too much cyclomatic complexity in one part of the system? Is there too much churn somewhere? Are the levels of test coverage and documentation falling? No coverage in a critical area? It pays to have answers to these questions. Taking this a step further, you want to track this information over time. That way if you detect a negative trend you can take action before it becomes a problem.
Iterative, analytical design: A bad design with good test coverage is still a bad design. Conceptual flaws are hard to fix on a live system, especially where you have to deal with persistent data. Thus, if you model your data correctly up-front, you will save yourself much grief. Comparatively speaking, inefficient algorithms are much easier to tackle. Now, there are two points worth noting. First, proper data modeling depends on having accurate requirements, both technical and business. In fact, I will argue that 90% of design is figuring out the requirements. Second, remember to keep the fast iterations. Iterative refinement is key every step of the way, especially when defining requirements with the business stakeholders.
Sober assessment of technical debt: Do not take on technical debt without explicitly slating and prioritizing its paydown. Like any other debt, technical debt carries interest, and it behooves you to track how much you're carrying on your books. Mounting technical debt means you're building on a substandard foundation, and the longer you wait, the more expensive it will be to excise. Now, keep in mind that not all technical debt is the same. An undocumented constant is mildly annoying but trivial to fix. A bad internal API or antipattern is very annoying, will negatively impact development time, and is much harder to fix. A bad external API has been known to reduce strong men to tears and is very hard to fight. Evaluate and prioritize accordingly.
Discipline: This is by far the most important point, so I saved it for last. For these ideas to do you good, your organization needs to practice them consistently, and ideally, you want to do it from day one. The goal is to develop within the organization a culture of discipline, where doing the right thing is the natural, automatic choice.

At first glance, this list may seem daunting. Fear not. There is a powerful advantage working in our favor in that these practices feed off each other, and almost paradoxically, the whole really is greater than the sum of its parts. For example, when I’m reviewing code for a new feature, I know that a) because we have fast iterations, the amount of code I’m looking at is small and comprehensible, b) because it’s covered by automated tests and we have complete test coverage that it is likely correct and I can be reasonably sure there are no regressions, c) because it’s documented it’s easier to understand what the author was trying to do, d) the static analysis tools have identified some of the areas I need to take a closer look at. As a result, not only is the reviewer’s job easier, there is actually more value to the review since there won’t be wasted time dealing with the truly egregious flaws and the reviewer can focus on the sort of issues only a second pair of eyes can detect. There is a genuine synergy here, one that can help take the development organization to the next level.

Share your thoughts with @engineyard on Twitter