How can we design software to handle complexity?

You’ve probably heard about the so-called “blame game” unleashed in the UK after news surfaced that the authorities failed to report nearly 16,000 positive COVID-19 cases. You’ve probably also heard that this was “caused by an Excel spreadsheet containing lab results reaching its maximum size”.

When I first heard that cases were missing I thought “can’t be an Excel problem, the rows run out at 1,048,576 which is bigger than even total cases”.

I never thought they’d have a case per COLUMN. Unbelievable. And yes: Excel columns end at 16,384 aka “XFD”. https://t.co/To9b4qIuzD
— Matt Parker (@standupmaths) October 5, 2020

There are a couple of things (at least) to be said about this.

First, it is interesting to notice that the episode was referred to as “an Excel glitch”. No mention of things that can easily be read between the lines: no one updated the spreadsheet, no one bothered to check if it needed to be updated, and no one said anything about the way the data was being logged, or even if Excel was the right tool to begin with. As if the authorities were talking about rain, the whole thing was described as something that just happened. The issue of human responsibility went completely unmentioned.

A case like this one is interesting because it is so clear that Excel is not to blame, that it lets us see beyond. It lets us pose the question:

Why is it that tools and systems get blamed for human miscalculation, failure to apply said tools as they were intended, or addressing a problem with the wrong system?

Designing systems to handle complexity

While most innovations over the course of History multiplied human capabilities (harder, better, faster, stronger, as Daft Punk put it), software introduces the groundbreaking possibility to create things that would be impossible to build without software.

Software is not a scalar upgrade of capabilities, but an exponential one.

When capabilities can grow exponentially, however, so can complexity. And that introduces a design problem: we need to learn how to design a whole new category of things, unlike any other. And that requires that we think differently, that we approach problems differently.

Here, it’s useful to discern between the design of systems and the design of software. Systems make up a larger class that includes software, but exceeds it. This can sound trivial, but it’s actually key: while problems that are thought of as ‘solvable with software’ are usually approached with a design thinking mentality, problems that are not thought of as involving software (but are very much a systemic problem) tend to be approached unsystematically and without much regard for design. That is likely how we end up with Excel being used to count COVID-19 cases.

John Ousterhout proposes that the main idea underlying all of computer science is Problem Decomposition: the capacity to take a complex problem and break it down into pieces so that each piece can be built more or less independently.

The idea behind software design, Ousterhout says, is that we are doing things for the future. That, in turn, comes with an inherent problem, which is that humans can’t visualize the future very well.

Explaining complexity

In the hours after the news broke out about this case, two main hypotheses were presented to explain the loss of data:

The first one said that the positive cases data was stored in Excel using columns for the data points instead of rows.
The second one said that the data was stored in rows, but using an old version of Excel, which uses an .xls file format and supports a much smaller amount of rows, instead of the newer version which uses an .xlsx file format and supports a larger number of rows.

The first explanation points to a misuse of the tool. As such, it is misleading. By focusing on the rows-columns dichotomy, it implies that, had the cases been logged using rows, things would’ve been fine. The second one focuses on a failure to update the tool, and some sources suggest that the decision to stay in an older version (pre-2007) was driven by the need to use macros in order to automate part of the process, which is not possible in the new version. This explanation is a bit more elaborate, but still somehow approves the use of Excel as a database for positive COVID-19 cases. In both cases, there is a deeper issue of myopia: a system design that failed to prepare for the future.

Tools to help us think

The COVID-19 pandemic made it clear that our inability to predict the future can have serious consequences. This is only made worse by the fact that, the longer into the future we go, the more inaccurate our vision becomes.

This is the difference between tactical and strategic programming: the former prioritizes getting the system to work and leaves fixing it for later, while the latter focuses on figuring out how to reduce complexity –and avoid it entirely when possible– before building.

Inherent complexity vs. accidental complexity

J. B. Rainsberger reminds us that inherent complexity is related to a problem being hard in essence and can’t be removed, only dealt with; while accidental complexity comes from people mis-approaching a problem, regardless of that problem being intrinsically hard or not, and as it’s manmade, can be eliminated or fixed. Since we’re doing things for the future, this distinction is paramount:

Inherent complexity can’t be avoided, only managed; accidental complexity is a mistake that replicates itself.

To approach inherently complex problems, we need to experiment, evaluate, and gather more knowledge. That doesn’t mean that we can’t do anything until we gain clarity; but it does mean that whatever decisions we make with high uncertainty will need to be revisited once we clear some of that uncertainty away.

Doing things for the future is a complex problem. The practice of software design consists, then, in building the scaffolding that our future selves can lean on to make better decisions, reduce bias and avoid falling for the trap of urgency.

In his book The Design of Design: Essays from a Computer Scientist, Frederick P. Brooks Jr. mentions that, “As tool complexity grows, the need for explicit use models increases. Even for a shovel, it is important to be explicit as to whether it is for coal, dirt, grain, snow, or some mix; whether for child, woman, or man; whether for the casual user or the manual laborer. How much greater is the need for explicit use models for a truck, a spreadsheet, an academic building!”

Getting out of our own way

So we get to the heart of the problem: the goal of software design is to remove us from the accidental complexity of a problem. To do so, system designers need to create an abstraction for the solution at a higher level. The main goal of abstraction is to handle complexity by hiding unnecessary details from the user.

The more flexible a solution needs to be, however, the less it will be able to abstract, because it needs to give users some manipulation access. Therefore, the higher the risk of introducing accidental complexity. And more accidental complexity means that it gets increasingly harder to verify correctness, that applications are more prone to errors, and that their usage will be more likely to allow for –you guessed it– accidents.

Excel is a super flexible tool. The invention of the spreadsheet is arguably one of the most powerful and, at the same time, most accessible inventions in all of computing. The very same spreadsheet used in the UK to count COVID cases can be easily modified to run Mario Bros.
A global-scale pandemic is definitely a hard problem, inherently complex. Certainly something one does not want to take any chances with. But high-end, efficient and tested systems that are specially designed to tackle that particular problem, take time to develop and implement, and viral outbreaks don’t regularly give people a heads-up.
A COVID-19 case-counter built with Excel bears much more accidental complexity than a custom tool, and therefore is a lot more prone to the types of errors we’ve seen. But in emergency situations, humans default into the domains and tools that are familiar to them, and that is another thing that can be said about MS Excel: it is a very familiar tool. This is the window through which accidental complexity can get in, and add to the inherent complexity of the problem.
Our best approach, then, is to design our systems, not to help us do a particular task better or faster, but to help us make better decisions.

Designing systems to handle complexity

Explaining complexity

Tools to help us think

Inherent complexity vs. accidental complexity

Getting out of our own way

More blogs