Paul's Pontifications: Managing Risks in Company IT

Another month, another story of corporate IT going spectacularly wrong. First we had the RBS debacle, now it seems that Knight Capital Group have lost hundreds of millions to a rogue algorithm. Similar events have happened before. And its not just banks that suffer from such events.

There are lessons to be learned here. Some of them I draw directly from the events linked above. Others are derived from watching large software-intensive organizations from the inside.

Lesson 1: It happens without warning

Your business is ticking over nicely. You have just approved the annual IT budget, and your CIO has assured you that everything is green. Then at 6am you get a phone call telling you that your annual profit for this year, and maybe your entire company, has just vanished into thin air courtesy of a computer that you own but quite possibly have never heard of. How can this happen?

The answer is that computers are non-linear; small changes can have huge consequences. Change a plus to a minus somewhere in a program with 100,000 lines of code (which is fairly typical) and if you are lucky you will get no output, and if you are unlucky you will get the wrong output.

Mistakes like that happen all the time. As a rough rule of thumb, when programmers type code they make a mistake every 10 lines or so. Everything after that is about finding and removing those mistakes. On top of that you have the mistakes that were baked in at the specification stage (assuming that your software even has a written specification; if you are relying on a programmer having an informal chat with the person who wants the program then you are in even worse shape).

The only way to prevent this stuff happening is to treat the technology as important.

Lesson 2: Do a Risk Assessment

Do an inventory of every single program used by the company in regular business. If anyone has a spreadsheet file that they regularly use, treat that as a separate program. You may have to do some digging to find these things, but that spreadsheet that some bright intern in Operations invented last year to schedule the truck drivers could be the one that paralyses your entire operation next February 29th.

When you have done this you will have a depressingly long list that fits into roughly three categories, listed here in ascending order of risk:

Commercial Off The Shelf (COTS) software. For the most part you can treat this as low priority; it tends to be reasonably well tested before it leaves the supplier. Not always, but you can be sure that you have bigger problems elsewhere. HOWEVER take a long look at any configuration or other input files supplied to this software; they tend to be less well controlled and hence more prone to error. Consider putting anything like this into category 3.
Ancient dragons. Twenty or thirty years ago your company commissioned a big piece of software which has since become a key part of your operation. Its poorly documented and maintained by a few aging programmers, but at least they do understand it (until you offer them early retirement in a round of cost-cutting).
Ad-hoc bits of stuff. These hang off the side of categories 1 and 2 like remora fish around sharks. Typically they convert between obsolete protocols, massage data formats, generate specialist reports, generate input tables and configuration files, and other odd jobs. All those spreadsheets will fit in here too, as will any particularly complicated configuration files from item 1. This stuff is risky because it was generally written on the cheap; poorly specified, documented and tested. As a result it tends to be fragile, and nobody quite knows how it all works or how to fix it when it goes wrong.

I'm going to call all this stuff "software" for simplicity, even though some of it is not normally considered to be a "computer program". From the risk point of view its all the same.

Start with the stuff in item 3, and then work upwards. For each piece of software, figure out where the output goes and how important it is. Make a short list of the biggest risks based on the initial analysis, and then produce Value At Risk figure for each bit of software if it either fails to work or produces the wrong answer (the Knight Capital Group algorithm, for instance, would have had zero impact if it had simply done nothing, but the RBS account update was a problem precisely because it did nothing). Don't forget to include reputation, regulatory and compensation costs in the analysis.

Lesson 3: Risk cannot be outsourced

You've probably got some outsource contracts already. If any of your high-risk software is outsourced, either as software maintenance or operation, then compare the penalty clauses in your contract with your VAR. You will probably find at least one order of magnitude difference, if not two or three.

In general you cannot pay a supplier enough to take on your risk. If you are not careful you will find that your supplier has all the power to control the service while you have all the responsibility for their failures.

Consider running your own acceptance tests on external software development. Yes, the supplier has already run a bunch of tests, but you still need to check that it works in your context before it goes live. If that means you need a whole duplicate IT set-up for testing then so be it. I haven't seen an analysis of the Knight Capital Group failure, but I'll bet dollars to doughnuts that lack of test infrastructure was an important element.

Take a look at Section 2.2.4 in Nancy Leveson's book Engineering A Safer World. On one level, the Bhopal disaster was about engineering and procedural failure; an essential safety step was omitted during a routine procedure. But behind that was a long history of cost reduction and outsourcing; skilled staff were replaced by external contractors and training programs were cut back over several years until a serious accident became inevitable.

(Aside: the system safety world has faced similar issues, and tends to have better documented case histories because loss of life generally triggers public scrutiny: you could learn a lot from reading about system safety).

IT has many of the same properties as a chemical plant; it is complicated, it requires ongoing maintenance and operation by skilled personnel, and a small mistake can cause a disaster. Companies manage IT as a cost centre, and cost centres exist to be minimised. Hence there is always pressure to replace expensive staff with cheaper, lower skilled replacements, or to outsource the whole thing to someone who promises to do it cheaper, maybe in another country. Over time an accident becomes more and more likely, and eventually inevitable.

Lesson 4: Pay attention to process

Process, meaning the steps your people go through to carry out part of the operation, needs to be considered as a component of the overall system, and treated almost as if it were a piece of software. The difference is that any complicated process will, sooner or later, include a mistake. This was probably a significant component in the RBS debacle.

Complicated manual processes can be automated. This often means creating another bit of remora software, but that is actually preferable to a manual process.

Systematise your processes; make sure they are written down and followed. Keep them up to date.

Pay attention to change control and configuration management. Lots of mistakes stem from the wrong version of some file being used. You should know the version of every piece of software you are using, but equally any kind of configuration file should also be under the same kind of control.

Lesson 5: Listen to your engineers

Neither managers nor engineers can see the whole story; if you listen only to the managers then you will sleepwalk into disaster. If you listen only to engineers then you will wind up commissioning another Concorde or Advanced Gas Cooled Reactor (both of which were created by senior engineers who wouldn't listen to the accountants). The trick is to listen to both.

The managers will tell you about how to trim costs. The engineers will tell you why that is a bad idea. They will also tell you about the current problems and hazards.

Do not trust your management reporting chain to tell you this stuff. No manager wants to take a problem to his boss, so the instinctive response to an engineer or lower level manager with a problem is to solve it quickly, or failing that, put a lid on it. An engineer who says "this is fragile" will usually be told "We haven't got time for that now, but we'll fix it when the current urgent project is finished". Of course there is always another urgent project, and so the fix is postponed indefinitely, and no information about the problem percolates upwards.

A related issue is "technical debt". This is incurred whenever a project makes an expedient short-term decision (usually to meet a project deadline) that has long term costs. Examples include engineering kludges (such as adding one of those remora boxes I talked about earlier) or skimping on documentation. The analogy is with financial loan; you get a short term productivity boost (the loan), but a long term productivity cost (the interest) until you go back and fix the original short-cut (paying off the loan). Imagine the financial chaos if every software project in your company was allowed to borrow money off the books. Then imagine the technical chaos caused by uncontrolled accumulation of technical debt across all your different systems.

Conclusion

You can stop your corporate IT blowing up in your face, but it takes attention to the details. If you treat IT as a dumb cost centre (like the staff canteen or building maintenance) then you won't just have a slightly shabby IT service, you will have an unstable foundation for your company that could collapse at any moment.

4 comments:

Jason W said...: Hi,

Saw this from the Planet Haskell blogroll. Was wondering if you have any sort of backing for the claim quoted below.

"As a rough rule of thumb, when programmers type code they make a mistake every 10 lines or so."

Thanks!; August 4, 2012 at 10:53 PM
Unknown said...: What Jason said.; August 6, 2012 at 6:06 AM
Paul Johnson said...: Regrettably, nothing published. I once (many years ago) tried a private experiment where I kept track of my own errors when writing a few hundred lines of code, and as I recall the number of errors was about 10% of the number of SLOC. Most of the errors were things I had forgotten (e.g. loop increments).

I'm extrapolating wildly of course, but I think I'm probably about typical.; August 6, 2012 at 6:27 AM
Anonymous said...: Steve McConnel on Code Complete talks about 1 to 25 errors per 1000 lines of code on delivered software. Considering it has passed testing and QA phases, the 10% figure is a very believable figure for the first raw code.; August 18, 2012 at 9:02 AM

Paul's Pontifications

Saturday, August 4, 2012

Managing Risks in Company IT