Saturday, September 1, 2007

Composability and Productivity

This was posted to my original blog on January 7th 2007. It was also discussed on Reddit, so please do not repost it.

----------------------------------------------------

My earlier post about increased productivity through functional programming stirred up a good deal of comment. A number of people replied that the libraries are more important than the language. More recently a similar point has been made by Karsten Wagner, who argues that code reuse is what makes languages productive.

I remember the early days of OO, when it was argued by many people (me amongst them) that OO languages would finally let us write reusable software. There was optimistic talk of “software factories”, and Brad Cox gave lots of talks about how software could now finally move from craft to engineering discipline, built on the back of libraries of reusable code. So the entire industry went for C++, but the easy creation of reusable software remained elusive. That is not to say you can’t ever produce reusable software in the language, but it is not significantly easier than in C.

Karsten accurately pins the reason why C++ failed to deliver: memory management. Back in those old days I was trying to explain to people why Eiffel would be so much more productive than C++, and garbage collection was a big part of it.

The reason why GC is so important lies in a more general principle called composability. Composability means that you can put two bits of code together and important correctness properties will be preserved automatically. This does not mean, of course, that the composition is automatically correct in a wider sense, but it does mean that you don’t introduce new bugs merely by sticking two things together.

Imagine two modules of code in a non-GC language like C or C++. Module X creates an object and hands a reference to that object over to Module Y. At some point in the future X will delete that object. However it is only safe to do so once Y has finished with it. So if X and Y were written independently then it is quite possible that Y will hang on to the reference longer than X expects. In short, manual memory management is not composable, because the composition of X and Y can introduce stale pointer bugs that were not present in either X or Y.

In a language with GC this is a non-problem: the collector will reap the object once it sees that both X and Y have finished with it. X can therefore forget about the object in its own time without having to know anything about Y. But programmers in non-GC lanuages have to resort to a number of workarounds. Either objects have to be copied unecessarily (which leads to stale data bugs instead of stale pointer bugs), or else some kind of reference counting or similar scheme must be employed. Reference counting is of course merely an ad hoc, informally-specified, bug-ridden, slow implementation of GC, and therefore stands as a classic example of Greenspun’s Tenth Rule.

But programming languages contain other examples of non-composable constructs. The current biggest offender is shared memory concurrency using locking. If X takes locks Foo and Bar, and Y takes locks Bar and Foo (in those orders), then sooner or later they are going to deadlock with X holding Foo and Y holding Bar. Ironically Java has the biggest problems here.

Surprisingly, program state generally is a source of non-composability. Mutable state is actually another form of manual memory management: every time you over-write a value you are making a decision that the old value is now garbage, regardless of what other part of the program might have been using it. So suppose that X stores some data in Y, and then Z also stores some other data in Y, overwriting what X did. If X assumes that its old data is still there then it is going to be in trouble. Either X needs to defensively check for new data, or Y needs to tell X about the change (the Observer pattern). Whichever way, the addition of Z to the system can introduce bugs that stem from the composition of modules rather than the modules themselves.

Another example is the Command pattern, which includes a method to “undo” a previous command. Why undo it? Because the global state has been changed. Get rid of the concept of a single unique state for the entire application and you get rid of the problem.

On a more prosaic level, many library modules have some internal state, and so require an initialisation call to “set things up”. Who exactly is responsible for this initialisation? If you compose two modules together then they may both imagine themselves soley responsible for making this initialisation call, and hence it will be done twice.

It is a truism of programming that “integration” is the most difficult and risky (from a scheduling point of view) part of developing a large system. Integration basically means composing together all the units to try to get a working system. And the biggest integration problems are due to the non-composability of the units. A defect within a unit is fairly easy to identify, but composition bugs are down to the interaction of two separate units in some unforseen way, which makes it correspondingly more difficult to pin them down.

So what would a language look like if it got rid of all of these non-composable constructs? The answer, basically, is Haskell. The underlying goal of functional language designers is to make everything completely composable. Haskell is the language that has advanced furthest towards this goal.

Erlang is pretty good as well. It doesn’t eliminate state, but it does keep it confined to individual processes: there are no shared memory locks. Processes have to communicate according to shared protocols, but this is a manageable dependency on a common abstraction. Even within a process the shared state can be reduced by careful use of functional programming rather than imperative.

10 comments:

Eelis said...

In your explanation of why two independently written modules of code in a non-GC language cannot be combined, you seem to assume that the modules do not make explicit assumptions and guarantees about the validity duration of the references they receive and hand out. This assumption does not hold for properly designed C++ libraries (I'm only considering C++ in my response), for which validity duration and ownership guarantees for conveyed references and raw pointers (and similar things like iterators) are explicitly documented. For example, for all containers in the C++ standard library, it is explicitly documented which operations on the containers invalidate which iterators and references to the elements.

If the two modules' assumptions and guarantees are compatible, they can be combined just fine.

Paul Johnson said...

If the two modules' assumptions and guarantees are compatible, they can be combined just fine.

That, of course, is the big "if".

Paul.

Eelis said...

"That, of course, is the big "if"."

(I'm not sure why I'm even bothering to give such a meagre cop-out a dignified response, but here goes.)

The point is that this "if" can simply be verified by the programmer who intends to combine the modules. This is really no different from verifying that the types of the modules' interfaces are compatible (except that in the latter case there is of course more help from the compiler, but that is beside the point). In both cases, if the modules' interfaces are not compatible, some glue code may be needed.

If your criterion for languages that support composable modules is that /any/ two modules must be composable without any "if"s -- that is, without any concern for compatibility of interfaces -- then by that standard your Haskell cannot be considered to feature module composibility either.

JP Moresmau said...

I think the "more help from the compiler" is NOT beside the point. There is a world of difference between a compatibility that is enforced by a tool and a compatibility that requires a developer to read documentation, understand it, and code perfectly according to it (assuming the documentation is 100% correct all the time). And Haskell does enforce more than just interface type checking (or, more precisely, the type checking check for side effects thanks to monadic types, etc.)

Eelis said...

JP Moresmau:

You must be arguing a different point than Paul, who did not include a "must be enforceable by a tool" clause in his notion of composability.

That said, it's a good thing he didn't add such a clause, because with it, Haskell wouldn't qualify either. After all, in Haskell too is the developer absolutely required to "read documentation, understand it, and code perfectly according to it", because the compiler-checked types of functions generally barely scratch the surface of their semantics (including interface pre- and postconditions).

Just like in C++ you have to read the documentation for a function returning an int* to find out how long the pointer will remain valid, in Haskell you have to read the documentation for a function taking a [Int] to find out if it requires the list to be finite, non-empty, or something else. Both are examples of interface/compatibility aspects that have to be checked by programmers.

Paul Johnson said...

Sorry eelis, I was a bit rushed for time yesterday. Here is a fuller reply.

Its not just a matter of putting two modules together, its a matter of putting lots of them together.

Lets say I have a module that allocates some memory for data and exposes the resulting pointer (along with a lot of other data) as part of its API. Client modules use all this data to compute derived data that includes the original pointer (e.g. lists sorted by different criteria). Then the original module does something that removes the original data from view. At this point you have a problem: when is it safe to free this memory. Any one of the client modules may still be holding on to a copy of the pointer, so it is only safe to free the memory when all the clients have dropped their copies of pointer. But the original module has no way of knowing when this is unless detailed knowledge of all those clients has been programmed into it.

There are various workarounds for this, most of which involve reference counting. But reference counting is not an alternative to GC, its a slow, buggy ad-hoc implementation of GC, which is why this perfectly illustrates Greenspun's Tenth Rule.

Of course you could just have the clients take copies of the data, so each has exactly one copy to delete. But that just replaces the GC problem with a precisely equivalent data consistency problem.

Anonymous said...

Correction about the semantics of GC: The GC doesn't free an object once it sees module X and Y are done with it... that's way too hard (impossible because it's equivalent to the halting problem). Most GCs instead delete an object at some arbitrary point when it is unreachable, which is guaranteed to be after the last time the object is ever used.

johnbender said...

+1 for mentioning erlang. Loving FP these days. OO got boring kinda fast, when I got stuck working with someone else's ObjectXFactory class.

Conal said...
This comment has been removed by the author.
Conal said...

"Mutable state is actually another form of manual memory management: every time you over-write a value you are making a decision that the old value is now garbage, regardless of what other part of the program might have been using it."I like this insight very much. Thanks!