Friday, January 18, 2013

When Haskell is faster than C

Conventional wisdom says that no programming language is faster than C, and all higher level languages (such as Haskell) are doomed to be much slower because of their distance from the real machine.

TL;DR: Conventional wisdom is wrong. Nothing can beat highly micro-optimised C, but real everyday C code is not like that, and is often several times slower than the micro-optimised version would be. Meanwhile the high level of Haskell means that the compiler has lots of scope for doing micro-optimisation of its own. As a result it is quite common for everyday Haskell code to run faster than everyday C. Not always, of course, but enough to make the speed difference moot unless you actually plan on doing lots of micro-optimisation.

This view seems to be borne out by the Computer Language Benchmarks Game (aka the Shootout), which collects implementations of various programs written in different languages and measures their execution speed and memory usage. The winningest programs are always written in highly optimised C. The Haskell versions run anything from slightly to several times slower than the C versions. Furthermore, if you read the Haskell it is also highly optimised, to the point where it no longer looks like Haskell. Instead it reads like C translated into Haskell.  Based on this, you would be justified in concluding that everyday Haskell (that is, Haskell that plays to the strengths of the language in brevity, correctness and composability) must be irredeemably slow.

But then I read this presentation by Tony Albrecht, in which he talks about how some seemingly innocent everyday C++ code is actually very inefficient. Two things in particular caught my eye:
  • A fetch from the main memory when there is a cache miss costs 400 cycles. Pointer indirection in particular tends to fill up cache lines, and hence increase the number of cache misses overall.
  • A wrong branch prediction costs over 20 cycles. A branch that skips a simple calculation can actually run slower than the calculation.
To put it another way, C is no longer close to the real machine. The real machine has 3 levels of CPU cache (some of them shared between multiple cores), long instruction pipelines, branch prediction, multiple ALUs and FPUs, and hardware data flow analysis done while the program is being executed in order to schedule all this in a way that makes it look like a simple processor executing one instruction at a time. C doesn't expose all that to the programmer, so it seems to me that the only way to write highly optimized C is to have a sophisticated mental model of the processor and its memory architecture, decide what you want this machine to do, and then reverse-engineer the C which is going to make that happen. The result is  difficult to understand and hence hard to maintain. Look at the C implementations of the Shootout problems for examples of what I mean.

But most code isn't written like that. Tony Albrecht is a games programmer, an expert at squeezing cycles out of the rendering loop. Most developers do not live in that world. For them the objective is to produce code that meets the requirements, which includes being fast enough. This is not laziness or waste, but practicality. First design the optimal algorithm, then implement it in idiomatic code. Only if that does not run fast enough should you start detailed profiling and micro-optimisation. Not only is the micro-optimisation process itself expensive, but it makes the code hard to understand for future maintenance.

So I wondered: the high level of Haskell gives the compiler many more opportunities for micro-optimisation than C. Rather than comparing micro-optimised programs therefore, it seemed sensible to compare everyday programs of the sort that might be produced when readability is more important than raw speed. I wanted to compare programs that solved a problem large enough to have a useful mix of work, but small enough that I could write Haskell and C versions fairly quickly. After poking around the Shootout website I settled on the reverse-complement problem.

A potential issue was that one of my programs might inadvertently use something highly non-optimal, so I decided I would profile the code and remove anything that turned out to be pointlessly slow, but not change the structure of the program or add complexity merely for the sake of speed. With that in mind I wrote Haskell and C versions. I also downloaded the Shootout winner to get some feel for how my programs compared. You can see my code at the bottom of this post.

The first version of the Haskell took 30 seconds (compared with the Shootout time of about 0.6 seconds). As I had feared, profiling did indeed reveal something pointlessly slow in it. In order to filter out carriage returns from the input I used "isLetter", but unlike the C char type the Haskell Char covers the whole of Unicode, and determining if one of those is a letter is not trivial. I put the filter after the complement operation and compared the result with zero, which in addition to being faster is also the Right Thing if the input contains invalid characters. Once I had this fixed it dropped down to a much more respectable 4.3 seconds.  Interestingly, profiling suggests that about half the time is being spent writing out the 60 character lines; merely printing out the result with no line breaks cut execution down to around 2 seconds.

The C version, meanwhile, took 8.2 seconds. Profiling did not directly reveal the cause, but it seemed to imply that the processor was spending most of its time somewhere other than my code. I strongly suspect that this time is being spent in putc(3) and getc(3). The obvious optimisation would use fread(3) and fwrite(3) to read and write characters in blocks instead of one at a time, but that would require significant changes to the code; extra bookkeeping to deal with the start of the next sequence (signalled by a ">" character) when it is found half way through a block, and to insert newlines similarly. Unlike the replacement of isLetter in Haskell this would require new variables and control structures driven by the cost of the solution rather than a simple switch to a less expensive expression

It might be argued that I have tilted the playing field against C by not making these changes, and that any halfway competent C programmer would do so when faced with code that runs too slowly. If isLetter is pointlessly slow, isn't the same thing true of putc(3) and getc(3)? But I think there is a clear difference.  Both programs are written in a character-oriented way because the problem is described in terms of characters. I wrote the inner loop of the C to operate on a linked list of blocks because it looked like a faster and simpler choice than copying the whole string into a new buffer twice the size every time it overflowed (on average this algorithm copies each character once or twice; see Knuth for details). I might have considered reading or writing the characters in reverse order rather than doing the in-memory reverse in a separate function, but profiling didn't show that as a significant time sink.  Overall, getting decent performance out of the C is going to take about the same amount of work as writing the code in the first place.

On the other hand the Haskell has decent performance out of the box because the compiler automates a lot of the micro-optimisation that C forces you to do manually. It may not do as good a job as a human with plenty of time might do, but it does it automatically and reasonably well.

This isn't principally about code size, but for the record the Haskell has 42 SLOC, of which 21 are executable. The C has 115 SLOC, of which 63 are executable. The Shootout C winner has 70 SLOC, of which 46 are executable (counting comma separated statements as one line per statement).

So here is the bottom line. If you really need your code to run as fast as possible, and you are planning to spend half your development time doing profiling and optimisation, then go ahead and write it in C because no other language (except possibly Assembler) will give you the level of control you need. But if you aren't planning to do a significant amount of micro-optimisation then you don't need C. Haskell will give you the same performance, or better, and cut your development and maintenance costs by 75 to 90 percent as well.

Update: here are the compilation and run commands:

ghc -main-is HaskellText -O2  -fllvm -funbox-strict-fields HaskellText.hs

gcc -O2 naive-c.c -o naive

time ./HaskellText < big-fasta.txt > /dev/null

time ./naive < big-fasta.txt > /dev/null

For what its worth, here is the code for the two programs.

First the Haskell:

module HaskellText (main) where

import Control.Applicative
import Control.Monad
import Data.ByteString.Lazy.Char8 (ByteString)
import qualified Data.ByteString.Lazy.Char8 as B
import Data.Char
import Data.Int
import Data.Vector.Unboxed (Vector, (!), (//))
import qualified Data.Vector.Unboxed as V
import Data.Word
import System.IO

-- | A block requiring the reverse complement.  The first element is the
-- header string. The second element is the actual data.
data Block = Block ByteString ByteString

main :: IO ()
main = getBlocks >>= mapM_ (writeBlock . reverseComplement)

-- | Write a block to stdio with line breaks.
writeBlock :: Block -> IO ()
writeBlock (Block name txt) = do
      putChar '>'
      B.putStrLn name
      mapM_ B.putStrLn $ splitEvery 60 txt

-- | A utility function that ought to be part of the bytestring library.
splitEvery :: Int64 -> ByteString -> [ByteString]
splitEvery n t = go t
    go t = if (not $ B.null t2) then t1 : go t2 else [t1]
      where (t1, t2) = B.splitAt n t

-- | Read a series of blocks, each of which starts with a '>' character.
getBlocks :: IO [Block]
getBlocks = map mkBlock . drop 1 . B.split '>' <$> B.getContents
      mkBlock bs1 = Block h t
        where (h, t) = B.break (== '\n') bs1

reverseComplement :: Block -> Block
reverseComplement (Block name codes) =
   Block name
   $ B.filter (/= '\0')
   $ (V.unsafeIndex tbl . ord)
   $ B.reverse codes
      tbl = V.replicate 256 '\0' //
            map (first ord) (pairs ++ map (first toLower) pairs)
      pairs = [('A', 'T'),('C', 'G'),('G', 'C'),('T', 'A'),
               ('M', 'K'),('R', 'Y'),('W', 'W'),('S', 'S'),
               ('Y', 'R'),('K', 'M'),('V', 'B'),('H', 'D'),
               ('D', 'H'),('B', 'V'),('N', 'N')]
      first f (x,y) = (f x, y)

Now the C. Sorry about the missing brockets around the #include files; this blog host won't render them, even when I try writing the HTML directly.

#include stdio.h
#include stdlib.h
#include strings.h

#define BLOCKLEN 1024
#define TITLELEN 132
#define LINELEN 60

typedef struct block_struct {
  int len;
  char text [BLOCKLEN];
  struct block_struct *next;
} Block;

/* Here to get them profiled */
int my_getc (FILE *stream) {
  return (getc(stream));

int my_putc (int c, FILE *stream) {
  return (putc(c,stream));

/* Initialised by calls to setComplement. Holds the complement codes. */
static char complement[256];

void setComplement(char c1, char c2) {
  complement [c1] = c2;
  complement [c2] = c1;
  complement [c1 | 0x20] = c2;
  complement [c2 | 0x20] = c1;

/* Reverse-complement block in-place */
void reverseComplementBlock (Block *block) {
  char *ptr1;
  char *ptr2;
  char c;

  ptr1 = &(block->text[0]);
  ptr2 = &(block->text[block->len - 1]);
  while (ptr1 <= ptr2) {
    c = complement [*ptr1];
    *ptr1 = complement [*ptr2];
    *ptr2 = c;

  Read blocks up to either the next ">" or EOF. Create a chain of
  blocks in reverse order of reading.
Block *readBlocks () {
  char c;
  int i;
  Block *result;
  Block *old;

  old = NULL;
  result = NULL;
  c = ' ';
  while (c != '>' && !feof (stdin)) {
    result = malloc (sizeof (Block));
    i = 0;
    while (i < BLOCKLEN && !feof (stdin)) {
      c = (char)my_getc(stdin);
      if (c == '>') {
    ungetc (c, stdin);
      if (c >= ' ') { // Drop line breaks.
    result->text[i++] = c;
    result->len = i;
    result->next = old;
    old = result;
  return result;

int process_sequence () {

  char title[TITLELEN];
  Block *ptr;
  Block *old_ptr;
  int i;
  int cpos;

  if (!feof(stdin)) {
    fgets (title, TITLELEN, stdin);
  } else {
    return 0;

  if (index (title, '\n') == NULL) {
    while (my_getc (stdin) != '\n') {}

  ptr = readBlocks ();

  printf("%s", title);

  cpos = 0;
  while (ptr != NULL) {
    reverseComplementBlock (ptr);
    for (i = 0; i < ptr->len; i++) {
      if (cpos++ == 60) {
    my_putc ('\n',stdout);
    cpos = 1;
      my_putc (ptr->text[i], stdout);
    old_ptr = ptr;
    ptr = ptr->next;
    free (old_ptr);
    /* Beware; the obvious "for" loop would free the block and then
       dereference it to move to the next one. */

  my_putc ('\n',stdout);
  return 1;

void main () {

  int i;
  char title[TITLELEN];

  for (i = 0; i < 256; i++) {
    complement [i] = 0;
  setComplement ('A', 'T');
  setComplement ('C', 'G');
  setComplement ('M', 'K');
  setComplement ('R', 'Y');
  setComplement ('W', 'W');
  setComplement ('S', 'S');
  setComplement ('V', 'B');
  setComplement ('H', 'D');
  setComplement ('N', 'N');

  while (process_sequence ()) {}


Carl said...

You could probably make the C version about twice as fast just by using getc_unlocked and putc_unlocked. (Which is of course not quite portable.)

It seems impossible to post comments without cookies, that's annoying. Maybe it'll work this time?

Mos Fring said...

Instead of "TL;DR" where you mean "Summary" or "Abstract", why not consider "Summary" or "Abstract" in future? It will enhance the impression of competence and make your readers feel less like you're talking down to them.

Isaac Gouy said...
This comment has been removed by the author.
Anonymous said...

Excellent way of looking at it. As we tackle problems of increasing complexity it is good to focus on solutions that give more correctness and maintainability while not losing much performance.

It looks Haskell compilers are going to make C go the way of assembly.

Anonymous said...

Just a quick nitpick (because I'm always bugged by overuse of $...): if (not $ B.null t2) then ... can just be if not (B.null t2) then ...

Isaac Gouy said...

"(aka the Shootout) ... Shootout ... Shootout ... Shootout ... Shootout ... Shootout"

No, No, No, No, No, No! Shootout.

"There was no wish to be associated with or to trivialise the slaughter behind the phrase shootout so the project was renamed back on 20th April 2007."

5 years on, please use the correct name -- the benchmarks game.

Andy said...

I can't reproduce your results.

I took the input file from the shootout page:

and fixed the C code so it compiled by adding the brackets for the includes. I then compiled it with "gcc -O3" and ran it on the input. It was nearly instantaneous, so I made a new input file by repeating the input file lots of times (it is now 127 MB instead of 11 KB).

Now when I run it as:

time ./a.out < in.txt > /dev/null

I get:
real 0m2.903s
user 0m2.864s
sys 0m0.032s

For the haskell side, because I don't really know what I'm doing, I commented out the first line of the file and compile with "ghc -O" (without commenting out that line, it wouldn't link). I then run it:

time ./main < in4.txt > /dev/null

and get:

real 0m14.554s
user 0m14.433s
sys 0m0.088s

Even if I don't turn on the optimizer for the C code, I still get much faster times (a bit over 4 seconds).

Compiler versions:
gcc (Ubuntu/Linaro 4.7.2-2ubuntu1) 4.7.2
The Glorious Glasgow Haskell Compilation System, version 7.4.2

Jon said...

That you can get great performance from C doesn't mean you can't get bad performance if you work for it. Code using getc(), putc() and so on is never going to be fast. It's just poor C.

High-level languages will end up eating C's cake, but the above is not a good example of why.

Andy said...

as a follow up to my previous comment. Since everyone is saying that it is getc/putc which is causing C to be slow, I thought I should also try redirecting to a file:

C version:

time ./a.out < in.txt > cresult

real 0m4.250s
user 0m2.888s
sys 0m1.352s

Haskell Version:

time ./main < in.txt > hsresult

real 0m16.507s
user 0m14.581s
sys 0m1.888s

As you would expect, the sys time goes up but the time spent in userspace doesn't really change.

k0t4 said...

I can not reproduce your results either. Perhaps you made a mistake when you were reading the results.
Also jumping to the conclusion that Haskell will be as close to as efficient as c everytime based on a single test with a single program is a big issue. In order to really be able to say that with any level of confidence you should really test several programs in both languages and do that for several different types of problems.

Timmy said...

I find it hard to take this post seriously, since failing to realize the impact I/O will have on the benchmark results seems like a pretty big methodological flaw. In fact microbenchmarks, pretty much by definition, try to avoid interacting with any system that is not part of the core concern of the benchmark (as much as possible)..

In fact the benchmark does explicitly say "read line-by-line", so your program probably even fails to qualify as a solution.

The motivation for the original benchmark may have been measuring the performance of data structures, but that's certainly not what you are measuring. Which brings me to the question of what exactly is being measured here?

The I/O throughput of the standard libraries of language implementations? The ease by which this implementations can be understood and utilized? I think we can both agree that on both counts Haskell doesn't have a realistic chance of beating C, regardless of the amounts of magic that the compiler may be capable of utilizing..

This brings me to my final point, which is that I fail to see how you can claim in your conclusion that "conventional wisdom is wrong", given that
1. everyday C code will most certainly not be doing character-based I/O
2. out of a lineup of n programmers picked at random a sizable chunk wouldn't even be capable of writing I/O code in Haskell, while almost certainly all of them would still manage an "everyday" C solution that beats yours.

Anonymous said...

>2. out of a lineup of n programmers picked at random a sizable chunk
>wouldn't even be capable of writing I/O code in Haskell, while almost
>certainly all of them would still manage an "everyday" C solution that
>beats yours.

I don't think that this argument holds. The fact that few programmers are able to comprehend and use Haskell only means that they don't have yet that skill. It is not something that may never change. 40 years ago people capable of coding in C weren't legion either. Out of the same random chunk of programmers, how many more of them would be able to code up that programme in PHP rather than C? Does it make PHP more suitable?

Sphires said...

< less than < <
> greater than > >

" would have been a good alternative to LT GT symbols

Sphires said...

< less than & lt; & #60;
> greater than & gt; & #62;

remove space after &

Anonymous said...

Its nice to see that you stacked this test on Haskells side by choosing the benchmark test that it is best at. I think you should face the facts that Haskell will almost never be anywhere near as fast or efficient as well written c

Timmy said...


>Does it make PHP more suitable?

Maybe not "suitable", but it makes PHP certainly more usable than C. And that seems to be what this post is mostly concerned about.

Btw, fun fact: the PHP version in the benchmarks game for this benchmark does manage to beat this blogger's C version by 1.6 seconds, and it's only 30 lines long.

Andy said...

Using the extra options to ghc, I still can't reproduce the results:

Haskell (with -O2 -fllvm and -funbox-strict-fields):

$ time ./main < in.txt > /dev/null

real 0m15.509s
user 0m14.065s
sys 0m1.316s

C (with -O2):

$ time ./a.out < in.txt > /dev/null

real 0m2.895s
user 0m2.864s
sys 0m0.024s

Anonymous said...

In short, you're wrong. C is just as close to the machine now as it was before. The only difference is that "the machine" is now actually rather complex. The machine does not offer some kind of "faster_dereference" which is faster than C's de-reference but with slightly different semantics. C's de-reference is all the machine offers. The fact that loading from main memory can now be very expensive is irrelevant.

Also, a linked list of characters? hahahahahaha. The buffer algorithm will be much faster. No wonder your performance was rubbish.

IOW, your entire article is "I don't understand CPUs, and I'm going to write some terrible code that uses the CPU very badly, and this proves my point.".

Readability *does* matter. And being more readable *is* a big advantage. Performance *isn't* everything. But you have definitely failed to demonstrate that Haskell can be equally as performant as C, and secondly, it's just a pity that Haskell is *less* readable than C, in general.

Forhad Chowdhury said...
This comment has been removed by a blog administrator.
treeowl said...

The Haskell code here could be made more efficient. The most obvious thing is that break should be replaced with breakByte. The second most obvious thing is that you scan the header text twice, first for the greater-than symbol and then for the newline. The third most obvious thing (I think...) is that the "filter out nulls" operation doesn't seem to serve any obvious purpose. If you need it, you should probably indicate why.

David Feuer said...

My last post had a couple errors. I see now why the filter is required. I also see now that GHC has a rewrite rule to replace break with breakByte when appropriate. However, the double scanning is still unfortunate (it might even be good to use readLn after setting the handle buffer size explicitly—not sure), and I think I found a bigger problem: as written, I believe the table will be rebuilt for every block, unless full laziness is enabled (a bad idea, generally). Making the table a top-level variable seems the most sensible approach.

Sylko Olzscher said...
This comment has been removed by the author.
UNIX administrator said...

Guys, just to remind everyone that, as fast as C is, if you really need speed, you take the latest-and-greatest shiny new intel Fortran, Sun Studio, or PGI Group's Fortran compiler and code away. Yes, Fortran. And yes, there are ultramodern Fotran compilers still produced, and then some. And yes, they are used because short of coding in assembler, the Fortran compilers produce code closest to the hand coded assembler.

And yes, I know, this is just anecdotal evidence from someone on the Internet. Yes, I understand. But if I didn't think this important enough to mention, I wouldn't have bothered. The reason I am mentioning this is just so that everyone gets the full picture, that there are languages and compilers even faster than C.

In finite element analysis, for the most intensive computing work (solvers), we used Fortran. Matrix computations on the order of 100 million by 100 million. intel AVX optimizations to the hilt. Why wasn't C used? The assembler code generated wasn't fast enough. Last thing we were looking at before I left that field was running PGI Fortran CUDA compilers, which enabled us to compile the code into Nvidia GPU assembler. Again, not C, Fortran. Why? Speed.