Shared Mutable Memory Must Die
On Friday I had the opportunity to meet Joe Armstrong who created the Erlang programming language. He was visiting London and popped into LShift to give a talk. It was only meant to be a 15 minuted talk but ended up at a couple of hours; all the better for us! Now I'm pretty big fan of Erlang, though I prefer statically strongly typed languages like Haskell. It is true that you can do things in dynamically strongly typed languages like Erlang that you can't do in statically strongly typed languages but personally I don't think these are necessarily good things and I think the advantages of statically typed languages vastly outweigh any loss of expression. But that's not what I really want to write about now.
I have heard tales of the strangeness of the Grand Old Men of Computing. I guess Joe's probably still a bit young to fit in to that category, but the eccentricities are certainly there for all to see, making for an amusing if at times bemusing talk and general discussion. What I was most happy about is that he reinforced many of my own beliefs about where we're heading, beliefs for which I've received sod all support from those around me at Imperial. Because Joe's out there actually working on these problems day in, day out, he speaks from an experienced and authoritative position which again, makes me feel much better about my own convictions.
The fundamental problem is that most programmers believe in a model of a computer which is just not accurate any more. Languages like C and C++ reinforce this belief in a broken model of computing and the hardware manufacturers have hardly helped as they've really just been throwing more transistors and silicon at the problem like a bandage, trying to patch up the problems and maintain the illusion.
The illusion comes from the idea that multiple CPUs or Cores on a CPU can share data simply by reading and writing values to the same address. This requires a single unified address space (i.e. address x means the same location in memory from the perspective of all CPU cores (yes, ignore page tables and virtual address spaces for the moment please)). Originally, this may not have seemed such a crazy idea: memory used to be very fast with respect to CPU speeds, whereas now it's much much slower. Because it's been getting slower and slower, manufacturers have added caches all over the place. We have caches on hard disks, several on CPUs themselves, various caches that are shared between cores, and some which are private to each core. We have separate caches for instructions and for data. In some cases we even have caches for instructions after they've been decoded by the CPU. There are many caches. And therein lies part of the problem: if you only have one copy of a given piece of data, it might be possible to come to some kind of global consensus about its current value. But as soon as you start duplicating the data into several caches, working out who owns a particular value and how modified values should be combined in order to update the main memory becomes much much harder. So we have Cache Coherency Protocols which try to intercept memory accesses and make sure that multiple caches can't contain updated values for any given address. Sort of.
By not exposing the cache to the programmer, and by trying to maintain this shared mutable memory concept, programmers are forced to resort to locking and similar techniques in order to try to ensure that concurrent modifications to shared data simply don't happen. Modifications to the same set of addresses must not happen concurrently, but sequentially so that the data is mutated in a predictable, safe and intended way. Using locks is, as we all know, inherently dangerous: deadlocks can occur very easily and it's difficult to get the balance right between coarse grained locking which is easier to debug and reason about but may reduce performance by unnecessary locking, and fine grained locking which may allow better performance, but the number of locks makes it much harder to reason about and have and kind of confidence in the correctness of the locking and freedom of deadlocks. In fact, it's simply unwise to ever believe that any non-trivial multi-threaded application is deadlock free. It's really quite unlikely. Furthermore, deadlocks are only one issue arising from the general race conditions inherent in the shared-mutable-state world. Thread scheduling, timing and other issues can all combine with or without locks to result in unintended consequences which are often difficult to track down, repeat, debug and fix.
Perhaps the more significant problem is scalability. If you designed a computer system for a million CPUs and then scaled that design down to one CPU you have the knowledge that as your needs grow, your design will scale up and will work correctly. If you design for one CPU and then try to scale up, you're going to need a lot of bandages and a big sewing kit to sew up all the gaps. The reason why this is perhaps the more significant problem is the direction in which CPU manufacturers are heading. Now AMD and Intel are both thinking that somewhere between 2 and 8 cores would be fine for most desktop users. But for servers, it still seems to be a case of the more, the merrier. So actually considering how you would work with a million CPUs or cores isn't the far-fetched an idea. It's certainly unbelievably short-sighted to think that you can get away with only considering the case for a single core or two. If you design for a million CPUs, you also come to some significant conclusions early on in the process. For example, you realise that it's a very silly idea to pretend that all memory should be equally available to all CPUs at the same time. If you try to do that, then you'll end up with a memory system that is phenomenally slow for all CPUs and fast for none because the memory system will have to have enormous bandwidth to process all the requests from a million CPUs and will potentially suffer horrible performance problems when trying to regulate access to the shared mutable memory.
To expand on this a bit further, Joe pointed out what actually has to happen when you "take a lock". A lock obviously has to be safe by which I mean that if two threads on two different CPUs try to take the same lock at the same time then only one can succeed. Furthermore, a lock is usually just a particular value at a particular location in memory. So if two threads wish to take the same lock, they both have to try to read the value of the lock and update it atomically. CPU instruction sets normally provide a dedicated instruction to do this, but it affects the memory system and often causes lots of work for the cache coherency protocol to deal with. If you have a million CPUs all trying to access the same value, you have something of a problem as all million CPUs will try to issue the instruction to test and set the lock and that will cause the memory system to send the value of the lock to all CPUs and track which CPU got there first (agh, first. That means we've got to place these requests in order somehow. So they actually become sequential, not parallel, and that means that the millionth CPU may be waiting a long time before its request gets serviced) and thus has the right to write back to the lock whilst the others cannot. And when that value is written, the lock is taken, the new updated value will then have to be sent out to the remaining million minus one CPUs so that they can then see that they cannot take the lock. So all but one of the CPUs have stalled waiting for the memory system. If there's much contention for the same addresses, this can get pretty expensive.
If you instead build a memory system where the memory is distributed, local to each CPU, then you can get very very fast access to the local memory. At this point you might consider whether it's worth your while trying to maintain the illusion of a single address space. Whilst there's lots of pressure from software vendors to ensure the single address space remains, there are lots of reasons why I think it will eventually die. Firstly, memory access now becomes irregular: if your thread is running on a given CPU and the addresses it is accessing are local to that CPU then memory access will be very fast. But if it's accessing addresses which are not local to that CPU then memory performance is going to be very slow. When you have only a few hops to get to the desired memory banks, it might be manageable: aggressive caching and pre-fetching might just hold you together. But with a million CPUs, the average number of hops is going to go up a long way. And that's really going to hurt performance.
So what happens if you throw out the single address space? Well firstly, you can make memory accesses go even faster. You can throw out all the cache coherency silicon and you can throw out regulating memory access. And what do you have? Well, if you squint a bit, you have a million CPUs, each with their own, very big addressable cache. And then if you squint a bit more, you suddenly realise this is quite similar to what we already have with, e.g. the Cell CPU and Tilera.
So people then turn around and ask how you can communicate between threads? - it's really quite an essential feature. Well, you have networks, just like in large distributed systems that we have today. A network is sort of in the direction of shared mutable state, but it's vastly safer and more predictable. And so this is why Erlang, and its message-passing concurrency is so obviously the correct solution. Two CPUs can actually send a message to the same destination at the same time and have the messages arrive in some order that does not result in one message being lost. Not true with shared mutable state. Whilst you can use shared mutable state to implement message passing and whilst if you do, you will still need locks (in fact, the normal implementation of Erlang does this on typical consumer computers), firstly, the sections of code which need to be analysed and carefully thought out from a deadlocking-avoidance perspective can be made very small, and secondly, the use of shared mutable state is really only a stop-gap measure. Also, please note that I'm not pretending you can't deadlock in message-passing systems. You can. But it tends to be harder.
If you think about how you'd implement message passing in such a million CPU system, you'd send messages from CPU to CPU directly. You wouldn't go out to some shared mutable memory bank as that would be dog slow. If you look at the design of the Cell or Tilera64, it's easy to see how targeting the various cores directly as the destination of the message would be much much faster. Now at this point you might raise a hand and say, well, if this is all in hardware, you're going to have finite message buffers and other annoying limitations like that. True, you will. There are a couple of things I'd like to point out in return though. Firstly, ethernet has dealt with this issue for a long time. CSMA/CD (ensuring multiple hosts don't try and send on the same wire at the same time), and TCP/IP ensures that messages don't get lost, get buffered and are resent when necessary. And sure, there are still limitations in buffer size - hey, this is the Real Physical World after all, limitations of size will always exist. But the buffers just have to be big enough so that they don't cause issues most of the time. It's an engineering tradeoff. The other point is that you could, if you still have some general shared mutable memory, use that as a larger area to send messages to if the intended target is otherwise swamped with work to do.
And this really puts the final part of the picture in place. Currently, if your computer runs out of memory, it will start using the hard disc to store areas of memory that it thinks haven't been accessed recently - swapping or paging out to disk. This works very well. So if you consider something like the Tilera64, you could imagine that you do the same, but one level up: when your own core-local memory fills up, you evict parts of it out to the general memory. Of course, since the values you've sent out to general memory will only be recalled by you, you still don't need to provide for shared mutable memory: the memory is certainly shared between many CPU cores, but any given address is only ever accessed by a single core.
As ever, the problem here is legacy code. The fact that more interesting CPU designs are starting to appear should be an indication that in the not too distant future, things are going to change. What's likely is that eventually we get a hybrid design where you can put your CPU into different modes: fundamentally whether the CPU-local cache appears as a separate addressable memory space or whether it's the same old transparent caching layer we're used to. Networks on a chip are already here and won't be going. Hopefully, if this trend continues, we shall find that certain warts of the computing industry, such as C, disappear from general use, appearing only in the most specific and suitable corners. Maybe even C++ and Object Orientation in general all but disappear as the assumptions of computing on which they are built become ever slower and unwieldy to maintain.