Doing it better 1999-06-10 c ### At the end of my talk, I sketched a graph reduction engine. That's what this e-mail is about. I've been trying to build persistence into CoVirt. The aim is that if you unexpectedly switch off the computer, or if it crashes, then on re-booting the computer is as similar as possible to its state just before the disaster, and certainly hasn't become inconsistent or corrupt. This is remarkably difficult, as the computer may be running while you save it to disk. For example, imagine saving an arcade game to disk while playing it... but don't imagine too hard because it is a painful thought. As far as I can see, the only solution is to take an instantaneous snapshot of the whole computer, and serialise that. This is entirely out of step with the rest of Tau, which is decentralised whereever possible. It also falls flat for distributed programs. It is also not very real-time. The problem is solved completely by using a graph rewriting system. We regard the collective harddisks of everybody in the universe as a single large graph. We regard our own harddisk as a local cache of part of that graph (we lock that part to ensure that only we modify it). If we crash, the network detects that we are not responding and unlocks our graph, leaving it in the state in which it was before we ever touched it. If we don't, then we occasionally write bits of the graph back out to the network, which get patched in on top in a log-structured kind of way that only forgets old data when it is completely superceded. Just as our hard-disk is a cache of the network, our memory is a cache of our hard-disk. We can even have a less-compact and smaller area of memory to cache our memory, and the algorithm I presented on the whiteboard on May 12th can be seen as cacheing part of that less-compact area in the processor registers. This sounds like horrible organisational stuff, and indeed it is, but I think I've got a good plan. Every level of the hierarchy is broken into pages, whose size is chosen to be roughly proportional to the capacity of that level. Registers hold a single node, as do the less-compact memory cells. Main memory cells (which coincide with cache-lines) hold about eight nodes. Harddisk pages hold about 128 nodes. Network cells hold more. You have to have a routine which shifts a cell across a level boundary. Loading into a smaller level is easy - you just break a cell into many smaller cells. Saving back again involves choosing which cells to combine. This is best achieved by walking the graph, as knowing that the cell holds a connected subgraph allows a much more compact representation in memory. Neighbouring cells which are left behind in the big level when a cell is loaded are not modified, so that if we die then the connections to the (old version of) the cell we loaded are preserved. However, the old version is locked to prevent modification by anyone else. Saving a cell back again still doesn't modify the neighbours, in case the save operation gets lost, but you can tell that the neighbours are wrong and the new cell is right because of the time-stamps (you have time-stamps, by the way). Conversely, whenever you load a cell, you have to check the time-stamps of the neighbouring cell and modify the interface if it is newer. You have to keep a boundary layer one cell thick just inside the region (many cells) which has been loaded, and not modify any cells in that layer, so that when you save back again the neighbours won't get confused. You also have to save back your whole cache in a single operation, for the same reason. You want to save back often. However, it is acceptable to save back an old version of your whole cache - you don't have to flush the smaller caches in order to flush the larger ones. That is the key to keeping things running while you autosave, and the unique selling point of this idea. The other selling point is the fall-back option, which solves all deadlock problems. If you want to load something and you can't because it is locked, then you simply dump your whole cache and start processing somewhere else in the graph. Because of the mathematical underpinnings, you can be confident that if the other processor starts munching through your incomplete computation, it will merely do precisely what you would have done anyway, had you had the necessary information. You also get very good interrupt latency - you don't need to save state because the system has been designed to cope with unexpected failure. You simply unlock pages at an appropriate level (probably main memory), which reverts to an earlier state. In short, there are some very desirable properties which come with graph reduction which are damn nearly impossible to achieve any other way. All we need now are the 'right' graph reduction rules... Alistair