Doing it better 1999-06-10 c
###

At the end of my talk, I sketched a graph reduction engine. That's what
this e-mail is about.

I've been trying to build persistence into CoVirt. The aim is that if you
unexpectedly switch off the computer, or if it crashes, then on re-booting
the computer is as similar as possible to its state just before the
disaster, and certainly hasn't become inconsistent or corrupt. This is
remarkably difficult, as the computer may be running while you save it to
disk. For example, imagine saving an arcade game to disk while playing
it... but don't imagine too hard because it is a painful thought.

As far as I can see, the only solution is to take an instantaneous
snapshot of the whole computer, and serialise that. This is entirely out
of step with the rest of Tau, which is decentralised whereever possible.
It also falls flat for distributed programs. It is also not very
real-time.

The problem is solved completely by using a graph rewriting system. We
regard the collective harddisks of everybody in the universe as a single
large graph. We regard our own harddisk as a local cache of part of that
graph (we lock that part to ensure that only we modify it). If we crash,
the network detects that we are not responding and unlocks our graph,
leaving it in the state in which it was before we ever touched it. If we
don't, then we occasionally write bits of the graph back out to the
network, which get patched in on top in a log-structured kind of way that
only forgets old data when it is completely superceded.

Just as our hard-disk is a cache of the network, our memory is a cache of
our hard-disk. We can even have a less-compact and smaller area of memory
to cache our memory, and the algorithm I presented on the whiteboard on
May 12th can be seen as cacheing part of that less-compact area in the
processor registers.

This sounds like horrible organisational stuff, and indeed it is, but I
think I've got a good plan.

Every level of the hierarchy is broken into pages, whose size is chosen to
be roughly proportional to the capacity of that level. Registers hold a
single node, as do the less-compact memory cells. Main memory cells (which
coincide with cache-lines) hold about eight nodes. Harddisk pages hold
about 128 nodes. Network cells hold more.

You have to have a routine which shifts a cell across a level boundary.
Loading into a smaller level is easy - you just break a cell into many
smaller cells. Saving back again involves choosing which cells to combine.
This is best achieved by walking the graph, as knowing that the cell holds
a connected subgraph allows a much more compact representation in memory.

Neighbouring cells which are left behind in the big level when a cell is
loaded are not modified, so that if we die then the connections to the
(old version of) the cell we loaded are preserved. However, the old
version is locked to prevent modification by anyone else. Saving a cell
back again still doesn't modify the neighbours, in case the save operation
gets lost, but you can tell that the neighbours are wrong and the new cell
is right because of the time-stamps (you have time-stamps, by the way).

Conversely, whenever you load a cell, you have to check the time-stamps of
the neighbouring cell and modify the interface if it is newer. You have to
keep a boundary layer one cell thick just inside the region (many cells)
which has been loaded, and not modify any cells in that layer, so that
when you save back again the neighbours won't get confused. You also have
to save back your whole cache in a single operation, for the same reason.
You want to save back often.

However, it is acceptable to save back an old version of your whole cache
- you don't have to flush the smaller caches in order to flush the larger
ones. That is the key to keeping things running while you autosave, and
the unique selling point of this idea.

The other selling point is the fall-back option, which solves all deadlock
problems. If you want to load something and you can't because it is
locked, then you simply dump your whole cache and start processing
somewhere else in the graph. Because of the mathematical underpinnings,
you can be confident that if the other processor starts munching through
your incomplete computation, it will merely do precisely what you would
have done anyway, had you had the necessary information.

You also get very good interrupt latency - you don't need to save state
because the system has been designed to cope with unexpected failure. You
simply unlock pages at an appropriate level (probably main memory), which
reverts to an earlier state.

In short, there are some very desirable properties which come with graph
reduction which are damn nearly impossible to achieve any other way. All
we need now are the 'right' graph reduction rules...

Alistair