Doing it better 2001-01-09
###

\[snip\]

Project tau-ish stuff: I think I've worked out how to run MIN on something
resembling my array of 8K computers.

The obvious part is that you have to choose one computer to be the master,
preferably one with a direct connection to main memory, and let all the
others be slaves. The slaves need some infra-structure so that they can
each talk to the master, possibly via other slaves. The clever part is
what they say.

Recall that in MIN, redexes consist of a cut: two nodes joined at their
principal ports. In the deterministic subset, nodes have only one
principal port. The complete graph, including all the device drivers, has
no free ports, so if we start at any node and follow the path indicated by
the principal ports, we will eventually either get to a cut, or end up
going round and round a vicious circle. Vicious circels are inert, and
their creation is a bug, so let's suppose there are none. So principal
paths always lead to a cut.

We can partition the nodes into catchment areas, according to which cut is
reached by their principal paths. Intuitively, such a catchment area is a
thread. The water-sheds are the wires which do not have a principal port
at either end. If a reduct contains a water-shed, or is disconnected, then
any thread which executes it forks. Conversely, if a reaction occurs right
next to a water-shed, and some principal paths in the reduct point to it,
then the threads join.

The reduction order I showed you a year ago was a lazy one, which
performed a rewrite only if there was a device driver in its catchment
area. Abandon that idea.

Instead, keep a table of all the threads in the graph. Each iteration,
choose one according to some scheduling policy, and rewrite it. This
replaces the thread with some new threads. It is easy to see that these
new threads, if any, must be at the periphery of the reduct. So we just
look at each of the candidates, and add it to the thread table if
necessary. A nice local operation.

To do this on multiple computers, we have to decide which bits of the
graph to keep on each computer, and how and when to send bits of graph
over the network. My idea is that the only bits of graph you can send are:

1) A cut.

2) A node, provided its principal port is connected to something on the
receiving computer.

3) a wire, provided both its ends are joined to something on the receiving
computer.

In addition, a computer can ask another computer for a cut, or for
'whatever is joined to this wire', and the other computer may reply
'sorry, that would break the rules'. Only allow messages between a slave
and the master, not between one slave and another slave.

And another thing: while all this is going on, the master will JIT any
rewrite that a slave requests, and slaves maintain an instruction cache,
so we can assume that slaves know how to perform any rewrite.

The sequence of events is as follows.

First, each slave asks the master for a cut. For each of these requests,
the master looks in the thread table, chooses a thread according to some
scheduling policy, and sends that cut to the slave. On receiving it, the
slave starts its own little thread table containing just that cut.

Each slave then starts processing. Each iteration, it chooses a thread
from its table, and performs a rewrite. It then scans the periphery of the
reduct for new threads, to put back in its table.

In some cases, the periphery will be connected to the master (this will
certainly be true for the first iteration), in which case it asks the
master for whatever it is joined to. If the master returns a bit of graph,
everything is hunky-dory. If not, it sends as much graph as it can to the
master, starting at that point.

Eventually, a slave will exhaust its tiny memory. When this happens, or
after a timeslice of about 1ms, whichever is the sooner, it picks a cut
from its thread table and sends it to the master. It then sends as much
graph as it can to the master, starting at that cut. Upon receiving the
cut, the master adds it to the main thread table.

If ever a slave empties its thread table, it returns to the beginning and
asks the master for a new cut.

This idea is nice because:

1) If a small bit of graph needs a large number of rewrites, then it will
find its way onto a slave and stay there until all the rewrites are done,
during which time it won't bother the master at all. Except every 1ms.

2) If a slightly larger bit of graph needs a large number of rewrites,
then it will find its way onto several slaves, and whenever a bit of graph
is sent back to the master it will almost immediately be sent on to
another slave. A clever master would have a bit of cache, and might not
access main memory at all. So in a way, the slaves cooperate to achieve
the effect of a single bigger computer when the working set requires it.

3) Whenever I say 'sends as much graph to the master as it can, starting
at X', it is possible to do this rather lengthly and non-atomic operation
in the background, for example while waiting for the master to reply to
another message. The only time it must be done in sequence is when the
slave's memory is full so it can't perform any more rewrites until it has
sent something to the master; even then it will probably become free after
sending three or four nodes.

4) While I have described a two-level arrangement (master and slaves) the
slaves are implementing roughly the same algorithm as the whole network.
One could therefore replace each slave with a sub-master and several
sub-slaves. This decision must be based on the degree to which the master
is a bottle-neck, which is something to measure empirically.

5) It begins to look like a conventional imperative language on a cluster
of processors (thread table, scheduling policy, instruction cache, data
cache, cache replacement policy, store buffer, etc.), which we sort of
understand. But with all the horrible problems (cache consistency, large
arrays, load balancing, mutual exclusion, etc.) ironed out, and a nice
mathematical foundation to boot.

6) It avoids the problem of recognising when a network route returns to
its starting point, because all routes used are confined to a tree.

What remains to be done is:

1) Get my arbitration primitive into the design. Shouldn't be hard, but I
haven't thought about it.

2) Design a JIT that will fit on the master while it simultaneously
buffers the thread table and caches bits of graph. Or put these functional
units onto separate processors.

3) Work out the details of the protocol between a slave and the master, to
get some idea of the bandwidth, latency, and synchronisation that it
requires. This is related to the representation of the graph used by the
slaves (in particular, the way that a connection to the master is
represented), which also has to be worked out.

What do you think of this?

Alistair