Doing it better 2001-01-09 ### \[snip\] Project tau-ish stuff: I think I've worked out how to run MIN on something resembling my array of 8K computers. The obvious part is that you have to choose one computer to be the master, preferably one with a direct connection to main memory, and let all the others be slaves. The slaves need some infra-structure so that they can each talk to the master, possibly via other slaves. The clever part is what they say. Recall that in MIN, redexes consist of a cut: two nodes joined at their principal ports. In the deterministic subset, nodes have only one principal port. The complete graph, including all the device drivers, has no free ports, so if we start at any node and follow the path indicated by the principal ports, we will eventually either get to a cut, or end up going round and round a vicious circle. Vicious circels are inert, and their creation is a bug, so let's suppose there are none. So principal paths always lead to a cut. We can partition the nodes into catchment areas, according to which cut is reached by their principal paths. Intuitively, such a catchment area is a thread. The water-sheds are the wires which do not have a principal port at either end. If a reduct contains a water-shed, or is disconnected, then any thread which executes it forks. Conversely, if a reaction occurs right next to a water-shed, and some principal paths in the reduct point to it, then the threads join. The reduction order I showed you a year ago was a lazy one, which performed a rewrite only if there was a device driver in its catchment area. Abandon that idea. Instead, keep a table of all the threads in the graph. Each iteration, choose one according to some scheduling policy, and rewrite it. This replaces the thread with some new threads. It is easy to see that these new threads, if any, must be at the periphery of the reduct. So we just look at each of the candidates, and add it to the thread table if necessary. A nice local operation. To do this on multiple computers, we have to decide which bits of the graph to keep on each computer, and how and when to send bits of graph over the network. My idea is that the only bits of graph you can send are: 1) A cut. 2) A node, provided its principal port is connected to something on the receiving computer. 3) a wire, provided both its ends are joined to something on the receiving computer. In addition, a computer can ask another computer for a cut, or for 'whatever is joined to this wire', and the other computer may reply 'sorry, that would break the rules'. Only allow messages between a slave and the master, not between one slave and another slave. And another thing: while all this is going on, the master will JIT any rewrite that a slave requests, and slaves maintain an instruction cache, so we can assume that slaves know how to perform any rewrite. The sequence of events is as follows. First, each slave asks the master for a cut. For each of these requests, the master looks in the thread table, chooses a thread according to some scheduling policy, and sends that cut to the slave. On receiving it, the slave starts its own little thread table containing just that cut. Each slave then starts processing. Each iteration, it chooses a thread from its table, and performs a rewrite. It then scans the periphery of the reduct for new threads, to put back in its table. In some cases, the periphery will be connected to the master (this will certainly be true for the first iteration), in which case it asks the master for whatever it is joined to. If the master returns a bit of graph, everything is hunky-dory. If not, it sends as much graph as it can to the master, starting at that point. Eventually, a slave will exhaust its tiny memory. When this happens, or after a timeslice of about 1ms, whichever is the sooner, it picks a cut from its thread table and sends it to the master. It then sends as much graph as it can to the master, starting at that cut. Upon receiving the cut, the master adds it to the main thread table. If ever a slave empties its thread table, it returns to the beginning and asks the master for a new cut. This idea is nice because: 1) If a small bit of graph needs a large number of rewrites, then it will find its way onto a slave and stay there until all the rewrites are done, during which time it won't bother the master at all. Except every 1ms. 2) If a slightly larger bit of graph needs a large number of rewrites, then it will find its way onto several slaves, and whenever a bit of graph is sent back to the master it will almost immediately be sent on to another slave. A clever master would have a bit of cache, and might not access main memory at all. So in a way, the slaves cooperate to achieve the effect of a single bigger computer when the working set requires it. 3) Whenever I say 'sends as much graph to the master as it can, starting at X', it is possible to do this rather lengthly and non-atomic operation in the background, for example while waiting for the master to reply to another message. The only time it must be done in sequence is when the slave's memory is full so it can't perform any more rewrites until it has sent something to the master; even then it will probably become free after sending three or four nodes. 4) While I have described a two-level arrangement (master and slaves) the slaves are implementing roughly the same algorithm as the whole network. One could therefore replace each slave with a sub-master and several sub-slaves. This decision must be based on the degree to which the master is a bottle-neck, which is something to measure empirically. 5) It begins to look like a conventional imperative language on a cluster of processors (thread table, scheduling policy, instruction cache, data cache, cache replacement policy, store buffer, etc.), which we sort of understand. But with all the horrible problems (cache consistency, large arrays, load balancing, mutual exclusion, etc.) ironed out, and a nice mathematical foundation to boot. 6) It avoids the problem of recognising when a network route returns to its starting point, because all routes used are confined to a tree. What remains to be done is: 1) Get my arbitration primitive into the design. Shouldn't be hard, but I haven't thought about it. 2) Design a JIT that will fit on the master while it simultaneously buffers the thread table and caches bits of graph. Or put these functional units onto separate processors. 3) Work out the details of the protocol between a slave and the master, to get some idea of the bandwidth, latency, and synchronisation that it requires. This is related to the representation of the graph used by the slaves (in particular, the way that a connection to the master is represented), which also has to be worked out. What do you think of this? Alistair