Monday, November 1, 2010

A virtual machine for ParaSail with picothread scheduling

As we have worked toward having an initial compiler/interpreter for ParaSail, we have decided to define a ParaSail Virtual Machine (PSVM) which will support the kind of very-light-weight threading structure (picothreading) needed to be able to evaluate small constructs, like parameters in parallel.  We have decided to turn each ParaSail operation into a single PSVM Routine, even if its execution will involve multiple threads executing bits and pieces of the code for the operation. Each PSVM routine is identified by a unique index, the routine index, and is represented by a routine header and a sequence of PSVM instructions.

The ParaSail Virtual Machine instructions use relatively conventional addressing modes to refer to operands in memory. Memory is presumed to be organized into areas, each area being a contiguous sequence of 64-bit words. While executing, a PSVM routine has access to the following areas:
An area for input and output parameters. Output parameters, if any, come first, followed by input parameters. Parameters may be a single 64-bit value, or may be a 64-bit pointer to a larger object.
An area for storing local variables, and parameter lists being built up to pass as a Parameter_Area to a called routine. The first word of a local area is a link to the local area of the enclosing block or operation, in the presence of nesting. This sort of link is generally called a static link. The second word of a local area is a link to the associated parameter area.
Each type, which is an instance of a module, is represented by an area containing the actual module parameters, and a table of operations. Each operation in the operation table is represented by a pair: routine index and type area. In many cases the type area for an operation is the same as the enclosing Type_Area, but for inherited operations, the type area for the operation would refer to the type area for the super-class from which the operation was inherited. The actual module parameters are most frequently other types (represented by a pointer to their type area), but can be values, object references, or operations.
Instructions identify a memory location of interest by a pair: memory area and offset. This pair is called an object locator. In addition to the Parameter_Area, Local_Area, and Type_Area, an object locator can specify a memory area pointed-to by a local variable, and the locator offset is the offset within the pointed-to area. The local pointers are called local base registers, and may reside in any of the first 64 words of the local area. Finally, in the presence of nesting, the chain of static links may be followed to find an enclosing local area or enclosing parameter area.

Here is a sampling of ParaSail Virtual Machine instructions:

Move_Word, Store_Int_Literal, Store_String_Literal, Store_Real_Literal, Jump, Conditional_Jump, Call, Return, Parallel_Call, Parallel_Block, Parallel_Wait.

Operands are generally identified with object locators, except for literal operands which are identified either by their actual value, or by an index into a table of literals.

Note that there are no instructions for doing normal arithmetic or logical operations. These are all implemented by calling routines. There are a set of built-in routines for doing the typical set of arithmetic and logical operations, on operands of various sizes.

The more interesting instructions are the Parallel_Call, Parallel_Block, and Parallel_Wait. Parallel_Call is essentially equivalent to Call, where the calling routine computes the input parameters and places them in a parameter area, and then calls the routine. The difference with a Parallel_Call is that the caller also identifies a picothread master and allocates a small area for a picothread control block (pTCB), and the instruction completes without waiting for the call to complete. Instead, the Parallel_Call instruction adds the pTCB to a queue waiting to be serviced by one of the virtual processors (vCPUs) executing the ParaSail program. When the caller thread has finished its own work and wants to wait for the Parallel_Call to complete, it uses the Parallel_Wait instruction, identifying the same picothread master as was specified in the Parallel_Call instruction. This suspends the calling thread until all of the parallel picothreads associated with the picothread master complete.

The Parallel_Block instruction is very similar to the Parallel_Call instruction, except that it identifies instructions that are part of the current routine, rather than calling a separate routine. The execution of these instructions is performed by a separate picothread, which has its own pTCB, and local area. The static link of the local area for the Parallel_Block picothread refers to the local area of the thread invoking the Parallel_Block instruction, allowing the nested picothread to use up-level references to reach the local variables of the enclosing picothread.

The Return instruction is used to complete processing of both a Parallel_Call and a Parallel_Block, and the Parallel_Wait is used to wait for either kind of parallel activity.

We recently completed a prototype implementation of the ParaSail Virtual Machine, including the picothread scheduling. We learned some interesting lessons along the way. Initially, a vCPU that was executing a picothread that performed a Parallel_Wait was itself suspended. That quickly exhausted the number of vCPUs, and led us to start dynamically creating new vCPUs. That caused the overall stack space to grow dramatically, since each vCPU needed its own heavy-weight threading context in the underlying operating system, along with a stack.

At this point, we concluded that a vCPU that executed a Parallel_Wait instruction should service the queue of waiting pTCBs if the picothread master it was waiting for was not yet complete. That significantly reduced the number of vCPUs needed. However, it still didn't seem to be as efficient as it could be. As originally implemented, the queue of waiting pTCBs was first-in, first-out (FIFO). However, after looking at various traces of execution, we realized that it was the last pTCB that was created which was always the first pTCB to be awaited. Hence, we concluded that the pTCB queue should be last-in, first-out (LIFO). That is, a vCPU should preferentially service the most recently queued pTCB when it had cycles to spare, since that would more likely be associated with a picothread master that is being awaited, and by servicing that pTCB first, it will reduce the number of pTCBs that were suspended in a Parallel_Wait instruction. After making this final change, even a heavily recursive algorithm throwing off lots of parallel picothreads was handled efficiently.

1 comment:

  1. Using LIFO rather than FIFO turns out to be a fairly well known trick in the scheduling of highly-parallel MIMD computations. The CILK language which came out of MIT and is now supported by Intel uses LIFO scheduling within a single vCPU. However, they have an added twist that when one vCPU is "stealing" work from another, it actually chooses the "oldest" thread rather than the youngest. This is explained in a nice paper available at: