Monday, November 1, 2010

Type conversion in ParaSail

Allowing for user-defined type conversion is complex, as it is an NxN problem, where you want to allow conversion of N different types to each other, and each pair might involve a different set of operations. ParaSail addresses this problem in the following ways:
  • Allows use of the [[ ... ]] operation, which converts the operand to a universal type.  
    • The user defines which universal type(s) a given type converts to by defining the "to_univ" operator(s).  Once the value has been converted to a universal type, it will be implicitly converted to any other type which defines a "from_univ" operator from that same universal type.  
      • Using a prefix can disambiguate if necessary:  T2::[[T1_Obj]] will convert T1_Obj first to a universal type using T1's "from_univ" operator, and then to T2 using T2's "to_univ" operator.
  •  Allows use of the target type's name as a function name.
    • This will convert between two types that are structurally equivalent (that is, the same module parameters were used in their defining instantiation), but which were distinguished by one or both being defined as new.  For example:
      • type T1 is new Vector<Integer<>> and type T2 is new Vector<Integer<>> define two distinct types because of the use of new, but T2(T1_Obj) and T1(T2_Obj) will convert T1_Obj to type T2, and T2_Obj to type T1, respectively.
    • The target type's name can also convert an object from some source type if the user defines a "convert" operator for either of the two types, and this operator's input and output parameter types match the source and target of the conversion.  Note that this matching might be thanks to the input or output formal being itself parameterized.  For example, the interface Int_Vector below provides a "convert" operator that converts to any other instance of Int_Vector:
            interface Int_Vector<Element is Integer<>> is
               operator "convert"(X : Int_Vector) 
                 -> Result_Type is
                      Int_Vector<Target_Element is Integer<>>;
            end interface Int_Vector;
These capabilities seem to provide enough flexibility for the user to define the desired explicit conversion functions. Note that the only implicit conversion in ParaSail is from a universal type to a type with an appropriate "from_univ" operator.

A virtual machine for ParaSail with picothread scheduling

As we have worked toward having an initial compiler/interpreter for ParaSail, we have decided to define a ParaSail Virtual Machine (PSVM) which will support the kind of very-light-weight threading structure (picothreading) needed to be able to evaluate small constructs, like parameters in parallel.  We have decided to turn each ParaSail operation into a single PSVM Routine, even if its execution will involve multiple threads executing bits and pieces of the code for the operation. Each PSVM routine is identified by a unique index, the routine index, and is represented by a routine header and a sequence of PSVM instructions.

The ParaSail Virtual Machine instructions use relatively conventional addressing modes to refer to operands in memory. Memory is presumed to be organized into areas, each area being a contiguous sequence of 64-bit words. While executing, a PSVM routine has access to the following areas:
An area for input and output parameters. Output parameters, if any, come first, followed by input parameters. Parameters may be a single 64-bit value, or may be a 64-bit pointer to a larger object.
An area for storing local variables, and parameter lists being built up to pass as a Parameter_Area to a called routine. The first word of a local area is a link to the local area of the enclosing block or operation, in the presence of nesting. This sort of link is generally called a static link. The second word of a local area is a link to the associated parameter area.
Each type, which is an instance of a module, is represented by an area containing the actual module parameters, and a table of operations. Each operation in the operation table is represented by a pair: routine index and type area. In many cases the type area for an operation is the same as the enclosing Type_Area, but for inherited operations, the type area for the operation would refer to the type area for the super-class from which the operation was inherited. The actual module parameters are most frequently other types (represented by a pointer to their type area), but can be values, object references, or operations.
Instructions identify a memory location of interest by a pair: memory area and offset. This pair is called an object locator. In addition to the Parameter_Area, Local_Area, and Type_Area, an object locator can specify a memory area pointed-to by a local variable, and the locator offset is the offset within the pointed-to area. The local pointers are called local base registers, and may reside in any of the first 64 words of the local area. Finally, in the presence of nesting, the chain of static links may be followed to find an enclosing local area or enclosing parameter area.

Here is a sampling of ParaSail Virtual Machine instructions:

Move_Word, Store_Int_Literal, Store_String_Literal, Store_Real_Literal, Jump, Conditional_Jump, Call, Return, Parallel_Call, Parallel_Block, Parallel_Wait.

Operands are generally identified with object locators, except for literal operands which are identified either by their actual value, or by an index into a table of literals.

Note that there are no instructions for doing normal arithmetic or logical operations. These are all implemented by calling routines. There are a set of built-in routines for doing the typical set of arithmetic and logical operations, on operands of various sizes.

The more interesting instructions are the Parallel_Call, Parallel_Block, and Parallel_Wait. Parallel_Call is essentially equivalent to Call, where the calling routine computes the input parameters and places them in a parameter area, and then calls the routine. The difference with a Parallel_Call is that the caller also identifies a picothread master and allocates a small area for a picothread control block (pTCB), and the instruction completes without waiting for the call to complete. Instead, the Parallel_Call instruction adds the pTCB to a queue waiting to be serviced by one of the virtual processors (vCPUs) executing the ParaSail program. When the caller thread has finished its own work and wants to wait for the Parallel_Call to complete, it uses the Parallel_Wait instruction, identifying the same picothread master as was specified in the Parallel_Call instruction. This suspends the calling thread until all of the parallel picothreads associated with the picothread master complete.

The Parallel_Block instruction is very similar to the Parallel_Call instruction, except that it identifies instructions that are part of the current routine, rather than calling a separate routine. The execution of these instructions is performed by a separate picothread, which has its own pTCB, and local area. The static link of the local area for the Parallel_Block picothread refers to the local area of the thread invoking the Parallel_Block instruction, allowing the nested picothread to use up-level references to reach the local variables of the enclosing picothread.

The Return instruction is used to complete processing of both a Parallel_Call and a Parallel_Block, and the Parallel_Wait is used to wait for either kind of parallel activity.

We recently completed a prototype implementation of the ParaSail Virtual Machine, including the picothread scheduling. We learned some interesting lessons along the way. Initially, a vCPU that was executing a picothread that performed a Parallel_Wait was itself suspended. That quickly exhausted the number of vCPUs, and led us to start dynamically creating new vCPUs. That caused the overall stack space to grow dramatically, since each vCPU needed its own heavy-weight threading context in the underlying operating system, along with a stack.

At this point, we concluded that a vCPU that executed a Parallel_Wait instruction should service the queue of waiting pTCBs if the picothread master it was waiting for was not yet complete. That significantly reduced the number of vCPUs needed. However, it still didn't seem to be as efficient as it could be. As originally implemented, the queue of waiting pTCBs was first-in, first-out (FIFO). However, after looking at various traces of execution, we realized that it was the last pTCB that was created which was always the first pTCB to be awaited. Hence, we concluded that the pTCB queue should be last-in, first-out (LIFO). That is, a vCPU should preferentially service the most recently queued pTCB when it had cycles to spare, since that would more likely be associated with a picothread master that is being awaited, and by servicing that pTCB first, it will reduce the number of pTCBs that were suspended in a Parallel_Wait instruction. After making this final change, even a heavily recursive algorithm throwing off lots of parallel picothreads was handled efficiently.