Wednesday, August 1, 2012

A Pointer-Free Path to Object-Oriented Parallel Programming

The following is a first draft of a submission for the forthcoming "Foundations of Object-Oriented Languages" (FOOL 2012) workshop accompanying this year's SPLASH/OOPSLA conference in Tuscon.  Any comments would be welcome.

-----------

ParaSail: Pointer-Free Path to Object-Oriented Parallel Programming

Pointers are ubiquitous in modern object-oriented programming languages, and many data structures such as trees, lists, graphs, hash tables, etc. depend on them heavily.  Unfortunately, pointers can add significant complexity to programming.  Pointers can make storage management more complex, pointers can make assignment and equality semantics more complex, pointers can increase the ways two different names (access paths) can designate the same object, pointers can make program analysis and proof more complex, and pointers can make it harder to "divide and conquer" a data structure for parallel processing.

Is there an alternative to using pointers?  ParaSail, a new parallel object-oriented programming language, adopts a different paradigm for defining data structures.  Rather than using pointers, ParaSail supports flexible data structuring using "expandable" (and shrinkable) objects, along with generalized indexing.  By eliminating pointers, ParaSail significantly reduces the complexity for the programmer, while also allowing ParaSail to provide pervasive, safe, object-oriented parallel programming.

Expandable and Optional Objects


An "expandable" object is one which can grow without using pointers, much as a house can grow through additions.  Where once there was a door to the back yard, a new screened-in porch can be added.  Where once there was only one floor, a new floor can be added.  The basic mechanism for expansion in ParaSail is that every type has one additional value, called "null."  A component can initially be null, and then be replaced by a non-null value, thereby expanding the enclosing object.  At some later point the enclosing object could shrink, by replacing a non-null component with null. 

Not every component of an object is allowed to be null.  The component must be declared as "optional" if it is allowed to take on a null value.  For example, a Tree structure might have a (non-optional) "Payload" component, and then two additional components, "Left" and "Right," which are each declared as "optional Tree."  Similarly, a stand-alone object may be declared to be of a type "T," or of a type "optional T."  Only if it is declared "optional" may it take on the null value.  The value of an object X declared as "optional" may be tested for nullness using "X is null" or "X not null."

Another example of a data structure using optional components would be a linked list, with each node having two components, one "Payload" component, and a "Tail" component of type "optional List."  There is also a built-in parameterized type, "Basic_Array<Component_Type>" which allows the Component_Type to be specified as "optional."  This allows the construction of a hash table with buckets represented as linked-lists, by declaring the "backbone" of the hash table as a "Basic_Array<optional List<Hash_Table_Item>>."  The components of the hash table would start out as null, but as items are added to the hash table, one or more of the component lists would begin to grow.

Assignment, Move, and Swap Operations

Because there are no pointers, the semantics of assignment in ParaSail are very straightforward, namely the entire right-hand-side object is copied and assigned into the left-hand side, replacing whatever prior value was there.  However, there are times when it is desirable to "move" a component from one object to another, or "swap" two components.  Because implementing these on top of an assignment that uses copying would be painful, in ParaSail, "move" and "swap" are separate operations.  The semantics of "move" is that the value of the left-hand-side is replaced with the value of the right-hand-side, and the right-hand-side ends up null.  For "swap," the values of the left- and right-hand-side are swapped.  Syntactically, ParaSail uses ":=" for (copying) assignment, "<==" for move, and "<=>" for swap.  The ParaSail compiler is smart enough to automatically use "move" semantics when the right-hand-side is the result of a computation, rather than an object or component that persists after the assignment.

As an example of where "move" might be used, if our hash table grows to the point that it would be wise to lengthen the backbone, we could create a new Basic_Array twice as large (for example), and then "move" each list node from the old array into the new array in an appropriate spot, rebuilding each linked list, and then finally "move" the new array into the original hash-table object, replacing the old array.  The "swap" operation is also useful in many contexts, for example when balancing a tree structure, or when sorting an array.

Cyclic Data Structures and Generalized Indexing

Expandable objects allow the construction of many kinds of data structures, but a general, possibly cyclic graph is not one of them.  For this, ParaSail provides generalized indexing.  The array-indexing syntax, "A[I]," is generalized in ParaSail to be usable with any container-like data structure, where A is the container and I is the key into that data structure.  A directed graph in ParaSail could be represented as an indexed set of Nodes, where the index is a unique node "Id" of some sort, with edges represented as Predecessor and Successor  indice sets that are components of each Node.   There is no harm in following a "dangling" unique node-id in ParaSail, as that is easy to detect because the node-id would be missing from the indexed set.  In a language with pointers, either a dangling pointer will cause a storage leak, because the target node cannot be reclaimed, or it will lead to a potentially destructive reference to reclaimed storage.

Region-Based Storage Management

Storage management without pointers is significantly simplified.  All of the objects declared in a given scope are associated with a storage "region," essentially a local heap.  As an object grows, all new storage for it is allocated out of this region.  As an object shrinks, the old storage can be immediately released back to this region.  When a scope is exited, the entire region is reclaimed.  There is no need for asynchronous garbage collection, as garbage never accumulates.

Every object identifies its region, and in addition, when a function is called, the region in which the result object should be allocated is passed as an implicit parameter.  This "target" region is determined by how the function result is used.  If it is a temporary, then it will be allocated out of a temporary region associated with the point of call.  If it is assigned into a longer-lived object, then the function will be directed to allocated the result object out of the region associated with this longer-lived object.  The net effect is that there is no copying at the call site upon function return, since the result object is already sitting in the correct region.

Note that pointers are likely still used "behind the scenes" in any implementation of ParaSail, but eliminating them from the surface syntax and semantics can eliminate essentially all of the complexity associated with pointers.  That is, a semantic model of expandable and shrinkable objects, operating under "value semantics," rather than a semantic model of nodes connected with pointers, operating under "reference semantics," provides a number of benefits, such as simpler storage management, simpler assignment semantics, easier analyzability, etc. 

Parallel and Distributed Programming

In addition to removing pointers, certain other simplifications are made in ParaSail to ease parallel and distributed programming.  In particular, there are no global variables; functions may only update objects passed to them as "var" (in-out) parameters.  Furthermore, as part of passing an object as a "var" parameter, it is effectively "handed off" to the receiving function, and no further reference may be made to the object, until the function completes.  In particular, no part of the "var" parameter may be passed to any other function, or even to this same function as a separate parameter.  This eliminates the possibility of aliasing between a "var" parameter and any other object visible to the function.  These two additional rules, coupled with the lack of pointers, means that all parameter evaluation may happen in parallel (e.g. in "F(G(X), H(Y))", G(X) and H(Y) may be evaluated in parallel), and function calls may easily cross address-space boundaries, since the objects are self-contained (with no incoming or outgoing references), and only one function at a time can update a given object.

Concurrent Objects

All of the above rules apply to objects that are not designed for concurrent access.  ParaSail also supports the construction of concurrent objects, which allow lock-free, locked, and queued simultaneous access.  These objects are *not* "handed off" as part of parameter passing, but instead provide operations which synchronize any attempts at concurrent access.  Three kinds of synchronization are supported.  "Lock-free" synchronization relies on low-level hardware-supported operations such as atomic load and store, and compare-and-swap.  "Locked" synchronization relies on automatic locking as part of calling a "locked" operation of a concurrent object, and automatic unlocking as part of returning from the operation.  Finally, "queued" synchronization is provided, which evaluates a "dequeue condition" upon call (under a lock), and only if the condition is satisfied is the call allowed to proceed, still under the lock.  A typical "dequeue condition" might be that a buffer is not full, or that a mailbox has at least one element in it.  If the dequeue condition is not satisfied, then the caller is added to a queue.  At the end of any operation on the concurrent object that might change the result of the dequeue condition for a queued caller, the dequeue condition is evaluated and if true, the operation requested by the queued caller is performed before the lock is released.  If there are multiple queued callers, then they are serviced in turn until there are none with satisfied dequeue conditions.

Pointer-Free Object-Oriented Parallel Programming


The pointer-free nature of ParaSail is not just an interesting quirk.  Rather, we believe it represents a significantly simpler way to build large object-oriented systems.  By itself it simplifies storage management, assignment semantics, and analyzability, and when combined with the elimination of global variables and parameter aliasing, it allows for the easy parallelization of all expression evaluation, and the easy distribution of computations across address spaces, without having to make the jump to a pure side-effect-free functional language.

15 comments:

  1. Call me dumb, but those no-pointers look for me like pointers. And assignments with copy semantics is nothing more like dereference on the fly when using regular pointers.

    Anyway, a little example would not harm -- for example a single linked list and adding and removing object.

    ReplyDelete
    Replies
    1. The big difference is that there is no pointer assignment. Also, there is no global heap. All objects are essentially "stack" objects, but they can expand and shrink, which provides the necessary flexibility to define interesting data structures.

      As far as an example, the next blog entry is an example of a balanced tree, which supports both insertion and deletion, implemented in ParaSail in a pointer-free fashion.

      Delete
  2. I like this summary; I have followed this blog, but had forgotten the assign/move/swap distinction.

    typo in Region-Based Storage Management: "function will be directed to allocated the result" should be "allocate".

    Now if I could just write an Android app in Parasail ...

    ReplyDelete
    Replies
    1. Thanks for noticing the typo. I do hope to have a mobile version of ParaSail at some point. It seems like it might be a very nice and safe language for writing apps, especially as the number of cores start growing on your favorite phone/tablet.

      Delete
  3. Seems to be a standard OO language, but without the possibility of sharing an object that is not designed to be concurrent.

    Looks like a step further for saving the OO model in concurrent environments.

    Just a question, do you plan adding lazy/parallel-free/fork-join objects?

    Let me try to explain... I have an object encapsulating a huge computation, in order to trigger that computation and store the result, I would need to handle the instance off to a separate function. But I do not want to trigger that computation if none of the other objects needs it, but I do not want to trigger the same computation more than one time. Like a Future of Java.

    Objects A, B and C may need the value of a huge computation encapsulated in an object D.
    I want to fire a function in all the three objects (A, B and C) that may or may not use the value of D. So I will only fire off the D computation if one of the objects need it, and then if any of the others also need it, they should block and wait the first computation without trigger a new one.

    By the way, great post.

    ReplyDelete
    Replies
    1. On Wed, Aug 8, 2012 at 8:57 PM, Thiago Negri wrote:
      > Thiago Negri has left a new comment on your post "A Pointer-Free Path to
      > Object-Oriented Parallel Pr...":
      >
      > ...
      > Just a question, do you plan adding lazy/parallel-free/fork-join objects?
      >
      > Let me try to explain... I have an object encapsulating a huge computation,
      > in order to trigger that computation and store the result, I would need to
      > handle the instance off to a separate function. But I do not want to trigger
      > that computation if none of the other objects needs it, but I do not want to
      > trigger the same computation more than one time. Like a Future of Java.
      >
      > Objects A, B and C may need the value of a huge computation encapsulated in
      > an object D.
      > I want to fire a function in all the three objects (A, B and C) that may or
      > may not use the value of D. So I will only fire off the D computation if one
      > of the objects need it, and then if any of the others also need it, they
      > should block and wait the first computation without trigger a new one.

      This looks relatively straightforward to code, so it doesn't seem necessary to build a special feature for it. See below.
      >
      > By the way, great post.

      Thanks!

      Here is a possible implementation of your compute-on-need abstraction:

      concurrent interface Lazy_Computation<> is
          func Create_Cache() -> Lazy_Computation;
          func Get_Value
            (locked var Cache : Lazy_Computation)
              return Result_Type;
      end interface Lazy_Computation;

      concurrent class Lazy_Computation is
          var Result : optional Result_Type;
        exports
          func Create_Cache() -> Lazy_Computation is
              return (Result => null);
          end func Create_Cache;

          func Get_Value
            (locked var Cache : Lazy_Computation)
              return Result_Type is
              if Cache.Result is null then
                  // Compute now, and save for future users
                  Cache.Result := Big_Computation(...);
              end if;
              return Cache.Result;
          end func Get_Value;
      end class Lazy_Computation;

      Delete
  4. "Furthermore, as part of passing an object as a "var" parameter, it is effectively "handed off" to the receiving function, and no further reference may be made to the object, until the function completes."
    You should take a look at the Clean programming language, http://wiki.clean.cs.ru.nl/ and its uniqueness types, which seem to capture the same meaning.

    ReplyDelete
    Replies
    1. Interesting. "Clean" is using uniqueness to allow destructive updates without breaking referential transparency. The semantics are a bit different because it is saying that the object is "dead" after the call. In ParaSail, a var parameter is not dead after the call, but during the call, it is guaranteed there is only one reference to it.

      "Hand-off" semantics were first described, I believe, in the language Hermes (originally called "NIL"), by Strom et al:
      http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.13.9034

      It is also related to "linear types":

      http://en.wikipedia.org/wiki/Linear_types

      Delete
  5. 'Syntactically, ParaSail uses ":=" for (copying) assignment, "<==" for move, and "<=>" for swap.'

    I can't help but think of C++11's Move Semantics, where the equivalent of "x <== y" is spelled "x = std::move(y)". (And " x <=> y" is "swap(x, y)", the implementation of which is just three moves.)

    If you're removing pointers entirely, perhaps you should go all the way to a Linear Type System?

    ReplyDelete
    Replies
    1. C++11 has various kinds of pointers, some of which approximate the pointer-free, value semantics provided by ParaSail. Of course one of the challenges of using C++ is that it supports many, many ways to solve the same problem, some safe, some not so safe, some completely unsafe. One of the main goals of ParaSail was to simplify the language, provide only safe alternatives, yet still provide flexible support for the construction of elegant, efficient programs.

      Going all the way to a Linear Type system might be throwing out major parts of the "baby" with the "bathwater," and seems like it isn't any safer than what ParaSail provides, while feeling much more difficult to use. ParaSail has the "hand off" semantics of Linear Type systems when doing parameter passing, but then the value is "passed back" at the end of the call. In addition, ParaSail unifies simple types like integers with more complex types like trees, by providing a default copying assignment, and a move assignment where desired. With a Linear Type system, simple types and pointer types have very different semantics, which seems to unnecessarily complicate the model for users.

      Delete
  6. This reminds me of Rust, which does allow shared objects but also takes pains to do as much as possible with "unique pointers"--objects referenced from only one place. Of course, that's to simplify memory management first and for parallelism second.

    ParaSail sounds promising as a very simple and intuitive way to write highly parallel programs (though I am mostly ignorant about highly parallel programming). The trouble with being able to run all functions in parallel, though, is "what kind of processor allows running all functions in parallel"? I don't know of a system that makes forks and joins trivially cheap.

    But assuming suitable hardware, it seems like ParaSail would be a neat addition to a conventional programming language, where, if you simply limit parts of your code to certain operations, those parts suddenly become massively parallel. I don't think I'd like to program under these restrictions all the time, though. Using indexes instead of actual pointers sounds terribly cumbersome, because the code would always have to keep track of which array goes with which index, manually, and you cannot, of course, simulate a pointer by bundling the array and index together ... unless you've developed a new abstraction to "fake it".

    ReplyDelete
    Replies
    1. Yes, Rust and ParaSail have a lot in common. ParaSail attempts to provide a simpler model overall, but our goals are very much in common -- provide a systems programming language that is safe and productive in the context of a multicore world.

      As far as the lack of "actual pointers," surprisingly I have not found it particularly cumbersome. A large number of the uses for pointers are handled by optional and expandable objects, and for structures like directed graphs, it turns out indices rather than pointers are often used to represent edges anyway, as the existence of a node is often independent of the number of edges to it. More generally, given that pointers are not generally part of the semantic model for relational databases, nor for distributed algorithms, it seems that pointers are not really that fundamental.

      I would very much welcome examples of the kinds of situations where you think pointers would be particularly valuable. These would be a good challenge for seeing how easily the problem could be solved in a pointer-free language like ParaSail.

      Delete
    2. Well, the graph is a good example. I would normally represent a graph node as an array of edges where each edge is a weight and a pointer to the next node. If there are no pointers then you (1) can't pass around a pointer to a node and (2) the node can't contain pointers to other nodes. So instead you use indexes, but this means that you always have to pass two arguments to a method instead of just one: the entire graph plus an index, instead of just a single node pointer. Every time you want to access a node or edge, you must explicitly mention both the graph and the index, instead of just the single pointer. It sounds pretty inconvenient. You mention relational databases, and I feel the same way about them; SQL is quite a pain compared to languages that offer containment and pointers.

      Oh, and since you don't want to make a copy of the graph, you'll pass it as a "var" parameter, right? So isn't parallelism lost?

      Delete
    3. You can pass nodes around directly as well, if you wish. The indexing operation can provide a short-lived "view" of an element of a larger object, and for a graph, that would typically be a node. Here is an example implementation of a directed graph in ParaSail, with an "indexing" operation:

      http://parasail-programming-language.blogspot.com/2012/07/implementation-of-directed-graph-in.html

      As far as parameter passing in general, it is "by value" but because of the "hand-off" semantics, it can accomplish by-value parameter semantics using a by-reference mechanism. Even function return is by-reference because functions are told in which region to build the object to return, so you get efficient parameter passing and return in spite of the nice "value-based" semantic model.

      Delete