Sunday, October 7, 2012

Work Stealing Statistics; Qsort vs. N-queens

We recently added more statistics to the ParaSail interpreter that relate to work stealing (see recent blog entry about rev 3.5 release).  ParaSail uses work stealing to map the very light weight picothreads generated by the compiler, to the heavier-weight server threads which are provided by the underlying operating system, typically about one per physical core.  For more general information on work stealing, see:

Work stealing is a scheduling approach where each server thread has its own (double-ended) queue of picothreads.  When a server executing code spawns off another picothread, it is added to the end of that server's queue.  When a server finishes its current picothread and needs another one to execute, it removes the most-recently added picothread from its queue.  In other words, it uses a last-in, first-out (LIFO) discipline with its own picothread queue.  If at some point, the server finds its own queue to be empty, it looks around at other servers' queues, and if it finds one with some picothreads, it will steal a picothread from that queue.  However, in this case it will choose the oldest picothread, that is, it uses a first-in, first-out (FIFO) discipline when stealing from another server's queue.

So an interesting question is: When a server goes about picking a new picothread to execute, what proportion of the time does the server have to steal a picothread rather than simply pick one off its own queue?  We have added statistics to the ParaSail interpreter to determine this.  So the next interesting question is: Do all algorithms produce the same proportion of stealing, or does the stealing proportion vary significantly from one algorithm to another?

So here are some answers, based on having four server threads:

Qsort : Sort array of 500 randomly selected values between 0 and 99, twice:

  43139 thread steals out of 67096 total thread initiations;
 64.3% stealing proportion
N-queens: Generate all solutions to the N queens problem for 8x8 chess board, 6x6, and 4x4:

  1690 thread steals out of 25015 total thread initiations;
 6.8% stealing proportion

Quite a dramatic difference.

The classic (inefficient) recursive fibonacci produces an even smaller proportion of steals, for Fib(15):

  17 thread steals out of 986 total thread initiations;
 1.7% stealing proportion

We also collect statistics on how many of the servers are doing useful work on average, and how many picothreads are waiting on any server's queue, on average.  For these three cases, the averages are:
 Qsort:   3.45 servers active,  0.92 threads waiting for service
 N-Queens:3.89 servers active, 13.94 threads waiting for service
 Fib(15): 3.99 servers active, 17.99 threads waiting for service

We have also started collecting statistics about the proportion of allocations from regions that are performed by the server that created the region initially, versus by some other server (which presumably stole some work from the "owning" server).  This clearly reflects something about the work stealing, but also provides some additional information about what proportion of typical work is manipulating objects coming from an enclosing picothread, and how much is manipulating objects local to the scope of the picothread doing the work.   Here are the results for the above three test cases:
 Qsort:   64,578 non-owner allocs out of 445,479 = 14.5%
 N-Queens:29,489 non-owner allocs out of 115,586 = 25.5%
 Fib(15): [no allocations -- all objects were small]

So what about the algorithm determines the thread stealing proportion and/or the owner vs. non-owner allocations from a region?  Sounds like an interesting research project...

Rev 3.5 alpha 0.5 release of ParaSail

We just released a new version (3.5 alpha 0.5) of the ParaSail compiler and virtual machine, available at the same URL as before:

We have added some improved statistics to the virtual machine interpreter related to work stealing and region usage, and a command "stats" to view the statistics without exiting the interpreter, and "stats clear" to view and zero-out the statistics.  We will shortly be posting another blog entry about some of the more interesting results of these new statistics.

Additional updates in this release:
  • The conditional expression "(if X then Y)" is now equivalent to "(if X then Y else null)"
  • We now use the "<|=" operator, if available, when building up an aggregate with an iterator inside, such as "[for I in 1..10 => I*5]".  Before we used the "var_indexing" operator instead if both "<|=" and "var_indexing" were available, using the value of "I" as the index.  This didn't make much difference when building up a vector, but was somewhat limiting in other cases.
  • We now allow "reverse" when defining an aggregate with an iterator, such as "[for I in 1..10 reverse => I * 5]," allowing the creation of a reverse-order vector, for example.
  • We have fixed a bug which meant that certain annotations were ignored if they immediately preceded or followed a call on an operation.