f



help with (bordeaux-)threads for simple paralellization

Hi,

I am running Monte Carlo calculations which are inherently
parallelizable.  I would like to run multiple threads to speed things
up, but I have no experience with threads.

A mock-up of the sequential version looks like this:

(defun run-chains (n)
  (let ((result (make-array n)))
    (dotimes (i n result)
      (setf (aref result i) (chain-generator)))))

For the parallel version, I would like to 

1) start CHAIN-GENERATOR's in separate threads for each index,
2) have the result of each end up in the appropriate place in the vector, and
3) have RUN-CHAINS wait for each of the threads to terminate before returning.

I would prefer to use bordeaux-threads if possible, in order to make
my code portable.  I have figured out how to do 1-2 with MAKE-THREAD
and closures, but I don't know how to deal with 3.  I am assuming that
I need to use locks/mutexes, but I am not familiar with these
concepts.  Hints or example code would be appreciated.

Best,

Tamas

0
tkpapp (998)
8/2/2010 10:54:16 AM
comp.lang.lisp 16861 articles. 5 followers. Post Follow

16 Replies
653 Views

Similar Articles

[PageSpeed] 12

On 02/08/2010 12:54, Tamas K Papp wrote:
> Hi,
>
> I am running Monte Carlo calculations which are inherently
> parallelizable.  I would like to run multiple threads to speed things
> up, but I have no experience with threads.
>
> A mock-up of the sequential version looks like this:
>
> (defun run-chains (n)
>    (let ((result (make-array n)))
>      (dotimes (i n result)
>        (setf (aref result i) (chain-generator)))))
>
> For the parallel version, I would like to
>
> 1) start CHAIN-GENERATOR's in separate threads for each index,
> 2) have the result of each end up in the appropriate place in the vector, and
> 3) have RUN-CHAINS wait for each of the threads to terminate before returning.
>
> I would prefer to use bordeaux-threads if possible, in order to make
> my code portable.  I have figured out how to do 1-2 with MAKE-THREAD
> and closures, but I don't know how to deal with 3.  I am assuming that
> I need to use locks/mutexes, but I am not familiar with these
> concepts.  Hints or example code would be appreciated.

1) Don't start a thread for each index. Rather, start as many threads as 
you have processing units available, and have each of them work 
sequentially on a subrange of the vector. Otherwise, the runtime system 
will spend too much time on coordinating the various threads, that have 
to fight for processing time, and you want to avoid that.

2) Don't have each thread perform an assignment on the shared vector. 
Rather report the results back to a coordinating thread, and let it 
perform the assignment. Otherwise the vector has to be copied to the 
caches of the processing units, and then back to main memory, which is 
unnecessary.

3) You can use condition variables for coordinating the threads. Create 
a single condition variable, and make the thread that starts all the 
worker threads wait on the condition variable. Each worker thread can 
notify that condition variable when it's done. If there are n worker 
threads, you have to do n waits on the single condition variable.

I find the abstractions bordeaux-threads provides a bit too low-level 
and too thin. What LispWorks 6.0 provides is much more convenient. There 
I would use mailboxes for the coordination, or maybe barriers. One 
option is to implement a mailbox abstraction on top of condition 
variables yourself.

Information about the synchronization constructs in LispWorks can be 
found here: 
http://www.lispworks.com/documentation/lw60/LW/html/lw-275.htm#pgfId-893780

Final note: Parallelization only pays off if what you parallelize is 
computationally intensive. If chain-generator doesn't perform a lot of 
computation, the overhead of parallelizing and coordinating is probably 
higher than just performing the sequential version.



Pascal

-- 
My website: http://p-cos.net
Common Lisp Document Repository: http://cdr.eurolisp.org
Closer to MOP & ContextL: http://common-lisp.net/project/closer/
0
Pascal
8/2/2010 11:30:59 AM
On 08/02/2010 07:30 AM, Pascal Costanza wrote:
> On 02/08/2010 12:54, Tamas K Papp wrote:
>> I am running Monte Carlo calculations which are inherently
>> parallelizable. I would like to run multiple threads to speed things
>> up, but I have no experience with threads.
....
> 2) Don't have each thread perform an assignment on the shared vector.
> Rather report the results back to a coordinating thread, and let it
> perform the assignment. Otherwise the vector has to be copied to the
> caches of the processing units, and then back to main memory, which is
> unnecessary.

As long as he isn't interleaving the ranges, this shouldn't be 
necessary; a direct write is just as efficient and avoids an extra 
copy (I don't see why the whole vector would be copied into each 
thread's cache).  The main thread allocates the problem and answer 
space, and each worker thread is responsible for filling its subrange.

You do want to avoid multiple CPUs writing to the same cache lines. 
Ensuring consistency requires extra memory bus synchronization, and 
can easily slow things down 10x.

i.e.  Thread N should process result indices (N+K N+K+1 ... N+K+N-1) 
and not (N 2N ... KN).


> I find the abstractions bordeaux-threads provides a bit too low-level
> and too thin. What LispWorks 6.0 provides is much more convenient.
> There I would use mailboxes for the coordination, or maybe barriers.
> One option is to implement a mailbox abstraction on top of condition
> variables yourself.

Funny -- there are many missing features; but I miss the low-level 
ones!  (e.g. memory barriers, atomic ops, ...)  For this problem, a 
single condition variable/counting semaphore is quite sufficient.

The above algorithm may be described as "hand me your paper when 
you've finished."  This inter-person synchronization may be 
unnecessary.  An alternative is one of the simplest lock-free 
algorithms:  "last one out, turn off the lights".  It slightly inverts 
the control structure to avoid locking synchronization on thread exit. 
  It is employed by the popular C++ shared_ptr.

The main thread initializes the result space with a count of the 
threads and a continuation.  As each thread exits, it performs an 
atomic decrement on the thread count.  The thread that decrements the 
count to 0 calls the continuation.  [The shared_ptr actually 
initializes the count to 1; new references increment the count; scope 
exits decrement the count.]

- Daniel
0
D
8/2/2010 2:13:19 PM
On 02/08/2010 16:13, D Herring wrote:
> On 08/02/2010 07:30 AM, Pascal Costanza wrote:
>> On 02/08/2010 12:54, Tamas K Papp wrote:
>>> I am running Monte Carlo calculations which are inherently
>>> parallelizable. I would like to run multiple threads to speed things
>>> up, but I have no experience with threads.
> ...
>> 2) Don't have each thread perform an assignment on the shared vector.
>> Rather report the results back to a coordinating thread, and let it
>> perform the assignment. Otherwise the vector has to be copied to the
>> caches of the processing units, and then back to main memory, which is
>> unnecessary.
>
> As long as he isn't interleaving the ranges, this shouldn't be
> necessary; a direct write is just as efficient and avoids an extra copy
> (I don't see why the whole vector would be copied into each thread's
> cache). The main thread allocates the problem and answer space, and each
> worker thread is responsible for filling its subrange.
>
> You do want to avoid multiple CPUs writing to the same cache lines.
> Ensuring consistency requires extra memory bus synchronization, and can
> easily slow things down 10x.
>
> i.e. Thread N should process result indices (N+K N+K+1 ... N+K+N-1) and
> not (N 2N ... KN).

OK

>> I find the abstractions bordeaux-threads provides a bit too low-level
>> and too thin. What LispWorks 6.0 provides is much more convenient.
>> There I would use mailboxes for the coordination, or maybe barriers.
>> One option is to implement a mailbox abstraction on top of condition
>> variables yourself.
>
> Funny -- there are many missing features; but I miss the low-level ones!
> (e.g. memory barriers, atomic ops, ...) For this problem, a single
> condition variable/counting semaphore is quite sufficient.

OK, OK. ;) Both more high-level and low-level constructs are missing. (I 
find condition variables to be inconvenient to use, because they are too 
low-level, compared to mailboxes.)

> The above algorithm may be described as "hand me your paper when you've
> finished." This inter-person synchronization may be unnecessary. An
> alternative is one of the simplest lock-free algorithms: "last one out,
> turn off the lights". It slightly inverts the control structure to avoid
> locking synchronization on thread exit. It is employed by the popular
> C++ shared_ptr.
>
> The main thread initializes the result space with a count of the threads
> and a continuation. As each thread exits, it performs an atomic
> decrement on the thread count. The thread that decrements the count to 0
> calls the continuation. [The shared_ptr actually initializes the count
> to 1; new references increment the count; scope exits decrement the count.]

Interesting...

Pascal

-- 
My website: http://p-cos.net
Common Lisp Document Repository: http://cdr.eurolisp.org
Closer to MOP & ContextL: http://common-lisp.net/project/closer/
0
Pascal
8/3/2010 9:24:14 AM
On Mon, 02 Aug 2010 10:13:19 -0400, D Herring wrote:

> On 08/02/2010 07:30 AM, Pascal Costanza wrote:
>> On 02/08/2010 12:54, Tamas K Papp wrote:
>>> I am running Monte Carlo calculations which are inherently
>>> parallelizable. I would like to run multiple threads to speed things
>>> up, but I have no experience with threads.
> ...
>> 2) Don't have each thread perform an assignment on the shared vector.
>> Rather report the results back to a coordinating thread, and let it
>> perform the assignment. Otherwise the vector has to be copied to the
>> caches of the processing units, and then back to main memory, which is
>> unnecessary.
> 
> As long as he isn't interleaving the ranges, this shouldn't be
> necessary; a direct write is just as efficient and avoids an extra copy
> (I don't see why the whole vector would be copied into each thread's
> cache).  The main thread allocates the problem and answer space, and
> each worker thread is responsible for filling its subrange.
> 
> You do want to avoid multiple CPUs writing to the same cache lines.
> Ensuring consistency requires extra memory bus synchronization, and can
> easily slow things down 10x.
> 
> i.e.  Thread N should process result indices (N+K N+K+1 ... N+K+N-1) and
> not (N 2N ... KN).
> 
> 
>> I find the abstractions bordeaux-threads provides a bit too low-level
>> and too thin. What LispWorks 6.0 provides is much more convenient.
>> There I would use mailboxes for the coordination, or maybe barriers.
>> One option is to implement a mailbox abstraction on top of condition
>> variables yourself.
> 
> Funny -- there are many missing features; but I miss the low-level ones!
>  (e.g. memory barriers, atomic ops, ...)  For this problem, a single
> condition variable/counting semaphore is quite sufficient.
> 
> The above algorithm may be described as "hand me your paper when you've
> finished."  This inter-person synchronization may be unnecessary.  An
> alternative is one of the simplest lock-free algorithms:  "last one out,
> turn off the lights".  It slightly inverts the control structure to
> avoid locking synchronization on thread exit.
>   It is employed by the popular C++ shared_ptr.
> 
> The main thread initializes the result space with a count of the threads
> and a continuation.  As each thread exits, it performs an atomic
> decrement on the thread count.  The thread that decrements the count to
> 0 calls the continuation.  [The shared_ptr actually initializes the
> count to 1; new references increment the count; scope exits decrement
> the count.]

Thanks for both answers.  I have to admit that this is still over my
head: I am beginning to grasp the idea behind the two algorithms, but
don't know how to implement them.  Eg in the "last one out" algorithm,
I think I have to use some kind of counters, but if multiple threads
access these, how do I make sure that they don't mess up?  A simple
example would be extremely helpful (preferably using a portable thread
library so I can replicate it on SBCL).

By the way, the total number of independent chains is small (3-7),
each returning a large (10000x100) array of double-floats (which,
AFAIK, is passed as a pointer in CL so it shouldn't be a big deal), so
I am not worried about the cost of assignment.

Thanks,

Tamas
0
Tamas
8/3/2010 10:50:54 AM
On 03/08/2010 12:50, Tamas K Papp wrote:
> On Mon, 02 Aug 2010 10:13:19 -0400, D Herring wrote:
>
>> On 08/02/2010 07:30 AM, Pascal Costanza wrote:
>>> On 02/08/2010 12:54, Tamas K Papp wrote:
>>>> I am running Monte Carlo calculations which are inherently
>>>> parallelizable. I would like to run multiple threads to speed things
>>>> up, but I have no experience with threads.
>> ...
>>> 2) Don't have each thread perform an assignment on the shared vector.
>>> Rather report the results back to a coordinating thread, and let it
>>> perform the assignment. Otherwise the vector has to be copied to the
>>> caches of the processing units, and then back to main memory, which is
>>> unnecessary.
>>
>> As long as he isn't interleaving the ranges, this shouldn't be
>> necessary; a direct write is just as efficient and avoids an extra copy
>> (I don't see why the whole vector would be copied into each thread's
>> cache).  The main thread allocates the problem and answer space, and
>> each worker thread is responsible for filling its subrange.
>>
>> You do want to avoid multiple CPUs writing to the same cache lines.
>> Ensuring consistency requires extra memory bus synchronization, and can
>> easily slow things down 10x.
>>
>> i.e.  Thread N should process result indices (N+K N+K+1 ... N+K+N-1) and
>> not (N 2N ... KN).
>>
>>
>>> I find the abstractions bordeaux-threads provides a bit too low-level
>>> and too thin. What LispWorks 6.0 provides is much more convenient.
>>> There I would use mailboxes for the coordination, or maybe barriers.
>>> One option is to implement a mailbox abstraction on top of condition
>>> variables yourself.
>>
>> Funny -- there are many missing features; but I miss the low-level ones!
>>   (e.g. memory barriers, atomic ops, ...)  For this problem, a single
>> condition variable/counting semaphore is quite sufficient.
>>
>> The above algorithm may be described as "hand me your paper when you've
>> finished."  This inter-person synchronization may be unnecessary.  An
>> alternative is one of the simplest lock-free algorithms:  "last one out,
>> turn off the lights".  It slightly inverts the control structure to
>> avoid locking synchronization on thread exit.
>>    It is employed by the popular C++ shared_ptr.
>>
>> The main thread initializes the result space with a count of the threads
>> and a continuation.  As each thread exits, it performs an atomic
>> decrement on the thread count.  The thread that decrements the count to
>> 0 calls the continuation.  [The shared_ptr actually initializes the
>> count to 1; new references increment the count; scope exits decrement
>> the count.]
>
> Thanks for both answers.  I have to admit that this is still over my
> head: I am beginning to grasp the idea behind the two algorithms, but
> don't know how to implement them.  Eg in the "last one out" algorithm,
> I think I have to use some kind of counters, but if multiple threads
> access these, how do I make sure that they don't mess up?  A simple
> example would be extremely helpful (preferably using a portable thread
> library so I can replicate it on SBCL).
>
> By the way, the total number of independent chains is small (3-7),
> each returning a large (10000x100) array of double-floats (which,
> AFAIK, is passed as a pointer in CL so it shouldn't be a big deal), so
> I am not worried about the cost of assignment.

Here is a sketch of the "hand me your paper" solution in LispWorks:

(defmacro parallel-dotimes ((var iterations &optional result) &body body)
   `(parallel-dotimes-f
     ,iterations
     (lambda (,var)
       (declare (ignorable ,var))
       ,@body)
     (lambda (,var)
       (declare (ignorable ,var))
       ,result)))

(defconstant +nof-processors+ 4)

(defun parallel-dotimes-f (iterations loopf resultf)
   (if (< iterations +nof-processors+)
     (dotimes (i iterations (funcall resultf i))
       (funcall loopf i))
     (loop with range-size = (ceiling iterations +nof-processors+)
           with mailbox = (mp:make-mailbox)
           for start by range-size
           for end = (+ start range-size)
           for count
           while (< start iterations) do
           (mp:process-run-function
            "parallel-dotimes" '()
            (lambda (start end)
              (loop for i from start below end do (funcall loopf i))
              (mp:mailbox-send mailbox t))
            start (min end iterations))
           finally
           (loop repeat count do (mp:mailbox-read mailbox))
           (return (funcall resultf iterations)))))

(defun run-chains (n)
   (let ((result (make-array n)))
     (parallel-dotimes (i n result)
       (setf (aref result i) (random 100000)))))

It's a good idea to abstract away the parallel version of dotimes, so 
you can later change it to a different parallelization strategy. If 
there are more places in your code where you want to use parallel 
execution, it's probably very worthwhile to factor out the creation of 
the processes and have them available as worker processes to which you 
can "send" work as needed.

Something like http://common-lisp.net/project/eager-future/ or 
http://userweb.cs.utexas.edu/users/ragerdl/parallelism/ already does 
this for you, for example.


Pascal

-- 
My website: http://p-cos.net
Common Lisp Document Repository: http://cdr.eurolisp.org
Closer to MOP & ContextL: http://common-lisp.net/project/closer/
0
pc56 (3929)
8/4/2010 10:58:01 AM
On 08/03/2010 06:50 AM, Tamas K Papp wrote:

> I think I have to use some kind of counters, but if multiple threads
> access these, how do I make sure that they don't mess up?  A simple
> example would be extremely helpful (preferably using a portable thread
> library so I can replicate it on SBCL).

Unfortunately, the required operations are not in any portable library 
I'm aware of.  Many CL implementations lack them entirely.  That said, 
here's how its done elsewhere.


Most CPUs provide a "compare and swap" operation.  It is beautiful 
once you understand how to use it.

Here's the GCC builtin command.
http://gcc.gnu.org/onlinedocs/gcc/Atomic-Builtins.html
bool __sync_bool_compare_and_swap (type *ptr, type oldval type newval)
type __sync_val_compare_and_swap (type *ptr, type oldval type newval)

The function takes three arguments:  a place, the value you think it 
contains, and a new value to put there.  If the oldval is in fact the 
current value when the function is called, then it will be replaced by 
newval.

The bool form returns a flag indicating whether the assignment was 
made; the other form returns the value observed by the function.


The take home message is:  You can atomically make a change, 
conditional on your thread having an accurate view of the world.

So an atomic addition may be implemented as something like the 
following unoptimized code.

int atomic_add(int *place, int increment)
{
   int x=*place;
   while(true)
   {
     int sum=x+increment;
     int y=__sync_val_compare_and_swap(place, x, sum);
     if(y==x) return y;
     x=y;
   }
}


Many CPUs also provide an atomic increment and/or decrement operator. 
  Here you don't care what value currently exists; you just want it to 
be changed correctly.  Again, GCC has builtins for this.
type __sync_fetch_and_add (type *ptr, type value)
type __sync_fetch_and_sub (type *ptr, type value)

These are roughly equivalent to the above atomic_add; return the 
current value of the place, and modify it correctly.


These two operations are the key building blocks in "lock-free" 
algorithms.  Since they have direct hardware support, they are 
generally 10x faster than constructs like mutexes and semaphores 
(which generally involve kernel context switching on top of internal 
atomic ops).  However, a CAS is generally 10x slower than simply 
setting the memory location.

Later,
Daniel
0
dherring1 (562)
8/5/2010 4:21:08 AM
On 05/08/2010 06:21, D Herring wrote:
> On 08/03/2010 06:50 AM, Tamas K Papp wrote:
>
>> I think I have to use some kind of counters, but if multiple threads
>> access these, how do I make sure that they don't mess up? A simple
>> example would be extremely helpful (preferably using a portable thread
>> library so I can replicate it on SBCL).
>
> Unfortunately, the required operations are not in any portable library
> I'm aware of. Many CL implementations lack them entirely.

Here is an attempt at doing this in LispWorks 6.0:

(defmacro parallel-dotimes
   ((var iterations &optional result) &body body)
   `(parallel-dotimes-f
     ,iterations
     (lambda (,var)
       (declare (ignorable ,var))
       ,@body)
     (lambda (,var)
       (declare (ignorable ,var))
       ,result)))

(defconstant +nof-processors+ 4)

(defun parallel-dotimes-f (iterations loopf resultf)
   (if (< iterations +nof-processors+)
     (dotimes (i iterations (funcall resultf i))
       (funcall loopf i))
     (loop with range-size = (ceiling iterations +nof-processors+)
           with mailbox = (mp:make-mailbox)
           with count = (vector +nof-processors+)
           for start by range-size
           for end = (+ start range-size)
           while (< start iterations) do
           (mp:process-run-function
            "parallel-dotimes" '()
            (lambda (start end)
              (loop for i from start below end do (funcall loopf i))
              (when (zerop (sys:atomic-decf (svref count 0)))
                (mp:mailbox-send mailbox t)))
            start (min end iterations))
           finally
           (mp:mailbox-read mailbox)
           (return (funcall resultf iterations)))))

(defun run-chains (n)
   (let ((result (make-array n)))
     (parallel-dotimes (i n result)
       (setf (aref result i) (random 100000)))))

This uses atomic-decf, but compare-and-swap is also available in 
LispWorks. At the moment, these operations only work on global 
variables, and components of vectors, structs, etc., but not on lexical 
variables. That's why the count variable is boxed in a vector.


Pascal

-- 
My website: http://p-cos.net
Common Lisp Document Repository: http://cdr.eurolisp.org
Closer to MOP & ContextL: http://common-lisp.net/project/closer/
0
pc56 (3929)
8/5/2010 11:40:32 AM
On 08/05/2010 07:40 AM, Pascal Costanza wrote:
> On 05/08/2010 06:21, D Herring wrote:
>> On 08/03/2010 06:50 AM, Tamas K Papp wrote:
>>
>>> I think I have to use some kind of counters, but if multiple threads
>>> access these, how do I make sure that they don't mess up? A simple
>>> example would be extremely helpful (preferably using a portable thread
>>> library so I can replicate it on SBCL).
>>
>> Unfortunately, the required operations are not in any portable library
>> I'm aware of. Many CL implementations lack them entirely.
>
> Here is an attempt at doing this in LispWorks 6.0:

Looks pretty good.  It could be a drop-in upgrade for existing code 
since it maintains a "single-threaded flavor".  One thread starts 
parallel-dotimes and waits for the result.  Is that thread special, or 
can it store a continuation to be executed by whatever thread happens 
to finish last?


> This uses atomic-decf, but compare-and-swap is also available in
> LispWorks.

That's good.  Atomic-decf should be more efficient than the 
compare-and-swap loop; I just wrote that as an illustration.


> At the moment, these operations only work on global
> variables, and components of vectors, structs, etc., but not on
> lexical variables. That's why the count variable is boxed in a vector.

I think that will be true for any CL implementation; memory used by 
multiple threads should be on the heap, thus it can't be stored in a 
single thread's stack.


BTW, there are other tricks to play.  Thread creation can be 
expensive; it can be cheaper to populate a "thread pool" and feed it 
tasks.  Also, if one PARALLEL-DOTIMES calls code containing another, 
then too many threads may be spawned.  Finding efficient ways to slice 
a problem can be difficult.

One trick I heard (IIRC, it was explained in the Sun Fortress 
presentation at the Boston Lisp Meeting) was called something like 
"task stealing".  The idea was to have a thread pool milling about a 
central task list.  When a thread starts to do something which may 
take a long time, it keeps a reasonable chunk for itself (e.g. by 
bisection) and pushes the rest onto the task list.  An idle thread may 
then take that remaining task, possibly subdividing it further.


Later,
Daniel
0
D
8/6/2010 2:01:19 AM
On 06/08/2010 04:01, D Herring wrote:
> On 08/05/2010 07:40 AM, Pascal Costanza wrote:
>> On 05/08/2010 06:21, D Herring wrote:
>>> On 08/03/2010 06:50 AM, Tamas K Papp wrote:
>>>
>>>> I think I have to use some kind of counters, but if multiple threads
>>>> access these, how do I make sure that they don't mess up? A simple
>>>> example would be extremely helpful (preferably using a portable thread
>>>> library so I can replicate it on SBCL).
>>>
>>> Unfortunately, the required operations are not in any portable library
>>> I'm aware of. Many CL implementations lack them entirely.
>>
>> Here is an attempt at doing this in LispWorks 6.0:
>
> Looks pretty good. It could be a drop-in upgrade for existing code since
> it maintains a "single-threaded flavor". One thread starts
> parallel-dotimes and waits for the result. Is that thread special, or
> can it store a continuation to be executed by whatever thread happens to
> finish last?

I guess the continuation could be executed by whatever thread finishes 
last, but I'm not sure if that buys us anything. We have to wait for 
that final result anyway.

We also always have to watch out for the dynamic environment in Common 
Lisp, so I prefer to have what you call a "single-threaded flavor."

>> At the moment, these operations only work on global
>> variables, and components of vectors, structs, etc., but not on
>> lexical variables. That's why the count variable is boxed in a vector.
>
> I think that will be true for any CL implementation; memory used by
> multiple threads should be on the heap, thus it can't be stored in a
> single thread's stack.

If a closure is passed around, the lexical variables are also stored on 
the heap. Two or more closures closing over the same environments could 
be executed in different threads, and then compare-and-swap and friends 
become relevant for lexical variables.

> BTW, there are other tricks to play. Thread creation can be expensive;
> it can be cheaper to populate a "thread pool" and feed it tasks. Also,
> if one PARALLEL-DOTIMES calls code containing another, then too many
> threads may be spawned. Finding efficient ways to slice a problem can be
> difficult.

Indeed.

> One trick I heard (IIRC, it was explained in the Sun Fortress
> presentation at the Boston Lisp Meeting) was called something like "task
> stealing". The idea was to have a thread pool milling about a central
> task list. When a thread starts to do something which may take a long
> time, it keeps a reasonable chunk for itself (e.g. by bisection) and
> pushes the rest onto the task list. An idle thread may then take that
> remaining task, possibly subdividing it further.

The actual term is "work stealing". It was originally used in MultiLisp 
and QLisp, and then picked up by Cilk and Cilk++. The Cilk and Cilk++ 
versions have what seems to be the most mature versions of work 
stealing, and they have backed their approach with a series of excellent 
papers that show why work stealing is indeed an extremely good approach 
for parallelization.

I'm currently experimenting with implementations of that idea in 
LispWorks, and hope to be able to publish a library for doing this...


Pascal

-- 
My website: http://p-cos.net
Common Lisp Document Repository: http://cdr.eurolisp.org
Closer to MOP & ContextL: http://common-lisp.net/project/closer/
0
pc56 (3929)
8/6/2010 8:58:59 AM
On 08/06/2010 04:58 AM, Pascal Costanza wrote:

> The actual term is "work stealing". It was originally used in
> MultiLisp and QLisp, and then picked up by Cilk and Cilk++. The Cilk
> and Cilk++ versions have what seems to be the most mature versions of
> work stealing, and they have backed their approach with a series of
> excellent papers that show why work stealing is indeed an extremely
> good approach for parallelization.

Yes.  That's the term and the proper heritage.

- Daniel
0
D
8/7/2010 1:59:21 AM
On Fri, 06 Aug 2010 21:59:21 -0400, D Herring wrote:

> On 08/06/2010 04:58 AM, Pascal Costanza wrote:
> 
>> The actual term is "work stealing". It was originally used in MultiLisp
>> and QLisp, and then picked up by Cilk and Cilk++. The Cilk and Cilk++
>> versions have what seems to be the most mature versions of work
>> stealing, and they have backed their approach with a series of
>> excellent papers that show why work stealing is indeed an extremely
>> good approach for parallelization.
> 
> Yes.  That's the term and the proper heritage.

It seems that a lot of the tools/concepts I am looking for have been
invented a long time ago, it's just that I don't know about them (I am
a self-taught programmer, never had formal CS education).

Can either of you (or anyone else) recommend a good introduction /
tutorial to the relevant concepts and algorithms?  Ideally, I am
looking for a book that would cover threads & related concepts the
same way PCL covers Lisp, for the total newbie.  Doesn't have to be
Lisp-based though.

Thanks,

Tamas
0
Tamas
8/7/2010 7:54:47 AM
On 08/07/2010 03:54 AM, Tamas K Papp wrote:

> Can either of you (or anyone else) recommend a good introduction /
> tutorial to the relevant concepts and algorithms?  Ideally, I am
> looking for a book that would cover threads&  related concepts the
> same way PCL covers Lisp, for the total newbie.  Doesn't have to be
> Lisp-based though.

I've never seen such a beast, but I didn't look much either, being 
satisfied with a hodge-podge of small articles, chapters, and papers. 
  AFAICT, multithreading is still something of a black art.  There are 
some patterns that work, and some pitfalls to watch out for.  The Java 
language was one of the first with good threading support (including 
an actual memory model); thus there is good documentation and 
guidelines for Java threading.  The Linux kernel has some good 
documentation on all the ways memory can get out of sync between 
threads (e.g. Documentation/memory-barriers.txt).  POSIX also has a 
good set of the classic primitives (e.g. `apropos pthread` and `man 
pthreads` and google pthreads).

For basic stuff, you want to learn about the mutex (mutual exclusion - 
lock protected areas), (counting) semaphore (locking and event 
notification), barrier, condition variable, and thread-local/specific 
storage.

For advanced stuff, I strongly recommend investigating lock-free 
algorithms and nonblocking synchronization.  The classic locking 
algorithms have a major problem:  they don't compose.  The program 
must maintain a strict (partial) ordering on the locks it uses; 
otherwise deadlock will occur (generally after delivery).  This is 
very hard when libraries introduce or modify their own internal locks...

Maybe Pascal has a good reference.

- Daniel
0
D
8/8/2010 3:37:40 AM
On Aug 7, 8:54=A0am, Tamas K Papp <tkp...@gmail.com> wrote:

> Can either of you (or anyone else) recommend a good introduction /
> tutorial to the relevant concepts and algorithms? =A0Ideally, I am
> looking for a book that would cover threads & related concepts the
> same way PCL covers Lisp, for the total newbie. =A0Doesn't have to be
> Lisp-based though.

I've been really enjoying 'The Art of Multiprocessor Programming',
published in 2008, by Herlihy and Shavit.  (It has a section on work
stealing.)

http://www.amazon.com/Art-Multiprocessor-Programming-Maurice-Herlihy/dp/012=
3705916
0
Thomas
8/8/2010 8:09:43 AM
On 08/08/2010 05:37, D Herring wrote:
> Maybe Pascal has a good reference.

Unfortunately not.


Pascal

-- 
My website: http://p-cos.net
Common Lisp Document Repository: http://cdr.eurolisp.org
Closer to MOP & ContextL: http://common-lisp.net/project/closer/
0
Pascal
8/10/2010 11:09:04 AM
On 10/08/2010 13:09, Pascal Costanza wrote:
> On 08/08/2010 05:37, D Herring wrote:
>> Maybe Pascal has a good reference.
>
> Unfortunately not.


....however, the LispWorks 6.0 documentation provides some hints, with 
some examples here and there. It's a bit dense, though.

See especially 
http://www.lispworks.com/documentation/lw60/RNIG/html/readme-292.htm and 
http://www.lispworks.com/documentation/lw60/LW/html/lw-228.htm


Pascal

-- 
My website: http://p-cos.net
Common Lisp Document Repository: http://cdr.eurolisp.org
Closer to MOP & ContextL: http://common-lisp.net/project/closer/
0
pc56 (3929)
8/10/2010 11:18:43 AM
Tamas K Papp <tkpapp@gmail.com> writes:
> I am running Monte Carlo calculations which are inherently
> parallelizable.  I would like to run multiple threads to speed things
> up, but I have no experience with threads.
>
> A mock-up of the sequential version looks like this:
>
> (defun run-chains (n)
>   (let ((result (make-array n)))
>     (dotimes (i n result)
>       (setf (aref result i) (chain-generator)))))
>
> For the parallel version, I would like to 
>
> 1) start CHAIN-GENERATOR's in separate threads for each index,
> 2) have the result of each end up in the appropriate place in the vector, and
> 3) have RUN-CHAINS wait for each of the threads to terminate before returning.
>
> I would prefer to use bordeaux-threads if possible, in order to make
> my code portable.  I have figured out how to do 1-2 with MAKE-THREAD
> and closures, but I don't know how to deal with 3.  I am assuming that
> I need to use locks/mutexes, but I am not familiar with these
> concepts.  Hints or example code would be appreciated.

You can read any of the literature about unix threads (written for C).
google for: POSIX threads

-- 
__Pascal Bourguignon__                     http://www.informatimago.com/
0
pjb
8/10/2010 1:18:33 PM
Reply: