f



Would this work on even one platform?

Here is simple example:




static char string[] = "XXXXX XXXXX";
static bool volatile done = false;
static pthread_mutex_t mutex;


void thread_1() {
    strcpy(string, "Hello World");
    pthread_mutex_lock(&mutex);
    pthread_mutex_unlock(&mutex);
    done = true;
}


void other_threads() {
    do {
        pthread_mutex_lock(&mutex);
        pthread_mutex_unlock(&mutex);
        sched_yield();
    }

    while (! done);

    assert(! strcmp(string, "Hello World"));
}




I assume that on at least one platform, the lock/unlock pair should execute 
the memory barriers I need. I also assume that the compiler will not reorder 
the store to done above the lock/unlock pair in thread_1, and not reorder 
the load to done above the lock/unlock pair in the other_threads.


Where am I going wrong? I MUST be missing something! 

0
James
11/20/2009 6:36:11 PM
comp.programming.threads 4878 articles. 1 followers. Post Follow

18 Replies
996 Views

Similar Articles

[PageSpeed] 59

On 11/20/2009 12:36 PM, James wrote:
> Here is simple example:

> static char string[] = "XXXXX XXXXX";
> static bool volatile done = false;
> static pthread_mutex_t mutex;
> 
> void thread_1() {
>     strcpy(string, "Hello World");
>     pthread_mutex_lock(&mutex);
>     pthread_mutex_unlock(&mutex);
>     done = true;
> }
> 
> void other_threads() {
>     do {
>         pthread_mutex_lock(&mutex);
>         pthread_mutex_unlock(&mutex);
>         sched_yield();
>     } while (! done);
> 
>     assert(! strcmp(string, "Hello World"));
> }

> I assume that on at least one platform, the lock/unlock pair should execute 
> the memory barriers I need. I also assume that the compiler will not reorder 
> the store to done above the lock/unlock pair in thread_1, and not reorder 
> the load to done above the lock/unlock pair in the other_threads.
> 
> Where am I going wrong? I MUST be missing something! 

In fact this actually does work on my x86 linux desktop.  But it's not
guaranteed by the spec.

First, the behaviour of sched_yield() is not defined by the spec for
non-realtime tasks.

Second, I'm not sure your assumption about the compiler not reordering
stores is correct.  Both the write to "string" and the write to "done"
happen outside any locks so they could theoretically be reordered with
respect to each other resulting in "done" being true on other cpus
before the write to "string".

Third, I'm not sure your assumption about the compiler not reordering
loads is correct. The read of "string" is outside the any locks and so
nothing prevents the compiler from hoisting it earlier in the code to
before the loop.

In practice though, even with optimization enabled gcc generates pretty
much the code that you would expect.

Chris
0
Chris
11/20/2009 8:01:57 PM
On Nov 20, 10:36=A0am, "James" <n...@spam.invalid> wrote:
> Here is simple example:

Your code is guaranteed valid. This is perfectly normal mutex use,
just with some code removed. It's similar to this paradigm:

A1) Construct an object.
A2) Acquire a mutex.
A3) Put a pointer to that object where other threads can find it.
A4) Release the mutex.

and

B1) Acquire a mutex.
B2) Get a pointer to an object that was initialized by another thread
but whose contents are known stable once the pointer is made visible.
B3) Release the mutex.
B4) Access the pointer.

You just omitted steps A3 and B2. But this is typical mutex use with
those steps, and you know the pointer you could have gotten in B2
would be the one you could have stored in A3.

DS
0
David
11/20/2009 9:35:48 PM
"James" <no@spam.invalid> wrote in message news:he6nhm$m0a$1@aioe.org...
> Here is simple example:
[...]
> void other_threads() {
>    do {
>        pthread_mutex_lock(&mutex);
>        pthread_mutex_unlock(&mutex);
>        sched_yield();
>    }
>
>    while (! done);
>
>    assert(! strcmp(string, "Hello World"));
> }

You have a race-condition here. You need to observe `done' as `true' 
__BEFORE__ you perform the lock/unlock pair. Try something like this:
______________________________________________________________
// compile this seperatly without optmizations.
extern void membar()
{
    static pthread_mutex_t g_mutex
        = PTHREAD_MUTEX_INITIALIZER;

    pthread_mutex_lock(&g_mutex);
    pthread_mutex_unlock(&g_mutex);
}




// link with object file that contains `membar()';

// turn link-time optimizations OFF!

static char string[] = "XXXXX XXXXX";
static atomic_word volatile done = 0;


static void thread_1() {
    strcpy(string, "Hello World");
    membar();
    ATOMIC_STORE(&done, 1);
}


static void other_threads() {
    while (! ATOMIC_LOAD(&done)) sched_yield();
    membar();
    assert(! strcmp(string, "Hello World"));
}
______________________________________________________________




In you're example as-is `other_threads()' can lock/unlock the mutex; 
`thread_1()' stores `true' into `done'; 'other_threads()' loads `done'... 
BOOM!!!


The fix is to observe `done' first, then lock/unlock the mutex. This is 
because if a consumer processor observes `done' as being true it must mean 
that the producer process has already produced data and lock/unlock the 
mutex. The only thing left to do is synchronize on that mutex.




> I assume that on at least one platform, the lock/unlock pair should 
> execute the memory barriers I need.

Agreed.

;^D




> I also assume that the compiler will not reorder the store to done above 
> the lock/unlock pair in thread_1, and not reorder the load to done above 
> the lock/unlock pair in the other_threads.

Make sure to turn link-time optimizations off! Basically, a compiler usually 
treats a call to an externally compiled unknown function as a so-called 
compiler barrier. FWIW, MSVC has compiler barrier intrinsic's to get the job 
done.




> Where am I going wrong? I MUST be missing something!

You got the placement of the "artificial" membar wrong. 

0
Chris
11/20/2009 10:33:39 PM
"Chris M. Thomasson" <no@spam.invalid> wrote in message 
news:4PENm.11779$tz6.9151@newsfe02.iad...
> "James" <no@spam.invalid> wrote in message news:he6nhm$m0a$1@aioe.org...
>> Here is simple example:
[...]
>> I also assume that the compiler will not reorder the store to done above 
>> the lock/unlock pair in thread_1, and not reorder the load to done above 
>> the lock/unlock pair in the other_threads.
>
> Make sure to turn link-time optimizations off! Basically, a compiler 
> usually treats a call to an externally compiled unknown function as a 
> so-called compiler barrier. FWIW, MSVC has compiler barrier intrinsic's to 
> get the job done.

Read this entire thread:

http://groups.google.com/group/comp.programming.threads/browse_frm/thread/29ea516c5581240e

http://groups.google.com/group/comp.programming.threads/msg/423df394a0370fa6

0
Chris
11/21/2009 3:07:43 AM
On 11/20/2009 02:01 PM, Chris Friesen wrote:

> Second, I'm not sure your assumption about the compiler not reordering
> stores is correct.  Both the write to "string" and the write to "done"
> happen outside any locks so they could theoretically be reordered with
> respect to each other resulting in "done" being true on other cpus
> before the write to "string".
> 
> Third, I'm not sure your assumption about the compiler not reordering
> loads is correct. The read of "string" is outside the any locks and so
> nothing prevents the compiler from hoisting it earlier in the code to
> before the loop.

On further reflection, I think that these are not valid concerns and
that the loads/stores would not be able to be moved across the
pthread_mutex calls.

However, I think that the other Chris has a valid point that the
lock/unlock pair needs to come after the read of "done" but before the
read of "string".

Chris
0
Chris
11/21/2009 5:25:52 AM
On 20 =D0=BD=D0=BE=D1=8F, 21:36, "James" <n...@spam.invalid> wrote:
> Here is simple example:
>
> static char string[] =3D "XXXXX XXXXX";
> static bool volatile done =3D false;
> static pthread_mutex_t mutex;
>
> void thread_1() {
> =C2=A0 =C2=A0 strcpy(string, "Hello World");
> =C2=A0 =C2=A0 pthread_mutex_lock(&mutex);
> =C2=A0 =C2=A0 pthread_mutex_unlock(&mutex);
> =C2=A0 =C2=A0 done =3D true;
> }

If pthread_mutex_lock() is considered only as acquire and
pthread_mutex_unlock() is considered only as release, then
pthread_mutex_lock() can hoist above and pthread_mutex_unlock() can
sink below:

    pthread_mutex_lock(&mutex);
    strcpy(string, "Hello World");
    done =3D true;
    pthread_mutex_unlock(&mutex);

And then there are 2 normal memory accesses that can be reordered:

    pthread_mutex_lock(&mutex);
    done =3D true;
    strcpy(string, "Hello World");
    pthread_mutex_unlock(&mutex);

--
Dmitriy V'jukov
0
Dmitriy
11/21/2009 10:36:54 AM
"Dmitriy Vyukov" <dvyukov@gmail.com> wrote in message 
news:7510692e-190f-4aa7-bfb5-3b53943fb6f0@b2g2000yqi.googlegroups.com...
> On 20 ноя, 21:36, "James" <n...@spam.invalid> wrote:
> > Here is simple example:
> >
> > static char string[] = "XXXXX XXXXX";
> > static bool volatile done = false;
> > static pthread_mutex_t mutex;
> >
> > void thread_1() {
> > strcpy(string, "Hello World");
> > pthread_mutex_lock(&mutex);
> > pthread_mutex_unlock(&mutex);
> > done = true;
> > }

> If pthread_mutex_lock() is considered only as acquire and
> pthread_mutex_unlock() is considered only as release, then
> pthread_mutex_lock() can hoist above and pthread_mutex_unlock() can
> sink below:

>     pthread_mutex_lock(&mutex);
>     strcpy(string, "Hello World");
>     done = true;
>     pthread_mutex_unlock(&mutex);

> And then there are 2 normal memory accesses that can be reordered:

>     pthread_mutex_lock(&mutex);
>     done = true;
>     strcpy(string, "Hello World");
>     pthread_mutex_unlock(&mutex);

Okay. So, assuming that I changed the code to what Chris M. Thomasson 
suggested:

http://groups.google.com/group/comp.programming.threads/msg/d52c83e1d755091d


void thread_1() {
    strcpy(string, "Hello World");
    pthread_mutex_lock(&mutex);
    pthread_mutex_unlock(&mutex);
    done = true;
}


void other_threads() {
    while (! done) sched_yield();
    pthread_mutex_lock(&mutex);
    pthread_mutex_unlock(&mutex);
    assert(! strcmp(string, "Hello World"));
}


And assuming all the hoist/sink reorderings that can occur, the "worst" 
possible outcome could look like:


void thread_1() {
    pthread_mutex_lock(&mutex);
    done = true;
    strcpy(string, "Hello World");
    pthread_mutex_unlock(&mutex);
}


void other_threads() {
    pthread_mutex_lock(&mutex);
    assert(! strcmp(string, "Hello World"));
    while (! done) sched_yield();
    pthread_mutex_unlock(&mutex);
}



AFAICT, that will still work. Dmitriy, can a particular reordering break the 
code? It seems like if the lock can hoist above and the unlock can sink 
below, all's that does is "widen" the critical section, and does not seem 
like it would be able to break the code in the sense of the assertion 
failing.


Also, I notice that Chris Thomasson wrapped up the lock/unlock pair in a 
single function and said to externally compile it. It think he was trying to 
reduce compiler level reordering. It seems like you are talking about 
hardware reordering? Also, if the pthread_mutex_lock has a storeload, then 
is it still considered to have acquire semantics? Or is it something 
stronger?

Sorry for all the questions.

Thanks!

:^) 

0
James
11/21/2009 2:28:54 PM
"James" <no@spam.invalid> wrote in message news:he8tea$5s0$1@aioe.org...
[...]
> void other_threads() {
>    pthread_mutex_lock(&mutex);
>    assert(! strcmp(string, "Hello World"));
>    while (! done) sched_yield();
>    pthread_mutex_unlock(&mutex);
> }
[...]

Ummm, hopefully the compiler would never do this because that could cause a 
deadlock. Forget about the actual while loop, I was basically referring to 
the actual load from done in the predicate of the loop. 

0
James
11/21/2009 5:14:37 PM
On Nov 21, 5:28=A0pm, "James" <n...@spam.invalid> wrote:
> And assuming all the hoist/sink reorderings that can occur, the "worst"
> possible outcome could look like:
>
> void thread_1() {
> =A0 =A0 pthread_mutex_lock(&mutex);
> =A0 =A0 done =3D true;
> =A0 =A0 strcpy(string, "Hello World");
> =A0 =A0 pthread_mutex_unlock(&mutex);
>
> }
>
> void other_threads() {
> =A0 =A0 pthread_mutex_lock(&mutex);
> =A0 =A0 assert(! strcmp(string, "Hello World"));
> =A0 =A0 while (! done) sched_yield();
> =A0 =A0 pthread_mutex_unlock(&mutex);
>
> }
>
> AFAICT, that will still work. Dmitriy, can a particular reordering break =
the
> code? It seems like if the lock can hoist above and the unlock can sink
> below, all's that does is "widen" the critical section, and does not seem
> like it would be able to break the code in the sense of the assertion
> failing.


Humm... I think you are right, I do not see any reordering that breaks
the code. I just mixed up your variant with the following variant
(that I tried to use by myself):

// thread 1
g_data[index] =3D produce_data();
// here we are using "mutex barrier"
pthread_mutex_lock();
pthread_mutex_unlock();
g_index =3D index;

// thread 2
while (*(int volatile*)&g_index =3D=3D 0) {/*spin*/}
// here we are relying on the fact that compiler and hardware respect
control dependencies
consume_data(g_data[g_index]);

And this code can be broken by the reordering.
So it seems that "mutex barrier" can be used iff it is used on both
sides.


> Also, I notice that Chris Thomasson wrapped up the lock/unlock pair in a
> single function and said to externally compile it. It think he was trying=
 to
> reduce compiler level reordering. It seems like you are talking about
> hardware reordering? Also, if the pthread_mutex_lock has a storeload, the=
n
> is it still considered to have acquire semantics? Or is it something
> stronger?

Yes, many implementations use full barrier in lock(), and sometimes in
unlock() too.

--
Dmitriy V'jukov
0
Dmitriy
11/21/2009 5:57:06 PM
"Dmitriy Vyukov" <dvyukov@gmail.com> wrote in message 
news:3792d789-ec85-4c03-8b52-34b8a30ce85b@u7g2000yqm.googlegroups.com...
On Nov 21, 5:28 pm, "James" <n...@spam.invalid> wrote:
> > And assuming all the hoist/sink reorderings that can occur, the "worst"
> > possible outcome could look like:
> >
> > void thread_1() {
> > pthread_mutex_lock(&mutex);
> > done = true;
> > strcpy(string, "Hello World");
> > pthread_mutex_unlock(&mutex);
> >
> > }
> >
> > void other_threads() {
> > pthread_mutex_lock(&mutex);
> > assert(! strcmp(string, "Hello World"));
> > while (! done) sched_yield();
> > pthread_mutex_unlock(&mutex);
> >
> > }
> >
> > AFAICT, that will still work. Dmitriy, can a particular reordering break 
> > the
> > code? It seems like if the lock can hoist above and the unlock can sink
> > below, all's that does is "widen" the critical section, and does not 
> > seem
> > like it would be able to break the code in the sense of the assertion
> > failing.


> Humm... I think you are right, I do not see any reordering that breaks
> the code.


> I just mixed up your variant with the following variant
> (that I tried to use by myself):

[...]

> And this code can be broken by the reordering.
> So it seems that "mutex barrier" can be used iff it is used on both
> sides.

Assuming that the mutex lock can rise and the unlock can sink, can you 
"force" a certian ordering by using more than one mutex? Think of this:


static int g_state = 0;
static bool volatile g_done = 0;
static pthread_mutex_t g_mutex;


struct thread
{
    pthread_mutex_t m_mutex;

    void lock() { pthread_mutex_lock(&m_mutex); }
};


void thread_1(thread& t)
{



    g_state = 666;

}










> > Also, I notice that Chris Thomasson wrapped up the lock/unlock pair in a
> > single function and said to externally compile it. It think he was 
> > trying to
> > reduce compiler level reordering. It seems like you are talking about
> > hardware reordering? Also, if the pthread_mutex_lock has a storeload, 
> > then
> > is it still considered to have acquire semantics? Or is it something
> > stronger?

> Yes, many implementations use full barrier in lock(), and sometimes in
> unlock() too. 

0
James
11/21/2009 8:21:04 PM
ARGH! I accidentally sent the previous response prematurely!




Assuming that the mutex lock can rise and the unlock can sink, can you
"force" a certain ordering by using more than one mutex? Think of this:




struct thread
{
    pthread_mutex_t m_mutex1;
    pthread_mutex_t m_mutex2;


    thread()
    {
        lock1();
    }

    ~thread()
    {
        unlock1();
    }


    void lock1() { pthread_mutex_lock(&m_mutex1); }
    void unlock1() { pthread_mutex_unlock(&m_mutex1); }

    void lock2() { pthread_mutex_lock(&m_mutex2); }
    void unlock2() { pthread_mutex_unlock(&m_mutex2); }


    void membar() volatile
    {
        lock2();
        unlock1();
        lock1();
        unlock2();
    }
};




static int g_state = 0;
static bool volatile g_done = 0;


void thread_1(thread& t)
{
    g_state = 666;
    t.membar();
    g_done = true;
}


void other_threads(thread& t)
{
    while (! g_done) spin();
    t.membar();
    assert(g_state == 666);
}




Would that work? AFAICT, the store to g_state should be performed before the 
store to g_done in thread_1 because of the funky mutex usage. Also, I don't 
see how anything can be reordered that would break the assertion.



What am I missing?




>> > Also, I notice that Chris Thomasson wrapped up the lock/unlock pair in 
>> > a
>> > single function and said to externally compile it. It think he was 
>> > trying to
>> > reduce compiler level reordering. It seems like you are talking about
>> > hardware reordering? Also, if the pthread_mutex_lock has a storeload, 
>> > then
>> > is it still considered to have acquire semantics? Or is it something
>> > stronger?
>
>> Yes, many implementations use full barrier in lock(), and sometimes in
>> unlock() too.

So, it seems possible to emulate memory barriers using per-thread mutexs...


? 

0
James
11/21/2009 8:45:23 PM
"James" <no@spam.invalid> wrote in message news:he9jfs$1kl$1@aioe.org...
> ARGH! I accidentally sent the previous response prematurely!
>
>
>
>
> Assuming that the mutex lock can rise and the unlock can sink, can you
> "force" a certain ordering by using more than one mutex? Think of this:

> static int g_state = 0;
> static bool volatile g_done = 0;
>
>
> void thread_1(thread& t)
> {
>    g_state = 666;
>    t.membar();
>    g_done = true;
> }


Let's examine that, assuming mutex lock is acquire and unlock is release:
__________________________________________________________
void thread_1(thread& t)
{

S1:    g_state = 666;


    t.membar()
    {
M1:    lock2();
M2:    unlock1();
M3:    lock1();
M4:    unlock2();
    };


S2:    g_done = true;

}
__________________________________________________________


Okay. I see something important here:


1. M2 and M3 should never hoist above M1.
2. M2 and M3 should never sink below M4.
3. M3 should never hoist above M2
4. M2 should never sink below M3




This allows for another important observation:

1. S1 should never sink below M2
2. S2 should never hoist above M3




AFAICT, that gives you something that is analogous to a
`std::memory_order_acq_rel' barrier.




> Would that work?

Yes. BTW, I use the same method for the lock-based version of vZOOM:

http://groups.google.ru/group/comp.programming.threads/msg/59e9b6e427b4a144

except that I allow the polling thread to synchronize with reader thread by
actually acquiring the per-thread owner_lock's.


One more thing, don't forget about asymmetric synchronization. If a mutex
uses that type of sync, then you basically need to have a "foreign" thread
try to acquire the mutex in order to force a memory barrier on the
"preferred" thread. This is one of the reasons why I allow the polling logic
to sync on the owner_locks.


[...]

0
Chris
11/21/2009 11:23:07 PM
"Chris M. Thomasson" <no@spam.invalid> wrote in message 
news:Wd%Nm.66335$Wf2.51005@newsfe23.iad...
> "James" <no@spam.invalid> wrote in message news:he9jfs$1kl$1@aioe.org...
>> ARGH! I accidentally sent the previous response prematurely!
>>
>>
>>
>>
>> Assuming that the mutex lock can rise and the unlock can sink, can you
>> "force" a certain ordering by using more than one mutex? Think of this:
>
>> static int g_state = 0;
>> static bool volatile g_done = 0;
>>
>>
>> void thread_1(thread& t)
>> {
>>    g_state = 666;
>>    t.membar();
>>    g_done = true;
>> }
>
>
> Let's examine that, assuming mutex lock is acquire and unlock is release:
> __________________________________________________________
[...]
> __________________________________________________________

[...]


Actually, AFAICT you don't actually need the extra mutex for this example. 
The reason why I use it is to be in conformance to POSIX standard. You can 
code `thread_1' like:
____________________________________________________________
static int g_state = 0;
static bool volatile g_done = 0;


void thread_1()
{
    pthread_mutex_t mutex; // per-thread mutex

    pthread_mutex_lock(&mutex);




S1:    g_state = 666;


M1:    pthread_mutex_unlock(&mutex);
M2:    pthread_mutex_lock(&mutex);


S2:    g_done = 1;




    pthread_mutex_unlock(&mutex);
}
____________________________________________________________


- M1 should never hoist above S1

- M2 should never sink below S2

- M2 should never hoist above M1

- M1 should never sink below M2

- S1 should never sink below M1 and M2

- S2 should never rise above M1 and M2




>> Would that work?
>
> Yes. BTW, I use the same method for the lock-based version of vZOOM:
[...]


The synchronization aspect of the particular embodiment in pseudo-code looks 
like:
___________________________________________________________
struct thread
{
    mutex m_memory_mutex;
    mutex m_owner_mutex;
    unsigned long m_version;
};


#define VZSYNC 512


void vzthread(thread& t)
{
    t.m_memory_mutex.lock();

    for (size_t i = 1 ;; ++i)
    {
        // read shared data-structure

       if (! (i % VZSYNC))
       {
           t.m_owner_mutex.lock();
           t.m_memory_mutex.unlock();
           ++t.m_version;
           t.m_memory_mutex.lock();
           t.m_owner_mutex.unlock();
       }
    }

    t.m_memory_mutex.unlock();
}


void vzpoll()
{
    for (;;)
    {
        // snapshot
        for each thread as t
        {
            t.m_owner_mutex.lock();
            unsigned long version = t.m_version;
            t.m_owner_mutex.unlock();
        }

        // blah
    }
}
___________________________________________________________


When the version count of a snapshot differs, it means that the reader 
threads actions prior to the version update is fully visible to the polling 
thread. This is due to the fact that the memory mutex is unlocked and locked 
within the critical section guarded by the owner mutex. 

0
Chris
11/22/2009 6:59:41 PM
"Dmitriy Vyukov" <dvyukov@gmail.com> wrote in message 
news:3792d789-ec85-4c03-8b52-34b8a30ce85b@u7g2000yqm.googlegroups.com...
> On Nov 21, 5:28 pm, "James" <n...@spam.invalid> wrote:
> > And assuming all the hoist/sink reorderings that can occur, the "worst"
> > possible outcome could look like:
> >
[...]
> >
> > AFAICT, that will still work. Dmitriy, can a particular reordering break 
> > the
> > code? It seems like if the lock can hoist above and the unlock can sink
> > below, all's that does is "widen" the critical section, and does not 
> > seem
> > like it would be able to break the code in the sense of the assertion
> > failing.


> Humm... I think you are right, I do not see any reordering that breaks
> the code.

I modeled this senerio in Relacy:


http://relacy.pastebin.com/f4b36dd98


It cannot find a way to break it either.

;^)




> I just mixed up your variant with the following variant
> (that I tried to use by myself):

> // thread 1
> g_data[index] = produce_data();
> // here we are using "mutex barrier"
> pthread_mutex_lock();
> pthread_mutex_unlock();
> g_index = index;

> // thread 2
> while (*(int volatile*)&g_index == 0) {/*spin*/}
> // here we are relying on the fact that compiler and hardware respect
> control dependencies
> consume_data(g_data[g_index]);

> And this code can be broken by the reordering.
> So it seems that "mutex barrier" can be used iff it is used on both
> sides.


Yes. I was wondering how to model this in Relacy:

http://groups.google.com/group/comp.programming.threads/msg/3f8ef89bff9bff02


Humm... Would Relacy treat the per-thread mutex actions like:
_______________________________________________________________
void thread_1()
{
    std::mutex m;
    m.lock($);


T1-M1:    m.unlock($);
T1-M2:    m.lock($);


    m.unlock($);
}


void thread_2()
{
    std::mutex m;
    m.lock($);


T2-M1:    m.unlock($);
T2-M2:    m.lock($);


    m.unlock($);
}
_______________________________________________________________




as being equivalent to:
_______________________________________________________________
void thread_1()
{
T1-M1:    std::atomic_thread_fence(std::memory_order_release);
T1-M2:    std::atomic_thread_fence(std::memory_order_acquire);
}


void thread_2()
{
T2-M1:    std::atomic_thread_fence(std::memory_order_release);
T2-M2:    std::atomic_thread_fence(std::memory_order_acquire);
}
_______________________________________________________________



?


I don't think it will. Even though it would in practice, asymmetric 
synchronization schemes aside for a moment of course...

[...] 

0
Chris
11/22/2009 7:31:04 PM
On Nov 22, 10:31=A0pm, "Chris M. Thomasson" <n...@spam.invalid> wrote:

> I modeled this senerio in Relacy:
>
> http://relacy.pastebin.com/f4b36dd98
>
> It cannot find a way to break it either.
>
> ;^)

Note that one still have to use std::atomic for g_flag, even if actual
"synchronization" is provided by mutexes. And if one have std::atomic,
then there is a little sense in using mutexes for synchronization in
this example.
However, I think there is still benefit in coding with Relacy not
exactly what you have in production. Because if Relacy finds some
issue in a test, and then you manually verifies that the same issue
applies to the original production code, well, it's a plus. Some
issues may be masked. though.


> > I just mixed up your variant with the following variant
> > (that I tried to use by myself):
> > // thread 1
> > g_data[index] =3D produce_data();
> > // here we are using "mutex barrier"
> > pthread_mutex_lock();
> > pthread_mutex_unlock();
> > g_index =3D index;
> > // thread 2
> > while (*(int volatile*)&g_index =3D=3D 0) {/*spin*/}
> > // here we are relying on the fact that compiler and hardware respect
> > control dependencies
> > consume_data(g_data[g_index]);
> > And this code can be broken by the reordering.
> > So it seems that "mutex barrier" can be used iff it is used on both
> > sides.
>
> Yes. I was wondering how to model this in Relacy:
>
> http://groups.google.com/group/comp.programming.threads/msg/3f8ef89bf...
>
> Humm... Would Relacy treat the per-thread mutex actions like:
> _______________________________________________________________
> void thread_1()
> {
> =A0 =A0 std::mutex m;
> =A0 =A0 m.lock($);
>
> T1-M1: =A0 =A0m.unlock($);
> T1-M2: =A0 =A0m.lock($);
>
> =A0 =A0 m.unlock($);
>
> }
>
> void thread_2()
> {
> =A0 =A0 std::mutex m;
> =A0 =A0 m.lock($);
>
> T2-M1: =A0 =A0m.unlock($);
> T2-M2: =A0 =A0m.lock($);
>
> =A0 =A0 m.unlock($);}
>
> _______________________________________________________________
>
> as being equivalent to:
> _______________________________________________________________
> void thread_1()
> {
> T1-M1: =A0 =A0std::atomic_thread_fence(std::memory_order_release);
> T1-M2: =A0 =A0std::atomic_thread_fence(std::memory_order_acquire);
>
> }
>
> void thread_2()
> {
> T2-M1: =A0 =A0std::atomic_thread_fence(std::memory_order_release);
> T2-M2: =A0 =A0std::atomic_thread_fence(std::memory_order_acquire);}
>
> _______________________________________________________________
>
> ?
>
> I don't think it will. Even though it would in practice, asymmetric
> synchronization schemes aside for a moment of course...

Relacy will treat them differently.
Mutex lock/unlock is basically the same as acquire/release operation
on a variable. Stand-alone fences are more powerful in some sense,
because they do not bind to a particular variable. And at the same
time stand-alone fences (except seq_cst) are weaker, because they do
not establish total order over operations (as store/RMW on a
particular atomic).

--
Dmitriy V'jukov
0
Dmitriy
11/23/2009 6:47:19 AM
"Dmitriy Vyukov" <dvyukov@gmail.com> wrote in message 
news:8cf1c9ff-fa83-4de1-8095-e27c64434e07@p19g2000vbq.googlegroups.com...
On Nov 22, 10:31 pm, "Chris M. Thomasson" <n...@spam.invalid> wrote:

> > I modeled this senerio in Relacy:
> >
> > http://relacy.pastebin.com/f4b36dd98
> >
> > It cannot find a way to break it either.
> >
> > ;^)

> Note that one still have to use std::atomic for g_flag, even if actual
> "synchronization" is provided by mutexes. And if one have std::atomic,
> then there is a little sense in using mutexes for synchronization in
> this example.
> However, I think there is still benefit in coding with Relacy not
> exactly what you have in production. Because if Relacy finds some
> issue in a test, and then you manually verifies that the same issue
> applies to the original production code, well, it's a plus. Some
> issues may be masked. though.

Agreed.




> > I just mixed up your variant with the following variant
> > (that I tried to use by myself):
[...]
>
> Yes. I was wondering how to model this in Relacy:
>
> http://groups.google.com/group/comp.programming.threads/msg/3f8ef89bf...
>
> Humm... Would Relacy treat the per-thread mutex actions like:
> _______________________________________________________________
[...]
> _______________________________________________________________
>
> as being equivalent to:
> _______________________________________________________________
[...]
> _______________________________________________________________
>
> ?
>
> I don't think it will. Even though it would in practice, asymmetric
> synchronization schemes aside for a moment of course...


> Relacy will treat them differently.
> Mutex lock/unlock is basically the same as acquire/release operation
> on a variable.

I think using mutexs this way would be undefined behavior wrt the C++0x.




> Stand-alone fences are more powerful in some sense,
> because they do not bind to a particular variable. And at the same
> time stand-alone fences (except seq_cst) are weaker, because they do
> not establish total order over operations (as store/RMW on a
> particular atomic).

I personally like stand-alone fences because IMHO they are more flexible. I 
can put them exactly where I need them; e.g.:
________________________________________________________________
#define N 10


static int g_values[N] = { NULL };
static std::atomic<bool> g_flags[N] = { NULL };


void thread_1()
{
    for (size_t i = 0; i < N; ++i)
    {
        g_values[i] = 666;
    }

    std::atomic_thread_fence(std::memory_order_release);

    for (size_t i = 0; i < N; ++i)
    {
        g_flags[i].store(true, std::memory_order_relaxed);
    }
}


void other_threads()
{
retry:
    for (size_t i = 0; i < N; ++i)
    {
        if (! g_flags[i].load(std::memory_order_relaxed))
        {
            spin();
            goto retry;
        }
    }

    std::atomic_thread_fence(std::memory_order_acquire);

    for (size_t i = 0; i < N; ++i)
    {
        assert(g_values[i] == 666);
    }
}
________________________________________________________________




As far as a total order over operations, I am not sure I understand what you 
mean. I expect the following to be out of order:
________________________________________________________________
std::atomic<int> g_value1;
std::atomic<int> g_value2;


void foo()
{
    g_value1.store(1, std::memory_order_relaxed);
    std::atomic_thread_fence(std::memory_order_release);
    g_value2.load(std::memory_order_relaxed);
}
________________________________________________________________




but not this:
________________________________________________________________
std::atomic<int> g_value1;
std::atomic<int> g_value2;


void foo()
{
    g_value1.store(1, std::memory_order_relaxed);
    std::atomic_thread_fence(std::memory_order_seq_cst);
    g_value2.load(std::memory_order_relaxed);
}
________________________________________________________________ 

0
Chris
11/25/2009 7:25:18 PM
On 25 =D0=BD=D0=BE=D1=8F, 22:25, "Chris M. Thomasson" <n...@spam.invalid> w=
rote:

> > Stand-alone fences are more powerful in some sense,
> > because they do not bind to a particular variable. And at the same
> > time stand-alone fences (except seq_cst) are weaker, because they do
> > not establish total order over operations (as store/RMW on a
> > particular atomic).
>
> I personally like stand-alone fences because IMHO they are more flexible.=
 I
> can put them exactly where I need them; e.g.:
> ________________________________________________________________
> #define N 10
>
> static int g_values[N] =3D { NULL };
> static std::atomic<bool> g_flags[N] =3D { NULL };
>
> void thread_1()
> {
> =C2=A0 =C2=A0 for (size_t i =3D 0; i < N; ++i)
> =C2=A0 =C2=A0 {
> =C2=A0 =C2=A0 =C2=A0 =C2=A0 g_values[i] =3D 666;
> =C2=A0 =C2=A0 }
>
> =C2=A0 =C2=A0 std::atomic_thread_fence(std::memory_order_release);
>
> =C2=A0 =C2=A0 for (size_t i =3D 0; i < N; ++i)
> =C2=A0 =C2=A0 {
> =C2=A0 =C2=A0 =C2=A0 =C2=A0 g_flags[i].store(true, std::memory_order_rela=
xed);
> =C2=A0 =C2=A0 }
>
> }
>
> void other_threads()
> {
> retry:
> =C2=A0 =C2=A0 for (size_t i =3D 0; i < N; ++i)
> =C2=A0 =C2=A0 {
> =C2=A0 =C2=A0 =C2=A0 =C2=A0 if (! g_flags[i].load(std::memory_order_relax=
ed))
> =C2=A0 =C2=A0 =C2=A0 =C2=A0 {
> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 spin();
> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 goto retry;
> =C2=A0 =C2=A0 =C2=A0 =C2=A0 }
> =C2=A0 =C2=A0 }
>
> =C2=A0 =C2=A0 std::atomic_thread_fence(std::memory_order_acquire);
>
> =C2=A0 =C2=A0 for (size_t i =3D 0; i < N; ++i)
> =C2=A0 =C2=A0 {
> =C2=A0 =C2=A0 =C2=A0 =C2=A0 assert(g_values[i] =3D=3D 666);
> =C2=A0 =C2=A0 }}
>
> ________________________________________________________________


AFAIR, Paul McKenney present something along the lines of your example
as a motivating example for stand-alone fences.


> As far as a total order over operations, I am not sure I understand what =
you
> mean. I expect the following to be out of order:
> ________________________________________________________________
> std::atomic<int> g_value1;
> std::atomic<int> g_value2;
>
> void foo()
> {
> =C2=A0 =C2=A0 g_value1.store(1, std::memory_order_relaxed);
> =C2=A0 =C2=A0 std::atomic_thread_fence(std::memory_order_release);
> =C2=A0 =C2=A0 g_value2.load(std::memory_order_relaxed);}
>
> ________________________________________________________________

Regarding total order I mean that your above code won't work (even if
release is replaced with acq_rel).
However if we replace stand-alone acq_rel fence with atomic RMW with
acq_rel ordering:

     g_value1.store(1, std::memory_order_relaxed);
     g_sync.exchange(0, std::memory_order_acq_rel);
     g_value2.load(std::memory_order_relaxed);}

It must work. Because RMW on a particular variable establishes total
order (over all RMW on the variable), while stand-alone fence not.

--
Dmitriy V'jukov
0
Dmitriy
11/28/2009 9:03:16 AM
"Dmitriy Vyukov" <dvyukov@gmail.com> wrote in message 
news:f9e5c87c-5280-4a32-886e-b2ac9d30dd3d@a32g2000yqm.googlegroups.com...
On 25 ноя, 22:25, "Chris M. Thomasson" <n...@spam.invalid> wrote:
>
> > > Stand-alone fences are more powerful in some sense,
> > > because they do not bind to a particular variable. And at the same
> > > time stand-alone fences (except seq_cst) are weaker, because they do
> > > not establish total order over operations (as store/RMW on a
> > > particular atomic).
[...]
> > As far as a total order over operations, I am not sure I understand what 
> > you
> > mean. I expect the following to be out of order:
> > ________________________________________________________________
> > std::atomic<int> g_value1;
> > std::atomic<int> g_value2;
> >
> > void foo()
> > {
> > g_value1.store(1, std::memory_order_relaxed);
> > std::atomic_thread_fence(std::memory_order_release);
> > g_value2.load(std::memory_order_relaxed);}
> >
> > ________________________________________________________________

> Regarding total order I mean that your above code won't work (even if
> release is replaced with acq_rel).

Right. When I wrote:

"I expect the following to be out of order"

I meant that I expect the load from `g_value2' to be able to rise above the 
store to `g_value1'.




> However if we replace stand-alone acq_rel fence with atomic RMW with
> acq_rel ordering:

>      g_value1.store(1, std::memory_order_relaxed);
>      g_sync.exchange(0, std::memory_order_acq_rel);
>      g_value2.load(std::memory_order_relaxed);}

> It must work. Because RMW on a particular variable establishes total
> order (over all RMW on the variable), while stand-alone fence not.

I would also expect this to work:
___________________________________________________
    g_value1.store(1, std::memory_order_relaxed);
    std::atomic_thread_fence(std::memory_order_seq_cst);
    g_value2.load(std::memory_order_relaxed);
___________________________________________________


0
Chris
11/28/2009 7:59:56 PM
Reply: