Peeter Joot's (OLD) Blog.

Math, physics, perl, and programming obscurity.

Use of lwsync instead of isync as a mutex acquire memory barrier

Posted by peeterjoot on July 23, 2010

In a previous post on an AIX dbx quirk with the stepi command, the following mutex acquision fragment was used as an illustration

[10] stopped in main at line 187 in file "ossPowerPcVACPP.h" ($t1)
  187      return __fetch_and_or( (volatile unsigned int *)pAtomic, value ) ;
(dbx) listi
0x100000948 (main+0x108) 7ca01828       lwarx   r5,r0,r3
0x10000094c (main+0x10c) 7ca40378          or   r4,r5,r0
0x100000950 (main+0x110) 7c80192d      stwcx.   r4,r0,r3
0x100000954 (main+0x114) 40a2fff4         bne   0x100000948 (main+0x108)
0x100000958 (main+0x118) 74a01000      andis.   r0,r5,0x1000
0x10000095c (main+0x11c) 40820034         bne   0x100000990 (main+0x150)
0x100000960 (main+0x120) 4c00012c       isync

The isync represents the end of the mutex successfully acquired codepath, which will typically be followed by some number of loads and stores. For example, code like the following fragment that uses the pthreads library:

pthread_mutex_lock( &m ) ;

// pSharedMem->v++ ;
int tmp = pSharedMem->v ;
tmp++ ;
pSharedMem->v = tmp ;

pthread_mutex_unlock( &m ) ;

Here we have a load of pSharedMem->v, and increment of some register, and a store back from that register into the same address. What you don’t want is for that load to occur before the pthread_mutex_lock() is done. What gives you this guarentee may be a memory barrier instruction such as the powerpc isync (as can the heavyweight sync instruction). If you were to instruction step through an uncontended pthread_mutex_lock call on a powerpc platform, you would likely see something like the assembly listing above.

Now, a compiler guru told me today that isync is actually more than we need to supply the required ordering. This came as a bit of a surprise. With a careful read of the lwsync semantics one finds that it will (when placed between the memory accesses) order a pair of loads, or a pair of stores, or a load followed by a store, but NOT a store followed by a load. The isync on the other hand can order the store-load pair. It does this by imposing a total ordering of all instructions following the isync. Rather much like a wall, no instructions can be speculatively executed in the pipeline if an isync is first encountered. Consider the following busted examples of mutex protected reads and writes

volatile unsigned int g_mutex = 0 ;

int g_someGlobalCounter = 0 ;

int incorrect_mutex_read()
{
   while ( 1 == __fetch_and_or( &g_mutex, 1 ) )
   {
   }

   int v = g_someGlobalCounter ;

   __lwsync() ;
   g_mutex = 0 ;

   return v ;
}

void incorrect_mutex_write( int v )
{
   while ( 1 == __fetch_and_or( &g_mutex, 1 ) )
   {
   }

   g_someGlobalCounter = v ;   

   __lwsync() ;
   g_mutex = 0 ;
}

These are busted because the accesses (loads and stores respectively) of g_someGlobalCounter may be executed by the hardware before the atomic fetch and or operation writes the mutex is acquired value into the mutex memory location. There is no data dependence that prohibits such a hardware reordering, and it is not enough if you have verified that the compiler has generated the naive “ordered” instruction sequence.

The end effect of the hardware reordering these instructions would be as if you had the following “mutex protected” code:

pthread_mutex_t g_mutex = ... ; // assume properly initialized

int g_someGlobalCounter = 0 ;

int incorrect_mutex_read()
{
   int v = g_someGlobalCounter ;

   pthread_mutex_lock( &g_mutex ) ;
   pthread_mutex_unlock( &g_mutex ) ;

   return v ;
}

void incorrect_mutex_write( int v )
{
   g_someGlobalCounter = v ;

   pthread_mutex_lock( &g_mutex ) ;
   pthread_mutex_unlock( &g_mutex ) ;
}

Obviously, neither use of the mutex here is of any particular value. The code may even appear to work sometimes if you have minimal contention or nothing that provides a means to consistency check the correctness of the code.

To make these correct it was previously my impression that the isync was required, like so:

volatile unsigned int g_mutex = 0 ;

int g_someGlobalCounter = 0 ;

int correct_mutex_read()
{
   while ( 1 == __fetch_and_or( &g_mutex, 1 ) )
   {
   }

   __isync() ; // THIS IS THE CORRECTION.

   int v = g_someGlobalCounter ;

   __lwsync() ;
   g_mutex = 0 ;

   return v ;
}

void correct_mutex_write( int v )
{
   while ( 1 == __fetch_and_or( &g_mutex, 1 ) )
   {
   }

   __isync() ; // THIS IS THE CORRECTION.

   g_someGlobalCounter = v ;   

   __lwsync() ;
   g_mutex = 0 ;
}

(Never actually write a mutex like this with no breather in the tight spin like this).

Now all of the data accesses that follow the mutex acquision must neccessarily occur after the memory operation that acquired the mutex. Even the read case where we have a STORE 1 to g_mutex, followed by a load from g_someGlobalCounter we have the required ordering. One may not think that lwsync would be able to provide that ordering since it does not provide a STORE,LOAD barrier. This is apparently wrong. How does one reconcile this? Let’s start by observing that an atomic operation like fetch and or is a composite operation on powerpc. Let’s break this down, also using the nicely expressive xlC compiler’s builtins. Here’s an equivalent, but more explicit version of the code

volatile unsigned int g_mutex = 0 ;

int g_someGlobalCounter = 0 ;

int mutex_read_with_isync()
{
   int oldValue ;
   do {
      oldValue = __lwarx( &g_mutex ) ;
      int bStoreSuccessful = __stwcx( &g_mutex, 1 ) ; 
   } while ( !bStoreSuccessful || oldValue ) ;

   __isync() ;

   int v = g_someGlobalCounter ;

   __lwsync() ;
   g_mutex = 0 ;

   return v ;
}

int mutex_read_with_lwsync()
{
   int oldValue ;
   do {
      oldValue = __lwarx( &g_mutex ) ;
      int bStoreSuccessful = __stwcx( &g_mutex, 1 ) ; 
   } while ( !bStoreSuccessful || oldValue ) ;

   __lwsync() ;

   int v = g_someGlobalCounter ;

   __lwsync() ;
   g_mutex = 0 ;

   return v ;
}

This isn’t entirely equivalent to the old code since it uses a direct store of the “locked” value (1), and doesn’t actually implement a fetch and or, but it is close enough. Notice now that in the lwsync variation a successful acquire will have a LOAD(and reserve),STORE,LWSYNC,LOAD sequence of instructions. This appears vulnerable to be reordered by the hardware as LOAD(and reserve),LOAD,STORE,LWSYNC since lwsync doesn’t enforce an ordering of STORE,LOAD.

Now, in PowerPC Book II, Appendix B.2.1.1 Acquire Lock and Import Shared Storage, we have the following note after a very similar example:

If the shared data structure is in storage that is
neither Write Through Required nor Caching Inhibited,
an lwsync instruction can be used instead of the isync
instruction. If lwsync is used, the load from "data1"
may be performed before the stwcx.. But if the
stwcx. fails, the second branch is taken and the lwarx
is reexecuted. If the stwcx. succeeds, the value
returned by the load from "data1" is valid even if the
load is performed before the stwcx., because the
lwsync ensures that the load is performed after the
instance of the lwarx that created the reservation
used by the successful stwcx..

This raises a few questions. First of all, can we assume that we have never have “Write Through Required” and also never have “Caching Inhibited” memory access in our (presumed) userspace code that we are protecting with this mutex? The friendly neighborhood compiler guru says that he believes that is the case unless one is interacting with really low level stuff like kernel programming and device drivers and so forth. The gotcha here is that the lwarx and stwcx loads and stores are coupled, so even if the regular load can precede the store that acquires the mutex, this will never be a load that occurs when the mutex could not also be acquired. So if this is reordered in the successful mutex acquisition case, it will really be as if the instructions occurred in a LOAD(and reserve),LWSYNC,LOAD,STORE sequence at runtime. Very tricky. I’ve not yet tried to implement this change in our code and am a little bit scared to try it. Will our functional test be sufficient to verify that the new weaker barrier is sufficient. Also, I’m curious to attempt to measure the cost difference between isync and lwsync. Since lwsync defaults to full sync on pre-power4 hardware there was probably a time where such a change was not desirable even if it works, but nowadays I don’t expect we ever run on pre-power4 cpus. Taking some timings of isync, lwsync, and sync would probably be worthwhile to see how they all compare. Is contention required to make the timings of these barrier operations more meaningful?

More questions get raised than answered by starting down this path. But if it works it could make a small difference in a very frequently executed codepath … perhaps even enough to measure and be worthwhile. Time will tell.

Leave a comment