Peeter Joot's (OLD) Blog.

Math, physics, perl, and programming obscurity.

An unusual volatile required example.

Posted by peeterjoot on June 21, 2011

For code that is regularly subjected to invasive strip searches by the performance police it is not uncommon for the developers who own that code to try to avoid some of the use of mutual exclusion methods. One pattern that is routinely re-invented is taking a peek at something when the mutex is not held, then acquiring the mutex “for real” based on the state of the something. Example:

my v = pSharedData->myVariable ;

if ( v )
{
   pthread_mutex_lock( &pSharedData->m ) ;

   // myVariable always updated while the mutex is locked.  Get the current value, not
   // necessarily the same as what was read above. 
   v = pSharedData->myVariable ;

   // Now do something with it:
   foo( v ) ;

   pthread_mutex_unlock( &pSharedData->m ) ;
}

This code will often not work. One hidden requirement is that myVariable must be volatile. If that’s not the case, then the compiler can use the value that was first read. If there is no “peek” of the variable outside of the lock hold scope, then things work out because the mutex acquision has unknown side effects and the compiler won’t read from that memory until after the function call is executed. Note that if you implement your own mutual exclusion code (which can be done inline) you’ll need something to force code ordering as well as dealing with the multiprocess or multithread mutual exclusion mechanism itself.

I saw an example of volatile required that is similar to this usual one, but pretty much completely inverted. Here’s what the code looked like, and is executed in a context where the mutex that protects the write accesses to myVariable has not yet been locked:

unsigned v = pSharedMem->myVariable ;
if ( v != 0 )
{
   do {
      v-- ;

      if ( pSharedMem->someArray[ v ] == pp )
      {
         foo() ;

         break ;
      }
   } while ( v ) ;
}

Without myVariable declared volatile, one compiler, even with optimization enabled, chose to generate code for this fragment that was as if two loads had been done, as in the following:

if ( pSharedMem->myVariable )
{
   unsigned v = pSharedMem->myVariable ;

   do {
      v-- ;

      if ( pSharedMem->someArray[ v ] == pp )
      {
         foo() ;

         break ;
      }
   } while ( v ) ;
}

The lesson, as usual, was that there’s a big maintenance cost to attempting to avoid normal mutual exclusion mechanisms. This bit of mutex avoidance code had been carefully thought through, but once tried in production, it yielded this nice hidden platform specific, compiler optimization level specific run time bug. It was a suprising side effect that took the developers who owned the code a long time to track down.

This particular fragment of performance critical code was fixed by adding a volatile attribute to the myVariable declaration. Doing so means that the compiler is not free to re-load the variable if it has been read into a local and used only from there. It is curious that the compiler would choose to do so at all. There was no obvious reason, like a stack spill around a function call, for the compiler to make this code generation choice, but without the volatile it was permitted to do so.

Advertisements

11 Responses to “An unusual volatile required example.”

  1. Fascinating post, Peeter!

  2. Eran Borovik said

    Regarding your first mutex example. I don’t see how can the compiler use the value read before the mutex was acquired even if myVariable isn’t volatile. AFAIK, the mutex_acquire primitive inserts platform specific compiler/cpu directives to force this kind of optimization. It at least must force some kind of acquire semantics.

    • peeterjoot said

      What you aren’t considering is that the mutex APIs aren’t part of the language spec, and that it takes a considerable amount of coersion to get the compiler to play the game that the developer wants.

      If you are interested this why the next generation C++ spec has had so much trouble with some of the aspects of intrinsic thread support, and adding thread support required not only C++ runtime enhancements (mutex, atomic, …) but also designing a memory model (as Java has for example), specifing exactly how loads and stores are ordered with respect to each other around atomics and mutex exclusion code. Google C++ or C0x memory model for some of the research papers that led to the current in-progress spec.

      To be more specific, considering this first example, consider a platform specific implementations of a successful mutex acquire. Here’s one for PowerPC using xlC intrinsics

      my v = pSharedData->myVariable ;
      
      if ( v )
      {
         // simple mutex acquire implementation
         for ( ; ; )
         {
            unsigned m = __lwarx( &pSharedData->m ) ;
            if ( 0 == m )
            {
               m = 1 ;
               int rc = stwcx( &pSharedData->m, m ) ;
               if ( 1 == rc )
               {
                  __isync() ;
                  break ;
               }
               usleep( 10000 ) ;
            }
         }
      
         // myVariable always updated while the mutex is locked.  Get the current value, not
         // necessarily the same as what was read above.
         v = pSharedData->myVariable ;
      
         // Now do something with it:
         foo( v ) ;
      
      //   pthread_mutex_unlock( &pSharedData->m ) ;
      }
      

      There are a few special ” platform specific compiler/cpu directives” as you call them. One of them is the __lwarx that loads the mutex. This builtin is handled specially by the compiler and won’t, for example, be moved to before the preceding load of myVariable. Next after the mutex is acquired (rc =1 from the stwcx) we require another special intrinsic. This one is the isync() instruction and provides the desired acquire semantics. That means that subsequent instructions will not be executed before the isync is complete.

      The problem is that the compiler doesn’t know the details of what happens in the mutex API. If the LinuxPPC or AIX mutex_lock function was written like this (they will both necessarily do something of that nature), how does the compiler know that it’s previous load of myVariable should be discarded. It doesn’t see the isync() instruction which is buried in the C runtime library function. If you are going to try to cheat things, and not use the mutex sometimes, then the compiler requires some sort of directive to tell it that you are doing so. That’s why volatile is required in this case.

  3. Eran Borovik said

    Peeter, thanks for the detailed reply. I believe I can understand and agree with most of your points.
    Nevertheless, I’ve seen (and used) the pattern you’ve mentioned so many times I believe I want deeper understanding.
    According to your explanation, the mutex already makes sure ordering is OK by doing isync. However, this is hidden from the compiler and therefore the compiler is still free to do whatever it wants by using the variable read before mutex_acquire was called. I argue that given that the mutex_acquire primitive is opaque to the compiler, it cannot do such an optimization because it cannot know if myVariable is changed inside the mutex_acquire function. The compiler should be allowed to do such optimization only if it knows the contents of mutex_acquire, but if it knows it, it will see the isync and know that this is not OK to use myVariable.
    Besides the point I’ve mentioned (which I am not 100% sure as I am not a compiler guy), I believe declaring myVariable as volatile is a bit overkill and can hurt performance. Assuming, that myVariable is read outside the mutex only this one time, one should be able to do a cast to volatile pointer to tell the compiler that only this access is volatile.

    • peeterjoot said

      I’m not sure that cast would work under all optimization levels (especially if strict aliasing is enabled which is the case on a number of compilers these days).

      There’s one aspect of this discussion that I’m unsure of. Our product uses its own mutex implementation and there is a first inlined pass of atomic (+isync or similar instructions) in the acquire code that is _not_ opaque. So we definitely have a volatile requirement here that is perhaps not true if a strict function call API like pthread_mutex_lock was called. My suspision would be that volatile would still be required, but I am no longer sure.

      • Eran Borovik said

        Thanks again. I hope I am not nagging too much (This topic is just something that interests me personally).
        Seems to me that if your mutex implementation isn’t opaque than the compiler must respect the isync that it is now seeing (Unless there is a bug in the compiler…).
        So in both cases:Opaque/Non-opqaue (unless my logic if off completely) volatile shouldn’t be needed.
        Where did you see that happen? My guess is on Power where I’ve seen such cases myself: We tried the pthread_spin_locks to “boost” performance over the traditional mutex. We immediately reverted back to mutex as we’ve seen things similar to what you are describing, and the everything came back to normal.

    • peeterjoot said

      Hi Eran,

      But with a non-opaque implementation, you are assuming that the compiler knows about all the side effects of the isync operation. Most compilers are more primitive in how they deal with inline assembly (even if they provide the intrinsics to do so). For example, in gcc on linuxppc you use:

      __asm__ __volatile__(( "isync" : : :"memory" ))
      

      The compiler has nothing but the limited clobber constraint that you provide to work with to decide how to generate the surrounding loads and stores. While I’m sure the memory constraint results in a compile time barrier than prevents loads and stores from being reordered around the asm, without a volatile on the variable, if there is an access before and after the mutex is acquired, I’d not be the least bit confident that there will be a re-load of the variable.


      ps. Did you generate an assembly listing to compare the code in question when you tried the pthread spin lock?

      • Eran Borovik said

        Thanks, I agree that the compiler has limited knowledge about the assembly side-effects. Nevertheless, there are compiler primitives to ensure the compiler doesn’t do nasty things. The gcc inlines you’ve shown are supposed to make a “compiler barrier”, and as such to eliminate such optimization. I’ve seen them used everywhere including the Linux kernel itself, so if they don’t work many people may be in trouble.
        Anyways, your main point that one should be mega-careful about accessing variables outside a lock are 100% valid. Any such optimization attempt should be carefully considered and reviewed by peers.

        Regarding the spin-locks, I must confess that I was lazy, and we had pressing time-tables. So, unfortunately, like in many such situations, I just used what worked and continued on to more pressing matters…

      • peeterjoot said

        Still an interesting question. I’ve posed it on stack overflow and am curious what others will say

        http://stackoverflow.com/questions/6467240/volatile-vs-compiler-barrier-with-gcc-inline-assembly

  4. Jan Hudec said

    The code is wrong. Neither i386 nor x86_64 would, but some architectures will not hesitate to give you neither the old nor the new value, but some random number if you happen to race with the write and don’t use appropriate synchronized atomic operation. On platforms, that will always give you correct value, the volatile is totally useless. The compiler may not rely on any non-local variables (and any local variables to which it ever created a pointer) to remain unchanged across function call, so it will refetch pSharedData->myVariable after the lock in any case. The compiler may not rely on any memory content, local or not, after asm statement with “memory” clobber, because that’s how it’s defined, so it will refetch pSharedData->myVariable after the lock if it is implemented using such asm statement too.

    Volatile will also hurt performance significantly, because all accesses to the member under the lock will go through to memory where they wouldn’t have to.

    • Robin said

      Jan, I believe that you have some interesting points to make, but it is difficult to understand what they are because of the way that you have expressed yourself. Help us to understand by clarifying. When you say “the code is wrong”, to which code fragment are you referring? The first one that Peeter gave in his blog post? Something from subsequent comments?

      Your following sentences are especially difficult to comprehend, even for an expert in the subject matter.

      Your final statement may be true, but don’t forget that the volatile value has been stored in a local variable, so there are no further accesses of the volatile value under the lock.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
%d bloggers like this: