Peeter Joot's (OLD) Blog.

Math, physics, perl, and programming obscurity.

Archive for November, 2009

Two particle center of mass Laplacian change of variables.

Posted by peeterjoot on November 30, 2009

[Click here for a PDF of this post with nicer formatting]

Exercise 15.2 in [1] is to do a center of mass change of variables for the two particle Hamiltonian

\begin{aligned}H = - \frac{\hbar^2}{2 m_1} {\boldsymbol{\nabla}_1}^2- \frac{\hbar^2}{2 m_2} {\boldsymbol{\nabla}_2}^2+ V(\mathbf{r}_1 -\mathbf{r}_2).\end{aligned} \quad\quad\quad(1)

Before trying this, I was surprised that this would result in a diagonal form for the transformed Hamiltonian, so it is well worth doing the problem to see why this is the case. He uses

\begin{aligned}\boldsymbol{\xi} &= \mathbf{r}_1 - \mathbf{r}_2 \\ \boldsymbol{\eta} &= \frac{1}{{M}}( m_1 \mathbf{r}_1 + m_2 \mathbf{r}_2 ).\end{aligned} \quad\quad\quad(2)

Lets use coordinates {x_k}^{(1)} for \mathbf{r}_1, and {x_k}^{(2)} for \mathbf{r}_2. Expanding the first order partial operator for {\partial {}}/{\partial {{x_1}^{(1)}}} by chain rule in terms of \boldsymbol{\eta}, and \boldsymbol{\xi} coordinates we have

\begin{aligned}\frac{\partial {}}{\partial {x_1^{(1)}}}&=\frac{\partial {\eta_k}}{\partial {x_1^{(1)}}} \frac{\partial {}}{\partial {\eta_k}}+\frac{\partial {\xi_k}}{\partial {x_1^{(1)}}} \frac{\partial {}}{\partial {\xi_k}} \\ &=\frac{m_1}{M} \frac{\partial {}}{\partial {\eta_1}}+ \frac{\partial {}}{\partial {\xi_1}}.\end{aligned}

We also have

\begin{aligned}\frac{\partial {}}{\partial {x_1^{(2)}}}&=\frac{\partial {\eta_k}}{\partial {x_1^{(2)}}} \frac{\partial {}}{\partial {\eta_k}}+\frac{\partial {\xi_k}}{\partial {x_1^{(2)}}} \frac{\partial {}}{\partial {\xi_k}} \\ &=\frac{m_2}{M} \frac{\partial {}}{\partial {\eta_1}}- \frac{\partial {}}{\partial {\xi_1}}.\end{aligned}

The second partials for these x coordinates are not a diagonal quadratic second partial operator, but are instead

\begin{aligned}\frac{\partial {}}{\partial {x_1^{(1)}}} \frac{\partial {}}{\partial {x_1^{(1)}}}&=\frac{(m_1)^2}{M^2} \frac{\partial^2}{\partial \eta_1 \partial \eta_1}{}+\frac{\partial^2}{\partial \xi_1 \partial \xi_1}{}+2 \frac{m_1}{M} \frac{\partial^2}{\partial \xi_1 \partial \eta_1}{} \\ \frac{\partial {}}{\partial {x_1^{(2)}}} \frac{\partial {}}{\partial {x_1^{(2)}}}&=\frac{(m_2)^2}{M^2} \frac{\partial^2}{\partial \eta_1 \partial \eta_1}{}+\frac{\partial^2}{\partial \xi_1 \partial \xi_1}{}-2 \frac{m_2}{M} \frac{\partial^2}{\partial \xi_1 \partial \eta_1}{}.\end{aligned} \quad\quad\quad(4)

The desired result follows directly, since the mixed partial terms conveniently cancel when we sum (1/m_1) {\partial {}}/{\partial {x_1^{(1)}}} {\partial {}}/{\partial {x_1^{(1)}}} +(1/m_2) {\partial {}}/{\partial {x_1^{(2)}}} {\partial {}}/{\partial {x_1^{(2)}}}. This leaves us with

\begin{aligned}H = \frac{-\hbar^2}{2} \sum_{k=1}^3 \left( \frac{1}{{M}} \frac{\partial^2}{\partial \eta_k \partial \eta_k}{}+ \left( \frac{1}{{m_1}} + \frac{1}{{m_2}} \right) \frac{\partial^2}{\partial \xi_k \partial \xi_k}{}\right)+ V(\boldsymbol{\xi}),\end{aligned} \quad\quad\quad(6)

With the shorthand of the text

\begin{aligned}\boldsymbol{\nabla}_{\boldsymbol{\eta}} &= \sum_k \frac{\partial^2}{\partial \eta_k \partial \eta_k}{} \\ \boldsymbol{\nabla}_{\boldsymbol{\xi}} &= \sum_k \frac{\partial^2}{\partial \xi_k \partial \xi_k}{},\end{aligned} \quad\quad\quad(7)

this is the result to be proven.


[1] D. Bohm. Quantum Theory. Courier Dover Publications, 1989.

Posted in Math and Physics Learning. | Tagged: , , , | Leave a Comment »

An attempt to illustrate differences between memory ordering and atomic access.

Posted by peeterjoot on November 29, 2009

[Click here for a PDF alternative to this post]


Discussion of problems and solutions associated with correct atomic and memory barrier usage has been a recurrent theme of many cubicle chats within DB2 development, and I will attempt to describe here part of what I’ve learned in some of those chats. As a software developer, not a hardware implementer, I cannot pretend to understand more than a fragment of this topic, but it is enough to be dangerous and perhaps help others be dangerous too.


Use of atomic instructions for manipulating “word” (or smaller) size quantities, avoiding the use of mutual exclusion library functions is becoming increasingly easy for developers. This hasn’t made the using atomic instructions or library methods correctly any less difficult or error prone.

Information on how to use these written for an average Joe developer is hard to come by. What can be easily found is good detailed low level information targeted at or written by operating system kernel extension writers or compiler developers [1], or language implementers and designers [2] [3] [4] [5] [6] [7] [8].

The driving reason for most atomic usage is lock avoidance, and a desire for performance drives most use of atomic methods. When atomic operations are used to replace lock protected updates, this can change the semantics of the program in subtle ways that are difficult to understand or get right. In particular, additional use of memory barrier instructions or library methods may be required to correctly replace lock protected updates.

It is assumed here that most use of atomics is done for performance based lock (mutex) avoidance. An attempt to cheat, writing atomic based code that avoids a lock, unfortunately requires some understanding of what that lock provided in the first place. Here the meaning of, requirements for, and implementation of, a lock are discussed. The implementation of an atomic operation is then discussed, and we then move on to an enumeration of the types of memory barrier instructions and their effects. Finally, unless the assembly samples provided have not scared aways the reader, we cover a simple non-mutex example of barrier usage. This example uses an atomic to signal completion of a previous update and requires memory barriers for correctness on weakly ordered systems.

Why do I need locking or atomics?

Why would you want to use an atomic? If you are reading because you want to know a bit about memory barrier usage, you probably already know. However, for illustrative purposes, consider the very simplest example of non-thread-safe updates of an integer in shared memory.

   pShared->x++ ;

Any such operation requires three steps. Read the value, change the value, update the value in memory. On some platforms this is explicit and obvious since the value will have to be read into a register before making the update. Here’s a PowerPC example that illustrates this

    lwz      r4=(sharedMem)(r3,0)    ; r3 == pShared ; (r3,0) == *pShared
    addi     r0=r4,1
    stw      (sharedMem)(r3,0)=r0

The value in the memory address contained in the register r3 (offset by zero bytes) is loaded into a register (r4). One is added to this (into r0), and the value is stored back into the original memory location. Running on a CISC system this may all appear as one instruction, but it still requires the same number of steps internally.

Suppose you had a threaded application where this counter variable is updated routinely by many threads as they complete some routine and commonplace action. What can now happen, interleaved (with time downwards) across threads may now look like this

T0: pShared->x = 0

T1: R1 = pShared->x (R1 = 0)
    R1 = R1 + 1

T2: R2 = pShared->x
    R2 = R2 + 1
    pShared->x = R2

T3: R3 = pShared->x (assume T3 "sees" T2's update, a value of 1)
    R3 = R3 + 1
    pShared->x = R3

T4: R4 = pShared->x (assume T4 "sees" T3's update, a value of 2)
    R4 = R4 + 1
    pShared->x = R4

    pShared->x = R1 (T2 and T3's updates are now clobbered)

We had four updates to the counter after the initialization, but the counter value only goes up by one. Uncontrolled updates like this can oscillate in peculiar fashions, and something that may be expected to be monotonically increase perhaps only trends upwards on average. If the counter is not heavily contended you may even fluke out, running for a long time before getting unlucky with an inconsistent update of some sort.

This example leads us nicely to requirements for locking or atomic methods to protect such updates.

Use of locking to make the read, change update cycle safe.

Mutual exclusion mechanisms go by many names, locks, mutexes, critical sections, and latches to name a few. System libraries usually provide implementations, a common example of which are the Unix pthread_mutex methods. The unsafe increment above can be made thread safe, protecting it with such a guard. With error handling omitted for clarity, a modified example could be as follows

   pthread_mutex_lock( &m ) ;

   pShared->x++ ;

   pthread_mutex_unlock( &m ) ;

If all threads only ever updates this value we can use a similar bit of code to read and do something with the read value

   pthread_mutex_lock( &m ) ;

   v = pShared->x ;

   pthread_mutex_unlock( &m ) ;

   doSomethingWithIt( v ) ;

No two sequential samplings of the shared variable x would see the value go down was possible in the uncontrolled update case.

Now, you may ask why a mutex is required for the read. If that is just a single operation, why would it matter? That’s a good question, but not easy to answer portably. If your variable is appropriately aligned and of an appropriate size (practically speaking this means less than or equal to the native register size) and you aren’t compiling with funky compiler options that turn your accesses into byte accesses (such options do exist), then you may be able to do just that. However, if that variable, for example, is a 64-bit integer and you are running on a 32-bit platform, then this read may take two instructions and you have a risk of reading the two halves at different points in time. Similarly, suppose you were doing a 32-bit integer read, but that 32-bit integer was aligned improperly on a 16-bit boundary (on a platform that allows unaligned access), then your apparent single read may internally require multiple memory accesses (perhaps on different cache lines). This could cause the same sort of split read scenario.

You probably also need to declare your variable volatile and also have to be prepared to deal with a few other subtleties if lock avoidance on read is going to be attempted. By the end of these notes some of those subtleties will have been touched on.

To make the story short, it should be assumed that if you want portable correct results you need to take and release the mutex for both read and write.

What does correctness cost me?

Now, having corrected the increment code, what does that cost us. Timing the following very simple single threaded program with an without the mutex code


int x = 0 ;
pthread_mutex_t m = PTHREAD_MUTEX_INITIALIZER ;

int inc() ;

int main()
   for ( int i = 0 ; i < 1000000000 ; i++ )
      inc() ;

int inc()
#if defined USE_THE_MUTEX
   pthread_mutex_lock( &m ) ;

   x++ ;

   int v = x ;

#if defined USE_THE_MUTEX
   pthread_mutex_unlock( &m ) ;

   return x ;

I find that this billion sets of call a function, increment a variable and return it without the mutex takes about 2.5 seconds. With the mutex calls enabled, the total elapsed time required goes up to 8.5 seconds. It costs over three times the time to do it with a mutex, and that’s without any contention on the mutex whatsoever! With threads all fighting over the mutex and blocking on kernel resources when they cannot get it, this is as good as it gets. Correctness doesn’t come cheaply.

How does the lock work?

Any mutex implementation requires at least one platform specific method of performing a read, update, change cycle as if it is done without interruption. Your hardware will likely provide a Compare and swap or Load Link/Store Conditional (which can be used to construct higher level atomics like compare and swap). Implementing a mutex is intrinsically unportable territory. Older intel cpus did not provide a compare and swap, but one did have a bus locked exchange available (LOCK XCHG).

HP’s PARSIC, also easily argued to not be a modern architecture, only provides an atomic load and clear word instruction, also requiring 16 BYTE a aligned word. What can be assumed is that any multiple cpu machine provides some sort of low level atomic instruction suitable for building a mutual exclusion interface. If one were to imagine what a pthread_mutex_lock protected increment expands to, it would likely have the following form

   T valueRead ;

      // instructions (like PAUSE()) for hyper threaded cpus.

      // code for wasting a bit of time hoping to get the lock
      //   (when not running on a uniprocessor).

      // sleep or block using an OS primitive

   valueRead = pShared->myLockProtectedData ;
   pShared->myLockProtectedData = valueRead + 1 ;


Even before talking about the hardware, one has to assume that there are mechanisms available to the library writer that prevent the generated code from being reordered. Any instructions containing accesses of myLockProtectedData must NOT occur BEFORE the ACQUISION instructions, and must NOT occur AFTER the RELEASE instructions. If the compiler were to generate code that had the following effect

   T valueRead = pShared->myLockProtectedData ;
   pShared->myLockProtectedData++ ;

   pthread_mutex_lock( &m ) ;
   pthread_mutex_unlock( &m ) ;

Things would obviously be busted. In the case of the pthread functions we have non-inline method with unknown side effects requiring a call to an external shared library. This appears to be enough that the lock protected data does not have to be declared volatile and can be used while the mutex is held as if in a serial programming context. Understanding the murky language rules that give us the desired thread safe behavior is difficult (for an average Joe programmer like me). I came to the conclusion this is not a completely well defined area, motivating a lot of the recent C++ standards work into memory consistency and threaded behavior.

Presuming one assumes that the compiler is laying down the instructions for this code in program order (or something sufficiently close), or that there are mechanism available to ensure this, is this enough to guarantee that we have correct program behavior?

It may come as a rude surprise that, depending on the instructions used to acquire and release the mutex, having the all the instructions scheduled in program order is NOT actually sufficient. Not all hardware executes instructions in the order the compiler specifies. We also need mechanisms to prevent the hardware from starting a memory access too early or letting a memory access complete too late.

This requirement for memory ordering instructions is really the whole point of this discussion. We must have additional mechanism on top of the raw atomic instructions that only ensure read,change,update accesses on an isolated piece of memory (the lock word) can be made as if uninterpretable.

For illustrative purposes, suppose that one was using the gcc Atomic Builtins to attempt to implement a mutex. One (generally) wrong way to use these would be

   // the mutex data
   volatile int m = 0 ;
   #define HELD_BIT 1
   #define WAITER_BIT 2

   // Get the mutex.
   while ( true )
      int prev = __sync_fetch_and_or( &m, HELD_BIT ) ;

      if ( 0 == ( prev & HELD_BIT ) )
         break ;
      // sleep and other stuff, possibly blocking and setting WAITER_BIT.

   // Got the mutex, access the protected data.
   pShared->myLockProtectedData++ ;

   // Release the mutex.
   int preLockState = __sync_fetch_and_and( &m, ~HELD_BIT ) ;
   assert( preLockState & HELD_BIT ) ;
   if ( preLockState & WAITER_BIT )
      WakeUpAnyWaiters() ;

Reading the GCC docs one sees that this has desired compiler behaviour for instruction scheduling. Namely, “these builtins are considered a full barrier. That is, no memory operand will be moved across the operation, either forward or backward”. The compiler will not move instructions that access memory and move them into or out of the critical section bounded by the pair of atomic instructions.

Will the hardware retain that ordering? NO, not necessarily.

These compiler builtins are partially patterned after ones provided by intel, and intel designed them to match functionality provided by their ia64 architecture, a weakly ordered instruction set. While ia64 was effectively killed by AMD64 and Windows (now living on only in the guise of HP’s IPF systems), there are other weakly ordered instruction sets predating ia64. Some PowerPC chips that you will find on big AIX systems, or in older macs, or in your childrens ps3 will be weakly ordered. The Sparc architecture allows for weak ordering, but the chips you find in sun machines implement their TSO (total store order) model which is mostly ordered.

This allowance for unordered memory is why, way down at the end of the GCC page, one finds a couple special _lock methods. The point of these is to ensure that the hardware does not reorder the relative order of memory accesses of the lock word and the lock data, even when the compiler lays down the code in “program order”.

We can do a mutex implementation less wrongly using a memory barrier for release as in the following fragment.

   // the mutex data
   volatile int m = 0 ;
   #define HELD_BIT 1

   // Get the mutex.
   while ( true )
      int prev = __sync_fetch_and_or( &m, HELD_BIT ) ;

      if ( 0 == ( prev & HELD_BIT ) )
         break ;
      // sleep and other stuff if you didn't get it

   // Got the mutex, access the protected data.
   pShared->myLockProtectedData++ ;

   __sync_syncronize() ; 

   m = 0 ;

Observe that the use of the word barrier is overloaded on the GCC page, with some usage associated with compiler instruction scheduling, and other usage referring to memory ordering. Their syncronize function appears to be both. Looking at the generated assembly for a weakly ordered system would verify this interpretation, and one would likely find an lwsync or sync instruction on PowerPC for example. This modification of the locking code ensures the compiler lays down the code with the clearing of the lock word after the protected data access, and the hardware memory barrier instruction should ensure that reads or writes to the protected data are complete before the store to the lock word is allowed to occur. Depending on the type of instruction that is emitted for the synchronize builtin, this may not actually prevent loads and stores that occur after the lock release from being started before the lock is released. Generally if that must be prevented a full barrier is required, and it would not surprise me if most implementations of library methods like pthread_mutex_unlock do not prevent memory accesses that occur after the lock release from being initiated before the lock release occurs. This sort of subtlety is not likely to be found in documentation for the mutex library functions.

With the sample code above we are still left with the possibility that the hardware will execute the memory access instructions for the lock protected data before the lock is acquired, making the lock useless. We can fix this with the _lock test and set and release builtins, and also remove the likely overzealous full memory barrier that synchronize provides (allowing post lock release memory access into the critical section). Sample code for this would may look like

   // the mutex data
   volatile int m = 0 ;
   #define HELD_VALUE 1

   // Get the mutex.
   while ( true )
      int wasHeld = __sync_lock_test_and_set( &m, HELD_VALUE ) ;

      if ( !wasHeld )
         break ;
      // sleep and other stuff if you didn't get it

   // Got the mutex, access the protected data.
   pShared->myLockProtectedData++ ;

   __sync_lock_release( &m ) ; 

This now implements functional mutual exclusion code, even on weakly ordered systems. The compiler has provided methods that ensure it does not move the lock protected data accesses out of the critical section, we are using atomic instructions that guarantee other threads cannot also be modifying the data (presuming they all use the same mutex), and also have instructions that ensure the hardware does not inappropriately reorder the memory accesses. Inappropriate reordering means that accesses to the lock protected data remain after the lock acquisition and before the lock release. It may not mean that memory accesses from before the lock acquisition are done by the time the lock is acquired nor that memory accesses that occur after the lock is released occur after the lock is released (the compiler says it does that, but the hardware may be a sneaky bastard behind your back).

So, that’s a mutex. We can use something like this to avoid some of the excessive cost observed when using the pthread library mutex functions, and would expect that the uncontended cost of the back to back increment code could be lessened.

Presuming you do not mind the unportability of such code, and manage to get it right, and want to live with maintaining your own mutex implementation, if you do a poor job dealing with contention, you may very well get worse performance than the system provided methods in the contended codepath.

You really really have to want the performance to go down the path of implementing your own mutual exclusion code. DB2 is such a product. DB2 is also a product where first born sacrifices in the name of performance are not considered abhorrent. In our defense our mutex implementation predates the availability of widely available library methods. We had SMP exploitation before pthreads existed using mutiple processes and shared memory. Our own implementation also has the advantage of providing framework to build problem determination functionality that cannot be found in generic library implementations. Performance driven coding is not the only reason we do it ourself even in this more modern age where library methods do exist.

What is an atomic, and how does an atomic method work?

At the core of every atomic library method or compiler intrinsic is one or more special purpose instructions. In many cases these are the same types of instructions that can be used to implement mutual exclusion code (when properly augmented by memory barrier instructions).

If you don’t have an instruction for exactly what you want, it can often be built from some other lower level operation. Suppose for example you want to safely do the contended increment without using a lock, but only have a compare and swap. Something like the following can be used

int fetch_and_add( volatile int *ptr, int value )
   int v = *ptr ;
   while ( true )
      int prev = __sync_val_compare_and_swap( ptr, v, v + value ) ;
      if ( prev == v )
         break ;

      v = prev ;

   return prev ;

Generally, one can assume at least a compare and swap operation, but this may also be simulated, built up out of more direct and primitive operations. On PowerPC for example, this is done with the very flexible load-linked/store-condition instructions (load with reservation and store conditional in PowerPC lingo). A fetch and add can be built with a small loop

   lwarx     r0,0,r3       ; load word and acquire a registration
   add       r0,r0,r4      ; add in the desired amount to the possibly current value
   stwcx.    r0,0,r3       ; try to store it
   bne-      fadd32_spin   ; if loaded value old or somebody beat us repeat.
   sub       r3,r0,r4      ; fake a return using normal function return register r3

Building a portable dependence on a load-linked/store-condition primitive can be tricky, but possible. In DB2’s reader-writer mutex implementation, a data structure employing some nifty lock free algorithms ([9] [10] [11] [12] [13] [14]) we use this load-linked/store-conditional code to implement a slick compare-against-bitmask-and-increment-or-set-bit atomic operation. We can do the same thing with compare and swap, but the generated assembly code is particular pretty on AIX, where PowerPC gives us this more powerful primitive.

A historical reflection aside.

Even only a couple of years ago, it was much harder to exploit atomic instructions. Part of the reason is that we have tossed or stopped caring about older systems. Nobody cares anymore about older ia32 systems that did not have cmpxchg. Pre-PowerPC AIX implementations that did not have lwarx and stwcx. instructions are now dead and buried in landfill somewhere. PARISC systems that did not provide any sort of reasonable atomic instructions are now being replaced by HP IPF systems. We can now generally assume that some basic atomic building blocks are available and no longer have to tiptoe around avoiding breaking the old guys.

Another reason that makes atomic usage easier these days is the wide availability of compiler builtins and library infrastructure, rendering this previous area of horrendous unportability accessible. Use of the GCC builtins was illustrated above, showing some incorrect and correct lock implementations. On the same page can be found instructions for compare and swap, atomic read and increment, and others. Similar builtins are available from many other compilers. The boost library implementation of atomic types are a good example. The next revision of the C++ standard will likely include boost-like methods. It appears that C# implements Interchange methods very much like the msdev intrinsics, so this is even becoming a cross language portability area in a limited fashion.

It is fun to illustrate just how murky the portability waters were before the prevalance of compiler intrinsics. Atomic exploitation generally required writing assembly code, and that is never particularly fun. Here is a sample code showing how one would code an “inline assembly” language implementation of fetch and add using the IBM compiler targeting PowerPC

int PowerPcFetchAndAdd32Internal ( volatile int * pAtomic, int value ) ;

#pragma mc_func PowerPcFetchAndAdd32Internal { "7c001828" "7c002214" \
   "7c00192d" "40a2fff4" "7c640050" }

#pragma reg_killed_by PowerPcFetchAndAdd32Internal cr0, gr0, gr3

What we have here is literally a mechanism to instruct the compiler to shove the raw instructions, specified by hexadecimal opcodes, directly into the instruction stream, with the writer of the inline assembly routine providing information about all the registers clobbered and side effects of the code. This inline assembly code is actually generated from the listing in PowerPC assembly fragment above, which was carefully constructed to have the same structure as a real function call on AIX (input params in r3,r4, and output in r3)

Before the GCC intrinsics were available, one could use the powerful GCC inline assembly syntax to instruct the compiler what to shove into the code. Here’s an example for fetch and add

#define IA32_XADD32(pAtomic, addValue, oldValue)                    \
__asm__ __volatile__( "lock; xaddl 
 /* outputs */     : "=r"(oldValue), /* 
                     "+m" (*pAtomic) /* 
 /* inputs */      : "0" (addValue)  /* 
 /* clobbers */    : "cc"            /* condition registers (ZF) */ \

Looks kind of like ASCII barf, but does the job nicely enough if you can get it right.

On platforms that provided no good inline assembly methods was not available (which includes those platforms and compilers for which inline assembly makes the code worse than using out of line assembly) one had the option of coding standalone assembler files to create the function call methods desired. That can be just as difficult since one is then forced to learn the assembler syntax in more detail and understand the idiosyncrasies of all the platform calling conventions.

Types of barriers.

There are in general a few different types of barriers (or fences). Not all architectures that allow for unordered instructions necessarily implement all of these. Some of these may also only be available as variations of specific instructions, so when standalone use is required a more expensive instruction may be required.

To describe the barrier types I will borrow ia64 nomenclature and assembly fragments, which is nicely descriptive despite death of this platform. The ia64 architecture provides acquire, release and full barriers. Some of these are built into the load and store operations on these platforms.

Acquire barrier

If one sees assembly code containing

ld4      r3   = [r30]   ; "BEFORE"
st1      [r4] = r40     ; "BEFORE"

ld4.acq  r5   = [r42]   ; LOAD with ACQUIRE barrier

ld4      r6   = [r43]   ; AFTER
st1      [r7] = r44     ; AFTER

One has no guarantee about the relative ordering of the BEFORE-labeled memory accesses. Similarly one has no guarantee about the ordering of the pair of AFTER-labeled memory accesses, but because of the acquire barrier the load and acquire barrier must be complete before the AFTER accesses.

Note that the acquire barrier is one directional. It does not prevent the pair of BEFORE memory accesses from occurring after the barrier. Illustrating by example, the instructions could be executed as if the compiler had generated the following sequence of instructions

ld4.acq  r5   = [r42]

st1      [r7] = r44
ld4      r3   = [r30]
ld4      r6   = [r43]
st1      [r4] = r40

There may be additional architecture rules about dependent loads and stores effecting the types of code motion that is allowed by the compiler, or the way that the CPU implements the architecture, but I will ignore those here.

This sort of effective instruction reordering is often described saying that one instruction has passed another.

This acquire barrier, perhaps unsurprising, is exactly the type of barrier required when implementing code for mutex acquisition. In addition to ia64, the PowerPC platform requires such barriers, and provides the isync instruction to the programmer. An example in a higher level language that may not have the desired effect without an such an acquire barrier is

if ( pSharedMem->atomic.bitTest() ) // use a bit to flag that somethingElse is "ready"
   foo( pSharedMem->somethingElse ) ;

If correct function of your program depends on somethingElse only being read after the atomic update flagging a condition is complete, then this code is not correct without an isync or other similar platform specific acquire barrier.

Your code should really look something like this

if ( pSharedMem->atomic.bitTestAcquire() )
   foo( pSharedMem->somethingElse ) ;

Here it is assumed the atomic library provides acquire and release variations of all atomic operations. For PowerPC this could be as simple as adding an isync to the unordered atomic operation, but on a platform like ia64 one would probably use a different instruction completely (ie. cmpxchg.acq for example). With such an acquire barrier the somethingElse access or another other memory operation that follows the load associated with the bitTest necessarily follows the load itself. For example, this can mean that a non-dependent load that has been executed ahead of the acquire barrier will have to be thrown out and reexecuted.

Release barrier

A release barrier is perhaps a bit easier to understand. Illustrating again by example, suppose that we have the following set of load and store instructions, separated by a release barrier.

ld4      r3    = [r30]   ; BEFORE
st1      [r4]  = r40     ; BEFORE

st4.rel  [r42] = r5      ; STORE with release barrier.

ld4      r6    = [r43]   ; "AFTER"
st1      [r7]  = r44     ; "AFTER"

The release barrier ensures that the pair of BEFORE-labeled reads and write memory accesses are complete before the barrier instruction. There is nothing that prevents the BEFORE instructions from executing in an alternate order, but all of the effects of these instructions visible by any other CPU must be complete before the effects of the store barrier instruction is observable by that other CPU. Like the acquire barrier, the release barrier is a one way fence. Neither of the instructions labeled “AFTER” necessarily have to be executed after the release barrier.

On PowerPC this release barrier is implemented with a light weight sync (different than the isync that implements the acquire barrier). As a general rule of thumb, code that requires a release barrier will require an acquire barrier and vice-versa. These instructions are almost always required to be used in complementary pairs.

Similar but opposite to the acquire barrier, a release barrier has the semantics required for implementing the release portion of a mutex.

Full barrier.

We have seen how use of release and acquire barriers can ensure that that memory access before or after the barrier can be forced to maintain that order relative to the barrier. There are however, some types of memory ordering that neither of these can ensure.

Using ia64 code to illustrate, suppose one has memory accesses before a release barrier, and memory accesses after an acquire barrier.

ld4      r3    = [r30]   ; BEFORE
st1      [r4]  = r40     ; BEFORE

st4.rel  [r42] = r5      ; STORE with release barrier.
ld4.acq  r8    = [r45]   ; LOAD with acquire barrier.

ld4      r6    = [r43]   ; AFTER
st1      [r7]  = r44     ; AFTER

The barriers here ensure that the BEFORE-labeled instructions all complete before the release barrier effects become visible to other cpus, and also ensure that none of the AFTER-labeled instructions may start before the acquire barrier instruction is complete. This barrier cannot prevent the acquire barrier LOAD instruction from completing before the STORE barrier. Illustrating by example, the CPU could execute things like so

ld4.acq  r8    = [r45]   ; LOAD with acquire barrier.

ld4      r3    = [r30]   ; BEFORE
st1      [r4]  = r40     ; BEFORE

ld4      r6    = [r43]   ; AFTER
st1      [r7]  = r44     ; AFTER

st4.rel  [r42] = r5      ; STORE with release barrier.

This alternate ordering also meets the requirements of the original code. All of the BEFORE-labeled instructions are still before the release barrier and all the AFTER-labeled instructions occur after the acquire barrier. In order to enforce this type of ordering we require a special standalone ordering construct. On ia64 this is the mf (memory fence) instruction, and on that platform it acts as both a standalone acquire and release barrier, but also prevents store load reordering. Of the weakly ordered platforms that are widely available (PowerPC, Sparc, ia64), all of these architectures require a special standalone instruction to enforce this special ordering.

The Sparc architecture has a number of possible ordering models, but the one implemented in all the CPUs that you would find in a Solaris Sparc based system is designated total store order (TSO). For all practical purposes this means that all instruction pairs are ordered except a store load sequence, and in order to enforce such an ordering one must interject a ‘membar #storeload’ instruction.

On PowerPC, in order to guarantee store load instructions maintain their order, one must interject a heavy-weight sync instruction between the two (sync).

One gets the impression that whatever hardware black magic is required to implement store load ordering, it has the appearance of being harder to enforce this particular type of instruction ordering than any other.

One of the interesting things about these store load barrier instructions, is not that they do enforce this ordering, but that the release and acquire barrier instructions do not. In particular a mutex implemented with acquire and release barriers and not with a full barrier instruction can let exterior memory accesses into the critical section.

Suppose for example we have the following code

pSharedMem->x = 3 ;  // BEFORE
v = pSharedMem->y ;  // BEFORE

mutex.acquire() ; // assumed to contain an acquire barrier.

pSharedMem->z = 4 ;
foo( pSharedMem->w ) ;

mutex.release() ; // assumed to contain a release barrier.

pSharedMem->a = 2 ;  // AFTER
q = pSharedMem->b ;  // AFTER

Unless the mutex uses very heavy handed memory barrier instructions one has the possibility that either the AFTER or BEFORE labeled memory accesses may sneak into the critical section that the mutex provides.

This subtlety may not be documented by the mutex implementation, but one should not assume that a mutex does any more than keep stuff within the critical section from getting out.

The flag pattern illustrated.

The mutex is by far the most important example of barrier use that is required for correctness in a weakly ordered system. It is however, perhaps not the easiest to understand. A slightly simpler pattern is the use of an atomic to flag that another memory update is complete.

As mentioned it is common for correct memory barrier usage to require paired conugate acquire and release barriers. This is very much like what is in the implementation of the critical section that you are probably trying to avoid by using an atomic and the associated barriers. It will however, be split, with the release and acquire barrier usage divided into portions for the producer and consumer. As an example your critical section implementation would typically be of the form:

while (!pShared->lock.testAndSet_Acquire()) ;
// (this loop should include all the normal critical section stuff like
// spin, waste, 
// pause() instructions, and last-resort-give-up-and-blocking on a resource 
// until the lock is made available.)

// Access to shared memory.

pShared->foo = 1 
v = pShared-> goo


Acquire memory barrier above makes sure that any loads (pShared->goo) that may have been started before the successful lock modification are tossed, to be restarted if necessary.

The release memory barrier ensures that the load from goo into the (local say) variable v is complete before the lock word protecting the shared memory is cleared.

You have a similar pattern in the typical producer and consumer atomic flag scenario (it is difficult to tell by your sample if that is what you are doing but should illustrate the idea).

Suppose your producer used an atomic variable to indicate that some other state is ready to use. You’ll want something like this:

pShared->goo = 14


Without a release (“write”) barrier here in the producer you have no guarantee that the hardware isn’t going to get to the atomic store before the goo store has made it through the CPU store queues, and up through the memory hierarchy where it is visible (even if you have a mechanism that ensures the compiler orders things the way you want).

In the consumer

if ( pShared->atomic.compareAndSwap_Acquire(1,1) )
   v = pShared->goo 

Without a “read” barrier here you won’t know that the hardware hasn’t gone and fetched goo for you before the atomic access is complete. The atomic (ie: memory manipulated with the Interlocked functions doing stuff like lock cmpxchg), is only “atomic” with respect to itself, not other memory.

That’s really all there is to this example, so it is perhaps anticlimatic, as well as including a redundant discussion of a mutex implementation. However, it is a good first step and these notes are already too long. A more and better example will perhaps follow on a different day.


Some of the issues associated with using atomics and barriers to replace the simpler lock mechanism have been discussed. It is unfortunate to have to provide examples that utilize assembly code to illustrate these ideas, but it is harder to discuss these topics without doing so. Hopefully the inclusion of some of the low level details and assembly listings has not rendered this on paper discussion too incomprehensible.

The fact that some low level systems details must be understood for correct use of atomics to avoid locking should perhaps be a warning to the programmer. It can be very hard to implement lock avoidance algorithms correctly, harder still to test them. There is also a significant maintenance impact to going down this road, so be good and sure that you really need the performance provided before taking this path. Premature optimization in this area can be particularly dangerous and costly.

Do you really have a good reason for going down the atomic path instead of just using a lock? Doing all this correctly can be very difficult. Be prepared for a lot of self doubt and insecurity in code reviews and make sure you have a lot of high concurrency testing with all sorts of random timing scenarios. Use a critical section unless you have a very very good reason to avoid it, and don’t write that critical section yourself.

The correctness aspects of memory barrier utilization makes your code intrinsically unportable. Because many correct uses of atomics also require barrier instructions, this can in turn imply that the “innocent and simple” atomic class library implementation you have access to can be correspondingly unportable. This may be surprising to the developer, but then again, many aspects of weakly ordered system behavior can be surprising.

On a weakly ordered system, your compiler probably provides _acquire and _release variations for most of the atomic manipulation methods. On a strongly ordered platform like ia32 or amd64, _acquire() or _release() suffixed atomic intrinsics will not be any different than the vanilla operations. With ia64 effectively dead (except on HP where its still twitching slightly) PowerPC is probably the most pervasive surviving weakly ordered architecture around, and even it has toyed with non-weak ordering (power5 was weakly ordered, power6 was not, and power7 is again weakly ordered). It will be interesting to what path the hardware folks take, and whether these details are of any importance in 10 or 15 years.


[1] M. Lyons, E. Silha, and B. Hay. PowerPC storage model and AIX programming, 2002.

[2] D. Lea. The JSR-133 cookbook for compiler writers, 2005.

[3] L.C.H.J. Boehm. Threads and memory model for C++ [online].

[4] J. Manson and B. Goetz. Jsr 133 (java memory model) faq [online]. 2004. pugh/java/memoryModel/jsr-133-faq.html.

[5] L.C.H.J. Boehm. C++ atomic types and operations. mailings N2427. C++ standards committee, 2007.

[6] H. Sutter. {Prism-A Principle-Based Sequential Memory Model for Microsoft Native Code Platforms Draft Version 0.9. 1}, 2006.

[7] R. Silvera, M. Wong, P. McKenney, and B. Blainey. {A simple and efficient memory model for weakly-ordered architectures}, 2007.

[8] S.V. Adve and K. Gharachorloo. Shared memory consistency models: A tutorial. Computer, 29(12):66–76, 1996.

[9] J.D. Valois. Lock-free linked lists using compare-and-swap.

[10] T.L. Harris, K. Fraser, and I.A. Pratt. A practical multi-word compare-and-swap operation. Lecture Notes in Computer Science, 2508:265–279, 2002.

[11] T.L. Harris. A pragmatic implementation of non-blocking linked-lists. Lecture Notes in Computer Science, 2180:300–314, 2001.

[12] T.L. Harris. Harris Papers [online].

[13] R. Bencina. lock free papers and notes. [online]. 2006. rossb/code/lockfree/.

[14] M.M. Michael. {Safe memory reclamation for dynamic lock-free objects using atomic reads and writes}. In {Proceedings of the twenty-first annual symposium on Principles of distributed computing}, pages 21–30. ACM New York, NY, USA, 2002.

Posted in C/C++ development and debugging. | Tagged: , , , , , | 1 Comment »

Non-covariant Lagrangian and Hamiltonian for Lorentz force.

Posted by peeterjoot on November 28, 2009

[Click here for a PDF containing this post with nicer formatting]

In [1], the Lagrangian for a charged particle is given as (12.9) as

\begin{aligned}\mathcal{L} = -m c^2 \sqrt{1 - \mathbf{u}^2/c^2} + \frac{e}{c} \mathbf{u} \cdot \mathbf{A} - e \Phi.\end{aligned} \quad\quad\quad(1)

Let’s work in detail from this to the Lorentz force law and the Hamiltonian and from the Hamiltonian again to the Lorentz force law using the Hamiltonian equations. We should get the same results in each case, and have enough details in doing so to render the text a bit more comprehensible.

Canonical momenta

We need the conjugate momenta for both the Euler-Lagrange evaluation and the Hamiltonian, so lets get that first. The components of this are

\begin{aligned}\frac{\partial {\mathcal{L}}}{\partial {\dot{x}_i}} &= - \frac{1}{{2}} m c^2 \gamma (-2/c^2) \dot{x}_i + \frac{e}{c} A_i \\ &= m \gamma \dot{x}_i + \frac{e}{c} A_i.\end{aligned}

In vector form the canonical momenta are then

\begin{aligned}\mathbf{P} &= \gamma m \mathbf{u} + \frac{e}{c} \mathbf{A}.\end{aligned} \quad\quad\quad(2)

Euler-Lagrange evaluation.

Completing the Euler-Lagrange equation evaluation is the calculation of

\begin{aligned}\frac{d\mathbf{P}}{dt} = \boldsymbol{\nabla} \mathcal{L}.\end{aligned} \quad\quad\quad(3)

On the left hand side we have

\begin{aligned}\frac{d\mathbf{P}}{dt} = \frac{d(\gamma m \mathbf{u})}{dt} + \frac{e}{c} \frac{d\mathbf{A} }{dt},\end{aligned} \quad\quad\quad(4)

and on the right, with implied summation over repeated indexes, we have

\begin{aligned}\boldsymbol{\nabla} \mathcal{L} = \frac{e}{c} \mathbf{e}_k (\mathbf{u} \cdot \partial_k \mathbf{A}) - e \boldsymbol{\nabla} \Phi.\end{aligned} \quad\quad\quad(5)

Putting things together we have

\begin{aligned}\frac{d(\gamma m \mathbf{u})}{dt} &= -e \left(\boldsymbol{\nabla} \Phi + \frac{1}{{c}} \frac{\partial {\mathbf{A}}}{\partial {t}}+ \frac{1}{c} \left(\frac{\partial {\mathbf{A}}}{\partial {x_a}} \frac{\partial {x_a}}{\partial {t}} - \mathbf{e}_k (\mathbf{u} \cdot \partial_k \mathbf{A}) \right)\right) \\ &= -e \left(\boldsymbol{\nabla} \Phi + \frac{1}{{c}} \frac{\partial {\mathbf{A}}}{\partial {t}}+ \frac{1}{c} \mathbf{e}_b u_a\left(\frac{\partial {A_b}}{\partial {x_a}} -\frac{\partial {A_a}}{\partial {x_b}}\right)\right).\end{aligned}


\begin{aligned}\mathbf{E} = -\boldsymbol{\nabla} \Phi - \frac{1}{{c}} \frac{\partial {\mathbf{A}}}{\partial {t}},\end{aligned} \quad\quad\quad(6)

the first two terms are recognizable as the electric field. To put some structure in the remainder start by writing

\begin{aligned}\frac{\partial {A_b}}{\partial {x_a}} - \frac{\partial {A_a}}{\partial {x_b}} = \epsilon^{fab} {(\boldsymbol{\nabla} \times \mathbf{A})}_f.\end{aligned} \quad\quad\quad(7)

The remaining term, with \mathbf{B} = \boldsymbol{\nabla} \times \mathbf{A} is now

\begin{aligned}- \frac{e}{c} \mathbf{e}_b u_a \epsilon^{gab} B_g&=\frac{e}{c} \mathbf{e}_a u_b \epsilon^{abg} B_g \\ &= \frac{e}{c} \mathbf{u} \times \mathbf{B}.\end{aligned}

We are left with the momentum portion of the Lorentz force law as expected

\begin{aligned}\frac{d(\gamma m \mathbf{u})}{dt} = e \left( \mathbf{E} + \frac{1}{c} \mathbf{u} \times \mathbf{B} \right).\end{aligned} \quad\quad\quad(8)

Observe that with a small velocity Taylor expansion of the Lagrangian we obtain the approximation

\begin{aligned}-m c^2 \sqrt{ 1 -\mathbf{u}^2/c^2} \approx - m c^2 \left( 1 - \frac{1}{{2}} \mathbf{u}^2/c^2 \right) = \frac{1}{{2}} m \mathbf{u}^2\end{aligned} \quad\quad\quad(9)

If that is our starting place, we can only obtain the non-relativistic approximation of the momentum change by evaluating the Euler-Lagrange equations

\begin{aligned}\frac{d (m \mathbf{u})}{dt} = e \left( \mathbf{E} + \frac{1}{c} \mathbf{u} \times \mathbf{B} \right).\end{aligned} \quad\quad\quad(10)

That was an exercise previously attempting working the Tong Lagrangian problem set [2].


Having confirmed the by old fashioned Euler-Lagrange equation evaluation that our Lagrangian provides the desired equations of motion, let’s now try it using the Hamiltonian approach. First we need the Hamiltonian, which is nothing more than

\begin{aligned}H = \mathbf{P} \cdot \mathbf{u} - \mathcal{L}\end{aligned} \quad\quad\quad(11)

However, in the Lagrangian and the dot product we have velocity terms that we must eliminate in favor of the canonical momenta. The Hamiltonian remains valid in either form, but to apply the Hamiltonian equations we need H = H(\mathbf{P}, \mathbf{x}), and not H = H(\mathbf{u}, \mathbf{P}, \mathbf{x}).

\begin{aligned}H = \mathbf{P} \cdot \mathbf{u} + m c^2 \sqrt{1 - \mathbf{u}^2/c^2} - \frac{e}{c} \mathbf{u} \cdot \mathbf{A} + e \Phi.\end{aligned} \quad\quad\quad(12)


\begin{aligned}H = \mathbf{u} \cdot \left(\mathbf{P} - \frac{e}{c} \mathbf{A}\right) + m c^2 \sqrt{1 - \mathbf{u}^2/c^2} + e \Phi.\end{aligned} \quad\quad\quad(13)

We can rearrange 2 for \mathbf{u}

\begin{aligned}\mathbf{u} = \frac{1}{{m \gamma}} \left( \mathbf{P} - \frac{e}{c} \mathbf{A} \right),\end{aligned} \quad\quad\quad(14)

but \gamma also has a \mathbf{u} dependence, so this is not complete. Squaring gets us closer

\begin{aligned}\mathbf{u}^2 = \frac{1 - \mathbf{u}^2/c^2}{m^2} {\left( \mathbf{P} - \frac{e}{c} \mathbf{A} \right)}^2,\end{aligned} \quad\quad\quad(15)

and a bit of final rearrangement yields

\begin{aligned}\mathbf{u}^2 = \frac{(c \mathbf{P} - e \mathbf{A})^2}{m^2 c^2 + {\left( \mathbf{P} - \frac{e}{c} \mathbf{A} \right)}^2}.\end{aligned} \quad\quad\quad(16)

Writing \mathbf{p} = \mathbf{P} - e \mathbf{A}/c, we can rearrange and find

\begin{aligned}\sqrt{1 - \mathbf{u}^2/c^2} = \frac{m c }{\sqrt{m^2 c^2 +\mathbf{p}^2}}\end{aligned} \quad\quad\quad(17)

Also, taking roots of 16 we must have the directions of \mathbf{u} and \left( \mathbf{P} - \frac{e}{c} \mathbf{A} \right) differ only by a rotation. From 14 we also know that these are colinear, and therefore have

\begin{aligned}\mathbf{u} = \frac{c \mathbf{P} - e \mathbf{A}}{\sqrt{m^2 c^2 + {\left( \mathbf{P} - \frac{e}{c} \mathbf{A} \right)}^2}}.\end{aligned} \quad\quad\quad(18)

This and 17 can now be substituted into 13, for

\begin{aligned}H = \frac{c}{m^2 c^2 + \mathbf{p}^2} \left({\left(\mathbf{P} - \frac{e}{c} \mathbf{A}\right)}^2 + m^2 c^2 \right)+ e \Phi.\end{aligned} \quad\quad\quad(19)

Dividing out the common factors we finally have the Hamiltonian in a tidy form

\begin{aligned}H = \sqrt{ (c \mathbf{P} - e \mathbf{A})^2 + m^2 c^4 } + e\Phi.\end{aligned} \quad\quad\quad(20)

Hamiltonian equation evaluation.

Let’s now go through the exercise of evaluating the Hamiltonian equations. We want the starting point to be just the energy expression 20, and the use of the Hamiltonian equations and none of what led up to that. If we were given only this Hamiltonian and the Hamiltonian principle

\begin{aligned}\frac{\partial {H}}{\partial {P_k}} &= u_k \\  \frac{\partial {H}}{\partial {x_k}} &= -\dot{P}_k, \end{aligned} \quad\quad\quad(21)

how far can we go?

For the particle velocity we have no \Phi dependence and get

\begin{aligned}u_k &= \frac{c (c P_k -e A_k)}{\sqrt{ (c \mathbf{P} - e \mathbf{A})^2 + m^2 c^4 }}\end{aligned} \quad\quad\quad(23)

This is 18 in coordinate form, one of our stepping stones on the way to the Hamiltonian, and we recover it quickly with our first set of derivatives. We have the gradient part \dot{\mathbf{P}} = -\boldsymbol{\nabla} H of the Hamiltonian left to evaluate

\begin{aligned}\frac{d\mathbf{P}}{dt} = \frac{e (c P_k -e A_k) \boldsymbol{\nabla} A_k }{\sqrt{ (c \mathbf{P} - e \mathbf{A})^2 + m^2 c^4 }} - e \boldsymbol{\nabla} \Phi.\end{aligned} \quad\quad\quad(24)


\begin{aligned}\frac{d\mathbf{P}}{dt} = e \left( \frac{u_k}{c} \boldsymbol{\nabla} A_k - \boldsymbol{\nabla} \Phi \right)\end{aligned} \quad\quad\quad(25)

This looks nothing like the Lorentz force law. Knowing that \mathbf{P} - e\mathbf{A}/c is of significance (because we know where we started which is kind of a cheat), we can subtract derivatives of this from both sides, and use the convective derivative operator d/dt = {\partial {}}/{\partial {t}} + \mathbf{u} \cdot \boldsymbol{\nabla} (ie. chain rule) yielding

\begin{aligned}\frac{d}{dt}(\mathbf{P} - e\mathbf{A}/c) = e \left( -\frac{1}{{c}}\frac{\partial {\mathbf{A}}}{\partial {t}} - \frac{1}{{c}} (\mathbf{u} \cdot \boldsymbol{\nabla}) \mathbf{A} + \frac{u_k}{c} \boldsymbol{\nabla} A_k - \boldsymbol{\nabla} \Phi \right).\end{aligned} \quad\quad\quad(26)

The first and last terms sum to the electric field, and we seen evaluating the Euler-Lagrange equations that the remainder is u_k \boldsymbol{\nabla} A_k - (\mathbf{u} \cdot \boldsymbol{\nabla}) \mathbf{A} = \mathbf{u} \times (\boldsymbol{\nabla} \times \mathbf{A}). We have therefore gotten close to the familiar Lorentz force law, and have

\begin{aligned}\frac{d}{dt}(\mathbf{P} - e\mathbf{A}/c) = e \left( \mathbf{E} + \frac{\mathbf{u}}{c} \times \mathbf{B} \right).\end{aligned} \quad\quad\quad(27)

The only untidy detail left is that \mathbf{P} - e \mathbf{A}/c doesn’t look much like \gamma m \mathbf{u}, what we recognize as the relativistically corrected momentum. We ought to have that implied somewhere and 23 looks like the place. That turns out to be the case, and some rearrangement gives us this directly

\begin{aligned}\mathbf{P} - \frac{e}{c}\mathbf{A} = \frac{m \mathbf{u}}{\sqrt{1 - \mathbf{u}^2/c^2}}\end{aligned} \quad\quad\quad(28)

This completes the exercise, and we’ve now obtained the momentum part of the Lorentz force law. This is still unsatisfactory from a relativistic context since we do not have momentum and energy on equal footing. We likely need to utilize a covariant Lagrangian and Hamiltonian formulation to fix up that deficiency.


[1] JD Jackson. Classical Electrodynamics Wiley. 2nd edition, 1975.

[2] Dr. David Tong. Classical Mechanics Lagrangian Problem Set 1. [online].

Posted in Math and Physics Learning. | Tagged: , , , | Leave a Comment »

grepping a range of lines using perl.

Posted by peeterjoot on November 27, 2009

I was asked how to use grep to select everything in a file starting with a pattern, and ending with a different one. The file is our diagnostic log and if this has originated with one of our system testers could be massive (a few hundred thousand lines long). gnu-grep can be used for this. You could do something like:

grep -A 9999999 'some first expression' < wayToBigFile | grep -B 9999999 'some other expression'

Here 9999999 is some number of lines that is guessed big enough to contain all the lines of interest (not known ahead of time), so the command says “give me everything after the expression, and then give me everything before the other expression in that output”

Perl gives you a nicer way of doing this:

#!/usr/bin/perl -n

$foundIt = 1 if ( /some first expression/ ) ;

next if ( !$foundIt ) ;

print ;

exit if ( /some other expression/ ) ;

# done.

The -n flag says to run the whole script as if it is in a ‘while (<>){ … }’ loop.  Until the initial pattern is seen $foundIt is false, and nothing will be printed, and we bail if the second pattern is seen. Note that this relies on perl’s lazy variable initialization, since $foundIt = 0 until modified.

Observe also that this script is actually also it’s own test case.

myPrompt$ chmod 755 ./theScript
myPrompt$ ./theScript < ./theScript
$foundIt = 1 if ( /some first expression/ ) ;

next if ( !$foundIt ) ;

print ;

exit if ( /some other expression/ ) ;

Posted in perl and general scripting hackery | Tagged: , | 6 Comments »

A matrix algebra attack on the multiple mass spherical pendulum problem.

Posted by peeterjoot on November 26, 2009

FULL VERSION IN PDF ONLY HERE: Lagrangian and Euler-Lagrange equation evaluation for the spherical N-pendulum problem.


The dynamics of chain like objects can be idealized as a multiple pendulum, treating the system as a set of point masses, joined by rigid massless connecting rods, and frictionless pivots. The double planar pendulum and single mass spherical pendulum problems are well treated in Lagrangian physics texts, but due to complexity a similar treatment of the spherical N-pendulum problem is not pervasive. We show that this problem can be tackled in a direct fashion, even in the general case with multiple masses and no planar constraints. A matrix factorization of the kinetic energy into allows an explicit and compact specification of the Lagrangian. Once that is obtained the equations of motion for this generalized pendulum system follow directly.


Derivation of the equations of motion for a planar motion constrained double pendulum system and a single spherical pendulum system are given as problems or examples in many texts covering Lagrangian mechanics. Setup of the Lagrangian, particularly an explicit specification of the system kinetic energy, is the difficult aspect of the multiple mass pendulum problem. Each mass in the system introduces additional interaction coupling terms, complicating the kinetic energy specification. In this paper, we matrix algebra to determine explicitly the Lagrangian for the spherical N pendulum system, and to evaluate the Euler-Lagrange equations for the system.

It is well known that the general specification of the kinetic energy for a system of independent point masses takes the form of a symmetric quadratic form \cite{goldstein1951cm} \cite{hestenes1999nfc}. However, actually calculating that energy explicitly for the general N-pendulum is likely thought too pedantic for even the most punishing instructor to inflict on students as a problem or example.

Considering this problem in a matrix algebra context motivates the introduction of multivector matrices. Given an velocity coordinate vector expressed in terms of $N$ generalized coordinates, we can factor this into $N \times M$ and $M \times 1$ matrix products. This matrix factorization allows the velocity vector for each mass to be separated into a product containing one generalized angular velocity coordinate vector and one matrix of trigonometric terms. Such a grouping can be used to tidily separate the kinetic energy into an explicit quadratic form, sandwiching a symmetric matrix between two vectors of generalized velocity coordinates.

This paper is primarily a brute force and direct attack on the problem. It contains no new science, only a systematic treatment of a problem that is omitted from mechanics texts, yet conceptually simple enough to deserve treatment.

The end result of this paper is a complete and explicit specification of the Lagrangian and evaluation of the Euler-Lagrange equations for the chain-like N spherical pendulum system. While this end result is essentially nothing more than a non-linear set of coupled differential equations, it is believed that the approach used to obtain it has some elegance. Grouping all the rotational terms of the kinetic into a symmetric kernel appears to be a tidy way to tackle multiple discrete mass problems.
At the very least, the calculation performed can show that a problem perhaps thought to be too messy for a homework exersize yields nicely to an organized and systematic attack.

Posted in Math and Physics Learning. | Tagged: , , , , , | 1 Comment »

An aborted attempt at a scientific style paper (geometric algebra N spherical pendulum)

Posted by peeterjoot on November 26, 2009

I’m not going to blog-ize this post since I’d have to first teach my latex to wordpress script to handle the theorem and subequations environment:

Matrix factorization of vectors in Geometric Algebra with application to the spherical N-pendulum problem

This was an attempt to cleanup my previous N spherical pendulum treatment (ie. equations of motion for an idealized chain). I’d used matrixes of multivector elements to arrive at an explicit expression for the kinetic energy for this system and then evaluated the Euler-Lagrange equations. I thought initially that this was pretty nifty, but as I was cleaning this up for an attempted arxiv post, I rudely realized that exactly the same result follows by factoring our a column vector of generalized coordinates (an angular velocity vector of sorts) from the scalar energy expressions (factoring a “one by one matrix” of scalar values into a product of matrixes).

No use of geometric algebra is required. Without the use of geometic algebra with its compact representation for the spherical parameterized points on the unit sphere, I don’t think I’d have even attempted this generalized problem. However, the idea that makes the problem tractable is completely independent of GA, and renders the paper as written just an obfuscated way to tackle it (since only a handful of people know this algebraic approach).

In case anybody is curious, here’s the abstract and introduction.


Geometric algebra provides a formalism that allows for coordinate and matrix free representations of vectors, linear transformations, and other mathematical structures. The replacement of matrix techniques with direct geometric algebra methods is a well studied field, but this need not be an exclusive replacement. While the geometric algebra multivector space does not constitute a field, operations on multivector element matrices can be defined in a consistent fashion. This use of geometric algebraic objects in matrices provides a hybrid mathematical framework with immediate applications to mathematical physics. Such an application is the dynamical examination of chain like structures, idealizable as an N mass pendulum with no planar constraints. The double planar pendulum and single mass spherical pendulum problems are well treated in Lagrangian physics texts, but due to complexity a similar treatment of the spherical N-pendulum problem is not pervasive. It will be shown that the multivector matrix technique lends itself nicely to this problem, allowing the complete specification of the equations of motion for this generalized pendulum system.


Derivation of the equations of motion for a planar motion constrained double pendulum system and a single spherical pendulum system are given as problems or examples in many texts covering Lagrangian mechanics. Setup of the Lagrangian, particularly an explicit specification of the system kinetic energy, is the difficult aspect of the multiple mass pendulum problem. Each mass in the system introduces additional interaction coupling terms, complicating the kinetic energy specification. In this paper, we utilize geometric (or Clifford) algebra to determine explicitly the Lagrangian for the spherical N pendulum system, and to evaluate the Euler-Lagrange equations for the system.

It is well known that the general specification of the kinetic energy for a system of independent point masses takes the form of a symmetric quadratic form \cite{goldstein1951cm} \cite{hestenes1999nfc}. However, actually calculating that energy explicitly for the general N-pendulum is likely thought too pedantic for even the most punishing instructor to inflict on students as a problem or example.

Considering this problem in a geometric algebra context motivates the introduction of multivector matrices. Given a one by one matrix containing a sum of vector terms, we can factor that matrix into a pair of $M \times 1$ and $1 \times M$ multivector matrices. In general there are many possible multivector factorizations of any given vector. For the pendulum problem all the required vector factorizations involve products with exponential rotation operators. This matrix factorization allows the velocity vector for each mass to be separated into a product containing one angular velocity coordinate vector and one multivector factor. Such a grouping can be used to tidily separate the kinetic energy into an explicit quadratic form, sandwiching a Hermitian multivector matrix between two vectors of generalized velocity coordinates.

While matrix algebra over the field of real numbers of complex numbers is familiar and well defined, geometric algebra multivector objects do not constitute a field, generally lacking both commutative multiplication and multiplicative inverses. This does not prevent a coherent definition of multivector matrix multiplication, provided that care is taken to retain appropriate order of element products. In the geometric algebra literature, there is significant focus on how to replace matrix and coordinate methods with geometric algebra, and less focus on how to use both together. A hybrid approach utilizing both matrix and geometric algebra has some elegance, and appears particularly well suited to the specific problem of the spherical N pendulum.

This paper is divided into two major sections. In the first, we start with a brief and minimal summary of required geometric algebra fundamentals and definitions. This includes the axioms, some nomenclature, and the use of one sided exponential rotation operators. To anybody already familiar with geometric algebra, this can be skipped or referred to only for notational conventions. Next, also in the first section, we define multivector matrix operations. A notion of Hermiticity useful for forming the quadratic forms used to express the kinetic energy for a multiple point mass system is defined here. A few examples of vector factorization into multivector matrix products are supplied for illustration purposes.

The second part of this paper applies the geometric algebra matrix techniques to the pendulum problem. For the single and double pendulum problems, already familiar to many, the methods and the results are compared to traditional coordinate geometry techniques used elsewhere. It is noteworthy that even the foundational geometric algebra mechanics text \cite{hestenes1999nfc} falls back to coordinate geometry for the planar double pendulum problem, instead of treating it in a way that is arguably more natural given the available tools.

Enroute to the Lagrangian we find an expression for the kinetic energy as a sum of bilinear forms, where the matrices used to express these forms are composed of pairs of block matrix products. This is surely a result that can be obtained by non geometric algebra methods, although how to obtain this result using alternate tools is not immediately obvious.

The end result of this paper is a complete and explicit specification of the Lagrangian and evaluation of the Euler-Lagrange equations for the chain-like N spherical pendulum system. While, this end result is free of geometric algebra, being nothing more than a non-linear set of coupled differential equations, it is believed that this geometric algebra matrix approach is an elegant and compact way to obtain it or to tackle similar problems. At the very least, this may prove to be one more useful addition to the physicist’s mathematical toolbox.

Posted in Math and Physics Learning. | Tagged: , , , , , , | Leave a Comment »

building a private version of gdb on a machine that has an older version.

Posted by peeterjoot on November 23, 2009

We have SLES10 linux machines, and the gdb version available on them is a old (so old that it no longer works with the version of the intel compiler that we use to build our product). Here’s a quick cheatsheet on how to download and install a newer version of gdb for private use, without having to have root privileges or replace the default version on the machine:

mkdir -p ~/tmp/gdb
cd ~/tmp/gdb
bzip2 -dc gdb-7.2.tar.bz2 | tar -xf -
mkdir g
cd g
../gdb-7.2/configure --prefix=$HOME/gdb
make install

Executing these leaves you with a private version of gdb in ~/gdb/bin/gdb that works with newer intel compiled code.

This version of gdb has some additional features (relative to 6.8 that we have on our machines) that also look interesting:

  •  disassemble start,+length looks very handy (grab just the disassembly that is of interest, or when the whole thing is desired, not more hacking around with the pager depth to get it all).
  • save and restore breakpoints.
  • current thread number variable $_thread
  • trace state variables (7.1), and fast tracepoints (will have to try that).
  • detached tracing
  • multiple program debugging (although I’m not sure I’d want that, especially when just one multi-threaded program can be pretty hairy to debug).  I recall many times when dbx would crash AIX with follow fork.  I wonder if other operating systems deal with this better?
  • reverse debugging, so that you can undo changes!  This is said to be target dependent.  I wonder if amd64 is supported?
  • catch syscalls.  I’ve seen some times when the glibc dynamic loader appeared to be able to exit the process, and breaking on exit, _exit, __exit did nothing.  I wonder if the exit syscall would catch such an issue.
  • find.  Search memory for a sequence of bytes.

Posted in Development environment | Tagged: , | Leave a Comment »

Linearizing a set of regular differential equations.

Posted by peeterjoot on November 18, 2009

[Click here for a PDF of this post with nicer formatting]


A number of discrete multiple particle systems appear to generate coupled differential equations of the following form

\begin{aligned}A z' = b(z)\end{aligned} \quad\quad\quad(1)

Where A = A(z) is a matrix, and b(z) a column vector valued non-linear function. How do we solve such an equation?

When the matrix is invertible.

Consider the nicely behaved case where A(z) is invertible for all z. Then we can write

\begin{aligned}z' = A^{-1} b(z)\end{aligned} \quad\quad\quad(2)

Now with a non-linear function b (like the sines that we have in the pendulum problem from -\nabla \cos\phi), we can’t solve this thing easily, but in some small-enough neighborhood of some point (i.e. a point in phase space containing z) we can make a linear approximation.

Suppose our initial phase space point is z_0, and we wish to solve for differential displacement from that point x, namely z = z_0 + x. Then we have for our system

\begin{aligned}x' = {\left.{{A^{-1} b(z)}}\right\vert}_{{z=z_0}} + {{\begin{bmatrix}\frac{\partial {[A^{-1}b(z)]_1}}{\partial {z_1}} & \frac{\partial {[A^{-1}b(z)]_1}}{\partial {z_2}} & \hdots & \frac{\partial {[A^{-1}b(z)]_1}}{\partial {z_N}} \\ \frac{\partial {[A^{-1}b(z)]_2}}{\partial {z_1}} & \frac{\partial {[A^{-1}b(z)]_2}}{\partial {z_2}} & \hdots & \frac{\partial {[A^{-1}b(z)]_2}}{\partial {z_N}} \\ \vdots \\ \frac{\partial {[A^{-1}b(z)]_N}}{\partial {z_1}} & \frac{\partial {[A^{-1}b(z)]_N}}{\partial {z_2}} & \hdots & \frac{\partial {[A^{-1}b(z)]_N}}{\partial {z_N}} \\ \end{bmatrix}}}_{{z = z_0}} x\end{aligned} \quad\quad\quad(3)

Now we have a linear matrix, corresponding roughly to a first order Taylor expansion of the original system of equations. Mathematically the problem is now reduced to a column vector linear system of the form

\begin{aligned}x' = B x + a\end{aligned} \quad\quad\quad(4)

When a = 0 the solution is just

\begin{aligned}x = e^{B t} x(0)\end{aligned} \quad\quad\quad(5)

(where you can evaluate e^{Bt} the usual way using an eigenvalue similarity transformation where the exponential of the inner diagonal term is then simple).

Assuming a non-homogeneous solution of the same form x = e^{B t} f(t) one finds the specific solution

\begin{aligned}x = e^{B t} \int_0^t e^{-B\tau} a d\tau\end{aligned} \quad\quad\quad(6)

So the complete solution (a specific solution plus the homogeneous solution) to the system 4 is found to be

\begin{aligned}x = e^{B t} \left( x(0) + \int_0^t e^{-B\tau} a d\tau \right)\end{aligned} \quad\quad\quad(7)

Thoughts to consider for the non-invertible or ill formed matrix

Now, this works out all nice and pleasantly when A is invertible. What do we do when it is not? Each zero eigenvalue of A looks like it corresponds to a conserved quantity. Manually removing those zeros from the system can reduce things to the form dealt with here. What is even trickier seeming is what happens if the matrix is almost invertible. Example

\begin{aligned}\begin{bmatrix}1 & 0 \\ 0 & 0.0000000001\end{bmatrix} z' = b(z)\end{aligned} \quad\quad\quad(8)

This is perfectly invertible mathematically, but numerically you really don’t want to go there, since you’ll end up with garbage. What I think would work for dealing with this sort of system is using the SVD (symmetric value decomposition) techniques to determine a orthonormal basis for all the “non-zero” eigenvalues according to a reasonable numerical threshold for these zeros.

SVD it is relatively new to applied mathematics and wasn’t covered in my linear algebra course when I was in school (now over ten years ago). I’ve read of it since and was very impressed with its power and utility (but need more study of it to fully grasp it and its applications). Prof Gilbert Strang covers this in his online lectures (on iTunesU), and I’d recommend them highly.

Exploring methods of solving (in the neighborhood of a phase space point) these sorts of differential equations using SVD is something remaining to be explored in more detail (at least by me), but intuition says it is relevant.

Posted in Math and Physics Learning. | Tagged: , , , , | Leave a Comment »

Force free relativistic motion.

Posted by peeterjoot on November 15, 2009

[Click here for a PDF of this post with nicer formatting]


Considering the Euler-Lagrange solutions for the relativistic force free covariant Lagrangian

\begin{aligned}\mathcal{L} &= \frac{1}{2} m \dot{x}^\mu \dot{x}_\mu \\ \dot{x}^\mu &= \frac{d x^\mu}{d \tau},\end{aligned} \quad\quad\quad(1)

we get a set of four constant momentum equations

\begin{aligned}m \dot{x}_\mu = m v_\mu(0).\end{aligned} \quad\quad\quad(3)

Can we make some sense of this? While this seems natural enough in comparision to Newtonian physics, we “just” add a component when switching to a four vector representation, the \gamma factors that one may expect are nowhere obvious to be seen.


A decomposition into an explicit spacetime split looks like it is the first step along the path to resolves this

\begin{aligned}X \equiv (c t, \mathbf{x}) = (ct, x^1, x^2, x^3).\end{aligned} \quad\quad\quad(4)

Considering first the time component of our equations of motion we have

\begin{aligned}m c \frac{dt}{d\tau} = m v_0(0).\end{aligned} \quad\quad\quad(5)


\begin{aligned}\frac{dt}{d\tau} = \frac{v_0(0)}{c}.\end{aligned} \quad\quad\quad(6)

For the spatial components we have

\begin{aligned}m \frac{d x_k}{d\tau} &=m \frac{d x_k}{dt}  \frac{dt}{d\tau} \\ &=m \frac{d x_k}{dt}  \frac{v_0(0)}{c}.\end{aligned}

With a switch to upper indexes, the remaining three equations of motion are then just

\begin{aligned}\frac{d x^k}{dt} = c \frac{ v^k(0) }{ v_0(0) }.\end{aligned} \quad\quad\quad(7)


\begin{aligned}\frac{d \mathbf{v}}{dt} = c \frac{ \mathbf{v}(0) }{ v_0(0) }.\end{aligned} \quad\quad\quad(8)

The math seems to be saying that relativistically, in the abscence of forces, we have constant velocity in our rest frame. This constant velocity is relative to the initial time component of the four velocity. This is not what I would have expected from the relativisitically corrected Newtons laws in three vector form

\begin{aligned}\mathbf{F} = \frac{d}{dt}\left( \frac{m \mathbf{v}}{\sqrt{1 - (\mathbf{v}/c)^2}} \right).\end{aligned} \quad\quad\quad(9)

In this equation it appears that we should only expect constant velocity in the small speed limit where \mathbf{v}/c can be neglected. If we, however, take this equation and run with it, where does it lead? Introducing a vector constant for the spatial momentum \mathbf{p}(0) we have

\begin{aligned}\frac{m \mathbf{v}}{\sqrt{1 - (\mathbf{v}/c)^2}} = \mathbf{p}_0.\end{aligned} \quad\quad\quad(10)

We can now square and rearrange, yielding

\begin{aligned}\frac{\mathbf{v}^2}{c^2} = \frac{ {\mathbf{p}_0}^2 } { m^2 c^2 + {\mathbf{p}_0}^2 }.\end{aligned} \quad\quad\quad(11)

With the additional assumption that \mathbf{v} and \mathbf{p}_0 are colinear we can take roots (the two could differ by an arbitrary spatial rotation), yielding

\begin{aligned}\frac{\mathbf{v}}{c} = \frac{ \mathbf{p}_0} { \sqrt{m^2 c^2 + {\mathbf{p}_0}^2} }.\end{aligned} \quad\quad\quad(12)

Just as seen starting from the covariant Lagrangian, we have constant spatial velocity in the abscence of external forces. There was no fundamental inconsistency between the covariant result and the relativistically corrected Newtonian force law. It was just not initially obvious to me that this was the case.

Posted in Math and Physics Learning. | Tagged: , | Leave a Comment »

C pointer aliasing violations and aggressive compiler optimizations.

Posted by peeterjoot on November 14, 2009

In our product (DB2), performance is given priority over all else in many cases. We have performance regression testing on all our platforms (lots), policing of build to build and release to release instruction counts in critical codepaths, benchmark and customer workload driven performance features, and a lot of other similar performance driven activity.

It is probably fair to say that there are folks in our performance team that are willing to sacrifice small children and dogs for 3% improvement in various workloads. Some of those sacrificed children are developer time and sanity. In particular we build our product with options like the xlC (AIX powerpc compiler) -qansialias option (there are similar options on other compilers, like the gcc -fstrict-aliasing). Additionally we build with options like -qipa (inter-procedural-analysis) and -qpdf (profile directed feedback). These last two options allow the compiler to do very aggressive optimization, using profiling information to guide a second pass of compilation, and allowing that optimization to span separate compilation modules. In a nutshell, these are the very scary optimization options. When it goes wrong, often because of incorrect code, the effects can be very mysterious, hard to isolate, and even non-deterministic (Thursday’s build can be busted, even if the code didn’t change, when Wednesday’s build was fine). What do these profile guided optimizations have to do with aliasing? I’ll return to that.

Incorrectly pointer aliasing is probably the most prominent trigger for unexpected compiler behaviour under these aggressive optimizations. Aliasing problems can also mess up plain old optimization (when strict aliasing is enabled or not disabled when on by default). These issues generally become worse with profile guided optimization since the compiler has access to more state. To get an idea of how much state we are talking about, for our product, in the -qipa “re-link” phase of our libdb2e.a library, the compiler actually has ALL the source code available to it for the whole library in a metadata form of some sort. That’s a shitload of code. Our library, containing the bulk of our product, is now something like 100Mb large, and that is without debug symbols. It is so big that we break linkers on most new platform ports. Within the profiling information that the compiler uses for this second phase compilation, is information about what to spend time doing additional optimization of. However, just as important, this profile information is actually used by the compiler to decide what NOT to spend time doing additional optimization of. The compiler’s optimization problem is so big when it has all of the source available en-mass that, unless you want to spend 3 weeks waiting for your build to complete, the compiler needs a mechanism for its optimizations to be throttled down.

I know I haven’t said what aliasing means. Sorry about that, I’ll try to get to it now. Reading the man pages may not be that enlightening. Here, for example, is the gcc man page content for its aliasing checking options:

           This option is only active when -fstrict-aliasing is active.  It
           warns about code which might break the strict aliasing rules that
           the compiler is using for optimization. The warning does not catch
           all cases, but does attempt to catch the more common pitfalls. It
           is included in -Wall.

If you don’t already know what this means it isn’t going to be of much help. The help text for their option to enable strict aliasing is a bit better

           Allows the compiler to assume the strictest aliasing rules applica-
           ble to the language being compiled.  For C (and C++), this acti-
           vates optimizations based on the type of expressions.  In particu-
           lar, an object of one type is assumed never to reside at the same
           address as an object of a different type, unless the types are
           almost the same.  For example, an "unsigned int" can alias an
           "int", but not a "void*" or a "double".  A character type may alias
           any other type.

           Pay special attention to code like this:

                   union a_union {
                     int i;
                     double d;

                   int f() {
                     a_union t;
                     t.d = 3.0;
                     return t.i;

           The practice of reading from a different union member than the one
           most recently written to (called ``type-punning'') is common.  Even
           with -fstrict-aliasing, type-punning is allowed, provided the mem-
           ory is accessed through the union type.  So, the code above will
           work as expected.  However, this code might not:

                   int f() {
                     a_union t;
                     int* ip;
                     t.d = 3.0;
                     ip = &t.i;
                     return *ip;

           Every language that wishes to perform language-specific alias anal-
           ysis should define a function that computes, given an "tree" node,
           an alias set for the node.  Nodes in different alias sets are not
           allowed to alias.  For an example, see the C front-end function

But in all honesty, this is written by a compiler developer, and another compiler developer will probably look at this and say, “sure”. As a C or C++ developer, the real impact of this option is still probably not clear. Here’s what it boils down to.

  • You can potentially get a lot of performance improvement telling the compiler you do not alias pointers.
  • The subset of your developers who haven’t worked on the compiler probably don’t know what this means.
  • When they break the rules that they don’t know about, things blow up.

Ironically, the alias options are a lot like a cast in C.  When you cast something in the compiler you are usually instructing the compiler “I know what I am doing, so please believe me that this is what I want, and do it for me”.  The aliasing options are very much like this, and are a set of explicit instructions to the compiler that you are following the rules that it finds desirable to optimize the code.

These rules are in fact an agreement and pact to the compiler that you are not going to change the type of a pointer with a cast, or if you do then you won’t dereference the pointer.  There are certain types of casts that are allowed, but many that do not cause compilation errors are in fact not allowed when these options are, knowingly or unknowingly in effect.  Here is an example:

int32_t u = INT32_MAX ;
short * ps = (short *)&u ;

printf( "0x%hx\n", *ps ) ;

What expectation do you have for this code? The value placed in the variable u is the same in both the high and low order bits, so there’s no endian gotchas. Most people would expect 0xffff to be printed. You may be rudely suprised if you got 0x0 printed or something else, perhaps something random. You’d say “whoa, that’s one buggy compiler!”. But here’s the thing, if you’ve enabled the strict pointer aliasing options of the compiler, you’ve told it,

“Hey man, I’m friendly to your optimizer, I’d never do something as evil as casting one type to something that it is not. If I do, then I won’t actually dereference this pointer. Please go ahead and optimize away assuming something like this would never happen. If it does happen, your optimization pass should probably just treat such a thing as an impossibility that could be optimized away.”

Any strict aliasing option is pretty much the most dangerous thing you can tell the compiler, especially when the developers don’t know the rules that the compiler optimizer wants followed. A product like ours has hundreds or thousands of developers all over the world, with constant turn around and role change. Subtleties like this isn’t taught in school. When it comes down to it, code that does not fail to compile code is legal (albeit possibly buggy). If the language allows a cast, why would a subsequent dereference not be allowed?

It is the dereference that is really what is not allowed. You can cast away to your hearts content, but if the final result of that cast changes the type, then you are not allowed to dereference it. Here’s another example

void printIt( const void * const    p )
   short * ps = (short *)p ;

   printf( "0x%hx\n", *ps ) ;

void ThisIsAllowed( void )
   int16_t u = INT16_MAX ;

   printIt( &u ) ;

void ThisIsNotAllowed( void )
   int32_t u = INT32_MAX ;

   printIt( &u ) ;

Do you look at this example and also say “huuhh?” Isn’t that the whole point of a ‘void *’? You can cast something to a ‘void *’, and then you can cast it from a ‘void *’ and use it. Right?


If you talk to the compiler guru who points out that your code is buggy since you’ve agreed to follow the rules they want, then the response, rudely violating your expectations, will be something like:

“You may cast a ‘TYPE pointer’ to a ‘void *’, and then cast a ‘void *’ back to a ‘TYPE pointer’. You may not use this as an opportunity to change the type from what it was originally.”

Oh. Shit. We probably do that all the time. Does this mean all our code is probably busted?


It gets worse. Consider this fragment of code again. If those functions that dereference things are in different source code modules, then the compiler can’t know the type gets changed. Right? Wrong. Here’s things modified with a fix to workaround the issue:

 * like it or not, we believe we are given a pointer to a short
void printIt( const void * const    p )
   short * ps = (short *)p ;

   printf( "0x%hx\n", *ps ) ;


/* GOO.C 
void ThisIsAllowed( void )
   int16_t u = INT16_MAX ;

   printIt( &u ) ;

void ThisWasntAllowedButIsNowBelievedFixed( void )
   int32_t u = INT32_MAX ;

   printIt( &u ) ; // BLAH.C can't know this wasn't a short.

Not only do we lie to the compiler telling it that we follow the aliasing rules, but we also now let it do cross module optimization, even inlining non-inline functions that are called and defined in completely different places. This is why strict aliasing doesn’t get along with the even more aggressive optimization techniques.

So, you walk downstairs one floor to the compiler guys and ask how can I even find where our code is breaking the rules?

These guys do want your code to work, but they have a hard problem, and you have hard problems, and here’s what they say

“Here, I’ve got this friendly compiler option for you that will give you an error, whenever you cast anything. Does that help?”

Ah. Simply run that on millons of lines of legacy and constantly changing new code, and remove all the casting. I’ll get back to you tomorrow and let you know how that worked;)

One of the options to fix this is simple. Stop lying to the compiler. Your code probably doesn’t follow the aliasing rules, so stop telling it that it does. This is what I think we ought to do with our code, but I’m now an old jaded programmer who doesn’t have the energy to fight that battle. All those baby sacrificers in the performance team aren’t going to like loosing the free performance they are getting, and we’ve now found all the problematic parts of the code. Right? Sure, until next time.

I’m actually quite laid back writing this, so if it sounds like I am ranting, that’s not the intention. I’m just enjoying playing up the drama a bit. If I was asked what the right thing to do it would be turn off the aliasing options by default, and only turn them back on for subsets of the code identified as worthy. Then impose a requirement for strongly educating all developers in that kernel of performance sensitive code what they may or may not do, and audit that subset of the code thoroughly. That’s the right way I think, but you have to ante up the people to deal with the performance regressions this will cause, and everybody is busy working on the new features. The people who pay the salaries of the developers and cater to marketing demands have a whole different idea of what the right thing to do is, and they are also perfectly correct from their point of view.

Realistically, even if we could selectively restrict our use of these strong optimizations, I wouldn’t have high expectations that it would really work. At least not permanently. Tools are required to enforce the rules (ideally the compiler), so that things do not regress. A developer who cannot compile the code is much more likely to spend the time required to understand the rules being broken.

It is unfortunate that many of the options that are available for diagnosis are too noisy, and not granular enough. void pointer and char pointer casting are allowed in some circumstances, so alias detection options that warn about these too along with all the other bad stuff means that the bad stuff will get lost in the diagnostics barrage. When you are faced with humongous amounts of legacy code that no single person understands, and code that is only understood by people who have retired or moved on, then the only hope is incremental change, gradually increasing the scope of what is disallowed.

There is so much pressure for performance that the compiler guys (at least our compiler) now deliberately weaken their optimizations when it appears there is an aliasing rule violation. Unfortunately even this is not easy fed back to us for code change since it is in the bowels of the deepest layers of the optimizer where infrastructure to cause external and meaningful messages to the developer is not easily available. With the compiler doing this deliberate optimization weakening so that we can continue to lie to them that we follow their rules, we do squeak by.

I don’t know a good way of dealing with these sorts of problems. I suspect that the status quo, tackling these things one at a time, as they happen will be with us for a while. Every few months, usually as a product or fixpack cycle is approaching the end of the cycle we’ll hit these sort of optimization issues, and have the hard task of debugging them. The cost of this in developer time is absorbed silently and not part of any sizing of effort or time.

I’d love it if we had at least one of the platform compilers that we use deal with strict aliasing in a good way, making it an error to break the rules when we say we are following them. Then we can sleep at night knowing that we do not have a timebomb, and also get the performance that we desire.

Some references

I’ve glossed over a lot of details. There are a number of ways that type changing casts are allowed, and I haven’t said how you would fix these issues when you hit them. It’s my intention to blog on these separately some time later. Until then I’ve found a couple good tidbits that do some of what I have not.


and here:


Posted in C/C++ development and debugging. | Tagged: , , | 3 Comments »