Peeter Joot's (OLD) Blog.

Math, physics, perl, and programming obscurity.

Archive for September, 2010

Unitary exponential sandwich

Posted by peeterjoot on September 27, 2010

[Click here for a PDF of this post with nicer formatting]

Motivation.

One of the chapter II exercises in [1] involves a commutator exponential sandwich of the form

\begin{aligned}e^{i F} B e^{-iF}\end{aligned} \hspace{\stretch{1}}(1.1)

where F is Hermitian. Asking about commutators on physicsforums I was told that such sandwiches (my term) preserve expectation values, and also have a Taylor series like expansion involving the repeated commutators. Let’s derive the commutator relationship.

Guts

Let’s expand a sandwich of this form in series, and shuffle the summation order so that we sum over all the index plane diagonals k + m = \text{constant}. That is

\begin{aligned}e^{A} B e^{-A}&=\sum_{k,m=0}^\infty \frac{1}{{k!m!}} A^k B (-A)^m \\ &=\sum_{r=0}^\infty \sum_{m=0}^r \frac{1}{{(r-m)!m!}} A^{r-m} B (-A)^m \\ &=\sum_{r=0}^\infty \frac{1}{{r!}} \sum_{m=0}^r \frac{r!}{(r-m)!m!} A^{r-m} B (-A)^m \\ &=\sum_{r=0}^\infty \frac{1}{{r!}} \sum_{m=0}^r \binom{r}{m} A^{r-m} B (-A)^m.\end{aligned}

Assuming that these interior sums can be written as commutators, we’ll shortly have an induction exercise. Let’s write these out for a couple values of r to get a feel for things.

r=1

\begin{aligned}\binom{1}{0} A B + \binom{1}{1} B (-A) &= \left[{A},{B}\right]\end{aligned}

r=2

\begin{aligned}\binom{2}{0} A^2 B + \binom{2}{1} A B (-A) + \binom{2}{2} B (-A)^2 &= A^2 B - 2 A B A + B A\end{aligned}

This compares exactly to the double commutator:

\begin{aligned}\left[{A},{\left[{A},{B}\right]}\right]&= A(A B - B A) -(A B - B A)A \\ &= A^2 B - A B A - A B A + B A^2 \\ &= A^2 B - 2 A B A + B A^2 \\ \end{aligned}

r=3

\begin{aligned}\binom{3}{0} A^3 B + \binom{3}{1} A^2 B (-A) + \binom{3}{2} A B (-A)^2 + \binom{3}{3} B (-A)^3 &= A^3 B - 3 A^2 B A + 3 A B A^2 - B A^3.\end{aligned}

And this compares exactly to the triple commutator

\begin{aligned} [A, [A, [A, B] ] ] &= A^3 B - 2 A^2 B A + A B A^2 -(A^2 B A - 2 A B A^2 + B A^3) \\ &= A^3 B - 3 A^2 B A + 3 A B A^2 -B A^3 \\ \end{aligned}

The induction pattern is clear. Let’s write the r fold commutator as

\begin{aligned}C_r(A,B) &\equiv \underbrace{[A, [A, \cdots, [A,}_{r \text{times}} B]] \cdots ] = \sum_{m=0}^r \binom{r}{m} A^{r-m} B (-A)^m,\end{aligned} \hspace{\stretch{1}}(2.2)

and calculate this for the r+1 case to verify the induction hypothesis. We have

\begin{aligned}C_{r+1}(A,B) &= \sum_{m=0}^r \binom{r}{m} \left( A^{r-m+1} B (-A)^m-A^{r-m} B (-A)^{m} A \right) \\ &= \sum_{m=0}^r \binom{r}{m} \left( A^{r-m+1} B (-A)^m+A^{r-m} B (-A)^{m+1} \right) \\ &= A^{r+1} B+ \sum_{m=1}^r \binom{r}{m} A^{r-m+1} B (-A)^m+ \sum_{m=0}^{r-1} \binom{r}{m} A^{r-m} B (-A)^{m+1} + B (-A)^{r+1} \\ &= A^{r+1} B+ \sum_{k=0}^{r-1} \binom{r}{k+1} A^{r-k} B (-A)^{k+1}+ \sum_{m=0}^{r-1} \binom{r}{m} A^{r-m} B (-A)^{m+1} + B (-A)^{r+1} \\ &= A^{r+1} B+ \sum_{k=0}^{r-1} \left( \binom{r}{k+1} + \binom{r}{k} \right) A^{r-k} B (-A)^{k+1}+ B (-A)^{r+1} \\ \end{aligned}

We now have to sum those binomial coefficients. I like the search and replace technique for this, picking two visibly distinct numbers for r, and k that are easy to manipulate without abstract confusion. How about r=7, and k=3. Using those we have

\begin{aligned}\binom{7}{3+1} + \binom{7}{3} &=\frac{7!}{(3+1)!(7-3-1)!}+\frac{7!}{3!(7-3)!} \\ &=\frac{7!(7-3)}{(3+1)!(7-3)!}+\frac{7!(3+1)}{(3+1)!(7-3)!} \\ &=\frac{7! \left( 7-3 + 3 + 1 \right) }{(3+1)!(7-3)!} \\ &=\frac{(7+1)! }{(3+1)!((7+1)-(3+1))!}.\end{aligned}

Straight text replacement of 7 and 3 with r and k respectively now gives the harder to follow, but more general identity

\begin{aligned}\binom{r}{k+1} + \binom{r}{k} &=\frac{r!}{(k+1)!(r-k-1)!}+\frac{r!}{k!(r-k)!} \\ &=\frac{r!(r-k)}{(k+1)!(r-k)!}+\frac{r!(k+1)}{(k+1)!(r-k)!} \\ &=\frac{r! \left( r-k + k + 1 \right) }{(k+1)!(r-k)!} \\ &=\frac{(r+1)! }{(k+1)!((r+1)-(k+1))!} \\ &=\binom{r+1}{k+1}\end{aligned}

For our commutator we now have

\begin{aligned}C_{r+1}(A,B) &= A^{r+1} B+ \sum_{k=0}^{r-1} \binom{r+1}{k+1} A^{r-k} B (-A)^{k+1} + B (-A)^{r+1} \\ &= A^{r+1} B+ \sum_{s=1}^{r} \binom{r+1}{s} A^{r+1-s} B (-A)^{s} + B (-A)^{r+1} \\ &= \sum_{s=0}^{r+1} \binom{r+1}{s} A^{r+1-s} B (-A)^{s} \qquad\square\end{aligned}

That completes the inductive proof and allows us to write

\begin{aligned}e^A B e^{-A}&=\sum_{r=0}^\infty \frac{1}{{r!}} C_{r}(A,B),\end{aligned} \hspace{\stretch{1}}(2.3)

Or, in explicit form

\begin{aligned}e^A B e^{-A}&=B + \frac{1}{{1!}} \left[{A},{B}\right]+ \frac{1}{{2!}} \left[{A},{\left[{A},{B}\right]}\right]+ \cdots\end{aligned} \hspace{\stretch{1}}(2.4)

References

[1] BR Desai. Quantum mechanics with basic field theory. Cambridge University Press, 2009.

Posted in Math and Physics Learning. | Tagged: , , , , , , | Leave a Comment »

AIX function pointer trap notes

Posted by peeterjoot on September 15, 2010

The DB2 product is a massively complex system. If there is a software problem in either a development or a customer environment, there is a good chance that it will never be reproduced again. We’ve spend years incrementally building a cross platform post mortum debugging facility where we collect and log just about everything we can think of, with the aim of being able to figure it out after the fact. In some cases this includes information that could be available in core files, so one could perhaps wonder why we’d ever look at that info in development, but it is useful internally too. Core files for a large system like DB2, where we have GBs of memory mapped. They get lost on the machines they are dumped on, and often have to be disabled, especially for automated test runs (where one optimistically hopes the test will be successful).

Here’s a fragment from the disassembly listing for a trap that one such post mortem dump (“trapfile”) contained:

     0x0900000013473BCC : 38C100A0 addi	r6,r1,160
     0x0900000013473BD0 : 38A10098 addi	r5,r1,152
     0x0900000013473BD4 : E90C0000 ld	r8,0(r12)
     0x0900000013473BD8 : 7D0903A6 mtctr	r8
     0x0900000013473BDC : F8410028 std	r2,40(r1)
     0x0900000013473BE0 : E96C0010 ld	r11,16(r12)
     0x0900000013473BE4 : E84C0008 ld	r2,8(r12)
     0x0900000013473BE8 : 4E800421 bctrl                     # 20,bit0
>>>> 0x0900000013473BEC : E8410028 ld	r2,40(r1)
     0x0900000013473BF0 : 90610088 stw	r3,136(r1)
     0x0900000013473BF4 : E861008A lwa	r3,136(r1)
     0x0900000013473BF8 : 2C030000 cmpi	cr0,r3,0
     0x0900000013473BFC : 41820050 beq        cr0,0x13473C4C # 12,bit2
     0x0900000013473C00 : E861008A lwa	r3,136(r1)

However, this was for a SIGILL. How would we ever get a SIGILL with a load from gr1? Looking at the registers in question, it appears that we don’t correctly identify the trapping instruction, but have done something that is pretty close:

    IAR: 0000000000000000     MSR: A00000000008D032      LR: 0900000013473BEC
    CTR: 0000000000000000     XER: 00000010           FPSCR: A2208000
     CR: 24000224
GPR[00]: 0000000000000080 GPR[01]: 070000003B7FE060 GPR[02]: 0000000000000000 
GPR[03]: 00000001112157F0 GPR[04]: 0000000000004E20 GPR[05]: 070000003B7FE0F8 
GPR[06]: 070000003B7FE100 GPR[07]: 070000003B7FE0EC GPR[08]: 0000000000000000 
GPR[09]: 0000000000000000 GPR[10]: 0000000000000000 GPR[11]: 0000000000000000 
GPR[12]: 00000000000000A0 GPR[13]: 0000000111237800 GPR[14]: 0000000000000000 
GPR[15]: 0000000000000000 GPR[16]: 0000000000000000 GPR[17]: 0000000000000000 
GPR[18]: 0000000000000000 GPR[19]: 0000000000000000 GPR[20]: 0000000000000000 
GPR[21]: 0000000000000000 GPR[22]: 0000000000000000 GPR[23]: 0000000000000000 
GPR[24]: 0000000000000000 GPR[25]: 0000000000000000 GPR[26]: 0000000000000000 
GPR[27]: 0000000000000000 GPR[28]: FFFFFFFFCBCB0000 GPR[29]: 09001000A17B1A98 

Observe that the Instruction Address Register (IAR) is zero, and that we have identified the LR (link register) address as the location of the trap. Basically we have jumped to a zero address, probably via a function pointer call, and trapped there. LR is set by that branch and link (bctrl : branch and link to the CTL (counter register)). We don’t truely know that no other instructions were executed between the bctrl and our current IAR=0 point, but looking at the other registers gives a good hint that this is likely the case. Let’s look at the assembly listing for a NULL function pointer call and see what happens:

void goo( void (*blah)(void) )
{
   blah() ;
}

The disassembly for this (edited and annoted) is:

(dbx) listi goo
(goo)      mflr   r0             ; copy LR to gr0 (ie: save our current return address before the function call).
(goo+0x4)  ld     r11,0x10(r3)   ; something of interest is apparently found 16 bytes into the memory addressed by blah!
(goo+0x8)  stdu   r1,-112(r1)    ; allocate more stack space
(goo+0xc)  std    r0,0x80(r1)    ; stack spill of the LR copy
(goo+0x10) std    r2,0x28(r1)    ; stack spill of the TableOfContents register (in case this is an out of module call)
(goo+0x14) ld     r0,0x0(r3)     ; the function pointer appears to be in the memory pointed to by blah
(goo+0x18) mtctr  r0             ; save this to the CTR register (the only branch to register mechanism on PowerPC)
(goo+0x1c) ld     r2,0x8(r3)     ; load the TOC register for this function pointer call (could be different for out of module call)
(goo+0x20) bctrl                 ; the actual function pointer "call" finally.
(goo+0x24) ld     r2,0x28(r1)    ; restore the TOC
(goo+0x28) ld     r12,0x80(r1)   ; grab our original LR address for the return from this function.
(goo+0x2c) addi   r1,0x70(r1)    ; deallocate the stack space we used.
(goo+0x30) mtlr   r12            ; copy back our original LR (from a temp var ; must not have a way to do this directly from addr -> LR)
(goo+0x34) blr                   ; return to our caller.

Wow, that’s a lot of setup and take down for an innocent function pointer call!

Now we can make some more sense of the disassembly fragment in the trap file

     0x0900000013473BD4 : E90C0000 ld     r8,0(r12)      ; r12=0xA0 (a bad address, close to but not exactly NULL).  We dereference this
     0x0900000013473BD8 : 7D0903A6 mtctr  r8             ; address 0xA0 appears to contain ZERO, and we copy this to CTR 
     0x0900000013473BDC : F8410028 std    r2,40(r1)      ; save our current TOC to the stack
     0x0900000013473BE0 : E96C0010 ld     r11,16(r12)    ; set up r11 for whatever reason.
     0x0900000013473BE4 : E84C0008 ld     r2,8(r12)      ; set the new TOC register for the call.
     0x0900000013473BE8 : 4E800421 bctrl                 ; branch to IAR=0 for the SIGILL.

So we appear to have had a load from a near-NULL pointer (and since AIX this and similar exactly-NULL evil pointer dereferencing works in general). A NULL pointer dereference taking the address of a contained member was probably done (ie: to get at offsetof() == 0xA0), and then we had a function pointer call from that address (or something like that).

Posted in C/C++ development and debugging. | Tagged: , , , , , , , , | Leave a Comment »

Microbenchmarking to measure costs of mutual exclusion.

Posted by peeterjoot on September 11, 2010

Some previous posts have discussed implementations of mutual exclusion code for a few hardware platforms. PowerPC is an interesting case due to the possibility of unordered memory accesses. I wanted to get a rough idea what the relative costs of an atomic sequence of instructions, plus the required memory barriers that are required to implement a mutex. This is surprisingly tricky to measure, and I thought I’d start with something that would give me the absolute minimum cost ; how much time does it take to repeatedly do a pair of atomic operations in a single threaded, completely uncontended fashion, with and without memory barriers. A fragment of code that illustrates this is:


inline long atomicCompareAndIncrement( volatile long * a, long c, long v )
{
   long old = c ;

   __compare_and_swaplp( a, &old, c + v ) ;

   return old ;
}

// ...
   else if ( doIsync && doLwsync )
   {
      for ( long i = 0 ; i < n ; i++ )
      {
         v = atomicCompareAndIncrement( &a, i, 1 ) ;
         __isync( ) ;
         __lwsync( ) ;
         v = atomicCompareAndIncrement( &a, i, 1 ) ;
      }
   }

The compiler, perhaps knowing that its builtins have side effects, and unlike me, likely also not knowing that there are no other threads, does not optimize away this loop (no other threads it could probably figure out since I have only a single main function).

I wanted to have a comparison of this sort of loop with straight load and store pairs that have the same effect, and then just a loop of the lowest cost instructions, arithmetic ones. These last are tricky to get because the compiler optimizes away the loop. For example, the following code is almost completely optimized away:

long plainOldIncrementFunction( long v1, long v2 )
{
   return v1 + v2 ;
}
// ...
   else if ( impossibleOption )
   {
      /* guessing this will be completely optimized away ... yup. */
      for ( long i = 0 ; i < n ; i++ )
      {
         v = plainOldIncrementFunction( v, 1 ) ;
         v = plainOldIncrementFunction( v, 1 ) ;
      }
   }

Declaring the function non-inline doesn’t help. The compiler does generate a copy of this function body, but it is never called in main(), and the whole loop is missing in action when you look at the assembly listing.

I used a trick that I’d previously used, making use of xlC’s inline assembly construct, the mc_func (machine code function). In case you haven’t seen it before, this is a brute force way of getting specific desired instructions into the code, as if they were function calls. Here’s the ones that I created for this

/*
long loadAndAddAndStore( long * a, long i )
{
   long v = *a ;

   *a = v + i ;

   return v ;
}
*/
extern "C" long loadAndAddAndStore( volatile long * a, long v ) ;
#pragma mc_func loadAndAddAndStore { \
"e8030000" /*     ld r0,0(r3)          */ \
"7c802214" /*     add   r4,r0,r4       */ \
"f8830000" /*     std   r4,0(r3)       */ \
"60030000" /*     ori   r3,r0,0x0000   */ \
}
#pragma reg_killed_by loadAndAddAndStore gr0,gr3,gr4

/*
long plainOldIncrement( long v1, long v2 )
{
   return v1 + v2 ;
}
*/
extern "C" long plainOldIncrement( long v1, long v2 ) ;
#pragma mc_func plainOldIncrement { \
"7c632214" /*     add   r3,r3,r4       */ \
}
#pragma reg_killed_by plainOldIncrement gr3

There are three parts to each mc_func. One is the equivalent prototype, a fake function call interface that specifies the inputs and outputs of the function. One has to know the PowerPC calling convention to use this. Specifically that input parameters (up to eight of them) come in registers gr3,gr4,…, and output is in gr3. The next part of an mc_func is the set of opcodes for the machine code that you desire in the code. Note that the comments included in this sequence (like: add r3,r3,r4) are NOT part of the mc_func. I just included those for my own benifit. As far as the compiler is concerned, the mc_func provides only a set of opcodes, so something like:

#pragma mc_func loadAndAddAndStore { "e8030000" "7c802214" "f8830000" "60030000" }

is no different from my “readable” version. Isn’t that the ultimate in an unfriendly user interface? It is a user interface that is probably deliberately hard to use as a means to discourage the evil developer from even trying. Fortunately, there are sneaky ways you can do this. I dropped the C code versions of the interfaces I wanted into a separate module and compiled that with xlC -S. I was then able to run ‘as -a64 -l’ on the generated .s file, to produce a listing with the desired opcodes and cut and paste that into the mc_func blocks I desired.

I was able to use this bit of hackery to construct the mc_func’s that were desired. Since the compiler doesn’t know what’s in those, and only knows that they externally effect the list of registers in the clobber constraints, it doesn’t optimize that away.

I was able to get some basic measurements this way. This answered the question of how much more expensive uncontended atomics are than plain old memory operations, and what the minimal additional cost the isync and lwsync instructions introduce. The isync and lwsync’s didn’t appear to add much cost in this uncontended scenerio, and that’s perhaps not too surpising since I didn’t put in any other memory operations that have to be ordered by those barriers.

The measurements I made are fairly artificial and unsatisfactory, mostly because of no other cache access, especially no remote cache access. That can get really expensive, and remote caches can be behave in a very numa like fashion. The cost of going to the L2 cache on the same board (CAC is what I think AIX calls these) is much less than going to one of the remote caches. How could these sort of costs be measured systematically? That I don’t know. I’d bet that when multiple cachelines enter the picture, especially when contended, that is when you also start to see the costs of the memory barrier instructions. It is interesting to see how well those costs can be optimized away in the uncontended scenerio!

Posted in C/C++ development and debugging. | Tagged: , , , , , , , | Leave a Comment »

A desire to understand init owned defunct processes.

Posted by peeterjoot on September 2, 2010

On one machine I’m using, it happens that I’ve currently got 104 defunct processes right now owned by my userid. The system in question has 1122 of them at the moment. It will probably have to be rebooted, because there’s no indication that the parent of the processes in question still exists, so the reason for this is somehow non-standard.

The usual way of getting a defunct process is when the parent dies or exits without waiting for its child to do die or exit before it.

This is simple enough to demonstrate:

#include <unistd.h>
#include <stdio.h>

int main()
{
   pid_t p = fork() ;

   if ( p )
   {
      printf( "parent: forked: %d.  sleep(10).\n", (int)p ) ;

      sleep( 10 ) ;
   }
   else
   {
      printf( "child: exit().\n" ) ;
   }

   return 0 ;
}

Executing this I see:

# a.out
child: exit().
parent: forked: 2270.  sleep(10).

and in the ps output:

# ps -ef | grep peeterj | grep a.out | grep -v grep
pj | grep a.out
peeterj   2270  2269  0 13:21 pts/8    00:00:00 [a.out] <defunct>
peeterj   2269  5639  0 13:21 pts/8    00:00:00 a.out

However, once that parent exits, the child is reaped by init and the defunct process is gone. What is different about a persistent defunct process that appears to stay that way even with the death of the parent?

I’ve suspected clearcase being a factor a few times, but don’t have much to base that belief on. On the system where I’ve currently got lots of these undead beasties, I notice that they are parented by init:

peeterj  29499     1  0 Sep01 ?        00:00:00 [db2vend] 
peeterj  31001     1  0 Aug26 ?        00:00:00 [db2fmp] 

db2vend uses pipes, which could be a factor, but I don’t think that db2fmp uses pipes (instead using shared memory for its IPC).

What is it about such a process that makes init, the new parent, unable to reap the wait status of that undead child?

Posted in C/C++ development and debugging. | Tagged: , , , , | Leave a Comment »