Peeter Joot's (OLD) Blog.

Math, physics, perl, and programming obscurity.

Posts Tagged ‘SIGILL’

AIX function pointer trap notes

Posted by peeterjoot on September 15, 2010

The DB2 product is a massively complex system. If there is a software problem in either a development or a customer environment, there is a good chance that it will never be reproduced again. We’ve spend years incrementally building a cross platform post mortum debugging facility where we collect and log just about everything we can think of, with the aim of being able to figure it out after the fact. In some cases this includes information that could be available in core files, so one could perhaps wonder why we’d ever look at that info in development, but it is useful internally too. Core files for a large system like DB2, where we have GBs of memory mapped. They get lost on the machines they are dumped on, and often have to be disabled, especially for automated test runs (where one optimistically hopes the test will be successful).

Here’s a fragment from the disassembly listing for a trap that one such post mortem dump (“trapfile”) contained:

     0x0900000013473BCC : 38C100A0 addi	r6,r1,160
     0x0900000013473BD0 : 38A10098 addi	r5,r1,152
     0x0900000013473BD4 : E90C0000 ld	r8,0(r12)
     0x0900000013473BD8 : 7D0903A6 mtctr	r8
     0x0900000013473BDC : F8410028 std	r2,40(r1)
     0x0900000013473BE0 : E96C0010 ld	r11,16(r12)
     0x0900000013473BE4 : E84C0008 ld	r2,8(r12)
     0x0900000013473BE8 : 4E800421 bctrl                     # 20,bit0
>>>> 0x0900000013473BEC : E8410028 ld	r2,40(r1)
     0x0900000013473BF0 : 90610088 stw	r3,136(r1)
     0x0900000013473BF4 : E861008A lwa	r3,136(r1)
     0x0900000013473BF8 : 2C030000 cmpi	cr0,r3,0
     0x0900000013473BFC : 41820050 beq        cr0,0x13473C4C # 12,bit2
     0x0900000013473C00 : E861008A lwa	r3,136(r1)

However, this was for a SIGILL. How would we ever get a SIGILL with a load from gr1? Looking at the registers in question, it appears that we don’t correctly identify the trapping instruction, but have done something that is pretty close:

    IAR: 0000000000000000     MSR: A00000000008D032      LR: 0900000013473BEC
    CTR: 0000000000000000     XER: 00000010           FPSCR: A2208000
     CR: 24000224
GPR[00]: 0000000000000080 GPR[01]: 070000003B7FE060 GPR[02]: 0000000000000000 
GPR[03]: 00000001112157F0 GPR[04]: 0000000000004E20 GPR[05]: 070000003B7FE0F8 
GPR[06]: 070000003B7FE100 GPR[07]: 070000003B7FE0EC GPR[08]: 0000000000000000 
GPR[09]: 0000000000000000 GPR[10]: 0000000000000000 GPR[11]: 0000000000000000 
GPR[12]: 00000000000000A0 GPR[13]: 0000000111237800 GPR[14]: 0000000000000000 
GPR[15]: 0000000000000000 GPR[16]: 0000000000000000 GPR[17]: 0000000000000000 
GPR[18]: 0000000000000000 GPR[19]: 0000000000000000 GPR[20]: 0000000000000000 
GPR[21]: 0000000000000000 GPR[22]: 0000000000000000 GPR[23]: 0000000000000000 
GPR[24]: 0000000000000000 GPR[25]: 0000000000000000 GPR[26]: 0000000000000000 
GPR[27]: 0000000000000000 GPR[28]: FFFFFFFFCBCB0000 GPR[29]: 09001000A17B1A98 

Observe that the Instruction Address Register (IAR) is zero, and that we have identified the LR (link register) address as the location of the trap. Basically we have jumped to a zero address, probably via a function pointer call, and trapped there. LR is set by that branch and link (bctrl : branch and link to the CTL (counter register)). We don’t truely know that no other instructions were executed between the bctrl and our current IAR=0 point, but looking at the other registers gives a good hint that this is likely the case. Let’s look at the assembly listing for a NULL function pointer call and see what happens:

void goo( void (*blah)(void) )
   blah() ;

The disassembly for this (edited and annoted) is:

(dbx) listi goo
(goo)      mflr   r0             ; copy LR to gr0 (ie: save our current return address before the function call).
(goo+0x4)  ld     r11,0x10(r3)   ; something of interest is apparently found 16 bytes into the memory addressed by blah!
(goo+0x8)  stdu   r1,-112(r1)    ; allocate more stack space
(goo+0xc)  std    r0,0x80(r1)    ; stack spill of the LR copy
(goo+0x10) std    r2,0x28(r1)    ; stack spill of the TableOfContents register (in case this is an out of module call)
(goo+0x14) ld     r0,0x0(r3)     ; the function pointer appears to be in the memory pointed to by blah
(goo+0x18) mtctr  r0             ; save this to the CTR register (the only branch to register mechanism on PowerPC)
(goo+0x1c) ld     r2,0x8(r3)     ; load the TOC register for this function pointer call (could be different for out of module call)
(goo+0x20) bctrl                 ; the actual function pointer "call" finally.
(goo+0x24) ld     r2,0x28(r1)    ; restore the TOC
(goo+0x28) ld     r12,0x80(r1)   ; grab our original LR address for the return from this function.
(goo+0x2c) addi   r1,0x70(r1)    ; deallocate the stack space we used.
(goo+0x30) mtlr   r12            ; copy back our original LR (from a temp var ; must not have a way to do this directly from addr -> LR)
(goo+0x34) blr                   ; return to our caller.

Wow, that’s a lot of setup and take down for an innocent function pointer call!

Now we can make some more sense of the disassembly fragment in the trap file

     0x0900000013473BD4 : E90C0000 ld     r8,0(r12)      ; r12=0xA0 (a bad address, close to but not exactly NULL).  We dereference this
     0x0900000013473BD8 : 7D0903A6 mtctr  r8             ; address 0xA0 appears to contain ZERO, and we copy this to CTR 
     0x0900000013473BDC : F8410028 std    r2,40(r1)      ; save our current TOC to the stack
     0x0900000013473BE0 : E96C0010 ld     r11,16(r12)    ; set up r11 for whatever reason.
     0x0900000013473BE4 : E84C0008 ld     r2,8(r12)      ; set the new TOC register for the call.
     0x0900000013473BE8 : 4E800421 bctrl                 ; branch to IAR=0 for the SIGILL.

So we appear to have had a load from a near-NULL pointer (and since AIX this and similar exactly-NULL evil pointer dereferencing works in general). A NULL pointer dereference taking the address of a contained member was probably done (ie: to get at offsetof() == 0xA0), and then we had a function pointer call from that address (or something like that).

Posted in C/C++ development and debugging. | Tagged: , , , , , , , , | Leave a Comment »

A fun and curious dig. GCC generation of a ud2a instruction (SIGILL)

Posted by peeterjoot on May 26, 2010

Recently some of our code started misbehaving only when compiled with the GCC compiler. Our post mortem stacktrace and data collection tools didn’t deal with this trap very gracefully, and dealing with that (or even understanding it) is a different story.

What I see in the debugger once I find the guilty thread is:

(gdb) thread 12
[Switching to thread 12 (Thread 46970517317952 (LWP 30316))]#0  0x00002ab824438ec1 in __gxx_personality_v0 ()
    at ../../../../gcc-4.2.2/libstdc++-v3/libsupc++/
351     ../../../../gcc-4.2.2/libstdc++-v3/libsupc++/ No such file or directory.
        in ../../../../gcc-4.2.2/libstdc++-v3/libsupc++/
(gdb) where
#0  0x00002ab824438ec1 in __gxx_personality_v0 ()
    at ../../../../gcc-4.2.2/libstdc++-v3/libsupc++/
#1  0x00002ab824438cc9 in sleep () from /lib64/
#2  0x00002ab8203090ee in sqloEDUSleepHandler (signum=20, sigcode=0x2ab82cffa0c0, scp=0x2ab82cff9f90)
    at sqloinst.C:283
#4  0x00002ab81cf03231 in __gxx_personality_v0 ()
    at ../../../../gcc-4.2.2/libstdc++-v3/libsupc++/
#5  0x00002ab823b9b745 in ossSleep () from /home/hotel74/peeterj/sqllib/lib64/
#6  0x00002ab821206992 in pdInvokeCalloutScript () at /view/peeterj_m19/vbs/engn/include/sqluDMSort_inlines.h:158
#7  0x00002ab82030fe99 in sqloEDUCodeTrapHandler (signum=4, sigcode=0x2ab82cffcc60, scp=0x2ab82cffcb30)
    at sqloedu.C:4476
#9  0x00002ab821393257 in sqluInitLoadEDU (pPrivateACBIn=0x2059e0080, ppPrivateACBOut=0x2ab82cffd320,
    puchAuthID=0x2ab8fcef19b8 "PEETERJ ", pNLSACB=0x2ab8fceea168, pComCB=0x2ab8fceea080, pMemPool=0x2ab8fccca2d0)
    at sqluedus.C:1696
#10 0x00002ab8212d34c2 in sqluldat (pArgs=0x2ab82cffdef0 "", argsSize=96) at sqluldat.C:737
#11 0x00002ab820310ced in sqloEDUEntry (parms=0x2ab82f3e9680) at sqloedu.C:3438
#12 0x00002ab81cefc143 in start_thread () from /lib64/
#13 0x00002ab82446674d in clone () from /lib64/
#14 0x0000000000000000 in ?? ()

Observe that there are two sets of ” frames. One from the original SIGILL, and another one that our “main” thread ends up sending to all the rest of the threads as part of our process for freezing things to be able to take a peek and see what’s up.

Looking at the siginfo_t for the SIGILL handler we have:

(gdb) frame 7
#7  0x00002ab82030fe99 in sqloEDUCodeTrapHandler (signum=4, sigcode=0x2ab82cffcc60, scp=0x2ab82cffcb30)
    at sqloedu.C:4476
4476    sqloedu.C: No such file or directory.
        in sqloedu.C
(gdb) p *sigcode
$4 = {si_signo = 4, si_errno = 0, si_code = 2, _sifields = {_pad = {557396567, 10936, 0, 0, 1, 16777216,
      -1170923664, 10936, 754961616, 10936, 599153081, 10936, 0, 0, 15711488, 10752, 4, 0, -1170923664, 10936, 1, 0,
      0, 0, 754961680, 10936, 4292335, 0}, _kill = {si_pid = 557396567, si_uid = 10936}, _timer = {
      si_tid = 557396567, si_overrun = 10936, si_sigval = {sival_int = 0, sival_ptr = 0x0}}, _rt = {
      si_pid = 557396567, si_uid = 10936, si_sigval = {sival_int = 0, sival_ptr = 0x0}}, _sigchld = {
      si_pid = 557396567, si_uid = 10936, si_status = 0, si_utime = 72057594037927937, si_stime = 46972886392688},
    _sigfault = {si_addr = 0x2ab821393257}, _sigpoll = {si_band = 46970319745623, si_fd = 0}}}
(gdb) p /x *sigcode
$5 = {si_signo = 0x4, si_errno = 0x0, si_code = 0x2, _sifields = {_pad = {0x21393257, 0x2ab8, 0x0, 0x0, 0x1,
      0x1000000, 0xba351f70, 0x2ab8, 0x2cffccd0, 0x2ab8, 0x23b659b9, 0x2ab8, 0x0, 0x0, 0xefbd00, 0x2a00, 0x4, 0x0,
      0xba351f70, 0x2ab8, 0x1, 0x0, 0x0, 0x0, 0x2cffcd10, 0x2ab8, 0x417eef, 0x0}, _kill = {si_pid = 0x21393257,
      si_uid = 0x2ab8}, _timer = {si_tid = 0x21393257, si_overrun = 0x2ab8, si_sigval = {sival_int = 0x0,
        sival_ptr = 0x0}}, _rt = {si_pid = 0x21393257, si_uid = 0x2ab8, si_sigval = {sival_int = 0x0,
        sival_ptr = 0x0}}, _sigchld = {si_pid = 0x21393257, si_uid = 0x2ab8, si_status = 0x0,
      si_utime = 0x100000000000001, si_stime = 0x2ab8ba351f70}, _sigfault = {si_addr = 0x2ab821393257}, _sigpoll = {
      si_band = 0x2ab821393257, si_fd = 0x0}}}

This has got the si_addr value 0x00002AB821393257, which also matches frame 9 in the stack for sqluInitLoadEDU. What was at that line of code, doesn’t appear to be something that ought to generate a SIGILL:

   1693    // Set current activity in private agent CB to
   1694    // point to the activity that the EDU is working
   1695    // on behalf of.
   1696    pPrivateACB->agtRqstCB.pActivityCB = pComCB->my_curr_activity_entry;
   1697 #ifdef DB2_DEBUG
   1698    { //!!  This debug code is only useful in conjunction with a trap described by W749645
   1699       char mesg[500];
   1700       sprintf(mesg,"W749645:uILE pPr->agtR=%p ->pAct=%p",pPrivateACB->agtRqstCB,pPrivateACB->agtRqstCB.pActivi        tyCB);
   1701       sqlt_logerr_str(SQLT_SQLU, SQLT_sqluInitLoadEDU, __LINE__, mesg, NULL, 0, SQLT_FFSL_INF);
   1702    } //!!
   1703 #endif

So what is going on? Let’s look at the assembly for the trapping instruction address. Using ‘(gdb) set logging on’, and ‘(gdb) disassemble’ we find:

0x00002ab82139323e : mov    0xfffffffffffffd68(%rbp),%rax
0x00002ab821393245 : mov    0x6498(%rax),%rdx
0x00002ab82139324c : mov    0xffffffffffffffb0(%rbp),%rax
0x00002ab821393250 : mov    %rdx,0x5bd0(%rax)
0x00002ab821393257 : ud2a
0x00002ab821393259 : cmpl   $0x0,0xffffffffffffffac(%rbp)
0x00002ab82139325f : mov    0xfffffffffffffd80(%rbp),%rdi
0x00002ab821393266 : callq  0x2ab81dcd4218 
0x00002ab82139326b : mov    0xffffffffffffffd8(%rbp),%rax
0x00002ab82139326f : and    $0x82,%eax
0x00002ab821393274 : test   %rax,%rax

Hmm. What is a ud2a instruction? Google is our friend and we find that the linux kernel uses this as a “guaranteed invalid instruction”. It is used to fault the processor and halt the kernel in case you did something really really bad.

Other similar references can be found, also explaining the use in the linux kernel. So what is this doing in userspace code? It seems like something too specific to get there by accident and since the instruction stream itself contains this stack corruption or any other sneaky nasty mechanism doesn’t seem likely. The instruction doesn’t immediately follow a callq, so a runtime loader malfunction or something else equally odd doesn’t seem likely.

Perhaps the compiler put this instruction into the code for some reason. A compiler bug perhaps? A new google search for GCC ud2a instruction finds me

   ...generates this warning (using gcc 4.4.1 but I think it applies to most
   gcc versions):

   main.cpp:12: warning: cannot pass objects of non-POD type .class A.
   through .....; call will abort at runtime

   1. Why is this a "warning" rather than an "error"? When I run the program
   it hits a "ud2a" instruction emitted by gcc and promptly hits SIGILL.

Oh my! It sounds like GCC has cowardly refused to generate an error, but also bravely refuses to generate bad code for whatever this code sequence is. Do I have such an error in my build log? In fact, I have three, all of which look like:

sqluedus.C:1464: warning: deprecated conversion from string constant to 'char*'
sqluedus.C:1700: warning: cannot pass objects of non-POD type 'struct sqlrw_request_cb' through '...'; call will abort at runtime

At 1700 of that file we have:

sprintf(mesg,"W749645:uILE pPr->agtR=%p ->pAct=%p",pPrivateACB->agtRqstCB,pPrivateACB->agtRqstCB.pActivityCB);

It turns out that agtRqstCB is a rather large structure, and certainly doesn’t match the %p that the developer used in this debug build special code. The debug code actually makes things worse, and certainly won’t help on any platform. It probably also won’t crash on any platform either (except when using the GCC compiler) since there are no subsequent %s format parameters that will get messed up by placing gob-loads of structure data in the varargs data area inappropriately.

This should resolve this issue and allow me to go back to avoiding the (much slower!) intel compiler that is used by our nightly build process.

Posted in C/C++ development and debugging. | Tagged: , , , , , , | 15 Comments »