Peeter Joot's (OLD) Blog.

Math, physics, perl, and programming obscurity.

Posts Tagged ‘gdb’

The beauty of gdb hardware watchpoints

Posted by peeterjoot on March 5, 2012

I had a fun memory corruption to debug the last couple days. That’s probably something that not many people would be caught dead saying, but it happens that DB2’s internal memory allocation infrastructure has some awesome and powerful cross platform capabilities. One of these is a memory debugging runtime option that enables unwritable guard pages on allocation (using mprotect and other similar operating system primatives if available).

By enabling this memory debugging code in my test scenerio, I end up with a nice friendly SIGSEGV when an attempt to write past the end of an allocation was made. I say this is friendly, because compared to the internal functions of free() barfing long after the corruption, with no idea what could have caused it, nor when, a SIGSEGV at the point of corruption is very nice!

However, tt happened that the SIGSEGV in this case was actually a side effect of an earlier corruption. I see the following in the debugger when the exception occurs

(gdb) p /x *gbptoken->xiinfo
$6 = {address = 0x2aaad54baaaa, length = 0x3e8, key = 0x28584929}

Now that address looks a bit fishy doesn’t it. I happen to know this xiinfo->address was heap allocated, so my expectation was that it would be aligned nicely. What we’ve got here is a pair of 0xAA corruptions of the address, and that was enough to push a later dereference of the memory it was pointing to to get pushed into the guard page region past the allocation (the allocation size in this case was 1000 bytes < 0xAAAA). While I have a reproducable scenerio, noticing that my pointer here is being corrupted, unfortunately reduces the problem to tracking down a corruption that’s not occuring in my guard page region of memory any more. I’d thought of instrumenting the code in question with a validation routine that checks this address against an earlier cached value. That worked, but only triggered after the fact, and it was hard to see exactly what was causing the corruption. As a third step in the debugging process I used for the first time a hardware watchpoint, something I’d wanted to try for a while:

(gdb) help watch
Set a watchpoint for an expression.
A watchpoint stops execution of your program whenever the value of an expression changes.

Here’s an example of a fragment of the debugging session that shows a hardware watchpoint in action (with some names changed to protect the guilty)

 (gdb) p /x hCachedGbpPointer->xiinfo->address
$4 = 0x2aaad54be000
(gdb) watch hCachedGbpPointer->xiinfo->address
Hardware watchpoint 2: hCachedGbpPointer->xiinfo->address
(gdb) c
Hardware watchpoint 2: hCachedGbpPointer->xiinfo->address

Old value = 46913211326464
New value = 46913211326634
foo (xiaddr=0x2aaad549bfe8, wobp=0x207e80000, first=1, last=2) at foo.c:6100
6100    foo.c: No such file or directory.
        in foo.c
(gdb) p /x 46913211326464
$5 = 0x2aaad54be000
(gdb) p /x 46913211326634
$6 = 0x2aaad54be0aa
(gdb) where 3
#0  foo (xiaddr=0x2aaad549bfe8, wobp=0x207e80000, first=1, last=2) at foo.c:6100
#1  0x00002aaab822fd50 in goo (mcb=0x2aaad43989e8, mrb=0x2aaad4398de8, cmdparms=0x2aaaccbf2c28, gbptoken=0x2aaad5491f28,
    contoken=0x2aaad439ef58) at foo.c:6410
#2  0x00002aaab823d862 in moo (struct_token=0x2aaad5491f28, contoken=0x2aaad439ef58, timeout=5000000)
    at moo.c:670

The watch point catches the corruption after one byte at the very point that it occurs, and the job is reduced to inspecting the code in question and seeing what’s wrong!

Posted in C/C++ development and debugging. | Tagged: , , , , , , | Leave a Comment »

a handy multithreading debugging technique: a local variable controlled semi-infinite loop.

Posted by peeterjoot on November 17, 2011

I had a three thread timing hole scenerio that I wanted to confirm with the debugger. Adding blocks of code to selected points like this turned out to be really handy:

      volatile int loop = 1 ;
      while (loop)
         loop = 1 ;

         sleep(1) ;

Because the variable loop is local, I could have two different functions paused where I wanted them, and once I break on the sleep line, can let each go with a debugger command like so at exactly the right point in time

(gdb) p loop=0

(assigns a value of zero to the loop variable after switching to the thread of interest). The gdb ‘set scheduler-locking on/off’ and ‘info threads’ ‘thread N’ commands are also very handy for this sort of race condition debugging (this one was actually debugged by code inspection, but I wanted to see it in action to confirm that I had it right).

I suppose that I could have done this with a thread specific breakpoint. I wonder if that’s also possible (probably). I’ll have to try that next time, but hopefully I don’t have to look at race conditions like today’s for a quite a while!

Posted in C/C++ development and debugging. | Tagged: , , , | Leave a Comment »

avoiding gdb signal noise.

Posted by peeterjoot on July 7, 2010

A quick note for future reference (recorded elsewhere and subsequently lost).

Suppose your program handles a signal that gdb intercepts by default, like the following example

(gdb) c

Program received signal SIGUSR1, User defined signal 1.
[Switching to Thread 47133440862528 (LWP 4833)]
0x00002ade149d6baa in semtimedop () from /lib64/
(gdb) c

You can hit ‘c’ to continue at this point, but if it happens repeatedly in various threads (like when one thread is calling pthread_kill() to force each other thread in turn to dump its stack and stuff) this repeated ‘c’ing can be a bit of a pain.

For the same SIGUSR1 example above, you can query the gdb handler rules like so:

(gdb) info signal SIGUSR1
Signal        Stop      Print   Pass to program Description
SIGUSR1       Yes       Yes     Yes             User defined signal 1

And if deemed to not be of interest, where you just want your program to continue without prompting or spamming, something like the following does the trick:

(gdb) handle SIGUSR1 noprint nostop
Signal        Stop      Print   Pass to program Description
SIGUSR1       No        No      Yes             User defined signal 1

Posted in C/C++ development and debugging. | Tagged: , | 4 Comments »

The meaning of continue in C

Posted by peeterjoot on June 16, 2010

I found myself looking at some code and unsure how it would behave. Being a bit tired today I couldn’t remember if continue pops you to the beginning of the loop, or back to the predicate that allows you to break from it. Here’s an example:

int main()
   volatile int rc = 0 ;
   volatile int x = 0 ;

   do {
      rc = 1 ;

      if ( 1 == x )
         rc = 0 ;

         continue ;
   } while ( rc == 1 ) ;

   return 0 ;

And if you say, “What you’ve been programming for 12 years and don’t know?” Well it looks that way. I chose to walk through this code in the debugger to see how it worked:

(gdb) b main
Breakpoint 1 at 0x40055c: file t.C, line 3.
(gdb) run
Starting program: /vbs/engn/.t/a.out

Breakpoint 1, main () at t.C:3
3          volatile int rc = 0 ;
(gdb) n
4          volatile int x = 0 ;
(gdb) n
7             rc = 1 ;
(gdb) n
9             if ( 1 == x )
(gdb) n
6          do {
(gdb) n
7             rc = 1 ;
(gdb) n
9             if ( 1 == x )
(gdb) p x=1
$1 = 1
(gdb) n
11               rc = 0 ;
(gdb) n
6          do {
(gdb) n
17         return 0 ;
(gdb) n
18      }
(gdb) q

Once the variable x is modified, sure enough we break from the loop (note the sneaky way you have to modify variables in gdb, using the print statement to implicitly assign). There’s no chance to go back to the beginning and reset rc = 1 to keep going.

The conclusion: continue means goto the loop exit predicate statement, not continue to the beginning of the loop to retry. In the code in question a goto will actually be clearer, since what was desired was a retry, not a retry-if.

Posted in C/C++ development and debugging. | Tagged: , , , | Leave a Comment »

A fun and curious dig. GCC generation of a ud2a instruction (SIGILL)

Posted by peeterjoot on May 26, 2010

Recently some of our code started misbehaving only when compiled with the GCC compiler. Our post mortem stacktrace and data collection tools didn’t deal with this trap very gracefully, and dealing with that (or even understanding it) is a different story.

What I see in the debugger once I find the guilty thread is:

(gdb) thread 12
[Switching to thread 12 (Thread 46970517317952 (LWP 30316))]#0  0x00002ab824438ec1 in __gxx_personality_v0 ()
    at ../../../../gcc-4.2.2/libstdc++-v3/libsupc++/
351     ../../../../gcc-4.2.2/libstdc++-v3/libsupc++/ No such file or directory.
        in ../../../../gcc-4.2.2/libstdc++-v3/libsupc++/
(gdb) where
#0  0x00002ab824438ec1 in __gxx_personality_v0 ()
    at ../../../../gcc-4.2.2/libstdc++-v3/libsupc++/
#1  0x00002ab824438cc9 in sleep () from /lib64/
#2  0x00002ab8203090ee in sqloEDUSleepHandler (signum=20, sigcode=0x2ab82cffa0c0, scp=0x2ab82cff9f90)
    at sqloinst.C:283
#4  0x00002ab81cf03231 in __gxx_personality_v0 ()
    at ../../../../gcc-4.2.2/libstdc++-v3/libsupc++/
#5  0x00002ab823b9b745 in ossSleep () from /home/hotel74/peeterj/sqllib/lib64/
#6  0x00002ab821206992 in pdInvokeCalloutScript () at /view/peeterj_m19/vbs/engn/include/sqluDMSort_inlines.h:158
#7  0x00002ab82030fe99 in sqloEDUCodeTrapHandler (signum=4, sigcode=0x2ab82cffcc60, scp=0x2ab82cffcb30)
    at sqloedu.C:4476
#9  0x00002ab821393257 in sqluInitLoadEDU (pPrivateACBIn=0x2059e0080, ppPrivateACBOut=0x2ab82cffd320,
    puchAuthID=0x2ab8fcef19b8 "PEETERJ ", pNLSACB=0x2ab8fceea168, pComCB=0x2ab8fceea080, pMemPool=0x2ab8fccca2d0)
    at sqluedus.C:1696
#10 0x00002ab8212d34c2 in sqluldat (pArgs=0x2ab82cffdef0 "", argsSize=96) at sqluldat.C:737
#11 0x00002ab820310ced in sqloEDUEntry (parms=0x2ab82f3e9680) at sqloedu.C:3438
#12 0x00002ab81cefc143 in start_thread () from /lib64/
#13 0x00002ab82446674d in clone () from /lib64/
#14 0x0000000000000000 in ?? ()

Observe that there are two sets of ” frames. One from the original SIGILL, and another one that our “main” thread ends up sending to all the rest of the threads as part of our process for freezing things to be able to take a peek and see what’s up.

Looking at the siginfo_t for the SIGILL handler we have:

(gdb) frame 7
#7  0x00002ab82030fe99 in sqloEDUCodeTrapHandler (signum=4, sigcode=0x2ab82cffcc60, scp=0x2ab82cffcb30)
    at sqloedu.C:4476
4476    sqloedu.C: No such file or directory.
        in sqloedu.C
(gdb) p *sigcode
$4 = {si_signo = 4, si_errno = 0, si_code = 2, _sifields = {_pad = {557396567, 10936, 0, 0, 1, 16777216,
      -1170923664, 10936, 754961616, 10936, 599153081, 10936, 0, 0, 15711488, 10752, 4, 0, -1170923664, 10936, 1, 0,
      0, 0, 754961680, 10936, 4292335, 0}, _kill = {si_pid = 557396567, si_uid = 10936}, _timer = {
      si_tid = 557396567, si_overrun = 10936, si_sigval = {sival_int = 0, sival_ptr = 0x0}}, _rt = {
      si_pid = 557396567, si_uid = 10936, si_sigval = {sival_int = 0, sival_ptr = 0x0}}, _sigchld = {
      si_pid = 557396567, si_uid = 10936, si_status = 0, si_utime = 72057594037927937, si_stime = 46972886392688},
    _sigfault = {si_addr = 0x2ab821393257}, _sigpoll = {si_band = 46970319745623, si_fd = 0}}}
(gdb) p /x *sigcode
$5 = {si_signo = 0x4, si_errno = 0x0, si_code = 0x2, _sifields = {_pad = {0x21393257, 0x2ab8, 0x0, 0x0, 0x1,
      0x1000000, 0xba351f70, 0x2ab8, 0x2cffccd0, 0x2ab8, 0x23b659b9, 0x2ab8, 0x0, 0x0, 0xefbd00, 0x2a00, 0x4, 0x0,
      0xba351f70, 0x2ab8, 0x1, 0x0, 0x0, 0x0, 0x2cffcd10, 0x2ab8, 0x417eef, 0x0}, _kill = {si_pid = 0x21393257,
      si_uid = 0x2ab8}, _timer = {si_tid = 0x21393257, si_overrun = 0x2ab8, si_sigval = {sival_int = 0x0,
        sival_ptr = 0x0}}, _rt = {si_pid = 0x21393257, si_uid = 0x2ab8, si_sigval = {sival_int = 0x0,
        sival_ptr = 0x0}}, _sigchld = {si_pid = 0x21393257, si_uid = 0x2ab8, si_status = 0x0,
      si_utime = 0x100000000000001, si_stime = 0x2ab8ba351f70}, _sigfault = {si_addr = 0x2ab821393257}, _sigpoll = {
      si_band = 0x2ab821393257, si_fd = 0x0}}}

This has got the si_addr value 0x00002AB821393257, which also matches frame 9 in the stack for sqluInitLoadEDU. What was at that line of code, doesn’t appear to be something that ought to generate a SIGILL:

   1693    // Set current activity in private agent CB to
   1694    // point to the activity that the EDU is working
   1695    // on behalf of.
   1696    pPrivateACB->agtRqstCB.pActivityCB = pComCB->my_curr_activity_entry;
   1697 #ifdef DB2_DEBUG
   1698    { //!!  This debug code is only useful in conjunction with a trap described by W749645
   1699       char mesg[500];
   1700       sprintf(mesg,"W749645:uILE pPr->agtR=%p ->pAct=%p",pPrivateACB->agtRqstCB,pPrivateACB->agtRqstCB.pActivi        tyCB);
   1701       sqlt_logerr_str(SQLT_SQLU, SQLT_sqluInitLoadEDU, __LINE__, mesg, NULL, 0, SQLT_FFSL_INF);
   1702    } //!!
   1703 #endif

So what is going on? Let’s look at the assembly for the trapping instruction address. Using ‘(gdb) set logging on’, and ‘(gdb) disassemble’ we find:

0x00002ab82139323e : mov    0xfffffffffffffd68(%rbp),%rax
0x00002ab821393245 : mov    0x6498(%rax),%rdx
0x00002ab82139324c : mov    0xffffffffffffffb0(%rbp),%rax
0x00002ab821393250 : mov    %rdx,0x5bd0(%rax)
0x00002ab821393257 : ud2a
0x00002ab821393259 : cmpl   $0x0,0xffffffffffffffac(%rbp)
0x00002ab82139325f : mov    0xfffffffffffffd80(%rbp),%rdi
0x00002ab821393266 : callq  0x2ab81dcd4218 
0x00002ab82139326b : mov    0xffffffffffffffd8(%rbp),%rax
0x00002ab82139326f : and    $0x82,%eax
0x00002ab821393274 : test   %rax,%rax

Hmm. What is a ud2a instruction? Google is our friend and we find that the linux kernel uses this as a “guaranteed invalid instruction”. It is used to fault the processor and halt the kernel in case you did something really really bad.

Other similar references can be found, also explaining the use in the linux kernel. So what is this doing in userspace code? It seems like something too specific to get there by accident and since the instruction stream itself contains this stack corruption or any other sneaky nasty mechanism doesn’t seem likely. The instruction doesn’t immediately follow a callq, so a runtime loader malfunction or something else equally odd doesn’t seem likely.

Perhaps the compiler put this instruction into the code for some reason. A compiler bug perhaps? A new google search for GCC ud2a instruction finds me

   ...generates this warning (using gcc 4.4.1 but I think it applies to most
   gcc versions):

   main.cpp:12: warning: cannot pass objects of non-POD type .class A.
   through .....; call will abort at runtime

   1. Why is this a "warning" rather than an "error"? When I run the program
   it hits a "ud2a" instruction emitted by gcc and promptly hits SIGILL.

Oh my! It sounds like GCC has cowardly refused to generate an error, but also bravely refuses to generate bad code for whatever this code sequence is. Do I have such an error in my build log? In fact, I have three, all of which look like:

sqluedus.C:1464: warning: deprecated conversion from string constant to 'char*'
sqluedus.C:1700: warning: cannot pass objects of non-POD type 'struct sqlrw_request_cb' through '...'; call will abort at runtime

At 1700 of that file we have:

sprintf(mesg,"W749645:uILE pPr->agtR=%p ->pAct=%p",pPrivateACB->agtRqstCB,pPrivateACB->agtRqstCB.pActivityCB);

It turns out that agtRqstCB is a rather large structure, and certainly doesn’t match the %p that the developer used in this debug build special code. The debug code actually makes things worse, and certainly won’t help on any platform. It probably also won’t crash on any platform either (except when using the GCC compiler) since there are no subsequent %s format parameters that will get messed up by placing gob-loads of structure data in the varargs data area inappropriately.

This should resolve this issue and allow me to go back to avoiding the (much slower!) intel compiler that is used by our nightly build process.

Posted in C/C++ development and debugging. | Tagged: , , , , , , | 15 Comments »

Some gdb dumping examples.

Posted by peeterjoot on April 28, 2010

I often forget how to dump memory in raw form with various debuggers. Here’s a quick note to myself of how to do it in gdb

As bytes (in hex):

(gdb) x/256xb 0x73d2e0
0x73d2e0:       0x00    0x00    0x00    0x00    0x00    0x00    0x00    0x00
0x73d2e8:       0x01    0x00    0x00    0x00    0x00    0x00    0x00    0x00
0x73d2f0:       0x01    0x40    0x00    0x00    0x00    0x00    0x00    0x00
0x73d3d8:       0x00    0x00    0x00    0x00    0x00    0x00    0x00    0xaa

As 4-byte “words”:

(gdb) x/64xw 0x73d2e0
0x73d2e0:       0x00000000      0x00000000      0x00000001      0x00000000
0x73d2f0:       0x00004001      0x00000000      0x00000000      0x00000000
0x73d300:       0x00000000      0x00000000      0x00000000      0x00000000
0x73d310:       0x00000000      0x00000000      0x00010001      0x00040013
0x73d3d0:       0x00000000      0x00000000      0x00000000      0xaa000000

Note that the repeat count isn’t the total number of bytes to dump, but the total number of objects in the size specification:

(gdb) help x
Examine memory: x/FMT ADDRESS.
ADDRESS is an expression for the memory address to examine.
FMT is a repeat count followed by a format letter and a size letter.
Format letters are o(octal), x(hex), d(decimal), u(unsigned decimal),
  t(binary), f(float), a(address), i(instruction), c(char) and s(string).
Size letters are b(byte), h(halfword), w(word), g(giant, 8 bytes).
The specified number of objects of the specified size are printed
according to the format.

Defaults for format and size letters are those previously used.
Default count is 1.  Default address is following last thing printed
with this command or "print".

Posted in C/C++ development and debugging. | Tagged: , | Leave a Comment »

building a private version of gdb on a machine that has an older version.

Posted by peeterjoot on November 23, 2009

We have SLES10 linux machines, and the gdb version available on them is a old (so old that it no longer works with the version of the intel compiler that we use to build our product). Here’s a quick cheatsheet on how to download and install a newer version of gdb for private use, without having to have root privileges or replace the default version on the machine:

mkdir -p ~/tmp/gdb
cd ~/tmp/gdb
bzip2 -dc gdb-7.2.tar.bz2 | tar -xf -
mkdir g
cd g
../gdb-7.2/configure --prefix=$HOME/gdb
make install

Executing these leaves you with a private version of gdb in ~/gdb/bin/gdb that works with newer intel compiled code.

This version of gdb has some additional features (relative to 6.8 that we have on our machines) that also look interesting:

  •  disassemble start,+length looks very handy (grab just the disassembly that is of interest, or when the whole thing is desired, not more hacking around with the pager depth to get it all).
  • save and restore breakpoints.
  • current thread number variable $_thread
  • trace state variables (7.1), and fast tracepoints (will have to try that).
  • detached tracing
  • multiple program debugging (although I’m not sure I’d want that, especially when just one multi-threaded program can be pretty hairy to debug).  I recall many times when dbx would crash AIX with follow fork.  I wonder if other operating systems deal with this better?
  • reverse debugging, so that you can undo changes!  This is said to be target dependent.  I wonder if amd64 is supported?
  • catch syscalls.  I’ve seen some times when the glibc dynamic loader appeared to be able to exit the process, and breaking on exit, _exit, __exit did nothing.  I wonder if the exit syscall would catch such an issue.
  • find.  Search memory for a sequence of bytes.

Posted in Development environment | Tagged: , | Leave a Comment »

turning off the gdb pager to collect gobs of info.

Posted by peeterjoot on September 30, 2009

I’ve got stuff interrupted with the debugger, so I can’t invoke our external tool to collect stacks. Since gdb doesn’t have redirect for most commands here’s how I was able to collect all my stacks, leaving my debugger attached:

(gdb) set height 0
(gdb) set logging on
(gdb) thread apply all where

Now I can go edit gdb.txt when it finishes (in the directory where I initially attached the debugger to my pid), and examine things. A small tip, but it took me 10 minutes to figure out how to do it (yet again), so it’s worth jotting down for future reference.

Posted in debugging | Tagged: | 2 Comments »

Making function calls within gdb.

Posted by peeterjoot on August 31, 2009

Here’s a quick debugging tidbit for the day, somewhat obscure, but it is a cool one and can be useful (I used it today)

Gdb has a particularily nice feature of being able to call arbitrary functions on the gdb command line, and print their output (if any)


(gdb) p getpid()
$8 = 6649
(gdb) p this->SAL_DumpMyStateToAFile()
[New Thread 47323035986240 (LWP 32200)]
$9 = void

You may have to move up and down your stack frames to find the context required to make the call, or to get the parameters you need in scope. You have to think about (or exploit) the side effects of the functions you call.

Somewhat like modification of variables in the debugger, this capability allows you to shoot yourself fairly easily, and but that’s part of the power.

I don’t recall if many other debuggers had this functionality. I have a vague recollection that the sun workshop’s dbx did too, but I could be wrong.

Posted in debugging | Tagged: , | Leave a Comment »

gdb on linux. Finding your damn thread.

Posted by peeterjoot on August 28, 2009

Suppose you are debugging a threaded process, and know that somewhere in there you have one of many threads that’s running the code you want to debug. How do you find it?

Listing the running threads isn’t terribly helpful if you’ve got a lot of them. You may see something unhelpful like:

(gdb) info threads
  30 Thread 47529939401216 (LWP 13827)  0x00002b3a5ff8e5c5 in pthread_join () from /lib64/
  29 Thread 47529945196864 (LWP 13831)  0x00002b3a66fd2baa in semtimedop () from /lib64/
  17 Thread 47530159106368 (LWP 14065)  0x00002b3a66fd2baa in semtimedop () from /lib64/
  16 Thread 47530150717760 (LWP 14067)  0x00002b3a66fd2baa in semtimedop () from /lib64/
  15 Thread 47530154912064 (LWP 14559)  0x00002b3a5ff94231 in __gxx_personality_v0 ()
    at ../../../../gcc-4.2.2/libstdc++-v3/libsupc++/
  14 Thread 47530146523456 (LWP 14561)  0x00002b3a66fc9476 in poll () from /lib64/
  13 Thread 47530142329152 (LWP 14564)  0x00002b3a66fc9476 in poll () from /lib64/
  12 Thread 47530138134848 (LWP 14580)  0x00002b3a66fc9476 in poll () from /lib64/
  11 Thread 47530133940544 (LWP 14581)  0x00002b3a66fc9476 in poll () from /lib64/
  10 Thread 47530129746240 (LWP 14582)  0x00002b3a66fd2baa in semtimedop () from /lib64/
  9 Thread 47530125551936 (LWP 14583)  0x00002b3a66fd2baa in semtimedop () from /lib64/
  8 Thread 47530121357632 (LWP 14584)  0x00002b3a66fd2baa in semtimedop () from /lib64/
  7 Thread 47530117163328 (LWP 14585)  0x00002b3a66fd2baa in semtimedop () from /lib64/
  6 Thread 47530112969024 (LWP 14586)  0x00002b3a66fd2baa in semtimedop () from /lib64/
  5 Thread 47530108774720 (LWP 14587)  0x00002b3a66fd2baa in semtimedop () from /lib64/
  4 Thread 47530104580416 (LWP 14588)  0x00002b3a66fd2baa in semtimedop () from /lib64/
  3 Thread 47530100386112 (LWP 14589)  0x00002b3a66fd2baa in semtimedop () from /lib64/
  2 Thread 47530096191808 (LWP 14590)  0x00002b3a66fd2baa in semtimedop () from /lib64/
  1 Thread 47530091997504 (LWP 14591)  0x00002b3a66fd2baa in semtimedop () from /lib64/

Unless you are running on a 128 way (and god help you if you have to actively debug with that kind of concurrency), most of your threads will be blocked all the time, stuck in a kernel or C runtime function, and only that shows at the top of the stack.

You can list the top frames of all your functions easily enough, doing something like:

(gdb) thread apply all where 4

Thread 30 (Thread 47529939401216 (LWP 13827)):
#0  0x00002b3a5ff8e5c5 in pthread_join () from /lib64/
#1  0x00002b3a6312e635 in sqloSpawnEDU (FuncPtr=0x2b3a6312bd7e ,
    pcArguments=0x7fff4ac3c380 "4'A", ulArgSize=24, pEDUInfo=0x7fff4ac3c340, pEDUid=0x7fff4ac3c3a0) at sqloedu.C:2206
#2  0x00002b3a6312e928 in sqloRunMainAsEDU (pFncPtr=0x412734 , argc=2, argv=0x7fff4ac3c4b8) at sqloedu.C:2445
#3  0x000000000041272c in main (argc=2, argv=0x7fff4ac3c4b8) at sqlesysc.C:1495

Thread 29 (Thread 47529945196864 (LWP 13831)):
#0  0x00002b3a66fd2baa in semtimedop () from /lib64/
#1  0x00002b3a63050880 in sqlo_waitlist::timeoutWait (this=0x2004807e0, timeout=10000)
    at /view/peeterj_kseq/vbs/engn/include/sqlowlst_inlines.h:557
#2  0x00002b3a6304eb1c in sqloWaitEDUWaitPost (pEDUWaitPostArea=0x200e90528, pUserPostCode=0x2b3a6d7d9170, timeOut=10000, flags=0)
    at sqlowaitpost.C:942
#3  0x00002b3a61b95e13 in sqeSyscQueueEdu::syscWaitRequest (this=0x200e904c0, reason=@0x2b3a6d7d9590) at sqlesyscqueue.C:510
(More stack frames follow...)

Thread 28 (Thread 47530205243712 (LWP 13894)):
#0  0x00002b3a5ff94231 in __gxx_personality_v0 () at ../../../../gcc-4.2.2/libstdc++-v3/libsupc++/
#1  0x00002b3a66722625 in ossSleep (milliseconds=1000) at osstime.C:204
#2  0x00002b3a6305f69d in sqloAlarmThreadEntry (pArgs=0x0, argSize=0) at sqloalarm.C:453
#3  0x00002b3a63131703 in sqloEDUEntry (parms=0x2b3a6d7d91d0) at sqloedu.C:3402
(More stack frames follow...)

then page through that output, and find what you are looking for, set breakpoints and start debugging, but that can be tedious.

A different way, which requires some preparation, is by dumping to a log file, the thread id. There’s still a gotcha for that though, and you can see in the ‘info threads’ output that the thread ids (what’s you’d get if you call and log the value of pthread_self()) are big ass hexadecimal values that aren’t particularily easy to find in the ‘info threads’ output. Note that pthread_self() will return the base address of the stack itself (or something close to it) on a number of platforms since this can be used as a unique identifier, and linux currently appears to do this (AIX no longer does since around 4.3).

Also observe that gdb prints out (LWP ….) values in the ‘info threads’ output. These are the Linux kernel Task values, roughly equivalent to a threads’s pid as far as the linux kernel is concerned (linux threads and processes are all types of “tasks” … threads just happen to share more than processes, like virtual memory and signal handlers and file descriptors). At the time of this writing there isn’t a super easy way to dump this task id, but a helper function of the following form will do the trick:

#include <sys/syscall.h>
      int GetMyKernelThreadId(void)
         return syscall(__NR_gettid);

You’ll probably have to put this code in a separate module from other stuff since kernel headers and C runtime headers don’t get along well. Having done that you can call this in your dumping code, like the output below tagged with the prefix KTID (i.e. what a DB2 developer will find in n-builds in the coral project db2diag.log).

2009-08-28- I735590E1447          LEVEL: Severe
PID     : 13827                TID  : 47530154912064 KTID : 14559
PROC    : db2sysc 0

This identifier is much easier to pick out in the ‘info threads’ output (and is in this case thread 15), so get to yourself up and debugging now requires just:

(gdb) thread 15
[Switching to thread 15 (Thread 47530154912064 (LWP 14559))]#0  0x00002b3a5ff94231 in __gxx_personality_v0 ()
    at ../../../../gcc-4.2.2/libstdc++-v3/libsupc++/
351     ../../../../gcc-4.2.2/libstdc++-v3/libsupc++/ No such file or directory.
        in ../../../../gcc-4.2.2/libstdc++-v3/libsupc++/
(gdb) where 5
#0  0x00002b3a5ff94231 in __gxx_personality_v0 () at ../../../../gcc-4.2.2/libstdc++-v3/libsupc++/
#1  0x00002b3a66722625 in ossSleep (milliseconds=100) at osstime.C:204
#2  0x00002b3a6a29e674 in traceCrash () from /home/hotel77/peeterj/sqllib/lib64/
#3  0x00002b3a66732b66 in _gtraceEntryVar (threadID=47530154912064, ecfID=423100446, eduID=25, eduIndex=3, pNargs=3)
    at gtrace.C:2130
#4  0x00002b3a61145ade in pdtEntry3 (ecfID=423100446, t1=423100418, s1=16, p1=0x2b3a8a01c478, t2=36, s2=8, p2=0x2b3a79fde418,
    t3=3, s3=8, p3=0x2b3a79fde410) at pdtraceapi.C:2012
(More stack frames follow...)

Posted in debugging | Tagged: , , | 3 Comments »