Peeter Joot's (OLD) Blog.

Math, physics, perl, and programming obscurity.

quirks with glibc backtrace vs. memcpy on Linux AMD64.

Posted by peeterjoot on January 7, 2011

Our product has a home grown post mortum facility that collects many details about a software failure when it occurs.  As well as being useful in a customer situation, given the complexity of our product, this has a huge utility in the development phase, where bugs can be easily introduced before our huge regression test suite has been run on the new code.

On some of our linux platforms we use the glibc backtrace() function to do the stack walking.  It is not a perfect API however.  In particular, gdb can be observed to do a better job in some circumstances.  Here is an example of the complete backtrace that we include in our “trapfile” when a call to memcpy traps in a particular way:

-----FUNC-ADDR---- ------FUNCTION + OFFSET------
0x00002AB7984E5938 ossDumpStackTraceInternal(unsigned long, OSSTrapFile&, int, siginfo*, void*, unsigned long, unsigned long) + 0x02b0
      (/home/hotel81/peeterj/sqllib/lib64/libdb2osse.so.1)
0x00002AB7984E5D95 ossDumpStackTraceV98 + 0x006b
      (/home/hotel81/peeterj/sqllib/lib64/libdb2osse.so.1)
0x00002AB7984E145F OSSTrapFile::dumpEx(unsigned long, int, siginfo*, void*, unsigned long) + 0x020d
      (/home/hotel81/peeterj/sqllib/lib64/libdb2osse.so.1)
0x00002AB7945A5BF9 sqlo_trce + 0x03af
      (/home/hotel81/peeterj/sqllib/lib64/libdb2e.so.1)
0x00002AB79469730C sqloEDUCodeTrapHandler + 0x0272
      (/home/hotel81/peeterj/sqllib/lib64/libdb2e.so.1)
0x00002AB7910BFD60 address: 0x00002AB7910BFD60 ; dladdress: 0x00002AB7910B2000 ; offset in lib: 0x000000000000DD60 ;
      (/lib64/libpthread.so.0)
0x00002AB798D49592 memcpy + 0x0052
      (/lib64/libc.so.6)
0x00002AB7ACBF7E60 address: 0x2ab7acbf7e60

(I’ve also filtered this through c++filt to make it slightly more readable). Observe that there’s a bunch of junk from the signal handler itself, and then we get to our memcpy call that is wrong, and then nothing. The caller of the memcpy is missing. Thankfully, this was an error that I introduced in development, and I was able to reproduce it in the debugger easily enough. Observe that the debugger is able to give me the whole stacktrace:

(gdb) c
Continuing.

Program received signal SIGSEGV, Segmentation fault.
0x00002ab798d49592 in memcpy () from /lib64/libc.so.6
(gdb) where
#0  0x00002ab798d49592 in memcpy () from /lib64/libc.so.6
#1  0x00002ab7acbf7e60 in ?? ()
#2  0x00002ab792f02093 in SAL_GBP_HANDLE::SAL_CatchupSecondarySCA (this=0x2ab7b2cca8c0, flags=1, caKeySample=@0x2ab7acbfd290)
    at sqle_ca_cmd.C:32748
#3  0x00002ab7b2ac9fce in DO_WARM_TEST (pszTitle=0x2ab7b2ada060 "LITTLE_WARM_TEST", h=@0x2ab7b2cca8c0,
    what=SAL_STRUCTURE_IDENTITY_CACHE_SCA, n=30) at testca.C:3308
#4  0x00002ab7b2aca5dc in CACHE_TESTS (h=@0x2ab7b2cca8c0, what=SAL_STRUCTURE_IDENTITY_CACHE_SCA) at testca.C:7304
#5  0x00002ab7b2aced94 in testcode (krcbp=0x200480080,
    inFilePath=0x2ab7acbfdef0 "/home/hotel81/peeterj/sqllib/db2dump/rca.config") at testca.C:8484
#6  0x00002ab792d359f2 in sqleInvokeProcedure (Args=0x2ab7acbfdef0) at sqleinvp.C:204
#7  0x00002ab7946984bc in sqloEDUEntry () from /home/hotel81/peeterj/sqllib/lib64/libdb2e.so.1
#8  0x00002ab7910b8193 in start_thread () from /lib64/libpthread.so.0
#9  0x00002ab798d94dfd in clone () from /lib64/libc.so.6
(gdb) q

My initial thought was stack corruption, and that the debugger did something fancy to find the pre-stack corruption frame in the stack and show the complete backtrace. That doesn’t appear to be the case. When I look at the saved rip instruction for the return point after the memcpy, it appears that this API is actually handled differently. There is a PLT jump to the memcpy code, but no stack frame appears to be allocated.

It’s worth making a note about how the stack is handled in a linuxamd64 function call. Here’s what I saw in this case:

3307             SAL_ENCODED_CA_STATE fakeCaKeySample = 0 ;
0x00002b80fbd35fbd :   movq   $0x0,-0x70(%rbp)

3308             rc = h.SAL_CatchupSecondarySCA( SAL_REBUILD_TEST_CONTEXT, fakeCaKeySample ) ;
0x00002b80fbd35fc5 :   lea    -0x70(%rbp),%rdx
0x00002b80fbd35fc9 :   mov    -0x5030(%rbp),%rdi
0x00002b80fbd35fd0 :   mov    $0x1,%esi
0x00002b80fbd35fd5 :   callq  0x2b80fbd28b90 
0x00002b80fbd35fda :   mov    %eax,-0x4c(%rbp)

See the callq above. Just after that is the address 0x00002b80fbd35fda, and that’s what I expect to see saved to the stack so that the return from the function goes to the right place. An overwrite of that stack location would cause the sort of chaos that I incorrectly guessed that I had in this case. What do I see on the stack after this call. When I step into this function I see from ‘info registers’

rip            0x2b80dc16d65a   0x2b80dc16d65a 

that the step has gotten me a few instructions into the new function body. Where exactly am I and what’s occured so far?

Dump of assembler code for function SAL_GBP_HANDLE::SAL_CatchupSecondarySCA(unsigned long, unsigned long&):

32466                                                   SAL_ENCODED_CA_STATE &   caKeySample )
0x00002b80dc16d63a :        push   %rbp
0x00002b80dc16d63b :        mov    %rsp,%rbp
0x00002b80dc16d63e :        sub    $0x5e0,%rsp
0x00002b80dc16d645 :       mov    %rdi,-0x438(%rbp)
0x00002b80dc16d64c :       mov    %rsi,-0x440(%rbp)
0x00002b80dc16d653 :       mov    %rdx,-0x448(%rbp)

32467   {
32468      sqlzRc rc = 0 ;
0x00002b80dc16d65a :       movl   $0x0,-0x68(%rbp)

NOTES: %rsp gets decremented here to allocate stack storage. Expect that the pre-decrement value has our return address nearby

I see that $rbp has the pre-decrement value of the stack address, so rsp indexes into our allocated stack frame for this function call, while we are left with the return info in $rbp

(gdb) p $rbp-0x5e0
$10 = (void *) 0x2b80f5ff7c90
(gdb) p $rsp
$11 = (void *) 0x2b80f5ff7c90

Dumping the raw stack at this point, I find that the return address is tucked away nicely:

(gdb) x/2xg $rbp
0x2b80f5ff8270: 0x00002b80f5ffd300      0x00002b80fbd35fda

Unless there is any sort of alloca logic or locally scoped stack storage in this function, this should be an invariant after any call that returns us to this function scope (ie: right before the malloc that takes us out for example). Let’s check on rbp and rsp when we get to the basic block of interest. There’s actually two memcpy’s in this code, and it seemed that somehow the first one must have been the culprit. Right before that first memcpy I had:

(gdb) p $rsp
$14 = (void *) 0x2b80f5ff7c90
(gdb) p $rbp
$15 = (void *) 0x2b80f5ff8270

Good, nothing has allocated any more stack in this function compared to the prologue stack allocation in the beginning of the function itself. How about our return address in *rbp:

(gdb) x/2xg $rbp
0x2b80f5ff8270: 0x00002b80f5ffd300      0x00002b80fbd35fda

Hmm. This is still the same? Curious. And after each of the intermediate instructions, leading up to the memcpy that takes us down:

(gdb) n
32745         cmd.pagename   = &filterPageName ;
(gdb) x/2xg $rbp
0x2b80f5ff8270: 0x00002b80f5ffd300      0x00002b80fbd35fda
(gdb) n
32746         cmd.pagedata   = &primaryNamesTriplet ;
(gdb) n
32747         cmd.rtoken     = &restart ;
(gdb) n
32748         memcpy( cmd.udf, pUdf, sizeof(cmd.udf) ) ;
(gdb) x/2xg $rbp
0x2b80f5ff8270: 0x00002b80f5ffd300      0x00002b80fbd35fda
(gdb) n

Program received signal SIGSEGV, Segmentation fault.
0x00002b80e1fb5592 in memcpy () from /lib64/libc.so.6

My conclusion at this point is that this isn’t a stack corruption after all. The backtrace() related lack of info is actually a red-herring in this case!

If I look at the trapping point in memcpy, I see that it is here:

0x00002b80e1fb5592 : mov    (%rsi),%rax

Out of curiousity, is %rax in range? Yes:

%RAX at $rbp + 0x1d0 is within the stack frame range (0x5e0).

However, since we actually have a %rsi dereference and not a %rax dereference:

Note that this is src,dest order in gcc binutils style disassembly. How does rsi get a 0 in it? What happens to rsi before this trap point in the memcpy code:

(gdb) disassemble memcpy
Dump of assembler code for function memcpy:
0x00002b80e1fb5540 :  cmp    $0x8,%rdx
0x00002b80e1fb5544 :  mov    %rdi,%rax
0x00002b80e1fb5547 :  jae    0x2b80e1fb5580 
0x00002b80e1fb5549 :  test   $0x4,%dl
0x00002b80e1fb554c : je     0x2b80e1fb555a 
0x00002b80e1fb554e : mov    (%rsi),%ecx
0x00002b80e1fb5550 : mov    %ecx,(%rdi)
0x00002b80e1fb5552 : add    $0x4,%rsi
0x00002b80e1fb5556 : add    $0x4,%rdi
0x00002b80e1fb555a : test   $0x2,%dl
0x00002b80e1fb555d : je     0x2b80e1fb556d 
0x00002b80e1fb555f : mov    (%rsi),%cx
0x00002b80e1fb5562 : mov    %cx,(%rdi)
0x00002b80e1fb5565 : add    $0x2,%rsi
0x00002b80e1fb5569 : add    $0x2,%rdi
0x00002b80e1fb556d : test   $0x1,%dl
0x00002b80e1fb5570 : je     0x2b80e1fb5576 
0x00002b80e1fb5572 : mov    (%rsi),%cl
0x00002b80e1fb5574 : mov    %cl,(%rdi)
0x00002b80e1fb5576 : repz retq
0x00002b80e1fb5578 : xchg   %ax,%ax
0x00002b80e1fb557c : xchg   %ax,%ax
0x00002b80e1fb5580 : push   %rax
0x00002b80e1fb5581 : cmp    $0x20,%rdx
0x00002b80e1fb5585 : jae    0x2b80e1fb55b0 
0x00002b80e1fb5587 : mov    %edx,%ecx
0x00002b80e1fb5589 : shr    $0x3,%ecx
0x00002b80e1fb558c : je     0x2b80e1fb55a2 
0x00002b80e1fb558e : xchg   %ax,%ax
0x00002b80e1fb5590 : dec    %ecx
0x00002b80e1fb5592 : mov    (%rsi),%rax

Not much. It looks like rsi is actually just a parameter, and that we are getting a NULL from the caller.

After a bit of digging, and a bit more on the amd64 calling conventions, I see that the registers RDI, RSI, RDX, RCX, R8 and R9 are used for integer and pointer arguments, so it sounds like my memcpy src argument is in fact NULL.

Very dumb of me to not have checked for that first and that is in fact the case! The moral of the story is that we cannot always trust our tools. In this case it was the backtrace() API. What seemed like it was probably a stack corruption due to incomplete backtrace is probably really an artifact of the memcpy routine having been hand coded in assembly in some way that backtrace() didn’t quite get along with. The offending trap was really a plain old NULL pointer, it just looked like it was something more.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
%d bloggers like this: