I had a fun memory corruption to debug the last couple days. That’s probably something that not many people would be caught dead saying, but it happens that DB2′s internal memory allocation infrastructure has some awesome and powerful cross platform capabilities. One of these is a memory debugging runtime option that enables unwritable guard pages on allocation (using mprotect and other similar operating system primatives if available).
By enabling this memory debugging code in my test scenerio, I end up with a nice friendly SIGSEGV when an attempt to write past the end of an allocation was made. I say this is friendly, because compared to the internal functions of free() barfing long after the corruption, with no idea what could have caused it, nor when, a SIGSEGV at the point of corruption is very nice!
However, tt happened that the SIGSEGV in this case was actually a side effect of an earlier corruption. I see the following in the debugger when the exception occurs
(gdb) p /x *gbptoken->xiinfo
$6 = {address = 0x2aaad54baaaa, length = 0x3e8, key = 0x28584929}
Now that address looks a bit fishy doesn’t it. I happen to know this xiinfo->address was heap allocated, so my expectation was that it would be aligned nicely. What we’ve got here is a pair of 0xAA corruptions of the address, and that was enough to push a later dereference of the memory it was pointing to to get pushed into the guard page region past the allocation (the allocation size in this case was 1000 bytes < 0xAAAA). While I have a reproducable scenerio, noticing that my pointer here is being corrupted, unfortunately reduces the problem to tracking down a corruption that’s not occuring in my guard page region of memory any more. I’d thought of instrumenting the code in question with a validation routine that checks this address against an earlier cached value. That worked, but only triggered after the fact, and it was hard to see exactly what was causing the corruption. As a third step in the debugging process I used for the first time a hardware watchpoint, something I’d wanted to try for a while:
(gdb) help watch Set a watchpoint for an expression. A watchpoint stops execution of your program whenever the value of an expression changes.
Here’s an example of a fragment of the debugging session that shows a hardware watchpoint in action (with some names changed to protect the guilty)
(gdb) p /x hCachedGbpPointer->xiinfo->address
$4 = 0x2aaad54be000
(gdb) watch hCachedGbpPointer->xiinfo->address
Hardware watchpoint 2: hCachedGbpPointer->xiinfo->address
(gdb) c
Continuing.
Hardware watchpoint 2: hCachedGbpPointer->xiinfo->address
Old value = 46913211326464
New value = 46913211326634
foo (xiaddr=0x2aaad549bfe8, wobp=0x207e80000, first=1, last=2) at foo.c:6100
6100 foo.c: No such file or directory.
in foo.c
(gdb) p /x 46913211326464
$5 = 0x2aaad54be000
(gdb) p /x 46913211326634
$6 = 0x2aaad54be0aa
(gdb) where 3
#0 foo (xiaddr=0x2aaad549bfe8, wobp=0x207e80000, first=1, last=2) at foo.c:6100
#1 0x00002aaab822fd50 in goo (mcb=0x2aaad43989e8, mrb=0x2aaad4398de8, cmdparms=0x2aaaccbf2c28, gbptoken=0x2aaad5491f28,
contoken=0x2aaad439ef58) at foo.c:6410
#2 0x00002aaab823d862 in moo (struct_token=0x2aaad5491f28, contoken=0x2aaad439ef58, timeout=5000000)
at moo.c:670
The watch point catches the corruption after one byte at the very point that it occurs, and the job is reduced to inspecting the code in question and seeing what’s wrong!