Peeter Joot's (OLD) Blog.

Math, physics, perl, and programming obscurity.

A start at a rudimentary perl decompiler for linux amd64 objdump output.

Posted by peeterjoot on January 12, 2012

I’ve got two different versions of some code that appears to behave significantly differently under optimization (in complex performance scenerios where diagnosis, or even tracing is difficult). There are some minor differences in the source code, and I’m now left wondering whether the compiler is doing something unexpected for the two sets of code.

I went looking for a decompiler for Linux amd64 code. I remember once using IDA pro in an ethical hacking course I took, but that’s not available in IBM’s standard licensed software offerings I can ask for. There’s a couple of free decompilers that I found listed on wikipedia. The only one that appeared to have any sort of amd64 support looked like it was backerstreet’s reverse engineering compiler. I happened to have a ubuntu VM around that I tried this on, but it crashed even on a 32-bit executable (tried command prompt: ‘open /bin/bash’), so I don’t really expect it to behave better on a cross target executable.

Since the code in question is not too big (~300 lines of disassembly), I was wondering if I could hack together something that at least removed all the addresses from the disassembly that weren’t jump targets.

Another thing that was required to make the disassembly sensible was the use of the linked output, nor just the .o files, as the objdump source. Otherwise I end up with all the function calls left unresolved, like:

     272:       e8 00 00 00 00          callq  277 
     277:       48 83 c4 20             add    $0x20,%rsp

So, from the shared lib, I ran ‘objdump -d –no-show-raw-insn’ and filtered that with a simpler re-labler

#!/usr/bin/perl

my @lines = () ;
my %addressMap = () ;
my $lableCount = 0 ;

while (<>)
{
   chomp ;

   if ( /\tj\S+\s+(\S+)/ )
   {
      unless ( defined $addressMap{$1} )
      {
         $addressMap{$1} = "L$lableCount" ;

         $lableCount++ ;
      }
   }

   push( @lines, $_ ) ;
}

my @addrs = ( keys %addressMap ) ;

foreach my $line (@lines)
{
   foreach ( @addrs )
   {
      $line =~ s/\t(j\S+)  # example: <tab>je
                 \s+
                 $_
                 \s.*
                /printf("\t%-6s $addressMap{$_}", $1)/xe ;

      $line =~ s/^ $_:/ $addressMap{$_}:/ ;
   }

   $line =~ s/^ [0-9a-f]+:// ;

   print "$line\n" ;
}

So, now instead of a mess like:

 3639ff3:       test   %rbp,%rbp
 3639ff6:       mov    (%rax),%r14
 3639ff9:       je     363a02a 
 3639ffb:       mov    0x788(%r14),%rbx
 363a002:       test   %rbx,%rbx
 363a005:       je     363a02c 
 363a007:       mov    $0x1,%edi

I get something like:

   test   %rbp,%rbp
   mov    (%rax),%r14
   je     L21
   mov    0x788(%r14),%rbx
   test   %rbx,%rbx
   je     L31
   mov    $0x1,%edi

(with the lables in the jump targets also renamed, and retained).

The next logical step would be to implement a register renamer. Since I’ve now got all the basic blocks identified, it should be possible to figure out any time a general purpose register is clobbered and give the register a new name at any clobber point. For instance in the following BB these two pairs of rdx variables are logically different:

   mov    0x128(%rsp),%rdx
   inc    %rdx
   test   %rbx,%rbx
   je     L191
   mov    %rdx,0x128(%rsp)
   mov    0x78(%rbx),%rdi
   test   %rdi,%rdi
   je     L191
   mov    0xc88(%rdi),%rsi
   test   %rsi,%rsi
   je     L191
   decq   0xcb0(%rdi)
   mov    0x78(%rbx),%r8
   mov    0xcb0(%r8),%rdx
   test   %rdx,%rdx

The first rdx use above could be renamed without trouble since it is clobbered in the same BB, so you know that its use is purely local to that block. However, to rename registers intelligently in general you’d also have to identify what register dependencies exist between the basic blocks.

Identifying the dependencies would be extra messy on amd64 since we have different aliases for the same registers too, depending on the size of the access to the register (ie: rdx, edx, dx, …)

In the end I’ve come to the conclusion that taking this any further would really be too much work. Perhaps I can spot a difference just by inspection. Without some preprocessing the assembly is fairly hard to read though.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
%d bloggers like this: