I was tasked to review some inline assembly, essentially like so:

Uint64 negativeLessThanX(Uint64 v0, Uint64 v1) { Uint64 mask; __asm__ volatile ("subfc %0,%2,%1; subfe %0,%0,%0" \ /* outputs */ : "=r"(mask) /* %0 */ \ /* inputs */ : "r"(v0), /* %1 */ \ "r"(v1) /* %2 */ \ /* clobbers */: "xer" /* condition registers (CF, ...) */ \ ); return mask; }

This should have the effect of doing:

Uint64 negativeLessThanX(Uint64 v0, Uint64 v1) { Uint64 mask; subfc mask,v1,v0 subfe mask, mask, mask return mask; }

From the powerpc instruction set reference (PowerISA_V2.07_PUBLIC.pdf), our subfc, and subfe instructions are respectively ‘Subtract From Carrying XO-form’, ‘Subtract From Extended XO-form’ :

subfc RT,RA,RB (OE=0 Rc=0) RT <- ¬ (RA) + (RB) + 1 (RT = RB - RA) subfe RT,RA,RB (OE=0 Rc=0) RT <- ¬ (RA) + (RB) + CA

Since we have RA = RB in the subfe, and self plus complement is all bits set, we essentially have

RT <- 0xFFFFFFFFFFFFFFFF + CA

Let’s walk through this in the debugger to understand it. We have:

(dbx) listi negativeLessThanX 0x100000a24 (negativeLessThanX(unsigned long,unsigned long)+0x24) 7c030010 subfc r0,r3,r0 0x100000a28 (negativeLessThanX(unsigned long,unsigned long)+0x28) 7c000110 subfe r0,r0,r0

So, we have r0 = r0 – r3. After this we have a r0 = r0 – r0, but also bringing in the carry flag (CA bit in XER). Let’s see this in the debugger, first with r0=2, r3=1 :

(dbx) stop in negativeLessThanX [1] stop in negativeLessThanX(unsigned long,unsigned long) (dbx) c [1] stopped in negativeLessThanX(unsigned long,unsigned long) at line 6 in file "w.C" ($t1) 6 __asm__ volatile ("subfc %0,%2,%1; subfe %0,%0,%0" \ (dbx) stepi stopped in negativeLessThanX(unsigned long,unsigned long) at 0x100000a20 ($t1) 0x100000a20 (negativeLessThanX(unsigned long,unsigned long)+0x20) e86100b8 ld r3,0xb8(r1) (dbx) stopped in negativeLessThanX(unsigned long,unsigned long) at 0x100000a24 ($t1) 0x100000a24 (negativeLessThanX(unsigned long,unsigned long)+0x24) 7c030010 subfc r0,r3,r0 (dbx) p $r0 0x0000000000000002 (dbx) p $r3 0x0000000000000001 (dbx) p $xer 0x0000000020000002 (dbx) stepi stopped in negativeLessThanX(unsigned long,unsigned long) at 0x100000a28 ($t1) 0x100000a28 (negativeLessThanX(unsigned long,unsigned long)+0x28) 7c000110 subfe r0,r0,r0 (dbx) p $r0 0x0000000000000001 (dbx) p $xer 0x0000000020000002 (dbx) stepi stopped in negativeLessThanX(unsigned long,unsigned long) at 0x100000a2c ($t1) 0x100000a2c (negativeLessThanX(unsigned long,unsigned long)+0x2c) f8010070 std r0,0x70(r1) (dbx) p $r0

We see that the subfc does generate r0=1 as expected. The CA bit of the XER is ’34 Carry (CA)’, and out XER value is: 0b[0….]00100000000000000000000000000010. Bits 32, 33 are clear, but CA (34) is set. At first this seems curiously inverted. We have $r0 = v0-v1 > 0, so why is CA set?

How about with v0=1, v1=2:

(dbx) stopped in negativeLessThanX(unsigned long,unsigned long) at 0x100000a24 ($t1) 0x100000a24 (negativeLessThanX(unsigned long,unsigned long)+0x24) 7c030010 subfc r0,r3,r0 (dbx) p $r0 0x0000000000000001 (dbx) p $r3 0x0000000000000002 (dbx) stepi stopped in negativeLessThanX(unsigned long,unsigned long) at 0x100000a28 ($t1) 0x100000a28 (negativeLessThanX(unsigned long,unsigned long)+0x28) 7c000110 subfe r0,r0,r0 (dbx) p $r0 0xffffffffffffffff (dbx) p $xer 0x0000000000000013 (dbx) stepi stopped in negativeLessThanX(unsigned long,unsigned long) at 0x100000a2c ($t1) 0x100000a2c (negativeLessThanX(unsigned long,unsigned long)+0x2c) f8010070 std r0,0x70(r1) (dbx) p $r0 0xffffffffffffffff

The intermediate subtraction now produces a -1 (0xffffffffffffffff), but now we have CA clear from the subfc, since we see XER=0b[0….]00000000000000000000000000010011 (with bit 34 clear). Again, this seems backwards.

The trick to understanding this is that the subtract isn’t implemented as a subtraction, but an addition. For 2-1, where we don’t have to borrow, our subfc is actually doing:

~1 + 2 + 1: 1110 +0010 +0001 ===== 10001

No borrow is required, but we do generate a carry when doing this _addition_ operation!

Compare that to the a 1-2 operation:

~2 + 1 + 1: 1101 +0001 +0001 ===== 1111

Also compare to a 1-1 operation:

~1 + 1 + 1: 1110 +0001 +0001 ===== 10000

We generate a carry (now borrow!) when v0=v1 and v0>v1. We do not generate a carry (CA clear) for v0<v1, so that the end result is that for v0<v1 we have -1, and 0 otherwise.