I was tasked to review some inline assembly, essentially like so:

Uint64 negativeLessThanX(Uint64 v0, Uint64 v1)
{
Uint64 mask;
__asm__ volatile ("subfc %0,%2,%1; subfe %0,%0,%0" \
/* outputs */ : "=r"(mask) /* %0 */ \
/* inputs */ : "r"(v0), /* %1 */ \
"r"(v1) /* %2 */ \
/* clobbers */: "xer" /* condition registers (CF, ...) */ \
);
return mask;
}

This should have the effect of doing:

Uint64 negativeLessThanX(Uint64 v0, Uint64 v1)
{
Uint64 mask;
subfc mask,v1,v0
subfe mask, mask, mask
return mask;
}

From the powerpc instruction set reference (PowerISA_V2.07_PUBLIC.pdf), our subfc, and subfe instructions are respectively ‘Subtract From Carrying XO-form’, ‘Subtract From Extended XO-form’ :

subfc RT,RA,RB (OE=0 Rc=0)
RT <- ¬ (RA) + (RB) + 1
(RT = RB - RA)
subfe RT,RA,RB (OE=0 Rc=0)
RT <- ¬ (RA) + (RB) + CA

Since we have RA = RB in the subfe, and self plus complement is all bits set, we essentially have

RT <- 0xFFFFFFFFFFFFFFFF + CA

Let’s walk through this in the debugger to understand it. We have:

(dbx) listi negativeLessThanX
0x100000a24 (negativeLessThanX(unsigned long,unsigned long)+0x24) 7c030010 subfc r0,r3,r0
0x100000a28 (negativeLessThanX(unsigned long,unsigned long)+0x28) 7c000110 subfe r0,r0,r0

So, we have r0 = r0 – r3. After this we have a r0 = r0 – r0, but also bringing in the carry flag (CA bit in XER). Let’s see this in the debugger, first with r0=2, r3=1 :

(dbx) stop in negativeLessThanX
[1] stop in negativeLessThanX(unsigned long,unsigned long)
(dbx) c
[1] stopped in negativeLessThanX(unsigned long,unsigned long) at line 6 in file "w.C" ($t1)
6 __asm__ volatile ("subfc %0,%2,%1; subfe %0,%0,%0" \
(dbx) stepi
stopped in negativeLessThanX(unsigned long,unsigned long) at 0x100000a20 ($t1)
0x100000a20 (negativeLessThanX(unsigned long,unsigned long)+0x20) e86100b8 ld r3,0xb8(r1)
(dbx)
stopped in negativeLessThanX(unsigned long,unsigned long) at 0x100000a24 ($t1)
0x100000a24 (negativeLessThanX(unsigned long,unsigned long)+0x24) 7c030010 subfc r0,r3,r0
(dbx) p $r0
0x0000000000000002
(dbx) p $r3
0x0000000000000001
(dbx) p $xer
0x0000000020000002
(dbx) stepi
stopped in negativeLessThanX(unsigned long,unsigned long) at 0x100000a28 ($t1)
0x100000a28 (negativeLessThanX(unsigned long,unsigned long)+0x28) 7c000110 subfe r0,r0,r0
(dbx) p $r0
0x0000000000000001
(dbx) p $xer
0x0000000020000002
(dbx) stepi
stopped in negativeLessThanX(unsigned long,unsigned long) at 0x100000a2c ($t1)
0x100000a2c (negativeLessThanX(unsigned long,unsigned long)+0x2c) f8010070 std r0,0x70(r1)
(dbx) p $r0

We see that the subfc does generate r0=1 as expected. The CA bit of the XER is ’34 Carry (CA)’, and out XER value is: 0b[0….]00100000000000000000000000000010. Bits 32, 33 are clear, but CA (34) is set. At first this seems curiously inverted. We have $r0 = v0-v1 > 0, so why is CA set?

How about with v0=1, v1=2:

(dbx)
stopped in negativeLessThanX(unsigned long,unsigned long) at 0x100000a24 ($t1)
0x100000a24 (negativeLessThanX(unsigned long,unsigned long)+0x24) 7c030010 subfc r0,r3,r0
(dbx) p $r0
0x0000000000000001
(dbx) p $r3
0x0000000000000002
(dbx) stepi
stopped in negativeLessThanX(unsigned long,unsigned long) at 0x100000a28 ($t1)
0x100000a28 (negativeLessThanX(unsigned long,unsigned long)+0x28) 7c000110 subfe r0,r0,r0
(dbx) p $r0
0xffffffffffffffff
(dbx) p $xer
0x0000000000000013
(dbx) stepi
stopped in negativeLessThanX(unsigned long,unsigned long) at 0x100000a2c ($t1)
0x100000a2c (negativeLessThanX(unsigned long,unsigned long)+0x2c) f8010070 std r0,0x70(r1)
(dbx) p $r0
0xffffffffffffffff

The intermediate subtraction now produces a -1 (0xffffffffffffffff), but now we have CA clear from the subfc, since we see XER=0b[0….]00000000000000000000000000010011 (with bit 34 clear). Again, this seems backwards.

The trick to understanding this is that the subtract isn’t implemented as a subtraction, but an addition. For 2-1, where we don’t have to borrow, our subfc is actually doing:

~1 + 2 + 1:
1110
+0010
+0001
=====
10001

No borrow is required, but we do generate a carry when doing this _addition_ operation!

Compare that to the a 1-2 operation:

~2 + 1 + 1:
1101
+0001
+0001
=====
1111

Also compare to a 1-1 operation:

~1 + 1 + 1:
1110
+0001
+0001
=====
10000

We generate a carry (now borrow!) when v0=v1 and v0>v1. We do not generate a carry (CA clear) for v0<v1, so that the end result is that for v0<v1 we have -1, and 0 otherwise.