09 May 2021

Fun with floating-point assembly (in RISC-V and x64)

There was a fun little exercise in an old Fortran book that involved approximating a cube root using the Newton method to a specified accuracy. I’ve implemented this in multiple high-level languages and then moved on to two different assembly versions - one targeting the RISC-V instruction set, another the ubiquitous x64 with SSE2 extensions.

All of the samples below calculate a cube root of 27 (or of the number from the standard input) and print the result to the standard output.

Python pseudocode

The original algorithm from the book in pseudocode:

1: read A,e
2: X2 = A
3: X1 = X2  
4: U = 1/3 * (A / X1^2 - X1)
5: X2 = X1+U
6: if abs(U) < epsilon, go to 7, else go to 3
7: X2 contains cube root of A

Transferring that to a FORTRAN implementation (using an ancient FORTRAN 66 standard) is straightforward:

C  CUBE ROOT CALCULATION
    1 FORMAT (F6.3)
      READ(5,1)A
      X2=A
    4 X1=X2
      U=(A/(X1*X1)-X1)/3.0
      X2=U+X1
      IF(ABS(U)-0.001)9,4,4
    9 WRITE(6,1)X2
      STOP
      END

For those of you that prefer a bit more modern languages, there’s also a Python implementation

def cube(a):
    epsilon = 0.001
    x2 = a
    u = sys.maxsize
    while abs(u) >= epsilon:
        x1 = x2
        u = (a/(x1*x1) - x1 ) / 3.0
        x2 = x1 + u
        print(x2)
    return x2

RISC-V assembly:

Now how would we write this in RISCV assembly? A most straightforward way is to utilize as many registers as we can, decomposing the most complicated source line (u = (a/(x1*x1) - x1 ) / 3.0) into the individual operations and stretching the temporary results across registers instead of storing the intermediate values on the stack.

Compile with: (if you save the source as cube.s):

gcc -no-pie -o cube cube.s

# cube root approximation
.option nopic
.text
.section .rodata
.LC0:
    .string "%f\n"
.text
.globl main
.type main, @function

main:
#reserve stack space
addi sp, sp, -64

# input number in a0, epsilon in a1
li a0, 0x41d80000 # 27
li a1, 0x3a83126f # 0.001

# store parameters on stack
sw a0, (sp)
sw a1, -4(sp)

# load parameters into floating point registers

# variables: a,   x1,  x2,  u,   epsilon
# registers: fa0, fa1, fa2, fa3, fa7
# constants: ft3 = 3.0

flw fa0, (sp)  	#a
flw fa7, -4(sp) #epsilon
li t0, 0x40400000 # 3.0
sw t0, -4(sp)
flw ft3, -4(sp) # 3.0

#2: x2 = a
flw fa2, (sp)	# x2
loop: 
#3: x1 = x2
fsw fa2, -8(sp) 
flw fa1, -8(sp)
#4: U = (a/(x1*x1) - x1 ) / 3.0f

fmul.s fs0, fa1, fa1 # x1*x1
fdiv.s fs1, fa0, fs0 # a/(x1^2)
fsub.s fs2, fs1, fa1 # (a/(x1*x1)) - x1
fdiv.s fa3, fs2, ft3 # u =(a/(x1*x1)) - x1) / 3.0

#5: x2 = x1 + u
fadd.s fa2, fa1, fa3  #x2 = x1 + u

#6: if abs(u) < epsilon, go to 7, else go to 3
fabs.s fs3, fa3 # we can also overwrite u with abs(u)

fle.s t0, fs3, fa7 # t0 will be 1 if u < epsilon ?

# go to 3
beqz t0, loop

#the result is stored at fa2
#print it out
fcvt.d.s fa2,fa2
fmv.x.d a1,fa2
lui a5,%hi(.LC0)
addi a0,a5,%lo(.LC0)
call printf
addi sp, sp, 64
ret

x64/SSE assembly:

I’ve preferred the Intel syntax and NASM for the x64 version. You’ll also need a 64-bit processor capable of the SSE2 instruction set. I’m also using only the floating-point registers for temporary variables instead of the stack.

Compile with:

nasm -f elf64 cube.s && gcc -no-pie -o cube cube.o

global main
extern printf

section .data
    string db `%f\n`, 0       ; our format string
    num dq 27.0                ; the value to calculate root from
    eps dq 0.001              ; epsilon
    three dq 3.0              ; constant 3.0

section .text

;xmm10 to xmm15 are constants and variables
main:
    sub rsp, 8              ; reserve stack space
    
    mov rdi, string         ; load format string for printf
    mov rax, 1              ; tell printf we use 1 SSE register as argument

    movq xmm15, qword [num]   ; load number from data section to SSE register
    movq xmm14, qword [eps] 
    movq xmm13, qword [three]

    movq xmm12, xmm15   ;   2: x2 = a

loop:

    movq xmm11, xmm12   ;   3: x1 = x2

    movq xmm1, xmm11    ;   4: U = (a/(x1*x1) - x1 ) / 3.0f decomposed 
    mulsd xmm1, xmm1    ;  x1*x1
    movq xmm0, xmm15    ; a -> xmm0
    divsd xmm0, xmm1    ; a / x1^2 -> xmm0
    subsd xmm0, xmm11   ; (a/(x1*x1)) - x1 -> xmm0
    divsd xmm0, xmm13   ; u =(a/(x1*x1)) - x1) / 3.0
    movsd xmm10, xmm0   ; u (xmm10)

    addsd xmm0, xmm11   ;   5: x2 = x1 + u
    movsd xmm12, xmm0
;   6: if abs(u) < epsilon, go to 7, else go to 3
    psllq xmm10, 1      ; abs(u): get rid of sign bit on xmm10 
    psrlq xmm10, 1
    ucomisd xmm10, xmm14 ; compare abs(u) with epsilon
    jae loop            ; jump if above or equal to step 3

    movq xmm0, xmm12        ; the result x2 is at xmm12, move to xmm0 for printf
    mov rdi, string         ; load format string for printf
    mov rax, 1              ; tell printf we use 1 SSE register as argument
    call printf             ; 

    add rsp, 8
    mov eax, 0x60
    xor edi, edi
    syscall

To run the compiled program, just execute ./cube

$ ./cube
3.000000

What’s different?

The programs are basically the same length, however, the x64 SSE instructions are at a slight disadvantage as the addsd/mulsd/... instructions only have two operands, so the first source register is the same as the destination registers, requiring us to shuffle the registers slightly for the operations.

However, the x64 SSE instructions are capable of vectorization, so a smart compiler could generate code that operates on multiple numbers in parallel.

Debugging floating point stuff in gdb:

A quick reminder of gdb commands useful in debugging these programs (instructions for x64, but should be similar enough for RISC-V):

layout asm switches gdb into an assembly debugging layout displaying the code
b main sets a breakpoint at the main entry point
si steps over an instruction
info all-registers shows all of the registers
i r xmm0 shows the contents of the xmm0 register
p $xmm0.v2_double shows the contentx of the xmm0 register interpreted as a double precision floating point number