Testbug's speed tests

Comparing basic coding

Note that the code here does not necessarily write properly to the screen - the code is intended for speed tests only

FPU v CPU speeds @ 64 bits
These test compare the speed of various actions using the floating point instructions and also compares the same actions using the ordinary registers. In the early years when the floating point (maths co-processor) was first introduced, it was rather slow but it seems to be speeding up with each new processor.
Dividing a 64 bit number
Converting 64 bits to ascii
   Converting 64 bits to ascii using the CPU and DIV
   Convert 64 bits to ascii using the CPU bit shifting and bcd addition
   Convert 64 bits to ascii using the FPU instruction FPREM
   Convert 64 bits to ascii using the FPU instruction FRNDINT

Dividing a 64 bit number
The first two tests compare the speed of dividing a 64 bit number by 10 using the ordinary registers against the same action using the floating point registers. Here is the coding of these two tests starting with the CPU coding and then the FPU coding:-

Divide 64 bits in edx:eax by 10 (CPU method using DIV)

MOV EDI,500        ;do this 500 times
L1:            
MOV ESI,EAX        ;keep low order part in esi
MOV EAX,EDX        ;ready to divide high order part
XOR EDX,EDX
MOV ECX,10D
DIV ECX            ;divide-remainder now in edx
XCHG EAX,ESI       ;get low order part in eax, quotient in esi
DIV ECX
MOV EBX,EDX        ;remainder now in ebx
MOV EDX,ESI        ;high order quotient in edx, low in eax
DEC EDI
JNZ L1

Divide 64 bits by 10 using FIDIV

FINIT             ;initialise fpu
FLDCW W[INIT]     ;0F3Fh initialise precision=64 bits rounding=0
MOV ECX,500       ;do this 500 times
L1:
FLD T[MEM_64]     ;load 64 bit number (in 80 bit real format)
FIDIV W[TEN]      ;divide by 10
FISTP Q[RESULT]   ;get 64 bit long integer result
DEC ECX
JNZ L1

Converting 64 bits to ascii
The next eight tests compare the speed of various methods of converting a 64 bit number to ascii so that it can be displayed to the user as a decimal number. In these tests the actual screen-write is not carried out, and only the basic coding is compared to avoid obscuring the differences in speed. One way to achieve the conversion to ascii using the CPU is to divide the number by 10 and write the remainder to the screen, then continue this until all 64 bits have been dealt with. Another method described below is to use bcd (binary coded decimal) numbers. The speed tests are carried out both with small and with large numbers. In these tests the small number is 128 (in floating point terms this is a mantissa of +8000000000000000 and an exponent of 8); and the large number is 1234567890123456789 when using the CPU (for the FPU a real number of 1234567890123456789012345.0E-6 is used which gives a mantissa of 891087A3EF4C0860 and an exponent of 61). The CPU DIV test also drops out after 20 digits have been written. The actual coding of the main part of the CPU DIV test is as follows (where the number to be dealt with is in edx:eax, where ECX is 19):-

Convert 64 bits to ASCII using the CPU and DIV

MOV EDI,10          ;ready to divide by 10
L4:
MOV ESI,EAX         ;keep low order part in esi
MOV EAX,EDX         ;ready to divide high order part
XOR EDX,EDX
DIV EDI             ;rem now in edx ready for 2nd div
XCHG EAX,ESI        ;restore low order part in eax
DIV EDI
MOV [MEM+ECX],DL    ;write remainder to memory
MOV EDX,ESI         ;high order now in edx, low in eax
DEC ECX             ;see if 20 digits written already
JS >L5              ;yes
OR EAX,EAX          ;see if number is zero (low)
JNZ L4              ;no
OR EDX,EDX          ;see if number is zero (high)
JNZ L4              ;no
L5:

Convert 64 bits to ASCII using the CPU bit shifting and bcd addition.
The conversion from a 64 bit binary number to ascii is done here in an entirely different way from the divide by 10 method. This method looks at each bit in turn and if the bit is "set" (on) the value represented by that bit is added to a "result" buffer in memory. So for example we know that bit 0 has a value of 1 (decimal), and bit 2 has a value of 4 decimal. If bits 0 to 2 are looked at and only bits 0 and 2 are set these two values would be added together in the result buffer making 5 so far. After all 64 bits have been dealt with in this way the actual value in the result buffer will be a representation of the 64 bit number we started with. The tricky part is to arrange for the numbers in the result buffer to be a string of decimal digits which can then be written to the screen.
This is where the binary coded decimal (bcd) numbers come in. Continuing with my example of addition, bit 3 has a value of 8 decimal and if this is also set, you will have to add 8 and 5 making 13 decimal. What we need to do here is to ensure that this value of 13 decimal is drawn out as 1 then 3 decimal, rather than 13. Bcd numbers are digits with a value of zero to 9 only. There are a series of assembler instructions which use bcd numbers and these ensure that the limit of 9 is maintained in the register being used. For example, if you use ADD AL,AH to add 8 (in ah) to 5 (in al) this will result in a value of 13 decimal in al. However, following the instruction with the special bcd instruction AAA will reduce the value in al to 3 and set the carry flag to signify that there is a carry. Using the bcd instruction DAA allows you to use two bcd digits per byte (hence these are called "packed" bcd numbers). There is one in the first 4 bits (the first nibble) and one in the second 4 bits.
In the coding which follows, two buffers are set up. On is the result buffer, which is held in the first 512 bytes of the memory area. The other is the "2* buffer" which holds the value of the bit currently being looked at. This buffer starts off with the value of 1. This is when bit 0 is being considered. The value in this buffer has to be doubled ready for the next bit to be considered, so it increases as follows: 1 then 2 then 4 then 8 then 1,6 then 3,2 then 6,4 then 1,2,8 and so on. So at any one time the 2* buffer contains a representation of the value of the bit in the 64-bit number which is currently being considered. If a bit in the 64-bit number is set then each "active" nibble in the 2* buffer needs to be added to the corresponding nibble in the results buffer (using bcd instructions), and the carry if present passed to the next digit on the left. At the end of the procedure the result buffer contains a representation of the actual value in the original 64-bit number. All that remains is to take each nibble in turn to write the result to the screen.
Although this method is usually slower than the others it is capable of very accurate results since no rounding occurs whatsover (unlike when using divide), and it can be extended well beyond 64 bits.
The coding is in two parts, firstly to set things up (where the 64 bit number to convert is in edx:eax):-

Setting up the buffers and feeding blocks of 32 bits to the shift and bcd addition procedure

PUSH EDX,EAX            ;save 64-bit value
MOV EDI,[MEMAREA]       ;get memory area to use
XOR EAX,EAX
MOV ECX,128
REP STOSD               ;fill result buffer with zeroes
MOV ECX,128
REP STOSD               ;fill 2* buffer with zeroes
DEC EDI                 ;back to end of 2* buffer
MOV B[EDI],1            ;insert starter there
MOV [BUFF],EDI          ;keep end of active 2* buffer
DEC EDI
MOV EBP,EDI             ;keep buffer stopper in ebp
POP EDX                 ;get first 32 bits to look at
MOV ECX,32              ;ready to look at 32 bits
L6:
CALL DOABIT             ;do one bit only
DEC ECX                 ;see if any more to do
JNZ L6                  ;yes
POP EDX                 ;get next 32 bits to look at
OR EDX,EDX              ;see if any are set
JZ >L10                 ;no, so may finish
MOV ECX,32              ;ready to look at 32 bits
L8:
CALL DOABIT             ;do one bit only
DEC ECX                 ;see if any more to do
JNZ L8                  ;yes
L10:
This is now the DOABIT procedure. Look at each bit. If set, add value in 2* buffer to result buffer. Then in any case double the value of the 2* buffer.
DOABIT:
SHR EDX,1               ;see if bit is set and get to next
JNC >L26                ;no, so just double the 2* buffer
;***************** add 2* buffer to result buffer
MOV ESI,[BUFF]          ;get active end of 2* buffer
MOV EDI,ESI
SUB EDI,512             ;get to corresp active end of result buffer
XOR AH,AH
L21:
MOV AL,[EDI]            ;get current value in result buffer
ADD AL,AH               ;add previous carry if any
ADD AL,[ESI]            ;add the 2* value
DAA                     ;adjust to correct to BCD
JZ >L23                 ;al is zero - so may be finished
L22:
MOV AH,0                ;no effect on carry flag
MOV [EDI],AL            ;put value back into result buffer
DEC EDI,ESI             ;no effect on carry flag
JNC L21                 
INC AH                  ;give effect to carry flag
JMP L21
L23:                    ;may be finished
JC L22                  ;if c not all done (yes! 50+50 would do this)
CMP ESI,EBP             ;see if stopper in 2* buffer reached yet
JA L22                  ;no, return to loop nc
MOV EBX,EDI             ;keep earliest current place in result buffer
;******************* double the 2* buffer
L26:
MOV EDI,[BUFF]          ;get active end of 2* buffer
XOR AH,AH
L28:
MOV AL,[EDI]            ;get current value
ADD AH,AL               ;add previous carry if any
ADD AL,AH               ;double the value, add carry too
DAA                     ;adjust to correct to BCD
JZ >L32                 ;zero reached - possibly all done
L230:
MOV AH,0                ;no effect on carry flag
MOV [EDI],AL            ;put value back into buffer
DEC EDI                 ;no effect on carry flag
JNC L28
INC AH                  ;give effect to carry
JMP L28
L32:                    ;possibly all done
JC L30                  ;if c not all done (yes! 50+50 would do this)
CMP EDI,EBP             ;see if reached stopper in 2* buffer yet
JA L30                  ;no, return to loop nc
JZ >L34                 ;if same, stopper need not move
DEC EBP                 ;move back stopper by one
MOV EAX,[BUFF]          ;get end of active buffer so far
SUB EAX,EBP             ;get distance between them now
CMP EAX,42              ;see if 42 done so far (83/4 bcd digits)
JNA >L34                ;no, max required precision not reached
DEC D[BUFF]             ;bring active end of buffer back one
L34:
RET

Convert 64 bits to ASCII using the FPU instruction FPREM
Since FPREM computes the remainder after a pretended ST0/ST1 (that is, contents of floating point stack register 0 divided by stack register 1), it would appear to be ideal to produce an ascii string from a large number. The idea would be to use FPREM to pretend to divide the number by 10 and get the remainder, put that in the ascii string, then actually carry out the division by 10 and continue carrying out these actions until the number goes to less than one. Unfortunately in practice some complicated code is required to make this work properly and this also slows the whole process down.
One complication is that because of the way FPREM works, it is not necessarily a one-shot instruction. FPREM computes the remainder by carrying out successive subtractions of the divisor. If the difference between the exponents of the dividend and the divisor is less than 64 FPREM needs only one shot, but if the difference is higher than this, it needs to be called more than once. FPREM indicates that it has not finished by setting the C2 condition code. In the example below, lines 7 to 11 get the C2 status word into AX after each FPREM operation and if the C2 condition code is set execution loops back to do another FPREM.
After the remainder has been calculated it is necessary to get it out of the floating point register and into memory. This is done in line 13 using the instruction FISTP with a dword destination. Now comes the second complication. FPREM did not perform as expected, and it was found to produce a negative result sometimes. So for example, if the remainder should be 7, the result might be minus 3 (0FFFFFFFDh), rather than 7. This is dealt with in lines 15 to 18 by testing for the negative and adding 10 if found.
Before the FISTP operation the precision flag is changed to round down (line 12). This is to ensure that another of FPREM's strange results, a negative zero becomes -1 after the FISTP and then is changed to 9 by the test carried out in lines 15 to 18. Rounding is changed back to round near immediately afterwards. The precision and rounding values are words in data initialised to the correct value.
The next step is to divide the number at the top of the stack using FDIV (line 19). This is an actual divide instead of a pretended one as before. Note FDIV simply throws away the remainder, which is why we are trying FPREM in the first place!
Then lines 20 to 25 check whether the number has yet gone below 1. If not, the process must be continued until it has, or until 20 digits have been done.
This procedure is accurate at least to 20 decimal digits, which represents the accuracy of the 64 bit floating point register. Beyond that the FDIV starts to lose precision. Round near reduces these inaccuracies. FDIV sets the precision flag if precision has been lost.
In this example I used the instruction FPREM1 which superceded FPREM after the IEEE finalised the floating point standard. Before any of this code is executed (as is required in all cases) the instruction FINIT should be executed and the precision and rounding flags should be set (in this case) to 64-bit precision and round near. Note also that the data "TEN" is a word initialised to the value of 10. This is pushed on the first floating point register ready for each divide by 10 (line 5).

1.  MOV ECX,20              ;do 20 digits only (max resolution anyway)
2.  MOV ESI,ADDR DWORDNO    ;data in memory declared as dword
3.  FLD Q[QWORDNO]          ;load 64 bit format number to convert
4.  L30:                    ;(number is always at the top of the stack)
5.  FILD W[TEN]             ;make ST0=10 and PUSH: number now pushed to ST1
6.  FLD 1                   ;PUSH copy of ST1 at ST0: now ST1=10, ST0=ST2
7.  L34:
8.  FPREM1                  ;get remainder following pretended ST0/ST1
9.  FSTSW AX                ;get status word into ax
10. TEST AH,4h              ;check C2 condition code-see if FPREM1 done
11. JNZ L34                 ;no, so need to continue
12. FLDCW W[PREC64_DOWN]    ;073Fh=precision=64 bits, rounding=down
13. FISTP D[ESI]            ;convert remainder to integer and POP
14. FLDCW W[PREC64_NEAR]    ;033Fh=precision=64 bits, rounding=nearest
15. TEST D[ESI],80000000h   ;see if negative number
16. JZ >L38                 ;no
17. ADD D[ESI],10
18. L38:
19. FDIV                    ;ST1=ST1/ST0 then POP leaving result at top of stack
20. DEC ECX
21. JZ >L39
22. FICOM W[ONE]            ;compare stack top with 1
23. FSTSW AX                ;get status word
24. TEST AH,45h             ;check C0,C2,C3 condition code - see if number less than 1
25. JZ L30                  ;no, need to continue
26. L39:
27. FFREE 0                 ;free last stack element
28. FINCSTP                 ;and increment stack pointer

Convert 64 bits to ASCII using the FPU instruction FRNDINT
This conversion method is to divide the number to be converted by 10, convert the result of this division to an integer using FRNDINT (ie. remove the fractional part of the result) then multiply that integer by 10. Then subtract that from the original number. This produces the equivalent of the "remainder" of a division of the original number by 10. This is quite a precise method although not as precise as the method using FPREM, but it is about 20% quicker. Of course (as before) the processor must be initialised using FINIT and the precision and rounding flags must be set.

MOV EDI,ADDR BUFFER
MOV ECX,20              ;20 digits only
MOV ESI,ADDR DWORDNO
FLD Q[QWORDNO]          ;load 64 bit format number to convert
L30:                    ;(number is always at the top of the stack)
FLD 0                   ;PUSH copy of number at top of stack now ST0=ST1
FILD W[TEN]             ;ST0=10 and PUSH: number now pushed to ST1 & ST2
FDIV 2,0                ;ST2=ST2/ST0 divided value now at ST2, orig no. at ST1
FLD 2                   ;make copy at stack top of divided value
FRNDINT                 ;round stack top up to an integer
FMUL                    ;ST0*ST1 (div value*10) leaving result on stack top
FSUB                    ;subtract mul result from div res result in stack top
FISTP D[ESI]            ;convert remainder to integer and POP
DEC ECX                 ;see if max digits reached
JZ >L36                 ;yes - finish
FICOM W[ONE]            ;compare stack top with 1
FSTSW AX                ;get status word
TEST AH,45h             ;check C0,C2,C3 condition code - see if number less than 1
JZ L30                  ;no, need to continue
L36:
FFREE 0                 ;free last stack element
FINCSTP                 ;and increment stack pointer