"Use of mnemonics" demonstrations
3DNow! instructions

Go to the demos

About the 3DNow! instructions
The 3DNow! instructions were introduced by AMD in their K6 processor. They use the floating point registers, dividing each one into two 32-bit single precision floating point registers. This was intended to support enhanced 3D graphics and audio processing. The problem for programmers is that Intel did not follow suit, so you have the strange situation that the 3DNow! instructions are supported on fairly old AMD processors but not on the latest Intel ones. Nevertheless, AMD continue to support the 3DNow! instructions in their AMD64 processor, so it may be that they are here to stay! Obviously the programmer must test to see if the CPU running the program supports these instructions, and if not the necessary algorithms must switch over to be done in some other way. The way to test for support for the 3DNow! instructions is to move the value 80000001h into the EAX register, and then call CPUID. This gives a report on the CPU extended feature flags. The result is given in the EDX register and bit 31 is set if 3DNow! instructions can be used. Five new 3DNow! instructions were added in the AMD Athlon processor and if these are supported bit 30 is also set. Note also that FEMMS is used at the end of the sequences to clear the registers and the tag words so that, if necessary, floating point instructions can then be used.
Except for the sample for the reciprocal instructions, the 3DNow! instructions are simple to use. The examples here use only the register-to-register form of the instructions to make the demonstration easier to follow. In practice, instead of loading a number from memory into a register and then carrying out the register-to-register form of the instruction, you would use the memory-to-register form of the instruction wherever possible. An example of this is given in the demo of the conversion instructions.

Each of the following demonstrations rely on the following data declarations in the DATA SECTION:-

_3DPI_VALUE       DD PI          ;floating point value (low)
                  DD 6.294       ;floating point value (high)
_3DSTARTER_VALUE2 DD 5.1E0       ;floating point value (low)
                  DD 1.2E0       ;floating point value (high)
_3DSTARTER_VALUE3 DD 33333       ;integer value (low)
                  DD 66666       ;integer value (high)
There are 5 demonstrations:-
Arithmetic instructions
Comparison instructions
Conversion instructions
Reciprocal instructions
Extended instructions

Arithmetic instrunctions
These are all packed instructions, so they operate on the high and low dwords of the registers at the same time. For each test each dword holds a number in single-precision floating point format.
Set the breakpoint to _3D_ARITHMETIC, run the test and single-step through the code:-

MOV EAX,80000001h      ;request CPU extended feature flags
CPUID                  ;0Fh, 0A2h CPUID instruction
TEST EDX,80000000h     ;test bit 31
JNZ >L2                ;3DNow! available
CALL NO_3DNOWMESS
RET
L2:
MOVQ MM0,[_3DPI_VALUE]
MOVQ MM1,[_3DSTARTER_VALUE2]
MOVQ MM2,MM1           ;copy to MM2
PFADD MM2,MM0          ;packed floating point add (result in MM2)
;
MOVQ MM3,MM1           ;copy to MM3
PFACC MM3,MM0          ;packed accumulation (result in MM3)
;
MOVQ MM4,MM1           ;copy to MM4
PFSUB MM4,MM0          ;packed subtraction (result in MM4)
;
MOVQ MM5,MM1           ;copy to MM5
PFSUBR MM5,MM0         ;packed reverse subtraction (result in MM5)
;
MOVQ MM6,MM1           ;copy to MM6
PFMUL MM6,MM0          ;packed multiply (result in MM6)
FEMMS
RET

Comparison instructions
These are all packed instructions and so operate on two dwords (containing single-precision floating point numbers) at once. The strict comparison instructions tend to give the result of the comparison either as all one bits or all zero bits, whereas PFMAX returns the larger of two floating point numbers, and PFMIN returns the smaller.
Set the breakpoint to _3D_COMPARISON run the test and single-step through the code:-

MOV EAX,80000001h      ;request CPU extended feature flags
CPUID                  ;0Fh, 0A2h CPUID instruction
TEST EDX,80000000h     ;test bit 31
JNZ >L2                ;3DNow! available
CALL NO_3DNOWMESS
RET
L2:
MOVQ MM0,[_3DPI_VALUE]
MOVQ MM1,[_3DSTARTER_VALUE2]
PFCMPEQ MM0,MM1        ;see if equal result in MM0
MOVQ MM0,[_3DSTARTER_VALUE2]
PFCMPEQ MM0,MM1        ;see if equal result in MM0
MOVQ MM0,[_3DPI_VALUE]
PFCMPGT MM0,MM1        ;see if greater result in MM0
MOVQ MM0,[_3DPI_VALUE]
PFCMPGT MM1,MM0        ;see if greater result in MM1
MOVQ MM1,[_3DSTARTER_VALUE2]
PFCMPGE MM0,MM1        ;see if greater or equal result in MM0
MOVQ MM0,[_3DPI_VALUE]
PFCMPGE MM1,MM0        ;see if greater or equal result in MM1
MOVQ MM1,[_3DSTARTER_VALUE2]
PFCMPGE MM0,MM1        ;see if greater or equal result in MM0
MOVQ MM0,[_3DPI_VALUE]
PFMAX MM0,MM1          ;return the larger in MM0
MOVQ MM0,[_3DPI_VALUE]
PFMAX MM1,MM0          ;return the larger in MM1
MOVQ MM1,[_3DSTARTER_VALUE2]
PFMIN MM0,MM1          ;return the smaller in MM0
MOVQ MM0,[_3DPI_VALUE]
PFMIN MM1,MM0          ;return the smaller in MM1
FEMMS
RET

Conversion instructions
These are both packed instructions, PF2ID converting two dwords in floating point format to signed 32-bit integers in the destination register. PI2FD converts two signed 32-bit integers to dwords in floating point format in the destination register.
Set the breakpoint to _3D_CONVERSION run the test and single-step through the code:-

_3D_CONVERSION:
MOV EAX,80000001h      ;request CPU extended feature flags
CPUID                  ;0Fh, 0A2h CPUID instruction
TEST EDX,80000000h     ;test bit 31
JNZ >L4                ;3DNow! available
CALL NO_3DNOWMESS
RET
L4:
MOVQ MM1,[_3DPI_VALUE]
PF2ID MM0,MM1           ;packed floating to integer dwords
;
;or this, which is the same ..
PF2ID MM0,[_3DPI_VALUE] ;packed floating to integer dwords
;
MOVQ MM1,[_3DSTARTER_VALUE3]
PI2FD MM0,MM1          ;packed integer to floating dwords
;
;or this, which is the same ..
PI2FD MM0,[_3DSTARTER_VALUE3] ;packed integer to floating dwords
;
FEMMS
RET

Reciprocal instructions
Since you cannot divide directly using the 3DNow! instructions, the reciprocal instructions take on a greater importance. To divide x by y you can use the formula (1/y)*x instead, which is the same as x/y. This is the reciprocal of y multiplied by x. Note that PFRCP is quick but it is only accurate to 14 bits because it gets its value from a ROM-based lookup table. It only takes the lower register of the source, giving the result in both the lower and higher registers.
Set the breakpoint to _3D_RECIPROCAL run the test and single-step through the code:-
This test divides PI by itself in the low part of MM0. Note the result is not exactly 1 due to the imprecision:-

MOVQ MM0,[_3DPI_VALUE]
PFRCP MM1,MM0           ;give MM1 the reciprocal of MM0(low)
PFMUL MM1,MM0           ;packed multiply result in MM1
To achieve greater accuracy (to 24 bits) you need to use two further instructions, PFRCPIT1 and PFRCPIT2. These take the reciprocal produced by PFRCP and the original value given to PFRCP and use these two values to refine it further in two steps using the Newton-Raphson formula. The final output can then be multiplied as before to produce a division with 24 bit accuracy. This practical example carries out the same division as in the above example, but to 24 bit accuracy. This time the result is exactly 1:-
MOVQ MM2,[_3DPI_VALUE]
PFRCP MM3,MM2           ;give MM3 the reciprocal of MM2(low)
PFRCPIT1 MM2,MM3        ;first refining iteration
PFRCPIT2 MM2,MM3        ;second refining iteration
;the accurate reciprocal is now in MM2(low)
PFMUL MM2,MM0           ;packed multiply result in MM2(low)
;compare with imprecise result in MM1(low)
Finally here is an example of how to obtain the reciprocal of a square root using PFRSQRT. Again this is a table lookup instruction which is fast but is only accurate to 15 bits.
PFRSQRT MM1,MM0         ;give MM1 the reciprocal of sq root of MM0(low)
To achieve greater accuracy (to 24 bits) you need to refine the result using PFMUL, PFRSQIT1 and PFRCPIT2:-
PFRSQRT MM2,MM0         ;give MM2 the reciprocal of sq root of MM0(low)
MOVQ  MM3,MM2           ;and keep in MM3
PFMUL MM2,MM2           ;square the value in MM2
PFRSQIT1 MM2,MM0        ;first refining iteration
PFRCPIT2 MM2,MM3        ;second refining iteration
;the accurate reciprocal of sq root is now in MM2(low)
;compare with imprecise result in MM1(low)
Note the the PFRSQIT1, PFRCPIT1 and PFRCPIT2 are packed instructions so, with a little tweaking, it's possible to divide, or obtain the square root of two numbers at once - see the AMD 3DNow! Technology Manual.

Extended instructions
These instructions were introduced in the AMD Athlon processor. Note that this time bit 30 of EDX is tested to see if the processor supports the instructions. Again all the instructions are packed, working on two dwords in the registers at once.
Set the breakpoint to _3D_EXTENDED run the test and single-step through the code:-

_3D_EXTENDED:
MOV EAX,80000001h      ;request CPU extended feature flags
CPUID                  ;0Fh, 0A2h CPUID instruction
TEST EDX,40000000h     ;test bit 30
JNZ >L8                ;3DNow! extended available
PUSH 40h               ;information+ok button only
PUSH 'Testbug - 3DNow! (extended) test'
PUSH 'Sorry 3DNow! extended instructions are not available on your processor',[hWnd]
CALL MessageBoxA       ;wait till ok pressed
RET
L8:
MOVQ MM1,[_3DPI_VALUE]
PF2IW MM0,MM1          ;packed floating point to integer word
MOVQ MM0,[_3DPI_VALUE]
MOVQ MM1,[_3DSTARTER_VALUE2]
PFNACC MM0,MM1         ;packed negative accumulation
MOVQ MM1,[_3DSTARTER_VALUE2]
PFPNACC MM0,MM1        ;packed positive-negative accumulation
MOVQ MM1,[_3DSTARTER_VALUE3]
PI2FW MM0,MM1          ;packed integer word to floating point
MOVQ MM0,[_3DPI_VALUE]
PSWAPD MM0,MM1         ;swap the dwords
FEMMS
RET