About the 3DNow! instructions
Each of the following demonstrations rely on the following data declarations
in the DATA SECTION:-
The 3DNow! instructions were introduced by AMD in their
K6 processor. They use the floating point registers, dividing each one
into two 32-bit single precision floating point registers. This was intended
to support enhanced 3D graphics and audio processing. The problem for programmers
is that Intel did not follow suit, so you have the strange situation that
the 3DNow! instructions are supported on fairly old AMD processors but not
on the latest Intel ones. Nevertheless, AMD continue to support the 3DNow!
instructions in their AMD64 processor, so it may be that they are here
to stay! Obviously the programmer must test to see if the CPU running the
program supports these instructions, and if not the necessary algorithms must
switch over to be done in some other way. The way to test for support for the
3DNow! instructions is to move the value 80000001h into the EAX register,
and then call CPUID. This gives a report on the CPU extended feature flags.
The result is given in the EDX register and bit 31 is set if 3DNow!
instructions can be used. Five new 3DNow! instructions were added in the
AMD Athlon processor and if these are supported bit 30 is also set.
Note also that FEMMS is used at the end of the sequences to clear the
registers and the tag words so that, if necessary, floating point instructions
can then be used.
Except for the sample for the reciprocal instructions, the 3DNow!
instructions are simple to use. The examples here use only the
register-to-register form of the instructions
to make the demonstration easier to follow. In practice, instead of loading
a number from memory into a register and then carrying out the register-to-register
form of the instruction, you would use the memory-to-register form of the
instruction wherever possible. An example of this is given in the demo
of the conversion instructions.
_3DPI_VALUE DD PI ;floating point value (low)
DD 6.294 ;floating point value (high)
_3DSTARTER_VALUE2 DD 5.1E0 ;floating point value (low)
DD 1.2E0 ;floating point value (high)
_3DSTARTER_VALUE3 DD 33333 ;integer value (low)
DD 66666 ;integer value (high)
There are 5 demonstrations:-
Arithmetic instructions
Comparison instructions
Conversion instructions
Reciprocal instructions
Extended instructions
Arithmetic instrunctions
These are all packed instructions, so they operate
on the high and low dwords of the registers at the same time. For each
test each dword holds a number in single-precision floating point format.
Set the breakpoint to _3D_ARITHMETIC, run the test and single-step through the code:-
MOV EAX,80000001h ;request CPU extended feature flags
CPUID ;0Fh, 0A2h CPUID instruction
TEST EDX,80000000h ;test bit 31
JNZ >L2 ;3DNow! available
CALL NO_3DNOWMESS
RET
L2:
MOVQ MM0,[_3DPI_VALUE]
MOVQ MM1,[_3DSTARTER_VALUE2]
MOVQ MM2,MM1 ;copy to MM2
PFADD MM2,MM0 ;packed floating point add (result in MM2)
;
MOVQ MM3,MM1 ;copy to MM3
PFACC MM3,MM0 ;packed accumulation (result in MM3)
;
MOVQ MM4,MM1 ;copy to MM4
PFSUB MM4,MM0 ;packed subtraction (result in MM4)
;
MOVQ MM5,MM1 ;copy to MM5
PFSUBR MM5,MM0 ;packed reverse subtraction (result in MM5)
;
MOVQ MM6,MM1 ;copy to MM6
PFMUL MM6,MM0 ;packed multiply (result in MM6)
FEMMS
RET
Comparison instructions
These are all packed instructions and so operate on two dwords (containing
single-precision floating point numbers) at once. The strict comparison
instructions tend to give the result of the comparison either
as all one bits or all zero bits, whereas PFMAX returns the larger of two
floating point numbers, and PFMIN returns the smaller.
Set the breakpoint to _3D_COMPARISON run the test and single-step through the code:-
MOV EAX,80000001h ;request CPU extended feature flags
CPUID ;0Fh, 0A2h CPUID instruction
TEST EDX,80000000h ;test bit 31
JNZ >L2 ;3DNow! available
CALL NO_3DNOWMESS
RET
L2:
MOVQ MM0,[_3DPI_VALUE]
MOVQ MM1,[_3DSTARTER_VALUE2]
PFCMPEQ MM0,MM1 ;see if equal result in MM0
MOVQ MM0,[_3DSTARTER_VALUE2]
PFCMPEQ MM0,MM1 ;see if equal result in MM0
MOVQ MM0,[_3DPI_VALUE]
PFCMPGT MM0,MM1 ;see if greater result in MM0
MOVQ MM0,[_3DPI_VALUE]
PFCMPGT MM1,MM0 ;see if greater result in MM1
MOVQ MM1,[_3DSTARTER_VALUE2]
PFCMPGE MM0,MM1 ;see if greater or equal result in MM0
MOVQ MM0,[_3DPI_VALUE]
PFCMPGE MM1,MM0 ;see if greater or equal result in MM1
MOVQ MM1,[_3DSTARTER_VALUE2]
PFCMPGE MM0,MM1 ;see if greater or equal result in MM0
MOVQ MM0,[_3DPI_VALUE]
PFMAX MM0,MM1 ;return the larger in MM0
MOVQ MM0,[_3DPI_VALUE]
PFMAX MM1,MM0 ;return the larger in MM1
MOVQ MM1,[_3DSTARTER_VALUE2]
PFMIN MM0,MM1 ;return the smaller in MM0
MOVQ MM0,[_3DPI_VALUE]
PFMIN MM1,MM0 ;return the smaller in MM1
FEMMS
RET
Conversion instructions
These are both packed instructions, PF2ID converting
two dwords in floating point format to signed 32-bit integers in the
destination register. PI2FD converts two signed 32-bit integers to
dwords in floating point format in the destination register.
Set the breakpoint to _3D_CONVERSION run the test and single-step through the code:-
_3D_CONVERSION:
MOV EAX,80000001h ;request CPU extended feature flags
CPUID ;0Fh, 0A2h CPUID instruction
TEST EDX,80000000h ;test bit 31
JNZ >L4 ;3DNow! available
CALL NO_3DNOWMESS
RET
L4:
MOVQ MM1,[_3DPI_VALUE]
PF2ID MM0,MM1 ;packed floating to integer dwords
;
;or this, which is the same ..
PF2ID MM0,[_3DPI_VALUE] ;packed floating to integer dwords
;
MOVQ MM1,[_3DSTARTER_VALUE3]
PI2FD MM0,MM1 ;packed integer to floating dwords
;
;or this, which is the same ..
PI2FD MM0,[_3DSTARTER_VALUE3] ;packed integer to floating dwords
;
FEMMS
RET
Reciprocal instructions
Since you cannot divide directly using the 3DNow! instructions, the reciprocal
instructions take on a greater importance. To divide x by y you can use the
formula (1/y)*x instead, which is the same as x/y. This is the reciprocal
of y multiplied by x. Note that PFRCP is quick but it is only accurate to 14
bits because it gets its value from a ROM-based lookup table. It only takes the lower
register of the source, giving the result in both the lower and higher registers.
Set the breakpoint to _3D_RECIPROCAL run the test and single-step through the code:-
This test divides PI by itself in the low part of MM0. Note the
result is not exactly 1 due to the imprecision:-
MOVQ MM0,[_3DPI_VALUE]
PFRCP MM1,MM0 ;give MM1 the reciprocal of MM0(low)
PFMUL MM1,MM0 ;packed multiply result in MM1
To achieve greater accuracy (to 24 bits) you need to use two further
instructions, PFRCPIT1 and PFRCPIT2. These take the reciprocal
produced by PFRCP and the original value given to PFRCP and use these two
values to refine it further in two steps using the Newton-Raphson formula.
The final output can then be multiplied as before to produce a division
with 24 bit accuracy. This practical example carries out the same division
as in the above example, but to 24 bit accuracy. This time the result is
exactly 1:-
MOVQ MM2,[_3DPI_VALUE]
PFRCP MM3,MM2 ;give MM3 the reciprocal of MM2(low)
PFRCPIT1 MM2,MM3 ;first refining iteration
PFRCPIT2 MM2,MM3 ;second refining iteration
;the accurate reciprocal is now in MM2(low)
PFMUL MM2,MM0 ;packed multiply result in MM2(low)
;compare with imprecise result in MM1(low)
Finally here is an example of how to obtain the reciprocal of a square
root using PFRSQRT. Again this is a table lookup instruction which is fast
but is only accurate to 15 bits.
PFRSQRT MM1,MM0 ;give MM1 the reciprocal of sq root of MM0(low)
To achieve greater accuracy (to 24 bits) you need to refine the result
using PFMUL, PFRSQIT1 and PFRCPIT2:-
PFRSQRT MM2,MM0 ;give MM2 the reciprocal of sq root of MM0(low)
MOVQ MM3,MM2 ;and keep in MM3
PFMUL MM2,MM2 ;square the value in MM2
PFRSQIT1 MM2,MM0 ;first refining iteration
PFRCPIT2 MM2,MM3 ;second refining iteration
;the accurate reciprocal of sq root is now in MM2(low)
;compare with imprecise result in MM1(low)
Note the the PFRSQIT1, PFRCPIT1 and PFRCPIT2 are packed instructions so,
with a little tweaking, it's possible to divide, or obtain the square root
of two numbers at once - see the AMD 3DNow! Technology Manual.
Extended instructions
These instructions were introduced in the AMD Athlon
processor. Note that this time bit 30 of EDX is tested to see if the
processor supports the instructions. Again all the instructions are packed,
working on two dwords in the registers at once.
Set the breakpoint to _3D_EXTENDED run the test and single-step through the code:-
_3D_EXTENDED:
MOV EAX,80000001h ;request CPU extended feature flags
CPUID ;0Fh, 0A2h CPUID instruction
TEST EDX,40000000h ;test bit 30
JNZ >L8 ;3DNow! extended available
PUSH 40h ;information+ok button only
PUSH 'Testbug - 3DNow! (extended) test'
PUSH 'Sorry 3DNow! extended instructions are not available on your processor',[hWnd]
CALL MessageBoxA ;wait till ok pressed
RET
L8:
MOVQ MM1,[_3DPI_VALUE]
PF2IW MM0,MM1 ;packed floating point to integer word
MOVQ MM0,[_3DPI_VALUE]
MOVQ MM1,[_3DSTARTER_VALUE2]
PFNACC MM0,MM1 ;packed negative accumulation
MOVQ MM1,[_3DSTARTER_VALUE2]
PFPNACC MM0,MM1 ;packed positive-negative accumulation
MOVQ MM1,[_3DSTARTER_VALUE3]
PI2FW MM0,MM1 ;packed integer word to floating point
MOVQ MM0,[_3DPI_VALUE]
PSWAPD MM0,MM1 ;swap the dwords
FEMMS
RET