"Use of mnemonics" demonstrations
XMM SSE2 floating point instructions

Go to the demos

About the XMM SSE2 floating point instructions
These are SSE2 (streaming SIMD extensions) floating point instructions which use the 128-bit XMM registers and which can handle double-precision (64-bit) floating point values. There are also some instructions which work on single-precision (32-bit) floating point values. Support for SSE2 was introduced in the Pentium 4 and Xeon processors.
Generally the instructions are very similar to the SSE floating point instructions except for the data size that they work with.
Before using these instructions in your code you need to check if they are available on the processor which is running your program. This is done by calling CPUID having set EAX=1. Then test bit 26 of EDX. The bit will be set if the SSE2 instructions can be used.
In the tests the following data declarations are used:-

DOUBLEFP1 DQ 1.1
          DQ 3.3
DOUBLEFP2 DQ 20.66
          DQ 40.66
DOUBLEFPN DQ -5.1
          DQ +6.3
Since it is possible that these labels may not be on a 16-bit boundary, the MOVUPD instruction must be used to transfer the data from memory into an XMM register. MOVUPD (move two unaligned packed double-precision values) does not care about alignment. If you specify ALIGN 16 immediately before the data declaration, however, then the faster MOVAPD (move two aligned packed double-precision values) can be used instead. If you get this wrong your program will cause an exception. See more about this (in the case of MOVDQA and MOVDQU which work in the same way). When transferring between registers, either MOVUPD or MOVAPD may be used.
The instructions we are looking at here tend largely to be two types. The first type of instruction deals with two 64-bit floating point numbers at once. These instructions have "PD" in their mnemonic name, referring to "packed double-precision". The second type of instruction deals with just one 64-bit floating point value. These instructions have "SD" in their mnemonic name referring to "scalar double-precision". They work on the lowest part of the XMM register only, that is to say the first 64 bits of the register (bits 0 to 63).
To watch these tests properly you need to set the appropriate breakpoint, start the test and then single step through the instructions. You can then watch how they change the XMM registers. Using GoBug you can make the XMM registers appear in their floating point SSE2 format using the appropriate button on the toolbar.

SSE2 instructions:-
Data movement instructions
Arithmetic instructions
Logical instructions
Comparison instructions
Shuffle and unpack instructions
Conversion instructions

SSE2 Data movement instructions
This demonstrates moving data into the registers and between the registers. MOVUPD and MOVAPD (aligned version), MOVSD, MOVLPD and MOVHPD can also be used to get values out and into memory. MOVMSKPD can be used after a comparison instruction to get the result of the compare into eax for analysis.
As an experiment we also try the SSE integer instruction MOVDQU and the SSE floating point instruction MOVUPS to see if they do the same as MOVUPD. It seems they do, merely performing a bit transfer into the XMM register. However, Intel do warn against using different instructions from those specified to avoid unstated performance implications.
The breakpoint is XMMSSE2_FPDATA:-

XMMSSE2_FPDATA:
MOV EAX,1               ;request CPU feature flags
CPUID                   ;0Fh, 0A2h CPUID instruction
TEST EDX,4000000h       ;test bit 26 (SSE2)
JNZ >L20                ;SSE2 available
CALL NOSSE2FPMESS       ;displays message if SSE2 not available
RET
L20:
;***** display XMM registers in SSE2 mode ..
MOVUPD XMM0,[DOUBLEFP1]      ;move two double precision fp values into XMM0
MOVAPD XMM7,XMM0             ;copying to XMM7
MOVSD  XMM2,[DOUBLEFP2]      ;move fp value to XMM1 low only
MOVLPD XMM3,[DOUBLEFP2]      ;this seems to be the same
MOVHPD XMM4,[DOUBLEFP2]      ;but this moves the high value
MOVUPD XMM0,[DOUBLEFPN]      ;move two new values, one is negative
MOVMSKPD EAX,XMM0            ;get both sign bits in XMM0 into eax
;************ and as an experiment, see if this does the same as MOVUPD ..
MOVDQU XMM1,[DOUBLEFPN]      ;use integer instruction to transfer the bits
;************ as this too (one byte smaller) ..
MOVUPS XMM2,[DOUBLEFPN]      ;use SSE instruction to transfer the bits
RET

SSE2 Arithmetic instrunctions
This demonstrated the arithmetic instructions which can work in the XMM registers using double-precision (64-bit) numbers.
The breakpoint is XMMSSE2_FPARITH:-

XMMSSE2_FPARITH:
MOV EAX,1               ;request CPU feature flags
CPUID                   ;0Fh, 0A2h CPUID instruction
TEST EDX,4000000h       ;test bit 26 (SSE2)
JNZ >L22                ;SSE2 available
CALL NOSSE2FPMESS       ;displays message if SSE2 not available
RET
L22:
;***** display XMM registers in SSE2 mode ..
MOVUPD XMM0,[DOUBLEFP1] ;move two double precision fp values into XMM0
MOVAPD XMM2,XMM0        ;copying to XMM2
MOVUPD XMM1,[DOUBLEFP2] ;move 2nd tester fp values into XMM1
MOVAPD XMM3,XMM1        ;copying to XMM3
ADDPD  XMM0,XMM1        ;add both fp values result in XMM0
MOVAPD XMM0,XMM2        ;restore value in XMM0
SUBPD  XMM0,XMM1        ;subtract both fp values result in XMM0
;*******
MOVAPD XMM0,XMM2        ;restore value in XMM0
ADDSD  XMM0,XMM1        ;add low fp value result in XMM0
SUBSD  XMM0,XMM1        ;subtract low fp value result in XMM0
;*******
MOVAPD XMM0,XMM2        ;restore value in XMM0
MULPD  XMM0,XMM1        ;multiply both fp values result in XMM0 
;*******
MOVAPD XMM0,XMM2        ;restore value in XMM0
MULSD  XMM0,XMM1        ;multiply low fp value result in XMM0
;*******                
MOVAPD XMM0,XMM2        ;restore value in XMM0
DIVPD  XMM0,XMM1        ;divide both fp values result in XMM0 
;*******
MOVAPD XMM0,XMM2        ;restore value in XMM0
DIVSD  XMM0,XMM1        ;divide low fp value result in XMM0
;*******
MOVAPD XMM0,XMM2        ;restore value in XMM0
SQRTPD XMM0,XMM1        ;get square roots of both fp values result in XMM0 
;*******
MOVAPD XMM0,XMM2        ;restore value in XMM0
SQRTSD XMM0,XMM1        ;get square root of low fp value result in XMM0
;*******
MOVAPD XMM0,XMM2        ;restore value in XMM0
MAXPD XMM0,XMM1         ;get numerically greater fp values result in XMM0
;*******
MOVAPD XMM0,XMM2        ;restore value in XMM0
MAXSD XMM0,XMM1         ;get numerically greater of low fp values result in XMM0
;*******
MOVAPD XMM0,XMM2        ;restore value in XMM0
MINPD XMM0,XMM1         ;get numerically smaller fp values result in XMM0
;*******
MOVAPD XMM0,XMM2        ;restore value in XMM0
MINSD XMM0,XMM1         ;get numerically smaller of low fp values result in XMM0
RET

SSE2 Logical instructions
This demonstrates the logical instructions which can work in the XMM registers using double-precision (64-bit) numbers.
The breakpoint is XMMSSE2_FPLOGIC:-

XMMSSE2_FPLOGIC:
MOV EAX,1               ;request CPU feature flags
CPUID                   ;0Fh, 0A2h CPUID instruction
TEST EDX,4000000h       ;test bit 26 (SSE2)
JNZ >L24                ;SSE2 available
CALL NOSSE2FPMESS       ;displays message if SSE2 not available
RET
L24:
;***** display XMM registers in SSE2 mode ..
MOVUPD XMM0,[DOUBLEFP1] ;move two double precision fp values into XMM0
MOVAPD XMM2,XMM0        ;copying to XMM2
MOVUPD XMM1,[DOUBLEFP2] ;move 2nd tester fp values into XMM1
MOVAPD XMM3,XMM1        ;copying to XMM3
ANDPD  XMM0,XMM1        ;perform AND on both fp values result in XMM0
;*******
MOVAPD XMM0,XMM2        ;restore value in XMM0
ANDNPD XMM0,XMM1        ;perform AND NOT on both fp values result in XMM0
;*******
MOVAPD XMM0,XMM2        ;restore value in XMM0
ORPD   XMM0,XMM1        ;perform OR on both fp values result in XMM0
;*******
MOVAPD XMM0,XMM2        ;restore value in XMM0
XORPD  XMM0,XMM1        ;perform XOR on both fp values result in XMM0
RET

SSE2 Comparison instructions
This demonstrates the comparison instructions which can work in the XMM registers using single-precision (64-bit) numbers.
You tell CMPPD and CMPSD what to do by specifying an immediate value in the third operand. It is not easy to remember what value does what, so some assemblers (including GoAsm) also provide psuedo mnemonics in the form recommended by Intel (given here in the comment). Somewhat easier to use, because they use the ordinary flags are COMISD and UCOMISD although they only work on one floating point value in the XMM register (contained in bits 0-63).
The breakpoint is XMMSSE2_FPCOMP:-

XMMSSE2_FPCOMP:
MOV EAX,1               ;request CPU feature flags
CPUID                   ;0Fh, 0A2h CPUID instruction
TEST EDX,4000000h       ;test bit 26 (SSE2)
JNZ >L26                ;SSE2 available
CALL NOSSE2FPMESS       ;displays message if SSE2 not available
RET
L26:
;***** display XMM registers in SSE2 mode ..
MOVUPD XMM0,[DOUBLEFP1] ;move two double precision fp values into XMM0
MOVAPD XMM2,XMM0        ;copying to XMM2
MOVUPD XMM1,[DOUBLEFP2] ;move 2nd tester fp values into XMM1
MOVAPD XMM3,XMM1        ;copying to XMM3
;********************* compare instructions working on both fp values
CMPPD XMM0,XMM1,0       ;=CMPEQPD see whether equal, result in XMM0
MOVAPD XMM0,XMM2        ;restore original value to XMM0
CMPPD XMM0,XMM1,1       ;=CMPLTPD see whether less than, result in XMM0
MOVAPD XMM0,XMM2        ;restore original value to XMM0
CMPPD XMM0,XMM1,2       ;=CMPLEPD see whether less than or equal, result in XMM0
MOVAPD XMM0,XMM2        ;restore original value to XMM0
CMPPD XMM0,XMM1,3       ;=CMPUNORDPD see unordered, result in XMM0
MOVAPD XMM0,XMM2        ;restore original value to XMM0
CMPPD XMM0,XMM1,4       ;=CMPNEQPD see whether not equal, result in XMM0
MOVAPD XMM0,XMM2        ;restore original value to XMM0
CMPPD XMM0,XMM1,5       ;=CMPNLTPD see whether not less than, result in XMM0
MOVAPD XMM0,XMM2        ;restore original value to XMM0
CMPPD XMM0,XMM1,6       ;=CMPNLEPD see whether not less than or equal, result in XMM0
MOVAPD XMM0,XMM2        ;restore original value to XMM0
CMPPD XMM0,XMM1,7       ;=CMPORDPD see whether ordered, result in XMM0
;********************* compare instructions working on low value only
MOVAPD XMM0,XMM2        ;restore original value to XMM0
CMPSD XMM0,XMM1,0       ;=CMPEQPD see whether equal, result in XMM0
MOVAPD XMM0,XMM2        ;restore original value to XMM0
CMPSD XMM0,XMM1,1       ;=CMPLTPD see whether less than, result in XMM0
MOVAPD XMM0,XMM2        ;restore original value to XMM0
CMPSD XMM0,XMM1,2       ;=CMPLEPD see whether less than or equal, result in XMM0
MOVAPD XMM0,XMM2        ;restore original value to XMM0
CMPSD XMM0,XMM1,3       ;=CMPUNORDPD see unordered, result in XMM0
MOVAPD XMM0,XMM2        ;restore original value to XMM0
CMPSD XMM0,XMM1,4       ;=CMPNEQPD see whether not equal, result in XMM0
MOVAPD XMM0,XMM2        ;restore original value to XMM0
CMPSD XMM0,XMM1,5       ;=CMPNLTPD see whether not less than, result in XMM0
MOVAPD XMM0,XMM2        ;restore original value to XMM0
CMPSD XMM0,XMM1,6       ;=CMPNLEPD see whether not less than or equal, result in XMM0
MOVAPD XMM0,XMM2        ;restore original value to XMM0
CMPSD XMM0,XMM1,7       ;=CMPORDPD see whether ordered, result in XMM0
;********************* compare and give result in eflags
MOVAPD XMM0,XMM2        ;restore original value to XMM0
COMISD XMM0,XMM1        ;look at lowest only result in eflags
UCOMISD XMM0,XMM1       ;(unordered compare)
MOVUPD XMM1,[DOUBLEFPN] ;move two -ve, two +ve values into XMM1
COMISD XMM0,XMM1        ;look at lowest only - result in eflags
UCOMISD XMM0,XMM1       ;(unordered compare)
RET

SSE2 Shuffle and unpack instructions
With these instructions you can move the double-precision (64-bit) floating point values around the XMM registers.
The breakpoint is XMMSSE2_SHUFF:-

XMMSSE2_SHUFF:
MOV EAX,1               ;request CPU feature flags
CPUID                   ;0Fh, 0A2h CPUID instruction
TEST EDX,4000000h       ;test bit 26 (SSE2)
JNZ >L28                ;SSE2 available
CALL NOSSE2FPMESS       ;displays message if SSE2 not available
RET
L28:
;***** display XMM registers in SSE2 mode ..
MOVUPD XMM0,[DOUBLEFP1] ;move two double precision fp values into XMM0
MOVAPD XMM2,XMM0        ;copying to XMM2
MOVUPD XMM1,[DOUBLEFP2] ;move 2nd tester fp values into XMM1
MOVAPD XMM3,XMM1        ;copying to XMM3
SHUFPD XMM0,XMM1,3h     ;shuffle pack into destination
SHUFPD XMM0,XMM0,1h     ;swap the values in XMM0
MOVAPD XMM0,XMM2        ;restore original value to XMM0
UNPCKHPD XMM0,XMM1      ;unpack (high) and put into destination 
MOVAPD XMM0,XMM2        ;restore original value to XMM0
UNPCKLPD XMM0,XMM0      ;unpack (low) and put into destination 
RET

SSE2 Conversion instructions
The instructions convert between integers, single-precision and double-precision floating point values. They should be read together with the SSE conversion instructions.
The breakpoint is XMMSSE2_CONV:-

XMMSSE2_CONV:
MOV EAX,1               ;request CPU feature flags
CPUID                   ;0Fh, 0A2h CPUID instruction
TEST EDX,4000000h       ;test bit 26 (SSE2)
JNZ >L30                ;SSE2 available
CALL NOSSE2FPMESS       ;displays message if SSE2 not available
RET
L30:
;***** display XMM registers in both SSE and SSE2 modes ..
;***** conversion between single and double-precision fp values ..
CVTPS2PD XMM0,[SINGLEFP1]  ;put single-precision fp values into XMM0 as double-precision
CVTPD2PS XMM6,XMM0         ;convert double precision to single precision in XMM7
CVTSS2SD XMM1,[SINGLEFP1]  ;as CVTPS2PD but working with only one value
CVTSD2SS XMM7,XMM1         ;as CVTSS2SD but working with only one value
;***** conversion between integers and double-precision fp values ..
;***** open the MMX integer pane for these tests ..
CVTPD2PI MM0,XMM0          ;convert fp values in XMM0 to integers in MM0
CVTTPD2PI MM1,XMM0         ;same as above with truncation
CVTPI2PD XMM0,[DINTEGER]   ;convert 23 and 24 to double-precision fp values
;***** open the XMM integer display and switch to dword display
CVTPD2DQ XMM7,XMM0         ;and convert 23 and 24 to dword integers into XMM7 (low)
CVTTPD2DQ XMM7,XMM0        ;same as above with truncation
CVTDQ2PD XMM3,XMM7         ;and back into fp values in XMM3
CVTSD2SI EAX,XMM0          ;take low fp value and convert as integer in EAX
CVTTSD2SI EDX,XMM0         ;same as above with truncation
CVTSI2SD XMM4,EAX          ;and back again into XMM4 (low)
;***** conversion between single-precision and integers ..
;***** watch these in XMM integer display switched to dword display
CVTPS2DQ XMM0,[SINGLEFP1]  ;move 4 single-precision fp values to dwords as integers
CVTTPS2DQ XMM1,[SINGLEFP1] ;same as above with truncation
;***** and watch this in the SSE fp pane ..
CVTDQ2PS XMM6,XMM0         ;and convert back to 4 single-precision fp values
CVTDQ2PS XMM7,XMM1         ;ditto
RET