Testbug's speed tests

Comparing basic coding

Optimisation tests
   LOOP v DEC ECX, JZ ..
   REP v DEC ECX, JNZ ..
   Alignment of oft-called functions
   Alignment for memory reads, writes and compares
   Conditional jump branch hints

The first two tests compare the speed of different types of loops. Since you can't use a loop to do the tests themselves, prior to performing each test, Testbug creates an area of execute/write/read memory using the api VirtualAlloc. Testbug then writes the opcodes for the instruction into the memory area so created . This is written 500 times so that there are 500 instructions to be executed. Having made a note of the tick count using QueryPerformaceCounter, execution of the processor is then transferred to the beginning of the area of memory containing these instructions. The 500 instructions are therefore carried out. At the end of the 500 instructions the tick count is tested and displayed. Testbug then frees the memory which was used and moves on to the next test.

LOOP v DEC ECX, JZ ..
This test was inspired by Rick Booth's excellent book "Inner Loops" which studies optimisation of instructions in Intel processors. It was a great shock to learn that for some processors the "LOOP" mnemonic was much slower than the DEC ECX, JZ .. combination. I suspect that as processors develop the speed of the "LOOP" instruction will be restored to its former glory. This test shows just how much slower it is on the processor on your machine. On earlier processors, the LOOPZ and LOOPNZ instructions were even slower and in this test they are compared with their "expanded" versions that is, in GoAsm syntax where ecx=5000 decimal:-

Tested instruction Compared with

L1:
LOOP L1

L1:
DEC ECX
JNZ L1

L1:
XOR EAX,EAX
LOOPZ L1

L1:
XOR EAX,EAX
JNZ >L2
DEC ECX
JNZ L1
INC ECX
DEC ECX
L2:

OR ECX,ECX
L1:
LOOPNZ L1

OR ECX,ECX
L1:
JZ >L2
DEC ECX
JNZ L1
INC ECX
DEC ECX
L2:

REP v DEC ECX, JNZ ..
Another favourite amongst assembler programmers is the REP STOS, REP MOV, and REP SCAS instructions and their variants. Rick Booth pointed out in his book that they could run very slow. These tests compare the speed of the basic mnemonic with their expanded versions that is, in GoAsm syntax where ecx=5000 decimal and edi points to somewhere in memory:-

Tested instruction Compared with

REP STOSD

L1:
MOV [EDI],EAX
ADD EDI,4
DEC ECX
JNZ L1

REPNZ SCASB

L1:
CMP [EDI],AL
JZ >L2
INC EDI
DEC ECX
JNZ L1
L2:

REPNZ SCASD

L1:
CMP [EDI],EAX
JZ >L2
ADD EDI,4
DEC ECX
JNZ L1
L2:

REP MOVSD

L1:
MOV EAX,[ESI]
MOV [EDI],EAX
ADD ESI,4
ADD EDI,4
DEC ECX
JNZ L1

Alignment of oft-called functions
At one time I had a theory that functions which are often called (for example in a loop) ought to be aligned at least on a dword boundary. I established this test to see if this was right. So far it does not seem to be.
Here is the code used in these tests. The breakpoint is FALIGNMENT_TEST. For each test this procedure is called with the address of the appropriate function in edx, and the number of iterations to do in ebx:-

ALIGNMENTTEST_SHELL:     ;edx holds function to call
L1:
CALL EDX
DEC EBX
JNZ L1
RET
Here are the functions themselves, with their alignment adjusted using the ALIGN directive and NOP (no operation byte):-
ALIGN 16
NOP
FPARAGRAPH_PLUS_1:       ;aligned on paragraph + 1 byte
RET
;
ALIGN 16
NOP
NOP
FPARAGRAPH_PLUS_2:       ;aligned on paragraph + 2 bytes
RET
;
ALIGN 16
NOP
NOP
NOP
FPARAGRAPH_PLUS_3:       ;aligned on paragraph + 3 bytes
RET
;
ALIGN 16
FPARAGRAPH:              ;aligned on paragraph
RET
;
ALIGN 16
NOP
NOP
NOP
NOP
FPARAGRAPH_PLUS_4:       ;aligned on paragraph + 4 bytes
RET
;
ALIGN 16
NOP
NOP
NOP
NOP
NOP
NOP
NOP
NOP
FPARAGRAPH_PLUS_8:       ;aligned on paragraph + 8 bytes
RET
;
ALIGN 16
NOP
NOP
NOP
NOP
NOP
NOP
NOP
NOP
NOP
NOP
NOP
NOP
FPARAGRAPH_PLUS_12:      ;aligned on paragraph + 12 bytes
RET

Alignment for memory reads, writes and compares
We are always being told to ensure that our dword memory is dword aligned. Some APIs, particularly when running under NT 2000, and XP will fail if memory is not aligned properly, so we know that memory areas sent to an API (typically for example a structure) must be on a dword boundary. Some instructions such as MOVDQA and MOVAPD require that the data they address is aligned on a 16 byte boundary. But what about ordinary assembler code? This test demonstrates the severe penalty which can occur on a P4 if it is asked to read from, write to or compare data which is not aligned properly (I found from the write test that the speed could be reduced by a staggering 36 times in some cases!).
I won't give the full coding here since it can be seen from the debugger. The following instructions are tested in a loop for the memory read test (breakpoint is MRALIGNMENT_TEST):- BR11E DB 'BRANCHHINT_TEST',0

MOV EAX,[EDX]                 ;and
MOV EAX,[MEMORY_LABEL]
and for the memory write test (breakpoint is MWALIGNMENT_TEST):-
MOV [EDX],EAX                 ;and
MOV [MEMORY_LABEL],EAX
and for the memory compare test (breakpoint is MCALIGNMENT_TEST):-
CMP [EDX],EAX                 ;and
CMP [MEMORY_LABEL],EAX
In each case the address either in EDX or of the MEMORY_LABEL was aligned in memory at dword+1, dword+2, dword+3, and at a dword respectively.

Conditional jump branch hints
Using branch hints on modern processors allows us to optimise our code. Let me explain a bit more about this before coming on to the tests which prove the theory.

What is branch prediction?

Modern processors (for example the Pentium upwards) are able to run very fast by predicting in advance the next instruction which will be executed after a conditional jump instruction. For example:-
CMP EDX,EAX       ;compare edx and eax
JZ >L1            ;branch to L1 if edx=eax
                  ;lots of other instructions
L1:
The processor cannot be 100% sure in advance whether edx will equal eax when line 1 is executed. It might predict this to be unlikely, however, in which case it will set up the instructions just after the conditional jump to be executed next. If this prediction is right, those instruction will be executed immediately without any time loss. If the prediction is wrong, there will be some time loss in switching to the correct instruction (just after L1) instead. Various branch prediction algorithms are used by the processor, including ones which learn from their mistakes. One of the most basic predictions used as a starting point assume that:-
All forward conditional jumps will not take place, and
All backwards conditional jumps (loops back) will take place.
It is documented that in normal code, backward conditional jumps tend to take place 80% of the time, whereas forward conditional jumps will be likely not to take place. It is said that overall, the default prediction is correct 65% of the time. So predicting whether or not the conditional jump will take place can speed up the code particularly on a series of loops back. As an assembler programmer, you can produce fast code knowing the default prediction for the processor that you are programming for since you are in complete control over your code.

What are branch hints?

In the P4 (and possibly in some earlier processors) you can give the processor a "hint" as to whether or not the branch is likely to occur.
This is done by using 2Eh and 3Eh branch hint bytes which are inserted immediately before the conditional jump instruction. Respectively they mean as follows:-
2Eh - hint that the branch will not occur most of the time.
3Eh - hint that the branch will occur most of the time.
Using branch hint 2Eh would be useful (for example) if you have a backwards branch in a conditional jump which occurs only in the case of an error. The processor would usually predict that the backwards branch is likely to happen, thereby slowing down the code at that particular point. Inserting 2Eh will stop the processor making that prediction.
Branch hint 3Eh would be useful if (unusually) you want to create a loop where the conditional jump is at the beginning of a code fragment and the destination of the jump is forward in the code. This loop would normally run more slowly than a normal type of loop because of the default prediction that the forward branch is unlikely to occur, unless the processor is capable of learning from its prediction mistakes.

Inserting branch hints

Intel does not recommend any particular mnemonic to insert the branch hint bytes. You could do this as follows:-
DB 2Eh  ;or
DB 3Eh
If you prefer you can insert the branch hint bytes automatically.
GoAsm uses the following syntax to insert the branch hint bytes suitable for the P4 processor:-
hint.nobranch           ;insert 2Eh
hint.branch             ;insert 3Eh
Going back to the first example, if edx normally does equal eax, then this code will run faster on the P4 by adding a branch hint as follows:-
CMP EDX,EAX             ;compare edx and eax
hint.branch JZ >L1      ;normally does branch to L1
                        ;lots of other instructions
L1:
I found in my tests that whereas the branch hint made no difference at all on a 486 processor, on a P4 the speed of the loop was increased substantially (by 1.5 times) if the branch hint was given in a case where the processor would have predicted wrongly. This was the case in the unusual loops of the first and third tests. Where the processor would predict correctly, as in the case of the typical loops in the last four tests, the branch hint made no difference to speed.
The breakpoint is BRANCHHINT_TEST. Here is the code used for the tests, where ebx holds the number of iterations to do:-
ALIGN 4
UNUSUALBRFOR_HINT:
L1:
DEC EBX
hint.branch JNZ >L4     ;branch forward hint
RET
L4:
JMP L1
;
ALIGN 4
UNUSUALBRFOR_NOHINT:
L6:
DEC EBX
NOP                     ;ensure no alignment differences
JNZ >L8                 ;branch forward no hint
RET
L8:
JMP L6
;
ALIGN 4
L10:
RET
UNUSUALBRBACK_HINT:
L12:
DEC EBX
hint.nobranch JZ L10    ;branch back hint
JMP L12
;
ALIGN 4
L14:
RET
UNUSUALBRBACK_NOHINT:
L16:
DEC EBX
NOP                     ;ensure no alignment differences
JZ L14                  ;branch back no hint
JMP L16
;
ALIGN 4
TYPICALBRFOR_HINT:
L18:
DEC EBX
hint.nobranch JZ >L20   ;branch forward hint
JMP L18
L20:
RET
;
ALIGN 4
TYPICALBRFOR_NOHINT:
L22:
DEC EBX
NOP                     ;ensure no alignment differences
JZ >L24                 ;branch forward no hint
JMP L22
L24:
RET
;
ALIGN 4
TYPICALBRBACK_HINT:
L26:
DEC EBX
hint.branch JNZ L26     ;branch back hint
RET
;
ALIGN 4
TYPICALBRBACK_NOHINT:
L28:
DEC EBX
NOP                     ;ensure no alignment differences
JNZ L28                 ;branch back no hint
RET