Comparing basic coding
Optimisation tests
   LOOP v DEC ECX, JZ ..
   REP v DEC ECX, JNZ ..
   Alignment of oft-called functions
   Alignment for memory reads, writes and compares
   Conditional jump branch hints
The first two tests compare the speed of different types of loops. Since you can't use a loop to do the tests themselves, prior to performing each test, Testbug creates an area of execute/write/read memory using the api VirtualAlloc. Testbug then writes the opcodes for the instruction into the memory area so created . This is written 500 times so that there are 500 instructions to be executed. Having made a note of the tick count using QueryPerformaceCounter, execution of the processor is then transferred to the beginning of the area of memory containing these instructions. The 500 instructions are therefore carried out. At the end of the 500 instructions the tick count is tested and displayed. Testbug then frees the memory which was used and moves on to the next test.
LOOP v DEC ECX, JZ ..
This test was inspired by Rick Booth's excellent book "Inner Loops" which studies optimisation of instructions in Intel processors. It was a great shock to learn that for some processors the "LOOP" mnemonic was much slower than the DEC ECX, JZ .. combination. I suspect that as processors develop the speed of the "LOOP" instruction will be restored to its former glory. This test shows just how much slower it is on the processor on your machine. On earlier processors, the LOOPZ and LOOPNZ instructions were even slower and in this test they are compared with their "expanded" versions that is, in GoAsm syntax where ecx=5000 decimal:-
Tested instruction
Compared with
L1:
LOOP L1
L1:
DEC ECX
JNZ L1
L1:
XOR EAX,EAX
LOOPZ L1
L1:
XOR EAX,EAX
JNZ >L2
DEC ECX
JNZ L1
INC ECX
DEC ECX
L2:
OR ECX,ECX
L1:
LOOPNZ L1
OR ECX,ECX
L1:
JZ >L2
DEC ECX
JNZ L1
INC ECX
DEC ECX
L2:
REP v DEC ECX, JNZ ..
Another favourite amongst assembler programmers is the REP STOS, REP MOV, and REP SCAS instructions and their variants. Rick Booth pointed out in his book that they could run very slow. These tests compare the speed of the basic mnemonic with their expanded versions that is, in GoAsm syntax where ecx=5000 decimal and edi points to somewhere in memory:-
Tested instruction
Compared with
REP STOSD
L1:
MOV [EDI],EAX
ADD EDI,4
DEC ECX
JNZ L1
REPNZ SCASB
L1:
CMP [EDI],AL
JZ >L2
INC EDI
DEC ECX
JNZ L1
L2:
REPNZ SCASD
L1:
CMP [EDI],EAX
JZ >L2
ADD EDI,4
DEC ECX
JNZ L1
L2:
REP MOVSD
L1:
MOV EAX,[ESI]
MOV [EDI],EAX
ADD ESI,4
ADD EDI,4
DEC ECX
JNZ L1
Alignment of oft-called functions
At one time I had a theory that functions which are often called (for example in a loop) ought to be aligned at least on a dword boundary. I established this test to see if this was right. So far it does not seem to be.
Here is the code used in these tests. The breakpoint is FALIGNMENT_TEST.
For each test this procedure
is called with the address of the appropriate function in edx, and
the number of iterations to do in ebx:-
ALIGNMENTTEST_SHELL: ;edx holds function to call
L1:
CALL EDX
DEC EBX
JNZ L1
RET
Here are the functions themselves, with their alignment adjusted using
the ALIGN directive and NOP (no operation byte):-
ALIGN 16
NOP
FPARAGRAPH_PLUS_1: ;aligned on paragraph + 1 byte
RET
;
ALIGN 16
NOP
NOP
FPARAGRAPH_PLUS_2: ;aligned on paragraph + 2 bytes
RET
;
ALIGN 16
NOP
NOP
NOP
FPARAGRAPH_PLUS_3: ;aligned on paragraph + 3 bytes
RET
;
ALIGN 16
FPARAGRAPH: ;aligned on paragraph
RET
;
ALIGN 16
NOP
NOP
NOP
NOP
FPARAGRAPH_PLUS_4: ;aligned on paragraph + 4 bytes
RET
;
ALIGN 16
NOP
NOP
NOP
NOP
NOP
NOP
NOP
NOP
FPARAGRAPH_PLUS_8: ;aligned on paragraph + 8 bytes
RET
;
ALIGN 16
NOP
NOP
NOP
NOP
NOP
NOP
NOP
NOP
NOP
NOP
NOP
NOP
FPARAGRAPH_PLUS_12: ;aligned on paragraph + 12 bytes
RET
Alignment for memory reads, writes and compares
We are always being told to ensure that our dword
memory is dword aligned. Some APIs, particularly when running under NT
2000, and XP will fail if memory is not aligned properly, so we know
that memory areas sent to an API (typically for example a structure)
must be on a dword boundary. Some instructions such as MOVDQA and MOVAPD
require that the data they address is aligned on a 16 byte boundary.
But what about ordinary assembler code?
This test demonstrates the severe penalty which can occur on a P4 if it
is asked to read from, write to or compare data which is not aligned
properly (I found from the write test that the speed could be reduced
by a staggering 36 times in some cases!).
I won't give the full coding here since it can be seen from the
debugger. The following instructions are tested in a loop for the
memory read test (breakpoint is MRALIGNMENT_TEST):-
BR11E DB 'BRANCHHINT_TEST',0
MOV EAX,[EDX] ;and
MOV EAX,[MEMORY_LABEL]
and for the memory write test (breakpoint is MWALIGNMENT_TEST):-
MOV [EDX],EAX ;and
MOV [MEMORY_LABEL],EAX
and for the memory compare test (breakpoint is MCALIGNMENT_TEST):-
CMP [EDX],EAX ;and
CMP [MEMORY_LABEL],EAX
In each case the address either in EDX or of the MEMORY_LABEL was
aligned in memory at dword+1, dword+2, dword+3, and at a dword
respectively.
Conditional jump branch hints
Using branch hints on modern processors allows us to
optimise our code. Let me explain a bit more about this before coming
on to the tests which prove the theory.
What is branch prediction?
Modern processors (for example the Pentium upwards) are able to run very
fast by predicting in advance the next instruction which will be executed
after a conditional jump instruction. For example:-
CMP EDX,EAX ;compare edx and eax
JZ >L1 ;branch to L1 if edx=eax
;lots of other instructions
L1:
The processor cannot be 100% sure in advance whether edx will equal eax
when line 1 is executed. It might predict this to be unlikely, however,
in which case it will set up the instructions just after the conditional
jump to be executed next. If this prediction is right, those instruction
will be executed immediately without any time loss. If the prediction
is wrong, there will be some time loss in switching to the correct
instruction (just after L1) instead.
Various branch prediction algorithms are used by the processor, including
ones which learn from their mistakes. One of the most basic predictions
used as a starting point assume that:-
All forward conditional jumps will not take place, and
All backwards conditional jumps (loops back) will take place.
It is documented that in normal code, backward conditional jumps tend to
take place 80% of the time, whereas forward conditional jumps will be
likely not to take place. It is said that overall, the default prediction
is correct 65% of the time.
So predicting whether or not the conditional jump will take place can
speed up the code particularly on a series of loops back.
As an assembler programmer, you can produce fast code knowing the default
prediction for the processor that you are programming for since you are
in complete control over your code.
What are branch hints?
In the P4 (and possibly in some earlier processors) you can
give the processor a "hint" as to whether or not the branch is likely
to occur.
This is done by using 2Eh and 3Eh branch hint bytes which are
inserted immediately before the conditional jump instruction. Respectively
they mean as follows:-
2Eh - hint that the branch will not occur most of the time.
3Eh - hint that the branch will occur most of the time.
Using branch hint 2Eh would be useful (for example) if you have a
backwards branch in a conditional jump which occurs only in the case of
an error. The processor would usually predict that the backwards branch
is likely to happen, thereby slowing down the code at that particular
point. Inserting 2Eh will stop the processor making that prediction.
Branch hint 3Eh would be useful if (unusually) you want to create a
loop where the conditional jump is at the beginning of a code fragment
and the destination of the jump is forward in the code. This loop would
normally run more slowly than a normal type of loop because of
the default prediction that the forward branch is unlikely to occur,
unless the processor is capable of learning from its prediction mistakes.
Inserting branch hints
Intel does not recommend any particular mnemonic to insert the branch
hint bytes. You could do this as follows:-
DB 2Eh ;or
DB 3Eh
If you prefer you can insert the branch hint bytes automatically.
GoAsm uses the following syntax to insert the branch hint bytes
suitable for the P4 processor:-
hint.nobranch ;insert 2Eh
hint.branch ;insert 3Eh
Going back to the first example, if edx normally does equal eax, then
this code will run faster on the P4 by adding a branch hint as follows:-
CMP EDX,EAX ;compare edx and eax
hint.branch JZ >L1 ;normally does branch to L1
;lots of other instructions
L1:
I found in my tests that whereas the branch hint made no difference
at all on a 486 processor, on a P4 the speed of the loop was increased
substantially (by 1.5 times) if the branch hint was given in a case
where the processor would have predicted wrongly. This was the case
in the unusual loops of the first and third tests. Where the processor
would predict correctly, as in the case of the typical loops in the
last four tests, the branch hint made no difference to speed.
The breakpoint is BRANCHHINT_TEST.
Here is the code used for the tests, where ebx holds the number
of iterations to do:-
ALIGN 4
UNUSUALBRFOR_HINT:
L1:
DEC EBX
hint.branch JNZ >L4 ;branch forward hint
RET
L4:
JMP L1
;
ALIGN 4
UNUSUALBRFOR_NOHINT:
L6:
DEC EBX
NOP ;ensure no alignment differences
JNZ >L8 ;branch forward no hint
RET
L8:
JMP L6
;
ALIGN 4
L10:
RET
UNUSUALBRBACK_HINT:
L12:
DEC EBX
hint.nobranch JZ L10 ;branch back hint
JMP L12
;
ALIGN 4
L14:
RET
UNUSUALBRBACK_NOHINT:
L16:
DEC EBX
NOP ;ensure no alignment differences
JZ L14 ;branch back no hint
JMP L16
;
ALIGN 4
TYPICALBRFOR_HINT:
L18:
DEC EBX
hint.nobranch JZ >L20 ;branch forward hint
JMP L18
L20:
RET
;
ALIGN 4
TYPICALBRFOR_NOHINT:
L22:
DEC EBX
NOP ;ensure no alignment differences
JZ >L24 ;branch forward no hint
JMP L22
L24:
RET
;
ALIGN 4
TYPICALBRBACK_HINT:
L26:
DEC EBX
hint.branch JNZ L26 ;branch back hint
RET
;
ALIGN 4
TYPICALBRBACK_NOHINT:
L28:
DEC EBX
NOP ;ensure no alignment differences
JNZ L28 ;branch back no hint
RET