Table of
Contents:
choosing the right
options
loop unrolling
understanding
loop unrolling
vectorization
understanding
vectorization
results of
optimization and information
Unless
you develop your code to
drive a washing machine, speed is one of the most important issues in
DSP application. To achieve high performance, one may wait for 18-24
month for new generation of microprocessors to double the speed of the
currently available CPUs. If you intend to follow this approach - read
no more from this article (or, maybe, read it - will help to kill an
hour from the long wait).
The
other approach is to use
advanced tools that optimize your code. In this our StarCore Optimizer
(sco)
may help a lot.
sco is
a software package
specifically designed to optimize StarCore code by taking full
advantage of the options and features provided by the StarCore DSP.
This
document explains some of the
optimizations, provided by sco, and gives advise on the best
way to utilize them.
sco
utilizes incredible number of
different optimizations and algorithms to speed up the code. Reviewing
all of them will exceed the amount of space allowed for this article
and therefore we will omit it. Most of the optimization options are
enabled by default but many may be disabled using appropriate command
line or source code option (see the documentation for details).
The
special optimizations provided
by sco are loop unrolling and vectorization.
Many
applications spend bulk of
their time inside inner loop(s) thus making it extremely important to
optimize code of the inner loop. One of the efficient ways to optimize
loops is to perform loop unrolling - to duplicate the body of a loop
specified number of times (sco parameter '-lu #'). Generating
more instructions in the body of the loop, generally, creates more
opportunities for generating multinstuctions. However, before using
this powerful optimization, it is important to understand what exactly
sco
perform.
The
standard single precision dot
product loop:
for
( i = 0; i < n; i++ )
{
s += x[i]*y[i];
}
is
translated by the compiler (
maximally optimized ) into:
doen3
d7
loopstart3
PL001
[
iadd d12,d10
move.l (r0)+,d8
move.l (r1)+,d9
]
[
impyuu d8,d9,d12
iadd d10,d0
impysu d8,d9,d10
]
imacus d8,d9,d10
asll #<16,d10
loopend3
which
sco, when unrolled
by 3, transfer to:
PL001:
bmclr
#$4000,sr.h
move.w #-3,d2
add d2,d7,d4
move.w #-1,d3
sub d3,d4,d4
tstgt d4
[
move.w #0,d1
jfd __dco_lu_0
]
iadd d7,d1
move.l d4,d1
__dco_lu_2:
[
move.l (r1)+,d4
move.l (r0)+,d11
iadd d12,d10
iadd d2,d1
]
[
impysu d11,d4,d5
move.l (r1)+,d9
move.l (r0)+,d8
iadd d10,d0
]
[
imacus d11,d4,d5
impyuu d11,d4,d13
move.l (r1)+,d14
]
[
impysu d8,d9,d4
move.l (r0)+,d15
asll #16,d5
tstgt d1
]
[
imacus d8,d9,d4
impysu d15,d14,d10
iadd d13,d5
]
[
asll #16,d4
impyuu d8,d9,d11
imacus d15,d14,d10
]
[
iadd d11,d4
iadd d5,d0
jtd __dco_lu_2
]
[
asll #16,d10
impyuu d15,d14,d12
iadd d4,d0
]
sub d2,d1,d1
iadd d3,d1
tstgt d1
jf __dco_lu_1
__dco_lu_0:
__dco_lu_3:
[
move.l (r1)+,d9
move.l (r0)+,d8
iadd d12,d10
]
iadd d3,d1
[
iadd d10,d0
impysu d8,d9,d10
impyuu d8,d9,d12
]
tstgt d1
imacus d8,d9,d10
jtd __dco_lu_3
asll #16,d10
__dco_lu_1:
Close
examination of this code
reveals that it consists of 5 different units:
1.
Check
if loop is executed for at least 3 iterations and therefore may be
unrolled by 3:
move.w
#-3,d2
add d2,d7,d4
move.w #-1,d3
sub d3,d4,d4
tstgt d4
[
move.w #0,d1
jfd __dco_lu_0
]
iadd d7,d1
move.l d4,d1
2.
Unrolled
loop setup:
bmclr
#$4000,sr.h
3.
Unrolled
loop body:
__dco_lu_2:
[
move.l (r1)+,d4
move.l (r0)+,d11
iadd d12,d10
iadd d2,d1
]
[
impysu d11,d4,d5
move.l (r1)+,d9
move.l (r0)+,d8
iadd d10,d0
]
[
imacus d11,d4,d5
impyuu d11,d4,d13
move.l (r1)+,d14
]
[
impysu d8,d9,d4
move.l (r0)+,d15
asll #16,d5
tstgt d1
]
[
imacus d8,d9,d4
impysu d15,d14,d10
iadd d13,d5
]
[
asll #16,d4
impyuu d8,d9,d11
imacus d15,d14,d10
]
[
iadd d11,d4
iadd d5,d0
jtd __dco_lu_2
]
[
asll #16,d10
impyuu d15,d14,d12
iadd d4,d0
]
4.
Check
if there are iterations left to be performed and if there are not -
jump to exit from the unrolled loop; otherwise - execute the code that
executes original ( not unrolled ) loop code ( see unit 5 ).
sub
d2,d1,d1
iadd d3,d1
tstgt d1
jf __dco_lu_1
5.
Execute
the original loop body:
__dco_lu_0:
__dco_lu_3:
[
move.l (r1)+,d9
move.l (r0)+,d8
iadd d12,d10
]
iadd d3,d1
[
iadd d10,d0
impysu d8,d9,d10
impyuu d8,d9,d12
]
tstgt d1
imacus d8,d9,d10
jtd __dco_lu_3
asll #16,d10
This
code is executed when it is
determined that loop doesn't have enough iterations to be unrolled (see
unit 1) or when there are iterations
left after unrolled loop is executed (see unit 4).
Note,
that when unrolling count is
power of 2, sco generates more efficient code. For example,
when unrolling by 4, the generated code ( for the loop body ) looks as
following:
falign
loopstart3
__dco_lu_1:
[
move.l (r1)+,d3
move.l (r0)+,d15
iadd d12,d10
]
[
impysu d15,d3,d4
move.l (r1)+,d2
move.l (r0)+,d5
iadd d10,d0
]
[
imacus d15,d3,d4
move.l (r1)+,d9
move.l (r0)+,d8
]
[
impyuu d15,d3,d13
impysu d5,d2,d3
asll #16,d4
]
[
imacus d5,d2,d3
impyuu d5,d2,d11
move.l (r1)+,d14
]
[
impysu d8,d9,d2
move.l (r0)+,d15
asll #16,d3
iadd d13,d4
]
[
imacus d8,d9,d2
impysu d15,d14,d10
iadd d11,d3
]
[
asll #16,d2
impyuu d8,d9,d5
iadd d4,d0
]
[
imacus d15,d14,d10
iadd d5,d2
iadd d3,d0
]
[
asll #16,d10
iadd d2,d0
impyuu d15,d14,d12
]
loopend3
All of
the above leads to the
following guidelines for using sco's loop unroller:
- Do not unroll loops that has only few iterations to execute
or unroll by the count significantly less then number of iterations.
- Use power of 2 as loop unrolling count and/or choose the
count that divides number of loop iterations.
- Do not use loop unrolling if it is important to minimize
code size.
Vectorization
(sco
parameter '-v') is another powerful optimization of the inner loop
body. It overlaps epilog code execution for current loop iteration with
prolog code execution for the next loop iteration. Essentially, the two
consecutive loop iterations are fitted in the block of the size of the
loop iteration (of size n). m instruction are chosen from the bottom of
the first iteration and combined with n - m instructions from the top
of the second iteration. The generated m + ( n - m ) = n instructions
are optimized. This is done for all m from 1 to n-1 and the best
resulting code is chosen. Of course, all necessary checkups to preserve
logic of the code are performed by sco and resulting code.
As an
example, consider again the
kernel of the dot product
for
( i = 0; i < n; i++ )
{
s += x[i]*y[i];
}
unrolled
by 2:
doen3
d7
loopstart3
PL001
[
iadd d12,d10
move.l (r0)+,d8
move.l (r1)+,d9
]
[
impyuu d8,d9,d12
iadd d10,d0
impysu d8,d9,d10
]
imacus d8,d9,d10
asll #<16,d10
[
iadd d12,d10
move.l (r0)+,d8
move.l (r1)+,d9
]
[
impyuu d8,d9,d12
iadd d10,d0
impysu d8,d9,d10
]
imacus d8,d9,d10
asll #<16,d10
loopend3
Running
vectorization on this code
( sco -v ) produces the following result:
move.l
d7,d1
[
iadd d12,d10
move.l (r0)+,d8
move.l (r1)+,d9
]
deceq d1
jfd __dco_vr_0
[
iadd d10,d0
move.l (r0)+,d4
move.l (r1)+,d3
]
[
impysu d8,d9,d1
impysu d4,d3,d10
]
[
imacus d8,d9,d1
impyuu d8,d9,d2
]
asll #16,d1
[
imacus d4,d3,d10
impyuu d4,d3,d12
]
iadd d2,d1
asll #16,d10
iadd d1,d0
bmclr #$4000,sr.h
jmp __dco_vr_1
__dco_vr_0:
doen3
d1
[
impysu d8,d9,d1
impyuu d8,d9,d2
]
[
imacus d8,d9,d1
impysu d4,d3,d10
]
dosetup3 __dco_vr_2
asll #16,d1
falign
loopstart3
__dco_vr_2:
[
imacus d4,d3,d10
move.l (r0)+,d8
move.l (r1)+,d9
iadd d2,d1
]
[
impyuu d4,d3,d12
asll #16,d10
iadd d1,d0
move.l (r0)+,d4
]
[
impysu d8,d9,d1
iadd d12,d10
move.l (r1)+,d3
]
[
imacus d8,d9,d1
iadd d10,d0
impyuu d8,d9,d2
]
[
asll #16,d1
impysu d4,d3,d10
]
loopend3
iadd d2,d1
[
imacus d4,d3,d10
impyuu d4,d3,d12
]
iadd d1,d0
asll #16,d10
__dco_vr_1:
In this
case, the loop body is:
[
iadd d12,d10
move.l (r0)+,d8
move.l (r1)+,d9
]
[
impyuu d8,d9,d12
iadd d10,d0
impysu d8,d9,d10
]
imacus d8,d9,d10
asll #<16,d10
[
iadd d12,d10
move.l (r0)+,d8
move.l (r1)+,d9
]
[
impyuu d8,d9,d12
iadd d10,d0
impysu d8,d9,d10
]
imacus d8,d9,d10
asll #<16,d10
and m
instructions
iadd
d2,d1
[
imacus d4,d3,d10
impyuu d4,d3,d12
]
iadd d1,d0
asll #16,d10
are
overlapped with n-m
instructions
[
iadd
d12,d10
move.l
(r0)+,d8
move.l
(r1)+,d9
]
[
impyuu
d8,d9,d12
iadd
d10,d0
impysu
d8,d9,d10
]
imacus
d8,d9,d10
asll
#<16,d10
[
move.l
(r0)+,d8
move.l
(r1)+,d9
]
[
impysu
d8,d9,d10
]
Again,
as with loop unrolling,
this code consist of number different units:
1.
Check
if loop is executed for at least 2 iterations (jump to __dco_vr_0) and
if not - perform one iteration
and exit iterations (jump to __dco_vr_1):
move.l
d7,d1
[
iadd d12,d10
move.l (r0)+,d8
move.l (r1)+,d9
]
deceq d1
jfd __dco_vr_0
[
iadd d10,d0
move.l (r0)+,d4
move.l (r1)+,d3
]
[
impysu d8,d9,d1
impysu d4,d3,d10
]
[
imacus d8,d9,d1
impyuu d8,d9,d2
]
asll #16,d1
[
imacus d4,d3,d10
impyuu d4,d3,d12
]
iadd d2,d1
asll #16,d10
iadd d1,d0
bmclr #$4000,sr.h
jmp __dco_vr_1
2.
Execute
NM instructions from the top of the first iteration that were not
chosen to be combined with instructions of second iteration:
__dco_vr_0:
doen3 d1
[
impysu d8,d9,d1
impyuu d8,d9,d2
]
[
imacus d8,d9,d1
impysu d4,d3,d10
]
dosetup3 __dco_vr_2
asll #16,d1
3.
Execute
vectorized loop:
falign
loopstart3
__dco_vr_2:
[
imacus d4,d3,d10
move.l (r0)+,d8
move.l (r1)+,d9
iadd d2,d1
]
[
impyuu d4,d3,d12
asll #16,d10
iadd d1,d0
move.l (r0)+,d4
]
[
impysu d8,d9,d1
iadd d12,d10
move.l (r1)+,d3
]
[
imacus d8,d9,d1
iadd d10,d0
impyuu d8,d9,d2
]
[
asll #16,d1
impysu d4,d3,d10
]
loopend3
4.
Execute
m instructions from the button of the second iteration that were not
chosen to be combined with instructions of first iteration:
iadd
d2,d1
[
imacus d4,d3,d10
impyuu d4,d3,d12
]
iadd d1,d0
asll #16,d10
__dco_vr_1:
So the
guidelines for using sco's
vectorizer will be:
- Do not attempt to vectorize code
that has only few iterations to execute.
- Use loop unrolling if there are
too few instructions in the body of a loop being vectorized.
- Beware that vectorization of large
inner loop may consume great amount of CPU time.
The
code patterns studied above
(dot-product) show the following improvements achieved over the code
fully
optimized by the ccsc100 compiler:
Code fully
optimized by the compiler
|
4 clocks/point
|
Loop unrolled by 3
|
3.3 clocks/point, 18% improvement
|
Loop unrolled by 4
|
2.5 clocks/point, 38% improvement
|
Loop unrolled by
4 and vectorized
|
2.25 clocks/point, 44% improvement
|
The
similar or better code
improvements were achieved for many others benchmarks and kernels -
click here for more information.
|