Table of Contents:

choosing the right options
loop unrolling
understanding loop unrolling
vectorization
understanding vectorization
results of optimization and information

Unless you develop your code to drive a washing machine, speed is one of the most important issues in DSP application. To achieve high performance, one may wait for 18-24 month for new generation of microprocessors to double the speed of the currently available CPUs. If you intend to follow this approach - read no more from this article (or, maybe, read it - will help to kill an hour from the long wait).

The other approach is to use advanced tools that optimize your code. In this our StarCore Optimizer (sco) may help a lot.

sco is a software package specifically designed to optimize StarCore code by taking full advantage of the options and features provided by the StarCore DSP.

This document explains some of the optimizations, provided by sco, and gives advise on the best way to utilize them.

choosing the right options

sco utilizes incredible number of different optimizations and algorithms to speed up the code. Reviewing all of them will exceed the amount of space allowed for this article and therefore we will omit it. Most of the optimization options are enabled by default but many may be disabled using appropriate command line or source code option (see the documentation for details).

The special optimizations provided by sco are loop unrolling and vectorization.

Loop unrolling

Many applications spend bulk of their time inside inner loop(s) thus making it extremely important to optimize code of the inner loop. One of the efficient ways to optimize loops is to perform loop unrolling - to duplicate the body of a loop specified number of times (sco parameter '-lu #'). Generating more instructions in the body of the loop, generally, creates more opportunities for generating multinstuctions. However, before using this powerful optimization, it is important to understand what exactly sco perform.

Understanding Loop unrolling.

The standard single precision dot product loop:

for ( i = 0; i < n; i++ )
  {
    s += x[i]*y[i];
  }

is translated by the compiler ( maximally optimized ) into:

doen3    d7
loopstart3

PL001

 [
    iadd     d12,d10
    move.l   (r0)+,d8
    move.l   (r1)+,d9
 ]
 [
    impyuu   d8,d9,d12
    iadd     d10,d0
    impysu   d8,d9,d10
 ]
    imacus   d8,d9,d10
    asll     #<16,d10
    loopend3

which sco, when unrolled by 3, transfer to:

PL001:

bmclr #$4000,sr.h
move.w #-3,d2
add d2,d7,d4
move.w #-1,d3
sub d3,d4,d4
tstgt d4
[
move.w #0,d1
jfd __dco_lu_0
]
iadd d7,d1
move.l d4,d1

__dco_lu_2:

[
move.l (r1)+,d4
move.l (r0)+,d11
iadd d12,d10
iadd d2,d1
]
[
impysu d11,d4,d5
move.l (r1)+,d9
move.l (r0)+,d8
iadd d10,d0
]
[
imacus d11,d4,d5
impyuu d11,d4,d13
move.l (r1)+,d14
]
[
impysu d8,d9,d4
move.l (r0)+,d15
asll #16,d5
tstgt d1
]
[
imacus d8,d9,d4
impysu d15,d14,d10
iadd d13,d5
]
[
asll #16,d4
impyuu d8,d9,d11
imacus d15,d14,d10
]
[
iadd d11,d4
iadd d5,d0
jtd __dco_lu_2
]
[
asll #16,d10
impyuu d15,d14,d12
iadd d4,d0
]

sub d2,d1,d1
iadd d3,d1
tstgt d1
jf __dco_lu_1


__dco_lu_0:
__dco_lu_3:

[
move.l (r1)+,d9
move.l (r0)+,d8
iadd d12,d10
]
iadd d3,d1
[
iadd d10,d0
impysu d8,d9,d10
impyuu d8,d9,d12
]
tstgt d1
imacus d8,d9,d10
jtd __dco_lu_3
asll #16,d10

__dco_lu_1:

Close examination of this code reveals that it consists of 5 different units:

1. Check if loop is executed for at least 3 iterations and therefore may be unrolled by 3:

move.w #-3,d2
add d2,d7,d4
move.w #-1,d3
sub d3,d4,d4
tstgt d4
[
move.w #0,d1
jfd __dco_lu_0
]
iadd d7,d1
move.l d4,d1

2. Unrolled loop setup:

bmclr #$4000,sr.h

3. Unrolled loop body:

__dco_lu_2:

[
move.l (r1)+,d4
move.l (r0)+,d11
iadd d12,d10
iadd d2,d1
]
[
impysu d11,d4,d5
move.l (r1)+,d9
move.l (r0)+,d8
iadd d10,d0
]
[
imacus d11,d4,d5
impyuu d11,d4,d13
move.l (r1)+,d14
]
[
impysu d8,d9,d4
move.l (r0)+,d15
asll #16,d5
tstgt d1
]
[
imacus d8,d9,d4
impysu d15,d14,d10
iadd d13,d5
]
[
asll #16,d4
impyuu d8,d9,d11
imacus d15,d14,d10
]
[
iadd d11,d4
iadd d5,d0
jtd __dco_lu_2
]
[
asll #16,d10
impyuu d15,d14,d12
iadd d4,d0
]

4. Check if there are iterations left to be performed and if there are not - jump to exit from the unrolled loop; otherwise - execute the code that executes original ( not unrolled ) loop code ( see unit 5 ).

sub d2,d1,d1
iadd d3,d1
tstgt d1
jf __dco_lu_1

5. Execute the original loop body:

__dco_lu_0:
__dco_lu_3:

[
move.l (r1)+,d9
move.l (r0)+,d8
iadd d12,d10
]
iadd d3,d1
[
iadd d10,d0
impysu d8,d9,d10
impyuu d8,d9,d12
]
tstgt d1
imacus d8,d9,d10
jtd __dco_lu_3
asll #16,d10

This code is executed when it is determined that loop doesn't have enough iterations to be unrolled (see unit 1) or when there are iterations left after unrolled loop is executed (see unit 4).

Note, that when unrolling count is power of 2, sco generates more efficient code. For example, when unrolling by 4, the generated code ( for the loop body ) looks as following:

falign
loopstart3

__dco_lu_1:

[
move.l (r1)+,d3
move.l (r0)+,d15
iadd d12,d10
]
[
impysu d15,d3,d4
move.l (r1)+,d2
move.l (r0)+,d5
iadd d10,d0
]
[
imacus d15,d3,d4
move.l (r1)+,d9
move.l (r0)+,d8
]
[
impyuu d15,d3,d13
impysu d5,d2,d3
asll #16,d4
]
[
imacus d5,d2,d3
impyuu d5,d2,d11
move.l (r1)+,d14
]
[
impysu d8,d9,d2
move.l (r0)+,d15
asll #16,d3
iadd d13,d4
]
[
imacus d8,d9,d2
impysu d15,d14,d10
iadd d11,d3
]
[
asll #16,d2
impyuu d8,d9,d5
iadd d4,d0
]
[
imacus d15,d14,d10
iadd d5,d2
iadd d3,d0
]
[
asll #16,d10
iadd d2,d0
impyuu d15,d14,d12
]
loopend3

All of the above leads to the following guidelines for using sco's loop unroller:

  • Do not unroll loops that has only few iterations to execute or unroll by the count significantly less then number of iterations.
  • Use power of 2 as loop unrolling count and/or choose the count that divides number of loop iterations.
  • Do not use loop unrolling if it is important to minimize code size.

Vectorization

Vectorization (sco parameter '-v') is another powerful optimization of the inner loop body. It overlaps epilog code execution for current loop iteration with prolog code execution for the next loop iteration. Essentially, the two consecutive loop iterations are fitted in the block of the size of the loop iteration (of size n). m instruction are chosen from the bottom of the first iteration and combined with n - m instructions from the top of the second iteration. The generated m + ( n - m ) = n instructions are optimized. This is done for all m from 1 to n-1 and the best resulting code is chosen. Of course, all necessary checkups to preserve logic of the code are performed by sco and resulting code.

Understanding vectorization

As an example, consider again the kernel of the dot product

for ( i = 0; i < n; i++ )
  {
    s += x[i]*y[i];
  }

unrolled by 2:

doen3    d7
loopstart3

PL001

[
    iadd     d12,d10
    move.l   (r0)+,d8
    move.l   (r1)+,d9
 ]
 [
    impyuu   d8,d9,d12
    iadd     d10,d0
    impysu   d8,d9,d10
 ]
    imacus   d8,d9,d10
    asll     #<16,d10
[
    iadd     d12,d10
    move.l   (r0)+,d8
    move.l   (r1)+,d9
 ]
 [
    impyuu   d8,d9,d12
    iadd     d10,d0
    impysu   d8,d9,d10
 ]
    imacus   d8,d9,d10
    asll     #<16,d10
    loopend3

Running vectorization on this code ( sco -v ) produces the following result:


move.l d7,d1
[
iadd d12,d10
move.l (r0)+,d8
move.l (r1)+,d9
]
deceq d1
jfd __dco_vr_0
[
iadd d10,d0
move.l (r0)+,d4
move.l (r1)+,d3
]
[
impysu d8,d9,d1
impysu d4,d3,d10
]
[
imacus d8,d9,d1
impyuu d8,d9,d2
]
asll #16,d1
[
imacus d4,d3,d10
impyuu d4,d3,d12
]
iadd d2,d1
asll #16,d10
iadd d1,d0
bmclr #$4000,sr.h
jmp __dco_vr_1

__dco_vr_0:

doen3 d1
[
impysu d8,d9,d1
impyuu d8,d9,d2
]
[
imacus d8,d9,d1
impysu d4,d3,d10
]
dosetup3 __dco_vr_2
asll #16,d1
falign
loopstart3

__dco_vr_2:

[
imacus d4,d3,d10
move.l (r0)+,d8
move.l (r1)+,d9
iadd d2,d1
]
[
impyuu d4,d3,d12
asll #16,d10
iadd d1,d0
move.l (r0)+,d4
]
[
impysu d8,d9,d1
iadd d12,d10
move.l (r1)+,d3
]
[
imacus d8,d9,d1
iadd d10,d0
impyuu d8,d9,d2
]
[
asll #16,d1
impysu d4,d3,d10
]
loopend3
iadd d2,d1
[
imacus d4,d3,d10
impyuu d4,d3,d12
]
iadd d1,d0
asll #16,d10

__dco_vr_1:

In this case, the loop body is:

[
    iadd     d12,d10
    move.l   (r0)+,d8
    move.l   (r1)+,d9
 ]
 [
    impyuu   d8,d9,d12
    iadd     d10,d0
    impysu   d8,d9,d10
 ]
    imacus   d8,d9,d10
    asll     #<16,d10
[
    iadd     d12,d10
    move.l   (r0)+,d8
    move.l   (r1)+,d9
 ]
 [
    impyuu   d8,d9,d12
    iadd     d10,d0
    impysu   d8,d9,d10
 ]
    imacus   d8,d9,d10
    asll     #<16,d10

and m instructions

iadd d2,d1
[
imacus d4,d3,d10
impyuu d4,d3,d12
]
iadd d1,d0
asll #16,d10

are overlapped with n-m instructions

[
    iadd     d12,d10
    move.l   (r0)+,d8
    move.l   (r1)+,d9
 ]
 [
    impyuu   d8,d9,d12
    iadd     d10,d0
    impysu   d8,d9,d10
 ]
    imacus   d8,d9,d10
    asll     #<16,d10
[
    move.l   (r0)+,d8
    move.l   (r1)+,d9
 ]
 [
    impysu   d8,d9,d10
 ]

Again, as with loop unrolling, this code consist of number different units:

1. Check if loop is executed for at least 2 iterations (jump to __dco_vr_0) and if not - perform one iteration and exit iterations (jump to __dco_vr_1):

move.l d7,d1
[
iadd d12,d10
move.l (r0)+,d8
move.l (r1)+,d9
]
deceq d1
jfd __dco_vr_0
[
iadd d10,d0
move.l (r0)+,d4
move.l (r1)+,d3
]
[
impysu d8,d9,d1
impysu d4,d3,d10
]
[
imacus d8,d9,d1
impyuu d8,d9,d2
]
asll #16,d1
[
imacus d4,d3,d10
impyuu d4,d3,d12
]
iadd d2,d1
asll #16,d10
iadd d1,d0
bmclr #$4000,sr.h
jmp __dco_vr_1

2. Execute NM instructions from the top of the first iteration that were not chosen to be combined with instructions of second iteration:

__dco_vr_0:

doen3 d1
[
impysu d8,d9,d1
impyuu d8,d9,d2
]
[
imacus d8,d9,d1
impysu d4,d3,d10
]
dosetup3 __dco_vr_2
asll #16,d1

3. Execute vectorized loop:

falign
loopstart3

__dco_vr_2:

[
imacus d4,d3,d10
move.l (r0)+,d8
move.l (r1)+,d9
iadd d2,d1
]
[
impyuu d4,d3,d12
asll #16,d10
iadd d1,d0
move.l (r0)+,d4
]
[
impysu d8,d9,d1
iadd d12,d10
move.l (r1)+,d3
]
[
imacus d8,d9,d1
iadd d10,d0
impyuu d8,d9,d2
]
[
asll #16,d1
impysu d4,d3,d10
]
loopend3

4. Execute m instructions from the button of the second iteration that were not chosen to be combined with instructions of first iteration:

iadd d2,d1
[
imacus d4,d3,d10
impyuu d4,d3,d12
]
iadd d1,d0
asll #16,d10

__dco_vr_1:

So the guidelines for using sco's vectorizer will be:

  • Do not attempt to vectorize code that has only few iterations to execute.
  • Use loop unrolling if there are too few instructions in the body of a loop being vectorized.
  •  Beware that vectorization of large inner loop may consume great amount of CPU time.

Results of optimization and Information

The code patterns studied above (dot-product) show the following improvements achieved over the code fully optimized by the ccsc100 compiler:

Code fully optimized by the compiler

4 clocks/point

Loop unrolled by 3

3.3 clocks/point, 18% improvement

Loop unrolled by 4

2.5 clocks/point, 38% improvement

Loop unrolled by 4 and vectorized

2.25 clocks/point, 44% improvement

The similar or better code improvements were achieved for many others benchmarks and kernels - click here for more information.