In this article we demonstrate optimization results achieved by applying sco to number of benchmarks and applications.

Table of Contents:

    1. The Framework
    2. Results
      1. Convolution &Correlation
      2. Edge Detection
      3. Matrix Multiplication
      4. Reed-Solomon encoder/decoder
      5. Customer-created application

The Framework

The study comprises five benchmarks:
  • Convolution&Correlation
  • Edge Detection
  • Matrix Multiplication
  • Reed-Solomon encoder/decoder
  • Customer-created application
In the sections that follow, we will consider each of these algorithms in turn. We will show an excerpt of the source code of the algorithm and show an excerpt of the generated assembly language for the algorithm.

All benchmarks were compiled using ccsc100 compiler (version 1.5, build 113000-1633) utilizing all optimizations provided by it (-Ot2).

The following is description of how the benchmark foo.c had been processed:

ccsc100 -S -Ot2 foo.c
sco -i foo.sl -o foo.slo <sco_options>

ccsc100 -S –Ot2 foo.c
compiles the input file 'foo.c' and generates assembly output file 'foo.sl'. Note that –Ot2 was specified thus utilizing all optimizations provided by the compiler.

sco -i foo.sl -o foo.slo <sco_options>
optimizes the input file 'foo.sl' generating as an output file 'foo.slo'. <sco_options> depend on application and will be specified later.

Files were foo.sl and foo.slo analyzed and compared.

Note, that all code improvements reported are over performance of the code that is maximally optimized by the compiler. Note also that all results were obtained by calling sco with certain parameters (<sco_options>) - other parameter settings could result in even better improvements.

Results

Convolution&Correlation

Classical algorithm that may be found as part of many DSP applications:

for ( i = 0; i < N; i++ ) {
ap = A - I;
bp = B - J;
sum = 0;
for ( q = 0; q < P; q++ ) {
temp1 = *(ap += I);
temp2 = *(bp += J);
temp1 = temp1 * temp2;
sum = sum + temp1;
}
*(C += K) = sum;
A += I;
}


Compiler generated code for the innermost loop looks as following:
 
loopstart3
PL003
[
iadd d9,d5 ;[54] 4%=1
move.l (r1)+,d1 ;[53] 0%=0
move.l (r2)+,d4 ;[53] 0%=0
]
[
impyuu d1,d4,d9 ;[54] 1%=0
iadd d5,d3 ;[55] 5%=1
impysu d1,d4,d5 ;[54] 1%=0
]
imacus d1,d4,d5 ;[54] 2%=0
asll #<16,d5 ;[54] 3%=0
loopend3


and processes one point in 4 clocks.

Using sco with <sco_options> set to –lu 4 –v generates the following equivalent of the above kernel:
loopstart3
__dco_vr_7:
[
move.l (r2)+,d8
impysu d13,d12,d5
iadd d7,d3
iadd d10,d6
]
[
iadd d6,d3
move.l (r1)+,d6
impyuu d15,d14,d14
asll #16,d2
]
[
impysu d6,d8,d7
move.l (r1)+,d1
move.l (r2)+,d4
iadd d14,d2
]
[
imacus d13,d12,d5
imacus d6,d8,d7
move.l (r1)+,d15
]
[
move.l (r2)+,d14
impyuu d13,d12,d9
asll #16,d5
iadd d2,d3
]
[
impyuu d6,d8,d11
impysu d1,d4,d6
iadd d9,d5
]
[
asll #16,d7
imacus d1,d4,d6
impysu d15,d14,d2
]
[
iadd d5,d3
iadd d11,d7
impyuu d1,d4,d10
asll #16,d6
]
[
imacus d15,d14,d2
move.l (r1)+,d13
move.l (r2)+,d12
]
loopend3

and processes four points in 9 clocks – 44% improvement.
top

Edge Detection

This application detects the edges in a 256 gray-level 128 x 128 pixel image and is based on the routines and algorithms found in the book "C Language Algorithms for Digital Signal Processing" by P.M. Embree and B. Kimble.

for (r = 0; r < N - K + 1; r++) {
output_image_offset = output_image_ptr;
output_image_offset += dead_cols;
col = 0;
for (c = 0; c < N - K + 1; c++) {
input_image_ptr = input_image;
input_image_ptr += (N * row);
kernel_ptr = kernel;
sum = 0;
for (i = 0; i < K; i++) {
input_image_offset = input_image_ptr;
input_image_offset += col;
kernel_offset = kernel_ptr;
for (j = 0; j < K; j++) {
temp1 = *input_image_offset++;
temp2 = *kernel_offset++;
sum += temp1 * temp2;
}
kernel_ptr += K;
input_image_ptr += N;
}
*output_image_offset++ = (sum / normal_factor);
col++;
}

Compiler generated code for the inner loop (which contains another loop) looks as following:


loopstart2
L44
[
doen3 #2 ;[0] @II4
move.l (r0)+,d5 ;[300] 0%=0
]
move.l (r3)+,d6 ;[301] 0%=0
[
impysu d5,d6,d7 ;[302] 1%=0
impyuu d5,d6,d12 ;[302] 1%=0
]
imacus d5,d6,d7 ;[302] 2%=0
asll #<16,d7 ;[302] 3%=0
falign
loopstart3
L43
[
iadd d12,d7 ;[302] 4%=1
move.l (r0)+,d5 ;[300] 0%=0
move.l (r3)+,d6 ;[301] 0%=0
]
[
impyuu d5,d6,d12 ;[302] 1%=0
add d0,d7,d0 ;[301] 5%=1
impysu d5,d6,d7 ;[302] 1%=0
]
imacus d5,d6,d7 ;[302] 2%=0
asll #<16,d7 ;[302] 3%=0
loopend3
[
iadd d12,d7 ;[302] 4%=1
adda #>116,r0,r0 ;[0]
]
add d0,d7,d0 ;[301] 5%=1
loopend2

each iteration of which is executed in 15 clocks.

Using sco with <sco_options> set to –lu 2 generates the following equivalent of the above:


loopstart2
L44:
[
move.l (r3)+,d6
move.l (r0)+,d5
]
[
move.l (r3)+,d13
move.l (r0)+,d14
impysu d5,d6,d7
]
[
move.l (r3)+,d8
move.l (r0)+,d9
impysu d14,d13,d1
]
[
imacus d5,d6,d7
impysu d9,d8,d10
]
[
imacus d14,d13,d1
asll #16,d7
impyuu d5,d6,d12
]
[
imacus d9,d8,d10
asll #16,d1
impyuu d14,d13,d4
]
[
iadd d12,d7
asll #16,d10
impyuu d9,d8,d11
iadd d4,d1
]
[
add d0,d7,d0
iadd d11,d10
adda #116,r0,r0
]
add d0,d1,d0
add d0,d10,d0
loopend2

each iteration of which is executed in 10 clocks – 33% improvement.
top

Matrix Multiplication

This is another important algorithm that may be found as part of many DSP applications:

A = a_matrix;
B = b_matrix;
for (i = 0; i < A_ROW; i++) {
for (j = 0; j < B_COL; j++) {
a = A;
b = B;
sum = 0;
for (k = 0; k < B_ROW; k++) {
term_a = *a;
term_b = *b;
a++;
b += B_COL;
sum += term_a * term_b;
}
*c_matrix++ = sum;
B++;
}
A += A_COL;
B = b_matrix;
}

Compiler generated code for the innermost loop looks as following:

loopstart3
L56
[
iadd d8,d4 ;[81] 4%=1
move.l (r1)+,d2 ;[77] 0%=0
move.l (r3)+n3,d3 ;[78] 0%=0
]
[
impyuu d2,d3,d8 ;[81] 1%=0
iadd d4,d1 ;[80] 5%=1
impysu d2,d3,d4 ;[81] 1%=0
]
imacus d2,d3,d4 ;[81] 2%=0
asll #<16,d4 ;[81] 3%=0
loopend3

and processes one point in 4 clocks.

Using sco with <sco_options> set to –lu 2 –v generates the following equivalent of the above kernel:


loopstart3
__dco_vr_5:
[
imacus d7,d6,d4
move.l (r1)+,d2
move.l (r3)+n3,d3
iadd d5,d0
]
[
impyuu d7,d6,d8
asll #16,d4
iadd d0,d1
move.l (r1)+,d7
]
[
impysu d2,d3,d0
iadd d8,d4
move.l (r3)+n3,d6
]
[
imacus d2,d3,d0
iadd d4,d1
impyuu d2,d3,d5
]
[
asll #16,d0
impysu d7,d6,d4
]
loopend3

and processes two points in 5 clocks – 38% improvement.
top

Reed-Solomon encoder/decoder

Encoder/decoder for Reed-Solomon codes – we list it to show how effective sco may be when optimizing code that is not “loop based”. The source code of the Reed-Solomon encoder/decoder is too large to reproduce here – we will study only  the inner part of the encoder:

for (i=kk-1; i>=0; i--)
{
feedback = index_of[data[i]^bb[nn-kk-1]] ;
if (feedback != -1)
{
for (j=nn-kk-1; j>0; j--)
if (gg[j] != -1)
bb[j] = bb[j-1]^alpha_to[(gg[j]+feedback)%nn] ;
else
bb[j] = bb[j-1] ;
bb[0] = alpha_to[(gg[0]+feedback)%nn] ;
}
else
{
for (j=nn-kk-1; j>0; j--)
bb[j] = bb[j-1] ;
bb[0] = 0 ;
} ;
} ;

which is translated by the compiler into the following (only part is listed):

L24
[
move.l (r5),d2 ;[133]
move.l (r2),d3 ;[134] B6
]
[
cmpeq.w #-1,d2 ;[133]
add d2,d4,d0 ;[134] B6
move.w #255,d1 ;[134] B6
]
bt <L27 ;[133]
jsr ___Qmod32_s ;[134]
[
asll #<2,d0 ;[134]
move.l #_alpha_to,r7 ;[134]
]
move.l d0,r6 ;[134]
nop ;[0] AGU stall
adda r7,r6 ;[134]
move.l (r6),d2 ;[134]
[
eor d3,d2 ;[134]
jmpd L26 ;[135]
]
move.l d2,(r4) ;[134]
L27
move.l (r2),d3 ;[136]
move.l d3,(r4) ;[136]
L26
[
sub #1,d5 ;[132]
suba #<4,r4 ;[132]
suba #<4,r5 ;[132]
]
[
tstgt d5 ;[132]
suba #<4,r2 ;[132]
]
bt L24

Using sco with default options (<sco_options> is left empty) generates the following equivalent of the above kernel:

L24:
[
move.l (r5),d2
move.w #255,d1
]
[
cmpeq.w #-1,d2
add d2,d4,d0
move.l (r2),d3
]
btd <L27
ift move.l d3,(r4)
; gso 4.15 - 3.12
jsr ___Qmod32_s
; gso 3.3 - 3.3
[
move.l #_alpha_to,r7
move.l d0,r6
]
nop
move.l (r7+r6),d2
eor d3,d2
move.l d2,(r4)
; gso 6.9 - 8.14
L27:
; gso 0.0 - 2.2
L26:
[
sub #1,d5
suba #4,r2
suba #4,r5
]
[
tstgt d5
suba #4,r4
]
bt L24

It is difficult to estimate the improvement achieved by sco, but the difference in the quality of codes is apparent. For example, note the way blocks at L27 was transformed and jmpd L26 was eliminated.
top

Customer-created application

Due to obvious reasons we will omit C source listing of the code. It was pointed out that the following compiler-generated loop is greatly affects overall performance of the code:

loopstart3
PL003
[
ift tfr d2,d5 ;[454] 7%=1
ift tfr d11,d9 ;[455] 7%=1
ifa move.f (r9)+n1,d2 ;[447] 0%=0
]
[
ift tfr d13,d8 ;[456] 7%=1
ift tfr d14,d7 ;[458] 7%=1
ifa move.f (r11)+n1,d11 ;[446] 0%=0
]
[
add d15,d11,d11 ;[446] 1%=0
add #<5,d14 ;[445] 8%=1
move.l (sp-68),d13 ;[447] 1%=0
move.f #2048,d12 ;[448] 1%=0
]
[
ift move.w (sp-98),d10 ;[457] 9%=1
ifa mac #8192,d2,d13 ;[447] 2%=0
ifa move.f (r10)+n1,d2 ;[448] 2%=0
]
[
macr d2,d12,d13 ;[448] 3%=0
mpy d11,d11,d2 ;[449] 3%=0
]
mpy d8,d2,d12 ;[451] 4%=0
mac -d5,d13,d12 ;[451] 5%=0
tstgt d12 ;[453] 6%=0
loopend3

which takes 10 clocks per iteration.

Using sco with <sco_options> set to –v generates the following equivalent of the above kernel:

falign
loopstart3
__dco_vr_2:
[
mac -d5,d13,d12
move.f (r11)+n1,d4
move.f (r9)+n1,d3
]
[
ift
move.w (r1),d10
ifa
tstgt d12
move.f #2048,d1
]
[
tfrt d13,d8
move.l (r0),d13
tfrt d11,d9
add d15,d4,d11
]
[
mac #8192,d3,d13
move.f (r10)+n1,d0
tfrt d2,d5
mpy d11,d11,d2
]
[
tfrt d14,d7
add #5,d14
macr d0,d1,d13
mpy d8,d2,d12
]
loopend3

which processes loop iteration in 5 clocks – 50% improvement.
top