In this article we
demonstrate optimization results achieved by
applying sco to
number of benchmarks and applications.
Table of Contents:
- The
Framework
- Results
- Convolution
&Correlation
- Edge
Detection
- Matrix
Multiplication
- Reed-Solomon
encoder/decoder
- Customer-created
application
The
Framework
The study comprises five
benchmarks:
- Convolution&Correlation
- Edge Detection
- Matrix Multiplication
- Reed-Solomon
encoder/decoder
- Customer-created
application
In the sections that follow, we
will consider each of
these
algorithms in turn. We will show an excerpt of the source code of the
algorithm and show an excerpt of the generated assembly language for
the algorithm.
All benchmarks were compiled
using ccsc100
compiler (version 1.5, build 113000-1633) utilizing all optimizations
provided by it (-Ot2).
The following is description of
how the benchmark foo.c
had been processed:
ccsc100
-S -Ot2 foo.c
sco
-i foo.sl -o foo.slo <sco_options>
ccsc100
-S –Ot2 foo.c compiles the input file 'foo.c'
and generates assembly output file 'foo.sl'.
Note that –Ot2
was specified thus
utilizing all
optimizations provided by the compiler.
sco -i foo.sl -o
foo.slo <sco_options>
optimizes the input file 'foo.sl'
generating as an output file 'foo.slo'.
<sco_options>
depend on application and
will be specified later.
Files
were foo.sl
and foo.slo
analyzed and compared.
Note, that all code
improvements
reported are over performance of the code that is maximally optimized
by the compiler. Note also that all results
were
obtained by calling sco
with certain parameters (<sco_options>)
- other
parameter settings could result in even better improvements.
Results
Convolution&Correlation
Classical algorithm that may be found as part of many
DSP
applications:
for
( i = 0; i < N; i++ ) {
ap
= A - I;
bp
= B - J;
sum
= 0;
for
( q = 0; q < P; q++ ) {
temp1
= *(ap += I);
temp2
= *(bp += J);
temp1
= temp1 * temp2;
sum
= sum + temp1;
}
*(C
+= K) = sum;
A
+= I;
}
Compiler generated code for the innermost loop looks as
following:
loopstart3
PL003
[
iadd d9,d5 ;[54] 4%=1
move.l (r1)+,d1 ;[53]
0%=0
move.l (r2)+,d4 ;[53]
0%=0
]
[
impyuu d1,d4,d9 ;[54]
1%=0
iadd d5,d3 ;[55] 5%=1
impysu d1,d4,d5 ;[54]
1%=0
]
imacus d1,d4,d5 ;[54]
2%=0
asll #<16,d5
;[54] 3%=0
loopend3
and processes one point in 4 clocks.
Using sco
with <sco_options>
set to –lu
4 –v
generates the following equivalent of the above kernel:
loopstart3
__dco_vr_7:
[
move.l
(r2)+,d8
impysu
d13,d12,d5
iadd
d7,d3
iadd
d10,d6
]
[
iadd
d6,d3
move.l
(r1)+,d6
impyuu
d15,d14,d14
asll
#16,d2
]
[
impysu
d6,d8,d7
move.l
(r1)+,d1
move.l
(r2)+,d4
iadd
d14,d2
]
[
imacus
d13,d12,d5
imacus
d6,d8,d7
move.l
(r1)+,d15
]
[
move.l
(r2)+,d14
impyuu
d13,d12,d9
asll
#16,d5
iadd
d2,d3
]
[
impyuu
d6,d8,d11
impysu
d1,d4,d6
iadd
d9,d5
]
[
asll
#16,d7
imacus
d1,d4,d6
impysu
d15,d14,d2
]
[
iadd
d5,d3
iadd
d11,d7
impyuu
d1,d4,d10
asll
#16,d6
]
[
imacus
d15,d14,d2
move.l
(r1)+,d13
move.l
(r2)+,d12
]
loopend3
and processes four points in 9 clocks – 44%
improvement.
top
Edge
Detection
This application detects the
edges in a 256 gray-level 128 x 128
pixel image and is based on the routines and algorithms found in the
book "C Language Algorithms for Digital Signal Processing"
by P.M. Embree and B. Kimble.
for
(r = 0; r < N - K + 1; r++) {
output_image_offset
=
output_image_ptr;
output_image_offset
+=
dead_cols;
col
= 0;
for
(c = 0; c < N - K + 1; c++) {
input_image_ptr
=
input_image;
input_image_ptr
+= (N *
row);
kernel_ptr
= kernel;
sum
= 0;
for
(i = 0; i < K; i++) {
input_image_offset
=
input_image_ptr;
input_image_offset
+=
col;
kernel_offset
=
kernel_ptr;
for
(j = 0; j < K; j++) {
temp1
= *input_image_offset++;
temp2
= *kernel_offset++;
sum
+= temp1 * temp2;
}
kernel_ptr
+= K;
input_image_ptr
+= N;
}
*output_image_offset++
= (sum /
normal_factor);
col++;
}
Compiler generated code for the inner loop (which
contains another
loop) looks as following:
loopstart2
L44
[
doen3
#2 ;[0] @II4
move.l
(r0)+,d5 ;[300]
0%=0
]
move.l
(r3)+,d6 ;[301]
0%=0
[
impysu
d5,d6,d7 ;[302]
1%=0
impyuu
d5,d6,d12 ;[302]
1%=0
]
imacus
d5,d6,d7 ;[302]
2%=0
asll
#<16,d7
;[302] 3%=0
falign
loopstart3
L43
[
iadd
d12,d7 ;[302] 4%=1
move.l
(r0)+,d5 ;[300]
0%=0
move.l
(r3)+,d6 ;[301]
0%=0
]
[
impyuu
d5,d6,d12 ;[302]
1%=0
add
d0,d7,d0 ;[301] 5%=1
impysu
d5,d6,d7 ;[302]
1%=0
]
imacus
d5,d6,d7 ;[302]
2%=0
asll
#<16,d7
;[302] 3%=0
loopend3
[
iadd
d12,d7 ;[302] 4%=1
adda
#>116,r0,r0
;[0]
]
add
d0,d7,d0 ;[301] 5%=1
loopend2
each iteration of which is
executed in 15 clocks.
Using sco
with <sco_options>
set to –lu
2 generates
the following equivalent of the above:
loopstart2
L44:
[
move.l
(r3)+,d6
move.l
(r0)+,d5
]
[
move.l
(r3)+,d13
move.l
(r0)+,d14
impysu
d5,d6,d7
]
[
move.l
(r3)+,d8
move.l
(r0)+,d9
impysu
d14,d13,d1
]
[
imacus
d5,d6,d7
impysu
d9,d8,d10
]
[
imacus
d14,d13,d1
asll
#16,d7
impyuu
d5,d6,d12
]
[
imacus
d9,d8,d10
asll
#16,d1
impyuu
d14,d13,d4
]
[
iadd
d12,d7
asll
#16,d10
impyuu
d9,d8,d11
iadd
d4,d1
]
[
add
d0,d7,d0
iadd
d11,d10
adda
#116,r0,r0
]
add
d0,d1,d0
add
d0,d10,d0
loopend2
each iteration of which is
executed in 10 clocks
– 33%
improvement.
top
Matrix
Multiplication
This is another important
algorithm that may be found as part of many
DSP
applications:
A
= a_matrix;
B
= b_matrix;
for
(i = 0; i < A_ROW; i++) {
for
(j = 0; j < B_COL; j++) {
a
= A;
b
= B;
sum
= 0;
for
(k = 0; k < B_ROW; k++) {
term_a
= *a;
term_b
= *b;
a++;
b
+= B_COL;
sum
+= term_a * term_b;
}
*c_matrix++
= sum;
B++;
}
A
+= A_COL;
B
= b_matrix;
}
Compiler generated code for
the innermost loop looks as
following:
loopstart3
L56
[
iadd
d8,d4 ;[81] 4%=1
move.l
(r1)+,d2 ;[77]
0%=0
move.l
(r3)+n3,d3 ;[78]
0%=0
]
[
impyuu
d2,d3,d8 ;[81]
1%=0
iadd
d4,d1 ;[80] 5%=1
impysu
d2,d3,d4 ;[81]
1%=0
]
imacus
d2,d3,d4 ;[81]
2%=0
asll
#<16,d4
;[81] 3%=0
loopend3
and processes one point in 4
clocks.
Using sco
with <sco_options>
set to –lu
2 –v
generates the following equivalent of the above kernel:
loopstart3
__dco_vr_5:
[
imacus
d7,d6,d4
move.l
(r1)+,d2
move.l
(r3)+n3,d3
iadd
d5,d0
]
[
impyuu
d7,d6,d8
asll
#16,d4
iadd
d0,d1
move.l
(r1)+,d7
]
[
impysu
d2,d3,d0
iadd
d8,d4
move.l
(r3)+n3,d6
]
[
imacus
d2,d3,d0
iadd
d4,d1
impyuu
d2,d3,d5
]
[
asll
#16,d0
impysu
d7,d6,d4
]
loopend3
and processes two points in
5 clocks – 38%
improvement.
top
Reed-Solomon
encoder/decoder
Encoder/decoder for
Reed-Solomon codes – we
list it to show how
effective sco
may
be when optimizing code that is not “loop based”.
The source code
of the Reed-Solomon encoder/decoder is too large to reproduce here
– we
will study only the inner part of the encoder:
for
(i=kk-1; i>=0; i--)
{
feedback
=
index_of[data[i]^bb[nn-kk-1]] ;
if
(feedback != -1)
{
for
(j=nn-kk-1; j>0; j--)
if
(gg[j] != -1)
bb[j]
= bb[j-1]^alpha_to[(gg[j]+feedback)%nn] ;
else
bb[j]
= bb[j-1] ;
bb[0]
= alpha_to[(gg[0]+feedback)%nn] ;
}
else
{
for
(j=nn-kk-1; j>0; j--)
bb[j]
= bb[j-1] ;
bb[0]
= 0 ;
}
;
}
;
which is translated by the
compiler into the following
(only part
is listed):
L24
[
move.l
(r5),d2 ;[133]
move.l
(r2),d3 ;[134] B6
]
[
cmpeq.w
#-1,d2 ;[133]
add
d2,d4,d0 ;[134] B6
move.w
#255,d1 ;[134] B6
]
bt
<L27 ;[133]
jsr
___Qmod32_s ;[134]
[
asll
#<2,d0
;[134]
move.l
#_alpha_to,r7
;[134]
]
move.l
d0,r6 ;[134]
nop
;[0] AGU stall
adda
r7,r6 ;[134]
move.l
(r6),d2 ;[134]
[
eor
d3,d2 ;[134]
jmpd
L26 ;[135]
]
move.l
d2,(r4) ;[134]
L27
move.l
(r2),d3 ;[136]
move.l
d3,(r4) ;[136]
L26
[
sub
#1,d5 ;[132]
suba
#<4,r4
;[132]
suba
#<4,r5
;[132]
]
[
tstgt
d5 ;[132]
suba
#<4,r2
;[132]
]
bt
L24
Using sco
with
default options (<sco_options>
is left empty) generates the following equivalent of the above
kernel:
L24:
[
move.l
(r5),d2
move.w
#255,d1
]
[
cmpeq.w
#-1,d2
add
d2,d4,d0
move.l
(r2),d3
]
btd
<L27
ift
move.l d3,(r4)
;
gso 4.15 - 3.12
jsr
___Qmod32_s
;
gso 3.3 - 3.3
[
move.l
#_alpha_to,r7
move.l
d0,r6
]
nop
move.l
(r7+r6),d2
eor
d3,d2
move.l
d2,(r4)
;
gso 6.9 - 8.14
L27:
;
gso 0.0 - 2.2
L26:
[
sub
#1,d5
suba
#4,r2
suba
#4,r5
]
[
tstgt
d5
suba
#4,r4
]
bt
L24
It is difficult to estimate
the improvement achieved by sco,
but the difference in the quality of codes is apparent. For example,
note the way blocks at L27
was transformed and jmpd
L26
was eliminated.
top
Customer-created
application
Due to obvious reasons we
will omit C source
listing of
the code. It was pointed out that the
following
compiler-generated loop is greatly affects overall performance of the
code:
loopstart3
PL003
[
ift
tfr d2,d5 ;[454] 7%=1
ift
tfr d11,d9 ;[455] 7%=1
ifa
move.f (r9)+n1,d2 ;[447] 0%=0
]
[
ift
tfr d13,d8 ;[456] 7%=1
ift
tfr d14,d7 ;[458] 7%=1
ifa
move.f (r11)+n1,d11 ;[446] 0%=0
]
[
add
d15,d11,d11 ;[446]
1%=0
add
#<5,d14
;[445] 8%=1
move.l
(sp-68),d13
;[447] 1%=0
move.f
#2048,d12 ;[448]
1%=0
]
[
ift
move.w (sp-98),d10 ;[457] 9%=1
ifa
mac #8192,d2,d13 ;[447] 2%=0
ifa
move.f (r10)+n1,d2 ;[448] 2%=0
]
[
macr
d2,d12,d13 ;[448]
3%=0
mpy
d11,d11,d2 ;[449]
3%=0
]
mpy
d8,d2,d12 ;[451]
4%=0
mac
-d5,d13,d12 ;[451]
5%=0
tstgt
d12 ;[453] 6%=0
loopend3
which takes 10 clocks per
iteration.
Using sco
with <sco_options>
set to –v
generates
the following equivalent of the above kernel:
falign
loopstart3
__dco_vr_2:
[
mac
-d5,d13,d12
move.f
(r11)+n1,d4
move.f
(r9)+n1,d3
]
[
ift
move.w
(r1),d10
ifa
tstgt
d12
move.f
#2048,d1
]
[
tfrt
d13,d8
move.l
(r0),d13
tfrt
d11,d9
add
d15,d4,d11
]
[
mac
#8192,d3,d13
move.f
(r10)+n1,d0
tfrt
d2,d5
mpy
d11,d11,d2
]
[
tfrt
d14,d7
add
#5,d14
macr
d0,d1,d13
mpy
d8,d2,d12
]
loopend3
which processes loop
iteration in 5 clocks – 50%
improvement.
top
|