Hi,
I would like to hear some opinions about a rather "philosophic" problem
(with severe practical consequences) that I have run into.
First, lets start with two "programming paradigms" that are taught in every
course and every book, and which are valid for almost every system, every
language and every problem:
1. Divide a problem into smaller (probably self-contained) subproblems, and
combine their solutions to get a solution of the original problem.
This goes hand in hand with:
2. Write code (functions) that accomplish one and exactly one thing, and
that are general (i.e. general enough to be reused for different problems
leading to the same subproblems).
These both lead to: write a set of routines that can be combined in
different ways to solve a large amount of problems easily.
I will illustrate my problem using a specific example (although I am sure it
is a rather general problem): the implementation of the Perona-Malik filter
for gray images. For that we have defined a class "image" (which is
essentially an array of floats). The P-M filter is (in its most
straightforward C++ implementation):
image u, d;
float l;
for(i=0;i<N;i++) {
d = 1/(1 + gradient_magnitude_squared(u)/(l*l));
u += 0.25*(gradient_left_x(d*gradient_right_x(u)) +
gradient_left_y(d*gradient_right_y(u)));
}
Naturally, we have defined image::operator+, image::operator-,
image::operator*, image::operator/ for different arguments (image and
float) and all the flavors of the gradient. Of course they are also
essential for many other image processing algorithms. Since in effect every
function call is the same operation applied to a lot of numbers it was
logical to accelerate the program by implementing it using the Intel SSE
instruction set. Since the operators and the gradient functions are not too
complex they were easily implemented using SSE, giving us about 100-200%
more performance (also in other algorithms => reusability). But looking at
the generated code you see that it looks like the following "pseudo code":
[gradient_magnitude_squared]
for_each_pixel_in_image:
load pixel into CPU
operate on pixel
store pixel in memory
[operator/]
for_each_pixel_in_image:
load pixel into CPU
operate on pixel
store pixel in memory
.....
Counting function and operator calls in the P-M loop gives us 11 function
calls, each one starting with a load and ending with a store operation
(leading to a lot of notoriously slow data transfer memory<=>CPU). Now the
idea was to implement the whole P-M filter as a single function in SSE to
reduce that to one load/store and perform all operations to one pixel at
once. I.e.:
[PM filter]
for_each_pixel_in_image:
load pixel into CPU
load neighbour pixels into CPU (for calulcating gradients)
perform all operations from the PM filter
store the pixel to memory
That gave as a performance improvement of more than 1000%! The downside: it
is a rather lenghty, complex and difficult to debug function. And it can
hardly ever be used for something else than a P-M filter. Thus it
contradicts the two paradigms given at the beginning.
So where to draw the line? How should we weigh performance against writing
clean reusable code? Is it worth the work and all the disadvantages to
write high-level functions fully optimized and in full length in assembler?
|
|
0
|
|
|
|
Reply
|
harald.grossauer (25)
|
9/18/2003 11:44:47 AM |
|
Harald Grossauer <harald.grossauer@uibk.ac.at> writes:
>
> So where to draw the line? How should we weigh performance against writing
> clean reusable code? Is it worth the work and all the disadvantages to
> write high-level functions fully optimized and in full length in assembler?
>
My inclination is to believe that your profiler has a better answer
to this question than I do.
--
Howard Ding
<hading@hading.dnsalias.com>
|
|
0
|
|
|
|
Reply
|
hading (66)
|
9/18/2003 11:52:34 AM
|
|
"Harald Grossauer" <harald.grossauer@uibk.ac.at> schreef in bericht
news:3f699aac@sia.uibk.ac.at...
> Hi,
>
> I would like to hear some opinions about a rather "philosophic" problem
> (with severe practical consequences) that I have run into.
[snip]
> That gave as a performance improvement of more than 1000%! The downside:
it
> is a rather lenghty, complex and difficult to debug function. And it can
> hardly ever be used for something else than a P-M filter. Thus it
> contradicts the two paradigms given at the beginning.
You can make functions inline, meaning these functions won't be called but
'weaved' into the caller. That would save you a bunch of functions calls
(and hopefully increase speed).
> So where to draw the line? How should we weigh performance against writing
> clean reusable code? Is it worth the work and all the disadvantages to
> write high-level functions fully optimized and in full length in
assembler?
I would use a profiler to see where your existing code is slow, then
optimize that code.
Floris
|
|
0
|
|
|
|
Reply
|
flvdbergMASTER (36)
|
9/18/2003 12:03:16 PM
|
|
Floris van den Berg wrote:
> You can make functions inline, meaning these functions won't be called but
> 'weaved' into the caller. That would save you a bunch of functions calls
> (and hopefully increase speed).
>
These functions are supposed to be in a library. Besides they are too long
for typical inlined functions.
|
|
0
|
|
|
|
Reply
|
harald.grossauer (25)
|
9/18/2003 12:52:11 PM
|
|
"Harald Grossauer" <harald.grossauer@uibk.ac.at> wrote in message
news:3f699aac@sia.uibk.ac.at...
> So where to draw the line? How should we weigh performance against writing
> clean reusable code? Is it worth the work and all the disadvantages to
> write high-level functions fully optimized and in full length in
assembler?
Write clean reusable code first. If that code is not fast enough there are
two solutions. Write faster code or use faster hardware. Which one to
choose is a business question, the short answer is to choose the cheapest
approach.
Optimization leads to specialization, at least to a certain degree, and that
leads to code which is in general less reusable. Is the trade off between
speed and reusability worth it? Sometimes an easy question, sometimes
hard and not one that I can help you with.
Often you are lucky and you can do algorithmic and source level
optimizations which does not affect reuse or maintainability. If you are
using assembly then reusability is to a certain extent thrown out of the
window anyway.
I am inclined to say that if you find these questions to hard to answer
for your current project you should go with what is best for the current
project and do not care to much about future reuse. But do care about
future maintenance. Easy to say, hard to do :)
Programming is certainly not a matter of black and white.
--
Thomas.
|
|
0
|
|
|
|
Reply
|
tstegen (281)
|
9/18/2003 12:54:30 PM
|
|
"Harald Grossauer" <harald.grossauer@uibk.ac.at> wrote in message
news:3f69aa77@sia.uibk.ac.at...
> Floris van den Berg wrote:
>
> > You can make functions inline, meaning these functions won't be called
but
> > 'weaved' into the caller. That would save you a bunch of functions calls
> > (and hopefully increase speed).
> >
>
> These functions are supposed to be in a library. Besides they are too long
> for typical inlined functions.
So write two versions. Use an inline version for other library routines and
wrap it in whatever is necessary to create one that can be called from
outside
the library. (Writing graphics routines myself, I've created some very large
inline functions). If performance is the problem to be solved, you need to
focus on methods that solve that problem, not methods that solve other
problems such as flexibility and code sharing. One solution does not always
solve all problems. -Wm
|
|
0
|
|
|
|
Reply
|
reply34 (474)
|
9/18/2003 3:07:37 PM
|
|
"Harald Grossauer" wrote
<snipped>
> So where to draw the line? How should we weigh performance against writing
> clean reusable code? Is it worth the work and all the disadvantages to
> write high-level functions fully optimized and in full length in
assembler?
My rule of thumb, always go in the direction of clarity, reusability and
manutenability. As performance is needed for special reasons, then
concentrate on that specific problem and optimize it.
A good book to read regarding these issues is "Code Complete", there you'll
see that when you design a specific class, method or system, besides
specifying functional requirements, you also specify non-functional
requirements.
For example, one thing may need to be as robust as possible, as another
thing may be required to be as fast as possible, but from experience, 80% of
time you're looking for clear, easy to mantain code. Just imagine having a
system that has 1 million lines of code and all of them are "tricky"
code.... I wouldn't want to work in a system like that. ;-)
Regards
Padu
|
|
0
|
|
|
|
Reply
|
padu (45)
|
9/18/2003 4:52:27 PM
|
|
Harald Grossauer wrote:
> So where to draw the line? How should we weigh performance against
> writing clean reusable code? Is it worth the work and all the
> disadvantages to write high-level functions fully optimized and in
> full length in assembler?
What's the cost of not having the speed verses the cost of not
having easily maintainable code? Can you afford to be slower?
Or can you better afford to create a fast black box, seal it
and leave it alone?
--
|_ CJSonnack <Chris@Sonnack.com> _____________| How's my programming? |
|_ http://www.Sonnack.com/ ___________________| Call: 1-800-DEV-NULL |
|_____________________________________________|_______________________|
|
|
0
|
|
|
|
Reply
|
Chris7 (2511)
|
9/18/2003 6:02:26 PM
|
|
Harald Grossauer <harald.grossauer@uibk.ac.at> wrote:
> image u, d;
> float l;
> for(i=0;i<N;i++) {
> d = 1/(1 + gradient_magnitude_squared(u)/(l*l));
> u += 0.25*(gradient_left_x(d*gradient_right_x(u)) +
> gradient_left_y(d*gradient_right_y(u)));
> }
As I understand it, this function can operate not only on whole images,
but also on individual pixels. Callers can then use it in a loop for
each pixel. If it is inlined, they get speed and modularity at the same
time. If you don't trust compiler inlining, you can use macro. If you
want to change it at run-time (via function pointer or library upgrade),
you can generate code at run-time, compile it, and load as shared
library.
|
|
0
|
|
|
|
Reply
|
robertvazan (81)
|
9/19/2003 9:26:31 AM
|
|
|
8 Replies
23 Views
(page loaded in 0.149 seconds)
|