CPU specific extensions contradict programming paradigms!?! [LONG]

  • Follow


Hi,

I would like to hear some opinions about a rather "philosophic" problem 
(with severe practical consequences) that I have run into.

First, lets start with two "programming paradigms" that are taught in every 
course and every book, and which are valid for almost every system, every 
language and every problem:

1. Divide a problem into smaller (probably self-contained) subproblems, and 
combine their solutions to get a solution of the original problem.

This goes hand in hand with:

2. Write code (functions) that accomplish one and exactly one thing, and 
that are general (i.e. general enough to be reused for different problems 
leading to the same subproblems).

These both lead to: write a set of routines that can be combined in 
different ways to solve a large amount of problems easily.

I will illustrate my problem using a specific example (although I am sure it 
is a rather general problem): the implementation of the Perona-Malik filter 
for gray images. For that we have defined a class "image" (which is 
essentially an array of floats). The P-M filter is (in its most 
straightforward C++ implementation):

image u, d;
float l;
for(i=0;i<N;i++) {
d = 1/(1 + gradient_magnitude_squared(u)/(l*l));
u += 0.25*(gradient_left_x(d*gradient_right_x(u)) + 
gradient_left_y(d*gradient_right_y(u)));
}

Naturally, we have defined image::operator+, image::operator-, 
image::operator*, image::operator/ for different arguments (image and 
float) and all the flavors of the gradient. Of course they are also 
essential for many other image processing algorithms. Since in effect every 
function call is the same operation applied to a lot of numbers it was 
logical to accelerate the program by implementing it using the Intel SSE 
instruction set. Since the operators and the gradient functions are not too 
complex they were easily implemented using SSE, giving us about 100-200% 
more performance (also in other algorithms => reusability). But looking at 
the generated code you see that it looks like the following "pseudo code":

[gradient_magnitude_squared]
for_each_pixel_in_image:
load pixel into CPU
operate on pixel
store pixel in memory
[operator/]
for_each_pixel_in_image:
load pixel into CPU
operate on pixel
store pixel in memory
.....

Counting function and operator calls in the P-M loop gives us 11 function 
calls, each one starting with a load and ending with a store operation 
(leading to a lot of notoriously slow data transfer memory<=>CPU). Now the 
idea was to implement the whole P-M filter as a single function in SSE to 
reduce that to one load/store and perform all operations to one pixel at 
once. I.e.:

[PM filter]
for_each_pixel_in_image:
load pixel into CPU
load neighbour pixels into CPU (for calulcating gradients)
perform all operations from the PM filter
store the pixel to memory

That gave as a performance improvement of more than 1000%! The downside: it 
is a rather lenghty, complex and difficult to debug function. And it can 
hardly ever be used for something else than a P-M filter. Thus it 
contradicts the two paradigms given at the beginning.

So where to draw the line? How should we weigh performance against writing 
clean reusable code? Is it worth the work and all the disadvantages to 
write high-level functions fully optimized and in full length in assembler?

0
Reply harald.grossauer (25) 9/18/2003 11:44:47 AM

Harald Grossauer <harald.grossauer@uibk.ac.at> writes:

>
> So where to draw the line? How should we weigh performance against writing 
> clean reusable code? Is it worth the work and all the disadvantages to 
> write high-level functions fully optimized and in full length in assembler?
>

My inclination is to believe that your profiler has a better answer
to this question than I do.

-- 
Howard Ding
<hading@hading.dnsalias.com>
0
Reply hading (66) 9/18/2003 11:52:34 AM


"Harald Grossauer" <harald.grossauer@uibk.ac.at> schreef in bericht
news:3f699aac@sia.uibk.ac.at...
> Hi,
>
> I would like to hear some opinions about a rather "philosophic" problem
> (with severe practical consequences) that I have run into.

[snip]

> That gave as a performance improvement of more than 1000%! The downside:
it
> is a rather lenghty, complex and difficult to debug function. And it can
> hardly ever be used for something else than a P-M filter. Thus it
> contradicts the two paradigms given at the beginning.

You can make functions inline, meaning these functions won't be called but
'weaved' into the caller. That would save you a bunch of functions calls
(and hopefully increase speed).

> So where to draw the line? How should we weigh performance against writing
> clean reusable code? Is it worth the work and all the disadvantages to
> write high-level functions fully optimized and in full length in
assembler?

I would use a profiler to see where your existing code is slow, then
optimize that code.

Floris


0
Reply flvdbergMASTER (36) 9/18/2003 12:03:16 PM

Floris van den Berg wrote:

> You can make functions inline, meaning these functions won't be called but
> 'weaved' into the caller. That would save you a bunch of functions calls
> (and hopefully increase speed).
> 

These functions are supposed to be in a library. Besides they are too long 
for typical inlined functions.

0
Reply harald.grossauer (25) 9/18/2003 12:52:11 PM

"Harald Grossauer" <harald.grossauer@uibk.ac.at> wrote in message
news:3f699aac@sia.uibk.ac.at...

> So where to draw the line? How should we weigh performance against writing
> clean reusable code? Is it worth the work and all the disadvantages to
> write high-level functions fully optimized and in full length in
assembler?

Write clean reusable code first. If that code is not fast enough there are
two solutions. Write faster code or use faster hardware. Which one to
choose is a business question, the short answer is to choose the cheapest
approach.

Optimization leads to specialization, at least to a certain degree, and that
leads to code which is in general less reusable. Is the trade off between
speed and reusability worth it? Sometimes an easy question, sometimes
hard and not one that I can help you with.

Often you are lucky and you can do algorithmic and source level
optimizations which does not affect reuse or maintainability. If you are
using assembly then reusability is to a certain extent thrown out of the
window anyway.

I am inclined to say that if you find these questions to hard to answer
for your current project you should go with what is best for the current
project and do not care to much about future reuse. But do care about
future maintenance. Easy to say, hard to do :)

Programming is certainly not a matter of black and white.

--
Thomas.


0
Reply tstegen (281) 9/18/2003 12:54:30 PM

"Harald Grossauer" <harald.grossauer@uibk.ac.at> wrote in message
news:3f69aa77@sia.uibk.ac.at...
> Floris van den Berg wrote:
>
> > You can make functions inline, meaning these functions won't be called
but
> > 'weaved' into the caller. That would save you a bunch of functions calls
> > (and hopefully increase speed).
> >
>
> These functions are supposed to be in a library. Besides they are too long
> for typical inlined functions.

So write two versions. Use an inline version for other library routines and
wrap it in whatever is necessary to create one that can be called from
outside
the library. (Writing graphics routines myself, I've created some very large
inline functions). If performance is the problem to be solved, you need to
focus on methods that solve that problem, not methods that solve other
problems such as flexibility and code sharing. One solution does not always
solve all problems. -Wm


0
Reply reply34 (474) 9/18/2003 3:07:37 PM

"Harald Grossauer" wrote
<snipped>
> So where to draw the line? How should we weigh performance against writing
> clean reusable code? Is it worth the work and all the disadvantages to
> write high-level functions fully optimized and in full length in
assembler?


My rule of thumb, always go in the direction of clarity, reusability and
manutenability. As performance is needed for special reasons, then
concentrate on that specific problem and optimize it.
A good book to read regarding these issues is "Code Complete", there you'll
see that when you design a specific class, method or system, besides
specifying functional requirements, you also specify non-functional
requirements.

For example, one thing may need to be as robust as possible, as another
thing may be required to be as fast as possible, but from experience, 80% of
time you're looking for clear, easy to mantain code. Just imagine having a
system that has 1 million lines of code and all of them are "tricky"
code.... I wouldn't want to work in a system like that. ;-)

Regards

Padu


0
Reply padu (45) 9/18/2003 4:52:27 PM

Harald Grossauer wrote:

> So where to draw the line? How should we weigh performance against
> writing clean reusable code? Is it worth the work and all the
> disadvantages to write high-level functions fully optimized and in
> full length in assembler?

What's the cost of not having the speed verses the cost of not
having easily maintainable code?  Can you afford to be slower?
Or can you better afford to create a fast black box, seal it
and leave it alone?

-- 
|_ CJSonnack <Chris@Sonnack.com> _____________| How's my programming? |
|_ http://www.Sonnack.com/ ___________________| Call: 1-800-DEV-NULL  |
|_____________________________________________|_______________________|
0
Reply Chris7 (2511) 9/18/2003 6:02:26 PM

Harald Grossauer <harald.grossauer@uibk.ac.at> wrote:

> image u, d;
> float l;
> for(i=0;i<N;i++) {
> d = 1/(1 + gradient_magnitude_squared(u)/(l*l));
> u += 0.25*(gradient_left_x(d*gradient_right_x(u)) + 
> gradient_left_y(d*gradient_right_y(u)));
> }

As I understand it, this function can operate not only on whole images,
but also on individual pixels. Callers can then use it in a loop for
each pixel. If it is inlined, they get speed and modularity at the same
time. If you don't trust compiler inlining, you can use macro. If you
want to change it at run-time (via function pointer or library upgrade),
you can generate code at run-time, compile it, and load as shared
library.
0
Reply robertvazan (81) 9/19/2003 9:26:31 AM

8 Replies
23 Views

(page loaded in 0.149 seconds)

Similiar Articles:






7/21/2012 6:32:45 PM


Reply: