Measuring Load Balancing

  • Follow


Hi,

I am writing an MPI app, where the load for each node changed
dynamically at runtime. It is a code that simulates a fluid using an
adaptive mesh refinement algorithm for some number of timesteps. In
order to work out how serious this problem is in terms of time wasted,
and to test load balancing strategies, I would like to know the time
wasted due to load imbalance. I am having a lot of difficulty obtaining
such information.

The profilers I have tried give me the time spent in functions, with a
separation between my computational code and the MPI communication
code, and they give this information for each node. If the load was
imbalanced, but didn't change dynamically, I could just use the time
spent in the computational code for each node. To work out time wasted
in such a situation:

Ideal computing time per node = (Sum over nodes (Time in compuational
Code) ) / number of nodes
Total Time Wasted = Max over nodes(time in computational code) - Ideal
computing time per node

However, such a figure would be meaningless if the load on each node
changed dynamically. In a two-node system, the nodes could be load
imbalanced, but with half the time node 1 has the majority of the load,
and the other half of the time node 2 has the majority of the node. It
could lead to the "Time in computational code" figure being equal to
"Ideal computing time per node" figure, and thus giving Total time
wasted to be 0, when this is definitely not the case.

Ideally a profiler could work out the total time wasted for each
timestep, where load does not change, and then sum over all the
timesteps. However, I cannot find a profiler that gives me this, or one
that gives me information which which I could work this out.

Can anyone help with this?

Thanks,

Nym.

0
Reply neverwillreply (16) 11/13/2006 3:56:43 PM

Hello again,

Nym wrote:
> Ideally a profiler could work out the total time wasted for each
> timestep, where load does not change, and then sum over all the
> timesteps. However, I cannot find a profiler that gives me this, or one
> that gives me information which which I could work this out.

I have realised that it shouldn't be too difficult to roll my own,
using MPI_Wtime, so this is what I will do.

Nym.

0
Reply Nym 11/14/2006 8:16:16 PM


> Hello again,
>
> Nym wrote:
>> Ideally a profiler could work out the total time wasted for each
>> timestep, where load does not change, and then sum over all the
>> timesteps. However, I cannot find a profiler that gives me this, or one
>> that gives me information which which I could work this out.
>
> I have realised that it shouldn't be too difficult to roll my own,
> using MPI_Wtime, so this is what I will do.

You might consider to evaluate the Intel Trace Analyzer and Collector
to get some insight in the behavior of your code.
0
Reply Georg 11/15/2006 9:28:48 AM

Georg Bisseling wrote:
> > Hello again,
> >
> > Nym wrote:
> >> Ideally a profiler could work out the total time wasted for each
> >> timestep, where load does not change, and then sum over all the
> >> timesteps. However, I cannot find a profiler that gives me this, or one
> >> that gives me information which which I could work this out.
> >
> > I have realised that it shouldn't be too difficult to roll my own,
> > using MPI_Wtime, so this is what I will do.
>
> You might consider to evaluate the Intel Trace Analyzer and Collector
> to get some insight in the behavior of your code.
I did have a look at that - and there are nice timeline graphs.
However, I couldn't figure out a way for it to give me a figure of time
wasted due to load imbalance. I've looked at "Tau" as well, and it
seems to be able to lots of information, just not this information.
(Although yes perhaps I am unable to use them properly!)

Now I am profiling with mpiP, which gives basic information, and I use
calls to MPI_Wtime to figure out my time wasted due to load imbalance
statistic.

Nym.

0
Reply Nym 11/16/2006 11:07:23 AM

On Thu, 16 Nov 2006 12:07:23 +0100, Nym <neverwillreply@gmail.com> wrote:

>> You might consider to evaluate the Intel Trace Analyzer and Collector
>> to get some insight in the behavior of your code.
> I did have a look at that - and there are nice timeline graphs.
> However, I couldn't figure out a way for it to give me a figure of time
> wasted due to load imbalance.

Maybe using the function profile or the collective operations profile that
can show the time spent in the function calls (operations) by process (thread),
rank, operation type, communicator etc.

Well, I admit that it will not exactly calculate the formula you gave, but
it should allow to analyze if there is a load balancing problem. And it can
export the values of the profiles for arbitrary time intervals to make them
available to scripts or spreadsheets.


-- 
This signature intentionally left almost blank.
0
Reply Georg 11/22/2006 9:16:21 PM

4 Replies
125 Views

(page loaded in 0.622 seconds)

5/20/2013 12:47:50 AM


Reply: