f



Methods for predicting/estimating runtime

We have a heterogeneous, distributed computing (cluster/grid) environment,
which users use to submit single-task and parallel (MPI) jobs.  The jobs
themselves may be multiprocessor jobs.  The jobs users submit can be from
many applications (Ansys, Fluent, etc.)

We are trying to come up with a method to predict or estimate the amount of
time a job may run for prior to the job being dispatched for execution.
This could be based on historical data (past runs), if necessary.  Such
prediction would help in the dispatch of jobs to appropriate queues (which
have different runtime constraints), esp. with classes of urgent jobs.

What research and literature would one recommend I examine to help us in
this task?

--john
0
jsassojr (3)
6/30/2003 6:29:23 PM
comp.parallel 2866 articles. 0 followers. Post Follow

3 Replies
456 Views

Similar Articles

[PageSpeed] 26

Nick Maclaren wrote:
> 
> In article <gZ8La.2783$4J6.272@twister.nyroc.rr.com>,
> John Sasso <jsassojr@nycap.rr.com> wrote:
> > ...
> >We are trying to come up with a method to predict or estimate the amount of
> >time a job may run for prior to the job being dispatched for execution.
> >This could be based on historical data (past runs), if necessary.
> > ...
> >What research and literature would one recommend I examine to help us in
> >this task?
> 
> You may have a job finding it, but the scheduler work of the 1970s.
> The executive summary is "forget it."

Maybe you're referring to work where the performance is determined on
the fly, or based on static analysis of the code--e.g. predicting
performance on a platform where the code has never (or rarely) executed
before.  The operative phrase, above, is "This could be based on
historical data (past runs) if necessary."

> The ONLY solution to this is to demand that the user supplies a
> limit, in one way or another - and even that is a partial solution.

And a user comes up with this time limit by...running the code a few
times and seeing how long it usually takes.  Yes, the user could
theoretically adjust the limits each time based on some sort of pre-
(and/or ill-) conceived non-constant performance model, but barring the
remote chance that they do and that that model is at all accurate, it
would seem that automating users' standard simple recording/notating
process could at least potentially remove some human error.  (Of course,
providing tools and monitoring to help them *build* such performance
models is a different issue.)

> You can do better only if you can categorise your jobs and collect
> a fair amount of data on each (i.e. the 'commercial' scheduling
> problem).  The 'research' scheduling problem is insoluble, both
> theoretically and practically.

I don't know your definition of 'research' and 'commercial', but I'd
guess that an IBM SP at Argonne National Labs, an IBM SP at Cornell
Theory Center, an Intel Paragon at SDSC, and/or 3 SGI Origins at NASA
Ames would more likely run 'research' jobs, in which case see 2.1 and
2.2 of
http://www.nas.nasa.gov/Research/Reports/Techreports/2002/PDF/nas-02-003.pdf

Even if the distributions have "no means", the real question is whether
or not there are better predictions than either (a) none at all, or (b)
user limits.  The report above suggests that the answer to both might
well be "yes".

The original poster might also check out the Grid Forum Performance Area
(http://www-didc.lbl.gov/GGF-PERF/AREA/), which is at least addressing
ways to monitor performance if not predict it, and might also try
posting the question to the comp.distributed newsgroup, which is
focussed more on grid (a.k.a. distributed resource
collectives/cooperatives) issues.

-- Dave
-----------------------------------------------------------------
David C. DiNucci  Elepar                 Tools for portable grid,
dave@elepar.com   http://www.elepar.com  parallel, distributed, &
503-439-9431      Beaverton, OR 97006    peer-to-peer computing

0
dave239 (6)
7/2/2003 6:43:26 AM
In article <3F02640B.79234741@elepar.com>,
David C DiNucci <dave@elepar.com> writes:
|> > You may have a job finding it, but the scheduler work of the 1970s.
|> > The executive summary is "forget it."
|> 
|> Maybe you're referring to work where the performance is determined on
|> the fly, or based on static analysis of the code--e.g. predicting
|> performance on a platform where the code has never (or rarely) executed
|> before.  The operative phrase, above, is "This could be based on
|> historical data (past runs) if necessary."

No, I am not.  I saw that sentence, and my investigations in the
1970s did use such data.

|> > The ONLY solution to this is to demand that the user supplies a
|> > limit, in one way or another - and even that is a partial solution.
|> 
|> And a user comes up with this time limit by...running the code a few
|> times and seeing how long it usually takes.  Yes, the user could
|> theoretically adjust the limits each time based on some sort of pre-
|> (and/or ill-) conceived non-constant performance model, but barring the
|> remote chance that they do and that that model is at all accurate, it
|> would seem that automating users' standard simple recording/notating
|> process could at least potentially remove some human error.  (Of course,
|> providing tools and monitoring to help them *build* such performance
|> models is a different issue.)

Yes, it would seem so.  It seemed so in the 1970s, and was shown
not to be so.  And, for reasons I shall try to explain, it will
probably be forever so.

|> > You can do better only if you can categorise your jobs and collect
|> > a fair amount of data on each (i.e. the 'commercial' scheduling
|> > problem).  The 'research' scheduling problem is insoluble, both
|> > theoretically and practically.
|> 
|> I don't know your definition of 'research' and 'commercial', but I'd
|> guess that an IBM SP at Argonne National Labs, an IBM SP at Cornell
|> Theory Center, an Intel Paragon at SDSC, and/or 3 SGI Origins at NASA
|> Ames would more likely run 'research' jobs, in which case see 2.1 and
|> 2.2 of
|> http://www.nas.nasa.gov/Research/Reports/Techreports/2002/PDF/nas-02-003.pdf

No.  In THIS context, they are closer to 'commercial'.

Part of the reason is that each 'user' tends to be tackling one
particular task for a long time.  This is the norm in the commercial
world, a typical scenario in development and that sort of lab., but
much rarer in normal academic research computing.

Those papers are getting good predictability.  When I looked at
what users were doing, I was often predicting only 10% of the
variance.  That is hopeless.  And my techniques were pretty
similar to theirs - it was the type of workload that differed.

Part of the reason that I could get nowhere was the fact that
many users ran a variable mix of jobs (i.e. multiple parallel
analyses), where the job parameters provided no indication of
which one it was.  This, I understand, is relatively rare at
those sites.

Another part is that it is common for a user to run only a few
jobs on one particular analysis, go away and think or write up,
and then run another apparently identical analysis with very
different properties.

Another example is a researcher who is sometimes running
'production' jobs and sometimes investigating errors with hacked
code or data - and with no external evidence as to which!  When
that is the NORM, you have a problem, and my experience is that
it is the norm for many researchers in academia.

|> Even if the distributions have "no means", the real question is whether
|> or not there are better predictions than either (a) none at all, or (b)
|> user limits.  The report above suggests that the answer to both might
|> well be "yes".

The answer to the former is "Obviously, unless there is no
association between the parameters of a job and its running time."

The report above indicates that the answer to the latter might be
"yes, IN THAT CONTEXT".  What I am saying is that, in a more typical
research environment (which those are definitely NOT), the answer
is probably "no".

The effectiveness of user limits will also depend on the
intelligence of the typical user, of course.  They are pretty
clued up here, which is one reason that user limits work well.
Actually, what they do is to adapt the running time to the
'time slice' provided by the scheduler, and run their jobs in
sections.


Regards,
Nick Maclaren.

0
nmm1 (190)
7/2/2003 4:58:30 PM
Nick Maclaren wrote:
> 
> In article <3F02640B.79234741@elepar.com>,
> David C DiNucci <dave@elepar.com> writes:
> |> Nick Maclaren wrote:
> |> > You may have a job finding it, but the scheduler work of the 1970s.
> |> > The executive summary is "forget it."
....

> Yes, it would seem so.  It seemed so in the 1970s, and was shown
> not to be so.  And, for reasons I shall try to explain, it will
> probably be forever so.

So, if I understand so far, it was "shown not to be so" by some as-yet
un-cited published studies in the '70s, and your personal memories of
unpublished studies you made back then.

Even assuming that all of those studies showed exactly what you say they
do (and, given what you've said so far, I would find it hard to either
verify or dismiss that assumption), I would further have to believe that
the results of those studies still hold on some significant set of
today's job mixes (30 years later), and that the job mix represented by
the questioner largely intersected that set, before I could come to your
"forget it" conclusion (or "the ONLY solution is...user-supplied time
limits" conclusion).

> |> > You can do better only if you can categorise your jobs and collect
> |> > a fair amount of data on each (i.e. the 'commercial' scheduling
> |> > problem).  The 'research' scheduling problem is insoluble, both
> |> > theoretically and practically.
> |>
> |> I don't know your definition of 'research' and 'commercial', but I'd
> |> guess that an IBM SP at Argonne National Labs, an IBM SP at Cornell
> |> Theory Center, an Intel Paragon at SDSC, and/or 3 SGI Origins at NASA
> |> Ames would more likely run 'research' jobs, in which case see 2.1 and
> |> 2.2 of
> |> http://www.nas.nasa.gov/Research/Reports/Techreports/2002/PDF/nas-02-003.pdf
> 
> No.  In THIS context, they are closer to 'commercial'.
> 
> Part of the reason is that each 'user' tends to be tackling one
> particular task for a long time.  This is the norm in the commercial
> world, a typical scenario in development and that sort of lab., but
> much rarer in normal academic research computing.

You seem to concur, then, that there may be a good chance of
accomplishing what the original poster requested in both commercial
settings and in a number of research settings, where I use the term
"research settings" to mean "on machines at government research labs
and/or university research support", ANL and NASA Ames being examples of
the former, CTC and SDSC being examples of the latter.

In fact, seeing as how the original question made no statement of
whether it was a 'research' or 'commercial' setting, one might even
guess from the applications mentioned ("Ansys, Fluent, etc.") that the
job mix mentioned was more 'commercial' in nature, using your definition
of 'commercial'.  If so, then we apparently also concur that predictions
such as those made it the paper I cited might be useful to help address
the original question.

> Those papers are getting good predictability.  When I looked at
> what users were doing, I was often predicting only 10% of the
> variance.  That is hopeless.  And my techniques were pretty
> similar to theirs - it was the type of workload that differed.

> |> Even if the distributions have "no means", the real question is whether
> |> or not there are better predictions than either (a) none at all, or (b)
> |> user limits.  The report above suggests that the answer to both might
> |> well be "yes".
> 
> ...
> 
> The report above indicates that the answer to the latter might be
> "yes, IN THAT CONTEXT".  What I am saying is that, in a more typical
> research environment (which those are definitely NOT), the answer
> is probably "no".

To come to the same conclusion, I would both need some sort of
definition of your term "typical research environment" that clearly
delineates it them from those (on grounds other than observed workload,
or it becomes a tautology), and some sort of reason to believe that your
observations of workloads 30 years ago in typical research environments
still held.  While your explanations of your observations appear
believable for the time, I don't see that they explain why "it will
probably forever be so".  People's use of machines can change
dramatically depending on cost (as in money) of usage, speed, tools,
ease of use, and all sorts of other factors which have changed
dramatically over 30 years.

-- Dave

0
dave239 (6)
7/7/2003 6:26:35 PM
Reply: