How to detect if other processes are still alive

  • Follow


Hi there,

I am parallelizing a package (written in Fortran)
which has lots of check points and the program
may stop if anything goes wrong.

That kind of feature (check to stop) becomes a
problem since one process may stop  while others
are still running and might wait to receive data
from the dead process somewhere.  I wonder if it
is possible to detect whether corresponding process
is alive before sending/receiving data?

Any idea or suggestion is appreciated.

Reggie

0
Reply reggie (1) 2/4/2005 3:05:56 AM

reggie@zonatech.com wrote:

> Hi there,
> 
> I am parallelizing a package (written in Fortran)
> which has lots of check points and the program
> may stop if anything goes wrong.
> 
> That kind of feature (check to stop) becomes a
> problem since one process may stop  while others
> are still running and might wait to receive data
> from the dead process somewhere.  I wonder if it
> is possible to detect whether corresponding process
> is alive before sending/receiving data?
> 
> Any idea or suggestion is appreciated.
> 
> Reggie

Can you arrange it so that the stopped process calls MPI_ABORT ?  Depending
on your MPI implementation, that might just 'do the right thing'.

Otherwise, I don't think there is any portable way.  The best bet would
probably be something like a watchdog timer using TCP sockets.  Or maybe
you could do this externally, have a daemon process on some frontend node
that periodically checks to see if all of the job processes are still
running?

HTH
Ian McCulloch

0
Reply Ian 2/4/2005 4:49:04 AM


Ian McCulloch wrote:
> reggie@zonatech.com wrote:
>
> > Hi there,
> >
> > I am parallelizing a package (written in Fortran)
> > which has lots of check points and the program
> > may stop if anything goes wrong.
> >
> > That kind of feature (check to stop) becomes a
> > problem since one process may stop  while others
> > are still running and might wait to receive data
> > from the dead process somewhere.  I wonder if it
> > is possible to detect whether corresponding process
> > is alive before sending/receiving data?
> >
> > Any idea or suggestion is appreciated.
> >
> > Reggie
>
> Can you arrange it so that the stopped process calls MPI_ABORT ?
Depending
> on your MPI implementation, that might just 'do the right thing'.
>
> Otherwise, I don't think there is any portable way.  The best bet
would
> probably be something like a watchdog timer using TCP sockets.  Or
maybe
> you could do this externally, have a daemon process on some frontend
node
> that periodically checks to see if all of the job processes are still
> running?
>
> HTH
> Ian McCulloch


Ian,

Thank you very much for the message.

Although I can call MPI_ABORT in the process to
be stopped, I don't think other processes would
know that action.

I do consider other options similar to what you
recommended.  However, I want to check with other
mpi users about the possibility using mpi calls
since I'm a rookie in mpi.  

Best regards,

Reggie

0
Reply reggie 2/4/2005 5:36:06 PM

reggie@zonatech.com schrieb:
> Hi there,
> 
> I am parallelizing a package (written in Fortran)
> which has lots of check points and the program
> may stop if anything goes wrong.
> 
> That kind of feature (check to stop) becomes a
> problem since one process may stop  while others
> are still running and might wait to receive data
> from the dead process somewhere.  I wonder if it
> is possible to detect whether corresponding process
> is alive before sending/receiving data?
> 
> Any idea or suggestion is appreciated.
> 
> Reggie
> 

A)
Assuming that the processes can be synchronized,
then you could change the check routines to calculate
an ok flag via MPI_Allreduce and let them only
continue if you have an ok from all parties.

B)
If that synchronization is not an option you could let
a failing process issue a message of death to all others.
Before trying to communicate you would have to check
for the absence of any death messages.

C)
If your targeted platforms allow, you could use
a MPI window (see one-sided communication in the
MPI2 standard) to hold the "still moving" flags.
Unfortunately one-sided communication is not a
feature that you can really rely on to be present
in every implementation.

Or some variation/combination of the schemes.

0
Reply Georg 2/4/2005 5:45:15 PM

Thanks for the info, George.

A)  I don't think using MPI_AllReduce would help.  Like MPI_Reduce,
MPI_AllReduce needs to have some form of send/receive involved.  If any
process goes wrong, some processes(es) will keep waiting.

B)  It seems that it also needs send/receive mechanism.  Again, if any
process goes wrong, some processes(es) will keep waiting.

C)  I will try to study this.  Hope it is an option for me.

So far the closest solution that I found out is FT-MPI.  According to
an article by some scientiests from cs.utk.edu, its spawn and notify
service "can notifiy other interested processes when a failure of exit
of the invoked process occurs".  FT-MPI has a daemon running and uses
additional libraries.  FT-MPI libraries might be a problem to me since
I have to deal with multiple platforms.

Best regards,

Reggie

0
Reply reggie 2/5/2005 1:19:12 AM

reggie@zonatech.com wrote:
> Although I can call MPI_ABORT in the process to
> be stopped, I don't think other processes would
> know that action.

A correct MPI implementation will handle MPI_Abort() in a way that will 
terminate all other proceses in the given communicator. If you use 
MPI_COMM_WORLD as an argument, it will shut down your complete 
application (I assume you don't use dynamic process creation).

-- 
Joachim - reply to joachim at domain ccrl-nece dot de

Opinion expressed is personal and does not constitute
an opinion or statement of NEC Laboratories.
0
Reply Joachim 2/8/2005 9:52:49 AM

Joachim,

Thanks a lot for the message.

I did try to use MPI_ABORT.  But other processes
without calling MPI_ABORT just hang without being
stopped on Windows.  MPI_ABORT does work
as expected on linux.

Is this because the difference of the platforms?

Reggie

0
Reply reggie 2/11/2005 2:50:42 AM

reggie@zonatech.com wrote:
> I did try to use MPI_ABORT.  But other processes
> without calling MPI_ABORT just hang without being
> stopped on Windows.  MPI_ABORT does work
> as expected on linux.
> 
> Is this because the difference of the platforms?

The MPI implementation (which one is it?) that you are using under 
Windows does not behave correctly. You should verify this with a minimal 
example and then ask the vendor/authors for help.

-- 
Joachim - reply to joachim at domain ccrl-nece dot de

Opinion expressed is personal and does not constitute
an opinion or statement of NEC Laboratories.
0
Reply Joachim 2/11/2005 8:44:27 AM

Joachim Worringen wrote:
> reggie@zonatech.com wrote:
> > I did try to use MPI_ABORT.  But other processes
> > without calling MPI_ABORT just hang without being
> > stopped on Windows.  MPI_ABORT does work
> > as expected on linux.
> >
> > Is this because the difference of the platforms?
>
> The MPI implementation (which one is it?) that you are using under
> Windows does not behave correctly. You should verify this with a
minimal
> example and then ask the vendor/authors for help.
>
> --
> Joachim - reply to joachim at domain ccrl-nece dot de
>
> Opinion expressed is personal and does not constitute
> an opinion or statement of NEC Laboratories.

Joachim,

Thank you for the message.

I downloaded the executable file from official MPI
website and build MPICH 1.2.5 on my Windows XP PRO.

The way I built the application is to use command line
compiling and link.  (I got the flags /nologo /MT  /I"C:\Program
Files\MPICH\SDK\include" /D WIN32 /D NDEBUG /D _CONSOLE /D _MBCS
from User's Guide and the sample examples for Visual Fortran.)

If anyone could use MPI_ABORT to stop all processes
on Windows machine, that may imply I did not build
MPI properly or I used wrong command options.  

Best regards,

Reggie

0
Reply reggie 2/18/2005 4:01:01 AM

8 Replies
365 Views

(page loaded in 0.124 seconds)

Similiar Articles:













7/24/2012 8:45:08 AM


Reply: