Hi there,
I am parallelizing a package (written in Fortran)
which has lots of check points and the program
may stop if anything goes wrong.
That kind of feature (check to stop) becomes a
problem since one process may stop while others
are still running and might wait to receive data
from the dead process somewhere. I wonder if it
is possible to detect whether corresponding process
is alive before sending/receiving data?
Any idea or suggestion is appreciated.
Reggie
|
|
0
|
|
|
|
Reply
|
reggie (1)
|
2/4/2005 3:05:56 AM |
|
reggie@zonatech.com wrote:
> Hi there,
>
> I am parallelizing a package (written in Fortran)
> which has lots of check points and the program
> may stop if anything goes wrong.
>
> That kind of feature (check to stop) becomes a
> problem since one process may stop while others
> are still running and might wait to receive data
> from the dead process somewhere. I wonder if it
> is possible to detect whether corresponding process
> is alive before sending/receiving data?
>
> Any idea or suggestion is appreciated.
>
> Reggie
Can you arrange it so that the stopped process calls MPI_ABORT ? Depending
on your MPI implementation, that might just 'do the right thing'.
Otherwise, I don't think there is any portable way. The best bet would
probably be something like a watchdog timer using TCP sockets. Or maybe
you could do this externally, have a daemon process on some frontend node
that periodically checks to see if all of the job processes are still
running?
HTH
Ian McCulloch
|
|
0
|
|
|
|
Reply
|
Ian
|
2/4/2005 4:49:04 AM
|
|
Ian McCulloch wrote:
> reggie@zonatech.com wrote:
>
> > Hi there,
> >
> > I am parallelizing a package (written in Fortran)
> > which has lots of check points and the program
> > may stop if anything goes wrong.
> >
> > That kind of feature (check to stop) becomes a
> > problem since one process may stop while others
> > are still running and might wait to receive data
> > from the dead process somewhere. I wonder if it
> > is possible to detect whether corresponding process
> > is alive before sending/receiving data?
> >
> > Any idea or suggestion is appreciated.
> >
> > Reggie
>
> Can you arrange it so that the stopped process calls MPI_ABORT ?
Depending
> on your MPI implementation, that might just 'do the right thing'.
>
> Otherwise, I don't think there is any portable way. The best bet
would
> probably be something like a watchdog timer using TCP sockets. Or
maybe
> you could do this externally, have a daemon process on some frontend
node
> that periodically checks to see if all of the job processes are still
> running?
>
> HTH
> Ian McCulloch
Ian,
Thank you very much for the message.
Although I can call MPI_ABORT in the process to
be stopped, I don't think other processes would
know that action.
I do consider other options similar to what you
recommended. However, I want to check with other
mpi users about the possibility using mpi calls
since I'm a rookie in mpi.
Best regards,
Reggie
|
|
0
|
|
|
|
Reply
|
reggie
|
2/4/2005 5:36:06 PM
|
|
reggie@zonatech.com schrieb:
> Hi there,
>
> I am parallelizing a package (written in Fortran)
> which has lots of check points and the program
> may stop if anything goes wrong.
>
> That kind of feature (check to stop) becomes a
> problem since one process may stop while others
> are still running and might wait to receive data
> from the dead process somewhere. I wonder if it
> is possible to detect whether corresponding process
> is alive before sending/receiving data?
>
> Any idea or suggestion is appreciated.
>
> Reggie
>
A)
Assuming that the processes can be synchronized,
then you could change the check routines to calculate
an ok flag via MPI_Allreduce and let them only
continue if you have an ok from all parties.
B)
If that synchronization is not an option you could let
a failing process issue a message of death to all others.
Before trying to communicate you would have to check
for the absence of any death messages.
C)
If your targeted platforms allow, you could use
a MPI window (see one-sided communication in the
MPI2 standard) to hold the "still moving" flags.
Unfortunately one-sided communication is not a
feature that you can really rely on to be present
in every implementation.
Or some variation/combination of the schemes.
|
|
0
|
|
|
|
Reply
|
Georg
|
2/4/2005 5:45:15 PM
|
|
Thanks for the info, George.
A) I don't think using MPI_AllReduce would help. Like MPI_Reduce,
MPI_AllReduce needs to have some form of send/receive involved. If any
process goes wrong, some processes(es) will keep waiting.
B) It seems that it also needs send/receive mechanism. Again, if any
process goes wrong, some processes(es) will keep waiting.
C) I will try to study this. Hope it is an option for me.
So far the closest solution that I found out is FT-MPI. According to
an article by some scientiests from cs.utk.edu, its spawn and notify
service "can notifiy other interested processes when a failure of exit
of the invoked process occurs". FT-MPI has a daemon running and uses
additional libraries. FT-MPI libraries might be a problem to me since
I have to deal with multiple platforms.
Best regards,
Reggie
|
|
0
|
|
|
|
Reply
|
reggie
|
2/5/2005 1:19:12 AM
|
|
reggie@zonatech.com wrote:
> Although I can call MPI_ABORT in the process to
> be stopped, I don't think other processes would
> know that action.
A correct MPI implementation will handle MPI_Abort() in a way that will
terminate all other proceses in the given communicator. If you use
MPI_COMM_WORLD as an argument, it will shut down your complete
application (I assume you don't use dynamic process creation).
--
Joachim - reply to joachim at domain ccrl-nece dot de
Opinion expressed is personal and does not constitute
an opinion or statement of NEC Laboratories.
|
|
0
|
|
|
|
Reply
|
Joachim
|
2/8/2005 9:52:49 AM
|
|
Joachim,
Thanks a lot for the message.
I did try to use MPI_ABORT. But other processes
without calling MPI_ABORT just hang without being
stopped on Windows. MPI_ABORT does work
as expected on linux.
Is this because the difference of the platforms?
Reggie
|
|
0
|
|
|
|
Reply
|
reggie
|
2/11/2005 2:50:42 AM
|
|
reggie@zonatech.com wrote:
> I did try to use MPI_ABORT. But other processes
> without calling MPI_ABORT just hang without being
> stopped on Windows. MPI_ABORT does work
> as expected on linux.
>
> Is this because the difference of the platforms?
The MPI implementation (which one is it?) that you are using under
Windows does not behave correctly. You should verify this with a minimal
example and then ask the vendor/authors for help.
--
Joachim - reply to joachim at domain ccrl-nece dot de
Opinion expressed is personal and does not constitute
an opinion or statement of NEC Laboratories.
|
|
0
|
|
|
|
Reply
|
Joachim
|
2/11/2005 8:44:27 AM
|
|
Joachim Worringen wrote:
> reggie@zonatech.com wrote:
> > I did try to use MPI_ABORT. But other processes
> > without calling MPI_ABORT just hang without being
> > stopped on Windows. MPI_ABORT does work
> > as expected on linux.
> >
> > Is this because the difference of the platforms?
>
> The MPI implementation (which one is it?) that you are using under
> Windows does not behave correctly. You should verify this with a
minimal
> example and then ask the vendor/authors for help.
>
> --
> Joachim - reply to joachim at domain ccrl-nece dot de
>
> Opinion expressed is personal and does not constitute
> an opinion or statement of NEC Laboratories.
Joachim,
Thank you for the message.
I downloaded the executable file from official MPI
website and build MPICH 1.2.5 on my Windows XP PRO.
The way I built the application is to use command line
compiling and link. (I got the flags /nologo /MT /I"C:\Program
Files\MPICH\SDK\include" /D WIN32 /D NDEBUG /D _CONSOLE /D _MBCS
from User's Guide and the sample examples for Visual Fortran.)
If anyone could use MPI_ABORT to stop all processes
on Windows machine, that may imply I did not build
MPI properly or I used wrong command options.
Best regards,
Reggie
|
|
0
|
|
|
|
Reply
|
reggie
|
2/18/2005 4:01:01 AM
|
|
|
8 Replies
365 Views
(page loaded in 0.124 seconds)
Similiar Articles: fork(): how can I kill forked process and all ITS children, but ...How can I make all child processes die with the parent? - comp ... The children could check, whether the parent is still alive ... in turn fork other 2 processes (grand ... Determine the process ID of a process that created a named object ...... how I'm intending to see if any of the client processes that are > connected to this service process are still "alive". To up the security I > need to get a process ... how to terminate a child process properly - comp.unix.programmer ...Hi, I have a program need to fork a child which in turn fork other 2 processes ... Proc the child process has a for(;;) look to keep the child process alive, but it still ... How can I make all child processes die with the parent? - comp ...The children could check, whether the parent is still alive ( for example via ... encountered a segfault error, causing it to die, which caused the other MPI processes ... how to check if a process still exists in a C app - comp.unix ...That's how I'm intending to see if any of the client processes that are > connected to this service process are still "alive". ... Application app; Thread .CurrentThread ... How to detect if solaris is running as a guest machine - comp.unix ...> > There are certain things that are not clear still ... global zone non-global-zone# zlogin other ... way to modify the environment of a running process in ... How to detect ... how to find out which process is listening to which port - comp ...So are there any other ways to do this on ... EMERGENCY: process die but the port is still listening - comp.unix ... ... How to find out which process is listening upon a port If ... Port in use on Solaris 9 - comp.unix.solarisNow I wanted to determine, which application still holds it, so I tried ... to be unavailable for some time period even if the web server process is no longer alive for ... How to get the process id that is using a port - comp.unix.solaris ...The super user can almost always do that, but other lsof users can do it if some ... » Find Process ID by Port Number - Homepage Konstantin Rekk 4 Responses to “Find Process ... IOERROR Broken Pipe - comp.unix.programmerBut even if you would manage to keep the process alive by giving the signal some other disposition that still wouldn't help when the other side you want to send commands ... How to detect and avoid memory and resources leaks in .NET ...Still in Wikipedia: "Languages that provide automatic ... it should share the system resources with the other processes ... then all the objects down the chain will stay alive ... Still Alive - Wikipedia, the free encyclopedia"Still Alive" is a song featured in the ... of backstory for GLaDOS and other aspects of Portal, which Coulton was able to use to write the lyrics. The overall process to ... 7/24/2012 8:45:08 AM
|