Hi all,
I have a two node cluster running LINUX.
Each node has a full OS
Both nodes have the same username.
ssh has been set up so that the Master can log into the same username onthe
slave without using a password.
The machine file has been edited to include the slave node (erik).
The /etc/hosts file was updated to include the IP addr/name of the slave.
Under the username in both nodes, there is a subdirectory that contains the
mpich install and I'm trying to run the test program called "cpi".
I have a window open on the slave and I am running "top" on it.
When I run the mpirun command, I see cpi starting on the slave (in top), and
then I get the following error message:
$ ../../bin/mpirun -np 2 cpi
rm_1400: p4_error: rm_start: net_conn_to_listener failed: 33408
p0_12586: p4_error: Child process exited while making connection to remote
process on erik: 0
p0_12586: (10.472656) net_send: could not write to fd=4, errno = 32
Can anyone give me a pointer to some information that would help me figure
out what the problem is?
thanks
|
|
0
|
|
|
|
Reply
|
Saville
|
3/5/2008 1:02:31 AM |
|
On Wed, 05 Mar 2008 02:02:31 +0100, Saville <saville@comcast.net> wrote:=
> ssh has been set up so that the Master can log into the same username=
onthe
> slave without using a password.
That should be possible in both directions just to be save.
>
> The machine file has been edited to include the slave node (erik).
>
> The /etc/hosts file was updated to include the IP addr/name of the sl=
ave.
On both nodes? Consider to have an DNS server.
>
> Under the username in both nodes, there is a subdirectory that contai=
ns the
> mpich install and I'm trying to run the test program called "cpi".
Why don't you use NFS to make it absolutely sure to have identical softw=
are?
(Same machine file etc.)
>
> I have a window open on the slave and I am running "top" on it.
>
> When I run the mpirun command, I see cpi starting on the slave (in top=
), and
> then I get the following error message:
>
>
> $ ../../bin/mpirun -np 2 cpi
> rm_1400: p4_error: rm_start: net_conn_to_listener failed: 33408
> p0_12586: p4_error: Child process exited while making connection to r=
emote
> process on erik: 0
> p0_12586: (10.472656) net_send: could not write to fd=3D4, errno =3D 3=
2
>
> Can anyone give me a pointer to some information that would help me fi=
gure
> out what the problem is?
>
> thanks
>
>
What MPI are you using and what device does it use?
If you configured and compiled it yourself you will
have to look into the respective log files to find
out which config was chosen.
The "p4" stuff indicates MPICH or MPICH2.
Maybe it uses ssh to startup the program on the other
node - but maybe it uses a set of demons to do it and
that must be started before mpirun would work.
RTFM and follow the instructions!
What Linux do you use? Does it come with a restrictive
firewall software that prevents "erik" from being a
TCP server? My opinion is that firewall software should
not be present on a cluster...
-- =
Using Opera's revolutionary e-mail client: http://www.opera.com/mail/
|
|
0
|
|
|
|
Reply
|
Georg
|
3/6/2008 1:57:20 PM
|
|
Georg Bisseling wrote:
> On Wed, 05 Mar 2008 02:02:31 +0100, Saville <saville@comcast.net> wrote:
>
>> ssh has been set up so that the Master can log into the same username
>> onthe
>> slave without using a password.
>
> That should be possible in both directions just to be save.
It is. I can ssh from one node to the other.
>
>
>>
>> The machine file has been edited to include the slave node (erik).
>>
>> The /etc/hosts file was updated to include the IP addr/name of the
>> slave.
>
> On both nodes? Consider to have an DNS server.
Yes on both nodes.
>
>>
>> Under the username in both nodes, there is a subdirectory that contains
>> the
>> mpich install and I'm trying to run the test program called "cpi".
>
> Why don't you use NFS to make it absolutely sure to have identical
> software? (Same machine file etc.)
But is this required? I can issue the following command on the Master and it
works (csc10 is the Master and csc11 is the slave):
ssh csc11 /home/flashman/mpich-1.2.7p1/examples/basic/cpi
This works just fine.
>
>>
>> I have a window open on the slave and I am running "top" on it.
>>
>> When I run the mpirun command, I see cpi starting on the slave (in top),
>> and then I get the following error message:
>>
>>
>> $ ../../bin/mpirun -np 2 cpi
>> rm_1400: p4_error: rm_start: net_conn_to_listener failed: 33408
>> p0_12586: p4_error: Child process exited while making connection to
>> remote process on erik: 0
>> p0_12586: (10.472656) net_send: could not write to fd=4, errno = 32
>>
>> Can anyone give me a pointer to some information that would help me
>> figure out what the problem is?
>>
>> thanks
>>
>>
>
> What MPI are you using and what device does it use?
mpich-1.2.7p1
I take the default device which seems to be ch_p4. That's what it says
inside of mpirun:
DEFAULT_DEVICE=ch_p4
RSHCOMMAND="ssh"
> If you configured and compiled it yourself you will
> have to look into the respective log files to find
> out which config was chosen.
>
> The "p4" stuff indicates MPICH or MPICH2.
>
> Maybe it uses ssh to startup the program on the other
> node - but maybe it uses a set of demons to do it and
> that must be started before mpirun would work.
>
> RTFM and follow the instructions!
I did and I did.
>
> What Linux do you use?
Fedora Core 8 on both nodes.
> Does it come with a restrictive
> firewall software that prevents "erik" from being a
> TCP server? My opinion is that firewall software should
> not be present on a cluster...
I selected no firewall.
I really would like to know what the error message means or where it comes
from:
rm_23914: p4_error: rm_start: net_conn_to_listener failed: 41488
p0_30790: p4_error: Child process exited while making connection to remote
process on csc11: 0
p0_30790: (11.355131) net_send: could not write to fd=4, errno = 32
thanks
|
|
0
|
|
|
|
Reply
|
Saville
|
3/8/2008 2:20:07 AM
|
|
Am 08.03.2008, 03:20 Uhr, schrieb Saville <saville@comcast.net>:
Excuse my long list of gotchas. Seems you did everything right.
I never ran into error messages exactly like yours.
To make the p4 device work it is required to have the fully
qualified host names in the machine file. I experienced weird
effects otherwise: incomprehensible mapping of processes to nodes.
If you want to avoid NFS then you will have to take great care
that the processes on both ranks will start in the same working
directory using a compatible (not necessarily identical) set
of environment variables etc. NFS with shared config files
just makes that much easier. You may call it superstition,
but I am quite sure that your setup is not covered by Argonne's
regular tests.
> I really would like to know what the error message means or where it c=
omes
> from:
>
> rm_23914: p4_error: rm_start: net_conn_to_listener failed: 41488
> p0_30790: p4_error: Child process exited while making connection to r=
emote
> process on csc11: 0
> p0_30790: (11.355131) net_send: could not write to fd=3D4, errno =3D 3=
2
If you like it or not, the best place to look for an explanation
might be the source of the p4 device and the functions rm_start
and net_conn_to_listener. And in the logs of the remote machine.
Did you have a look into /var/log/messages?
But the fact that the error message mentions a child process
(presumably of mpirun) that can not connect to a listener seems
to indicate that mpirun does not use ssh to start the remote
processes but tries to connect to a locally running demon.
But that would be the device p4mpd. Weird.
There is a dedicated chp4 user's guide
ftp://info.mcs.anl.gov/pub/tech_reports/reports/ANL9217.ps.Z
maybe that can help.
You can make the p4 device more verbose by saying
mpirun -np 2 myprog -p4dbg 20 -p4rdbg
the exact meaning of the is explained in the mentioned user's guide.
BTW: using the demons gives you a much faster startup.
BTW2: mpich-1.2.7 is not further maintained, if there are no
backward compatibility concerns I would always recommend
OpenMPI.
Good Luck!
Georg
-- =
This signature was left intentionally almost blank.
http://www.this-page-intentionally-left-blank.org/
|
|
0
|
|
|
|
Reply
|
Georg
|
3/8/2008 4:56:07 PM
|
|
> mpirun -np 2 myprog -p4dbg 20 -p4rdbg
It has to be:
mpirun -np 2 myprog -p4dbg 20 -p4rdbg 20
|
|
0
|
|
|
|
Reply
|
Georg
|
3/8/2008 5:00:16 PM
|
|
Georg Bisseling wrote:
> Am 08.03.2008, 03:20 Uhr, schrieb Saville <saville@comcast.net>:
>
> Excuse my long list of gotchas. Seems you did everything right.
> I never ran into error messages exactly like yours.
I found the problem:
I opened up the Firewall on the Master and everything worked.
So now I need to find what ports are used by MPI and only open those.
Thanks for all the help.
|
|
0
|
|
|
|
Reply
|
Saville
|
3/11/2008 10:30:47 PM
|
|
On Tue, 11 Mar 2008 23:30:47 +0100, Saville <saville@comcast.net> wrote:=
> So now I need to find what ports are used by MPI and only open those.
>
> Thanks for all the help.
One route to ease the pain is to have two network
cards in the master: one outbound with the firewall
watching over it and one inbound to the other cluster
nodes that is considered internal=3D=3Dharmless by the
firewall. The master then acts as a gateway and router
for the cluster nodes.
Configuring that can be much easier and the fast
ethernet card comes for $5.
Cheers
Georg
|
|
0
|
|
|
|
Reply
|
Georg
|
3/13/2008 11:47:47 AM
|
|
Georg Bisseling wrote:
> On Tue, 11 Mar 2008 23:30:47 +0100, Saville <saville@comcast.net> wrote:
>
>> So now I need to find what ports are used by MPI and only open those.
>>
>> Thanks for all the help.
>
> One route to ease the pain is to have two network
> cards in the master: one outbound with the firewall
> watching over it and one inbound to the other cluster
> nodes that is considered internal==harmless by the
> firewall. The master then acts as a gateway and router
> for the cluster nodes.
>
> Configuring that can be much easier and the fast
> ethernet card comes for $5.
Hi Georg,
I already have two ethernet cards in my Master. However I didn't see
anyting in the MPICH install document that allowed me to specify which
ethernet card it should use.
I'd very much like to use that second card to isolate the cluster.
thanks
Gregg
|
|
0
|
|
|
|
Reply
|
Saville
|
3/15/2008 1:52:33 PM
|
|
Am 15.03.2008, 14:52 Uhr, schrieb Saville <saville@comcast.net>:
> Hi Georg,
>
> I already have two ethernet cards in my Master. However I didn't see
> anyting in the MPICH install document that allowed me to specify which
> ethernet card it should use.
>
> I'd very much like to use that second card to isolate the cluster.
>
> thanks
>
> Gregg
Quite simple: put the two cards in the same IP network,
name the IP adresses (in /etc/hosts for a start) and put
the respective names in your machine file.
--
This signature was left intentionally almost blank.
http://www.this-page-intentionally-left-blank.org/
|
|
0
|
|
|
|
Reply
|
Georg
|
3/16/2008 10:12:49 PM
|
|
|
8 Replies
339 Views
(page loaded in 0.106 seconds)
|