send recv performance

  • Follow


We have a program where each task sends and recvs data from some of the
other tasks.
When I check the timing for each send and recv most of the are betwenn
1-5 miliseconds then randomly one of the sends and corresponding recv
takes 500 milliseconds.
setup:
Itanium hp cluster
intel compiler 9
lam-mpi 7.0.4

Why do I get this time difference - all send and recv are of the same
data size?

Hod

0
Reply hodwi (3) 9/23/2006 7:17:44 AM

Having a bit of experience on benchmarking the following questions pop
up in my mind:
* what is the topology of the cluster (processors per node, nodes per
rack, hypercube, star, tree, etc..)
* what is the network hardware (infiniband, myrinet, ethernet, ...)
* How do you time your sends/recvs? Do you take the average of a 1000
of these operations? What is the resolution of your time function?
* How many nodes does your program use
* How is your program distributed among the processors/nodes
* Are there any other programs running on the cluster? what resources
do they use?

Personally I use 'MPI_Alltoall/MPI_Alltoallv' a lot. I found that these
have a predictable and good performance, even if not all processors are
communicating.

regards,
Wijnand


hodw wrote:
> We have a program where each task sends and recvs data from some of the
> other tasks.
> When I check the timing for each send and recv most of the are betwenn
> 1-5 miliseconds then randomly one of the sends and corresponding recv
> takes 500 milliseconds.
> setup:
> Itanium hp cluster
> intel compiler 9
> lam-mpi 7.0.4
>
> Why do I get this time difference - all send and recv are of the same
> data size?
> 
> Hod

0
Reply wijnand 9/24/2006 8:27:02 AM


Hi
topolgy -  I do not know the terminology but our setup is hp 12 double
itanium rx2600 boxes , one master (2 cpus) connected by ethernet to all
nodes (2 cpus each) directly.

The timing is through mpi_wtime with the default time intervals, since
I get numbers as small a 1e-4 I assume its it the 0.1 milisecond area.
The problem I am seeing is that some send/recv are longer than the
others by 1 fctor of magnitude.

the program usses 4 nodes and runs 1 copy per node on one cpu.

no other processes running on the nodes.

Hod


this programs runs
wijnand wrote:
> Having a bit of experience on benchmarking the following questions pop
> up in my mind:
> * what is the topology of the cluster (processors per node, nodes per
> rack, hypercube, star, tree, etc..)
> * what is the network hardware (infiniband, myrinet, ethernet, ...)
> * How do you time your sends/recvs? Do you take the average of a 1000
> of these operations? What is the resolution of your time function?
> * How many nodes does your program use
> * How is your program distributed among the processors/nodes
> * Are there any other programs running on the cluster? what resources
> do they use?
>
> Personally I use 'MPI_Alltoall/MPI_Alltoallv' a lot. I found that these
> have a predictable and good performance, even if not all processors are
> communicating.
>
> regards,
> Wijnand
>
>
> hodw wrote:
> > We have a program where each task sends and recvs data from some of the
> > other tasks.
> > When I check the timing for each send and recv most of the are betwenn
> > 1-5 miliseconds then randomly one of the sends and corresponding recv
> > takes 500 milliseconds.
> > setup:
> > Itanium hp cluster
> > intel compiler 9
> > lam-mpi 7.0.4
> >
> > Why do I get this time difference - all send and recv are of the same
> > data size?
> > 
> > Hod

0
Reply hodw 9/24/2006 1:46:11 PM

hodw wrote:
> We have a program where each task sends and recvs data from some of the
> other tasks.
> When I check the timing for each send and recv most of the are betwenn
> 1-5 miliseconds then randomly one of the sends and corresponding recv
> takes 500 milliseconds.
> setup:
> Itanium hp cluster
> intel compiler 9
> lam-mpi 7.0.4
> 
> Why do I get this time difference - all send and recv are of the same
> data size?

My first shot would be that 1-5 mili is between processors of the same 
machine and 500 mili is through the network.

Do a simple math: Check np (number of processes) and CPUs per machine. 
See what's the proportion between processes talking to other processes 
through the network and through the pipeline and see the amount of 
half-sec sends you have and check if the proportion is the same.

--renato
0
Reply Renato 9/25/2006 9:38:09 AM

The case I am running is one process for node, although each node has 2
cpus since the memroy requirement on each node is high.
all send/recv are to nodes across the network.

Hod

Renato Golin wrote:
> hodw wrote:
> > We have a program where each task sends and recvs data from some of the
> > other tasks.
> > When I check the timing for each send and recv most of the are betwenn
> > 1-5 miliseconds then randomly one of the sends and corresponding recv
> > takes 500 milliseconds.
> > setup:
> > Itanium hp cluster
> > intel compiler 9
> > lam-mpi 7.0.4
> >
> > Why do I get this time difference - all send and recv are of the same
> > data size?
>
> My first shot would be that 1-5 mili is between processors of the same
> machine and 500 mili is through the network.
>
> Do a simple math: Check np (number of processes) and CPUs per machine.
> See what's the proportion between processes talking to other processes
> through the network and through the pipeline and see the amount of
> half-sec sends you have and check if the proportion is the same.
> 
> --renato

0
Reply hodw 9/25/2006 7:38:56 PM

hodw wrote:
> topolgy -  I do not know the terminology but our setup is hp 12 double
> itanium rx2600 boxes , one master (2 cpus) connected by ethernet to all
> nodes (2 cpus each) directly.

What flavour ethernet do you use?
plain ethernet (10 mbit),  fast ethernet (100mbit) or gigabit ethernet
(1Gbit) or 10gigabit ethernet (10 Gbit)
How much data is each node trying to send to an other node? Is it in
the order of 1 byte, 100 bytes, 10 kb, 1Mb, 100MB?

If you have a 'star topology' (each node is connected to the central
switch), my best guess would be that the messages (sends/recvs) of the
nodes collide with each other. Although I am not an ethernet expert, I
know that two nodes cannot communicate simultaniously over a single
wire. One of them has to back of and I guess that this causes one node
to wait 500 ms.

If this is the case I would recommend an other topology and/or other
network hardware. You use very expensive machines but your network is
much too cheap. 

regards,
wijnand

0
Reply wijnand 9/26/2006 9:31:21 AM

wijnand wrote:
> If you have a 'star topology' (each node is connected to the central
> switch), my best guess would be that the messages (sends/recvs) of the
> nodes collide with each other. Although I am not an ethernet expert, I
> know that two nodes cannot communicate simultaniously over a single
> wire. One of them has to back of and I guess that this causes one node
> to wait 500 ms.

unless he's doing those tests on a heavy load network (or a 10mbit or 
using hub), collision shouldn't be a problem, or at least occur on a 
very low rate.

A 1k packet over a 1MB/s (10Mbps) have 0.1% of chance to collide on a 
clean network. The same packet on a 100Mbps have 0.01% of chance.

Although, as all computing nodes are trying to talk to the head node, 
that probability may be multiplied by the number of nodes you have.

Hod, can you get the percentage on which the long sends are happening?

--renato
0
Reply Renato 9/26/2006 12:18:15 PM

6 Replies
324 Views

(page loaded in 0.084 seconds)

Similiar Articles:













7/20/2012 3:41:46 PM


Reply: