IPMP group failure for (seemingly) no reason.

  • Follow


Hi all. I've got a Netra440 with IPMP set up (config below). Every so
often I will get messages that the the interfaces fail and the group
fails. I only get the messages from IPMP, I don't get messages from
the physical layer saying anything is wrong with the connection. Also,
I have other connections on this server going to the same switch and
those appear to be fine. I am doing something I think is a bit
unorthodox. The multipathing is over a 16 bit subnet with no router.
The 2 individual addresses are 10.20.10.61 and 10.20.11.61 and they
share 10.20.10.60. Could that be the cause of my problem? I've never
seen that done before, but I don't really know why it couldn't. I'm
running Solaris 8. I've read that 108528 is a necessary patch for an
issue similar to this, and I'm currently running 108528-22.

Thanks.

ce1:
flags=78040843<UP,BROADCAST,RUNNING,MULTICAST,DEPRECATED,NOFAILOVER,FAILED,STANDBY,INACTIVE>
mtu 1500
        inet 10.20.10.61 netmask ffff0000 broadcast 10.20.255.255
        groupname ipmp
        ether 0:14:4f:25:dd:f8
ce5:
flags=18040843<UP,BROADCAST,RUNNING,MULTICAST,DEPRECATED,NOFAILOVER,FAILED>
mtu 1500
        inet 10.20.11.61 netmask ffff0000 broadcast 10.20.255.255
        groupname ipmp
        ether 0:14:4f:25:dc:c4
ce5:1: flags=10000843<UP,BROADCAST,RUNNING,MULTICAST,FAILED> mtu 1500
        inet 10.20.10.60 netmask ffff0000 broadcast 10.20.255.255

0
Reply bozothedeathmachine16 (49) 4/17/2007 9:08:26 PM

On Apr 17, 4:08 pm, bozothedeathmachine
<bozothedeathmach...@gmail.com> wrote:
> Hi all. I've got a Netra440 with IPMP set up (config below). Every so
> often I will get messages that the the interfaces fail and the group
> fails. I only get the messages from IPMP, I don't get messages from
> the physical layer saying anything is wrong with the connection. Also,
> I have other connections on this server going to the same switch and
> those appear to be fine. I am doing something I think is a bit
> unorthodox. The multipathing is over a 16 bit subnet with no router.
> The 2 individual addresses are 10.20.10.61 and 10.20.11.61 and they
> share 10.20.10.60. Could that be the cause of my problem? I've never
> seen that done before, but I don't really know why it couldn't. I'm
> running Solaris 8. I've read that 108528 is a necessary patch for an
> issue similar to this, and I'm currently running 108528-22.
>
> Thanks.
>
> ce1:
> flags=78040843<UP,BROADCAST,RUNNING,MULTICAST,DEPRECATED,NOFAILOVER,FAILED,STANDBY,INACTIVE>
> mtu 1500
>         inet 10.20.10.61 netmask ffff0000 broadcast 10.20.255.255
>         groupname ipmp
>         ether 0:14:4f:25:dd:f8
> ce5:
> flags=18040843<UP,BROADCAST,RUNNING,MULTICAST,DEPRECATED,NOFAILOVER,FAILED>
> mtu 1500
>         inet 10.20.11.61 netmask ffff0000 broadcast 10.20.255.255
>         groupname ipmp
>         ether 0:14:4f:25:dc:c4
> ce5:1: flags=10000843<UP,BROADCAST,RUNNING,MULTICAST,FAILED> mtu 1500
>         inet 10.20.10.60 netmask ffff0000 broadcast 10.20.255.255

Probe based IPMP requires ping partners, which by default is the
default router.  The ping partner must reply consistently to ICMP
requests from mpathd.  If there are consecutive requests without
replies, it is considered failed.  You can specify ping partners by
defining static routes.

0
Reply Adam 4/18/2007 5:04:58 AM


On Apr 18, 12:04 am, Adam Sanders <sanders.a...@gmail.com> wrote:
> On Apr 17, 4:08 pm, bozothedeathmachine
>
>
>
> <bozothedeathmach...@gmail.com> wrote:
> > Hi all. I've got a Netra440 with IPMP set up (config below). Every so
> > often I will get messages that the the interfaces fail and the group
> > fails. I only get the messages from IPMP, I don't get messages from
> > the physical layer saying anything is wrong with the connection. Also,
> > I have other connections on this server going to the same switch and
> > those appear to be fine. I am doing something I think is a bit
> > unorthodox. The multipathing is over a 16 bit subnet with no router.
> > The 2 individual addresses are 10.20.10.61 and 10.20.11.61 and they
> > share 10.20.10.60. Could that be the cause of my problem? I've never
> > seen that done before, but I don't really know why it couldn't. I'm
> > running Solaris 8. I've read that 108528 is a necessary patch for an
> > issue similar to this, and I'm currently running 108528-22.
>
> > Thanks.
>
> > ce1:
> > flags=78040843<UP,BROADCAST,RUNNING,MULTICAST,DEPRECATED,NOFAILOVER,FAILED,STANDBY,INACTIVE>
> > mtu 1500
> >         inet 10.20.10.61 netmask ffff0000 broadcast 10.20.255.255
> >         groupname ipmp
> >         ether 0:14:4f:25:dd:f8
> > ce5:
> > flags=18040843<UP,BROADCAST,RUNNING,MULTICAST,DEPRECATED,NOFAILOVER,FAILED>
> > mtu 1500
> >         inet 10.20.11.61 netmask ffff0000 broadcast 10.20.255.255
> >         groupname ipmp
> >         ether 0:14:4f:25:dc:c4
> > ce5:1: flags=10000843<UP,BROADCAST,RUNNING,MULTICAST,FAILED> mtu 1500
> >         inet 10.20.10.60 netmask ffff0000 broadcast 10.20.255.255
>
> Probe based IPMP requires ping partners, which by default is the
> default router.  The ping partner must reply consistently to ICMP
> requests from mpathd.  If there are consecutive requests without
> replies, it is considered failed.  You can specify ping partners by
> defining static routes.

I did a "route add net 10.20.0.0/16 10.20.10.50" on both machines.
Let's see if it works. Thanks.

0
Reply bozothedeathmachine 4/18/2007 12:59:31 PM

On 2007-04-18, bozothedeathmachine <bozothedeathmachine@gmail.com> wrote:
> On Apr 18, 12:04 am, Adam Sanders <sanders.a...@gmail.com> wrote:
>> On Apr 17, 4:08 pm, bozothedeathmachine
>>
>>
>>
>> <bozothedeathmach...@gmail.com> wrote:
>> > Hi all. I've got a Netra440 with IPMP set up (config below). Every so
>> > often I will get messages that the the interfaces fail and the group
>> > fails. I only get the messages from IPMP, I don't get messages from
>> > the physical layer saying anything is wrong with the connection. Also,
>> > I have other connections on this server going to the same switch and
>> > those appear to be fine. I am doing something I think is a bit
>> > unorthodox. The multipathing is over a 16 bit subnet with no router.
>> > The 2 individual addresses are 10.20.10.61 and 10.20.11.61 and they
>> > share 10.20.10.60. Could that be the cause of my problem? I've never
>> > seen that done before, but I don't really know why it couldn't. I'm
>> > running Solaris 8. I've read that 108528 is a necessary patch for an
>> > issue similar to this, and I'm currently running 108528-22.
>>
>> > Thanks.
>>
>> > ce1:
>> > flags=78040843<UP,BROADCAST,RUNNING,MULTICAST,DEPRECATED,NOFAILOVER,FAILED,STANDBY,INACTIVE>
>> > mtu 1500
>> >         inet 10.20.10.61 netmask ffff0000 broadcast 10.20.255.255
>> >         groupname ipmp
>> >         ether 0:14:4f:25:dd:f8
>> > ce5:
>> > flags=18040843<UP,BROADCAST,RUNNING,MULTICAST,DEPRECATED,NOFAILOVER,FAILED>
>> > mtu 1500
>> >         inet 10.20.11.61 netmask ffff0000 broadcast 10.20.255.255
>> >         groupname ipmp
>> >         ether 0:14:4f:25:dc:c4
>> > ce5:1: flags=10000843<UP,BROADCAST,RUNNING,MULTICAST,FAILED> mtu 1500
>> >         inet 10.20.10.60 netmask ffff0000 broadcast 10.20.255.255
>>
>> Probe based IPMP requires ping partners, which by default is the
>> default router.  The ping partner must reply consistently to ICMP
>> requests from mpathd.  If there are consecutive requests without
>> replies, it is considered failed.  You can specify ping partners by
>> defining static routes.
>
> I did a "route add net 10.20.0.0/16 10.20.10.50" on both machines.
> Let's see if it works. Thanks.

I *think* that they need to be static host routes, so if what you've
done doesn't work, try that (and let me know either way if you don't
mind).

Cheers,

Ceri
-- 
That must be wonderful!  I don't understand it at all.
                                                  -- Moliere
0
Reply Ceri 4/18/2007 5:11:15 PM

Ceri Davies wrote:
> On 2007-04-18, bozothedeathmachine <bozothedeathmachine@gmail.com> wrote:
>> On Apr 18, 12:04 am, Adam Sanders <sanders.a...@gmail.com> wrote:
>>> On Apr 17, 4:08 pm, bozothedeathmachine

>> I did a "route add net 10.20.0.0/16 10.20.10.50" on both machines.
>> Let's see if it works. Thanks.
> 
> I *think* that they need to be static host routes, so if what you've
> done doesn't work, try that (and let me know either way if you don't
> mind).

In the past, I have dealt with default routes that would fail over in 
case one of the routers went down. I was unable to get a guarantee from 
the network group that the fail over would happen within the default 
mpathd timeout of 15 seconds.

To get around the situation, like you suggested, I added static host 
routes to the NAS head in that subnet, so if the routers were failing 
over, the IPMP interfaces always had a failback host route that they 
could ping. The likelihood of routers failing over and NAS being 
unavailable at the same was considered low enough for this solution to 
deemed acceptable.

The OP could also just use link based failure detection rather than 
probe based detection 
(http://docs.sun.com/app/docs/doc/816-4554/6maoq027k?a=view#emqra), thus 
not requiring the use of static hosts routes.

Kind Regards,

Nathan Dietsch
0
Reply Nathan 4/22/2007 11:28:53 AM

I'm getting all sort of errors in one the server's messages log
(below). I'm not quite sure if adding the static route worked.

Could those errors be indicative of a hardware problem?

Thanks again,
Ben..

Apr 23 11:56:58 utspptslee1 in.mpathd[38]: [ID 122137 daemon.error]
Improved failure detection time 23646 ms
Apr 23 11:56:59 utspptslee1 in.mpathd[38]: [ID 122137 daemon.error]
Improved failure detection time 11823 ms
Apr 23 11:56:59 utspptslee1 in.mpathd[38]: [ID 122137 daemon.error]
Improved failure detection time 10000 ms
Apr 23 12:01:59 utspptslee1 in.mpathd[38]: [ID 398532 daemon.error]
Cannot meet requested failure detection time of 10000 ms on (ine
t  ce1) new failure detection time is 63042 ms
Apr 23 12:03:00 utspptslee1 in.mpathd[38]: [ID 122137 daemon.error]
Improved failure detection time 31521 ms
Apr 23 12:03:00 utspptslee1 in.mpathd[38]: [ID 122137 daemon.error]
Improved failure detection time 15760 ms
Apr 23 12:03:00 utspptslee1 in.mpathd[38]: [ID 122137 daemon.error]
Improved failure detection time 10000 ms

0
Reply bozothedeathmachine 4/23/2007 6:17:18 PM

In article <1177352237.982077.310430@y5g2000hsa.googlegroups.com>,
bozothedeathmachine  <bozothedeathmachine@gmail.com> wrote:
>I'm getting all sort of errors in one the server's messages log
>(below). I'm not quite sure if adding the static route worked.
>
>Could those errors be indicative of a hardware problem?
>
>Thanks again,
>Ben..
>
>Apr 23 11:56:58 utspptslee1 in.mpathd[38]: [ID 122137 daemon.error]
>Improved failure detection time 23646 ms
>Apr 23 11:56:59 utspptslee1 in.mpathd[38]: [ID 122137 daemon.error]
>Improved failure detection time 11823 ms
>Apr 23 11:56:59 utspptslee1 in.mpathd[38]: [ID 122137 daemon.error]
>Improved failure detection time 10000 ms
>Apr 23 12:01:59 utspptslee1 in.mpathd[38]: [ID 398532 daemon.error]
>Cannot meet requested failure detection time of 10000 ms on (ine
>t  ce1) new failure detection time is 63042 ms
>Apr 23 12:03:00 utspptslee1 in.mpathd[38]: [ID 122137 daemon.error]
>Improved failure detection time 31521 ms
>Apr 23 12:03:00 utspptslee1 in.mpathd[38]: [ID 122137 daemon.error]
>Improved failure detection time 15760 ms
>Apr 23 12:03:00 utspptslee1 in.mpathd[38]: [ID 122137 daemon.error]
>Improved failure detection time 10000 ms

Silly question, but: does the router you added respond to pings or does 
any device on your segment respond to broadcast pings? Is the system that 
you're having issues with a relatively network idle system? I've noticed 
issues like you've described mostly on systems that have low amounts of 
traffic on a given IPMP grouping.

-tom

-- 

"You can only be -so- accurate with a claw-hammer."  --me
0
Reply ferric 4/26/2007 7:26:00 PM

6 Replies
323 Views

(page loaded in 0.096 seconds)

Similiar Articles:










7/20/2012 6:08:04 AM


Reply: