System freezes with high NFS activity

  • Follow


Hello,

My Solaris 10 x64 system freezes when there is a lot of activity
(writing, in particular) on an NFS mounted disk.  A typical example is
when I run "make".

The system is currently running Solaris 10/2008.  I upgraded because
Solaris 05/2008 had the same problem, hoping the new version would fix
it.

I didn't have the problem when the system was running Solaris 11/2006.
The system died about the time Solaris 05/2008 came out (it died
because I was stupid enough to install some patches), so I decided to
upgrade to the newversion.

I'm now considering reinstalling Solaris 11/2006 (and this time I
won't patch it).  But I would rather fix the real problem.

Any ideas?

Thanks,

---
Frank Jensen,   Frank.Jensen@hugin.com

Hugin Expert A/S
Gasv�rksvej 5
9000 Aalborg
DENMARK

0
Reply Frank 1/23/2009 2:55:57 PM

On 2009-01-23, Frank Jensen <fj@hugin.dk> wrote:

> Any ideas?

Have you tried debugging the system with dtrace? Are you using ZFS?


-- 
|    Jakov Sosic    |    ICQ: 28410271    |   PGP: 0x965CAE2D   |
=================================================================
| start fighting cancer -> http://www.worldcommunitygrid.org/   |
0
Reply Jakov 1/23/2009 3:51:13 PM


Frank Jensen <fj@hugin.dk> wrote:
> Hello,
> 
> My Solaris 10 x64 system freezes when there is a lot of activity
> (writing, in particular) on an NFS mounted disk.  A typical example is
> when I run "make".
> 
> The system is currently running Solaris 10/2008.  I upgraded because
> Solaris 05/2008 had the same problem, hoping the new version would fix
> it.
> 
> I didn't have the problem when the system was running Solaris 11/2006.
> The system died about the time Solaris 05/2008 came out (it died
> because I was stupid enough to install some patches), so I decided to
> upgrade to the newversion.
> 
> I'm now considering reinstalling Solaris 11/2006 (and this time I
> won't patch it).  But I would rather fix the real problem.
> 
> Any ideas?

NFS is tricky to tame.

Anways, what does the NFS server show? It is hanging too or acting 
strange?

how are you mounting it, what sort of host is it?

In general, when NFS breaks your host will hang waiting for the mount to 
work again, which may be never- it's like losing a local disk or running 
umount -f on a mount.




0
Reply Cydrome 1/23/2009 4:00:02 PM

Cydrome Leader <presence@MUNGEpanix.com> writes:

> Frank Jensen <fj@hugin.dk> wrote:
>> Hello,
>> 
>> My Solaris 10 x64 system freezes when there is a lot of activity
>> (writing, in particular) on an NFS mounted disk.  A typical example is
>> when I run "make".
>> 
>> The system is currently running Solaris 10/2008.  I upgraded because
>> Solaris 05/2008 had the same problem, hoping the new version would fix
>> it.
>> 
>> I didn't have the problem when the system was running Solaris 11/2006.
>> The system died about the time Solaris 05/2008 came out (it died
>> because I was stupid enough to install some patches), so I decided to
>> upgrade to the newversion.
>> 
>> I'm now considering reinstalling Solaris 11/2006 (and this time I
>> won't patch it).  But I would rather fix the real problem.
>> 
>> Any ideas?
>
> NFS is tricky to tame.
>
> Anways, what does the NFS server show? It is hanging too or acting 
> strange?

The server works perfectly.  We also have a Linux and a Mac computer
mounting via NFS from the server without problems.

> how are you mounting it, what sort of host is it?

I have added this line to /etc/vfstab:

192.168.0.144:/export/home  -  /hugin  nfs  -  yes -

The server is a Sun T5120 running Solaris 05/2008.  However, the
problem also exists with an old Sun E250 running Solaris 8.

> In general, when NFS breaks your host will hang waiting for the mount to 
> work again, which may be never- it's like losing a local disk or running 
> umount -f on a mount.

Thanks

---
Frank Jensen,   Frank.Jensen@hugin.com

Hugin Expert A/S
Gasv�rksvej 5
9000 Aalborg
DENMARK

0
Reply Frank 1/23/2009 4:13:15 PM

Frank Jensen <fj@hugin.dk> wrote:
> Cydrome Leader <presence@MUNGEpanix.com> writes:
> 
>> Frank Jensen <fj@hugin.dk> wrote:
>>> Hello,
>>> 
>>> My Solaris 10 x64 system freezes when there is a lot of activity
>>> (writing, in particular) on an NFS mounted disk.  A typical example is
>>> when I run "make".
>>> 
>>> The system is currently running Solaris 10/2008.  I upgraded because
>>> Solaris 05/2008 had the same problem, hoping the new version would fix
>>> it.
>>> 
>>> I didn't have the problem when the system was running Solaris 11/2006.
>>> The system died about the time Solaris 05/2008 came out (it died
>>> because I was stupid enough to install some patches), so I decided to
>>> upgrade to the newversion.
>>> 
>>> I'm now considering reinstalling Solaris 11/2006 (and this time I
>>> won't patch it).  But I would rather fix the real problem.
>>> 
>>> Any ideas?
>>
>> NFS is tricky to tame.
>>
>> Anways, what does the NFS server show? It is hanging too or acting 
>> strange?
> 
> The server works perfectly.  We also have a Linux and a Mac computer
> mounting via NFS from the server without problems.

ok, so it sounds like the server is not maxed out, even though mac and 
linux NFS clients are different in behavior from the sun one.

 
>> how are you mounting it, what sort of host is it?
> 
> I have added this line to /etc/vfstab:
> 
> 192.168.0.144:/export/home  -  /hugin  nfs  -  yes -

What does the output of mount -p show?


> The server is a Sun T5120 running Solaris 05/2008.  However, the
> problem also exists with an old Sun E250 running Solaris 8.

Hmm, they should default to nfs v4 if the nfs domains are correct, but 
let's see what mount shows 
0
Reply Cydrome 1/23/2009 5:27:01 PM

Cydrome Leader <presence@MUNGEpanix.com> writes:

> Frank Jensen <fj@hugin.dk> wrote:
>> Cydrome Leader <presence@MUNGEpanix.com> writes:
>> 
>>> Frank Jensen <fj@hugin.dk> wrote:
>>>> Hello,
>>>> 
>>>> My Solaris 10 x64 system freezes when there is a lot of activity
>>>> (writing, in particular) on an NFS mounted disk.  A typical example is
>>>> when I run "make".
>>>> 
>>>> The system is currently running Solaris 10/2008.  I upgraded because
>>>> Solaris 05/2008 had the same problem, hoping the new version would fix
>>>> it.
>>>> 
>>>> I didn't have the problem when the system was running Solaris 11/2006.
>>>> The system died about the time Solaris 05/2008 came out (it died
>>>> because I was stupid enough to install some patches), so I decided to
>>>> upgrade to the newversion.
>>>> 
>>>> I'm now considering reinstalling Solaris 11/2006 (and this time I
>>>> won't patch it).  But I would rather fix the real problem.
>>>> 
>>>> Any ideas?
>>>
>>> NFS is tricky to tame.
>>>
>>> Anways, what does the NFS server show? It is hanging too or acting 
>>> strange?
>> 
>> The server works perfectly.  We also have a Linux and a Mac computer
>> mounting via NFS from the server without problems.
>
> ok, so it sounds like the server is not maxed out, even though mac and 
> linux NFS clients are different in behavior from the sun one.
>
>  
>>> how are you mounting it, what sort of host is it?
>> 
>> I have added this line to /etc/vfstab:
>> 
>> 192.168.0.144:/export/home  -  /hugin  nfs  -  yes -
>
> What does the output of mount -p show?

The output is:

/dev/dsk/c1t0d0s0 - / ufs - no rw,intr,largefiles,logging,xattr,onerror=panic
/devices - /devices devfs - no
ctfs - /system/contract ctfs - no
proc - /proc proc - no
mnttab - /etc/mnttab mntfs - no
swap - /etc/svc/volatile tmpfs - no xattr
objfs - /system/object objfs - no
sharefs - /etc/dfs/sharetab sharefs - no
/usr/lib/libc/libc_hwcap1.so.1 - /lib/libc.so.1 lofs - no
fd - /dev/fd fd - no rw
swap - /tmp tmpfs - no xattr
swap - /var/run tmpfs - no xattr
/dev/dsk/c0d0s7 - /export/home0 ufs - no rw,intr,largefiles,logging,xattr,onerror=panic
/dev/dsk/c1t0d0s7 - /export/home ufs - no rw,intr,largefiles,logging,xattr,onerror=panic
192.168.0.144:/export/home - /hugin nfs - no rw,xattr
/hugin/users/fj - /home/fj lofs - no

>> The server is a Sun T5120 running Solaris 05/2008.  However, the
>> problem also exists with an old Sun E250 running Solaris 8.
>
> Hmm, they should default to nfs v4 if the nfs domains are correct, but 
> let's see what mount shows 

Thanks,

/Frank
0
Reply Frank 1/25/2009 4:49:37 PM

Jakov Sosic <jsosic@jsosic.homeunix.org> writes:

> On 2009-01-23, Frank Jensen <fj@hugin.dk> wrote:
>
>> Any ideas?
>
> Have you tried debugging the system with dtrace? Are you using ZFS?

No, I'm not using ZFS.

And I haven't tried to debug the system with DTrace.  I'm not familiar
with DTrace, but I have heard that it should be quite amazing.  Can it
help in this situation?

Thanks,

/Frank
0
Reply Frank 1/25/2009 6:19:28 PM

On Fri, 23 Jan 2009 15:55:57 +0100
Frank Jensen <fj@hugin.dk> wrote:

> My Solaris 10 x64 system freezes when there is a lot of activity
> (writing, in particular) on an NFS mounted disk.  A typical example is
> when I run "make".

What do you mean by "freeze"? Does the system crash, or is it merely
the "make" run that stalls? If the latter, can you interrupt the
process and regain control of your session? 

You could try and run the "make" under truss to see where the process
blocks. 

> The system is currently running Solaris 10/2008.  I upgraded because
> Solaris 05/2008 had the same problem, hoping the new version would fix
> it.

I have 11/2008 deployed in production, with lots of writes to 7 NFS
mounts (the servers are one Solaris 9, and the other Solaris 10), and I
haven't encountered a problem (keeps fingers crossed :).

What does "dmesg" say when the freeze occurs (or /var/adm/messages if
you've been rebooting a lot)?

> The server is a Sun T5120 running Solaris 05/2008.  However, the
> problem also exists with an old Sun E250 running Solaris 8.

Try mounting with a different protocol level and see if that fixes your
problem (but the fact that you also have problems with Solaris 8 makes
me doubt this will work). 

> I have added this line to /etc/vfstab:
>
> 192.168.0.144:/export/home  -  /hugin  nfs  -  yes -

Why not using the automounter for home directories? That's the most
flexible and most reliable way to use NFS. When you put NFS mounts in
the crontab like you do, you might not be able to boot to completion
when the NFS server is not available.

Assuming that all home directories are on 192.168.0.144
under /export/home, put the following in /etc/auto_home:

*       192.168.0.144:/export/home/&

ensure that /etc/auto_master contains the following line:

/home           auto_home       -nobrowse

and start the automount service:

# svcadm enable svc:/system/filesystem/autofs:default

This will automount user directories under /home (where $DEITY intended
them to be). If you insist on having them under /hugin, simply make the
obvious replacements.

You can check for network problems with netstat:

# netstat -i

If you see Ierrs, Oerrs or Collisions, your problem could be network
related. Are you using a Gigabit connection? If so, have you enabled
jumbo frames on the client system, and not on the servers?

-- 
Stefaan A Eeckels
-- 
"Technically, Windows is an 'operating system,' which means that it
supplies your computer with the basic commands that it needs to
suddenly, with no warning whatsoever, stop operating." -Dave Barry 
0
Reply Stefaan 2/2/2009 11:38:58 PM

Stefaan A Eeckels <hoendech@ecc.lu> writes:

> On Fri, 23 Jan 2009 15:55:57 +0100
> Frank Jensen <fj@hugin.dk> wrote:
>
>> My Solaris 10 x64 system freezes when there is a lot of activity
>> (writing, in particular) on an NFS mounted disk.  A typical example is
>> when I run "make".
>
> What do you mean by "freeze"? Does the system crash, or is it merely
> the "make" run that stalls? If the latter, can you interrupt the
> process and regain control of your session? 

The system doesn't crash, but it freezes for a while (several
minutes).  It is not possible to interrupt the process, and everything
else that touches the NFS file system also freeze.  But a simple
"return" in a shell doesn't freeze.

> You could try and run the "make" under truss to see where the process
> blocks. 

I have tried to run one of the commands, the "ar" command, invoked by
"make" under "truss -d".  The freeze happens when the archive is
written:

 0.1160	write(2, "\n", 1)				= 1
189.3060	write(3, " ! < a r c h >\n /      ".., 4352200)	= 4352200

Running the same "make" command on a Linux computer, mounting the same
disk from the Sun T5120 NFS server, takes about 3.5 seconds (real time).

>> The system is currently running Solaris 10/2008.  I upgraded because
>> Solaris 05/2008 had the same problem, hoping the new version would fix
>> it.
>
> I have 11/2008 deployed in production, with lots of writes to 7 NFS
> mounts (the servers are one Solaris 9, and the other Solaris 10), and I
> haven't encountered a problem (keeps fingers crossed :).
>
> What does "dmesg" say when the freeze occurs (or /var/adm/messages if
> you've been rebooting a lot)?

Here is the output from the most recent reboot:

Jan 29 19:42:31 frank rootnex: [ID 349649 kern.info] xsvc0 at root
Jan 29 19:42:31 frank genunix: [ID 936769 kern.info] xsvc0 is /xsvc
Jan 29 19:42:31 frank genunix: [ID 935449 kern.info] 	ATA DMA off: disabled.  Control with "atapi-cd-dma-enabled" property
Jan 29 19:42:31 frank genunix: [ID 882269 kern.info] 	PIO mode 4 selected
Jan 29 19:42:31 frank genunix: [ID 935449 kern.info] 	ATA DMA off: disabled.  Control with "atapi-cd-dma-enabled" property
Jan 29 19:42:31 frank genunix: [ID 882269 kern.info] 	PIO mode 4 selected
Jan 29 19:42:31 frank genunix: [ID 935449 kern.info] 	ATA DMA off: disabled.  Control with "atapi-cd-dma-enabled" property
Jan 29 19:42:31 frank genunix: [ID 882269 kern.info] 	PIO mode 4 selected
Jan 29 19:42:31 frank genunix: [ID 935449 kern.info] 	ATA DMA off: disabled.  Control with "atapi-cd-dma-enabled" property
Jan 29 19:42:31 frank genunix: [ID 882269 kern.info] 	PIO mode 4 selected
Jan 29 19:42:31 frank genunix: [ID 773945 kern.info] 	UltraDMA mode 6 selected
Jan 29 19:42:31 frank pcplusmp: [ID 637496 kern.info] pcplusmp: asy (asy) instance 0 vector 0x4 ioapic 0x2 intin 0x4 is bound to cpu 0
Jan 29 19:42:31 frank isa: [ID 202937 kern.info] ISA-device: asy0
Jan 29 19:42:31 frank genunix: [ID 936769 kern.info] asy0 is /isa/asy@1,3f8
Jan 29 19:42:31 frank pseudo: [ID 129642 kern.info] pseudo-device: ramdisk1024
Jan 29 19:42:31 frank genunix: [ID 936769 kern.info] ramdisk1024 is /pseudo/ramdisk@1024
Jan 29 19:42:31 frank pseudo: [ID 129642 kern.info] pseudo-device: lockstat0
Jan 29 19:42:31 frank genunix: [ID 936769 kern.info] lockstat0 is /pseudo/lockstat@0
Jan 29 19:42:31 frank pseudo: [ID 129642 kern.info] pseudo-device: llc10
Jan 29 19:42:31 frank genunix: [ID 936769 kern.info] llc10 is /pseudo/llc1@0
Jan 29 19:42:31 frank cpudrv: [ID 805513 kern.info] NOTICE: cpu_acpi: _PSS package not found.
Jan 29 19:42:31 frank cpudrv: [ID 978953 kern.warning] WARNING: cpu_acpi: error parsing _PSS for CPU instance 0
Jan 29 19:42:31 frank cpudrv: [ID 805513 kern.info] NOTICE: cpu_acpi: _PSS package not found.
Jan 29 19:42:31 frank cpudrv: [ID 978953 kern.warning] WARNING: cpu_acpi: error parsing _PSS for CPU instance 1
Jan 29 19:42:31 frank pseudo: [ID 129642 kern.info] pseudo-device: lofi0
Jan 29 19:42:31 frank genunix: [ID 936769 kern.info] lofi0 is /pseudo/lofi@0
Jan 29 19:42:31 frank pseudo: [ID 129642 kern.info] pseudo-device: profile0
Jan 29 19:42:31 frank genunix: [ID 936769 kern.info] profile0 is /pseudo/profile@0
Jan 29 19:42:31 frank pseudo: [ID 129642 kern.info] pseudo-device: systrace0
Jan 29 19:42:31 frank genunix: [ID 936769 kern.info] systrace0 is /pseudo/systrace@0
Jan 29 19:42:31 frank pseudo: [ID 129642 kern.info] pseudo-device: fbt0
Jan 29 19:42:31 frank genunix: [ID 936769 kern.info] fbt0 is /pseudo/fbt@0
Jan 29 19:42:31 frank pseudo: [ID 129642 kern.info] pseudo-device: sdt0
Jan 29 19:42:31 frank genunix: [ID 936769 kern.info] sdt0 is /pseudo/sdt@0
Jan 29 19:42:32 frank pseudo: [ID 129642 kern.info] pseudo-device: fcp0
Jan 29 19:42:32 frank genunix: [ID 936769 kern.info] fcp0 is /pseudo/fcp@0
Jan 29 19:42:32 frank pseudo: [ID 129642 kern.info] pseudo-device: fcsm0
Jan 29 19:42:32 frank genunix: [ID 936769 kern.info] fcsm0 is /pseudo/fcsm@0
Jan 29 19:42:32 frank pseudo: [ID 129642 kern.info] pseudo-device: lx_systrace0
Jan 29 19:42:32 frank genunix: [ID 936769 kern.info] lx_systrace0 is /pseudo/lx_systrace@0
Jan 29 19:42:32 frank pseudo: [ID 129642 kern.info] pseudo-device: ucode0
Jan 29 19:42:32 frank genunix: [ID 936769 kern.info] ucode0 is /pseudo/ucode@0
Jan 29 19:42:32 frank pseudo: [ID 129642 kern.info] pseudo-device: fssnap0
Jan 29 19:42:32 frank genunix: [ID 936769 kern.info] fssnap0 is /pseudo/fssnap@0
Jan 29 19:42:32 frank pseudo: [ID 129642 kern.info] pseudo-device: winlock0
Jan 29 19:42:32 frank genunix: [ID 936769 kern.info] winlock0 is /pseudo/winlock@0
Jan 29 19:42:32 frank pseudo: [ID 129642 kern.info] pseudo-device: pm0
Jan 29 19:42:32 frank genunix: [ID 936769 kern.info] pm0 is /pseudo/pm@0
Jan 29 19:42:32 frank pseudo: [ID 129642 kern.info] pseudo-device: rsm0
Jan 29 19:42:32 frank genunix: [ID 936769 kern.info] rsm0 is /pseudo/rsm@0


>> The server is a Sun T5120 running Solaris 05/2008.  However, the
>> problem also exists with an old Sun E250 running Solaris 8.
>
> Try mounting with a different protocol level and see if that fixes your
> problem (but the fact that you also have problems with Solaris 8 makes
> me doubt this will work). 

I have tried to force the mount to use version 3 of the NFS protocol,
but that doesn't change anything.

>> I have added this line to /etc/vfstab:
>>
>> 192.168.0.144:/export/home  -  /hugin  nfs  -  yes -
>
> Why not using the automounter for home directories? That's the most
> flexible and most reliable way to use NFS. When you put NFS mounts in
> the crontab like you do, you might not be able to boot to completion
> when the NFS server is not available.

The disk contains more than just home directories.  And without that
disk, it is meaningless to boot the computer.

> Assuming that all home directories are on 192.168.0.144
> under /export/home, put the following in /etc/auto_home:
>
> *       192.168.0.144:/export/home/&
>
> ensure that /etc/auto_master contains the following line:
>
> /home           auto_home       -nobrowse
>
> and start the automount service:
>
> # svcadm enable svc:/system/filesystem/autofs:default
>
> This will automount user directories under /home (where $DEITY intended
> them to be). If you insist on having them under /hugin, simply make the
> obvious replacements.

I haven't tried this yet, but I suspect it won't make a difference
(but perhaps I'm wrong).

> You can check for network problems with netstat:
>
> # netstat -i

This produces the following output:

frank> netstat -i
Name  Mtu  Net/Dest      Address        Ipkts  Ierrs Opkts  Oerrs Collis Queue
lo0   8232 loopback      localhost      318    0     318    0     0      0
elxl0 1500 frank         frank          424328672 0     369015398 0     0      0

> If you see Ierrs, Oerrs or Collisions, your problem could be network
> related. Are you using a Gigabit connection? If so, have you enabled
> jumbo frames on the client system, and not on the servers?

The network is 100 Mbps.  The network devices should autodetect this
(I haven't explicitly enabled or disabled anything).

The NFS server reports:

# dladm show-dev
e1000g0         link: up        speed: 100   Mbps       duplex: full

So I don't think it is a network problem (at least not on the server
side).

> -- 
> Stefaan A Eeckels
> -- 
> "Technically, Windows is an 'operating system,' which means that it
> supplies your computer with the basic commands that it needs to
> suddenly, with no warning whatsoever, stop operating." -Dave Barry 

Thanks,

/Frank
0
Reply Frank 2/3/2009 6:00:39 PM

On Tue, 03 Feb 2009 19:00:39 +0100
Frank Jensen <fj@hugin.dk> wrote:

> > You could try and run the "make" under truss to see where the
> > process blocks. 
> 
> I have tried to run one of the commands, the "ar" command, invoked by
> "make" under "truss -d".  The freeze happens when the archive is
> written:
> 
>  0.1160	write(2, "\n", 1)				= 1
> 189.3060	write(3, " ! < a r c h >\n /      "..,
> 4352200)	= 4352200
> 
> Running the same "make" command on a Linux computer, mounting the same
> disk from the Sun T5120 NFS server, takes about 3.5 seconds (real
> time).

I have a sneaky suspicion I have seen something similar before, long
time ago (1992) between a SCO system and a SunOS 4.0 server. A small
write to a file would work OK, larger writes would hang for a long time.

Notice the rather large block size on the write performed by ar. 

I cannot reproduce this behaviour on my systems. It would be useful to
snoop the line to see what exactly is going on.

-- 
Stefaan A Eeckels
-- 
He who will not reason, is a bigot;
he who cannot is a fool;
and he who dares not is a slave.     (Sir William Drummond)
0
Reply Stefaan 2/4/2009 11:44:02 PM

On Tue, 03 Feb 2009 19:00:39 +0100, Frank Jensen <fj@hugin.dk> wrote:

>Here is the output from the most recent reboot:

It would be interesting to sie a Sniffer/Snoop output of the write.

You can then see how the request flow over the wire, how large the gaps
are and whether there are any errors in the NFS protocol.
0
Reply Emil 2/5/2009 5:59:29 AM

Stefaan A Eeckels <hoendech@ecc.lu> writes:

> On Tue, 03 Feb 2009 19:00:39 +0100
> Frank Jensen <fj@hugin.dk> wrote:
>
>> > You could try and run the "make" under truss to see where the
>> > process blocks. 
>> 
>> I have tried to run one of the commands, the "ar" command, invoked by
>> "make" under "truss -d".  The freeze happens when the archive is
>> written:
>> 
>>  0.1160	write(2, "\n", 1)				= 1
>> 189.3060	write(3, " ! < a r c h >\n /      "..,
>> 4352200)	= 4352200
>> 
>> Running the same "make" command on a Linux computer, mounting the same
>> disk from the Sun T5120 NFS server, takes about 3.5 seconds (real
>> time).
>
> I have a sneaky suspicion I have seen something similar before, long
> time ago (1992) between a SCO system and a SunOS 4.0 server. A small
> write to a file would work OK, larger writes would hang for a long time.
>
> Notice the rather large block size on the write performed by ar. 
>
> I cannot reproduce this behaviour on my systems. It would be useful to
> snoop the line to see what exactly is going on.

Unfortunately, I have no experience diagnosing network problems.  I
have now tried "snoop", and it shows quite a few packets marked NFS,
and each NFS "write" packet is followed by 20 or so packets marked
TCP.  Any suggestions about what to look for?

Thanks,

/Frank

0
Reply Frank 2/5/2009 6:15:50 PM

On Thu, 05 Feb 2009 19:15:50 +0100, Frank Jensen <fj@hugin.dk> wrote:

>Unfortunately, I have no experience diagnosing network problems.  I
>have now tried "snoop", and it shows quite a few packets marked NFS,
>and each NFS "write" packet is followed by 20 or so packets marked
>TCP.  Any suggestions about what to look for?

You should use a snaplen of at least 120 (-s 120) to capture the
relevant header data. Look for the the time distance between NFS
requests and responses if there are any long delays. When it takes 189 s
to write 4 MB there must be long delays somewhere. Normally you should
see a constant flow of NFS write requests and NFS replies and some other
requests like getattr, commit, ... The requests you look for should all
use the same filehandle. The TCP packets after a write are normally not
interesting. They contain the data written in a NFS write. Interesting
are the NFS protocol packets. May be you can post a few lines. 

It may be easier to save the data in a file (-o file) and analyse this
with http://www.wireshark.org/download.html .
0
Reply Emil 2/7/2009 5:52:07 AM

Emil Naepflein <netnewsegn@kabelmail.de> writes:

> On Thu, 05 Feb 2009 19:15:50 +0100, Frank Jensen <fj@hugin.dk> wrote:
>
>>Unfortunately, I have no experience diagnosing network problems.  I
>>have now tried "snoop", and it shows quite a few packets marked NFS,
>>and each NFS "write" packet is followed by 20 or so packets marked
>>TCP.  Any suggestions about what to look for?
>
> You should use a snaplen of at least 120 (-s 120) to capture the
> relevant header data. Look for the the time distance between NFS
> requests and responses if there are any long delays. When it takes 189 s
> to write 4 MB there must be long delays somewhere. Normally you should
> see a constant flow of NFS write requests and NFS replies and some other
> requests like getattr, commit, ... The requests you look for should all
> use the same filehandle. The TCP packets after a write are normally not
> interesting. They contain the data written in a NFS write. Interesting
> are the NFS protocol packets. May be you can post a few lines. 

How do I identify the file handle in the output?

Here are some of the NFS packets.  In the beginning, there are "short"
delays, but they get larger -- the largest ones being about 1 minute
each.

475   0.25103        frank -> 192.168.0.144 NFS C 4 (write       ) PUTFH FH=5DB9 WRITE ST=1EAE:0 at 196608 for 32768 
476   0.25120        frank -> 192.168.0.144 NFS C 4 (write       ) PUTFH FH=5DB9 WRITE ST=1EAE:0 at 229376 for 32768 
477   0.25755 192.168.0.144 -> frank        NFS R 4 (write       ) NFS4_OK PUTFH NFS4_OK WRITE NFS4_OK 32768 (ASYNC) 
478   0.64359 192.168.0.144 -> frank        NFS R 4 (write       ) NFS4_OK PUTFH NFS4_OK WRITE NFS4_OK 32768 (ASYNC) 
479   0.64397        frank -> 192.168.0.144 NFS C 4 (write       ) PUTFH FH=5DB9 WRITE ST=1EAE:0 at 262144 for 32768 


498   1.47464        frank -> 192.168.0.144 NFS C 4 (write       ) PUTFH FH=5DB9 WRITE ST=1EAE:0 at 753664 for 32768 
499   1.47821 192.168.0.144 -> frank        NFS R 4 (write       ) NFS4_OK PUTFH NFS4_OK WRITE NFS4_OK 32768 (ASYNC) 
500   3.07381 192.168.0.144 -> frank        NFS R 4 (write       ) NFS4_OK PUTFH NFS4_OK WRITE NFS4_OK 32768 (ASYNC) 
501   3.07421        frank -> 192.168.0.144 NFS C 4 (write       ) PUTFH FH=5DB9 WRITE ST=1EAE:0 at 786432 for 32768 
502   3.07442        frank -> 192.168.0.144 NFS C 4 (write       ) PUTFH FH=5DB


529  12.77272        frank -> 192.168.0.144 NFS C 4 (write       ) PUTFH FH=5DB9 WRITE ST=1EAE:0 at 1540096 for 32768 
530  12.77935 192.168.0.144 -> frank        NFS R 4 (write       ) NFS4_OK PUTFH NFS4_OK WRITE NFS4_OK 32768 (ASYNC) 
531  25.65555 192.168.0.144 -> frank        NFS R 4 (write       ) NFS4_OK PUTFH NFS4_OK WRITE NFS4_OK 32768 (ASYNC) 
532  25.65595        frank -> 192.168.0.144 NFS C 4 (write       ) PUTFH FH=5DB


565  59.55866        frank -> 192.168.0.144 NFS C 4 (lookup valid) PUTFH FH=5311 NVERIFY GETATTR 10011a b0a23a ACCESS rd,lk,mo,ext,dl LOOKUP apps GETFH GETATTR 1001...
566  59.55893 192.168.0.144 -> frank        NFS R 4 (lookup valid) NFS4ERR_SAME PUTFH NFS4_OK NVERIFY NFS4ERR_SAME 
567 103.05160 192.168.0.144 -> frank        NFS R 4 (write       ) NFS4_OK PUTFH NFS4_OK WRITE NFS4_OK 32768 (ASYNC) 
568 103.05201        frank -> 192.168.0.144 NFS C 4 (write       ) PUTFH FH=5DB


582 103.07585 192.168.0.144 -> frank        NFS R 4 (write       ) NFS4_OK PUTFH NFS4_OK WRITE NFS4_OK 32768 (ASYNC) 
583 104.04859        frank -> 192.168.0.144 NFS C 4 (renew       ) RENEW CL=49889bf600000067 
584 104.04894 192.168.0.144 -> frank        NFS R 4 (renew       ) NFS4_OK RENEW NFS4_OK 
585 163.06621 192.168.0.144 -> frank        NFS R 4 (write       ) NFS4_OK PUTFH NFS4_OK WRITE NFS4_OK 32768 (ASYNC) 
586 163.06666        frank -> 192.168.0.144 NFS C 4 (write       ) PUTFH FH=5DB
0
Reply Frank 2/9/2009 8:01:10 PM

On Mon, 09 Feb 2009 21:01:10 +0100, Frank Jensen <fj@hugin.dk> wrote:

>How do I identify the file handle in the output?

The file handle is the value behind "FH=" .

>Here are some of the NFS packets.  In the beginning, there are "short"
>delays, but they get larger -- the largest ones being about 1 minute
>each.
>
>475   0.25103        frank -> 192.168.0.144 NFS C 4 (write       ) PUTFH FH=5DB9 WRITE ST=1EAE:0 at 196608 for 32768 

You can see that the file is written in 32 k chunks. 

>476   0.25120        frank -> 192.168.0.144 NFS C 4 (write       ) PUTFH FH=5DB9 WRITE ST=1EAE:0 at 229376 for 32768 
>477   0.25755 192.168.0.144 -> frank        NFS R 4 (write       ) NFS4_OK PUTFH NFS4_OK WRITE NFS4_OK 32768 (ASYNC) 
>478   0.64359 192.168.0.144 -> frank        NFS R 4 (write       ) NFS4_OK PUTFH NFS4_OK WRITE NFS4_OK 32768 (ASYNC) 
>479   0.64397        frank -> 192.168.0.144 NFS C 4 (write       ) PUTFH FH=5DB9 WRITE ST=1EAE:0 at 262144 for 32768 
>
>
>498   1.47464        frank -> 192.168.0.144 NFS C 4 (write       ) PUTFH FH=5DB9 WRITE ST=1EAE:0 at 753664 for 32768 

Either some packets are missing, or the file is written randomly because
the offset jumps from 262144 to 753664.

>499   1.47821 192.168.0.144 -> frank        NFS R 4 (write       ) NFS4_OK PUTFH NFS4_OK WRITE NFS4_OK 32768 (ASYNC) 
>500   3.07381 192.168.0.144 -> frank        NFS R 4 (write       ) NFS4_OK PUTFH NFS4_OK WRITE NFS4_OK 32768 (ASYNC) 
>501   3.07421        frank -> 192.168.0.144 NFS C 4 (write       ) PUTFH FH=5DB9 WRITE ST=1EAE:0 at 786432 for 32768 
>502   3.07442        frank -> 192.168.0.144 NFS C 4 (write       ) PUTFH FH=5DB

Is information missing here?

>
>
>529  12.77272        frank -> 192.168.0.144 NFS C 4 (write       ) PUTFH FH=5DB9 WRITE ST=1EAE:0 at 1540096 for 32768 

Here is the same. May be random writes cause the long delay. 

>530  12.77935 192.168.0.144 -> frank        NFS R 4 (write       ) NFS4_OK PUTFH NFS4_OK WRITE NFS4_OK 32768 (ASYNC) 
>531  25.65555 192.168.0.144 -> frank        NFS R 4 (write       ) NFS4_OK PUTFH NFS4_OK WRITE NFS4_OK 32768 (ASYNC) 
>532  25.65595        frank -> 192.168.0.144 NFS C 4 (write       ) PUTFH FH=5DB
>
>
>565  59.55866        frank -> 192.168.0.144 NFS C 4 (lookup valid) PUTFH FH=5311 NVERIFY GETATTR 10011a b0a23a ACCESS rd,lk,mo,ext,dl LOOKUP apps GETFH GETATTR 1001...
>566  59.55893 192.168.0.144 -> frank        NFS R 4 (lookup valid) NFS4ERR_SAME PUTFH NFS4_OK NVERIFY NFS4ERR_SAME 
>567 103.05160 192.168.0.144 -> frank        NFS R 4 (write       ) NFS4_OK PUTFH NFS4_OK WRITE NFS4_OK 32768 (ASYNC) 
>568 103.05201        frank -> 192.168.0.144 NFS C 4 (write       ) PUTFH FH=5DB
>
>
>582 103.07585 192.168.0.144 -> frank        NFS R 4 (write       ) NFS4_OK PUTFH NFS4_OK WRITE NFS4_OK 32768 (ASYNC) 
>583 104.04859        frank -> 192.168.0.144 NFS C 4 (renew       ) RENEW CL=49889bf600000067 
>584 104.04894 192.168.0.144 -> frank        NFS R 4 (renew       ) NFS4_OK RENEW NFS4_OK 
>585 163.06621 192.168.0.144 -> frank        NFS R 4 (write       ) NFS4_OK PUTFH NFS4_OK WRITE NFS4_OK 32768 (ASYNC) 
>586 163.06666        frank -> 192.168.0.144 NFS C 4 (write       ) PUTFH FH=5DB

You can try to mount the filesystem with NFSv3 and/or to create the
archive at a temporary location and move it then to the final location. 
0
Reply Emil 2/11/2009 5:39:16 AM

Emil Naepflein <netnewsegn@kabelmail.de> writes:

> On Mon, 09 Feb 2009 21:01:10 +0100, Frank Jensen <fj@hugin.dk> wrote:
>
>>How do I identify the file handle in the output?
>
> The file handle is the value behind "FH=" .

Yes, I figured that out soon after I posted this.

>>Here are some of the NFS packets.  In the beginning, there are "short"
>>delays, but they get larger -- the largest ones being about 1 minute
>>each.
>>
>>475   0.25103        frank -> 192.168.0.144 NFS C 4 (write       ) PUTFH FH=5DB9 WRITE ST=1EAE:0 at 196608 for 32768 
>
> You can see that the file is written in 32 k chunks. 
>
>>476   0.25120        frank -> 192.168.0.144 NFS C 4 (write       ) PUTFH FH=5DB9 WRITE ST=1EAE:0 at 229376 for 32768 
>>477   0.25755 192.168.0.144 -> frank        NFS R 4 (write       ) NFS4_OK PUTFH NFS4_OK WRITE NFS4_OK 32768 (ASYNC) 
>>478   0.64359 192.168.0.144 -> frank        NFS R 4 (write       ) NFS4_OK PUTFH NFS4_OK WRITE NFS4_OK 32768 (ASYNC) 
>>479   0.64397        frank -> 192.168.0.144 NFS C 4 (write       ) PUTFH FH=5DB9 WRITE ST=1EAE:0 at 262144 for 32768 
>>
>>
>>498   1.47464        frank -> 192.168.0.144 NFS C 4 (write       ) PUTFH FH=5DB9 WRITE ST=1EAE:0 at 753664 for 32768 
>
> Either some packets are missing, or the file is written randomly because
> the offset jumps from 262144 to 753664.

I only posted some of the output from "snoop" -- the parts showing
some of the delays (the full output is hundreds of kilobytes).  The
two blank lines indicates that something was cut out.

>>499   1.47821 192.168.0.144 -> frank        NFS R 4 (write       ) NFS4_OK PUTFH NFS4_OK WRITE NFS4_OK 32768 (ASYNC) 
>>500   3.07381 192.168.0.144 -> frank        NFS R 4 (write       ) NFS4_OK PUTFH NFS4_OK WRITE NFS4_OK 32768 (ASYNC) 
>>501   3.07421        frank -> 192.168.0.144 NFS C 4 (write       ) PUTFH FH=5DB9 WRITE ST=1EAE:0 at 786432 for 32768 
>>502   3.07442        frank -> 192.168.0.144 NFS C 4 (write       ) PUTFH FH=5DB
>
> Is information missing here?
>
>>
>>
>>529  12.77272        frank -> 192.168.0.144 NFS C 4 (write       ) PUTFH FH=5DB9 WRITE ST=1EAE:0 at 1540096 for 32768 
>
> Here is the same. May be random writes cause the long delay. 
>
>>530  12.77935 192.168.0.144 -> frank        NFS R 4 (write       ) NFS4_OK PUTFH NFS4_OK WRITE NFS4_OK 32768 (ASYNC) 
>>531  25.65555 192.168.0.144 -> frank        NFS R 4 (write       ) NFS4_OK PUTFH NFS4_OK WRITE NFS4_OK 32768 (ASYNC) 
>>532  25.65595        frank -> 192.168.0.144 NFS C 4 (write       ) PUTFH FH=5DB
>>
>>
>>565  59.55866        frank -> 192.168.0.144 NFS C 4 (lookup valid) PUTFH FH=5311 NVERIFY GETATTR 10011a b0a23a ACCESS rd,lk,mo,ext,dl LOOKUP apps GETFH GETATTR 1001...
>>566  59.55893 192.168.0.144 -> frank        NFS R 4 (lookup valid) NFS4ERR_SAME PUTFH NFS4_OK NVERIFY NFS4ERR_SAME 
>>567 103.05160 192.168.0.144 -> frank        NFS R 4 (write       ) NFS4_OK PUTFH NFS4_OK WRITE NFS4_OK 32768 (ASYNC) 
>>568 103.05201        frank -> 192.168.0.144 NFS C 4 (write       ) PUTFH FH=5DB
>>
>>
>>582 103.07585 192.168.0.144 -> frank        NFS R 4 (write       ) NFS4_OK PUTFH NFS4_OK WRITE NFS4_OK 32768 (ASYNC) 
>>583 104.04859        frank -> 192.168.0.144 NFS C 4 (renew       ) RENEW CL=49889bf600000067 
>>584 104.04894 192.168.0.144 -> frank        NFS R 4 (renew       ) NFS4_OK RENEW NFS4_OK 
>>585 163.06621 192.168.0.144 -> frank        NFS R 4 (write       ) NFS4_OK PUTFH NFS4_OK WRITE NFS4_OK 32768 (ASYNC) 
>>586 163.06666        frank -> 192.168.0.144 NFS C 4 (write       ) PUTFH FH=5DB
>
> You can try to mount the filesystem with NFSv3 and/or to create the
> archive at a temporary location and move it then to the final location. 

Mounting as NFSv3 was one of the first things I tried.  Same result.

Late yesterday I finally managed to solve the problem.  I'm 99.9% sure
that the "elxl" network driver is broken in the newer releases of
Solaris.  It works flawlessly in Solaris 11/2006 and many earlier
releases, but fails under certain conditions in the current Solaris
release.  For example, writing large files on NFS file systems leads
to the slow performance shown here.  On the other hand, using "ftp"
for file transfers works in both directions -- transferring a 1GB file
takes about 90 seconds (which is about the best that can be expected
on a 100 Mbps network).

The ASUS motherboard has a Marvell Yukon LAN chip, so I downloaded the
newest version of the Solaris driver from www.marvell.com.  Then I
disabled the elxl driver, and now the computer works perfectly.

Thanks for the assistance.  I now know a lot more about networking.

/Frank




0
Reply Frank 2/11/2009 4:38:57 PM

15 Replies
554 Views

(page loaded in 0.135 seconds)

Similiar Articles:


















7/24/2012 8:34:24 PM


Reply: