Hello,
My Solaris 10 x64 system freezes when there is a lot of activity
(writing, in particular) on an NFS mounted disk. A typical example is
when I run "make".
The system is currently running Solaris 10/2008. I upgraded because
Solaris 05/2008 had the same problem, hoping the new version would fix
it.
I didn't have the problem when the system was running Solaris 11/2006.
The system died about the time Solaris 05/2008 came out (it died
because I was stupid enough to install some patches), so I decided to
upgrade to the newversion.
I'm now considering reinstalling Solaris 11/2006 (and this time I
won't patch it). But I would rather fix the real problem.
Any ideas?
Thanks,
---
Frank Jensen, Frank.Jensen@hugin.com
Hugin Expert A/S
Gasv�rksvej 5
9000 Aalborg
DENMARK
|
|
0
|
|
|
|
Reply
|
Frank
|
1/23/2009 2:55:57 PM |
|
On 2009-01-23, Frank Jensen <fj@hugin.dk> wrote:
> Any ideas?
Have you tried debugging the system with dtrace? Are you using ZFS?
--
| Jakov Sosic | ICQ: 28410271 | PGP: 0x965CAE2D |
=================================================================
| start fighting cancer -> http://www.worldcommunitygrid.org/ |
|
|
0
|
|
|
|
Reply
|
Jakov
|
1/23/2009 3:51:13 PM
|
|
Frank Jensen <fj@hugin.dk> wrote:
> Hello,
>
> My Solaris 10 x64 system freezes when there is a lot of activity
> (writing, in particular) on an NFS mounted disk. A typical example is
> when I run "make".
>
> The system is currently running Solaris 10/2008. I upgraded because
> Solaris 05/2008 had the same problem, hoping the new version would fix
> it.
>
> I didn't have the problem when the system was running Solaris 11/2006.
> The system died about the time Solaris 05/2008 came out (it died
> because I was stupid enough to install some patches), so I decided to
> upgrade to the newversion.
>
> I'm now considering reinstalling Solaris 11/2006 (and this time I
> won't patch it). But I would rather fix the real problem.
>
> Any ideas?
NFS is tricky to tame.
Anways, what does the NFS server show? It is hanging too or acting
strange?
how are you mounting it, what sort of host is it?
In general, when NFS breaks your host will hang waiting for the mount to
work again, which may be never- it's like losing a local disk or running
umount -f on a mount.
|
|
0
|
|
|
|
Reply
|
Cydrome
|
1/23/2009 4:00:02 PM
|
|
Cydrome Leader <presence@MUNGEpanix.com> writes:
> Frank Jensen <fj@hugin.dk> wrote:
>> Hello,
>>
>> My Solaris 10 x64 system freezes when there is a lot of activity
>> (writing, in particular) on an NFS mounted disk. A typical example is
>> when I run "make".
>>
>> The system is currently running Solaris 10/2008. I upgraded because
>> Solaris 05/2008 had the same problem, hoping the new version would fix
>> it.
>>
>> I didn't have the problem when the system was running Solaris 11/2006.
>> The system died about the time Solaris 05/2008 came out (it died
>> because I was stupid enough to install some patches), so I decided to
>> upgrade to the newversion.
>>
>> I'm now considering reinstalling Solaris 11/2006 (and this time I
>> won't patch it). But I would rather fix the real problem.
>>
>> Any ideas?
>
> NFS is tricky to tame.
>
> Anways, what does the NFS server show? It is hanging too or acting
> strange?
The server works perfectly. We also have a Linux and a Mac computer
mounting via NFS from the server without problems.
> how are you mounting it, what sort of host is it?
I have added this line to /etc/vfstab:
192.168.0.144:/export/home - /hugin nfs - yes -
The server is a Sun T5120 running Solaris 05/2008. However, the
problem also exists with an old Sun E250 running Solaris 8.
> In general, when NFS breaks your host will hang waiting for the mount to
> work again, which may be never- it's like losing a local disk or running
> umount -f on a mount.
Thanks
---
Frank Jensen, Frank.Jensen@hugin.com
Hugin Expert A/S
Gasv�rksvej 5
9000 Aalborg
DENMARK
|
|
0
|
|
|
|
Reply
|
Frank
|
1/23/2009 4:13:15 PM
|
|
Frank Jensen <fj@hugin.dk> wrote:
> Cydrome Leader <presence@MUNGEpanix.com> writes:
>
>> Frank Jensen <fj@hugin.dk> wrote:
>>> Hello,
>>>
>>> My Solaris 10 x64 system freezes when there is a lot of activity
>>> (writing, in particular) on an NFS mounted disk. A typical example is
>>> when I run "make".
>>>
>>> The system is currently running Solaris 10/2008. I upgraded because
>>> Solaris 05/2008 had the same problem, hoping the new version would fix
>>> it.
>>>
>>> I didn't have the problem when the system was running Solaris 11/2006.
>>> The system died about the time Solaris 05/2008 came out (it died
>>> because I was stupid enough to install some patches), so I decided to
>>> upgrade to the newversion.
>>>
>>> I'm now considering reinstalling Solaris 11/2006 (and this time I
>>> won't patch it). But I would rather fix the real problem.
>>>
>>> Any ideas?
>>
>> NFS is tricky to tame.
>>
>> Anways, what does the NFS server show? It is hanging too or acting
>> strange?
>
> The server works perfectly. We also have a Linux and a Mac computer
> mounting via NFS from the server without problems.
ok, so it sounds like the server is not maxed out, even though mac and
linux NFS clients are different in behavior from the sun one.
>> how are you mounting it, what sort of host is it?
>
> I have added this line to /etc/vfstab:
>
> 192.168.0.144:/export/home - /hugin nfs - yes -
What does the output of mount -p show?
> The server is a Sun T5120 running Solaris 05/2008. However, the
> problem also exists with an old Sun E250 running Solaris 8.
Hmm, they should default to nfs v4 if the nfs domains are correct, but
let's see what mount shows
|
|
0
|
|
|
|
Reply
|
Cydrome
|
1/23/2009 5:27:01 PM
|
|
Cydrome Leader <presence@MUNGEpanix.com> writes:
> Frank Jensen <fj@hugin.dk> wrote:
>> Cydrome Leader <presence@MUNGEpanix.com> writes:
>>
>>> Frank Jensen <fj@hugin.dk> wrote:
>>>> Hello,
>>>>
>>>> My Solaris 10 x64 system freezes when there is a lot of activity
>>>> (writing, in particular) on an NFS mounted disk. A typical example is
>>>> when I run "make".
>>>>
>>>> The system is currently running Solaris 10/2008. I upgraded because
>>>> Solaris 05/2008 had the same problem, hoping the new version would fix
>>>> it.
>>>>
>>>> I didn't have the problem when the system was running Solaris 11/2006.
>>>> The system died about the time Solaris 05/2008 came out (it died
>>>> because I was stupid enough to install some patches), so I decided to
>>>> upgrade to the newversion.
>>>>
>>>> I'm now considering reinstalling Solaris 11/2006 (and this time I
>>>> won't patch it). But I would rather fix the real problem.
>>>>
>>>> Any ideas?
>>>
>>> NFS is tricky to tame.
>>>
>>> Anways, what does the NFS server show? It is hanging too or acting
>>> strange?
>>
>> The server works perfectly. We also have a Linux and a Mac computer
>> mounting via NFS from the server without problems.
>
> ok, so it sounds like the server is not maxed out, even though mac and
> linux NFS clients are different in behavior from the sun one.
>
>
>>> how are you mounting it, what sort of host is it?
>>
>> I have added this line to /etc/vfstab:
>>
>> 192.168.0.144:/export/home - /hugin nfs - yes -
>
> What does the output of mount -p show?
The output is:
/dev/dsk/c1t0d0s0 - / ufs - no rw,intr,largefiles,logging,xattr,onerror=panic
/devices - /devices devfs - no
ctfs - /system/contract ctfs - no
proc - /proc proc - no
mnttab - /etc/mnttab mntfs - no
swap - /etc/svc/volatile tmpfs - no xattr
objfs - /system/object objfs - no
sharefs - /etc/dfs/sharetab sharefs - no
/usr/lib/libc/libc_hwcap1.so.1 - /lib/libc.so.1 lofs - no
fd - /dev/fd fd - no rw
swap - /tmp tmpfs - no xattr
swap - /var/run tmpfs - no xattr
/dev/dsk/c0d0s7 - /export/home0 ufs - no rw,intr,largefiles,logging,xattr,onerror=panic
/dev/dsk/c1t0d0s7 - /export/home ufs - no rw,intr,largefiles,logging,xattr,onerror=panic
192.168.0.144:/export/home - /hugin nfs - no rw,xattr
/hugin/users/fj - /home/fj lofs - no
>> The server is a Sun T5120 running Solaris 05/2008. However, the
>> problem also exists with an old Sun E250 running Solaris 8.
>
> Hmm, they should default to nfs v4 if the nfs domains are correct, but
> let's see what mount shows
Thanks,
/Frank
|
|
0
|
|
|
|
Reply
|
Frank
|
1/25/2009 4:49:37 PM
|
|
Jakov Sosic <jsosic@jsosic.homeunix.org> writes:
> On 2009-01-23, Frank Jensen <fj@hugin.dk> wrote:
>
>> Any ideas?
>
> Have you tried debugging the system with dtrace? Are you using ZFS?
No, I'm not using ZFS.
And I haven't tried to debug the system with DTrace. I'm not familiar
with DTrace, but I have heard that it should be quite amazing. Can it
help in this situation?
Thanks,
/Frank
|
|
0
|
|
|
|
Reply
|
Frank
|
1/25/2009 6:19:28 PM
|
|
On Fri, 23 Jan 2009 15:55:57 +0100
Frank Jensen <fj@hugin.dk> wrote:
> My Solaris 10 x64 system freezes when there is a lot of activity
> (writing, in particular) on an NFS mounted disk. A typical example is
> when I run "make".
What do you mean by "freeze"? Does the system crash, or is it merely
the "make" run that stalls? If the latter, can you interrupt the
process and regain control of your session?
You could try and run the "make" under truss to see where the process
blocks.
> The system is currently running Solaris 10/2008. I upgraded because
> Solaris 05/2008 had the same problem, hoping the new version would fix
> it.
I have 11/2008 deployed in production, with lots of writes to 7 NFS
mounts (the servers are one Solaris 9, and the other Solaris 10), and I
haven't encountered a problem (keeps fingers crossed :).
What does "dmesg" say when the freeze occurs (or /var/adm/messages if
you've been rebooting a lot)?
> The server is a Sun T5120 running Solaris 05/2008. However, the
> problem also exists with an old Sun E250 running Solaris 8.
Try mounting with a different protocol level and see if that fixes your
problem (but the fact that you also have problems with Solaris 8 makes
me doubt this will work).
> I have added this line to /etc/vfstab:
>
> 192.168.0.144:/export/home - /hugin nfs - yes -
Why not using the automounter for home directories? That's the most
flexible and most reliable way to use NFS. When you put NFS mounts in
the crontab like you do, you might not be able to boot to completion
when the NFS server is not available.
Assuming that all home directories are on 192.168.0.144
under /export/home, put the following in /etc/auto_home:
* 192.168.0.144:/export/home/&
ensure that /etc/auto_master contains the following line:
/home auto_home -nobrowse
and start the automount service:
# svcadm enable svc:/system/filesystem/autofs:default
This will automount user directories under /home (where $DEITY intended
them to be). If you insist on having them under /hugin, simply make the
obvious replacements.
You can check for network problems with netstat:
# netstat -i
If you see Ierrs, Oerrs or Collisions, your problem could be network
related. Are you using a Gigabit connection? If so, have you enabled
jumbo frames on the client system, and not on the servers?
--
Stefaan A Eeckels
--
"Technically, Windows is an 'operating system,' which means that it
supplies your computer with the basic commands that it needs to
suddenly, with no warning whatsoever, stop operating." -Dave Barry
|
|
0
|
|
|
|
Reply
|
Stefaan
|
2/2/2009 11:38:58 PM
|
|
Stefaan A Eeckels <hoendech@ecc.lu> writes:
> On Fri, 23 Jan 2009 15:55:57 +0100
> Frank Jensen <fj@hugin.dk> wrote:
>
>> My Solaris 10 x64 system freezes when there is a lot of activity
>> (writing, in particular) on an NFS mounted disk. A typical example is
>> when I run "make".
>
> What do you mean by "freeze"? Does the system crash, or is it merely
> the "make" run that stalls? If the latter, can you interrupt the
> process and regain control of your session?
The system doesn't crash, but it freezes for a while (several
minutes). It is not possible to interrupt the process, and everything
else that touches the NFS file system also freeze. But a simple
"return" in a shell doesn't freeze.
> You could try and run the "make" under truss to see where the process
> blocks.
I have tried to run one of the commands, the "ar" command, invoked by
"make" under "truss -d". The freeze happens when the archive is
written:
0.1160 write(2, "\n", 1) = 1
189.3060 write(3, " ! < a r c h >\n / ".., 4352200) = 4352200
Running the same "make" command on a Linux computer, mounting the same
disk from the Sun T5120 NFS server, takes about 3.5 seconds (real time).
>> The system is currently running Solaris 10/2008. I upgraded because
>> Solaris 05/2008 had the same problem, hoping the new version would fix
>> it.
>
> I have 11/2008 deployed in production, with lots of writes to 7 NFS
> mounts (the servers are one Solaris 9, and the other Solaris 10), and I
> haven't encountered a problem (keeps fingers crossed :).
>
> What does "dmesg" say when the freeze occurs (or /var/adm/messages if
> you've been rebooting a lot)?
Here is the output from the most recent reboot:
Jan 29 19:42:31 frank rootnex: [ID 349649 kern.info] xsvc0 at root
Jan 29 19:42:31 frank genunix: [ID 936769 kern.info] xsvc0 is /xsvc
Jan 29 19:42:31 frank genunix: [ID 935449 kern.info] ATA DMA off: disabled. Control with "atapi-cd-dma-enabled" property
Jan 29 19:42:31 frank genunix: [ID 882269 kern.info] PIO mode 4 selected
Jan 29 19:42:31 frank genunix: [ID 935449 kern.info] ATA DMA off: disabled. Control with "atapi-cd-dma-enabled" property
Jan 29 19:42:31 frank genunix: [ID 882269 kern.info] PIO mode 4 selected
Jan 29 19:42:31 frank genunix: [ID 935449 kern.info] ATA DMA off: disabled. Control with "atapi-cd-dma-enabled" property
Jan 29 19:42:31 frank genunix: [ID 882269 kern.info] PIO mode 4 selected
Jan 29 19:42:31 frank genunix: [ID 935449 kern.info] ATA DMA off: disabled. Control with "atapi-cd-dma-enabled" property
Jan 29 19:42:31 frank genunix: [ID 882269 kern.info] PIO mode 4 selected
Jan 29 19:42:31 frank genunix: [ID 773945 kern.info] UltraDMA mode 6 selected
Jan 29 19:42:31 frank pcplusmp: [ID 637496 kern.info] pcplusmp: asy (asy) instance 0 vector 0x4 ioapic 0x2 intin 0x4 is bound to cpu 0
Jan 29 19:42:31 frank isa: [ID 202937 kern.info] ISA-device: asy0
Jan 29 19:42:31 frank genunix: [ID 936769 kern.info] asy0 is /isa/asy@1,3f8
Jan 29 19:42:31 frank pseudo: [ID 129642 kern.info] pseudo-device: ramdisk1024
Jan 29 19:42:31 frank genunix: [ID 936769 kern.info] ramdisk1024 is /pseudo/ramdisk@1024
Jan 29 19:42:31 frank pseudo: [ID 129642 kern.info] pseudo-device: lockstat0
Jan 29 19:42:31 frank genunix: [ID 936769 kern.info] lockstat0 is /pseudo/lockstat@0
Jan 29 19:42:31 frank pseudo: [ID 129642 kern.info] pseudo-device: llc10
Jan 29 19:42:31 frank genunix: [ID 936769 kern.info] llc10 is /pseudo/llc1@0
Jan 29 19:42:31 frank cpudrv: [ID 805513 kern.info] NOTICE: cpu_acpi: _PSS package not found.
Jan 29 19:42:31 frank cpudrv: [ID 978953 kern.warning] WARNING: cpu_acpi: error parsing _PSS for CPU instance 0
Jan 29 19:42:31 frank cpudrv: [ID 805513 kern.info] NOTICE: cpu_acpi: _PSS package not found.
Jan 29 19:42:31 frank cpudrv: [ID 978953 kern.warning] WARNING: cpu_acpi: error parsing _PSS for CPU instance 1
Jan 29 19:42:31 frank pseudo: [ID 129642 kern.info] pseudo-device: lofi0
Jan 29 19:42:31 frank genunix: [ID 936769 kern.info] lofi0 is /pseudo/lofi@0
Jan 29 19:42:31 frank pseudo: [ID 129642 kern.info] pseudo-device: profile0
Jan 29 19:42:31 frank genunix: [ID 936769 kern.info] profile0 is /pseudo/profile@0
Jan 29 19:42:31 frank pseudo: [ID 129642 kern.info] pseudo-device: systrace0
Jan 29 19:42:31 frank genunix: [ID 936769 kern.info] systrace0 is /pseudo/systrace@0
Jan 29 19:42:31 frank pseudo: [ID 129642 kern.info] pseudo-device: fbt0
Jan 29 19:42:31 frank genunix: [ID 936769 kern.info] fbt0 is /pseudo/fbt@0
Jan 29 19:42:31 frank pseudo: [ID 129642 kern.info] pseudo-device: sdt0
Jan 29 19:42:31 frank genunix: [ID 936769 kern.info] sdt0 is /pseudo/sdt@0
Jan 29 19:42:32 frank pseudo: [ID 129642 kern.info] pseudo-device: fcp0
Jan 29 19:42:32 frank genunix: [ID 936769 kern.info] fcp0 is /pseudo/fcp@0
Jan 29 19:42:32 frank pseudo: [ID 129642 kern.info] pseudo-device: fcsm0
Jan 29 19:42:32 frank genunix: [ID 936769 kern.info] fcsm0 is /pseudo/fcsm@0
Jan 29 19:42:32 frank pseudo: [ID 129642 kern.info] pseudo-device: lx_systrace0
Jan 29 19:42:32 frank genunix: [ID 936769 kern.info] lx_systrace0 is /pseudo/lx_systrace@0
Jan 29 19:42:32 frank pseudo: [ID 129642 kern.info] pseudo-device: ucode0
Jan 29 19:42:32 frank genunix: [ID 936769 kern.info] ucode0 is /pseudo/ucode@0
Jan 29 19:42:32 frank pseudo: [ID 129642 kern.info] pseudo-device: fssnap0
Jan 29 19:42:32 frank genunix: [ID 936769 kern.info] fssnap0 is /pseudo/fssnap@0
Jan 29 19:42:32 frank pseudo: [ID 129642 kern.info] pseudo-device: winlock0
Jan 29 19:42:32 frank genunix: [ID 936769 kern.info] winlock0 is /pseudo/winlock@0
Jan 29 19:42:32 frank pseudo: [ID 129642 kern.info] pseudo-device: pm0
Jan 29 19:42:32 frank genunix: [ID 936769 kern.info] pm0 is /pseudo/pm@0
Jan 29 19:42:32 frank pseudo: [ID 129642 kern.info] pseudo-device: rsm0
Jan 29 19:42:32 frank genunix: [ID 936769 kern.info] rsm0 is /pseudo/rsm@0
>> The server is a Sun T5120 running Solaris 05/2008. However, the
>> problem also exists with an old Sun E250 running Solaris 8.
>
> Try mounting with a different protocol level and see if that fixes your
> problem (but the fact that you also have problems with Solaris 8 makes
> me doubt this will work).
I have tried to force the mount to use version 3 of the NFS protocol,
but that doesn't change anything.
>> I have added this line to /etc/vfstab:
>>
>> 192.168.0.144:/export/home - /hugin nfs - yes -
>
> Why not using the automounter for home directories? That's the most
> flexible and most reliable way to use NFS. When you put NFS mounts in
> the crontab like you do, you might not be able to boot to completion
> when the NFS server is not available.
The disk contains more than just home directories. And without that
disk, it is meaningless to boot the computer.
> Assuming that all home directories are on 192.168.0.144
> under /export/home, put the following in /etc/auto_home:
>
> * 192.168.0.144:/export/home/&
>
> ensure that /etc/auto_master contains the following line:
>
> /home auto_home -nobrowse
>
> and start the automount service:
>
> # svcadm enable svc:/system/filesystem/autofs:default
>
> This will automount user directories under /home (where $DEITY intended
> them to be). If you insist on having them under /hugin, simply make the
> obvious replacements.
I haven't tried this yet, but I suspect it won't make a difference
(but perhaps I'm wrong).
> You can check for network problems with netstat:
>
> # netstat -i
This produces the following output:
frank> netstat -i
Name Mtu Net/Dest Address Ipkts Ierrs Opkts Oerrs Collis Queue
lo0 8232 loopback localhost 318 0 318 0 0 0
elxl0 1500 frank frank 424328672 0 369015398 0 0 0
> If you see Ierrs, Oerrs or Collisions, your problem could be network
> related. Are you using a Gigabit connection? If so, have you enabled
> jumbo frames on the client system, and not on the servers?
The network is 100 Mbps. The network devices should autodetect this
(I haven't explicitly enabled or disabled anything).
The NFS server reports:
# dladm show-dev
e1000g0 link: up speed: 100 Mbps duplex: full
So I don't think it is a network problem (at least not on the server
side).
> --
> Stefaan A Eeckels
> --
> "Technically, Windows is an 'operating system,' which means that it
> supplies your computer with the basic commands that it needs to
> suddenly, with no warning whatsoever, stop operating." -Dave Barry
Thanks,
/Frank
|
|
0
|
|
|
|
Reply
|
Frank
|
2/3/2009 6:00:39 PM
|
|
On Tue, 03 Feb 2009 19:00:39 +0100
Frank Jensen <fj@hugin.dk> wrote:
> > You could try and run the "make" under truss to see where the
> > process blocks.
>
> I have tried to run one of the commands, the "ar" command, invoked by
> "make" under "truss -d". The freeze happens when the archive is
> written:
>
> 0.1160 write(2, "\n", 1) = 1
> 189.3060 write(3, " ! < a r c h >\n / "..,
> 4352200) = 4352200
>
> Running the same "make" command on a Linux computer, mounting the same
> disk from the Sun T5120 NFS server, takes about 3.5 seconds (real
> time).
I have a sneaky suspicion I have seen something similar before, long
time ago (1992) between a SCO system and a SunOS 4.0 server. A small
write to a file would work OK, larger writes would hang for a long time.
Notice the rather large block size on the write performed by ar.
I cannot reproduce this behaviour on my systems. It would be useful to
snoop the line to see what exactly is going on.
--
Stefaan A Eeckels
--
He who will not reason, is a bigot;
he who cannot is a fool;
and he who dares not is a slave. (Sir William Drummond)
|
|
0
|
|
|
|
Reply
|
Stefaan
|
2/4/2009 11:44:02 PM
|
|
On Tue, 03 Feb 2009 19:00:39 +0100, Frank Jensen <fj@hugin.dk> wrote:
>Here is the output from the most recent reboot:
It would be interesting to sie a Sniffer/Snoop output of the write.
You can then see how the request flow over the wire, how large the gaps
are and whether there are any errors in the NFS protocol.
|
|
0
|
|
|
|
Reply
|
Emil
|
2/5/2009 5:59:29 AM
|
|
Stefaan A Eeckels <hoendech@ecc.lu> writes:
> On Tue, 03 Feb 2009 19:00:39 +0100
> Frank Jensen <fj@hugin.dk> wrote:
>
>> > You could try and run the "make" under truss to see where the
>> > process blocks.
>>
>> I have tried to run one of the commands, the "ar" command, invoked by
>> "make" under "truss -d". The freeze happens when the archive is
>> written:
>>
>> 0.1160 write(2, "\n", 1) = 1
>> 189.3060 write(3, " ! < a r c h >\n / "..,
>> 4352200) = 4352200
>>
>> Running the same "make" command on a Linux computer, mounting the same
>> disk from the Sun T5120 NFS server, takes about 3.5 seconds (real
>> time).
>
> I have a sneaky suspicion I have seen something similar before, long
> time ago (1992) between a SCO system and a SunOS 4.0 server. A small
> write to a file would work OK, larger writes would hang for a long time.
>
> Notice the rather large block size on the write performed by ar.
>
> I cannot reproduce this behaviour on my systems. It would be useful to
> snoop the line to see what exactly is going on.
Unfortunately, I have no experience diagnosing network problems. I
have now tried "snoop", and it shows quite a few packets marked NFS,
and each NFS "write" packet is followed by 20 or so packets marked
TCP. Any suggestions about what to look for?
Thanks,
/Frank
|
|
0
|
|
|
|
Reply
|
Frank
|
2/5/2009 6:15:50 PM
|
|
On Thu, 05 Feb 2009 19:15:50 +0100, Frank Jensen <fj@hugin.dk> wrote:
>Unfortunately, I have no experience diagnosing network problems. I
>have now tried "snoop", and it shows quite a few packets marked NFS,
>and each NFS "write" packet is followed by 20 or so packets marked
>TCP. Any suggestions about what to look for?
You should use a snaplen of at least 120 (-s 120) to capture the
relevant header data. Look for the the time distance between NFS
requests and responses if there are any long delays. When it takes 189 s
to write 4 MB there must be long delays somewhere. Normally you should
see a constant flow of NFS write requests and NFS replies and some other
requests like getattr, commit, ... The requests you look for should all
use the same filehandle. The TCP packets after a write are normally not
interesting. They contain the data written in a NFS write. Interesting
are the NFS protocol packets. May be you can post a few lines.
It may be easier to save the data in a file (-o file) and analyse this
with http://www.wireshark.org/download.html .
|
|
0
|
|
|
|
Reply
|
Emil
|
2/7/2009 5:52:07 AM
|
|
Emil Naepflein <netnewsegn@kabelmail.de> writes:
> On Thu, 05 Feb 2009 19:15:50 +0100, Frank Jensen <fj@hugin.dk> wrote:
>
>>Unfortunately, I have no experience diagnosing network problems. I
>>have now tried "snoop", and it shows quite a few packets marked NFS,
>>and each NFS "write" packet is followed by 20 or so packets marked
>>TCP. Any suggestions about what to look for?
>
> You should use a snaplen of at least 120 (-s 120) to capture the
> relevant header data. Look for the the time distance between NFS
> requests and responses if there are any long delays. When it takes 189 s
> to write 4 MB there must be long delays somewhere. Normally you should
> see a constant flow of NFS write requests and NFS replies and some other
> requests like getattr, commit, ... The requests you look for should all
> use the same filehandle. The TCP packets after a write are normally not
> interesting. They contain the data written in a NFS write. Interesting
> are the NFS protocol packets. May be you can post a few lines.
How do I identify the file handle in the output?
Here are some of the NFS packets. In the beginning, there are "short"
delays, but they get larger -- the largest ones being about 1 minute
each.
475 0.25103 frank -> 192.168.0.144 NFS C 4 (write ) PUTFH FH=5DB9 WRITE ST=1EAE:0 at 196608 for 32768
476 0.25120 frank -> 192.168.0.144 NFS C 4 (write ) PUTFH FH=5DB9 WRITE ST=1EAE:0 at 229376 for 32768
477 0.25755 192.168.0.144 -> frank NFS R 4 (write ) NFS4_OK PUTFH NFS4_OK WRITE NFS4_OK 32768 (ASYNC)
478 0.64359 192.168.0.144 -> frank NFS R 4 (write ) NFS4_OK PUTFH NFS4_OK WRITE NFS4_OK 32768 (ASYNC)
479 0.64397 frank -> 192.168.0.144 NFS C 4 (write ) PUTFH FH=5DB9 WRITE ST=1EAE:0 at 262144 for 32768
498 1.47464 frank -> 192.168.0.144 NFS C 4 (write ) PUTFH FH=5DB9 WRITE ST=1EAE:0 at 753664 for 32768
499 1.47821 192.168.0.144 -> frank NFS R 4 (write ) NFS4_OK PUTFH NFS4_OK WRITE NFS4_OK 32768 (ASYNC)
500 3.07381 192.168.0.144 -> frank NFS R 4 (write ) NFS4_OK PUTFH NFS4_OK WRITE NFS4_OK 32768 (ASYNC)
501 3.07421 frank -> 192.168.0.144 NFS C 4 (write ) PUTFH FH=5DB9 WRITE ST=1EAE:0 at 786432 for 32768
502 3.07442 frank -> 192.168.0.144 NFS C 4 (write ) PUTFH FH=5DB
529 12.77272 frank -> 192.168.0.144 NFS C 4 (write ) PUTFH FH=5DB9 WRITE ST=1EAE:0 at 1540096 for 32768
530 12.77935 192.168.0.144 -> frank NFS R 4 (write ) NFS4_OK PUTFH NFS4_OK WRITE NFS4_OK 32768 (ASYNC)
531 25.65555 192.168.0.144 -> frank NFS R 4 (write ) NFS4_OK PUTFH NFS4_OK WRITE NFS4_OK 32768 (ASYNC)
532 25.65595 frank -> 192.168.0.144 NFS C 4 (write ) PUTFH FH=5DB
565 59.55866 frank -> 192.168.0.144 NFS C 4 (lookup valid) PUTFH FH=5311 NVERIFY GETATTR 10011a b0a23a ACCESS rd,lk,mo,ext,dl LOOKUP apps GETFH GETATTR 1001...
566 59.55893 192.168.0.144 -> frank NFS R 4 (lookup valid) NFS4ERR_SAME PUTFH NFS4_OK NVERIFY NFS4ERR_SAME
567 103.05160 192.168.0.144 -> frank NFS R 4 (write ) NFS4_OK PUTFH NFS4_OK WRITE NFS4_OK 32768 (ASYNC)
568 103.05201 frank -> 192.168.0.144 NFS C 4 (write ) PUTFH FH=5DB
582 103.07585 192.168.0.144 -> frank NFS R 4 (write ) NFS4_OK PUTFH NFS4_OK WRITE NFS4_OK 32768 (ASYNC)
583 104.04859 frank -> 192.168.0.144 NFS C 4 (renew ) RENEW CL=49889bf600000067
584 104.04894 192.168.0.144 -> frank NFS R 4 (renew ) NFS4_OK RENEW NFS4_OK
585 163.06621 192.168.0.144 -> frank NFS R 4 (write ) NFS4_OK PUTFH NFS4_OK WRITE NFS4_OK 32768 (ASYNC)
586 163.06666 frank -> 192.168.0.144 NFS C 4 (write ) PUTFH FH=5DB
|
|
0
|
|
|
|
Reply
|
Frank
|
2/9/2009 8:01:10 PM
|
|
On Mon, 09 Feb 2009 21:01:10 +0100, Frank Jensen <fj@hugin.dk> wrote:
>How do I identify the file handle in the output?
The file handle is the value behind "FH=" .
>Here are some of the NFS packets. In the beginning, there are "short"
>delays, but they get larger -- the largest ones being about 1 minute
>each.
>
>475 0.25103 frank -> 192.168.0.144 NFS C 4 (write ) PUTFH FH=5DB9 WRITE ST=1EAE:0 at 196608 for 32768
You can see that the file is written in 32 k chunks.
>476 0.25120 frank -> 192.168.0.144 NFS C 4 (write ) PUTFH FH=5DB9 WRITE ST=1EAE:0 at 229376 for 32768
>477 0.25755 192.168.0.144 -> frank NFS R 4 (write ) NFS4_OK PUTFH NFS4_OK WRITE NFS4_OK 32768 (ASYNC)
>478 0.64359 192.168.0.144 -> frank NFS R 4 (write ) NFS4_OK PUTFH NFS4_OK WRITE NFS4_OK 32768 (ASYNC)
>479 0.64397 frank -> 192.168.0.144 NFS C 4 (write ) PUTFH FH=5DB9 WRITE ST=1EAE:0 at 262144 for 32768
>
>
>498 1.47464 frank -> 192.168.0.144 NFS C 4 (write ) PUTFH FH=5DB9 WRITE ST=1EAE:0 at 753664 for 32768
Either some packets are missing, or the file is written randomly because
the offset jumps from 262144 to 753664.
>499 1.47821 192.168.0.144 -> frank NFS R 4 (write ) NFS4_OK PUTFH NFS4_OK WRITE NFS4_OK 32768 (ASYNC)
>500 3.07381 192.168.0.144 -> frank NFS R 4 (write ) NFS4_OK PUTFH NFS4_OK WRITE NFS4_OK 32768 (ASYNC)
>501 3.07421 frank -> 192.168.0.144 NFS C 4 (write ) PUTFH FH=5DB9 WRITE ST=1EAE:0 at 786432 for 32768
>502 3.07442 frank -> 192.168.0.144 NFS C 4 (write ) PUTFH FH=5DB
Is information missing here?
>
>
>529 12.77272 frank -> 192.168.0.144 NFS C 4 (write ) PUTFH FH=5DB9 WRITE ST=1EAE:0 at 1540096 for 32768
Here is the same. May be random writes cause the long delay.
>530 12.77935 192.168.0.144 -> frank NFS R 4 (write ) NFS4_OK PUTFH NFS4_OK WRITE NFS4_OK 32768 (ASYNC)
>531 25.65555 192.168.0.144 -> frank NFS R 4 (write ) NFS4_OK PUTFH NFS4_OK WRITE NFS4_OK 32768 (ASYNC)
>532 25.65595 frank -> 192.168.0.144 NFS C 4 (write ) PUTFH FH=5DB
>
>
>565 59.55866 frank -> 192.168.0.144 NFS C 4 (lookup valid) PUTFH FH=5311 NVERIFY GETATTR 10011a b0a23a ACCESS rd,lk,mo,ext,dl LOOKUP apps GETFH GETATTR 1001...
>566 59.55893 192.168.0.144 -> frank NFS R 4 (lookup valid) NFS4ERR_SAME PUTFH NFS4_OK NVERIFY NFS4ERR_SAME
>567 103.05160 192.168.0.144 -> frank NFS R 4 (write ) NFS4_OK PUTFH NFS4_OK WRITE NFS4_OK 32768 (ASYNC)
>568 103.05201 frank -> 192.168.0.144 NFS C 4 (write ) PUTFH FH=5DB
>
>
>582 103.07585 192.168.0.144 -> frank NFS R 4 (write ) NFS4_OK PUTFH NFS4_OK WRITE NFS4_OK 32768 (ASYNC)
>583 104.04859 frank -> 192.168.0.144 NFS C 4 (renew ) RENEW CL=49889bf600000067
>584 104.04894 192.168.0.144 -> frank NFS R 4 (renew ) NFS4_OK RENEW NFS4_OK
>585 163.06621 192.168.0.144 -> frank NFS R 4 (write ) NFS4_OK PUTFH NFS4_OK WRITE NFS4_OK 32768 (ASYNC)
>586 163.06666 frank -> 192.168.0.144 NFS C 4 (write ) PUTFH FH=5DB
You can try to mount the filesystem with NFSv3 and/or to create the
archive at a temporary location and move it then to the final location.
|
|
0
|
|
|
|
Reply
|
Emil
|
2/11/2009 5:39:16 AM
|
|
Emil Naepflein <netnewsegn@kabelmail.de> writes:
> On Mon, 09 Feb 2009 21:01:10 +0100, Frank Jensen <fj@hugin.dk> wrote:
>
>>How do I identify the file handle in the output?
>
> The file handle is the value behind "FH=" .
Yes, I figured that out soon after I posted this.
>>Here are some of the NFS packets. In the beginning, there are "short"
>>delays, but they get larger -- the largest ones being about 1 minute
>>each.
>>
>>475 0.25103 frank -> 192.168.0.144 NFS C 4 (write ) PUTFH FH=5DB9 WRITE ST=1EAE:0 at 196608 for 32768
>
> You can see that the file is written in 32 k chunks.
>
>>476 0.25120 frank -> 192.168.0.144 NFS C 4 (write ) PUTFH FH=5DB9 WRITE ST=1EAE:0 at 229376 for 32768
>>477 0.25755 192.168.0.144 -> frank NFS R 4 (write ) NFS4_OK PUTFH NFS4_OK WRITE NFS4_OK 32768 (ASYNC)
>>478 0.64359 192.168.0.144 -> frank NFS R 4 (write ) NFS4_OK PUTFH NFS4_OK WRITE NFS4_OK 32768 (ASYNC)
>>479 0.64397 frank -> 192.168.0.144 NFS C 4 (write ) PUTFH FH=5DB9 WRITE ST=1EAE:0 at 262144 for 32768
>>
>>
>>498 1.47464 frank -> 192.168.0.144 NFS C 4 (write ) PUTFH FH=5DB9 WRITE ST=1EAE:0 at 753664 for 32768
>
> Either some packets are missing, or the file is written randomly because
> the offset jumps from 262144 to 753664.
I only posted some of the output from "snoop" -- the parts showing
some of the delays (the full output is hundreds of kilobytes). The
two blank lines indicates that something was cut out.
>>499 1.47821 192.168.0.144 -> frank NFS R 4 (write ) NFS4_OK PUTFH NFS4_OK WRITE NFS4_OK 32768 (ASYNC)
>>500 3.07381 192.168.0.144 -> frank NFS R 4 (write ) NFS4_OK PUTFH NFS4_OK WRITE NFS4_OK 32768 (ASYNC)
>>501 3.07421 frank -> 192.168.0.144 NFS C 4 (write ) PUTFH FH=5DB9 WRITE ST=1EAE:0 at 786432 for 32768
>>502 3.07442 frank -> 192.168.0.144 NFS C 4 (write ) PUTFH FH=5DB
>
> Is information missing here?
>
>>
>>
>>529 12.77272 frank -> 192.168.0.144 NFS C 4 (write ) PUTFH FH=5DB9 WRITE ST=1EAE:0 at 1540096 for 32768
>
> Here is the same. May be random writes cause the long delay.
>
>>530 12.77935 192.168.0.144 -> frank NFS R 4 (write ) NFS4_OK PUTFH NFS4_OK WRITE NFS4_OK 32768 (ASYNC)
>>531 25.65555 192.168.0.144 -> frank NFS R 4 (write ) NFS4_OK PUTFH NFS4_OK WRITE NFS4_OK 32768 (ASYNC)
>>532 25.65595 frank -> 192.168.0.144 NFS C 4 (write ) PUTFH FH=5DB
>>
>>
>>565 59.55866 frank -> 192.168.0.144 NFS C 4 (lookup valid) PUTFH FH=5311 NVERIFY GETATTR 10011a b0a23a ACCESS rd,lk,mo,ext,dl LOOKUP apps GETFH GETATTR 1001...
>>566 59.55893 192.168.0.144 -> frank NFS R 4 (lookup valid) NFS4ERR_SAME PUTFH NFS4_OK NVERIFY NFS4ERR_SAME
>>567 103.05160 192.168.0.144 -> frank NFS R 4 (write ) NFS4_OK PUTFH NFS4_OK WRITE NFS4_OK 32768 (ASYNC)
>>568 103.05201 frank -> 192.168.0.144 NFS C 4 (write ) PUTFH FH=5DB
>>
>>
>>582 103.07585 192.168.0.144 -> frank NFS R 4 (write ) NFS4_OK PUTFH NFS4_OK WRITE NFS4_OK 32768 (ASYNC)
>>583 104.04859 frank -> 192.168.0.144 NFS C 4 (renew ) RENEW CL=49889bf600000067
>>584 104.04894 192.168.0.144 -> frank NFS R 4 (renew ) NFS4_OK RENEW NFS4_OK
>>585 163.06621 192.168.0.144 -> frank NFS R 4 (write ) NFS4_OK PUTFH NFS4_OK WRITE NFS4_OK 32768 (ASYNC)
>>586 163.06666 frank -> 192.168.0.144 NFS C 4 (write ) PUTFH FH=5DB
>
> You can try to mount the filesystem with NFSv3 and/or to create the
> archive at a temporary location and move it then to the final location.
Mounting as NFSv3 was one of the first things I tried. Same result.
Late yesterday I finally managed to solve the problem. I'm 99.9% sure
that the "elxl" network driver is broken in the newer releases of
Solaris. It works flawlessly in Solaris 11/2006 and many earlier
releases, but fails under certain conditions in the current Solaris
release. For example, writing large files on NFS file systems leads
to the slow performance shown here. On the other hand, using "ftp"
for file transfers works in both directions -- transferring a 1GB file
takes about 90 seconds (which is about the best that can be expected
on a 100 Mbps network).
The ASUS motherboard has a Marvell Yukon LAN chip, so I downloaded the
newest version of the Solaris driver from www.marvell.com. Then I
disabled the elxl driver, and now the computer works perfectly.
Thanks for the assistance. I now know a lot more about networking.
/Frank
|
|
0
|
|
|
|
Reply
|
Frank
|
2/11/2009 4:38:57 PM
|
|
|
15 Replies
554 Views
(page loaded in 0.135 seconds)
Similiar Articles: How to handle if nfs hangs - comp.unix.programmerSystem freezes with high NFS activity - comp.unix.solaris ... How do I identify the file handle in the output? Here are some of the NFS ... Network File System ... clients ... Solaris 10 NFS 4 Problem - comp.unix.solarisSystem freezes with high NFS activity - comp.unix.solaris ..... the servers are one Solaris 9, and the other Solaris 10), and I haven't encountered a problem ... Autofs issue with home directory on AIX.- anyone? - comp.unix ...I'm having trouble get user's home directories to mount via the Automounter on AIX 5.3 NIS client. The directories are exported from a Solaris NFS fil... Solaris 10 not mounting NFS via /etc/vfstab on reboot - comp.unix ...I have an NFS client which has an entry like the following in /etc ... default enabled. svc:/system/identity:domain enabled. svc:/network/nfs/status enabled. svc:/system ... What happens when you use the vxfs option delaylog? - comp.sys.hp ...System freezes with high NFS activity - comp.unix.solaris ... The freeze happens when the archive is written: 0.1160 write(2, "\n ... What happens when you use the vxfs ... Session Freezes - comp.unix.solarisSystem freezes with high NFS activity - comp.unix.solaris ... If the latter, can you interrupt the > process and regain control of your session? Samba mount on Solaris 10. - comp.sys.sun.adminSystem freezes with high NFS activity - comp.unix.solaris ... Hello, My Solaris 10 x64 system freezes ... lots of writes to 7 NFS > mounts (the servers are one Solaris 9 ... run following shell error on solaris but OK on RedHat, why ...Error Booting... putting me in system maintence mode - comp.unix ... Console login ... NFS slowdown - Solaris client, Linux server. - comp.unix.solaris ... Running snoop ... snoop- capture and inspect network packets - comp.unix.solaris ...snoop captures packets from the network and displays their contents. snoop uses both the network packet filter and streams buffer modules to provide e... Strange problem with CPU spikes on solaris 8 - comp.unix.solaris ...System freezes with high NFS activity - comp.unix.solaris ... It is hanging too or acting >>> strange ... but the fact that you also have problems with Solaris 8 ... PSP9.01 Patch>Fix For Capture Problems> - comp.graphics.apps.paint ...System freezes with high NFS activity - comp.unix.solaris ..... 11/2006 (and this time I won't patch it). But I would rather fix the real problem. ... Slow Commit New Record - Looking for suggestions - comp.databases ...Still trying to get a converted system to run acceptably: I'm stymied on this issue: This is the main orders table of a complex FM6-> FM7 converted ... NFS slowdown - Solaris client, Linux server. - comp.unix.solaris ...> You should snoop activity between client and server ... I think Solaris 7 NFS over high bandwidth links (eg ... Network File System ... ... can I chown(1) an NFS-mounted file ... Odd BACKUP error: unsupported file structure ! - comp.os.vms ...... tried to send the backup to my xserve via NFS, but it ... With my system having been essentially down for days now ... (note that in this case, there was no activity so the fact ... top 10 uses for random data compression?? anyone? - comp ...Who freezes increasingly, when Yani rushs the empirical ... Hardly any tropical fundamental system reproduces ... might through rebuild except Salahuddin when the high ... Session Freezes - comp.unix.solaris | Computer GroupSystem freezes with high NFS activity - comp.unix.solaris ... If the latter, can you interrupt the > process and regain control of your session? System Hangs with Constant Disk Access - Club MyCEGeneral Hardware Forum Discuss, System Hangs with Constant Disk Access at ... drive access hang, systeem bevriest muis reageert niet, system freezes high disk activity ... 7/24/2012 8:34:24 PM
|