Are Suns fussy about fibre channel disks??

  • Follow


I'm thinking of buying a used Blade 2000. I've been offered a Dell 147 
GB F-CAL (fibre) disk. Are Suns fussy about their disks, or will pretty 
much any FCAL disk work in a Blade 2000?

I know on the older machines, it rarely seems to matter who makes the 
SCSI disk. One usually has to label non-Sun disks, but that is about it. 
But I've not no idea if this freedom extends to fibre disks.


0
Reply Dave 9/27/2007 1:37:45 PM

On 2007-09-27, Dave <sorry-no-email@nowhere.com> wrote:
> I'm thinking of buying a used Blade 2000. I've been offered a Dell 147 
> GB F-CAL (fibre) disk. Are Suns fussy about their disks, or will pretty 
> much any FCAL disk work in a Blade 2000?
>
> I know on the older machines, it rarely seems to matter who makes the 
> SCSI disk. One usually has to label non-Sun disks, but that is about it. 
> But I've not no idea if this freedom extends to fibre disks.

W-e-e-e-e-e-e-e-e-e-e-e-llll. This is long, but bear with me...

I bought a SB2000 a couple of months ago, from eBay, after my main home machine
(an Ultra 60) died in a lightning strike. (Trying to claim on the insurance for
a non-PC, non-Mac computer was amusing, but that's another story). The SB2000
has twin 73GB disks, 4 GB of memory and twin 1.2GHz processors and initially I
was delighted with it.

I reinstalled everything, restored from the U60 backup tapes and was poised on
brink of getting disk mirroring set up (which I'd put off, because it's fiddly
to do and the sort of thing I only do rarely (the systems I work on at work are
set up by other people)) when it started crashing. No errors, nothing in syslog,
it just died. I dd'd /home to the other disk to save my work. Then I noticed I
was getting UFS log rollover errors from the boot disk - they weren't getting
syslogged, just coming on the console, so I never saw them unless I was actually
there. After a day of crashes of increasing frequency, it would no longer boot
from the main disk. Booting from DVD, the main disk could no longer be seen by
the system at all. probe-scsi also couldn't see it at all.

So I contacted the supplier, who was absolutely brilliant throughout, and he
sent me another 73Gb disk. This is where it gets relevant to you. The original
disks were Sun badged ones. The replacement disk was a Fujitsu one. The original
disk was c1t1d0, the new one came up as c1t33d0, and I *could* *not* make its
logical ID correct (and no-one responded when I posted here asking for help -
not to worry, I learned loads about luxadm, OBP, FC-AL disks and so on). I
decided to press ahead regardless and run the machine with c1t33d0 and c1t2d0,
but first I ran a surface analysis on the new disk. Hundreds of errors, where it
said that the error was repairable, but was unable to determine the block ID to
repair it.

I contacted the supplier again, and he sent me another new disk, only this time
another Sun badged one, identical to the original. This came up as c1t1d0
immediately, surface analysis ran fine, so I installed it. By this time, I'd
reinstalled everything (again) on c1t2d0, so I just made c1t1d0 the mirror. It's
been absolutely fine for about 10 days now. (I do need to swap the boot disk
back to c1t1d0, I suppose).

So ... my conclusion? I'd be inclined to think that the SB2000 *is* picky about
disks, and if I ever need to fit another, I shall be choosy about what I buy.
Would I ever buy another machine with FC-AL disks? Probably not.

Oh, and "Dear Mr. Sun", what's the point in fitting hot swappable disks to a
machine with a power supply interlock on its access panel, so you can't open it
up "hot" to swap the disks anyway?


-- 
                 "Religion poisons everything."
            [email me at huge {at} huge (dot) org <dot> uk]
0
Reply Huge 9/27/2007 2:33:17 PM


Huge wrote:
> On 2007-09-27, Dave <sorry-no-email@nowhere.com> wrote:
> 
>>I'm thinking of buying a used Blade 2000. I've been offered a Dell 147 
>>GB F-CAL (fibre) disk. Are Suns fussy about their disks, or will pretty 
>>much any FCAL disk work in a Blade 2000?
>>
>>I know on the older machines, it rarely seems to matter who makes the 
>>SCSI disk. One usually has to label non-Sun disks, but that is about it. 
>>But I've not no idea if this freedom extends to fibre disks.
> 
> 
> W-e-e-e-e-e-e-e-e-e-e-e-llll. This is long, but bear with me...
> 
> I bought a SB2000 a couple of months ago, from eBay, after my main home machine
> (an Ultra 60) died in a lightning strike. (Trying to claim on the insurance for
> a non-PC, non-Mac computer was amusing, but that's another story). The SB2000
> has twin 73GB disks, 4 GB of memory and twin 1.2GHz processors and initially I
> was delighted with it.
> 
> I reinstalled everything, restored from the U60 backup tapes and was poised on
> brink of getting disk mirroring set up (which I'd put off, because it's fiddly
> to do and the sort of thing I only do rarely (the systems I work on at work are
> set up by other people)) when it started crashing. No errors, nothing in syslog,
> it just died. I dd'd /home to the other disk to save my work. Then I noticed I
> was getting UFS log rollover errors from the boot disk - they weren't getting
> syslogged, just coming on the console, so I never saw them unless I was actually
> there. After a day of crashes of increasing frequency, it would no longer boot
> from the main disk. Booting from DVD, the main disk could no longer be seen by
> the system at all. probe-scsi also couldn't see it at all.
> 
> So I contacted the supplier, who was absolutely brilliant throughout, and he
> sent me another 73Gb disk. This is where it gets relevant to you. The original
> disks were Sun badged ones. The replacement disk was a Fujitsu one. The original
> disk was c1t1d0, the new one came up as c1t33d0, and I *could* *not* make its
> logical ID correct (and no-one responded when I posted here asking for help -
> not to worry, I learned loads about luxadm, OBP, FC-AL disks and so on). I
> decided to press ahead regardless and run the machine with c1t33d0 and c1t2d0,
> but first I ran a surface analysis on the new disk. Hundreds of errors, where it
> said that the error was repairable, but was unable to determine the block ID to
> repair it.
> 
> I contacted the supplier again, and he sent me another new disk, only this time
> another Sun badged one, identical to the original. This came up as c1t1d0
> immediately, surface analysis ran fine, so I installed it. By this time, I'd
> reinstalled everything (again) on c1t2d0, so I just made c1t1d0 the mirror. It's
> been absolutely fine for about 10 days now. (I do need to swap the boot disk
> back to c1t1d0, I suppose).
> 
> So ... my conclusion? I'd be inclined to think that the SB2000 *is* picky about
> disks, and if I ever need to fit another, I shall be choosy about what I buy.
> Would I ever buy another machine with FC-AL disks? Probably not.
> 
> Oh, and "Dear Mr. Sun", what's the point in fitting hot swappable disks to a
> machine with a power supply interlock on its access panel, so you can't open it
> up "hot" to swap the disks anyway?
> 
> 

Just guessing but it might have something to do with buying disks in 
quantity and/or not needing to stock hundreds of different replacement 
disks.

0
Reply Richard 9/27/2007 3:28:17 PM

On Sep 27, 6:37 am, Dave <sorry-no-em...@nowhere.com> wrote:
> I'm thinking of buying a used Blade 2000. I've been offered a Dell 147
> GB F-CAL (fibre) disk. Are Suns fussy about their disks, or will pretty
> much any FCAL disk work in a Blade 2000?
>
> I know on the older machines, it rarely seems to matter who makes the
> SCSI disk. One usually has to label non-Sun disks, but that is about it.
> But I've not no idea if this freedom extends to fibre disks.

YMMV as they say. Huge had problems with one disk whereas I have 50
odd
of them and nary a squawk from one in the last X # of years. Lest we
forget SAN
used to be exclusively (pretty much ) FC/AL. Anyway - that said all my
disks
are Sun branded. And each one has the latest firmware updates.. How
would you,
if you had to, update a Dell disk? Further I have used lots of SCSI
too and the
non Sun ones did on occasion act strangely. Some died the death. All
the
Sun ones still live : > Recently I purchased an HBA for my 2000. Too
bad the PCI
only has one 66 MHz slot as I built an external 500 GB SATA-II drive
and
it works very well and was under 200 CDN dollars. Fast even at half
speed.
The 2000 with dual 1.2's is a fine box. I dont think they have
released anything to
replace it YET thats worth buying new or used. Still waiting.

0
Reply gerryt 9/27/2007 3:41:42 PM

On 2007-09-27, Richard B. Gilbert <rgilbert88@comcast.net> wrote:
> Huge wrote:


>> So ... my conclusion? I'd be inclined to think that the SB2000 *is* picky about
>> disks, and if I ever need to fit another, I shall be choosy about what I buy.
>> Would I ever buy another machine with FC-AL disks? Probably not.
>> 
>> Oh, and "Dear Mr. Sun", what's the point in fitting hot swappable disks to a
>> machine with a power supply interlock on its access panel, so you can't open it
>> up "hot" to swap the disks anyway?
>> 
>> 
>
> Just guessing but it might have something to do with buying disks in 
> quantity and/or not needing to stock hundreds of different replacement 
> disks.

It was a rhetorical question...   :o)


-- 
                 "Religion poisons everything."
            [email me at huge {at} huge (dot) org <dot> uk]
0
Reply Huge 9/27/2007 4:00:36 PM

In comp.sys.sun.admin Huge <Huge@nowhere.much.invalid> wrote:
> So ... my conclusion? I'd be inclined to think that the SB2000 *is* picky
> about disks, and if I ever need to fit another, I shall be choosy about
> what I buy. Would I ever buy another machine with FC-AL disks? Probably not.

Not my experience. I recently put "defective" disks from an EMC
Symmetrix DMX2 into my SB1000 and had no issues at all.

These are Seagate disks with a custom EMC firmware (they identify themselves
as SX3146707 instead of ST3146707). The two 146GB disks replaced the
original 36GB disks.

-- 
Daniel
0
Reply Daniel 9/27/2007 6:14:31 PM

Daniel Rock wrote:
> In comp.sys.sun.admin Huge <Huge@nowhere.much.invalid> wrote:
> 
>>So ... my conclusion? I'd be inclined to think that the SB2000 *is* picky
>>about disks, and if I ever need to fit another, I shall be choosy about
>>what I buy. Would I ever buy another machine with FC-AL disks? Probably not.
> 
> 
> Not my experience. I recently put "defective" disks from an EMC
> Symmetrix DMX2 into my SB1000 and had no issues at all.
> 
> These are Seagate disks with a custom EMC firmware (they identify themselves
> as SX3146707 instead of ST3146707). The two 146GB disks replaced the
> original 36GB disks.
> 

If EMC thinks they are defective they MIGHT be OK.  EMC tends to be 
ultra conservative; I worked with an EMC 3630 for several years without 
a single failure!  Every once in a while someone from EMC would show up 
and replace a cable or a circuit board but this was almost always done 
without down time!  What they replaced was working but they considered 
the replacement to be "better" or "more reliable".  If it's not the most 
reliable storage on the planet, it comes close!  If it's not the most 
expensive storage on the planet, it comes close!

It's worth noting that there is almost NO secondary market in EMC 
equipment; if you didn't buy it from EMC, they won't support it!


0
Reply Richard 9/27/2007 7:31:26 PM

In comp.sys.sun.admin Richard B. Gilbert <rgilbert88@comcast.net> wrote:
> If EMC thinks they are defective they MIGHT be OK.

I know. That's why I put them into the SB1000. We have a policy of not
giving back failed disks. Instead we keep them and physically destroy
them from time to time.

90% of the "failed" EMC disks are indeed Ok. They may have spin-up problems
or a few entries in the grown defect list. But basically they are Ok.

The EMC firmware also doesn't prevent using them in another environment.

> It's worth noting that there is almost NO secondary market in EMC 
> equipment; if you didn't buy it from EMC, they won't support it!

Why would EMC support its disks in a SB1000?

-- 
Daniel
0
Reply Daniel 9/27/2007 9:56:57 PM

In comp.sys.sun.hardware Daniel Rock <v200739@deadcafe.de> wrote:
> In comp.sys.sun.admin Richard B. Gilbert <rgilbert88@comcast.net> wrote:
>> If EMC thinks they are defective they MIGHT be OK.
> 
> I know. That's why I put them into the SB1000. We have a policy of not
> giving back failed disks. Instead we keep them and physically destroy
> them from time to time.
> 
> 90% of the "failed" EMC disks are indeed Ok. They may have spin-up problems
> or a few entries in the grown defect list. But basically they are Ok.

I'd say a disk that doesn't always spin up correctly is not a good place 
to store data.
0
Reply Cydrome 9/28/2007 3:47:45 AM

According to Huge  <huge@huge.org.uk>:

	[ ... ]

> So I contacted the supplier, who was absolutely brilliant throughout, and he
> sent me another 73Gb disk. This is where it gets relevant to you. The original
> disks were Sun badged ones. The replacement disk was a Fujitsu one. The original
> disk was c1t1d0, the new one came up as c1t33d0, and I *could* *not* make its
> logical ID correct (and no-one responded when I posted here asking for help -
> not to worry, I learned loads about luxadm, OBP, FC-AL disks and so on). I
> decided to press ahead regardless and run the machine with c1t33d0 and c1t2d0,
> but first I ran a surface analysis on the new disk. Hundreds of errors, where it
> said that the error was repairable, but was unable to determine the block ID to
> repair it.

	Hmm ... that may be because the new disk had a different WWN
(World Wide Number) which makes all FC-AL disks unique, and Solaris had
already allocated c1t1d0 to another WWN.  The way to fix that is to run
devfsadm with the -C (cleanup) option, so it removes the data about the
old WWN and frees c1t1d0 for the new one.

	Perhaps the reason that things worked as desired with the third
disk is that you had done a fresh install of Solaris on c1t2d0 while
there was no disk in the c1t1d0 slot.

	An interesting thing, BTW, The Sun Fire 280R (which I have) uses
the same system board, but a slightly different drive cage, and it
assigns c1t0d0 and c1t1d0 instead of the c1t1d0 and c1t2d0 which the Sun
Blade 2000 does.  Another difference is that the Sun Fire 280's drive
cage will only accept 1" high drives, while the Sun Blade 2000's drive
cage will accept 1.6" high drives.

> I contacted the supplier again, and he sent me another new disk, only this time
> another Sun badged one, identical to the original. This came up as c1t1d0
> immediately, surface analysis ran fine, so I installed it. By this time, I'd
> reinstalled everything (again) on c1t2d0,

	With nothing in the c1t1d0 slot?  So Solaris was able to start
from scratch, with no WWN conflicts.

>                                           so I just made c1t1d0 the mirror. It's
> been absolutely fine for about 10 days now. (I do need to swap the boot disk
> back to c1t1d0, I suppose).
> 
> So ... my conclusion? I'd be inclined to think that the SB2000 *is* picky about
> disks, and if I ever need to fit another, I shall be choosy about what I buy.
> Would I ever buy another machine with FC-AL disks? Probably not.

	My Sun Fire 280R came with no disks, and I bought a pair of 146
GB drives (non Sun) and both work with no problems.  Before I knew that
the Sun Fire 280R would not accept the 1.6" high drives, I had gotten a
pair of them (at about 180 GB) from eBay -- and I was able to test them
in a friend's Sun Blade 2000, and they just would not work at all.  The
vendor took them back, since he had not been able to test them either.

> Oh, and "Dear Mr. Sun", what's the point in fitting hot swappable disks to a
> machine with a power supply interlock on its access panel, so you can't open it
> up "hot" to swap the disks anyway?

	Because the same system board goes in the Sun Fire 280R, which
has both (1") drives, and both hot-swap power supplies changeable from
the front panel with no interlocks.  I guess that they figure that if
you don't need the hot swappable power supplies, you also don't need hot
swapable disks. :-)

	When you *do* hot swap them in the Sun Fire 280R, you do have to
umount them anyway, which may require rebooting onto another drive.

	I have a card cage for a Sun Blade 2000 which I plan to set up
for making duplicate drives for backups using the external FC-AL
connector.

	And I did not answer the original questions probably because I
did not have the Sun Fire 280R yet and thus had no experience with the
FC-AL disks.

	Good Luck,
		DoN.

-- 
 Email:   <dnichols@d-and-d.com>   | Voice (all times): (703) 938-4564
	(too) near Washington D.C. | http://www.d-and-d.com/dnichols/DoN.html
           --- Black Holes are where God is dividing by zero ---
0
Reply dnichols 9/28/2007 5:48:32 AM

In comp.sys.sun.admin Cydrome Leader <presence@mungepanix.com> wrote:
> I'd say a disk that doesn't always spin up correctly is not a good place 
> to store data.

So don't let it spin down.

The disk is in a workstation. If one fails the mirror is still Ok. If both
fail, I can restore the data from the backup.

-- 
Daniel
0
Reply Daniel 9/28/2007 8:45:40 AM

On 2007-09-28, DoN. Nichols <dnichols@d-and-d.com> wrote:
> According to Huge  <huge@huge.org.uk>:
>

[snippage]

>> but first I ran a surface analysis on the new disk. Hundreds of errors, where it
>> said that the error was repairable, but was unable to determine the block ID to
>> repair it.
>
> 	Hmm ... that may be because the new disk had a different WWN
> (World Wide Number) which makes all FC-AL disks unique, and Solaris had
> already allocated c1t1d0 to another WWN.  The way to fix that is to run
> devfsadm with the -C (cleanup) option, so it removes the data about the
> old WWN and frees c1t1d0 for the new one.
>
> 	Perhaps the reason that things worked as desired with the third
> disk is that you had done a fresh install of Solaris on c1t2d0 while
> there was no disk in the c1t1d0 slot.

This was happening even when booting from the install media, which builds its
device tree from scratch each time.

>> Oh, and "Dear Mr. Sun", what's the point in fitting hot swappable disks to a
>> machine with a power supply interlock on its access panel, so you can't open it
>> up "hot" to swap the disks anyway?
>
> 	Because the same system board goes in the Sun Fire 280R, which
> has both (1") drives, and both hot-swap power supplies changeable from
> the front panel with no interlocks.  I guess that they figure that if
> you don't need the hot swappable power supplies, you also don't need hot
> swapable disks. :-)

Ahhh, that makes sense. Although 30 seconds with some Scotch tape sorted out the
interlock.

> 	When you *do* hot swap them in the Sun Fire 280R, you do have to
> umount them anyway, which may require rebooting onto another drive.

In theory you can metadetach, then luxadm {remove - whatever the comnand is} the
drive on the SB2K, except you can't open the box!   :o)

> 	And I did not answer the original questions probably because I
> did not have the Sun Fire 280R yet and thus had no experience with the
> FC-AL disks.




-- 
                 "Religion poisons everything."
            [email me at huge {at} huge (dot) org <dot> uk]
0
Reply Huge 9/28/2007 9:09:50 AM

Huge wrote:

>>> Oh, and "Dear Mr. Sun", what's the point in fitting hot swappable disks to a
>>> machine with a power supply interlock on its access panel, so you can't open it
>>> up "hot" to swap the disks anyway?
>> 	Because the same system board goes in the Sun Fire 280R, which
>> has both (1") drives, and both hot-swap power supplies changeable from
>> the front panel with no interlocks.  I guess that they figure that if
>> you don't need the hot swappable power supplies, you also don't need hot
>> swapable disks. :-)
> 
> Ahhh, that makes sense. Although 30 seconds with some Scotch tape sorted out the
> interlock.

Personally, given the Blade 2000 is not designed for hot swapping of 
disks, I suspect it could be risky to swap them. I would suspect both 
the disk and where it plugs must both be designed to allow hot-swap. It 
seems unlikely Sun would have designed the Blade 2000 to be 
hot-swappable, then put an interlock on it.

Of course, it may be that the disk just connects to a standard SCSI 
chip, and that takes care of it all. But unless I knew that to be the 
case, I personally would not risk it.

0
Reply Dave 9/28/2007 9:45:25 AM

On 2007-09-28, Dave <sorry-no-email@nowhere.com> wrote:
> Huge wrote:
>
>>>> Oh, and "Dear Mr. Sun", what's the point in fitting hot swappable disks to a
>>>> machine with a power supply interlock on its access panel, so you can't open it
>>>> up "hot" to swap the disks anyway?
>>> 	Because the same system board goes in the Sun Fire 280R, which
>>> has both (1") drives, and both hot-swap power supplies changeable from
>>> the front panel with no interlocks.  I guess that they figure that if
>>> you don't need the hot swappable power supplies, you also don't need hot
>>> swapable disks. :-)
>> 
>> Ahhh, that makes sense. Although 30 seconds with some Scotch tape sorted out the
>> interlock.
>
> Personally, given the Blade 2000 is not designed for hot swapping of 
> disks, I suspect it could be risky to swap them. I would suspect both 
> the disk and where it plugs must both be designed to allow hot-swap. It 
> seems unlikely Sun would have designed the Blade 2000 to be 
> hot-swappable, then put an interlock on it.

Oh, I think you underestimate the stupidity of the Health and Safety fascists.


-- 
                 "Religion poisons everything."
            [email me at huge {at} huge (dot) org <dot> uk]
0
Reply Huge 9/28/2007 10:38:41 AM

In comp.sys.sun.hardware Daniel Rock <v200739@deadcafe.de> wrote:
> In comp.sys.sun.admin Cydrome Leader <presence@mungepanix.com> wrote:
>> I'd say a disk that doesn't always spin up correctly is not a good place 
>> to store data.
> 
> So don't let it spin down.
> 
> The disk is in a workstation. If one fails the mirror is still Ok. If both
> fail, I can restore the data from the backup.

Do you drive around with a flat tire? Three out of four isn't too bad.
0
Reply Cydrome 10/1/2007 1:42:03 AM

In comp.sys.sun.admin Cydrome Leader <presence@mungepanix.com> wrote:
> Do you drive around with a flat tire? Three out of four isn't too bad.

Bad analogy.

Do you change your LCD screen if it shows a bad pixel?

-- 
Daniel
0
Reply Daniel 10/1/2007 12:59:25 PM

In comp.sys.sun.hardware Daniel Rock <v200740@deadcafe.de> wrote:
> In comp.sys.sun.admin Cydrome Leader <presence@mungepanix.com> wrote:
>> Do you drive around with a flat tire? Three out of four isn't too bad.
> 
> Bad analogy.
> 
> Do you change your LCD screen if it shows a bad pixel?

If my display has two pixels, and I know one is broken to start with, yes, 
I replace it.
0
Reply Cydrome 10/1/2007 3:15:36 PM

In comp.sys.sun.admin Cydrome Leader <presence@mungepanix.com> wrote:
> If my display has two pixels, and I know one is broken to start with, yes, 
> I replace it.

Do you replace your car if you jump-started it once?

-- 
Daniel
0
Reply Daniel 10/1/2007 4:37:25 PM

Daniel Rock wrote:
> In comp.sys.sun.admin Cydrome Leader <presence@mungepanix.com> wrote:
>> If my display has two pixels, and I know one is broken to start with, yes, 
>> I replace it.
> 
> Do you replace your car if you jump-started it once?

If my car doesn't work reliably, I find out what's wrong and fix or
replace it.  If the disk is purely scratch storage and I wouldn't miss
what's on it, the I would reuse the disk.  I however would not store
anything critical on the disk even if it is part of a mirror set.

If your car's steering failed once, would you keep on driving it?
0
Reply Douglas 10/1/2007 4:48:59 PM

In comp.sys.sun.hardware Daniel Rock <v200740@deadcafe.de> wrote:
> In comp.sys.sun.admin Cydrome Leader <presence@mungepanix.com> wrote:
>> If my display has two pixels, and I know one is broken to start with, yes, 
>> I replace it.
> 
> Do you replace your car if you jump-started it once?

I would replace what was broken or failing.

I prefer preventative maintenance, not cleaning up larger messes later. 
0
Reply Cydrome 10/1/2007 6:36:24 PM

In comp.sys.sun.admin Cydrome Leader <presence@mungepanix.com> wrote:
> I prefer preventative maintenance, not cleaning up larger messes later. 

A few "metattach" or "metareplace" are a larger mess?

-- 
Daniel
0
Reply Daniel 10/1/2007 9:39:25 PM

In comp.sys.sun.hardware Daniel Rock <v200740@deadcafe.de> wrote:
> In comp.sys.sun.admin Cydrome Leader <presence@mungepanix.com> wrote:
>> I prefer preventative maintenance, not cleaning up larger messes later. 
> 
> A few "metattach" or "metareplace" are a larger mess?

I guess if you're bored and have nothing better to do with a computer than 
stuff it full of blatantly broken drives from a junk pile, then rebuild 
the data you're just going to lose anyways next week, because you're 
probably also using RAM that's "mostly" OK and SCSI card that "sort of 
works" on a system board that's "almost always fine" with power supplies 
with fans that "usually" spin, then go for it.

Some people have slightly different standards, and know that drives don't 
fix themselves, and always just get worse and should be replaced at the 
first signs of trouble, at your convenience, not when they finally do 
catastrophically fail.





 
0
Reply Cydrome 10/2/2007 5:12:23 AM

In comp.sys.sun.admin Cydrome Leader <presence@mungepanix.com> wrote:
> In comp.sys.sun.hardware Daniel Rock <v200740@deadcafe.de> wrote:
>> In comp.sys.sun.admin Cydrome Leader <presence@mungepanix.com> wrote:
>>> I prefer preventative maintenance, not cleaning up larger messes later. 
>> 
>> A few "metattach" or "metareplace" are a larger mess?
> 
> Some people have slightly different standards, and know that drives don't 
> fix themselves, and always just get worse and should be replaced at the 
> first signs of trouble, at your convenience, not when they finally do 
> catastrophically fail.

Do you just replace a flat tire or the entire car?


Let's calculate the probability of a total failure...

Normal SCSI drives have a AFR of ~3%. Let's say the AFR of these drives
is 10 times higher (i.e. 30%). Let's also assume it takes on average 48 hours
to replace a broken drive.

What is the probability that two drives fail within 48 hours?

The probability is ~0.05% p.a. (0.3 * 0.3 * (2/365))


BTW this is the SMART output of one of the drives:

Device: SEAGATE  SX3146807FC      Version: D010
Device type: disk
Transport protocol: Fibre channel (FCP-2)
Device supports SMART and is Enabled
Temperature Warning Disabled or Not Supported
SMART Health Status: OK

Elements in grown defect list: 8
Vendor (Seagate) cache information
  Blocks sent to initiator = 323870666916662
Vendor (Seagate/Hitachi) factory information
  number of hours powered up = 27478.45
  number of minutes until next internal SMART test = 10
0
Reply Daniel 10/2/2007 7:55:51 AM

Daniel Rock wrote:
> <snip>
> 
> Let's calculate the probability of a total failure...
> 
> Normal SCSI drives have a AFR of ~3%. Let's say the AFR of these drives
> is 10 times higher (i.e. 30%). Let's also assume it takes on average 48 hours
> to replace a broken drive.
> 
> What is the probability that two drives fail within 48 hours?
> 
> The probability is ~0.05% p.a. (0.3 * 0.3 * (2/365))

You've assumed that drives fail independently of each other.  If the
drives have a higher probability of failure following some event (e.g.,
a power cycle), then your calculation is flawed.  Take another example;
the drive has a 30% chance of not spinning up after a power cycle.  The
probability of a catastrophic failure is
  0.3 * 0.3 * P(power cycle)
Since you know you will power cycle at some point, you have a 9% chance
of losing your data at that point.  Not a risk I'd take.
0
Reply Douglas 10/2/2007 2:16:55 PM

In comp.sys.sun.admin Douglas O'Neal <oneal@dbi.udel.edu> wrote:
> Since you know you will power cycle at some point, you have a 9% chance
> of losing your data at that point.  Not a risk I'd take.

You are assuming that the drive will never again spin up after a power
cycle.

This assumption is flawed.

-- 
Daniel
0
Reply Daniel 10/2/2007 2:39:47 PM

In comp.sys.sun.hardware Daniel Rock <v200740@deadcafe.de> wrote:
> In comp.sys.sun.admin Cydrome Leader <presence@mungepanix.com> wrote:
>> In comp.sys.sun.hardware Daniel Rock <v200740@deadcafe.de> wrote:
>>> In comp.sys.sun.admin Cydrome Leader <presence@mungepanix.com> wrote:
>>>> I prefer preventative maintenance, not cleaning up larger messes later. 
>>> 
>>> A few "metattach" or "metareplace" are a larger mess?
>> 
>> Some people have slightly different standards, and know that drives don't 
>> fix themselves, and always just get worse and should be replaced at the 
>> first signs of trouble, at your convenience, not when they finally do 
>> catastrophically fail.
> 
> Do you just replace a flat tire or the entire car?

the car/stires analogy is best summed up as the person that uses broken 
drives puts leaky tires on their car, and hopes it doesn't go flat, and if 
it does, they're fine with three good tires- and they saved a a few 
dollars because they're witty.

> Let's calculate the probability of a total failure...
> 
> Normal SCSI drives have a AFR of ~3%. Let's say the AFR of these drives
> is 10 times higher (i.e. 30%). Let's also assume it takes on average 48 hours
> to replace a broken drive.
> 
> What is the probability that two drives fail within 48 hours?

more than you'd expect. I've seen plenty of double disk failures.

> The probability is ~0.05% p.a. (0.3 * 0.3 * (2/365))

I can simplify that equation into:

it's stupid to put broken disks back into a machine, no matter what 
nonsense math you try to justify it with.

> 
> BTW this is the SMART output of one of the drives:
> 
> Device: SEAGATE  SX3146807FC      Version: D010
> Device type: disk
> Transport protocol: Fibre channel (FCP-2)
> Device supports SMART and is Enabled
> Temperature Warning Disabled or Not Supported
> SMART Health Status: OK
> 
> Elements in grown defect list: 8
> Vendor (Seagate) cache information
>  Blocks sent to initiator = 323870666916662
> Vendor (Seagate/Hitachi) factory information
>  number of hours powered up = 27478.45
>  number of minutes until next internal SMART test = 10

You're blinding yourself.

You know the drive doesn't always spin up. No amount of smart data cancels 
that out.

just throw the drive out or RMA it.
0
Reply Cydrome 10/2/2007 4:08:40 PM

In comp.sys.sun.admin Cydrome Leader <presence@mungepanix.com> wrote:
> just throw the drive out or RMA it.

Why should I pay for it?

-- 
Daniel
0
Reply Daniel 10/2/2007 4:11:04 PM

Daniel Rock wrote:
> In comp.sys.sun.admin Douglas O'Neal <oneal@dbi.udel.edu> wrote:
>> Since you know you will power cycle at some point, you have a 9% chance
>> of losing your data at that point.  Not a risk I'd take.
> 
> You are assuming that the drive will never again spin up after a power
> cycle.
> 
> This assumption is flawed.

Agreed, my probability is too high.  But the point is that the 0.05%
catastrophic probability you calculated is too low.  And if we take
a number somewhere in the middle, say 0.5% chance of catastrophic
failure per power cycle, that would be way too high for me to trust
with critical data.
0
Reply Douglas 10/2/2007 5:14:52 PM

Daniel Rock wrote:
> In comp.sys.sun.admin Cydrome Leader <presence@mungepanix.com> wrote:
>> In comp.sys.sun.hardware Daniel Rock <v200740@deadcafe.de> wrote:
>>> In comp.sys.sun.admin Cydrome Leader <presence@mungepanix.com> wrote:
>>>> I prefer preventative maintenance, not cleaning up larger messes later. 
>>> A few "metattach" or "metareplace" are a larger mess?
>> Some people have slightly different standards, and know that drives don't 
>> fix themselves, and always just get worse and should be replaced at the 
>> first signs of trouble, at your convenience, not when they finally do 
>> catastrophically fail.
> 
> Do you just replace a flat tire or the entire car?
> 
> 
> Let's calculate the probability of a total failure...
> 
> Normal SCSI drives have a AFR of ~3%. Let's say the AFR of these drives
> is 10 times higher (i.e. 30%). Let's also assume it takes on average 48 hours
> to replace a broken drive.

MTBF is around 800kh for SCSI disks giving AFR=1.095%

> What is the probability that two drives fail within 48 hours?
> 
> The probability is ~0.05% p.a. (0.3 * 0.3 * (2/365))

0.00657% is more correct from above.


> 
> BTW this is the SMART output of one of the drives:
> 
> Device: SEAGATE  SX3146807FC      Version: D010

But this disk has 1200000 h MTBF so here we have an AFR of
and your probability therefore 0.0000292%

Quite a difference...


http://www.seagate.com/support/disc/specs/fc/st3146807fc.html


> Device type: disk
> Transport protocol: Fibre channel (FCP-2)
> Device supports SMART and is Enabled
> Temperature Warning Disabled or Not Supported
> SMART Health Status: OK
> 
> Elements in grown defect list: 8
> Vendor (Seagate) cache information
>   Blocks sent to initiator = 323870666916662
> Vendor (Seagate/Hitachi) factory information
>   number of hours powered up = 27478.45
>   number of minutes until next internal SMART test = 10
0
Reply Thommy 10/2/2007 5:42:44 PM

In comp.sys.sun.admin Douglas O'Neal <oneal@dbi.udel.edu> wrote:
> that would be way too high for me to trust with critical data.

Who said there was critical data?

-- 
Daniel
0
Reply Daniel 10/2/2007 7:16:52 PM

In comp.sys.sun.hardware Daniel Rock <v200740@deadcafe.de> wrote:
> In comp.sys.sun.admin Cydrome Leader <presence@mungepanix.com> wrote:
>> just throw the drive out or RMA it.
> 
> Why should I pay for it?
> 

You're a cheap hobbiest, you shouldn't pay for anything.

do what suits your needs best.


0
Reply Cydrome 10/2/2007 7:58:32 PM

Thommy M. wrote:

>> Let's calculate the probability of a total failure...
>>
>> Normal SCSI drives have a AFR of ~3%. Let's say the AFR of these drives
>> is 10 times higher (i.e. 30%). Let's also assume it takes on average 48 hours
>> to replace a broken drive.
> 
> MTBF is around 800kh for SCSI disks giving AFR=1.095%
> 
>> What is the probability that two drives fail within 48 hours?
>>
>> The probability is ~0.05% p.a. (0.3 * 0.3 * (2/365))
> 
> 0.00657% is more correct from above.


You need to be careful in interpreting MTBF of disks. The MTBF is based 
on the assumption that the disk will be replaced (even if working) at 
the end of the service life, which is typically 5 years for a SCSI disk 
- I found that on the Seagate web site once.

A MTBF of  1,000,000 hours does *not* mean the disks will last an 
average time of 1,000,000 hours or 114 years if you switch them on and 
never replace them. They will on average last a LOT less than that. 
Although I have no data on it, I doubt any single disk would be working 
114 years later!

During that 5 years, the disk is likely to be under warranty anyway.

I think the point at which one disposes of disks depends on ones 
circumstances. If it's an important server in your company, it might be 
wise to replace them every 5 years. If its on a less important system, 
you might not do so until logs indicate a problem. If it's for a home 
machine, and not one use to store important information, one might 
tolerate a few errors.

I don't know how many Suns are used by hobbyists, but I suspect there 
are quite a few. A previous employer had a site licence for a piece of 
software, which allowed one to use a copy at home. I asked for a SPARC 
licence for use at home, and was initially declined this as "a Sun SPARC 
is not considered a home computer". After some discussions they agreed 
as a "one-off".

I would not personally use a disk that did not reliably spin up (even at 
home as a scratch disk), but I would not criticise someone who felt in 
their circumstances that was appropriate. Clearly anyone doing that on a 
important server in their company would need their head tested!






0
Reply Dave 10/3/2007 7:24:40 AM

On 2007-10-03, Dave <someplace@nowhere-nice.com> wrote:

> If it's an important server in your company, it might be 
> wise to replace them every 5 years.

You jest. We have over 3000 Unix servers. Wild guesstimate, 12,000 disks.
Replace 2400 disks a year? Nonsense.

-- 
                 "Religion poisons everything."
            [email me at huge {at} huge (dot) org <dot> uk]
0
Reply Huge 10/4/2007 9:02:28 AM

In comp.sys.sun.admin Huge <Huge@nowhere.much.invalid> wrote:
> On 2007-10-03, Dave <someplace@nowhere-nice.com> wrote:
> 
>> If it's an important server in your company, it might be 
>> wise to replace them every 5 years.
> 
> You jest. We have over 3000 Unix servers. Wild guesstimate, 12,000 disks.
> Replace 2400 disks a year? Nonsense.

Just let the machines age and you will be replacing that many at some 
point.
0
Reply Cydrome 10/5/2007 12:00:02 AM

Daniel Rock wrote:
> In comp.sys.sun.admin Cydrome Leader <presence@mungepanix.com> wrote:
>> just throw the drive out or RMA it.
> 
> Why should I pay for it?
> 
Rock on Daniel, I think you know more about disks than the others know 
about cars :-)
/Jorgen
0
Reply Jorgen 10/5/2007 1:13:03 AM

In comp.sys.sun.hardware Jorgen Moquist <jorgen.moquist@n.o.s.p.a.m.mailbox.swipnet.se> wrote:
> Daniel Rock wrote:
>> In comp.sys.sun.admin Cydrome Leader <presence@mungepanix.com> wrote:
>>> just throw the drive out or RMA it.
>> 
>> Why should I pay for it?
>> 
> Rock on Daniel, I think you know more about disks than the others know 
> about cars :-)
> /Jorgen

yup, it's always best to use broken parts, and when things do fail, to do 
nothing. Problems with machines only get better with time, they're self 
healing.
0
Reply Cydrome 10/5/2007 3:03:06 PM

Cydrome Leader wrote:
> In comp.sys.sun.hardware Jorgen Moquist <jorgen.moquist@n.o.s.p.a.m.mailbox.swipnet.se> wrote:
>> Daniel Rock wrote:
>>> In comp.sys.sun.admin Cydrome Leader <presence@mungepanix.com> wrote:
>>>> just throw the drive out or RMA it.
>>> Why should I pay for it?
>>>
>> Rock on Daniel, I think you know more about disks than the others know 
>> about cars :-)
>> /Jorgen
> 
> yup, it's always best to use broken parts, and when things do fail, to do 
> nothing. Problems with machines only get better with time, they're self 
> healing.
scsi and fcal disks are "selfhealing", lots of spare tracks/cyls.
replacement and cacheing tables, one spare sector per cyl and two spare
cyls per surface as i recall.
and several copies of the bootcode/os/rtc.
very easy to monitoring grown defect list or ioerrors or use SMART.
can only see one scary situation, if 2 drives are manufactured the same day.
well if having backups :-)
/jorgen
0
Reply Jorgen 10/5/2007 9:51:25 PM

In comp.sys.sun.admin Jorgen Moquist <jorgen.moquist@n.o.s.p.a.m.mailbox.swipnet.se> wrote:
> Cydrome Leader wrote:
>> In comp.sys.sun.hardware Jorgen Moquist <jorgen.moquist@n.o.s.p.a.m.mailbox.swipnet.se> wrote:
>>> Daniel Rock wrote:
>>>> In comp.sys.sun.admin Cydrome Leader <presence@mungepanix.com> wrote:
>>>>> just throw the drive out or RMA it.
>>>> Why should I pay for it?
>>>>
>>> Rock on Daniel, I think you know more about disks than the others know 
>>> about cars :-)
>>> /Jorgen
>> 
>> yup, it's always best to use broken parts, and when things do fail, to do 
>> nothing. Problems with machines only get better with time, they're self 
>> healing.
> scsi and fcal disks are "selfhealing", lots of spare tracks/cyls.
> replacement and cacheing tables, one spare sector per cyl and two spare
> cyls per surface as i recall.

None of this keeps errors from happening in the first place. Spare sectors 
don't make unrecoverable read errors not happen. It's also pretty known 
that once you start to see errors (and that means the drive has warning 
you because there's something wrong), things only go downhill from there.

these media errors tend to grow. I know this is link to PC magazine of all 
places, but it's a short link

http://www.pcmag.com/encyclopedia_term/0,2542,t=hard+disk+defect+management&i=55545,00.asp


> and several copies of the bootcode/os/rtc.
> very easy to monitoring grown defect list or ioerrors or use SMART.
> can only see one scary situation, if 2 drives are manufactured the same day.
> well if having backups :-)
> /jorgen

Plenty of drive problems are mechanical. Having 9000% spare data on 
platters doesn't help if you crashed a head or your disk won't spin up.

drives don't heal themselves. They never improve the state they're in. 
If they start to throw errors, replace them.


0
Reply Cydrome 10/5/2007 11:22:41 PM

Cydrome Leader wrote:
> In comp.sys.sun.admin Jorgen Moquist <jorgen.moquist@n.o.s.p.a.m.mailbox.swipnet.se> wrote:
> 
>>Cydrome Leader wrote:
>>
>>>In comp.sys.sun.hardware Jorgen Moquist <jorgen.moquist@n.o.s.p.a.m.mailbox.swipnet.se> wrote:
>>>
>>>>Daniel Rock wrote:
>>>>
>>>>>In comp.sys.sun.admin Cydrome Leader <presence@mungepanix.com> wrote:
>>>>>
>>>>>>just throw the drive out or RMA it.
>>>>>
>>>>>Why should I pay for it?
>>>>>
>>>>
>>>>Rock on Daniel, I think you know more about disks than the others know 
>>>>about cars :-)
>>>>/Jorgen
>>>
>>>yup, it's always best to use broken parts, and when things do fail, to do 
>>>nothing. Problems with machines only get better with time, they're self 
>>>healing.
>>
>>scsi and fcal disks are "selfhealing", lots of spare tracks/cyls.
>>replacement and cacheing tables, one spare sector per cyl and two spare
>>cyls per surface as i recall.
> 
> 
> None of this keeps errors from happening in the first place. Spare sectors 
> don't make unrecoverable read errors not happen. It's also pretty known 
> that once you start to see errors (and that means the drive has warning 
> you because there's something wrong), things only go downhill from there.
> 

Disk drives can and do survive a block becoming unreadable.  SCSI drives 
can "revector" a bad block.  In some operating systems, the disk driver 
works in conjunction with the disk to copy data from a questionable 
block to a replacement block.  This looks, to the user, like "self 
healing".  If a bad block is revectored, it's not an indication of a 
serious problem.

What IS an indication of a serious problem is A PATTERN of bad blocks 
being revectored.  When you see that, it's time to replace the the disk.
Do it NOW!  Tomorrow may be too late.

0
Reply Richard 10/6/2007 1:53:59 AM

In comp.sys.sun.admin Richard B. Gilbert <rgilbert88@comcast.net> wrote:
> Cydrome Leader wrote:
>> In comp.sys.sun.admin Jorgen Moquist <jorgen.moquist@n.o.s.p.a.m.mailbox.swipnet.se> wrote:
>> 
>>>Cydrome Leader wrote:
>>>
>>>>In comp.sys.sun.hardware Jorgen Moquist <jorgen.moquist@n.o.s.p.a.m.mailbox.swipnet.se> wrote:
>>>>
>>>>>Daniel Rock wrote:
>>>>>
>>>>>>In comp.sys.sun.admin Cydrome Leader <presence@mungepanix.com> wrote:
>>>>>>
>>>>>>>just throw the drive out or RMA it.
>>>>>>
>>>>>>Why should I pay for it?
>>>>>>
>>>>>
>>>>>Rock on Daniel, I think you know more about disks than the others know 
>>>>>about cars :-)
>>>>>/Jorgen
>>>>
>>>>yup, it's always best to use broken parts, and when things do fail, to do 
>>>>nothing. Problems with machines only get better with time, they're self 
>>>>healing.
>>>
>>>scsi and fcal disks are "selfhealing", lots of spare tracks/cyls.
>>>replacement and cacheing tables, one spare sector per cyl and two spare
>>>cyls per surface as i recall.
>> 
>> 
>> None of this keeps errors from happening in the first place. Spare sectors 
>> don't make unrecoverable read errors not happen. It's also pretty known 
>> that once you start to see errors (and that means the drive has warning 
>> you because there's something wrong), things only go downhill from there.
>> 
> 
> Disk drives can and do survive a block becoming unreadable.  SCSI drives 
> can "revector" a bad block.  In some operating systems, the disk driver 
> works in conjunction with the disk to copy data from a questionable 
> block to a replacement block.  This looks, to the user, like "self 
> healing".  If a bad block is revectored, it's not an indication of a 
> serious problem.
> 
> What IS an indication of a serious problem is A PATTERN of bad blocks 
> being revectored.  When you see that, it's time to replace the the disk.
> Do it NOW!  Tomorrow may be too late.
> 

and it generally always is a patter of failing blocks, not one random one 
and then things are great again for years.


0
Reply Cydrome 10/6/2007 9:33:44 PM

Cydrome Leader wrote:
> In comp.sys.sun.admin Richard B. Gilbert <rgilbert88@comcast.net> wrote:
> 
>>Cydrome Leader wrote:
>>
>>>In comp.sys.sun.admin Jorgen Moquist <jorgen.moquist@n.o.s.p.a.m.mailbox.swipnet.se> wrote:
>>>
>>>
>>>>Cydrome Leader wrote:
>>>>
>>>>
>>>>>In comp.sys.sun.hardware Jorgen Moquist <jorgen.moquist@n.o.s.p.a.m.mailbox.swipnet.se> wrote:
>>>>>
>>>>>
>>>>>>Daniel Rock wrote:
>>>>>>
>>>>>>
>>>>>>>In comp.sys.sun.admin Cydrome Leader <presence@mungepanix.com> wrote:
>>>>>>>
>>>>>>>
>>>>>>>>just throw the drive out or RMA it.
>>>>>>>
>>>>>>>Why should I pay for it?
>>>>>>>
>>>>>>
>>>>>>Rock on Daniel, I think you know more about disks than the others know 
>>>>>>about cars :-)
>>>>>>/Jorgen
>>>>>
>>>>>yup, it's always best to use broken parts, and when things do fail, to do 
>>>>>nothing. Problems with machines only get better with time, they're self 
>>>>>healing.
>>>>
>>>>scsi and fcal disks are "selfhealing", lots of spare tracks/cyls.
>>>>replacement and cacheing tables, one spare sector per cyl and two spare
>>>>cyls per surface as i recall.
>>>
>>>
>>>None of this keeps errors from happening in the first place. Spare sectors 
>>>don't make unrecoverable read errors not happen. It's also pretty known 
>>>that once you start to see errors (and that means the drive has warning 
>>>you because there's something wrong), things only go downhill from there.
>>>
>>
>>Disk drives can and do survive a block becoming unreadable.  SCSI drives 
>>can "revector" a bad block.  In some operating systems, the disk driver 
>>works in conjunction with the disk to copy data from a questionable 
>>block to a replacement block.  This looks, to the user, like "self 
>>healing".  If a bad block is revectored, it's not an indication of a 
>>serious problem.
>>
>>What IS an indication of a serious problem is A PATTERN of bad blocks 
>>being revectored.  When you see that, it's time to replace the the disk.
>>Do it NOW!  Tomorrow may be too late.
>>
> 
> 
> and it generally always is a patter of failing blocks, not one random one 
> and then things are great again for years.
> 
> 

I think "always" is an overstatement.  I've seen a lot of disks over the 
years.  Some of them showed the pattern of failure I've described.  Some 
of them revectored a bad block or two and ran for several more years.

If a disk has critical data and is not a member of a RAID set you may be 
justified in replacing it the first time it detects a bad block.  In 
most cases I would not get excited about a single bad block.

It also makes a difference if you have a service contract or are doing 
"self maintenance".

0
Reply Richard 10/6/2007 10:54:56 PM

In comp.sys.sun.admin Richard B. Gilbert <rgilbert88@comcast.net> wrote:
> If a disk has critical data and is not a member of a RAID set you may be 
> justified in replacing it the first time it detects a bad block.  In 
> most cases I would not get excited about a single bad block.

If there is a power failure while the disk is in the middle of a write you
will most likely have a bad block. I don't worry if the number of entries
in the grown defect list keeps constant (five or less). I get suspicous
if the grown defect list grows silently.

Then I run a read analysis (or write analysis if possible) of the disk.
If the defect list has grown again, it is time to replace the disk.

But Sun doesn't care about the grown defect list, if you want to replace
a disk under service contract. "Hopefully" a read analysis of the disk
find an unrevorable read error. When enough SCSI errors have filled up
/var/adm/messages you can finally convince Sun to replace the disk.

-- 
Daniel
0
Reply Daniel 10/6/2007 11:47:35 PM

Richard B. Gilbert wrote:
> Cydrome Leader wrote:
>> In comp.sys.sun.admin Richard B. Gilbert <rgilbert88@comcast.net> wrote:
>>
>>> Cydrome Leader wrote:
>>>
>>>> In comp.sys.sun.admin Jorgen Moquist 
>>>> <jorgen.moquist@n.o.s.p.a.m.mailbox.swipnet.se> wrote:
>>>>
>>>>
>>>>> Cydrome Leader wrote:
>>>>>
>>>>>
>>>>>> In comp.sys.sun.hardware Jorgen Moquist 
>>>>>> <jorgen.moquist@n.o.s.p.a.m.mailbox.swipnet.se> wrote:
>>>>>>
>>>>>>
>>>>>>> Daniel Rock wrote:
>>>>>>>
>>>>>>>
>>>>>>>> In comp.sys.sun.admin Cydrome Leader <presence@mungepanix.com> 
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>> just throw the drive out or RMA it.
>>>>>>>>
>>>>>>>> Why should I pay for it?
>>>>>>>>
>>>>>>>
>>>>>>> Rock on Daniel, I think you know more about disks than the others 
>>>>>>> know about cars :-)
>>>>>>> /Jorgen
>>>>>>
>>>>>> yup, it's always best to use broken parts, and when things do 
>>>>>> fail, to do nothing. Problems with machines only get better with 
>>>>>> time, they're self healing.
>>>>>
>>>>> scsi and fcal disks are "selfhealing", lots of spare tracks/cyls.
>>>>> replacement and cacheing tables, one spare sector per cyl and two 
>>>>> spare
>>>>> cyls per surface as i recall.
>>>>
>>>>
>>>> None of this keeps errors from happening in the first place. Spare 
>>>> sectors don't make unrecoverable read errors not happen. It's also 
>>>> pretty known that once you start to see errors (and that means the 
>>>> drive has warning you because there's something wrong), things only 
>>>> go downhill from there.
>>>>
>>>
>>> Disk drives can and do survive a block becoming unreadable.  SCSI 
>>> drives can "revector" a bad block.  In some operating systems, the 
>>> disk driver works in conjunction with the disk to copy data from a 
>>> questionable block to a replacement block.  This looks, to the user, 
>>> like "self healing".  If a bad block is revectored, it's not an 
>>> indication of a serious problem.
>>>
>>> What IS an indication of a serious problem is A PATTERN of bad blocks 
>>> being revectored.  When you see that, it's time to replace the the disk.
>>> Do it NOW!  Tomorrow may be too late.
>>>
>>
>>
>> and it generally always is a patter of failing blocks, not one random 
>> one and then things are great again for years.
>>
>>
> 
> I think "always" is an overstatement.  I've seen a lot of disks over the 
> years.  Some of them showed the pattern of failure I've described.  Some 
> of them revectored a bad block or two and ran for several more years.
> 
> If a disk has critical data and is not a member of a RAID set you may be 
> justified in replacing it the first time it detects a bad block.  In 
> most cases I would not get excited about a single bad block.
> 
> It also makes a difference if you have a service contract or are doing 
> "self maintenance".
> 


It would be useful if you cut out irrelevant stuff when quoting - there 
is rarely much point in quoting this amount, most of which is totally 
irrelevant.

Dave
0
Reply Dave 10/8/2007 1:05:39 PM

42 Replies
200 Views

(page loaded in 0.323 seconds)

Similiar Articles:


















7/30/2012 3:53:01 PM


Reply: