f



Metareplace one device for another on active mounted filesystem

Hello,

I have asked about this before in a slightly different way, and I'm
not sure that I expressed it well.  I do not want to do this without
at least asking the opinion of other admins.

I have a Sun E250 server attached to an A1000 hardware RAID5 box.  On
this RAID box are 2 soft partitions containing source code
repositories and other data.  The data is rsynced to at least 2
machines.

Anyway, after replacing a faulty battery on the A1000 controller,
something strange happened.  I do not kno when exactly or why this
happened, but I didn't do it.

When the machine boots the fsck fails on the device that contains the
soft partitions b/c it can't find it.  The system boot stops and waits
for confirmation to proceed.  After that it boots fine and everything
is mounted and fine.  When researching this disturbing behavior I
found out the following puzzling scenario.

The Solaris Volume Manager sees the device containing the soft
partitions (d30 & d31) as:

c2t10d0s6

When you do a metastat on them that is the device.  That was the
actual device.  At some time and for an as yet unknown reason, the
format utility sees this device as

c3t10d0s6

I believe that this discrepancy is why the fsck fails.  It sees the
vfstab logical devices as pointing to c2, which is what it once was,
yet the real true device is now (for unknown reason) c3.  Thankfully
it can still boot and mount the correct devices, (have no idea how it
figures it out) but this boot stopping is annoying b/c if someone
isn't here it can't complete a boot and that means that I can't do it
remotely b/c sadly I have no remote console server hooked up to it.
(I know I know I should - working on it)

So, is it SAFE to do a mere

metareplace -e d30 c3t10d0s6

and

metareplace -e d31 c3t10d0s6

while they are mounted?  Wouldn't that correct the SVM to point to the
correcet device and make the fsck succeed b/c now it would be pointing
to the correct device designation?  Does my assessment sound correct?

Thanks.

0
5/11/2007 8:56:11 PM
comp.sys.sun.admin 3739 articles. 0 followers. bozothedeathmachine16 (49) is leader. Post Follow

11 Replies
450 Views

Similar Articles

[PageSpeed] 21

worlok wrote:
> Hello,
> 
> I have asked about this before in a slightly different way, and I'm
> not sure that I expressed it well.  I do not want to do this without
> at least asking the opinion of other admins.
> 
> I have a Sun E250 server attached to an A1000 hardware RAID5 box.  On
> this RAID box are 2 soft partitions containing source code
> repositories and other data.  The data is rsynced to at least 2
> machines.
> 
> Anyway, after replacing a faulty battery on the A1000 controller,
> something strange happened.  I do not kno when exactly or why this
> happened, but I didn't do it.
> 
> When the machine boots the fsck fails on the device that contains the
> soft partitions b/c it can't find it.  The system boot stops and waits
> for confirmation to proceed.  After that it boots fine and everything
> is mounted and fine.  When researching this disturbing behavior I
> found out the following puzzling scenario.
> 
> The Solaris Volume Manager sees the device containing the soft
> partitions (d30 & d31) as:
> 
> c2t10d0s6
> 
> When you do a metastat on them that is the device.  That was the
> actual device.  At some time and for an as yet unknown reason, the
> format utility sees this device as
> 
> c3t10d0s6
> 
> I believe that this discrepancy is why the fsck fails.  It sees the
> vfstab logical devices as pointing to c2, which is what it once was,
> yet the real true device is now (for unknown reason) c3.  Thankfully
> it can still boot and mount the correct devices, (have no idea how it
> figures it out) but this boot stopping is annoying b/c if someone
> isn't here it can't complete a boot and that means that I can't do it
> remotely b/c sadly I have no remote console server hooked up to it.
> (I know I know I should - working on it)
> 
> So, is it SAFE to do a mere
> 
> metareplace -e d30 c3t10d0s6
> 
> and
> 
> metareplace -e d31 c3t10d0s6
> 
> while they are mounted?  Wouldn't that correct the SVM to point to the
> correcet device and make the fsck succeed b/c now it would be pointing
> to the correct device designation?  Does my assessment sound correct?
> 
> Thanks.
> 

Don't even think of doing anything until you have at least two copies of 
a current backup!!!!!!  Not just the disks in question but the whole 
system!  And make sure those two copies are both readable.

Something is badly fscked up and you don't know what.  Anything you try 
to to do may be the wrong thing to do and have disastrous results.  Give 
  yourself every possible chance to recover.

0
Richard
5/11/2007 9:07:24 PM
On May 11, 5:07 pm, "Richard B. Gilbert" <rgilber...@comcast.net>
wrote:
>
> Don't even think of doing anything until you have at least two copies of
> a current backup!!!!!!  Not just the disks in question but the whole
> system!  And make sure those two copies are both readable.
>
> Something is badly fscked up and you don't know what.  Anything you try
> to to do may be the wrong thing to do and have disastrous results.  Give
>   yourself every possible chance to recover.

I understand, but the soft partitions that are affected and that
"device' which is the RAID box in no way affect the system itself.
The OS and boot areas are all on internal disks.  The RAID device soft
partitions contain data only and the data is mirrored to 2 other
machines via rsync.




0
worlok
5/11/2007 9:27:28 PM
worlok wrote:
> On May 11, 5:07 pm, "Richard B. Gilbert" <rgilber...@comcast.net>
> wrote:
> 
>>Don't even think of doing anything until you have at least two copies of
>>a current backup!!!!!!  Not just the disks in question but the whole
>>system!  And make sure those two copies are both readable.
>>
>>Something is badly fscked up and you don't know what.  Anything you try
>>to to do may be the wrong thing to do and have disastrous results.  Give
>>  yourself every possible chance to recover.
> 
> 
> I understand, but the soft partitions that are affected and that
> "device' which is the RAID box in no way affect the system itself.
> The OS and boot areas are all on internal disks.  The RAID device soft
> partitions contain data only and the data is mirrored to 2 other
> machines via rsync.
> 

But the system affects those soft partitions and the RAID device.  It's 
just barely conceivable that the problem might be hardware but I think 
it's more likely to be a software and/or configuration problem.  Thus, 
fixing it will probably require installing patches, new software, or 
modifying configuration files.  The effects of a failed attempt to fix 
the problem are unpredictable, espcially since you are essentially 
shooting in the dark and hoping to hit something.

If you have a good backup, you can restore the status quo ante 
regardless of what happens.




0
Richard
5/11/2007 9:38:29 PM
> The Solaris Volume Manager sees the device containing the soft
> partitions (d30 & d31) as:
>
> c2t10d0s6
>
> When you do a metastat on them that is the device.  That was the
> actual device.  At some time and for an as yet unknown reason, the
> format utility sees this device as
>
> c3t10d0s6
>

Probably when the battery was out of the disk controller, it was
invisible to the system, and then somehow during a reconfiguration
boot the system re-arranged the newly-refound disk controller that
used to be c2 as c3.
Basically, your controller 2 is now seen by the system as controller
3.
I think there might be an easy way to fix it in /etc/path_to_inst - if
you swap the instance numbers for the physical device names for the
2nd and 3rd disk controllers. But you'd have to be comfortable with
what you're doing and know what to do if you system can't boot if you
screw up your path_to_inst (i.e. boot -s and then fix it up).

0
noident
5/14/2007 6:35:56 AM
On May 14, 2:35 am, noid...@my-deja.com wrote:
>
> Probably when the battery was out of the disk controller, it was
> invisible to the system, and then somehow during a reconfiguration
> boot the system re-arranged the newly-refound disk controller that
> used to be c2 as c3.
> Basically, your controller 2 is now seen by the system as controller
> 3.
> I think there might be an easy way to fix it in /etc/path_to_inst - if
> you swap the instance numbers for the physical device names for the
> 2nd and 3rd disk controllers. But you'd have to be comfortable with
> what you're doing and know what to do if you system can't boot if you
> screw up your path_to_inst (i.e. boot -s and then fix it up).

Yes, I think that is what happened.

I checked that config file, but so far can't match up any of the
devices to the device info that I have on the array.  I have to go
through it further, but I will post my info here. I still wonder if a
simple metareplace to the SVM wouldn't fix it since it also sees it as
c2, not c3 - but maybe this other file is also a factor for the fsck
problem which messes up the OS boot.?

====================

Metastat info:

bash-2.05# metastat d30
d30: Soft Partition
    Device: c2t10d0s6
    State: Okay
    Size: 75497472 blocks (36 GB)
        Device      Start Block  Dbase Reloc
        c2t10d0s6      10816     No    Yes

        Extent              Start Block              Block count
             0                    10817                 16777216
             1                281029187                 58720256

Device Relocation Information:
Device    Reloc Device ID
c2t10d0   Yes   id1,iver@w600a0b80000a5ed90000000f3d94158d
bash-2.05# metastat d31
d31: Soft Partition
    Device: c2t10d0s6
    State: Okay
    Size: 327155712 blocks (156 GB)
        Device      Start Block  Dbase Reloc
        c2t10d0s6      10816     No    Yes

        Extent              Start Block              Block count
             0                 16788034                264241152
             1                339749444                 62914560

Device Relocation Information:
Device    Reloc Device ID
c2t10d0   Yes   id1,iver@w600a0b80000a5ed90000000f3d94158d

FORMAT output:

bash-2.05# format
Searching for disks...done


AVAILABLE DISK SELECTIONS:
       0. c0t0d0 <SUN36G cyl 24620 alt 2 hd 27 sec 107>
          /pci@1f,4000/scsi@3/sd@0,0
       1. c0t8d0 <SUN36G cyl 24620 alt 2 hd 27 sec 107>
          /pci@1f,4000/scsi@3/sd@8,0
       2. c0t9d0 <SUN36G cyl 24620 alt 2 hd 27 sec 107>
          /pci@1f,4000/scsi@3/sd@9,0
       3. c0t10d0 <SUN36G cyl 24620 alt 2 hd 27 sec 107>
          /pci@1f,4000/scsi@3/sd@a,0
       4. c0t11d0 <SUN36G cyl 24620 alt 2 hd 27 sec 107>
          /pci@1f,4000/scsi@3/sd@b,0
       5. c0t12d0 <SUN36G cyl 24620 alt 2 hd 27 sec 107>
          /pci@1f,4000/scsi@3/sd@c,0
       6. c3t10d0 <Symbios-StorEDGEA1000-0301 cyl 65533 alt 2 hd 64
sec 169>
          /pseudo/rdnexus@3/rdriver@a,0
Specify disk (enter its number):

Contents of /etc/path_to_inst

#
#	Caution! This file contains critical kernel state
#
"/pci@1f,4000" 0 "pcipsy"
"/pci@1f,4000/scsi@3,1" 1 "glm"
"/pci@1f,4000/scsi@3,1/ses@7,0" 23 "ses"
"/pci@1f,4000/scsi@3,1/ses@6,0" 22 "ses"
"/pci@1f,4000/scsi@3,1/ses@5,0" 21 "ses"
"/pci@1f,4000/scsi@3,1/ses@4,0" 20 "ses"
"/pci@1f,4000/scsi@3,1/ses@3,0" 19 "ses"
"/pci@1f,4000/scsi@3,1/ses@2,0" 18 "ses"
"/pci@1f,4000/scsi@3,1/ses@1,0" 17 "ses"
"/pci@1f,4000/scsi@3,1/ses@0,0" 16 "ses"
"/pci@1f,4000/scsi@3,1/ses@9,0" 25 "ses"
"/pci@1f,4000/scsi@3,1/ses@8,0" 24 "ses"
"/pci@1f,4000/scsi@3,1/ses@f,0" 31 "ses"
"/pci@1f,4000/scsi@3,1/ses@e,0" 30 "ses"
"/pci@1f,4000/scsi@3,1/ses@d,0" 29 "ses"
"/pci@1f,4000/scsi@3,1/ses@c,0" 28 "ses"
"/pci@1f,4000/scsi@3,1/scg@0,0" 1 "scg"
"/pci@1f,4000/scsi@3,1/ses@b,0" 27 "ses"
"/pci@1f,4000/scsi@3,1/ses@a,0" 26 "ses"
"/pci@1f,4000/scsi@3,1/st@5,0" 12 "st"
"/pci@1f,4000/scsi@3,1/sd@e,0" 28 "sd"
"/pci@1f,4000/scsi@3,1/sd@a,4" 98 "sd"
"/pci@1f,4000/scsi@3,1/st@4,0" 11 "st"
"/pci@1f,4000/scsi@3,1/sd@d,0" 27 "sd"
"/pci@1f,4000/scsi@3,1/sd@a,5" 99 "sd"
"/pci@1f,4000/scsi@3,1/sd@a,6" 100 "sd"
"/pci@1f,4000/scsi@3,1/st@6,0" 13 "st"
"/pci@1f,4000/scsi@3,1/sd@f,0" 29 "sd"
"/pci@1f,4000/scsi@3,1/sd@a,7" 101 "sd"
"/pci@1f,4000/scsi@3,1/st@1,0" 8 "st"
"/pci@1f,4000/scsi@3,1/sd@a,0" 24 "sd"
"/pci@1f,4000/scsi@3,1/st@0,0" 7 "st"
"/pci@1f,4000/scsi@3,1/sd@a,1" 95 "sd"
"/pci@1f,4000/scsi@3,1/st@3,0" 10 "st"
"/pci@1f,4000/scsi@3,1/sd@c,0" 26 "sd"
"/pci@1f,4000/scsi@3,1/sd@a,2" 96 "sd"
"/pci@1f,4000/scsi@3,1/st@2,0" 9 "st"
"/pci@1f,4000/scsi@3,1/sd@b,0" 25 "sd"
"/pci@1f,4000/scsi@3,1/sd@a,3" 97 "sd"
"/pci@1f,4000/scsi@3,1/sd@5,0" 20 "sd"
"/pci@1f,4000/scsi@3,1/sd@4,1" 81 "sd"
"/pci@1f,4000/scsi@3,1/sd@4,0" 19 "sd"
"/pci@1f,4000/scsi@3,1/sd@5,1" 88 "sd"
"/pci@1f,4000/scsi@3,1/sd@4,3" 83 "sd"
"/pci@1f,4000/scsi@3,1/sd@5,2" 89 "sd"
"/pci@1f,4000/scsi@3,1/sd@6,0" 21 "sd"
"/pci@1f,4000/scsi@3,1/sd@4,2" 82 "sd"
"/pci@1f,4000/scsi@3,1/sd@5,3" 90 "sd"
"/pci@1f,4000/scsi@3,1/sd@1,0" 16 "sd"
"/pci@1f,4000/scsi@3,1/sd@4,5" 85 "sd"
"/pci@1f,4000/scsi@3,1/sd@5,4" 91 "sd"
"/pci@1f,4000/scsi@3,1/sd@0,0" 15 "sd"
"/pci@1f,4000/scsi@3,1/sd@4,4" 84 "sd"
"/pci@1f,4000/scsi@3,1/sd@5,5" 92 "sd"
"/pci@1f,4000/scsi@3,1/sd@3,0" 18 "sd"
"/pci@1f,4000/scsi@3,1/sd@4,7" 87 "sd"
"/pci@1f,4000/scsi@3,1/sd@5,6" 93 "sd"
"/pci@1f,4000/scsi@3,1/sd@2,0" 17 "sd"
"/pci@1f,4000/scsi@3,1/sd@4,6" 86 "sd"
"/pci@1f,4000/scsi@3,1/sd@5,7" 94 "sd"
"/pci@1f,4000/scsi@3,1/sd@9,0" 23 "sd"
"/pci@1f,4000/scsi@3,1/sd@8,0" 22 "sd"
"/pci@1f,4000/scsi@4,1" 3 "glm"
"/pci@1f,4000/scsi@4,1/ses@0,0" 48 "ses"
"/pci@1f,4000/scsi@4,1/ses@1,0" 49 "ses"
"/pci@1f,4000/scsi@4,1/ses@2,0" 50 "ses"
"/pci@1f,4000/scsi@4,1/ses@3,0" 51 "ses"
"/pci@1f,4000/scsi@4,1/ses@4,0" 52 "ses"
"/pci@1f,4000/scsi@4,1/ses@5,0" 53 "ses"
"/pci@1f,4000/scsi@4,1/ses@6,0" 54 "ses"
"/pci@1f,4000/scsi@4,1/ses@7,0" 55 "ses"
"/pci@1f,4000/scsi@4,1/ses@8,0" 56 "ses"
"/pci@1f,4000/scsi@4,1/ses@9,0" 57 "ses"
"/pci@1f,4000/scsi@4,1/ses@a,0" 58 "ses"
"/pci@1f,4000/scsi@4,1/ses@b,0" 59 "ses"
"/pci@1f,4000/scsi@4,1/scg@0,0" 3 "scg"
"/pci@1f,4000/scsi@4,1/ses@c,0" 60 "ses"
"/pci@1f,4000/scsi@4,1/ses@d,0" 61 "ses"
"/pci@1f,4000/scsi@4,1/ses@e,0" 62 "ses"
"/pci@1f,4000/scsi@4,1/ses@f,0" 63 "ses"
"/pci@1f,4000/scsi@4,1/sd@b,0" 55 "sd"
"/pci@1f,4000/scsi@4,1/st@2,0" 23 "st"
"/pci@1f,4000/scsi@4,1/sd@a,3" 139 "sd"
"/pci@1f,4000/scsi@4,1/sd@c,0" 56 "sd"
"/pci@1f,4000/scsi@4,1/st@3,0" 24 "st"
"/pci@1f,4000/scsi@4,1/sd@a,2" 138 "sd"
"/pci@1f,4000/scsi@4,1/st@0,0" 21 "st"
"/pci@1f,4000/scsi@4,1/sd@a,1" 137 "sd"
"/pci@1f,4000/scsi@4,1/sd@a,0" 54 "sd"
"/pci@1f,4000/scsi@4,1/st@1,0" 22 "st"
"/pci@1f,4000/scsi@4,1/sd@f,0" 59 "sd"
"/pci@1f,4000/scsi@4,1/st@6,0" 27 "st"
"/pci@1f,4000/scsi@4,1/sd@a,7" 143 "sd"
"/pci@1f,4000/scsi@4,1/sd@a,6" 142 "sd"
"/pci@1f,4000/scsi@4,1/sd@d,0" 57 "sd"
"/pci@1f,4000/scsi@4,1/st@4,0" 25 "st"
"/pci@1f,4000/scsi@4,1/sd@a,5" 141 "sd"
"/pci@1f,4000/scsi@4,1/sd@e,0" 58 "sd"
"/pci@1f,4000/scsi@4,1/st@5,0" 26 "st"
"/pci@1f,4000/scsi@4,1/sd@a,4" 140 "sd"
"/pci@1f,4000/scsi@4,1/sd@2,0" 47 "sd"
"/pci@1f,4000/scsi@4,1/sd@4,6" 128 "sd"
"/pci@1f,4000/scsi@4,1/sd@5,7" 136 "sd"
"/pci@1f,4000/scsi@4,1/sd@3,0" 48 "sd"
"/pci@1f,4000/scsi@4,1/sd@4,7" 129 "sd"
"/pci@1f,4000/scsi@4,1/sd@5,6" 135 "sd"
"/pci@1f,4000/scsi@4,1/sd@0,0" 45 "sd"
"/pci@1f,4000/scsi@4,1/sd@4,4" 126 "sd"
"/pci@1f,4000/scsi@4,1/sd@5,5" 134 "sd"
"/pci@1f,4000/scsi@4,1/sd@1,0" 46 "sd"
"/pci@1f,4000/scsi@4,1/sd@4,5" 127 "sd"
"/pci@1f,4000/scsi@4,1/sd@5,4" 133 "sd"
"/pci@1f,4000/scsi@4,1/sd@6,0" 51 "sd"
"/pci@1f,4000/scsi@4,1/sd@4,2" 124 "sd"
"/pci@1f,4000/scsi@4,1/sd@5,3" 132 "sd"
"/pci@1f,4000/scsi@4,1/sd@4,3" 125 "sd"
"/pci@1f,4000/scsi@4,1/sd@5,2" 131 "sd"
"/pci@1f,4000/scsi@4,1/sd@4,0" 49 "sd"
"/pci@1f,4000/scsi@4,1/sd@5,1" 130 "sd"
"/pci@1f,4000/scsi@4,1/sd@5,0" 50 "sd"
"/pci@1f,4000/scsi@4,1/sd@4,1" 123 "sd"
"/pci@1f,4000/scsi@4,1/sd@8,0" 52 "sd"
"/pci@1f,4000/scsi@4,1/sd@9,0" 53 "sd"
"/pci@1f,4000/scsi@3" 0 "glm"
"/pci@1f,4000/scsi@3/scg@0,0" 0 "scg"
"/pci@1f,4000/scsi@3/ses@b,0" 11 "ses"
"/pci@1f,4000/scsi@3/ses@c,0" 12 "ses"
"/pci@1f,4000/scsi@3/ses@a,0" 10 "ses"
"/pci@1f,4000/scsi@3/ses@f,0" 15 "ses"
"/pci@1f,4000/scsi@3/ses@d,0" 13 "ses"
"/pci@1f,4000/scsi@3/ses@e,0" 14 "ses"
"/pci@1f,4000/scsi@3/ses@8,0" 8 "ses"
"/pci@1f,4000/scsi@3/ses@9,0" 9 "ses"
"/pci@1f,4000/scsi@3/ses@2,0" 2 "ses"
"/pci@1f,4000/scsi@3/ses@3,0" 3 "ses"
"/pci@1f,4000/scsi@3/ses@0,0" 0 "ses"
"/pci@1f,4000/scsi@3/ses@1,0" 1 "ses"
"/pci@1f,4000/scsi@3/ses@6,0" 6 "ses"
"/pci@1f,4000/scsi@3/ses@7,0" 7 "ses"
"/pci@1f,4000/scsi@3/ses@4,0" 4 "ses"
"/pci@1f,4000/scsi@3/ses@5,0" 5 "ses"
"/pci@1f,4000/scsi@3/sd@8,0" 7 "sd"
"/pci@1f,4000/scsi@3/sd@9,0" 8 "sd"
"/pci@1f,4000/scsi@3/sd@0,0" 0 "sd"
"/pci@1f,4000/scsi@3/sd@4,4" 63 "sd"
"/pci@1f,4000/scsi@3/sd@5,5" 71 "sd"
"/pci@1f,4000/scsi@3/sd@1,0" 1 "sd"
"/pci@1f,4000/scsi@3/sd@4,5" 64 "sd"
"/pci@1f,4000/scsi@3/sd@5,4" 70 "sd"
"/pci@1f,4000/scsi@3/sd@2,0" 2 "sd"
"/pci@1f,4000/scsi@3/sd@4,6" 65 "sd"
"/pci@1f,4000/scsi@3/sd@5,7" 73 "sd"
"/pci@1f,4000/scsi@3/sd@3,0" 3 "sd"
"/pci@1f,4000/scsi@3/sd@4,7" 66 "sd"
"/pci@1f,4000/scsi@3/sd@5,6" 72 "sd"
"/pci@1f,4000/scsi@3/sd@4,0" 4 "sd"
"/pci@1f,4000/scsi@3/sd@5,1" 67 "sd"
"/pci@1f,4000/scsi@3/sd@5,0" 5 "sd"
"/pci@1f,4000/scsi@3/sd@4,1" 60 "sd"
"/pci@1f,4000/scsi@3/sd@6,0" 6 "sd"
"/pci@1f,4000/scsi@3/sd@4,2" 61 "sd"
"/pci@1f,4000/scsi@3/sd@5,3" 69 "sd"
"/pci@1f,4000/scsi@3/sd@4,3" 62 "sd"
"/pci@1f,4000/scsi@3/sd@5,2" 68 "sd"
"/pci@1f,4000/scsi@3/st@0,0" 0 "st"
"/pci@1f,4000/scsi@3/sd@a,1" 74 "sd"
"/pci@1f,4000/scsi@3/st@1,0" 1 "st"
"/pci@1f,4000/scsi@3/sd@a,0" 9 "sd"
"/pci@1f,4000/scsi@3/st@2,0" 2 "st"
"/pci@1f,4000/scsi@3/sd@b,0" 10 "sd"
"/pci@1f,4000/scsi@3/sd@a,3" 76 "sd"
"/pci@1f,4000/scsi@3/st@3,0" 3 "st"
"/pci@1f,4000/scsi@3/sd@c,0" 11 "sd"
"/pci@1f,4000/scsi@3/sd@a,2" 75 "sd"
"/pci@1f,4000/scsi@3/st@4,0" 4 "st"
"/pci@1f,4000/scsi@3/sd@d,0" 12 "sd"
"/pci@1f,4000/scsi@3/sd@a,5" 78 "sd"
"/pci@1f,4000/scsi@3/st@5,0" 5 "st"
"/pci@1f,4000/scsi@3/sd@e,0" 13 "sd"
"/pci@1f,4000/scsi@3/sd@a,4" 77 "sd"
"/pci@1f,4000/scsi@3/st@6,0" 6 "st"
"/pci@1f,4000/scsi@3/sd@f,0" 14 "sd"
"/pci@1f,4000/scsi@3/sd@a,7" 80 "sd"
"/pci@1f,4000/scsi@3/sd@a,6" 79 "sd"
"/pci@1f,4000/scsi@4" 2 "glm"
"/pci@1f,4000/scsi@4/ses@e,0" 46 "ses"
"/pci@1f,4000/scsi@4/ses@d,0" 45 "ses"
"/pci@1f,4000/scsi@4/ses@f,0" 47 "ses"
"/pci@1f,4000/scsi@4/ses@a,0" 42 "ses"
"/pci@1f,4000/scsi@4/ses@c,0" 44 "ses"
"/pci@1f,4000/scsi@4/ses@b,0" 43 "ses"
"/pci@1f,4000/scsi@4/scg@0,0" 2 "scg"
"/pci@1f,4000/scsi@4/ses@9,0" 41 "ses"
"/pci@1f,4000/scsi@4/ses@8,0" 40 "ses"
"/pci@1f,4000/scsi@4/ses@5,0" 37 "ses"
"/pci@1f,4000/scsi@4/ses@4,0" 36 "ses"
"/pci@1f,4000/scsi@4/ses@7,0" 39 "ses"
"/pci@1f,4000/scsi@4/ses@6,0" 38 "ses"
"/pci@1f,4000/scsi@4/ses@1,0" 33 "ses"
"/pci@1f,4000/scsi@4/ses@0,0" 32 "ses"
"/pci@1f,4000/scsi@4/ses@3,0" 35 "ses"
"/pci@1f,4000/scsi@4/ses@2,0" 34 "ses"
"/pci@1f,4000/scsi@4/sd@9,0" 38 "sd"
"/pci@1f,4000/scsi@4/sd@8,0" 37 "sd"
"/pci@1f,4000/scsi@4/sd@4,3" 104 "sd"
"/pci@1f,4000/scsi@4/sd@5,2" 110 "sd"
"/pci@1f,4000/scsi@4/sd@6,0" 36 "sd"
"/pci@1f,4000/scsi@4/sd@4,2" 103 "sd"
"/pci@1f,4000/scsi@4/sd@5,3" 111 "sd"
"/pci@1f,4000/scsi@4/sd@5,0" 35 "sd"
"/pci@1f,4000/scsi@4/sd@4,1" 102 "sd"
"/pci@1f,4000/scsi@4/sd@4,0" 34 "sd"
"/pci@1f,4000/scsi@4/sd@5,1" 109 "sd"
"/pci@1f,4000/scsi@4/sd@3,0" 33 "sd"
"/pci@1f,4000/scsi@4/sd@4,7" 108 "sd"
"/pci@1f,4000/scsi@4/sd@5,6" 114 "sd"
"/pci@1f,4000/scsi@4/sd@2,0" 32 "sd"
"/pci@1f,4000/scsi@4/sd@4,6" 107 "sd"
"/pci@1f,4000/scsi@4/sd@5,7" 115 "sd"
"/pci@1f,4000/scsi@4/sd@1,0" 31 "sd"
"/pci@1f,4000/scsi@4/sd@4,5" 106 "sd"
"/pci@1f,4000/scsi@4/sd@5,4" 112 "sd"
"/pci@1f,4000/scsi@4/sd@0,0" 30 "sd"
"/pci@1f,4000/scsi@4/sd@4,4" 105 "sd"
"/pci@1f,4000/scsi@4/sd@5,5" 113 "sd"
"/pci@1f,4000/scsi@4/sd@a,6" 121 "sd"
"/pci@1f,4000/scsi@4/sd@f,0" 44 "sd"
"/pci@1f,4000/scsi@4/st@6,0" 20 "st"
"/pci@1f,4000/scsi@4/sd@a,7" 122 "sd"
"/pci@1f,4000/scsi@4/sd@e,0" 43 "sd"
"/pci@1f,4000/scsi@4/st@5,0" 19 "st"
"/pci@1f,4000/scsi@4/sd@a,4" 119 "sd"
"/pci@1f,4000/scsi@4/sd@d,0" 42 "sd"
"/pci@1f,4000/scsi@4/st@4,0" 18 "st"
"/pci@1f,4000/scsi@4/sd@a,5" 120 "sd"
"/pci@1f,4000/scsi@4/sd@c,0" 41 "sd"
"/pci@1f,4000/scsi@4/st@3,0" 17 "st"
"/pci@1f,4000/scsi@4/sd@a,2" 117 "sd"
"/pci@1f,4000/scsi@4/sd@b,0" 40 "sd"
"/pci@1f,4000/scsi@4/st@2,0" 16 "st"
"/pci@1f,4000/scsi@4/sd@a,3" 118 "sd"
"/pci@1f,4000/scsi@4/sd@a,0" 39 "sd"
"/pci@1f,4000/scsi@4/st@1,0" 15 "st"
"/pci@1f,4000/scsi@4/st@0,0" 14 "st"
"/pci@1f,4000/scsi@4/sd@a,1" 116 "sd"
"/pci@1f,4000/pci@5" 0 "pci_pci"
"/pci@1f,4000/pci@5/network@0" 0 "ce"
"/pci@1f,4000/ebus@1" 0 "ebus"
"/pci@1f,4000/ebus@1/power@14,724000" 0 "power"
"/pci@1f,4000/ebus@1/SUNW,envctrltwo@14,600000" 0 "envctrltwo"
"/pci@1f,4000/ebus@1/se@14,400000" 0 "se"
"/pci@1f,4000/ebus@1/su@14,3083f8" 0 "su"
"/pci@1f,4000/ebus@1/se@14,200000" 1 "se"
"/pci@1f,4000/ebus@1/su@14,3062f8" 1 "su"
"/pci@1f,4000/ebus@1/ecpp@14,3043bc" 0 "ecpp"
"/pci@1f,4000/network@1,1" 0 "hme"
"/pci@1f,4000/TSI,gfxp@2" 0 "gfxp"
"/options" 0 "options"
"/pci@1f,2000" 1 "pcipsy"
"/scsi_vhci" 0 "scsi_vhci"
"/pseudo" 0 "pseudo"

==================================

There has to be an easy way to fix this.


Thanks,


Tom



0
worlok
5/14/2007 3:22:01 PM
I'm guessing that these are your scsi controllers:
"/pci@1f,4000/scsi@3,1" 1 "glm"
"/pci@1f,4000/scsi@4,1" 3 "glm"
"/pci@1f,4000/scsi@3" 0 "glm"

I can't say anything beyond that. At any rate - tweaking the
path_to_inst, or metareplace - can be easily tested on a non-
production system.

0
noident
5/15/2007 3:09:52 AM

On Mon, 14 May 2007, worlok wrote:

> On May 14, 2:35 am, noid...@my-deja.com wrote:
> >
> > Probably when the battery was out of the disk controller, it was
> > invisible to the system, and then somehow during a reconfiguration
> > boot the system re-arranged the newly-refound disk controller that
> > used to be c2 as c3.
> > Basically, your controller 2 is now seen by the system as controller
> > 3.
> > I think there might be an easy way to fix it in /etc/path_to_inst - if
> > you swap the instance numbers for the physical device names for the
> > 2nd and 3rd disk controllers. But you'd have to be comfortable with
> > what you're doing and know what to do if you system can't boot if you
> > screw up your path_to_inst (i.e. boot -s and then fix it up).
>
> Yes, I think that is what happened.

No, what seems to have happened is that it looks like the diff scsi cable
has been moved during the "downtime"

<snip>

> "/pci@1f,4000/scsi@3,1" 1 "glm"
> "/pci@1f,4000/scsi@4,1" 3 "glm"
> "/pci@1f,4000/scsi@3" 0 "glm"
> "/pci@1f,4000/scsi@4" 2 "glm"

As you can see, both c2 and c3 are on the same PCI card, just different
ports on it, so to me that seems like you accidentally moved the cable
from one of them to the other while the systems was down for the battery
exchange.

Solaris does not change the internal order of one PCI card upon
reconfiguration, it might switch places of cards, but not the individual
ports on them.

And the cards probing order is determined by the probe-order OBP variable
if available, but since c0 and c1 hasnt switched on you, it much more
likely that the cable position has been switched.

As usual, I might be wrong... but this seems logical to me...

/Johan A

0
Mr
5/15/2007 8:23:56 AM
On May 15, 4:23 am, "Mr. Johan Andersson" <j...@solace.miun.se> wrote:
> ----snip-----
>
> As you can see, both c2 and c3 are on the same PCI card, just different
> ports on it, so to me that seems like you accidentally moved the cable
> from one of them to the other while the systems was down for the battery
> exchange.
>
> ----snip-----
>
> And the cards probing order is determined by the probe-order OBP variable
> if available, but since c0 and c1 hasnt switched on you, it much more
> likely that the cable position has been switched.
>
> As usual, I might be wrong... but this seems logical to me...
>
> /Johan A

Oh if only it were that simple.  I checked the back of the RAID box
and the cables were not moved.  There are only two ports, one with the
terminator and the other with the scsi cable connector.  They haven't
been changed.  I didn't need to move them for the battery change as
they aren't in the way.  I didn't change/move the ones on the server
side either.

---Tom

0
worlok
5/21/2007 9:27:27 PM

On Tue, 15 May 2007, Mr. Johan Andersson wrote:

> No, what seems to have happened is that it looks like the diff scsi cable
> has been moved during the "downtime"
>
> <snip>
>
> > "/pci@1f,4000/scsi@3,1" 1 "glm"
> > "/pci@1f,4000/scsi@4,1" 3 "glm"
> > "/pci@1f,4000/scsi@3" 0 "glm"
> > "/pci@1f,4000/scsi@4" 2 "glm"
>
> As you can see, both c2 and c3 are on the same PCI card, just different
> ports on it, so to me that seems like you accidentally moved the cable
> from one of them to the other while the systems was down for the battery
> exchange.
>
> Solaris does not change the internal order of one PCI card upon
> reconfiguration, it might switch places of cards, but not the individual
> ports on them.
>
> And the cards probing order is determined by the probe-order OBP variable
> if available, but since c0 and c1 hasnt switched on you, it much more
> likely that the cable position has been switched.
>
> As usual, I might be wrong... but this seems logical to me...
>
> /Johan A
>
>

As I never aw any response to this i goodled it and saw your response Tom,
funny enough never reached my newsnode :-/

Anyway, the cable-order on the A1000 side is of no consequence,its the
host side that matters... one channel is the c2 and one is c3

"/pci@1f,4000/scsi@4" 2 "glm"    <- c2
"/pci@1f,4000/scsi@4,1" 3 "glm"  <- c3

There is no way, known to me, that would change the order of these two
devices, and if there were, then the order of the other two devices

"/pci@1f,4000/scsi@3" 0 "glm"    <- c0
"/pci@1f,4000/scsi@3,1" 1 "glm"  <- c1

should also have changed...

it is reasonable to see that the device would be connected that way, and
if you had your disk on c2 and you now find them on c3 then the only
logical explanation is that the hostside cable has moved, from the 4(.0)
channel to the 4.1 channel.

If it hasnt as you say, i have no idea how it came to do what it did :-/
possibly the rdriver has done something funny, but I cant imagine what.
You could possibly check the rm6 logs to see if something happened there.

What you could try, is to take the system down, boot it on cdrom in single
user and check if the device ends up in the same place... or if it is
moved, i.e th 4 and 4.1 ones, you could then try moving the cable to the
other channel on the host side and try to see if it moves back.

Not something you do on a running system though, but rather on a
maintainance weekend with backups done beforehand.

Another thing to check is if you have a system backup from before the
batterychange (or diskmove) and check the same files for clues to the
attachmement before the problem.

Anyway... I dont think a metareplace will work for you, a delete and
reinit of the device will remove the error, but once again, I havent
used soft partitions much so I cant say that they would reinit as none
soft would, where you can recreate it with the same info and get it back.
sadly I have no hardware to test it on nowadays :-/

/Johan A




0
Mr
5/24/2007 9:01:23 AM
On May 24, 5:01 am, "Mr. Johan Andersson" <jo...@solace.miun.se>
wrote:
>
> As I never aw any response to this i goodled it and saw your response Tom,
> funny enough never reached my newsnode :-/
>
> Anyway, the cable-order on the A1000 side is of no consequence,its the
> host side that matters... one channel is the c2 and one is c3
>
> "/pci@1f,4000/scsi@4" 2 "glm"    <- c2
> "/pci@1f,4000/scsi@4,1" 3 "glm"  <- c3
>
> There is no way, known to me, that would change the order of these two
> devices, and if there were, then the order of the other two devices
>
> "/pci@1f,4000/scsi@3" 0 "glm"    <- c0
> "/pci@1f,4000/scsi@3,1" 1 "glm"  <- c1
>
> should also have changed...

---snip-----

>
> /Johan A

Johan,

This is intriguing.  There is, of course, another channel on that
adapter which is unused.  If you look at the rear of the host machine,
the second (to the right) one is hosting the cable that leads to the
A1000 while the first one is unused.

These cable positions were not changed during the battery changing on
the A1000.

One thing has me wondering.  There was a major server room move a few
years back, which may have been when the original battery was ALREADY
dead.

Could it have been that attaching the wrong cable after the move
didn't cause any upset since the battery was dead and the machine
didn't "see" the controller due to the bad battery?  Then, when I
changed out the original dead battery and the controller became
"active" again, it read 3 when it should have been 2 if during that
long ago move the cable was changed by someone else?  I would have
never noticed it until the battery was changed by me and hence this
odd behavior.  Follow what I am saying?

I want to test this but I have to wait until the machine can be
downed.  If this fixes it then that would be something and would be
added to my crazy scenario notes but I don't want to count my chickens
just yet.

Thanks for the clue and i will post a followup on the result.

--Tom

0
worlok
7/18/2007 5:16:13 PM

On Wed, 18 Jul 2007, worlok wrote:

> On May 24, 5:01 am, "Mr. Johan Andersson" <jo...@solace.miun.se>
> wrote:
> >
> > As I never aw any response to this i goodled it and saw your response Tom,
> > funny enough never reached my newsnode :-/
> >
> > Anyway, the cable-order on the A1000 side is of no consequence,its the
> > host side that matters... one channel is the c2 and one is c3
> >
> > "/pci@1f,4000/scsi@4" 2 "glm"    <- c2
> > "/pci@1f,4000/scsi@4,1" 3 "glm"  <- c3
> >
> > There is no way, known to me, that would change the order of these two
> > devices, and if there were, then the order of the other two devices
> >
> > "/pci@1f,4000/scsi@3" 0 "glm"    <- c0
> > "/pci@1f,4000/scsi@3,1" 1 "glm"  <- c1
> >
> > should also have changed...
>
> ---snip-----
>
> >
> > /Johan A
>
> Johan,
>
> This is intriguing.  There is, of course, another channel on that
> adapter which is unused.  If you look at the rear of the host machine,
> the second (to the right) one is hosting the cable that leads to the
> A1000 while the first one is unused.
>
> These cable positions were not changed during the battery changing on
> the A1000.

Ok...

> One thing has me wondering.  There was a major server room move a few
> years back, which may have been when the original battery was ALREADY
> dead.

Depends on what was wrong with the battery, it could have been broken, in
which case it could have happened at anytime, or it could have expired,
which I belive was two years after they were installed, there is a time
on the battery which you can check...

This is how it looks on my A2000, same softwares

# /usr/lib/osa/bin/lad
c5t5d0 1T71322076 LUNS: 0 2 4 6
c9t4d1 1T74447964 LUNS: 1 3 5
# /usr/lib/osa/bin/raidutil -c c5t5d0 -B
LUNs found on c5t5d0.
  LUN 0    RAID 5    34389 MB
  LUN 2    RAID 5    34389 MB
  LUN 4    RAID 5    34389 MB
  LUN 6    RAID 5    34389 MB
Battery age is between 270 days and 360 days.

> Could it have been that attaching the wrong cable after the move
> didn't cause any upset since the battery was dead and the machine
> didn't "see" the controller due to the bad battery?  Then, when I
> changed out the original dead battery and the controller became
> "active" again, it read 3 when it should have been 2 if during that
> long ago move the cable was changed by someone else?  I would have
> never noticed it until the battery was changed by me and hence this
> odd behavior.  Follow what I am saying?

Yes, I follow, I think, but I believe the system would have seen the
controller even without battery, it just means the cache will be off,
and hence lowering the performance some.

But if the A1000 was off, or not used, and the system never booted with
reconfigure, then it might have been "invisible" to the OS, other then
maybe showing a line or two during boots where the device might show up.

Nothing to try with so cant help you there...

> I want to test this but I have to wait until the machine can be
> downed.  If this fixes it then that would be something and would be
> added to my crazy scenario notes but I don't want to count my chickens
> just yet.

No, but if its only the path that is wrong, then moving the cable to the
'"correct" onefor whatever reason, should make it work again.

> Thanks for the clue and i will post a followup on the result.
>
> --Tom

Just hope it turns up, this post did anyways :-)

/Johan A
0
Mr
7/20/2007 9:16:28 AM
Reply: