We had a reboot recently that was a result of this hardware fault:
--------------- ------------------------------------ -------------- ---------
TIME EVENT-ID MSG-ID SEVERITY
--------------- ------------------------------------ -------------- ---------
Jun 03 14:41:55 f7fe7526-3295-c81b-ff45-a996faed8072 INTEL-8000-WS Critical
Fault class : fault.cpu.intel.nb.fsb
FRU : hc://:product-id=-zMemory-Scrub:chassis-id=(stuck:server-id=electra/motherboard=0/chip=4
faulty
How do I determine which CPU or core is at fault? This is on an E4450
with four four-core CPUs. `psrinfo -vp' says:
The physical processor has 4 virtual processors (0 4-6)
x86 (chipid 0x0 GenuineIntel family 6 model 15 step 11 clock 2933 MHz)
Intel(r) Xeon(r) CPU X7350 @ 2.93GHz
The physical processor has 4 virtual processors (1 7-9)
x86 (chipid 0x2 GenuineIntel family 6 model 15 step 11 clock 2933 MHz)
Intel(r) Xeon(r) CPU X7350 @ 2.93GHz
The physical processor has 4 virtual processors (2 10-12)
x86 (chipid 0x4 GenuineIntel family 6 model 15 step 11 clock 2933 MHz)
Intel(r) Xeon(r) CPU X7350 @ 2.93GHz
The physical processor has 4 virtual processors (3 13-15)
x86 (chipid 0x6 GenuineIntel family 6 model 15 step 11 clock 2933 MHz)
Intel(r) Xeon(r) CPU X7350 @ 2.93GHz
--
-Gary Mills- -Unix Group- -Computer and Network Services-
|
|
0
|
|
|
|
Reply
|
Gary
|
6/10/2010 8:54:41 PM |
|
Gary Mills <mills@cc.umanitoba.ca> wrote:
> We had a reboot recently that was a result of this hardware fault:
>
> --------------- ------------------------------------ -------------- ---------
> TIME EVENT-ID MSG-ID SEVERITY
> --------------- ------------------------------------ -------------- ---------
> Jun 03 14:41:55 f7fe7526-3295-c81b-ff45-a996faed8072 INTEL-8000-WS Critical
>
> Fault class : fault.cpu.intel.nb.fsb
> FRU : hc://:product-id=-zMemory-Scrub:chassis-id=(stuck:server-id=electra/motherboard=0/chip=4
> faulty
>
> How do I determine which CPU or core is at fault? This is on an E4450
> with four four-core CPUs. `psrinfo -vp' says:
While you can disable cores/processors for solaris x86, it's not clear if
it really does anything. On a sparc platform, yes you can really disable
memory and processors and it's for real.
I've seen xeon processors (really cores) fail in solaris before and in
real life there's nothing wrong at all with the CPU. For intel hardware
just rebooting seems to be the fix. I suspect it's some sort of software
issue.
|
|
0
|
|
|
|
Reply
|
Cydrome
|
6/10/2010 11:58:40 PM
|
|
In <huru7f$ing$1@reader1.panix.com> Cydrome Leader <presence@MUNGEpanix.com> writes:
>Gary Mills <mills@cc.umanitoba.ca> wrote:
>> We had a reboot recently that was a result of this hardware fault:
>>
>> --------------- ------------------------------------ -------------- ---------
>> TIME EVENT-ID MSG-ID SEVERITY
>> --------------- ------------------------------------ -------------- ---------
>> Jun 03 14:41:55 f7fe7526-3295-c81b-ff45-a996faed8072 INTEL-8000-WS Critical
>>
>> Fault class : fault.cpu.intel.nb.fsb
>> FRU : hc://:product-id=-zMemory-Scrub:chassis-id=(stuck:server-id=electra/motherboard=0/chip=4
>> faulty
>>
>> How do I determine which CPU or core is at fault? This is on an E4450
>> with four four-core CPUs. `psrinfo -vp' says:
In this instance, I'd really like to know which CPU was faulty.
I can guess, but I might be wrong. (It was actually an X4450.)
>I've seen xeon processors (really cores) fail in solaris before and in
>real life there's nothing wrong at all with the CPU. For intel hardware
>just rebooting seems to be the fix. I suspect it's some sort of software
>issue.
This server needed a power-cycle before it came back to normal. A
reboot wasn't sufficient. Either something didn't get reset fully
or it was a real hardware failure.
--
-Gary Mills- -Unix Group- -Computer and Network Services-
|
|
0
|
|
|
|
Reply
|
Gary
|
6/11/2010 11:25:20 PM
|
|
Gary Mills <mills@cc.umanitoba.ca> wrote:
> In <huru7f$ing$1@reader1.panix.com> Cydrome Leader <presence@MUNGEpanix.com> writes:
>
>>Gary Mills <mills@cc.umanitoba.ca> wrote:
>>> We had a reboot recently that was a result of this hardware fault:
>>>
>>> --------------- ------------------------------------ -------------- ---------
>>> TIME EVENT-ID MSG-ID SEVERITY
>>> --------------- ------------------------------------ -------------- ---------
>>> Jun 03 14:41:55 f7fe7526-3295-c81b-ff45-a996faed8072 INTEL-8000-WS Critical
>>>
>>> Fault class : fault.cpu.intel.nb.fsb
>>> FRU : hc://:product-id=-zMemory-Scrub:chassis-id=(stuck:server-id=electra/motherboard=0/chip=4
>>> faulty
>>>
>>> How do I determine which CPU or core is at fault? This is on an E4450
>>> with four four-core CPUs. `psrinfo -vp' says:
>
> In this instance, I'd really like to know which CPU was faulty.
> I can guess, but I might be wrong. (It was actually an X4450.)
>
>>I've seen xeon processors (really cores) fail in solaris before and in
>>real life there's nothing wrong at all with the CPU. For intel hardware
>>just rebooting seems to be the fix. I suspect it's some sort of software
>>issue.
>
> This server needed a power-cycle before it came back to normal. A
> reboot wasn't sufficient. Either something didn't get reset fully
> or it was a real hardware failure.
If you have any core files, sun might be able to tell you which cpu it
feels faulted. Since you're running on sun hardware they should probably
be able to help with this.
If you can, running VTS for a few days might be a good idea.
|
|
0
|
|
|
|
Reply
|
Cydrome
|
6/12/2010 3:49:22 AM
|
|
In <huv042$2rn$2@reader1.panix.com> Cydrome Leader <presence@MUNGEpanix.com> writes:
>Gary Mills <mills@cc.umanitoba.ca> wrote:
>> In <huru7f$ing$1@reader1.panix.com> Cydrome Leader <presence@MUNGEpanix.com> writes:
>>
>>>Gary Mills <mills@cc.umanitoba.ca> wrote:
>>>> We had a reboot recently that was a result of this hardware fault:
>>>>
>>>> --------------- ------------------------------------ -------------- ---------
>>>> TIME EVENT-ID MSG-ID SEVERITY
>>>> --------------- ------------------------------------ -------------- ---------
>>>> Jun 03 14:41:55 f7fe7526-3295-c81b-ff45-a996faed8072 INTEL-8000-WS Critical
>>>>
>>>> Fault class : fault.cpu.intel.nb.fsb
>>>> FRU : hc://:product-id=-zMemory-Scrub:chassis-id=(stuck:server-id=electra/motherboard=0/chip=4
>>>> faulty
>>
>> This server needed a power-cycle before it came back to normal. A
>> reboot wasn't sufficient. Either something didn't get reset fully
>> or it was a real hardware failure.
>If you have any core files, sun might be able to tell you which cpu it
>feels faulted. Since you're running on sun hardware they should probably
>be able to help with this.
There was no core file or traceback, just a sudden reboot. Oracle/Sun
is going to replace one of the CPUs. I just wanted an independant
way to verify which one it was.
--
-Gary Mills- -Unix Group- -Computer and Network Services-
|
|
0
|
|
|
|
Reply
|
Gary
|
6/12/2010 12:06:17 PM
|
|
In article <hurjeh$r53$1@canopus.cc.umanitoba.ca>,
Gary Mills <mills@cc.umanitoba.ca> writes:
> We had a reboot recently that was a result of this hardware fault:
>
> --------------- ------------------------------------ -------------- ---------
> TIME EVENT-ID MSG-ID SEVERITY
> --------------- ------------------------------------ -------------- ---------
> Jun 03 14:41:55 f7fe7526-3295-c81b-ff45-a996faed8072 INTEL-8000-WS Critical
>
> Fault class : fault.cpu.intel.nb.fsb
> FRU : hc://:product-id=-zMemory-Scrub:chassis-id=(stuck:server-id=electra/motherboard=0/chip=4
> faulty
>
> How do I determine which CPU or core is at fault? This is on an E4450
Look at the output from /usr/lib/fm/fmd/fmtopo -V
for the same FRU and see if that entry tells you which socket.
Also, you might find the system is numbering the chips 0, 2, 4, 6 in
the fmtopo output, which would make it the third socket.
I believe the fm output has recently been changed to be more helpful
in this case, but I don't know if/when that's gone back into S10.
--
Andrew Gabriel
[email address is not usable -- followup in the newsgroup]
|
|
0
|
|
|
|
Reply
|
andrew
|
6/12/2010 7:58:07 PM
|
|
In article <huru7f$ing$1@reader1.panix.com>,
Cydrome Leader <presence@MUNGEpanix.com> writes:
>
> I've seen xeon processors (really cores) fail in solaris before and in
> real life there's nothing wrong at all with the CPU. For intel hardware
> just rebooting seems to be the fix. I suspect it's some sort of software
> issue.
Blimy. We go to extra ordinary effort to retrieve and decode all the Intel
chip telemetry (which Intel tell me no other OS has managed to do to
anywhere near the same degree) to ensure you don't get any data corruption
when parts of chips/busses/memory/etc detect error situations, as you'd
expect from an Enterprise grade OS. Then when it happens, someone says
"I suspect it's some sort of software issue."
;-)
--
Andrew Gabriel
[email address is not usable -- followup in the newsgroup]
|
|
0
|
|
|
|
Reply
|
andrew
|
6/12/2010 8:09:20 PM
|
|
In <hv0osf$s49$1@news.eternal-september.org> andrew@cucumber.demon.co.uk (Andrew Gabriel) writes:
>In article <hurjeh$r53$1@canopus.cc.umanitoba.ca>,
> Gary Mills <mills@cc.umanitoba.ca> writes:
>> We had a reboot recently that was a result of this hardware fault:
>>
>> --------------- ------------------------------------ -------------- ---------
>> TIME EVENT-ID MSG-ID SEVERITY
>> --------------- ------------------------------------ -------------- ---------
>> Jun 03 14:41:55 f7fe7526-3295-c81b-ff45-a996faed8072 INTEL-8000-WS Critical
>>
>> Fault class : fault.cpu.intel.nb.fsb
>> FRU : hc://:product-id=-zMemory-Scrub:chassis-id=(stuck:server-id=electra/motherboard=0/chip=4
>> faulty
>>
>> How do I determine which CPU or core is at fault? This is on an E4450
>Look at the output from /usr/lib/fm/fmd/fmtopo -V
>for the same FRU and see if that entry tells you which socket.
>Also, you might find the system is numbering the chips 0, 2, 4, 6 in
>the fmtopo output, which would make it the third socket.
Ah, that clears up the confusion. I wasn't sure if `chip' meant CPU
socket or CPU core. I see that individual cores are represented this
way:
motherboard=0/chip=4/cpu=0
Yes, they are numbered 0, 2, 4, 6 in fmtopo and `psrinfo -vp' output.
The board diagram is labelled 0, 1, 2, 3, making it #2 that's faulty.
The FE is going to replace CPU3. I suspect that's the same one.
--
-Gary Mills- -Unix Group- -Computer and Network Services-
|
|
0
|
|
|
|
Reply
|
Gary
|
6/12/2010 10:14:58 PM
|
|
Andrew Gabriel <andrew@cucumber.demon.co.uk> wrote:
> In article <huru7f$ing$1@reader1.panix.com>,
> Cydrome Leader <presence@MUNGEpanix.com> writes:
>>
>> I've seen xeon processors (really cores) fail in solaris before and in
>> real life there's nothing wrong at all with the CPU. For intel hardware
>> just rebooting seems to be the fix. I suspect it's some sort of software
>> issue.
>
> Blimy. We go to extra ordinary effort to retrieve and decode all the Intel
> chip telemetry (which Intel tell me no other OS has managed to do to
> anywhere near the same degree) to ensure you don't get any data corruption
> when parts of chips/busses/memory/etc detect error situations, as you'd
> expect from an Enterprise grade OS. Then when it happens, someone says
>
> "I suspect it's some sort of software issue."
>
> ;-)
You work for sun?
While I agree a machine with a nonrecoverable fault should just crash, I
will point out that writing software to just crash a machine over and over
again without any meaninful error output is in fact a sofware issue as
well.
|
|
0
|
|
|
|
Reply
|
Cydrome
|
6/12/2010 10:33:36 PM
|
|
In article <hv1200$hda$5@reader1.panix.com>,
Cydrome Leader <presence@MUNGEpanix.com> writes:
> Andrew Gabriel <andrew@cucumber.demon.co.uk> wrote:
>> In article <huru7f$ing$1@reader1.panix.com>,
>> Cydrome Leader <presence@MUNGEpanix.com> writes:
>>>
>>> I've seen xeon processors (really cores) fail in solaris before and in
>>> real life there's nothing wrong at all with the CPU. For intel hardware
>>> just rebooting seems to be the fix. I suspect it's some sort of software
>>> issue.
>>
>> Blimy. We go to extra ordinary effort to retrieve and decode all the Intel
>> chip telemetry (which Intel tell me no other OS has managed to do to
>> anywhere near the same degree) to ensure you don't get any data corruption
>> when parts of chips/busses/memory/etc detect error situations, as you'd
>> expect from an Enterprise grade OS. Then when it happens, someone says
>>
>> "I suspect it's some sort of software issue."
>>
>> ;-)
>
> You work for sun?
Yes, well Oracle now, although I don't speak for them.
> While I agree a machine with a nonrecoverable fault should just crash, I
> will point out that writing software to just crash a machine over and over
> again without any meaninful error output is in fact a sofware issue as
> well.
I agree. The fact that Solaris managed to record the necessary chip
failure telemetry after a hardware failure which hit the system hard
enough for it to be unable to dump and unable to recover even after
a reset is quite remarkable. I don't think [m]any other OS's would
give you the slightest clue what when wrong with the system in this
case, yet here we have the relevant faulty chip identified (hopefully,
although in some cases the chip which detects a fault isn't the one
where the fault lays;-), and the more detailed fm record should include
details of exactly what's wrong, for those intimately familiar with its
innards.
--
Andrew Gabriel
[email address is not usable -- followup in the newsgroup]
|
|
0
|
|
|
|
Reply
|
andrew
|
6/14/2010 2:51:44 PM
|
|
Andrew Gabriel <andrew@cucumber.demon.co.uk> wrote:
> In article <hv1200$hda$5@reader1.panix.com>,
> Cydrome Leader <presence@MUNGEpanix.com> writes:
>> Andrew Gabriel <andrew@cucumber.demon.co.uk> wrote:
>>> In article <huru7f$ing$1@reader1.panix.com>,
>>> Cydrome Leader <presence@MUNGEpanix.com> writes:
>>>>
>>>> I've seen xeon processors (really cores) fail in solaris before and in
>>>> real life there's nothing wrong at all with the CPU. For intel hardware
>>>> just rebooting seems to be the fix. I suspect it's some sort of software
>>>> issue.
>>>
>>> Blimy. We go to extra ordinary effort to retrieve and decode all the Intel
>>> chip telemetry (which Intel tell me no other OS has managed to do to
>>> anywhere near the same degree) to ensure you don't get any data corruption
>>> when parts of chips/busses/memory/etc detect error situations, as you'd
>>> expect from an Enterprise grade OS. Then when it happens, someone says
>>>
>>> "I suspect it's some sort of software issue."
>>>
>>> ;-)
>>
>> You work for sun?
>
> Yes, well Oracle now, although I don't speak for them.
>
>> While I agree a machine with a nonrecoverable fault should just crash, I
>> will point out that writing software to just crash a machine over and over
>> again without any meaninful error output is in fact a sofware issue as
>> well.
>
> I agree. The fact that Solaris managed to record the necessary chip
> failure telemetry after a hardware failure which hit the system hard
> enough for it to be unable to dump and unable to recover even after
> a reset is quite remarkable. I don't think [m]any other OS's would
is "necessary chip failure telemetry" data that can only be decoded by
hitting a tech group on usenet and finding a sun employee?
Still, it's some PC platform in this case so I don't really expect awesome
diagnotics or failure recovery.
I still like the older RS/6000s that would log there was a power supply
fault and commit it to disk if you just pulled the plugs on the server.
that's impressive, and new stuff from sun still can't pull that off.
|
|
0
|
|
|
|
Reply
|
Cydrome
|
6/15/2010 10:25:33 PM
|
|
|
10 Replies
549 Views
(page loaded in 0.052 seconds)
|