f



high system CPU usage

This is on an idle system, a W2100z running an OpenSolaris build-24 kernel.
(DEBUG). I think I had the same problem with the Nevada build that was on
the box before.

One of the CPUs is all bound up with system usage. I *know* this is a job
for DTrace, but I'm not yet savvy enough to work out exactly what to do
with it. (I've only previously studied the behaviour of applications.)

Here's what the standard tools tell me. First vmstat:

0 0 0 2220160 1387472 0 73  0  0  0  0  0  0  0  0  0 26000 1436 25997 2 39
60
 0 0 0 2220160 1387476 0  0  0  0  0  0  0  0  0  0  0 25945 1589 25952 3 39
58
 0 0 0 2220160 1387476 0  0  0  0  0  0  0  0  0  0  0 26070 752 25932 1 38
61

and now mpstat.

CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  0    3   0    4 13119  411 25292  211  111   61    0   273    0  75   0 
24
  1    0   0   10 12784 12635  593   29  109   52    0  1135    4   2   0 
94
  0    0   0    2 13078  424 25171  129  124  116    0   244    1  75   0 
25
  1    0   0   14 12810 12637  807   75  126  100    0  1232    3   3   0 
94

I've shut down all the apps. When I run prstat, prstat is the top task.
Obviously the kernel is doing something, but what? This is a problem which
gets worse over time - a reboot fixes it, but I want to know how to track
down the problem. Hopefully someone can help.

TIA


Rob
0
rdf
10/20/2005 8:43:35 PM
comp.unix.solaris 26025 articles. 2 followers. Post Follow

14 Replies
846 Views

Similar Articles

[PageSpeed] 18

rdf wrote:
> This is on an idle system, a W2100z running an OpenSolaris build-24 kernel.
> (DEBUG). I think I had the same problem with the Nevada build that was on
> the box before.
> 
> One of the CPUs is all bound up with system usage. I *know* this is a job
> for DTrace, but I'm not yet savvy enough to work out exactly what to do
> with it. (I've only previously studied the behaviour of applications.)
> 
> Here's what the standard tools tell me. First vmstat:
> 
> 0 0 0 2220160 1387472 0 73  0  0  0  0  0  0  0  0  0 26000 1436 25997 2 39
> 60
>  0 0 0 2220160 1387476 0  0  0  0  0  0  0  0  0  0  0 25945 1589 25952 3 39
> 58
>  0 0 0 2220160 1387476 0  0  0  0  0  0  0  0  0  0  0 26070 752 25932 1 38
> 61
> 
> and now mpstat.
> 
> CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
>   0    3   0    4 13119  411 25292  211  111   61    0   273    0  75   0 
> 24
>   1    0   0   10 12784 12635  593   29  109   52    0  1135    4   2   0 
> 94
>   0    0   0    2 13078  424 25171  129  124  116    0   244    1  75   0 
> 25
>   1    0   0   14 12810 12637  807   75  126  100    0  1232    3   3   0 
> 94
> 
> I've shut down all the apps. When I run prstat, prstat is the top task.
> Obviously the kernel is doing something, but what? This is a problem which
> gets worse over time - a reboot fixes it, but I want to know how to track
> down the problem. Hopefully someone can help.
> 
> TIA
> 

Many interrupts. My ideas:

Network traffic? Disconnect while watching.
Keyboard defective? Disconnect while watching.


-- 
Michael Tosch @ hp : com
0
Michael
10/20/2005 9:13:50 PM
Run the top command and see which command is taking too much time.
Regards,
Thobias
Michael Tosch wrote:

> rdf wrote:
>
>> This is on an idle system, a W2100z running an OpenSolaris build-24 
>> kernel.
>> (DEBUG). I think I had the same problem with the Nevada build that 
>> was on
>> the box before.
>>
>> One of the CPUs is all bound up with system usage. I *know* this is a 
>> job
>> for DTrace, but I'm not yet savvy enough to work out exactly what to do
>> with it. (I've only previously studied the behaviour of applications.)
>>
>> Here's what the standard tools tell me. First vmstat:
>>
>> 0 0 0 2220160 1387472 0 73  0  0  0  0  0  0  0  0  0 26000 1436 
>> 25997 2 39
>> 60
>>  0 0 0 2220160 1387476 0  0  0  0  0  0  0  0  0  0  0 25945 1589 
>> 25952 3 39
>> 58
>>  0 0 0 2220160 1387476 0  0  0  0  0  0  0  0  0  0  0 26070 752 
>> 25932 1 38
>> 61
>>
>> and now mpstat.
>>
>> CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  
>> wt idl
>>   0    3   0    4 13119  411 25292  211  111   61    0   273    0  
>> 75   0 24
>>   1    0   0   10 12784 12635  593   29  109   52    0  1135    4   
>> 2   0 94
>>   0    0   0    2 13078  424 25171  129  124  116    0   244    1  
>> 75   0 25
>>   1    0   0   14 12810 12637  807   75  126  100    0  1232    3   
>> 3   0 94
>>
>> I've shut down all the apps. When I run prstat, prstat is the top task.
>> Obviously the kernel is doing something, but what? This is a problem 
>> which
>> gets worse over time - a reboot fixes it, but I want to know how to 
>> track
>> down the problem. Hopefully someone can help.
>>
>> TIA
>>
>
> Many interrupts. My ideas:
>
> Network traffic? Disconnect while watching.
> Keyboard defective? Disconnect while watching.
>
>


-- 
Thobias Vakayil
Alcatel Development India (ADI)
PH: 2349961/72/86 EXTN :7018
0
Thobias
10/21/2005 3:15:34 AM
What does intrstat tell you?

# intrstat

      device |      cpu0 %tim
-------------+---------------
 audioi810#0 |        70  0.0
      ehci#0 |         1  0.0
      gani#0 |       193  0.4
    nvidia#0 |       193  0.1
      ohci#0 |        70  0.1

      device |      cpu0 %tim
-------------+---------------
       ata#0 |         1  0.0
 audioi810#0 |       168  0.2
      ehci#0 |         1  0.0
      gani#0 |       122  0.3
    nvidia#0 |       122  0.0
      ohci#0 |       168  0.1

0
barts
10/21/2005 6:12:14 AM
rdf <lists@thecatflap.co.uk> writes:

> This is on an idle system, a W2100z running an OpenSolaris build-24 kernel.
> (DEBUG). I think I had the same problem with the Nevada build that was on
> the box before.
>
> One of the CPUs is all bound up with system usage. I *know* this is a job
> for DTrace, but I'm not yet savvy enough to work out exactly what to do
> with it. (I've only previously studied the behaviour of applications.)
>
> Here's what the standard tools tell me. First vmstat:
>
> 0 0 0 2220160 1387472 0 73  0  0  0  0  0  0  0  0  0 26000 1436 25997 2 39
> 60

The problem is the 26000 interrupts / second.

Try to find out which device is sending the 26000 interrupts using
the "intrstat" command.

It is possible that the number of interrupts reported by "intrstat" is
nowhere near the 26000 interrupts / second.  If that is the case, try
the following dtrace script (save as "inststat.d", "chmod +x intrstat.d",
run intrstat.d):

========================================================================
#!/usr/sbin/dtrace -s

interrupt-start
{
        self->ts = vtimestamp;
}


interrupt-complete
/arg0 == 0/
{
        this->dn_name = "<unknown>";
        @counts[this->dn_name, arg1] = count();
        @times[this->dn_name, arg1] = sum(vtimestamp - self->ts);
        self->ts = 0;
}

interrupt-complete
/arg0 != 0/
{
        this->devi = (struct dev_info*)arg0;
        this->dn_name = stringof(`devnamesp[this->devi->devi_major].dn_name);
        @counts[this->dn_name, this->devi->devi_instance] = count();
        @times[this->dn_name, this->devi->devi_instance] =
                                                sum(vtimestamp - self->ts);
        self->ts = 0;
}

tick-1sec
{
        printf("\n%-10s %-16s %s\n---------------------------------------\n",
                "driver", "instance", "intr-count");
        printa("%-10s %-16a %@8u\n", @counts);
        /* printa(@times); */
        clear(@counts);
        clear(@times);
}
========================================================================


See also (the yahoo thread is probably about exactly the same problem):

  http://www.opensolaris.org/jive/thread.jspa?threadID=2942&tstart=0
  http://groups.yahoo.com/group/solarisx86/message/29841
  http://groups.yahoo.com/group/solarisx86/message/29998
0
Juergen
10/21/2005 2:38:47 PM
> Try to find out which device is sending the 26000 interrupts using
> the "intrstat" command.

I forgot to mention that I'd run intrstat and seen nothing untoward. Of
course, now I've got something to try, the machine is behaving perfectly!

> It is possible that the number of interrupts reported by "intrstat" is
> nowhere near the 26000 interrupts / second.  If that is the case, try
> the following dtrace script (save as "inststat.d", "chmod +x intrstat.d",
> run intrstat.d):

Cool. That's just what I was looking for.

> See also (the yahoo thread is probably about exactly the same problem):

Looks just the same as mine. Thanks a lot Juergen - lots of interesting
reading in that group. I'll report back if and when I solve my problem.



R
0
rdf
10/22/2005 12:42:45 AM
hi,

i'm seeing the same symptoms you do. juergen already referenced my thread
on the solarisx86 list.

rdf <lists@thecatflap.co.uk> writes:

> I forgot to mention that I'd run intrstat and seen nothing untoward. Of
> course, now I've got something to try, the machine is behaving perfectly!
hihi, i know that... took me a week of waiting (until today) to see that happen 
again. 

> > It is possible that the number of interrupts reported by "intrstat" is
> > nowhere near the 26000 interrupts / second.  If that is the case, try
> > the following dtrace script (save as "inststat.d", "chmod +x intrstat.d",
> > run intrstat.d):
> 
> Cool. That's just what I was looking for.
for me it's the acpi interrupts running up after about a week.

using juergens dtrace script i see:

bernd.charon ~/build/autovec # ./acpi3.d
dtrace: script './acpi3.d' matched 4 probes
CPU     ID                    FUNCTION:NAME
  0  48441                       :tick-1sec
driver     instance         intr-count
---------------------------------------
adpu320    0x1                     4
bge        0x0                     4
hci1394    0x0                   106
ohci       0x1                   106
<unknown>  unix`xc_serv          140
ehci       0x0                   172
nvidia     0x0                   172
<unknown>  unix`cbe_fire         207
<unknown>  acpica`acpi_wrapper_isr    16024

-- 
As they say, never underestimate the bandwidth of a 747 full of CDROMs!
0
Bernd
10/24/2005 9:24:12 AM
Bernd Markgraf <markgraf@mail.cs.uni-magdeburg.de> writes:

>adpu320    0x1                     4
>bge        0x0                     4
>hci1394    0x0                   106
>ohci       0x1                   106
><unknown>  unix`xc_serv          140
>ehci       0x0                   172
>nvidia     0x0                   172
><unknown>  unix`cbe_fire         207
><unknown>  acpica`acpi_wrapper_isr    16024

Ah, interesting; so ACPI gives you all these interrupts.

This is somewhat surprising unless we programmed the interrupt incorrectly;
acpica generally disables interrupts for devices which have no handler defined.

Casper
-- 
Expressed in this posting are my opinions.  They are in no way related
to opinions held by my employer, Sun Microsystems.
Statements on Sun products included here are not gospel and may
be fiction rather than truth.
0
Casper
10/24/2005 9:47:32 AM
Casper H.S. Dik <Casper.Dik@Sun.COM> writes:

> Bernd Markgraf <markgraf@mail.cs.uni-magdeburg.de> writes:
>
>><unknown>  unix`xc_serv          140
>>ehci       0x0                   172
>>nvidia     0x0                   172
>><unknown>  unix`cbe_fire         207
>><unknown>  acpica`acpi_wrapper_isr    16024
>
> Ah, interesting; so ACPI gives you all these interrupts.
>
> This is somewhat surprising unless we programmed the interrupt
> incorrectly;

That has been my theory so far; bug 6284164 might be related and it
seems the acpi sci interrupt was installed on the wrong vector in the
past.  But bug 6284164 is supposed to be fixed in solaris
nevada(snv_22); the original poster that started this usenet thread
reported the problem for a debug build 24 kernel. OTOH, Bernd is using
snv_20, ...


Bernd has just mailed me the result of dtrace'ing acpi_wrapper_isr()
return values using the following script, and it returns "1" all the
time (== DDI_INTR_CLAIMED):

#!/usr/sbin/dtrace -s

fbt:acpica:acpi_wrapper_isr:return
{
        printf("return %d", arg1);
}


So "acpica" is apparently working hard, handling all of the
acpi_wrapper_isr interrupts.

> acpica generally disables interrupts for devices which
> have no handler defined.

But isn't this an acpi "sci" interrupt?  What kind of interrupts
are send to that acpi "sci" interrupt handler?  Things like
"cpu is running too hot", "fan is running too slow", "battery
needs charging" (on laptops),  and things like that, perhaps ?
0
Juergen
10/24/2005 11:02:46 AM
Casper H.S. Dik <Casper.Dik@Sun.COM> writes:

> Ah, interesting; so ACPI gives you all these interrupts.
yes. juergen suspected this a while ago...

> This is somewhat surprising unless we programmed the interrupt incorrectly;
> acpica generally disables interrupts for devices which have no handler defined.
how do look further into this? juergen sent me an dtrace script, which 
gives the return values of the acpi_wrapper_isr function. this one always
returns 1 (interrupt claimed). is there any way to fid out, where the interrupts
come from?

 bernd

-- 
Du beginnst mit einem vollen Sack Gl�ck und einem leeren Sack
Erfahrung. Der Trick ist, den Erfahrungssack aufzuf�llen bevor 
Du den Gl�ck-Sack geleert hast.

0
Bernd
10/24/2005 11:12:46 AM
Juergen Keil <jk@tools.de> writes:

> But isn't this an acpi "sci" interrupt?  What kind of interrupts
> are send to that acpi "sci" interrupt handler?  Things like
> "cpu is running too hot", "fan is running too slow", "battery
> needs charging" (on laptops),  and things like that, perhaps ?
i was wondering what triggers this. as it happens after about a
week i suspected that something might get too hot. but that boxed
is in an airconditioned room (21ged C all the time) so i don't 
think it's temperature related. 

-- 
As they say, never underestimate the bandwidth of a 747 full of CDROMs!

0
Bernd
10/24/2005 11:16:56 AM
Juergen Keil <jk@tools.de> writes:

>But isn't this an acpi "sci" interrupt?  What kind of interrupts
>are send to that acpi "sci" interrupt handler?  Things like
>"cpu is running too hot", "fan is running too slow", "battery
>needs charging" (on laptops),  and things like that, perhaps ?

Yes, but those would be the result of GPEs and such and when there
are no interested parties, ACPICA disables those GPEs.

What is more likely is that the SCI is incorrectly inserted and that
there's another device which is claiming interrupts.

(The way b21 and before work are such that the interrupt cannot be shared
and if another device attaches before ACPI then it loses the vector once
acpica attached and vice versa).

So upgrade to b24 or 25 and see if that fixes the issue.

Casper
-- 
Expressed in this posting are my opinions.  They are in no way related
to opinions held by my employer, Sun Microsystems.
Statements on Sun products included here are not gospel and may
be fiction rather than truth.
0
Casper
10/24/2005 1:32:17 PM
> (The way b21 and before work are such that the interrupt cannot be shared
> and if another device attaches before ACPI then it loses the vector once
> acpica attached and vice versa).
> 
> So upgrade to b24 or 25 and see if that fixes the issue.

Looks like this is the same issue I'm having. Here's the output of Juergen's
dtrace script. I'm afraid this is on build-25.

driver     instance         intr-count
---------------------------------------
ata        0x1                     0
bge        0x0                     2
adpu320    0x1                    21
ehci       0x0                   152
nvidia     0x0                   152
<unknown>  unix`xc_serv          172
<unknown>  unix`cbe_fire         206
<unknown>  acpica`acpi_wrapper_isr    12232

I've tried disabling acpi with eeprom,

[tap:/home/rob]$ eeprom acpi-user-options
acpi-user-options=0x2

and rebooted, but having read these pages:

 http://blogs.sun.com/roller/page/danasblog
 http://blogs.sun.com/roller/page/chewmyblog/20050614

I'm less certain that was the right thing to do. I've also just upgraded the
BIOS on the W2100z.


Rob
0
rdf
10/28/2005 2:34:14 PM
rdf <lists@thecatflap.co.uk> writes:

> Looks like this is the same issue I'm having. Here's the output of Juergen's
> dtrace script. I'm afraid this is on build-25.
> 
> driver     instance         intr-count
> ---------------------------------------
> ata        0x1                     0
> bge        0x0                     2
> adpu320    0x1                    21
> ehci       0x0                   152
> nvidia     0x0                   152
> <unknown>  unix`xc_serv          172
> <unknown>  unix`cbe_fire         206
> <unknown>  acpica`acpi_wrapper_isr    12232

i also updated to b25 and still see the same symptoms:

bernd.charon ~/build/autovec # uptime
  2:55pm  up  2:33,  1 user,  load average: 2.08, 2.04, 1.77

bernd.charon ~/build/autovec # uname -a
SunOS charon 5.11 snv_25 i86pc i386 i86pc

driver     instance         intr-count
---------------------------------------
ata        0x1                     0
ohci       0x0                     0
adpu320    0x1                     1
bge        0x0                     4
<unknown>  unix`xc_serv          111
hci1394    0x0                   112
ohci       0x1                   112
ehci       0x0                   176
nvidia     0x0                   176
<unknown>  unix`cbe_fire         204
<unknown>  acpica`acpi_wrapper_isr    12579

 bernd

-- 
As they say, never underestimate the bandwidth of a 747 full of CDROMs!
0
Bernd
10/31/2005 1:56:41 PM
Here's the report of the last week:

I set acpi-user-options to 0x8 (legacy mode) which disables the acpi
interpreter. This worked fine for the last week. No symptoms whatsoever.

Juergen sent me another dtrace script which catches the ACPI events.
So after a week of no trouble i switched back to the default acpi 
mechanism (acpi-user-options=0x0) and five minutes later i had the
same symptoms again and the script indicated GPE26 to be the source.

Just a quick note. Maybe 'rdf' could check if acpi-user-options=0x8
make his box symptom free as well.

--
As they say, never underestimate the bandwidth of a 747 full of CDROMs!
0
Bernd
11/7/2005 11:23:57 AM
Reply: