SIGILL ILL_ILLOPN in write() on 11.23 ia64

  • Follow


I'm trying to track down an odd SIGILL and I'm hoping someone will
have a useful suggestion, or reports of similar behavior.

We have a large, complex process running under HP-UX 11.23 on an
Itanium system, and it's occasionally receiving SIGILL when under
heavy load. The same code runs on many other Unix platforms, and we
haven't see this particular error anywhere else (which doesn't prove
anything, of course, but suggests it might be something this platform
is particularly sensitive to).

gdb core file analysis says the fault is happening in a call to
write(2), sending data on a TCP socket:

-----
Program terminated with signal 4, Illegal instruction.
ILL_ILLOPN - Illegal Operand
(no debugging symbols found)...#0  0x0 in <unknown_procedure> ()
(gdb) bt
#0  0x0 in <unknown_procedure> ()
warning: Attempting to unwind past bad PC 0x0
#1  0xe0000001200028e0 in <unknown_procedure> ()
#2  0x60000000e6f29d70:0 in _write_sys+0x30 () from
/usr/lib/hpux32/libc.so.1
#3  0x60000000e6f3f430:0 in write+0xb0 () from /usr/lib/hpux32/libc.so.1
#4  0x60000000e5e90700:0 in mFt_NET_tcp_write () at tcpnet.c:11448
-----

The rest of the stack trace looks valid, and the source line at frame
4 (the first frame in our code) is indeed a call to write, so I'm
inclined to trust it.

Thus far we've only reproduced the problem in a release (optimized, no
symbols) build, so I can't validate the parameters in the call. (Well,
I might be able to, if I spend some time figuring out ia64 assembly
and the ABI, but extracting accurate values from a core dump of an
optimized, stripped binary is chancy anyway.) But the program logic
implies that they've mostly been verified already by other calls; for
example, the descriptor's been used in select and other socket calls
in this same function. And I wouldn't expect SIGILL / ILL_ILLOPN from
a typical bad-parameter error - SIGSEGV or SIGBUS, maybe, but ILLOPN?
(And write only takes three parameters: the descriptor, a char*, and
an unsigned int. The first and third shouldn't even be able to have
trap representations.)

I'm working on reproducing it in a debug build, with the debugger
already attached to the process. In the meantime, though, I was
wondering if anyone had any clever suggestions for tracking this down,
or if anyone's seen anything like it.

Also, does anyone know under what conditions HP-UX sets SIG_ILLOPN on
a SIGILL? Some Google searches suggest that Linux for ia64 sets it for
NaT register consumption, but I don't know if that's true of HP-UX as
well.

-- 
Michael Wojcik
Micro Focus
Rhetoric & Writing, Michigan State University
0
Reply Michael 1/28/2010 5:34:41 PM

Michael Wojcik wrote:
> gdb core file analysis says the fault is happening in a call to
> write(2), sending data on a TCP socket:
> Program terminated with signal 4, Illegal instruction.
> ILL_ILLOPN - Illegal Operand
> #0  0x0 in <unknown_procedure>
> warning: Attempting to unwind past bad PC 0x0
> #1  0xe0000001200028e0 in <unknown_procedure>
> #2  0x60000000e6f29d70:0 in _write_sys+0x30

This is saying from the kernel gateway page you are jumping to location 
0.  That's why you get Illegal Operand.  You should link with -z so you 
get signal 11.

> I can't validate the parameters in the call. (Well,
> I might be able to, if I spend some time figuring out ia64 assembly
> and the ABI, but extracting accurate values from a core dump of an
> optimized, stripped binary is chancy anyway.)

It's pretty easy, assuming the registers haven't been reused, set the 
frame then:
p /x $r32
x /20x $r33
p /d $r34

> (And write only takes three parameters: the descriptor, a char*, and
> an unsigned int.

No, the size is unsigned long.

> does anyone know under what conditions HP-UX sets SIG_ILLOPN on
> a SIGILL?

You get it when you have a bad instruction.

> Some Google searches suggest that Linux for ia64 sets it for
> NaT register consumption, but I don't know if that's true of HP-UX as well.

It is only for Integrity.  And that's the case where you have bad data.
ILL_REGNAT    9  /* +  Register NaT Consumption */
0
Reply Dennis 1/29/2010 11:33:39 AM


Dennis Handly wrote:
> Michael Wojcik wrote:
>> gdb core file analysis says the fault is happening in a call to
>> write(2), sending data on a TCP socket:
>> Program terminated with signal 4, Illegal instruction.
>> ILL_ILLOPN - Illegal Operand
>> #0  0x0 in <unknown_procedure>
>> warning: Attempting to unwind past bad PC 0x0
>> #1  0xe0000001200028e0 in <unknown_procedure>
>> #2  0x60000000e6f29d70:0 in _write_sys+0x30
> 
> This is saying from the kernel gateway page you are jumping to location
> 0.  That's why you get Illegal Operand.

Thanks. So this is an actual branch or call to location 0? Any idea
what would cause this below a call to write? (Presumably something got
stomped, but what? Why would _write_sys be calling through a vector
that's writable from usermode code?)

> You should link with -z so you get signal 11.

The process *is* linked with -z, and chatr says "nulptr dereferences
trap enabled". (Though it doesn't much matter to me whether I get
SIGILL or SIGSEGV.)

>> I can't validate the parameters in the call. (Well,
>> I might be able to, if I spend some time figuring out ia64 assembly
>> and the ABI, but extracting accurate values from a core dump of an
>> optimized, stripped binary is chancy anyway.)
> 
> It's pretty easy, assuming the registers haven't been reused, set the
> frame then:
> p /x $r32
> x /20x $r33
> p /d $r34

Thanks. I printed the contents of a bunch of registers, but I wasn't
having much luck finding the values I was looking for. They may well
have been changed since the call.

>> (And write only takes three parameters: the descriptor, a char*, and
>> an unsigned int.
> 
> No, the size is unsigned long.

Technically size_t. I meant the actual parameter was an unsigned int,
but of course integer promotion would widen it to the underlying type
of size_t (an unsigned long in this implementation).

>> does anyone know under what conditions HP-UX sets SIG_ILLOPN on
>> a SIGILL?
> 
> You get it when you have a bad instruction.

Do you mean a bad opcode, or just a generic "bad instruction"? The
latter is obvious. I figured a bad opcode would set si_code to
ILL_ILLOPC. I'm asking about what specifically causes si_code to be
set to SIG_ILLOPN.

>> Some Google searches suggest that Linux for ia64 sets it for
>> NaT register consumption, but I don't know if that's true of HP-UX as
>> well.
> 
> It is only for Integrity.  And that's the case where you have bad data.
> ILL_REGNAT    9  /* +  Register NaT Consumption */

Yeah, I see that now in /usr/include/ia64/sys/siginfo.h. Thanks.

-- 
Michael Wojcik
Micro Focus
Rhetoric & Writing, Michigan State University
0
Reply Michael 1/29/2010 5:04:48 PM

Michael Wojcik wrote:
> So this is an actual branch or call to location 0? Any idea
> what would cause this below a call to write?

That's what the debugger thinks, perhaps there is a problem with it?

 > (Presumably something got
> stomped, but what? Why would _write_sys be calling through a vector
> that's writable from usermode code?)

I'm not sure why?  Perhaps it detected a problem with the parms?

> Do you mean a bad opcode, or just a generic "bad instruction"? The
> latter is obvious. I figured a bad opcode would set si_code to
> ILL_ILLOPC. I'm asking about what specifically causes si_code to be
> set to SIG_ILLOPN.

I don't see any Illegal Operand errors in IPF.  Just Illegal Operation.
0
Reply Dennis 1/30/2010 12:27:30 AM

Michael Wojcik wrote:
> what specifically causes si_code to be set to SIG_ILLOPN.


ILL_ILLOPN indicates a reserved register/field fault.
0
Reply Dennis 1/30/2010 7:47:56 AM

4 Replies
281 Views

(page loaded in 1.679 seconds)


Reply: