Invalid instruction pointer

  • Follow


I have been working on a problem with an installed product for
approximately a year now.  After much investigation including careful
review of the code and repeat of ANSI hardware testing, we have been
unable to recreate the problem in house.  However, through analysis of
the symptoms we have come to believe that something is causing the
instruction pointer in this embedded application to be pointed to the
wrong code address.

My question is what external events can affect a microprocessor in
such a way that it essentially gets "lost" in execution?  We are
reasonably certain that an external event is the cause, rather than a
stack problem, as the majority of the installed product are working
fine and have been for over a year.

The microprocessor we are using does not have an illegal instruction
trap or watchdog timer, so in order to fix the problem, we would
likely need hardware modifications.  I would like any information that
any of you might have gleaned in past experience with similar issues
so that we can pursue testing based on most likely causes.

0
Reply ginger.zinkowski1 (1) 7/24/2007 3:54:30 PM

ginger.zinkowski@ge.com wrote:

<snip>
> 
> My question is what external events can affect a microprocessor in
> such a way that it essentially gets "lost" in execution?  We are
> reasonably certain that an external event is the cause, rather than a
> stack problem, as the majority of the installed product are working
> fine and have been for over a year.
> 
> The microprocessor we are using does not have an illegal instruction
> trap or watchdog timer, so in order to fix the problem, we would
> likely need hardware modifications.

<snip>

I have a project that is using salvaged components (desoldered ICs with
short leads) installed in tin-plate sockets; with temp. and humidity
changes, the parts move and integrity of connections suffers so that
the mcu does bad external memory fetches.  All unused areas of the
firmware store contain a 'jump relative to self' instruction so that
when the execution goes south I can reset and debug with less chance
of losing state.  Is your product using socketed ICs?

I also have some projects working in very harsh RFI environments;
these necessitated complete enclosure in tin can faraday shields,
together with ample power rail isolation and ferrite bead installations
in order to stop the glitches. What is the operating environment of
your product?

Regards,

Michael

0
Reply msg 7/24/2007 5:45:59 PM


ginger.zinkowski@ge.com wrote:
> I have been working on a problem with an installed product for
> approximately a year now.  After much investigation including careful
> review of the code and repeat of ANSI hardware testing, we have been
> unable to recreate the problem in house.  However, through analysis of
> the symptoms we have come to believe that something is causing the
> instruction pointer in this embedded application to be pointed to the
> wrong code address.
> 
> My question is what external events can affect a microprocessor in
> such a way that it essentially gets "lost" in execution?  We are
> reasonably certain that an external event is the cause, rather than a
> stack problem, as the majority of the installed product are working
> fine and have been for over a year.
> 
> The microprocessor we are using does not have an illegal instruction
> trap or watchdog timer, so in order to fix the problem, we would
> likely need hardware modifications.  I would like any information that
> any of you might have gleaned in past experience with similar issues
> so that we can pursue testing based on most likely causes.
> 
Intel issued an app-note many years ago (for the 8048, no less!), 
"Designing High-Reliability Software for Automotive Applications", or 
something like that. It assumed that someone would be careless with a 
hot sparkplug lead some day, & the CPU would make a random jump to *any* 
accessible location. The idea was to be able to recover from that.
  They programmed in assembler, not C, which allowed some cunning 
tricks. Instance, share out the unused ROM space, so that there is a 
dead zone after every unconditional jump/return. Fill such space with 
jumps to recovery code. There was much more in that vein. (Sorry, I 
don't have a copy to hand.)
0
Reply davebXXX (145) 7/24/2007 6:25:48 PM

On Jul 24, 11:54 am, ginger.zinkow...@ge.com wrote:
> I have been working on a problem with an installed product for
> approximately a year now.  After much investigation including careful
> review of the code and repeat of ANSI hardware testing, we have been
> unable to recreate the problem in house.  However, through analysis of
> the symptoms we have come to believe that something is causing the
> instruction pointer in this embedded application to be pointed to the
> wrong code address.
>
> My question is what external events can affect a microprocessor in
> such a way that it essentially gets "lost" in execution?  We are
> reasonably certain that an external event is the cause, rather than a
> stack problem, as the majority of the installed product are working
> fine and have been for over a year.
>
> The microprocessor we are using does not have an illegal instruction
> trap or watchdog timer, so in order to fix the problem, we would
> likely need hardware modifications.  I would like any information that
> any of you might have gleaned in past experience with similar issues
> so that we can pursue testing based on most likely causes.

Stack overruns are only one source of this kind of problem. (And you
are absolutely 100% positively certain with no doubt that it is not a
stack overrun?)

The overrun does not have to be corrupting the stack.
 does the code run from RAM?
 You may have had a pointer overwrite code with some data.

 Do you have any state machine tables that use function pointers?
 You may have a bad state value so that you jump to a non-existant
function.

 Do you have fully debugged interrupt routines?
 You might be forgetting to push/pop a value on the stack in some
certain special case.

 Are you 100% sure the hardware is working?
 No overclocking the CPU or memory? Do you run diagnostics on power-
up? ESD protection??

What is the difference between your lab set up and the field? There
could be some keys there.
(In this regard, I spent nearly a year trying to debug an intermittent
freeze up on a machine. In the lab I ran the system using an ICE (In
Circuit Emulator). I finally concluded there was a hardware issue and
using the ICE changed the impedance of the circuit enough to avoid the
problem. (Of course we couldn't ship an ICE with each unit sold!)

Without knowing your code, only vague generalities like above come to
mind.

It is very hard to track down intermittent problems. Patience and lots
of information are required. You need a lot of evidence to push the
problem to the hardware side. That's why I'll end the way I started:
Are you really sure this is not a buffer overrun (stack) issue?

HTH,
  Ed Prochak
Let me know if you want more detailed help off-line

0
Reply edprochak (546) 7/24/2007 6:40:36 PM

On Jul 24, 10:54 am, ginger.zinkow...@ge.com wrote:

> We are
> reasonably certain that an external event is the cause, rather than a
> stack problem, as the majority of the installed product are working
> fine and have been for over a year.

That's a fairly risky basis for certainty.  The one problem unit may
be experiencing very different task loading, external timing
intervals, etc than the others.  Just because most of them work
doesn't mean they don't all have a deadly bug.

If you cannot recreate the failure, you will probably have to go to
the problem location and try modifications to both software and
hardware.  Add tracing output.  Change the power supply to an external
one.  Add shielding.  Etc... figure out what it is that makes the
difference.

Also, you may not have dedicated trap capabilities, but with some care
you may be able to insert jumps to a trap routine between your
operational code and data.

0
Reply cs_posting (543) 7/24/2007 6:51:17 PM

ginger.zinkowski@ge.com wrote:
> My question is what external events can affect a microprocessor in
> such a way that it essentially gets "lost" in execution?  We are
> reasonably certain that an external event is the cause, rather than a
> stack problem, as the majority of the installed product are working
> fine and have been for over a year.

Not long ago, I was working on a PowerPC 860 operating
at 3.3V that had an 74HCxxx OR gate operating at 5V
driving an interrupt input. Occasionally, an undershoot
of about 2.5V on the interrupt would cause the system
to go out into the weeds. The moral of the story is...
beware of mixed voltage system and fast edges ;-)

-- 
Michael N. Moran           (h) 770 516 7918
5009 Old Field Ct.         (c) 678 521 5460
Kennesaw, GA, USA 30144    http://mnmoran.org

"So often times it happens, that we live our lives in chains
  and we never even know we have the key."
"Already Gone" by Jack Tempchin (recorded by The Eagles)

The Beatles were wrong: 1 & 1 & 1 is 1
0
Reply mnmoran (182) 7/24/2007 11:57:26 PM

ginger.zinkowski@ge.com wrote:
> I have been working on a problem with an installed product for
> approximately a year now.  After much investigation including careful
> review of the code and repeat of ANSI hardware testing, we have been
> unable to recreate the problem in house.  However, through analysis of
> the symptoms we have come to believe that something is causing the
> instruction pointer in this embedded application to be pointed to the
> wrong code address.
> 
> My question is what external events can affect a microprocessor in
> such a way that it essentially gets "lost" in execution?  We are
> reasonably certain that an external event is the cause, rather than a
> stack problem, as the majority of the installed product are working
> fine and have been for over a year.
> 
> The microprocessor we are using does not have an illegal instruction
> trap or watchdog timer, so in order to fix the problem, we would
> likely need hardware modifications.  I would like any information that
> any of you might have gleaned in past experience with similar issues
> so that we can pursue testing based on most likely causes.
> 
Electrical noise.
Insufficient filtering or decoupling.
Noisy peripherals (solenoids or Motors).
0
Reply NeilKurzm (321) 7/25/2007 6:46:17 AM

On Tue, 24 Jul 2007 08:54:30 -0700, ginger.zinkowski wrote:

> I have been working on a problem with an installed product for
> approximately a year now.  After much investigation including careful
> review of the code and repeat of ANSI hardware testing, we have been
> unable to recreate the problem in house.  However, through analysis of
> the symptoms we have come to believe that something is causing the
> instruction pointer in this embedded application to be pointed to the
> wrong code address.
> 
> My question is what external events can affect a microprocessor in
> such a way that it essentially gets "lost" in execution?  We are
> reasonably certain that an external event is the cause, rather than a
> stack problem, as the majority of the installed product are working
> fine and have been for over a year.
> 
> The microprocessor we are using does not have an illegal instruction
> trap or watchdog timer, so in order to fix the problem, we would
> likely need hardware modifications.  I would like any information that
> any of you might have gleaned in past experience with similar issues
> so that we can pursue testing based on most likely causes.

Some of these are repeats of what others have said:

* Voltage spikes.  Is the power supply or some inputs to the processor
  exceeding the allowable input limits?

* Brown-outs.  Is the power supply dipping below the recommended 
  minimum voltage?

* Interrupt frequency.  Is some external process causing an interrupt
  to be hammered at a much higher frequency than you anticipated?  This
  can cause problems by disturbing the timing of other code, or by
  using a higher than anticipated amount of memory on the stack (or
  stacks, if you use a kernel that puts interrupt responses on the task
  stacks).

* Noisy communication.  Is the equipment that your thing is connected
  to sending invalid comms data, or is your comms data getting
  otherwise corrupted?  Bad comms data in conjunction with fragile
  parsing could lead to stack overflows, memory leaks, or other primary
  faults that then result in branches to East Fishkill.

In summary:  Look for strange electrical events that are either taking the
pins of the processor out of their safe operating range, or look for
environmental effects that are unusual and may be lighting up software
bugs that you never tested for.

If you can, you should make some software that's instrumented for things
like heap usage (if you use a heap), buffer usage for all your comms,
stack usage, etc., and that either logs events (carefully -- event logging
can cause problems on its own) or that saves the state of the machine for
later analysis.  Then try to use these results to further your
investigations.

-- 
Tim Wescott
Control systems and communications consulting
http://www.wescottdesign.com

Need to learn how to apply control theory in your embedded system?
"Applied Control Theory for Embedded Systems" by Tim Wescott
Elsevier/Newnes, http://www.wescottdesign.com/actfes/actfes.html
0
Reply tim177 (4404) 7/25/2007 2:45:39 PM

David R Brooks wrote:

> Intel issued an app-note many years ago (for the 8048, no less!), 
> "Designing High-Reliability Software for Automotive Applications", or 
> something like that. It assumed that someone would be careless with a 
> hot sparkplug lead some day, & the CPU would make a random jump to *any* 
> accessible location. The idea was to be able to recover from that.
> They programmed in assembler, not C, which allowed some cunning tricks. 
> Instance, share out the unused ROM space, so that there is a dead zone 
> after every unconditional jump/return. Fill such space with jumps to 
> recovery code. There was much more in that vein. (Sorry, I don't have a 
> copy to hand.)

Are you perhaps referring to the following document?
Designing Microcontroller Systems for Electrically Noisy Environments
http://www.intel.com/design/auto/mcs96/applnots/210313.htm

Which itself refers to the following document:
Yarkoni, B. and Wharton, J.
Designing Reliable Software for Automotive Applications
SAE Transactions, 790237, July 1979

cf. also http://www.intel.com/design/auto/docs_auto.htm
0
Reply Spoon 7/25/2007 4:26:57 PM


Spoon wrote:

> David R Brooks wrote:
>
> > Intel issued an app-note many years ago (for the 8048, no less!),
> > "Designing High-Reliability Software for Automotive Applications", or
> > something like that.
>
> Are you perhaps referring to the following document?
> Designing Microcontroller Systems for Electrically Noisy Environments
> http://www.intel.com/design/auto/mcs96/applnots/210313.htm
>
> Which itself refers to the following document:
> Yarkoni, B. and Wharton, J. Designing Reliable Software for Automotive Applications
> SAE Transactions, 790237, July 1979
>
> cf. also http://www.intel.com/design/auto/docs_auto.htm

see also

http://www.dbicorporation.com/esd-anno.htm

w..



0
Reply walter20 (872) 7/25/2007 5:35:34 PM

Spoon wrote:
> David R Brooks wrote:
> 
>> Intel issued an app-note many years ago (for the 8048, no less!), 
>> "Designing High-Reliability Software for Automotive Applications", or 
>> something like that. It assumed that someone would be careless with a 
>> hot sparkplug lead some day, & the CPU would make a random jump to 
>> *any* accessible location. The idea was to be able to recover from that.
>> They programmed in assembler, not C, which allowed some cunning 
>> tricks. Instance, share out the unused ROM space, so that there is a 
>> dead zone after every unconditional jump/return. Fill such space with 
>> jumps to recovery code. There was much more in that vein. (Sorry, I 
>> don't have a copy to hand.)
> 
> Are you perhaps referring to the following document?
> Designing Microcontroller Systems for Electrically Noisy Environments
> http://www.intel.com/design/auto/mcs96/applnots/210313.htm
> 
> Which itself refers to the following document:
> Yarkoni, B. and Wharton, J.
> Designing Reliable Software for Automotive Applications
> SAE Transactions, 790237, July 1979
> 
> cf. also http://www.intel.com/design/auto/docs_auto.htm

Yes indeed :-)
Yarkoni & Wharton is the one I was thinking of.
0
Reply davebXXX (145) 7/25/2007 10:15:26 PM

10 Replies
31 Views

(page loaded in 0.165 seconds)

Similiar Articles:













7/12/2012 11:22:26 AM


Reply: