Hi all,
I am debugging a mpi program and got this error. The weird thing is
I've checked the codes where the error was from, it just called a
constructor to new an object and it did well before the program ran to
particular time.
Anyone has any idea about why "p2_12374: p4_error: interrupt SIGBUS:
10" occured? And any method might help me find out where the real bug
locates? Many thanks.
Patricia
|
|
0
|
|
|
|
Reply
|
lanchenn (28)
|
8/10/2005 10:46:54 AM |
|
Patricia wrote:
> Hi all,
>
> I am debugging a mpi program and got this error. The weird thing is
> I've checked the codes where the error was from, it just called a
> constructor to new an object and it did well before the program ran to
> particular time.
>
> Anyone has any idea about why "p2_12374: p4_error: interrupt SIGBUS:
> 10" occured? And any method might help me find out where the real bug
> locates? Many thanks.
>
> Patricia
SIGBUS errors usually indicate that an invalid memory address was
dereferenced. (SIGBUS is like a SIGSEGV, except that the former's
memory dereference lies outside your process' address space, while the
latter is a dereference that lies within the process' memory space, but
in an invalid memory segment.)
http://en.wikipedia.org/wiki/SIGBUS
Probably your object's constructor failed to allocate memory and then
tried to dereference the new object using an address of NULL, causing
the bus error.
The p4 error handler obviously lies within the MPI library code, but the
error might still be occurring in your code. It's being caught by the
only SIGBUS error handler that's available: it's within MPI and was
registered when you called MPI_INIT.
Randy
--
Randy Crawford http://www.ruf.rice.edu/~rand rand AT rice DOT edu
"If English was good enough for Jesus Christ, it ought to be good enough
for the children of Texas." -- Texas Governor Ma Ferguson (1924)
|
|
0
|
|
|
|
Reply
|
Randy
|
8/10/2005 8:47:17 PM
|
|
Hi,
Thanks for ur reply.I changed the codes, the SIGBUS:10 error changed to
SIGSEGV: 11. And core file was maken out. dbx reads the core like
below:
0x7f741f10: _malloc_unlocked+0x021c: ld [%o0 + 8], %o1
Current function is std::allocator<std::pair<const int,ProcessState>
>::allocate
389 void * tmp = _RWSTD_STATIC_CAST(void*,(::operator
new(_RWSTD_STATIC_CAST(size_t,(n)))));
trace stack as:
[1] _malloc_unlocked(0x7f7c2858, 0x1bbfa8, 0x7f7bc008, 0x2008,
0x1bc040, 0x0),
at 0x7f741f10
[2] malloc(0x2008, 0x0, 0x7fbdc0c0, 0x7fbdc790, 0x7fb40b60,
0x7fbded84), at 0x
7f741cd8
[3] _filbuf(0x142238, 0x1, 0x7f7bc008, 0x0, 0x2000, 0x1), at
0x7f78edf8
[4] _doprnt(0x0, 0xffbedf68, 0x142238, 0x7f7bc008, 0x40, 0x11cf6c),
at 0x7f784
afc
[5] printf(0x11cf6c, 0x142228, 0x7f7c3a54, 0x0, 0x2, 0x0), at
0x7f788154
[6] p4_error(0x139048, 0xb, 0xb, 0x0, 0x0, 0x0), at 0xfe7d0
[7] sig_err_handler(0xb, 0x0, 0xffbee0e0, 0x7f7bc008, 0x0, 0x0), at
0xfea14
[8] _setpgid(0xb, 0x0, 0xffbee0e0, 0x7f7c284c, 0x7f7c2848, 0x0), at
0x7f79ebc4
[9] _malloc_unlocked(0x7f7c2858, 0x1bbfa8, 0x7f7bc008, 0xb8,
0x1bc040, 0x0), a
t 0x7f741ea0
[10] malloc(0xb4, 0xffbee8c3, 0x0, 0xffbee8c2, 0x2234c, 0x7f741ce4),
at 0x7f74
1cd8
[11] operator new(0xb4, 0xffbee8c3, 0x13740, 0xffbeee38, 0x7fa4a8d4,
0xb4), at
0x7fa371b8
=>[12] std::allocator<std::pair<const int,ProcessState>
>::allocate(this = 0xffb
ee5ff, n = 180U, _ARG3 = (nil)), line 389 in "memory"
[13] std::allocator_interface<std::allocator<std::pair<const
int,ProcessState>
>,__rwstd::__rb_tree<int,std::pair<const
int,ProcessState>,__rwstd::__select1st
<std::pair<const
int,ProcessState>,int>,std::less<int>,std::allocator<std::pair<
const int,ProcessState> > >::__rb_tree_node>::allocate(this =
0xffbee5ff, n = 1U
, p = (nil)), line 488 in "memory"
[14] __rwstd::__rb_tree<int,std::pair<const
int,ProcessState>,__rwstd::__selec
t1st<std::pair<const
int,ProcessState>,int>,std::less<int>,std::allocator<std::p
air<const int,ProcessState> > >::__add_new_buffer(this = 0x1b9c28),
line 167 in
"tree"
[15] __rwstd::__rb_tree<int,std::pair<const
int,ProcessState>,__rwstd::__selec
t1st<std::pair<const
int,ProcessState>,int>,std::less<int>,std::allocator<std::p
air<const int,ProcessState> > >::__get_link(this = 0x1b9c28), line 189
in "tree"
[16] __rwstd::__rb_tree<int,std::pair<const
int,ProcessState>,__rwstd::__selec
t1st<std::pair<const
int,ProcessState>,int>,std::less<int>,std::allocator<std::p
air<const int,ProcessState> > >::__get_node(this = 0x1b9c28), line 223
in "tree"
[17] __rwstd::__rb_tree<int,std::pair<const
int,ProcessState>,__rwstd::__selec
t1st<std::pair<const
int,ProcessState>,int>,std::less<int>,std::allocator<std::p
air<const int,ProcessState> > >::init(this = 0x1b9c28), line 483 in
"tree"
[18] __rwstd::__rb_tree<int,std::pair<const
int,ProcessState>,__rwstd::__selec
t1st<std::pair<const
int,ProcessState>,int>,std::less<int>,std::allocator<std::p
air<const int,ProcessState> > >::__rb_tree(this = 0x1b9c28, _RWSTD_COMP
= STRUCT
, always = false, alloc = CLASS), line 499 in "tree"
[19]
std::map<int,ProcessState,std::less<int>,std::allocator<std::pair<const
i
nt,ProcessState> > >::map(this = 0x1b9c28, comp = STRUCT, alloc =
CLASS), line 1
48 in "map"
[20] ContContrlState::ContContrlState(this = 0x1b9c00), line 18 in
"ContContrl
State.hh"
[21] ContContrlObject::allocateState(this = 0x1b3c20), line 804 in
"ContContrl
Object.cc"
[22] StateManager::saveState(this = 0x1b3e78), line 126 in
"StateManager.cc"
[23] TimeWarp::saveState(this = 0x1b3c20), line 429 in "TimeWarp.cc"
[24] TimeWarp::executeSimulation(this = 0x1b3c20), line 325 in
"TimeWarp.cc"
[25] LTSFScheduler::runProcesses(this = 0xffbef270), line 50 in
"LTSFScheduler
..cc"
[26] LogicalProcess::simulate(this = 0xffbeee90, _ARG2 = 2147483647),
line 885
in "LogicalProcess.cc"
[27] main(argc = 1, argv = 0x16a420), line 299 in "main.cc"
Could you give me some hints about this? what is _malloc_unlocked?
Patricia
Randy wrote:
> Patricia wrote:
> > Hi all,
> >
> > I am debugging a mpi program and got this error. The weird thing is
> > I've checked the codes where the error was from, it just called a
> > constructor to new an object and it did well before the program ran to
> > particular time.
> >
> > Anyone has any idea about why "p2_12374: p4_error: interrupt SIGBUS:
> > 10" occured? And any method might help me find out where the real bug
> > locates? Many thanks.
> >
> > Patricia
>
> SIGBUS errors usually indicate that an invalid memory address was
> dereferenced. (SIGBUS is like a SIGSEGV, except that the former's
> memory dereference lies outside your process' address space, while the
> latter is a dereference that lies within the process' memory space, but
> in an invalid memory segment.)
>
> http://en.wikipedia.org/wiki/SIGBUS
>
> Probably your object's constructor failed to allocate memory and then
> tried to dereference the new object using an address of NULL, causing
> the bus error.
>
> The p4 error handler obviously lies within the MPI library code, but the
> error might still be occurring in your code. It's being caught by the
> only SIGBUS error handler that's available: it's within MPI and was
> registered when you called MPI_INIT.
>
> Randy
>
> --
> Randy Crawford http://www.ruf.rice.edu/~rand rand AT rice DOT edu
>
> "If English was good enough for Jesus Christ, it ought to be good enough
> for the children of Texas." -- Texas Governor Ma Ferguson (1924)
|
|
0
|
|
|
|
Reply
|
Patricia
|
9/2/2005 2:25:37 PM
|
|
Patricia wrote:
> Hi,
>
> Thanks for ur reply.I changed the codes, the SIGBUS:10 error changed to
> SIGSEGV: 11. And core file was maken out. dbx reads the core like
> below:
>
> 0x7f741f10: _malloc_unlocked+0x021c: ld [%o0 + 8], %o1
> Current function is std::allocator<std::pair<const int,ProcessState>
>>::allocate
> 389 void * tmp = _RWSTD_STATIC_CAST(void*,(::operator
> new(_RWSTD_STATIC_CAST(size_t,(n)))));
>
> trace stack as:
[...]
> [5] printf(0x11cf6c, 0x142228, 0x7f7c3a54, 0x0, 0x2, 0x0), at
> 0x7f788154
> [6] p4_error(0x139048, 0xb, 0xb, 0x0, 0x0, 0x0), at 0xfe7d0
> [7] sig_err_handler(0xb, 0x0, 0xffbee0e0, 0x7f7bc008, 0x0, 0x0), at
> 0xfea14
> [8] _setpgid(0xb, 0x0, 0xffbee0e0, 0x7f7c284c, 0x7f7c2848, 0x0), at
> 0x7f79ebc4
>
> [9] _malloc_unlocked(0x7f7c2858, 0x1bbfa8, 0x7f7bc008, 0xb8,
> 0x1bc040, 0x0), a
> t 0x7f741ea0
> [10] malloc(0xb4, 0xffbee8c3, 0x0, 0xffbee8c2, 0x2234c, 0x7f741ce4),
> at 0x7f74
> 1cd8
[...]
This is nothing to do with MPI, but your own code is causing a segmentation
fault which is being trapped by a signal handler installed by MPI. That is
rather rude of the MPI I think, even more so since it _appears_ to be
calling functions that are not async-signal safe (such as printf)
http://www.opengroup.org/onlinepubs/009695399/functions/xsh_chap02_04.html#tag_02_04_03
But the stack trace looks pretty confusing to me - you might be better off
reverting to the standard SIGSEGV handler to get rid of the p4_error junk
in the stack trace. To do this, add
#include <signal.h>
struct sigaction Action;
sigemptyset(&Action.sa_mask);
Action.sa_handler = SIG_DFL;
Action.sa_flags = 0;
sigaction(SIGSEGV, &Action, NULL);
somewhere after MPI_Init(). That *might* help the debugging. Another thing
that might help is using some debugging version of malloc, since it looks
like heap corruption is causing a fault inside malloc. (although I dont
understand why setpgid() is in there too).
Since this is nothing to do with MPI (well, except for the b0rken SEGV
handler), it would be better to followup to a group relating to your
compiler+platform, instead.
HTH,
Ian McCulloch
|
|
0
|
|
|
|
Reply
|
Ian
|
9/2/2005 5:40:43 PM
|
|
Patricia,
From the little bit of web searching I've done, it sounds like your
heap's integrity has been corrupted. Probably a write to a dynamically
created object exceeded the memory space that was allocated for it. A
subsequent call to "new" then encounters invalid metadata within the
heap, and a pointer is dereferenced that points outside your process's
data space (SIGBUS) or points to zero (SEGFAULT).
Be sure to check the memory footprint of your object and its constituent
objects. I assume "rb_tree" is a red-black tree. It's likely that the
data held within the RB tree's nodes is actually a different size than
what has been requested by new, and not enough memory is being
allocated, and subsequent writes into the tree exceed the space
available for each datum, and the heap gets corrupted.
Of course, the heap corruption could have arisen from any other object's
earlier misuse of the heap. It needn't be within your rb_tree code.
That's just where the heap corruption was first encountered.
That's my best guess anyway. Not sure if it helps.
_malloc_unlocked() appears to be the core function within malloc() that
actually traverses the heap and returns the allocated memory:
http://cvs.opensolaris.org/source/xref/usr/src/lib/libc/port/gen/malloc.c
149 void *
150 malloc(size_t size)
151 {
152 void *ret;
153
154 if (!primary_link_map) {
155 errno = ENOTSUP;
156 return (NULL);
157 }
158 assert_no_libc_locks_held();
159 lmutex_lock(&libc_malloc_lock);
160 ret = _malloc_unlocked(size);
161 lmutex_unlock(&libc_malloc_lock);
162 return (ret);
163 }
Randy
Patricia wrote:
> Hi,
>
> Thanks for ur reply.I changed the codes, the SIGBUS:10 error changed to
> SIGSEGV: 11. And core file was maken out. dbx reads the core like
> below:
>
> 0x7f741f10: _malloc_unlocked+0x021c: ld [%o0 + 8], %o1
> Current function is std::allocator<std::pair<const int,ProcessState>
>
>>::allocate
>
> 389 void * tmp = _RWSTD_STATIC_CAST(void*,(::operator
> new(_RWSTD_STATIC_CAST(size_t,(n)))));
>
> trace stack as:
> [1] _malloc_unlocked(0x7f7c2858, 0x1bbfa8, 0x7f7bc008, 0x2008,
> 0x1bc040, 0x0),
> at 0x7f741f10
> [2] malloc(0x2008, 0x0, 0x7fbdc0c0, 0x7fbdc790, 0x7fb40b60,
> 0x7fbded84), at 0x
> 7f741cd8
> [3] _filbuf(0x142238, 0x1, 0x7f7bc008, 0x0, 0x2000, 0x1), at
> 0x7f78edf8
> [4] _doprnt(0x0, 0xffbedf68, 0x142238, 0x7f7bc008, 0x40, 0x11cf6c),
> at 0x7f784
> afc
> [5] printf(0x11cf6c, 0x142228, 0x7f7c3a54, 0x0, 0x2, 0x0), at
> 0x7f788154
> [6] p4_error(0x139048, 0xb, 0xb, 0x0, 0x0, 0x0), at 0xfe7d0
> [7] sig_err_handler(0xb, 0x0, 0xffbee0e0, 0x7f7bc008, 0x0, 0x0), at
> 0xfea14
> [8] _setpgid(0xb, 0x0, 0xffbee0e0, 0x7f7c284c, 0x7f7c2848, 0x0), at
> 0x7f79ebc4
>
> [9] _malloc_unlocked(0x7f7c2858, 0x1bbfa8, 0x7f7bc008, 0xb8,
> 0x1bc040, 0x0), a
> t 0x7f741ea0
> [10] malloc(0xb4, 0xffbee8c3, 0x0, 0xffbee8c2, 0x2234c, 0x7f741ce4),
> at 0x7f74
> 1cd8
> [11] operator new(0xb4, 0xffbee8c3, 0x13740, 0xffbeee38, 0x7fa4a8d4,
> 0xb4), at
> 0x7fa371b8
> =>[12] std::allocator<std::pair<const int,ProcessState>
>
>>::allocate(this = 0xffb
>
> ee5ff, n = 180U, _ARG3 = (nil)), line 389 in "memory"
> [13] std::allocator_interface<std::allocator<std::pair<const
> int,ProcessState>
> >,__rwstd::__rb_tree<int,std::pair<const
> int,ProcessState>,__rwstd::__select1st
> <std::pair<const
> int,ProcessState>,int>,std::less<int>,std::allocator<std::pair<
> const int,ProcessState> > >::__rb_tree_node>::allocate(this =
> 0xffbee5ff, n = 1U
> , p = (nil)), line 488 in "memory"
> [14] __rwstd::__rb_tree<int,std::pair<const
> int,ProcessState>,__rwstd::__selec
> t1st<std::pair<const
> int,ProcessState>,int>,std::less<int>,std::allocator<std::p
> air<const int,ProcessState> > >::__add_new_buffer(this = 0x1b9c28),
> line 167 in
> "tree"
> [15] __rwstd::__rb_tree<int,std::pair<const
> int,ProcessState>,__rwstd::__selec
> t1st<std::pair<const
> int,ProcessState>,int>,std::less<int>,std::allocator<std::p
> air<const int,ProcessState> > >::__get_link(this = 0x1b9c28), line 189
> in "tree"
> [16] __rwstd::__rb_tree<int,std::pair<const
> int,ProcessState>,__rwstd::__selec
> t1st<std::pair<const
> int,ProcessState>,int>,std::less<int>,std::allocator<std::p
> air<const int,ProcessState> > >::__get_node(this = 0x1b9c28), line 223
> in "tree"
> [17] __rwstd::__rb_tree<int,std::pair<const
> int,ProcessState>,__rwstd::__selec
> t1st<std::pair<const
> int,ProcessState>,int>,std::less<int>,std::allocator<std::p
> air<const int,ProcessState> > >::init(this = 0x1b9c28), line 483 in
> "tree"
> [18] __rwstd::__rb_tree<int,std::pair<const
> int,ProcessState>,__rwstd::__selec
> t1st<std::pair<const
> int,ProcessState>,int>,std::less<int>,std::allocator<std::p
> air<const int,ProcessState> > >::__rb_tree(this = 0x1b9c28, _RWSTD_COMP
> = STRUCT
> , always = false, alloc = CLASS), line 499 in "tree"
> [19]
> std::map<int,ProcessState,std::less<int>,std::allocator<std::pair<const
> i
> nt,ProcessState> > >::map(this = 0x1b9c28, comp = STRUCT, alloc =
> CLASS), line 1
> 48 in "map"
> [20] ContContrlState::ContContrlState(this = 0x1b9c00), line 18 in
> "ContContrl
> State.hh"
> [21] ContContrlObject::allocateState(this = 0x1b3c20), line 804 in
> "ContContrl
> Object.cc"
> [22] StateManager::saveState(this = 0x1b3e78), line 126 in
> "StateManager.cc"
> [23] TimeWarp::saveState(this = 0x1b3c20), line 429 in "TimeWarp.cc"
> [24] TimeWarp::executeSimulation(this = 0x1b3c20), line 325 in
> "TimeWarp.cc"
> [25] LTSFScheduler::runProcesses(this = 0xffbef270), line 50 in
> "LTSFScheduler
> .cc"
> [26] LogicalProcess::simulate(this = 0xffbeee90, _ARG2 = 2147483647),
> line 885
> in "LogicalProcess.cc"
> [27] main(argc = 1, argv = 0x16a420), line 299 in "main.cc"
>
> Could you give me some hints about this? what is _malloc_unlocked?
>
> Patricia
>
> Randy wrote:
>
>>Patricia wrote:
>>
>>>Hi all,
>>>
>>>I am debugging a mpi program and got this error. The weird thing is
>>>I've checked the codes where the error was from, it just called a
>>>constructor to new an object and it did well before the program ran to
>>>particular time.
>>>
>>>Anyone has any idea about why "p2_12374: p4_error: interrupt SIGBUS:
>>>10" occured? And any method might help me find out where the real bug
>>>locates? Many thanks.
>>>
>>>Patricia
>>
>>SIGBUS errors usually indicate that an invalid memory address was
>>dereferenced. (SIGBUS is like a SIGSEGV, except that the former's
>>memory dereference lies outside your process' address space, while the
>>latter is a dereference that lies within the process' memory space, but
>>in an invalid memory segment.)
>>
>> http://en.wikipedia.org/wiki/SIGBUS
>>
>>Probably your object's constructor failed to allocate memory and then
>>tried to dereference the new object using an address of NULL, causing
>>the bus error.
>>
>>The p4 error handler obviously lies within the MPI library code, but the
>>error might still be occurring in your code. It's being caught by the
>>only SIGBUS error handler that's available: it's within MPI and was
>>registered when you called MPI_INIT.
>>
>> Randy
>>
>>--
>>Randy Crawford http://www.ruf.rice.edu/~rand rand AT rice DOT edu
>>
>>"If English was good enough for Jesus Christ, it ought to be good enough
>>for the children of Texas." -- Texas Governor Ma Ferguson (1924)
>
>
--
Randy Crawford http://www.ruf.rice.edu/~rand rand AT rice DOT edu
"Overstatement sucks." -- William of Ockham
|
|
0
|
|
|
|
Reply
|
Randy
|
9/2/2005 7:27:52 PM
|
|
|
4 Replies
176 Views
(page loaded in 0.107 seconds)
Similiar Articles: What is SIGBUS - Object specific hardware error? - comp.unix ...hahaha <hahaha@haha.com> writes: > NOTE, this is not alignment issue. > > I got BUS_OBJERR of SIGBUS. You got a *what* ? Perhaps you should read this: www.catb.org ... Help on "Bus Error(coredump)" - comp.unix.programmerI encountered a "Bus Error" when running a program compiled by gcc (actually I tried ... on systems where u_long isn't 64 bits. > On top of that, x86 CPUs don't SIGBUS on ... gdb (linux) "print" command clears memory corruption - so how do I ...Program received signal SIGBUS, Bus error. 0x08048406 in main () at demo.c:14 14 printf("ptr[%d] = %d\n", i-1, ptr[i-1]); ----- <snip ... See Source code of *.dll file - comp.fontsWhat is SIGBUS - Object specific hardware error? - comp.unix ... See Source code of *.dll file - comp.fonts A DLL is a binary object composed of direct machine instructions. comp.unix.programmer - page 50First program on FIFO's : not working :-(10 118 (9/24/2003 9:39:52 AM) Hi ... SIGBUS and SIGSEGV Errors in HP-UX 11 and Oracle 9i 1 257 (9/25/2003 7:01:55 AM) What are the ... segmentation fault with shared memory - comp.unix.programmer ...Segmentation Fault error ... segmentation fault when shared object using STL is ... SIGBUS occuring in shared libray... - comp.unix.solaris segmentation fault ... makes ... Could anyone give me the spice-mode.el - comp.emacsHi, All I am new to *NIX and I am thinking of writing spice code under Emacs. However, I have no idea of Emacs Lisp. Hence, I could not write a packa... SIGBUS (10) error - Toolbox for IT GroupsHi, In our mapping when we try to load arround 200000 records we are getting SIGBUS 10 error and the session is failing after writing a ... SIGBUS - Wikipedia, the free encyclopediaOn POSIX -compliant platforms, SIGBUS is the signal sent to a process when it causes a bus error. The symbolic constant for SIGBUS is defined in the header file signal.h. 7/16/2012 6:55:58 PM
|