Hello,
unfortunately our storage colleagues strongly advised against changing the
configuration.
Is there any other way to accomplish what you are asking (i.e. get the
"image" of the file system)?
Thanks,
George
________________________________________
From: Betzos Giorgos
Sent: Friday, 23 September 2011 4:25 pm
To: Sunil Mushran
CC: ocfs2-users at oss.oracle.com
Subject: Re: [Ocfs2-users] Linux kernel crash due to ocfs2
Currently, we don't have an x86_64 system that accesses the disk.
I 'll talk to our storage and DB colleagues to check if the disk can be
shared with an x86_64 cluster.
But I can't see why the x86_64 code will run any different than the ppc
code?
Endianess? (Intel systems are Little Endian, while Power systems are Big Endian,
I believe)
By the way, I also run o2image on another, small ocfs2 disk and it failed again,
but in a different piece of code.
Thanks,
George
________________________________________
???: Sunil Mushran [sunil.mushran at oracle.com]
????????: ?????????, 23 ??????????? 2011 1:55 ??
????: Betzos Giorgos
????.: ocfs2-users at oss.oracle.com
????: Re: ??: [Ocfs2-users] Linux kernel crash due to ocfs2
Thanks. Yes, I recently fixed that.
Thanks to your analysis I can see the location, but I cannot see
any specific problem that could cause this. This calls for valgrind, etc.
But we have been seriously side tracked. The intent is to get the image
to analyse another problem. So it would be easier if we could generate
the image on a different machine. Do you have an x86/64 box that can
access the same array?
On 09/22/2011 02:33 AM, Betzos Giorgos wrote:> After my previous post I found some time to download the source and build
it.
> Apparently the ocfs2-tools debuginfo package is missing the o2image source
code.
>
> This is what I got from the debugger:
>
> run /dev/mapper/mpath0 /files_shared/u02.o2image
> Starting program: /root/o2image.files/ocfs2-tools/o2image/o2image
/dev/mapper/mpath0 /files_shared/u02.o2image
> [Thread debugging using libthread_db enabled]
> *** glibc detected *** /root/o2image.files/ocfs2-tools/o2image/o2image:
corrupted double-linked list: 0x10075000 ***
> ======= Backtrace: ========> /lib/libc.so.6[0xfeb1ab4]
> /lib/libc.so.6(cfree+0xc8)[0xfeb5b68]
> /root/o2image.files/ocfs2-tools/o2image/o2image[0x1000d1e8]
> /root/o2image.files/ocfs2-tools/o2image/o2image[0x10002acc]
> /root/o2image.files/ocfs2-tools/o2image/o2image[0x10002008]
> /root/o2image.files/ocfs2-tools/o2image/o2image[0x100023dc]
> /root/o2image.files/ocfs2-tools/o2image/o2image[0x10002954]
> /root/o2image.files/ocfs2-tools/o2image/o2image[0x10002008]
> /root/o2image.files/ocfs2-tools/o2image/o2image[0x100023dc]
> /root/o2image.files/ocfs2-tools/o2image/o2image[0x10002954]
> /root/o2image.files/ocfs2-tools/o2image/o2image[0x10003d0c]
> /root/o2image.files/ocfs2-tools/o2image/o2image[0x100045d0]
> /lib/libc.so.6[0xfe4dc60]
> /lib/libc.so.6[0xfe4dea0]
> ======= Memory map: =======> 00100000-00120000 r-xp 00100000 00:00 0
[vdso]
> 0f430000-0f440000 r-xp 00000000 08:13 180307
/lib/libcom_err.so.2.1
> 0f440000-0f450000 rw-p 00000000 08:13 180307
/lib/libcom_err.so.2.1
> 0f900000-0f9c0000 r-xp 00000000 08:13 180293
/lib/libglib-2.0.so.0.1200.3
> 0f9c0000-0f9d0000 rw-p 000b0000 08:13 180293
/lib/libglib-2.0.so.0.1200.3
> 0fa40000-0fa50000 r-xp 00000000 08:13 180292
/lib/librt-2.5.so
> 0fa50000-0fa60000 r--p 00000000 08:13 180292
/lib/librt-2.5.so
> 0fa60000-0fa70000 rw-p 00010000 08:13 180292
/lib/librt-2.5.so
> 0fce0000-0fd00000 r-xp 00000000 08:13 180291
/lib/libpthread-2.5.so
> 0fd00000-0fd10000 r--p 00010000 08:13 180291
/lib/libpthread-2.5.so
> 0fd10000-0fd20000 rw-p 00020000 08:13 180291
/lib/libpthread-2.5.so
> 0fe30000-0ffa0000 r-xp 00000000 08:13 180288
/lib/libc-2.5.so
> 0ffa0000-0ffb0000 r--p 00160000 08:13 180288
/lib/libc-2.5.so
> 0ffb0000-0ffc0000 rw-p 00170000 08:13 180288
/lib/libc-2.5.so
> 0ffc0000-0ffe0000 r-xp 00000000 08:13 180287
/lib/ld-2.5.so
> 0ffe0000-0fff0000 r--p 00010000 08:13 180287
/lib/ld-2.5.so
> 0fff0000-10000000 rw-p 00020000 08:13 180287
/lib/ld-2.5.so
> 10000000-10050000 r-xp 00000000 08:13 7537720
/root/o2image.files/ocfs2-tools/o2image/o2image
> 10050000-10060000 rw-p 00040000 08:13 7537720
/root/o2image.files/ocfs2-tools/o2image/o2image
> 10060000-10090000 rwxp 10060000 00:00 0
[heap]
> f7680000-f7ff0000 rw-p f7680000 00:00 0
> ffa40000-ffb90000 rw-p ffa40000 00:00 0
[stack]
> [New Thread 268379248 (LWP 31433)]
>
> Program received signal SIGABRT, Aborted.
> [Switching to Thread 268379248 (LWP 31433)]
> 0x0fe66110 in *__GI_raise (sig=6) at
../nptl/sysdeps/unix/sysv/linux/raise.c:64
> 64 return INLINE_SYSCALL (tgkill, 3, pid, selftid, sig);
> (gdb) where
> #0 0x0fe66110 in *__GI_raise (sig=6)
> at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
> #1 0x0fe67e14 in *__GI_abort () at abort.c:88
> #2 0x0fea77f4 in __libc_message (do_abort=2,
> fmt=0xff8714c "*** glibc detected *** %s: %s: 0x%s ***\n")
> at ../sysdeps/unix/sysv/linux/libc_fatal.c:170
> #3 0x0feb1ab4 in _int_free (av=0xffb14a8, mem=<value optimized out>)
> at malloc.c:5768
> #4 0x0feb5b68 in *__GI___libc_free (mem=0x10076000) at malloc.c:3545
> #5 0x1000d1e8 in ocfs2_free (ptr=0xffb8f4d0) at memory.c:65
> #6 0x10002acc in traverse_inode (ofs=0x10060018, inode=25190403)
> at o2image.c:332
> #7 0x10002008 in traverse_group_desc (ofs=0x10060018, grp=0x10074000,
> dump_type=2, bpc=1) at o2image.c:76
> #8 0x100023dc in traverse_chains (ofs=0x10060018, cl=0x100720c0,
dump_type=2)
> at o2image.c:159
> #9 0x10002954 in traverse_inode (ofs=0x10060018, inode=17) at
o2image.c:295
> #10 0x10002008 in traverse_group_desc (ofs=0x10060018, grp=0x10070000,
> dump_type=2, bpc=1) at o2image.c:76
> #11 0x100023dc in traverse_chains (ofs=0x10060018, cl=0x1006e0c0,
dump_type=2)
> at o2image.c:159
> #12 0x10002954 in traverse_inode (ofs=0x10060018, inode=8) at o2image.c:295
> #13 0x10003d0c in scan_raw_disk (ofs=0x10060018) at o2image.c:640
> #14 0x100045d0 in main (argc=3, argv=0xffb8fa14) at o2image.c:785
> (gdb) thread apply all bt full
>
> Thread 1 (Thread 268379248 (LWP 31433)):
> #0 0x0fe66110 in *__GI_raise (sig=6)
> at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
> r4 =<value optimized out>
> r7 =<value optimized out>
> r12 =<value optimized out>
> r5 = 6
> r8 =<value optimized out>
> r10 =<value optimized out>
> r0 = 250
> r3 =<value optimized out>
> r6 =<value optimized out>
> r9 =<value optimized out>
> r11 =<value optimized out>
> sc_ret =<value optimized out>
> pd = (struct pthread *) 0x0
> pid = 0
> selftid = 31433
> #1 0x0fe67e14 in *__GI_abort () at abort.c:88
> self = (void *) 0xffb8ebf8
> act = {__sigaction_handler = {sa_handler = 0xfff13c8,
> sa_sigaction = 0xfff13c8}, sa_mask = {__val = {266534912, 0, 0,
268374984,
> 14, 267945972, 3, 4290309225, 7, 267945976, 2, 4290309238, 2,
267947588,
> 1, 267945972, 3, 4290309225, 7, 267945976, 2, 4, 1685382481, 0, 0,
0, 0,
> 0, 0, 0, 0, 0}}, sa_flags = 2, sa_restorer = 0x8}
> sigs = {__val = {32, 0<repeats 31 times>}}
> #2 0x0fea77f4 in __libc_message (do_abort=2,
> fmt=0xff8714c "*** glibc detected *** %s: %s: 0x%s ***\n")
> at ../sysdeps/unix/sysv/linux/libc_fatal.c:170
> ap = {{gpr = 5 '\005', fpr = 0 '\0', reserved = 0,
> overflow_arg_area = 0xffb8f408, reg_save_area = 0xffb8f370}}
> ap_copy = {{gpr = 2 '\002', fpr = 0 '\0', reserved
= 0,
> overflow_arg_area = 0xffb8f408, reg_save_area = 0xffb8f370}}
> fd = 8
> list = (struct str_list *) 0xffb8ed80
> nlist =<value optimized out>
> cp =<value optimized out>
> written =<value optimized out>
> #3 0x0feb1ab4 in _int_free (av=0xffb14a8, mem=<value optimized out>)
> at malloc.c:5768
> buf = "10075000"
> cp =<value optimized out>
> p = (mchunkptr) 0x10075000
> size = 8192
> nextchunk = (mchunkptr) 0x10077000
> nextsize = 8192
> prevsize =<value optimized out>
> bck =<value optimized out>
> fwd =<value optimized out>
> errstr =<value optimized out>
> #4 0x0feb5b68 in *__GI___libc_free (mem=0x10076000) at malloc.c:3545
> __futex =<value optimized out>
> ar_ptr = (mstate) 0xffb14a8
> p =<value optimized out>
> hook =<value optimized out>
> #5 0x1000d1e8 in ocfs2_free (ptr=0xffb8f4d0) at memory.c:65
> pp = (void **) 0xffb8f4d0
> #6 0x10002acc in traverse_inode (ofs=0x10060018, inode=25190403)
> at o2image.c:332
> super = (struct ocfs2_super_block *) 0x100610c0
> ost = (struct ocfs2_image_state *) 0x10060160
> di = (struct ocfs2_dinode *) 0x10076000
> ret = 0
> dump_type = 0
> buf = 0x10076000 "INODE01"
> slot = 2
> #7 0x10002008 in traverse_group_desc (ofs=0x10060018, grp=0x10074000,
> dump_type=2, bpc=1) at o2image.c:76
> ret = 0
> blkno = 25190403
> i = 2
> #8 0x100023dc in traverse_chains (ofs=0x10060018, cl=0x100720c0,
dump_type=2)
> at o2image.c:159
> grp = (struct ocfs2_group_desc *) 0x10074000
> rec = (struct ocfs2_chain_rec *) 0x100720d0
> ret = 0
> buf = 0x10074000 "GROUP01"
> blkno = 25190401
> i = 0
> #9 0x10002954 in traverse_inode (ofs=0x10060018, inode=17) at
o2image.c:295
> super = (struct ocfs2_super_block *) 0x100610c0
> ost = (struct ocfs2_image_state *) 0x10060160
> di = (struct ocfs2_dinode *) 0x10072000
> ret = 0
> dump_type = 2
> buf = 0x10072000 "INODE01"
> slot = 2
> #10 0x10002008 in traverse_group_desc (ofs=0x10060018, grp=0x10070000,
> dump_type=2, bpc=1) at o2image.c:76
> ret = 0
> blkno = 17
> i = 13
> #11 0x100023dc in traverse_chains (ofs=0x10060018, cl=0x1006e0c0,
dump_type=2)
> at o2image.c:159
> grp = (struct ocfs2_group_desc *) 0x10070000
> rec = (struct ocfs2_chain_rec *) 0x1006e0d0
> ret = 0
> buf = 0x10070000 "GROUP01"
> blkno = 4
> i = 0
> #12 0x10002954 in traverse_inode (ofs=0x10060018, inode=8) at o2image.c:295
> super = (struct ocfs2_super_block *) 0x100610c0
> ost = (struct ocfs2_image_state *) 0x10060160
> di = (struct ocfs2_dinode *) 0x1006e000
> ret = 0
> dump_type = 2
> buf = 0x1006e000 "INODE01"
> slot = 2
> #13 0x10003d0c in scan_raw_disk (ofs=0x10060018) at o2image.c:640
> ost = (struct ocfs2_image_state *) 0x10060160
> bits_set = 0
> ret = 0
> i = 262144
> j = 0
> #14 0x100045d0 in main (argc=3, argv=0xffb8fa14) at o2image.c:785
> ofs = (ocfs2_filesys *) 0x10060018
> ret = 0
> src_file = 0xffb8fb69 "/dev/mapper/mpath0"
> dest_file = 0xffb8fb7c "/files_shared/u02.o2image"
> open_flags = 0
> raw_flag = 0
> install_flag = 0
> interactive = 0
> fd = 1
> c = -1
>
> Probably this is what you wanted. If you need anything else, please let me
know.
>
> Thanks,
>
> George
>
> ________________________________________
> ???: Betzos Giorgos
> ????????: ???????, 19 ??????????? 2011 2:33 ??
> ????: sunil.mushran at oracle.com
> ????.: ocfs2-users at oss.oracle.com
> ????: Re: [Ocfs2-users] Linux kernel crash due to ocfs2
>
> Maybe you are missing the debuginfo packages. I installed the kernel,
> glibc, ocfs2 and ocfs-tools packages and I got the following (never mind
> the new core file):
>
> # gdb /sbin/o2image core.19917
> GNU gdb Red Hat Linux (6.5-37.el5rh)
> Copyright (C) 2006 Free Software Foundation, Inc.
> GDB is free software, covered by the GNU General Public License, and you
> are
> welcome to change it and/or distribute copies of it under certain
> conditions.
> Type "show copying" to see the conditions.
> There is absolutely no warranty for GDB. Type "show warranty"
for
> details.
> This GDB was configured as "ppc64-redhat-linux-gnu"...Using host
> libthread_db library "/lib64/libthread_db.so.1".
>
>
> warning: Can't read pathname for load map: Input/output error.
> Reading symbols from /lib/libglib-2.0.so.0...done.
> Loaded symbols for /lib/libglib-2.0.so.0
> Reading symbols from /lib/libcom_err.so.2...done.
> Loaded symbols for /lib/libcom_err.so.2
> Reading symbols from /lib/libc.so.6...Reading symbols
> from /usr/lib/debug/lib/libc-2.5.so.debug...done.
> done.
> Loaded symbols for /lib/libc.so.6
> Reading symbols from /lib/librt.so.1...Reading symbols
> from /usr/lib/debug/lib/librt-2.5.so.debug...done.
> done.
> Loaded symbols for /lib/librt.so.1
> Reading symbols from /lib/ld.so.1...Reading symbols
> from /usr/lib/debug/lib/ld-2.5.so.debug...done.
> done.
> Loaded symbols for /lib/ld.so.1
> Reading symbols from /lib/libpthread.so.0...Reading symbols
> from /usr/lib/debug/lib/libpthread-2.5.so.debug...done.
> done.
> Loaded symbols for /lib/libpthread.so.0
> Core was generated by
> `o2image /dev/mapper/mpath0 /files_shared/u02.o2image'.
> Program terminated with signal 6, Aborted.
> #0 0x0fe66110 in *__GI_raise (sig=6)
> at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
> 64 return INLINE_SYSCALL (tgkill, 3, pid, selftid, sig);
> (gdb) thread apply all bt full
>
> Thread 1 (process 19917):
> #0 0x0fe66110 in *__GI_raise (sig=6)
> at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
> r4 =<value optimized out>
> r7 =<value optimized out>
> r12 =<value optimized out>
> r5 = 6
> r8 =<value optimized out>
> r10 =<value optimized out>
> r0 = 250
> r3 =<value optimized out>
> r6 =<value optimized out>
> r9 =<value optimized out>
> r11 =<value optimized out>
> sc_ret =<value optimized out>
> pd = (struct pthread *) 0x0
> pid = 0
> selftid = 19917
> #1 0x0fe67e14 in *__GI_abort () at abort.c:88
> self = (void *) 0xffaaec68
> act = {__sigaction_handler = {sa_handler = 0xfff13c8,
> sa_sigaction = 0xfff13c8}, sa_mask = {__val = {266534912, 0, 0,
> 268374984,
> 14, 267945972, 3, 4289391833, 7, 267945976, 2, 4289391846, 2,
> 267947588,
> 1, 267945972, 3, 4289391833, 7, 267945976, 2, 0<repeats 11
> times>}},
> sa_flags = 2, sa_restorer = 0x4}
> sigs = {__val = {32, 0<repeats 31 times>}}
> #2 0x0fea77f4 in __libc_message (do_abort=2,
> fmt=0xff8714c "*** glibc detected *** %s: %s: 0x%s ***\n")
> at ../sysdeps/unix/sysv/linux/libc_fatal.c:170
> ap = {{gpr = 5 '\005', fpr = 0 '\0', reserved = 0,
> overflow_arg_area = 0xffaaf478, reg_save_area = 0xffaaf3e0}}
> ap_copy = {{gpr = 2 '\002', fpr = 0 '\0', reserved
= 0,
> overflow_arg_area = 0xffaaf478, reg_save_area = 0xffaaf3e0}}
> fd = 4
> list = (struct str_list *) 0xffaaedf0
> nlist =<value optimized out>
> cp =<value optimized out>
> written =<value optimized out>
> #3 0x0feb1ab4 in _int_free (av=0xffb14a8, mem=<value optimized out>)
> at malloc.c:5768
> buf = "10045000"
> cp =<value optimized out>
> p = (mchunkptr) 0x10045000
> size = 8192
> nextchunk = (mchunkptr) 0x10047000
> nextsize = 8192
> prevsize =<value optimized out>
> bck =<value optimized out>
> fwd =<value optimized out>
> errstr =<value optimized out>
> #4 0x0feb5b68 in *__GI___libc_free (mem=0x10046000) at malloc.c:3545
> __futex =<value optimized out>
> ar_ptr = (mstate) 0xffb14a8
> p =<value optimized out>
> hook =<value optimized out>
> #5 0x10007bb0 in ocfs2_free (ptr=0xffaaf530) at memory.c:65
> No locals.
> #6 0x10002748 in traverse_inode ()
> No symbol table info available.
> #7 0x10001f50 in traverse_group_desc ()
> No symbol table info available.
> #8 0x10002334 in traverse_chains ()
> No symbol table info available.
> #9 0x100026a0 in traverse_inode ()
> No symbol table info available.
> #10 0x10001f50 in traverse_group_desc ()
> No symbol table info available.
> #11 0x10002334 in traverse_chains ()
> No symbol table info available.
> #12 0x100026a0 in traverse_inode ()
> No symbol table info available.
> #13 0x1000358c in scan_raw_disk ()
> No symbol table info available.
> #14 0x10003e28 in main ()
> No symbol table info available.
> #15 0x0fe4dc60 in generic_start_main (main=0x10003a88<main>, argc=3,
> ubp_av=0xffaafa74, auxvec=0xffaafaec, init=<value optimized
out>,
> fini=<value optimized out>, rtld_fini=<value optimized
out>,
> stack_end=<value optimized out>) at ../csu/libc-start.c:231
> self = (struct pthread *) 0x0
> result =<value optimized out>
> unwind_buf = {cancel_jmp_buf = {{jmp_buf = {-8358199, 0,
> 265182513, 0,
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 268234624, -5572000, -5571980, 3, 0,
> 268367568, 268107764, 0, 570426402, 0<repeats 37 times>,
> 268372748,
> -5572296, -5572120, 0, 267932808, -5572100, 0, -1173947391,
> -5572288,
> 0, 268372748, 268369884, -5572320, -5572128, 268219208, 0, 0, 0,
> 265177660, 174420993, 10485760, 0, 0, -2147483648, 1,
> 0<repeats 17 times>, 570426402, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0},
> mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0xffafff4,
> 0xffaafaec},
> data = {prev = 0x0, cleanup = 0x0, canceltype = 268107764}}}
> #16 0x0fe4dea0 in __libc_start_main (argc=3, ubp_av=0xffaafa74,
> ubp_ev=<value optimized out>, auxvec=0xffaafaec,
> rtld_fini=0xffcef80<_dl_fini>, stinfo=0x1001a0c8,
> stack_on_entry=0xffaafa60)
> at ../sysdeps/unix/sysv/linux/powerpc/libc-start.c:122
> No locals.
> #17 0x00000000 in ?? ()
> No symbol table info available.
>
> However, I don't know the package in which the traverse_* functions
> calls can be found. If you know, please let me know so I can install it.
>
> Also, the last time the servers crashed we fsck'ed the filesystem
> without any problem. But, we have added/deleted a lot of space and files
> since then.
>
> Thanks,
>
> George
>
>
>
> On Fri, 2011-09-16 at 11:00 -0700, Sunil Mushran wrote:
>> I got it. But I still don't see the symbols. Maybe we are
corrupting the stack.
>> Maybe this is ppc specific. Do you have a x86/x86_64 box that can
access
>> the same volume? If so I could give you a drop of the same for that
arch.
>>
>> Also, have to run fsck on this volume before? One reason o2image could
>> fail is if there is a bad block pointer. While it is supposed to handle
all such
>> cases, it is known to miss some cases.
>>
>> On 09/16/2011 12:06 AM, Betzos Giorgos wrote:
>>> Please try http://portal-md.glk.gr/ocfs2/core.32578.bz2
>>>
>>> Please let me know, in case you have any problem downloading it.
>>>
>>> Thanks,
>>>
>>> George
>>>
>>> On Thu, 2011-09-15 at 09:45 -0700, Sunil Mushran wrote:
>>>> I was hoping to get a readable stack. Please could you provide
a link to
>>>> the coredump.
>>>>
>>>> On 09/15/2011 02:51 AM, Betzos Giorgos wrote:
>>>>> Hello,
>>>>>
>>>>> I am sorry for the delay in responding. Unfortunately, if
faulted again.
>>>>>
>>>>> Here is the log. Although my email client folds the Memory
Map lines.
>>>>> The core file is available.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> George
>>>>>
>>>>> # ./o2image.ppc.dbg /dev/mapper/mpath0
/files_shared/u02.o2image
>>>>> *** glibc detected *** ./o2image.ppc.dbg: corrupted
double-linked list:
>>>>> 0x10075000 ***
>>>>> ======= Backtrace: ========>>>>>
/lib/libc.so.6[0xfeb1ab4]
>>>>> /lib/libc.so.6(cfree+0xc8)[0xfeb5b68]
>>>>> ./o2image.ppc.dbg[0x1000d098]
>>>>> ./o2image.ppc.dbg[0x1000297c]
>>>>> ./o2image.ppc.dbg[0x10001eb8]
>>>>> ./o2image.ppc.dbg[0x1000228c]
>>>>> ./o2image.ppc.dbg[0x10002804]
>>>>> ./o2image.ppc.dbg[0x10001eb8]
>>>>> ./o2image.ppc.dbg[0x1000228c]
>>>>> ./o2image.ppc.dbg[0x10002804]
>>>>> ./o2image.ppc.dbg[0x10003bbc]
>>>>> ./o2image.ppc.dbg[0x10004480]
>>>>> /lib/libc.so.6[0xfe4dc60]
>>>>> /lib/libc.so.6[0xfe4dea0]
>>>>> ======= Memory map: =======>>>>>
00100000-00120000 r-xp 00100000 00:00 0
>>>>> [vdso]
>>>>> 0f430000-0f440000 r-xp 00000000 08:13
>>>>> 180307 /lib/libcom_err.so.2.1
>>>>> 0f440000-0f450000 rw-p 00000000 08:13
>>>>> 180307 /lib/libcom_err.so.2.1
>>>>> 0f900000-0f9c0000 r-xp 00000000 08:13
>>>>> 180293
/lib/libglib-2.0.so.0.1200.3
>>>>> 0f9c0000-0f9d0000 rw-p 000b0000 08:13
>>>>> 180293
/lib/libglib-2.0.so.0.1200.3
>>>>> 0fa40000-0fa50000 r-xp 00000000 08:13
>>>>> 180292 /lib/librt-2.5.so
>>>>> 0fa50000-0fa60000 r--p 00000000 08:13
>>>>> 180292 /lib/librt-2.5.so
>>>>> 0fa60000-0fa70000 rw-p 00010000 08:13
>>>>> 180292 /lib/librt-2.5.so
>>>>> 0fce0000-0fd00000 r-xp 00000000 08:13
>>>>> 180291 /lib/libpthread-2.5.so
>>>>> 0fd00000-0fd10000 r--p 00010000 08:13
>>>>> 180291 /lib/libpthread-2.5.so
>>>>> 0fd10000-0fd20000 rw-p 00020000 08:13
>>>>> 180291 /lib/libpthread-2.5.so
>>>>> 0fe30000-0ffa0000 r-xp 00000000 08:13
>>>>> 180288 /lib/libc-2.5.so
>>>>> 0ffa0000-0ffb0000 r--p 00160000 08:13
>>>>> 180288 /lib/libc-2.5.so
>>>>> 0ffb0000-0ffc0000 rw-p 00170000 08:13
>>>>> 180288 /lib/libc-2.5.so
>>>>> 0ffc0000-0ffe0000 r-xp 00000000 08:13
>>>>> 180287 /lib/ld-2.5.so
>>>>> 0ffe0000-0fff0000 r--p 00010000 08:13
>>>>> 180287 /lib/ld-2.5.so
>>>>> 0fff0000-10000000 rw-p 00020000 08:13
>>>>> 180287 /lib/ld-2.5.so
>>>>> 10000000-10050000 r-xp 00000000 08:13
>>>>> 7487795 /root/o2image.ppc.dbg
>>>>> 10050000-10060000 rw-p 00040000 08:13
>>>>> 7487795 /root/o2image.ppc.dbg
>>>>> 10060000-10090000 rwxp 10060000 00:00 0
>>>>> [heap]
>>>>> f7680000-f7ff0000 rw-p f7680000 00:00 0
>>>>> ff9a0000-ffaf0000 rw-p ff9a0000 00:00 0
>>>>> [stack]
>>>>> Aborted (core dumped)
>>>>>
>>>>>
>>>>> On Thu, 2011-09-08 at 12:10 -0700, Sunil Mushran wrote:
>>>>>> http://oss.oracle.com/~smushran/o2image.ppc.dbg
>>>>>>
>>>>>> Use the above executable. Hoping it won't fault.
But if it does
>>>>>> email me the backtrace. That trace will be readable as
the exec
>>>>>> has debugging symbols enabled.
>>>>>>
>>>>>> On 09/07/2011 11:24 PM, Betzos Giorgos wrote:
>>>>>>> # rpm -q ocfs2-tools
>>>>>>> ocfs2-tools-1.4.4-1.el5.ppc
>>>>>>>
>>>>>>> On Wed, 2011-09-07 at 09:13 -0700, Sunil Mushran
wrote:
>>>>>>>> version of ocfs2-tools?
>>>>>>>>
>>>>>>>> On 09/07/2011 09:10 AM, Betzos Giorgos wrote:
>>>>>>>>> Hello,
>>>>>>>>>
>>>>>>>>> I tried what you suggested but here is what
I got:
>>>>>>>>>
>>>>>>>>> # o2image /dev/mapper/mpath0
/files_shared/u02.o2image
>>>>>>>>> *** glibc detected *** o2image: corrupted
double-linked list: 0x10045000 ***
>>>>>>>>> ======= Backtrace:
========>>>>>>>>> /lib/libc.so.6[0xfeb1ab4]
>>>>>>>>> /lib/libc.so.6(cfree+0xc8)[0xfeb5b68]
>>>>>>>>> o2image[0x10007bb0]
>>>>>>>>> o2image[0x10002748]
>>>>>>>>> o2image[0x10001f50]
>>>>>>>>> o2image[0x10002334]
>>>>>>>>> o2image[0x100026a0]
>>>>>>>>> o2image[0x10001f50]
>>>>>>>>> o2image[0x10002334]
>>>>>>>>> o2image[0x100026a0]
>>>>>>>>> o2image[0x1000358c]
>>>>>>>>> o2image[0x10003e28]
>>>>>>>>> /lib/libc.so.6[0xfe4dc60]
>>>>>>>>> /lib/libc.so.6[0xfe4dea0]
>>>>>>>>> ======= Memory map:
=======>>>>>>>>> 00100000-00120000 r-xp 00100000
00:00 0 [vdso]
>>>>>>>>> 0f550000-0f560000 r-xp 00000000 08:13
2881590 /lib/libcom_err.so.2.1
>>>>>>>>> 0f560000-0f570000 rw-p 00000000 08:13
2881590 /lib/libcom_err.so.2.1
>>>>>>>>> 0f900000-0f9c0000 r-xp 00000000 08:13
2881576 /lib/libglib-2.0.so.0.1200.3
>>>>>>>>> 0f9c0000-0f9d0000 rw-p 000b0000 08:13
2881576 /lib/libglib-2.0.so.0.1200.3
>>>>>>>>> 0fa40000-0fa50000 r-xp 00000000 08:13
2881575 /lib/librt-2.5.so
>>>>>>>>> 0fa50000-0fa60000 r--p 00000000 08:13
2881575 /lib/librt-2.5.so
>>>>>>>>> 0fa60000-0fa70000 rw-p 00010000 08:13
2881575 /lib/librt-2.5.so
>>>>>>>>> 0fce0000-0fd00000 r-xp 00000000 08:13
2881574 /lib/libpthread-2.5.so
>>>>>>>>> 0fd00000-0fd10000 r--p 00010000 08:13
2881574 /lib/libpthread-2.5.so
>>>>>>>>> 0fd10000-0fd20000 rw-p 00020000 08:13
2881574 /lib/libpthread-2.5.so
>>>>>>>>> 0fe30000-0ffa0000 r-xp 00000000 08:13
2881571 /lib/libc-2.5.so
>>>>>>>>> 0ffa0000-0ffb0000 r--p 00160000 08:13
2881571 /lib/libc-2.5.so
>>>>>>>>> 0ffb0000-0ffc0000 rw-p 00170000 08:13
2881571 /lib/libc-2.5.so
>>>>>>>>> 0ffc0000-0ffe0000 r-xp 00000000 08:13
2881570 /lib/ld-2.5.so
>>>>>>>>> 0ffe0000-0fff0000 r--p 00010000 08:13
2881570 /lib/ld-2.5.so
>>>>>>>>> 0fff0000-10000000 rw-p 00020000 08:13
2881570 /lib/ld-2.5.so
>>>>>>>>> 10000000-10020000 r-xp 00000000 08:13
15058799 /sbin/o2image
>>>>>>>>> 10020000-10030000 rw-p 00010000 08:13
15058799 /sbin/o2image
>>>>>>>>> 10030000-10060000 rwxp 10030000 00:00 0
[heap]
>>>>>>>>> f7680000-f7ff0000 rw-p f7680000 00:00 0
>>>>>>>>> ffc60000-ffdb0000 rw-p ffc60000 00:00 0
[stack]
>>>>>>>>> Aborted (core dumped)
>>>>>>>>>
>>>>>>>>> I have the core file, if you need it.
>>>>>>>>>
>>>>>>>>> Here is some information about the fs in
question.
>>>>>>>>> It is used to store Oracle Archive Logs and
also to store the rman backup of the DB
>>>>>>>>> In the last crash case the fs became full
while rman was running. Maybe we can estimate from
>>>>>>>>> this the size of the write in that
particular case. Oracle DB rman backup files are from 7 to 11Gb.
>>>>>>>>> Maybe Oracle DataGuard was also using on
the same fs.
>>>>>>>>> After the crash, when we rebooted the
servers, they would crash again. We then noticed that
>>>>>>>>> the fs was full and we removed some
unneeded files.
>>>>>>>>>
>>>>>>>>> The system has crashed a couple more times
when the above conditions may not have been the same.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>> George
>>>>>>>>>
>>>>>>>>> ________________________________________
>>>>>>>>> From: Sunil Mushran
>>>>>>>>> Sent: Friday, September 02, 2011 8:24 PM
>>>>>>>>> To: Betzos Giorgos
>>>>>>>>> Cc: ocfs2-users at oss.oracle.com
>>>>>>>>> Subject: Re: [Ocfs2-users] Linux kernel
crash due to ocfs2
>>>>>>>>>
>>>>>>>>> Can you provide me with the o2image. It
includes the entire fs metadata.
>>>>>>>>> The size of the image file depends on the
number of files/dirs.
>>>>>>>>>
>>>>>>>>> # o2image /dev/sdX /path/to/image/file
>>>>>>>>>
>>>>>>>>> So the error is clear. We have
underestimated the amount of credits
>>>>>>>>> (num of blocks that need to be dirtied in
that transaction). This is the most
>>>>>>>>> common write path in the fs and thus hit
heavily. So I am surprised by this.
>>>>>>>>>
>>>>>>>>> One way to fix it is by reproducing it
inhouse. And having the image will allow
>>>>>>>>> us to mount the fs and reproduce the issue.
Do you know the size of the write?
>>>>>>>>>
>>>>>>>>> On 09/02/2011 07:23 AM, Betzos Giorgos
wrote:
>>>>>>>>>> Hello,
>>>>>>>>>>
>>>>>>>>>> we have a pair of IBM P570 servers
running RHEL5.2
>>>>>>>>>> kernel 2.6.18-92.el5.ppc64
>>>>>>>>>> We have Oracle RAC on ocfs2 storage
>>>>>>>>>> ocfs2 is 1.4.7-1 for the above kernel
(downloaded from oracle oss site)
>>>>>>>>>>
>>>>>>>>>> Recently both servers have been
crashing with the following error:
>>>>>>>>>>
>>>>>>>>>> Assertion failure in
journal_dirty_metadata() at
>>>>>>>>>> fs/jbd/transaction.c:1130:
"handle->h_buffer_credits> 0"
>>>>>>>>>> kernel BUG in journal_dirty_metadata at
fs/jbd/transaction.c:1130!
>>>>>>>>>>
>>>>>>>>>> We get some kind of kernel debug
prompt.
>>>>>>>>>>
>>>>>>>>>> the stack is as follows:
>>>>>>>>>>
>>>>>>>>>> .ocfs2_journal_dirty+0x78/0x13c [ocfs2]
>>>>>>>>>> .ocfs2_search_chain+0x131c/0x165c
[ocfs2]
>>>>>>>>>> .ocfs2_claim_suballoc_bits+0xadc/0xd94
[ocfs2]
>>>>>>>>>> .__ocfs2_claim_clusters+0x1b0/0x348
[ocfs2]
>>>>>>>>>> .ocf2_do_extend_allocation+0x1f8/0x5b4
[ocfs2]
>>>>>>>>>>
.ocfs2_write_cluster_by_desc+0x128/0x850 [ocfs2]
>>>>>>>>>> .ocfs2_write_begin_nolock+0xdc0/0xfbc
[ocfs2]
>>>>>>>>>> .ocfs2_write_begin+0x124/0x224 [ocfs2]
>>>>>>>>>> .ocfs2_file_aio_write+0x6a4/0xb40
[ocfs2]
>>>>>>>>>> .aio_pwrite+0x50/0xb4
>>>>>>>>>> .aio_run_iocb+0x140/0x214
>>>>>>>>>> .io_submit_one+0x2fc/0x3a8
>>>>>>>>>> .sys_io_submit+0xd0/0x17c
>>>>>>>>>> syscall_exit+0x0/0x40
>>>>>>>>>>
>>>>>>>>>> In the last crash case, the file system
was full.
>>>>>>>>>>
>>>>>>>>>> Any clues?
>>>>>>>>>>
>>>>>>>>>> There seems to be a ocfs2 kernel patch
some time ago for the 2.6.20.2
>>>>>>>>>> kernel that fixed some journal credits
updates.
>>>>>>>>>>
>>>>>>>>>> Is this another bug?
>>>>>>>>>>
>>>>>>>>>> Any help will be greatly appreciated,
because this is a production
>>>>>>>>>> system.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>>
>>>>>>>>>> George
>
CONFIDENTIALITY & DISCLAIMER NOTICE
Communications on or through Unisystems computer systems may be monitored or
recorded to secure effective system operation and for other lawful purposes.
Unless otherwise agreed expressly in writing by an officer of Unisystems, this
communication is to be treated as confidential and the information in it may not
be used or disclosed except for the purpose for which it has been sent. If you
have reason to believe that you are not the intended recipient of this
communication, please contact the sender immediately. No employee or agent is
authorised to conclude any binding agreement on behalf of Unisystems with
another party by e-mail without express written confirmation by an officer of
Unisystems. Unisystems employees are required not to make any defamatory
statements and not to infringe or authorise any infringement of copyright or any
other legal right by e-mail communications. Any such communication is contrary
to organisational policy and outside the scope of the employment of he
individual concerned. Unisystems will not accept any liability in respect of
such a communication, and the employee responsible will be personally liable for
any damages or other liability arising.