Hi eric,
I found the code below from
archive.ubuntu.com/ubuntu/pool/main/l/linux/fs/ocfs2/suballoc.c
2410 if (status < 0) {
2411 mlog_errno(status);
2412 goto bail;
2413 }
2414
2415 if (undo_fn) {
2416 jbd_lock_bh_state(group_bh);
2417 undo_bg = (struct ocfs2_group_desc *)
2418
bh2jh(group_bh)->b_committed_data;
2419 BUG_ON(!undo_bg);
2420 }
2421
2422 tmp = num_bits;
2423 while(tmp--) {
2424 ocfs2_clear_bit((bit_off + tmp),
2425 (unsigned long *) bg->bg_bitmap);
2426 if (undo_fn)
2427 undo_fn(bit_off + tmp,
2428 (unsigned long *) undo_bg->bg_bitmap);
2429 }
2430 le16_add_cpu(&bg->bg_free_bits_count, num_bits);
2431 if (le16_to_cpu(bg->bg_free_bits_count) >
le16_to_cpu(bg->bg_bits)) {
2432 ocfs2_error(alloc_inode->i_sb, "Group descriptor
# %llu has bit"
2433 " count %u but claims %u are freed.
num_bits %d",
2434 (unsigned long
long)le64_to_cpu(bg->bg_blkno),
2435 le16_to_cpu(bg->bg_bits),
2436 le16_to_cpu(bg->bg_free_bits_count),
num_bits);
2437 return -EROFS;
2438 }
On Wed, Sep 14, 2016 at 10:13 AM, Eric Ren <zren at suse.com>
wrote:> Hi,
>
> On 09/14/2016 02:30 PM, Ishmael Tsoaela wrote:
>>
>> Hi Eric,
>>
>> Could you paste the code context around this line?
>> Sep 13 08:10:18 nodeB kernel: [1104431.300882] kernel BUG at
>>
>>
/build/linux-lts-wily-Vv6Eyd/linux-lts-wily-4.2.0/fs/ocfs2/suballoc.c:2419!
>
> This message is very import because it shows exactly which line of the
> source code
> directly results in this BUG() output. What I want you do is to paste out
> the code around #2419
> of suballoc.c. Such I can locate where the BUG() is locally because the
code
> of line#2419 is different
> with different code version.
>>
>>
>> Apologies but I tried to understand this but failed
>>
>>
>> root at nodeB:~# echo w > /proc/sysrq-trigger
>> root at nodeB:~#
>>
>> Node reboot and mount points are accessble from all 3 nodes, not sure
>> why but it seems it will be difficult to figure out what went wrong
>> with ocfs2 without proper knowledge, so let me not waste any of your
>> time, let me figure out 'crash`[1][2] or gdb' then hopefully
when it
>> happens next time I would have much better understanding
>
> OK, good luck!
>
>
> Eric
>>
>>
>> On Tue, Sep 13, 2016 at 11:44 AM, Eric Ren <zren at suse.com>
wrote:
>>>
>>> On 09/13/2016 05:01 PM, Ishmael Tsoaela wrote:
>>>>
>>>> Hi Eric,
>>>>
>>>> Sorry Here are the other 2 syslogs if you need and debug output
>>>
>>> According to the logs, the nodeB should be the first one that got
>>> problem.
>>>
>>> Could you paste the code context around this line?
>>> Sep 13 08:10:18 nodeB kernel: [1104431.300882] kernel BUG at
>>>
>>>
/build/linux-lts-wily-Vv6Eyd/linux-lts-wily-4.2.0/fs/ocfs2/suballoc.c:2419!
>>>>
>>>> The request in the snip attached just hangs
>>>
>>> NodeB should have taken this exclusive cluster lock, so any
commands
>>> trying
>>> to access that file will hang up.
>>>
>>> Could you provide the output of `echo w > /proc/sysrq-trigger`?
OCFS2
>>> issue
>>> is not easy to debug if developer cannot reproduce
>>> it locally, and this is the case. BTW, you can narrow down by
>>> `crash`[1][2]
>>> or gdb if you have some knowledge of kernel stuff.
>>>
>>> [1] http://www.dedoimedo.com/computers/crash-analyze.html
>>> [2] https://people.redhat.com/anderson/crash_whitepaper/
>>>
>>> Eric
>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, Sep 13, 2016 at 10:37 AM, Ishmael Tsoaela <ishmaelt3
at gmail.com>
>>>> wrote:
>>>>>
>>>>> Thanks for the response
>>>>>
>>>>>
>>>>> 1. the disk is a shared ceph rbd device
>>>>>
>>>>> #rbd showmapped
>>>>> id pool image snap device
>>>>> 1 vmimages block_vmimages - /dev/rbd1
>>>>>
>>>>>
>>>>> 2. ocfs2 has been working well for 2 months now, with a
reboot 12 days
>>>>> ago
>>>>>
>>>>> 3. 3 ceph nodes all have rbd image mapped and ocfs3
mounted
>>>>>
>>>>> commands used
>>>>>
>>>>> #sudo rbd map block_vmimages --pool vmimages --name
>>>>>
>>>>> #sudo mount /dev/rbd/vmimages/block_vmimages /mnt/vmimages/
>>>>> /dev/rbd1
>>>>>
>>>>> 4.
>>>>> root at nodeC:~# sudo debugfs.ocfs2 -R stats /dev/rbd1
>>>>> Revision: 0.90
>>>>> Mount Count: 0 Max Mount Count: 20
>>>>> State: 0 Errors: 0
>>>>> Check Interval: 0 Last Check: Tue Aug 2
15:41:12 2016
>>>>> Creator OS: 0
>>>>> Feature Compat: 3 backup-super
strict-journal-super
>>>>> Feature Incompat: 592 sparse inline-data xattr
>>>>> Tunefs Incomplete: 0
>>>>> Feature RO compat: 1 unwritten
>>>>> Root Blknum: 5 System Dir Blknum: 6
>>>>> First Cluster Group Blknum: 3
>>>>> Block Size Bits: 12 Cluster Size Bits: 12
>>>>> Max Node Slots: 16
>>>>> Extended Attributes Inline Size: 256
>>>>> Label:
>>>>> UUID: 238F878003E7455FA5B01CC884D1047F
>>>>> Hash: 919897149 (0x36d4843d)
>>>>> DX Seed[0]: 0x00000000
>>>>> DX Seed[1]: 0x00000000
>>>>> DX Seed[2]: 0x00000000
>>>>> Cluster stack: classic o2cb
>>>>> Inode: 2 Mode: 00 Generation: 1754092981
(0x688d55b5)
>>>>> FS Generation: 1754092981 (0x688d55b5)
>>>>> CRC32: 00000000 ECC: 0000
>>>>> Type: Unknown Attr: 0x0 Flags: Valid System
Superblock
>>>>> Dynamic Features: (0x0)
>>>>> User: 0 (root) Group: 0 (root) Size: 0
>>>>> Links: 0 Clusters: 640000000
>>>>> ctime: 0x57a0a2f8 -- Tue Aug 2 15:41:12 2016
>>>>> atime: 0x0 -- Thu Jan 1 02:00:00 1970
>>>>> mtime: 0x57a0a2f8 -- Tue Aug 2 15:41:12 2016
>>>>> dtime: 0x0 -- Thu Jan 1 02:00:00 1970
>>>>> ctime_nsec: 0x00000000 -- 0
>>>>> atime_nsec: 0x00000000 -- 0
>>>>> mtime_nsec: 0x00000000 -- 0
>>>>> Refcount Block: 0
>>>>> Last Extblk: 0 Orphan Slot: 0
>>>>> Sub Alloc Slot: Global Sub Alloc Bit: 65535
>>>>>
>>>>>
>>>>>
>>>>> thanks for the assistance
>>>>>
>>>>>
>>>>> On Tue, Sep 13, 2016 at 10:23 AM, Eric Ren <zren at
suse.com> wrote:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> On 09/13/2016 03:16 PM, Ishmael Tsoaela wrote:
>>>>>>>
>>>>>>> Hi All,
>>>>>>>
>>>>>>> I have an ocfs2 mount point of 3 ceph cluster
nodes and suddenly I
>>>>>>> cannot read and write to the mount point although
the cluster is
>>>>>>> clean
>>>>>>> and showing no errors.
>>>>>>
>>>>>> 1. What is your ocfs2 shared disk? I mean it's a
shared disk exported
>>>>>> by
>>>>>> iscsi target, or a ceph rbd device?
>>>>>> 2. Did you check if ocfs2 works well before any
read/write? and how?
>>>>>> 3. Could you elaborating more details how the ceph
nodes use ocfs2?
>>>>>> 4. Please provide the output of:
>>>>>> #sudo debugfs.ocfs2 -R stats /dev/sda
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Are the any other logs I can check?
>>>>>>
>>>>>> All log messages should go to /var/log/messages, could
you attach the
>>>>>> whole
>>>>>> log file?
>>>>>>
>>>>>> Eric
>>>>>>>
>>>>>>>
>>>>>>> There are some log in kern.log about
>>>>>>>
>>>>>>>
>>>>>>> kern.log
>>>>>>>
>>>>>>> Sep 13 08:10:18 nodeB kernel: [1104431.300882]
kernel BUG at
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
/build/linux-lts-wily-Vv6Eyd/linux-lts-wily-4.2.0/fs/ocfs2/suballoc.c:2419!
>>>>>>> Sep 13 08:10:18 nodeB kernel: [1104431.345504]
invalid opcode: 0000
>>>>>>> [#1]
>>>>>>> SMP
>>>>>>> Sep 13 08:10:18 nodeB kernel: [1104431.370081]
Modules linked in:
>>>>>>> vhost_net vhost macvtap macvlan ocfs2 quota_tree
rbd libceph ipmi_si
>>>>>>> mpt3sas mpt2sas raid_class scsi_transport_sas
mptctl mptbase
>>>>>>> xt_CHECKSUM iptable_mangle ipt_MASQUERADE
nf_nat_masquerade_ipv4
>>>>>>> iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4
nf_defrag_ipv4
>>>>>>> xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4
xt_tcpudp
>>>>>>> ebtable_filter ebtables ip6table_filter ip6_tables
iptable_filter
>>>>>>> ip_tables x_tables dell_rbu ocfs2_dlmfs
ocfs2_stack_o2cb ocfs2_dlm
>>>>>>> ocfs2_nodemanager ocfs2_stackglue configfs bridge
stp llc binfmt_misc
>>>>>>> ipmi_devintf kvm_amd dcdbas kvm input_leds joydev
amd64_edac_mod
>>>>>>> crct10dif_pclmul edac_core shpchp i2c_piix4
fam15h_power crc32_pclmul
>>>>>>> edac_mce_amd ipmi_ssif k10temp aesni_intel
aes_x86_64 lrw gf128mul
>>>>>>> 8250_fintek glue_helper acpi_power_meter mac_hid
serio_raw
>>>>>>> ablk_helper
>>>>>>> cryptd ipmi_msghandler xfs libcrc32c lp parport
ixgbe dca hid_generic
>>>>>>> uas usbhid vxlan usb_storage ip6_udp_tunnel hid
udp_tunnel ptp
>>>>>>> psmouse
>>>>>>> bnx2 pps_core megaraid_sas mdio [last unloaded:
ipmi_si]
>>>>>>> Sep 13 08:10:18 nodeB kernel: [1104431.898986] CPU:
10 PID: 65016
>>>>>>> Comm: cp Not tainted 4.2.0-27-generic
#32~14.04.1-Ubuntu
>>>>>>> Sep 13 08:10:18 nodeB kernel: [1104432.012469]
Hardware name: Dell
>>>>>>> Inc. PowerEdge R515/0RMRF7, BIOS 2.0.2 10/22/2012
>>>>>>> Sep 13 08:10:18 nodeB kernel: [1104432.134659]
task: ffff880a61dca940
>>>>>>> ti: ffff88084a5ac000 task.ti: ffff88084a5ac000
>>>>>>> Sep 13 08:10:18 nodeB kernel: [1104432.265260] RIP:
>>>>>>> 0010:[<ffffffffc062026b>]
[<ffffffffc062026b>]
>>>>>>> _ocfs2_free_suballoc_bits+0x4db/0x4e0 [ocfs2]
>>>>>>> Sep 13 08:10:18 nodeB kernel: [1104432.406559] RSP:
>>>>>>> 0018:ffff88084a5af798 EFLAGS: 00010246
>>>>>>> Sep 13 08:10:18 nodeB kernel: [1104432.479958] RAX:
0000000000000000
>>>>>>> RBX: ffff881acebcb000 RCX: ffff881fcd372e00
>>>>>>> Sep 13 08:10:18 nodeB kernel: [1104432.630768] RDX:
ffff881fd0d4dc30
>>>>>>> RSI: ffff88197e351bc8 RDI: ffff880fd127b2b0
>>>>>>> Sep 13 08:10:18 nodeB kernel: [1104432.789688] RBP:
ffff88084a5af818
>>>>>>> R08: 0000000000000002 R09: 0000000000007e00
>>>>>>> Sep 13 08:10:18 nodeB kernel: [1104432.950053] R10:
ffff880d39a21020
>>>>>>> R11: ffff88084a5af550 R12: 00000000000000fa
>>>>>>> Sep 13 08:10:18 nodeB kernel: [1104433.113014] R13:
0000000000005ab1
>>>>>>> R14: 0000000000000000 R15: ffff880fb2d43000
>>>>>>> Sep 13 08:10:18 nodeB kernel: [1104433.276484] FS:
>>>>>>> 00007fcc68373840(0000) GS:ffff881fdde80000(0000)
>>>>>>> knlGS:0000000000000000
>>>>>>> Sep 13 08:10:18 nodeB kernel: [1104433.440016] CS:
0010 DS: 0000 ES:
>>>>>>> 0000 CR0: 000000008005003b
>>>>>>> Sep 13 08:10:18 nodeB kernel: [1104433.521496] CR2:
00005647b2ee6d80
>>>>>>> CR3: 0000000198b93000 CR4: 00000000000406e0
>>>>>>> Sep 13 08:10:18 nodeB kernel: [1104433.681357]
Stack:
>>>>>>> Sep 13 08:10:18 nodeB kernel: [1104433.758498]
0000000000000000
>>>>>>> ffff880fd127b2e8 ffff881fc6655f08 00005bab00000000
>>>>>>> Sep 13 08:10:18 nodeB kernel: [1104433.913655]
ffff881fd0c51d80
>>>>>>> ffff88197e351bc8 ffff880fd127b330 ffff880e9eaa6000
>>>>>>> Sep 13 08:10:18 nodeB kernel: [1104434.068609]
ffff88197e351bc8
>>>>>>> ffffffff817ba6d6 0000000000000001 000000001ac592b1
>>>>>>> Sep 13 08:10:18 nodeB kernel: [1104434.223347] Call
Trace:
>>>>>>> Sep 13 08:10:18 nodeB kernel: [1104434.298560]
[<ffffffff817ba6d6>]
>>>>>>> ?
>>>>>>> mutex_lock+0x16/0x37
>>>>>>> Sep 13 08:10:18 nodeB kernel: [1104434.374183]
[<ffffffffc0621bca>]
>>>>>>> _ocfs2_free_clusters+0xea/0x200 [ocfs2]
>>>>>>> Sep 13 08:10:18 nodeB kernel: [1104434.449628]
[<ffffffffc061ecb0>]
>>>>>>> ?
>>>>>>> ocfs2_put_slot+0xe0/0xe0 [ocfs2]
>>>>>>> Sep 13 08:10:18 nodeB kernel: [1104434.523971]
[<ffffffffc061ecb0>]
>>>>>>> ?
>>>>>>> ocfs2_put_slot+0xe0/0xe0 [ocfs2]
>>>>>>> Sep 13 08:10:18 nodeB kernel: [1104434.595803]
[<ffffffffc06234e5>]
>>>>>>> ocfs2_free_clusters+0x15/0x20 [ocfs2]
>>>>>>> Sep 13 08:10:18 nodeB kernel: [1104434.666614]
[<ffffffffc05d6037>]
>>>>>>> __ocfs2_flush_truncate_log+0x247/0x560 [ocfs2]
>>>>>>> Sep 13 08:10:18 nodeB kernel: [1104434.806017]
[<ffffffffc05d25a6>]
>>>>>>> ?
>>>>>>> ocfs2_num_free_extents+0x56/0x120 [ocfs2]
>>>>>>> Sep 13 08:10:18 nodeB kernel: [1104434.946141]
[<ffffffffc05db258>]
>>>>>>> ocfs2_remove_btree_range+0x4e8/0x760 [ocfs2]
>>>>>>> Sep 13 08:10:18 nodeB kernel: [1104435.086490]
[<ffffffffc05dc720>]
>>>>>>> ocfs2_commit_truncate+0x180/0x590 [ocfs2]
>>>>>>> Sep 13 08:10:18 nodeB kernel: [1104435.158189]
[<ffffffffc06022b0>]
>>>>>>> ?
>>>>>>> ocfs2_allocate_extend_trans+0x130/0x130 [ocfs2]
>>>>>>> Sep 13 08:10:18 nodeB kernel: [1104435.297235]
[<ffffffffc05f7e2c>]
>>>>>>> ocfs2_truncate_file+0x39c/0x610 [ocfs2]
>>>>>>> Sep 13 08:10:18 nodeB kernel: [1104435.368060]
[<ffffffffc05fe650>]
>>>>>>> ?
>>>>>>> ocfs2_read_inode_block+0x10/0x20 [ocfs2]
>>>>>>> Sep 13 08:10:18 nodeB kernel: [1104435.505117]
[<ffffffffc05fa2d7>]
>>>>>>> ocfs2_setattr+0x4b7/0xa50 [ocfs2]
>>>>>>> Sep 13 08:10:18 nodeB kernel: [1104435.574617]
[<ffffffffc064c4fd>]
>>>>>>> ?
>>>>>>> ocfs2_xattr_get+0x9d/0x130 [ocfs2]
>>>>>>> Sep 13 08:10:18 nodeB kernel: [1104435.643722]
[<ffffffff8120705e>]
>>>>>>> notify_change+0x1ae/0x380
>>>>>>> Sep 13 08:10:18 nodeB kernel: [1104435.712037]
[<ffffffff811e8436>]
>>>>>>> do_truncate+0x66/0xa0
>>>>>>> Sep 13 08:10:18 nodeB kernel: [1104435.778685]
[<ffffffff811f8527>]
>>>>>>> path_openat+0x277/0x1330
>>>>>>> Sep 13 08:10:18 nodeB kernel: [1104435.845776]
[<ffffffffc05f2bed>]
>>>>>>> ?
>>>>>>> __ocfs2_cluster_unlock.isra.36+0x7d/0xb0 [ocfs2]
>>>>>>> Sep 13 08:10:18 nodeB kernel: [1104435.977677]
[<ffffffff811fae8a>]
>>>>>>> do_filp_open+0x7a/0xd0
>>>>>>> Sep 13 08:10:18 nodeB kernel: [1104436.043693]
[<ffffffff811f9f8f>]
>>>>>>> ?
>>>>>>> getname_flags+0x4f/0x1f0
>>>>>>> Sep 13 08:10:18 nodeB kernel: [1104436.108385]
[<ffffffff81208006>]
>>>>>>> ?
>>>>>>> __alloc_fd+0x46/0x110
>>>>>>> Sep 13 08:10:18 nodeB kernel: [1104436.171504]
[<ffffffff811ea509>]
>>>>>>> do_sys_open+0x129/0x260
>>>>>>> Sep 13 08:10:18 nodeB kernel: [1104436.232889]
[<ffffffff811ea65e>]
>>>>>>> SyS_open+0x1e/0x20
>>>>>>> Sep 13 08:10:18 nodeB kernel: [1104436.294292]
[<ffffffff817bc3b2>]
>>>>>>> entry_SYSCALL_64_fastpath+0x16/0x75
>>>>>>> Sep 13 08:10:18 nodeB kernel: [1104436.356257]
Code: 65 c0 48 c7 c6
>>>>>>> e0
>>>>>>> 44 65 c0 41 b6 e2 48 8d 5d c8 48 8b 78 28 44 89 24
24 31 c0 49 c7 c4
>>>>>>> e2 ff ff ff e8 9a 8d 01 00 e9 c4 fd ff ff
<0f> 0b 0f 0b 90 0f 1f 44
>>>>>>> 00
>>>>>>> 00 55 48 89 e5 41 57 41 89 cf b9 01
>>>>>>> Sep 13 08:10:18 nodeB kernel: [1104436.549534] RIP
>>>>>>> [<ffffffffc062026b>]
_ocfs2_free_suballoc_bits+0x4db/0x4e0 [ocfs2]
>>>>>>> Sep 13 08:10:18 nodeB kernel: [1104436.681076] RSP
>>>>>>> <ffff88084a5af798>
>>>>>>> Sep 13 08:10:18 nodeB kernel: [1104436.834529] ---[
end trace
>>>>>>> 5f4b84ac539ed56c ]---
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Ocfs2-users mailing list
>>>>>>> Ocfs2-users at oss.oracle.com
>>>>>>> https://oss.oracle.com/mailman/listinfo/ocfs2-users
>>>>>>>
>