Hi Eric,
Could you paste the code context around this line?
Sep 13 08:10:18 nodeB kernel: [1104431.300882] kernel BUG at
/build/linux-lts-wily-Vv6Eyd/linux-lts-wily-4.2.0/fs/ocfs2/suballoc.c:2419!
Apologies but I tried to understand this but failed
root at nodeB:~# echo w > /proc/sysrq-trigger
root at nodeB:~#
Node reboot and mount points are accessble from all 3 nodes, not sure
why but it seems it will be difficult to figure out what went wrong
with ocfs2 without proper knowledge, so let me not waste any of your
time, let me figure out 'crash`[1][2] or gdb' then hopefully when it
happens next time I would have much better understanding
On Tue, Sep 13, 2016 at 11:44 AM, Eric Ren <zren at suse.com>
wrote:> On 09/13/2016 05:01 PM, Ishmael Tsoaela wrote:
>>
>> Hi Eric,
>>
>> Sorry Here are the other 2 syslogs if you need and debug output
>
> According to the logs, the nodeB should be the first one that got problem.
>
> Could you paste the code context around this line?
> Sep 13 08:10:18 nodeB kernel: [1104431.300882] kernel BUG at
> /build/linux-lts-wily-Vv6Eyd/linux-lts-wily-4.2.0/fs/ocfs2/suballoc.c:2419!
>>
>> The request in the snip attached just hangs
>
> NodeB should have taken this exclusive cluster lock, so any commands trying
> to access that file will hang up.
>
> Could you provide the output of `echo w > /proc/sysrq-trigger`? OCFS2
issue
> is not easy to debug if developer cannot reproduce
> it locally, and this is the case. BTW, you can narrow down by `crash`[1][2]
> or gdb if you have some knowledge of kernel stuff.
>
> [1] http://www.dedoimedo.com/computers/crash-analyze.html
> [2] https://people.redhat.com/anderson/crash_whitepaper/
>
> Eric
>
>>
>>
>>
>>
>>
>>
>>
>>
>> On Tue, Sep 13, 2016 at 10:37 AM, Ishmael Tsoaela <ishmaelt3 at
gmail.com>
>> wrote:
>>>
>>> Thanks for the response
>>>
>>>
>>> 1. the disk is a shared ceph rbd device
>>>
>>> #rbd showmapped
>>> id pool image snap device
>>> 1 vmimages block_vmimages - /dev/rbd1
>>>
>>>
>>> 2. ocfs2 has been working well for 2 months now, with a reboot 12
days
>>> ago
>>>
>>> 3. 3 ceph nodes all have rbd image mapped and ocfs3 mounted
>>>
>>> commands used
>>>
>>> #sudo rbd map block_vmimages --pool vmimages --name
>>>
>>> #sudo mount /dev/rbd/vmimages/block_vmimages /mnt/vmimages/
>>> /dev/rbd1
>>>
>>> 4.
>>> root at nodeC:~# sudo debugfs.ocfs2 -R stats /dev/rbd1
>>> Revision: 0.90
>>> Mount Count: 0 Max Mount Count: 20
>>> State: 0 Errors: 0
>>> Check Interval: 0 Last Check: Tue Aug 2 15:41:12 2016
>>> Creator OS: 0
>>> Feature Compat: 3 backup-super strict-journal-super
>>> Feature Incompat: 592 sparse inline-data xattr
>>> Tunefs Incomplete: 0
>>> Feature RO compat: 1 unwritten
>>> Root Blknum: 5 System Dir Blknum: 6
>>> First Cluster Group Blknum: 3
>>> Block Size Bits: 12 Cluster Size Bits: 12
>>> Max Node Slots: 16
>>> Extended Attributes Inline Size: 256
>>> Label:
>>> UUID: 238F878003E7455FA5B01CC884D1047F
>>> Hash: 919897149 (0x36d4843d)
>>> DX Seed[0]: 0x00000000
>>> DX Seed[1]: 0x00000000
>>> DX Seed[2]: 0x00000000
>>> Cluster stack: classic o2cb
>>> Inode: 2 Mode: 00 Generation: 1754092981 (0x688d55b5)
>>> FS Generation: 1754092981 (0x688d55b5)
>>> CRC32: 00000000 ECC: 0000
>>> Type: Unknown Attr: 0x0 Flags: Valid System Superblock
>>> Dynamic Features: (0x0)
>>> User: 0 (root) Group: 0 (root) Size: 0
>>> Links: 0 Clusters: 640000000
>>> ctime: 0x57a0a2f8 -- Tue Aug 2 15:41:12 2016
>>> atime: 0x0 -- Thu Jan 1 02:00:00 1970
>>> mtime: 0x57a0a2f8 -- Tue Aug 2 15:41:12 2016
>>> dtime: 0x0 -- Thu Jan 1 02:00:00 1970
>>> ctime_nsec: 0x00000000 -- 0
>>> atime_nsec: 0x00000000 -- 0
>>> mtime_nsec: 0x00000000 -- 0
>>> Refcount Block: 0
>>> Last Extblk: 0 Orphan Slot: 0
>>> Sub Alloc Slot: Global Sub Alloc Bit: 65535
>>>
>>>
>>>
>>> thanks for the assistance
>>>
>>>
>>> On Tue, Sep 13, 2016 at 10:23 AM, Eric Ren <zren at suse.com>
wrote:
>>>>
>>>> Hi,
>>>>
>>>> On 09/13/2016 03:16 PM, Ishmael Tsoaela wrote:
>>>>>
>>>>> Hi All,
>>>>>
>>>>> I have an ocfs2 mount point of 3 ceph cluster nodes and
suddenly I
>>>>> cannot read and write to the mount point although the
cluster is clean
>>>>> and showing no errors.
>>>>
>>>> 1. What is your ocfs2 shared disk? I mean it's a shared
disk exported by
>>>> iscsi target, or a ceph rbd device?
>>>> 2. Did you check if ocfs2 works well before any read/write? and
how?
>>>> 3. Could you elaborating more details how the ceph nodes use
ocfs2?
>>>> 4. Please provide the output of:
>>>> #sudo debugfs.ocfs2 -R stats /dev/sda
>>>>>
>>>>>
>>>>>
>>>>> Are the any other logs I can check?
>>>>
>>>> All log messages should go to /var/log/messages, could you
attach the
>>>> whole
>>>> log file?
>>>>
>>>> Eric
>>>>>
>>>>>
>>>>> There are some log in kern.log about
>>>>>
>>>>>
>>>>> kern.log
>>>>>
>>>>> Sep 13 08:10:18 nodeB kernel: [1104431.300882] kernel BUG
at
>>>>>
>>>>>
>>>>>
/build/linux-lts-wily-Vv6Eyd/linux-lts-wily-4.2.0/fs/ocfs2/suballoc.c:2419!
>>>>> Sep 13 08:10:18 nodeB kernel: [1104431.345504] invalid
opcode: 0000
>>>>> [#1]
>>>>> SMP
>>>>> Sep 13 08:10:18 nodeB kernel: [1104431.370081] Modules
linked in:
>>>>> vhost_net vhost macvtap macvlan ocfs2 quota_tree rbd
libceph ipmi_si
>>>>> mpt3sas mpt2sas raid_class scsi_transport_sas mptctl
mptbase
>>>>> xt_CHECKSUM iptable_mangle ipt_MASQUERADE
nf_nat_masquerade_ipv4
>>>>> iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4
nf_defrag_ipv4
>>>>> xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4
xt_tcpudp
>>>>> ebtable_filter ebtables ip6table_filter ip6_tables
iptable_filter
>>>>> ip_tables x_tables dell_rbu ocfs2_dlmfs ocfs2_stack_o2cb
ocfs2_dlm
>>>>> ocfs2_nodemanager ocfs2_stackglue configfs bridge stp llc
binfmt_misc
>>>>> ipmi_devintf kvm_amd dcdbas kvm input_leds joydev
amd64_edac_mod
>>>>> crct10dif_pclmul edac_core shpchp i2c_piix4 fam15h_power
crc32_pclmul
>>>>> edac_mce_amd ipmi_ssif k10temp aesni_intel aes_x86_64 lrw
gf128mul
>>>>> 8250_fintek glue_helper acpi_power_meter mac_hid serio_raw
ablk_helper
>>>>> cryptd ipmi_msghandler xfs libcrc32c lp parport ixgbe dca
hid_generic
>>>>> uas usbhid vxlan usb_storage ip6_udp_tunnel hid udp_tunnel
ptp psmouse
>>>>> bnx2 pps_core megaraid_sas mdio [last unloaded: ipmi_si]
>>>>> Sep 13 08:10:18 nodeB kernel: [1104431.898986] CPU: 10 PID:
65016
>>>>> Comm: cp Not tainted 4.2.0-27-generic #32~14.04.1-Ubuntu
>>>>> Sep 13 08:10:18 nodeB kernel: [1104432.012469] Hardware
name: Dell
>>>>> Inc. PowerEdge R515/0RMRF7, BIOS 2.0.2 10/22/2012
>>>>> Sep 13 08:10:18 nodeB kernel: [1104432.134659] task:
ffff880a61dca940
>>>>> ti: ffff88084a5ac000 task.ti: ffff88084a5ac000
>>>>> Sep 13 08:10:18 nodeB kernel: [1104432.265260] RIP:
>>>>> 0010:[<ffffffffc062026b>] [<ffffffffc062026b>]
>>>>> _ocfs2_free_suballoc_bits+0x4db/0x4e0 [ocfs2]
>>>>> Sep 13 08:10:18 nodeB kernel: [1104432.406559] RSP:
>>>>> 0018:ffff88084a5af798 EFLAGS: 00010246
>>>>> Sep 13 08:10:18 nodeB kernel: [1104432.479958] RAX:
0000000000000000
>>>>> RBX: ffff881acebcb000 RCX: ffff881fcd372e00
>>>>> Sep 13 08:10:18 nodeB kernel: [1104432.630768] RDX:
ffff881fd0d4dc30
>>>>> RSI: ffff88197e351bc8 RDI: ffff880fd127b2b0
>>>>> Sep 13 08:10:18 nodeB kernel: [1104432.789688] RBP:
ffff88084a5af818
>>>>> R08: 0000000000000002 R09: 0000000000007e00
>>>>> Sep 13 08:10:18 nodeB kernel: [1104432.950053] R10:
ffff880d39a21020
>>>>> R11: ffff88084a5af550 R12: 00000000000000fa
>>>>> Sep 13 08:10:18 nodeB kernel: [1104433.113014] R13:
0000000000005ab1
>>>>> R14: 0000000000000000 R15: ffff880fb2d43000
>>>>> Sep 13 08:10:18 nodeB kernel: [1104433.276484] FS:
>>>>> 00007fcc68373840(0000) GS:ffff881fdde80000(0000)
>>>>> knlGS:0000000000000000
>>>>> Sep 13 08:10:18 nodeB kernel: [1104433.440016] CS: 0010
DS: 0000 ES:
>>>>> 0000 CR0: 000000008005003b
>>>>> Sep 13 08:10:18 nodeB kernel: [1104433.521496] CR2:
00005647b2ee6d80
>>>>> CR3: 0000000198b93000 CR4: 00000000000406e0
>>>>> Sep 13 08:10:18 nodeB kernel: [1104433.681357] Stack:
>>>>> Sep 13 08:10:18 nodeB kernel: [1104433.758498]
0000000000000000
>>>>> ffff880fd127b2e8 ffff881fc6655f08 00005bab00000000
>>>>> Sep 13 08:10:18 nodeB kernel: [1104433.913655]
ffff881fd0c51d80
>>>>> ffff88197e351bc8 ffff880fd127b330 ffff880e9eaa6000
>>>>> Sep 13 08:10:18 nodeB kernel: [1104434.068609]
ffff88197e351bc8
>>>>> ffffffff817ba6d6 0000000000000001 000000001ac592b1
>>>>> Sep 13 08:10:18 nodeB kernel: [1104434.223347] Call Trace:
>>>>> Sep 13 08:10:18 nodeB kernel: [1104434.298560]
[<ffffffff817ba6d6>] ?
>>>>> mutex_lock+0x16/0x37
>>>>> Sep 13 08:10:18 nodeB kernel: [1104434.374183]
[<ffffffffc0621bca>]
>>>>> _ocfs2_free_clusters+0xea/0x200 [ocfs2]
>>>>> Sep 13 08:10:18 nodeB kernel: [1104434.449628]
[<ffffffffc061ecb0>] ?
>>>>> ocfs2_put_slot+0xe0/0xe0 [ocfs2]
>>>>> Sep 13 08:10:18 nodeB kernel: [1104434.523971]
[<ffffffffc061ecb0>] ?
>>>>> ocfs2_put_slot+0xe0/0xe0 [ocfs2]
>>>>> Sep 13 08:10:18 nodeB kernel: [1104434.595803]
[<ffffffffc06234e5>]
>>>>> ocfs2_free_clusters+0x15/0x20 [ocfs2]
>>>>> Sep 13 08:10:18 nodeB kernel: [1104434.666614]
[<ffffffffc05d6037>]
>>>>> __ocfs2_flush_truncate_log+0x247/0x560 [ocfs2]
>>>>> Sep 13 08:10:18 nodeB kernel: [1104434.806017]
[<ffffffffc05d25a6>] ?
>>>>> ocfs2_num_free_extents+0x56/0x120 [ocfs2]
>>>>> Sep 13 08:10:18 nodeB kernel: [1104434.946141]
[<ffffffffc05db258>]
>>>>> ocfs2_remove_btree_range+0x4e8/0x760 [ocfs2]
>>>>> Sep 13 08:10:18 nodeB kernel: [1104435.086490]
[<ffffffffc05dc720>]
>>>>> ocfs2_commit_truncate+0x180/0x590 [ocfs2]
>>>>> Sep 13 08:10:18 nodeB kernel: [1104435.158189]
[<ffffffffc06022b0>] ?
>>>>> ocfs2_allocate_extend_trans+0x130/0x130 [ocfs2]
>>>>> Sep 13 08:10:18 nodeB kernel: [1104435.297235]
[<ffffffffc05f7e2c>]
>>>>> ocfs2_truncate_file+0x39c/0x610 [ocfs2]
>>>>> Sep 13 08:10:18 nodeB kernel: [1104435.368060]
[<ffffffffc05fe650>] ?
>>>>> ocfs2_read_inode_block+0x10/0x20 [ocfs2]
>>>>> Sep 13 08:10:18 nodeB kernel: [1104435.505117]
[<ffffffffc05fa2d7>]
>>>>> ocfs2_setattr+0x4b7/0xa50 [ocfs2]
>>>>> Sep 13 08:10:18 nodeB kernel: [1104435.574617]
[<ffffffffc064c4fd>] ?
>>>>> ocfs2_xattr_get+0x9d/0x130 [ocfs2]
>>>>> Sep 13 08:10:18 nodeB kernel: [1104435.643722]
[<ffffffff8120705e>]
>>>>> notify_change+0x1ae/0x380
>>>>> Sep 13 08:10:18 nodeB kernel: [1104435.712037]
[<ffffffff811e8436>]
>>>>> do_truncate+0x66/0xa0
>>>>> Sep 13 08:10:18 nodeB kernel: [1104435.778685]
[<ffffffff811f8527>]
>>>>> path_openat+0x277/0x1330
>>>>> Sep 13 08:10:18 nodeB kernel: [1104435.845776]
[<ffffffffc05f2bed>] ?
>>>>> __ocfs2_cluster_unlock.isra.36+0x7d/0xb0 [ocfs2]
>>>>> Sep 13 08:10:18 nodeB kernel: [1104435.977677]
[<ffffffff811fae8a>]
>>>>> do_filp_open+0x7a/0xd0
>>>>> Sep 13 08:10:18 nodeB kernel: [1104436.043693]
[<ffffffff811f9f8f>] ?
>>>>> getname_flags+0x4f/0x1f0
>>>>> Sep 13 08:10:18 nodeB kernel: [1104436.108385]
[<ffffffff81208006>] ?
>>>>> __alloc_fd+0x46/0x110
>>>>> Sep 13 08:10:18 nodeB kernel: [1104436.171504]
[<ffffffff811ea509>]
>>>>> do_sys_open+0x129/0x260
>>>>> Sep 13 08:10:18 nodeB kernel: [1104436.232889]
[<ffffffff811ea65e>]
>>>>> SyS_open+0x1e/0x20
>>>>> Sep 13 08:10:18 nodeB kernel: [1104436.294292]
[<ffffffff817bc3b2>]
>>>>> entry_SYSCALL_64_fastpath+0x16/0x75
>>>>> Sep 13 08:10:18 nodeB kernel: [1104436.356257] Code: 65 c0
48 c7 c6 e0
>>>>> 44 65 c0 41 b6 e2 48 8d 5d c8 48 8b 78 28 44 89 24 24 31 c0
49 c7 c4
>>>>> e2 ff ff ff e8 9a 8d 01 00 e9 c4 fd ff ff <0f> 0b 0f
0b 90 0f 1f 44 00
>>>>> 00 55 48 89 e5 41 57 41 89 cf b9 01
>>>>> Sep 13 08:10:18 nodeB kernel: [1104436.549534] RIP
>>>>> [<ffffffffc062026b>]
_ocfs2_free_suballoc_bits+0x4db/0x4e0 [ocfs2]
>>>>> Sep 13 08:10:18 nodeB kernel: [1104436.681076] RSP
<ffff88084a5af798>
>>>>> Sep 13 08:10:18 nodeB kernel: [1104436.834529] ---[ end
trace
>>>>> 5f4b84ac539ed56c ]---
>>>>>
>>>>> _______________________________________________
>>>>> Ocfs2-users mailing list
>>>>> Ocfs2-users at oss.oracle.com
>>>>> https://oss.oracle.com/mailman/listinfo/ocfs2-users
>>>>>
>