thr3ads.net - Ocfs2 users - [Ocfs2-users] OCFS2 Node silent death. [Dec 2016]

If this information is useful, please help other people find it:
Share via:

Joseph Cluney

2016-Dec-19 11:55 UTC

[Ocfs2-users] OCFS2 Node silent death.

Hi,

I have a 3 node cluster running Oracle VM server release 3.4.1.
We are having an occasional instances of a node disappearing, the only
indication in the logs is the eviction and recovery messages.

typical dmesg output.>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
[3312552.140992] o2net: Connection to node cs-apps-node3 (num 2) at
192.168.200.3:7777 has been idle for 60.114 secs.
[3312612.171263] o2net: Connection to node cs-apps-node3 (num 2) at
192.168.200.3:7777 has been idle for 60.32 secs.
[3312613.015723] o2net: No longer connected to node cs-apps-node3 (num 2)
at 192.168.200.3:7777
[3312613.015783] (dlm_thread,5404,8):dlm_send_proxy_ast_msg:486 ERROR:
0004FB0000050000BC74A5DD09C33779: res M0000000000000000000207c84b1f72,
error -112 send AST to node 2
[3312613.015877] o2cb: o2dlm has evicted node 2 from domain
0004FB00000500007A4523B76F1051D5
[3312613.016354] o2cb: o2dlm has evicted node 2 from domain
0004FB0000050000BC74A5DD09C33779
[3312613.016652] (dlm_thread,5404,8):dlm_flush_asts:609 ERROR: status = -112
[3312613.018071] o2cb: o2dlm has evicted node 2 from domain ovm
[3312613.106251] o2dlm: Begin recovery on domain ovm for node 2
[3312613.107043] o2dlm: Node 0 (me) is the Recovery Master for the dead
node 2 in domain ovm
[3312614.039237] o2dlm: Waiting on the recovery of node 2 in domain
0004FB0000050000BC74A5DD09C33779
[3312614.050222] o2dlm: Waiting on the recovery of node 2 in domain
0004FB00000500007A4523B76F1051D5
[3312614.982240] o2cb: o2dlm has evicted node 2 from domain
0004FB00000500007A4523B76F1051D5
[3312615.913169] o2dlm: Begin recovery on domain
0004FB0000050000BC74A5DD09C33779 for node 2
[3312615.913901] o2dlm: Node 0 (me) is the Recovery Master for the dead
node 2 in domain 0004FB0000050000BC74A5DD09C33779
[3312615.976172] o2dlm: Begin recovery on domain
0004FB00000500007A4523B76F1051D5 for node 2
[3312615.976179] o2dlm: Node 1 (he) is the Recovery Master for the dead
node 2 in domain 0004FB00000500007A4523B76F1051D5
[3312615.976182] o2dlm: End recovery on domain
0004FB00000500007A4523B76F1051D5
[3312618.106659] o2dlm: End recovery on domain ovm
[3312620.914435] o2dlm: End recovery on domain
0004FB0000050000BC74A5DD09C33779
[3312620.983993] ocfs2: Begin replay journal (node 2, slot 1) on device
(7,0)
[3312621.077920] ocfs2: End replay journal (node 2, slot 1) on device (7,0)
[3312621.113673] ocfs2: Beginning quota recovery on device (7,0) for slot 1
[3312625.040978] ocfs2: Begin replay journal (node 2, slot 0) on device
(251,18)
[3312627.147646] ocfs2: End replay journal (node 2, slot 0) on device
(251,18)
[3312627.187473] ocfs2: Beginning quota recovery on device (251,18) for
slot 0
[3312643.137839] ocfs2: Finishing quota recovery on device (251,18) for
slot 0
[3312643.139285] ocfs2: Finishing quota recovery on device (7,0) for slot 1
[3312891.479712] o2net: Accepted connection from node cs-apps-node3 (num 2)
at 192.168.200.3:7777
[3312899.822790] o2dlm: Node 2 joins domain
0004FB00000500007A4523B76F1051D5 ( 0 1 2 ) 3 nodes
[3312900.046000] o2dlm: Node 2 joins domain
0004FB0000050000BC74A5DD09C33779 ( 0 1 2 ) 3 nodes
[3312901.134648] o2dlm: Node 2 joins domain ovm ( 0 1 2 ) 3 nodes
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Have done a net console to capture the output from the failing node.

The node that fails in the cluster is performing a reflink operation on VM
files for snapshots, the crash happens during this snapshot window.
To note this does not happen every time the snapshots are taken.

Two nodes are 10GB capable and have multipath links to the storage device,
when these links are activated the crashes are more frequent.

Only the node that is doing the snapshot (reflink) is affected.

netconsole output.>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
[1382139.668117] ------------[ cut here ]------------
[1382139.668172] kernel BUG at fs/ocfs2/dlmglue.c:3647!
[1382139.668687] invalid opcode: 0000 [#1] SMP
[1382139.669185] Modules linked in: netconsole tun ocfs2 jbd2 nfsv3 nfs_acl
rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs fscache lockd sunrpc grace
xen_pciback xen_netback xen_blkback xen_gntalloc xen_gntdev xen_evtchn
xenfs xen_privcmd ocfs2_d
lmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs
bnx2fc fcoe libfcoe libfc scsi_transport_fc 8021q mrp garp dm_round_robin
bridge stp llc bonding ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core
ib_addr iscsi_tc
p dm_multipath ipmi_devintf ipmi_si ipmi_msghandler iTCO_wdt
iTCO_vendor_support dcdbas pcspkr serio_raw lpc_ich mfd_core ioatdma dca
i7core_edac edac_core sg ext3 jbd mbcache sr_mod cdrom sd_mod wmi pata_acpi
ata_generic ata_piix megara
id_sas crc32c_intel bnx2 be2iscsi bnx2i cnic uio cxgb4i cxgb4 cxgb3i
libcxgbi ipv6 cxgb3 mdio libiscsi_tcp qla4xxx iscsi_boot_sysfs libiscsi
scsi_transport_iscsi usb_storage mgag200 ttm drm_kms_helper drm
i2c_algo_bit sysimgblt sysfillre
ct i2c_core syscopyarea dm_mirror dm_region_hash dm_log dm_mod [last
unloaded: netconsole]
[1382139.675600] CPU: 15 PID: 4240 Comm: ocfs2dc Tainted: G        W
4.1.12-32.1.3.el6uek.x86_64 #2
[1382139.676458] Hardware name: Dell Inc. PowerEdge R710/00W9X3, BIOS 6.4.0
07/23/2013
[1382139.677348] task: ffff8800b6648e00 ti: ffff8800b6640000 task.ti:
ffff8800b6640000
[1382139.678224] RIP: e030:[<ffffffffa0abf42f>] 
[<ffffffffa0abf42f>]
ocfs2_ci_checkpointed+0xbf/0xd0 [ocfs2]
[1382139.679202] RSP: e02b:ffff8800b6643ca8  EFLAGS: 00010093
[1382139.680185] RAX: 0000000000008652 RBX: ffff8800023156c8 RCX:
0000000000000005
[1382139.681180] RDX: 0000000000008652 RSI: ffff8800023153b8 RDI:
ffffffffa0b659d8
[1382139.682203] RBP: ffff8800b6643cd8 R08: ffff8800023153c8 R09:
0000000000000000
[1382139.683324] R10: 0000000000007ff0 R11: 0000000000000002 R12:
ffff8800023153b8
[1382139.684381] R13: 0000000000000000 R14: 0000000000001f2a R15:
0000000000001f2a
[1382139.685455] FS:  00007fe6eb8667c0(0000) GS:ffff8801307c0000(0000)
knlGS:0000000000000000
[1382139.686585] CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
[1382139.687701] CR2: 00007f510cad0000 CR3: 00000000bdaf0000 CR4:
0000000000002660
[1382139.688977] Stack:
[1382139.690147]  ffff8800b6643cd8 ffff8800023153b8 00000000000002d0
ffff880002315424
[1382139.691562]  0000000000000000 ffff8800b6643da8 ffff8800b6643ce8
ffffffffa0abf489
[1382139.692889]  ffff8800b6643d88 ffffffffa0ac2dca ffff8801307ca210
ffff8801307caa10
[1382139.694167] Call Trace:
[1382139.695462]  [<ffffffffa0abf489>]
ocfs2_check_meta_downconvert+0x29/0x40 [ocfs2]
[1382139.696808]  [<ffffffffa0ac2dca>] ocfs2_unblock_lock+0xca/0x750
[ocfs2]
[1382139.698166]  [<ffffffff81012982>] ? __switch_to+0x212/0x5b0
[1382139.699548]  [<ffffffffa0ac35fa>]
ocfs2_process_blocked_lock+0x1aa/0x270 [ocfs2]
[1382139.701015]  [<ffffffffa0ac3b92>]
ocfs2_downconvert_thread_do_work+0xb2/0xe0 [ocfs2]
[1382139.702491]  [<ffffffffa0ac3c36>] ocfs2_downconvert_thread+0x76/0x180
[ocfs2]
[1382139.704059]  [<ffffffff810c3b10>] ? wait_woken+0x90/0x90
[1382139.705537]  [<ffffffffa0ac3bc0>] ?
ocfs2_downconvert_thread_do_work+0xe0/0xe0 [ocfs2]
[1382139.707051]  [<ffffffffa0ac3bc0>] ?
ocfs2_downconvert_thread_do_work+0xe0/0xe0 [ocfs2]
[1382139.708558]  [<ffffffff810a012e>] kthread+0xce/0xf0
[1382139.710075]  [<ffffffff810a0060>] ?
kthread_freezable_should_stop+0x70/0x70
[1382139.711622]  [<ffffffff816b7e22>] ret_from_fork+0x42/0x70
[1382139.713182]  [<ffffffff810a0060>] ?
kthread_freezable_should_stop+0x70/0x70
[1382139.714769] Code: 48 89 df e8 34 a6 05 00 48 8b b8 40 03 00 00 31 c9
ba 01 00 00 00 be 03 00 00 00 48 81 c7 68 01 00 00 e8 a5 49 60 e0 31 c0 eb
ae <0f> 0b eb fe 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5
[1382139.718379] RIP  [<ffffffffa0abf42f>] ocfs2_ci_checkpointed+0xbf/0xd0
[ocfs2]
[1382139.720221]  RSP <ffff8800b6643ca8>
[1382139.728533] ------------[ cut here ]------------
[1382139.730240] kernel BUG at arch/x86/mm/pageattr.c:214!
[1382139.731902] invalid opcode: 0000 [#2] SMP
[1382139.733542] Modules linked in: netconsole tun ocfs2 jbd2 nfsv3 nfs_acl
rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs fscache lockd sunrpc grace
xen_pciback xen_netback xen_blkback xen_gntalloc xen_gntdev xen_evtchn
xenfs xen_privcmd ocfs2_d
lmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs
bnx2fc fcoe libfcoe libfc scsi_transport_fc 8021q mrp garp dm_round_robin
bridge stp llc bonding ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core
ib_addr iscsi_tc
p dm_multipath ipmi_devintf ipmi_si ipmi_msghandler iTCO_wdt
iTCO_vendor_support dcdbas pcspkr serio_raw lpc_ich mfd_core ioatdma dca
i7core_edac edac_core sg ext3 jbd mbcache sr_mod cdrom sd_mod wmi pata_acpi
ata_generic ata_piix megara
id_sas crc32c_intel bnx2 be2iscsi bnx2i cnic uio cxgb4i cxgb4 cxgb3i
libcxgbi ipv6 cxgb3 mdio libiscsi_tcp qla4xxx iscsi_boot_sysfs libiscsi
scsi_transport_iscsi usb_storage mgag200 ttm drm_kms_helper drm
i2c_algo_bit sysimgblt sysfillre
ct i2c_core syscopyarea dm_mirror dm_region_hash dm_log dm_mod [last
unloaded: netconsole]
[1382139.749332] CPU: 15 PID: 4240 Comm: ocfs2dc Tainted: G        W
4.1.12-32.1.3.el6uek.x86_64 #2
[1382139.751099] Hardware name: Dell Inc. PowerEdge R710/00W9X3, BIOS 6.4.0
07/23/2013
[1382139.752908] task: ffff8800b6648e00 ti: ffff8800b6640000 task.ti:
ffff8800b6640000
[1382139.754653] RIP: e030:[<ffffffff8106cf02>] 
[<ffffffff8106cf02>]
change_page_attr_set_clr+0x4e2/0x520
[1382139.756420] RSP: e02b:ffff8800b6642da8  EFLAGS: 00010046
[1382139.758187] RAX: 201008001fc900f5 RBX: 0000000000000200 RCX:
0000000000000000
[1382139.759913] RDX: 0000000000000000 RSI: 0000000080000000 RDI:
0000000080000000
[1382139.761592] RBP: ffff8800b6642e58 R08: 801000006ff01067 R09:
000000000006ff01
[1382139.763221] R10: 0000000000007ff0 R11: 0000000000000000 R12:
0000000000000005
[1382139.764843] R13: 0000000000000001 R14: 0000000000000000 R15:
0000000000000000
[1382139.766467] FS:  00007fe6eb8667c0(0000) GS:ffff8801307c0000(0000)
knlGS:0000000000000000
[1382139.768089] CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
[1382139.769697] CR2: 00007f510cad0000 CR3: 0000000001a62000 CR4:
0000000000002660
[1382139.771320] Stack:
[1382139.772907]  ffff880000000000 0000000000000000 0000000000000010
0000000000000001
[1382139.774646]  0000000000000000 0000000000000000 0000000000000000
0000000000000000
[1382139.776372]  0000000000000010 0000000000000000 0000000500000001
000000000006ff01
[1382139.777968] Call Trace:
[1382139.779530]  [<ffffffff8106d21c>] _set_pages_array+0xec/0x140
[1382139.781092]  [<ffffffff8106d283>] set_pages_array_wc+0x13/0x20
[1382139.782669]  [<ffffffffa010f54e>] ttm_set_pages_caching+0x4e/0x80
[ttm]
[1382139.784258]  [<ffffffffa010fb76>] ttm_alloc_new_pages+0xb6/0x180
[ttm]
[1382139.785824]  [<ffffffffa010fcdc>]
ttm_page_pool_fill_locked.clone.1+0x9c/0x140 [ttm]
[1382139.787394]  [<ffffffffa010fdc4>]
ttm_page_pool_get_pages.clone.2+0x44/0xf0 [ttm]
[1382139.788970]  [<ffffffffa010ff13>] ttm_get_pages.clone.0+0xa3/0x200
[ttm]
[1382139.790551]  [<ffffffffa01100fc>] ttm_pool_populate+0x8c/0xf0 [ttm]
[1382139.792114]  [<ffffffffa010c2b4>] ? ttm_mem_reg_ioremap+0x64/0x100
[ttm]
[1382139.793795]  [<ffffffffa012deee>] mgag200_ttm_tt_populate+0xe/0x10
[mgag200]
[1382139.795373]  [<ffffffffa010c8c2>] ttm_bo_move_memcpy+0x4f2/0x540
[ttm]
[1382139.796935]  [<ffffffffa01081ac>] ? ttm_tt_init+0x8c/0xb0 [ttm]
[1382139.798498]  [<ffffffff811c321e>] ? __vmalloc_node+0x3e/0x40
[1382139.800052]  [<ffffffffa012de18>] mgag200_bo_move+0x18/0x20 [mgag200]
[1382139.801607]  [<ffffffffa0108ce9>] ttm_bo_handle_move_mem+0x299/0x650
[ttm]
[1382139.803099]  [<ffffffff811c2600>] ? vunmap_page_range+0x100/0x180
[1382139.804590]  [<ffffffffa010af6e>] ttm_bo_validate+0x1de/0x1f0 [ttm]
[1382139.806058]  [<ffffffff8106a3aa>] ? iounmap+0x8a/0xd0
[1382139.807541]  [<ffffffffa012dfe3>] mgag200_bo_push_sysram+0x83/0xd0
[mgag200]
[1382139.809092]  [<ffffffffa012af3f>]
mga_crtc_do_set_base.clone.0+0x7f/0x1e0 [mgag200]
[1382139.810654]  [<ffffffff816b72e0>] ?
_raw_spin_unlock_irqrestore+0x20/0x50
[1382139.812146]  [<ffffffffa012bbe1>] mga_crtc_mode_set+0xb41/0xe60
[mgag200]
[1382139.813585]  [<ffffffffa00da9a9>]
drm_crtc_helper_set_mode+0x389/0x5d0
[drm_kms_helper]
[1382139.815096]  [<ffffffffa00dba59>]
drm_crtc_helper_set_config+0x879/0xae0 [drm_kms_helper]
[1382139.816468]  [<ffffffff816b72e0>] ?
_raw_spin_unlock_irqrestore+0x20/0x50
[1382139.817845]  [<ffffffffa0078576>]
drm_mode_set_config_internal+0x66/0x110 [drm]
[1382139.819177]  [<ffffffffa00e6a21>]
drm_fb_helper_pan_display+0x111/0x160 [drm_kms_helper]
[1382139.820487]  [<ffffffff8138abd6>] fb_pan_display+0xe6/0x140
[1382139.821761]  [<ffffffff81384d4a>] bit_update_start+0x2a/0x60
[1382139.822981]  [<ffffffff813810d6>] fbcon_switch+0x386/0x530
[1382139.824198]  [<ffffffff8140f5a9>] redraw_screen+0x189/0x230
[1382139.825370]  [<ffffffff81380c9a>] fbcon_blank+0x21a/0x2d0
[1382139.826642]  [<ffffffff816b72e0>] ?
_raw_spin_unlock_irqrestore+0x20/0x50
[1382139.827813]  [<ffffffff810d736c>] ? vprintk_emit+0x2ac/0x520
[1382139.828933]  [<ffffffff810e8091>] ? internal_add_timer+0x91/0xc0
[1382139.830070]  [<ffffffff816b72e0>] ?
_raw_spin_unlock_irqrestore+0x20/0x50
[1382139.831192]  [<ffffffff810ea06e>] ? mod_timer+0xfe/0x1d0
[1382139.832306]  [<ffffffff81410676>] do_unblank_screen+0xb6/0x1f0
[1382139.833438]  [<ffffffff814107c0>] unblank_screen+0x10/0x20
[1382139.834562]  [<ffffffff8131f7bd>] bust_spinlocks+0x1d/0x40
[1382139.835731]  [<ffffffff81017643>] oops_end+0x43/0x120
[1382139.836953]  [<ffffffff81017c4b>] die+0x5b/0x90
[1382139.838020]  [<ffffffff810140f9>] do_trap+0x169/0x170
[1382139.839090]  [<ffffffff810a0ef2>] ?
__atomic_notifier_call_chain+0x12/0x20
[1382139.840144]  [<ffffffff81014af9>] do_error_trap+0xd9/0x180
[1382139.841221]  [<ffffffffa0abf42f>] ? ocfs2_ci_checkpointed+0xbf/0xd0
[ocfs2]
[1382139.842344]  [<ffffffff816b72e0>] ?
_raw_spin_unlock_irqrestore+0x20/0x50
[1382139.843520]  [<ffffffff810c3e23>] ? __wake_up+0x53/0x70
[1382139.844581]  [<ffffffff81014cb0>] do_invalid_op+0x20/0x30
[1382139.845662]  [<ffffffff816b93be>] invalid_op+0x1e/0x30
[1382139.846734]  [<ffffffffa0abf42f>] ? ocfs2_ci_checkpointed+0xbf/0xd0
[ocfs2]
[1382139.847826]  [<ffffffffa0abf3cd>] ? ocfs2_ci_checkpointed+0x5d/0xd0
[ocfs2]
[1382139.848992]  [<ffffffffa0abf489>]
ocfs2_check_meta_downconvert+0x29/0x40 [ocfs2]
[1382139.850223]  [<ffffffffa0ac2dca>] ocfs2_unblock_lock+0xca/0x750
[ocfs2]
[1382139.851302]  [<ffffffff81012982>] ? __switch_to+0x212/0x5b0
[1382139.852384]  [<ffffffffa0ac35fa>]
ocfs2_process_blocked_lock+0x1aa/0x270 [ocfs2]
[1382139.853480]  [<ffffffffa0ac3b92>]
ocfs2_downconvert_thread_do_work+0xb2/0xe0 [ocfs2]
[1382139.854580]  [<ffffffffa0ac3c36>] ocfs2_downconvert_thread+0x76/0x180
[ocfs2]
[1382139.855649]  [<ffffffff810c3b10>] ? wait_woken+0x90/0x90
[1382139.856730]  [<ffffffffa0ac3bc0>] ?
ocfs2_downconvert_thread_do_work+0xe0/0xe0 [ocfs2]
[1382139.857835]  [<ffffffffa0ac3bc0>] ?
ocfs2_downconvert_thread_do_work+0xe0/0xe0 [ocfs2]
[1382139.858899]  [<ffffffff810a012e>] kthread+0xce/0xf0
[1382139.860106]  [<ffffffff810a0060>] ?
kthread_freezable_should_stop+0x70/0x70
[1382139.861175]  [<ffffffff816b7e22>] ret_from_fork+0x42/0x70
[1382139.862236]  [<ffffffff810a0060>] ?
kthread_freezable_should_stop+0x70/0x70
[1382139.863303] Code: 58 ff ff ff e9 e2 fb ff ff be b8 00 00 00 48 c7 c7
f8 56 95 81 e8 bf 35 01 00 e9 03 ff ff ff 0f 0b eb fe 0f 0b 0f 1f 40 00 eb
fa <0f> 0b eb fe 48 89 ca 48 c1 e9 05 45 31 ed 48 c1 ea 03 48 89 d0
[1382139.865840] RIP  [<ffffffff8106cf02>]
change_page_attr_set_clr+0x4e2/0x520
[1382139.866978]  RSP <ffff8800b6642da8>
[1382139.868102] ---[ end trace c97984eb9f0c5481 ]---
[1382139.877139] Kernel panic - not syncing: Fatal exception
[1382139.878299] Kernel Offset: disabled
[1382139.879386] drm_kms_helper: panic occurred, switching back to text
console
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
If anyone has experienced this issue and could give any pointers on what to
try next it would be much appreciated.

Kind Regards

Joe Cluney

-- 




* OneCard Solutions Ltd*

* Unit 4C,*

* Manor House,*

* Waterford,*

* Ireland*
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://oss.oracle.com/pipermail/ocfs2-users/attachments/20161219/1b6c73d2/attachment-0001.html

Eric Ren

2016-Dec-23 01:59 UTC

head link

[Ocfs2-users] OCFS2 Node silent death.

Hi,

Looks like this patch is for this issue:

commit 3d46a44a0c01b15d385ccaae24b56f619613c256
Author: Tariq Saeed <tariq.x.saeed at oracle.com>
Date:   Fri Sep 4 15:44:31 2015 -0700

     ocfs2: fix BUG_ON() in ocfs2_ci_checkpointed()

     PID: 614    TASK: ffff882a739da580  CPU: 3   COMMAND: "ocfs2dc"
       #0 [ffff882ecc3759b0] machine_kexec at ffffffff8103b35d
       #1 [ffff882ecc375a20] crash_kexec at ffffffff810b95b5
       #2 [ffff882ecc375af0] oops_end at ffffffff815091d8
       #3 [ffff882ecc375b20] die at ffffffff8101868b
       #4 [ffff882ecc375b50] do_trap at ffffffff81508bb0
       #5 [ffff882ecc375ba0] do_invalid_op at ffffffff810165e5
       #6 [ffff882ecc375c40] invalid_op at ffffffff815116fb
          [exception RIP: ocfs2_ci_checkpointed+208]
          RIP: ffffffffa0a7e940  RSP: ffff882ecc375cf0  RFLAGS: 00010002
          RAX: 0000000000000001  RBX: 000000000000654b  RCX: ffff8812dc83f1f8
          RDX: 00000000000017d9  RSI: ffff8812dc83f1f8  RDI: ffffffffa0b2c318
          RBP: ffff882ecc375d20   R8: ffff882ef6ecfa60   R9: ffff88301f272200
          R10: 0000000000000000  R11: 0000000000000000  R12: ffffffffffffffff
          R13: ffff8812dc83f4f0  R14: 0000000000000000  R15: ffff8812dc83f1f8
          ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
       #7 [ffff882ecc375d28] ocfs2_check_meta_downconvert at ffffffffa0a7edbd
[ocfs2]
       #8 [ffff882ecc375d38] ocfs2_unblock_lock at ffffffffa0a84af8 [ocfs2]
       #9 [ffff882ecc375dc8] ocfs2_process_blocked_lock at ffffffffa0a85285
[ocfs2]
     #10 [ffff882ecc375e18] ocfs2_downconvert_thread_do_work at ffffffffa0a85445
[ocfs2]
     #11 [ffff882ecc375e68] ocfs2_downconvert_thread at ffffffffa0a854de [ocfs2]
     #12 [ffff882ecc375ee8] kthread at ffffffff81090da7
     #13 [ffff882ecc375f48] kernel_thread_helper at ffffffff81511884
     assert is tripped because the tran is not checkpointed and the lock level
is PR.

     Some time ago, chmod command had been executed. As result, the following
call
     chain left the inode cluster lock in PR state, latter on causing the
assert.
     system_call_fastpath
       -> my_chmod
        -> sys_chmod
         -> sys_fchmodat
          -> notify_change
           -> ocfs2_setattr
            -> posix_acl_chmod
             -> ocfs2_iop_set_acl
              -> ocfs2_set_acl
               -> ocfs2_acl_set_mode
     Here is how.
     1119 int ocfs2_setattr(struct dentry *dentry, struct iattr *attr)
     1120 {
     1247         ocfs2_inode_unlock(inode, 1); <<< WRONG thing to do.
     ..
     1258         if (!status && attr->ia_valid & ATTR_MODE) {
     1259                 status =  posix_acl_chmod(inode, inode->i_mode);

     519 posix_acl_chmod(struct inode *inode, umode_t mode)
     520 {
     ..
     539         ret = inode->i_op->set_acl(inode, acl, ACL_TYPE_ACCESS);

     287 int ocfs2_iop_set_acl(struct inode *inode, struct posix_acl *acl, ...
     288 {
     289         return ocfs2_set_acl(NULL, inode, NULL, type, acl, NULL, NULL);

     224 int ocfs2_set_acl(handle_t *handle,
     225                          struct inode *inode, ...
     231 {
     ..
     252                                 ret = ocfs2_acl_set_mode(inode, di_bh,
     253 handle, mode);

     168 static int ocfs2_acl_set_mode(struct inode *inode, struct buffer_head
...
     170 {
     183         if (handle == NULL) {
                         >>> BUG: inode lock not held in ex at this
point <<<
     184                 handle = ocfs2_start_trans(OCFS2_SB(inode->i_sb),
     185 OCFS2_INODE_UPDATE_CREDITS);

     ocfs2_setattr.#1247 we unlock and at #1259 call posix_acl_chmod. When we
reach
     ocfs2_acl_set_mode.#181 and do trans, the inode cluster lock is not held in
EX
     mode (it should be). How this could have happended?

     We are the lock master, were holding lock EX and have released it in
     ocfs2_setattr.#1247.  Note that there are no holders of this lock at
     this point.  Another node needs the lock in PR, and we downconvert from
     EX to PR.  So the inode lock is PR when do the trans in
     ocfs2_acl_set_mode.#184.  The trans stays in core (not flushed to disc).
     Now another node want the lock in EX, downconvert thread gets kicked
     (the one that tripped assert abovt), finds an unflushed trans but the
     lock is not EX (it is PR).  If the lock was at EX, it would have flushed
     the trans ocfs2_ci_checkpointed -> ocfs2_start_checkpoint before
     downconverting (to NULL) for the request.

     ocfs2_setattr must not drop inode lock ex in this code path. If it
     does, takes it again before the trans, say in ocfs2_set_acl, another
     cluster node can get in between, execute another setattr, overwriting
     the one in progress on this node, resulting in a mode acl size combo
     that is a mix of the two.

     Orabug: 20189959
     Signed-off-by: Tariq Saeed <tariq.x.saeed at oracle.com>
     Reviewed-by: Mark Fasheh <mfasheh at suse.de>
     Cc: Joel Becker <jlbec at evilplan.org>
     Cc: Joseph Qi <joseph.qi at huawei.com>
     Signed-off-by: Andrew Morton <akpm at linux-foundation.org>

Thanks,
Eric

On 12/19/2016 07:55 PM, Joseph Cluney wrote:> Hi,
>
> I have a 3 node cluster running Oracle VM server release 3.4.1.
> We are having an occasional instances of a node disappearing, the only
> indication in the logs is the eviction and recovery messages.
>
> typical dmesg output.
>
> [3312552.140992] o2net: Connection to node cs-apps-node3 (num 2) at
> 192.168.200.3:7777 has been idle for 60.114 secs.
> [3312612.171263] o2net: Connection to node cs-apps-node3 (num 2) at
> 192.168.200.3:7777 has been idle for 60.32 secs.
> [3312613.015723] o2net: No longer connected to node cs-apps-node3 (num 2)
> at 192.168.200.3:7777
> [3312613.015783] (dlm_thread,5404,8):dlm_send_proxy_ast_msg:486 ERROR:
> 0004FB0000050000BC74A5DD09C33779: res M0000000000000000000207c84b1f72,
> error -112 send AST to node 2
> [3312613.015877] o2cb: o2dlm has evicted node 2 from domain
> 0004FB00000500007A4523B76F1051D5
> [3312613.016354] o2cb: o2dlm has evicted node 2 from domain
> 0004FB0000050000BC74A5DD09C33779
> [3312613.016652] (dlm_thread,5404,8):dlm_flush_asts:609 ERROR: status =
-112
> [3312613.018071] o2cb: o2dlm has evicted node 2 from domain ovm
> [3312613.106251] o2dlm: Begin recovery on domain ovm for node 2
> [3312613.107043] o2dlm: Node 0 (me) is the Recovery Master for the dead
> node 2 in domain ovm
> [3312614.039237] o2dlm: Waiting on the recovery of node 2 in domain
> 0004FB0000050000BC74A5DD09C33779
> [3312614.050222] o2dlm: Waiting on the recovery of node 2 in domain
> 0004FB00000500007A4523B76F1051D5
> [3312614.982240] o2cb: o2dlm has evicted node 2 from domain
> 0004FB00000500007A4523B76F1051D5
> [3312615.913169] o2dlm: Begin recovery on domain
> 0004FB0000050000BC74A5DD09C33779 for node 2
> [3312615.913901] o2dlm: Node 0 (me) is the Recovery Master for the dead
> node 2 in domain 0004FB0000050000BC74A5DD09C33779
> [3312615.976172] o2dlm: Begin recovery on domain
> 0004FB00000500007A4523B76F1051D5 for node 2
> [3312615.976179] o2dlm: Node 1 (he) is the Recovery Master for the dead
> node 2 in domain 0004FB00000500007A4523B76F1051D5
> [3312615.976182] o2dlm: End recovery on domain
> 0004FB00000500007A4523B76F1051D5
> [3312618.106659] o2dlm: End recovery on domain ovm
> [3312620.914435] o2dlm: End recovery on domain
> 0004FB0000050000BC74A5DD09C33779
> [3312620.983993] ocfs2: Begin replay journal (node 2, slot 1) on device
> (7,0)
> [3312621.077920] ocfs2: End replay journal (node 2, slot 1) on device (7,0)
> [3312621.113673] ocfs2: Beginning quota recovery on device (7,0) for slot 1
> [3312625.040978] ocfs2: Begin replay journal (node 2, slot 0) on device
> (251,18)
> [3312627.147646] ocfs2: End replay journal (node 2, slot 0) on device
> (251,18)
> [3312627.187473] ocfs2: Beginning quota recovery on device (251,18) for
> slot 0
> [3312643.137839] ocfs2: Finishing quota recovery on device (251,18) for
> slot 0
> [3312643.139285] ocfs2: Finishing quota recovery on device (7,0) for slot 1
> [3312891.479712] o2net: Accepted connection from node cs-apps-node3 (num 2)
> at 192.168.200.3:7777
> [3312899.822790] o2dlm: Node 2 joins domain
> 0004FB00000500007A4523B76F1051D5 ( 0 1 2 ) 3 nodes
> [3312900.046000] o2dlm: Node 2 joins domain
> 0004FB0000050000BC74A5DD09C33779 ( 0 1 2 ) 3 nodes
> [3312901.134648] o2dlm: Node 2 joins domain ovm ( 0 1 2 ) 3 nodes
>
> Have done a net console to capture the output from the failing node.
>
> The node that fails in the cluster is performing a reflink operation on VM
> files for snapshots, the crash happens during this snapshot window.
> To note this does not happen every time the snapshots are taken.
>
> Two nodes are 10GB capable and have multipath links to the storage device,
> when these links are activated the crashes are more frequent.
>
> Only the node that is doing the snapshot (reflink) is affected.
>
> netconsole output.
> [1382139.668117] ------------[ cut here ]------------
> [1382139.668172] kernel BUG at fs/ocfs2/dlmglue.c:3647!
> [1382139.668687] invalid opcode: 0000 [#1] SMP
> [1382139.669185] Modules linked in: netconsole tun ocfs2 jbd2 nfsv3 nfs_acl
> rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs fscache lockd sunrpc grace
> xen_pciback xen_netback xen_blkback xen_gntalloc xen_gntdev xen_evtchn
> xenfs xen_privcmd ocfs2_d
> lmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs
> bnx2fc fcoe libfcoe libfc scsi_transport_fc 8021q mrp garp dm_round_robin
> bridge stp llc bonding ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core
> ib_addr iscsi_tc
> p dm_multipath ipmi_devintf ipmi_si ipmi_msghandler iTCO_wdt
> iTCO_vendor_support dcdbas pcspkr serio_raw lpc_ich mfd_core ioatdma dca
> i7core_edac edac_core sg ext3 jbd mbcache sr_mod cdrom sd_mod wmi pata_acpi
> ata_generic ata_piix megara
> id_sas crc32c_intel bnx2 be2iscsi bnx2i cnic uio cxgb4i cxgb4 cxgb3i
> libcxgbi ipv6 cxgb3 mdio libiscsi_tcp qla4xxx iscsi_boot_sysfs libiscsi
> scsi_transport_iscsi usb_storage mgag200 ttm drm_kms_helper drm
> i2c_algo_bit sysimgblt sysfillre
> ct i2c_core syscopyarea dm_mirror dm_region_hash dm_log dm_mod [last
> unloaded: netconsole]
> [1382139.675600] CPU: 15 PID: 4240 Comm: ocfs2dc Tainted: G        W
> 4.1.12-32.1.3.el6uek.x86_64 #2
> [1382139.676458] Hardware name: Dell Inc. PowerEdge R710/00W9X3, BIOS 6.4.0
> 07/23/2013
> [1382139.677348] task: ffff8800b6648e00 ti: ffff8800b6640000 task.ti:
> ffff8800b6640000
> [1382139.678224] RIP: e030:[<ffffffffa0abf42f>] 
[<ffffffffa0abf42f>]
> ocfs2_ci_checkpointed+0xbf/0xd0 [ocfs2]
> [1382139.679202] RSP: e02b:ffff8800b6643ca8  EFLAGS: 00010093
> [1382139.680185] RAX: 0000000000008652 RBX: ffff8800023156c8 RCX:
> 0000000000000005
> [1382139.681180] RDX: 0000000000008652 RSI: ffff8800023153b8 RDI:
> ffffffffa0b659d8
> [1382139.682203] RBP: ffff8800b6643cd8 R08: ffff8800023153c8 R09:
> 0000000000000000
> [1382139.683324] R10: 0000000000007ff0 R11: 0000000000000002 R12:
> ffff8800023153b8
> [1382139.684381] R13: 0000000000000000 R14: 0000000000001f2a R15:
> 0000000000001f2a
> [1382139.685455] FS:  00007fe6eb8667c0(0000) GS:ffff8801307c0000(0000)
> knlGS:0000000000000000
> [1382139.686585] CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
> [1382139.687701] CR2: 00007f510cad0000 CR3: 00000000bdaf0000 CR4:
> 0000000000002660
> [1382139.688977] Stack:
> [1382139.690147]  ffff8800b6643cd8 ffff8800023153b8 00000000000002d0
> ffff880002315424
> [1382139.691562]  0000000000000000 ffff8800b6643da8 ffff8800b6643ce8
> ffffffffa0abf489
> [1382139.692889]  ffff8800b6643d88 ffffffffa0ac2dca ffff8801307ca210
> ffff8801307caa10
> [1382139.694167] Call Trace:
> [1382139.695462]  [<ffffffffa0abf489>]
> ocfs2_check_meta_downconvert+0x29/0x40 [ocfs2]
> [1382139.696808]  [<ffffffffa0ac2dca>] ocfs2_unblock_lock+0xca/0x750
[ocfs2]
> [1382139.698166]  [<ffffffff81012982>] ? __switch_to+0x212/0x5b0
> [1382139.699548]  [<ffffffffa0ac35fa>]
> ocfs2_process_blocked_lock+0x1aa/0x270 [ocfs2]
> [1382139.701015]  [<ffffffffa0ac3b92>]
> ocfs2_downconvert_thread_do_work+0xb2/0xe0 [ocfs2]
> [1382139.702491]  [<ffffffffa0ac3c36>]
ocfs2_downconvert_thread+0x76/0x180
> [ocfs2]
> [1382139.704059]  [<ffffffff810c3b10>] ? wait_woken+0x90/0x90
> [1382139.705537]  [<ffffffffa0ac3bc0>] ?
> ocfs2_downconvert_thread_do_work+0xe0/0xe0 [ocfs2]
> [1382139.707051]  [<ffffffffa0ac3bc0>] ?
> ocfs2_downconvert_thread_do_work+0xe0/0xe0 [ocfs2]
> [1382139.708558]  [<ffffffff810a012e>] kthread+0xce/0xf0
> [1382139.710075]  [<ffffffff810a0060>] ?
> kthread_freezable_should_stop+0x70/0x70
> [1382139.711622]  [<ffffffff816b7e22>] ret_from_fork+0x42/0x70
> [1382139.713182]  [<ffffffff810a0060>] ?
> kthread_freezable_should_stop+0x70/0x70
> [1382139.714769] Code: 48 89 df e8 34 a6 05 00 48 8b b8 40 03 00 00 31 c9
> ba 01 00 00 00 be 03 00 00 00 48 81 c7 68 01 00 00 e8 a5 49 60 e0 31 c0 eb
> ae <0f> 0b eb fe 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5
> [1382139.718379] RIP  [<ffffffffa0abf42f>]
ocfs2_ci_checkpointed+0xbf/0xd0
> [ocfs2]
> [1382139.720221]  RSP <ffff8800b6643ca8>
> [1382139.728533] ------------[ cut here ]------------
> [1382139.730240] kernel BUG at arch/x86/mm/pageattr.c:214!
> [1382139.731902] invalid opcode: 0000 [#2] SMP
> [1382139.733542] Modules linked in: netconsole tun ocfs2 jbd2 nfsv3 nfs_acl
> rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs fscache lockd sunrpc grace
> xen_pciback xen_netback xen_blkback xen_gntalloc xen_gntdev xen_evtchn
> xenfs xen_privcmd ocfs2_d
> lmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs
> bnx2fc fcoe libfcoe libfc scsi_transport_fc 8021q mrp garp dm_round_robin
> bridge stp llc bonding ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core
> ib_addr iscsi_tc
> p dm_multipath ipmi_devintf ipmi_si ipmi_msghandler iTCO_wdt
> iTCO_vendor_support dcdbas pcspkr serio_raw lpc_ich mfd_core ioatdma dca
> i7core_edac edac_core sg ext3 jbd mbcache sr_mod cdrom sd_mod wmi pata_acpi
> ata_generic ata_piix megara
> id_sas crc32c_intel bnx2 be2iscsi bnx2i cnic uio cxgb4i cxgb4 cxgb3i
> libcxgbi ipv6 cxgb3 mdio libiscsi_tcp qla4xxx iscsi_boot_sysfs libiscsi
> scsi_transport_iscsi usb_storage mgag200 ttm drm_kms_helper drm
> i2c_algo_bit sysimgblt sysfillre
> ct i2c_core syscopyarea dm_mirror dm_region_hash dm_log dm_mod [last
> unloaded: netconsole]
> [1382139.749332] CPU: 15 PID: 4240 Comm: ocfs2dc Tainted: G        W
> 4.1.12-32.1.3.el6uek.x86_64 #2
> [1382139.751099] Hardware name: Dell Inc. PowerEdge R710/00W9X3, BIOS 6.4.0
> 07/23/2013
> [1382139.752908] task: ffff8800b6648e00 ti: ffff8800b6640000 task.ti:
> ffff8800b6640000
> [1382139.754653] RIP: e030:[<ffffffff8106cf02>] 
[<ffffffff8106cf02>]
> change_page_attr_set_clr+0x4e2/0x520
> [1382139.756420] RSP: e02b:ffff8800b6642da8  EFLAGS: 00010046
> [1382139.758187] RAX: 201008001fc900f5 RBX: 0000000000000200 RCX:
> 0000000000000000
> [1382139.759913] RDX: 0000000000000000 RSI: 0000000080000000 RDI:
> 0000000080000000
> [1382139.761592] RBP: ffff8800b6642e58 R08: 801000006ff01067 R09:
> 000000000006ff01
> [1382139.763221] R10: 0000000000007ff0 R11: 0000000000000000 R12:
> 0000000000000005
> [1382139.764843] R13: 0000000000000001 R14: 0000000000000000 R15:
> 0000000000000000
> [1382139.766467] FS:  00007fe6eb8667c0(0000) GS:ffff8801307c0000(0000)
> knlGS:0000000000000000
> [1382139.768089] CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
> [1382139.769697] CR2: 00007f510cad0000 CR3: 0000000001a62000 CR4:
> 0000000000002660
> [1382139.771320] Stack:
> [1382139.772907]  ffff880000000000 0000000000000000 0000000000000010
> 0000000000000001
> [1382139.774646]  0000000000000000 0000000000000000 0000000000000000
> 0000000000000000
> [1382139.776372]  0000000000000010 0000000000000000 0000000500000001
> 000000000006ff01
> [1382139.777968] Call Trace:
> [1382139.779530]  [<ffffffff8106d21c>] _set_pages_array+0xec/0x140
> [1382139.781092]  [<ffffffff8106d283>] set_pages_array_wc+0x13/0x20
> [1382139.782669]  [<ffffffffa010f54e>]
ttm_set_pages_caching+0x4e/0x80 [ttm]
> [1382139.784258]  [<ffffffffa010fb76>] ttm_alloc_new_pages+0xb6/0x180
[ttm]
> [1382139.785824]  [<ffffffffa010fcdc>]
> ttm_page_pool_fill_locked.clone.1+0x9c/0x140 [ttm]
> [1382139.787394]  [<ffffffffa010fdc4>]
> ttm_page_pool_get_pages.clone.2+0x44/0xf0 [ttm]
> [1382139.788970]  [<ffffffffa010ff13>]
ttm_get_pages.clone.0+0xa3/0x200
> [ttm]
> [1382139.790551]  [<ffffffffa01100fc>] ttm_pool_populate+0x8c/0xf0
[ttm]
> [1382139.792114]  [<ffffffffa010c2b4>] ?
ttm_mem_reg_ioremap+0x64/0x100
> [ttm]
> [1382139.793795]  [<ffffffffa012deee>]
mgag200_ttm_tt_populate+0xe/0x10
> [mgag200]
> [1382139.795373]  [<ffffffffa010c8c2>] ttm_bo_move_memcpy+0x4f2/0x540
[ttm]
> [1382139.796935]  [<ffffffffa01081ac>] ? ttm_tt_init+0x8c/0xb0 [ttm]
> [1382139.798498]  [<ffffffff811c321e>] ? __vmalloc_node+0x3e/0x40
> [1382139.800052]  [<ffffffffa012de18>] mgag200_bo_move+0x18/0x20
[mgag200]
> [1382139.801607]  [<ffffffffa0108ce9>]
ttm_bo_handle_move_mem+0x299/0x650
> [ttm]
> [1382139.803099]  [<ffffffff811c2600>] ?
vunmap_page_range+0x100/0x180
> [1382139.804590]  [<ffffffffa010af6e>] ttm_bo_validate+0x1de/0x1f0
[ttm]
> [1382139.806058]  [<ffffffff8106a3aa>] ? iounmap+0x8a/0xd0
> [1382139.807541]  [<ffffffffa012dfe3>]
mgag200_bo_push_sysram+0x83/0xd0
> [mgag200]
> [1382139.809092]  [<ffffffffa012af3f>]
> mga_crtc_do_set_base.clone.0+0x7f/0x1e0 [mgag200]
> [1382139.810654]  [<ffffffff816b72e0>] ?
> _raw_spin_unlock_irqrestore+0x20/0x50
> [1382139.812146]  [<ffffffffa012bbe1>] mga_crtc_mode_set+0xb41/0xe60
> [mgag200]
> [1382139.813585]  [<ffffffffa00da9a9>]
drm_crtc_helper_set_mode+0x389/0x5d0
> [drm_kms_helper]
> [1382139.815096]  [<ffffffffa00dba59>]
> drm_crtc_helper_set_config+0x879/0xae0 [drm_kms_helper]
> [1382139.816468]  [<ffffffff816b72e0>] ?
> _raw_spin_unlock_irqrestore+0x20/0x50
> [1382139.817845]  [<ffffffffa0078576>]
> drm_mode_set_config_internal+0x66/0x110 [drm]
> [1382139.819177]  [<ffffffffa00e6a21>]
> drm_fb_helper_pan_display+0x111/0x160 [drm_kms_helper]
> [1382139.820487]  [<ffffffff8138abd6>] fb_pan_display+0xe6/0x140
> [1382139.821761]  [<ffffffff81384d4a>] bit_update_start+0x2a/0x60
> [1382139.822981]  [<ffffffff813810d6>] fbcon_switch+0x386/0x530
> [1382139.824198]  [<ffffffff8140f5a9>] redraw_screen+0x189/0x230
> [1382139.825370]  [<ffffffff81380c9a>] fbcon_blank+0x21a/0x2d0
> [1382139.826642]  [<ffffffff816b72e0>] ?
> _raw_spin_unlock_irqrestore+0x20/0x50
> [1382139.827813]  [<ffffffff810d736c>] ? vprintk_emit+0x2ac/0x520
> [1382139.828933]  [<ffffffff810e8091>] ? internal_add_timer+0x91/0xc0
> [1382139.830070]  [<ffffffff816b72e0>] ?
> _raw_spin_unlock_irqrestore+0x20/0x50
> [1382139.831192]  [<ffffffff810ea06e>] ? mod_timer+0xfe/0x1d0
> [1382139.832306]  [<ffffffff81410676>] do_unblank_screen+0xb6/0x1f0
> [1382139.833438]  [<ffffffff814107c0>] unblank_screen+0x10/0x20
> [1382139.834562]  [<ffffffff8131f7bd>] bust_spinlocks+0x1d/0x40
> [1382139.835731]  [<ffffffff81017643>] oops_end+0x43/0x120
> [1382139.836953]  [<ffffffff81017c4b>] die+0x5b/0x90
> [1382139.838020]  [<ffffffff810140f9>] do_trap+0x169/0x170
> [1382139.839090]  [<ffffffff810a0ef2>] ?
> __atomic_notifier_call_chain+0x12/0x20
> [1382139.840144]  [<ffffffff81014af9>] do_error_trap+0xd9/0x180
> [1382139.841221]  [<ffffffffa0abf42f>] ?
ocfs2_ci_checkpointed+0xbf/0xd0
> [ocfs2]
> [1382139.842344]  [<ffffffff816b72e0>] ?
> _raw_spin_unlock_irqrestore+0x20/0x50
> [1382139.843520]  [<ffffffff810c3e23>] ? __wake_up+0x53/0x70
> [1382139.844581]  [<ffffffff81014cb0>] do_invalid_op+0x20/0x30
> [1382139.845662]  [<ffffffff816b93be>] invalid_op+0x1e/0x30
> [1382139.846734]  [<ffffffffa0abf42f>] ?
ocfs2_ci_checkpointed+0xbf/0xd0
> [ocfs2]
> [1382139.847826]  [<ffffffffa0abf3cd>] ?
ocfs2_ci_checkpointed+0x5d/0xd0
> [ocfs2]
> [1382139.848992]  [<ffffffffa0abf489>]
> ocfs2_check_meta_downconvert+0x29/0x40 [ocfs2]
> [1382139.850223]  [<ffffffffa0ac2dca>] ocfs2_unblock_lock+0xca/0x750
[ocfs2]
> [1382139.851302]  [<ffffffff81012982>] ? __switch_to+0x212/0x5b0
> [1382139.852384]  [<ffffffffa0ac35fa>]
> ocfs2_process_blocked_lock+0x1aa/0x270 [ocfs2]
> [1382139.853480]  [<ffffffffa0ac3b92>]
> ocfs2_downconvert_thread_do_work+0xb2/0xe0 [ocfs2]
> [1382139.854580]  [<ffffffffa0ac3c36>]
ocfs2_downconvert_thread+0x76/0x180
> [ocfs2]
> [1382139.855649]  [<ffffffff810c3b10>] ? wait_woken+0x90/0x90
> [1382139.856730]  [<ffffffffa0ac3bc0>] ?
> ocfs2_downconvert_thread_do_work+0xe0/0xe0 [ocfs2]
> [1382139.857835]  [<ffffffffa0ac3bc0>] ?
> ocfs2_downconvert_thread_do_work+0xe0/0xe0 [ocfs2]
> [1382139.858899]  [<ffffffff810a012e>] kthread+0xce/0xf0
> [1382139.860106]  [<ffffffff810a0060>] ?
> kthread_freezable_should_stop+0x70/0x70
> [1382139.861175]  [<ffffffff816b7e22>] ret_from_fork+0x42/0x70
> [1382139.862236]  [<ffffffff810a0060>] ?
> kthread_freezable_should_stop+0x70/0x70
> [1382139.863303] Code: 58 ff ff ff e9 e2 fb ff ff be b8 00 00 00 48 c7 c7
> f8 56 95 81 e8 bf 35 01 00 e9 03 ff ff ff 0f 0b eb fe 0f 0b 0f 1f 40 00 eb
> fa <0f> 0b eb fe 48 89 ca 48 c1 e9 05 45 31 ed 48 c1 ea 03 48 89 d0
> [1382139.865840] RIP  [<ffffffff8106cf02>]
> change_page_attr_set_clr+0x4e2/0x520
> [1382139.866978]  RSP <ffff8800b6642da8>
> [1382139.868102] ---[ end trace c97984eb9f0c5481 ]---
> [1382139.877139] Kernel panic - not syncing: Fatal exception
> [1382139.878299] Kernel Offset: disabled
> [1382139.879386] drm_kms_helper: panic occurred, switching back to text
> console
>
>
> If anyone has experienced this issue and could give any pointers on what to
> try next it would be much appreciated.
>
> Kind Regards
>
> Joe Cluney
>
>
>
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users at oss.oracle.com
> https://oss.oracle.com/mailman/listinfo/ocfs2-users

Ocfs2 users - Dec 2016 - OCFS2 Node silent death.

[Ocfs2-users] OCFS2 Node silent death.

[Ocfs2-users] OCFS2 Node silent death.