Hello,
So we have created the new cluster finally 3 identical KVMs:
-8 vCPUs
-10GB ram per node
-Kernel custom 4.13.2OCFS
-All the 3 VMs running on a dell host server which have more than enough
resources so network connection between the VMs cannot be an issue yet
(we will move them to separate physical servers once they become rock
solid)
Until 9 days it was running fine until Today one of the webservers
decided to crash on OCFS2 again.
Here is the picture of the crashed server:
https://urldefense.proofpoint.com/v2/url?u=https-3A__ibb.co_kxSqLm&d=DwICAg&c=RoP1YumCXCgaWHvlZYR8PQcxBKCX5YTpkKY057SbK10&r=QxGl6UoyzTJm_1fAz5ZR9izvWJhWcqbtYn-0afBpa7A&m=LIe0FuKdHS00KQDpalNr3sC8x4IUbJAxr9ZbKkaVVRU&s=CU8iQ7bMjz3onn2KVgChw_n06syWA6OAYpbd1hl6mfw&e=
And the log from the other nodes:
Oct 27 13:11:06 webserver2 kernel: [789844.406061] o2net: Connection to
node webserver3 (num 2) at 10.0.0.247:7777 has been idle for 30.688
secs.
Oct 27 13:11:36 webserver2 kernel: [789875.125863] o2net: Connection to
node webserver3 (num 2) at 10.0.0.247:7777 has been idle for 30.720
secs.
Oct 27 13:11:40 webserver2 kernel: [789878.935510] o2net: No longer
connected to node webserver3 (num 2) at 10.0.0.247:7777
Oct 27 13:11:40 webserver2 kernel: [789878.935924] o2cb: o2dlm has
evicted node 2 from domain 428503AACBAA492D84DFA48C5CF305B4
Oct 27 13:11:40 webserver2 kernel: [789879.050040] o2cb: o2dlm has
evicted node 2 from domain E6CEF44C077640538468D6FCD1E27C5F
Oct 27 13:11:41 webserver2 kernel: [789880.245846] o2dlm: Begin recovery
on domain 428503AACBAA492D84DFA48C5CF305B4 for node 2
Oct 27 13:11:41 webserver2 kernel: [789880.246863] o2dlm: Node 1 (me) is
the Recovery Master for the dead node 2 in domain
428503AACBAA492D84DFA48C5CF305B4
Oct 27 13:11:41 webserver2 kernel: [789880.325817] o2dlm: End recovery
on domain 428503AACBAA492D84DFA48C5CF305B4
Oct 27 13:11:42 webserver2 kernel: [789880.501802] o2dlm: Begin recovery
on domain E6CEF44C077640538468D6FCD1E27C5F for node 2
Oct 27 13:11:42 webserver2 kernel: [789880.502841] o2dlm: Node 1 (me) is
the Recovery Master for the dead node 2 in domain
E6CEF44C077640538468D6FCD1E27C5F
Oct 27 13:11:47 webserver2 kernel: [789885.629843] o2dlm: End recovery
on domain E6CEF44C077640538468D6FCD1E27C5F
Oct 27 13:11:47 webserver2 kernel: [789885.684062] ocfs2: Begin replay
journal (node 2, slot 1) on device (254,64)
Oct 27 13:11:47 webserver2 kernel: [789885.707354] ocfs2: End replay
journal (node 2, slot 1) on device (254,64)
Oct 27 13:11:47 webserver2 kernel: [789885.737907] ocfs2: Beginning
quota recovery on device (254,64) for slot 1
Oct 27 13:11:47 webserver2 kernel: [789885.757285] ocfs2: Finishing
quota recovery on device (254,64) for slot 1
Oct 27 13:19:40 webserver2 kernel: [790358.453142] php-fpm7.0 D
0 8659 8654 0x00000000
Oct 27 13:19:40 webserver2 kernel: [790358.453145] Call Trace:
Oct 27 13:19:40 webserver2 kernel: [790358.453153] ?
__schedule+0x3c8/0x860
Oct 27 13:19:40 webserver2 kernel: [790358.453155] ? schedule+0x32/0x80
Oct 27 13:19:40 webserver2 kernel: [790358.453158] ?
rwsem_down_write_failed+0x232/0x410
Oct 27 13:19:40 webserver2 kernel: [790358.453160] ? dput+0x2f/0x1f0
Oct 27 13:19:40 webserver2 kernel: [790358.453164] ?
call_rwsem_down_write_failed+0x13/0x20
Oct 27 13:19:40 webserver2 kernel: [790358.453165] ?
call_rwsem_down_write_failed+0x13/0x20
Oct 27 13:19:40 webserver2 kernel: [790358.453167] ?
down_write+0x29/0x40
Oct 27 13:19:40 webserver2 kernel: [790358.453170] ?
path_openat+0x3dc/0x1440
Oct 27 13:19:40 webserver2 kernel: [790358.453227] ?
ocfs2_mark_lockres_freeing+0x17d/0x240 [ocfs2]
Oct 27 13:19:40 webserver2 kernel: [790358.453230] ?
do_filp_open+0x99/0x110
Oct 27 13:19:40 webserver2 kernel: [790358.453232] ?
kmem_cache_alloc+0x11a/0x5a0
Oct 27 13:19:40 webserver2 kernel: [790358.453234] ? dput+0x2f/0x1f0
Oct 27 13:19:40 webserver2 kernel: [790358.453236] ?
__check_object_size+0xb3/0x190
Oct 27 13:19:40 webserver2 kernel: [790358.453238] ?
__alloc_fd+0x44/0x170
Oct 27 13:19:40 webserver2 kernel: [790358.453240] ?
do_sys_open+0x12e/0x210
Oct 27 13:19:40 webserver2 kernel: [790358.453241] ?
do_sys_open+0x12e/0x210
Oct 27 13:19:40 webserver2 kernel: [790358.453243] ?
entry_SYSCALL_64_fastpath+0x1e/0xa9
Oct 27 13:19:40 webserver2 kernel: [790358.455597] php-fpm7.0 D
0 8662 8654 0x00000000
Oct 27 13:19:40 webserver2 kernel: [790358.455624] Call Trace:
Oct 27 13:19:40 webserver2 kernel: [790358.455628] ?
__schedule+0x3c8/0x860
Oct 27 13:19:40 webserver2 kernel: [790358.455630] ? schedule+0x32/0x80
Oct 27 13:19:40 webserver2 kernel: [790358.455632] ?
rwsem_down_write_failed+0x232/0x410
Oct 27 13:19:40 webserver2 kernel: [790358.455634] ? dput+0x2f/0x1f0
Oct 27 13:19:40 webserver2 kernel: [790358.455637] ?
call_rwsem_down_write_failed+0x13/0x20
Oct 27 13:19:40 webserver2 kernel: [790358.455639] ?
call_rwsem_down_write_failed+0x13/0x20
Oct 27 13:19:40 webserver2 kernel: [790358.455640] ?
down_write+0x29/0x40
Oct 27 13:19:40 webserver2 kernel: [790358.455642] ?
path_openat+0x3dc/0x1440
Oct 27 13:19:40 webserver2 kernel: [790358.455678] ?
ocfs2_mark_lockres_freeing+0x17d/0x240 [ocfs2]
Oct 27 13:19:40 webserver2 kernel: [790358.455680] ?
do_filp_open+0x99/0x110
Oct 27 13:19:40 webserver2 kernel: [790358.455682] ?
kmem_cache_alloc+0x11a/0x5a0
Oct 27 13:19:40 webserver2 kernel: [790358.455696] ? dput+0x2f/0x1f0
Oct 27 13:19:40 webserver2 kernel: [790358.455698] ?
__check_object_size+0xb3/0x190
Oct 27 13:19:40 webserver2 kernel: [790358.455700] ?
__alloc_fd+0x44/0x170
Oct 27 13:19:40 webserver2 kernel: [790358.455702] ?
do_sys_open+0x12e/0x210
Oct 27 13:19:40 webserver2 kernel: [790358.455704] ?
do_sys_open+0x12e/0x210
Oct 27 13:19:40 webserver2 kernel: [790358.455706] ?
entry_SYSCALL_64_fastpath+0x1e/0xa9
Oct 27 13:19:40 webserver2 kernel: [790358.458274] php-fpm7.0 D
0 8700 8654 0x00000000
Oct 27 13:19:40 webserver2 kernel: [790358.458277] Call Trace:
Oct 27 13:19:40 webserver2 kernel: [790358.458280] ?
__schedule+0x3c8/0x860
Oct 27 13:19:40 webserver2 kernel: [790358.458282] ? schedule+0x32/0x80
Oct 27 13:19:40 webserver2 kernel: [790358.458284] ?
rwsem_down_write_failed+0x232/0x410
Oct 27 13:19:40 webserver2 kernel: [790358.458286] ? dput+0x2f/0x1f0
Oct 27 13:19:40 webserver2 kernel: [790358.458289] ?
call_rwsem_down_write_failed+0x13/0x20
Oct 27 13:19:40 webserver2 kernel: [790358.458290] ?
call_rwsem_down_write_failed+0x13/0x20
Oct 27 13:19:40 webserver2 kernel: [790358.458292] ?
down_write+0x29/0x40
Oct 27 13:19:40 webserver2 kernel: [790358.458294] ?
path_openat+0x3dc/0x1440
Oct 27 13:19:40 webserver2 kernel: [790358.458330] ?
ocfs2_mark_lockres_freeing+0x17d/0x240 [ocfs2]
Oct 27 13:19:40 webserver2 kernel: [790358.458332] ?
do_filp_open+0x99/0x110
Oct 27 13:19:40 webserver2 kernel: [790358.458334] ?
kmem_cache_alloc+0x11a/0x5a0
Oct 27 13:19:40 webserver2 kernel: [790358.458336] ? dput+0x2f/0x1f0
Oct 27 13:19:40 webserver2 kernel: [790358.458337] ?
__check_object_size+0xb3/0x190
Oct 27 13:19:40 webserver2 kernel: [790358.458339] ?
__alloc_fd+0x44/0x170
Oct 27 13:19:40 webserver2 kernel: [790358.458341] ?
do_sys_open+0x12e/0x210
Oct 27 13:19:40 webserver2 kernel: [790358.458342] ?
do_sys_open+0x12e/0x210
Oct 27 13:19:40 webserver2 kernel: [790358.458344] ?
entry_SYSCALL_64_fastpath+0x1e/0xa9
Oct 27 13:19:40 webserver2 kernel: [790358.461224] php-fpm7.0 D
0 8703 8654 0x00000000
Oct 27 13:19:40 webserver2 kernel: [790358.461226] Call Trace:
Oct 27 13:19:40 webserver2 kernel: [790358.461230] ?
__schedule+0x3c8/0x860
Oct 27 13:19:40 webserver2 kernel: [790358.461233] ? schedule+0x32/0x80
Oct 27 13:19:40 webserver2 kernel: [790358.461235] ?
rwsem_down_write_failed+0x232/0x410
Oct 27 13:19:40 webserver2 kernel: [790358.461237] ? dput+0x2f/0x1f0
Oct 27 13:19:40 webserver2 kernel: [790358.461239] ?
call_rwsem_down_write_failed+0x13/0x20
Oct 27 13:19:40 webserver2 kernel: [790358.461241] ?
call_rwsem_down_write_failed+0x13/0x20
Oct 27 13:19:40 webserver2 kernel: [790358.461243] ?
down_write+0x29/0x40
Oct 27 13:19:40 webserver2 kernel: [790358.461245] ?
path_openat+0x3dc/0x1440
Oct 27 13:19:40 webserver2 kernel: [790358.461280] ?
ocfs2_mark_lockres_freeing+0x17d/0x240 [ocfs2]
Oct 27 13:19:40 webserver2 kernel: [790358.461282] ?
do_filp_open+0x99/0x110
Oct 27 13:19:40 webserver2 kernel: [790358.461284] ?
kmem_cache_alloc+0x11a/0x5a0
Oct 27 13:19:40 webserver2 kernel: [790358.461286] ? dput+0x2f/0x1f0
Oct 27 13:19:40 webserver2 kernel: [790358.461287] ?
__check_object_size+0xb3/0x190
Oct 27 13:19:40 webserver2 kernel: [790358.461289] ?
__alloc_fd+0x44/0x170
Oct 27 13:19:40 webserver2 kernel: [790358.461291] ?
do_sys_open+0x12e/0x210
Oct 27 13:19:40 webserver2 kernel: [790358.461292] ?
do_sys_open+0x12e/0x210
Oct 27 13:19:40 webserver2 kernel: [790358.461294] ?
entry_SYSCALL_64_fastpath+0x1e/0xa9
Oct 27 13:21:40 webserver2 kernel: [790479.282565] php-fpm7.0 D
0 8659 8654 0x00000000
Oct 27 13:21:40 webserver2 kernel: [790479.282568] Call Trace:
Oct 27 13:21:40 webserver2 kernel: [790479.282580] ?
__schedule+0x3c8/0x860
Oct 27 13:21:40 webserver2 kernel: [790479.282583] ? schedule+0x32/0x80
Oct 27 13:21:40 webserver2 kernel: [790479.282587] ?
rwsem_down_write_failed+0x232/0x410
Oct 27 13:21:40 webserver2 kernel: [790479.282590] ? dput+0x2f/0x1f0
Oct 27 13:21:40 webserver2 kernel: [790479.282594] ?
call_rwsem_down_write_failed+0x13/0x20
Oct 27 13:21:40 webserver2 kernel: [790479.282596] ?
call_rwsem_down_write_failed+0x13/0x20
Oct 27 13:21:40 webserver2 kernel: [790479.282598] ?
down_write+0x29/0x40
Oct 27 13:21:40 webserver2 kernel: [790479.282601] ?
path_openat+0x3dc/0x1440
Oct 27 13:21:40 webserver2 kernel: [790479.282692] ?
ocfs2_mark_lockres_freeing+0x17d/0x240 [ocfs2]
Oct 27 13:21:40 webserver2 kernel: [790479.282695] ?
do_filp_open+0x99/0x110
Oct 27 13:21:40 webserver2 kernel: [790479.282698] ?
kmem_cache_alloc+0x11a/0x5a0
Oct 27 13:21:40 webserver2 kernel: [790479.282700] ? dput+0x2f/0x1f0
Oct 27 13:21:40 webserver2 kernel: [790479.282702] ?
__check_object_size+0xb3/0x190
Oct 27 13:21:40 webserver2 kernel: [790479.282705] ?
__alloc_fd+0x44/0x170
Oct 27 13:21:40 webserver2 kernel: [790479.282707] ?
do_sys_open+0x12e/0x210
Oct 27 13:21:40 webserver2 kernel: [790479.282709] ?
do_sys_open+0x12e/0x210
Oct 27 13:21:40 webserver2 kernel: [790479.282711] ?
entry_SYSCALL_64_fastpath+0x1e/0xa9
Oct 27 13:21:40 webserver2 kernel: [790479.282852] php-fpm7.0 D
0 8661 8654 0x00000000
Oct 27 13:21:40 webserver2 kernel: [790479.282854] Call Trace:
Oct 27 13:21:40 webserver2 kernel: [790479.282857] ?
__schedule+0x3c8/0x860
Oct 27 13:21:40 webserver2 kernel: [790479.282859] ? schedule+0x32/0x80
Oct 27 13:21:40 webserver2 kernel: [790479.282861] ?
rwsem_down_write_failed+0x232/0x410
Oct 27 13:21:40 webserver2 kernel: [790479.282862] ? dput+0x2f/0x1f0
Oct 27 13:21:40 webserver2 kernel: [790479.282865] ?
call_rwsem_down_write_failed+0x13/0x20
Oct 27 13:21:40 webserver2 kernel: [790479.282867] ?
call_rwsem_down_write_failed+0x13/0x20
Oct 27 13:21:40 webserver2 kernel: [790479.282869] ?
down_write+0x29/0x40
Oct 27 13:21:40 webserver2 kernel: [790479.282871] ?
path_openat+0x3dc/0x1440
Oct 27 13:21:40 webserver2 kernel: [790479.282895] ?
ocfs2_mark_lockres_freeing+0x17d/0x240 [ocfs2]
Oct 27 13:21:40 webserver2 kernel: [790479.282897] ?
do_filp_open+0x99/0x110
Oct 27 13:21:40 webserver2 kernel: [790479.282899] ?
kmem_cache_alloc+0x11a/0x5a0
Oct 27 13:21:40 webserver2 kernel: [790479.282901] ? dput+0x2f/0x1f0
Oct 27 13:21:40 webserver2 kernel: [790479.282903] ?
__check_object_size+0xb3/0x190
Oct 27 13:21:40 webserver2 kernel: [790479.282904] ?
__alloc_fd+0x44/0x170
Oct 27 13:21:40 webserver2 kernel: [790479.282906] ?
do_sys_open+0x12e/0x210
Oct 27 13:21:40 webserver2 kernel: [790479.282907] ?
do_sys_open+0x12e/0x210
Oct 27 13:21:40 webserver2 kernel: [790479.282909] ?
entry_SYSCALL_64_fastpath+0x1e/0xa9
Oct 27 13:21:40 webserver2 kernel: [790479.283060] php-fpm7.0 D
0 8662 8654 0x00000000
Oct 27 13:21:40 webserver2 kernel: [790479.283062] Call Trace:
Oct 27 13:21:40 webserver2 kernel: [790479.283065] ?
__schedule+0x3c8/0x860
Oct 27 13:21:40 webserver2 kernel: [790479.283067] ? schedule+0x32/0x80
Oct 27 13:21:40 webserver2 kernel: [790479.283069] ?
rwsem_down_write_failed+0x232/0x410
Oct 27 13:21:40 webserver2 kernel: [790479.283071] ? dput+0x2f/0x1f0
Oct 27 13:21:40 webserver2 kernel: [790479.283073] ?
call_rwsem_down_write_failed+0x13/0x20
Oct 27 13:21:40 webserver2 kernel: [790479.283077] ?
call_rwsem_down_write_failed+0x13/0x20
Oct 27 13:21:40 webserver2 kernel: [790479.283079] ?
down_write+0x29/0x40
Oct 27 13:21:40 webserver2 kernel: [790479.283081] ?
path_openat+0x3dc/0x1440
Oct 27 13:21:40 webserver2 kernel: [790479.283109] ?
ocfs2_mark_lockres_freeing+0x17d/0x240 [ocfs2]
Oct 27 13:21:40 webserver2 kernel: [790479.283111] ?
do_filp_open+0x99/0x110
Oct 27 13:21:40 webserver2 kernel: [790479.283113] ?
kmem_cache_alloc+0x11a/0x5a0
Oct 27 13:21:40 webserver2 kernel: [790479.283114] ? dput+0x2f/0x1f0
Oct 27 13:21:40 webserver2 kernel: [790479.283116] ?
__check_object_size+0xb3/0x190
Oct 27 13:21:40 webserver2 kernel: [790479.283118] ?
__alloc_fd+0x44/0x170
Oct 27 13:21:40 webserver2 kernel: [790479.283119] ?
do_sys_open+0x12e/0x210
Oct 27 13:21:40 webserver2 kernel: [790479.283121] ?
do_sys_open+0x12e/0x210
Oct 27 13:21:40 webserver2 kernel: [790479.283122] ?
entry_SYSCALL_64_fastpath+0x1e/0xa9
Oct 27 13:21:40 webserver2 kernel: [790479.284496] php-fpm7.0 D
0 8700 8654 0x00000000
Oct 27 13:21:40 webserver2 kernel: [790479.284499] Call Trace:
Oct 27 13:21:40 webserver2 kernel: [790479.284503] ?
__schedule+0x3c8/0x860
Oct 27 13:21:40 webserver2 kernel: [790479.284505] ? schedule+0x32/0x80
Oct 27 13:21:40 webserver2 kernel: [790479.284507] ?
rwsem_down_write_failed+0x232/0x410
Oct 27 13:21:40 webserver2 kernel: [790479.284509] ? dput+0x2f/0x1f0
Oct 27 13:21:40 webserver2 kernel: [790479.284512] ?
call_rwsem_down_write_failed+0x13/0x20
Oct 27 13:21:40 webserver2 kernel: [790479.284514] ?
call_rwsem_down_write_failed+0x13/0x20
Oct 27 13:21:40 webserver2 kernel: [790479.284516] ?
down_write+0x29/0x40
Oct 27 13:21:40 webserver2 kernel: [790479.284518] ?
path_openat+0x3dc/0x1440
Oct 27 13:21:40 webserver2 kernel: [790479.284557] ?
ocfs2_mark_lockres_freeing+0x17d/0x240 [ocfs2]
Oct 27 13:21:40 webserver2 kernel: [790479.284559] ?
do_filp_open+0x99/0x110
Oct 27 13:21:40 webserver2 kernel: [790479.284561] ?
kmem_cache_alloc+0x11a/0x5a0
Oct 27 13:21:40 webserver2 kernel: [790479.284563] ? dput+0x2f/0x1f0
Oct 27 13:21:40 webserver2 kernel: [790479.284565] ?
__check_object_size+0xb3/0x190
Oct 27 13:21:40 webserver2 kernel: [790479.284566] ?
__alloc_fd+0x44/0x170
Oct 27 13:21:40 webserver2 kernel: [790479.284568] ?
do_sys_open+0x12e/0x210
Oct 27 13:21:40 webserver2 kernel: [790479.284569] ?
do_sys_open+0x12e/0x210
Oct 27 13:21:40 webserver2 kernel: [790479.284571] ?
entry_SYSCALL_64_fastpath+0x1e/0xa9
Oct 27 13:21:40 webserver2 kernel: [790479.288370] php-fpm7.0 D
0 8703 8654 0x00000000
Oct 27 13:21:40 webserver2 kernel: [790479.288372] Call Trace:
Oct 27 13:21:40 webserver2 kernel: [790479.288377] ?
__schedule+0x3c8/0x860
Oct 27 13:21:40 webserver2 kernel: [790479.288380] ? schedule+0x32/0x80
Oct 27 13:21:40 webserver2 kernel: [790479.288382] ?
rwsem_down_write_failed+0x232/0x410
Oct 27 13:21:40 webserver2 kernel: [790479.288384] ? dput+0x2f/0x1f0
Oct 27 13:21:40 webserver2 kernel: [790479.288387] ?
call_rwsem_down_write_failed+0x13/0x20
Oct 27 13:21:40 webserver2 kernel: [790479.288389] ?
call_rwsem_down_write_failed+0x13/0x20
Oct 27 13:21:40 webserver2 kernel: [790479.288392] ?
down_write+0x29/0x40
Oct 27 13:21:40 webserver2 kernel: [790479.288394] ?
path_openat+0x3dc/0x1440
Oct 27 13:21:40 webserver2 kernel: [790479.288433] ?
ocfs2_mark_lockres_freeing+0x17d/0x240 [ocfs2]
Oct 27 13:21:40 webserver2 kernel: [790479.288436] ?
do_filp_open+0x99/0x110
Oct 27 13:21:40 webserver2 kernel: [790479.288439] ?
kmem_cache_alloc+0x11a/0x5a0
Oct 27 13:21:40 webserver2 kernel: [790479.288440] ? dput+0x2f/0x1f0
Oct 27 13:21:40 webserver2 kernel: [790479.288442] ?
__check_object_size+0xb3/0x190
Oct 27 13:21:40 webserver2 kernel: [790479.288445] ?
__alloc_fd+0x44/0x170
Oct 27 13:21:40 webserver2 kernel: [790479.288447] ?
do_sys_open+0x12e/0x210
Oct 27 13:21:40 webserver2 kernel: [790479.288449] ?
do_sys_open+0x12e/0x210
Oct 27 13:21:40 webserver2 kernel: [790479.288450] ?
entry_SYSCALL_64_fastpath+0x1e/0xa9
Oct 27 13:23:41 webserver2 kernel: [790600.113898] php-fpm7.0 D
0 8659 8654 0x00000000
Oct 27 13:23:41 webserver2 kernel: [790600.113901] Call Trace:
Oct 27 13:23:41 webserver2 kernel: [790600.113912] ?
__schedule+0x3c8/0x860
Oct 27 13:23:41 webserver2 kernel: [790600.113915] ? schedule+0x32/0x80
Oct 27 13:23:41 webserver2 kernel: [790600.113918] ?
rwsem_down_write_failed+0x232/0x410
Oct 27 13:23:41 webserver2 kernel: [790600.113922] ? dput+0x2f/0x1f0
Oct 27 13:23:41 webserver2 kernel: [790600.113926] ?
call_rwsem_down_write_failed+0x13/0x20
Oct 27 13:23:41 webserver2 kernel: [790600.113928] ?
call_rwsem_down_write_failed+0x13/0x20
Oct 27 13:23:41 webserver2 kernel: [790600.113929] ?
down_write+0x29/0x40
Oct 27 13:23:41 webserver2 kernel: [790600.113933] ?
path_openat+0x3dc/0x1440
Oct 27 13:23:41 webserver2 kernel: [790600.114006] ?
ocfs2_mark_lockres_freeing+0x17d/0x240 [ocfs2]
Oct 27 13:23:41 webserver2 kernel: [790600.114008] ?
do_filp_open+0x99/0x110
Oct 27 13:23:41 webserver2 kernel: [790600.114012] ?
kmem_cache_alloc+0x11a/0x5a0
Oct 27 13:23:41 webserver2 kernel: [790600.114013] ? dput+0x2f/0x1f0
Oct 27 13:23:41 webserver2 kernel: [790600.114016] ?
__check_object_size+0xb3/0x190
Oct 27 13:23:41 webserver2 kernel: [790600.114019] ?
__alloc_fd+0x44/0x170
Oct 27 13:23:41 webserver2 kernel: [790600.114021] ?
do_sys_open+0x12e/0x210
Oct 27 13:23:41 webserver2 kernel: [790600.114023] ?
do_sys_open+0x12e/0x210
Oct 27 13:23:41 webserver2 kernel: [790600.114025] ?
entry_SYSCALL_64_fastpath+0x1e/0xa9
Any clues what is causing this?
Thanks!
On 2017-09-29 08:46, Gang He wrote:> Hello netbsd,
>
> Could you conclude to a way to trigger this crash happen in a normal
> ocfs2 cluster?
> e.g. reproduce steps, or a shell script.
>
> Thanks
> Gang
>
>
>>>>
>> Hello,
>>
>> Find the full log below:
>>
>>
https://urldefense.proofpoint.com/v2/url?u=https-3A__paste.ubuntu.com_25625787_&d=DwICAg&c=RoP1YumCXCgaWHvlZYR8PQcxBKCX5YTpkKY057SbK10&r=QxGl6UoyzTJm_1fAz5ZR9izvWJhWcqbtYn-0afBpa7A&m=LIe0FuKdHS00KQDpalNr3sC8x4IUbJAxr9ZbKkaVVRU&s=VPI6eV6Mfe3WqNRd1ik-Qgx2TrRcv_1mICCopkeXvm4&e=
>>
>> VM was restarted at 9:27 and no problem since then. We are rsyncing
>> about 2TB data (a lot of small files) between 2 OCFS shares on the
>> same
>> vm:
>>
>>
>> /dev/vdc 4.8T 2.8T 2.1T 58% /mnt/s1
>> /dev/vdf 4.8T 985G 3.9T 21% /mnt/s2
>>
>> rsync -av --numeric-ids --delete /mnt/s1/ /mnt/s2/
>>
>>
>> On 2017-09-27 10:53, Gang He wrote:
>>> Hello netbsd,
>>>
>>> The ocfs2 project is still be developed by us (from SUE, Huawei,
>>> Oracle and H3C. etc.).
>>> If you encountered some problem, please send the mail to
ocfs2-devel
>>> mail list, we usually watch that mail for ocfs2 kernel related
>>> issues.
>>>
>>>
>>>
>>>
>>>>>>
>>>> Hello All,
>>>>
>>>> I wrote earlier about our OCFS2 crash issue in KVM due to bug
in the
>>>> SMP
>>>> code.
>>>>
>>>> For this we come up with a solution:
>>>>
>>>> Instead of using multiple vcpus
>>>> <vcpu placement='static'>8</vcpu>
>>>>
>>>> using a single one and multiple cores instead:
>>>> <topology sockets='8' cores='8'
threads='1'/>
>>>>
>>>> And applying key tune options to sysctl.conf:
>>>>
>>>> vm.min_free_kbytes=131072
>>>> vm.zone_reclaim_mode=1
>>>>
>>>> Seemed to be helped, the fs did not crash right away when we
were
>>>> hammering it with apache benchmarks with 10000 requests however
last
>>>> night I started a large rsync operation from a 5TB OCFS2 FS
mounted
>>>> in
>>>> the VM to another OCFS2 mounted in the same VM and ended up
with:
>>>>
>>>>
https://urldefense.proofpoint.com/v2/url?u=https-3A__ibb.co_gFeGg5&d=DwICAg&c=R
>>>>
>>
oP1YumCXCgaWHvlZYR8PQcxBKCX5YTpkKY057SbK10&r=QxGl6UoyzTJm_1fAz5ZR9izvWJhWcqbtY
>>
>>>>
>>
n-0afBpa7A&m=cYprGRHz-oQmhnx4HIke8sTdCG_tf8Jb-rF6sHnYLnk&s=ajWfQIlUZOpElFWxoKcmvTI
>>
>>>> k7J3PpuCJITcnXfJQHrc&e>>> From the kernel crash
backtrace, this problem should be that long
>>> time
>>> to acquiring spin_lock triggers a NMI interruption.
>>> Could you give a detailed reproduce steps? since we want to
reproduce
>>> this issue in local, then try to fix it.
>>>
>>>
>>> Thanks
>>> Gang
>>>
>>>>
>>>> After trying a lot of different kernels starting from the 3.x
>>>> series,
>>>> now we are using 4.13.2 latest kernel with default
configuration but
>>>> these issues still present. Is this OCFS2 project still being
>>>> developed?
>>>> With this crashing and unreliability it cannot be used in
production
>>>> unless you put in place bunch of safeguards to reset out the
whole
>>>> virtualmachine when it crashes.
>>>>
>>>> Thanks
>>>>
>>>> _______________________________________________
>>>> Ocfs2-users mailing list
>>>> Ocfs2-users at oss.oracle.com
>>>> https://oss.oracle.com/mailman/listinfo/ocfs2-users