Joseph,
Do you feel anything like kernel issue in below logs. After certain
point of time no dlm logs found.
Dec 27 22:21:22 integ-hm8 kernel: (ocfs2rec,68213,10):dlmconvert_remote:270
type=0, convert_type=-1, busy=0
Dec 27 22:21:22 integ-hm8 kernel: (ocfs2rec,68213,10):dlmconvert_remote:275
bailing out early since res is RECOVERING on secondary queue
Dec 27 22:21:22 integ-hm8 kernel: (ocfs2rec,68213,10):dlmlock:652 retrying
convert with migration/recovery/in-progress
Dec 27 22:21:22 integ-hm8 kernel:
(kworker/u192:1,56020,17):dlm_send_one_lockres:1289 sending to 2
Dec 27 22:21:22 integ-hm8 kernel:
(kworker/u192:1,56020,17):dlm_send_mig_lockres_msg:1138
A895BC216BE641A8A7E20AA89D57E051:M000000000000008ec64cbc00000000: sending mig
lockres (recovery) to 2
Dec 27 22:21:22 integ-hm8 kernel:
(kworker/u192:1,56020,1):dlm_send_one_lockres:1289 sending to 2
Dec 27 22:21:22 integ-hm8 kernel:
(kworker/u192:1,56020,1):dlm_send_mig_lockres_msg:1138
A895BC216BE641A8A7E20AA89D57E051:O000000000000006e7df72000000000: sending mig
lockres (recovery) to 2
Dec 27 22:21:22 integ-hm8 kernel:
(kworker/u192:1,56020,1):dlm_send_one_lockres:1289 sending to 2
Dec 27 22:21:22 integ-hm8 kernel:
(kworker/u192:1,56020,1):dlm_send_mig_lockres_msg:1138
A895BC216BE641A8A7E20AA89D57E051:M000000000000006e7df72600000000: sending mig
lockres (recovery) to 2
Dec 27 22:21:22 integ-hm8 kernel:
(kworker/u192:1,56020,1):dlm_send_one_lockres:1289 sending to 2
Dec 27 22:21:22 integ-hm8 kernel:
(kworker/u192:1,56020,1):dlm_send_mig_lockres_msg:1138
A895BC216BE641A8A7E20AA89D57E051:O000000000000006e7df72600000000: sending mig
lockres (recovery) to 2
Dec 27 22:21:22 integ-hm8 kernel:
(kworker/u192:1,56020,1):dlm_send_one_lockres:1289 sending to 2
Dec 27 22:21:22 integ-hm8 kernel:
(kworker/u192:1,56020,1):dlm_send_mig_lockres_msg:1138
A895BC216BE641A8A7E20AA89D57E051:M000000000000006e7df72900000000: sending mig
lockres (recovery) to 2
Dec 27 22:21:22 integ-hm8 kernel:
(kworker/u192:1,56020,9):dlm_send_one_lockres:1289 sending to 2
Dec 27 22:21:22 integ-hm8 kernel:
(kworker/u192:1,56020,9):dlm_send_mig_lockres_msg:1138
A895BC216BE641A8A7E20AA89D57E051:O000000000000006e7df72900000000: sending mig
lockres (recovery) to 2
Dec 27 22:21:22 integ-hm8 kernel:
(kworker/u192:1,56020,9):dlm_send_one_lockres:1289 sending to 2
Dec 27 22:21:22 integ-hm8 kernel:
(kworker/u192:1,56020,9):dlm_send_mig_lockres_msg:1138
A895BC216BE641A8A7E20AA89D57E051:M000000000000006e7df72c00000000: sending mig
lockres (recovery) to 2
Dec 27 22:21:22 integ-hm8 kernel:
(kworker/u192:1,56020,9):dlm_send_one_lockres:1289 sending to 2
Dec 27 22:21:22 integ-hm8 kernel:
(kworker/u192:1,56020,9):dlm_send_mig_lockres_msg:1138
A895BC216BE641A8A7E20AA89D57E051:O000000000000006e7df72c00000000: sending mig
lockres (recovery) to 2
Dec 27 22:21:22 integ-hm8 kernel:
(kworker/u192:1,56020,9):dlm_send_one_lockres:1289 sending to 2
Dec 27 22:55:52 integ-hm8 rsyslogd: [origin software="rsyslogd"
swVersion="7.4.7" x-pid="1331"
x-info="http://www.rsyslog.com"] exiting on signal 15.
Dec 27 23:05:25 integ-hm8 rsyslogd: [origin software="rsyslogd"
swVersion="7.4.7" x-pid="1346"
x-info="http://www.rsyslog.com"] start
Dec 27 23:05:08 integ-hm8 journal: Runtime journal is using 8.0M (max 4.0G,
leaving 4.0G of free 251.9G, current limit 4.0G).
Dec 27 23:05:08 integ-hm8 kernel: Initializing cgroup subsys cpuset
Dec 27 23:05:08 integ-hm8 kernel: Initializing cgroup subsys cpu
Dec 27 23:05:08 integ-hm8 kernel: Initializing cgroup subsys cpuacct
Dec 27 23:05:08 integ-hm8 kernel: Linux version 3.10.91 (root at integ-hm8) (gcc
version 4.8.3 20140911 (Red Hat 4.8.3-9) (GCC) ) #1 SMP Thu Oct 29 11:52:34 IST
2015
Dec 27 23:05:08 integ-hm8 kernel: Command line: BOOT_IMAGE=/vmlinuz-3.10.91
root=UUID=20d0873d-8f3e-455a-ba67-f8f336b5e9a7 ro crashkernel=auto rhgb quiet
LANG=en_US.UTF-8 systemd.debug
Dec 27 23:05:08 integ-hm8 kernel: e820: BIOS-provided physical RAM map:
Dec 27 23:05:08 integ-hm8 kernel: BIOS-e820: [mem
0x0000000000000000-0x000000000009bfff] usable
Dec 27 23:05:08 integ-hm8 kernel: BIOS-e820: [mem
0x0000000000100000-0x00000000bd2cffff] usable
Dec 27 23:05:08 integ-hm8 kernel: BIOS-e820: [mem
0x00000000bd2d0000-0x00000000bd2fbfff] reserved
Dec 27 23:05:08 integ-hm8 kernel: BIOS-e820: [mem
0x00000000bd2fc000-0x00000000bd35afff] ACPI data
Dec 27 23:05:08 integ-hm8 kernel: BIOS-e820: [mem
0x00000000bd35b000-0x00000000bfffffff] reserved
Dec 27 23:05:08 integ-hm8 kernel: BIOS-e820: [mem
0x00000000e0000000-0x00000000efffffff] reserved
Dec 27 23:05:08 integ-hm8 kernel: BIOS-e820: [mem
0x00000000fe000000-0x00000000ffffffff] reserved
Dec 27 23:05:08 integ-hm8 kernel: BIOS-e820: [mem
0x0000000100000000-0x000000803fffffff] usable
Dec 27 23:05:08 integ-hm8 kernel: NX (Execute Disable) protection: active
Dec 27 23:05:08 integ-hm8 kernel: SMBIOS 2.7 present.
Dec 27 23:05:08 integ-hm8 kernel: No AGP bridge found
Dec 27 23:05:08 integ-hm8 kernel: e820: last_pfn = 0x8040000 max_arch_pfn =
0x400000000
Dec 27 23:05:08 integ-hm8 kernel: x86 PAT enabled: cpu 0, old 0x7040600070406,
new 0x7010600070106
Dec 27 23:05:08 integ-hm8 kernel: total RAM covered: 524288M
Dec 27 23:05:08 integ-hm8 kernel: gran_size: 64K chunk_size: 64K
num_reg: 10 lose cover RAM: 0G
Dec 27 23:05:08 integ-hm8 kernel: gran_size: 64K chunk_size: 128K
num_reg: 10 lose cover RAM: 0G
Dec 27 23:05:08 integ-hm8 kernel: gran_size: 64K chunk_size: 256K
num_reg: 10 lose cover RAM: 0G
Dec 27 23:05:08 integ-hm8 kernel: gran_size: 64K chunk_size: 512K
num_reg: 10 lose cover RAM: 0G
Dec 27 23:05:08 integ-hm8 kernel: gran_size: 64K chunk_size: 1M num_reg:
10 lose cover RAM: 0G
Dec 27 23:05:08 integ-hm8 kernel: gran_size: 64K chunk_size: 2M num_reg:
10 lose cover RAM: 0G
Dec 27 23:05:08 integ-hm8 kernel: gran_size: 64K chunk_size: 4M num_reg:
10 lose cover RAM: 0G
Dec 27 23:05:08 integ-hm8 kernel: gran_size: 64K chunk_size: 8M num_reg:
10 lose cover RAM: 0G
Dec 27 23:05:08 integ-hm8 kernel: gran_size: 64K chunk_size: 16M
num_reg: 10 lose cover RAM: 0G
Dec 27 23:05:08 integ-hm8 kernel: gran_size: 64K chunk_size: 32M
num_reg: 10 lose cover RAM: 0G
Dec 27 23:05:08 integ-hm8 kernel: gran_size: 64K chunk_size: 64M
num_reg: 10 lose cover RAM: 0G
Dec 27 23:05:08 integ-hm8 kernel: gran_size: 64K chunk_size: 128M
num_reg: 10 lose cover RAM: 0G
Dec 27 23:05:08 integ-hm8 kernel: gran_size: 64K chunk_size: 256M
num_reg: 10 lose cover RAM: 0G
Dec 27 23:05:08 integ-hm8 kernel: gran_size: 64K chunk_size: 512M
num_reg: 10 lose cover RAM: 0G
Dec 27 23:05:08 integ-hm8 kernel: gran_size: 64K chunk_size: 1G num_reg:
10 lose cover RAM: 0G
Dec 27 23:05:08 integ-hm8 kernel: *BAD*gran_size: 64K chunk_size: 2G num_reg:
10 lose cover RAM: -1G
Dec 27 23:05:08 integ-hm8 kernel: gran_size: 128K chunk_size: 128K
num_reg: 10 lose cover RAM: 0G
Dec 27 23:05:08 integ-hm8 kernel: gran_size: 128K chunk_size: 256K
num_reg: 10 lose cover RAM: 0G
Dec 27 23:05:08 integ-hm8 kernel: gran_size: 128K chunk_size: 512K
num_reg: 10 lose cover RAM: 0G
Dec 27 23:05:08 integ-hm8 kernel: gran_size: 128K chunk_size: 1M num_reg:
10 lose cover RAM: 0G
Dec 27 23:05:08 integ-hm8 kernel: gran_size: 128K chunk_size: 2M num_reg:
10 lose cover RAM: 0G
Dec 27 23:05:08 integ-hm8 kernel: gran_size: 128K chunk_size: 4M num_reg:
10 lose cover RAM: 0G
Dec 27 23:05:08 integ-hm8 kernel: gran_size: 128K chunk_size: 8M num_reg:
10 lose cover RAM: 0G
Dec 27 23:05:08 integ-hm8 kernel: gran_size: 128K chunk_size: 16M
num_reg: 10 lose cover RAM: 0G
Dec 27 23:05:08 integ-hm8 kernel: gran_size: 128K chunk_size: 32M
num_reg: 10 lose cover RAM: 0G
Dec 27 23:05:08 integ-hm8 kernel: gran_size: 128K chunk_size: 64M
num_reg: 10 lose cover RAM: 0G
Dec 27 23:05:08 integ-hm8 kernel: gran_size: 128K chunk_size: 128M
num_reg: 10 lose cover RAM: 0G
Dec 27 23:05:08 integ-hm8 kernel: gran_size: 128K chunk_size: 256M
num_reg: 10 lose cover RAM: 0G
Dec 27 23:05:08 integ-hm8 kernel: gran_size: 128K chunk_size: 512M
num_reg: 10 lose cover RAM: 0G
Dec 27 23:05:08 integ-hm8 kernel: gran_size: 128K chunk_size: 1G num_reg:
10 lose cover RAM: 0G
Regards
Prabu
---- On Mon, 28 Dec 2015 15:18:02 +0530 gjprabu <gjprabu at
zohocorp.com>wrote ----
Yes, its got hanged all 5 nodes, after restart everything fine
Regards
Prabu
---- On Mon, 28 Dec 2015 15:00:57 +0530 Joseph Qi <joseph.qi at
huawei.com>wrote ----
So which process hangs? And which lockres it is waiting for?
>From the log I cannot get those information.
On 2015/12/28 16:46, gjprabu wrote:
> Hi Joseph,
>
> Again we are facing same issue. Please find the logs when the problem
occurred.
>
> Dec 27 21:45:44 integ-hm5 kernel:
(dlm_thread,46268,24):dlm_update_lvb:206 getting lvb from lockres for master
node
> Dec 27 21:45:44 integ-hm5 kernel:
(dlm_thread,46268,24):ocfs2_locking_ast:1076 AST fired for lockres
M000000000000008478220200000000, action 1, unlock 0, level -1 => 3
> Dec 27 21:45:44 integ-hm5 kernel:
(nvfs,91539,0):__ocfs2_cluster_lock:1465 lockres N000000008340963d, convert from
-1 to 3
> Dec 27 21:45:44 integ-hm5 kernel:
(nvfs,91539,0):dlm_get_lock_resource:724 get lockres N000000008340963d (len 31)
> Dec 27 21:45:44 integ-hm5 kernel:
(nvfs,91539,0):__dlm_lookup_lockres_full:198 N000000008340963d
> Dec 27 21:45:44 integ-hm5 kernel:
(nvfs,91539,0):__dlm_lockres_grab_inflight_ref:663
A895BC216BE641A8A7E20AA89D57E051: res N000000008340963d, inflight++: now 1,
dlm_lockres_grab_inflight_ref()
> Dec 27 21:45:44 integ-hm5 kernel: (nvfs,91539,0):dlmlock:690 type=3,
flags = 0x0
> Dec 27 21:45:44 integ-hm5 kernel: (nvfs,91539,0):dlmlock:691 creating
lock: lock=ffff8801824b4500 res=ffff88265dbf2bc0
> Dec 27 21:45:44 integ-hm5 kernel: (nvfs,91539,0):dlmlock_master:131
type=3
> Dec 27 21:45:44 integ-hm5 kernel: (nvfs,91539,0):dlmlock_master:148 I
can grant this lock right away
> Dec 27 21:45:44 integ-hm5 kernel:
(nvfs,91539,0):__dlm_dirty_lockres:483 A895BC216BE641A8A7E20AA89D57E051: res
N000000008340963d
> Dec 27 21:45:44 integ-hm5 kernel:
(nvfs,91539,0):dlm_lockres_drop_inflight_ref:684
A895BC216BE641A8A7E20AA89D57E051: res N000000008340963d, inflight--: now 0,
dlmlock()
> Dec 27 21:45:44 integ-hm5 kernel:
(nvfs,91539,0):__dlm_dirty_lockres:483 A895BC216BE641A8A7E20AA89D57E051: res
N000000008340963d
> Dec 27 21:45:44 integ-hm5 kernel:
(dlm_thread,46268,24):dlm_flush_asts:541 A895BC216BE641A8A7E20AA89D57E051: res
N000000008340963d, Flush AST for lock 5:441609912, type 3, node 5
> Dec 27 21:45:44 integ-hm5 kernel:
(dlm_thread,46268,24):dlm_do_local_ast:232 A895BC216BE641A8A7E20AA89D57E051: res
N000000008340963d, lock 5:441609912, Local AST
> Dec 27 21:45:44 integ-hm5 kernel:
(dlm_thread,46268,24):ocfs2_locking_ast:1076 AST fired for lockres
N000000008340963d, action 1, unlock 0, level -1 => 3
> Dec 27 21:45:44 integ-hm5 kernel:
(nvfs,91539,0):__ocfs2_cluster_lock:1465 lockres
O000000000000008478220400000000, convert from -1 to 3
> Dec 27 21:45:44 integ-hm5 kernel:
(nvfs,91539,0):dlm_get_lock_resource:724 get lockres
O000000000000008478220400000000 (len 31)
> Dec 27 21:45:44 integ-hm5 kernel:
(nvfs,91539,0):__dlm_lookup_lockres_full:198 O000000000000008478220400000000
> Dec 27 21:45:44 integ-hm5 kernel:
(nvfs,91539,0):dlm_get_lock_resource:778 allocating a new resource
> Dec 27 21:45:44 integ-hm5 kernel:
(nvfs,91539,0):__dlm_lookup_lockres_full:198 O000000000000008478220400000000
> Dec 27 21:45:44 integ-hm5 kernel:
(nvfs,91539,0):dlm_get_lock_resource:789 no lockres found, allocated our own:
ffff880717e38780
> Dec 27 21:45:44 integ-hm5 kernel:
(nvfs,91539,0):__dlm_insert_lockres:187 A895BC216BE641A8A7E20AA89D57E051: Hash
res O000000000000008478220400000000
> Dec 27 21:45:44 integ-hm5 kernel:
(nvfs,91539,0):__dlm_lockres_grab_inflight_ref:663
A895BC216BE641A8A7E20AA89D57E051: res O000000000000008478220400000000,
inflight++: now 1, dlm_get_lock_resource()
> Dec 27 21:45:44 integ-hm5 kernel:
(nvfs,91539,0):dlm_do_master_request:1364 node 1 not master, response=NO
> Dec 27 21:45:44 integ-hm5 kernel:
(nvfs,91539,0):dlm_do_master_request:1364 node 2 not master, response=NO
> Dec 27 21:45:44 integ-hm5 kernel:
(nvfs,91539,0):dlm_do_master_request:1364 node 3 not master, response=NO
> Dec 27 21:45:44 integ-hm5 kernel:
(nvfs,91539,0):dlm_do_master_request:1364 node 4 not master, response=NO
> Dec 27 21:45:44 integ-hm5 kernel:
(nvfs,91539,0):dlm_wait_for_lock_mastery:1122 about to master
O000000000000008478220400000000 here, this=5
> Dec 27 21:45:44 integ-hm5 kernel:
(nvfs,91539,0):dlm_do_assert_master:1668 sending assert master to 1
(O000000000000008478220400000000)
> Dec 27 21:45:44 integ-hm5 kernel:
(nvfs,91539,4):dlm_do_assert_master:1668 sending assert master to 2
(O000000000000008478220400000000)
> Dec 27 21:45:44 integ-hm5 kernel:
(nvfs,91539,4):dlm_do_assert_master:1668 sending assert master to 3
(O000000000000008478220400000000)
> Dec 27 21:45:44 integ-hm5 kernel:
(nvfs,91539,4):dlm_do_assert_master:1668 sending assert master to 4
(O000000000000008478220400000000)
> Dec 27 21:45:44 integ-hm5 kernel:
(nvfs,91539,4):dlm_get_lock_resource:968 A895BC216BE641A8A7E20AA89D57E051: res
O000000000000008478220400000000, Mastered by 5
> Dec 27 21:45:44 integ-hm5 kernel: (nvfs,91539,4):dlm_mle_release:436
Releasing mle for O000000000000008478220400000000, type 1
> Dec 27 21:45:44 integ-hm5 kernel: (nvfs,91539,4):dlmlock:690 type=3,
flags = 0x0
> Dec 27 21:45:44 integ-hm5 kernel: (nvfs,91539,4):dlmlock:691 creating
lock: lock=ffff8801824b4680 res=ffff880717e38780
> Dec 27 21:45:44 integ-hm5 kernel: (nvfs,91539,4):dlmlock_master:131
type=3
> Dec 27 21:45:44 integ-hm5 kernel: (nvfs,91539,4):dlmlock_master:148 I
can grant this lock right away
> Dec 27 21:45:44 integ-hm5 kernel:
(nvfs,91539,4):__dlm_dirty_lockres:483 A895BC216BE641A8A7E20AA89D57E051: res
O000000000000008478220400000000
> Dec 27 21:45:44 integ-hm5 kernel:
(nvfs,91539,4):dlm_lockres_drop_inflight_ref:684
A895BC216BE641A8A7E20AA89D57E051: res O000000000000008478220400000000,
inflight--: now 0, dlmlock()
> Dec 27 21:45:44 integ-hm5 kernel:
(nvfs,91539,4):__dlm_dirty_lockres:483 A895BC216BE641A8A7E20AA89D57E051: res
O000000000000008478220400000000
> Dec 27 21:45:44 integ-hm5 kernel:
(dlm_thread,46268,24):dlm_flush_asts:541 A895BC216BE641A8A7E20AA89D57E051: res
O000000000000008478220400000000, Flush AST for lock 5:441609913, type 3, node 5
> Dec 27 21:45:44 integ-hm5 kernel:
(dlm_thread,46268,24):dlm_do_local_ast:232 A895BC216BE641A8A7E20AA89D57E051: res
O000000000000008478220400000000, lock 5:441609913, Local AST
> Dec 27 21:45:44 integ-hm5 kernel:
(dlm_thread,46268,24):ocfs2_locking_ast:1076 AST fired for lockres
O000000000000008478220400000000, action 1, unlock 0, level -1 => 3
> Dec 27 21:45:44 integ-hm5 kernel:
(nvfs,91539,4):__ocfs2_cluster_lock:1465 lockres
M000000000000008478220400000000, convert from -1 to 3
> Dec 27 21:45:44 integ-hm5 kernel:
(nvfs,91539,4):dlm_get_lock_resource:724 get lockres
M000000000000008478220400000000 (len 31)
> Dec 27 21:45:44 integ-hm5 kernel:
(nvfs,91539,4):__dlm_lookup_lockres_full:198 M000000000000008478220400000000
> Dec 27 21:45:44 integ-hm5 kernel:
(nvfs,91539,4):dlm_get_lock_resource:778 allocating a new resource
> Dec 27 21:45:44 integ-hm5 kernel:
(nvfs,91539,4):__dlm_lookup_lockres_full:198 M000000000000008478220400000000
> Dec 27 21:45:44 integ-hm5 kernel:
(nvfs,91539,4):dlm_get_lock_resource:789 no lockres found, allocated our own:
ffff8803ba843e80
> Dec 27 21:45:44 integ-hm5 kernel:
(nvfs,91539,4):__dlm_insert_lockres:187 A895BC216BE641A8A7E20AA89D57E051: Hash
res M000000000000008478220400000000
> Dec 27 21:45:44 integ-hm5 kernel:
(nvfs,91539,4):__dlm_lockres_grab_inflight_ref:663
A895BC216BE641A8A7E20AA89D57E051: res M000000000000008478220400000000,
inflight++: now 1, dlm_get_lock_resource()
> Dec 27 21:45:44 integ-hm5 kernel:
(nvfs,91539,4):dlm_do_master_request:1364 node 1 not master, response=NODec 27
21:45:44 integ-hm5 kernel: (dlm_thread,46268,24):dlm_update_lvb:206 getting lvb
from lockres for master node
> Dec 27 21:45:44 integ-hm5 kernel:
(dlm_thread,46268,24):ocfs2_locking_ast:1076 AST fired for lockres
M000000000000008478220200000000, action 1, unlock 0, level -1 => 3
> Dec 27 21:45:44 integ-hm5 kernel:
(nvfs,91539,0):__ocfs2_cluster_lock:1465 lockres N000000008340963d, convert from
-1 to 3
> Dec 27 21:45:44 integ-hm5 kernel:
(nvfs,91539,0):dlm_get_lock_resource:724 get lockres N000000008340963d (len 31)
> Dec 27 21:45:44 integ-hm5 kernel:
(nvfs,91539,0):__dlm_lookup_lockres_full:198 N000000008340963d
> Dec 27 21:45:44 integ-hm5 kernel:
(nvfs,91539,0):__dlm_lockres_grab_inflight_ref:663
A895BC216BE641A8A7E20AA89D57E051: res N000000008340963d, inflight++: now 1,
dlm_lockres_grab_inflight_ref()
> Dec 27 21:45:44 integ-hm5 kernel: (nvfs,91539,0):dlmlock:690 type=3,
flags = 0x0
> Dec 27 21:45:44 integ-hm5 kernel: (nvfs,91539,0):dlmlock:691 creating
lock: lock=ffff8801824b4500 res=ffff88265dbf2bc0
> Dec 27 21:45:44 integ-hm5 kernel: (nvfs,91539,0):dlmlock_master:131
type=3
> Dec 27 21:45:44 integ-hm5 kernel: (nvfs,91539,0):dlmlock_master:148 I
can grant this lock right away
> Dec 27 21:45:44 integ-hm5 kernel:
(nvfs,91539,0):__dlm_dirty_lockres:483 A895BC216BE641A8A7E20AA89D57E051: res
N000000008340963d
> Dec 27 21:45:44 integ-hm5 kernel:
(nvfs,91539,0):dlm_lockres_drop_inflight_ref:684
A895BC216BE641A8A7E20AA89D57E051: res N000000008340963d, inflight--: now 0,
dlmlock()
> Dec 27 21:45:44 integ-hm5 kernel:
(nvfs,91539,0):__dlm_dirty_lockres:483 A895BC216BE641A8A7E20AA89D57E051: res
N000000008340963d
> Dec 27 21:45:44 integ-hm5 kernel:
(dlm_thread,46268,24):dlm_flush_asts:541 A895BC216BE641A8A7E20AA89D57E051: res
N000000008340963d, Flush AST for lock 5:441609912, type 3, node 5
> Dec 27 21:45:44 integ-hm5 kernel:
(dlm_thread,46268,24):dlm_do_local_ast:232 A895BC216BE641A8A7E20AA89D57E051: res
N000000008340963d, lock 5:441609912, Local AST
> Dec 27 21:45:44 integ-hm5 kernel:
(dlm_thread,46268,24):ocfs2_locking_ast:1076 AST fired for lockres
N000000008340963d, action 1, unlock 0, level -1 => 3
> Dec 27 21:45:44 integ-hm5 kernel:
(nvfs,91539,0):__ocfs2_cluster_lock:1465 lockres
O000000000000008478220400000000, convert from -1 to 3
> Dec 27 21:45:44 integ-hm5 kernel:
(nvfs,91539,0):dlm_get_lock_resource:724 get lockres
O000000000000008478220400000000 (len 31)
> Dec 27 21:45:44 integ-hm5 kernel:
(nvfs,91539,0):__dlm_lookup_lockres_full:198 O000000000000008478220400000000
> Dec 27 21:45:44 integ-hm5 kernel:
(nvfs,91539,0):dlm_get_lock_resource:778 allocating a new resource
> Dec 27 21:45:44 integ-hm5 kernel:
(nvfs,91539,0):__dlm_lookup_lockres_full:198 O000000000000008478220400000000
> Dec 27 21:45:44 integ-hm5 kernel:
(nvfs,91539,0):dlm_get_lock_resource:789 no lockres found, allocated our own:
ffff880717e38780
> Dec 27 21:45:44 integ-hm5 kernel:
(nvfs,91539,0):__dlm_insert_lockres:187 A895BC216BE641A8A7E20AA89D57E051: Hash
res O000000000000008478220400000000
> Dec 27 21:45:44 integ-hm5 kernel:
(nvfs,91539,0):__dlm_lockres_grab_inflight_ref:663
A895BC216BE641A8A7E20AA89D57E051: res O000000000000008478220400000000,
inflight++: now 1, dlm_get_lock_resource()
>
> Regards
> Prabu GJ
> **
>
>
>
> ---- On Wed, 23 Dec 2015 10:05:10 +0530 *gjprabu <gjprabu at
zohocorp.com>*wrote ----
>
> **Ok, thanks
>
>
> ---- On Wed, 23 Dec 2015 09:08:13 +0530 *Joseph Qi <joseph.qi at
huawei.com <mailto:joseph.qi at huawei.com>>*wrote ----
>
>
>
> I don't think there is relation with packet size.
> Once reproduced? you can share the messages and I will try my best if
> free.
>
> On 2015/12/23 10:45, gjprabu wrote:
> > Hi Joseph,
> >
> > I have enabled requested and Is the DLM log will capture to
analyze further. Also do we need to enable network side setting for allow max
packets.
> >
> > debugfs.ocfs2 -l
> > DLM allow
> > MSG off
> > TCP off
> > CONN off
> > VOTE off
> > DLM_DOMAIN off
> > HB_BIO off
> > BASTS allow
> > DLMFS off
> > ERROR allow
> > DLM_MASTER off
> > KTHREAD off
> > NOTICE allow
> > QUORUM off
> > SOCKET off
> > DLM_GLUE off
> > DLM_THREAD off
> > DLM_RECOVERY allow
> > HEARTBEAT off
> > CLUSTER off
> >
> > Regards
> > Prabu
> >
> >
> > ---- On Wed, 23 Dec 2015 07:51:38 +0530 *Joseph Qi
<joseph.qi at huawei.com <mailto:joseph.qi at
huawei.com>>*wrote ----
> >
> > Please also switch on BASTS and DLM_RECOVERY.
> >
> > On 2015/12/23 10:11, gjprabu wrote:
> > > HI Joseph,
> > >
> > > Our current setup is having below details and DLM is
now allowed (DLM allow). Do you suggest any other option to get more logs.
> > >
> > > debugfs.ocfs2 -l
> > > DLM off ( DLM allow)
> > > MSG off
> > > TCP off
> > > CONN off
> > > VOTE off
> > > DLM_DOMAIN off
> > > HB_BIO off
> > > BASTS off
> > > DLMFS off
> > > ERROR allow
> > > DLM_MASTER off
> > > KTHREAD off
> > > NOTICE allow
> > > QUORUM off
> > > SOCKET off
> > > DLM_GLUE off
> > > DLM_THREAD off
> > > DLM_RECOVERY off
> > > HEARTBEAT off
> > > CLUSTER off
> > >
> > > Regards
> > > Prabu
> > > **
> > >
> > >
> > >
> > > ---- On Wed, 23 Dec 2015 07:30:54 +0530 *Joseph Qi
<joseph.qi at huawei.com <mailto:joseph.qi at huawei.com>
<mailto:joseph.qi at huawei.com <mailto:joseph.qi at
huawei.com>>>*wrote ----
> > >
> > > So you mean the four nodes are manually rebooted? If
so you must
> > > analyze messages before you rebooted.
> > > If there are not enough messages, you can switch on
some messages. IMO,
> > > mostly hang problems are caused by DLM bug, so I
suggest switch on DLM
> > > related log and reproduce.
> > > You can use debugfs.ocfs2 -l to show all message
switches and switch on
> > > you want. For example,
> > > # debugfs.ocfs2 -l DLM allow
> > >
> > > Thanks?
> > > Joseph
> > >
> > > On 2015/12/22 21:47, gjprabu wrote:
> > > > Hi Joseph,
> > > >
> > > > We are facing ocfs2 server hang problem
frequently and suddenly 4 nodes going to hang stat expect 1 node. After reboot
everything is come to normal, this behavior happend many times. Do we have any
debug and fix for this issue.
> > > >
> > > > Regards
> > > > Prabu
> > > >
> > > >
> > > > ---- On Tue, 22 Dec 2015 16:30:52 +0530
*Joseph Qi <joseph.qi at huawei.com <mailto:joseph.qi at
huawei.com> <mailto:joseph.qi at huawei.com
<mailto:joseph.qi at huawei.com>> <mailto:joseph.qi
at huawei.com <mailto:joseph.qi at huawei.com>
<mailto:joseph.qi at huawei.com <mailto:joseph.qi at
huawei.com>>>>*wrote ----
> > > >
> > > > Hi Prabu,
> > > > From the log you provided, I can only see
that node 5 disconnected with
> > > > node 2, 3, 1 and 4. It seemed that something
wrong happened on the four
> > > > nodes, and node 5 did recovery for them.
After that, the four nodes
> > > > joined again.
> > > >
> > > > On 2015/12/22 16:23, gjprabu wrote:
> > > > > Hi,
> > > > >
> > > > > Anybody please help me on this
issue.
> > > > >
> > > > > Regards
> > > > > Prabu
> > > > >
> > > > > ---- On Mon, 21 Dec 2015 15:16:49
+0530 *gjprabu <gjprabu at zohocorp.com <mailto:gjprabu at
zohocorp.com> <mailto:gjprabu at zohocorp.com
<mailto:gjprabu at zohocorp.com>> <mailto:gjprabu at
zohocorp.com <mailto:gjprabu at zohocorp.com>
<mailto:gjprabu at zohocorp.com <mailto:gjprabu at
zohocorp.com>>> <mailto:gjprabu at zohocorp.com
<mailto:gjprabu at zohocorp.com> <mailto:gjprabu at
zohocorp.com <mailto:gjprabu at zohocorp.com>>
<mailto:gjprabu at zohocorp.com <mailto:gjprabu at
zohocorp.com> <mailto:gjprabu at zohocorp.com
<mailto:gjprabu at
zohocorp.com>>>>>*wrote ----
> > > > >
> > > > > Dear Team,
> > > > >
> > > > > Ocfs2 clients are getting hang
often and unusable. Please find the logs. Kindly provide the solution, it will
be highly appreciated.
> > > > >
> > > > >
> > > > > [3659684.042530] o2dlm: Node 4
joins domain A895BC216BE641A8A7E20AA89D57E051 ( 1 2 3 4 5 ) 5 nodes
> > > > >
> > > > > [3992993.101490]
(kworker/u192:1,63211,24):dlm_create_lock_handler:515 ERROR: dlm status =
DLM_IVLOCKID
> > > > > [3993002.193285]
(kworker/u192:1,63211,24):dlm_deref_lockres_handler:2267 ERROR:
A895BC216BE641A8A7E20AA89D57E051:M0000000000000062d2dcd000000000: bad lockres
name
> > > > > [3993032.457220]
(kworker/u192:0,67418,11):dlm_do_assert_master:1680 ERROR: Error -112 when
sending message 502 (key 0xc3460ae7) to node 2
> > > > > [3993062.547989]
(kworker/u192:0,67418,11):dlm_do_assert_master:1680 ERROR: Error -107 when
sending message 502 (key 0xc3460ae7) to node 2
> > > > > [3993064.860776]
(kworker/u192:0,67418,15):dlm_do_assert_master:1680 ERROR: Error -107 when
sending message 502 (key 0xc3460ae7) to node 2
> > > > > [3993064.860804] o2cb: o2dlm has
evicted node 2 from domain A895BC216BE641A8A7E20AA89D57E051
> > > > > [3993073.280062] o2dlm: Begin
recovery on domain A895BC216BE641A8A7E20AA89D57E051 for node 2
> > > > > [3993094.623695]
(dlm_thread,46268,8):dlm_send_proxy_ast_msg:484 ERROR:
A895BC216BE641A8A7E20AA89D57E051: res S000000000000000000000200000000, error
-112 send AST to node 4
> > > > > [3993094.624281]
(dlm_thread,46268,8):dlm_flush_asts:605 ERROR: status = -112
> > > > > [3993094.687668]
(kworker/u192:0,67418,15):dlm_do_assert_master:1680 ERROR: Error -112 when
sending message 502 (key 0xc3460ae7) to node 3
> > > > > [3993094.815662]
(dlm_reco_thread,46269,7):dlm_do_master_requery:1666 ERROR: Error -112 when
sending message 514 (key 0xc3460ae7) to node 1
> > > > > [3993094.816118]
(dlm_reco_thread,46269,7):dlm_pre_master_reco_lockres:2166 ERROR: status = -112
> > > > > [3993124.778525]
(dlm_reco_thread,46269,7):dlm_do_master_requery:1666 ERROR: Error -107 when
sending message 514 (key 0xc3460ae7) to node 3
> > > > > [3993124.779032]
(dlm_reco_thread,46269,7):dlm_pre_master_reco_lockres:2166 ERROR: status = -107
> > > > > [3993133.332516] o2cb: o2dlm has
evicted node 3 from domain A895BC216BE641A8A7E20AA89D57E051
> > > > > [3993139.915122] o2cb: o2dlm has
evicted node 1 from domain A895BC216BE641A8A7E20AA89D57E051
> > > > > [3993147.071956] o2cb: o2dlm has
evicted node 4 from domain A895BC216BE641A8A7E20AA89D57E051
> > > > > [3993147.071968]
(dlm_reco_thread,46269,7):dlm_do_master_requery:1666 ERROR: Error -107 when
sending message 514 (key 0xc3460ae7) to node 4
> > > > > [3993147.071975]
(kworker/u192:0,67418,15):dlm_do_assert_master:1680 ERROR: Error -107 when
sending message 502 (key 0xc3460ae7) to node 4
> > > > > [3993147.071997]
(kworker/u192:0,67418,15):dlm_do_assert_master:1680 ERROR: Error -107 when
sending message 502 (key 0xc3460ae7) to node 4
> > > > > [3993147.072001]
(kworker/u192:0,67418,15):dlm_do_assert_master:1680 ERROR: Error -107 when
sending message 502 (key 0xc3460ae7) to node 4
> > > > > [3993147.072005]
(kworker/u192:0,67418,15):dlm_do_assert_master:1680 ERROR: Error -107 when
sending message 502 (key 0xc3460ae7) to node 4
> > > > > [3993147.072009]
(kworker/u192:0,67418,15):dlm_do_assert_master:1680 ERROR: Error -107 when
sending message 502 (key 0xc3460ae7) to node 4
> > > > > [3993147.075019]
(dlm_reco_thread,46269,7):dlm_pre_master_reco_lockres:2166 ERROR: status = -107
> > > > > [3993147.075353]
(dlm_reco_thread,46269,7):dlm_do_master_request:1347 ERROR: link to 1 went down!
> > > > > [3993147.075701]
(dlm_reco_thread,46269,7):dlm_get_lock_resource:932 ERROR: status = -107
> > > > > [3993147.076001]
(dlm_reco_thread,46269,7):dlm_do_master_request:1347 ERROR: link to 3 went down!
> > > > > [3993147.076329]
(dlm_reco_thread,46269,7):dlm_get_lock_resource:932 ERROR: status = -107
> > > > > [3993147.076634]
(dlm_reco_thread,46269,7):dlm_do_master_request:1347 ERROR: link to 4 went down!
> > > > > [3993147.076968]
(dlm_reco_thread,46269,7):dlm_get_lock_resource:932 ERROR: status = -107
> > > > > [3993147.077275]
(dlm_reco_thread,46269,7):dlm_restart_lock_mastery:1236 ERROR: node down! 1
> > > > > [3993147.077591]
(dlm_reco_thread,46269,7):dlm_restart_lock_mastery:1229 node 3 up while
restarting
> > > > > [3993147.077594]
(dlm_reco_thread,46269,7):dlm_wait_for_lock_mastery:1053 ERROR: status = -11
> > > > > [3993155.171570]
(dlm_reco_thread,46269,7):dlm_do_master_request:1347 ERROR: link to 3 went down!
> > > > > [3993155.171874]
(dlm_reco_thread,46269,7):dlm_get_lock_resource:932 ERROR: status = -107
> > > > > [3993155.172150]
(dlm_reco_thread,46269,7):dlm_do_master_request:1347 ERROR: link to 4 went down!
> > > > > [3993155.172446]
(dlm_reco_thread,46269,7):dlm_get_lock_resource:932 ERROR: status = -107
> > > > > [3993155.172719]
(dlm_reco_thread,46269,7):dlm_restart_lock_mastery:1236 ERROR: node down! 3
> > > > > [3993155.173001]
(dlm_reco_thread,46269,7):dlm_restart_lock_mastery:1229 node 4 up while
restarting
> > > > > [3993155.173003]
(dlm_reco_thread,46269,7):dlm_wait_for_lock_mastery:1053 ERROR: status = -11
> > > > > [3993155.173283]
(dlm_reco_thread,46269,7):dlm_do_master_request:1347 ERROR: link to 4 went down!
> > > > > [3993155.173581]
(dlm_reco_thread,46269,7):dlm_get_lock_resource:932 ERROR: status = -107
> > > > > [3993155.173858]
(dlm_reco_thread,46269,7):dlm_restart_lock_mastery:1236 ERROR: node down! 4
> > > > > [3993155.174135]
(dlm_reco_thread,46269,7):dlm_wait_for_lock_mastery:1053 ERROR: status = -11
> > > > > [3993155.174458] o2dlm: Node 5 (me)
is the Recovery Master for the dead node 2 in domain
A895BC216BE641A8A7E20AA89D57E051
> > > > > [3993158.361220] o2dlm: End
recovery on domain A895BC216BE641A8A7E20AA89D57E051
> > > > > [3993158.361228] o2dlm: Begin
recovery on domain A895BC216BE641A8A7E20AA89D57E051 for node 1
> > > > > [3993158.361305] o2dlm: Node 5 (me)
is the Recovery Master for the dead node 1 in domain
A895BC216BE641A8A7E20AA89D57E051
> > > > > [3993161.833543] o2dlm: End
recovery on domain A895BC216BE641A8A7E20AA89D57E051
> > > > > [3993161.833551] o2dlm: Begin
recovery on domain A895BC216BE641A8A7E20AA89D57E051 for node 3
> > > > > [3993161.833620] o2dlm: Node 5 (me)
is the Recovery Master for the dead node 3 in domain
A895BC216BE641A8A7E20AA89D57E051
> > > > > [3993165.188817] o2dlm: End
recovery on domain A895BC216BE641A8A7E20AA89D57E051
> > > > > [3993165.188826] o2dlm: Begin
recovery on domain A895BC216BE641A8A7E20AA89D57E051 for node 4
> > > > > [3993165.188907] o2dlm: Node 5 (me)
is the Recovery Master for the dead node 4 in domain
A895BC216BE641A8A7E20AA89D57E051
> > > > > [3993168.551610] o2dlm: End
recovery on domain A895BC216BE641A8A7E20AA89D57E051
> > > > >
> > > > > [3996486.869628] o2dlm: Node 4
joins domain A895BC216BE641A8A7E20AA89D57E051 ( 4 5 ) 2 nodes
> > > > > [3996778.703664] o2dlm: Node 4
leaves domain A895BC216BE641A8A7E20AA89D57E051 ( 5 ) 1 nodes
> > > > > [3997012.295536] o2dlm: Node 2
joins domain A895BC216BE641A8A7E20AA89D57E051 ( 2 5 ) 2 nodes
> > > > > [3997099.498157] o2dlm: Node 4
joins domain A895BC216BE641A8A7E20AA89D57E051 ( 2 4 5 ) 3 nodes
> > > > > [3997783.633140] o2dlm: Node 1
joins domain A895BC216BE641A8A7E20AA89D57E051 ( 1 2 4 5 ) 4 nodes
> > > > > [3997864.039868] o2dlm: Node 3
joins domain A895BC216BE641A8A7E20AA89D57E051 ( 1 2 3 4 5 ) 5 nodes
> > > > >
> > > > > Regards
> > > > > Prabu
> > > > > **
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
_______________________________________________
> > > > > Ocfs2-users mailing list
> > > > > Ocfs2-users at oss.oracle.com
<mailto:Ocfs2-users at oss.oracle.com> <mailto:Ocfs2-users
at oss.oracle.com <mailto:Ocfs2-users at oss.oracle.com>>
<mailto:Ocfs2-users at oss.oracle.com <mailto:Ocfs2-users at
oss.oracle.com> <mailto:Ocfs2-users at oss.oracle.com
<mailto:Ocfs2-users at oss.oracle.com>>>
<mailto:Ocfs2-users at oss.oracle.com <mailto:Ocfs2-users at
oss.oracle.com> <mailto:Ocfs2-users at oss.oracle.com
<mailto:Ocfs2-users at oss.oracle.com>>
<mailto:Ocfs2-users at oss.oracle.com <mailto:Ocfs2-users at
oss.oracle.com> <mailto:Ocfs2-users at oss.oracle.com
<mailto:Ocfs2-users at oss.oracle.com>>>>
> > > > >
https://oss.oracle.com/mailman/listinfo/ocfs2-users
> > > > >
> > > >
> > > >
> > > >
> > >
> > >
> > >
> >
> >
> >
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://oss.oracle.com/pipermail/ocfs2-users/attachments/20151228/4a78ce3e/attachment-0001.html