thr3ads.net - Ocfs2 users - [Ocfs2-users] node

If this information is useful, please help other people find it:
Share via:

Hai Tao

2011-Sep-10 05:15 UTC

[Ocfs2-users] node_count=0

I have a two node ocfs2 cluster, and in the /etc/ocfs2/cluster.conf file, the
node_count=0 rather than 2. Does this have to be a wrong config, and how would
this affect the cluster?


Thanks.
 
Hai Tao 		 	   		  
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://oss.oracle.com/pipermail/ocfs2-users/attachments/20110909/2a7258c3/attachment.html

Hai Tao

2011-Sep-10 07:50 UTC

head link

[Ocfs2-users] disable heartbeat nic caused ocfs2 errors

I have a two nodes ocfs2 cluster, and I disabled the heartbeat nic with
"ifdown eth1". I got following weird logs on both nodes:
 
Sep  7 10:45:49 dbtest-01 kernel: o2net: connection to node dbtest-02 (num 1) at
10.194.59.65:7777 has been idle for 30.0 seconds, shutting it down.
Sep  7 10:45:49 dbtest-01 kernel: (swapper,0,3):o2net_idle_timer:1503 here are
some times that might help debug the situation: (tmr 1315417519.185025 now
1315417549.183798 dr 1315417519.185016 adv 1315417519.185032:1315417519.185032
func (b9bb7168:504) 1315417518.872227:1315417518.872268)
Sep  7 10:45:49 dbtest-01 kernel: o2net: no longer connected to node dbtest-02
(num 1) at 10.194.59.65:7777
Sep  7 10:45:49 dbtest-01 kernel: (dlm_thread,3781,2):dlm_send_proxy_ast_msg:457
ERROR: status = -112
Sep  7 10:45:49 dbtest-01 kernel: (oracle,26129,1):dlm_do_master_request:1334
ERROR: link to 1 went down!
Sep  7 10:45:49 dbtest-01 kernel: (oracle,26129,1):dlm_get_lock_resource:917
ERROR: status = -112
Sep  7 10:45:49 dbtest-01 kernel: (dlm_thread,4256,1):dlm_send_proxy_ast_msg:457
ERROR: status = -112
Sep  7 10:45:49 dbtest-01 kernel: (dlm_thread,4256,1):dlm_flush_asts:604 ERROR:
status = -112
Sep  7 10:45:49 dbtest-01 kernel: (dlm_thread,3781,2):dlm_flush_asts:604 ERROR:
status = -112
Sep  7 10:46:19 dbtest-01 kernel: (o2net,3736,3):o2net_connect_expired:1664
ERROR: no connection established with node 1 after 30.0 seconds, giving up and
returning errors.
Sep  7 10:46:19 dbtest-01 kernel: o2net: accepted connection from node dbtest-02
(num 1) at 10.194.59.65:7777
Sep  7 10:48:37 dbtest-01 kernel: INFO: task events/0:10 blocked for more than
120 seconds.
Sep  7 10:48:37 dbtest-01 kernel: "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep  7 10:48:37 dbtest-01 kernel: events/0      D ffff810001004420     0    10  
1            11     9 (L-TLB)
Sep  7 10:48:37 dbtest-01 kernel:  ffff81083ffedc80 0000000000000046
ffffffff80333680 0000000000000001
Sep  7 10:48:37 dbtest-01 kernel:  0000000000000400 000000000000000a
ffff81083ffe1820 ffffffff80309b60
Sep  7 10:48:37 dbtest-01 kernel:  0030b62498ce7b3f 000000000000416b
ffff81083ffe1a08 0000000000000000
Sep  7 10:48:37 dbtest-01 kernel: Call Trace:
Sep  7 10:48:37 dbtest-01 kernel: Call Trace:
Sep  7 10:48:37 dbtest-01 kernel:  [<ffffffff80064167>]
wait_for_completion+0x79/0xa2
Sep  7 10:48:37 dbtest-01 kernel:  [<ffffffff8008e16d>]
default_wake_function+0x0/0xe
Sep  7 10:48:37 dbtest-01 kernel:  [<ffffffff884e64b7>]
:ocfs2:ocfs2_wait_for_mask+0xd/0x19
Sep  7 10:48:37 dbtest-01 kernel:  [<ffffffff884e78d8>]
:ocfs2:ocfs2_cluster_lock+0x9ae/0x9d3
Sep  7 10:48:37 dbtest-01 kernel:  [<ffffffff885013e5>]
:ocfs2:ocfs2_orphan_scan_work+0x0/0x83
Sep  7 10:48:37 dbtest-01 kernel:  [<ffffffff884ed1e4>]
:ocfs2:ocfs2_orphan_scan_lock+0x55/0x84
Sep  7 10:48:37 dbtest-01 kernel:  [<ffffffff884fc59b>]
:ocfs2:ocfs2_queue_orphan_scan+0x32/0x147
Sep  7 10:48:37 dbtest-01 kernel:  [<ffffffff885013ff>]
:ocfs2:ocfs2_orphan_scan_work+0x1a/0x83
Sep  7 10:48:37 dbtest-01 kernel:  [<ffffffff8004dc37>]
run_workqueue+0x94/0xe4
Sep  7 10:48:37 dbtest-01 kernel:  [<ffffffff8004a472>]
worker_thread+0x0/0x122
Sep  7 10:48:37 dbtest-01 kernel:  [<ffffffff8004a562>]
worker_thread+0xf0/0x122
Sep  7 10:48:37 dbtest-01 kernel:  [<ffffffff8008e16d>]
default_wake_function+0x0/0xe
Sep  7 10:48:37 dbtest-01 kernel:  [<ffffffff80032bdc>] kthread+0xfe/0x132
Sep  7 10:48:37 dbtest-01 kernel:  [<ffffffff8005efb1>] child_rip+0xa/0x11
Sep  7 10:48:37 dbtest-01 kernel:  [<ffffffff80032ade>] kthread+0x0/0x132
Sep  7 10:48:37 dbtest-01 kernel:  [<ffffffff8005efa7>] child_rip+0x0/0x11
Sep  7 10:48:37 dbtest-01 kernel:

Does anyone know why this happened?
 
Thanks. 		 	   		  
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://oss.oracle.com/pipermail/ocfs2-users/attachments/20110910/95f29a65/attachment-0001.html

Hai Tao

2011-Sep-11 03:54 UTC

head link

[Ocfs2-users] disable heartbeat nic caused ocfs2 errors

is ocfs2 heartbeat transferred over the network, or just updating a file to the
shared disk?
 
If the heartbeat lost, what should happen? what if only one node is writing, and
the other is still? Will it still cause any file system issue?


Thanks.
 
Hai Tao
 




From: taoh666 at hotmail.com
To: ocfs2-users at oss.oracle.com
Date: Sat, 10 Sep 2011 00:50:23 -0700
Subject: [Ocfs2-users] disable heartbeat nic caused ocfs2 errors





I have a two nodes ocfs2 cluster, and I disabled the heartbeat nic with
"ifdown eth1". I got following weird logs on both nodes:
 
Sep  7 10:45:49 dbtest-01 kernel: o2net: connection to node dbtest-02 (num 1) at
10.194.59.65:7777 has been idle for 30.0 seconds, shutting it down.
Sep  7 10:45:49 dbtest-01 kernel: (swapper,0,3):o2net_idle_timer:1503 here are
some times that might help debug the situation: (tmr 1315417519.185025 now
1315417549.183798 dr 1315417519.185016 adv 1315417519.185032:1315417519.185032
func (b9bb7168:504) 1315417518.872227:1315417518.872268)
Sep  7 10:45:49 dbtest-01 kernel: o2net: no longer connected to node dbtest-02
(num 1) at 10.194.59.65:7777
Sep  7 10:45:49 dbtest-01 kernel: (dlm_thread,3781,2):dlm_send_proxy_ast_msg:457
ERROR: status = -112
Sep  7 10:45:49 dbtest-01 kernel: (oracle,26129,1):dlm_do_master_request:1334
ERROR: link to 1 went down!
Sep  7 10:45:49 dbtest-01 kernel: (oracle,26129,1):dlm_get_lock_resource:917
ERROR: status = -112
Sep  7 10:45:49 dbtest-01 kernel: (dlm_thread,4256,1):dlm_send_proxy_ast_msg:457
ERROR: status = -112
Sep  7 10:45:49 dbtest-01 kernel: (dlm_thread,4256,1):dlm_flush_asts:604 ERROR:
status = -112
Sep  7 10:45:49 dbtest-01 kernel: (dlm_thread,3781,2):dlm_flush_asts:604 ERROR:
status = -112
Sep  7 10:46:19 dbtest-01 kernel: (o2net,3736,3):o2net_connect_expired:1664
ERROR: no connection established with node 1 after 30.0 seconds, giving up and
returning errors.
Sep  7 10:46:19 dbtest-01 kernel: o2net: accepted connection from node dbtest-02
(num 1) at 10.194.59.65:7777
Sep  7 10:48:37 dbtest-01 kernel: INFO: task events/0:10 blocked for more than
120 seconds.
Sep  7 10:48:37 dbtest-01 kernel: "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep  7 10:48:37 dbtest-01 kernel: events/0      D ffff810001004420     0    10  
1            11     9 (L-TLB)
Sep  7 10:48:37 dbtest-01 kernel:  ffff81083ffedc80 0000000000000046
ffffffff80333680 0000000000000001
Sep  7 10:48:37 dbtest-01 kernel:  0000000000000400 000000000000000a
ffff81083ffe1820 ffffffff80309b60
Sep  7 10:48:37 dbtest-01 kernel:  0030b62498ce7b3f 000000000000416b
ffff81083ffe1a08 0000000000000000
Sep  7 10:48:37 dbtest-01 kernel: Call Trace:
Sep  7 10:48:37 dbtest-01 kernel: Call Trace:
Sep  7 10:48:37 dbtest-01 kernel:  [<ffffffff80064167>]
wait_for_completion+0x79/0xa2
Sep  7 10:48:37 dbtest-01 kernel:  [<ffffffff8008e16d>]
default_wake_function+0x0/0xe
Sep  7 10:48:37 dbtest-01 kernel:  [<ffffffff884e64b7>]
:ocfs2:ocfs2_wait_for_mask+0xd/0x19
Sep  7 10:48:37 dbtest-01 kernel:  [<ffffffff884e78d8>]
:ocfs2:ocfs2_cluster_lock+0x9ae/0x9d3
Sep  7 10:48:37 dbtest-01 kernel:  [<ffffffff885013e5>]
:ocfs2:ocfs2_orphan_scan_work+0x0/0x83
Sep  7 10:48:37 dbtest-01 kernel:  [<ffffffff884ed1e4>]
:ocfs2:ocfs2_orphan_scan_lock+0x55/0x84
Sep  7 10:48:37 dbtest-01 kernel:  [<ffffffff884fc59b>]
:ocfs2:ocfs2_queue_orphan_scan+0x32/0x147
Sep  7 10:48:37 dbtest-01 kernel:  [<ffffffff885013ff>]
:ocfs2:ocfs2_orphan_scan_work+0x1a/0x83
Sep  7 10:48:37 dbtest-01 kernel:  [<ffffffff8004dc37>]
run_workqueue+0x94/0xe4
Sep  7 10:48:37 dbtest-01 kernel:  [<ffffffff8004a472>]
worker_thread+0x0/0x122
Sep  7 10:48:37 dbtest-01 kernel:  [<ffffffff8004a562>]
worker_thread+0xf0/0x122
Sep  7 10:48:37 dbtest-01 kernel:  [<ffffffff8008e16d>]
default_wake_function+0x0/0xe
Sep  7 10:48:37 dbtest-01 kernel:  [<ffffffff80032bdc>] kthread+0xfe/0x132
Sep  7 10:48:37 dbtest-01 kernel:  [<ffffffff8005efb1>] child_rip+0xa/0x11
Sep  7 10:48:37 dbtest-01 kernel:  [<ffffffff80032ade>] kthread+0x0/0x132
Sep  7 10:48:37 dbtest-01 kernel:  [<ffffffff8005efa7>] child_rip+0x0/0x11
Sep  7 10:48:37 dbtest-01 kernel:

Does anyone know why this happened?
 
Thanks.

_______________________________________________ Ocfs2-users mailing list
Ocfs2-users at oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://oss.oracle.com/pipermail/ocfs2-users/attachments/20110910/74ac93a4/attachment.html

Sunil Mushran

2011-Sep-12 18:01 UTC

head link

[Ocfs2-users] disable heartbeat nic caused ocfs2 errors

ocfs2 uses disk heartbeat to detect node liveness. It uses net heartbeat
to detect link liveness. Both need to operate for the cluster to function.
If the network link between two nodes snaps, then one of the two nodes
is fenced.

The stack below indicates that the two nodes are not able to communicate.
The two nodes are waiting on the quorum to fence one of the nodes.
It appears you have upped the disk heartbeat timeout > 2mins. I would imagine
one of the nodes reset after that timeout.

On 09/10/2011 08:54 PM, Hai Tao wrote:> is ocfs2 heartbeat transferred over the network, or just updating a file to
the shared disk?
>
> If the heartbeat lost, what should happen? what if only one node is
writing, and the other is still? Will it still cause any file system issue?
>
>
> Thanks.
> Hai Tao
>
>
>
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> From: taoh666 at hotmail.com
> To: ocfs2-users at oss.oracle.com
> Date: Sat, 10 Sep 2011 00:50:23 -0700
> Subject: [Ocfs2-users] disable heartbeat nic caused ocfs2 errors
>
> I have a two nodes ocfs2 cluster, and I disabled the heartbeat nic with
"ifdown eth1". I got following weird logs on both nodes:
>
> Sep  7 10:45:49 dbtest-01 kernel: o2net: connection to node dbtest-02 (num
1) at 10.194.59.65:7777 has been idle for 30.0 seconds, shutting it down.
> Sep  7 10:45:49 dbtest-01 kernel: (swapper,0,3):o2net_idle_timer:1503 here
are some times that might help debug the situation: (tmr 1315417519.185025 now
1315417549.183798 dr 1315417519.185016 adv 1315417519.185032:1315417519.185032
func (b9bb7168:504) 1315417518.872227:1315417518.872268)
> Sep  7 10:45:49 dbtest-01 kernel: o2net: no longer connected to node
dbtest-02 (num 1) at 10.194.59.65:7777
> Sep  7 10:45:49 dbtest-01 kernel:
(dlm_thread,3781,2):dlm_send_proxy_ast_msg:457 ERROR: status = -112
> Sep  7 10:45:49 dbtest-01 kernel:
(oracle,26129,1):dlm_do_master_request:1334 ERROR: link to 1 went down!
> Sep  7 10:45:49 dbtest-01 kernel:
(oracle,26129,1):dlm_get_lock_resource:917 ERROR: status = -112
> Sep  7 10:45:49 dbtest-01 kernel:
(dlm_thread,4256,1):dlm_send_proxy_ast_msg:457 ERROR: status = -112
> Sep  7 10:45:49 dbtest-01 kernel: (dlm_thread,4256,1):dlm_flush_asts:604
ERROR: status = -112
> Sep  7 10:45:49 dbtest-01 kernel: (dlm_thread,3781,2):dlm_flush_asts:604
ERROR: status = -112
> Sep  7 10:46:19 dbtest-01 kernel: (o2net,3736,3):o2net_connect_expired:1664
ERROR: no connection established with node 1 after 30.0 seconds, giving up and
returning errors.
> Sep  7 10:46:19 dbtest-01 kernel: o2net: accepted connection from node
dbtest-02 (num 1) at 10.194.59.65:7777
> Sep  7 10:48:37 dbtest-01 kernel: INFO: task events/0:10 blocked for more
than 120 seconds.
> Sep  7 10:48:37 dbtest-01 kernel: "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Sep  7 10:48:37 dbtest-01 kernel: events/0      D ffff810001004420     0   
10      1            11     9 (L-TLB)
> Sep  7 10:48:37 dbtest-01 kernel:  ffff81083ffedc80 0000000000000046
ffffffff80333680 0000000000000001
> Sep  7 10:48:37 dbtest-01 kernel:  0000000000000400 000000000000000a
ffff81083ffe1820 ffffffff80309b60
> Sep  7 10:48:37 dbtest-01 kernel:  0030b62498ce7b3f 000000000000416b
ffff81083ffe1a08 0000000000000000
> Sep  7 10:48:37 dbtest-01 kernel: Call Trace:
> Sep  7 10:48:37 dbtest-01 kernel: Call Trace:
> Sep  7 10:48:37 dbtest-01 kernel:  [<ffffffff80064167>]
wait_for_completion+0x79/0xa2
> Sep  7 10:48:37 dbtest-01 kernel:  [<ffffffff8008e16d>]
default_wake_function+0x0/0xe
> Sep  7 10:48:37 dbtest-01 kernel:  [<ffffffff884e64b7>]
:ocfs2:ocfs2_wait_for_mask+0xd/0x19
> Sep  7 10:48:37 dbtest-01 kernel:  [<ffffffff884e78d8>]
:ocfs2:ocfs2_cluster_lock+0x9ae/0x9d3
> Sep  7 10:48:37 dbtest-01 kernel:  [<ffffffff885013e5>]
:ocfs2:ocfs2_orphan_scan_work+0x0/0x83
> Sep  7 10:48:37 dbtest-01 kernel:  [<ffffffff884ed1e4>]
:ocfs2:ocfs2_orphan_scan_lock+0x55/0x84
> Sep  7 10:48:37 dbtest-01 kernel:  [<ffffffff884fc59b>]
:ocfs2:ocfs2_queue_orphan_scan+0x32/0x147
> Sep  7 10:48:37 dbtest-01 kernel:  [<ffffffff885013ff>]
:ocfs2:ocfs2_orphan_scan_work+0x1a/0x83
> Sep  7 10:48:37 dbtest-01 kernel:  [<ffffffff8004dc37>]
run_workqueue+0x94/0xe4
> Sep  7 10:48:37 dbtest-01 kernel:  [<ffffffff8004a472>]
worker_thread+0x0/0x122
> Sep  7 10:48:37 dbtest-01 kernel:  [<ffffffff8004a562>]
worker_thread+0xf0/0x122
> Sep  7 10:48:37 dbtest-01 kernel:  [<ffffffff8008e16d>]
default_wake_function+0x0/0xe
> Sep  7 10:48:37 dbtest-01 kernel:  [<ffffffff80032bdc>]
kthread+0xfe/0x132
> Sep  7 10:48:37 dbtest-01 kernel:  [<ffffffff8005efb1>]
child_rip+0xa/0x11
> Sep  7 10:48:37 dbtest-01 kernel:  [<ffffffff80032ade>]
kthread+0x0/0x132
> Sep  7 10:48:37 dbtest-01 kernel:  [<ffffffff8005efa7>]
child_rip+0x0/0x11
> Sep  7 10:48:37 dbtest-01 kernel:
>
> Does anyone know why this happened?
>
> Thanks.
>
> _______________________________________________ Ocfs2-users mailing list
Ocfs2-users at oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
>
>
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://oss.oracle.com/pipermail/ocfs2-users/attachments/20110912/f31eac49/attachment.html

Sunil Mushran

2011-Sep-12 18:02 UTC

head link

[Ocfs2-users] node_count=0

It is wrong config.

On 09/09/2011 10:15 PM, Hai Tao wrote:> I have a two node ocfs2 cluster, and in the /etc/ocfs2/cluster.conf file,
the node_count=0 rather than 2. Does this have to be a wrong config, and how
would this affect the cluster?
>
> Thanks.
> Hai Tao
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://oss.oracle.com/pipermail/ocfs2-users/attachments/20110912/86570a01/attachment.html

Ocfs2 users - Sep 2011 - node_count=0

[Ocfs2-users] node_count=0

[Ocfs2-users] disable heartbeat nic caused ocfs2 errors

[Ocfs2-users] disable heartbeat nic caused ocfs2 errors

[Ocfs2-users] disable heartbeat nic caused ocfs2 errors

[Ocfs2-users] node_count=0