I am trying to copy a single 42 gb file from et3 file system to ocfs2 file system on node 1. The ocfs2 file system hang on all nodes after/during the cp. The /p0ebsdb/u13 is an ocfs2 mount point shared with other 2 nodes (3 nodes rac). The following is unix copy command [root@b30svrxp-ebsdb1 migrate]# time cp aexp02.dmp /p0ebsdb/u13/junk real 17m49.351s user 0m0.392s sys 1m49.065s The following is dmesg on node1 ocfs2_dlm: Nodes in domain ("A2AECED66891407D915CBF282A9E9299"): 0 1 2 o2net: connection to node b30svrxp-ebsdb2.ameripride.com (num 1) at 192.168.3.70:7777 has been idle for 10.0 seconds, shutting it down. (0,3):o2net_idle_timer:1418 here are some times that might help debug the situation: (tmr 1184814613.883032 now 1184814623.882842 dr 1184814613.883028 adv 1184814613.883033:1184814613.883033 func (2b61f804:504) 1184814613.882900:1184814613.882904) o2net: no longer connected to node b30svrxp-ebsdb2.ameripride.com (num 1) at 192.168.3.70:7777 (6047,3):dlm_send_proxy_ast_msg:459 ERROR: status = -107 (6047,3):dlm_flush_asts:600 ERROR: status = -107 (20810,0):dlm_do_master_request:1418 ERROR: link to 1 went down! (20810,0):dlm_get_lock_resource:995 ERROR: status = -107 The following is dmesg on node2 (26243,1):dlm_send_remote_convert_request:398 ERROR: status = -107 (26243,1):dlm_wait_for_node_death:365 9EA98E20F6E44FF7B7A89789976C1E32: waiting 5000ms for notification of death of node 0 (7427,0):dlm_send_remote_convert_request:398 ERROR: status = -107 (7427,0):dlm_wait_for_node_death:365 75990178D36942BFA473A2AE4149690C: waiting 5000ms for notification of death of node 0 The following is dmesg on node3 mtrr: type mismatch for d8000000,2000000 old: uncachable new: write-combining adl_trace[9860]: segfault at 000000000000000c rip 0000000040002462 rsp 0000007fbfffe3e0 error 4 Any clue? And thanks in advance -------------- next part -------------- An HTML attachment was scrubbed... URL: http://oss.oracle.com/pipermail/ocfs2-users/attachments/20070718/7270cf81/attachment.html
Sunil Mushran
2007-Jul-19 10:46 UTC
[Ocfs2-users] ocfs2 file system hang during copy files
The default disk heartbeat timeouts are way too low. In short, the buffered write flush is probably flooding the device and delaying the heartbeat io. For more, refer: http://oss.oracle.com/projects/ocfs2/dist/documentation/ocfs2_faq.html#HEARTBEAT If you are 1.2.5, then also refer: http://oss.oracle.com/projects/ocfs2/dist/documentation/ocfs2_faq.html#TIMEOUT Zosen Wang wrote:> > I am trying to copy a single 42 gb file from et3 file system to ocfs2 > file system on node 1. The ocfs2 file system hang on all nodes > after/during the cp. The /p0ebsdb/u13 is an ocfs2 mount point shared > with other 2 nodes (3 nodes rac). > > > > The following is unix copy command > > [root@b30svrxp-ebsdb1 migrate]# time cp aexp02.dmp /p0ebsdb/u13/junk > > > > real 17m49.351s > > user 0m0.392s > > sys 1m49.065s > > > > The following is dmesg on node1 > > > > ocfs2_dlm: Nodes in domain ("A2AECED66891407D915CBF282A9E9299"): 0 1 2 > > o2net: connection to node b30svrxp-ebsdb2.ameripride.com (num 1) at > 192.168.3.70:7777 has been idle for 10.0 seconds, shutting it down. > > (0,3):o2net_idle_timer:1418 here are some times that might help debug > the situation: (tmr 1184814613.883032 now 1184814623.882842 dr > 1184814613.883028 adv 1184814613.883033:1184814613.883033 func > (2b61f804:504) 1184814613.882900:1184814613.882904) > > o2net: no longer connected to node b30svrxp-ebsdb2.ameripride.com (num > 1) at 192.168.3.70:7777 > > (6047,3):dlm_send_proxy_ast_msg:459 ERROR: status = -107 > > (6047,3):dlm_flush_asts:600 ERROR: status = -107 > > (20810,0):dlm_do_master_request:1418 ERROR: link to 1 went down! > > (20810,0):dlm_get_lock_resource:995 ERROR: status = -107 > > > > The following is dmesg on node2 > > (26243,1):dlm_send_remote_convert_request:398 ERROR: status = -107 > > (26243,1):dlm_wait_for_node_death:365 > 9EA98E20F6E44FF7B7A89789976C1E32: waiting 5000ms for notification of > death of node 0 > > (7427,0):dlm_send_remote_convert_request:398 ERROR: status = -107 > > (7427,0):dlm_wait_for_node_death:365 75990178D36942BFA473A2AE4149690C: > waiting 5000ms for notification of death of node 0 > > > > The following is dmesg on node3 > > mtrr: type mismatch for d8000000,2000000 old: uncachable new: > write-combining > > adl_trace[9860]: segfault at 000000000000000c rip 0000000040002462 rsp > 0000007fbfffe3e0 error 4 > > > > Any clue? And thanks in advance > > ------------------------------------------------------------------------ > > _______________________________________________ > Ocfs2-users mailing list > Ocfs2-users@oss.oracle.com > http://oss.oracle.com/mailman/listinfo/ocfs2-users