Henri Cook
2008-Sep-07 00:00 UTC
[Ocfs-users] Hard system restart when DRBD connection fails while in use
Hi all, I have two nodes (A+B) running a DRBD file system (using OCFS2) on /shared. If I start say, an FTP file transfer to my drbd /shared directory on node A, then reboot node B which is the other machine in a Primary-Primary DRBD configuration while the transfer is in progress - node A stops at a similar time that DRBD notices the connection with Node B has been lost (hence crippling both machines for the time it takes to reboot). If the drive is inactive (i.e. nothing is being written to it) then this does not occur. My question then is, could OCFS2 tools be the source of these reboots, is there any such default action configured? If so, how would I go about investigating/altering it? There are no log entries about the reboot to speak of. OS is Ubuntu Hardy (Server) 8.04 and ocfs2-tools 1.3.9-0ubuntu1 Thanks in advance, Henri
Henri Cook
2008-Sep-07 09:40 UTC
[Ocfs-users] Hard system restart when DRBD connection fails while in use
Please, this is quite urgent for me - sorry to be a pain It appears that OCFS2 is a very likely suspect in causing these reboots, they only occur when the shared drbd device is mounted on both nodes (which is the default behaviour) - if I unmount on Node B before rebooting it then the reboot does not occur. There are no error messages from OCFS2 to speak of, where can i see/configure these heartbeat options? I'm trying to get a faux-serial console attached as i've read there's a historic issue where it doesn't even write to log files but only to screen Thanks, Henri Henri Cook wrote:> Hi all, > > I have two nodes (A+B) running a DRBD file system (using OCFS2) on /shared. > > If I start say, an FTP file transfer to my drbd /shared directory on node A, then reboot node B which is the other machine in a Primary-Primary DRBD configuration while the transfer is in progress - node A stops at a similar time that DRBD notices the connection with Node B has been lost (hence crippling both machines for the time it takes to reboot). If the drive is inactive (i.e. nothing is being written to it) then this does not occur. > > My question then is, could OCFS2 tools be the source of these reboots, is there any such default action configured? If so, how would I go about investigating/altering it? There are no log entries about the reboot to speak of. > > OS is Ubuntu Hardy (Server) 8.04 and ocfs2-tools 1.3.9-0ubuntu1 > > Thanks in advance, > > Henri > > >
Sunil Mushran
2008-Sep-08 00:43 UTC
[Ocfs-users] Hard system restart when DRBD connection fails while in use
Repeat the test. This time run the following on Node A after you have killed Node B. $ ps -e -o pid,stat,comm,wchan=WIDE-WCHAN-COLUMN If we are lucky we'll get to see where that process is waiting. Henri Cook wrote:> Hi all, > > I have two nodes (A+B) running a DRBD file system (using OCFS2) on /shared. > > If I start say, an FTP file transfer to my drbd /shared directory on node A, then reboot node B which is the other machine in a Primary-Primary DRBD configuration while the transfer is in progress - node A stops at a similar time that DRBD notices the connection with Node B has been lost (hence crippling both machines for the time it takes to reboot). If the drive is inactive (i.e. nothing is being written to it) then this does not occur. > > My question then is, could OCFS2 tools be the source of these reboots, is there any such default action configured? If so, how would I go about investigating/altering it? There are no log entries about the reboot to speak of. > > OS is Ubuntu Hardy (Server) 8.04 and ocfs2-tools 1.3.9-0ubuntu1 > > Thanks in advance, > > Henri > > > _______________________________________________ > Ocfs-users mailing list > Ocfs-users at oss.oracle.com > http://oss.oracle.com/mailman/listinfo/ocfs-users >
Sunil Mushran
2008-Sep-08 00:44 UTC
[Ocfs2-users] [Ocfs-users] Hard system restart when DRBD connection fails while in use
Repeat the test. This time run the following on Node A after you have killed Node B. $ ps -e -o pid,stat,comm,wchan=WIDE-WCHAN-COLUMN If we are lucky we'll get to see where that process is waiting. Henri Cook wrote:> Hi all, > > I have two nodes (A+B) running a DRBD file system (using OCFS2) on /shared. > > If I start say, an FTP file transfer to my drbd /shared directory on node A, then reboot node B which is the other machine in a Primary-Primary DRBD configuration while the transfer is in progress - node A stops at a similar time that DRBD notices the connection with Node B has been lost (hence crippling both machines for the time it takes to reboot). If the drive is inactive (i.e. nothing is being written to it) then this does not occur. > > My question then is, could OCFS2 tools be the source of these reboots, is there any such default action configured? If so, how would I go about investigating/altering it? There are no log entries about the reboot to speak of. > > OS is Ubuntu Hardy (Server) 8.04 and ocfs2-tools 1.3.9-0ubuntu1 > > Thanks in advance, > > Henri > > > _______________________________________________ > Ocfs-users mailing list > Ocfs-users at oss.oracle.com > http://oss.oracle.com/mailman/listinfo/ocfs-users >