Hi, I have been working around the issue of Node fence in case of a heartbeat failure / Network timeout. I modified o2quo_fence_self() in quorum.c to make all ocfs2 filesystems RO, when tested it worked like a charm, and the filesystems were made RO, but I am not able to umount the filesystem or stop O2CB service. Is there any way by which I could ask O2CB to abort heartbeat and treat the filesystem as LOCAL instead of GLOBAL? The following is the code change that I made. ************************************************** static void make_fs_RO(struct super_block *sb, void *arg) { struct ocfs2_super *osb = OCFS2_SB(sb); sb->s_flags |= MS_RDONLY; ocfs2_set_osb_flag(osb, OCFS2_OSB_ERROR_FS); ocfs2_set_ro_flag(osb, *(int *)arg); } /* this is horribly heavy-handed. It should instead flip the file * system RO and call some userspace script. */ static void o2quo_fence_self(void) { *...* case O2NM_FENCE_RESET: printk(KERN_ERR "*** Hard failure in O2CB, all ocfs2 " "filesystems made RO ***\n"); /* Iterate through all ocfs2 super blocks and make each of them RO */ fs_type = get_fs_type("ocfs2"); if (fs_type) iterate_supers_type(fs_type, make_fs_RO, &hard_reset); break; *...* } *************************************************************** The error from kern.log: ======================================May 31 16:08:18 localhost kernel: [ 5434.076126] (kworker/u:2,577,3):dlm_send_remote_convert_request:395 ERROR: Error -107 when sending message 504 (key 0xcfe4a084) to node 0 May 31 16:08:18 localhost kernel: [ 5434.076178] o2dlm: Waiting on the death of node 0 in domain A4E98618A3744717A65AF04E943D035A ====================================== Any pointers would be much appreciated. Thanks, Vineeth -------------- next part -------------- An HTML attachment was scrubbed... URL: http://oss.oracle.com/pipermail/ocfs2-users/attachments/20130531/aa67c269/attachment.html
The reason nodes are fenced during network failures is because we need to guarantee that no i/o's are going to happen from this fenced node. If you just change the fs to read-only we still cannot guarantee that there are no inflight-io's from this node from previous writes. On 05/31/2013 08:33 AM, Vineeth Thampi wrote:> Hi, > > I have been working around the issue of Node fence in case of a > heartbeat failure / Network timeout. I modified o2quo_fence_self() in > quorum.c to make all ocfs2 filesystems RO, when tested it worked like > a charm, and the filesystems were made RO, but I am not able to umount > the filesystem or stop O2CB service. > > Is there any way by which I could ask O2CB to abort heartbeat and > treat the filesystem as LOCAL instead of GLOBAL? > > The following is the code change that I made. > > ************************************************** > static void make_fs_RO(struct super_block *sb, void *arg) > { > struct ocfs2_super *osb = OCFS2_SB(sb); > > sb->s_flags |= MS_RDONLY; > ocfs2_set_osb_flag(osb, OCFS2_OSB_ERROR_FS); > ocfs2_set_ro_flag(osb, *(int *)arg); > } > > /* this is horribly heavy-handed. It should instead flip the file > * system RO and call some userspace script. */ > static void o2quo_fence_self(void) > { > > *...* > > case O2NM_FENCE_RESET: > printk(KERN_ERR "*** Hard failure in O2CB, all ocfs2 " > "filesystems made RO ***\n"); > > /* Iterate through all ocfs2 super blocks and make > each of > them RO */ > fs_type = get_fs_type("ocfs2"); > if (fs_type) > iterate_supers_type(fs_type, make_fs_RO, > &hard_reset); > > break; > *...* > > } > *************************************************************** > > > The error from kern.log: > > ======================================> May 31 16:08:18 localhost kernel: [ 5434.076126] > (kworker/u:2,577,3):dlm_send_remote_convert_request:395 ERROR: Error > -107 when sending message 504 (key 0xcfe4a084) to node 0 > May 31 16:08:18 localhost kernel: [ 5434.076178] o2dlm: Waiting on the > death of node 0 in domain A4E98618A3744717A65AF04E943D035A > ======================================> > Any pointers would be much appreciated. > > Thanks, > > Vineeth > > > _______________________________________________ > Ocfs2-users mailing list > Ocfs2-users at oss.oracle.com > https://oss.oracle.com/mailman/listinfo/ocfs2-users-------------- next part -------------- An HTML attachment was scrubbed... URL: http://oss.oracle.com/pipermail/ocfs2-users/attachments/20130531/490424af/attachment.html
On 2013/6/1 1:09, Srinivas Eeda wrote:> The reason nodes are fenced during network failures is because we need > to guarantee that no i/o's are going to happen from this fenced node. > If you just change the fs to read-only we still cannot guarantee that > there are no inflight-io's from this node from previous writes. >I agree it. set the ocfs2 to read-only, it just prevent io from user space application. on the kernel cache for example page cache or currently write maybe write to io the SAN. the best way is use the SCSI-3 Persistent Group Reservation to fence the node.> > On 05/31/2013 08:33 AM, Vineeth Thampi wrote: >> Hi, >> >> I have been working around the issue of Node fence in case of a >> heartbeat failure / Network timeout. I modified o2quo_fence_self() in >> quorum.c to make all ocfs2 filesystems RO, when tested it worked like >> a charm, and the filesystems were made RO, but I am not able to >> umount the filesystem or stop O2CB service. >> >> Is there any way by which I could ask O2CB to abort heartbeat and >> treat the filesystem as LOCAL instead of GLOBAL? >> >> The following is the code change that I made. >> >> ************************************************** >> static void make_fs_RO(struct super_block *sb, void *arg) >> { >> struct ocfs2_super *osb = OCFS2_SB(sb); >> >> sb->s_flags |= MS_RDONLY; >> ocfs2_set_osb_flag(osb, OCFS2_OSB_ERROR_FS); >> ocfs2_set_ro_flag(osb, *(int *)arg); >> } >> >> /* this is horribly heavy-handed. It should instead flip the file >> * system RO and call some userspace script. */ >> static void o2quo_fence_self(void) >> { >> >> *...* >> >> case O2NM_FENCE_RESET: >> printk(KERN_ERR "*** Hard failure in O2CB, all ocfs2 " >> "filesystems made RO ***\n"); >> >> /* Iterate through all ocfs2 super blocks and make >> each of >> them RO */ >> fs_type = get_fs_type("ocfs2"); >> if (fs_type) >> iterate_supers_type(fs_type, make_fs_RO, >> &hard_reset); >> >> break; >> *...* >> >> } >> *************************************************************** >> >> >> The error from kern.log: >> >> ======================================>> May 31 16:08:18 localhost kernel: [ 5434.076126] >> (kworker/u:2,577,3):dlm_send_remote_convert_request:395 ERROR: Error >> -107 when sending message 504 (key 0xcfe4a084) to node 0 >> May 31 16:08:18 localhost kernel: [ 5434.076178] o2dlm: Waiting on >> the death of node 0 in domain A4E98618A3744717A65AF04E943D035A >> ======================================>> >> Any pointers would be much appreciated. >> >> Thanks, >> >> Vineeth >> >> >> _______________________________________________ >> Ocfs2-users mailing list >> Ocfs2-users at oss.oracle.com >> https://oss.oracle.com/mailman/listinfo/ocfs2-users > > > > _______________________________________________ > Ocfs2-users mailing list > Ocfs2-users at oss.oracle.com > https://oss.oracle.com/mailman/listinfo/ocfs2-users-------------- next part -------------- An HTML attachment was scrubbed... URL: http://oss.oracle.com/pipermail/ocfs2-users/attachments/20130602/0e43bd80/attachment.html