thr3ads.net - Ocfs2 devel - [Ocfs2-devel] Bug 48 "[kernel 2.6 porting] System halt during reboot after mount an OCFS volume." in bugzilla is fixed. [Mar 2004]

If this information is useful, please help other people find it:
Share via:

Sonic Zhang

2004-Mar-23 21:05 UTC

[Ocfs2-devel] Bug 48 "[kernel 2.6 porting] System halt during reboot after mount an OCFS volume." in bugzilla is fixed.

Hi all,

I successfully root cause and fix bug 48 "[kernel 2.6 porting] System halt 
during reboot after mount an OCFS volume.".

In current OCFS v2 driver, ocfs_volume_thread, ocfs_recv_thread and 
ocfs_commit_thread are assumed to be terminated by the ocfs_dismount_volume 
routine. But, if the system reboots, all processes and kernel threads will 
receive signal SIGTERM before ocfs_dismount_volume routine is called.  

These kernel threads don't exit correctly. For example, they don't know
they
should exit loop after received signal SIGTERM and clear their task_struct 
pointers in ocfs_super to indiate their status. That's the cause of the
system
halt in ocfs_dismount_volume routine when system reboots.

I attach a patch to fix this bug. Please review. 

Thank you

This patch is against svn version 807.
----------------------------------------------------------------
--- ocfs2.old/src/journal.c	2004-03-22 16:02:55.000000000 +0800
+++ ocfs2/src/journal.c	2004-03-22 16:09:57.000000000 +0800
@@ -1034,12 +1034,13 @@
 	/* The OCFS_JOURNAL_IN_SHUTDOWN will signal to commit_cache to not
 	 * drop the trans_lock (which we want to hold until we
 	 * completely destroy the journal. */
-	if (osb->commit && osb->commit->c_task) {
-		/* Wait for the commit thread */
-		LOG_TRACE_STR ("Waiting for ocfs2commit to exit....");
-		send_sig (SIGINT, osb->commit->c_task, 0);
-		wait_for_completion(&osb->commit->c_complete);
-		osb->commit->c_task = NULL;
+	if (osb->commit) {
+		if(osb->commit->c_task) {
+			/* Wait for the commit thread */
+			LOG_TRACE_STR ("Waiting for ocfs2commit to exit....");
+			send_sig (SIGINT, osb->commit->c_task, 0);
+			wait_for_completion(&osb->commit->c_complete);
+		}
 		ocfs_free(osb->commit);
 	}
 	
@@ -1808,7 +1809,7 @@
 			break;
 	}
 
-
+	commit->c_task = NULL;
 
         /* Flush all scheduled tasks */
 #if LINUX_VERSION_CODE >= KERNEL_VERSION(2,6,0)


--- ocfs2.old/src/nm.c.old	2004-03-23 17:09:29.000000000 +0800
+++ ocfs2/src/nm.c	2004-03-24 10:18:35.000000000 +0800
@@ -118,6 +118,8 @@
 		OcfsIpcCtxt.recv_sock = NULL;
 	}
 
+	OcfsIpcCtxt.task = NULL;
+
 	/* signal main thread of ipcdlm's exit */
 	complete (&(OcfsIpcCtxt.complete));
 
@@ -249,6 +251,8 @@
 	__u64 cfg_seq_num;
 	int which, pruned, prune_iters = 0;
 	struct buffer_head *bh = NULL;
+	int signr;
+	siginfo_t info;
 
 	LOG_ENTRY ();
 
@@ -258,6 +262,7 @@
 
 	sprintf (proc, "ocfs2nm-%d", osb->osb_id);
 	ocfs_daemonize (proc, strlen(proc));
+	allow_signal(SIGTERM);
 
 	osb->dlm_task = current;
 
@@ -437,7 +442,11 @@
 			osb->hbt = 50 + j;
 		}
 		set_current_state (TASK_INTERRUPTIBLE);
-		schedule_timeout (osb->hbt - j);
+		if( schedule_timeout (osb->hbt - j) < osb->hbt -j ) {
+			signr = dequeue_signal_lock(current, &current->blocked, &info);
+			if(signr == SIGTERM)
+				OcfsGlobalCtxt.flags |= OCFS_FLAG_SHUTDOWN_VOL_THREAD;
+		}
 	}
 
         /* Flush all scheduled tasks */
@@ -447,6 +456,8 @@
         flush_scheduled_tasks ();
 #endif
 
+	osb->dlm_task = NULL;
+
 	complete (&(osb->dlm_complete));
 eek:
 	LOG_EXIT_LONG (0);

Sonic Zhang

2004-Mar-24 01:37 UTC

head link

[Ocfs2-devel] Bug 48 "[kernel 2.6 porting] System halt during reboot after mount an OCFS volume." in bugzilla is fixed.

Hi all,

I successfully root cause and fix bug 48 "[kernel 2.6 porting] System halt
during reboot after mount an OCFS volume.".

In current OCFS v2 driver, ocfs_volume_thread, ocfs_recv_thread and
ocfs_commit_thread are assumed to be terminated by the ocfs_dismount_volume
routine. But, if the system reboots, all processes and kernel threads will
receive signal SIGTERM before ocfs_dismount_volume routine is called. 

These kernel threads don't exit correctly. For example, they don't know 
they
should exit loop after received signal SIGTERM and clear their task_struct
pointers in ocfs_super to indiate their status. That's the cause of the 
system
halt in ocfs_dismount_volume routine when system reboots.

I attach a patch to fix this bug. Please review.

Thank you

This patch is against svn version 807.
----------------------------------------------------------------
--- ocfs2.old/src/journal.c    2004-03-22 16:02:55.000000000 +0800
+++ ocfs2/src/journal.c    2004-03-22 16:09:57.000000000 +0800
@@ -1034,12 +1034,13 @@
    /* The OCFS_JOURNAL_IN_SHUTDOWN will signal to commit_cache to not
     * drop the trans_lock (which we want to hold until we
     * completely destroy the journal. */
-    if (osb->commit && osb->commit->c_task) {
-        /* Wait for the commit thread */
-        LOG_TRACE_STR ("Waiting for ocfs2commit to exit....");
-        send_sig (SIGINT, osb->commit->c_task, 0);
-        wait_for_completion(&osb->commit->c_complete);
-        osb->commit->c_task = NULL;
+    if (osb->commit) {
+        if(osb->commit->c_task) {
+            /* Wait for the commit thread */
+            LOG_TRACE_STR ("Waiting for ocfs2commit to exit....");
+            send_sig (SIGINT, osb->commit->c_task, 0);
+            wait_for_completion(&osb->commit->c_complete);
+        }
        ocfs_free(osb->commit);
    }
    
@@ -1808,7 +1809,7 @@
            break;
    }

-
+    commit->c_task = NULL;

        /* Flush all scheduled tasks */
#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,6,0)


--- ocfs2.old/src/nm.c.old    2004-03-23 17:09:29.000000000 +0800
+++ ocfs2/src/nm.c    2004-03-24 10:18:35.000000000 +0800
@@ -118,6 +118,8 @@
        OcfsIpcCtxt.recv_sock = NULL;
    }

+    OcfsIpcCtxt.task = NULL;
+
    /* signal main thread of ipcdlm's exit */
    complete (&(OcfsIpcCtxt.complete));

@@ -249,6 +251,8 @@
    __u64 cfg_seq_num;
    int which, pruned, prune_iters = 0;
    struct buffer_head *bh = NULL;
+    int signr;
+    siginfo_t info;

    LOG_ENTRY ();

@@ -258,6 +262,7 @@

    sprintf (proc, "ocfs2nm-%d", osb->osb_id);
    ocfs_daemonize (proc, strlen(proc));
+    allow_signal(SIGTERM);

    osb->dlm_task = current;

@@ -437,7 +442,11 @@
            osb->hbt = 50 + j;
        }
        set_current_state (TASK_INTERRUPTIBLE);
-        schedule_timeout (osb->hbt - j);
+        if( schedule_timeout (osb->hbt - j) < osb->hbt -j ) {
+            signr = dequeue_signal_lock(current, &current->blocked,
&info);
+            if(signr == SIGTERM)
+                OcfsGlobalCtxt.flags |= OCFS_FLAG_SHUTDOWN_VOL_THREAD;
+        }
    }

        /* Flush all scheduled tasks */
@@ -447,6 +456,8 @@
        flush_scheduled_tasks ();
#endif

+    osb->dlm_task = NULL;
+
    complete (&(osb->dlm_complete));
eek:
    LOG_EXIT_LONG (0);

Mark Fasheh

2004-Mar-24 18:18 UTC

head link

[Ocfs2-devel] Bug 48 "[kernel 2.6 porting] System halt during reboot after mount an OCFS volume." in bugzilla is fixed.

On Wed, Mar 24, 2004 at 11:05:38AM +0800, Sonic Zhang
wrote:> Hi all,
> 
> I successfully root cause and fix bug 48 "[kernel 2.6 porting] System
halt
> during reboot after mount an OCFS volume.".
> 
> In current OCFS v2 driver, ocfs_volume_thread, ocfs_recv_thread and 
> ocfs_commit_thread are assumed to be terminated by the ocfs_dismount_volume
> routine. But, if the system reboots, all processes and kernel threads will 
> receive signal SIGTERM before ocfs_dismount_volume routine is called.  
> 
> These kernel threads don't exit correctly. For example, they don't
know
> they should exit loop after received signal SIGTERM and clear their 
> task_struct pointers in ocfs_super to indiate their status. That's the 
> cause of the system halt in ocfs_dismount_volume routine when system 
> reboots.You're correct that some of these threads aren't handling certain
signals
properly, but I think you're going about it the wrong way. We *want* these
threads to stay alive until the actual umount process begins. In fact, if
you kill the commit_cache thread early, and the volume is still busy (with
metadata updates), you could miss your last flush! This could result in
metadata corruption!

What we really want is that a thread ignores all signals except for a small
set, which it only acts on if a certain flag is set. For the commit thread,
this would be OCFS_JOURNAL_IN_SHUTDOWN. NM thread, I believe already checks
the osb shutdown flags, and doesn't need to be changed. Can you try this
patch instead and tell me if it fixes your problem?

Basically in this patch, commit_thread just checks to make sure that the
volume is being unmounted before setting the "finish" flag. Otherwise,
it
ignores the signal.

I'm a tad green to the in-kernel signal API's so I'm hoping others
will
review what I did and let me know if I'm wrong :)
	--Mark

--
Mark Fasheh
Software Developer, Oracle Corp
mark.fasheh@oracle.com

Index: journal.c
==================================================================--- journal.c
(revision 806)
+++ journal.c	(working copy)
@@ -1745,6 +1745,7 @@ int ocfs_commit_thread(void *arg)
 	ocfs_commit_task *commit = osb->commit;
 	char name[16];
 	ocfs_journal * journal = &osb->journal;
+	siginfo_t info;
 
 	sprintf (name, "ocfs2cmt-%d", osb->osb_id);
 	ocfs_daemonize (name, strlen(name));
@@ -1765,7 +1766,16 @@ int ocfs_commit_thread(void *arg)
 				LOG_TRACE_STR("FLUSH_EVENT: timed out");
 				break;
 			case -EINTR:
-				finish = 1;
+				/* journal shutdown has asked me to do
+				 * one last commit cache and then exit */
+				if (journal->state == OCFS_JOURNAL_IN_SHUTDOWN)
+					finish = 1;
+				if (signal_pending(current)) {
+					spin_lock_irq(&current->sigmask_lock);
+					/* ignore the actual signal */
+					dequeue_signal(&current->blocked, &info);
+					spin_unlock_irq(&current->sigmask_lock);
+				}
 				LOG_TRACE_STR("FLUSH_EVENT: interrupted");
 				break;
 			case 0:
@@ -1778,7 +1788,7 @@ int ocfs_commit_thread(void *arg)
 
 		if ((OcfsGlobalCtxt.flags & OCFS_FLAG_SHUTDOWN_VOL_THREAD) ||
 		    (osb->osb_flags & OCFS_OSB_FLAGS_BEING_DISMOUNTED))
-			break;
+			finish = 1;
 
 		//if (!osb->needs_flush && status != 0)
 		//	continue;
@@ -1788,18 +1798,13 @@ int ocfs_commit_thread(void *arg)
 
 		if (down_trylock(&osb->trans_lock) != 0) {
 			LOG_TRACE_ARGS("commit thread: trylock failed, miss=%d\n",
misses);
-			if (++misses < OCFS_COMMIT_MISS_MAX)
+			if (++misses < OCFS_COMMIT_MISS_MAX && finish == 0)
 				continue;
 			LOG_TRACE_ARGS("commit thread: about to down\n");
 			down(&osb->trans_lock);
 			misses = 0;
 		}
 
-		/* journal shutdown has asked me to do one last commit cache */
-		/* this commit cache will leave trans lock held! */
-		if (journal->state == OCFS_JOURNAL_IN_SHUTDOWN)
-			finish = 1;
-
 		status = ocfs_commit_cache (osb, false);
 		if (status < 0)
 			LOG_ERROR_STATUS(status);

Ocfs2 devel - Mar 2004 - Bug 48 "[kernel 2.6 porting] System halt during reboot after mount an OCFS volume." in bugzilla is fixed.

[Ocfs2-devel] Bug 48 "[kernel 2.6 porting] System halt during reboot after mount an OCFS volume." in bugzilla is fixed.

[Ocfs2-devel] Bug 48 "[kernel 2.6 porting] System halt during reboot after mount an OCFS volume." in bugzilla is fixed.

[Ocfs2-devel] Bug 48 "[kernel 2.6 porting] System halt during reboot after mount an OCFS volume." in bugzilla is fixed.