thr3ads.net - Ocfs2 devel - [Ocfs2-devel] Four patches for ocfs2 1.4 [Jul 2008]

If this information is useful, please help other people find it:
Share via:

Sunil Mushran

2008-Jul-07 20:26 UTC

[Ocfs2-devel] Four patches for ocfs2 1.4

Four patches for review. The second patch is from the userspace
clusterstack changes that are already in mainline. It was backported
for applying the third patch that fixes a mount/reco hang.

Sunil

Sunil Mushran

2008-Jul-07 20:26 UTC

head link

[Ocfs2-devel] [PATCH 1/4] ocfs2: Silence an error message in ocfs2_file_aio_read()

This patch silences an EINVAL error message in ocfs2_file_aio_read()
that is always due to a user error.

Signed-off-by: Sunil Mushran <sunil.mushran at oracle.com>
---
 fs/ocfs2/file.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
index d4cf056..79c4ee1 100644
--- a/fs/ocfs2/file.c
+++ b/fs/ocfs2/file.c
@@ -2450,7 +2450,7 @@ static ssize_t __ocfs2_file_aio_read(struct kiocb *iocb,
 
 	ret = kapi_generic_file_aio_read(iocb, iov, nr_segs, iocb->ki_pos);
 	if (ret == -EINVAL)
-		mlog(ML_ERROR, "generic_file_aio_read returned -EINVAL\n");
+		mlog(0, "generic_file_aio_read returned -EINVAL\n");
 
 	/* buffered aio wouldn't have proper lock coverage today */
 	BUG_ON(ret == -EIOCBQUEUED && !(filp->f_flags & O_DIRECT));
-- 
1.5.4.3

Sunil Mushran

2008-Jul-07 20:26 UTC

head link

[Ocfs2-devel] [PATCH 2/4] ocfs2: Change the recovery map to an array of node numbers.

From: Joel Becker <joel.becker at oracle.com>

Mainline commit 553abd046af609191a91af7289d87d477adc659f

The old recovery map was a bitmap of node numbers.  This was sufficient
for the maximum node number of 254.  Going forward, we want node numbers
to be UINT32.  Thus, we need a new recovery map.

Note that we can't keep track of slots here.  We must write down the
node number to recovery *before* we get the locks needed to convert a
node number into a slot number.

The recovery map is now an array of unsigned ints, max_slots in size.
It moves to journal.c with the rest of recovery.

Because it needs to be initialized, we move all of recovery initialization
into a new function, ocfs2_recovery_init().  This actually cleans up
ocfs2_initialize_super() a little as well.  Following on, recovery cleaup
becomes part of ocfs2_recovery_exit().

A number of node map functions are rendered obsolete and are removed.

Finally, waiting on recovery is wrapped in a function rather than naked
checks on the recovery_event.  This is a cleanup from Mark.

Signed-off-by: Joel Becker <joel.becker at oracle.com>
Signed-off-by: Mark Fasheh <mfasheh at suse.com>
---
 fs/ocfs2/dlmglue.c   |    6 +-
 fs/ocfs2/heartbeat.c |  111 ------------------------------
 fs/ocfs2/heartbeat.h |   14 ----
 fs/ocfs2/journal.c   |  181 +++++++++++++++++++++++++++++++++++++++++++++----
 fs/ocfs2/journal.h   |    4 +
 fs/ocfs2/ocfs2.h     |    3 +-
 fs/ocfs2/super.c     |   33 ++-------
 7 files changed, 182 insertions(+), 170 deletions(-)

diff --git a/fs/ocfs2/dlmglue.c b/fs/ocfs2/dlmglue.c
index d7ae5cd..adb1d9d 100644
--- a/fs/ocfs2/dlmglue.c
+++ b/fs/ocfs2/dlmglue.c
@@ -1950,8 +1950,7 @@ int ocfs2_inode_lock_full(struct inode *inode,
 		goto local;
 
 	if (!(arg_flags & OCFS2_META_LOCK_RECOVERY))
-		wait_event(osb->recovery_event,
-			   ocfs2_node_map_is_empty(osb, &osb->recovery_map));
+		ocfs2_wait_for_recovery(osb);
 
 	lockres = &OCFS2_I(inode)->ip_inode_lockres;
 	level = ex ? LKM_EXMODE : LKM_PRMODE;
@@ -1974,8 +1973,7 @@ int ocfs2_inode_lock_full(struct inode *inode,
 	 * committed to owning this lock so we don't allow signals to
 	 * abort the operation. */
 	if (!(arg_flags & OCFS2_META_LOCK_RECOVERY))
-		wait_event(osb->recovery_event,
-			   ocfs2_node_map_is_empty(osb, &osb->recovery_map));
+		ocfs2_wait_for_recovery(osb);
 
 local:
 	/*
diff --git a/fs/ocfs2/heartbeat.c b/fs/ocfs2/heartbeat.c
index 0758daf..80de239 100644
--- a/fs/ocfs2/heartbeat.c
+++ b/fs/ocfs2/heartbeat.c
@@ -48,7 +48,6 @@ static inline void __ocfs2_node_map_set_bit(struct
ocfs2_node_map *map,
 					    int bit);
 static inline void __ocfs2_node_map_clear_bit(struct ocfs2_node_map *map,
 					      int bit);
-static inline int __ocfs2_node_map_is_empty(struct ocfs2_node_map *map);
 
 /* special case -1 for now
  * TODO: should *really* make sure the calling func never passes -1!!  */
@@ -62,7 +61,6 @@ static void ocfs2_node_map_init(struct ocfs2_node_map *map)
 void ocfs2_init_node_maps(struct ocfs2_super *osb)
 {
 	spin_lock_init(&osb->node_map_lock);
-	ocfs2_node_map_init(&osb->recovery_map);
 	ocfs2_node_map_init(&osb->osb_recovering_orphan_dirs);
 }
 
@@ -192,112 +190,3 @@ int ocfs2_node_map_test_bit(struct ocfs2_super *osb,
 	return ret;
 }
 
-static inline int __ocfs2_node_map_is_empty(struct ocfs2_node_map *map)
-{
-	int bit;
-	bit = find_next_bit(map->map, map->num_nodes, 0);
-	if (bit < map->num_nodes)
-		return 0;
-	return 1;
-}
-
-int ocfs2_node_map_is_empty(struct ocfs2_super *osb,
-			    struct ocfs2_node_map *map)
-{
-	int ret;
-	BUG_ON(map->num_nodes == 0);
-	spin_lock(&osb->node_map_lock);
-	ret = __ocfs2_node_map_is_empty(map);
-	spin_unlock(&osb->node_map_lock);
-	return ret;
-}
-
-#if 0
-
-static void __ocfs2_node_map_dup(struct ocfs2_node_map *target,
-				 struct ocfs2_node_map *from)
-{
-	BUG_ON(from->num_nodes == 0);
-	ocfs2_node_map_init(target);
-	__ocfs2_node_map_set(target, from);
-}
-
-/* returns 1 if bit is the only bit set in target, 0 otherwise */
-int ocfs2_node_map_is_only(struct ocfs2_super *osb,
-			   struct ocfs2_node_map *target,
-			   int bit)
-{
-	struct ocfs2_node_map temp;
-	int ret;
-
-	spin_lock(&osb->node_map_lock);
-	__ocfs2_node_map_dup(&temp, target);
-	__ocfs2_node_map_clear_bit(&temp, bit);
-	ret = __ocfs2_node_map_is_empty(&temp);
-	spin_unlock(&osb->node_map_lock);
-
-	return ret;
-}
-
-static void __ocfs2_node_map_set(struct ocfs2_node_map *target,
-				 struct ocfs2_node_map *from)
-{
-	int num_longs, i;
-
-	BUG_ON(target->num_nodes != from->num_nodes);
-	BUG_ON(target->num_nodes == 0);
-
-	num_longs = BITS_TO_LONGS(target->num_nodes);
-	for (i = 0; i < num_longs; i++)
-		target->map[i] = from->map[i];
-}
-
-#endif  /*  0  */
-
-/* Returns whether the recovery bit was actually set - it may not be
- * if a node is still marked as needing recovery */
-int ocfs2_recovery_map_set(struct ocfs2_super *osb,
-			   int num)
-{
-	int set = 0;
-
-	spin_lock(&osb->node_map_lock);
-
-	if (!test_bit(num, osb->recovery_map.map)) {
-	    __ocfs2_node_map_set_bit(&osb->recovery_map, num);
-	    set = 1;
-	}
-
-	spin_unlock(&osb->node_map_lock);
-
-	return set;
-}
-
-void ocfs2_recovery_map_clear(struct ocfs2_super *osb,
-			      int num)
-{
-	ocfs2_node_map_clear_bit(osb, &osb->recovery_map, num);
-}
-
-int ocfs2_node_map_iterate(struct ocfs2_super *osb,
-			   struct ocfs2_node_map *map,
-			   int idx)
-{
-	int i = idx;
-
-	idx = O2NM_INVALID_NODE_NUM;
-	spin_lock(&osb->node_map_lock);
-	if ((i != O2NM_INVALID_NODE_NUM) &&
-	    (i >= 0) &&
-	    (i < map->num_nodes)) {
-		while(i < map->num_nodes) {
-			if (test_bit(i, map->map)) {
-				idx = i;
-				break;
-			}
-			i++;
-		}
-	}
-	spin_unlock(&osb->node_map_lock);
-	return idx;
-}
diff --git a/fs/ocfs2/heartbeat.h b/fs/ocfs2/heartbeat.h
index eac63ae..98d8ffc 100644
--- a/fs/ocfs2/heartbeat.h
+++ b/fs/ocfs2/heartbeat.h
@@ -33,8 +33,6 @@ void ocfs2_stop_heartbeat(struct ocfs2_super *osb);
 
 /* node map functions - used to keep track of mounted and in-recovery
  * nodes. */
-int ocfs2_node_map_is_empty(struct ocfs2_super *osb,
-			    struct ocfs2_node_map *map);
 void ocfs2_node_map_set_bit(struct ocfs2_super *osb,
 			    struct ocfs2_node_map *map,
 			    int bit);
@@ -44,17 +42,5 @@ void ocfs2_node_map_clear_bit(struct ocfs2_super *osb,
 int ocfs2_node_map_test_bit(struct ocfs2_super *osb,
 			    struct ocfs2_node_map *map,
 			    int bit);
-int ocfs2_node_map_iterate(struct ocfs2_super *osb,
-			   struct ocfs2_node_map *map,
-			   int idx);
-static inline int ocfs2_node_map_first_set_bit(struct ocfs2_super *osb,
-					       struct ocfs2_node_map *map)
-{
-	return ocfs2_node_map_iterate(osb, map, 0);
-}
-int ocfs2_recovery_map_set(struct ocfs2_super *osb,
-			   int num);
-void ocfs2_recovery_map_clear(struct ocfs2_super *osb,
-			      int num);
 
 #endif /* OCFS2_HEARTBEAT_H */
diff --git a/fs/ocfs2/journal.c b/fs/ocfs2/journal.c
index 741a4e2..c39eb12 100644
--- a/fs/ocfs2/journal.c
+++ b/fs/ocfs2/journal.c
@@ -64,6 +64,137 @@ static int ocfs2_recover_orphans(struct ocfs2_super *osb,
 				 int slot);
 static int ocfs2_commit_thread(void *arg);
 
+
+/*
+ * The recovery_list is a simple linked list of node numbers to recover.
+ * It is protected by the recovery_lock.
+ */
+
+struct ocfs2_recovery_map {
+	int rm_used;
+	unsigned int *rm_entries;
+};
+
+int ocfs2_recovery_init(struct ocfs2_super *osb)
+{
+	struct ocfs2_recovery_map *rm;
+
+	mutex_init(&osb->recovery_lock);
+	osb->disable_recovery = 0;
+	osb->recovery_thread_task = NULL;
+	init_waitqueue_head(&osb->recovery_event);
+
+	rm = kzalloc(sizeof(struct ocfs2_recovery_map) +
+		     osb->max_slots * sizeof(unsigned int),
+		     GFP_KERNEL);
+	if (!rm) {
+		mlog_errno(-ENOMEM);
+		return -ENOMEM;
+	}
+
+	rm->rm_entries = (unsigned int *)((char *)rm +
+					  sizeof(struct ocfs2_recovery_map));
+	osb->recovery_map = rm;
+
+	return 0;
+}
+
+/* we can't grab the goofy sem lock from inside wait_event, so we use
+ * memory barriers to make sure that we'll see the null task before
+ * being woken up */
+static int ocfs2_recovery_thread_running(struct ocfs2_super *osb)
+{
+	mb();
+	return osb->recovery_thread_task != NULL;
+}
+
+void ocfs2_recovery_exit(struct ocfs2_super *osb)
+{
+	struct ocfs2_recovery_map *rm;
+
+	/* disable any new recovery threads and wait for any currently
+	 * running ones to exit. Do this before setting the vol_state. */
+	mutex_lock(&osb->recovery_lock);
+	osb->disable_recovery = 1;
+	mutex_unlock(&osb->recovery_lock);
+	wait_event(osb->recovery_event, !ocfs2_recovery_thread_running(osb));
+
+	/* At this point, we know that no more recovery threads can be
+	 * launched, so wait for any recovery completion work to
+	 * complete. */
+	flush_workqueue(ocfs2_wq);
+
+	/*
+	 * Now that recovery is shut down, and the osb is about to be
+	 * freed,  the osb_lock is not taken here.
+	 */
+	rm = osb->recovery_map;
+	/* XXX: Should we bug if there are dirty entries? */
+
+	kfree(rm);
+}
+
+static int __ocfs2_recovery_map_test(struct ocfs2_super *osb,
+				     unsigned int node_num)
+{
+	int i;
+	struct ocfs2_recovery_map *rm = osb->recovery_map;
+
+	assert_spin_locked(&osb->osb_lock);
+
+	for (i = 0; i < rm->rm_used; i++) {
+		if (rm->rm_entries[i] == node_num)
+			return 1;
+	}
+
+	return 0;
+}
+
+/* Behaves like test-and-set.  Returns the previous value */
+static int ocfs2_recovery_map_set(struct ocfs2_super *osb,
+				  unsigned int node_num)
+{
+	struct ocfs2_recovery_map *rm = osb->recovery_map;
+
+	spin_lock(&osb->osb_lock);
+	if (__ocfs2_recovery_map_test(osb, node_num)) {
+		spin_unlock(&osb->osb_lock);
+		return 1;
+	}
+
+	/* XXX: Can this be exploited? Not from o2dlm... */
+	BUG_ON(rm->rm_used >= osb->max_slots);
+
+	rm->rm_entries[rm->rm_used] = node_num;
+	rm->rm_used++;
+	spin_unlock(&osb->osb_lock);
+
+	return 0;
+}
+
+static void ocfs2_recovery_map_clear(struct ocfs2_super *osb,
+				     unsigned int node_num)
+{
+	int i;
+	struct ocfs2_recovery_map *rm = osb->recovery_map;
+
+	spin_lock(&osb->osb_lock);
+
+	for (i = 0; i < rm->rm_used; i++) {
+		if (rm->rm_entries[i] == node_num)
+			break;
+	}
+
+	if (i < rm->rm_used) {
+		/* XXX: be careful with the pointer math */
+		memmove(&(rm->rm_entries[i]), &(rm->rm_entries[i + 1]),
+			(rm->rm_used - i - 1) * sizeof(unsigned int));
+		rm->rm_used--;
+	}
+
+	spin_unlock(&osb->osb_lock);
+}
+
 static int ocfs2_commit_cache(struct ocfs2_super *osb)
 {
 	int status = 0;
@@ -654,6 +785,23 @@ bail:
 	return status;
 }
 
+static int ocfs2_recovery_completed(struct ocfs2_super *osb)
+{
+	int empty;
+	struct ocfs2_recovery_map *rm = osb->recovery_map;
+
+	spin_lock(&osb->osb_lock);
+	empty = (rm->rm_used == 0);
+	spin_unlock(&osb->osb_lock);
+
+	return empty;
+}
+
+void ocfs2_wait_for_recovery(struct ocfs2_super *osb)
+{
+	wait_event(osb->recovery_event, ocfs2_recovery_completed(osb));
+}
+
 /*
  * JBD Might read a cached version of another nodes journal file. We
  * don't want this as this file changes often and we get no
@@ -852,6 +1000,7 @@ static int __ocfs2_recovery_thread(void *arg)
 {
 	int status, node_num;
 	struct ocfs2_super *osb = arg;
+	struct ocfs2_recovery_map *rm = osb->recovery_map;
 
 	mlog_entry_void();
 
@@ -867,26 +1016,29 @@ restart:
 		goto bail;
 	}
 
-	while(!ocfs2_node_map_is_empty(osb, &osb->recovery_map)) {
-		node_num = ocfs2_node_map_first_set_bit(osb,
-							&osb->recovery_map);
-		if (node_num == O2NM_INVALID_NODE_NUM) {
-			mlog(0, "Out of nodes to recover.\n");
-			break;
-		}
+	spin_lock(&osb->osb_lock);
+	while (rm->rm_used) {
+		/* It's always safe to remove entry zero, as we won't
+		 * clear it until ocfs2_recover_node() has succeeded. */
+		node_num = rm->rm_entries[0];
+		spin_unlock(&osb->osb_lock);
 
 		status = ocfs2_recover_node(osb, node_num);
-		if (status < 0) {
+		if (!status) {
+			ocfs2_recovery_map_clear(osb, node_num);
+		} else {
 			mlog(ML_ERROR,
 			     "Error %d recovering node %d on device (%u,%u)!\n",
 			     status, node_num,
 			     MAJOR(osb->sb->s_dev), MINOR(osb->sb->s_dev));
 			mlog(ML_ERROR, "Volume requires unmount.\n");
-			continue;
 		}
 
-		ocfs2_recovery_map_clear(osb, node_num);
+		spin_lock(&osb->osb_lock);
 	}
+	spin_unlock(&osb->osb_lock);
+	mlog(0, "All nodes recovered\n");
+
 	ocfs2_super_unlock(osb, 1);
 
 	/* We always run recovery on our own orphan dir - the dead
@@ -897,8 +1049,7 @@ restart:
 
 bail:
 	mutex_lock(&osb->recovery_lock);
-	if (!status &&
-	    !ocfs2_node_map_is_empty(osb, &osb->recovery_map)) {
+	if (!status && !ocfs2_recovery_completed(osb)) {
 		mutex_unlock(&osb->recovery_lock);
 		goto restart;
 	}
@@ -928,8 +1079,8 @@ void ocfs2_recovery_thread(struct ocfs2_super *osb, int
node_num)
 
 	/* People waiting on recovery will wait on
 	 * the recovery map to empty. */
-	if (!ocfs2_recovery_map_set(osb, node_num))
-		mlog(0, "node %d already be in recovery.\n", node_num);
+	if (ocfs2_recovery_map_set(osb, node_num))
+		mlog(0, "node %d already in recovery map.\n", node_num);
 
 	mlog(0, "starting recovery thread...\n");
 
@@ -1202,7 +1353,7 @@ int ocfs2_mark_dead_nodes(struct ocfs2_super *osb)
 			continue;
 
 		node_num = si->si_global_node_nums[i];
-		if (ocfs2_node_map_test_bit(osb, &osb->recovery_map, node_num))
+		if (__ocfs2_recovery_map_test(osb, node_num))
 			continue;
 		spin_unlock(&si->si_lock);
 
diff --git a/fs/ocfs2/journal.h b/fs/ocfs2/journal.h
index 52f02fe..986f71f 100644
--- a/fs/ocfs2/journal.h
+++ b/fs/ocfs2/journal.h
@@ -134,6 +134,10 @@ static inline void ocfs2_inode_set_new(struct ocfs2_super
*osb,
 
 /* Exported only for the journal struct init code in super.c. Do not call. */
 void ocfs2_complete_recovery(kapi_work_struct_t *work);
+void ocfs2_wait_for_recovery(struct ocfs2_super *osb);
+
+int ocfs2_recovery_init(struct ocfs2_super *osb);
+void ocfs2_recovery_exit(struct ocfs2_super *osb);
 
 /*
  *  Journal Control:
diff --git a/fs/ocfs2/ocfs2.h b/fs/ocfs2/ocfs2.h
index 9546470..5214c00 100644
--- a/fs/ocfs2/ocfs2.h
+++ b/fs/ocfs2/ocfs2.h
@@ -183,6 +183,7 @@ enum ocfs2_mount_options
 #define OCFS2_DEFAULT_ATIME_QUANTUM	60
 
 struct ocfs2_journal;
+struct ocfs2_recovery_map;
 struct ocfs2_super
 {
 	struct task_struct *commit_task;
@@ -194,7 +195,6 @@ struct ocfs2_super
 	struct ocfs2_slot_info *slot_info;
 
 	spinlock_t node_map_lock;
-	struct ocfs2_node_map recovery_map;
 
 	u64 root_blkno;
 	u64 system_dir_blkno;
@@ -232,6 +232,7 @@ struct ocfs2_super
 
 	atomic_t vol_state;
 	struct mutex recovery_lock;
+	struct ocfs2_recovery_map *recovery_map;
 	struct task_struct *recovery_thread_task;
 	int disable_recovery;
 	wait_queue_head_t checkpoint_event;
diff --git a/fs/ocfs2/super.c b/fs/ocfs2/super.c
index ff31722..fcb7502 100644
--- a/fs/ocfs2/super.c
+++ b/fs/ocfs2/super.c
@@ -1296,15 +1296,6 @@ leave:
 	return status;
 }
 
-/* we can't grab the goofy sem lock from inside wait_event, so we use
- * memory barriers to make sure that we'll see the null task before
- * being woken up */
-static int ocfs2_recovery_thread_running(struct ocfs2_super *osb)
-{
-	mb();
-	return osb->recovery_thread_task != NULL;
-}
-
 static void ocfs2_dismount_volume(struct super_block *sb, int mnt_err)
 {
 	int tmp;
@@ -1321,17 +1312,8 @@ static void ocfs2_dismount_volume(struct super_block *sb,
int mnt_err)
 
 	ocfs2_truncate_log_shutdown(osb);
 
-	/* disable any new recovery threads and wait for any currently
-	 * running ones to exit. Do this before setting the vol_state. */
-	mutex_lock(&osb->recovery_lock);
-	osb->disable_recovery = 1;
-	mutex_unlock(&osb->recovery_lock);
-	wait_event(osb->recovery_event, !ocfs2_recovery_thread_running(osb));
-
-	/* At this point, we know that no more recovery threads can be
-	 * launched, so wait for any recovery completion work to
-	 * complete. */
-	flush_workqueue(ocfs2_wq);
+	/* This will disable recovery and flush any recovery work. */
+	ocfs2_recovery_exit(osb);
 
 	ocfs2_journal_shutdown(osb);
 
@@ -1440,7 +1422,6 @@ static int ocfs2_initialize_super(struct super_block *sb,
 	osb->s_sectsize_bits = blksize_bits(sector_size);
 	BUG_ON(!osb->s_sectsize_bits);
 
-	init_waitqueue_head(&osb->recovery_event);
 	spin_lock_init(&osb->dc_task_lock);
 	init_waitqueue_head(&osb->dc_event);
 	osb->dc_work_sequence = 0;
@@ -1461,10 +1442,12 @@ static int ocfs2_initialize_super(struct super_block
*sb,
 	snprintf(osb->dev_str, sizeof(osb->dev_str), "%u,%u",
 		 MAJOR(osb->sb->s_dev), MINOR(osb->sb->s_dev));
 
-	mutex_init(&osb->recovery_lock);
-
-	osb->disable_recovery = 0;
-	osb->recovery_thread_task = NULL;
+	status = ocfs2_recovery_init(osb);
+	if (status) {
+		mlog(ML_ERROR, "Unable to initialize recovery state\n");
+		mlog_errno(status);
+		goto bail;
+	}
 
 	init_waitqueue_head(&osb->checkpoint_event);
 	atomic_set(&osb->needs_checkpoint, 0);
-- 
1.5.4.3

Sunil Mushran

2008-Jul-07 20:26 UTC

head link

[Ocfs2-devel] [PATCH 3/4] ocfs2: Fixes tiny race between mount and recovery

This patch fixes a tiny race between recovery and mount that could otherwise
lead to a hang.

During mount, the node takes an EX on the superblock and the journal/slot locks,
and, if needed, recovers the slot it is assigned. However, it only marks the
other dirty slots and leaves their recovery to the recovery thread that does so
after the mount process has completed.

The last step of the mount process calls for it to drop the EX on the superblock
lock. The node holds onto the same lock level on its journal/slot lock.

The recovery thread then picks up the EX on the superblock lock and then takes
the same lock for the journal/slot it is recovering.

This exposes a tiny race as another node mounting the volume could race the
recovery thread for the EX on the superblock lock.

If that happens, that node could be assigned a dirty slot which it would
recover.
It too will then drop the EX on the superblock lock but hold onto the same
lock level on its journal/slot lock.

Now when the recovery thread on the first node gets back the EX on the
superblock
lock, it will promptly attempt to get the EX on the journal/slot lock of the
node
it thinks is dirty but since has not only been recovered but also locked by
another node.

Hang.

The fix here is to make the journal/slot EX lock request during recovery a
trylock
operation. If the trylock fails, it would indicate that that slot is in use and
thus recovered.

This bug appears to have been inadvertently introduced during the mount/umount
vote removal by mainline commit 34d024f84345807bf44163fac84e921513dde323. In the
older voting scheme, the voting thread would remove the node being mounted from
the recovery map.

Mark Fasheh <mfasheh at suse.com> wrote the initial draft of this patch
fix.

Signed-off-by: Sunil Mushran <sunil.mushran at oracle.com>
---
 fs/ocfs2/journal.c |   80 +++++++++++++++++++++++++++++++++++++++++----------
 1 files changed, 64 insertions(+), 16 deletions(-)

diff --git a/fs/ocfs2/journal.c b/fs/ocfs2/journal.c
index c39eb12..3b14d1f 100644
--- a/fs/ocfs2/journal.c
+++ b/fs/ocfs2/journal.c
@@ -28,6 +28,7 @@
 #include <linux/slab.h>
 #include <linux/highmem.h>
 #include <linux/kthread.h>
+#include <linux/delay.h>
 
 #define MLOG_MASK_PREFIX ML_JOURNAL
 #include <cluster/masklog.h>
@@ -70,9 +71,14 @@ static int ocfs2_commit_thread(void *arg);
  * It is protected by the recovery_lock.
  */
 
+struct ocfs2_recovery_item {
+	unsigned int ri_node_num;
+	unsigned int ri_count;
+};
+
 struct ocfs2_recovery_map {
 	int rm_used;
-	unsigned int *rm_entries;
+	struct ocfs2_recovery_item *rm_entries;
 };
 
 int ocfs2_recovery_init(struct ocfs2_super *osb)
@@ -85,15 +91,15 @@ int ocfs2_recovery_init(struct ocfs2_super *osb)
 	init_waitqueue_head(&osb->recovery_event);
 
 	rm = kzalloc(sizeof(struct ocfs2_recovery_map) +
-		     osb->max_slots * sizeof(unsigned int),
+		     osb->max_slots * sizeof(struct ocfs2_recovery_item),
 		     GFP_KERNEL);
 	if (!rm) {
 		mlog_errno(-ENOMEM);
 		return -ENOMEM;
 	}
 
-	rm->rm_entries = (unsigned int *)((char *)rm +
-					  sizeof(struct ocfs2_recovery_map));
+	rm->rm_entries = (struct ocfs2_recovery_item *)
+			((char *)rm + sizeof(struct ocfs2_recovery_map));
 	osb->recovery_map = rm;
 
 	return 0;
@@ -143,7 +149,7 @@ static int __ocfs2_recovery_map_test(struct ocfs2_super
*osb,
 	assert_spin_locked(&osb->osb_lock);
 
 	for (i = 0; i < rm->rm_used; i++) {
-		if (rm->rm_entries[i] == node_num)
+		if (rm->rm_entries[i].ri_node_num == node_num)
 			return 1;
 	}
 
@@ -155,6 +161,7 @@ static int ocfs2_recovery_map_set(struct ocfs2_super *osb,
 				  unsigned int node_num)
 {
 	struct ocfs2_recovery_map *rm = osb->recovery_map;
+	int i;
 
 	spin_lock(&osb->osb_lock);
 	if (__ocfs2_recovery_map_test(osb, node_num)) {
@@ -165,8 +172,17 @@ static int ocfs2_recovery_map_set(struct ocfs2_super *osb,
 	/* XXX: Can this be exploited? Not from o2dlm... */
 	BUG_ON(rm->rm_used >= osb->max_slots);
 
-	rm->rm_entries[rm->rm_used] = node_num;
+	for (i = 0; i < rm->rm_used; i++) {
+		if (rm->rm_entries[i].ri_node_num == node_num) {
+			rm->rm_entries[i].ri_count++;
+			goto out;
+		}
+	}
+
+	rm->rm_entries[rm->rm_used].ri_node_num = node_num;
+	rm->rm_entries[rm->rm_used].ri_count = 1;
 	rm->rm_used++;
+out:
 	spin_unlock(&osb->osb_lock);
 
 	return 0;
@@ -181,17 +197,21 @@ static void ocfs2_recovery_map_clear(struct ocfs2_super
*osb,
 	spin_lock(&osb->osb_lock);
 
 	for (i = 0; i < rm->rm_used; i++) {
-		if (rm->rm_entries[i] == node_num)
+		if (rm->rm_entries[i].ri_node_num == node_num) {
+			rm->rm_entries[i].ri_count--;
+			if (rm->rm_entries[i].ri_count)
+				goto out;
 			break;
+		}
 	}
 
 	if (i < rm->rm_used) {
 		/* XXX: be careful with the pointer math */
 		memmove(&(rm->rm_entries[i]), &(rm->rm_entries[i + 1]),
-			(rm->rm_used - i - 1) * sizeof(unsigned int));
+			(rm->rm_used - i - 1) * sizeof(struct ocfs2_recovery_item));
 		rm->rm_used--;
 	}
-
+out:
 	spin_unlock(&osb->osb_lock);
 }
 
@@ -1020,7 +1040,7 @@ restart:
 	while (rm->rm_used) {
 		/* It's always safe to remove entry zero, as we won't
 		 * clear it until ocfs2_recover_node() has succeeded. */
-		node_num = rm->rm_entries[0];
+		node_num = rm->rm_entries[0].ri_node_num;
 		spin_unlock(&osb->osb_lock);
 
 		status = ocfs2_recover_node(osb, node_num);
@@ -1105,7 +1125,8 @@ out:
  * clean. Will only replay if the journal inode is marked dirty. */
 static int ocfs2_replay_journal(struct ocfs2_super *osb,
 				int node_num,
-				int slot_num)
+				int slot_num,
+				int *busy)
 {
 	int status;
 	int got_lock = 0;
@@ -1115,6 +1136,8 @@ static int ocfs2_replay_journal(struct ocfs2_super *osb,
 	journal_t *journal = NULL;
 	struct buffer_head *bh = NULL;
 
+	*busy = 0;
+
 	inode = ocfs2_get_system_file_inode(osb, JOURNAL_SYSTEM_INODE,
 					    slot_num);
 	if (inode == NULL) {
@@ -1131,8 +1154,23 @@ static int ocfs2_replay_journal(struct ocfs2_super *osb,
 	}
 	SET_INODE_JOURNAL(inode);
 
-	status = ocfs2_inode_lock_full(inode, &bh, 1, OCFS2_META_LOCK_RECOVERY);
+	status = ocfs2_inode_lock_full(inode, &bh, 1,
+		       OCFS2_META_LOCK_RECOVERY|OCFS2_META_LOCK_NOQUEUE);
 	if (status < 0) {
+		/*
+		 * As fs recovery is asynchronous, there is a small possibility
+		 * that the slot being recovered has been assigned to another
+		 * node that happened to be mounting the volume at that same
+		 * time. If so, the journal replay will be handled by that node.
+		 * A failed trylock on the journal indicates that the slot has
+		 * been assigned to another node.
+		 */
+		if (status == -EAGAIN) {
+			*busy = 1;
+			status = 0;
+			goto done;
+		}
+
 		mlog(0, "status returned from ocfs2_inode_lock=%d\n", status);
 		if (status != -ERESTARTSYS)
 			mlog(ML_ERROR, "Could not lock journal!\n");
@@ -1232,7 +1270,7 @@ done:
 static int ocfs2_recover_node(struct ocfs2_super *osb,
 			      int node_num)
 {
-	int status = 0;
+	int status = 0, busy = 0;
 	int slot_num;
 	struct ocfs2_slot_info *si = osb->slot_info;
 	struct ocfs2_dinode *la_copy = NULL;
@@ -1256,11 +1294,16 @@ static int ocfs2_recover_node(struct ocfs2_super *osb,
 
 	mlog(0, "node %d was using slot %d\n", node_num, slot_num);
 
-	status = ocfs2_replay_journal(osb, node_num, slot_num);
+	status = ocfs2_replay_journal(osb, node_num, slot_num, &busy);
 	if (status < 0) {
 		mlog_errno(status);
 		goto done;
 	}
+	if (busy) {
+		mlog(0, "Skipping recovery for slot %u (node %u) "
+		     "as another node has recovered it\n", slot_num, node_num);
+		goto done;
+	}
 
 	/* Stamp a clean local alloc file AFTER recovering the journal... */
 	status = ocfs2_begin_local_alloc_recovery(osb, slot_num, &la_copy);
@@ -1339,7 +1382,7 @@ bail:
  * slot info struct has been updated from disk. */
 int ocfs2_mark_dead_nodes(struct ocfs2_super *osb)
 {
-	int status, i, node_num;
+	int status, i, node_num, inreco;
 	struct ocfs2_slot_info *si = osb->slot_info;
 
 	/* This is called with the super block cluster lock, so we
@@ -1353,8 +1396,13 @@ int ocfs2_mark_dead_nodes(struct ocfs2_super *osb)
 			continue;
 
 		node_num = si->si_global_node_nums[i];
-		if (__ocfs2_recovery_map_test(osb, node_num))
+
+		spin_lock(&osb->osb_lock);
+		inreco = __ocfs2_recovery_map_test(osb, node_num);
+		spin_unlock(&osb->osb_lock);
+		if (inreco)
 			continue;
+
 		spin_unlock(&si->si_lock);
 
 		/* Ok, we have a slot occupied by another node which
-- 
1.5.4.3

Sunil Mushran

2008-Jul-07 20:26 UTC

head link

[Ocfs2-devel] [PATCH 4/4] ocfs2/dlm: Fixes oops in dlm_new_lockres()

Patch fixes a race that can result in an oops while adding a
lockres to the dlm lockres tracking list.

Bug introduced by mainline commit 29576f8bb54045be944ba809d4fca1ad77c94165.

Signed-off-by: Sunil Mushran <sunil.mushran at oracle.com>
---
 fs/ocfs2/dlm/dlmmaster.c |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/fs/ocfs2/dlm/dlmmaster.c b/fs/ocfs2/dlm/dlmmaster.c
index 24fb8e4..9d3472e 100644
--- a/fs/ocfs2/dlm/dlmmaster.c
+++ b/fs/ocfs2/dlm/dlmmaster.c
@@ -606,7 +606,9 @@ static void dlm_init_lockres(struct dlm_ctxt *dlm,
 
 	res->last_used = 0;
 
+	spin_lock(&dlm->spinlock);
 	list_add_tail(&res->tracking, &dlm->tracking_list);
+	spin_unlock(&dlm->spinlock);
 
 	memset(res->lvb, 0, DLM_LVB_LEN);
 	memset(res->refmap, 0, sizeof(res->refmap));
-- 
1.5.4.3

Sunil Mushran

2008-Jul-08 17:08 UTC

head link

[Ocfs2-devel] [PATCH 3/4] ocfs2: Fixes tiny race between mount and recovery

Instead of using generation, we could use ij_pad. Renamed to
something appropriate.

Sunil Mushran wrote:> ok.. going back to the drawing board. I am putting in writing the
> scheme the two of us thrashed out here. Please review.
>
> Problem:
> How does a recovery thread know that the slot it is trying to recover
> has already been mounted by another node (and thus already recovered)?
>
> Our first solution was to trylock the journal lock with the assumption
> that if it failed, the slot was recovered by another node.
>
> As Joel pointed out, the problem here is that as the fs is informed
> of node death before the dlm, a trylock on journal failing could
> be a case of dlm not yet registering the node death. And skipping
> journal replay in these circumstances would be asking for trouble.
> So we have to make a blocking request... but if we do that blindly,
> we could hang as another node could have mounted that slot.
>
> Possible Solution:
> Use the i_generation in the journal inode as a recovery counter.
> Meaning, we increment it each time we recover the journal. That
> is, from mount thread in ocfs2_journal_load() and from recovery
> thread in ocfs2_replay_journal(). In both places, we have the
> superblock lock making the writes safe.
>
> The mount thread will read the generation#s and save it. In the
> recovery thread, after taking the superblock lock, the thread
> reads the journal inode. If it finds the generation is not the same,
> it can safely skip the recovery. If not, it will go ahead with the
> blocking lock.
>
> As all nodes get the down event, they all go thru with the recovery
> process if only to detect that no recovery is required. We will
> use that to keep the recovery generation for all journal slots uptodate.
>
> Sunil
>
> Joel Becker wrote:
>> On Mon, Jul 07, 2008 at 01:26:19PM -0700, Sunil Mushran wrote:
>>  
>>> This patch fixes a tiny race between recovery and mount that could 
>>> otherwise
>>> lead to a hang.
>>>     
>>
>>     First off, does this still exist in mainline?
>>
>> <snip>
>>  
>>  
>>> The last step of the mount process calls for it to drop the EX on 
>>> the superblock
>>> lock. The node holds onto the same lock level on its journal/slot
lock.
>>>
>>> The recovery thread then picks up the EX on the superblock lock and
>>> then takes
>>> the same lock for the journal/slot it is recovering.
>>>
>>> This exposes a tiny race as another node mounting the volume could 
>>> race the
>>> recovery thread for the EX on the superblock lock.
>>>
>>> If that happens, that node could be assigned a dirty slot which it 
>>> would recover.
>>> It too will then drop the EX on the superblock lock but hold onto 
>>> the same
>>> lock level on its journal/slot lock.
>>>
>>> Now when the recovery thread on the first node gets back the EX on 
>>> the superblock
>>> lock, it will promptly attempt to get the EX on the journal/slot 
>>> lock of the node
>>> it thinks is dirty but since has not only been recovered but also 
>>> locked by
>>> another node.
>>>
>>> Hang.
>>>
>>> The fix here is to make the journal/slot EX lock request during 
>>> recovery a trylock
>>> operation. If the trylock fails, it would indicate that that slot
is
>>> in use and
>>> thus recovered.
>>>     
>>
>>     How do we distinguish this from a node that hasn't been evicted
>> yet?  If you recall, we first tried to eliminate the recovery map
>> completely, having recovery just trylock every journal to find the dead
>> node.  But if the node hasn't been evicted, the trylock fails. 
Isn't
>> this the same thing?
>>     Let me try putting it another way.  The recovery thread just
>> sees entries.  It doesn't know whether they came from mount or from
a
>> dead node event.  So, we have an entry.  Sure, it could be an entry
that
>> was locked during mount, then unlocked, and now we're trying to
relock.
>> The only way the trylock fails is if someone else recovered it. 
That's
>> safe.  If it came from another path, though, we expect to sleep on that
>> lock - that's how we know the dlm has evicted it, because the lock
>> blocks until then.  How do we determine this is the case and sleep on
>> it?  Otherwise, our trylock fails, we skip it, then the dlm evicts it,
>> and noone has recovered it.  We then continue assuming the journal was
>> replayed, which it was not.
>>     I'm missing something, right?
>>
>> Joel
>>
>>   
>

Joel Becker

2008-Jul-08 18:17 UTC

head link

[Ocfs2-devel] [PATCH 3/4] ocfs2: Fixes tiny race between mount and recovery

On Tue, Jul 08, 2008 at 10:08:38AM -0700, Sunil Mushran
wrote:> Instead of using generation, we could use ij_pad. Renamed to
> something appropriate.
	I like that, as it is specific to the journal part and leaves
i_generation alone.  Can we see ourselves needing any other use of that
word?

Joel
> Sunil Mushran wrote:
>> Possible Solution:
>> Use the i_generation in the journal inode as a recovery counter.
>> Meaning, we increment it each time we recover the journal. That
>> is, from mount thread in ocfs2_journal_load() and from recovery
>> thread in ocfs2_replay_journal(). In both places, we have the
>> superblock lock making the writes safe.
-- 

"When ideas fail, words come in very handy." 
         - Goethe

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker at oracle.com
Phone: (650) 506-8127

Ocfs2 devel - Jul 2008 - Four patches for ocfs2 1.4

[Ocfs2-devel] Four patches for ocfs2 1.4

[Ocfs2-devel] [PATCH 1/4] ocfs2: Silence an error message in ocfs2_file_aio_read()

[Ocfs2-devel] [PATCH 2/4] ocfs2: Change the recovery map to an array of node numbers.

[Ocfs2-devel] [PATCH 3/4] ocfs2: Fixes tiny race between mount and recovery

[Ocfs2-devel] [PATCH 4/4] ocfs2/dlm: Fixes oops in dlm_new_lockres()

[Ocfs2-devel] [PATCH 3/4] ocfs2: Fixes tiny race between mount and recovery

[Ocfs2-devel] [PATCH 3/4] ocfs2: Fixes tiny race between mount and recovery