thr3ads.net - Ocfs2 devel - [Ocfs2-devel] [PATCH 0/4] re-enable non-clustered mount & add MMP support [Jul 2022]

If this information is useful, please help other people find it:
Share via:

Heming Zhao

2022-Jul-30 01:14 UTC

[Ocfs2-devel] [PATCH 0/4] re-enable non-clustered mount & add MMP support

This serial patches re-enable ocfs2 non-clustered mount feature.

the previous patch c80af0c250c8 (Revert "ocfs2: mount shared volume
without ha stack") revert Gang's non-clustered mount patch. This
serial patches re-enable ocfs2 non-clustered mount.

the key different between local mount and non-clustered mount: 
local mount feature (tunefs.ocfs2 --fs-features=[no]local) can't do
convert job without ha stack. non-clustered mount feature can run
totally without ha stack.

For avoiding data corruption when non-clustered & clustered mount are
happening at same time, this serial patches also introduces MMP
feature. MMP (Multiple Mount Protection) idea got from ext4 MMP
(fs/ext4/mmp.c) which protects fs from being mounted more than once.
For ocfs2 is a clustered fs and also for compatible with existing
slotmap feature, I did some optimization and modification when
porting from ext4 MMP to ocfs2.

The related userspace code for supporting MMP had been sent to github
for reviewing:
- https://github.com/markfasheh/ocfs2-tools/pull/58

ocfs2-tools enable MMP and check status:

```
# enable MMP
tunefs.ocfs2 --fs-feature=mmp /dev/vdb

# check the command result
tunefs.ocfs2 -Q "%H\n" /dev/vdb | grep MMP

# active MMP on nocluster mount
mount -t ocfs2 -o nocluster /dev/vdb /mnt

# check slotmap info
# echo slotmap | PAGER=cat debugfs.ocfs2 /dev/vdb
```

=== below are test cases for patches ===
<1> non-clustered mount vs local mount

1.1 tunefs.ocfs2 can't convert local/nolocal mount without ha stack.

```
(on ha stack env)
mkfs.ocfs2 --cluster-stack=pcmk --cluster-name=hacluster -N 4 /dev/vdb
tunefs.ocfs2 --fs-features=local /dev/vdb  (<== success)
tunefs.ocfs2 --fs-features=nolocal /dev/vdb  (<== success)
(on another node without ha stack)
tunefs.ocfs2 --fs-features=local /dev/vdb  (<== failure)
```

1.2 non-cluster feature can run without ha stack.
```
(on ha stack env)
mkfs.ocfs2 --cluster-stack=pcmk --cluster-name=hacluster -N 4 /dev/vdb
(on another node without ha stack)
mount -t ocfs2 -o nocluster /dev/vdb /mnt  (<== success)
```


<2> do clustered & non-clustered mount on same node

2.1  non-clustered mount => clustered mount

```
mkfs.ocfs2 --cluster-stack=pcmk --cluster-name=hacluster -N 4 /dev/vdb
mount -t ocfs2 -o nocluster /dev/vdb /mnt
mount -t ocfs2 /dev/vdb /mnt               (<=== failure)
```

2.2 clustered mount => non-clustered mount

```
mkfs.ocfs2 --cluster-stack=pcmk --cluster-name=hacluster -N 4 /dev/vdb
mount -t ocfs2 /dev/vdb /mnt
mount -t ocfs2 -o nocluster /dev/vdb /mnt  (<=== failure)
```

<3> one node does clustered mount, another does non-clustered mount

test rule: clustered mount and non-clustered mount can not exist at same
time.

3.1 clustered mount @node1 => [no]clustered mount @node2

```
node1:
mkfs.ocfs2 --cluster-stack=pcmk --cluster-name=hacluster -N 4 /dev/vdb
mount -t ocfs2 /dev/vdb /mnt

node2:
mount -t ocfs2 /dev/vdb /mnt              (<== success)
umount /mnt
mount -t ocfs2 -o nocluster /dev/vdb /mnt (<== failure)
```

3.2 enable mmp, repeate 3.1 case

```
node1:
mkfs.ocfs2 --cluster-stack=pcmk --cluster-name=hacluster -N 4 /dev/vdb
tunefs.ocfs2 --fs-features=mmp /dev/vdb   (<== enable mmp)
mount -t ocfs2 /dev/vdb /mnt

node2:
mount -t ocfs2 /dev/vdb /mnt              (<== wait ~22s [*] for mmp,
then success)
umount /mnt
mount -t ocfs2 -o nocluster /dev/vdb /mnt (<== failure)
```

[*] 22s:
(OCFS2_MMP_MIN_CHECK_INTERVAL * 2 + 1) * 2 times (calling
schedule_timeout_interruptible)

3.3 noclustered mount @node1 => [no]clustered  mount @node2

```
node1:
mkfs.ocfs2 --cluster-stack=pcmk --cluster-name=hacluster -N 4 /dev/vdb
mount -t ocfs2 -o nocluster /dev/vdb /mnt

node2:
mount -t ocfs2 /dev/vdb /mnt              (<== failure)
mount -t ocfs2 -o nocluster /dev/vdb /mnt (<== success, without mmp
enable)
umount /mnt               (<== will ZERO out slotmap area while node1
still mounting)
```

3.4 enable mmp, repeate 3.3 case.

```
node1:
mkfs.ocfs2 --cluster-stack=pcmk --cluster-name=hacluster -N 4 /dev/vdb
tunefs.ocfs2 --fs-features=mmp /dev/vdb   (<== enable mmp)
mount -t ocfs2 -o nocluster /dev/vdb /mnt

node2:
mount -t ocfs2 /dev/vdb /mnt              (<== failure)
mount -t ocfs2 -o nocluster /dev/vdb /mnt (<== failure, denied by mmp)
```

<4> simulate mounting after machine crash

info:
- below all steps do on one node
- address 287387648 is the '//slot_map' extent address.
- test the rule: If last mount didn't do unmount, (eg: crash), the next
  mount MUST be same mount type.

4.0 how to calculate '//slot_map' extent address

```
# PAGER=cat debugfs.ocfs2 -R "stats" /dev/vdb | grep "Block Size
Bits"
        Block Size Bits: 12   Cluster Size Bits: 12

# PAGER=cat debugfs.ocfs2 -R "stat //slot_map" /dev/vdb | grep -A1
# "Block#"
        ## Offset        Clusters       Block#          Flags
        0  0             1              70163           0x0
```

70163 * (1<<12) = 70163 * 4096 = 287387648


4.1 clustered mount => crash => non-clustered mount fails => clean
slotmap => non-clustered mount succeeds

```
mkfs.ocfs2 --cluster-stack=pcmk --cluster-name=hacluster -N 4 /dev/vdb
mount -t ocfs2 /dev/vdb /mnt
dd if=/dev/vdb bs=1 count=32 skip=287387648
of=/root/slotmap.cluster.mnted  (<== backup slot info)
umount /mnt
dd if=/root/slotmap.cluster.mnted of=/dev/vdb seek=287387648 bs=1
count=32 (<== overwrite)

mount -t ocfs2 -o nocluster /dev/vdb /mnt   <== failure
mount -t ocfs2 /dev/vdb /mnt && umount /mnt <== clean slot 0
mount -t ocfs2 -o nocluster /dev/vdb /mnt   <== success
```

4.2  non-clustered mount => crash => clustered mount fails => clean
slotmap => clustered mount succeeds

```
mkfs.ocfs2 --cluster-stack=pcmk --cluster-name=hacluster -N 4 /dev/vdb
mount -t ocfs2 -o nocluster /dev/vdb /mnt
dd if=/dev/vdb bs=1 count=32 skip=287387648
of=/root/slotmap.nocluster.mnted
umount /mnt
dd if=/root/slotmap.nocluster.mnted of=/dev/vdb seek=287387648 bs=1
count=32

mount -t ocfs2 /dev/vdb /mnt   <== failure
mount -t ocfs2 -o nocluster /dev/vdb /mnt && umount /mnt <== clean
slot
0
mount -t ocfs2 /dev/vdb /mnt   <== success
```

<5> MMP test

5.1 node1 noclustered mount => node 2 noclustered mount

disable mmp
```
node1:
mkfs.ocfs2 --cluster-stack=pcmk --cluster-name=hacluster -N 4 /dev/vdb
mount -t ocfs2 -o nocluster /dev/vdb /mnt

node2:
mount -t ocfs2 -o nocluster /dev/vdb /mnt (<== success)
```

enable mmp
```
node1:
mkfs.ocfs2 --cluster-stack=pcmk --cluster-name=hacluster -N 4 /dev/vdb
tunefs.ocfs2 --fs-features=mmp /dev/vdb
mount -t ocfs2 -o nocluster /dev/vdb /mnt

node2:
mount -t ocfs2 -o nocluster /dev/vdb /mnt (<== wait ~12s[*], failure by
mmp)
```

[*] 12s:
sleep (OCFS2_MMP_MIN_CHECK_INTERVAL * 2 + 1) then detect mmp_seq was
changed, then failed.

5.2 node1 clustered mount => node 2 clustered mount

see case 3.2

5.3 node1 noclustered mount => node 2 noclustered mount

see case 3.4

5.4 remount test

5.4.1 non-clustered mount (run commands on same node)

```
mkfs.ocfs2 --cluster-stack=pcmk --cluster-name=hacluster -N 4 /dev/vdb
tunefs.ocfs2 --fs-features=mmp /dev/vdb

mount -t ocfs2 -o nocluster /dev/vdb /mnt
ps axj | grep kmmpd                            (<== will show kmmpd)
PAGER=cat debugfs.ocfs2 -R "slotmap" /dev/vdb  (<== show
'OCFS2_MMP_SEQ')

mount -o remount,ro,nocluster /dev/vdb /mnt    (<== kmmpd will stop)
ps axj | grep kmmpd  (<== won't show kmmpd)
PAGER=cat debugfs.ocfs2 -R "slotmap" /dev/vdb  (<== show
'OCFS2_MMP_SEQ_CLEAN')

mount -o remount,rw,nocluster /dev/vdb /mnt    (<== kmmpd will start)
ps axj | grep kmmpd  (<== will show kmmpd)
PAGER=cat debugfs.ocfs2 -R "slotmap" /dev/vdb  (<== show
'OCFS2_MMP_SEQ')
```

5.4.2 clustered mount

```
mkfs.ocfs2 --cluster-stack=pcmk --cluster-name=hacluster -N 4 /dev/vdb
tunefs.ocfs2 --fs-features=mmp /dev/vdb

mount -t ocfs2 /dev/vdb /mnt                   (<== clustered mount
won't create kmmpd)
PAGER=cat debugfs.ocfs2 -R "slotmap" /dev/vdb  (<== show
'OCFS2_VALID_CLUSTER')

mount -o remount,ro /dev/vdb /mnt
PAGER=cat debugfs.ocfs2 -R "slotmap" /dev/vdb  (<== show
'OCFS2_VALID_CLUSTER')

mount -o remount,rw /dev/vdb /mnt              (<== wait for ~22s by mmp
start)
PAGER=cat debugfs.ocfs2 -R "slotmap" /dev/vdb  (<== show
'OCFS2_VALID_CLUSTER')
```

Heming Zhao (4):
  ocfs2: Fix freeing uninitialized resource on ocfs2_dlm_shutdown
  ocfs2: add mlog ML_WARNING support
  re-enable "ocfs2: mount shared volume without ha stack"
  ocfs2: introduce ext4 MMP feature

 fs/ocfs2/cluster/masklog.c |   3 +
 fs/ocfs2/cluster/masklog.h |   9 +-
 fs/ocfs2/dlmglue.c         |   3 +
 fs/ocfs2/ocfs2.h           |   6 +-
 fs/ocfs2/ocfs2_fs.h        |  13 +-
 fs/ocfs2/slot_map.c        | 479 +++++++++++++++++++++++++++++++++++--
 fs/ocfs2/slot_map.h        |   3 +
 fs/ocfs2/super.c           |  42 +++-
 8 files changed, 527 insertions(+), 31 deletions(-)

-- 
2.37.1

Heming Zhao

2022-Jul-30 01:14 UTC

head link

[Ocfs2-devel] [PATCH 1/4] ocfs2: Fix freeing uninitialized resource on ocfs2_dlm_shutdown

On local mount mode, there is no dlm resource initalized. If
ocfs2_mount_volume() fails in ocfs2_find_slot(), error handling
flow will call ocfs2_dlm_shutdown(), then does dlm resource
cleanup job, which will trigger kernel crash.

Fixes: 0737e01de9c4 ("ocfs2: ocfs2_mount_volume does cleanup job before
return error")
Signed-off-by: Heming Zhao <heming.zhao at suse.com>
---
 fs/ocfs2/dlmglue.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/fs/ocfs2/dlmglue.c b/fs/ocfs2/dlmglue.c
index 801e60bab955..1438ac14940b 100644
--- a/fs/ocfs2/dlmglue.c
+++ b/fs/ocfs2/dlmglue.c
@@ -3385,6 +3385,9 @@ int ocfs2_dlm_init(struct ocfs2_super *osb)
 void ocfs2_dlm_shutdown(struct ocfs2_super *osb,
 			int hangup_pending)
 {
+	if (ocfs2_mount_local(osb))
+		return;
+
 	ocfs2_drop_osb_locks(osb);
 
 	/*
-- 
2.37.1

Heming Zhao

2022-Jul-30 01:14 UTC

head link

[Ocfs2-devel] [PATCH 2/4] ocfs2: add mlog ML_WARNING support

This commit gives new message type for ocfs2.

Signed-off-by: Heming Zhao <heming.zhao at suse.com>
---
 fs/ocfs2/cluster/masklog.c | 3 +++
 fs/ocfs2/cluster/masklog.h | 9 +++++----
 2 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/fs/ocfs2/cluster/masklog.c b/fs/ocfs2/cluster/masklog.c
index 563881ddbf00..bac3488e8002 100644
--- a/fs/ocfs2/cluster/masklog.c
+++ b/fs/ocfs2/cluster/masklog.c
@@ -63,6 +63,9 @@ void __mlog_printk(const u64 *mask, const char *func, int
line,
 	if (*mask & ML_ERROR) {
 		level = KERN_ERR;
 		prefix = "ERROR: ";
+	} else if (*mask & ML_WARNING) {
+		level = KERN_WARNING;
+		prefix = "WARNING: ";
 	} else if (*mask & ML_NOTICE) {
 		level = KERN_NOTICE;
 	} else {
diff --git a/fs/ocfs2/cluster/masklog.h b/fs/ocfs2/cluster/masklog.h
index b73fc42e46ff..d0bc4fe8cf3d 100644
--- a/fs/ocfs2/cluster/masklog.h
+++ b/fs/ocfs2/cluster/masklog.h
@@ -86,10 +86,11 @@
 
 /* bits that are infrequently given and frequently matched in the high word */
 #define ML_ERROR	0x1000000000000000ULL /* sent to KERN_ERR */
-#define ML_NOTICE	0x2000000000000000ULL /* setn to KERN_NOTICE */
-#define ML_KTHREAD	0x4000000000000000ULL /* kernel thread activity */
+#define ML_NOTICE	0x2000000000000000ULL /* sent to KERN_NOTICE */
+#define ML_WARNING	0x4000000000000000ULL /* sent to KERN_WARNING */
+#define ML_KTHREAD	0x8000000000000000ULL /* kernel thread activity */
 
-#define MLOG_INITIAL_AND_MASK (ML_ERROR|ML_NOTICE)
+#define MLOG_INITIAL_AND_MASK (ML_ERROR|ML_WARNING|ML_NOTICE)
 #ifndef MLOG_MASK_PREFIX
 #define MLOG_MASK_PREFIX 0
 #endif
@@ -102,7 +103,7 @@
 #if defined(CONFIG_OCFS2_DEBUG_MASKLOG)
 #define ML_ALLOWED_BITS ~0
 #else
-#define ML_ALLOWED_BITS (ML_ERROR|ML_NOTICE)
+#define ML_ALLOWED_BITS (ML_ERROR|ML_WARNING|ML_NOTICE)
 #endif
 
 #define MLOG_MAX_BITS 64
-- 
2.37.1

Heming Zhao

2022-Jul-30 01:14 UTC

head link

[Ocfs2-devel] [PATCH 3/4] re-enable "ocfs2: mount shared volume without ha stack"

the key different between local mount and non-clustered mount:
local mount feature (tunefs.ocfs2 --fs-features=[no]local) can't do
convert job without ha stack. non-clustered mount feature can run
totally without ha stack.

commit 912f655d78c5 ("ocfs2: mount shared volume without ha stack")
had
bug, then commit c80af0c250c8f8a3c978aa5aafbe9c39b336b813 reverted it.

Let's give some explain for the issue mentioned by commit c80af0c250c8.

Under Junxiao's call trace, in __ocfs2_find_empty_slot(), the 'if'
accessment is wrong. sl_node_num could be 0 at o2cb env.

with current information, the trigger flow (base on 912f655d78c5):
1>
nodeA with 'node_num = 0' for mounting. it will succeed.
at this time, slotmap extent block will contains es_valid:1 &
es_node_num:0 for nodeA
then ocfs2_update_disk_slot() will write back slotmap info to disk.

2>
then, nodeB with 'node_num = 1' for mounting
this time, osb->node_num is 1 (set by config file), osb->preferred is
OCFS2_INVALID_SLOT (set by ocfs2_parse_options).

ocfs2_find_slot
 + ocfs2_update_slot_info //read slotmap info from disk
 |  + set si->si_slots[0].es_valid = 1 & si->si_slots[0].sl_node_num =
0
 |
 + __ocfs2_node_num_to_slot //will return -ENOENT.
 + __ocfs2_find_empty_slot
    + if ((preferred >= 0) && (preferred < si->si_num_slots))
    |  fails enter this 'if' for preferred value is OCFS2_INVALID_SLOT
    |
    + 'for(i = 0; i < si->si_num_slots; i++)' search slot 0
      successfully.
    |  'si->si_slots[0].sl_node_num' is false. trigger
'break' condition.
    |
    + return slot 0.
       it will cause nodeB grab nodeA journal dlm lock, then trigger hung.

How to do for this bug?

This commit re-enabled 912f655d78c5, next commit (add MMP support) will fix
the issue.

Signed-off-by: Heming Zhao <heming.zhao at suse.com>
---
 fs/ocfs2/ocfs2.h    |  4 +++-
 fs/ocfs2/slot_map.c | 46 ++++++++++++++++++++++++++-------------------
 fs/ocfs2/super.c    | 21 +++++++++++++++++++++
 3 files changed, 51 insertions(+), 20 deletions(-)

diff --git a/fs/ocfs2/ocfs2.h b/fs/ocfs2/ocfs2.h
index 740b64238312..337527571461 100644
--- a/fs/ocfs2/ocfs2.h
+++ b/fs/ocfs2/ocfs2.h
@@ -277,6 +277,7 @@ enum ocfs2_mount_options
 	OCFS2_MOUNT_JOURNAL_ASYNC_COMMIT = 1 << 15,  /* Journal Async Commit */
 	OCFS2_MOUNT_ERRORS_CONT = 1 << 16, /* Return EIO to the calling process
on error */
 	OCFS2_MOUNT_ERRORS_ROFS = 1 << 17, /* Change filesystem to read-only on
error */
+	OCFS2_MOUNT_NOCLUSTER = 1 << 18, /* No cluster aware filesystem mount */
 };
 
 #define OCFS2_OSB_SOFT_RO	0x0001
@@ -672,7 +673,8 @@ static inline int ocfs2_cluster_o2cb_global_heartbeat(struct
ocfs2_super *osb)
 
 static inline int ocfs2_mount_local(struct ocfs2_super *osb)
 {
-	return (osb->s_feature_incompat & OCFS2_FEATURE_INCOMPAT_LOCAL_MOUNT);
+	return ((osb->s_feature_incompat & OCFS2_FEATURE_INCOMPAT_LOCAL_MOUNT)
+		|| (osb->s_mount_opt & OCFS2_MOUNT_NOCLUSTER));
 }
 
 static inline int ocfs2_uses_extended_slot_map(struct ocfs2_super *osb)
diff --git a/fs/ocfs2/slot_map.c b/fs/ocfs2/slot_map.c
index da7718cef735..0b0ae3ebb0cf 100644
--- a/fs/ocfs2/slot_map.c
+++ b/fs/ocfs2/slot_map.c
@@ -252,14 +252,16 @@ static int __ocfs2_find_empty_slot(struct ocfs2_slot_info
*si,
 	int i, ret = -ENOSPC;
 
 	if ((preferred >= 0) && (preferred < si->si_num_slots)) {
-		if (!si->si_slots[preferred].sl_valid) {
+		if (!si->si_slots[preferred].sl_valid ||
+		    !si->si_slots[preferred].sl_node_num) {
 			ret = preferred;
 			goto out;
 		}
 	}
 
 	for(i = 0; i < si->si_num_slots; i++) {
-		if (!si->si_slots[i].sl_valid) {
+		if (!si->si_slots[i].sl_valid ||
+		    !si->si_slots[i].sl_node_num) {
 			ret = i;
 			break;
 		}
@@ -454,24 +456,30 @@ int ocfs2_find_slot(struct ocfs2_super *osb)
 	spin_lock(&osb->osb_lock);
 	ocfs2_update_slot_info(si);
 
-	/* search for ourselves first and take the slot if it already
-	 * exists. Perhaps we need to mark this in a variable for our
-	 * own journal recovery? Possibly not, though we certainly
-	 * need to warn to the user */
-	slot = __ocfs2_node_num_to_slot(si, osb->node_num);
-	if (slot < 0) {
-		/* if no slot yet, then just take 1st available
-		 * one. */
-		slot = __ocfs2_find_empty_slot(si, osb->preferred_slot);
+	if (ocfs2_mount_local(osb))
+		/* use slot 0 directly in local mode */
+		slot = 0;
+	else {
+		/* search for ourselves first and take the slot if it already
+		 * exists. Perhaps we need to mark this in a variable for our
+		 * own journal recovery? Possibly not, though we certainly
+		 * need to warn to the user */
+		slot = __ocfs2_node_num_to_slot(si, osb->node_num);
 		if (slot < 0) {
-			spin_unlock(&osb->osb_lock);
-			mlog(ML_ERROR, "no free slots available!\n");
-			status = -EINVAL;
-			goto bail;
-		}
-	} else
-		printk(KERN_INFO "ocfs2: Slot %d on device (%s) was already "
-		       "allocated to this node!\n", slot, osb->dev_str);
+			/* if no slot yet, then just take 1st available
+			 * one. */
+			slot = __ocfs2_find_empty_slot(si, osb->preferred_slot);
+			if (slot < 0) {
+				spin_unlock(&osb->osb_lock);
+				mlog(ML_ERROR, "no free slots available!\n");
+				status = -EINVAL;
+				goto bail;
+			}
+		} else
+			printk(KERN_INFO "ocfs2: Slot %d on device (%s) was "
+			       "already allocated to this node!\n",
+			       slot, osb->dev_str);
+	}
 
 	ocfs2_set_slot(si, slot, osb->node_num);
 	osb->slot_num = slot;
diff --git a/fs/ocfs2/super.c b/fs/ocfs2/super.c
index 438be028935d..f7298816d8d9 100644
--- a/fs/ocfs2/super.c
+++ b/fs/ocfs2/super.c
@@ -172,6 +172,7 @@ enum {
 	Opt_dir_resv_level,
 	Opt_journal_async_commit,
 	Opt_err_cont,
+	Opt_nocluster,
 	Opt_err,
 };
 
@@ -205,6 +206,7 @@ static const match_table_t tokens = {
 	{Opt_dir_resv_level, "dir_resv_level=%u"},
 	{Opt_journal_async_commit, "journal_async_commit"},
 	{Opt_err_cont, "errors=continue"},
+	{Opt_nocluster, "nocluster"},
 	{Opt_err, NULL}
 };
 
@@ -616,6 +618,13 @@ static int ocfs2_remount(struct super_block *sb, int
*flags, char *data)
 		goto out;
 	}
 
+	tmp = OCFS2_MOUNT_NOCLUSTER;
+	if ((osb->s_mount_opt & tmp) != (parsed_options.mount_opt & tmp)) {
+		ret = -EINVAL;
+		mlog(ML_ERROR, "Cannot change nocluster option on remount\n");
+		goto out;
+	}
+
 	tmp = OCFS2_MOUNT_HB_LOCAL | OCFS2_MOUNT_HB_GLOBAL |
 		OCFS2_MOUNT_HB_NONE;
 	if ((osb->s_mount_opt & tmp) != (parsed_options.mount_opt & tmp)) {
@@ -856,6 +865,7 @@ static int ocfs2_verify_userspace_stack(struct ocfs2_super
*osb,
 	}
 
 	if (ocfs2_userspace_stack(osb) &&
+	    !(osb->s_mount_opt & OCFS2_MOUNT_NOCLUSTER) &&
 	    strncmp(osb->osb_cluster_stack, mopt->cluster_stack,
 		    OCFS2_STACK_LABEL_LEN)) {
 		mlog(ML_ERROR,
@@ -1127,6 +1137,11 @@ static int ocfs2_fill_super(struct super_block *sb, void
*data, int silent)
 	       osb->s_mount_opt & OCFS2_MOUNT_DATA_WRITEBACK ?
"writeback" :
 	       "ordered");
 
+	if ((osb->s_mount_opt & OCFS2_MOUNT_NOCLUSTER) &&
+	   !(osb->s_feature_incompat & OCFS2_FEATURE_INCOMPAT_LOCAL_MOUNT))
+		printk(KERN_NOTICE "ocfs2: The shared device (%s) is mounted "
+		       "without cluster aware mode.\n", osb->dev_str);
+
 	atomic_set(&osb->vol_state, VOLUME_MOUNTED);
 	wake_up(&osb->osb_mount_event);
 
@@ -1437,6 +1452,9 @@ static int ocfs2_parse_options(struct super_block *sb,
 		case Opt_journal_async_commit:
 			mopt->mount_opt |= OCFS2_MOUNT_JOURNAL_ASYNC_COMMIT;
 			break;
+		case Opt_nocluster:
+			mopt->mount_opt |= OCFS2_MOUNT_NOCLUSTER;
+			break;
 		default:
 			mlog(ML_ERROR,
 			     "Unrecognized mount option \"%s\" "
@@ -1548,6 +1566,9 @@ static int ocfs2_show_options(struct seq_file *s, struct
dentry *root)
 	if (opts & OCFS2_MOUNT_JOURNAL_ASYNC_COMMIT)
 		seq_printf(s, ",journal_async_commit");
 
+	if (opts & OCFS2_MOUNT_NOCLUSTER)
+		seq_printf(s, ",nocluster");
+
 	return 0;
 }
 
-- 
2.37.1

Heming Zhao

2022-Jul-30 01:14 UTC

head link

[Ocfs2-devel] [PATCH 4/4] ocfs2: introduce ext4 MMP feature

MMP (multiple mount protection) gives filesystem ability to prevent
from being mounted multiple times.

For avoiding data corruption when non-clustered and/or clustered mount
are happening at same time, this commit introduced MMP feature. MMP
idea is from ext4 MMP (fs/ext4/mmp.c) code. For ocfs2 is a clustered
fs and also for compatible with existing slotmap feature, I did some
optimization and modification when porting from ext4 to ocfs2.

For optimization:
mmp has a kthread kmmpd-<dev>, which is only created in non-clustered
mode.

We set a rule:
If last mount didn't do unmount, (eg: crash), the next mount MUST be
same mount type.

At last, this commit also fix commit c80af0c250c8 ("Revert "ocfs2:
mount shared volume without ha stack") mentioned issue.

Signed-off-by: Heming Zhao <heming.zhao at suse.com>
---
 fs/ocfs2/ocfs2.h    |   2 +
 fs/ocfs2/ocfs2_fs.h |  13 +-
 fs/ocfs2/slot_map.c | 459 ++++++++++++++++++++++++++++++++++++++++++--
 fs/ocfs2/slot_map.h |   3 +
 fs/ocfs2/super.c    |  23 ++-
 5 files changed, 479 insertions(+), 21 deletions(-)

diff --git a/fs/ocfs2/ocfs2.h b/fs/ocfs2/ocfs2.h
index 337527571461..37a7c5855d07 100644
--- a/fs/ocfs2/ocfs2.h
+++ b/fs/ocfs2/ocfs2.h
@@ -337,6 +337,8 @@ struct ocfs2_super
 	unsigned int node_num;
 	int slot_num;
 	int preferred_slot;
+	u16 mmp_update_interval;
+	struct task_struct *mmp_task;
 	int s_sectsize_bits;
 	int s_clustersize;
 	int s_clustersize_bits;
diff --git a/fs/ocfs2/ocfs2_fs.h b/fs/ocfs2/ocfs2_fs.h
index 638d875eccc7..015672f75563 100644
--- a/fs/ocfs2/ocfs2_fs.h
+++ b/fs/ocfs2/ocfs2_fs.h
@@ -87,7 +87,8 @@
 					 | OCFS2_FEATURE_INCOMPAT_REFCOUNT_TREE \
 					 | OCFS2_FEATURE_INCOMPAT_DISCONTIG_BG	\
 					 | OCFS2_FEATURE_INCOMPAT_CLUSTERINFO \
-					 | OCFS2_FEATURE_INCOMPAT_APPEND_DIO)
+					 | OCFS2_FEATURE_INCOMPAT_APPEND_DIO \
+					 | OCFS2_FEATURE_INCOMPAT_MMP)
 #define OCFS2_FEATURE_RO_COMPAT_SUPP	(OCFS2_FEATURE_RO_COMPAT_UNWRITTEN \
 					 | OCFS2_FEATURE_RO_COMPAT_USRQUOTA \
 					 | OCFS2_FEATURE_RO_COMPAT_GRPQUOTA)
@@ -167,6 +168,11 @@
  */
 #define OCFS2_FEATURE_INCOMPAT_APPEND_DIO	0x8000
 
+/*
+ * Multiple mount protection
+ */
+#define OCFS2_FEATURE_INCOMPAT_MMP	0x10000
+
 /*
  * backup superblock flag is used to indicate that this volume
  * has backup superblocks.
@@ -535,8 +541,7 @@ struct ocfs2_slot_map {
 };
 
 struct ocfs2_extended_slot {
-/*00*/	__u8	es_valid;
-	__u8	es_reserved1[3];
+/*00*/	__le32	es_valid;
 	__le32	es_node_num;
 /*08*/
 };
@@ -611,7 +616,7 @@ struct ocfs2_super_block {
 						     INCOMPAT flag set. */
 /*B8*/	__le16 s_xattr_inline_size;	/* extended attribute inline size
 					   for this fs*/
-	__le16 s_reserved0;
+	__le16 s_mmp_update_interval; /* # seconds to wait in MMP checking */
 	__le32 s_dx_seed[3];		/* seed[0-2] for dx dir hash.
 					 * s_uuid_hash serves as seed[3]. */
 /*C0*/  __le64 s_reserved2[15];		/* Fill out superblock */
diff --git a/fs/ocfs2/slot_map.c b/fs/ocfs2/slot_map.c
index 0b0ae3ebb0cf..86a21140ead6 100644
--- a/fs/ocfs2/slot_map.c
+++ b/fs/ocfs2/slot_map.c
@@ -8,6 +8,8 @@
 #include <linux/types.h>
 #include <linux/slab.h>
 #include <linux/highmem.h>
+#include <linux/random.h>
+#include <linux/kthread.h>
 
 #include <cluster/masklog.h>
 
@@ -24,9 +26,48 @@
 
 #include "buffer_head_io.h"
 
+/*
+ * This structure will be used for multiple mount protection. It will be
+ * written into the '//slot_map' field in the system dir.
+ * Programs that check MMP should assume that if SEQ_FSCK (or any unknown
+ * code above SEQ_MAX) is present then it is NOT safe to use the filesystem.
+ */
+#define OCFS2_MMP_SEQ_CLEAN 0xFF4D4D50U /* mmp_seq value for clean unmount */
+#define OCFS2_MMP_SEQ_FSCK  0xE24D4D50U /* mmp_seq value when being fscked */
+#define OCFS2_MMP_SEQ_MAX   0xE24D4D4FU /* maximum valid mmp_seq value */
+#define OCFS2_MMP_SEQ_INIT  0x0         /* mmp_seq init value */
+#define OCFS2_VALID_CLUSTER   0xE24D4D55U /* value for clustered mount
+											   under MMP disabled */
+#define OCFS2_VALID_NOCLUSTER 0xE24D4D5AU /* value for noclustered mount
+											   under MMP disabled */
+
+#define OCFS2_SLOT_INFO_OLD_VALID   1 /* use for old slot info */
+
+/*
+ * Check interval multiplier
+ * The MMP block is written every update interval and initially checked every
+ * update interval x the multiplier (the value is then adapted based on the
+ * write latency). The reason is that writes can be delayed under load and we
+ * don't want readers to incorrectly assume that the filesystem is no
longer
+ * in use.
+ */
+#define OCFS2_MMP_CHECK_MULT		2UL
+
+/*
+ * Minimum interval for MMP checking in seconds.
+ */
+#define OCFS2_MMP_MIN_CHECK_INTERVAL	5UL
+
+/*
+ * Maximum interval for MMP checking in seconds.
+ */
+#define OCFS2_MMP_MAX_CHECK_INTERVAL	300UL
 
 struct ocfs2_slot {
-	int sl_valid;
+	union {
+		unsigned int sl_valid;
+		unsigned int mmp_seq;
+	};
 	unsigned int sl_node_num;
 };
 
@@ -52,11 +93,11 @@ static void ocfs2_invalidate_slot(struct ocfs2_slot_info
*si,
 }
 
 static void ocfs2_set_slot(struct ocfs2_slot_info *si,
-			   int slot_num, unsigned int node_num)
+			   int slot_num, unsigned int node_num, unsigned int valid)
 {
 	BUG_ON((slot_num < 0) || (slot_num >= si->si_num_slots));
 
-	si->si_slots[slot_num].sl_valid = 1;
+	si->si_slots[slot_num].sl_valid = valid;
 	si->si_slots[slot_num].sl_node_num = node_num;
 }
 
@@ -75,7 +116,8 @@ static void ocfs2_update_slot_info_extended(struct
ocfs2_slot_info *si)
 		     i++, slotno++) {
 			if (se->se_slots[i].es_valid)
 				ocfs2_set_slot(si, slotno,
-					       le32_to_cpu(se->se_slots[i].es_node_num));
+					       le32_to_cpu(se->se_slots[i].es_node_num),
+					       le32_to_cpu(se->se_slots[i].es_valid));
 			else
 				ocfs2_invalidate_slot(si, slotno);
 		}
@@ -97,7 +139,8 @@ static void ocfs2_update_slot_info_old(struct ocfs2_slot_info
*si)
 		if (le16_to_cpu(sm->sm_slots[i]) == (u16)OCFS2_INVALID_SLOT)
 			ocfs2_invalidate_slot(si, i);
 		else
-			ocfs2_set_slot(si, i, le16_to_cpu(sm->sm_slots[i]));
+			ocfs2_set_slot(si, i, le16_to_cpu(sm->sm_slots[i]),
+						OCFS2_SLOT_INFO_OLD_VALID);
 	}
 }
 
@@ -252,16 +295,14 @@ static int __ocfs2_find_empty_slot(struct ocfs2_slot_info
*si,
 	int i, ret = -ENOSPC;
 
 	if ((preferred >= 0) && (preferred < si->si_num_slots)) {
-		if (!si->si_slots[preferred].sl_valid ||
-		    !si->si_slots[preferred].sl_node_num) {
+		if (!si->si_slots[preferred].sl_valid) {
 			ret = preferred;
 			goto out;
 		}
 	}
 
 	for(i = 0; i < si->si_num_slots; i++) {
-		if (!si->si_slots[i].sl_valid ||
-		    !si->si_slots[i].sl_node_num) {
+		if (!si->si_slots[i].sl_valid) {
 			ret = i;
 			break;
 		}
@@ -270,6 +311,43 @@ static int __ocfs2_find_empty_slot(struct ocfs2_slot_info
*si,
 	return ret;
 }
 
+/* Return first used slot.
+ * -ENOENT means all slots are clean, ->sl_valid should be
+ * OCFS2_MMP_SEQ_CLEAN or ZERO */
+static int __ocfs2_find_used_slot(struct ocfs2_slot_info *si)
+{
+	int i, ret = -ENOENT, valid;
+
+	for (i = 0; i < si->si_num_slots; i++) {
+		valid = si->si_slots[i].sl_valid;
+		if (valid == 0 || valid == OCFS2_MMP_SEQ_CLEAN)
+			continue;
+		if (valid <= OCFS2_MMP_SEQ_MAX ||
+			valid == OCFS2_MMP_SEQ_FSCK ||
+			valid == OCFS2_VALID_CLUSTER ||
+			valid == OCFS2_VALID_NOCLUSTER) {
+			ret = i;
+			break;
+		}
+	}
+
+	return ret;
+}
+
+static int __ocfs2_find_expected_slot(struct ocfs2_slot_info *si,
+								unsigned int expected)
+{
+	int i;
+
+	for (i = 0; i < si->si_num_slots; i++) {
+		if (si->si_slots[i].sl_valid == expected) {
+			return 1;
+		}
+	}
+
+	return 0;
+}
+
 int ocfs2_node_num_to_slot(struct ocfs2_super *osb, unsigned int node_num)
 {
 	int slot;
@@ -445,21 +523,357 @@ void ocfs2_free_slot_info(struct ocfs2_super *osb)
 	__ocfs2_free_slot_info(si);
 }
 
+/*
+ * Get a random new sequence number but make sure it is not greater than
+ * EXT4_MMP_SEQ_MAX.
+ */
+static unsigned int mmp_new_seq(void)
+{
+	u32 new_seq;
+
+	do {
+		new_seq = prandom_u32();
+	} while (new_seq > OCFS2_MMP_SEQ_MAX);
+
+	if (new_seq == 0)
+		return 1;
+	else
+		return new_seq;
+}
+
+/*
+ * kmmpd will update the MMP sequence every mmp_update_interval seconds
+ */
+static int kmmpd(void *data)
+{
+	struct ocfs2_super *osb = data;
+	struct super_block *sb = osb->sb;
+	struct ocfs2_slot_info *si = osb->slot_info;
+	int slot = osb->slot_num;
+	u32 seq, mmp_seq;
+	unsigned long failed_writes = 0;
+	u16 mmp_update_interval = osb->mmp_update_interval;
+	unsigned int mmp_check_interval;
+	unsigned long last_update_time;
+	unsigned long diff;
+	int retval = 0;
+
+	if (!ocfs2_mount_local(osb)) {
+		mlog(ML_ERROR, "kmmpd thread only works for local mount mode.\n");
+		goto wait_to_exit;
+	}
+
+	retval = ocfs2_refresh_slot_info(osb);
+	seq = si->si_slots[slot].mmp_seq;
+
+	/*
+	 * Start with the higher mmp_check_interval and reduce it if
+	 * the MMP block is being updated on time.
+	 */
+	mmp_check_interval = max(OCFS2_MMP_CHECK_MULT * mmp_update_interval,
+				 OCFS2_MMP_MIN_CHECK_INTERVAL);
+
+	while (!kthread_should_stop() && !sb_rdonly(sb)) {
+		if (!OCFS2_HAS_INCOMPAT_FEATURE(sb, OCFS2_FEATURE_INCOMPAT_MMP)) {
+			mlog(ML_WARNING, "kmmpd being stopped since MMP feature"
+				     " has been disabled.");
+			goto wait_to_exit;
+		}
+		if (++seq > OCFS2_MMP_SEQ_MAX)
+			seq = 1;
+
+		spin_lock(&osb->osb_lock);
+		si->si_slots[slot].mmp_seq = mmp_seq = seq;
+		spin_unlock(&osb->osb_lock);
+
+		last_update_time = jiffies;
+		retval = ocfs2_update_disk_slot(osb, si, slot);
+
+		/*
+		 * Don't spew too many error messages. Print one every
+		 * (s_mmp_update_interval * 60) seconds.
+		 */
+		if (retval) {
+			if ((failed_writes % 60) == 0) {
+				ocfs2_error(sb, "Error writing to MMP block");
+			}
+			failed_writes++;
+		}
+
+		diff = jiffies - last_update_time;
+		if (diff < mmp_update_interval * HZ)
+			schedule_timeout_interruptible(mmp_update_interval *
+						       HZ - diff);
+
+		/*
+		 * We need to make sure that more than mmp_check_interval
+		 * seconds have not passed since writing. If that has happened
+		 * we need to check if the MMP block is as we left it.
+		 */
+		diff = jiffies - last_update_time;
+		if (diff > mmp_check_interval * HZ) {
+			retval = ocfs2_refresh_slot_info(osb);
+			if (retval) {
+				ocfs2_error(sb, "error reading MMP data: %d", retval);
+				goto wait_to_exit;
+			}
+
+			if (si->si_slots[slot].mmp_seq != mmp_seq) {
+				ocfs2_error(sb, "Error while updating MMP info. "
+					     "The filesystem seems to have been"
+					     " multiply mounted.");
+				retval = -EBUSY;
+				goto wait_to_exit;
+			}
+		}
+
+		 /*
+		 * Adjust the mmp_check_interval depending on how much time
+		 * it took for the MMP block to be written.
+		 */
+		mmp_check_interval = max(min(OCFS2_MMP_CHECK_MULT * diff / HZ,
+					     OCFS2_MMP_MAX_CHECK_INTERVAL),
+					     OCFS2_MMP_MIN_CHECK_INTERVAL);
+	}
+
+	/*
+	 * Unmount seems to be clean.
+	 */
+	spin_lock(&osb->osb_lock);
+	si->si_slots[slot].mmp_seq = OCFS2_MMP_SEQ_CLEAN;
+	spin_unlock(&osb->osb_lock);
+
+	retval = ocfs2_update_disk_slot(osb, si, 0);
+
+wait_to_exit:
+	while (!kthread_should_stop()) {
+		set_current_state(TASK_INTERRUPTIBLE);
+		if (!kthread_should_stop())
+			schedule();
+	}
+	set_current_state(TASK_RUNNING);
+	return retval;
+}
+
+void ocfs2_stop_mmpd(struct ocfs2_super *osb)
+{
+	if (osb->mmp_task) {
+		kthread_stop(osb->mmp_task);
+		osb->mmp_task = NULL;
+	}
+}
+
+/*
+ * Protect the filesystem from being mounted more than once.
+ *
+ * This function was inspired by ext4 MMP feature. Because HA stack
+ * helps ocfs2 to manage nodes join/leave, so we only focus on MMP
+ * under nocluster mode.
+ * Another info is ocfs2 only uses slot 0 on nocuster mode.
+ *
+ * es_valid:
+ *  0: not available
+ *  1: valid, cluster mode
+ *  2: valid, nocluster mode
+ *
+ * parameters:
+ *  osb: the struct ocfs2_super
+ *  noclustered: under noclustered mount
+ *  slot: prefer slot number
+ */
+int ocfs2_multi_mount_protect(struct ocfs2_super *osb, int noclustered)
+{
+	struct buffer_head *bh = NULL;
+	u32 seq;
+	struct ocfs2_slot_info *si = osb->slot_info;
+	unsigned int mmp_check_interval = osb->mmp_update_interval;
+	unsigned int wait_time = 0;
+	int retval = 0;
+	int slot = osb->slot_num;
+
+	if (!ocfs2_uses_extended_slot_map(osb)) {
+		mlog(ML_WARNING, "MMP only works on extended slot map.\n");
+		retval = -EINVAL;
+		goto bail;
+	}
+
+	retval = ocfs2_refresh_slot_info(osb);
+	if (retval)
+		goto bail;
+
+	if (mmp_check_interval < OCFS2_MMP_MIN_CHECK_INTERVAL)
+		mmp_check_interval = OCFS2_MMP_MIN_CHECK_INTERVAL;
+
+	spin_lock(&osb->osb_lock);
+	seq = si->si_slots[slot].mmp_seq;
+
+	if (__ocfs2_find_used_slot(si) == -ENOENT)
+		goto skip;
+
+	/* TODO ocfs2-tools need to support this flag */
+	if (__ocfs2_find_expected_slot(si, OCFS2_MMP_SEQ_FSCK)) {
+		mlog(ML_NOTICE, "fsck is running on the filesystem");
+		spin_unlock(&osb->osb_lock);
+		retval = -EBUSY;
+		goto bail;
+	}
+	spin_unlock(&osb->osb_lock);
+
+	wait_time = min(mmp_check_interval * 2 + 1, mmp_check_interval + 60);
+
+	/* Print MMP interval if more than 20 secs. */
+	if (wait_time > OCFS2_MMP_MIN_CHECK_INTERVAL * 4)
+		mlog(ML_WARNING, "MMP interval %u higher than expected, please"
+			     " wait.\n", wait_time * 2);
+
+	if (schedule_timeout_interruptible(HZ * wait_time) != 0) {
+		mlog(ML_WARNING, "MMP startup interrupted, failing mount.\n");
+		retval = -EPERM;
+		goto bail;
+	}
+
+	retval = ocfs2_refresh_slot_info(osb);
+	if (retval)
+		goto bail;
+	if (seq != si->si_slots[slot].mmp_seq) {
+		mlog(ML_ERROR, "Device is already active on another node.\n");
+		retval = -EPERM;
+		goto bail;
+	}
+
+	spin_lock(&osb->osb_lock);
+skip:
+	/*
+	 * write a new random sequence number.
+	 */
+	seq = mmp_new_seq();
+	mlog(ML_ERROR, "seq: 0x%x mmp_seq: 0x%x\n", seq,
si->si_slots[slot].mmp_seq);
+	ocfs2_set_slot(si, slot, osb->node_num, seq);
+	spin_unlock(&osb->osb_lock);
+
+	ocfs2_update_disk_slot_extended(si, slot, &bh);
+	mlog(ML_ERROR, "seq: 0x%x mmp_seq: 0x%x\n", seq,
si->si_slots[slot].mmp_seq);
+	retval = ocfs2_write_block(osb, bh, INODE_CACHE(si->si_inode));
+	if (retval < 0) {
+		mlog_errno(retval);
+		goto bail;
+	}
+	mlog(ML_ERROR, "seq: 0x%x mmp_seq: 0x%x wait_time: %u\n", seq,
si->si_slots[slot].mmp_seq, wait_time);
+
+	/*
+	 * wait for MMP interval and check mmp_seq.
+	 */
+	if (schedule_timeout_interruptible(HZ * wait_time) != 0) {
+		mlog(ML_WARNING, "MMP startup interrupted, failing mount.\n");
+		retval = -EPERM;
+		goto bail;
+	}
+
+	retval = ocfs2_refresh_slot_info(osb);
+	if (retval)
+		goto bail;
+
+	mlog(ML_ERROR, "seq: 0x%x mmp_seq: 0x%x\n", seq,
si->si_slots[slot].mmp_seq);
+	if (seq != si->si_slots[slot].mmp_seq) {
+		mlog(ML_ERROR, "Update seq failed, device is already active on another
node.\n");
+		retval = -EPERM;
+		goto bail;
+	}
+
+	/*
+	 * There are two reasons we don't create kmmpd on clustered mount:
+	 * - ocfs2 needs to grab osb->osb_lock to modify/access osb->si.
+	 * - For huge number nodes cluster, nodes update same sector
+	 *   of '//slot_map' will cause IO performance issue.
+	 *
+	 * Then there has another question:
+	 * On clustered mount, MMP seq won't update, and MMP how to
+	 * handle a noclustered mount when there already exist
+	 * clustered mount.
+	 * The answer is the rule mentioned in ocfs2_find_slot().
+	 */
+	if (!noclustered) {
+		spin_lock(&osb->osb_lock);
+		ocfs2_set_slot(si, slot, osb->node_num, OCFS2_VALID_CLUSTER);
+		spin_unlock(&osb->osb_lock);
+
+		ocfs2_update_disk_slot_extended(si, slot, &bh);
+		retval = ocfs2_write_block(osb, bh, INODE_CACHE(si->si_inode));
+		goto bail;
+	}
+
+	/*
+	 * Start a kernel thread to update the MMP block periodically.
+	 */
+	osb->mmp_task = kthread_run(kmmpd, osb, "kmmpd-%s",
osb->sb->s_id);
+	if (IS_ERR(osb->mmp_task)) {
+		osb->mmp_task = NULL;
+		mlog(ML_WARNING, "Unable to create kmmpd thread for %s.",
+			     osb->sb->s_id);
+		retval = -EPERM;
+		goto bail;
+	}
+
+bail:
+	return retval;
+}
+
+static void show_conflict_mnt_msg(int clustered)
+{
+	const char *exist = clustered ? "non-clustered" :
"clustered";
+
+	mlog(ML_ERROR, "Found %s mount info!", exist);
+	mlog(ML_ERROR, "Please clean %s slotmap info for mounting.\n",
exist);
+	mlog(ML_ERROR, "eg. remount then unmount with %s mode\n", exist);
+}
+
+/*
+ * Even under readonly mode, we write slot info on disk.
+ * The logic is correct: if not change slot info on readonly
+ * mode, in cluster env, later mount from another node
+ * may reuse the same slot, deadlock happen!
+ */
 int ocfs2_find_slot(struct ocfs2_super *osb)
 {
-	int status;
+	int status = -EPERM;
 	int slot;
+	int noclustered = 0;
 	struct ocfs2_slot_info *si;
 
 	si = osb->slot_info;
 
 	spin_lock(&osb->osb_lock);
 	ocfs2_update_slot_info(si);
+	slot = __ocfs2_find_used_slot(si);
+	if (slot == 0 &&
+		((si->si_slots[0].sl_valid == OCFS2_VALID_NOCLUSTER) ||
+		 (si->si_slots[0].sl_valid < OCFS2_MMP_SEQ_MAX)))
+		noclustered = 1;
 
-	if (ocfs2_mount_local(osb))
-		/* use slot 0 directly in local mode */
-		slot = 0;
-	else {
+	/*
+	 * We set a rule:
+	 * If last mount didn't do unmount, (eg: crash), the next mount
+	 * MUST be same mount type.
+	 */
+	if (ocfs2_mount_local(osb)) {
+		/* empty slotmap, or device didn't unmount from last time */
+		if ((slot == -ENOENT) || noclustered) {
+			/* use slot 0 directly in local mode */
+			slot = 0;
+			noclustered = 1;
+		} else {
+			spin_unlock(&osb->osb_lock);
+			show_conflict_mnt_msg(0);
+			status = -EINVAL;
+			goto bail;
+		}
+	} else {
+		if (noclustered) {
+			spin_unlock(&osb->osb_lock);
+			show_conflict_mnt_msg(1);
+			status = -EINVAL;
+			goto bail;
+		}
 		/* search for ourselves first and take the slot if it already
 		 * exists. Perhaps we need to mark this in a variable for our
 		 * own journal recovery? Possibly not, though we certainly
@@ -481,7 +895,21 @@ int ocfs2_find_slot(struct ocfs2_super *osb)
 			       slot, osb->dev_str);
 	}
 
-	ocfs2_set_slot(si, slot, osb->node_num);
+	if (OCFS2_HAS_INCOMPAT_FEATURE(osb->sb, OCFS2_FEATURE_INCOMPAT_MMP)) {
+		osb->slot_num = slot;
+		spin_unlock(&osb->osb_lock);
+		status = ocfs2_multi_mount_protect(osb, noclustered);
+		if (status < 0) {
+			mlog(ML_ERROR, "MMP failed to start.\n");
+			goto mmp_fail;
+		}
+
+		trace_ocfs2_find_slot(osb->slot_num);
+		return status;
+	}
+
+	ocfs2_set_slot(si, slot, osb->node_num, noclustered ?
+			OCFS2_VALID_NOCLUSTER : OCFS2_VALID_CLUSTER);
 	osb->slot_num = slot;
 	spin_unlock(&osb->osb_lock);
 
@@ -490,6 +918,7 @@ int ocfs2_find_slot(struct ocfs2_super *osb)
 	status = ocfs2_update_disk_slot(osb, si, osb->slot_num);
 	if (status < 0) {
 		mlog_errno(status);
+mmp_fail:
 		/*
 		 * if write block failed, invalidate slot to avoid overwrite
 		 * slot during dismount in case another node rightly has mounted
diff --git a/fs/ocfs2/slot_map.h b/fs/ocfs2/slot_map.h
index a43644570b53..d4d147b0c190 100644
--- a/fs/ocfs2/slot_map.h
+++ b/fs/ocfs2/slot_map.h
@@ -25,4 +25,7 @@ int ocfs2_slot_to_node_num_locked(struct ocfs2_super *osb, int
slot_num,
 
 int ocfs2_clear_slot(struct ocfs2_super *osb, int slot_num);
 
+int ocfs2_multi_mount_protect(struct ocfs2_super *osb, int noclustered);
+void ocfs2_stop_mmpd(struct ocfs2_super *osb);
+
 #endif
diff --git a/fs/ocfs2/super.c b/fs/ocfs2/super.c
index f7298816d8d9..b0e76b06efc3 100644
--- a/fs/ocfs2/super.c
+++ b/fs/ocfs2/super.c
@@ -609,6 +609,7 @@ static int ocfs2_remount(struct super_block *sb, int *flags,
char *data)
 	struct mount_options parsed_options;
 	struct ocfs2_super *osb = OCFS2_SB(sb);
 	u32 tmp;
+	int noclustered;
 
 	sync_filesystem(sb);
 
@@ -619,7 +620,8 @@ static int ocfs2_remount(struct super_block *sb, int *flags,
char *data)
 	}
 
 	tmp = OCFS2_MOUNT_NOCLUSTER;
-	if ((osb->s_mount_opt & tmp) != (parsed_options.mount_opt & tmp)) {
+	noclustered = osb->s_mount_opt & tmp;
+	if (noclustered != (parsed_options.mount_opt & tmp)) {
 		ret = -EINVAL;
 		mlog(ML_ERROR, "Cannot change nocluster option on remount\n");
 		goto out;
@@ -686,10 +688,20 @@ static int ocfs2_remount(struct super_block *sb, int
*flags, char *data)
 			}
 			sb->s_flags &= ~SB_RDONLY;
 			osb->osb_flags &= ~OCFS2_OSB_SOFT_RO;
+			if (OCFS2_HAS_INCOMPAT_FEATURE(sb, OCFS2_FEATURE_INCOMPAT_MMP)) {
+				spin_unlock(&osb->osb_lock);
+				if (ocfs2_multi_mount_protect(osb, noclustered)) {
+					mlog(ML_ERROR, "started MMP failed.\n");
+					ocfs2_stop_mmpd(osb);
+					ret = -EROFS;
+					goto unlocked_osb;
+				}
+			}
 		}
 		trace_ocfs2_remount(sb->s_flags, osb->osb_flags, *flags);
 unlock_osb:
 		spin_unlock(&osb->osb_lock);
+unlocked_osb:
 		/* Enable quota accounting after remounting RW */
 		if (!ret && !(*flags & SB_RDONLY)) {
 			if (sb_any_quota_suspended(sb))
@@ -722,6 +734,8 @@ static int ocfs2_remount(struct super_block *sb, int *flags,
char *data)
 		sb->s_flags = (sb->s_flags & ~SB_POSIXACL) |
 			((osb->s_mount_opt & OCFS2_MOUNT_POSIX_ACL) ?
 							SB_POSIXACL : 0);
+		if (sb_rdonly(osb->sb))
+			ocfs2_stop_mmpd(osb);
 	}
 out:
 	return ret;
@@ -1833,7 +1847,7 @@ static int ocfs2_mount_volume(struct super_block *sb)
 	status = ocfs2_init_local_system_inodes(osb);
 	if (status < 0) {
 		mlog_errno(status);
-		goto out_super_lock;
+		goto out_find_slot;
 	}
 
 	status = ocfs2_check_volume(osb);
@@ -1858,6 +1872,8 @@ static int ocfs2_mount_volume(struct super_block *sb)
 	/* before journal shutdown, we should release slot_info */
 	ocfs2_free_slot_info(osb);
 	ocfs2_journal_shutdown(osb);
+out_find_slot:
+	ocfs2_stop_mmpd(osb);
 out_super_lock:
 	ocfs2_super_unlock(osb, 1);
 out_dlm:
@@ -1878,6 +1894,8 @@ static void ocfs2_dismount_volume(struct super_block *sb,
int mnt_err)
 	osb = OCFS2_SB(sb);
 	BUG_ON(!osb);
 
+	ocfs2_stop_mmpd(osb);
+
 	/* Remove file check sysfs related directores/files,
 	 * and wait for the pending file check operations */
 	ocfs2_filecheck_remove_sysfs(osb);
@@ -2086,6 +2104,7 @@ static int ocfs2_initialize_super(struct super_block *sb,
 	snprintf(osb->dev_str, sizeof(osb->dev_str), "%u,%u",
 		 MAJOR(osb->sb->s_dev), MINOR(osb->sb->s_dev));
 
+	osb->mmp_update_interval =
le16_to_cpu(di->id2.i_super.s_mmp_update_interval);
 	osb->max_slots = le16_to_cpu(di->id2.i_super.s_max_slots);
 	if (osb->max_slots > OCFS2_MAX_SLOTS || osb->max_slots == 0) {
 		mlog(ML_ERROR, "Invalid number of node slots (%u)\n",
-- 
2.37.1

Joseph Qi

2022-Aug-08 06:51 UTC

head link

[Ocfs2-devel] [PATCH 1/4] ocfs2: Fix freeing uninitialized resource on ocfs2_dlm_shutdown

On 7/30/22 9:14 AM, Heming Zhao wrote:> On local mount mode, there is no dlm resource initalized. If
> ocfs2_mount_volume() fails in ocfs2_find_slot(), error handling
> flow will call ocfs2_dlm_shutdown(), then does dlm resource
> cleanup job, which will trigger kernel crash.
> 
> Fixes: 0737e01de9c4 ("ocfs2: ocfs2_mount_volume does cleanup job
before
> return error")
Should be put at the same line.
> Signed-off-by: Heming Zhao <heming.zhao at suse.com>
> ---
>  fs/ocfs2/dlmglue.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/fs/ocfs2/dlmglue.c b/fs/ocfs2/dlmglue.c
> index 801e60bab955..1438ac14940b 100644
> --- a/fs/ocfs2/dlmglue.c
> +++ b/fs/ocfs2/dlmglue.c
> @@ -3385,6 +3385,9 @@ int ocfs2_dlm_init(struct ocfs2_super *osb)
>  void ocfs2_dlm_shutdown(struct ocfs2_super *osb,
>  			int hangup_pending)
>  {
> +	if (ocfs2_mount_local(osb))
> +		return;
> +
IMO, we have to do part of ocfs2_dlm_shutdown() jobs such as
ocfs2_lock_res_free(), which will remove lockres from d_lockres_tracking
added by ocfs2_xxx_lock_res_init().

Before commit 0737e01de9c4, it seems this issue also exists since
osb->cconn is already set under local mount mode. 

Thanks,
Joseph

Joseph Qi

2022-Aug-08 08:19 UTC

head link

[Ocfs2-devel] [PATCH 4/4] ocfs2: introduce ext4 MMP feature

On 7/30/22 9:14 AM, Heming Zhao wrote:> MMP (multiple mount protection) gives filesystem ability to prevent
> from being mounted multiple times.
> 
> For avoiding data corruption when non-clustered and/or clustered mount
> are happening at same time, this commit introduced MMP feature. MMP
> idea is from ext4 MMP (fs/ext4/mmp.c) code. For ocfs2 is a clustered
> fs and also for compatible with existing slotmap feature, I did some
> optimization and modification when porting from ext4 to ocfs2.
> 
> For optimization:
> mmp has a kthread kmmpd-<dev>, which is only created in non-clustered
> mode.
> 
> We set a rule:
> If last mount didn't do unmount, (eg: crash), the next mount MUST be
> same mount type.
> 
> At last, this commit also fix commit c80af0c250c8 ("Revert
"ocfs2:
> mount shared volume without ha stack") mentioned issue.
I suggest we re-split this series (especially patch 3 and 4), but not
revive a buggy commit first and then another commit fixing it BTW.

Thanks,
Joseph

Ocfs2 devel - Jul 2022 - [PATCH 0/4] re-enable non-clustered mount & add MMP support

[Ocfs2-devel] [PATCH 0/4] re-enable non-clustered mount & add MMP support

[Ocfs2-devel] [PATCH 1/4] ocfs2: Fix freeing uninitialized resource on ocfs2_dlm_shutdown

[Ocfs2-devel] [PATCH 2/4] ocfs2: add mlog ML_WARNING support

[Ocfs2-devel] [PATCH 3/4] re-enable "ocfs2: mount shared volume without ha stack"

[Ocfs2-devel] [PATCH 4/4] ocfs2: introduce ext4 MMP feature

[Ocfs2-devel] [PATCH 1/4] ocfs2: Fix freeing uninitialized resource on ocfs2_dlm_shutdown

[Ocfs2-devel] [PATCH 4/4] ocfs2: introduce ext4 MMP feature