Heming Zhao
2022-Jul-30 01:14 UTC
[Ocfs2-devel] [PATCH 0/4] re-enable non-clustered mount & add MMP support
This serial patches re-enable ocfs2 non-clustered mount feature.
the previous patch c80af0c250c8 (Revert "ocfs2: mount shared volume
without ha stack") revert Gang's non-clustered mount patch. This
serial patches re-enable ocfs2 non-clustered mount.
the key different between local mount and non-clustered mount:
local mount feature (tunefs.ocfs2 --fs-features=[no]local) can't do
convert job without ha stack. non-clustered mount feature can run
totally without ha stack.
For avoiding data corruption when non-clustered & clustered mount are
happening at same time, this serial patches also introduces MMP
feature. MMP (Multiple Mount Protection) idea got from ext4 MMP
(fs/ext4/mmp.c) which protects fs from being mounted more than once.
For ocfs2 is a clustered fs and also for compatible with existing
slotmap feature, I did some optimization and modification when
porting from ext4 MMP to ocfs2.
The related userspace code for supporting MMP had been sent to github
for reviewing:
- https://github.com/markfasheh/ocfs2-tools/pull/58
ocfs2-tools enable MMP and check status:
```
# enable MMP
tunefs.ocfs2 --fs-feature=mmp /dev/vdb
# check the command result
tunefs.ocfs2 -Q "%H\n" /dev/vdb | grep MMP
# active MMP on nocluster mount
mount -t ocfs2 -o nocluster /dev/vdb /mnt
# check slotmap info
# echo slotmap | PAGER=cat debugfs.ocfs2 /dev/vdb
```
=== below are test cases for patches ===
<1> non-clustered mount vs local mount
1.1 tunefs.ocfs2 can't convert local/nolocal mount without ha stack.
```
(on ha stack env)
mkfs.ocfs2 --cluster-stack=pcmk --cluster-name=hacluster -N 4 /dev/vdb
tunefs.ocfs2 --fs-features=local /dev/vdb (<== success)
tunefs.ocfs2 --fs-features=nolocal /dev/vdb (<== success)
(on another node without ha stack)
tunefs.ocfs2 --fs-features=local /dev/vdb (<== failure)
```
1.2 non-cluster feature can run without ha stack.
```
(on ha stack env)
mkfs.ocfs2 --cluster-stack=pcmk --cluster-name=hacluster -N 4 /dev/vdb
(on another node without ha stack)
mount -t ocfs2 -o nocluster /dev/vdb /mnt (<== success)
```
<2> do clustered & non-clustered mount on same node
2.1 non-clustered mount => clustered mount
```
mkfs.ocfs2 --cluster-stack=pcmk --cluster-name=hacluster -N 4 /dev/vdb
mount -t ocfs2 -o nocluster /dev/vdb /mnt
mount -t ocfs2 /dev/vdb /mnt (<=== failure)
```
2.2 clustered mount => non-clustered mount
```
mkfs.ocfs2 --cluster-stack=pcmk --cluster-name=hacluster -N 4 /dev/vdb
mount -t ocfs2 /dev/vdb /mnt
mount -t ocfs2 -o nocluster /dev/vdb /mnt (<=== failure)
```
<3> one node does clustered mount, another does non-clustered mount
test rule: clustered mount and non-clustered mount can not exist at same
time.
3.1 clustered mount @node1 => [no]clustered mount @node2
```
node1:
mkfs.ocfs2 --cluster-stack=pcmk --cluster-name=hacluster -N 4 /dev/vdb
mount -t ocfs2 /dev/vdb /mnt
node2:
mount -t ocfs2 /dev/vdb /mnt (<== success)
umount /mnt
mount -t ocfs2 -o nocluster /dev/vdb /mnt (<== failure)
```
3.2 enable mmp, repeate 3.1 case
```
node1:
mkfs.ocfs2 --cluster-stack=pcmk --cluster-name=hacluster -N 4 /dev/vdb
tunefs.ocfs2 --fs-features=mmp /dev/vdb (<== enable mmp)
mount -t ocfs2 /dev/vdb /mnt
node2:
mount -t ocfs2 /dev/vdb /mnt (<== wait ~22s [*] for mmp,
then success)
umount /mnt
mount -t ocfs2 -o nocluster /dev/vdb /mnt (<== failure)
```
[*] 22s:
(OCFS2_MMP_MIN_CHECK_INTERVAL * 2 + 1) * 2 times (calling
schedule_timeout_interruptible)
3.3 noclustered mount @node1 => [no]clustered mount @node2
```
node1:
mkfs.ocfs2 --cluster-stack=pcmk --cluster-name=hacluster -N 4 /dev/vdb
mount -t ocfs2 -o nocluster /dev/vdb /mnt
node2:
mount -t ocfs2 /dev/vdb /mnt (<== failure)
mount -t ocfs2 -o nocluster /dev/vdb /mnt (<== success, without mmp
enable)
umount /mnt (<== will ZERO out slotmap area while node1
still mounting)
```
3.4 enable mmp, repeate 3.3 case.
```
node1:
mkfs.ocfs2 --cluster-stack=pcmk --cluster-name=hacluster -N 4 /dev/vdb
tunefs.ocfs2 --fs-features=mmp /dev/vdb (<== enable mmp)
mount -t ocfs2 -o nocluster /dev/vdb /mnt
node2:
mount -t ocfs2 /dev/vdb /mnt (<== failure)
mount -t ocfs2 -o nocluster /dev/vdb /mnt (<== failure, denied by mmp)
```
<4> simulate mounting after machine crash
info:
- below all steps do on one node
- address 287387648 is the '//slot_map' extent address.
- test the rule: If last mount didn't do unmount, (eg: crash), the next
mount MUST be same mount type.
4.0 how to calculate '//slot_map' extent address
```
# PAGER=cat debugfs.ocfs2 -R "stats" /dev/vdb | grep "Block Size
Bits"
Block Size Bits: 12 Cluster Size Bits: 12
# PAGER=cat debugfs.ocfs2 -R "stat //slot_map" /dev/vdb | grep -A1
# "Block#"
## Offset Clusters Block# Flags
0 0 1 70163 0x0
```
70163 * (1<<12) = 70163 * 4096 = 287387648
4.1 clustered mount => crash => non-clustered mount fails => clean
slotmap => non-clustered mount succeeds
```
mkfs.ocfs2 --cluster-stack=pcmk --cluster-name=hacluster -N 4 /dev/vdb
mount -t ocfs2 /dev/vdb /mnt
dd if=/dev/vdb bs=1 count=32 skip=287387648
of=/root/slotmap.cluster.mnted (<== backup slot info)
umount /mnt
dd if=/root/slotmap.cluster.mnted of=/dev/vdb seek=287387648 bs=1
count=32 (<== overwrite)
mount -t ocfs2 -o nocluster /dev/vdb /mnt <== failure
mount -t ocfs2 /dev/vdb /mnt && umount /mnt <== clean slot 0
mount -t ocfs2 -o nocluster /dev/vdb /mnt <== success
```
4.2 non-clustered mount => crash => clustered mount fails => clean
slotmap => clustered mount succeeds
```
mkfs.ocfs2 --cluster-stack=pcmk --cluster-name=hacluster -N 4 /dev/vdb
mount -t ocfs2 -o nocluster /dev/vdb /mnt
dd if=/dev/vdb bs=1 count=32 skip=287387648
of=/root/slotmap.nocluster.mnted
umount /mnt
dd if=/root/slotmap.nocluster.mnted of=/dev/vdb seek=287387648 bs=1
count=32
mount -t ocfs2 /dev/vdb /mnt <== failure
mount -t ocfs2 -o nocluster /dev/vdb /mnt && umount /mnt <== clean
slot
0
mount -t ocfs2 /dev/vdb /mnt <== success
```
<5> MMP test
5.1 node1 noclustered mount => node 2 noclustered mount
disable mmp
```
node1:
mkfs.ocfs2 --cluster-stack=pcmk --cluster-name=hacluster -N 4 /dev/vdb
mount -t ocfs2 -o nocluster /dev/vdb /mnt
node2:
mount -t ocfs2 -o nocluster /dev/vdb /mnt (<== success)
```
enable mmp
```
node1:
mkfs.ocfs2 --cluster-stack=pcmk --cluster-name=hacluster -N 4 /dev/vdb
tunefs.ocfs2 --fs-features=mmp /dev/vdb
mount -t ocfs2 -o nocluster /dev/vdb /mnt
node2:
mount -t ocfs2 -o nocluster /dev/vdb /mnt (<== wait ~12s[*], failure by
mmp)
```
[*] 12s:
sleep (OCFS2_MMP_MIN_CHECK_INTERVAL * 2 + 1) then detect mmp_seq was
changed, then failed.
5.2 node1 clustered mount => node 2 clustered mount
see case 3.2
5.3 node1 noclustered mount => node 2 noclustered mount
see case 3.4
5.4 remount test
5.4.1 non-clustered mount (run commands on same node)
```
mkfs.ocfs2 --cluster-stack=pcmk --cluster-name=hacluster -N 4 /dev/vdb
tunefs.ocfs2 --fs-features=mmp /dev/vdb
mount -t ocfs2 -o nocluster /dev/vdb /mnt
ps axj | grep kmmpd (<== will show kmmpd)
PAGER=cat debugfs.ocfs2 -R "slotmap" /dev/vdb (<== show
'OCFS2_MMP_SEQ')
mount -o remount,ro,nocluster /dev/vdb /mnt (<== kmmpd will stop)
ps axj | grep kmmpd (<== won't show kmmpd)
PAGER=cat debugfs.ocfs2 -R "slotmap" /dev/vdb (<== show
'OCFS2_MMP_SEQ_CLEAN')
mount -o remount,rw,nocluster /dev/vdb /mnt (<== kmmpd will start)
ps axj | grep kmmpd (<== will show kmmpd)
PAGER=cat debugfs.ocfs2 -R "slotmap" /dev/vdb (<== show
'OCFS2_MMP_SEQ')
```
5.4.2 clustered mount
```
mkfs.ocfs2 --cluster-stack=pcmk --cluster-name=hacluster -N 4 /dev/vdb
tunefs.ocfs2 --fs-features=mmp /dev/vdb
mount -t ocfs2 /dev/vdb /mnt (<== clustered mount
won't create kmmpd)
PAGER=cat debugfs.ocfs2 -R "slotmap" /dev/vdb (<== show
'OCFS2_VALID_CLUSTER')
mount -o remount,ro /dev/vdb /mnt
PAGER=cat debugfs.ocfs2 -R "slotmap" /dev/vdb (<== show
'OCFS2_VALID_CLUSTER')
mount -o remount,rw /dev/vdb /mnt (<== wait for ~22s by mmp
start)
PAGER=cat debugfs.ocfs2 -R "slotmap" /dev/vdb (<== show
'OCFS2_VALID_CLUSTER')
```
Heming Zhao (4):
ocfs2: Fix freeing uninitialized resource on ocfs2_dlm_shutdown
ocfs2: add mlog ML_WARNING support
re-enable "ocfs2: mount shared volume without ha stack"
ocfs2: introduce ext4 MMP feature
fs/ocfs2/cluster/masklog.c | 3 +
fs/ocfs2/cluster/masklog.h | 9 +-
fs/ocfs2/dlmglue.c | 3 +
fs/ocfs2/ocfs2.h | 6 +-
fs/ocfs2/ocfs2_fs.h | 13 +-
fs/ocfs2/slot_map.c | 479 +++++++++++++++++++++++++++++++++++--
fs/ocfs2/slot_map.h | 3 +
fs/ocfs2/super.c | 42 +++-
8 files changed, 527 insertions(+), 31 deletions(-)
--
2.37.1
Heming Zhao
2022-Jul-30 01:14 UTC
[Ocfs2-devel] [PATCH 1/4] ocfs2: Fix freeing uninitialized resource on ocfs2_dlm_shutdown
On local mount mode, there is no dlm resource initalized. If
ocfs2_mount_volume() fails in ocfs2_find_slot(), error handling
flow will call ocfs2_dlm_shutdown(), then does dlm resource
cleanup job, which will trigger kernel crash.
Fixes: 0737e01de9c4 ("ocfs2: ocfs2_mount_volume does cleanup job before
return error")
Signed-off-by: Heming Zhao <heming.zhao at suse.com>
---
fs/ocfs2/dlmglue.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/fs/ocfs2/dlmglue.c b/fs/ocfs2/dlmglue.c
index 801e60bab955..1438ac14940b 100644
--- a/fs/ocfs2/dlmglue.c
+++ b/fs/ocfs2/dlmglue.c
@@ -3385,6 +3385,9 @@ int ocfs2_dlm_init(struct ocfs2_super *osb)
void ocfs2_dlm_shutdown(struct ocfs2_super *osb,
int hangup_pending)
{
+ if (ocfs2_mount_local(osb))
+ return;
+
ocfs2_drop_osb_locks(osb);
/*
--
2.37.1
Heming Zhao
2022-Jul-30 01:14 UTC
[Ocfs2-devel] [PATCH 2/4] ocfs2: add mlog ML_WARNING support
This commit gives new message type for ocfs2.
Signed-off-by: Heming Zhao <heming.zhao at suse.com>
---
fs/ocfs2/cluster/masklog.c | 3 +++
fs/ocfs2/cluster/masklog.h | 9 +++++----
2 files changed, 8 insertions(+), 4 deletions(-)
diff --git a/fs/ocfs2/cluster/masklog.c b/fs/ocfs2/cluster/masklog.c
index 563881ddbf00..bac3488e8002 100644
--- a/fs/ocfs2/cluster/masklog.c
+++ b/fs/ocfs2/cluster/masklog.c
@@ -63,6 +63,9 @@ void __mlog_printk(const u64 *mask, const char *func, int
line,
if (*mask & ML_ERROR) {
level = KERN_ERR;
prefix = "ERROR: ";
+ } else if (*mask & ML_WARNING) {
+ level = KERN_WARNING;
+ prefix = "WARNING: ";
} else if (*mask & ML_NOTICE) {
level = KERN_NOTICE;
} else {
diff --git a/fs/ocfs2/cluster/masklog.h b/fs/ocfs2/cluster/masklog.h
index b73fc42e46ff..d0bc4fe8cf3d 100644
--- a/fs/ocfs2/cluster/masklog.h
+++ b/fs/ocfs2/cluster/masklog.h
@@ -86,10 +86,11 @@
/* bits that are infrequently given and frequently matched in the high word */
#define ML_ERROR 0x1000000000000000ULL /* sent to KERN_ERR */
-#define ML_NOTICE 0x2000000000000000ULL /* setn to KERN_NOTICE */
-#define ML_KTHREAD 0x4000000000000000ULL /* kernel thread activity */
+#define ML_NOTICE 0x2000000000000000ULL /* sent to KERN_NOTICE */
+#define ML_WARNING 0x4000000000000000ULL /* sent to KERN_WARNING */
+#define ML_KTHREAD 0x8000000000000000ULL /* kernel thread activity */
-#define MLOG_INITIAL_AND_MASK (ML_ERROR|ML_NOTICE)
+#define MLOG_INITIAL_AND_MASK (ML_ERROR|ML_WARNING|ML_NOTICE)
#ifndef MLOG_MASK_PREFIX
#define MLOG_MASK_PREFIX 0
#endif
@@ -102,7 +103,7 @@
#if defined(CONFIG_OCFS2_DEBUG_MASKLOG)
#define ML_ALLOWED_BITS ~0
#else
-#define ML_ALLOWED_BITS (ML_ERROR|ML_NOTICE)
+#define ML_ALLOWED_BITS (ML_ERROR|ML_WARNING|ML_NOTICE)
#endif
#define MLOG_MAX_BITS 64
--
2.37.1
Heming Zhao
2022-Jul-30 01:14 UTC
[Ocfs2-devel] [PATCH 3/4] re-enable "ocfs2: mount shared volume without ha stack"
the key different between local mount and non-clustered mount:
local mount feature (tunefs.ocfs2 --fs-features=[no]local) can't do
convert job without ha stack. non-clustered mount feature can run
totally without ha stack.
commit 912f655d78c5 ("ocfs2: mount shared volume without ha stack")
had
bug, then commit c80af0c250c8f8a3c978aa5aafbe9c39b336b813 reverted it.
Let's give some explain for the issue mentioned by commit c80af0c250c8.
Under Junxiao's call trace, in __ocfs2_find_empty_slot(), the 'if'
accessment is wrong. sl_node_num could be 0 at o2cb env.
with current information, the trigger flow (base on 912f655d78c5):
1>
nodeA with 'node_num = 0' for mounting. it will succeed.
at this time, slotmap extent block will contains es_valid:1 &
es_node_num:0 for nodeA
then ocfs2_update_disk_slot() will write back slotmap info to disk.
2>
then, nodeB with 'node_num = 1' for mounting
this time, osb->node_num is 1 (set by config file), osb->preferred is
OCFS2_INVALID_SLOT (set by ocfs2_parse_options).
ocfs2_find_slot
+ ocfs2_update_slot_info //read slotmap info from disk
| + set si->si_slots[0].es_valid = 1 & si->si_slots[0].sl_node_num =
0
|
+ __ocfs2_node_num_to_slot //will return -ENOENT.
+ __ocfs2_find_empty_slot
+ if ((preferred >= 0) && (preferred < si->si_num_slots))
| fails enter this 'if' for preferred value is OCFS2_INVALID_SLOT
|
+ 'for(i = 0; i < si->si_num_slots; i++)' search slot 0
successfully.
| 'si->si_slots[0].sl_node_num' is false. trigger
'break' condition.
|
+ return slot 0.
it will cause nodeB grab nodeA journal dlm lock, then trigger hung.
How to do for this bug?
This commit re-enabled 912f655d78c5, next commit (add MMP support) will fix
the issue.
Signed-off-by: Heming Zhao <heming.zhao at suse.com>
---
fs/ocfs2/ocfs2.h | 4 +++-
fs/ocfs2/slot_map.c | 46 ++++++++++++++++++++++++++-------------------
fs/ocfs2/super.c | 21 +++++++++++++++++++++
3 files changed, 51 insertions(+), 20 deletions(-)
diff --git a/fs/ocfs2/ocfs2.h b/fs/ocfs2/ocfs2.h
index 740b64238312..337527571461 100644
--- a/fs/ocfs2/ocfs2.h
+++ b/fs/ocfs2/ocfs2.h
@@ -277,6 +277,7 @@ enum ocfs2_mount_options
OCFS2_MOUNT_JOURNAL_ASYNC_COMMIT = 1 << 15, /* Journal Async Commit */
OCFS2_MOUNT_ERRORS_CONT = 1 << 16, /* Return EIO to the calling process
on error */
OCFS2_MOUNT_ERRORS_ROFS = 1 << 17, /* Change filesystem to read-only on
error */
+ OCFS2_MOUNT_NOCLUSTER = 1 << 18, /* No cluster aware filesystem mount */
};
#define OCFS2_OSB_SOFT_RO 0x0001
@@ -672,7 +673,8 @@ static inline int ocfs2_cluster_o2cb_global_heartbeat(struct
ocfs2_super *osb)
static inline int ocfs2_mount_local(struct ocfs2_super *osb)
{
- return (osb->s_feature_incompat & OCFS2_FEATURE_INCOMPAT_LOCAL_MOUNT);
+ return ((osb->s_feature_incompat & OCFS2_FEATURE_INCOMPAT_LOCAL_MOUNT)
+ || (osb->s_mount_opt & OCFS2_MOUNT_NOCLUSTER));
}
static inline int ocfs2_uses_extended_slot_map(struct ocfs2_super *osb)
diff --git a/fs/ocfs2/slot_map.c b/fs/ocfs2/slot_map.c
index da7718cef735..0b0ae3ebb0cf 100644
--- a/fs/ocfs2/slot_map.c
+++ b/fs/ocfs2/slot_map.c
@@ -252,14 +252,16 @@ static int __ocfs2_find_empty_slot(struct ocfs2_slot_info
*si,
int i, ret = -ENOSPC;
if ((preferred >= 0) && (preferred < si->si_num_slots)) {
- if (!si->si_slots[preferred].sl_valid) {
+ if (!si->si_slots[preferred].sl_valid ||
+ !si->si_slots[preferred].sl_node_num) {
ret = preferred;
goto out;
}
}
for(i = 0; i < si->si_num_slots; i++) {
- if (!si->si_slots[i].sl_valid) {
+ if (!si->si_slots[i].sl_valid ||
+ !si->si_slots[i].sl_node_num) {
ret = i;
break;
}
@@ -454,24 +456,30 @@ int ocfs2_find_slot(struct ocfs2_super *osb)
spin_lock(&osb->osb_lock);
ocfs2_update_slot_info(si);
- /* search for ourselves first and take the slot if it already
- * exists. Perhaps we need to mark this in a variable for our
- * own journal recovery? Possibly not, though we certainly
- * need to warn to the user */
- slot = __ocfs2_node_num_to_slot(si, osb->node_num);
- if (slot < 0) {
- /* if no slot yet, then just take 1st available
- * one. */
- slot = __ocfs2_find_empty_slot(si, osb->preferred_slot);
+ if (ocfs2_mount_local(osb))
+ /* use slot 0 directly in local mode */
+ slot = 0;
+ else {
+ /* search for ourselves first and take the slot if it already
+ * exists. Perhaps we need to mark this in a variable for our
+ * own journal recovery? Possibly not, though we certainly
+ * need to warn to the user */
+ slot = __ocfs2_node_num_to_slot(si, osb->node_num);
if (slot < 0) {
- spin_unlock(&osb->osb_lock);
- mlog(ML_ERROR, "no free slots available!\n");
- status = -EINVAL;
- goto bail;
- }
- } else
- printk(KERN_INFO "ocfs2: Slot %d on device (%s) was already "
- "allocated to this node!\n", slot, osb->dev_str);
+ /* if no slot yet, then just take 1st available
+ * one. */
+ slot = __ocfs2_find_empty_slot(si, osb->preferred_slot);
+ if (slot < 0) {
+ spin_unlock(&osb->osb_lock);
+ mlog(ML_ERROR, "no free slots available!\n");
+ status = -EINVAL;
+ goto bail;
+ }
+ } else
+ printk(KERN_INFO "ocfs2: Slot %d on device (%s) was "
+ "already allocated to this node!\n",
+ slot, osb->dev_str);
+ }
ocfs2_set_slot(si, slot, osb->node_num);
osb->slot_num = slot;
diff --git a/fs/ocfs2/super.c b/fs/ocfs2/super.c
index 438be028935d..f7298816d8d9 100644
--- a/fs/ocfs2/super.c
+++ b/fs/ocfs2/super.c
@@ -172,6 +172,7 @@ enum {
Opt_dir_resv_level,
Opt_journal_async_commit,
Opt_err_cont,
+ Opt_nocluster,
Opt_err,
};
@@ -205,6 +206,7 @@ static const match_table_t tokens = {
{Opt_dir_resv_level, "dir_resv_level=%u"},
{Opt_journal_async_commit, "journal_async_commit"},
{Opt_err_cont, "errors=continue"},
+ {Opt_nocluster, "nocluster"},
{Opt_err, NULL}
};
@@ -616,6 +618,13 @@ static int ocfs2_remount(struct super_block *sb, int
*flags, char *data)
goto out;
}
+ tmp = OCFS2_MOUNT_NOCLUSTER;
+ if ((osb->s_mount_opt & tmp) != (parsed_options.mount_opt & tmp)) {
+ ret = -EINVAL;
+ mlog(ML_ERROR, "Cannot change nocluster option on remount\n");
+ goto out;
+ }
+
tmp = OCFS2_MOUNT_HB_LOCAL | OCFS2_MOUNT_HB_GLOBAL |
OCFS2_MOUNT_HB_NONE;
if ((osb->s_mount_opt & tmp) != (parsed_options.mount_opt & tmp)) {
@@ -856,6 +865,7 @@ static int ocfs2_verify_userspace_stack(struct ocfs2_super
*osb,
}
if (ocfs2_userspace_stack(osb) &&
+ !(osb->s_mount_opt & OCFS2_MOUNT_NOCLUSTER) &&
strncmp(osb->osb_cluster_stack, mopt->cluster_stack,
OCFS2_STACK_LABEL_LEN)) {
mlog(ML_ERROR,
@@ -1127,6 +1137,11 @@ static int ocfs2_fill_super(struct super_block *sb, void
*data, int silent)
osb->s_mount_opt & OCFS2_MOUNT_DATA_WRITEBACK ?
"writeback" :
"ordered");
+ if ((osb->s_mount_opt & OCFS2_MOUNT_NOCLUSTER) &&
+ !(osb->s_feature_incompat & OCFS2_FEATURE_INCOMPAT_LOCAL_MOUNT))
+ printk(KERN_NOTICE "ocfs2: The shared device (%s) is mounted "
+ "without cluster aware mode.\n", osb->dev_str);
+
atomic_set(&osb->vol_state, VOLUME_MOUNTED);
wake_up(&osb->osb_mount_event);
@@ -1437,6 +1452,9 @@ static int ocfs2_parse_options(struct super_block *sb,
case Opt_journal_async_commit:
mopt->mount_opt |= OCFS2_MOUNT_JOURNAL_ASYNC_COMMIT;
break;
+ case Opt_nocluster:
+ mopt->mount_opt |= OCFS2_MOUNT_NOCLUSTER;
+ break;
default:
mlog(ML_ERROR,
"Unrecognized mount option \"%s\" "
@@ -1548,6 +1566,9 @@ static int ocfs2_show_options(struct seq_file *s, struct
dentry *root)
if (opts & OCFS2_MOUNT_JOURNAL_ASYNC_COMMIT)
seq_printf(s, ",journal_async_commit");
+ if (opts & OCFS2_MOUNT_NOCLUSTER)
+ seq_printf(s, ",nocluster");
+
return 0;
}
--
2.37.1
Heming Zhao
2022-Jul-30 01:14 UTC
[Ocfs2-devel] [PATCH 4/4] ocfs2: introduce ext4 MMP feature
MMP (multiple mount protection) gives filesystem ability to prevent
from being mounted multiple times.
For avoiding data corruption when non-clustered and/or clustered mount
are happening at same time, this commit introduced MMP feature. MMP
idea is from ext4 MMP (fs/ext4/mmp.c) code. For ocfs2 is a clustered
fs and also for compatible with existing slotmap feature, I did some
optimization and modification when porting from ext4 to ocfs2.
For optimization:
mmp has a kthread kmmpd-<dev>, which is only created in non-clustered
mode.
We set a rule:
If last mount didn't do unmount, (eg: crash), the next mount MUST be
same mount type.
At last, this commit also fix commit c80af0c250c8 ("Revert "ocfs2:
mount shared volume without ha stack") mentioned issue.
Signed-off-by: Heming Zhao <heming.zhao at suse.com>
---
fs/ocfs2/ocfs2.h | 2 +
fs/ocfs2/ocfs2_fs.h | 13 +-
fs/ocfs2/slot_map.c | 459 ++++++++++++++++++++++++++++++++++++++++++--
fs/ocfs2/slot_map.h | 3 +
fs/ocfs2/super.c | 23 ++-
5 files changed, 479 insertions(+), 21 deletions(-)
diff --git a/fs/ocfs2/ocfs2.h b/fs/ocfs2/ocfs2.h
index 337527571461..37a7c5855d07 100644
--- a/fs/ocfs2/ocfs2.h
+++ b/fs/ocfs2/ocfs2.h
@@ -337,6 +337,8 @@ struct ocfs2_super
unsigned int node_num;
int slot_num;
int preferred_slot;
+ u16 mmp_update_interval;
+ struct task_struct *mmp_task;
int s_sectsize_bits;
int s_clustersize;
int s_clustersize_bits;
diff --git a/fs/ocfs2/ocfs2_fs.h b/fs/ocfs2/ocfs2_fs.h
index 638d875eccc7..015672f75563 100644
--- a/fs/ocfs2/ocfs2_fs.h
+++ b/fs/ocfs2/ocfs2_fs.h
@@ -87,7 +87,8 @@
| OCFS2_FEATURE_INCOMPAT_REFCOUNT_TREE \
| OCFS2_FEATURE_INCOMPAT_DISCONTIG_BG \
| OCFS2_FEATURE_INCOMPAT_CLUSTERINFO \
- | OCFS2_FEATURE_INCOMPAT_APPEND_DIO)
+ | OCFS2_FEATURE_INCOMPAT_APPEND_DIO \
+ | OCFS2_FEATURE_INCOMPAT_MMP)
#define OCFS2_FEATURE_RO_COMPAT_SUPP (OCFS2_FEATURE_RO_COMPAT_UNWRITTEN \
| OCFS2_FEATURE_RO_COMPAT_USRQUOTA \
| OCFS2_FEATURE_RO_COMPAT_GRPQUOTA)
@@ -167,6 +168,11 @@
*/
#define OCFS2_FEATURE_INCOMPAT_APPEND_DIO 0x8000
+/*
+ * Multiple mount protection
+ */
+#define OCFS2_FEATURE_INCOMPAT_MMP 0x10000
+
/*
* backup superblock flag is used to indicate that this volume
* has backup superblocks.
@@ -535,8 +541,7 @@ struct ocfs2_slot_map {
};
struct ocfs2_extended_slot {
-/*00*/ __u8 es_valid;
- __u8 es_reserved1[3];
+/*00*/ __le32 es_valid;
__le32 es_node_num;
/*08*/
};
@@ -611,7 +616,7 @@ struct ocfs2_super_block {
INCOMPAT flag set. */
/*B8*/ __le16 s_xattr_inline_size; /* extended attribute inline size
for this fs*/
- __le16 s_reserved0;
+ __le16 s_mmp_update_interval; /* # seconds to wait in MMP checking */
__le32 s_dx_seed[3]; /* seed[0-2] for dx dir hash.
* s_uuid_hash serves as seed[3]. */
/*C0*/ __le64 s_reserved2[15]; /* Fill out superblock */
diff --git a/fs/ocfs2/slot_map.c b/fs/ocfs2/slot_map.c
index 0b0ae3ebb0cf..86a21140ead6 100644
--- a/fs/ocfs2/slot_map.c
+++ b/fs/ocfs2/slot_map.c
@@ -8,6 +8,8 @@
#include <linux/types.h>
#include <linux/slab.h>
#include <linux/highmem.h>
+#include <linux/random.h>
+#include <linux/kthread.h>
#include <cluster/masklog.h>
@@ -24,9 +26,48 @@
#include "buffer_head_io.h"
+/*
+ * This structure will be used for multiple mount protection. It will be
+ * written into the '//slot_map' field in the system dir.
+ * Programs that check MMP should assume that if SEQ_FSCK (or any unknown
+ * code above SEQ_MAX) is present then it is NOT safe to use the filesystem.
+ */
+#define OCFS2_MMP_SEQ_CLEAN 0xFF4D4D50U /* mmp_seq value for clean unmount */
+#define OCFS2_MMP_SEQ_FSCK 0xE24D4D50U /* mmp_seq value when being fscked */
+#define OCFS2_MMP_SEQ_MAX 0xE24D4D4FU /* maximum valid mmp_seq value */
+#define OCFS2_MMP_SEQ_INIT 0x0 /* mmp_seq init value */
+#define OCFS2_VALID_CLUSTER 0xE24D4D55U /* value for clustered mount
+ under MMP disabled */
+#define OCFS2_VALID_NOCLUSTER 0xE24D4D5AU /* value for noclustered mount
+ under MMP disabled */
+
+#define OCFS2_SLOT_INFO_OLD_VALID 1 /* use for old slot info */
+
+/*
+ * Check interval multiplier
+ * The MMP block is written every update interval and initially checked every
+ * update interval x the multiplier (the value is then adapted based on the
+ * write latency). The reason is that writes can be delayed under load and we
+ * don't want readers to incorrectly assume that the filesystem is no
longer
+ * in use.
+ */
+#define OCFS2_MMP_CHECK_MULT 2UL
+
+/*
+ * Minimum interval for MMP checking in seconds.
+ */
+#define OCFS2_MMP_MIN_CHECK_INTERVAL 5UL
+
+/*
+ * Maximum interval for MMP checking in seconds.
+ */
+#define OCFS2_MMP_MAX_CHECK_INTERVAL 300UL
struct ocfs2_slot {
- int sl_valid;
+ union {
+ unsigned int sl_valid;
+ unsigned int mmp_seq;
+ };
unsigned int sl_node_num;
};
@@ -52,11 +93,11 @@ static void ocfs2_invalidate_slot(struct ocfs2_slot_info
*si,
}
static void ocfs2_set_slot(struct ocfs2_slot_info *si,
- int slot_num, unsigned int node_num)
+ int slot_num, unsigned int node_num, unsigned int valid)
{
BUG_ON((slot_num < 0) || (slot_num >= si->si_num_slots));
- si->si_slots[slot_num].sl_valid = 1;
+ si->si_slots[slot_num].sl_valid = valid;
si->si_slots[slot_num].sl_node_num = node_num;
}
@@ -75,7 +116,8 @@ static void ocfs2_update_slot_info_extended(struct
ocfs2_slot_info *si)
i++, slotno++) {
if (se->se_slots[i].es_valid)
ocfs2_set_slot(si, slotno,
- le32_to_cpu(se->se_slots[i].es_node_num));
+ le32_to_cpu(se->se_slots[i].es_node_num),
+ le32_to_cpu(se->se_slots[i].es_valid));
else
ocfs2_invalidate_slot(si, slotno);
}
@@ -97,7 +139,8 @@ static void ocfs2_update_slot_info_old(struct ocfs2_slot_info
*si)
if (le16_to_cpu(sm->sm_slots[i]) == (u16)OCFS2_INVALID_SLOT)
ocfs2_invalidate_slot(si, i);
else
- ocfs2_set_slot(si, i, le16_to_cpu(sm->sm_slots[i]));
+ ocfs2_set_slot(si, i, le16_to_cpu(sm->sm_slots[i]),
+ OCFS2_SLOT_INFO_OLD_VALID);
}
}
@@ -252,16 +295,14 @@ static int __ocfs2_find_empty_slot(struct ocfs2_slot_info
*si,
int i, ret = -ENOSPC;
if ((preferred >= 0) && (preferred < si->si_num_slots)) {
- if (!si->si_slots[preferred].sl_valid ||
- !si->si_slots[preferred].sl_node_num) {
+ if (!si->si_slots[preferred].sl_valid) {
ret = preferred;
goto out;
}
}
for(i = 0; i < si->si_num_slots; i++) {
- if (!si->si_slots[i].sl_valid ||
- !si->si_slots[i].sl_node_num) {
+ if (!si->si_slots[i].sl_valid) {
ret = i;
break;
}
@@ -270,6 +311,43 @@ static int __ocfs2_find_empty_slot(struct ocfs2_slot_info
*si,
return ret;
}
+/* Return first used slot.
+ * -ENOENT means all slots are clean, ->sl_valid should be
+ * OCFS2_MMP_SEQ_CLEAN or ZERO */
+static int __ocfs2_find_used_slot(struct ocfs2_slot_info *si)
+{
+ int i, ret = -ENOENT, valid;
+
+ for (i = 0; i < si->si_num_slots; i++) {
+ valid = si->si_slots[i].sl_valid;
+ if (valid == 0 || valid == OCFS2_MMP_SEQ_CLEAN)
+ continue;
+ if (valid <= OCFS2_MMP_SEQ_MAX ||
+ valid == OCFS2_MMP_SEQ_FSCK ||
+ valid == OCFS2_VALID_CLUSTER ||
+ valid == OCFS2_VALID_NOCLUSTER) {
+ ret = i;
+ break;
+ }
+ }
+
+ return ret;
+}
+
+static int __ocfs2_find_expected_slot(struct ocfs2_slot_info *si,
+ unsigned int expected)
+{
+ int i;
+
+ for (i = 0; i < si->si_num_slots; i++) {
+ if (si->si_slots[i].sl_valid == expected) {
+ return 1;
+ }
+ }
+
+ return 0;
+}
+
int ocfs2_node_num_to_slot(struct ocfs2_super *osb, unsigned int node_num)
{
int slot;
@@ -445,21 +523,357 @@ void ocfs2_free_slot_info(struct ocfs2_super *osb)
__ocfs2_free_slot_info(si);
}
+/*
+ * Get a random new sequence number but make sure it is not greater than
+ * EXT4_MMP_SEQ_MAX.
+ */
+static unsigned int mmp_new_seq(void)
+{
+ u32 new_seq;
+
+ do {
+ new_seq = prandom_u32();
+ } while (new_seq > OCFS2_MMP_SEQ_MAX);
+
+ if (new_seq == 0)
+ return 1;
+ else
+ return new_seq;
+}
+
+/*
+ * kmmpd will update the MMP sequence every mmp_update_interval seconds
+ */
+static int kmmpd(void *data)
+{
+ struct ocfs2_super *osb = data;
+ struct super_block *sb = osb->sb;
+ struct ocfs2_slot_info *si = osb->slot_info;
+ int slot = osb->slot_num;
+ u32 seq, mmp_seq;
+ unsigned long failed_writes = 0;
+ u16 mmp_update_interval = osb->mmp_update_interval;
+ unsigned int mmp_check_interval;
+ unsigned long last_update_time;
+ unsigned long diff;
+ int retval = 0;
+
+ if (!ocfs2_mount_local(osb)) {
+ mlog(ML_ERROR, "kmmpd thread only works for local mount mode.\n");
+ goto wait_to_exit;
+ }
+
+ retval = ocfs2_refresh_slot_info(osb);
+ seq = si->si_slots[slot].mmp_seq;
+
+ /*
+ * Start with the higher mmp_check_interval and reduce it if
+ * the MMP block is being updated on time.
+ */
+ mmp_check_interval = max(OCFS2_MMP_CHECK_MULT * mmp_update_interval,
+ OCFS2_MMP_MIN_CHECK_INTERVAL);
+
+ while (!kthread_should_stop() && !sb_rdonly(sb)) {
+ if (!OCFS2_HAS_INCOMPAT_FEATURE(sb, OCFS2_FEATURE_INCOMPAT_MMP)) {
+ mlog(ML_WARNING, "kmmpd being stopped since MMP feature"
+ " has been disabled.");
+ goto wait_to_exit;
+ }
+ if (++seq > OCFS2_MMP_SEQ_MAX)
+ seq = 1;
+
+ spin_lock(&osb->osb_lock);
+ si->si_slots[slot].mmp_seq = mmp_seq = seq;
+ spin_unlock(&osb->osb_lock);
+
+ last_update_time = jiffies;
+ retval = ocfs2_update_disk_slot(osb, si, slot);
+
+ /*
+ * Don't spew too many error messages. Print one every
+ * (s_mmp_update_interval * 60) seconds.
+ */
+ if (retval) {
+ if ((failed_writes % 60) == 0) {
+ ocfs2_error(sb, "Error writing to MMP block");
+ }
+ failed_writes++;
+ }
+
+ diff = jiffies - last_update_time;
+ if (diff < mmp_update_interval * HZ)
+ schedule_timeout_interruptible(mmp_update_interval *
+ HZ - diff);
+
+ /*
+ * We need to make sure that more than mmp_check_interval
+ * seconds have not passed since writing. If that has happened
+ * we need to check if the MMP block is as we left it.
+ */
+ diff = jiffies - last_update_time;
+ if (diff > mmp_check_interval * HZ) {
+ retval = ocfs2_refresh_slot_info(osb);
+ if (retval) {
+ ocfs2_error(sb, "error reading MMP data: %d", retval);
+ goto wait_to_exit;
+ }
+
+ if (si->si_slots[slot].mmp_seq != mmp_seq) {
+ ocfs2_error(sb, "Error while updating MMP info. "
+ "The filesystem seems to have been"
+ " multiply mounted.");
+ retval = -EBUSY;
+ goto wait_to_exit;
+ }
+ }
+
+ /*
+ * Adjust the mmp_check_interval depending on how much time
+ * it took for the MMP block to be written.
+ */
+ mmp_check_interval = max(min(OCFS2_MMP_CHECK_MULT * diff / HZ,
+ OCFS2_MMP_MAX_CHECK_INTERVAL),
+ OCFS2_MMP_MIN_CHECK_INTERVAL);
+ }
+
+ /*
+ * Unmount seems to be clean.
+ */
+ spin_lock(&osb->osb_lock);
+ si->si_slots[slot].mmp_seq = OCFS2_MMP_SEQ_CLEAN;
+ spin_unlock(&osb->osb_lock);
+
+ retval = ocfs2_update_disk_slot(osb, si, 0);
+
+wait_to_exit:
+ while (!kthread_should_stop()) {
+ set_current_state(TASK_INTERRUPTIBLE);
+ if (!kthread_should_stop())
+ schedule();
+ }
+ set_current_state(TASK_RUNNING);
+ return retval;
+}
+
+void ocfs2_stop_mmpd(struct ocfs2_super *osb)
+{
+ if (osb->mmp_task) {
+ kthread_stop(osb->mmp_task);
+ osb->mmp_task = NULL;
+ }
+}
+
+/*
+ * Protect the filesystem from being mounted more than once.
+ *
+ * This function was inspired by ext4 MMP feature. Because HA stack
+ * helps ocfs2 to manage nodes join/leave, so we only focus on MMP
+ * under nocluster mode.
+ * Another info is ocfs2 only uses slot 0 on nocuster mode.
+ *
+ * es_valid:
+ * 0: not available
+ * 1: valid, cluster mode
+ * 2: valid, nocluster mode
+ *
+ * parameters:
+ * osb: the struct ocfs2_super
+ * noclustered: under noclustered mount
+ * slot: prefer slot number
+ */
+int ocfs2_multi_mount_protect(struct ocfs2_super *osb, int noclustered)
+{
+ struct buffer_head *bh = NULL;
+ u32 seq;
+ struct ocfs2_slot_info *si = osb->slot_info;
+ unsigned int mmp_check_interval = osb->mmp_update_interval;
+ unsigned int wait_time = 0;
+ int retval = 0;
+ int slot = osb->slot_num;
+
+ if (!ocfs2_uses_extended_slot_map(osb)) {
+ mlog(ML_WARNING, "MMP only works on extended slot map.\n");
+ retval = -EINVAL;
+ goto bail;
+ }
+
+ retval = ocfs2_refresh_slot_info(osb);
+ if (retval)
+ goto bail;
+
+ if (mmp_check_interval < OCFS2_MMP_MIN_CHECK_INTERVAL)
+ mmp_check_interval = OCFS2_MMP_MIN_CHECK_INTERVAL;
+
+ spin_lock(&osb->osb_lock);
+ seq = si->si_slots[slot].mmp_seq;
+
+ if (__ocfs2_find_used_slot(si) == -ENOENT)
+ goto skip;
+
+ /* TODO ocfs2-tools need to support this flag */
+ if (__ocfs2_find_expected_slot(si, OCFS2_MMP_SEQ_FSCK)) {
+ mlog(ML_NOTICE, "fsck is running on the filesystem");
+ spin_unlock(&osb->osb_lock);
+ retval = -EBUSY;
+ goto bail;
+ }
+ spin_unlock(&osb->osb_lock);
+
+ wait_time = min(mmp_check_interval * 2 + 1, mmp_check_interval + 60);
+
+ /* Print MMP interval if more than 20 secs. */
+ if (wait_time > OCFS2_MMP_MIN_CHECK_INTERVAL * 4)
+ mlog(ML_WARNING, "MMP interval %u higher than expected, please"
+ " wait.\n", wait_time * 2);
+
+ if (schedule_timeout_interruptible(HZ * wait_time) != 0) {
+ mlog(ML_WARNING, "MMP startup interrupted, failing mount.\n");
+ retval = -EPERM;
+ goto bail;
+ }
+
+ retval = ocfs2_refresh_slot_info(osb);
+ if (retval)
+ goto bail;
+ if (seq != si->si_slots[slot].mmp_seq) {
+ mlog(ML_ERROR, "Device is already active on another node.\n");
+ retval = -EPERM;
+ goto bail;
+ }
+
+ spin_lock(&osb->osb_lock);
+skip:
+ /*
+ * write a new random sequence number.
+ */
+ seq = mmp_new_seq();
+ mlog(ML_ERROR, "seq: 0x%x mmp_seq: 0x%x\n", seq,
si->si_slots[slot].mmp_seq);
+ ocfs2_set_slot(si, slot, osb->node_num, seq);
+ spin_unlock(&osb->osb_lock);
+
+ ocfs2_update_disk_slot_extended(si, slot, &bh);
+ mlog(ML_ERROR, "seq: 0x%x mmp_seq: 0x%x\n", seq,
si->si_slots[slot].mmp_seq);
+ retval = ocfs2_write_block(osb, bh, INODE_CACHE(si->si_inode));
+ if (retval < 0) {
+ mlog_errno(retval);
+ goto bail;
+ }
+ mlog(ML_ERROR, "seq: 0x%x mmp_seq: 0x%x wait_time: %u\n", seq,
si->si_slots[slot].mmp_seq, wait_time);
+
+ /*
+ * wait for MMP interval and check mmp_seq.
+ */
+ if (schedule_timeout_interruptible(HZ * wait_time) != 0) {
+ mlog(ML_WARNING, "MMP startup interrupted, failing mount.\n");
+ retval = -EPERM;
+ goto bail;
+ }
+
+ retval = ocfs2_refresh_slot_info(osb);
+ if (retval)
+ goto bail;
+
+ mlog(ML_ERROR, "seq: 0x%x mmp_seq: 0x%x\n", seq,
si->si_slots[slot].mmp_seq);
+ if (seq != si->si_slots[slot].mmp_seq) {
+ mlog(ML_ERROR, "Update seq failed, device is already active on another
node.\n");
+ retval = -EPERM;
+ goto bail;
+ }
+
+ /*
+ * There are two reasons we don't create kmmpd on clustered mount:
+ * - ocfs2 needs to grab osb->osb_lock to modify/access osb->si.
+ * - For huge number nodes cluster, nodes update same sector
+ * of '//slot_map' will cause IO performance issue.
+ *
+ * Then there has another question:
+ * On clustered mount, MMP seq won't update, and MMP how to
+ * handle a noclustered mount when there already exist
+ * clustered mount.
+ * The answer is the rule mentioned in ocfs2_find_slot().
+ */
+ if (!noclustered) {
+ spin_lock(&osb->osb_lock);
+ ocfs2_set_slot(si, slot, osb->node_num, OCFS2_VALID_CLUSTER);
+ spin_unlock(&osb->osb_lock);
+
+ ocfs2_update_disk_slot_extended(si, slot, &bh);
+ retval = ocfs2_write_block(osb, bh, INODE_CACHE(si->si_inode));
+ goto bail;
+ }
+
+ /*
+ * Start a kernel thread to update the MMP block periodically.
+ */
+ osb->mmp_task = kthread_run(kmmpd, osb, "kmmpd-%s",
osb->sb->s_id);
+ if (IS_ERR(osb->mmp_task)) {
+ osb->mmp_task = NULL;
+ mlog(ML_WARNING, "Unable to create kmmpd thread for %s.",
+ osb->sb->s_id);
+ retval = -EPERM;
+ goto bail;
+ }
+
+bail:
+ return retval;
+}
+
+static void show_conflict_mnt_msg(int clustered)
+{
+ const char *exist = clustered ? "non-clustered" :
"clustered";
+
+ mlog(ML_ERROR, "Found %s mount info!", exist);
+ mlog(ML_ERROR, "Please clean %s slotmap info for mounting.\n",
exist);
+ mlog(ML_ERROR, "eg. remount then unmount with %s mode\n", exist);
+}
+
+/*
+ * Even under readonly mode, we write slot info on disk.
+ * The logic is correct: if not change slot info on readonly
+ * mode, in cluster env, later mount from another node
+ * may reuse the same slot, deadlock happen!
+ */
int ocfs2_find_slot(struct ocfs2_super *osb)
{
- int status;
+ int status = -EPERM;
int slot;
+ int noclustered = 0;
struct ocfs2_slot_info *si;
si = osb->slot_info;
spin_lock(&osb->osb_lock);
ocfs2_update_slot_info(si);
+ slot = __ocfs2_find_used_slot(si);
+ if (slot == 0 &&
+ ((si->si_slots[0].sl_valid == OCFS2_VALID_NOCLUSTER) ||
+ (si->si_slots[0].sl_valid < OCFS2_MMP_SEQ_MAX)))
+ noclustered = 1;
- if (ocfs2_mount_local(osb))
- /* use slot 0 directly in local mode */
- slot = 0;
- else {
+ /*
+ * We set a rule:
+ * If last mount didn't do unmount, (eg: crash), the next mount
+ * MUST be same mount type.
+ */
+ if (ocfs2_mount_local(osb)) {
+ /* empty slotmap, or device didn't unmount from last time */
+ if ((slot == -ENOENT) || noclustered) {
+ /* use slot 0 directly in local mode */
+ slot = 0;
+ noclustered = 1;
+ } else {
+ spin_unlock(&osb->osb_lock);
+ show_conflict_mnt_msg(0);
+ status = -EINVAL;
+ goto bail;
+ }
+ } else {
+ if (noclustered) {
+ spin_unlock(&osb->osb_lock);
+ show_conflict_mnt_msg(1);
+ status = -EINVAL;
+ goto bail;
+ }
/* search for ourselves first and take the slot if it already
* exists. Perhaps we need to mark this in a variable for our
* own journal recovery? Possibly not, though we certainly
@@ -481,7 +895,21 @@ int ocfs2_find_slot(struct ocfs2_super *osb)
slot, osb->dev_str);
}
- ocfs2_set_slot(si, slot, osb->node_num);
+ if (OCFS2_HAS_INCOMPAT_FEATURE(osb->sb, OCFS2_FEATURE_INCOMPAT_MMP)) {
+ osb->slot_num = slot;
+ spin_unlock(&osb->osb_lock);
+ status = ocfs2_multi_mount_protect(osb, noclustered);
+ if (status < 0) {
+ mlog(ML_ERROR, "MMP failed to start.\n");
+ goto mmp_fail;
+ }
+
+ trace_ocfs2_find_slot(osb->slot_num);
+ return status;
+ }
+
+ ocfs2_set_slot(si, slot, osb->node_num, noclustered ?
+ OCFS2_VALID_NOCLUSTER : OCFS2_VALID_CLUSTER);
osb->slot_num = slot;
spin_unlock(&osb->osb_lock);
@@ -490,6 +918,7 @@ int ocfs2_find_slot(struct ocfs2_super *osb)
status = ocfs2_update_disk_slot(osb, si, osb->slot_num);
if (status < 0) {
mlog_errno(status);
+mmp_fail:
/*
* if write block failed, invalidate slot to avoid overwrite
* slot during dismount in case another node rightly has mounted
diff --git a/fs/ocfs2/slot_map.h b/fs/ocfs2/slot_map.h
index a43644570b53..d4d147b0c190 100644
--- a/fs/ocfs2/slot_map.h
+++ b/fs/ocfs2/slot_map.h
@@ -25,4 +25,7 @@ int ocfs2_slot_to_node_num_locked(struct ocfs2_super *osb, int
slot_num,
int ocfs2_clear_slot(struct ocfs2_super *osb, int slot_num);
+int ocfs2_multi_mount_protect(struct ocfs2_super *osb, int noclustered);
+void ocfs2_stop_mmpd(struct ocfs2_super *osb);
+
#endif
diff --git a/fs/ocfs2/super.c b/fs/ocfs2/super.c
index f7298816d8d9..b0e76b06efc3 100644
--- a/fs/ocfs2/super.c
+++ b/fs/ocfs2/super.c
@@ -609,6 +609,7 @@ static int ocfs2_remount(struct super_block *sb, int *flags,
char *data)
struct mount_options parsed_options;
struct ocfs2_super *osb = OCFS2_SB(sb);
u32 tmp;
+ int noclustered;
sync_filesystem(sb);
@@ -619,7 +620,8 @@ static int ocfs2_remount(struct super_block *sb, int *flags,
char *data)
}
tmp = OCFS2_MOUNT_NOCLUSTER;
- if ((osb->s_mount_opt & tmp) != (parsed_options.mount_opt & tmp)) {
+ noclustered = osb->s_mount_opt & tmp;
+ if (noclustered != (parsed_options.mount_opt & tmp)) {
ret = -EINVAL;
mlog(ML_ERROR, "Cannot change nocluster option on remount\n");
goto out;
@@ -686,10 +688,20 @@ static int ocfs2_remount(struct super_block *sb, int
*flags, char *data)
}
sb->s_flags &= ~SB_RDONLY;
osb->osb_flags &= ~OCFS2_OSB_SOFT_RO;
+ if (OCFS2_HAS_INCOMPAT_FEATURE(sb, OCFS2_FEATURE_INCOMPAT_MMP)) {
+ spin_unlock(&osb->osb_lock);
+ if (ocfs2_multi_mount_protect(osb, noclustered)) {
+ mlog(ML_ERROR, "started MMP failed.\n");
+ ocfs2_stop_mmpd(osb);
+ ret = -EROFS;
+ goto unlocked_osb;
+ }
+ }
}
trace_ocfs2_remount(sb->s_flags, osb->osb_flags, *flags);
unlock_osb:
spin_unlock(&osb->osb_lock);
+unlocked_osb:
/* Enable quota accounting after remounting RW */
if (!ret && !(*flags & SB_RDONLY)) {
if (sb_any_quota_suspended(sb))
@@ -722,6 +734,8 @@ static int ocfs2_remount(struct super_block *sb, int *flags,
char *data)
sb->s_flags = (sb->s_flags & ~SB_POSIXACL) |
((osb->s_mount_opt & OCFS2_MOUNT_POSIX_ACL) ?
SB_POSIXACL : 0);
+ if (sb_rdonly(osb->sb))
+ ocfs2_stop_mmpd(osb);
}
out:
return ret;
@@ -1833,7 +1847,7 @@ static int ocfs2_mount_volume(struct super_block *sb)
status = ocfs2_init_local_system_inodes(osb);
if (status < 0) {
mlog_errno(status);
- goto out_super_lock;
+ goto out_find_slot;
}
status = ocfs2_check_volume(osb);
@@ -1858,6 +1872,8 @@ static int ocfs2_mount_volume(struct super_block *sb)
/* before journal shutdown, we should release slot_info */
ocfs2_free_slot_info(osb);
ocfs2_journal_shutdown(osb);
+out_find_slot:
+ ocfs2_stop_mmpd(osb);
out_super_lock:
ocfs2_super_unlock(osb, 1);
out_dlm:
@@ -1878,6 +1894,8 @@ static void ocfs2_dismount_volume(struct super_block *sb,
int mnt_err)
osb = OCFS2_SB(sb);
BUG_ON(!osb);
+ ocfs2_stop_mmpd(osb);
+
/* Remove file check sysfs related directores/files,
* and wait for the pending file check operations */
ocfs2_filecheck_remove_sysfs(osb);
@@ -2086,6 +2104,7 @@ static int ocfs2_initialize_super(struct super_block *sb,
snprintf(osb->dev_str, sizeof(osb->dev_str), "%u,%u",
MAJOR(osb->sb->s_dev), MINOR(osb->sb->s_dev));
+ osb->mmp_update_interval =
le16_to_cpu(di->id2.i_super.s_mmp_update_interval);
osb->max_slots = le16_to_cpu(di->id2.i_super.s_max_slots);
if (osb->max_slots > OCFS2_MAX_SLOTS || osb->max_slots == 0) {
mlog(ML_ERROR, "Invalid number of node slots (%u)\n",
--
2.37.1
Joseph Qi
2022-Aug-08 06:51 UTC
[Ocfs2-devel] [PATCH 1/4] ocfs2: Fix freeing uninitialized resource on ocfs2_dlm_shutdown
On 7/30/22 9:14 AM, Heming Zhao wrote:> On local mount mode, there is no dlm resource initalized. If > ocfs2_mount_volume() fails in ocfs2_find_slot(), error handling > flow will call ocfs2_dlm_shutdown(), then does dlm resource > cleanup job, which will trigger kernel crash. > > Fixes: 0737e01de9c4 ("ocfs2: ocfs2_mount_volume does cleanup job before > return error")Should be put at the same line.> Signed-off-by: Heming Zhao <heming.zhao at suse.com> > --- > fs/ocfs2/dlmglue.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/fs/ocfs2/dlmglue.c b/fs/ocfs2/dlmglue.c > index 801e60bab955..1438ac14940b 100644 > --- a/fs/ocfs2/dlmglue.c > +++ b/fs/ocfs2/dlmglue.c > @@ -3385,6 +3385,9 @@ int ocfs2_dlm_init(struct ocfs2_super *osb) > void ocfs2_dlm_shutdown(struct ocfs2_super *osb, > int hangup_pending) > { > + if (ocfs2_mount_local(osb)) > + return; > +IMO, we have to do part of ocfs2_dlm_shutdown() jobs such as ocfs2_lock_res_free(), which will remove lockres from d_lockres_tracking added by ocfs2_xxx_lock_res_init(). Before commit 0737e01de9c4, it seems this issue also exists since osb->cconn is already set under local mount mode. Thanks, Joseph
Joseph Qi
2022-Aug-08 08:19 UTC
[Ocfs2-devel] [PATCH 4/4] ocfs2: introduce ext4 MMP feature
On 7/30/22 9:14 AM, Heming Zhao wrote:> MMP (multiple mount protection) gives filesystem ability to prevent > from being mounted multiple times. > > For avoiding data corruption when non-clustered and/or clustered mount > are happening at same time, this commit introduced MMP feature. MMP > idea is from ext4 MMP (fs/ext4/mmp.c) code. For ocfs2 is a clustered > fs and also for compatible with existing slotmap feature, I did some > optimization and modification when porting from ext4 to ocfs2. > > For optimization: > mmp has a kthread kmmpd-<dev>, which is only created in non-clustered > mode. > > We set a rule: > If last mount didn't do unmount, (eg: crash), the next mount MUST be > same mount type. > > At last, this commit also fix commit c80af0c250c8 ("Revert "ocfs2: > mount shared volume without ha stack") mentioned issue.I suggest we re-split this series (especially patch 3 and 4), but not revive a buggy commit first and then another commit fixing it BTW. Thanks, Joseph