Heming Zhao
2022-Jul-30 01:14 UTC
[Ocfs2-devel] [PATCH 0/4] re-enable non-clustered mount & add MMP support
This serial patches re-enable ocfs2 non-clustered mount feature. the previous patch c80af0c250c8 (Revert "ocfs2: mount shared volume without ha stack") revert Gang's non-clustered mount patch. This serial patches re-enable ocfs2 non-clustered mount. the key different between local mount and non-clustered mount: local mount feature (tunefs.ocfs2 --fs-features=[no]local) can't do convert job without ha stack. non-clustered mount feature can run totally without ha stack. For avoiding data corruption when non-clustered & clustered mount are happening at same time, this serial patches also introduces MMP feature. MMP (Multiple Mount Protection) idea got from ext4 MMP (fs/ext4/mmp.c) which protects fs from being mounted more than once. For ocfs2 is a clustered fs and also for compatible with existing slotmap feature, I did some optimization and modification when porting from ext4 MMP to ocfs2. The related userspace code for supporting MMP had been sent to github for reviewing: - https://github.com/markfasheh/ocfs2-tools/pull/58 ocfs2-tools enable MMP and check status: ``` # enable MMP tunefs.ocfs2 --fs-feature=mmp /dev/vdb # check the command result tunefs.ocfs2 -Q "%H\n" /dev/vdb | grep MMP # active MMP on nocluster mount mount -t ocfs2 -o nocluster /dev/vdb /mnt # check slotmap info # echo slotmap | PAGER=cat debugfs.ocfs2 /dev/vdb ``` === below are test cases for patches === <1> non-clustered mount vs local mount 1.1 tunefs.ocfs2 can't convert local/nolocal mount without ha stack. ``` (on ha stack env) mkfs.ocfs2 --cluster-stack=pcmk --cluster-name=hacluster -N 4 /dev/vdb tunefs.ocfs2 --fs-features=local /dev/vdb (<== success) tunefs.ocfs2 --fs-features=nolocal /dev/vdb (<== success) (on another node without ha stack) tunefs.ocfs2 --fs-features=local /dev/vdb (<== failure) ``` 1.2 non-cluster feature can run without ha stack. ``` (on ha stack env) mkfs.ocfs2 --cluster-stack=pcmk --cluster-name=hacluster -N 4 /dev/vdb (on another node without ha stack) mount -t ocfs2 -o nocluster /dev/vdb /mnt (<== success) ``` <2> do clustered & non-clustered mount on same node 2.1 non-clustered mount => clustered mount ``` mkfs.ocfs2 --cluster-stack=pcmk --cluster-name=hacluster -N 4 /dev/vdb mount -t ocfs2 -o nocluster /dev/vdb /mnt mount -t ocfs2 /dev/vdb /mnt (<=== failure) ``` 2.2 clustered mount => non-clustered mount ``` mkfs.ocfs2 --cluster-stack=pcmk --cluster-name=hacluster -N 4 /dev/vdb mount -t ocfs2 /dev/vdb /mnt mount -t ocfs2 -o nocluster /dev/vdb /mnt (<=== failure) ``` <3> one node does clustered mount, another does non-clustered mount test rule: clustered mount and non-clustered mount can not exist at same time. 3.1 clustered mount @node1 => [no]clustered mount @node2 ``` node1: mkfs.ocfs2 --cluster-stack=pcmk --cluster-name=hacluster -N 4 /dev/vdb mount -t ocfs2 /dev/vdb /mnt node2: mount -t ocfs2 /dev/vdb /mnt (<== success) umount /mnt mount -t ocfs2 -o nocluster /dev/vdb /mnt (<== failure) ``` 3.2 enable mmp, repeate 3.1 case ``` node1: mkfs.ocfs2 --cluster-stack=pcmk --cluster-name=hacluster -N 4 /dev/vdb tunefs.ocfs2 --fs-features=mmp /dev/vdb (<== enable mmp) mount -t ocfs2 /dev/vdb /mnt node2: mount -t ocfs2 /dev/vdb /mnt (<== wait ~22s [*] for mmp, then success) umount /mnt mount -t ocfs2 -o nocluster /dev/vdb /mnt (<== failure) ``` [*] 22s: (OCFS2_MMP_MIN_CHECK_INTERVAL * 2 + 1) * 2 times (calling schedule_timeout_interruptible) 3.3 noclustered mount @node1 => [no]clustered mount @node2 ``` node1: mkfs.ocfs2 --cluster-stack=pcmk --cluster-name=hacluster -N 4 /dev/vdb mount -t ocfs2 -o nocluster /dev/vdb /mnt node2: mount -t ocfs2 /dev/vdb /mnt (<== failure) mount -t ocfs2 -o nocluster /dev/vdb /mnt (<== success, without mmp enable) umount /mnt (<== will ZERO out slotmap area while node1 still mounting) ``` 3.4 enable mmp, repeate 3.3 case. ``` node1: mkfs.ocfs2 --cluster-stack=pcmk --cluster-name=hacluster -N 4 /dev/vdb tunefs.ocfs2 --fs-features=mmp /dev/vdb (<== enable mmp) mount -t ocfs2 -o nocluster /dev/vdb /mnt node2: mount -t ocfs2 /dev/vdb /mnt (<== failure) mount -t ocfs2 -o nocluster /dev/vdb /mnt (<== failure, denied by mmp) ``` <4> simulate mounting after machine crash info: - below all steps do on one node - address 287387648 is the '//slot_map' extent address. - test the rule: If last mount didn't do unmount, (eg: crash), the next mount MUST be same mount type. 4.0 how to calculate '//slot_map' extent address ``` # PAGER=cat debugfs.ocfs2 -R "stats" /dev/vdb | grep "Block Size Bits" Block Size Bits: 12 Cluster Size Bits: 12 # PAGER=cat debugfs.ocfs2 -R "stat //slot_map" /dev/vdb | grep -A1 # "Block#" ## Offset Clusters Block# Flags 0 0 1 70163 0x0 ``` 70163 * (1<<12) = 70163 * 4096 = 287387648 4.1 clustered mount => crash => non-clustered mount fails => clean slotmap => non-clustered mount succeeds ``` mkfs.ocfs2 --cluster-stack=pcmk --cluster-name=hacluster -N 4 /dev/vdb mount -t ocfs2 /dev/vdb /mnt dd if=/dev/vdb bs=1 count=32 skip=287387648 of=/root/slotmap.cluster.mnted (<== backup slot info) umount /mnt dd if=/root/slotmap.cluster.mnted of=/dev/vdb seek=287387648 bs=1 count=32 (<== overwrite) mount -t ocfs2 -o nocluster /dev/vdb /mnt <== failure mount -t ocfs2 /dev/vdb /mnt && umount /mnt <== clean slot 0 mount -t ocfs2 -o nocluster /dev/vdb /mnt <== success ``` 4.2 non-clustered mount => crash => clustered mount fails => clean slotmap => clustered mount succeeds ``` mkfs.ocfs2 --cluster-stack=pcmk --cluster-name=hacluster -N 4 /dev/vdb mount -t ocfs2 -o nocluster /dev/vdb /mnt dd if=/dev/vdb bs=1 count=32 skip=287387648 of=/root/slotmap.nocluster.mnted umount /mnt dd if=/root/slotmap.nocluster.mnted of=/dev/vdb seek=287387648 bs=1 count=32 mount -t ocfs2 /dev/vdb /mnt <== failure mount -t ocfs2 -o nocluster /dev/vdb /mnt && umount /mnt <== clean slot 0 mount -t ocfs2 /dev/vdb /mnt <== success ``` <5> MMP test 5.1 node1 noclustered mount => node 2 noclustered mount disable mmp ``` node1: mkfs.ocfs2 --cluster-stack=pcmk --cluster-name=hacluster -N 4 /dev/vdb mount -t ocfs2 -o nocluster /dev/vdb /mnt node2: mount -t ocfs2 -o nocluster /dev/vdb /mnt (<== success) ``` enable mmp ``` node1: mkfs.ocfs2 --cluster-stack=pcmk --cluster-name=hacluster -N 4 /dev/vdb tunefs.ocfs2 --fs-features=mmp /dev/vdb mount -t ocfs2 -o nocluster /dev/vdb /mnt node2: mount -t ocfs2 -o nocluster /dev/vdb /mnt (<== wait ~12s[*], failure by mmp) ``` [*] 12s: sleep (OCFS2_MMP_MIN_CHECK_INTERVAL * 2 + 1) then detect mmp_seq was changed, then failed. 5.2 node1 clustered mount => node 2 clustered mount see case 3.2 5.3 node1 noclustered mount => node 2 noclustered mount see case 3.4 5.4 remount test 5.4.1 non-clustered mount (run commands on same node) ``` mkfs.ocfs2 --cluster-stack=pcmk --cluster-name=hacluster -N 4 /dev/vdb tunefs.ocfs2 --fs-features=mmp /dev/vdb mount -t ocfs2 -o nocluster /dev/vdb /mnt ps axj | grep kmmpd (<== will show kmmpd) PAGER=cat debugfs.ocfs2 -R "slotmap" /dev/vdb (<== show 'OCFS2_MMP_SEQ') mount -o remount,ro,nocluster /dev/vdb /mnt (<== kmmpd will stop) ps axj | grep kmmpd (<== won't show kmmpd) PAGER=cat debugfs.ocfs2 -R "slotmap" /dev/vdb (<== show 'OCFS2_MMP_SEQ_CLEAN') mount -o remount,rw,nocluster /dev/vdb /mnt (<== kmmpd will start) ps axj | grep kmmpd (<== will show kmmpd) PAGER=cat debugfs.ocfs2 -R "slotmap" /dev/vdb (<== show 'OCFS2_MMP_SEQ') ``` 5.4.2 clustered mount ``` mkfs.ocfs2 --cluster-stack=pcmk --cluster-name=hacluster -N 4 /dev/vdb tunefs.ocfs2 --fs-features=mmp /dev/vdb mount -t ocfs2 /dev/vdb /mnt (<== clustered mount won't create kmmpd) PAGER=cat debugfs.ocfs2 -R "slotmap" /dev/vdb (<== show 'OCFS2_VALID_CLUSTER') mount -o remount,ro /dev/vdb /mnt PAGER=cat debugfs.ocfs2 -R "slotmap" /dev/vdb (<== show 'OCFS2_VALID_CLUSTER') mount -o remount,rw /dev/vdb /mnt (<== wait for ~22s by mmp start) PAGER=cat debugfs.ocfs2 -R "slotmap" /dev/vdb (<== show 'OCFS2_VALID_CLUSTER') ``` Heming Zhao (4): ocfs2: Fix freeing uninitialized resource on ocfs2_dlm_shutdown ocfs2: add mlog ML_WARNING support re-enable "ocfs2: mount shared volume without ha stack" ocfs2: introduce ext4 MMP feature fs/ocfs2/cluster/masklog.c | 3 + fs/ocfs2/cluster/masklog.h | 9 +- fs/ocfs2/dlmglue.c | 3 + fs/ocfs2/ocfs2.h | 6 +- fs/ocfs2/ocfs2_fs.h | 13 +- fs/ocfs2/slot_map.c | 479 +++++++++++++++++++++++++++++++++++-- fs/ocfs2/slot_map.h | 3 + fs/ocfs2/super.c | 42 +++- 8 files changed, 527 insertions(+), 31 deletions(-) -- 2.37.1
Heming Zhao
2022-Jul-30 01:14 UTC
[Ocfs2-devel] [PATCH 1/4] ocfs2: Fix freeing uninitialized resource on ocfs2_dlm_shutdown
On local mount mode, there is no dlm resource initalized. If ocfs2_mount_volume() fails in ocfs2_find_slot(), error handling flow will call ocfs2_dlm_shutdown(), then does dlm resource cleanup job, which will trigger kernel crash. Fixes: 0737e01de9c4 ("ocfs2: ocfs2_mount_volume does cleanup job before return error") Signed-off-by: Heming Zhao <heming.zhao at suse.com> --- fs/ocfs2/dlmglue.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/fs/ocfs2/dlmglue.c b/fs/ocfs2/dlmglue.c index 801e60bab955..1438ac14940b 100644 --- a/fs/ocfs2/dlmglue.c +++ b/fs/ocfs2/dlmglue.c @@ -3385,6 +3385,9 @@ int ocfs2_dlm_init(struct ocfs2_super *osb) void ocfs2_dlm_shutdown(struct ocfs2_super *osb, int hangup_pending) { + if (ocfs2_mount_local(osb)) + return; + ocfs2_drop_osb_locks(osb); /* -- 2.37.1
Heming Zhao
2022-Jul-30 01:14 UTC
[Ocfs2-devel] [PATCH 2/4] ocfs2: add mlog ML_WARNING support
This commit gives new message type for ocfs2. Signed-off-by: Heming Zhao <heming.zhao at suse.com> --- fs/ocfs2/cluster/masklog.c | 3 +++ fs/ocfs2/cluster/masklog.h | 9 +++++---- 2 files changed, 8 insertions(+), 4 deletions(-) diff --git a/fs/ocfs2/cluster/masklog.c b/fs/ocfs2/cluster/masklog.c index 563881ddbf00..bac3488e8002 100644 --- a/fs/ocfs2/cluster/masklog.c +++ b/fs/ocfs2/cluster/masklog.c @@ -63,6 +63,9 @@ void __mlog_printk(const u64 *mask, const char *func, int line, if (*mask & ML_ERROR) { level = KERN_ERR; prefix = "ERROR: "; + } else if (*mask & ML_WARNING) { + level = KERN_WARNING; + prefix = "WARNING: "; } else if (*mask & ML_NOTICE) { level = KERN_NOTICE; } else { diff --git a/fs/ocfs2/cluster/masklog.h b/fs/ocfs2/cluster/masklog.h index b73fc42e46ff..d0bc4fe8cf3d 100644 --- a/fs/ocfs2/cluster/masklog.h +++ b/fs/ocfs2/cluster/masklog.h @@ -86,10 +86,11 @@ /* bits that are infrequently given and frequently matched in the high word */ #define ML_ERROR 0x1000000000000000ULL /* sent to KERN_ERR */ -#define ML_NOTICE 0x2000000000000000ULL /* setn to KERN_NOTICE */ -#define ML_KTHREAD 0x4000000000000000ULL /* kernel thread activity */ +#define ML_NOTICE 0x2000000000000000ULL /* sent to KERN_NOTICE */ +#define ML_WARNING 0x4000000000000000ULL /* sent to KERN_WARNING */ +#define ML_KTHREAD 0x8000000000000000ULL /* kernel thread activity */ -#define MLOG_INITIAL_AND_MASK (ML_ERROR|ML_NOTICE) +#define MLOG_INITIAL_AND_MASK (ML_ERROR|ML_WARNING|ML_NOTICE) #ifndef MLOG_MASK_PREFIX #define MLOG_MASK_PREFIX 0 #endif @@ -102,7 +103,7 @@ #if defined(CONFIG_OCFS2_DEBUG_MASKLOG) #define ML_ALLOWED_BITS ~0 #else -#define ML_ALLOWED_BITS (ML_ERROR|ML_NOTICE) +#define ML_ALLOWED_BITS (ML_ERROR|ML_WARNING|ML_NOTICE) #endif #define MLOG_MAX_BITS 64 -- 2.37.1
Heming Zhao
2022-Jul-30 01:14 UTC
[Ocfs2-devel] [PATCH 3/4] re-enable "ocfs2: mount shared volume without ha stack"
the key different between local mount and non-clustered mount: local mount feature (tunefs.ocfs2 --fs-features=[no]local) can't do convert job without ha stack. non-clustered mount feature can run totally without ha stack. commit 912f655d78c5 ("ocfs2: mount shared volume without ha stack") had bug, then commit c80af0c250c8f8a3c978aa5aafbe9c39b336b813 reverted it. Let's give some explain for the issue mentioned by commit c80af0c250c8. Under Junxiao's call trace, in __ocfs2_find_empty_slot(), the 'if' accessment is wrong. sl_node_num could be 0 at o2cb env. with current information, the trigger flow (base on 912f655d78c5): 1> nodeA with 'node_num = 0' for mounting. it will succeed. at this time, slotmap extent block will contains es_valid:1 & es_node_num:0 for nodeA then ocfs2_update_disk_slot() will write back slotmap info to disk. 2> then, nodeB with 'node_num = 1' for mounting this time, osb->node_num is 1 (set by config file), osb->preferred is OCFS2_INVALID_SLOT (set by ocfs2_parse_options). ocfs2_find_slot + ocfs2_update_slot_info //read slotmap info from disk | + set si->si_slots[0].es_valid = 1 & si->si_slots[0].sl_node_num = 0 | + __ocfs2_node_num_to_slot //will return -ENOENT. + __ocfs2_find_empty_slot + if ((preferred >= 0) && (preferred < si->si_num_slots)) | fails enter this 'if' for preferred value is OCFS2_INVALID_SLOT | + 'for(i = 0; i < si->si_num_slots; i++)' search slot 0 successfully. | 'si->si_slots[0].sl_node_num' is false. trigger 'break' condition. | + return slot 0. it will cause nodeB grab nodeA journal dlm lock, then trigger hung. How to do for this bug? This commit re-enabled 912f655d78c5, next commit (add MMP support) will fix the issue. Signed-off-by: Heming Zhao <heming.zhao at suse.com> --- fs/ocfs2/ocfs2.h | 4 +++- fs/ocfs2/slot_map.c | 46 ++++++++++++++++++++++++++------------------- fs/ocfs2/super.c | 21 +++++++++++++++++++++ 3 files changed, 51 insertions(+), 20 deletions(-) diff --git a/fs/ocfs2/ocfs2.h b/fs/ocfs2/ocfs2.h index 740b64238312..337527571461 100644 --- a/fs/ocfs2/ocfs2.h +++ b/fs/ocfs2/ocfs2.h @@ -277,6 +277,7 @@ enum ocfs2_mount_options OCFS2_MOUNT_JOURNAL_ASYNC_COMMIT = 1 << 15, /* Journal Async Commit */ OCFS2_MOUNT_ERRORS_CONT = 1 << 16, /* Return EIO to the calling process on error */ OCFS2_MOUNT_ERRORS_ROFS = 1 << 17, /* Change filesystem to read-only on error */ + OCFS2_MOUNT_NOCLUSTER = 1 << 18, /* No cluster aware filesystem mount */ }; #define OCFS2_OSB_SOFT_RO 0x0001 @@ -672,7 +673,8 @@ static inline int ocfs2_cluster_o2cb_global_heartbeat(struct ocfs2_super *osb) static inline int ocfs2_mount_local(struct ocfs2_super *osb) { - return (osb->s_feature_incompat & OCFS2_FEATURE_INCOMPAT_LOCAL_MOUNT); + return ((osb->s_feature_incompat & OCFS2_FEATURE_INCOMPAT_LOCAL_MOUNT) + || (osb->s_mount_opt & OCFS2_MOUNT_NOCLUSTER)); } static inline int ocfs2_uses_extended_slot_map(struct ocfs2_super *osb) diff --git a/fs/ocfs2/slot_map.c b/fs/ocfs2/slot_map.c index da7718cef735..0b0ae3ebb0cf 100644 --- a/fs/ocfs2/slot_map.c +++ b/fs/ocfs2/slot_map.c @@ -252,14 +252,16 @@ static int __ocfs2_find_empty_slot(struct ocfs2_slot_info *si, int i, ret = -ENOSPC; if ((preferred >= 0) && (preferred < si->si_num_slots)) { - if (!si->si_slots[preferred].sl_valid) { + if (!si->si_slots[preferred].sl_valid || + !si->si_slots[preferred].sl_node_num) { ret = preferred; goto out; } } for(i = 0; i < si->si_num_slots; i++) { - if (!si->si_slots[i].sl_valid) { + if (!si->si_slots[i].sl_valid || + !si->si_slots[i].sl_node_num) { ret = i; break; } @@ -454,24 +456,30 @@ int ocfs2_find_slot(struct ocfs2_super *osb) spin_lock(&osb->osb_lock); ocfs2_update_slot_info(si); - /* search for ourselves first and take the slot if it already - * exists. Perhaps we need to mark this in a variable for our - * own journal recovery? Possibly not, though we certainly - * need to warn to the user */ - slot = __ocfs2_node_num_to_slot(si, osb->node_num); - if (slot < 0) { - /* if no slot yet, then just take 1st available - * one. */ - slot = __ocfs2_find_empty_slot(si, osb->preferred_slot); + if (ocfs2_mount_local(osb)) + /* use slot 0 directly in local mode */ + slot = 0; + else { + /* search for ourselves first and take the slot if it already + * exists. Perhaps we need to mark this in a variable for our + * own journal recovery? Possibly not, though we certainly + * need to warn to the user */ + slot = __ocfs2_node_num_to_slot(si, osb->node_num); if (slot < 0) { - spin_unlock(&osb->osb_lock); - mlog(ML_ERROR, "no free slots available!\n"); - status = -EINVAL; - goto bail; - } - } else - printk(KERN_INFO "ocfs2: Slot %d on device (%s) was already " - "allocated to this node!\n", slot, osb->dev_str); + /* if no slot yet, then just take 1st available + * one. */ + slot = __ocfs2_find_empty_slot(si, osb->preferred_slot); + if (slot < 0) { + spin_unlock(&osb->osb_lock); + mlog(ML_ERROR, "no free slots available!\n"); + status = -EINVAL; + goto bail; + } + } else + printk(KERN_INFO "ocfs2: Slot %d on device (%s) was " + "already allocated to this node!\n", + slot, osb->dev_str); + } ocfs2_set_slot(si, slot, osb->node_num); osb->slot_num = slot; diff --git a/fs/ocfs2/super.c b/fs/ocfs2/super.c index 438be028935d..f7298816d8d9 100644 --- a/fs/ocfs2/super.c +++ b/fs/ocfs2/super.c @@ -172,6 +172,7 @@ enum { Opt_dir_resv_level, Opt_journal_async_commit, Opt_err_cont, + Opt_nocluster, Opt_err, }; @@ -205,6 +206,7 @@ static const match_table_t tokens = { {Opt_dir_resv_level, "dir_resv_level=%u"}, {Opt_journal_async_commit, "journal_async_commit"}, {Opt_err_cont, "errors=continue"}, + {Opt_nocluster, "nocluster"}, {Opt_err, NULL} }; @@ -616,6 +618,13 @@ static int ocfs2_remount(struct super_block *sb, int *flags, char *data) goto out; } + tmp = OCFS2_MOUNT_NOCLUSTER; + if ((osb->s_mount_opt & tmp) != (parsed_options.mount_opt & tmp)) { + ret = -EINVAL; + mlog(ML_ERROR, "Cannot change nocluster option on remount\n"); + goto out; + } + tmp = OCFS2_MOUNT_HB_LOCAL | OCFS2_MOUNT_HB_GLOBAL | OCFS2_MOUNT_HB_NONE; if ((osb->s_mount_opt & tmp) != (parsed_options.mount_opt & tmp)) { @@ -856,6 +865,7 @@ static int ocfs2_verify_userspace_stack(struct ocfs2_super *osb, } if (ocfs2_userspace_stack(osb) && + !(osb->s_mount_opt & OCFS2_MOUNT_NOCLUSTER) && strncmp(osb->osb_cluster_stack, mopt->cluster_stack, OCFS2_STACK_LABEL_LEN)) { mlog(ML_ERROR, @@ -1127,6 +1137,11 @@ static int ocfs2_fill_super(struct super_block *sb, void *data, int silent) osb->s_mount_opt & OCFS2_MOUNT_DATA_WRITEBACK ? "writeback" : "ordered"); + if ((osb->s_mount_opt & OCFS2_MOUNT_NOCLUSTER) && + !(osb->s_feature_incompat & OCFS2_FEATURE_INCOMPAT_LOCAL_MOUNT)) + printk(KERN_NOTICE "ocfs2: The shared device (%s) is mounted " + "without cluster aware mode.\n", osb->dev_str); + atomic_set(&osb->vol_state, VOLUME_MOUNTED); wake_up(&osb->osb_mount_event); @@ -1437,6 +1452,9 @@ static int ocfs2_parse_options(struct super_block *sb, case Opt_journal_async_commit: mopt->mount_opt |= OCFS2_MOUNT_JOURNAL_ASYNC_COMMIT; break; + case Opt_nocluster: + mopt->mount_opt |= OCFS2_MOUNT_NOCLUSTER; + break; default: mlog(ML_ERROR, "Unrecognized mount option \"%s\" " @@ -1548,6 +1566,9 @@ static int ocfs2_show_options(struct seq_file *s, struct dentry *root) if (opts & OCFS2_MOUNT_JOURNAL_ASYNC_COMMIT) seq_printf(s, ",journal_async_commit"); + if (opts & OCFS2_MOUNT_NOCLUSTER) + seq_printf(s, ",nocluster"); + return 0; } -- 2.37.1
Heming Zhao
2022-Jul-30 01:14 UTC
[Ocfs2-devel] [PATCH 4/4] ocfs2: introduce ext4 MMP feature
MMP (multiple mount protection) gives filesystem ability to prevent from being mounted multiple times. For avoiding data corruption when non-clustered and/or clustered mount are happening at same time, this commit introduced MMP feature. MMP idea is from ext4 MMP (fs/ext4/mmp.c) code. For ocfs2 is a clustered fs and also for compatible with existing slotmap feature, I did some optimization and modification when porting from ext4 to ocfs2. For optimization: mmp has a kthread kmmpd-<dev>, which is only created in non-clustered mode. We set a rule: If last mount didn't do unmount, (eg: crash), the next mount MUST be same mount type. At last, this commit also fix commit c80af0c250c8 ("Revert "ocfs2: mount shared volume without ha stack") mentioned issue. Signed-off-by: Heming Zhao <heming.zhao at suse.com> --- fs/ocfs2/ocfs2.h | 2 + fs/ocfs2/ocfs2_fs.h | 13 +- fs/ocfs2/slot_map.c | 459 ++++++++++++++++++++++++++++++++++++++++++-- fs/ocfs2/slot_map.h | 3 + fs/ocfs2/super.c | 23 ++- 5 files changed, 479 insertions(+), 21 deletions(-) diff --git a/fs/ocfs2/ocfs2.h b/fs/ocfs2/ocfs2.h index 337527571461..37a7c5855d07 100644 --- a/fs/ocfs2/ocfs2.h +++ b/fs/ocfs2/ocfs2.h @@ -337,6 +337,8 @@ struct ocfs2_super unsigned int node_num; int slot_num; int preferred_slot; + u16 mmp_update_interval; + struct task_struct *mmp_task; int s_sectsize_bits; int s_clustersize; int s_clustersize_bits; diff --git a/fs/ocfs2/ocfs2_fs.h b/fs/ocfs2/ocfs2_fs.h index 638d875eccc7..015672f75563 100644 --- a/fs/ocfs2/ocfs2_fs.h +++ b/fs/ocfs2/ocfs2_fs.h @@ -87,7 +87,8 @@ | OCFS2_FEATURE_INCOMPAT_REFCOUNT_TREE \ | OCFS2_FEATURE_INCOMPAT_DISCONTIG_BG \ | OCFS2_FEATURE_INCOMPAT_CLUSTERINFO \ - | OCFS2_FEATURE_INCOMPAT_APPEND_DIO) + | OCFS2_FEATURE_INCOMPAT_APPEND_DIO \ + | OCFS2_FEATURE_INCOMPAT_MMP) #define OCFS2_FEATURE_RO_COMPAT_SUPP (OCFS2_FEATURE_RO_COMPAT_UNWRITTEN \ | OCFS2_FEATURE_RO_COMPAT_USRQUOTA \ | OCFS2_FEATURE_RO_COMPAT_GRPQUOTA) @@ -167,6 +168,11 @@ */ #define OCFS2_FEATURE_INCOMPAT_APPEND_DIO 0x8000 +/* + * Multiple mount protection + */ +#define OCFS2_FEATURE_INCOMPAT_MMP 0x10000 + /* * backup superblock flag is used to indicate that this volume * has backup superblocks. @@ -535,8 +541,7 @@ struct ocfs2_slot_map { }; struct ocfs2_extended_slot { -/*00*/ __u8 es_valid; - __u8 es_reserved1[3]; +/*00*/ __le32 es_valid; __le32 es_node_num; /*08*/ }; @@ -611,7 +616,7 @@ struct ocfs2_super_block { INCOMPAT flag set. */ /*B8*/ __le16 s_xattr_inline_size; /* extended attribute inline size for this fs*/ - __le16 s_reserved0; + __le16 s_mmp_update_interval; /* # seconds to wait in MMP checking */ __le32 s_dx_seed[3]; /* seed[0-2] for dx dir hash. * s_uuid_hash serves as seed[3]. */ /*C0*/ __le64 s_reserved2[15]; /* Fill out superblock */ diff --git a/fs/ocfs2/slot_map.c b/fs/ocfs2/slot_map.c index 0b0ae3ebb0cf..86a21140ead6 100644 --- a/fs/ocfs2/slot_map.c +++ b/fs/ocfs2/slot_map.c @@ -8,6 +8,8 @@ #include <linux/types.h> #include <linux/slab.h> #include <linux/highmem.h> +#include <linux/random.h> +#include <linux/kthread.h> #include <cluster/masklog.h> @@ -24,9 +26,48 @@ #include "buffer_head_io.h" +/* + * This structure will be used for multiple mount protection. It will be + * written into the '//slot_map' field in the system dir. + * Programs that check MMP should assume that if SEQ_FSCK (or any unknown + * code above SEQ_MAX) is present then it is NOT safe to use the filesystem. + */ +#define OCFS2_MMP_SEQ_CLEAN 0xFF4D4D50U /* mmp_seq value for clean unmount */ +#define OCFS2_MMP_SEQ_FSCK 0xE24D4D50U /* mmp_seq value when being fscked */ +#define OCFS2_MMP_SEQ_MAX 0xE24D4D4FU /* maximum valid mmp_seq value */ +#define OCFS2_MMP_SEQ_INIT 0x0 /* mmp_seq init value */ +#define OCFS2_VALID_CLUSTER 0xE24D4D55U /* value for clustered mount + under MMP disabled */ +#define OCFS2_VALID_NOCLUSTER 0xE24D4D5AU /* value for noclustered mount + under MMP disabled */ + +#define OCFS2_SLOT_INFO_OLD_VALID 1 /* use for old slot info */ + +/* + * Check interval multiplier + * The MMP block is written every update interval and initially checked every + * update interval x the multiplier (the value is then adapted based on the + * write latency). The reason is that writes can be delayed under load and we + * don't want readers to incorrectly assume that the filesystem is no longer + * in use. + */ +#define OCFS2_MMP_CHECK_MULT 2UL + +/* + * Minimum interval for MMP checking in seconds. + */ +#define OCFS2_MMP_MIN_CHECK_INTERVAL 5UL + +/* + * Maximum interval for MMP checking in seconds. + */ +#define OCFS2_MMP_MAX_CHECK_INTERVAL 300UL struct ocfs2_slot { - int sl_valid; + union { + unsigned int sl_valid; + unsigned int mmp_seq; + }; unsigned int sl_node_num; }; @@ -52,11 +93,11 @@ static void ocfs2_invalidate_slot(struct ocfs2_slot_info *si, } static void ocfs2_set_slot(struct ocfs2_slot_info *si, - int slot_num, unsigned int node_num) + int slot_num, unsigned int node_num, unsigned int valid) { BUG_ON((slot_num < 0) || (slot_num >= si->si_num_slots)); - si->si_slots[slot_num].sl_valid = 1; + si->si_slots[slot_num].sl_valid = valid; si->si_slots[slot_num].sl_node_num = node_num; } @@ -75,7 +116,8 @@ static void ocfs2_update_slot_info_extended(struct ocfs2_slot_info *si) i++, slotno++) { if (se->se_slots[i].es_valid) ocfs2_set_slot(si, slotno, - le32_to_cpu(se->se_slots[i].es_node_num)); + le32_to_cpu(se->se_slots[i].es_node_num), + le32_to_cpu(se->se_slots[i].es_valid)); else ocfs2_invalidate_slot(si, slotno); } @@ -97,7 +139,8 @@ static void ocfs2_update_slot_info_old(struct ocfs2_slot_info *si) if (le16_to_cpu(sm->sm_slots[i]) == (u16)OCFS2_INVALID_SLOT) ocfs2_invalidate_slot(si, i); else - ocfs2_set_slot(si, i, le16_to_cpu(sm->sm_slots[i])); + ocfs2_set_slot(si, i, le16_to_cpu(sm->sm_slots[i]), + OCFS2_SLOT_INFO_OLD_VALID); } } @@ -252,16 +295,14 @@ static int __ocfs2_find_empty_slot(struct ocfs2_slot_info *si, int i, ret = -ENOSPC; if ((preferred >= 0) && (preferred < si->si_num_slots)) { - if (!si->si_slots[preferred].sl_valid || - !si->si_slots[preferred].sl_node_num) { + if (!si->si_slots[preferred].sl_valid) { ret = preferred; goto out; } } for(i = 0; i < si->si_num_slots; i++) { - if (!si->si_slots[i].sl_valid || - !si->si_slots[i].sl_node_num) { + if (!si->si_slots[i].sl_valid) { ret = i; break; } @@ -270,6 +311,43 @@ static int __ocfs2_find_empty_slot(struct ocfs2_slot_info *si, return ret; } +/* Return first used slot. + * -ENOENT means all slots are clean, ->sl_valid should be + * OCFS2_MMP_SEQ_CLEAN or ZERO */ +static int __ocfs2_find_used_slot(struct ocfs2_slot_info *si) +{ + int i, ret = -ENOENT, valid; + + for (i = 0; i < si->si_num_slots; i++) { + valid = si->si_slots[i].sl_valid; + if (valid == 0 || valid == OCFS2_MMP_SEQ_CLEAN) + continue; + if (valid <= OCFS2_MMP_SEQ_MAX || + valid == OCFS2_MMP_SEQ_FSCK || + valid == OCFS2_VALID_CLUSTER || + valid == OCFS2_VALID_NOCLUSTER) { + ret = i; + break; + } + } + + return ret; +} + +static int __ocfs2_find_expected_slot(struct ocfs2_slot_info *si, + unsigned int expected) +{ + int i; + + for (i = 0; i < si->si_num_slots; i++) { + if (si->si_slots[i].sl_valid == expected) { + return 1; + } + } + + return 0; +} + int ocfs2_node_num_to_slot(struct ocfs2_super *osb, unsigned int node_num) { int slot; @@ -445,21 +523,357 @@ void ocfs2_free_slot_info(struct ocfs2_super *osb) __ocfs2_free_slot_info(si); } +/* + * Get a random new sequence number but make sure it is not greater than + * EXT4_MMP_SEQ_MAX. + */ +static unsigned int mmp_new_seq(void) +{ + u32 new_seq; + + do { + new_seq = prandom_u32(); + } while (new_seq > OCFS2_MMP_SEQ_MAX); + + if (new_seq == 0) + return 1; + else + return new_seq; +} + +/* + * kmmpd will update the MMP sequence every mmp_update_interval seconds + */ +static int kmmpd(void *data) +{ + struct ocfs2_super *osb = data; + struct super_block *sb = osb->sb; + struct ocfs2_slot_info *si = osb->slot_info; + int slot = osb->slot_num; + u32 seq, mmp_seq; + unsigned long failed_writes = 0; + u16 mmp_update_interval = osb->mmp_update_interval; + unsigned int mmp_check_interval; + unsigned long last_update_time; + unsigned long diff; + int retval = 0; + + if (!ocfs2_mount_local(osb)) { + mlog(ML_ERROR, "kmmpd thread only works for local mount mode.\n"); + goto wait_to_exit; + } + + retval = ocfs2_refresh_slot_info(osb); + seq = si->si_slots[slot].mmp_seq; + + /* + * Start with the higher mmp_check_interval and reduce it if + * the MMP block is being updated on time. + */ + mmp_check_interval = max(OCFS2_MMP_CHECK_MULT * mmp_update_interval, + OCFS2_MMP_MIN_CHECK_INTERVAL); + + while (!kthread_should_stop() && !sb_rdonly(sb)) { + if (!OCFS2_HAS_INCOMPAT_FEATURE(sb, OCFS2_FEATURE_INCOMPAT_MMP)) { + mlog(ML_WARNING, "kmmpd being stopped since MMP feature" + " has been disabled."); + goto wait_to_exit; + } + if (++seq > OCFS2_MMP_SEQ_MAX) + seq = 1; + + spin_lock(&osb->osb_lock); + si->si_slots[slot].mmp_seq = mmp_seq = seq; + spin_unlock(&osb->osb_lock); + + last_update_time = jiffies; + retval = ocfs2_update_disk_slot(osb, si, slot); + + /* + * Don't spew too many error messages. Print one every + * (s_mmp_update_interval * 60) seconds. + */ + if (retval) { + if ((failed_writes % 60) == 0) { + ocfs2_error(sb, "Error writing to MMP block"); + } + failed_writes++; + } + + diff = jiffies - last_update_time; + if (diff < mmp_update_interval * HZ) + schedule_timeout_interruptible(mmp_update_interval * + HZ - diff); + + /* + * We need to make sure that more than mmp_check_interval + * seconds have not passed since writing. If that has happened + * we need to check if the MMP block is as we left it. + */ + diff = jiffies - last_update_time; + if (diff > mmp_check_interval * HZ) { + retval = ocfs2_refresh_slot_info(osb); + if (retval) { + ocfs2_error(sb, "error reading MMP data: %d", retval); + goto wait_to_exit; + } + + if (si->si_slots[slot].mmp_seq != mmp_seq) { + ocfs2_error(sb, "Error while updating MMP info. " + "The filesystem seems to have been" + " multiply mounted."); + retval = -EBUSY; + goto wait_to_exit; + } + } + + /* + * Adjust the mmp_check_interval depending on how much time + * it took for the MMP block to be written. + */ + mmp_check_interval = max(min(OCFS2_MMP_CHECK_MULT * diff / HZ, + OCFS2_MMP_MAX_CHECK_INTERVAL), + OCFS2_MMP_MIN_CHECK_INTERVAL); + } + + /* + * Unmount seems to be clean. + */ + spin_lock(&osb->osb_lock); + si->si_slots[slot].mmp_seq = OCFS2_MMP_SEQ_CLEAN; + spin_unlock(&osb->osb_lock); + + retval = ocfs2_update_disk_slot(osb, si, 0); + +wait_to_exit: + while (!kthread_should_stop()) { + set_current_state(TASK_INTERRUPTIBLE); + if (!kthread_should_stop()) + schedule(); + } + set_current_state(TASK_RUNNING); + return retval; +} + +void ocfs2_stop_mmpd(struct ocfs2_super *osb) +{ + if (osb->mmp_task) { + kthread_stop(osb->mmp_task); + osb->mmp_task = NULL; + } +} + +/* + * Protect the filesystem from being mounted more than once. + * + * This function was inspired by ext4 MMP feature. Because HA stack + * helps ocfs2 to manage nodes join/leave, so we only focus on MMP + * under nocluster mode. + * Another info is ocfs2 only uses slot 0 on nocuster mode. + * + * es_valid: + * 0: not available + * 1: valid, cluster mode + * 2: valid, nocluster mode + * + * parameters: + * osb: the struct ocfs2_super + * noclustered: under noclustered mount + * slot: prefer slot number + */ +int ocfs2_multi_mount_protect(struct ocfs2_super *osb, int noclustered) +{ + struct buffer_head *bh = NULL; + u32 seq; + struct ocfs2_slot_info *si = osb->slot_info; + unsigned int mmp_check_interval = osb->mmp_update_interval; + unsigned int wait_time = 0; + int retval = 0; + int slot = osb->slot_num; + + if (!ocfs2_uses_extended_slot_map(osb)) { + mlog(ML_WARNING, "MMP only works on extended slot map.\n"); + retval = -EINVAL; + goto bail; + } + + retval = ocfs2_refresh_slot_info(osb); + if (retval) + goto bail; + + if (mmp_check_interval < OCFS2_MMP_MIN_CHECK_INTERVAL) + mmp_check_interval = OCFS2_MMP_MIN_CHECK_INTERVAL; + + spin_lock(&osb->osb_lock); + seq = si->si_slots[slot].mmp_seq; + + if (__ocfs2_find_used_slot(si) == -ENOENT) + goto skip; + + /* TODO ocfs2-tools need to support this flag */ + if (__ocfs2_find_expected_slot(si, OCFS2_MMP_SEQ_FSCK)) { + mlog(ML_NOTICE, "fsck is running on the filesystem"); + spin_unlock(&osb->osb_lock); + retval = -EBUSY; + goto bail; + } + spin_unlock(&osb->osb_lock); + + wait_time = min(mmp_check_interval * 2 + 1, mmp_check_interval + 60); + + /* Print MMP interval if more than 20 secs. */ + if (wait_time > OCFS2_MMP_MIN_CHECK_INTERVAL * 4) + mlog(ML_WARNING, "MMP interval %u higher than expected, please" + " wait.\n", wait_time * 2); + + if (schedule_timeout_interruptible(HZ * wait_time) != 0) { + mlog(ML_WARNING, "MMP startup interrupted, failing mount.\n"); + retval = -EPERM; + goto bail; + } + + retval = ocfs2_refresh_slot_info(osb); + if (retval) + goto bail; + if (seq != si->si_slots[slot].mmp_seq) { + mlog(ML_ERROR, "Device is already active on another node.\n"); + retval = -EPERM; + goto bail; + } + + spin_lock(&osb->osb_lock); +skip: + /* + * write a new random sequence number. + */ + seq = mmp_new_seq(); + mlog(ML_ERROR, "seq: 0x%x mmp_seq: 0x%x\n", seq, si->si_slots[slot].mmp_seq); + ocfs2_set_slot(si, slot, osb->node_num, seq); + spin_unlock(&osb->osb_lock); + + ocfs2_update_disk_slot_extended(si, slot, &bh); + mlog(ML_ERROR, "seq: 0x%x mmp_seq: 0x%x\n", seq, si->si_slots[slot].mmp_seq); + retval = ocfs2_write_block(osb, bh, INODE_CACHE(si->si_inode)); + if (retval < 0) { + mlog_errno(retval); + goto bail; + } + mlog(ML_ERROR, "seq: 0x%x mmp_seq: 0x%x wait_time: %u\n", seq, si->si_slots[slot].mmp_seq, wait_time); + + /* + * wait for MMP interval and check mmp_seq. + */ + if (schedule_timeout_interruptible(HZ * wait_time) != 0) { + mlog(ML_WARNING, "MMP startup interrupted, failing mount.\n"); + retval = -EPERM; + goto bail; + } + + retval = ocfs2_refresh_slot_info(osb); + if (retval) + goto bail; + + mlog(ML_ERROR, "seq: 0x%x mmp_seq: 0x%x\n", seq, si->si_slots[slot].mmp_seq); + if (seq != si->si_slots[slot].mmp_seq) { + mlog(ML_ERROR, "Update seq failed, device is already active on another node.\n"); + retval = -EPERM; + goto bail; + } + + /* + * There are two reasons we don't create kmmpd on clustered mount: + * - ocfs2 needs to grab osb->osb_lock to modify/access osb->si. + * - For huge number nodes cluster, nodes update same sector + * of '//slot_map' will cause IO performance issue. + * + * Then there has another question: + * On clustered mount, MMP seq won't update, and MMP how to + * handle a noclustered mount when there already exist + * clustered mount. + * The answer is the rule mentioned in ocfs2_find_slot(). + */ + if (!noclustered) { + spin_lock(&osb->osb_lock); + ocfs2_set_slot(si, slot, osb->node_num, OCFS2_VALID_CLUSTER); + spin_unlock(&osb->osb_lock); + + ocfs2_update_disk_slot_extended(si, slot, &bh); + retval = ocfs2_write_block(osb, bh, INODE_CACHE(si->si_inode)); + goto bail; + } + + /* + * Start a kernel thread to update the MMP block periodically. + */ + osb->mmp_task = kthread_run(kmmpd, osb, "kmmpd-%s", osb->sb->s_id); + if (IS_ERR(osb->mmp_task)) { + osb->mmp_task = NULL; + mlog(ML_WARNING, "Unable to create kmmpd thread for %s.", + osb->sb->s_id); + retval = -EPERM; + goto bail; + } + +bail: + return retval; +} + +static void show_conflict_mnt_msg(int clustered) +{ + const char *exist = clustered ? "non-clustered" : "clustered"; + + mlog(ML_ERROR, "Found %s mount info!", exist); + mlog(ML_ERROR, "Please clean %s slotmap info for mounting.\n", exist); + mlog(ML_ERROR, "eg. remount then unmount with %s mode\n", exist); +} + +/* + * Even under readonly mode, we write slot info on disk. + * The logic is correct: if not change slot info on readonly + * mode, in cluster env, later mount from another node + * may reuse the same slot, deadlock happen! + */ int ocfs2_find_slot(struct ocfs2_super *osb) { - int status; + int status = -EPERM; int slot; + int noclustered = 0; struct ocfs2_slot_info *si; si = osb->slot_info; spin_lock(&osb->osb_lock); ocfs2_update_slot_info(si); + slot = __ocfs2_find_used_slot(si); + if (slot == 0 && + ((si->si_slots[0].sl_valid == OCFS2_VALID_NOCLUSTER) || + (si->si_slots[0].sl_valid < OCFS2_MMP_SEQ_MAX))) + noclustered = 1; - if (ocfs2_mount_local(osb)) - /* use slot 0 directly in local mode */ - slot = 0; - else { + /* + * We set a rule: + * If last mount didn't do unmount, (eg: crash), the next mount + * MUST be same mount type. + */ + if (ocfs2_mount_local(osb)) { + /* empty slotmap, or device didn't unmount from last time */ + if ((slot == -ENOENT) || noclustered) { + /* use slot 0 directly in local mode */ + slot = 0; + noclustered = 1; + } else { + spin_unlock(&osb->osb_lock); + show_conflict_mnt_msg(0); + status = -EINVAL; + goto bail; + } + } else { + if (noclustered) { + spin_unlock(&osb->osb_lock); + show_conflict_mnt_msg(1); + status = -EINVAL; + goto bail; + } /* search for ourselves first and take the slot if it already * exists. Perhaps we need to mark this in a variable for our * own journal recovery? Possibly not, though we certainly @@ -481,7 +895,21 @@ int ocfs2_find_slot(struct ocfs2_super *osb) slot, osb->dev_str); } - ocfs2_set_slot(si, slot, osb->node_num); + if (OCFS2_HAS_INCOMPAT_FEATURE(osb->sb, OCFS2_FEATURE_INCOMPAT_MMP)) { + osb->slot_num = slot; + spin_unlock(&osb->osb_lock); + status = ocfs2_multi_mount_protect(osb, noclustered); + if (status < 0) { + mlog(ML_ERROR, "MMP failed to start.\n"); + goto mmp_fail; + } + + trace_ocfs2_find_slot(osb->slot_num); + return status; + } + + ocfs2_set_slot(si, slot, osb->node_num, noclustered ? + OCFS2_VALID_NOCLUSTER : OCFS2_VALID_CLUSTER); osb->slot_num = slot; spin_unlock(&osb->osb_lock); @@ -490,6 +918,7 @@ int ocfs2_find_slot(struct ocfs2_super *osb) status = ocfs2_update_disk_slot(osb, si, osb->slot_num); if (status < 0) { mlog_errno(status); +mmp_fail: /* * if write block failed, invalidate slot to avoid overwrite * slot during dismount in case another node rightly has mounted diff --git a/fs/ocfs2/slot_map.h b/fs/ocfs2/slot_map.h index a43644570b53..d4d147b0c190 100644 --- a/fs/ocfs2/slot_map.h +++ b/fs/ocfs2/slot_map.h @@ -25,4 +25,7 @@ int ocfs2_slot_to_node_num_locked(struct ocfs2_super *osb, int slot_num, int ocfs2_clear_slot(struct ocfs2_super *osb, int slot_num); +int ocfs2_multi_mount_protect(struct ocfs2_super *osb, int noclustered); +void ocfs2_stop_mmpd(struct ocfs2_super *osb); + #endif diff --git a/fs/ocfs2/super.c b/fs/ocfs2/super.c index f7298816d8d9..b0e76b06efc3 100644 --- a/fs/ocfs2/super.c +++ b/fs/ocfs2/super.c @@ -609,6 +609,7 @@ static int ocfs2_remount(struct super_block *sb, int *flags, char *data) struct mount_options parsed_options; struct ocfs2_super *osb = OCFS2_SB(sb); u32 tmp; + int noclustered; sync_filesystem(sb); @@ -619,7 +620,8 @@ static int ocfs2_remount(struct super_block *sb, int *flags, char *data) } tmp = OCFS2_MOUNT_NOCLUSTER; - if ((osb->s_mount_opt & tmp) != (parsed_options.mount_opt & tmp)) { + noclustered = osb->s_mount_opt & tmp; + if (noclustered != (parsed_options.mount_opt & tmp)) { ret = -EINVAL; mlog(ML_ERROR, "Cannot change nocluster option on remount\n"); goto out; @@ -686,10 +688,20 @@ static int ocfs2_remount(struct super_block *sb, int *flags, char *data) } sb->s_flags &= ~SB_RDONLY; osb->osb_flags &= ~OCFS2_OSB_SOFT_RO; + if (OCFS2_HAS_INCOMPAT_FEATURE(sb, OCFS2_FEATURE_INCOMPAT_MMP)) { + spin_unlock(&osb->osb_lock); + if (ocfs2_multi_mount_protect(osb, noclustered)) { + mlog(ML_ERROR, "started MMP failed.\n"); + ocfs2_stop_mmpd(osb); + ret = -EROFS; + goto unlocked_osb; + } + } } trace_ocfs2_remount(sb->s_flags, osb->osb_flags, *flags); unlock_osb: spin_unlock(&osb->osb_lock); +unlocked_osb: /* Enable quota accounting after remounting RW */ if (!ret && !(*flags & SB_RDONLY)) { if (sb_any_quota_suspended(sb)) @@ -722,6 +734,8 @@ static int ocfs2_remount(struct super_block *sb, int *flags, char *data) sb->s_flags = (sb->s_flags & ~SB_POSIXACL) | ((osb->s_mount_opt & OCFS2_MOUNT_POSIX_ACL) ? SB_POSIXACL : 0); + if (sb_rdonly(osb->sb)) + ocfs2_stop_mmpd(osb); } out: return ret; @@ -1833,7 +1847,7 @@ static int ocfs2_mount_volume(struct super_block *sb) status = ocfs2_init_local_system_inodes(osb); if (status < 0) { mlog_errno(status); - goto out_super_lock; + goto out_find_slot; } status = ocfs2_check_volume(osb); @@ -1858,6 +1872,8 @@ static int ocfs2_mount_volume(struct super_block *sb) /* before journal shutdown, we should release slot_info */ ocfs2_free_slot_info(osb); ocfs2_journal_shutdown(osb); +out_find_slot: + ocfs2_stop_mmpd(osb); out_super_lock: ocfs2_super_unlock(osb, 1); out_dlm: @@ -1878,6 +1894,8 @@ static void ocfs2_dismount_volume(struct super_block *sb, int mnt_err) osb = OCFS2_SB(sb); BUG_ON(!osb); + ocfs2_stop_mmpd(osb); + /* Remove file check sysfs related directores/files, * and wait for the pending file check operations */ ocfs2_filecheck_remove_sysfs(osb); @@ -2086,6 +2104,7 @@ static int ocfs2_initialize_super(struct super_block *sb, snprintf(osb->dev_str, sizeof(osb->dev_str), "%u,%u", MAJOR(osb->sb->s_dev), MINOR(osb->sb->s_dev)); + osb->mmp_update_interval = le16_to_cpu(di->id2.i_super.s_mmp_update_interval); osb->max_slots = le16_to_cpu(di->id2.i_super.s_max_slots); if (osb->max_slots > OCFS2_MAX_SLOTS || osb->max_slots == 0) { mlog(ML_ERROR, "Invalid number of node slots (%u)\n", -- 2.37.1
Joseph Qi
2022-Aug-08 06:51 UTC
[Ocfs2-devel] [PATCH 1/4] ocfs2: Fix freeing uninitialized resource on ocfs2_dlm_shutdown
On 7/30/22 9:14 AM, Heming Zhao wrote:> On local mount mode, there is no dlm resource initalized. If > ocfs2_mount_volume() fails in ocfs2_find_slot(), error handling > flow will call ocfs2_dlm_shutdown(), then does dlm resource > cleanup job, which will trigger kernel crash. > > Fixes: 0737e01de9c4 ("ocfs2: ocfs2_mount_volume does cleanup job before > return error")Should be put at the same line.> Signed-off-by: Heming Zhao <heming.zhao at suse.com> > --- > fs/ocfs2/dlmglue.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/fs/ocfs2/dlmglue.c b/fs/ocfs2/dlmglue.c > index 801e60bab955..1438ac14940b 100644 > --- a/fs/ocfs2/dlmglue.c > +++ b/fs/ocfs2/dlmglue.c > @@ -3385,6 +3385,9 @@ int ocfs2_dlm_init(struct ocfs2_super *osb) > void ocfs2_dlm_shutdown(struct ocfs2_super *osb, > int hangup_pending) > { > + if (ocfs2_mount_local(osb)) > + return; > +IMO, we have to do part of ocfs2_dlm_shutdown() jobs such as ocfs2_lock_res_free(), which will remove lockres from d_lockres_tracking added by ocfs2_xxx_lock_res_init(). Before commit 0737e01de9c4, it seems this issue also exists since osb->cconn is already set under local mount mode. Thanks, Joseph
Joseph Qi
2022-Aug-08 08:19 UTC
[Ocfs2-devel] [PATCH 4/4] ocfs2: introduce ext4 MMP feature
On 7/30/22 9:14 AM, Heming Zhao wrote:> MMP (multiple mount protection) gives filesystem ability to prevent > from being mounted multiple times. > > For avoiding data corruption when non-clustered and/or clustered mount > are happening at same time, this commit introduced MMP feature. MMP > idea is from ext4 MMP (fs/ext4/mmp.c) code. For ocfs2 is a clustered > fs and also for compatible with existing slotmap feature, I did some > optimization and modification when porting from ext4 to ocfs2. > > For optimization: > mmp has a kthread kmmpd-<dev>, which is only created in non-clustered > mode. > > We set a rule: > If last mount didn't do unmount, (eg: crash), the next mount MUST be > same mount type. > > At last, this commit also fix commit c80af0c250c8 ("Revert "ocfs2: > mount shared volume without ha stack") mentioned issue.I suggest we re-split this series (especially patch 3 and 4), but not revive a buggy commit first and then another commit fixing it BTW. Thanks, Joseph