thr3ads.net - Ocfs2 devel - [Ocfs2-devel] [PATCH] Revert "ocfs2: mount shared volume without ha stack" [Jun 2022]

If this information is useful, please help other people find it:
Share via:

Junxiao Bi

2022-Jun-07 02:21 UTC

[Ocfs2-devel] [PATCH] Revert "ocfs2: mount shared volume without ha stack"

> ? 2022?6?6????7:07?Joseph Qi <joseph.qi at linux.alibaba.com> ???
> 
> ?
> 
>> On 6/7/22 7:50 AM, heming.zhao at suse.com wrote:
>>> On 6/7/22 00:15, Junxiao Bi wrote:
>>>> On 6/5/22 7:08 PM, heming.zhao at suse.com wrote:
>>> 
>>>> Hello Junxiao,
>>>> 
>>>> First of all, let's turn to the same channel to discuss
your patch.
>>>> There are two features: 'local mount' &
'nocluster mount'.
>>>> I mistakenly wrote local-mount on some place in previous mails.
>>>> This patch revert commit 912f655d78c5d4, which is related with
'nocluster mount'.
>>>> 
>>>> 
>>>> On 6/5/22 00:19, Junxiao Bi wrote:
>>>>> 
>>>>>> ? 2022?6?4????1:45?heming.zhao at suse.com ???
>>>>>> 
>>>>>> ?Hello Junxiao,
>>>>>> 
>>>>>>> On 6/4/22 06:28, Junxiao Bi via Ocfs2-devel wrote:
>>>>>>> This reverts commit
912f655d78c5d4ad05eac287f23a435924df7144.
>>>>>>> This commit introduced a regression that can cause
mount hung.
>>>>>>> The changes in __ocfs2_find_empty_slot causes that
any node with
>>>>>>> none-zero node number can grab the slot that was
already taken by
>>>>>>> node 0, so node 1 will access the same journal with
node 0, when it
>>>>>>> try to grab journal cluster lock, it will hung
because it was already
>>>>>>> acquired by node 0.
>>>>>>> It's very easy to reproduce this, in one
cluster, mount node 0 first,
>>>>>>> then node 1, you will see the following call trace
from node 1.
>>>>>>   From your description, it looks your env mixed
local-mount & clustered-mount.
>>>>> No, only cluster mount.
>>>>>> Could you mind to share your test/reproducible steps.
>>>>>> And which ha stack do you use, pmck or o2cb?
>>>>>> 
>>>>>> I failed to reproduce it, my test steps (with pcmk
stack):
>>>>>> ```
>>>>>> node1:
>>>>>> mount -t ocfs2 /dev/vdd /mnt
>>>>>> 
>>>>>> node2:
>>>>>> for i in {1..100}; do
>>>>>> echo "mount <$i>"; mount -t ocfs2
/dev/vdd /mnt;
>>>>>> sleep 3;
>>>>>> echo "umount"; umount /mnt;
>>>>>> done
>>>>>> ```
>>>>>> 
>>>>> Try set one node with node number 0 and mount it there
first. I used o2cb stack.
>>>> Could you show more test info/steps. I can't follow your
meaning.
>>>> How to set up a node with a fix node number?
>>>> With my understanding, under pcmk env, the first mounted node
will auto got node
>>>> number 1 (or any value great than 0). and there is no place to
set node number
>>>> by hand. It's very likely you mixed to use nocluster &
cluster mount.
>>>> If my suspect right (mixed mount), your use case is wrong.
>>> 
>>> Did you check my last mail? I already said i didn't do mixed
mount, only cluster mount.
>> 
>> I carefully read every word of your mails. we are in different world.
(pcmk vs o2cb)
>> In pcmk env, slot number always great than 0. (I also maintain
cluster-md in suse,
>> in slot_number at drivers/md/md-cluster.c, you can see the number never
ZERO).
>> 
>>> 
>>> There is a configure file for o2cb, you can just set node number to
0, please check
https://docs.oracle.com/en/operating-systems/oracle-linux/7/fsadmin/ol7-ocfs2.html#ol7-config-file-ocfs2
>> 
>> Thank you for sharing. I will read & learn it.
>> 
>>> 
>>>> 
>>>>>> This local mount feature helps SUSE customers to
maintain ocfs2 partition, it's useful.
>>>>>> I want to find whether there is a idear way to fix the
hung issue.
>>>>>> 
>>>>>>> [13148.735424] INFO: task mount.ocfs2:53045 blocked
for more than 122 seconds.
>>>>>>> [13148.739691]       Not tainted
5.15.0-2148.0.4.el8uek.mountracev2.x86_64 #2
>>>>>>> [13148.742560] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>>>>>> [13148.745846] task:mount.ocfs2     state:D stack: 
0 pid:53045 ppid: 53044 flags:0x00004000
>>>>>>> [13148.749354] Call Trace:
>>>>>>> ...
>>>>>>> To fix it, we can just fix __ocfs2_find_empty_slot.
But original commit
>>>>>>> introduced the feature to mount ocfs2 locally even
it is cluster based,
>>>>>>> that is a very dangerous, it can easily cause
serious data corruption,
>>>>>>> there is no way to stop other nodes mounting the fs
and corrupting it.
>>>>>> I can't follow your meaning. When users want to use
local mount feature, they MUST know
>>>>>> what they are doing, and how to use it.
>>>>> I can?t agree with you. There is no  mechanism to make sure
customer will follow that, you can?t expect customer understand tech well or
even read the doc.
>>>> yes, no one reads doc by default.
>>>> 
>>>> currently, mount with option 'nocluster' will show
special info to user:
>>>> 
>>>> ```
>>>> # mount -t ocfs2 -o nocluster /dev/vdd /mnt
>>>> Warning: to mount a clustered volume without the cluster stack.
>>>> Please make sure you only mount the file system from one node.
>>>> Otherwise, the file system may be damaged.
>>>> Proceed (y/N):
>>>> ```
>>>> 
>>>>> It?s not the case that you don?t have choice, setup cluster
stack is the way to stop customer doing something bad, I believe you have to
educate customer to understand this is the cost to guard data security,
otherwise when something bad happens, they will lose important data, maybe even
no way to recover.
>>>> This feature is not enabled by default, and also shows enough
info/warn before executing.
>>>> I give (may awkward) another example:
>>>> nocluster mount likes executing command 'rm -rf /', do
you think we should
>>>> tell/educate customer do not execute it?
>>> 
>>> That's totally out of domain of ocfs2, it's not ocfs2
developer's job to tell customer not doing that.
>>> 
>>> Here you provided a ocfs2 feature that can easily corrupt ocfs2.
>> 
>> First, this hung issue or any other related issues can be fixed. I have
already
>> described a method in my previous mail. (use es_reserved1[0] of
ocfs2_extended_slot)
>> 
>> Second, this feature have been merged two years, only you reported a
hung issue.
>> Our customer also uses it for 2 years, no bug reported from them. it
means,
>> at least, this feature fine works in pcmk stack.
>> 
>>> 
>>> As an ocfs2 developer, you should make sure ocfs2 was not corrupted
even customer did something bad. That's why mkfs.ocfs2/fsck.ocfs2 check
whether ocfs2 volume is mounted in the cluster before changing anything.
>> 
>> fsck.ocfs2 with '-F' could work in noclustered env.
>> In 2005, commit 44c97d6ce8baeb4a6c37712d4c22d0702ebf7714 introduced
this feature.
>> This year, commit 7085e9177adc7197250d872c50a05dfc9c531bdc enhanced it,
>> which could make fsck.ocfs2 totally work in noclustered env.
>> (mkfs.ocfs2 can also work in local mount mode which is another story)
>> 
>>> 
>>>> 
>>>> The nocluster mount feature was designed to resolve customer
pain point from real world:
>>>> SUSE HA stack uses pacemaker+corosync+fsdlm+ocfs2, which
complicates/inconveniences
>>>> to set up. and need to install dozens of related packages.
>>>> 
>>>> The nocluster feature main use case:
>>>> customer wants to avoid to set up HA stack, but they wants to
check ocfs2 volume
>>>> or do backup volume.
>>> That doesn't mean you have to do this in kernel. Customer had a
pain to setup HA stack, you should develop some script/app to make it easy.
>> 
>> It's not a simple job to develop some script/app to help setup ha
stack.
>> Both SUSE and Red Hat have special team to do this. In SUSE, this team
have worked many years.
>> If any one can create an easy/powerful HA auto setup tools, He can even
found a company
>> to sell this software.
>> 
>>>> 
>>>> In my opinion, we should make ocfs2 more powerful and include
more useful features for users.
>>>> If there are some problems related new feature, we should do
our best to fix it not revert it.
>>> 
>>> Only good/safe features, i don't think this one is qualified.
Also no one give a reviewed-by to this commit, i am not sure how it was merged.
>> 
>> I had shared my idea about how to fix this hung issue. it's not a
big bug.
>> More useful feature could attract more users, it will make ocfs2
community more powerful.
>> 
>>> 
>>> Joseph, what's your call on this?
>> 
>> me too, wait for maintainer feedback.
>> 
> 
> Seems I am missing some mails for this thread.
> The 'nocluster' mount is introduced by Gang and I think it has real
> user scenarios. I am curious about since node 0 is commonly used in
> o2cb, why there is no any bug report before.
> So let's try to fix the regression first.Real user case doesn?t mean this has to been done through kernel? This sounds
like doing something in kernel that is to workaround some issue that can be done
from user space.
I didn?t see a Reviewed-by for the patch, how did it get merged?

Thanks,
Junxiao> 
> Thanks,
> Joseph

heming.zhao at suse.com

2022-Jun-07 02:38 UTC

head link

[Ocfs2-devel] [PATCH] Revert "ocfs2: mount shared volume without ha stack"

On 6/7/22 10:21, Junxiao Bi wrote:> 
> 
> 
>> ? 2022?6?6????7:07?Joseph Qi <joseph.qi at linux.alibaba.com> ???
>>
>> ?
>>
>>> On 6/7/22 7:50 AM, heming.zhao at suse.com wrote:
>>>> On 6/7/22 00:15, Junxiao Bi wrote:
>>>>> On 6/5/22 7:08 PM, heming.zhao at suse.com wrote:
>>>>
>>>>> Hello Junxiao,
>>>>>
>>>>> First of all, let's turn to the same channel to discuss
your patch.
>>>>> There are two features: 'local mount' &
'nocluster mount'.
>>>>> I mistakenly wrote local-mount on some place in previous
mails.
>>>>> This patch revert commit 912f655d78c5d4, which is related
with 'nocluster mount'.
>>>>>
>>>>>
>>>>> On 6/5/22 00:19, Junxiao Bi wrote:
>>>>>>
>>>>>>> ? 2022?6?4????1:45?heming.zhao at suse.com ???
>>>>>>>
>>>>>>> ?Hello Junxiao,
>>>>>>>
>>>>>>>> On 6/4/22 06:28, Junxiao Bi via Ocfs2-devel
wrote:
>>>>>>>> This reverts commit
912f655d78c5d4ad05eac287f23a435924df7144.
>>>>>>>> This commit introduced a regression that can
cause mount hung.
>>>>>>>> The changes in __ocfs2_find_empty_slot causes
that any node with
>>>>>>>> none-zero node number can grab the slot that
was already taken by
>>>>>>>> node 0, so node 1 will access the same journal
with node 0, when it
>>>>>>>> try to grab journal cluster lock, it will hung
because it was already
>>>>>>>> acquired by node 0.
>>>>>>>> It's very easy to reproduce this, in one
cluster, mount node 0 first,
>>>>>>>> then node 1, you will see the following call
trace from node 1.
>>>>>>>    From your description, it looks your env mixed
local-mount & clustered-mount.
>>>>>> No, only cluster mount.
>>>>>>> Could you mind to share your test/reproducible
steps.
>>>>>>> And which ha stack do you use, pmck or o2cb?
>>>>>>>
>>>>>>> I failed to reproduce it, my test steps (with pcmk
stack):
>>>>>>> ```
>>>>>>> node1:
>>>>>>> mount -t ocfs2 /dev/vdd /mnt
>>>>>>>
>>>>>>> node2:
>>>>>>> for i in {1..100}; do
>>>>>>> echo "mount <$i>"; mount -t ocfs2
/dev/vdd /mnt;
>>>>>>> sleep 3;
>>>>>>> echo "umount"; umount /mnt;
>>>>>>> done
>>>>>>> ```
>>>>>>>
>>>>>> Try set one node with node number 0 and mount it there
first. I used o2cb stack.
>>>>> Could you show more test info/steps. I can't follow
your meaning.
>>>>> How to set up a node with a fix node number?
>>>>> With my understanding, under pcmk env, the first mounted
node will auto got node
>>>>> number 1 (or any value great than 0). and there is no place
to set node number
>>>>> by hand. It's very likely you mixed to use nocluster
& cluster mount.
>>>>> If my suspect right (mixed mount), your use case is wrong.
>>>>
>>>> Did you check my last mail? I already said i didn't do
mixed mount, only cluster mount.
>>>
>>> I carefully read every word of your mails. we are in different
world. (pcmk vs o2cb)
>>> In pcmk env, slot number always great than 0. (I also maintain
cluster-md in suse,
>>> in slot_number at drivers/md/md-cluster.c, you can see the number
never ZERO).
>>>
>>>>
>>>> There is a configure file for o2cb, you can just set node
number to 0, please check
https://docs.oracle.com/en/operating-systems/oracle-linux/7/fsadmin/ol7-ocfs2.html#ol7-config-file-ocfs2
>>>
>>> Thank you for sharing. I will read & learn it.
>>>
>>>>
>>>>>
>>>>>>> This local mount feature helps SUSE customers to
maintain ocfs2 partition, it's useful.
>>>>>>> I want to find whether there is a idear way to fix
the hung issue.
>>>>>>>
>>>>>>>> [13148.735424] INFO: task mount.ocfs2:53045
blocked for more than 122 seconds.
>>>>>>>> [13148.739691]       Not tainted
5.15.0-2148.0.4.el8uek.mountracev2.x86_64 #2
>>>>>>>> [13148.742560] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>>>>>>> [13148.745846] task:mount.ocfs2     state:D
stack:    0 pid:53045 ppid: 53044 flags:0x00004000
>>>>>>>> [13148.749354] Call Trace:
>>>>>>>> ...
>>>>>>>> To fix it, we can just fix
__ocfs2_find_empty_slot. But original commit
>>>>>>>> introduced the feature to mount ocfs2 locally
even it is cluster based,
>>>>>>>> that is a very dangerous, it can easily cause
serious data corruption,
>>>>>>>> there is no way to stop other nodes mounting
the fs and corrupting it.
>>>>>>> I can't follow your meaning. When users want to
use local mount feature, they MUST know
>>>>>>> what they are doing, and how to use it.
>>>>>> I can?t agree with you. There is no  mechanism to make
sure customer will follow that, you can?t expect customer understand tech well
or even read the doc.
>>>>> yes, no one reads doc by default.
>>>>>
>>>>> currently, mount with option 'nocluster' will show
special info to user:
>>>>>
>>>>> ```
>>>>> # mount -t ocfs2 -o nocluster /dev/vdd /mnt
>>>>> Warning: to mount a clustered volume without the cluster
stack.
>>>>> Please make sure you only mount the file system from one
node.
>>>>> Otherwise, the file system may be damaged.
>>>>> Proceed (y/N):
>>>>> ```
>>>>>
>>>>>> It?s not the case that you don?t have choice, setup
cluster stack is the way to stop customer doing something bad, I believe you
have to educate customer to understand this is the cost to guard data security,
otherwise when something bad happens, they will lose important data, maybe even
no way to recover.
>>>>> This feature is not enabled by default, and also shows
enough info/warn before executing.
>>>>> I give (may awkward) another example:
>>>>> nocluster mount likes executing command 'rm -rf /',
do you think we should
>>>>> tell/educate customer do not execute it?
>>>>
>>>> That's totally out of domain of ocfs2, it's not ocfs2
developer's job to tell customer not doing that.
>>>>
>>>> Here you provided a ocfs2 feature that can easily corrupt
ocfs2.
>>>
>>> First, this hung issue or any other related issues can be fixed. I
have already
>>> described a method in my previous mail. (use es_reserved1[0] of
ocfs2_extended_slot)
>>>
>>> Second, this feature have been merged two years, only you reported
a hung issue.
>>> Our customer also uses it for 2 years, no bug reported from them.
it means,
>>> at least, this feature fine works in pcmk stack.
>>>
>>>>
>>>> As an ocfs2 developer, you should make sure ocfs2 was not
corrupted even customer did something bad. That's why mkfs.ocfs2/fsck.ocfs2
check whether ocfs2 volume is mounted in the cluster before changing anything.
>>>
>>> fsck.ocfs2 with '-F' could work in noclustered env.
>>> In 2005, commit 44c97d6ce8baeb4a6c37712d4c22d0702ebf7714 introduced
this feature.
>>> This year, commit 7085e9177adc7197250d872c50a05dfc9c531bdc enhanced
it,
>>> which could make fsck.ocfs2 totally work in noclustered env.
>>> (mkfs.ocfs2 can also work in local mount mode which is another
story)
>>>
>>>>
>>>>>
>>>>> The nocluster mount feature was designed to resolve
customer pain point from real world:
>>>>> SUSE HA stack uses pacemaker+corosync+fsdlm+ocfs2, which
complicates/inconveniences
>>>>> to set up. and need to install dozens of related packages.
>>>>>
>>>>> The nocluster feature main use case:
>>>>> customer wants to avoid to set up HA stack, but they wants
to check ocfs2 volume
>>>>> or do backup volume.
>>>> That doesn't mean you have to do this in kernel. Customer
had a pain to setup HA stack, you should develop some script/app to make it
easy.
>>>
>>> It's not a simple job to develop some script/app to help setup
ha stack.
>>> Both SUSE and Red Hat have special team to do this. In SUSE, this
team have worked many years.
>>> If any one can create an easy/powerful HA auto setup tools, He can
even found a company
>>> to sell this software.
>>>
>>>>>
>>>>> In my opinion, we should make ocfs2 more powerful and
include more useful features for users.
>>>>> If there are some problems related new feature, we should
do our best to fix it not revert it.
>>>>
>>>> Only good/safe features, i don't think this one is
qualified. Also no one give a reviewed-by to this commit, i am not sure how it
was merged.
>>>
>>> I had shared my idea about how to fix this hung issue. it's not
a big bug.
>>> More useful feature could attract more users, it will make ocfs2
community more powerful.
>>>
>>>>
>>>> Joseph, what's your call on this?
>>>
>>> me too, wait for maintainer feedback.
>>>
>>
>> Seems I am missing some mails for this thread.
>> The 'nocluster' mount is introduced by Gang and I think it has
real
>> user scenarios. I am curious about since node 0 is commonly used in
>> o2cb, why there is no any bug report before.
>> So let's try to fix the regression first.
> Real user case doesn?t mean this has to been done through kernel? This
sounds like doing something in kernel that is to workaround some issue that can
be done from user space.
> I didn?t see a Reviewed-by for the patch, how did it get merged?
> 
Gang had left SUSE for some time, and busy with his new job.
I have vague memory, he said this commit approved & merged directly by
Andrew Morton.
Gang dedicated to contribute ocfs2 community many years, and set up his
competence
to other maintainers & reviewers.

If Junxiao dislike this feature, and don't want to fix it as a bug.
I am willing to file a patch.

/Heming

Ocfs2 devel - Jun 2022 - [PATCH] Revert "ocfs2: mount shared volume without ha stack"

[Ocfs2-devel] [PATCH] Revert "ocfs2: mount shared volume without ha stack"

[Ocfs2-devel] [PATCH] Revert "ocfs2: mount shared volume without ha stack"