thr3ads.net - Gluster users - [Gluster-users] POSIX locks and disconnections between clients and bricks [Mar 2019]

If this information is useful, please help other people find it:
Share via:

Raghavendra Gowdappa

2019-Mar-27 10:52 UTC

[Gluster-users] POSIX locks and disconnections between clients and bricks

On Wed, Mar 27, 2019 at 12:56 PM Xavi Hernandez <jahernan at redhat.com>
wrote:
> Hi Raghavendra,
>
> On Wed, Mar 27, 2019 at 2:49 AM Raghavendra Gowdappa <rgowdapp at
redhat.com>
> wrote:
>
>> All,
>>
>> Glusterfs cleans up POSIX locks held on an fd when the client/mount
>> through which those locks are held disconnects from bricks/server. This
>> helps Glusterfs to not run into a stale lock problem later (For eg., if
>> application unlocks while the connection was still down). However, this
>> means the lock is no longer exclusive as other applications/clients can
>> acquire the same lock. To communicate that locks are no longer valid,
we
>> are planning to mark the fd (which has POSIX locks) bad on a disconnect
so
>> that any future operations on that fd will fail, forcing the
application to
>> re-open the fd and re-acquire locks it needs [1].
>>
>
> Wouldn't it be better to retake the locks when the brick is reconnected
if
> the lock is still in use ?
>
There is also  a possibility that clients may never reconnect. That's the
primary reason why bricks assume the worst (client will not reconnect) and
cleanup the locks.

> BTW, the referenced bug is not public. Should we open another bug to track
> this ?
>
I've just opened up the comment to give enough context. I'll open a bug
upstream too.

>
>
>>
>> Note that with AFR/replicate in picture we can prevent errors to
>> application as long as Quorum number of children "never ever"
lost
>> connection with bricks after locks have been acquired. I am using the
term
>> "never ever" as locks are not healed back after re-connection
and hence
>> first disconnect would've marked the fd bad and the fd remains so
even
>> after re-connection happens. So, its not just Quorum number of children
>> "currently online", but Quorum number of children "never
having
>> disconnected with bricks after locks are acquired".
>>
>
> I think this requisite is not feasible. In a distributed file system,
> sooner or later all bricks will be disconnected. It could be because of
> failures or because an upgrade is done, but it will happen.
>
> The difference here is how long are fd's kept open. If applications
open
> and close files frequently enough (i.e. the fd is not kept open more time
> than it takes to have more than Quorum bricks disconnected) then
there's no
> problem. The problem can only appear on applications that open files for a
> long time and also use posix locks. In this case, the only good solution I
> see is to retake the locks on brick reconnection.
>
Agree. But lock-healing should be done only by HA layers like AFR/EC as
only they know whether there are enough online bricks to have prevented any
conflicting lock. Protocol/client itself doesn't have enough information to
do that. If its a plain distribute, I don't see a way to heal locks without
loosing the property of exclusivity of locks.

What I proposed is a short term solution. mid to long term solution should
be lock healing feature implemented in AFR/EC. In fact I had this
conversation with +Karampuri, Pranith <pkarampu at redhat.com> before
posting
this msg to ML.

>
>> However, this use case is not affected if the application don't
acquire
>> any POSIX locks. So, I am interested in knowing
>> * whether your use cases use POSIX locks?
>> * Is it feasible for your application to re-open fds and re-acquire
locks
>> on seeing EBADFD errors?
>>
>
> I think that many applications are not prepared to handle that.
>
I too suspected that and in fact not too happy with the solution. But went
ahead with this mail as I heard implementing lock-heal  in AFR will take
time and hence there are no alternative short term solutions.

> Xavi
>
>
>>
>> [1] https://bugzilla.redhat.com/show_bug.cgi?id=1689375#c7
>>
>> regards,
>> Raghavendra
>>
>> _______________________________________________
>> Gluster-users mailing list
>> Gluster-users at gluster.org
>> https://lists.gluster.org/mailman/listinfo/gluster-users
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20190327/004d9c4f/attachment.html>

Raghavendra Gowdappa

2019-Mar-27 10:54 UTC

head link

[Gluster-users] POSIX locks and disconnections between clients and bricks

On Wed, Mar 27, 2019 at 4:22 PM Raghavendra Gowdappa <rgowdapp at
redhat.com>
wrote:
>
>
> On Wed, Mar 27, 2019 at 12:56 PM Xavi Hernandez <jahernan at
redhat.com>
> wrote:
>
>> Hi Raghavendra,
>>
>> On Wed, Mar 27, 2019 at 2:49 AM Raghavendra Gowdappa <rgowdapp at
redhat.com>
>> wrote:
>>
>>> All,
>>>
>>> Glusterfs cleans up POSIX locks held on an fd when the client/mount
>>> through which those locks are held disconnects from bricks/server.
This
>>> helps Glusterfs to not run into a stale lock problem later (For
eg., if
>>> application unlocks while the connection was still down). However,
this
>>> means the lock is no longer exclusive as other applications/clients
can
>>> acquire the same lock. To communicate that locks are no longer
valid, we
>>> are planning to mark the fd (which has POSIX locks) bad on a
disconnect so
>>> that any future operations on that fd will fail, forcing the
application to
>>> re-open the fd and re-acquire locks it needs [1].
>>>
>>
>> Wouldn't it be better to retake the locks when the brick is
reconnected
>> if the lock is still in use ?
>>
>
> There is also  a possibility that clients may never reconnect. That's
the
> primary reason why bricks assume the worst (client will not reconnect) and
> cleanup the locks.
>
>
>> BTW, the referenced bug is not public. Should we open another bug to
>> track this ?
>>
>
> I've just opened up the comment to give enough context. I'll open a
bug
> upstream too.
>
>
>>
>>
>>>
>>> Note that with AFR/replicate in picture we can prevent errors to
>>> application as long as Quorum number of children "never
ever" lost
>>> connection with bricks after locks have been acquired. I am using
the term
>>> "never ever" as locks are not healed back after
re-connection and hence
>>> first disconnect would've marked the fd bad and the fd remains
so even
>>> after re-connection happens. So, its not just Quorum number of
children
>>> "currently online", but Quorum number of children
"never having
>>> disconnected with bricks after locks are acquired".
>>>
>>
>> I think this requisite is not feasible. In a distributed file system,
>> sooner or later all bricks will be disconnected. It could be because of
>> failures or because an upgrade is done, but it will happen.
>>
>> The difference here is how long are fd's kept open. If applications
open
>> and close files frequently enough (i.e. the fd is not kept open more
time
>> than it takes to have more than Quorum bricks disconnected) then
there's no
>> problem. The problem can only appear on applications that open files
for a
>> long time and also use posix locks. In this case, the only good
solution I
>> see is to retake the locks on brick reconnection.
>>
>
> Agree. But lock-healing should be done only by HA layers like AFR/EC as
> only they know whether there are enough online bricks to have prevented any
> conflicting lock. Protocol/client itself doesn't have enough
information to
> do that. If its a plain distribute, I don't see a way to heal locks
without
> loosing the property of exclusivity of locks.
>
> What I proposed is a short term solution. mid to long term solution should
> be lock healing feature implemented in AFR/EC. In fact I had this
> conversation with +Karampuri, Pranith <pkarampu at redhat.com> before
> posting this msg to ML.
>
>
>>
>>> However, this use case is not affected if the application don't
acquire
>>> any POSIX locks. So, I am interested in knowing
>>> * whether your use cases use POSIX locks?
>>> * Is it feasible for your application to re-open fds and re-acquire
>>> locks on seeing EBADFD errors?
>>>
>>
>> I think that many applications are not prepared to handle that.
>>
>
> I too suspected that and in fact not too happy with the solution. But went
> ahead with this mail as I heard implementing lock-heal  in AFR will take
> time and hence there are no alternative short term solutions.
>
Also failing loudly is preferred to silently dropping locks.

>
>
>> Xavi
>>
>>
>>>
>>> [1] https://bugzilla.redhat.com/show_bug.cgi?id=1689375#c7
>>>
>>> regards,
>>> Raghavendra
>>>
>>> _______________________________________________
>>> Gluster-users mailing list
>>> Gluster-users at gluster.org
>>> https://lists.gluster.org/mailman/listinfo/gluster-users
>>
>>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20190327/1b21de2e/attachment.html>

Xavi Hernandez

2019-Mar-27 11:43 UTC

head link

[Gluster-users] POSIX locks and disconnections between clients and bricks

On Wed, Mar 27, 2019 at 11:52 AM Raghavendra Gowdappa <rgowdapp at
redhat.com>
wrote:
>
>
> On Wed, Mar 27, 2019 at 12:56 PM Xavi Hernandez <jahernan at
redhat.com>
> wrote:
>
>> Hi Raghavendra,
>>
>> On Wed, Mar 27, 2019 at 2:49 AM Raghavendra Gowdappa <rgowdapp at
redhat.com>
>> wrote:
>>
>>> All,
>>>
>>> Glusterfs cleans up POSIX locks held on an fd when the client/mount
>>> through which those locks are held disconnects from bricks/server.
This
>>> helps Glusterfs to not run into a stale lock problem later (For
eg., if
>>> application unlocks while the connection was still down). However,
this
>>> means the lock is no longer exclusive as other applications/clients
can
>>> acquire the same lock. To communicate that locks are no longer
valid, we
>>> are planning to mark the fd (which has POSIX locks) bad on a
disconnect so
>>> that any future operations on that fd will fail, forcing the
application to
>>> re-open the fd and re-acquire locks it needs [1].
>>>
>>
>> Wouldn't it be better to retake the locks when the brick is
reconnected
>> if the lock is still in use ?
>>
>
> There is also  a possibility that clients may never reconnect. That's
the
> primary reason why bricks assume the worst (client will not reconnect) and
> cleanup the locks.
>
True, so it's fine to cleanup the locks. I'm not saying that locks
shouldn't be released on disconnect. The assumption is that if the client
has really died, it will also disconnect from other bricks, who will
release the locks. So, eventually, another client will have enough quorum
to attempt a lock that will succeed. In other words, if a client gets
disconnected from too many bricks simultaneously (loses Quorum), then that
client can be considered as bad and can return errors to the application.
This should also cause to release the locks on the remaining connected
bricks.

On the other hand, if the disconnection is very short and the client has
not died, it will keep enough locked files (it has quorum) to avoid other
clients to successfully acquire a lock. In this case, if the brick is
reconnected, all existing locks should be reacquired to recover the
original state before the disconnection.

>
>> BTW, the referenced bug is not public. Should we open another bug to
>> track this ?
>>
>
> I've just opened up the comment to give enough context. I'll open a
bug
> upstream too.
>
>
>>
>>
>>>
>>> Note that with AFR/replicate in picture we can prevent errors to
>>> application as long as Quorum number of children "never
ever" lost
>>> connection with bricks after locks have been acquired. I am using
the term
>>> "never ever" as locks are not healed back after
re-connection and hence
>>> first disconnect would've marked the fd bad and the fd remains
so even
>>> after re-connection happens. So, its not just Quorum number of
children
>>> "currently online", but Quorum number of children
"never having
>>> disconnected with bricks after locks are acquired".
>>>
>>
>> I think this requisite is not feasible. In a distributed file system,
>> sooner or later all bricks will be disconnected. It could be because of
>> failures or because an upgrade is done, but it will happen.
>>
>> The difference here is how long are fd's kept open. If applications
open
>> and close files frequently enough (i.e. the fd is not kept open more
time
>> than it takes to have more than Quorum bricks disconnected) then
there's no
>> problem. The problem can only appear on applications that open files
for a
>> long time and also use posix locks. In this case, the only good
solution I
>> see is to retake the locks on brick reconnection.
>>
>
> Agree. But lock-healing should be done only by HA layers like AFR/EC as
> only they know whether there are enough online bricks to have prevented any
> conflicting lock. Protocol/client itself doesn't have enough
information to
> do that. If its a plain distribute, I don't see a way to heal locks
without
> loosing the property of exclusivity of locks.
>
Lock-healing of locks acquired while a brick was disconnected need to be
handled by AFR/EC. However, locks already present at the moment of
disconnection could be recovered by client xlator itself as long as the
file has not been closed (which client xlator already knows).

Xavi

> What I proposed is a short term solution. mid to long term solution should
> be lock healing feature implemented in AFR/EC. In fact I had this
> conversation with +Karampuri, Pranith <pkarampu at redhat.com> before
> posting this msg to ML.
>
>
>>
>>> However, this use case is not affected if the application don't
acquire
>>> any POSIX locks. So, I am interested in knowing
>>> * whether your use cases use POSIX locks?
>>> * Is it feasible for your application to re-open fds and re-acquire
>>> locks on seeing EBADFD errors?
>>>
>>
>> I think that many applications are not prepared to handle that.
>>
>
> I too suspected that and in fact not too happy with the solution. But went
> ahead with this mail as I heard implementing lock-heal  in AFR will take
> time and hence there are no alternative short term solutions.
>
>
>> Xavi
>>
>>
>>>
>>> [1] https://bugzilla.redhat.com/show_bug.cgi?id=1689375#c7
>>>
>>> regards,
>>> Raghavendra
>>>
>>> _______________________________________________
>>> Gluster-users mailing list
>>> Gluster-users at gluster.org
>>> https://lists.gluster.org/mailman/listinfo/gluster-users
>>
>>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20190327/a3a2f032/attachment-0001.html>

Gluster users - Mar 2019 - POSIX locks and disconnections between clients and bricks

[Gluster-users] POSIX locks and disconnections between clients and bricks

[Gluster-users] POSIX locks and disconnections between clients and bricks

[Gluster-users] POSIX locks and disconnections between clients and bricks