Xavi Hernandez
2019-Mar-27 07:25 UTC
[Gluster-users] POSIX locks and disconnections between clients and bricks
Hi Raghavendra, On Wed, Mar 27, 2019 at 2:49 AM Raghavendra Gowdappa <rgowdapp at redhat.com> wrote:> All, > > Glusterfs cleans up POSIX locks held on an fd when the client/mount > through which those locks are held disconnects from bricks/server. This > helps Glusterfs to not run into a stale lock problem later (For eg., if > application unlocks while the connection was still down). However, this > means the lock is no longer exclusive as other applications/clients can > acquire the same lock. To communicate that locks are no longer valid, we > are planning to mark the fd (which has POSIX locks) bad on a disconnect so > that any future operations on that fd will fail, forcing the application to > re-open the fd and re-acquire locks it needs [1]. >Wouldn't it be better to retake the locks when the brick is reconnected if the lock is still in use ? BTW, the referenced bug is not public. Should we open another bug to track this ?> > Note that with AFR/replicate in picture we can prevent errors to > application as long as Quorum number of children "never ever" lost > connection with bricks after locks have been acquired. I am using the term > "never ever" as locks are not healed back after re-connection and hence > first disconnect would've marked the fd bad and the fd remains so even > after re-connection happens. So, its not just Quorum number of children > "currently online", but Quorum number of children "never having > disconnected with bricks after locks are acquired". >I think this requisite is not feasible. In a distributed file system, sooner or later all bricks will be disconnected. It could be because of failures or because an upgrade is done, but it will happen. The difference here is how long are fd's kept open. If applications open and close files frequently enough (i.e. the fd is not kept open more time than it takes to have more than Quorum bricks disconnected) then there's no problem. The problem can only appear on applications that open files for a long time and also use posix locks. In this case, the only good solution I see is to retake the locks on brick reconnection.> However, this use case is not affected if the application don't acquire > any POSIX locks. So, I am interested in knowing > * whether your use cases use POSIX locks? > * Is it feasible for your application to re-open fds and re-acquire locks > on seeing EBADFD errors? >I think that many applications are not prepared to handle that. Xavi> > [1] https://bugzilla.redhat.com/show_bug.cgi?id=1689375#c7 > > regards, > Raghavendra > > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20190327/521f5d6a/attachment.html>
Soumya Koduri
2019-Mar-27 09:53 UTC
[Gluster-users] POSIX locks and disconnections between clients and bricks
On 3/27/19 12:55 PM, Xavi Hernandez wrote:> Hi?Raghavendra, > > On Wed, Mar 27, 2019 at 2:49 AM Raghavendra Gowdappa > <rgowdapp at redhat.com <mailto:rgowdapp at redhat.com>> wrote: > > All, > > Glusterfs cleans up POSIX locks held on an fd when the client/mount > through which those locks are held disconnects from bricks/server. > This helps Glusterfs to not run into a stale lock problem later (For > eg., if application unlocks while the connection was still down). > However, this means the lock is no longer exclusive as other > applications/clients can acquire the same lock. To communicate that > locks are no longer valid, we are planning to mark the fd (which has > POSIX locks) bad on a disconnect so that any future operations on > that fd will fail, forcing the application to re-open the fd and > re-acquire locks it needs [1]. > > > Wouldn't it be better to retake the locks when the brick is reconnected > if the lock is still in use ? > > BTW, the referenced bug is not public. Should we open another bug to > track this ? > > > Note that with AFR/replicate in picture we can prevent errors to > application as long as Quorum number of children "never ever" lost > connection with bricks after locks have been acquired. I am using > the term "never ever" as locks are not healed back after > re-connection and hence first disconnect would've marked the fd bad > and the fd remains so even after re-connection happens. So, its not > just Quorum number of children "currently online", but Quorum number > of children "never having disconnected with bricks after locks are > acquired". > > > I think this requisite is not feasible. In a distributed file system, > sooner or later all bricks will be disconnected. It could be because of > failures or because an upgrade is done, but it will happen. > > The difference here is how long are fd's kept open. If applications open > and close files frequently enough (i.e. the fd is not kept open more > time than it takes to have more than Quorum bricks disconnected) then > there's no problem. The problem can only appear on applications that > open files for a long time and also use posix locks. In this case, the > only good solution I see is to retake the locks on brick reconnection. > > > However, this use case is not affected if the application don't > acquire any POSIX locks. So, I am interested in knowing > * whether your use cases use POSIX locks? > * Is it feasible for your application to re-open fds and re-acquire > locks on seeing EBADFD errors? > > > I think that many applications are not prepared to handle that.+1 to all the points mentioned by Xavi. This has been day-1 issue for all the applications using locks (like NFS-Ganesha and Samba). Not many applications re-open and re-acquire the locks. On receiving EBADFD, that error is most likely propagated to application clients. Agree with Xavi that its better to heal/re-acquire the locks on brick reconnects before it accepts any fresh requests. I also suggest to have this healing mechanism generic enough (if possible) to heal any server-side state (like upcall, leases etc). Thanks, Soumya> > Xavi > > > [1] https://bugzilla.redhat.com/show_bug.cgi?id=1689375#c7 > > regards, > Raghavendra > > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org <mailto:Gluster-users at gluster.org> > https://lists.gluster.org/mailman/listinfo/gluster-users > > > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users >
Raghavendra Gowdappa
2019-Mar-27 10:52 UTC
[Gluster-users] POSIX locks and disconnections between clients and bricks
On Wed, Mar 27, 2019 at 12:56 PM Xavi Hernandez <jahernan at redhat.com> wrote:> Hi Raghavendra, > > On Wed, Mar 27, 2019 at 2:49 AM Raghavendra Gowdappa <rgowdapp at redhat.com> > wrote: > >> All, >> >> Glusterfs cleans up POSIX locks held on an fd when the client/mount >> through which those locks are held disconnects from bricks/server. This >> helps Glusterfs to not run into a stale lock problem later (For eg., if >> application unlocks while the connection was still down). However, this >> means the lock is no longer exclusive as other applications/clients can >> acquire the same lock. To communicate that locks are no longer valid, we >> are planning to mark the fd (which has POSIX locks) bad on a disconnect so >> that any future operations on that fd will fail, forcing the application to >> re-open the fd and re-acquire locks it needs [1]. >> > > Wouldn't it be better to retake the locks when the brick is reconnected if > the lock is still in use ? >There is also a possibility that clients may never reconnect. That's the primary reason why bricks assume the worst (client will not reconnect) and cleanup the locks.> BTW, the referenced bug is not public. Should we open another bug to track > this ? >I've just opened up the comment to give enough context. I'll open a bug upstream too.> > >> >> Note that with AFR/replicate in picture we can prevent errors to >> application as long as Quorum number of children "never ever" lost >> connection with bricks after locks have been acquired. I am using the term >> "never ever" as locks are not healed back after re-connection and hence >> first disconnect would've marked the fd bad and the fd remains so even >> after re-connection happens. So, its not just Quorum number of children >> "currently online", but Quorum number of children "never having >> disconnected with bricks after locks are acquired". >> > > I think this requisite is not feasible. In a distributed file system, > sooner or later all bricks will be disconnected. It could be because of > failures or because an upgrade is done, but it will happen. > > The difference here is how long are fd's kept open. If applications open > and close files frequently enough (i.e. the fd is not kept open more time > than it takes to have more than Quorum bricks disconnected) then there's no > problem. The problem can only appear on applications that open files for a > long time and also use posix locks. In this case, the only good solution I > see is to retake the locks on brick reconnection. >Agree. But lock-healing should be done only by HA layers like AFR/EC as only they know whether there are enough online bricks to have prevented any conflicting lock. Protocol/client itself doesn't have enough information to do that. If its a plain distribute, I don't see a way to heal locks without loosing the property of exclusivity of locks. What I proposed is a short term solution. mid to long term solution should be lock healing feature implemented in AFR/EC. In fact I had this conversation with +Karampuri, Pranith <pkarampu at redhat.com> before posting this msg to ML.> >> However, this use case is not affected if the application don't acquire >> any POSIX locks. So, I am interested in knowing >> * whether your use cases use POSIX locks? >> * Is it feasible for your application to re-open fds and re-acquire locks >> on seeing EBADFD errors? >> > > I think that many applications are not prepared to handle that. >I too suspected that and in fact not too happy with the solution. But went ahead with this mail as I heard implementing lock-heal in AFR will take time and hence there are no alternative short term solutions.> Xavi > > >> >> [1] https://bugzilla.redhat.com/show_bug.cgi?id=1689375#c7 >> >> regards, >> Raghavendra >> >> _______________________________________________ >> Gluster-users mailing list >> Gluster-users at gluster.org >> https://lists.gluster.org/mailman/listinfo/gluster-users > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20190327/004d9c4f/attachment.html>