Arend-Jan Wijtzes
2009-Jun-23 09:47 UTC
[Gluster-users] bailout after period of inactivity
Hi Gluster people, We are seeing errors when GlusterFS is being accessed after a long period (days) of inactivity (the FS is used but not from this machine). The behavior is that the accessing application is confronted with an error and after some time (seconds, not minutes) the Filesystem is availabe again. Whatever is causing it, I don't think the application should be bothered to handle this. Here's the significant portion of the logfile. """ 2009-03-26 13:37:22 W [socket.c:1277:socket_init] trans: disabling non-blocking IO Version : glusterfs 2.0.0rc1 built on Jan 28 2009 16:42:26 TLA Revision : glusterfs--mainline--3.0--patch-844 Starting Time: 2009-03-26 13:37:22 Command line : glusterfs -s 10.0.0.166 /mnt/alexandria_uk/ given volfile +----- 1: # volumes from ukarchive0 2: volume brick-0-0 3: type protocol/client 4: option transport-type tcp/client 5: option remote-host ukarchive0 6: option remote-subvolume brick0 7: end-volume 8: 9: # volumes from ukarchive1 10: volume brick-1-0 11: type protocol/client 12: option transport-type tcp/client 13: option remote-host ukarchive1 14: option remote-subvolume brick0 15: end-volume 16: 17: # volumes from ukarchive2 18: volume brick-2-0 19: type protocol/client 20: option transport-type tcp/client 21: option remote-host ukarchive2 22: option remote-subvolume brick0 23: end-volume 24: 25: # volumes from ukarchive3 26: volume brick-3-0 27: type protocol/client 28: option transport-type tcp/client 29: option remote-host ukarchive3 30: option remote-subvolume brick0 31: end-volume 32: 33: # volumes from ukarchive4 34: volume brick-4-0 35: type protocol/client 36: option transport-type tcp/client 37: option remote-host ukarchive4 38: option remote-subvolume brick0 39: end-volume 40: 41: # volumes from ukarchive5 42: volume brick-5-0 43: type protocol/client 44: option transport-type tcp/client 45: option remote-host ukarchive5 46: option remote-subvolume brick0 47: end-volume 48: 49: volume ns0 50: type protocol/client 51: option transport-type tcp/client 52: option remote-host ukarchivens0 53: option remote-subvolume brick-namespace 54: end-volume 55: 56: volume unify 57: type cluster/unify 58: option namespace ns0 59: option self-heal off 60: option scheduler rr 61: option rr.limits.min-free-disk 10% 62: option rr.refresh-interval 60 63: subvolumes brick-0-0 brick-1-0 brick-2-0 brick-3-0 brick-4-0 brick-5-0 64: end-volume +----- 2009-03-26 13:37:22 W [xlator.c:382:validate_xlator_volume_options] unify: option 'rr.refresh-interval' is deprecated, preferred is 'scheduler.refresh-interval', continuing with correction 2009-03-26 13:37:22 W [xlator.c:382:validate_xlator_volume_options] unify: option 'rr.limits.min-free-disk' is deprecated, preferred is 'scheduler.limits.min-free-disk', continuing with correction 2009-03-26 13:37:22 W [rr-options.c:179:rr_options_validate] rr: using scheduler.limits.min-free-disk = 10 2009-03-26 13:37:22 W [rr-options.c:207:rr_options_validate] rr: using scheduler.refresh-interval = 60 2009-04-10 16:18:06 E [client-protocol.c:263:call_bail] brick-0-0: activating bail-out. pending frames = 1. last sent = 2009-04-10 16:17:14. last received = 2009-03-30 03:13:43. transport-timeout = 42 2009-04-10 16:18:06 C [client-protocol.c:298:call_bail] brick-0-0: bailing transport 2009-04-10 16:18:06 E [client-protocol.c:263:call_bail] brick-1-0: activating bail-out. pending frames = 1. last sent = 2009-04-10 16:17:14. last received = 2009-03-30 03:13:43. transport-timeout = 42 2009-04-10 16:18:06 C [client-protocol.c:298:call_bail] brick-1-0: bailing transport 2009-04-10 16:18:06 E [saved-frames.c:148:saved_frames_unwind] brick-0-0: forced unwinding frame type(1) op(STATFS) 2009-04-10 16:18:06 E [client-protocol.c:263:call_bail] brick-3-0: activating bail-out. pending frames = 1. last sent = 2009-04-10 16:17:14. last received = 2009-03-30 03:13:43. transport-timeout = 42 2009-04-10 16:18:06 C [client-protocol.c:298:call_bail] brick-3-0: bailing transport 2009-04-10 16:18:06 E [saved-frames.c:148:saved_frames_unwind] brick-1-0: forced unwinding frame type(1) op(STATFS) 2009-04-10 16:18:06 E [client-protocol.c:263:call_bail] brick-2-0: activating bail-out. pending frames = 1. last sent = 2009-04-10 16:17:14. last received = 2009-03-30 03:13:43. transport-timeout = 42 2009-04-10 16:18:06 C [client-protocol.c:298:call_bail] brick-2-0: bailing transport 2009-04-10 16:18:06 E [saved-frames.c:148:saved_frames_unwind] brick-3-0: forced unwinding frame type(1) op(STATFS) 2009-04-10 16:18:06 E [client-protocol.c:263:call_bail] brick-4-0: activating bail-out. pending frames = 1. last sent = 2009-04-10 16:17:14. last received = 2009-03-30 03:13:43. transport-timeout = 42 2009-04-10 16:18:06 E [socket.c:104:__socket_rwv] brick-2-0: readv failed (Connection reset by peer) 2009-04-10 16:18:06 C [client-protocol.c:298:call_bail] brick-4-0: bailing transport 2009-04-10 16:18:06 E [socket.c:566:socket_proto_state_machine] brick-2-0: socket read failed (Connection reset by peer) in state 1 (10.0.0.169:6996) 2009-04-10 16:18:06 E [client-protocol.c:263:call_bail] brick-5-0: activating bail-out. pending frames = 1. last sent = 2009-04-10 16:17:14. last received = 2009-03-30 03:13:43. transport-timeout = 42 2009-04-10 16:18:06 E [saved-frames.c:148:saved_frames_unwind] brick-2-0: forced unwinding frame type(1) op(STATFS) 2009-04-10 16:18:06 C [client-protocol.c:298:call_bail] brick-5-0: bailing transport 2009-04-10 16:18:06 E [saved-frames.c:148:saved_frames_unwind] brick-4-0: forced unwinding frame type(1) op(STATFS) 2009-04-10 16:18:06 E [saved-frames.c:148:saved_frames_unwind] brick-5-0: forced unwinding frame type(1) op(STATFS) 2009-04-10 16:18:06 E [fuse-bridge.c:1907:fuse_statfs_cbk] glusterfs-fuse: 33700443: ERR => -1 (Transport endpoint is not connected) """ -- Arend-Jan Wijtzes -- Wiseguys -- www.wise-guys.nl
----- "Arend-Jan Wijtzes" <ajwytzes at wise-guys.nl> wrote:> Hi Gluster people, > > We are seeing errors when GlusterFS is being accessed after a long > period (days) of inactivity (the FS is used but not from this > machine).The error is not related to the inactivity. Take a look at these lines of the log file:> 2009-04-10 16:18:06 E [client-protocol.c:263:call_bail] brick-0-0: > activating bail-out. pending frames = 1. last sent = 2009-04-10 > 16:17:14. last received = 2009-03-30 03:13:43. transport-timeout = 42 > 2009-04-10 16:18:06 C [client-protocol.c:298:call_bail] brick-0-0: > bailing transportA request was sent at 16:17:14 but no reply has been received even at 16:18:06 (= 52 seconds). Since the transport timeout is set to 42 seconds, the request has been aborted. There was probably some kind of network issue which caused the reply to not arrive. Vikas -- Engineer - http://gluster.com/ A: Because it messes up the way people read text. Q: Why is a top-posting such a bad thing? --
Stephan von Krawczynski
2009-Jun-23 11:14 UTC
[Gluster-users] bailout after period of inactivity
On Tue, 23 Jun 2009 11:47:54 +0200 Arend-Jan Wijtzes <ajwytzes at wise-guys.nl> wrote:> Hi Gluster people, > > We are seeing errors when GlusterFS is being accessed after a long > period (days) of inactivity (the FS is used but not from this machine). > > The behavior is that the accessing application is confronted with an > error and after some time (seconds, not minutes) the Filesystem > is availabe again. > > Whatever is causing it, I don't think the application should be bothered > to handle this. Here's the significant portion of the logfile.Hello, I can confirm that kind of behaviour. In my testbed starting servers and clients and leaving them alone for a day shows that the mounted trees are unavailable and hung after that. Stopping the client process and restarting it makes the trees work again. -- Regards, Stephan