Vijay Bellur
2015-Mar-08 00:29 UTC
[Gluster-users] Gluster errors create zombie processes [LOGS ATTACHED]
On 03/07/2015 06:20 PM, Przemys?aw Mroczek wrote:> Hi guys, > > We have rails app, which is using gluster for our distributed file > system. The glusters servers are hosted independently as part of deal > with other, we don't have any impact on them, we are connected o them by > using gluster native client. > > We tried to resolve this issue using help from the admins of the company > that is hosting our gluster servers, but they say that's the client > issue and we ran out of ideas how that's possible if we are not doing > anything special here. > > Information about independent gluster servers: > -version: 3.6.0.42.1 > - They are using red hat > -They are enterprise so the are always using older versions > > Our servers: > System version: Ubuntu 14.04 > Our gluster client version: 3.6.2 > > The exact problem is that it often happens(couple times a week) that > errors in gluster causes proceses to become zombies. It happens with our > application server(unicorn), nginx and our crawling script that is run > as daemon. > > Our fstab file: > > 10.10.11.17:/drslk-prod /mnt/storage glusterfs > defaults,_netdev,nobootwait,fetch-attempts=10 0 0 > 10.10.11.17:/drslk-backup /mnt/backup glusterfs > defaults,_netdev,nobootwait,fetch-attempts=10 0 0 > > Logs from gluster: > > 2015-02-18 12:36:12.375695] E [rpc-clnt.c:362:saved_frames_unwind] (--> > /usr/lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_log_callingfn+0x186)[0x7fb41ddeada6] > (--> > /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7fb41d > bc1c7e] (--> > /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fb41dbc1d8e] > (--> > /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x82)[0x7fb41dbc3602] > (--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc > _clnt_notify+0x48)[0x7fb41dbc3d98] ))))) 0-drslk-prod-client-10: forced > unwinding frame type(GlusterFS 3.3) op(LOOKUP(27)) called at 2015-02-18 > 12:36:12.361489 (xid=0x5d475da) > [2015-02-18 12:36:12.375765] W > [client-rpc-fops.c:2766:client3_3_lookup_cbk] 0-drslk-prod-client-10: > remote operation failed: Transport endpoint is not connected. Path: > /system/posts/00/00/71/77/59.jpg (2ad81c2b-a141-478d-9dd4-253345edbce > b) > [2015-02-18 12:36:12.376288] E [rpc-clnt.c:362:saved_frames_unwind] (--> > /usr/lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_log_callingfn+0x186)[0x7fb41ddeada6] > (--> > /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7fb41d > bc1c7e] (--> > /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fb41dbc1d8e] > (--> > /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x82)[0x7fb41dbc3602] > (--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc > _clnt_notify+0x48)[0x7fb41dbc3d98] ))))) 0-drslk-prod-client-10: forced > unwinding frame type(GlusterFS 3.3) op(LOOKUP(27)) called at 2015-02-18 > 12:36:12.361858 (xid=0x5d475db) > [2015-02-18 12:36:12.376355] W > [client-rpc-fops.c:2766:client3_3_lookup_cbk] 0-drslk-prod-client-10: > remote operation failed: Transport endpoint is not connected. Path: > /system/posts/00/00/08 (f5c33a99-719e-4ea2-ad1f-33b893af103d) > [2015-02-18 12:36:12.376711] I [socket.c:3292:socket_submit_request] > 0-drslk-prod-client-10: not connected (priv->connected = 0) > [2015-02-18 12:36:12.376749] W [rpc-clnt.c:1562:rpc_clnt_submit] > 0-drslk-prod-client-10: failed to submit rpc-request (XID: 0x5d475dc > Program: GlusterFS 3.3, ProgVers: 330, Proc: 27) to rpc-transport > (drslk-prod-client-10) > [2015-02-18 12:36:12.376814] W > [client-rpc-fops.c:2766:client3_3_lookup_cbk] 0-drslk-prod-client-10: > remote operation failed: Transport endpoint is not connected. Path: > (null) (00000000-0000-0000-0000-000000000000) > [2015-02-18 12:36:12.376829] I [client.c:2215:client_rpc_notify] > 0-drslk-prod-client-10: disconnected from drslk-prod-client-10. Client > process will keep trying to connect to glusterd until brick's port is > available > [2015-02-18 12:36:12.376834] W [rpc-clnt.c:1562:rpc_clnt_submit] > 0-drslk-prod-client-10: failed to submit rpc-request (XID: 0x5d475dd > Program: GlusterFS 3.3, ProgVers: 330, Proc: 27) to rpc-transport > (drslk-prod-client-10) > [2015-02-18 12:36:12.376906] W > [client-rpc-fops.c:2766:client3_3_lookup_cbk] 0-drslk-prod-client-10: > remote operation failed: Transport endpoint is not connected. Path: > (null) (00000000-0000-0000-0000-000000000000) > [2015-02-18 12:36:12.376931] E [socket.c:2267:socket_connect_finish] > 0-drslk-prod-client-10: connection to 10.10.11.23:24007 > <http://10.10.11.23:24007/> failed (Connection refused) > [2015-02-18 12:36:12.379296] W > [client-rpc-fops.c:2766:client3_3_lookup_cbk] 0-drslk-prod-client-10: > remote operation failed: Transport endpoint is not connected. Path: > (null) (00000000-0000-0000-0000-000000000000) > [2015-02-18 12:36:12.379700] W > [client-rpc-fops.c:2766:client3_3_lookup_cbk] 0-drslk-prod-client-10: > remote operation failed: Transport endpoint is not connected. Path: > (null) (00000000-0000-0000-0000-000000000000) > [2015-02-18 13:10:52.759736] E > [client-handshake.c:1496:client_query_portmap_cbk] > 0-drslk-prod-client-10: failed to get the port number for remote > subvolume. Please run 'gluster volume status' on server to see if brick > process is running. > [2015-02-18 13:10:52.759796] I [client.c:2215:client_rpc_notify] > 0-drslk-prod-client-10: disconnected from drslk-prod-client-10. Client > process will keep trying to connect to glusterd until brick's port is > available > [2015-02-18 13:11:02.897307] I [rpc-clnt.c:1761:rpc_clnt_reconfig] > 0-drslk-prod-client-10: changing port to 49349 (from 0) > [2015-02-18 13:11:02.898097] I > [client-handshake.c:1413:select_server_supported_programs] > 0-drslk-prod-client-10: Using Program GlusterFS 3.3, Num (1298437), > Version (330) > [2015-02-18 13:11:02.898446] I > [client-handshake.c:1200:client_setvolume_cbk] 0-drslk-prod-client-10: > Connected to drslk-prod-client-10, attached to remote volume > '/GLUSTERFS/drslk-prod'. > [2015-02-18 13:11:02.898460] I > [client-handshake.c:1210:client_setvolume_cbk] 0-drslk-prod-client-10: > Server and Client lk-version numbers are not same, reopening the fds >Can you provide the gluster volume configuration details? It does look like frame-timeout for the volume has been set to 60. Is there any specific reason? Normally altering the frame-timeout is not recommended. -Vijay
Przemysław Mroczek
2015-Mar-08 13:36 UTC
[Gluster-users] Gluster errors create zombie processes [LOGS ATTACHED]
I don't have volfiles, they are not on our machines as I said previously we don't have impact on gluster servers. I saw some graph that looks similiar to volume file on logs. I will paste it here but we don't really have any impact on that. We are just using client to connect to gluster servers, we are not in control of. *1: volume drslk-prod-client-0* * 2: type protocol/client* * 3: option ping-timeout 20* * 4: option remote-host brick13.gluster.iadm* * 5: option remote-subvolume /GLUSTERFS/drslk-prod* * 6: option transport-type socket* * 7: option frame-timeout 60* * 8: option send-gids true* * 9: end-volume* * 10: * * 11: volume drslk-prod-client-1* * 12: type protocol/client* * 13: option ping-timeout 20* * 14: option remote-host brick14.gluster.iadm* * 15: option remote-subvolume /GLUSTERFS/drslk-prod* * 16: option transport-type socket* * 17: option frame-timeout 60* * 18: option send-gids true* * 19: end-volume* * 20: * * 21: volume drslk-prod-client-2* * 22: type protocol/client* * 23: option ping-timeout 20* * 24: option remote-host brick15.gluster.iadm* * 25: option remote-subvolume /GLUSTERFS/drslk-prod* * 26: option transport-type socket* * 27: option frame-timeout 60* * 28: option send-gids true* * 29: end-volume* * 30: * * 31: volume drslk-prod-replicate-0* * 32: type cluster/replicate* * 33: option read-hash-mode 2* * 34: option data-self-heal-window-size 128* * 35: option quorum-type auto* * 36: subvolumes drslk-prod-client-0 drslk-prod-client-1 drslk-prod-client-2* * 37: end-volume* * 38: * * 39: volume drslk-prod-client-3* * 40: type protocol/client* * 41: option ping-timeout 20* * 42: option remote-host brick16.gluster.iadm* * 43: option remote-subvolume /GLUSTERFS/drslk-prod* * 44: option transport-type socket* * 45: option frame-timeout 60* * 46: option send-gids true* * 47: end-volume* * 48: * * 49: volume drslk-prod-client-4* * 50: type protocol/client* * 51: option ping-timeout 20* * 52: option remote-host brick17.gluster.iadm* * 53: option remote-subvolume /GLUSTERFS/drslk-prod* * 54: option transport-type socket* * 55: option frame-timeout 60* * 56: option send-gids true* * 57: end-volume* * 58: * * 59: volume drslk-prod-client-5* * 60: type protocol/client* * 61: option ping-timeout 20* * 62: option remote-host brick18.gluster.iadm* * 63: option remote-subvolume /GLUSTERFS/drslk-prod* * 64: option transport-type socket* * 65: option frame-timeout 60* * 66: option send-gids true* * 67: end-volume* * 68: * * 69: volume drslk-prod-replicate-1* * 70: type cluster/replicate* * 71: option read-hash-mode 2* * 72: option data-self-heal-window-size 128* * 73: option quorum-type auto* * 74: subvolumes drslk-prod-client-3 drslk-prod-client-4 drslk-prod-client-5* * 75: end-volume* * 76: * * 77: volume drslk-prod-client-6* * 78: type protocol/client* * 79: option ping-timeout 20* * 80: option remote-host brick19.gluster.iadm* * 81: option remote-subvolume /GLUSTERFS/drslk-prod* * 82: option transport-type socket* * 83: option frame-timeout 60* * 84: option send-gids true* * 85: end-volume* * 86: * * 87: volume drslk-prod-client-7* * 88: type protocol/client* * 89: option ping-timeout 20* * 90: option remote-host brick20.gluster.iadm* * 91: option remote-subvolume /GLUSTERFS/drslk-prod* * 92: option transport-type socket* * 93: option frame-timeout 60* * 94: option send-gids true* * 95: end-volume* * 96: * * 97: volume drslk-prod-client-8* * 98: type protocol/client* * 99: option ping-timeout 20* *100: option remote-host brick21.gluster.iadm* *101: option remote-subvolume /GLUSTERFS/drslk-prod* *102: option transport-type socket* *103: option frame-timeout 60* *104: option send-gids true* *105: end-volume* *106: * *107: volume drslk-prod-replicate-2* *108: type cluster/replicate* *109: option read-hash-mode 2* *110: option data-self-heal-window-size 128* *111: option quorum-type auto* *112: subvolumes drslk-prod-client-6 drslk-prod-client-7 drslk-prod-client-8* *113: end-volume* *114: * *115: volume drslk-prod-client-9* *116: type protocol/client* *117: option ping-timeout 20* *118: option remote-host brick22.gluster.iadm* *119: option remote-subvolume /GLUSTERFS/drslk-prod* *120: option transport-type socket* *121: option frame-timeout 60* *122: option send-gids true* *123: end-volume* *124: * *125: volume drslk-prod-client-10* *126: type protocol/client* *127: option ping-timeout 20* *128: option remote-host brick23.gluster.iadm* *129: option remote-subvolume /GLUSTERFS/drslk-prod* *130: option transport-type socket* *131: option frame-timeout 60* *132: option send-gids true* *133: end-volume* *134: * *135: volume drslk-prod-client-11* *136: type protocol/client* *137: option ping-timeout 20* *138: option remote-host brick24.gluster.iadm* *139: option remote-subvolume /GLUSTERFS/drslk-prod* *140: option transport-type socket* *141: option frame-timeout 60* *142: option send-gids true* *143: end-volume* *144: * *145: volume drslk-prod-replicate-3* *146: type cluster/replicate* *147: option read-hash-mode 2* *148: option data-self-heal-window-size 128* *149: option quorum-type auto* *150: subvolumes drslk-prod-client-9 drslk-prod-client-10 drslk-prod-client-11* *151: end-volume* *152: * *153: volume drslk-prod-dht* *154: type cluster/distribute* *155: option min-free-disk 10%* *156: option readdir-optimize on* *157: subvolumes drslk-prod-replicate-0 drslk-prod-replicate-1 drslk-prod-replicate-2 drslk-prod-replicate-3* *158: end-volume* *159: * *160: volume drslk-prod-write-behind* *161: type performance/write-behind* *162: option cache-size 1MB* *163: subvolumes drslk-prod-dht* *164: end-volume* *165: * *166: volume drslk-prod-read-ahead* *167: type performance/read-ahead* *168: subvolumes drslk-prod-write-behind* *169: end-volume* *170: * *171: volume drslk-prod-readdir-ahead* *172: type performance/readdir-ahead* *173: subvolumes drslk-prod-read-ahead* *174: end-volume* *175: * *176: volume drslk-prod-io-cache* *177: type performance/io-cache* *178: option cache-timeout 60* *179: option cache-size 512MB* *180: subvolumes drslk-prod-readdir-ahead* *181: end-volume* *182: * *183: volume drslk-prod-quick-read* *184: type performance/quick-read* *185: option cache-size 512MB* *186: subvolumes drslk-prod-io-cache* *187: end-volume* *188: * *189: volume drslk-prod-md-cache* *190: type performance/md-cache* *191: subvolumes drslk-prod-quick-read* *192: end-volume* *193: * *194: volume drslk-prod* *195: type debug/io-stats* *196: option latency-measurement off* *197: option count-fop-hits off* *198: subvolumes drslk-prod-md-cache* *199: end-volume* *200: * *201: volume meta-autoload* *202: type meta* *203: subvolumes drslk-prod* *204: end-volume* *205: * Btw, do you think that different versions of gluster client and gluster server could be an issue here? 2015-03-08 1:29 GMT+01:00 Vijay Bellur <vbellur at redhat.com>:> On 03/07/2015 06:20 PM, Przemys?aw Mroczek wrote: > >> Hi guys, >> >> We have rails app, which is using gluster for our distributed file >> system. The glusters servers are hosted independently as part of deal >> with other, we don't have any impact on them, we are connected o them by >> using gluster native client. >> >> We tried to resolve this issue using help from the admins of the company >> that is hosting our gluster servers, but they say that's the client >> issue and we ran out of ideas how that's possible if we are not doing >> anything special here. >> >> Information about independent gluster servers: >> -version: 3.6.0.42.1 >> - They are using red hat >> -They are enterprise so the are always using older versions >> >> Our servers: >> System version: Ubuntu 14.04 >> Our gluster client version: 3.6.2 >> >> The exact problem is that it often happens(couple times a week) that >> errors in gluster causes proceses to become zombies. It happens with our >> application server(unicorn), nginx and our crawling script that is run >> as daemon. >> >> Our fstab file: >> >> 10.10.11.17:/drslk-prod /mnt/storage glusterfs >> defaults,_netdev,nobootwait,fetch-attempts=10 0 0 >> 10.10.11.17:/drslk-backup /mnt/backup glusterfs >> defaults,_netdev,nobootwait,fetch-attempts=10 0 0 >> >> Logs from gluster: >> >> 2015-02-18 12:36:12.375695] E [rpc-clnt.c:362:saved_frames_unwind] (--> >> /usr/lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_log_callingfn+0x186)[ >> 0x7fb41ddeada6] >> (--> >> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(saved_frames_ >> unwind+0x1de)[0x7fb41d >> bc1c7e] (--> >> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(saved_frames_ >> destroy+0xe)[0x7fb41dbc1d8e] >> (--> >> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_ >> connection_cleanup+0x82)[0x7fb41dbc3602] >> (--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc >> _clnt_notify+0x48)[0x7fb41dbc3d98] ))))) 0-drslk-prod-client-10: forced >> unwinding frame type(GlusterFS 3.3) op(LOOKUP(27)) called at 2015-02-18 >> 12:36:12.361489 (xid=0x5d475da) >> [2015-02-18 12:36:12.375765] W >> [client-rpc-fops.c:2766:client3_3_lookup_cbk] 0-drslk-prod-client-10: >> remote operation failed: Transport endpoint is not connected. Path: >> /system/posts/00/00/71/77/59.jpg (2ad81c2b-a141-478d-9dd4-253345edbce >> b) >> [2015-02-18 12:36:12.376288] E [rpc-clnt.c:362:saved_frames_unwind] (--> >> /usr/lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_log_callingfn+0x186)[ >> 0x7fb41ddeada6] >> (--> >> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(saved_frames_ >> unwind+0x1de)[0x7fb41d >> bc1c7e] (--> >> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(saved_frames_ >> destroy+0xe)[0x7fb41dbc1d8e] >> (--> >> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_ >> connection_cleanup+0x82)[0x7fb41dbc3602] >> (--> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc >> _clnt_notify+0x48)[0x7fb41dbc3d98] ))))) 0-drslk-prod-client-10: forced >> unwinding frame type(GlusterFS 3.3) op(LOOKUP(27)) called at 2015-02-18 >> 12:36:12.361858 (xid=0x5d475db) >> [2015-02-18 12:36:12.376355] W >> [client-rpc-fops.c:2766:client3_3_lookup_cbk] 0-drslk-prod-client-10: >> remote operation failed: Transport endpoint is not connected. Path: >> /system/posts/00/00/08 (f5c33a99-719e-4ea2-ad1f-33b893af103d) >> [2015-02-18 12:36:12.376711] I [socket.c:3292:socket_submit_request] >> 0-drslk-prod-client-10: not connected (priv->connected = 0) >> [2015-02-18 12:36:12.376749] W [rpc-clnt.c:1562:rpc_clnt_submit] >> 0-drslk-prod-client-10: failed to submit rpc-request (XID: 0x5d475dc >> Program: GlusterFS 3.3, ProgVers: 330, Proc: 27) to rpc-transport >> (drslk-prod-client-10) >> [2015-02-18 12:36:12.376814] W >> [client-rpc-fops.c:2766:client3_3_lookup_cbk] 0-drslk-prod-client-10: >> remote operation failed: Transport endpoint is not connected. Path: >> (null) (00000000-0000-0000-0000-000000000000) >> [2015-02-18 12:36:12.376829] I [client.c:2215:client_rpc_notify] >> 0-drslk-prod-client-10: disconnected from drslk-prod-client-10. Client >> process will keep trying to connect to glusterd until brick's port is >> available >> [2015-02-18 12:36:12.376834] W [rpc-clnt.c:1562:rpc_clnt_submit] >> 0-drslk-prod-client-10: failed to submit rpc-request (XID: 0x5d475dd >> Program: GlusterFS 3.3, ProgVers: 330, Proc: 27) to rpc-transport >> (drslk-prod-client-10) >> [2015-02-18 12:36:12.376906] W >> [client-rpc-fops.c:2766:client3_3_lookup_cbk] 0-drslk-prod-client-10: >> remote operation failed: Transport endpoint is not connected. Path: >> (null) (00000000-0000-0000-0000-000000000000) >> [2015-02-18 12:36:12.376931] E [socket.c:2267:socket_connect_finish] >> 0-drslk-prod-client-10: connection to 10.10.11.23:24007 >> <http://10.10.11.23:24007/> failed (Connection refused) >> >> [2015-02-18 12:36:12.379296] W >> [client-rpc-fops.c:2766:client3_3_lookup_cbk] 0-drslk-prod-client-10: >> remote operation failed: Transport endpoint is not connected. Path: >> (null) (00000000-0000-0000-0000-000000000000) >> [2015-02-18 12:36:12.379700] W >> [client-rpc-fops.c:2766:client3_3_lookup_cbk] 0-drslk-prod-client-10: >> remote operation failed: Transport endpoint is not connected. Path: >> (null) (00000000-0000-0000-0000-000000000000) >> [2015-02-18 13:10:52.759736] E >> [client-handshake.c:1496:client_query_portmap_cbk] >> 0-drslk-prod-client-10: failed to get the port number for remote >> subvolume. Please run 'gluster volume status' on server to see if brick >> process is running. >> [2015-02-18 13:10:52.759796] I [client.c:2215:client_rpc_notify] >> 0-drslk-prod-client-10: disconnected from drslk-prod-client-10. Client >> process will keep trying to connect to glusterd until brick's port is >> available >> [2015-02-18 13:11:02.897307] I [rpc-clnt.c:1761:rpc_clnt_reconfig] >> 0-drslk-prod-client-10: changing port to 49349 (from 0) >> [2015-02-18 13:11:02.898097] I >> [client-handshake.c:1413:select_server_supported_programs] >> 0-drslk-prod-client-10: Using Program GlusterFS 3.3, Num (1298437), >> Version (330) >> [2015-02-18 13:11:02.898446] I >> [client-handshake.c:1200:client_setvolume_cbk] 0-drslk-prod-client-10: >> Connected to drslk-prod-client-10, attached to remote volume >> '/GLUSTERFS/drslk-prod'. >> [2015-02-18 13:11:02.898460] I >> [client-handshake.c:1210:client_setvolume_cbk] 0-drslk-prod-client-10: >> Server and Client lk-version numbers are not same, reopening the fds >> >> > Can you provide the gluster volume configuration details? > > It does look like frame-timeout for the volume has been set to 60. Is > there any specific reason? Normally altering the frame-timeout is not > recommended. > > -Vijay > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20150308/e331c36c/attachment.html>