Dietmar Putz
2017-Jun-28 12:42 UTC
[Gluster-users] setting gfid on .trashcan/... failed - total outage
Hello, recently we had two times a partial gluster outage followed by a total outage of all four nodes. Looking into the gluster mailing list i found a very similar case in http://lists.gluster.org/pipermail/gluster-users/2016-June/027124.html but i'm not sure if this issue is fixed... even this outage happened on glusterfs 3.7.18 which gets no more updates since ~.20 i would kindly ask if this issue is known to be fixed in 3.8 resp. 3.10... ? unfortunately i did not found corresponding informations in the release notes... best regards Dietmar the partial outage started as shown below, the very first entries occurred in the brick-logs : gl-master-04, brick1-mvol1.log : [2017-06-23 16:35:11.373471] E [MSGID: 113020] [posix.c:2839:posix_create] 0-mvol1-posix: setting gfid on /brick1/mvol1/.trashcan//2290/uploads/170221_Sendung_Lieberum_01_AT.mp4_2017-06-23_163511 failed [2017-06-23 16:35:11.392540] E [posix.c:3188:_fill_writev_xdata] (-->/usr/lib/x86_64-linux-gnu/glusterfs/3.7.18/xlator/features/trash.so(trash_truncate_readv_cbk+0x1ab) [0x7f4f8c2aaa0b] -->/usr/lib/x86_64-linux-gnu/glusterfs/3.7.18/xlator/ storage/posix.so(posix_writev+0x1ff) [0x7f4f8caec62f] -->/usr/lib/x86_64-linux-gnu/glusterfs/3.7.18/xlator/storage/posix.so(_fill_writev_xdata+0x1c6) [0x7f4f8caec406] ) 0-mvol1-posix: fd: 0x7f4ef434225c inode: 0x7f4ef430bd6cgfid:00000000-0 000-0000-0000-000000000000 [Invalid argument] ... gl-master-04 : etc-glusterfs-glusterd.vol.log [2017-06-23 16:35:18.872346] W [rpcsvc.c:270:rpcsvc_program_actor] 0-rpc-service: RPC program not available (req 1298437 330) for 10.0.1.203:65533 [2017-06-23 16:35:18.872421] E [rpcsvc.c:565:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully gl-master-04 : glustershd.log [2017-06-23 16:35:42.536840] E [MSGID: 108006] [afr-common.c:4323:afr_notify] 0-mvol1-replicate-1: All subvolumes are down. Going offline until atleast one of them comes back up. [2017-06-23 16:35:51.702413] E [socket.c:2292:socket_connect_finish] 0-mvol1-client-3: connection to 10.0.1.156:49152 failed (Connection refused) gl-master-03, brick1-movl1.log : [2017-06-23 16:35:11.399769] E [MSGID: 113020] [posix.c:2839:posix_create] 0-mvol1-posix: setting gfid on /brick1/mvol1/.trashcan//2290/uploads/170221_Sendung_Lieberum_01_AT.mp4_2017-06-23_163511 failed [2017-06-23 16:35:11.418559] E [posix.c:3188:_fill_writev_xdata] (-->/usr/lib/x86_64-linux-gnu/glusterfs/3.7.18/xlator/features/trash.so(trash_truncate_readv_cbk+0x1ab) [0x7ff517087a0b] -->/usr/lib/x86_64-linux-gnu/glusterfs/3.7.18/xlator/ storage/posix.so(posix_writev+0x1ff) [0x7ff5178c962f] -->/usr/lib/x86_64-linux-gnu/glusterfs/3.7.18/xlator/storage/posix.so(_fill_writev_xdata+0x1c6) [0x7ff5178c9406] ) 0-mvol1-posix: fd: 0x7ff4c814a43c inode: 0x7ff4c82e1b5cgfid:00000000-0 000-0000-0000-000000000000 [Invalid argument] ... gl-master-03 : etc-glusterfs-glusterd.vol.log [2017-06-23 16:35:19.879140] W [rpcsvc.c:270:rpcsvc_program_actor] 0-rpc-service: RPC program not available (req 1298437 330) for 10.0.1.203:65530 [2017-06-23 16:35:19.879201] E [rpcsvc.c:565:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully [2017-06-23 16:35:19.879300] W [rpcsvc.c:270:rpcsvc_program_actor] 0-rpc-service: RPC program not available (req 1298437 330) for 10.0.1.203:65530 [2017-06-23 16:35:19.879314] E [rpcsvc.c:565:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully [2017-06-23 16:35:19.879845] W [rpcsvc.c:270:rpcsvc_program_actor] 0-rpc-service: RPC program not available (req 1298437 330) for 10.0.1.203:65530 [2017-06-23 16:35:19.879859] E [rpcsvc.c:565:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully [2017-06-23 16:35:42.538727] W [socket.c:596:__socket_rwv] 0-management: readv on /var/run/gluster/5e23d9709b37ac7877720ac3986c48bc.socket failed (No data available) [2017-06-23 16:35:42.543486] I [MSGID: 106005] [glusterd-handler.c:5037:__glusterd_brick_rpc_notify] 0-management: Brick gl-master-03-int:/brick1/mvol1 has disconnected from glusterd. gl-master-03 : glustershd.log [2017-06-23 16:35:42.537752] E [MSGID: 108006] [afr-common.c:4323:afr_notify] 0-mvol1-replicate-1: All subvolumes are down. Going offline until atleast one of them comes back up. [2017-06-23 16:35:52.011016] E [socket.c:2292:socket_connect_finish] 0-mvol1-client-3: connection to 10.0.1.156:49152 failed (Connection refused) [2017-06-23 16:35:53.010620] E [socket.c:2292:socket_connect_finish] 0-mvol1-client-2: connection to 10.0.1.154:49152 failed (Connection refused) about 73 minutes later the remaining replicated pair was affected by the outage : gl-master-02, brick1-mvol1.log : [2017-06-23 17:48:30.093526] E [MSGID: 113018] [posix.c:2766:posix_create] 0-mvol1-posix: pre-operation lstat on parent /brick1/mvol1/.trashcan//2290/uploads failed [No such file or directory] [2017-06-23 17:48:30.093591] E [MSGID: 113018] [posix.c:1447:posix_mkdir] 0-mvol1-posix: pre-operation lstat on parent /brick1/mvol1/.trashcan//2290 failed [No such file or directory] [2017-06-23 17:48:30.093636] E [MSGID: 113027] [posix.c:1538:posix_mkdir] 0-mvol1-posix: mkdir of /brick1/mvol1/ failed [File exists] [2017-06-23 17:48:30.093670] E [MSGID: 113027] [posix.c:1538:posix_mkdir] 0-mvol1-posix: mkdir of /brick1/mvol1/.trashcan failed [File exists] [2017-06-23 17:48:30.093701] E [MSGID: 113027] [posix.c:1538:posix_mkdir] 0-mvol1-posix: mkdir of /brick1/mvol1/.trashcan/ failed [File exists] [2017-06-23 17:48:30.113559] E [MSGID: 113001] [posix.c:1562:posix_mkdir] 0-mvol1-posix: setting xattrs on /brick1/mvol1/.trashcan//2290 failed [No such file or directory] [2017-06-23 17:48:30.113630] E [MSGID: 113027] [posix.c:1538:posix_mkdir] 0-mvol1-posix: mkdir of /brick1/mvol1/.trashcan//2290 failed [File exists] [2017-06-23 17:48:30.163155] E [MSGID: 113001] [posix.c:1562:posix_mkdir] 0-mvol1-posix: setting xattrs on /brick1/mvol1/.trashcan//2290/uploads failed [No such file or directory] [2017-06-23 17:48:30.163282] E [MSGID: 113001] [posix.c:2832:posix_create] 0-mvol1-posix: setting xattrs on /brick1/mvol1/.trashcan//2290/uploads/170623_TVM_News.mp4_2017-06-23_174830 failed [No such file or directory] [2017-06-23 17:48:30.165617] E [posix.c:3188:_fill_writev_xdata] (-->/usr/lib/x86_64-linux-gnu/glusterfs/3.7.18/xlator/features/trash.so(trash_truncate_readv_cbk+0x1ab) [0x7f4ec77d9a0b] -->/usr/lib/x86_64-linux-gnu/glusterfs/3.7.18/xlator/ storage/posix.so(posix_writev+0x1ff) [0x7f4ecc1c162f] -->/usr/lib/x86_64-linux-gnu/glusterfs/3.7.18/xlator/storage/posix.so(_fill_writev_xdata+0x1c6) [0x7f4ecc1c1406] ) 0-mvol1-posix: fd: 0x7f4e70429b6c inode: 0x7f4e7041f9acgfid:00000000-0 000-0000-0000-000000000000 [Invalid argument] the mentioned file in the brick-log was still available in the origin directory but not in the corresponding trashcan directory : [ 14:29:29 ] - root at gl-master-01 /var/log/glusterfs $ls -lh /sdn/2290/uploads/170221_Sendung_Lieberum_01_AT* -rw-r--r-- 1 2001 2001 386M Mar 31 13:00 /sdn/2290/uploads/170221_Sendung_Lieberum_01_AT.mp4 -rw-r--r-- 1 2001 2001 386M Jun 2 13:09 /sdn/2290/uploads/170221_Sendung_Lieberum_01_AT_AT.mp4 [ 15:08:53 ] - root at gl-master-01 /var/log/glusterfs $ [ 15:11:04 ] - root at gl-master-01 /var/log/glusterfs $ls -lh /sdn/.trashcan/2290/uploads/170221_Sendung_Lieberum_01_AT* [ 15:11:10 ] - root at gl-master-01 /var/log/glusterfs $ some further informations...the OS is ubuntu 16.04.2 lts, volume info below : [ 11:31:53 ] - root at gl-master-03 ~ $gluster volume info mvol1 Volume Name: mvol1 Type: Distributed-Replicate Volume ID: 2f5de6e4-66de-40a7-9f24-4762aad3ca96 Status: Started Number of Bricks: 2 x 2 = 4 Transport-type: tcp Bricks: Brick1: gl-master-01-int:/brick1/mvol1 Brick2: gl-master-02-int:/brick1/mvol1 Brick3: gl-master-03-int:/brick1/mvol1 Brick4: gl-master-04-int:/brick1/mvol1 Options Reconfigured: geo-replication.ignore-pid-check: on geo-replication.indexing: on nfs.disable: off diagnostics.client-log-level: ERROR changelog.changelog: on performance.cache-refresh-timeout: 32 cluster.min-free-disk: 200GB network.ping-timeout: 5 performance.io-thread-count: 64 performance.cache-size: 8GB performance.readdir-ahead: on features.trash: off features.trash-max-filesize: 1GB [ 11:31:56 ] - root at gl-master-03 ~ $ Host : gl-master-01 -rw-r----- 1 root root 232M Jun 23 17:49 /var/crash/_usr_sbin_glusterfsd.0.crash ----------------------------------------------------- Host : gl-master-02 -rw-r----- 1 root root 226M Jun 23 17:49 /var/crash/_usr_sbin_glusterfsd.0.crash ----------------------------------------------------- Host : gl-master-03 -rw-r----- 1 root root 254M Jun 23 16:35 /var/crash/_usr_sbin_glusterfsd.0.crash ----------------------------------------------------- Host : gl-master-04 -rw-r----- 1 root root 239M Jun 23 16:35 /var/crash/_usr_sbin_glusterfsd.0.crash ----------------------------------------------------- -- Dietmar Putz 3Q GmbH Wetzlarer Str. 86 D-14482 Potsdam Telefax: +49 (0)331 / 2797 866 - 1 Telefon: +49 (0)331 / 2797 866 - 8 Mobile: +49 171 / 90 160 39 Mail: dietmar.putz at 3qsdn.com
Anoop C S
2017-Jun-29 08:48 UTC
[Gluster-users] setting gfid on .trashcan/... failed - total outage
On Wed, 2017-06-28 at 14:42 +0200, Dietmar Putz wrote:> Hello, > > recently we had two times a partial gluster outage followed by a total? > outage of all four nodes. Looking into the gluster mailing list i found? > a very similar case in? > http://lists.gluster.org/pipermail/gluster-users/2016-June/027124.htmlIf you are talking about a crash happening on bricks, were you able to find any backtraces from any of the brick logs?> but i'm not sure if this issue is fixed... > > even this outage happened on glusterfs 3.7.18 which gets no more updates? > since ~.20 i would kindly ask if this issue is known to be fixed in 3.8? > resp. 3.10... ? > unfortunately i did not found corresponding informations in the release? > notes... > > best regards > Dietmar > > > the partial outage started as shown below, the very first entries? > occurred in the brick-logs : > > gl-master-04, brick1-mvol1.log : > > [2017-06-23 16:35:11.373471] E [MSGID: 113020]? > [posix.c:2839:posix_create] 0-mvol1-posix: setting gfid on? > /brick1/mvol1/.trashcan//2290/uploads/170221_Sendung_Lieberum_01_AT.mp4_2017-06-23_163511? > failed > [2017-06-23 16:35:11.392540] E [posix.c:3188:_fill_writev_xdata]? > (-->/usr/lib/x86_64-linux- > gnu/glusterfs/3.7.18/xlator/features/trash.so(trash_truncate_readv_cbk+0x1ab)? > [0x7f4f8c2aaa0b] -->/usr/lib/x86_64-linux-gnu/glusterfs/3.7.18/xlator/ > storage/posix.so(posix_writev+0x1ff) [0x7f4f8caec62f]? > -->/usr/lib/x86_64-linux-gnu/glusterfs/3.7.18/xlator/storage/posix.so(_fill_writev_xdata+0x1c6)? > [0x7f4f8caec406] ) 0-mvol1-posix: fd: 0x7f4ef434225c inode:? > 0x7f4ef430bd6cgfid:00000000-0 > 000-0000-0000-000000000000 [Invalid argument] > ... > > > gl-master-04 : etc-glusterfs-glusterd.vol.log > > [2017-06-23 16:35:18.872346] W [rpcsvc.c:270:rpcsvc_program_actor]? > 0-rpc-service: RPC program not available (req 1298437 330) for? > 10.0.1.203:65533 > [2017-06-23 16:35:18.872421] E? > [rpcsvc.c:565:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed? > to complete successfully > > gl-master-04 : glustershd.log > > [2017-06-23 16:35:42.536840] E [MSGID: 108006]? > [afr-common.c:4323:afr_notify] 0-mvol1-replicate-1: All subvolumes are? > down. Going offline until atleast one of them comes back up. > [2017-06-23 16:35:51.702413] E [socket.c:2292:socket_connect_finish]? > 0-mvol1-client-3: connection to 10.0.1.156:49152 failed (Connection refused) > > > > gl-master-03, brick1-movl1.log : > > [2017-06-23 16:35:11.399769] E [MSGID: 113020]? > [posix.c:2839:posix_create] 0-mvol1-posix: setting gfid on? > /brick1/mvol1/.trashcan//2290/uploads/170221_Sendung_Lieberum_01_AT.mp4_2017-06-23_163511? > failed > [2017-06-23 16:35:11.418559] E [posix.c:3188:_fill_writev_xdata]? > (-->/usr/lib/x86_64-linux- > gnu/glusterfs/3.7.18/xlator/features/trash.so(trash_truncate_readv_cbk+0x1ab)? > [0x7ff517087a0b] -->/usr/lib/x86_64-linux-gnu/glusterfs/3.7.18/xlator/ > storage/posix.so(posix_writev+0x1ff) [0x7ff5178c962f]? > -->/usr/lib/x86_64-linux-gnu/glusterfs/3.7.18/xlator/storage/posix.so(_fill_writev_xdata+0x1c6)? > [0x7ff5178c9406] ) 0-mvol1-posix: fd: 0x7ff4c814a43c inode:? > 0x7ff4c82e1b5cgfid:00000000-0 > 000-0000-0000-000000000000 [Invalid argument] > ... > > > gl-master-03 : etc-glusterfs-glusterd.vol.log > > [2017-06-23 16:35:19.879140] W [rpcsvc.c:270:rpcsvc_program_actor]? > 0-rpc-service: RPC program not available (req 1298437 330) for? > 10.0.1.203:65530 > [2017-06-23 16:35:19.879201] E? > [rpcsvc.c:565:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed? > to complete successfully > [2017-06-23 16:35:19.879300] W [rpcsvc.c:270:rpcsvc_program_actor]? > 0-rpc-service: RPC program not available (req 1298437 330) for? > 10.0.1.203:65530 > [2017-06-23 16:35:19.879314] E? > [rpcsvc.c:565:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed? > to complete successfully > [2017-06-23 16:35:19.879845] W [rpcsvc.c:270:rpcsvc_program_actor]? > 0-rpc-service: RPC program not available (req 1298437 330) for? > 10.0.1.203:65530 > [2017-06-23 16:35:19.879859] E? > [rpcsvc.c:565:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed? > to complete successfully > [2017-06-23 16:35:42.538727] W [socket.c:596:__socket_rwv] 0-management:? > readv on /var/run/gluster/5e23d9709b37ac7877720ac3986c48bc.socket failed? > (No data available) > [2017-06-23 16:35:42.543486] I [MSGID: 106005]? > [glusterd-handler.c:5037:__glusterd_brick_rpc_notify] 0-management:? > Brick gl-master-03-int:/brick1/mvol1 has disconnected from glusterd. > > > gl-master-03 : glustershd.log > > [2017-06-23 16:35:42.537752] E [MSGID: 108006]? > [afr-common.c:4323:afr_notify] 0-mvol1-replicate-1: All subvolumes are? > down. Going offline until atleast one of them comes back up. > [2017-06-23 16:35:52.011016] E [socket.c:2292:socket_connect_finish]? > 0-mvol1-client-3: connection to 10.0.1.156:49152 failed (Connection refused) > [2017-06-23 16:35:53.010620] E [socket.c:2292:socket_connect_finish]? > 0-mvol1-client-2: connection to 10.0.1.154:49152 failed (Connection refused) > > > > about 73 minutes later the remaining replicated pair was affected by the? > outage : > > gl-master-02, brick1-mvol1.log : > > [2017-06-23 17:48:30.093526] E [MSGID: 113018]? > [posix.c:2766:posix_create] 0-mvol1-posix: pre-operation lstat on parent? > /brick1/mvol1/.trashcan//2290/uploads failed [No such file or directory] > [2017-06-23 17:48:30.093591] E [MSGID: 113018]? > [posix.c:1447:posix_mkdir] 0-mvol1-posix: pre-operation lstat on parent? > /brick1/mvol1/.trashcan//2290 failed [No such file or directory] > [2017-06-23 17:48:30.093636] E [MSGID: 113027]? > [posix.c:1538:posix_mkdir] 0-mvol1-posix: mkdir of /brick1/mvol1/ failed? > [File exists] > [2017-06-23 17:48:30.093670] E [MSGID: 113027]? > [posix.c:1538:posix_mkdir] 0-mvol1-posix: mkdir of? > /brick1/mvol1/.trashcan failed [File exists] > [2017-06-23 17:48:30.093701] E [MSGID: 113027]? > [posix.c:1538:posix_mkdir] 0-mvol1-posix: mkdir of? > /brick1/mvol1/.trashcan/ failed [File exists] > [2017-06-23 17:48:30.113559] E [MSGID: 113001]? > [posix.c:1562:posix_mkdir] 0-mvol1-posix: setting xattrs on? > /brick1/mvol1/.trashcan//2290 failed [No such file or directory] > [2017-06-23 17:48:30.113630] E [MSGID: 113027]? > [posix.c:1538:posix_mkdir] 0-mvol1-posix: mkdir of? > /brick1/mvol1/.trashcan//2290 failed [File exists] > [2017-06-23 17:48:30.163155] E [MSGID: 113001]? > [posix.c:1562:posix_mkdir] 0-mvol1-posix: setting xattrs on? > /brick1/mvol1/.trashcan//2290/uploads failed [No such file or directory] > [2017-06-23 17:48:30.163282] E [MSGID: 113001]? > [posix.c:2832:posix_create] 0-mvol1-posix: setting xattrs on? > /brick1/mvol1/.trashcan//2290/uploads/170623_TVM_News.mp4_2017-06-23_174830? > failed??[No such file or directory] > [2017-06-23 17:48:30.165617] E [posix.c:3188:_fill_writev_xdata]? > (-->/usr/lib/x86_64-linux- > gnu/glusterfs/3.7.18/xlator/features/trash.so(trash_truncate_readv_cbk+0x1ab)? > [0x7f4ec77d9a0b] -->/usr/lib/x86_64-linux-gnu/glusterfs/3.7.18/xlator/ > storage/posix.so(posix_writev+0x1ff) [0x7f4ecc1c162f]? > -->/usr/lib/x86_64-linux-gnu/glusterfs/3.7.18/xlator/storage/posix.so(_fill_writev_xdata+0x1c6)? > [0x7f4ecc1c1406] ) 0-mvol1-posix: fd: 0x7f4e70429b6c inode:? > 0x7f4e7041f9acgfid:00000000-0 > 000-0000-0000-000000000000 [Invalid argument] > > > the mentioned file in the brick-log was still available in the origin? > directory but not in the corresponding trashcan directory : > > > [ 14:29:29 ] - root at gl-master-01??/var/log/glusterfs $ls -lh? > /sdn/2290/uploads/170221_Sendung_Lieberum_01_AT* > -rw-r--r-- 1 2001 2001 386M Mar 31 13:00? > /sdn/2290/uploads/170221_Sendung_Lieberum_01_AT.mp4 > -rw-r--r-- 1 2001 2001 386M Jun??2 13:09? > /sdn/2290/uploads/170221_Sendung_Lieberum_01_AT_AT.mp4 > [ 15:08:53 ] - root at gl-master-01??/var/log/glusterfs $ > > > [ 15:11:04 ] - root at gl-master-01??/var/log/glusterfs $ls -lh? > /sdn/.trashcan/2290/uploads/170221_Sendung_Lieberum_01_AT* > [ 15:11:10 ] - root at gl-master-01??/var/log/glusterfs $ > > > some further informations...the OS is ubuntu 16.04.2 lts, volume info? > below : > > [ 11:31:53 ] - root at gl-master-03??~ $gluster volume info mvol1 > > Volume Name: mvol1 > Type: Distributed-Replicate > Volume ID: 2f5de6e4-66de-40a7-9f24-4762aad3ca96 > Status: Started > Number of Bricks: 2 x 2 = 4 > Transport-type: tcp > Bricks: > Brick1: gl-master-01-int:/brick1/mvol1 > Brick2: gl-master-02-int:/brick1/mvol1 > Brick3: gl-master-03-int:/brick1/mvol1 > Brick4: gl-master-04-int:/brick1/mvol1 > Options Reconfigured: > geo-replication.ignore-pid-check: on > geo-replication.indexing: on > nfs.disable: off > diagnostics.client-log-level: ERROR > changelog.changelog: on > performance.cache-refresh-timeout: 32 > cluster.min-free-disk: 200GB > network.ping-timeout: 5 > performance.io-thread-count: 64 > performance.cache-size: 8GB > performance.readdir-ahead: on > features.trash: offmvol1 has disabled the trash feature. So you should not be seeing the above mentioned errors in brick logs further.> features.trash-max-filesize: 1GB > [ 11:31:56 ] - root at gl-master-03??~ $ > > > Host : gl-master-01 > -rw-r----- 1 root root 232M Jun 23 17:49? > /var/crash/_usr_sbin_glusterfsd.0.crash > ----------------------------------------------------- > Host : gl-master-02 > -rw-r----- 1 root root 226M Jun 23 17:49? > /var/crash/_usr_sbin_glusterfsd.0.crash > ----------------------------------------------------- > Host : gl-master-03 > -rw-r----- 1 root root 254M Jun 23 16:35? > /var/crash/_usr_sbin_glusterfsd.0.crash > ----------------------------------------------------- > Host : gl-master-04 > -rw-r----- 1 root root 239M Jun 23 16:35? > /var/crash/_usr_sbin_glusterfsd.0.crash > -----------------------------------------------------If these are the core files dumped due to brick crash, can you please attach it to gdb as follows and paste the backtrace by executing the `bt` command within it. $ gdb /usr/sbin/glusterfsd /var/crash/_usr_sbin_glusterfs.0.crash (gdb) bt
Dietmar Putz
2017-Jun-29 15:13 UTC
[Gluster-users] setting gfid on .trashcan/... failed - total outage
Hello Anoop, thank you for your reply.... answers inside... best regards Dietmar On 29.06.2017 10:48, Anoop C S wrote:> On Wed, 2017-06-28 at 14:42 +0200, Dietmar Putz wrote: >> Hello, >> >> recently we had two times a partial gluster outage followed by a total >> outage of all four nodes. Looking into the gluster mailing list i found >> a very similar case in >> http://lists.gluster.org/pipermail/gluster-users/2016-June/027124.html > If you are talking about a crash happening on bricks, were you able to find any backtraces from any > of the brick logs?yes, the crash happened on the bricks. i followed the hints in the mentioned similar case but unfortunately i did not found any backtrace from any of the brick logs.> >> but i'm not sure if this issue is fixed... >> >> even this outage happened on glusterfs 3.7.18 which gets no more updates >> since ~.20 i would kindly ask if this issue is known to be fixed in 3.8 >> resp. 3.10... ? >> unfortunately i did not found corresponding informations in the release >> notes... >> >> best regards >> Dietmar >> >> >> the partial outage started as shown below, the very first entries >> occurred in the brick-logs : >> >> gl-master-04, brick1-mvol1.log : >> >> [2017-06-23 16:35:11.373471] E [MSGID: 113020] >> [posix.c:2839:posix_create] 0-mvol1-posix: setting gfid on >> /brick1/mvol1/.trashcan//2290/uploads/170221_Sendung_Lieberum_01_AT.mp4_2017-06-23_163511 >> failed >> [2017-06-23 16:35:11.392540] E [posix.c:3188:_fill_writev_xdata] >> (-->/usr/lib/x86_64-linux- >> gnu/glusterfs/3.7.18/xlator/features/trash.so(trash_truncate_readv_cbk+0x1ab) >> [0x7f4f8c2aaa0b] -->/usr/lib/x86_64-linux-gnu/glusterfs/3.7.18/xlator/ >> storage/posix.so(posix_writev+0x1ff) [0x7f4f8caec62f] >> -->/usr/lib/x86_64-linux-gnu/glusterfs/3.7.18/xlator/storage/posix.so(_fill_writev_xdata+0x1c6) >> [0x7f4f8caec406] ) 0-mvol1-posix: fd: 0x7f4ef434225c inode: >> 0x7f4ef430bd6cgfid:00000000-0 >> 000-0000-0000-000000000000 [Invalid argument] >> ... >> >> >> gl-master-04 : etc-glusterfs-glusterd.vol.log >> >> [2017-06-23 16:35:18.872346] W [rpcsvc.c:270:rpcsvc_program_actor] >> 0-rpc-service: RPC program not available (req 1298437 330) for >> 10.0.1.203:65533 >> [2017-06-23 16:35:18.872421] E >> [rpcsvc.c:565:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed >> to complete successfully >> >> gl-master-04 : glustershd.log >> >> [2017-06-23 16:35:42.536840] E [MSGID: 108006] >> [afr-common.c:4323:afr_notify] 0-mvol1-replicate-1: All subvolumes are >> down. Going offline until atleast one of them comes back up. >> [2017-06-23 16:35:51.702413] E [socket.c:2292:socket_connect_finish] >> 0-mvol1-client-3: connection to 10.0.1.156:49152 failed (Connection refused) >> >> >> >> gl-master-03, brick1-movl1.log : >> >> [2017-06-23 16:35:11.399769] E [MSGID: 113020] >> [posix.c:2839:posix_create] 0-mvol1-posix: setting gfid on >> /brick1/mvol1/.trashcan//2290/uploads/170221_Sendung_Lieberum_01_AT.mp4_2017-06-23_163511 >> failed >> [2017-06-23 16:35:11.418559] E [posix.c:3188:_fill_writev_xdata] >> (-->/usr/lib/x86_64-linux- >> gnu/glusterfs/3.7.18/xlator/features/trash.so(trash_truncate_readv_cbk+0x1ab) >> [0x7ff517087a0b] -->/usr/lib/x86_64-linux-gnu/glusterfs/3.7.18/xlator/ >> storage/posix.so(posix_writev+0x1ff) [0x7ff5178c962f] >> -->/usr/lib/x86_64-linux-gnu/glusterfs/3.7.18/xlator/storage/posix.so(_fill_writev_xdata+0x1c6) >> [0x7ff5178c9406] ) 0-mvol1-posix: fd: 0x7ff4c814a43c inode: >> 0x7ff4c82e1b5cgfid:00000000-0 >> 000-0000-0000-000000000000 [Invalid argument] >> ... >> >> >> gl-master-03 : etc-glusterfs-glusterd.vol.log >> >> [2017-06-23 16:35:19.879140] W [rpcsvc.c:270:rpcsvc_program_actor] >> 0-rpc-service: RPC program not available (req 1298437 330) for >> 10.0.1.203:65530 >> [2017-06-23 16:35:19.879201] E >> [rpcsvc.c:565:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed >> to complete successfully >> [2017-06-23 16:35:19.879300] W [rpcsvc.c:270:rpcsvc_program_actor] >> 0-rpc-service: RPC program not available (req 1298437 330) for >> 10.0.1.203:65530 >> [2017-06-23 16:35:19.879314] E >> [rpcsvc.c:565:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed >> to complete successfully >> [2017-06-23 16:35:19.879845] W [rpcsvc.c:270:rpcsvc_program_actor] >> 0-rpc-service: RPC program not available (req 1298437 330) for >> 10.0.1.203:65530 >> [2017-06-23 16:35:19.879859] E >> [rpcsvc.c:565:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed >> to complete successfully >> [2017-06-23 16:35:42.538727] W [socket.c:596:__socket_rwv] 0-management: >> readv on /var/run/gluster/5e23d9709b37ac7877720ac3986c48bc.socket failed >> (No data available) >> [2017-06-23 16:35:42.543486] I [MSGID: 106005] >> [glusterd-handler.c:5037:__glusterd_brick_rpc_notify] 0-management: >> Brick gl-master-03-int:/brick1/mvol1 has disconnected from glusterd. >> >> >> gl-master-03 : glustershd.log >> >> [2017-06-23 16:35:42.537752] E [MSGID: 108006] >> [afr-common.c:4323:afr_notify] 0-mvol1-replicate-1: All subvolumes are >> down. Going offline until atleast one of them comes back up. >> [2017-06-23 16:35:52.011016] E [socket.c:2292:socket_connect_finish] >> 0-mvol1-client-3: connection to 10.0.1.156:49152 failed (Connection refused) >> [2017-06-23 16:35:53.010620] E [socket.c:2292:socket_connect_finish] >> 0-mvol1-client-2: connection to 10.0.1.154:49152 failed (Connection refused) >> >> >> >> about 73 minutes later the remaining replicated pair was affected by the >> outage : >> >> gl-master-02, brick1-mvol1.log : >> >> [2017-06-23 17:48:30.093526] E [MSGID: 113018] >> [posix.c:2766:posix_create] 0-mvol1-posix: pre-operation lstat on parent >> /brick1/mvol1/.trashcan//2290/uploads failed [No such file or directory] >> [2017-06-23 17:48:30.093591] E [MSGID: 113018] >> [posix.c:1447:posix_mkdir] 0-mvol1-posix: pre-operation lstat on parent >> /brick1/mvol1/.trashcan//2290 failed [No such file or directory] >> [2017-06-23 17:48:30.093636] E [MSGID: 113027] >> [posix.c:1538:posix_mkdir] 0-mvol1-posix: mkdir of /brick1/mvol1/ failed >> [File exists] >> [2017-06-23 17:48:30.093670] E [MSGID: 113027] >> [posix.c:1538:posix_mkdir] 0-mvol1-posix: mkdir of >> /brick1/mvol1/.trashcan failed [File exists] >> [2017-06-23 17:48:30.093701] E [MSGID: 113027] >> [posix.c:1538:posix_mkdir] 0-mvol1-posix: mkdir of >> /brick1/mvol1/.trashcan/ failed [File exists] >> [2017-06-23 17:48:30.113559] E [MSGID: 113001] >> [posix.c:1562:posix_mkdir] 0-mvol1-posix: setting xattrs on >> /brick1/mvol1/.trashcan//2290 failed [No such file or directory] >> [2017-06-23 17:48:30.113630] E [MSGID: 113027] >> [posix.c:1538:posix_mkdir] 0-mvol1-posix: mkdir of >> /brick1/mvol1/.trashcan//2290 failed [File exists] >> [2017-06-23 17:48:30.163155] E [MSGID: 113001] >> [posix.c:1562:posix_mkdir] 0-mvol1-posix: setting xattrs on >> /brick1/mvol1/.trashcan//2290/uploads failed [No such file or directory] >> [2017-06-23 17:48:30.163282] E [MSGID: 113001] >> [posix.c:2832:posix_create] 0-mvol1-posix: setting xattrs on >> /brick1/mvol1/.trashcan//2290/uploads/170623_TVM_News.mp4_2017-06-23_174830 >> failed [No such file or directory] >> [2017-06-23 17:48:30.165617] E [posix.c:3188:_fill_writev_xdata] >> (-->/usr/lib/x86_64-linux- >> gnu/glusterfs/3.7.18/xlator/features/trash.so(trash_truncate_readv_cbk+0x1ab) >> [0x7f4ec77d9a0b] -->/usr/lib/x86_64-linux-gnu/glusterfs/3.7.18/xlator/ >> storage/posix.so(posix_writev+0x1ff) [0x7f4ecc1c162f] >> -->/usr/lib/x86_64-linux-gnu/glusterfs/3.7.18/xlator/storage/posix.so(_fill_writev_xdata+0x1c6) >> [0x7f4ecc1c1406] ) 0-mvol1-posix: fd: 0x7f4e70429b6c inode: >> 0x7f4e7041f9acgfid:00000000-0 >> 000-0000-0000-000000000000 [Invalid argument] >> >> >> the mentioned file in the brick-log was still available in the origin >> directory but not in the corresponding trashcan directory : >> >> >> [ 14:29:29 ] - root at gl-master-01 /var/log/glusterfs $ls -lh >> /sdn/2290/uploads/170221_Sendung_Lieberum_01_AT* >> -rw-r--r-- 1 2001 2001 386M Mar 31 13:00 >> /sdn/2290/uploads/170221_Sendung_Lieberum_01_AT.mp4 >> -rw-r--r-- 1 2001 2001 386M Jun 2 13:09 >> /sdn/2290/uploads/170221_Sendung_Lieberum_01_AT_AT.mp4 >> [ 15:08:53 ] - root at gl-master-01 /var/log/glusterfs $ >> >> >> [ 15:11:04 ] - root at gl-master-01 /var/log/glusterfs $ls -lh >> /sdn/.trashcan/2290/uploads/170221_Sendung_Lieberum_01_AT* >> [ 15:11:10 ] - root at gl-master-01 /var/log/glusterfs $ >> >> >> some further informations...the OS is ubuntu 16.04.2 lts, volume info >> below : >> >> [ 11:31:53 ] - root at gl-master-03 ~ $gluster volume info mvol1 >> >> Volume Name: mvol1 >> Type: Distributed-Replicate >> Volume ID: 2f5de6e4-66de-40a7-9f24-4762aad3ca96 >> Status: Started >> Number of Bricks: 2 x 2 = 4 >> Transport-type: tcp >> Bricks: >> Brick1: gl-master-01-int:/brick1/mvol1 >> Brick2: gl-master-02-int:/brick1/mvol1 >> Brick3: gl-master-03-int:/brick1/mvol1 >> Brick4: gl-master-04-int:/brick1/mvol1 >> Options Reconfigured: >> geo-replication.ignore-pid-check: on >> geo-replication.indexing: on >> nfs.disable: off >> diagnostics.client-log-level: ERROR >> changelog.changelog: on >> performance.cache-refresh-timeout: 32 >> cluster.min-free-disk: 200GB >> network.ping-timeout: 5 >> performance.io-thread-count: 64 >> performance.cache-size: 8GB >> performance.readdir-ahead: on >> features.trash: off > mvol1 has disabled the trash feature. So you should not be seeing the above mentioned errors in > brick logs further.yes, right after the second outage we decided to disable the trash feature...> >> features.trash-max-filesize: 1GB >> [ 11:31:56 ] - root at gl-master-03 ~ $ >> >> >> Host : gl-master-01 >> -rw-r----- 1 root root 232M Jun 23 17:49 >> /var/crash/_usr_sbin_glusterfsd.0.crash >> ----------------------------------------------------- >> Host : gl-master-02 >> -rw-r----- 1 root root 226M Jun 23 17:49 >> /var/crash/_usr_sbin_glusterfsd.0.crash >> ----------------------------------------------------- >> Host : gl-master-03 >> -rw-r----- 1 root root 254M Jun 23 16:35 >> /var/crash/_usr_sbin_glusterfsd.0.crash >> ----------------------------------------------------- >> Host : gl-master-04 >> -rw-r----- 1 root root 239M Jun 23 16:35 >> /var/crash/_usr_sbin_glusterfsd.0.crash >> ----------------------------------------------------- > If these are the core files dumped due to brick crash, can you please attach it to gdb as follows > and paste the backtrace by executing the `bt` command within it. > > $ gdb /usr/sbin/glusterfsd /var/crash/_usr_sbin_glusterfs.0.crash > > (gdb) btunfortunately another problem...even when the filename ends up with 'crash' and the creation time meets the time of the error the file _usr_sbin_glusterfsd.0.crash is not recognized as a core dump. currently i don't know how to handle this, tried several things with no success, therefore i add the 'head' of the file... [ 14:47:37 ] - root at gl-master-03 ~ $gdb /usr/sbin/glusterfsd /var/crash/_usr_sbin_glusterfsd.0.crash GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.04) 7.11.1 ... "/var/crash/_usr_sbin_glusterfsd.0.crash" is not a core dump: File format not recognised (gdb) [ 14:48:30 ] - root at gl-master-03 ~ $file /var/crash/_usr_sbin_glusterfsd.0.crash /var/crash/_usr_sbin_glusterfsd.0.crash: ASCII text, with very long lines [ 14:48:37 ] - root at gl-master-03 ~ $head /var/crash/_usr_sbin_glusterfsd.0.crash ProblemType: Crash Architecture: amd64 Date: Fri Jun 23 16:35:13 2017 DistroRelease: Ubuntu 16.04 ExecutablePath: /usr/sbin/glusterfsd ExecutableTimestamp: 1481112595 ProcCmdline: /usr/sbin/glusterfsd -s gl-master-03-int --volfile-id mvol1.gl-master-03-int.brick1-mvol1 -p /var/lib/glusterd/vols/mvol1/run/gl-master-03-int-brick1-mvol1.pid -S /var/run/gluster/5e23d9709b37ac7877720ac3986c48bc.socket --brick-name /brick1/mvol1 -l /var/log/glusterfs/bricks/brick1-mvol1.log --xlator-option *-posix.glusterd-uuid=056fb1db-9a49-422d-81fb-94e1881313fd --brick-port 49152 --xlator-option mvol1-server.listen-port=49152 ProcCwd: / ProcEnviron: LANGUAGE=en_GB:en [ 14:48:52 ] - root at gl-master-03 ~ $>-- Dietmar Putz 3Q GmbH Wetzlarer Str. 86 D-14482 Potsdam Telefax: +49 (0)331 / 2797 866 - 1 Telefon: +49 (0)331 / 2797 866 - 8 Mobile: +49 171 / 90 160 39 Mail: dietmar.putz at 3qsdn.com