thr3ads.net - Gluster users - [Gluster-users] setting gfid on .trashcan/... failed

If this information is useful, please help other people find it:
Share via:

Dietmar Putz

2017-Jun-28 12:42 UTC

[Gluster-users] setting gfid on .trashcan/... failed - total outage

Hello,

recently we had two times a partial gluster outage followed by a total 
outage of all four nodes. Looking into the gluster mailing list i found 
a very similar case in 
http://lists.gluster.org/pipermail/gluster-users/2016-June/027124.html 
but i'm not sure if this issue is fixed...

even this outage happened on glusterfs 3.7.18 which gets no more updates 
since ~.20 i would kindly ask if this issue is known to be fixed in 3.8 
resp. 3.10... ?
unfortunately i did not found corresponding informations in the release 
notes...

best regards
Dietmar


the partial outage started as shown below, the very first entries 
occurred in the brick-logs :

gl-master-04, brick1-mvol1.log :

[2017-06-23 16:35:11.373471] E [MSGID: 113020] 
[posix.c:2839:posix_create] 0-mvol1-posix: setting gfid on 
/brick1/mvol1/.trashcan//2290/uploads/170221_Sendung_Lieberum_01_AT.mp4_2017-06-23_163511
failed
[2017-06-23 16:35:11.392540] E [posix.c:3188:_fill_writev_xdata] 
(-->/usr/lib/x86_64-linux-gnu/glusterfs/3.7.18/xlator/features/trash.so(trash_truncate_readv_cbk+0x1ab)
[0x7f4f8c2aaa0b] -->/usr/lib/x86_64-linux-gnu/glusterfs/3.7.18/xlator/
storage/posix.so(posix_writev+0x1ff) [0x7f4f8caec62f] 
-->/usr/lib/x86_64-linux-gnu/glusterfs/3.7.18/xlator/storage/posix.so(_fill_writev_xdata+0x1c6)
[0x7f4f8caec406] ) 0-mvol1-posix: fd: 0x7f4ef434225c inode: 
0x7f4ef430bd6cgfid:00000000-0
000-0000-0000-000000000000 [Invalid argument]
...


gl-master-04 : etc-glusterfs-glusterd.vol.log

[2017-06-23 16:35:18.872346] W [rpcsvc.c:270:rpcsvc_program_actor] 
0-rpc-service: RPC program not available (req 1298437 330) for 
10.0.1.203:65533
[2017-06-23 16:35:18.872421] E 
[rpcsvc.c:565:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed 
to complete successfully

gl-master-04 : glustershd.log

[2017-06-23 16:35:42.536840] E [MSGID: 108006] 
[afr-common.c:4323:afr_notify] 0-mvol1-replicate-1: All subvolumes are 
down. Going offline until atleast one of them comes back up.
[2017-06-23 16:35:51.702413] E [socket.c:2292:socket_connect_finish] 
0-mvol1-client-3: connection to 10.0.1.156:49152 failed (Connection refused)



gl-master-03, brick1-movl1.log :

[2017-06-23 16:35:11.399769] E [MSGID: 113020] 
[posix.c:2839:posix_create] 0-mvol1-posix: setting gfid on 
/brick1/mvol1/.trashcan//2290/uploads/170221_Sendung_Lieberum_01_AT.mp4_2017-06-23_163511
failed
[2017-06-23 16:35:11.418559] E [posix.c:3188:_fill_writev_xdata] 
(-->/usr/lib/x86_64-linux-gnu/glusterfs/3.7.18/xlator/features/trash.so(trash_truncate_readv_cbk+0x1ab)
[0x7ff517087a0b] -->/usr/lib/x86_64-linux-gnu/glusterfs/3.7.18/xlator/
storage/posix.so(posix_writev+0x1ff) [0x7ff5178c962f] 
-->/usr/lib/x86_64-linux-gnu/glusterfs/3.7.18/xlator/storage/posix.so(_fill_writev_xdata+0x1c6)
[0x7ff5178c9406] ) 0-mvol1-posix: fd: 0x7ff4c814a43c inode: 
0x7ff4c82e1b5cgfid:00000000-0
000-0000-0000-000000000000 [Invalid argument]
...


gl-master-03 : etc-glusterfs-glusterd.vol.log

[2017-06-23 16:35:19.879140] W [rpcsvc.c:270:rpcsvc_program_actor] 
0-rpc-service: RPC program not available (req 1298437 330) for 
10.0.1.203:65530
[2017-06-23 16:35:19.879201] E 
[rpcsvc.c:565:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed 
to complete successfully
[2017-06-23 16:35:19.879300] W [rpcsvc.c:270:rpcsvc_program_actor] 
0-rpc-service: RPC program not available (req 1298437 330) for 
10.0.1.203:65530
[2017-06-23 16:35:19.879314] E 
[rpcsvc.c:565:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed 
to complete successfully
[2017-06-23 16:35:19.879845] W [rpcsvc.c:270:rpcsvc_program_actor] 
0-rpc-service: RPC program not available (req 1298437 330) for 
10.0.1.203:65530
[2017-06-23 16:35:19.879859] E 
[rpcsvc.c:565:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed 
to complete successfully
[2017-06-23 16:35:42.538727] W [socket.c:596:__socket_rwv] 0-management: 
readv on /var/run/gluster/5e23d9709b37ac7877720ac3986c48bc.socket failed 
(No data available)
[2017-06-23 16:35:42.543486] I [MSGID: 106005] 
[glusterd-handler.c:5037:__glusterd_brick_rpc_notify] 0-management: 
Brick gl-master-03-int:/brick1/mvol1 has disconnected from glusterd.


gl-master-03 : glustershd.log

[2017-06-23 16:35:42.537752] E [MSGID: 108006] 
[afr-common.c:4323:afr_notify] 0-mvol1-replicate-1: All subvolumes are 
down. Going offline until atleast one of them comes back up.
[2017-06-23 16:35:52.011016] E [socket.c:2292:socket_connect_finish] 
0-mvol1-client-3: connection to 10.0.1.156:49152 failed (Connection refused)
[2017-06-23 16:35:53.010620] E [socket.c:2292:socket_connect_finish] 
0-mvol1-client-2: connection to 10.0.1.154:49152 failed (Connection refused)



about 73 minutes later the remaining replicated pair was affected by the 
outage :

gl-master-02, brick1-mvol1.log :

[2017-06-23 17:48:30.093526] E [MSGID: 113018] 
[posix.c:2766:posix_create] 0-mvol1-posix: pre-operation lstat on parent 
/brick1/mvol1/.trashcan//2290/uploads failed [No such file or directory]
[2017-06-23 17:48:30.093591] E [MSGID: 113018] 
[posix.c:1447:posix_mkdir] 0-mvol1-posix: pre-operation lstat on parent 
/brick1/mvol1/.trashcan//2290 failed [No such file or directory]
[2017-06-23 17:48:30.093636] E [MSGID: 113027] 
[posix.c:1538:posix_mkdir] 0-mvol1-posix: mkdir of /brick1/mvol1/ failed 
[File exists]
[2017-06-23 17:48:30.093670] E [MSGID: 113027] 
[posix.c:1538:posix_mkdir] 0-mvol1-posix: mkdir of 
/brick1/mvol1/.trashcan failed [File exists]
[2017-06-23 17:48:30.093701] E [MSGID: 113027] 
[posix.c:1538:posix_mkdir] 0-mvol1-posix: mkdir of 
/brick1/mvol1/.trashcan/ failed [File exists]
[2017-06-23 17:48:30.113559] E [MSGID: 113001] 
[posix.c:1562:posix_mkdir] 0-mvol1-posix: setting xattrs on 
/brick1/mvol1/.trashcan//2290 failed [No such file or directory]
[2017-06-23 17:48:30.113630] E [MSGID: 113027] 
[posix.c:1538:posix_mkdir] 0-mvol1-posix: mkdir of 
/brick1/mvol1/.trashcan//2290 failed [File exists]
[2017-06-23 17:48:30.163155] E [MSGID: 113001] 
[posix.c:1562:posix_mkdir] 0-mvol1-posix: setting xattrs on 
/brick1/mvol1/.trashcan//2290/uploads failed [No such file or directory]
[2017-06-23 17:48:30.163282] E [MSGID: 113001] 
[posix.c:2832:posix_create] 0-mvol1-posix: setting xattrs on 
/brick1/mvol1/.trashcan//2290/uploads/170623_TVM_News.mp4_2017-06-23_174830 
failed  [No such file or directory]
[2017-06-23 17:48:30.165617] E [posix.c:3188:_fill_writev_xdata] 
(-->/usr/lib/x86_64-linux-gnu/glusterfs/3.7.18/xlator/features/trash.so(trash_truncate_readv_cbk+0x1ab)
[0x7f4ec77d9a0b] -->/usr/lib/x86_64-linux-gnu/glusterfs/3.7.18/xlator/
storage/posix.so(posix_writev+0x1ff) [0x7f4ecc1c162f] 
-->/usr/lib/x86_64-linux-gnu/glusterfs/3.7.18/xlator/storage/posix.so(_fill_writev_xdata+0x1c6)
[0x7f4ecc1c1406] ) 0-mvol1-posix: fd: 0x7f4e70429b6c inode: 
0x7f4e7041f9acgfid:00000000-0
000-0000-0000-000000000000 [Invalid argument]


the mentioned file in the brick-log was still available in the origin 
directory but not in the corresponding trashcan directory :


[ 14:29:29 ] - root at gl-master-01  /var/log/glusterfs $ls -lh 
/sdn/2290/uploads/170221_Sendung_Lieberum_01_AT*
-rw-r--r-- 1 2001 2001 386M Mar 31 13:00 
/sdn/2290/uploads/170221_Sendung_Lieberum_01_AT.mp4
-rw-r--r-- 1 2001 2001 386M Jun  2 13:09 
/sdn/2290/uploads/170221_Sendung_Lieberum_01_AT_AT.mp4
[ 15:08:53 ] - root at gl-master-01  /var/log/glusterfs $


[ 15:11:04 ] - root at gl-master-01  /var/log/glusterfs $ls -lh 
/sdn/.trashcan/2290/uploads/170221_Sendung_Lieberum_01_AT*
[ 15:11:10 ] - root at gl-master-01  /var/log/glusterfs $


some further informations...the OS is ubuntu 16.04.2 lts, volume info 
below :

[ 11:31:53 ] - root at gl-master-03  ~ $gluster volume info mvol1

Volume Name: mvol1
Type: Distributed-Replicate
Volume ID: 2f5de6e4-66de-40a7-9f24-4762aad3ca96
Status: Started
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: gl-master-01-int:/brick1/mvol1
Brick2: gl-master-02-int:/brick1/mvol1
Brick3: gl-master-03-int:/brick1/mvol1
Brick4: gl-master-04-int:/brick1/mvol1
Options Reconfigured:
geo-replication.ignore-pid-check: on
geo-replication.indexing: on
nfs.disable: off
diagnostics.client-log-level: ERROR
changelog.changelog: on
performance.cache-refresh-timeout: 32
cluster.min-free-disk: 200GB
network.ping-timeout: 5
performance.io-thread-count: 64
performance.cache-size: 8GB
performance.readdir-ahead: on
features.trash: off
features.trash-max-filesize: 1GB
[ 11:31:56 ] - root at gl-master-03  ~ $


Host : gl-master-01
-rw-r----- 1 root root 232M Jun 23 17:49 
/var/crash/_usr_sbin_glusterfsd.0.crash
-----------------------------------------------------
Host : gl-master-02
-rw-r----- 1 root root 226M Jun 23 17:49 
/var/crash/_usr_sbin_glusterfsd.0.crash
-----------------------------------------------------
Host : gl-master-03
-rw-r----- 1 root root 254M Jun 23 16:35 
/var/crash/_usr_sbin_glusterfsd.0.crash
-----------------------------------------------------
Host : gl-master-04
-rw-r----- 1 root root 239M Jun 23 16:35 
/var/crash/_usr_sbin_glusterfsd.0.crash
-----------------------------------------------------

-- 

Dietmar Putz
3Q GmbH
Wetzlarer Str. 86
D-14482 Potsdam
  
Telefax:  +49 (0)331 / 2797 866 - 1
Telefon:  +49 (0)331 / 2797 866 - 8
Mobile:   +49 171 / 90 160 39
Mail:     dietmar.putz at 3qsdn.com

Anoop C S

2017-Jun-29 08:48 UTC

head link

[Gluster-users] setting gfid on .trashcan/... failed - total outage

On Wed, 2017-06-28 at 14:42 +0200, Dietmar Putz wrote:> Hello,
> 
> recently we had two times a partial gluster outage followed by a total?
> outage of all four nodes. Looking into the gluster mailing list i found?
> a very similar case in?
> http://lists.gluster.org/pipermail/gluster-users/2016-June/027124.html
If you are talking about a crash happening on bricks, were you able to find any
backtraces from any
of the brick logs?
> but i'm not sure if this issue is fixed...
> 
> even this outage happened on glusterfs 3.7.18 which gets no more updates?
> since ~.20 i would kindly ask if this issue is known to be fixed in 3.8?
> resp. 3.10... ?
> unfortunately i did not found corresponding informations in the release?
> notes...
> 
> best regards
> Dietmar
> 
> 
> the partial outage started as shown below, the very first entries?
> occurred in the brick-logs :
> 
> gl-master-04, brick1-mvol1.log :
> 
> [2017-06-23 16:35:11.373471] E [MSGID: 113020]?
> [posix.c:2839:posix_create] 0-mvol1-posix: setting gfid on?
>
/brick1/mvol1/.trashcan//2290/uploads/170221_Sendung_Lieberum_01_AT.mp4_2017-06-23_163511?
> failed
> [2017-06-23 16:35:11.392540] E [posix.c:3188:_fill_writev_xdata]?
> (-->/usr/lib/x86_64-linux-
>
gnu/glusterfs/3.7.18/xlator/features/trash.so(trash_truncate_readv_cbk+0x1ab)?
> [0x7f4f8c2aaa0b] -->/usr/lib/x86_64-linux-gnu/glusterfs/3.7.18/xlator/
> storage/posix.so(posix_writev+0x1ff) [0x7f4f8caec62f]?
>
-->/usr/lib/x86_64-linux-gnu/glusterfs/3.7.18/xlator/storage/posix.so(_fill_writev_xdata+0x1c6)?
> [0x7f4f8caec406] ) 0-mvol1-posix: fd: 0x7f4ef434225c inode:?
> 0x7f4ef430bd6cgfid:00000000-0
> 000-0000-0000-000000000000 [Invalid argument]
> ...
> 
> 
> gl-master-04 : etc-glusterfs-glusterd.vol.log
> 
> [2017-06-23 16:35:18.872346] W [rpcsvc.c:270:rpcsvc_program_actor]?
> 0-rpc-service: RPC program not available (req 1298437 330) for?
> 10.0.1.203:65533
> [2017-06-23 16:35:18.872421] E?
> [rpcsvc.c:565:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed?
> to complete successfully
> 
> gl-master-04 : glustershd.log
> 
> [2017-06-23 16:35:42.536840] E [MSGID: 108006]?
> [afr-common.c:4323:afr_notify] 0-mvol1-replicate-1: All subvolumes are?
> down. Going offline until atleast one of them comes back up.
> [2017-06-23 16:35:51.702413] E [socket.c:2292:socket_connect_finish]?
> 0-mvol1-client-3: connection to 10.0.1.156:49152 failed (Connection
refused)
> 
> 
> 
> gl-master-03, brick1-movl1.log :
> 
> [2017-06-23 16:35:11.399769] E [MSGID: 113020]?
> [posix.c:2839:posix_create] 0-mvol1-posix: setting gfid on?
>
/brick1/mvol1/.trashcan//2290/uploads/170221_Sendung_Lieberum_01_AT.mp4_2017-06-23_163511?
> failed
> [2017-06-23 16:35:11.418559] E [posix.c:3188:_fill_writev_xdata]?
> (-->/usr/lib/x86_64-linux-
>
gnu/glusterfs/3.7.18/xlator/features/trash.so(trash_truncate_readv_cbk+0x1ab)?
> [0x7ff517087a0b] -->/usr/lib/x86_64-linux-gnu/glusterfs/3.7.18/xlator/
> storage/posix.so(posix_writev+0x1ff) [0x7ff5178c962f]?
>
-->/usr/lib/x86_64-linux-gnu/glusterfs/3.7.18/xlator/storage/posix.so(_fill_writev_xdata+0x1c6)?
> [0x7ff5178c9406] ) 0-mvol1-posix: fd: 0x7ff4c814a43c inode:?
> 0x7ff4c82e1b5cgfid:00000000-0
> 000-0000-0000-000000000000 [Invalid argument]
> ...
> 
> 
> gl-master-03 : etc-glusterfs-glusterd.vol.log
> 
> [2017-06-23 16:35:19.879140] W [rpcsvc.c:270:rpcsvc_program_actor]?
> 0-rpc-service: RPC program not available (req 1298437 330) for?
> 10.0.1.203:65530
> [2017-06-23 16:35:19.879201] E?
> [rpcsvc.c:565:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed?
> to complete successfully
> [2017-06-23 16:35:19.879300] W [rpcsvc.c:270:rpcsvc_program_actor]?
> 0-rpc-service: RPC program not available (req 1298437 330) for?
> 10.0.1.203:65530
> [2017-06-23 16:35:19.879314] E?
> [rpcsvc.c:565:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed?
> to complete successfully
> [2017-06-23 16:35:19.879845] W [rpcsvc.c:270:rpcsvc_program_actor]?
> 0-rpc-service: RPC program not available (req 1298437 330) for?
> 10.0.1.203:65530
> [2017-06-23 16:35:19.879859] E?
> [rpcsvc.c:565:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed?
> to complete successfully
> [2017-06-23 16:35:42.538727] W [socket.c:596:__socket_rwv] 0-management:?
> readv on /var/run/gluster/5e23d9709b37ac7877720ac3986c48bc.socket failed?
> (No data available)
> [2017-06-23 16:35:42.543486] I [MSGID: 106005]?
> [glusterd-handler.c:5037:__glusterd_brick_rpc_notify] 0-management:?
> Brick gl-master-03-int:/brick1/mvol1 has disconnected from glusterd.
> 
> 
> gl-master-03 : glustershd.log
> 
> [2017-06-23 16:35:42.537752] E [MSGID: 108006]?
> [afr-common.c:4323:afr_notify] 0-mvol1-replicate-1: All subvolumes are?
> down. Going offline until atleast one of them comes back up.
> [2017-06-23 16:35:52.011016] E [socket.c:2292:socket_connect_finish]?
> 0-mvol1-client-3: connection to 10.0.1.156:49152 failed (Connection
refused)
> [2017-06-23 16:35:53.010620] E [socket.c:2292:socket_connect_finish]?
> 0-mvol1-client-2: connection to 10.0.1.154:49152 failed (Connection
refused)
> 
> 
> 
> about 73 minutes later the remaining replicated pair was affected by the?
> outage :
> 
> gl-master-02, brick1-mvol1.log :
> 
> [2017-06-23 17:48:30.093526] E [MSGID: 113018]?
> [posix.c:2766:posix_create] 0-mvol1-posix: pre-operation lstat on parent?
> /brick1/mvol1/.trashcan//2290/uploads failed [No such file or directory]
> [2017-06-23 17:48:30.093591] E [MSGID: 113018]?
> [posix.c:1447:posix_mkdir] 0-mvol1-posix: pre-operation lstat on parent?
> /brick1/mvol1/.trashcan//2290 failed [No such file or directory]
> [2017-06-23 17:48:30.093636] E [MSGID: 113027]?
> [posix.c:1538:posix_mkdir] 0-mvol1-posix: mkdir of /brick1/mvol1/ failed?
> [File exists]
> [2017-06-23 17:48:30.093670] E [MSGID: 113027]?
> [posix.c:1538:posix_mkdir] 0-mvol1-posix: mkdir of?
> /brick1/mvol1/.trashcan failed [File exists]
> [2017-06-23 17:48:30.093701] E [MSGID: 113027]?
> [posix.c:1538:posix_mkdir] 0-mvol1-posix: mkdir of?
> /brick1/mvol1/.trashcan/ failed [File exists]
> [2017-06-23 17:48:30.113559] E [MSGID: 113001]?
> [posix.c:1562:posix_mkdir] 0-mvol1-posix: setting xattrs on?
> /brick1/mvol1/.trashcan//2290 failed [No such file or directory]
> [2017-06-23 17:48:30.113630] E [MSGID: 113027]?
> [posix.c:1538:posix_mkdir] 0-mvol1-posix: mkdir of?
> /brick1/mvol1/.trashcan//2290 failed [File exists]
> [2017-06-23 17:48:30.163155] E [MSGID: 113001]?
> [posix.c:1562:posix_mkdir] 0-mvol1-posix: setting xattrs on?
> /brick1/mvol1/.trashcan//2290/uploads failed [No such file or directory]
> [2017-06-23 17:48:30.163282] E [MSGID: 113001]?
> [posix.c:2832:posix_create] 0-mvol1-posix: setting xattrs on?
>
/brick1/mvol1/.trashcan//2290/uploads/170623_TVM_News.mp4_2017-06-23_174830?
> failed??[No such file or directory]
> [2017-06-23 17:48:30.165617] E [posix.c:3188:_fill_writev_xdata]?
> (-->/usr/lib/x86_64-linux-
>
gnu/glusterfs/3.7.18/xlator/features/trash.so(trash_truncate_readv_cbk+0x1ab)?
> [0x7f4ec77d9a0b] -->/usr/lib/x86_64-linux-gnu/glusterfs/3.7.18/xlator/
> storage/posix.so(posix_writev+0x1ff) [0x7f4ecc1c162f]?
>
-->/usr/lib/x86_64-linux-gnu/glusterfs/3.7.18/xlator/storage/posix.so(_fill_writev_xdata+0x1c6)?
> [0x7f4ecc1c1406] ) 0-mvol1-posix: fd: 0x7f4e70429b6c inode:?
> 0x7f4e7041f9acgfid:00000000-0
> 000-0000-0000-000000000000 [Invalid argument]
> 
> 
> the mentioned file in the brick-log was still available in the origin?
> directory but not in the corresponding trashcan directory :
> 
> 
> [ 14:29:29 ] - root at gl-master-01??/var/log/glusterfs $ls -lh?
> /sdn/2290/uploads/170221_Sendung_Lieberum_01_AT*
> -rw-r--r-- 1 2001 2001 386M Mar 31 13:00?
> /sdn/2290/uploads/170221_Sendung_Lieberum_01_AT.mp4
> -rw-r--r-- 1 2001 2001 386M Jun??2 13:09?
> /sdn/2290/uploads/170221_Sendung_Lieberum_01_AT_AT.mp4
> [ 15:08:53 ] - root at gl-master-01??/var/log/glusterfs $
> 
> 
> [ 15:11:04 ] - root at gl-master-01??/var/log/glusterfs $ls -lh?
> /sdn/.trashcan/2290/uploads/170221_Sendung_Lieberum_01_AT*
> [ 15:11:10 ] - root at gl-master-01??/var/log/glusterfs $
> 
> 
> some further informations...the OS is ubuntu 16.04.2 lts, volume info?
> below :
> 
> [ 11:31:53 ] - root at gl-master-03??~ $gluster volume info mvol1
> 
> Volume Name: mvol1
> Type: Distributed-Replicate
> Volume ID: 2f5de6e4-66de-40a7-9f24-4762aad3ca96
> Status: Started
> Number of Bricks: 2 x 2 = 4
> Transport-type: tcp
> Bricks:
> Brick1: gl-master-01-int:/brick1/mvol1
> Brick2: gl-master-02-int:/brick1/mvol1
> Brick3: gl-master-03-int:/brick1/mvol1
> Brick4: gl-master-04-int:/brick1/mvol1
> Options Reconfigured:
> geo-replication.ignore-pid-check: on
> geo-replication.indexing: on
> nfs.disable: off
> diagnostics.client-log-level: ERROR
> changelog.changelog: on
> performance.cache-refresh-timeout: 32
> cluster.min-free-disk: 200GB
> network.ping-timeout: 5
> performance.io-thread-count: 64
> performance.cache-size: 8GB
> performance.readdir-ahead: on
> features.trash: off
mvol1 has disabled the trash feature. So you should not be seeing the above
mentioned errors in
brick logs further.
> features.trash-max-filesize: 1GB
> [ 11:31:56 ] - root at gl-master-03??~ $
> 
> 
> Host : gl-master-01
> -rw-r----- 1 root root 232M Jun 23 17:49?
> /var/crash/_usr_sbin_glusterfsd.0.crash
> -----------------------------------------------------
> Host : gl-master-02
> -rw-r----- 1 root root 226M Jun 23 17:49?
> /var/crash/_usr_sbin_glusterfsd.0.crash
> -----------------------------------------------------
> Host : gl-master-03
> -rw-r----- 1 root root 254M Jun 23 16:35?
> /var/crash/_usr_sbin_glusterfsd.0.crash
> -----------------------------------------------------
> Host : gl-master-04
> -rw-r----- 1 root root 239M Jun 23 16:35?
> /var/crash/_usr_sbin_glusterfsd.0.crash
> -----------------------------------------------------
If these are the core files dumped due to brick crash, can you please attach it
to gdb as follows
and paste the backtrace by executing the `bt` command within it.

$ gdb /usr/sbin/glusterfsd /var/crash/_usr_sbin_glusterfs.0.crash

(gdb) bt

Dietmar Putz

2017-Jun-29 15:13 UTC

head link

[Gluster-users] setting gfid on .trashcan/... failed - total outage

Hello Anoop,

thank you for your reply....

answers inside...

best regards

Dietmar


On 29.06.2017 10:48, Anoop C S wrote:> On Wed, 2017-06-28 at 14:42 +0200, Dietmar Putz wrote:
>> Hello,
>>
>> recently we had two times a partial gluster outage followed by a total
>> outage of all four nodes. Looking into the gluster mailing list i found
>> a very similar case in
>> http://lists.gluster.org/pipermail/gluster-users/2016-June/027124.html
> If you are talking about a crash happening on bricks, were you able to find
any backtraces from any
> of the brick logs?
yes, the crash happened on the bricks.
i followed the hints in the mentioned similar case but unfortunately i 
did not found any backtrace from any of the brick logs.

>
>> but i'm not sure if this issue is fixed...
>>
>> even this outage happened on glusterfs 3.7.18 which gets no more
updates
>> since ~.20 i would kindly ask if this issue is known to be fixed in 3.8
>> resp. 3.10... ?
>> unfortunately i did not found corresponding informations in the release
>> notes...
>>
>> best regards
>> Dietmar
>>
>>
>> the partial outage started as shown below, the very first entries
>> occurred in the brick-logs :
>>
>> gl-master-04, brick1-mvol1.log :
>>
>> [2017-06-23 16:35:11.373471] E [MSGID: 113020]
>> [posix.c:2839:posix_create] 0-mvol1-posix: setting gfid on
>>
/brick1/mvol1/.trashcan//2290/uploads/170221_Sendung_Lieberum_01_AT.mp4_2017-06-23_163511
>> failed
>> [2017-06-23 16:35:11.392540] E [posix.c:3188:_fill_writev_xdata]
>> (-->/usr/lib/x86_64-linux-
>>
gnu/glusterfs/3.7.18/xlator/features/trash.so(trash_truncate_readv_cbk+0x1ab)
>> [0x7f4f8c2aaa0b]
-->/usr/lib/x86_64-linux-gnu/glusterfs/3.7.18/xlator/
>> storage/posix.so(posix_writev+0x1ff) [0x7f4f8caec62f]
>>
-->/usr/lib/x86_64-linux-gnu/glusterfs/3.7.18/xlator/storage/posix.so(_fill_writev_xdata+0x1c6)
>> [0x7f4f8caec406] ) 0-mvol1-posix: fd: 0x7f4ef434225c inode:
>> 0x7f4ef430bd6cgfid:00000000-0
>> 000-0000-0000-000000000000 [Invalid argument]
>> ...
>>
>>
>> gl-master-04 : etc-glusterfs-glusterd.vol.log
>>
>> [2017-06-23 16:35:18.872346] W [rpcsvc.c:270:rpcsvc_program_actor]
>> 0-rpc-service: RPC program not available (req 1298437 330) for
>> 10.0.1.203:65533
>> [2017-06-23 16:35:18.872421] E
>> [rpcsvc.c:565:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed
>> to complete successfully
>>
>> gl-master-04 : glustershd.log
>>
>> [2017-06-23 16:35:42.536840] E [MSGID: 108006]
>> [afr-common.c:4323:afr_notify] 0-mvol1-replicate-1: All subvolumes are
>> down. Going offline until atleast one of them comes back up.
>> [2017-06-23 16:35:51.702413] E [socket.c:2292:socket_connect_finish]
>> 0-mvol1-client-3: connection to 10.0.1.156:49152 failed (Connection
refused)
>>
>>
>>
>> gl-master-03, brick1-movl1.log :
>>
>> [2017-06-23 16:35:11.399769] E [MSGID: 113020]
>> [posix.c:2839:posix_create] 0-mvol1-posix: setting gfid on
>>
/brick1/mvol1/.trashcan//2290/uploads/170221_Sendung_Lieberum_01_AT.mp4_2017-06-23_163511
>> failed
>> [2017-06-23 16:35:11.418559] E [posix.c:3188:_fill_writev_xdata]
>> (-->/usr/lib/x86_64-linux-
>>
gnu/glusterfs/3.7.18/xlator/features/trash.so(trash_truncate_readv_cbk+0x1ab)
>> [0x7ff517087a0b]
-->/usr/lib/x86_64-linux-gnu/glusterfs/3.7.18/xlator/
>> storage/posix.so(posix_writev+0x1ff) [0x7ff5178c962f]
>>
-->/usr/lib/x86_64-linux-gnu/glusterfs/3.7.18/xlator/storage/posix.so(_fill_writev_xdata+0x1c6)
>> [0x7ff5178c9406] ) 0-mvol1-posix: fd: 0x7ff4c814a43c inode:
>> 0x7ff4c82e1b5cgfid:00000000-0
>> 000-0000-0000-000000000000 [Invalid argument]
>> ...
>>
>>
>> gl-master-03 : etc-glusterfs-glusterd.vol.log
>>
>> [2017-06-23 16:35:19.879140] W [rpcsvc.c:270:rpcsvc_program_actor]
>> 0-rpc-service: RPC program not available (req 1298437 330) for
>> 10.0.1.203:65530
>> [2017-06-23 16:35:19.879201] E
>> [rpcsvc.c:565:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed
>> to complete successfully
>> [2017-06-23 16:35:19.879300] W [rpcsvc.c:270:rpcsvc_program_actor]
>> 0-rpc-service: RPC program not available (req 1298437 330) for
>> 10.0.1.203:65530
>> [2017-06-23 16:35:19.879314] E
>> [rpcsvc.c:565:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed
>> to complete successfully
>> [2017-06-23 16:35:19.879845] W [rpcsvc.c:270:rpcsvc_program_actor]
>> 0-rpc-service: RPC program not available (req 1298437 330) for
>> 10.0.1.203:65530
>> [2017-06-23 16:35:19.879859] E
>> [rpcsvc.c:565:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed
>> to complete successfully
>> [2017-06-23 16:35:42.538727] W [socket.c:596:__socket_rwv]
0-management:
>> readv on /var/run/gluster/5e23d9709b37ac7877720ac3986c48bc.socket
failed
>> (No data available)
>> [2017-06-23 16:35:42.543486] I [MSGID: 106005]
>> [glusterd-handler.c:5037:__glusterd_brick_rpc_notify] 0-management:
>> Brick gl-master-03-int:/brick1/mvol1 has disconnected from glusterd.
>>
>>
>> gl-master-03 : glustershd.log
>>
>> [2017-06-23 16:35:42.537752] E [MSGID: 108006]
>> [afr-common.c:4323:afr_notify] 0-mvol1-replicate-1: All subvolumes are
>> down. Going offline until atleast one of them comes back up.
>> [2017-06-23 16:35:52.011016] E [socket.c:2292:socket_connect_finish]
>> 0-mvol1-client-3: connection to 10.0.1.156:49152 failed (Connection
refused)
>> [2017-06-23 16:35:53.010620] E [socket.c:2292:socket_connect_finish]
>> 0-mvol1-client-2: connection to 10.0.1.154:49152 failed (Connection
refused)
>>
>>
>>
>> about 73 minutes later the remaining replicated pair was affected by
the
>> outage :
>>
>> gl-master-02, brick1-mvol1.log :
>>
>> [2017-06-23 17:48:30.093526] E [MSGID: 113018]
>> [posix.c:2766:posix_create] 0-mvol1-posix: pre-operation lstat on
parent
>> /brick1/mvol1/.trashcan//2290/uploads failed [No such file or
directory]
>> [2017-06-23 17:48:30.093591] E [MSGID: 113018]
>> [posix.c:1447:posix_mkdir] 0-mvol1-posix: pre-operation lstat on parent
>> /brick1/mvol1/.trashcan//2290 failed [No such file or directory]
>> [2017-06-23 17:48:30.093636] E [MSGID: 113027]
>> [posix.c:1538:posix_mkdir] 0-mvol1-posix: mkdir of /brick1/mvol1/
failed
>> [File exists]
>> [2017-06-23 17:48:30.093670] E [MSGID: 113027]
>> [posix.c:1538:posix_mkdir] 0-mvol1-posix: mkdir of
>> /brick1/mvol1/.trashcan failed [File exists]
>> [2017-06-23 17:48:30.093701] E [MSGID: 113027]
>> [posix.c:1538:posix_mkdir] 0-mvol1-posix: mkdir of
>> /brick1/mvol1/.trashcan/ failed [File exists]
>> [2017-06-23 17:48:30.113559] E [MSGID: 113001]
>> [posix.c:1562:posix_mkdir] 0-mvol1-posix: setting xattrs on
>> /brick1/mvol1/.trashcan//2290 failed [No such file or directory]
>> [2017-06-23 17:48:30.113630] E [MSGID: 113027]
>> [posix.c:1538:posix_mkdir] 0-mvol1-posix: mkdir of
>> /brick1/mvol1/.trashcan//2290 failed [File exists]
>> [2017-06-23 17:48:30.163155] E [MSGID: 113001]
>> [posix.c:1562:posix_mkdir] 0-mvol1-posix: setting xattrs on
>> /brick1/mvol1/.trashcan//2290/uploads failed [No such file or
directory]
>> [2017-06-23 17:48:30.163282] E [MSGID: 113001]
>> [posix.c:2832:posix_create] 0-mvol1-posix: setting xattrs on
>>
/brick1/mvol1/.trashcan//2290/uploads/170623_TVM_News.mp4_2017-06-23_174830
>> failed  [No such file or directory]
>> [2017-06-23 17:48:30.165617] E [posix.c:3188:_fill_writev_xdata]
>> (-->/usr/lib/x86_64-linux-
>>
gnu/glusterfs/3.7.18/xlator/features/trash.so(trash_truncate_readv_cbk+0x1ab)
>> [0x7f4ec77d9a0b]
-->/usr/lib/x86_64-linux-gnu/glusterfs/3.7.18/xlator/
>> storage/posix.so(posix_writev+0x1ff) [0x7f4ecc1c162f]
>>
-->/usr/lib/x86_64-linux-gnu/glusterfs/3.7.18/xlator/storage/posix.so(_fill_writev_xdata+0x1c6)
>> [0x7f4ecc1c1406] ) 0-mvol1-posix: fd: 0x7f4e70429b6c inode:
>> 0x7f4e7041f9acgfid:00000000-0
>> 000-0000-0000-000000000000 [Invalid argument]
>>
>>
>> the mentioned file in the brick-log was still available in the origin
>> directory but not in the corresponding trashcan directory :
>>
>>
>> [ 14:29:29 ] - root at gl-master-01  /var/log/glusterfs $ls -lh
>> /sdn/2290/uploads/170221_Sendung_Lieberum_01_AT*
>> -rw-r--r-- 1 2001 2001 386M Mar 31 13:00
>> /sdn/2290/uploads/170221_Sendung_Lieberum_01_AT.mp4
>> -rw-r--r-- 1 2001 2001 386M Jun  2 13:09
>> /sdn/2290/uploads/170221_Sendung_Lieberum_01_AT_AT.mp4
>> [ 15:08:53 ] - root at gl-master-01  /var/log/glusterfs $
>>
>>
>> [ 15:11:04 ] - root at gl-master-01  /var/log/glusterfs $ls -lh
>> /sdn/.trashcan/2290/uploads/170221_Sendung_Lieberum_01_AT*
>> [ 15:11:10 ] - root at gl-master-01  /var/log/glusterfs $
>>
>>
>> some further informations...the OS is ubuntu 16.04.2 lts, volume info
>> below :
>>
>> [ 11:31:53 ] - root at gl-master-03  ~ $gluster volume info mvol1
>>
>> Volume Name: mvol1
>> Type: Distributed-Replicate
>> Volume ID: 2f5de6e4-66de-40a7-9f24-4762aad3ca96
>> Status: Started
>> Number of Bricks: 2 x 2 = 4
>> Transport-type: tcp
>> Bricks:
>> Brick1: gl-master-01-int:/brick1/mvol1
>> Brick2: gl-master-02-int:/brick1/mvol1
>> Brick3: gl-master-03-int:/brick1/mvol1
>> Brick4: gl-master-04-int:/brick1/mvol1
>> Options Reconfigured:
>> geo-replication.ignore-pid-check: on
>> geo-replication.indexing: on
>> nfs.disable: off
>> diagnostics.client-log-level: ERROR
>> changelog.changelog: on
>> performance.cache-refresh-timeout: 32
>> cluster.min-free-disk: 200GB
>> network.ping-timeout: 5
>> performance.io-thread-count: 64
>> performance.cache-size: 8GB
>> performance.readdir-ahead: on
>> features.trash: off
> mvol1 has disabled the trash feature. So you should not be seeing the above
mentioned errors in
> brick logs further.
yes, right after the second outage we decided to disable the trash 
feature...
>
>> features.trash-max-filesize: 1GB
>> [ 11:31:56 ] - root at gl-master-03  ~ $
>>
>>
>> Host : gl-master-01
>> -rw-r----- 1 root root 232M Jun 23 17:49
>> /var/crash/_usr_sbin_glusterfsd.0.crash
>> -----------------------------------------------------
>> Host : gl-master-02
>> -rw-r----- 1 root root 226M Jun 23 17:49
>> /var/crash/_usr_sbin_glusterfsd.0.crash
>> -----------------------------------------------------
>> Host : gl-master-03
>> -rw-r----- 1 root root 254M Jun 23 16:35
>> /var/crash/_usr_sbin_glusterfsd.0.crash
>> -----------------------------------------------------
>> Host : gl-master-04
>> -rw-r----- 1 root root 239M Jun 23 16:35
>> /var/crash/_usr_sbin_glusterfsd.0.crash
>> -----------------------------------------------------
> If these are the core files dumped due to brick crash, can you please
attach it to gdb as follows
> and paste the backtrace by executing the `bt` command within it.
>
> $ gdb /usr/sbin/glusterfsd /var/crash/_usr_sbin_glusterfs.0.crash
>
> (gdb) bt
unfortunately another problem...even when the filename ends up with 
'crash' and the creation time meets the time of the error the file 
_usr_sbin_glusterfsd.0.crash is not recognized as a core dump.
currently i don't know how to handle this, tried several things with no 
success, therefore i add the 'head' of the file...

[ 14:47:37 ] - root at gl-master-03  ~ $gdb /usr/sbin/glusterfsd 
/var/crash/_usr_sbin_glusterfsd.0.crash
GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.04) 7.11.1
...
"/var/crash/_usr_sbin_glusterfsd.0.crash" is not a core dump: File 
format not recognised
(gdb)

[ 14:48:30 ] - root at gl-master-03  ~ $file 
/var/crash/_usr_sbin_glusterfsd.0.crash
/var/crash/_usr_sbin_glusterfsd.0.crash: ASCII text, with very long lines
[ 14:48:37 ] - root at gl-master-03  ~ $head 
/var/crash/_usr_sbin_glusterfsd.0.crash
ProblemType: Crash
Architecture: amd64
Date: Fri Jun 23 16:35:13 2017
DistroRelease: Ubuntu 16.04
ExecutablePath: /usr/sbin/glusterfsd
ExecutableTimestamp: 1481112595
ProcCmdline: /usr/sbin/glusterfsd -s gl-master-03-int --volfile-id 
mvol1.gl-master-03-int.brick1-mvol1 -p 
/var/lib/glusterd/vols/mvol1/run/gl-master-03-int-brick1-mvol1.pid -S 
/var/run/gluster/5e23d9709b37ac7877720ac3986c48bc.socket --brick-name 
/brick1/mvol1 -l /var/log/glusterfs/bricks/brick1-mvol1.log 
--xlator-option 
*-posix.glusterd-uuid=056fb1db-9a49-422d-81fb-94e1881313fd --brick-port 
49152 --xlator-option mvol1-server.listen-port=49152
ProcCwd: /
ProcEnviron:
  LANGUAGE=en_GB:en
[ 14:48:52 ] - root at gl-master-03  ~ $

>
-- 
Dietmar Putz
3Q GmbH
Wetzlarer Str. 86
D-14482 Potsdam
  
Telefax:  +49 (0)331 / 2797 866 - 1
Telefon:  +49 (0)331 / 2797 866 - 8
Mobile:   +49 171 / 90 160 39
Mail:     dietmar.putz at 3qsdn.com

Maybe Matching Threads

Search for more possibly parallel threads

Gluster users - Jun 2017 - setting gfid on .trashcan/... failed - total outage

[Gluster-users] setting gfid on .trashcan/... failed - total outage

[Gluster-users] setting gfid on .trashcan/... failed - total outage

[Gluster-users] setting gfid on .trashcan/... failed - total outage

Maybe Matching Threads