Jon Swanson
2010-Jul-21 08:44 UTC
[Gluster-users] 3.0.5 client crash - afr_set_split_brain
Seeing a glusterfs client die oddly. --Setup-- Client: Fedora 12 2.6.32.16-141.fc12.x86_64 # rpm -qa |egrep 'fuse|glust' fuse-2.8.4-1.fc12.x86_64 glusterfs-client-3.0.5-1.fc11.x86_64 fuse-libs-2.8.4-1.fc12.x86_64 glusterfs-common-3.0.5-1.fc11.x86_64 Servers - 6 nodes with a 3 x distribute: Fedora 12 2.6.32.9-70.fc12.x86_64 [root at pdbsearch11.pm.prod ~]# rpm -qa | grep glust glusterfs-common-3.0.5-1.fc11.x86_64 glusterfs-server-3.0.5-1.fc11.x86_64 Process: 1. Client copies a large amount of files to the gluster mount 2. Client tries to do a recursive list of all files copied (ls -R) 3. Recursive list comes across a file where the checksum does not match for some reason (see following log snipped) 4. Client dies horribly, the mount point will becoming invalid with the following error: gluster-mount/file: Transport endpoint is not connected I've tried to keep the snippets below as brief as possible. If you think the volume definition files would help, let me know and i'll be happy to post those here as well. Any help or suggestions are most welcome. Thanks! --- This is the corresponding snipped from 'tail -f gluster-mount.log':> [2010-07-21 16:34:48] N [client-protocol.c:6288:client_setvolume_cbk] pdbindex2-1: Connected to 192.168.201.88:6996, attached to remote volume 'brick'.> [2010-07-21 16:35:33] E [afr.c:107:afr_set_split_brain] mirror-0: invalid argument: inode > [2010-07-21 16:35:33] E [afr-self-heal-algorithm.c:768:sh_diff_checksum_cbk] mirror-0: checksum on /index.201007211105.deploy/file failed on subvolume indexcopy-0 (File descriptor in bad state) > [2010-07-21 16:35:33] E [afr-self-heal-algorithm.c:768:sh_diff_checksum_cbk] mirror-0: checksum on /index.201007211105.deploy/file failed on subvolume indexcopy-1 (File descriptor in bad state) > pending frames: > frame : type(1) op(LOOKUP) > frame : type(1) op(LOOKUP) > frame : type(1) op(LOOKUP) > > patchset: v3.0.5 > signal received: 11 > time of crash: 2010-07-21 16:35:33 > configuration details: > argp 1 > backtrace 1 > dlfcn 1 > fdatasync 1 > libpthread 1 > llistxattr 1 > setfsid 1 > spinlock 1 > epoll.h 1 > xattr.h 1 > st_atim.tv_nsec 1 > package-string: glusterfs 3.0.5 > /lib64/libc.so.6(+0x32740)[0x7fa9c949b740] > /usr/lib64/glusterfs/3.0.5/xlator/cluster/replicate.so(+0x4b2ea)[0x7fa9c85ff2ea] > /usr/lib64/glusterfs/3.0.5/xlator/cluster/replicate.so(+0x4b557)[0x7fa9c85ff557] > /usr/lib64/glusterfs/3.0.5/xlator/cluster/replicate.so(+0x4be10)[0x7fa9c85ffe10] > /usr/lib64/glusterfs/3.0.5/xlator/cluster/replicate.so(afr_sh_algo_diff+0x196)[0x7fa9c85fffc2] > /usr/lib64/glusterfs/3.0.5/xlator/cluster/replicate.so(afr_sh_data_sync_prepare+0x256)[0x7fa9c85e9a91] > /usr/lib64/glusterfs/3.0.5/xlator/cluster/replicate.so(afr_sh_data_fix+0x5db)[0x7fa9c85ea078] > /usr/lib64/glusterfs/3.0.5/xlator/cluster/replicate.so(afr_sh_data_fstat_cbk+0x167)[0x7fa9c85ea34e] > /usr/lib64/glusterfs/3.0.5/xlator/cluster/distribute.so(dht_attr_cbk+0x238)[0x7fa9c8820e08] > /usr/lib64/glusterfs/3.0.5/xlator/protocol/client.so(client_fstat_cbk+0x178)[0x7fa9c8a59868] > /usr/lib64/glusterfs/3.0.5/xlator/protocol/client.so(protocol_client_interpret+0x1df)[0x7fa9c8a60274] > /usr/lib64/glusterfs/3.0.5/xlator/protocol/client.so(protocol_client_pollin+0xc6)[0x7fa9c8a60ff5] > /usr/lib64/glusterfs/3.0.5/xlator/protocol/client.so(notify+0x158)[0x7fa9c8a6154d] > /usr/lib64/libglusterfs.so.0(xlator_notify+0xd8)[0x7fa9c9c1b639] > /usr/lib64/glusterfs/3.0.5/transport/socket.so(socket_event_poll_in+0x46)[0x7fa9c6f59249] > /usr/lib64/glusterfs/3.0.5/transport/socket.so(socket_event_handler+0xc4)[0x7fa9c6f5957c] > /usr/lib64/libglusterfs.so.0(+0x3eefc)[0x7fa9c9c40efc] > /usr/lib64/libglusterfs.so.0(+0x3f0ee)[0x7fa9c9c410ee] > /usr/lib64/libglusterfs.so.0(event_dispatch+0x74)[0x7fa9c9c4140d] > /usr/sbin/glusterfs(main+0xf53)[0x406187] > /lib64/libc.so.6(__libc_start_main+0xfd)[0x7fa9c9487b1d] > /usr/sbin/glusterfs[0x402679] > ---------If we look at the respective files, their checksums are fine:> [16:40] ~> for i in `seq 10 15`; do echo -n "search$i: "; ssh search$i md5sum /data/export/index.201007211105.deploy/file; done > search10: md5sum: /data/export/index.201007211105.deploy/file: No such file or directory > search11: 8605b1467bece54ed7ccd13e086ee299 /data/export/index.201007211105.deploy/file > search12: md5sum: /data/export/index.201007211105.deploy/file: No such file or directory > search13: md5sum: /data/export/index.201007211105.deploy/file: No such file or directory > search14: 8605b1467bece54ed7ccd13e086ee299 /data/export/index.201007211105.deploy/file > search15: md5sum: /data/export/index.201007211105.deploy/file: No such file or directoryIf we look at extended attributes however, we notice that 'trusted.posix.gen' is different:> for i in `seq 10 15`; do echo -n "search$i: "; ssh pdbsearch$i getfattr -d -m - /data/export/index.201007211105.deploy/file; done > search10: getfattr: /data/export/index.201007211105.deploy/file: No such file or directory > search11: getfattr: Removing leading '/' from absolute path names > # file: data/export/index.201007211105.deploy/file > security.selinux="unconfined_u:object_r:default_t:s0 > trusted.afr.indexcopy-0=0sAAAAAQAAAAAAAAAA > trusted.afr.indexcopy-1=0sAAAAAQAAAAAAAAAA > trusted.posix.gen=0sTEFukQAAAEY> > search12: getfattr: /data/export/index.201007211105.deploy/file: No such file or directory > search13: getfattr: /data/export/index.201007211105.deploy/file: No such file or directory > search14: getfattr: Removing leading '/' from absolute path names > # file: data/export/index.201007211105.deploy/file > security.selinux="unconfined_u:object_r:default_t:s0 > trusted.afr.indexcopy-0=0sAAAAAQAAAAAAAAAA > trusted.afr.indexcopy-1=0sAAAAAQAAAAAAAAAA > trusted.posix.gen=0sTEaPaAAAAAI> > search15: getfattr: /data/export/index.201007211105.deploy/file: No such file or directory
Anush Shetty
2010-Jul-21 10:25 UTC
[Gluster-users] 3.0.5 client crash - afr_set_split_brain
Hi Jon, Thanks for reporting. I have logged a bug for this. You can follow it up here bugs.gluster.com/cgi-bin/bugzilla3/show_bug.cgi?id=1188 - Anush On Wed, Jul 21, 2010 at 2:14 PM, Jon Swanson <jswanson at valuecommerce.co.jp> wrote:> Seeing a glusterfs client die oddly. > > --Setup-- > Client: > Fedora 12 2.6.32.16-141.fc12.x86_64 > # rpm -qa |egrep 'fuse|glust' > fuse-2.8.4-1.fc12.x86_64 > glusterfs-client-3.0.5-1.fc11.x86_64 > fuse-libs-2.8.4-1.fc12.x86_64 > glusterfs-common-3.0.5-1.fc11.x86_64 > > > Servers - 6 nodes with a 3 x distribute: > Fedora 12 2.6.32.9-70.fc12.x86_64 > [root at pdbsearch11.pm.prod ~]# rpm -qa | grep glust > glusterfs-common-3.0.5-1.fc11.x86_64 > glusterfs-server-3.0.5-1.fc11.x86_64 > > > Process: > 1. Client copies a large amount of files to the gluster mount > 2. Client tries to do a recursive list of all files copied (ls -R) > 3. Recursive list comes across a file where the checksum does not match for some reason (see following log snipped) > 4. Client dies horribly, the mount point will becoming invalid with the following error: > gluster-mount/file: Transport endpoint is not connected > > I've tried to keep the snippets below as brief as possible. ?If you think the volume definition files would help, let me know and i'll be happy to post those here as well. > > Any help or suggestions are most welcome. > > Thanks! > > --- > > This is the corresponding snipped from 'tail -f gluster-mount.log': > >> [2010-07-21 16:34:48] N [client-protocol.c:6288:client_setvolume_cbk] pdbindex2-1: Connected to 192.168.201.88:6996, attached to remote volume 'brick'. > >> [2010-07-21 16:35:33] E [afr.c:107:afr_set_split_brain] mirror-0: invalid argument: inode >> [2010-07-21 16:35:33] E [afr-self-heal-algorithm.c:768:sh_diff_checksum_cbk] mirror-0: checksum on /index.201007211105.deploy/file failed on subvolume indexcopy-0 (File descriptor in bad state) >> [2010-07-21 16:35:33] E [afr-self-heal-algorithm.c:768:sh_diff_checksum_cbk] mirror-0: checksum on /index.201007211105.deploy/file failed on subvolume indexcopy-1 (File descriptor in bad state) >> pending frames: >> frame : type(1) op(LOOKUP) >> frame : type(1) op(LOOKUP) >> frame : type(1) op(LOOKUP) >> >> patchset: v3.0.5 >> signal received: 11 >> time of crash: 2010-07-21 16:35:33 >> configuration details: >> argp 1 >> backtrace 1 >> dlfcn 1 >> fdatasync 1 >> libpthread 1 >> llistxattr 1 >> setfsid 1 >> spinlock 1 >> epoll.h 1 >> xattr.h 1 >> st_atim.tv_nsec 1 >> package-string: glusterfs 3.0.5 >> /lib64/libc.so.6(+0x32740)[0x7fa9c949b740] >> /usr/lib64/glusterfs/3.0.5/xlator/cluster/replicate.so(+0x4b2ea)[0x7fa9c85ff2ea] >> /usr/lib64/glusterfs/3.0.5/xlator/cluster/replicate.so(+0x4b557)[0x7fa9c85ff557] >> /usr/lib64/glusterfs/3.0.5/xlator/cluster/replicate.so(+0x4be10)[0x7fa9c85ffe10] >> /usr/lib64/glusterfs/3.0.5/xlator/cluster/replicate.so(afr_sh_algo_diff+0x196)[0x7fa9c85fffc2] >> /usr/lib64/glusterfs/3.0.5/xlator/cluster/replicate.so(afr_sh_data_sync_prepare+0x256)[0x7fa9c85e9a91] >> /usr/lib64/glusterfs/3.0.5/xlator/cluster/replicate.so(afr_sh_data_fix+0x5db)[0x7fa9c85ea078] >> /usr/lib64/glusterfs/3.0.5/xlator/cluster/replicate.so(afr_sh_data_fstat_cbk+0x167)[0x7fa9c85ea34e] >> /usr/lib64/glusterfs/3.0.5/xlator/cluster/distribute.so(dht_attr_cbk+0x238)[0x7fa9c8820e08] >> /usr/lib64/glusterfs/3.0.5/xlator/protocol/client.so(client_fstat_cbk+0x178)[0x7fa9c8a59868] >> /usr/lib64/glusterfs/3.0.5/xlator/protocol/client.so(protocol_client_interpret+0x1df)[0x7fa9c8a60274] >> /usr/lib64/glusterfs/3.0.5/xlator/protocol/client.so(protocol_client_pollin+0xc6)[0x7fa9c8a60ff5] >> /usr/lib64/glusterfs/3.0.5/xlator/protocol/client.so(notify+0x158)[0x7fa9c8a6154d] >> /usr/lib64/libglusterfs.so.0(xlator_notify+0xd8)[0x7fa9c9c1b639] >> /usr/lib64/glusterfs/3.0.5/transport/socket.so(socket_event_poll_in+0x46)[0x7fa9c6f59249] >> /usr/lib64/glusterfs/3.0.5/transport/socket.so(socket_event_handler+0xc4)[0x7fa9c6f5957c] >> /usr/lib64/libglusterfs.so.0(+0x3eefc)[0x7fa9c9c40efc] >> /usr/lib64/libglusterfs.so.0(+0x3f0ee)[0x7fa9c9c410ee] >> /usr/lib64/libglusterfs.so.0(event_dispatch+0x74)[0x7fa9c9c4140d] >> /usr/sbin/glusterfs(main+0xf53)[0x406187] >> /lib64/libc.so.6(__libc_start_main+0xfd)[0x7fa9c9487b1d] >> /usr/sbin/glusterfs[0x402679] >> --------- > > If we look at the respective files, their checksums are fine: >> [16:40] ~> for i in `seq 10 15`; do echo -n "search$i: "; ssh search$i md5sum /data/export/index.201007211105.deploy/file; done >> search10: md5sum: /data/export/index.201007211105.deploy/file: No such file or directory >> search11: 8605b1467bece54ed7ccd13e086ee299 ?/data/export/index.201007211105.deploy/file >> search12: md5sum: /data/export/index.201007211105.deploy/file: No such file or directory >> search13: md5sum: /data/export/index.201007211105.deploy/file: No such file or directory >> search14: 8605b1467bece54ed7ccd13e086ee299 ?/data/export/index.201007211105.deploy/file >> search15: md5sum: /data/export/index.201007211105.deploy/file: No such file or directory > > If we look at extended attributes however, we notice that 'trusted.posix.gen' is different: >> for i in `seq 10 15`; do echo -n "search$i: "; ssh pdbsearch$i getfattr -d -m - /data/export/index.201007211105.deploy/file; done >> search10: getfattr: /data/export/index.201007211105.deploy/file: No such file or directory >> search11: getfattr: Removing leading '/' from absolute path names >> # file: data/export/index.201007211105.deploy/file >> security.selinux="unconfined_u:object_r:default_t:s0 >> trusted.afr.indexcopy-0=0sAAAAAQAAAAAAAAAA >> trusted.afr.indexcopy-1=0sAAAAAQAAAAAAAAAA >> trusted.posix.gen=0sTEFukQAAAEY>> >> search12: getfattr: /data/export/index.201007211105.deploy/file: No such file or directory >> search13: getfattr: /data/export/index.201007211105.deploy/file: No such file or directory >> search14: getfattr: Removing leading '/' from absolute path names >> # file: data/export/index.201007211105.deploy/file >> security.selinux="unconfined_u:object_r:default_t:s0 >> trusted.afr.indexcopy-0=0sAAAAAQAAAAAAAAAA >> trusted.afr.indexcopy-1=0sAAAAAQAAAAAAAAAA >> trusted.posix.gen=0sTEaPaAAAAAI>> >> search15: getfattr: /data/export/index.201007211105.deploy/file: No such file or directory > > > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > gluster.org/cgi-bin/mailman/listinfo/gluster-users >