A Ghoshal
2015-Jan-26 16:25 UTC
[Gluster-users] Pretty much any operation related to Gluster mounted fs hangs for a while
I am plagued with something of this sort, too! What I mostly see when I explore these things is that A) it's a split-brain. B) the split-brain is because the gfid's on the two replicas are at odds. You could check that out by 1. On each server, first 'cd' to where your brick is mounted. 2. getfattr -m . -d -e hex templates/assets/prod/temporary/13/user_1339200.png You will see a trusted.gfid kind of extended attribute. If it's not the same on both servers, there's a problem. Thanks, Anirban -----Tiago Santos <tiago at musthavemenus.com> wrote: ----- ====================== To: gluster-users at gluster.org From: Tiago Santos <tiago at musthavemenus.com> Date: 01/26/2015 09:38PM Subject: [Gluster-users] Pretty much any operation related to Gluster mounted fs hangs for a while ====================== Hey guys, I'm experiencing this weird case for pretty much any command (ls, df, find, etc) I try to run against a Gluster client filesystem. Just for you guys to understand what I'm talking about, follows this easy and simple test I just ran: root at web3:~# date; time ls -ltrh /var/www/site-images/templates/assets/prod/temporary/13/user_1339200.png Mon Jan 26 07:00:27 PST 2015 -rwx---r-- 1 mhmadmin mhmadmin 61K Jan 22 14:37 /var/www/site-images/templates/assets/prod/temporary/13/user_1339200.png real 0m*33.651s* user 0m0.001s sys 0m0.004s root at web3:~# date; time ls -ltrh /var/www/site-images/templates/assets/prod/temporary/13/user_1339200.png Mon Jan 26 07:01:03 PST 2015 ls: cannot access /var/www/site-images/templates/assets/prod/temporary/13/user_1339200.png: *Input/output error* real *1m40.241s* user 0m0.000s sys 0m0.003s root at web3:~# date; time ls -ltrh /var/www/site-images/templates/assets/prod/temporary/13/user_1339200.png Mon Jan 26 07:02:51 PST 2015 ls: cannot access /var/www/site-images/templates/assets/prod/temporary/13/user_1339200.png: *Input/output error* real *0m12.834s* user 0m0.000s sys 0m0.003s root at web3:~# date; time ls -ltrh /var/www/site-images/templates/assets/prod/temporary/13/user_1339200.png Mon Jan 26 07:03:10 PST 2015 -rwx---r-- 1 mhmadmin mhmadmin 61K Jan 22 14:37 /var/www/site-images/templates/assets/prod/temporary/13/user_1339200.png real *2m10.150s* user 0m0.000s sys 0m0.005s Sometimes it passes, but takes a really long time to run a simple command (this is a 61K file), sometimes I see the Input/output error. The important thing to mention is that this behavior happens almost all the time. I can quickly reproduce it if asked. This is a 2-node gluster setup. Both VMs act as Client and Server (sorry if I'm not using the correct gluster naming.. I'm getting to know it for weeks now). More info: # gluster --version glusterfs 3.5.3 built on Nov 18 2014 03:53:25 Repository revision: git://git.gluster.com/glusterfs.git # df -Th Filesystem Type Size Used Avail Use% Mounted on /dev/mapper/data_vg-data_lv ext4 1007G 506G 451G 53% /export/images1-1 images1.mydomain.com:/site-images fuse.glusterfs 1007G 506G 451G 53% /var/www/site-images # uname -a Linux web3 3.13.0-44-generic #73-Ubuntu SMP Tue Dec 16 00:22:43 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux # gluster volume info Volume Name: site-images Type: Replicate Volume ID: 68bca3c9-210c-45a9-b2bc-6a0e2ee630bb Status: Started Number of Bricks: 1 x 2 = 2 Transport-type: tcp Bricks: Brick1: images1.mydomain.com:/export/images1-1/brick Brick2: images2.mydomain.com:/export/images2-1/brick Would anyone help me identify what is going on here? Thanks in advance! -- *Tiago Santos* MustHaveMenus.com _______________________________________________ Gluster-users mailing list Gluster-users at gluster.org http://www.gluster.org/mailman/listinfo/gluster-users =====-----=====-----====Notice: The information contained in this e-mail message and/or attachments to it may contain confidential or privileged information. If you are not the intended recipient, any dissemination, use, review, distribution, printing or copying of the information contained in this e-mail message and/or attachments to it are strictly prohibited. If you have received this communication in error, please notify us by reply e-mail or telephone and immediately and permanently delete the message and any attachments. Thank you
Tiago Santos
2015-Jan-26 18:50 UTC
[Gluster-users] Pretty much any operation related to Gluster mounted fs hangs for a while
Thanks for you input, Anirban. I ran the commands on both servers, with the following results: root at web3:/var/www/site-images# time getfattr -m . -d -e hex templates/assets/prod/temporary/13/user_1339200.png real 0m34.524s user 0m0.004s sys 0m0.000s root at web4:/var/www/site-images# time getfattr -m . -d -e hex templates/assets/prod/temporary/13/user_1339200.png getfattr: templates/assets/prod/temporary/13/user_1339200.png: Input/output error real 0m11.315s user 0m0.001s sys 0m0.003s root at web4:/var/www/site-images# ls templates/assets/prod/temporary/13/user_1339200.png ls: cannot access templates/assets/prod/temporary/13/user_1339200.png: Input/output error Not sure if it elucidate the issue.. Also, I saw at /var/log/gluster.log a zillion entries like these: [2015-01-26 17:35:39.973268] W [client-rpc-fops.c:2779:client3_3_lookup_cbk] 0-site-images-client-1: remote operation failed: Transport endpoint is not connected. Path: /templates/apache/template/prod/facebook/9616964 (00000000-0000-0000-0000-000000000000) [2015-01-26 17:35:39.973435] W [client-rpc-fops.c:2779:client3_3_lookup_cbk] 0-site-images-client-1: remote operation failed: Transport endpoint is not connected. Path: /templates/apache/template/prod/facebook/9594915 (00000000-0000-0000-0000-000000000000) [2015-01-26 17:35:39.973571] W [client-rpc-fops.c:2779:client3_3_lookup_cbk] 0-site-images-client-1: remote operation failed: Transport endpoint is not connected. Path: /templates/apache/template/prod/facebook/9681971 (00000000-0000-0000-0000-000000000000) [2015-01-26 17:35:39.973686] W [client-rpc-fops.c:2779:client3_3_lookup_cbk] 0-site-images-client-1: remote operation failed: Transport endpoint is not connected. Path: /templates/apache/template/prod/facebook/19615 (00000000-0000-0000-0000-000000000000) [2015-01-26 17:35:39.973802] W [client-rpc-fops.c:2779:client3_3_lookup_cbk] 0-site-images-client-1: remote operation failed: Transport endpoint is not connected. Path: /templates/apache/template/prod/facebook/130392 (00000000-0000-0000-0000-000000000000) I have talked with some guys at #gluster that pointed it could be network issues. I'm still looking into it, but since the issue also happens locally (within the same server), would that still be a valid point? Also, less often, I see entries like these: [2015-01-26 17:41:25.956418] E [afr-self-heal-common.c:1615:afr_sh_common_lookup_cbk] 0-site-images-replicate-0: Conflicting entries for /webhost/sites/clipart/assets/apache/images/graphics/215126/image1.png [2015-01-26 17:41:26.588753] E [afr-self-heal-common.c:1615:afr_sh_common_lookup_cbk] 0-site-images-replicate-0: Conflicting entries for /webhost/sites/clipart/assets/apache/images/graphics/215126/image1.png Are those a definitive indication of a split-brain? Or just something usual until self-heal takes care of recently updated files? On Mon, Jan 26, 2015 at 2:25 PM, A Ghoshal <a.ghoshal at tcs.com> wrote:> I am plagued with something of this sort, too! > > What I mostly see when I explore these things is that > > A) it's a split-brain. > B) the split-brain is because the gfid's on the two replicas are at odds. > > You could check that out by > 1. On each server, first 'cd' to where your brick is mounted. > 2. getfattr -m . -d -e hex > templates/assets/prod/temporary/13/user_1339200.png > > You will see a trusted.gfid kind of extended attribute. If it's not the > same on both servers, there's a problem. > > Thanks, > Anirban > >Regards, -- *Tiago Santos* MustHaveMenus.com -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20150126/ed243998/attachment.html>