thr3ads.net - Gluster users - [Gluster-users] Need some help on Mismatching xdata / Failed combine iatt / Too many fd [Apr 2016]

If this information is useful, please help other people find it:
Share via:

Ashish Pandey

2016-Apr-04 10:11 UTC

[Gluster-users] Need some help on Mismatching xdata / Failed combine iatt / Too many fd

Hi Chen,

As I suspected, there are many blocked call for inodelk in
sm11/mnt-disk1-mainvol.31115.dump.1459760675.

============================================[xlator.features.locks.mainvol-locks.inode]
path=/home/analyzer/softs/bin/GenomeAnalysisTK.jar
mandatory=0
inodelk-count=4
lock-dump.domain.domain=mainvol-disperse-0:self-heal
lock-dump.domain.domain=mainvol-disperse-0
inodelk.inodelk[0](ACTIVE)=type=WRITE, whence=0, start=0, len=0, pid = 1,
owner=dc2d3dfcc57f0000, client=0x7ff03435d5f0,
connection-id=sm12-8063-2016/04/01-07:51:46:892384-mainvol-client-0-0-0, blocked
at 2016-04-01 16:52:58, granted at 2016-04-01 16:52:58
inodelk.inodelk[1](BLOCKED)=type=WRITE, whence=0, start=0, len=0, pid = 1,
owner=1414371e1a7f0000, client=0x7ff034204490,
connection-id=hw10-17315-2016/04/01-07:51:44:421807-mainvol-client-0-0-0,
blocked at 2016-04-01 16:58:51
inodelk.inodelk[2](BLOCKED)=type=WRITE, whence=0, start=0, len=0, pid = 1,
owner=a8eb14cd9b7f0000, client=0x7ff01400dbd0,
connection-id=sm14-879-2016/04/01-07:51:56:133106-mainvol-client-0-0-0, blocked
at 2016-04-01 17:03:41
inodelk.inodelk[3](BLOCKED)=type=WRITE, whence=0, start=0, len=0, pid = 1,
owner=b41a0482867f0000, client=0x7ff01800e670,
connection-id=sm15-30906-2016/04/01-07:51:45:711474-mainvol-client-0-0-0,
blocked at 2016-04-01 17:05:09
============================================
This could be the cause of hang.
Possible Workaround -
If there is no IO going on for this volume, we can restart the volume using -
gluster v start <volume-name> force. This will restart the nfs process too
which will release the locks and
we could come out of this issue.

Ashish




----- Original Message -----
From: "Chen Chen" <chenchen at smartquerier.com>
To: "Ashish Pandey" <aspandey at redhat.com>
Cc: gluster-users at gluster.org
Sent: Monday, April 4, 2016 2:56:37 PM
Subject: Re: [Gluster-users] Need some help on Mismatching xdata / Failed
combine iatt / Too many fd

Hi Ashish,

Yes, I only uploaded the directory of one node (sm11). All nodes are 
showing the same kind of errors at the same time more or less.

I'm sending the infos of the other 5 nodes. Logs of all bricks (except 
the "dead" 1x2) are also appended. One of the node (sm16) refused to
let
me ssh into it. volume status said it is still alive and showmount on it 
is working too.

The node "hw10" works as a pure NFS server and don't have any
bricks.

The dump file and logs are again in my Dropbox (3.8M)
https://dl.dropboxusercontent.com/u/56671522/statedump.tar.xz

Best wishes,
Chen

On 4/4/2016 4:27 PM, Ashish Pandey wrote:>
> Hi Chen,
>
> By looking at log in mnt-disk1-mainvol.log and mnt-disk1-mainvol.log I
suspect this hang is because of inode lock contention.
> I think the log provided are for one brick only.
> To make sure of it, we would require statedump for all the brick process
and nfs
>
> For bricks: gluster volume statedump <volname>
> For nfs server: gluster volume statedump <volname> nfs
>
> Directory where statedump files are created can be find by using
'gluster --print-statedumpdir' command.
> If not present create this directory.
>
> Logs for all the bricks are also required.
> You should try to restart the volume which could solve this hang issue if
this is because of inode lock.
>
> gluster volume start <volname> force
>
> Ashish
>
>
>
>
>
>
> ----- Original Message -----
> From: "Chen Chen" <chenchen at smartquerier.com>
> To: "Ashish Pandey" <aspandey at redhat.com>
> Cc: gluster-users at gluster.org
> Sent: Sunday, April 3, 2016 2:13:22 PM
> Subject: Re: [Gluster-users] Need some help on Mismatching xdata / Failed
combine iatt / Too many fd
>
> Hi Ashish Pandey,
>
> After some investigation I updated the server from 3.7.6 to 3.7.9. I
> also switched from native fuse to NFS mount (which boosted the
> performance a lot when I tested) on April 1st.
>
> Then after two days' running, the cluster appeared to be locked.
"ls"
> hangs, no network usage, volume profile showed no r/w activity on
> bricks. "dmesg" showed the NFS went dead in 12 hrs (Apr 2 01:13),
but
> "showmount" and "volume status" said NFS server is
responding and all
> bricks are alive.
>
> I'm not sure what had happened (glustershd.log and nfs.log didn't
show
> anything interesting), so I dumped the whole log folder instead. It was
> a bit too large (5MB, filled by Error and Warning) and my mail was
> rejected multiple times by the mailing list. I can only attached the
> snapshot of all logs. You can grab the full version at
> https://dl.dropboxusercontent.com/u/56671522/glusterfs.tar.xz instead.
>
> The volume profile info is also attached. Hope it helps.
>
> Best wishes,
> Chen
>
> On 3/27/2016 2:38 AM, Ashish Pandey wrote:
>> Hi Chen,
>>
>> Could you please send us following logs-
>> 1 - brick logs - under /var/log/messages/brick/
>> 2 - mount logs
>>
>> Also some information like what kind of IO was happening (read,write,
unlink, rename on different mount) to understand this issue in a better way.
>>
>> ---
>> Ashish
>>
>> ----- Original Message -----
>> From: "??" <chenchen at smartquerier.com>
>> To: gluster-users at gluster.org
>> Sent: Friday, March 25, 2016 8:59:04 AM
>> Subject: [Gluster-users] Need some help on Mismatching xdata / Failed
combine iatt / Too many fd
>>
>> Hi Everyone,
>>
>> I have a "2 x (4 + 2) = 12 Distributed-Disperse" volume.
After upgraded
>> to 3.7.8 I noticed the volume is frequently out of service. The
>> glustershd.log is flooded by:
>>
>> [ec-combine.c:866:ec_combine_check] 0-mainvol-disperse-1: Mismatching
>> xdata in answers of 'LOOKUP'"
>> [ec-common.c:116:ec_check_status] 0-mainvol-disperse-1: Operation
failed
>> on some subvolumes (up=3F, mask=3F, remaining=0, good=1E, bad=21)
>> [ec-common.c:71:ec_heal_report] 0-mainvol-disperse-1: Heal failed
>> [Invalid argument]
>> [ec-combine.c:206:ec_iatt_combine] 0-mainvol-disperse-0: Failed to
>> combine iatt (inode: xxx, links: 1-1, uid: 1000-1000, gid: 1000-1000,
>> rdev: 0-0, size: xxx-xxx, mode: 100600-100600)
>>
>> in normal working state, and sometimes 1000+ lines of:
>>
>> [client-rpc-fops.c:466:client3_3_open_cbk] 0-mainvol-client-7: remote
>> operation failed. Path: <gfid:xxxx> (xxxx) [Too many open files]
>>
>> and the brick went offline. "top open" showed "Max open
fds: 899195".
>>
>> Can anyone suggest me what happened, and what should I do? I was trying
>> to deal with the terrible IOPS problem but things got even worse.
>>
>> Each Server has 2 x E5-2630v3 (32threads/server), 32GB RAM. Additional
>> infos are in the attachements. Many thanks.
>>
>> Sincerely yours,
>> Chen
>>
>
-- 
Chen Chen
Shanghai SmartQuerier Biotechnology Co., Ltd.
Add: Room 410, 781 Cai Lun Road, China (Shanghai) Pilot Free Trade Zone
         Shanghai 201203, P. R. China
Mob: +86 15221885893
Email: chenchen at smartquerier.com
Web: www.smartquerier.com


_______________________________________________
Gluster-users mailing list
Gluster-users at gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users

Chen Chen

2016-Apr-04 11:44 UTC

head link

[Gluster-users] Need some help on Mismatching xdata / Failed combine iatt / Too many fd

Hi Ashish,

There was heavy IO load on the cluster when it get locked down. I fear 
the process waiting for IO will all get crashed.

Furthermore, both "start force" and "stop" told me
"Error : Request
timed out". I'm not sure if it was caused by the semi-dead node.
I'll
hard reset the node tomorrow and see if it helps.

Besides, what caused the lock and how can I avoid it? Any advice is 
appreciated.

Best wishes,
Chen

On 4/4/2016 6:11 PM, Ashish Pandey wrote:> Hi Chen,
>
> As I suspected, there are many blocked call for inodelk in
sm11/mnt-disk1-mainvol.31115.dump.1459760675.
>
> ============================================>
[xlator.features.locks.mainvol-locks.inode]
> path=/home/analyzer/softs/bin/GenomeAnalysisTK.jar
> mandatory=0
> inodelk-count=4
> lock-dump.domain.domain=mainvol-disperse-0:self-heal
> lock-dump.domain.domain=mainvol-disperse-0
> inodelk.inodelk[0](ACTIVE)=type=WRITE, whence=0, start=0, len=0, pid = 1,
owner=dc2d3dfcc57f0000, client=0x7ff03435d5f0,
connection-id=sm12-8063-2016/04/01-07:51:46:892384-mainvol-client-0-0-0, blocked
at 2016-04-01 16:52:58, granted at 2016-04-01 16:52:58
> inodelk.inodelk[1](BLOCKED)=type=WRITE, whence=0, start=0, len=0, pid = 1,
owner=1414371e1a7f0000, client=0x7ff034204490,
connection-id=hw10-17315-2016/04/01-07:51:44:421807-mainvol-client-0-0-0,
blocked at 2016-04-01 16:58:51
> inodelk.inodelk[2](BLOCKED)=type=WRITE, whence=0, start=0, len=0, pid = 1,
owner=a8eb14cd9b7f0000, client=0x7ff01400dbd0,
connection-id=sm14-879-2016/04/01-07:51:56:133106-mainvol-client-0-0-0, blocked
at 2016-04-01 17:03:41
> inodelk.inodelk[3](BLOCKED)=type=WRITE, whence=0, start=0, len=0, pid = 1,
owner=b41a0482867f0000, client=0x7ff01800e670,
connection-id=sm15-30906-2016/04/01-07:51:45:711474-mainvol-client-0-0-0,
blocked at 2016-04-01 17:05:09
> ============================================>
> This could be the cause of hang.
> Possible Workaround -
> If there is no IO going on for this volume, we can restart the volume using
- gluster v start <volume-name> force. This will restart the nfs process
too which will release the locks and
> we could come out of this issue.
>
> Ashish
>
>
>
>
> ----- Original Message -----
> From: "Chen Chen" <chenchen at smartquerier.com>
> To: "Ashish Pandey" <aspandey at redhat.com>
> Cc: gluster-users at gluster.org
> Sent: Monday, April 4, 2016 2:56:37 PM
> Subject: Re: [Gluster-users] Need some help on Mismatching xdata / Failed
combine iatt / Too many fd
>
> Hi Ashish,
>
> Yes, I only uploaded the directory of one node (sm11). All nodes are
> showing the same kind of errors at the same time more or less.
>
> I'm sending the infos of the other 5 nodes. Logs of all bricks (except
> the "dead" 1x2) are also appended. One of the node (sm16) refused
to let
> me ssh into it. volume status said it is still alive and showmount on it
> is working too.
>
> The node "hw10" works as a pure NFS server and don't have any
bricks.
>
> The dump file and logs are again in my Dropbox (3.8M)
> https://dl.dropboxusercontent.com/u/56671522/statedump.tar.xz
>
> Best wishes,
> Chen
>
> On 4/4/2016 4:27 PM, Ashish Pandey wrote:
>>
>> Hi Chen,
>>
>> By looking at log in mnt-disk1-mainvol.log and mnt-disk1-mainvol.log I
suspect this hang is because of inode lock contention.
>> I think the log provided are for one brick only.
>> To make sure of it, we would require statedump for all the brick
process and nfs
>>
>> For bricks: gluster volume statedump <volname>
>> For nfs server: gluster volume statedump <volname> nfs
>>
>> Directory where statedump files are created can be find by using
'gluster --print-statedumpdir' command.
>> If not present create this directory.
>>
>> Logs for all the bricks are also required.
>> You should try to restart the volume which could solve this hang issue
if this is because of inode lock.
>>
>> gluster volume start <volname> force
>>
>> Ashish
-- 
Chen Chen
????????????
Shanghai SmartQuerier Biotechnology Co., Ltd.
Add: Room 410, 781 Cai Lun Road, China (Shanghai) Pilot Free Trade Zone
         Shanghai 201203, P. R. China
Mob: +86 15221885893
Email: chenchen at smartquerier.com
Web: www.smartquerier.com

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4169 bytes
Desc: S/MIME Cryptographic Signature
URL:
<http://www.gluster.org/pipermail/gluster-users/attachments/20160404/849ffac4/attachment.p7s>

Chen Chen

2016-Apr-07 03:21 UTC

head link

[Gluster-users] Need some help on Mismatching xdata / Failed combine iatt / Too many fd

Hi Ashish,

I experienced another inode block today,

===============================================================[xlator.features.locks.mainvol-locks.inode]
path=/home/analyzer/workdir/NTD/bam/A1649.bam
mandatory=0
inodelk-count=3
lock-dump.domain.domain=mainvol-disperse-0:self-heal
lock-dump.domain.domain=mainvol-disperse-0
inodelk.inodelk[0](ACTIVE)=type=WRITE, whence=0, start=0, len=0, pid = 
18446744073709551610, owner=547a5f1e967f0000, client=0x7f03ac0037c0, 
connection-id=hw10-27049-2016/04/05-06:37:22:818320-mainvol-client-0-0-0, 
granted at 2016-04-06 09:11:14
inodelk.inodelk[1](BLOCKED)=type=WRITE, whence=0, start=0, len=0, pid = 
18446744073709551610, owner=b046611e967f0000, client=0x7f03ac0037c0, 
connection-id=hw10-27049-2016/04/05-06:37:22:818320-mainvol-client-0-0-0, 
blocked at 2016-04-06 09:11:14
inodelk.inodelk[2](BLOCKED)=type=WRITE, whence=0, start=0, len=0, pid = 
1, owner=5433d109967f0000, client=0x7f03ac0037c0, 
connection-id=hw10-27049-2016/04/05-06:37:22:818320-mainvol-client-0-0-0, 
blocked at 2016-04-06 09:11:34
--

[xlator.features.locks.mainvol-locks.inode]
path=/home/analyzer/workdir/NTD/bam/A1588.bam
mandatory=0
inodelk-count=3
lock-dump.domain.domain=mainvol-disperse-0:self-heal
lock-dump.domain.domain=mainvol-disperse-0
inodelk.inodelk[0](ACTIVE)=type=WRITE, whence=0, start=0, len=0, pid = 
18446744073709551610, owner=b8185e1e967f0000, client=0x7f03ac0037c0, 
connection-id=hw10-27049-2016/04/05-06:37:22:818320-mainvol-client-0-0-0, 
granted at 2016-04-06 09:11:14
inodelk.inodelk[1](BLOCKED)=type=WRITE, whence=0, start=0, len=0, pid = 
18446744073709551610, owner=d44e5b1e967f0000, client=0x7f03ac0037c0, 
connection-id=hw10-27049-2016/04/05-06:37:22:818320-mainvol-client-0-0-0, 
blocked at 2016-04-06 09:11:14
inodelk.inodelk[2](BLOCKED)=type=WRITE, whence=0, start=0, len=0, pid = 
1, owner=3c15d109967f0000, client=0x7f03ac0037c0, 
connection-id=hw10-27049-2016/04/05-06:37:22:818320-mainvol-client-0-0-0, 
blocked at 2016-04-06 09:11:34
===============================================================
Could it be caused by the increased "nfs.outstanding-rpc-limit"?
Volume
info is attached. I'll decrease it to the default 16 to test it.

Best wishes,
Chen

On 4/4/2016 6:11 PM, Ashish Pandey wrote:> Hi Chen,
>
> As I suspected, there are many blocked call for inodelk in
sm11/mnt-disk1-mainvol.31115.dump.1459760675.
>
> ============================================>
[xlator.features.locks.mainvol-locks.inode]
> path=/home/analyzer/softs/bin/GenomeAnalysisTK.jar
> mandatory=0
> inodelk-count=4
> lock-dump.domain.domain=mainvol-disperse-0:self-heal
> lock-dump.domain.domain=mainvol-disperse-0
> inodelk.inodelk[0](ACTIVE)=type=WRITE, whence=0, start=0, len=0, pid = 1,
owner=dc2d3dfcc57f0000, client=0x7ff03435d5f0,
connection-id=sm12-8063-2016/04/01-07:51:46:892384-mainvol-client-0-0-0, blocked
at 2016-04-01 16:52:58, granted at 2016-04-01 16:52:58
> inodelk.inodelk[1](BLOCKED)=type=WRITE, whence=0, start=0, len=0, pid = 1,
owner=1414371e1a7f0000, client=0x7ff034204490,
connection-id=hw10-17315-2016/04/01-07:51:44:421807-mainvol-client-0-0-0,
blocked at 2016-04-01 16:58:51
> inodelk.inodelk[2](BLOCKED)=type=WRITE, whence=0, start=0, len=0, pid = 1,
owner=a8eb14cd9b7f0000, client=0x7ff01400dbd0,
connection-id=sm14-879-2016/04/01-07:51:56:133106-mainvol-client-0-0-0, blocked
at 2016-04-01 17:03:41
> inodelk.inodelk[3](BLOCKED)=type=WRITE, whence=0, start=0, len=0, pid = 1,
owner=b41a0482867f0000, client=0x7ff01800e670,
connection-id=sm15-30906-2016/04/01-07:51:45:711474-mainvol-client-0-0-0,
blocked at 2016-04-01 17:05:09
> ============================================>
> This could be the cause of hang.
> Possible Workaround -
> If there is no IO going on for this volume, we can restart the volume using
- gluster v start <volume-name> force. This will restart the nfs process
too which will release the locks and
> we could come out of this issue.
>
> Ashish
>
>
>
>
> ----- Original Message -----
> From: "Chen Chen" <chenchen at smartquerier.com>
> To: "Ashish Pandey" <aspandey at redhat.com>
> Cc: gluster-users at gluster.org
> Sent: Monday, April 4, 2016 2:56:37 PM
> Subject: Re: [Gluster-users] Need some help on Mismatching xdata / Failed
combine iatt / Too many fd
>
> Hi Ashish,
>
> Yes, I only uploaded the directory of one node (sm11). All nodes are
> showing the same kind of errors at the same time more or less.
>
> I'm sending the infos of the other 5 nodes. Logs of all bricks (except
> the "dead" 1x2) are also appended. One of the node (sm16) refused
to let
> me ssh into it. volume status said it is still alive and showmount on it
> is working too.
>
> The node "hw10" works as a pure NFS server and don't have any
bricks.
>
> The dump file and logs are again in my Dropbox (3.8M)
> https://dl.dropboxusercontent.com/u/56671522/statedump.tar.xz
>
> Best wishes,
> Chen
>
> On 4/4/2016 4:27 PM, Ashish Pandey wrote:
>>
>> Hi Chen,
>>
>> By looking at log in mnt-disk1-mainvol.log and mnt-disk1-mainvol.log I
suspect this hang is because of inode lock contention.
>> I think the log provided are for one brick only.
>> To make sure of it, we would require statedump for all the brick
process and nfs
>>
>> For bricks: gluster volume statedump <volname>
>> For nfs server: gluster volume statedump <volname> nfs
>>
>> Directory where statedump files are created can be find by using
'gluster --print-statedumpdir' command.
>> If not present create this directory.
>>
>> Logs for all the bricks are also required.
>> You should try to restart the volume which could solve this hang issue
if this is because of inode lock.
>>
>> gluster volume start <volname> force
>>
>> Ashish
>>
>>
>>
>>
>>
>>
>> ----- Original Message -----
>> From: "Chen Chen" <chenchen at smartquerier.com>
>> To: "Ashish Pandey" <aspandey at redhat.com>
>> Cc: gluster-users at gluster.org
>> Sent: Sunday, April 3, 2016 2:13:22 PM
>> Subject: Re: [Gluster-users] Need some help on Mismatching xdata /
Failed combine iatt / Too many fd
>>
>> Hi Ashish Pandey,
>>
>> After some investigation I updated the server from 3.7.6 to 3.7.9. I
>> also switched from native fuse to NFS mount (which boosted the
>> performance a lot when I tested) on April 1st.
>>
>> Then after two days' running, the cluster appeared to be locked.
"ls"
>> hangs, no network usage, volume profile showed no r/w activity on
>> bricks. "dmesg" showed the NFS went dead in 12 hrs (Apr 2
01:13), but
>> "showmount" and "volume status" said NFS server is
responding and all
>> bricks are alive.
>>
>> I'm not sure what had happened (glustershd.log and nfs.log
didn't show
>> anything interesting), so I dumped the whole log folder instead. It was
>> a bit too large (5MB, filled by Error and Warning) and my mail was
>> rejected multiple times by the mailing list. I can only attached the
>> snapshot of all logs. You can grab the full version at
>> https://dl.dropboxusercontent.com/u/56671522/glusterfs.tar.xz instead.
>>
>> The volume profile info is also attached. Hope it helps.
>>
>> Best wishes,
>> Chen
>>
>> On 3/27/2016 2:38 AM, Ashish Pandey wrote:
>>> Hi Chen,
>>>
>>> Could you please send us following logs-
>>> 1 - brick logs - under /var/log/messages/brick/
>>> 2 - mount logs
>>>
>>> Also some information like what kind of IO was happening
(read,write, unlink, rename on different mount) to understand this issue in a
better way.
>>>
>>> ---
>>> Ashish
>>>
>>> ----- Original Message -----
>>> From: "??" <chenchen at smartquerier.com>
>>> To: gluster-users at gluster.org
>>> Sent: Friday, March 25, 2016 8:59:04 AM
>>> Subject: [Gluster-users] Need some help on Mismatching xdata /
Failed combine iatt / Too many fd
>>>
>>> Hi Everyone,
>>>
>>> I have a "2 x (4 + 2) = 12 Distributed-Disperse" volume.
After upgraded
>>> to 3.7.8 I noticed the volume is frequently out of service. The
>>> glustershd.log is flooded by:
>>>
>>> [ec-combine.c:866:ec_combine_check] 0-mainvol-disperse-1:
Mismatching
>>> xdata in answers of 'LOOKUP'"
>>> [ec-common.c:116:ec_check_status] 0-mainvol-disperse-1: Operation
failed
>>> on some subvolumes (up=3F, mask=3F, remaining=0, good=1E, bad=21)
>>> [ec-common.c:71:ec_heal_report] 0-mainvol-disperse-1: Heal failed
>>> [Invalid argument]
>>> [ec-combine.c:206:ec_iatt_combine] 0-mainvol-disperse-0: Failed to
>>> combine iatt (inode: xxx, links: 1-1, uid: 1000-1000, gid:
1000-1000,
>>> rdev: 0-0, size: xxx-xxx, mode: 100600-100600)
>>>
>>> in normal working state, and sometimes 1000+ lines of:
>>>
>>> [client-rpc-fops.c:466:client3_3_open_cbk] 0-mainvol-client-7:
remote
>>> operation failed. Path: <gfid:xxxx> (xxxx) [Too many open
files]
>>>
>>> and the brick went offline. "top open" showed "Max
open fds: 899195".
>>>
>>> Can anyone suggest me what happened, and what should I do? I was
trying
>>> to deal with the terrible IOPS problem but things got even worse.
>>>
>>> Each Server has 2 x E5-2630v3 (32threads/server), 32GB RAM.
Additional
>>> infos are in the attachements. Many thanks.
>>>
>>> Sincerely yours,
>>> Chen
>>>
>>
>
-- 
Chen Chen
????????????
Shanghai SmartQuerier Biotechnology Co., Ltd.
Add: Room 410, 781 Cai Lun Road, China (Shanghai) Pilot Free Trade Zone
         Shanghai 201203, P. R. China
Mob: +86 15221885893
Email: chenchen at smartquerier.com
Web: www.smartquerier.com
-------------- next part --------------
Volume Name: mainvol
Type: Distributed-Disperse
Volume ID: 2e190c59-9e28-43a5-b22a-24f75e9a580b
Status: Started
Number of Bricks: 2 x (4 + 2) = 12
Transport-type: tcp
Bricks:
Brick1: sm11:/mnt/disk1/mainvol
Brick2: sm12:/mnt/disk1/mainvol
Brick3: sm13:/mnt/disk1/mainvol
Brick4: sm14:/mnt/disk2/mainvol
Brick5: sm15:/mnt/disk2/mainvol
Brick6: sm16:/mnt/disk2/mainvol
Brick7: sm11:/mnt/disk2/mainvol
Brick8: sm12:/mnt/disk2/mainvol
Brick9: sm13:/mnt/disk2/mainvol
Brick10: sm14:/mnt/disk1/mainvol
Brick11: sm15:/mnt/disk1/mainvol
Brick12: sm16:/mnt/disk1/mainvol
Options Reconfigured:
nfs.outstanding-rpc-limit: 64
nfs.rpc-auth-allow: 172.168.135.*,127.0.0.1,::1
network.remote-dio: false
performance.io-cache: true
performance.cache-size: 16GB
auth.allow: 172.16.135.*
performance.readdir-ahead: on
client.event-threads: 8
server.event-threads: 8
performance.io-thread-count: 32
performance.write-behind-window-size: 4MB
nfs.disable: false
diagnostics.client-log-level: WARNING
diagnostics.brick-log-level: WARNING
cluster.lookup-optimize: on
cluster.readdir-optimize: on
server.outstanding-rpc-limit: 256
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4169 bytes
Desc: S/MIME Cryptographic Signature
URL:
<http://www.gluster.org/pipermail/gluster-users/attachments/20160407/8f48ab4b/attachment.p7s>

Chen Chen

2016-Apr-13 09:56 UTC

head link

[Gluster-users] Need some help on Mismatching xdata / Failed combine iatt / Too many fd

Hi Ashish and other Gluster Users,

When I put some heavy IO load onto my cluster (a rsync operation, 
~600MB/s), one of the node instantly get inode locked and teared down 
the whole cluster. I've already turned on "features.lock-heal" but
it
didn't help.

My clients is using a round-robin tactic to mount servers, hoping to 
average the pressure. Could it be caused by a race between NFS servers 
on different nodes? Should I instead create a dedicated NFS Server with 
huge memory, no brick, and multiple Ethernet cables?

I really appreciate any help from you guys.

Best wishes,
Chen

PS. Don't know why the native fuse client is 5 times inferior than the 
old good NFSv3.

On 4/4/2016 6:11 PM, Ashish Pandey wrote:> Hi Chen,
>
> As I suspected, there are many blocked call for inodelk in
sm11/mnt-disk1-mainvol.31115.dump.1459760675.
>
> ============================================>
[xlator.features.locks.mainvol-locks.inode]
> path=/home/analyzer/softs/bin/GenomeAnalysisTK.jar
> mandatory=0
> inodelk-count=4
> lock-dump.domain.domain=mainvol-disperse-0:self-heal
> lock-dump.domain.domain=mainvol-disperse-0
> inodelk.inodelk[0](ACTIVE)=type=WRITE, whence=0, start=0, len=0, pid = 1,
owner=dc2d3dfcc57f0000, client=0x7ff03435d5f0,
connection-id=sm12-8063-2016/04/01-07:51:46:892384-mainvol-client-0-0-0, blocked
at 2016-04-01 16:52:58, granted at 2016-04-01 16:52:58
> inodelk.inodelk[1](BLOCKED)=type=WRITE, whence=0, start=0, len=0, pid = 1,
owner=1414371e1a7f0000, client=0x7ff034204490,
connection-id=hw10-17315-2016/04/01-07:51:44:421807-mainvol-client-0-0-0,
blocked at 2016-04-01 16:58:51
> inodelk.inodelk[2](BLOCKED)=type=WRITE, whence=0, start=0, len=0, pid = 1,
owner=a8eb14cd9b7f0000, client=0x7ff01400dbd0,
connection-id=sm14-879-2016/04/01-07:51:56:133106-mainvol-client-0-0-0, blocked
at 2016-04-01 17:03:41
> inodelk.inodelk[3](BLOCKED)=type=WRITE, whence=0, start=0, len=0, pid = 1,
owner=b41a0482867f0000, client=0x7ff01800e670,
connection-id=sm15-30906-2016/04/01-07:51:45:711474-mainvol-client-0-0-0,
blocked at 2016-04-01 17:05:09
> ============================================>
> This could be the cause of hang.
> Possible Workaround -
> If there is no IO going on for this volume, we can restart the volume using
- gluster v start <volume-name> force. This will restart the nfs process
too which will release the locks and
> we could come out of this issue.
>
> Ashish
-- 
Chen Chen
Shanghai SmartQuerier Biotechnology Co., Ltd.
Add: Add: 3F, 1278 Keyuan Road, Shanghai 201203, P. R. China
Mob: +86 15221885893
Email: chenchen at smartquerier.com
Web: www.smartquerier.com

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4169 bytes
Desc: S/MIME Cryptographic Signature
URL:
<http://www.gluster.org/pipermail/gluster-users/attachments/20160413/dc668eab/attachment.p7s>

Gluster users - Apr 2016 - Need some help on Mismatching xdata / Failed combine iatt / Too many fd

[Gluster-users] Need some help on Mismatching xdata / Failed combine iatt / Too many fd

[Gluster-users] Need some help on Mismatching xdata / Failed combine iatt / Too many fd

[Gluster-users] Need some help on Mismatching xdata / Failed combine iatt / Too many fd

[Gluster-users] Need some help on Mismatching xdata / Failed combine iatt / Too many fd