thr3ads.net - Gluster users - [Gluster-users] [Gluster-devel] Lot of EIO errors in disperse volume [Jan 2017]

If this information is useful, please help other people find it:
Share via:

Xavier Hernandez

2017-Jan-20 07:41 UTC

[Gluster-users] [Gluster-devel] Lot of EIO errors in disperse volume

Hi Ram,

On 20/01/17 08:02, Ankireddypalle Reddy wrote:> Ashish,
>
>                  Thanks for looking in to the issue. In the given
> example the size/version matches for file on glusterfs4 and glusterfs5
> nodes. The file is empty on glusterfs6. Now what happens if glusterfs5
> goes down. Though the SLA factor of 2 is met  still I will not be able
> to access the data.
True, but having a brick with inconsistent data is the same as having it 
down. You would have lost 2 bricks out of 3.

The problem is how to detect the inconsistent data, what is causing it 
and why self-heal (apparently) is not healing it.
> The problem is that writes did not fail for the file
> to indicate the issue to the application.
That's the expected behavior. Since we have redundancy 1, a loss or 
failure on a single fragment is hidden from the application. However 
this triggers internal procedures to repair the problem on the 
background. This also seems to not be working.

There are two issues to identify here:

1. Why the write fails in first place
2. Why self-heal is unable to repair it

Probably the root cause is the same for both problems, but I'm not sure.

For 1 there should be some warning or error in the mount log, since one 
of the bricks is reporting an error, even if ec is able to report 
success to the application later.

For 2 the analysis could be more complex, but most probably there should 
be some warning or error message in the mount log and/or self-heal log 
of one of the servers.

Xavi
>  It?s not that we are
> encountering issue for every file on the mount point. The issue happens
> randomly for different files.
>
>
>
> [root at glusterfs4 glusterfs]# gluster volume info
>
>
>
> Volume Name: glusterfsProd
>
> Type: Distributed-Disperse
>
> Volume ID: 622aa3ee-958f-485f-b3c1-fb0f6c8db34c
>
> Status: Started
>
> Number of Bricks: 12 x (2 + 1) = 36
>
> Transport-type: tcp
>
> Bricks:
>
> Brick1: glusterfs4sds:/ws/disk1/ws_brick
>
> Brick2: glusterfs5sds:/ws/disk1/ws_brick
>
> Brick3: glusterfs6sds:/ws/disk1/ws_brick
>
> Brick4: glusterfs4sds:/ws/disk10/ws_brick
>
> Brick5: glusterfs5sds:/ws/disk10/ws_brick
>
> Brick6: glusterfs6sds:/ws/disk10/ws_brick
>
> Brick7: glusterfs4sds:/ws/disk11/ws_brick
>
> Brick8: glusterfs5sds:/ws/disk11/ws_brick
>
> Brick9: glusterfs6sds:/ws/disk11/ws_brick
>
> Brick10: glusterfs4sds:/ws/disk2/ws_brick
>
> Brick11: glusterfs5sds:/ws/disk2/ws_brick
>
> Brick12: glusterfs6sds:/ws/disk2/ws_brick
>
> Brick13: glusterfs4sds:/ws/disk3/ws_brick
>
> Brick14: glusterfs5sds:/ws/disk3/ws_brick
>
> Brick15: glusterfs6sds:/ws/disk3/ws_brick
>
> Brick16: glusterfs4sds:/ws/disk4/ws_brick
>
> Brick17: glusterfs5sds:/ws/disk4/ws_brick
>
> Brick18: glusterfs6sds:/ws/disk4/ws_brick
>
> Brick19: glusterfs4sds:/ws/disk5/ws_brick
>
> Brick20: glusterfs5sds:/ws/disk5/ws_brick
>
> Brick21: glusterfs6sds:/ws/disk5/ws_brick
>
> Brick22: glusterfs4sds:/ws/disk6/ws_brick
>
> Brick23: glusterfs5sds:/ws/disk6/ws_brick
>
> Brick24: glusterfs6sds:/ws/disk6/ws_brick
>
> Brick25: glusterfs4sds:/ws/disk7/ws_brick
>
> Brick26: glusterfs5sds:/ws/disk7/ws_brick
>
> Brick27: glusterfs6sds:/ws/disk7/ws_brick
>
> Brick28: glusterfs4sds:/ws/disk8/ws_brick
>
> Brick29: glusterfs5sds:/ws/disk8/ws_brick
>
> Brick30: glusterfs6sds:/ws/disk8/ws_brick
>
> Brick31: glusterfs4sds:/ws/disk9/ws_brick
>
> Brick32: glusterfs5sds:/ws/disk9/ws_brick
>
> Brick33: glusterfs6sds:/ws/disk9/ws_brick
>
> Brick34: glusterfs4sds:/ws/disk12/ws_brick
>
> Brick35: glusterfs5sds:/ws/disk12/ws_brick
>
> Brick36: glusterfs6sds:/ws/disk12/ws_brick
>
> Options Reconfigured:
>
> storage.build-pgfid: on
>
> performance.readdir-ahead: on
>
> nfs.export-dirs: off
>
> nfs.export-volumes: off
>
> nfs.disable: on
>
> auth.allow: glusterfs4sds,glusterfs5sds,glusterfs6sds
>
> diagnostics.client-log-level: INFO
>
> [root at glusterfs4 glusterfs]#
>
>
>
> Thanks and Regards,
>
> Ram
>
> *From:*Ashish Pandey [mailto:aspandey at redhat.com]
> *Sent:* Thursday, January 19, 2017 10:36 PM
> *To:* Ankireddypalle Reddy
> *Cc:* Xavier Hernandez; gluster-users at gluster.org; Gluster Devel
> (gluster-devel at gluster.org)
> *Subject:* Re: [Gluster-users] [Gluster-devel] Lot of EIO errors in
> disperse volume
>
>
>
> Ram,
>
>
>
> I don't understand what do you mean by saying "redundancy factor
of 2 is
> met in a 3:1 disperse volume".
> You have given the xattr's of only 3 bricks.
>
> All the above 2 sentence and output of getxattr contradicts each other.
>
> In the given scenario, if you have (2+1) ec configuration, and 2 bricks
> are having same size and version, then there should not be
>
> any problem to access this file.  Run heal and 3rd fragment will also
> bee healthy.
>
>
>
> I think there has been major gap in providing the complete and correct
> information about the volume and all the logs and activities.
> Could you please provide the following -
>
> 1 - gluster v info - please give us the output of this command
>
> 2 - Let's consider only one file which you are not able to access and
> find out the reason.
>
> 3 - Try to create and write some files on mount point and see if there
> is any issue with new file creation if yes, why, provide logs.
>
>
>
> Specific but enough information is required to find the root cause of
> the issue.
>
>
>
> Ashish
>
>
>
>
>
>
>
> ------------------------------------------------------------------------
>
> *From: *"Ankireddypalle Reddy" <areddy at commvault.com
> <mailto:areddy at commvault.com>>
> *To: *"Ankireddypalle Reddy" <areddy at commvault.com
> <mailto:areddy at commvault.com>>, "Xavier Hernandez"
> <xhernandez at datalab.es <mailto:xhernandez at datalab.es>>
> *Cc: *gluster-users at gluster.org <mailto:gluster-users at
gluster.org>,
> "Gluster Devel (gluster-devel at gluster.org
> <mailto:gluster-devel at gluster.org>)" <gluster-devel at
gluster.org
> <mailto:gluster-devel at gluster.org>>
> *Sent: *Friday, January 20, 2017 3:20:46 AM
> *Subject: *Re: [Gluster-users] [Gluster-devel] Lot of EIO errors in
> disperse volume
>
>
>
> Xavi,
>             Finally I am able to locate the files for which we are
> seeing the mismatch.
>
>
>
> [2017-01-19 21:20:10.002737] W [MSGID: 122056]
> [ec-combine.c:875:ec_combine_check] 0-glusterfsProd-disperse-0:
> Mismatching xdata in answers of 'LOOKUP' for
> f5e61224-17f6-4803-9f75-dd74cd79e248
> [2017-01-19 21:20:10.002776] W [MSGID: 122006]
> [ec-combine.c:207:ec_iatt_combine] 0-glusterfsProd-disperse-0: Failed to
> combine iatt (inode: 11490333518038950472-11490333518038950472, links:
> 1-1, uid: 0-0, gid: 0-0, rdev: 0-0, size: 4800512-4734976, mode:
> 100775-100775) for f5e61224-17f6-4803-9f75-dd74cd79e248
>
>
>
> GFID f5e61224-17f6-4803-9f75-dd74cd79e248 maps to file
>
/ws/disk1/ws_brick/Folder_01.05.2017_21.15/CV_MAGNETIC/V_32410/CHUNK_424795/SFILE_CONTAINER_158
>
>
>
> The size field differs between the bricks of this sub volume.
>
>
>
> [root at glusterfs4 glusterfs]# getfattr -d -e hex -m .
>
/ws/disk1/ws_brick/Folder_01.05.2017_21.15/CV_MAGNETIC/V_32410/CHUNK_424795/SFILE_CONTAINER_158
> trusted.ec.size=0x00000000000e0000
> trusted.ec.version=0x00000000000000080000000000000008
>
>
>
> [root at glusterfs5 ws]# getfattr -d -e hex -m .
>
/ws/disk1/ws_brick/Folder_01.05.2017_21.15/CV_MAGNETIC/V_32410/CHUNK_424795/SFILE_CONTAINER_158
> trusted.ec.size=0x00000000000e0000
> trusted.ec.version=0x00000000000000080000000000000008
>
>
>
> [root at glusterfs6 glusterfs]# getfattr -d -e hex -m .
>
/ws/disk1/ws_brick/Folder_01.05.2017_21.15/CV_MAGNETIC/V_32410/CHUNK_424795/SFILE_CONTAINER_158
> getfattr: Removing leading '/' from absolute path names
> trusted.ec.size=0x0000000000000000
> trusted.ec.version=0x00000000000000000000000000000008
>
>
>
> On glusterfs6 writes seem to have failed and the file appears to be an
> empty file.
> [root at glusterfs6 glusterfs]# stat
>
/ws/disk1/ws_brick/Folder_01.05.2017_21.15/CV_MAGNETIC/V_32410/CHUNK_424795/SFILE_CONTAINER_158
>   File:
>
'/ws/disk1/ws_brick/Folder_01.05.2017_21.15/CV_MAGNETIC/V_32410/CHUNK_424795/SFILE_CONTAINER_158'
>   Size: 0               Blocks: 0          IO Block: 4096   regular
> empty file
>
>
>
> This looks like a major issue. Since we are left with no redundancy for
> these files and if the brick on other node fails then we will not be
> able to access this data though the redundancy factor of 2 is met in a
> 3:1 disperse volume.
>
>
>
> Thanks and Regards,
> Ram
>
>
>
> -----Original Message-----
> From: gluster-users-bounces at gluster.org
> <mailto:gluster-users-bounces at gluster.org>
> [mailto:gluster-users-bounces at gluster.org] On Behalf Of Ankireddypalle
Reddy
> Sent: Monday, January 16, 2017 9:41 AM
> To: Xavier Hernandez
> Cc: gluster-users at gluster.org <mailto:gluster-users at
gluster.org>;
> Gluster Devel (gluster-devel at gluster.org <mailto:gluster-devel at
gluster.org>)
> Subject: Re: [Gluster-users] [Gluster-devel] Lot of EIO errors in
> disperse volume
>
>
>
> Xavi,
>            Thanks. I will start by tracking the delta if any in return
> codes from the bricks for writes in EC. I have a feeling that  if we
> could get to the bottom of this issue then the EIO errors could be
> ultimately avoided.  Please note that while we are testing all these on
> a standby setup we continue to encounter the EIO errors on our
> production setup which is running @ 3.7.18.
>
>
>
> Thanks and Regards,
> Ram
>
>
>
> -----Original Message-----
> From: Xavier Hernandez [mailto:xhernandez at datalab.es]
> Sent: Monday, January 16, 2017 6:54 AM
> To: Ankireddypalle Reddy
> Cc: gluster-users at gluster.org <mailto:gluster-users at
gluster.org>;
> Gluster Devel (gluster-devel at gluster.org <mailto:gluster-devel at
gluster.org>)
> Subject: Re: [Gluster-devel] [Gluster-users] Lot of EIO errors in
> disperse volume
>
>
>
> Hi Ram,
>
>
>
> On 16/01/17 12:33, Ankireddypalle Reddy wrote:
>> Xavi,
>>           Thanks. Is there any other way to map from GFID to path.
>
>
>
> The only way I know is to search all files from bricks and lookup for
> the trusted.gfid xattr.
>
>
>
>> I will look for a way to share the TRACE logs. Easier way might be to
add some extra logging. I could do that if you could let me know functions in
which you are interested..
>
>
>
> The problem is that I don't know where the problem is. One possibility
> could be to track all return values from all bricks for all writes and
> then identify which ones belong to an inconsistent file.
>
>
>
> But if this doesn't reveal anything interesting we'll need to look
at
> some other place. And this can be very tedious and slow.
>
>
>
> Anyway, what we are looking now is not the source of an EIO, since there
> are two bricks with consistent state and the file should be perfectly
> readable and writable. It's true that there's some problem here and
it
> could derive in EIO if one of the healthy bricks degrades, but at least
> this file shouldn't be giving EIO errors for now.
>
>
>
> Xavi
>
>
>
>>
>> Sent on from my iPhone
>>
>>> On Jan 16, 2017, at 6:23 AM, Xavier Hernandez <xhernandez at
datalab.es <mailto:xhernandez at datalab.es>> wrote:
>>>
>>> Hi Ram,
>>>
>>>> On 13/01/17 18:41, Ankireddypalle Reddy wrote:
>>>> Xavi,
>>>>             I enabled TRACE logging. The log grew up to 120GB
and could not make much out of it. Then I started logging GFID in the code where
we were seeing errors.
>>>>
>>>> [2017-01-13 17:02:01.761349] I [dict.c:3065:dict_dump_to_log]
0-glusterfsProd-disperse-0: dict=0x7fa6706bc690
((trusted.ec.size:0:0:0:0:30:6b:0:0:)(trusted.ec.version:0:0:0:0:0:0:2a:38:0:0:0:0:0:0:2a:38:))
>>>> [2017-01-13 17:02:01.761360] I [dict.c:3065:dict_dump_to_log]
0-glusterfsProd-disperse-0: dict=0x7fa6706bed64
((trusted.ec.size:0:0:0:0:0:0:0:0:)(trusted.ec.version:0:0:0:0:0:0:0:0:0:0:0:0:0:0:2a:38:))
>>>> [2017-01-13 17:02:01.761365] W [MSGID: 122056]
[ec-combine.c:881:ec_combine_check] 0-glusterfsProd-disperse-0: Mismatching
xdata in answers of 'LOOKUP'
>>>> [2017-01-13 17:02:01.761405] I [dict.c:166:key_value_cmp]
0-glusterfsProd-disperse-0: 'trusted.ec.size' is different in two dicts
(8, 8)
>>>> [2017-01-13 17:02:01.761417] I [dict.c:3065:dict_dump_to_log]
0-glusterfsProd-disperse-0: dict=0x7fa6706bbb14
((trusted.ec.size:0:0:0:0:30:6b:0:0:)(trusted.ec.version:0:0:0:0:0:0:2a:38:0:0:0:0:0:0:2a:38:))
>>>> [2017-01-13 17:02:01.761428] I [dict.c:3065:dict_dump_to_log]
0-glusterfsProd-disperse-0: dict=0x7fa6706bed64
((trusted.ec.size:0:0:0:0:0:0:0:0:)(trusted.ec.version:0:0:0:0:0:0:0:0:0:0:0:0:0:0:2a:38:))
>>>> [2017-01-13 17:02:01.761433] W [MSGID: 122056]
[ec-combine.c:881:ec_combine_check] 0-glusterfsProd-disperse-0: Mismatching
xdata in answers of 'LOOKUP'
>>>> [2017-01-13 17:02:01.761442] W [MSGID: 122006]
[ec-combine.c:214:ec_iatt_combine] 0-glusterfsProd-disperse-0: Failed to combine
iatt (inode: 11275691004192850514-11275691004192850514, gfid:
60b990ed-d741-4176-9c7b-4d3a25fb8252  -  60b990ed-d741-4176-9c7b-4d3a25fb8252, 
links: 1-1, uid: 0-0, gid: 0-0, rdev: 0-0,size: 406650880-406683648,
> mode: 100775-100775)
>>>>
>>>> The file for which we are seeing this error turns out to be
having a GFID of 60b990ed-d741-4176-9c7b-4d3a25fb8252
>>>>
>>>> Then I tried looking for find out the file with this GFID. It
pointed me to following path. I was expecting a real file system path from the
following turorial:
>>>>
https://gluster.readthedocs.io/en/latest/Troubleshooting/gfid-to-path/
>>>
>>> I think this method only works if bricks have the inode cached.
>>>
>>>>
>>>> getfattr -n trusted.glusterfs.pathinfo -e text
/mnt/gfid/.gfid/60b990ed-d741-4176-9c7b-4d3a25fb8252
>>>> getfattr: Removing leading '/' from absolute path names
>>>> # file: mnt/gfid/.gfid/60b990ed-d741-4176-9c7b-4d3a25fb8252
>>>>
trusted.glusterfs.pathinfo="(<DISTRIBUTE:glusterfsProd-dht>
(<EC:glusterfsProd-disperse-0>
<POSIX(/ws/disk1/ws_brick):glusterfs6:/ws/disk1/ws_brick/.glusterfs/60/b9/60b990ed-d741-4176-9c7b-4d3a25fb8252>
<POSIX(/ws/disk1/ws_brick):glusterfs5:/ws/disk1/ws_brick/.glusterfs/60/b9/60b990ed-d741-4176-9c7b-4d3a25fb8252>))"
>>>>
>>>> Then I looked for the xatttrs for these files from all the 3
bricks
>>>>
>>>> [root at glusterfs4 glusterfs]# getfattr -d -m . -e hex
/ws/disk1/ws_brick/.glusterfs/60/b9/60b990ed-d741-4176-9c7b-4d3a25fb8252
>>>> getfattr: Removing leading '/' from absolute path names
>>>> # file:
ws/disk1/ws_brick/.glusterfs/60/b9/60b990ed-d741-4176-9c7b-4d3a25fb8252
>>>> trusted.bit-rot.version=0x02000000000000005877a8dc00041138
>>>> trusted.ec.config=0x0000080301000200
>>>> trusted.ec.size=0x0000000000000000
>>>> trusted.ec.version=0x00000000000000000000000000002a38
>>>> trusted.gfid=0x60b990edd74141769c7b4d3a25fb8252
>>>>
>>>> [root at glusterfs5 bricks]# getfattr -d -m . -e hex
/ws/disk1/ws_brick/.glusterfs/60/b9/60b990ed-d741-4176-9c7b-4d3a25fb8252
>>>> getfattr: Removing leading '/' from absolute path names
>>>> # file:
ws/disk1/ws_brick/.glusterfs/60/b9/60b990ed-d741-4176-9c7b-4d3a25fb8252
>>>> trusted.bit-rot.version=0x02000000000000005877a8dc000c92d0
>>>> trusted.ec.config=0x0000080301000200
>>>> trusted.ec.dirty=0x00000000000000160000000000000000
>>>> trusted.ec.size=0x00000000306b0000
>>>> trusted.ec.version=0x0000000000002a380000000000002a38
>>>> trusted.gfid=0x60b990edd74141769c7b4d3a25fb8252
>>>>
>>>> [root at glusterfs6 ee]# getfattr -d -m . -e hex
/ws/disk1/ws_brick/.glusterfs/60/b9/60b990ed-d741-4176-9c7b-4d3a25fb8252
>>>> getfattr: Removing leading '/' from absolute path names
>>>> # file:
ws/disk1/ws_brick/.glusterfs/60/b9/60b990ed-d741-4176-9c7b-4d3a25fb8252
>>>> trusted.bit-rot.version=0x02000000000000005877a8dc000c9436
>>>> trusted.ec.config=0x0000080301000200
>>>> trusted.ec.dirty=0x00000000000000160000000000000000
>>>> trusted.ec.size=0x00000000306b0000
>>>> trusted.ec.version=0x0000000000002a380000000000002a38
>>>> trusted.gfid=0x60b990edd74141769c7b4d3a25fb8252
>>>>
>>>> It turns out that the size and version in fact does not match
for one of the files.
>>>
>>> It seems as if the brick on glusterfs4 didn't receive any write
request (or they failed for some reason). Do you still have the trace log ? is
there any way I could download it ?
>>>
>>> Xavi
>>>
>>>>
>>>> Thanks and Regards,
>>>> Ram
>>>>
>>>> -----Original Message-----
>>>> From: gluster-devel-bounces at gluster.org
> <mailto:gluster-devel-bounces at gluster.org>
> [mailto:gluster-devel-bounces at gluster.org] On Behalf Of Ankireddypalle
Reddy
>>>> Sent: Friday, January 13, 2017 4:17 AM
>>>> To: Xavier Hernandez
>>>> Cc: gluster-users at gluster.org <mailto:gluster-users at
gluster.org>; Gluster
> Devel (gluster-devel at gluster.org <mailto:gluster-devel at
gluster.org>)
>>>> Subject: Re: [Gluster-devel] [Gluster-users] Lot of EIO errors
in disperse volume
>>>>
>>>> Xavi,
>>>>        Thanks for explanation. Will collect TRACE logs today.
>>>>
>>>> Thanks and Regards,
>>>> Ram
>>>>
>>>> Sent from my iPhone
>>>>
>>>>> On Jan 13, 2017, at 3:03 AM, Xavier Hernandez
<xhernandez at datalab.es <mailto:xhernandez at datalab.es>> wrote:
>>>>>
>>>>> Hi Ram,
>>>>>
>>>>>> On 12/01/17 22:14, Ankireddypalle Reddy wrote:
>>>>>> Xavi,
>>>>>>            I changed the logging to log the individual
bytes. Consider the following from ws-glus.log file where /ws/glus is the mount
point.
>>>>>>
>>>>>> [2017-01-12 20:47:59.368102] I [MSGID: 109063]
>>>>>> [dht-layout.c:718:dht_layout_normalize]
0-glusterfsProd-dht: Found
>>>>>> anomalies in
>>>>>>
/Folder_01.05.2017_21.15/CV_MAGNETIC/V_30970/CHUNK_390607 (gfid
>>>>>> e694387f-dde7-410b-9562-914a994d5e85). Holes=1
overlaps=0
>>>>>> [2017-01-12 20:47:59.391218] I [MSGID: 109036]
>>>>>> [dht-common.c:9082:dht_log_new_layout_for_dir_selfheal]
>>>>>> 0-glusterfsProd-dht: Setting layout of
>>>>>>
/Folder_01.05.2017_21.15/CV_MAGNETIC/V_30970/CHUNK_390607 with
>>>>>> [Subvol_name: glusterfsProd-disperse-0, Err: -1 ,
Start: 2505397587 ,
>>>>>> Stop: 2863311527 , Hash: 1 ], [Subvol_name:
glusterfsProd-disperse-1,
>>>>>> Err: -1 , Start: 2863311528 , Stop: 3221225468 , Hash:
1 ],
>>>>>> [Subvol_name: glusterfsProd-disperse-10, Err: -1 ,
Start: 3221225469
>>>>>> , Stop: 3579139409 , Hash: 1 ], [Subvol_name:
>>>>>> glusterfsProd-disperse-11, Err: -1 , Start: 3579139410
, Stop:
>>>>>> 3937053350 , Hash: 1 ], [Subvol_name:
glusterfsProd-disperse-2, Err:
>>>>>> -1 , Start: 3937053351 , Stop: 4294967295 , Hash: 1 ],
[Subvol_name:
>>>>>> glusterfsProd-disperse-3, Err: -1 , Start: 0 , Stop:
357913940 ,
>>>>>> Hash: 1 ], [Subvol_name: glusterfsProd-disperse-4, Err:
-1 , Start:
>>>>>> 357913941 , Stop: 715827881 , Hash: 1 ], [Subvol_name:
>>>>>> glusterfsProd-disperse-5, Err: -1 , Start: 715827882 ,
Stop:
>>>>>> 1073741822 , Hash
>>>> : 1 ], [Subvol_name: glusterfsProd-disperse-6, Err: -1 , Start:
1073741823 , Stop: 1431655763 , Hash: 1 ], [Subvol_name:
glusterfsProd-disperse-7, Err: -1 , Start: 1431655764 , Stop: 1789569704 , Hash:
1 ], [Subvol_name: glusterfsProd-disperse-8, Err: -1 , Start: 1789569705 , Stop:
2147483645 , Hash: 1 ], [Subvol_name:
> glusterfsProd-disperse-9, Err: -1 , Start: 2147483646 , Stop: 2505397586
> , Hash: 1 ],
>>>>>>
>>>>>>           Self-heal seems to be triggered for path 
/Folder_01.05.2017_21.15/CV_MAGNETIC/V_30970/CHUNK_390607 due to anomalies as
per DHT.  It would be great if someone could explain what could be the anomaly
here.  The setup where we encountered this is a fairly stable setup with no
brick failures or no node failures.
>>>>>
>>>>> This is not really a self-heal, at least from the point of
view of ec. This means that DHT has found a discrepancy in the layout of that
directory, however this doesn't mean any problem (notice the 'I' in
the log, meaning that it's informative, not a warning nor error).
>>>>>
>>>>> Not sure how DHT works in this case or why it finds this
"anomaly", but if there aren't any previous errors before that
message, it can be completely ignored.
>>>>>
>>>>> Not sure if it can be related to option
cluster.weighted-rebalance that is enabled by default.
>>>>>
>>>>>>
>>>>>>          Then Self-heal seems to have encountered the
following error.
>>>>>>
>>>>>> [2017-01-12 20:48:23.418432] I
[dict.c:166:key_value_cmp]
>>>>>> 0-glusterfsProd-disperse-2:
'trusted.ec.version' is different in two
>>>>>> dicts (16, 16)
>>>>>> [2017-01-12 20:48:23.418496] I
[dict.c:3065:dict_dump_to_log]
>>>>>> 0-glusterfsProd-disperse-2: dict=0x7f0b649520ac
>>>>>>
((trusted.glusterfs.dht:0:0:0:1:0:0:0:0:0:0:0:0:15:55:55:54:)(trusted
>>>>>> .ec.version:0:0:0:0:0:0:0:b:0:0:0:0:0:0:0:e:))
>>>>>> [2017-01-12 20:48:23.418519] I
[dict.c:3065:dict_dump_to_log]
>>>>>> 0-glusterfsProd-disperse-2: dict=0x7f0b6495b4e0
>>>>>>
((trusted.glusterfs.dht:0:0:0:1:0:0:0:0:0:0:0:0:15:55:55:54:)(trusted
>>>>>> .ec.version:0:0:0:0:0:0:0:d:0:0:0:0:0:0:0:e:))
>>>>>> [2017-01-12 20:48:23.418531] W [MSGID: 122056]
[ec-combine.c:873:ec_combine_check] 0-glusterfsProd-disperse-2: Mismatching
xdata in answers of 'LOOKUP'
>>>>>
>>>>> That's a real problem. Here we have two bricks that
differ in the trusted.ec.version xattr. However this xattr not necessarily
belongs to the previous directory. They are unrelated messages.
>>>>>
>>>>>>
>>>>>>            In this case glusterfsProd-disperse-2 sub
volume actually consists of the following bricks.
>>>>>>            glusterfs4sds:/ws/disk11/ws_brick,
glusterfs5sds:
>>>>>> /ws/disk11/ws_brick, glusterfs6sds: /ws/disk11/ws_brick
>>>>>>
>>>>>>            I went ahead and checked the value of
trusted.ec.version on all the 3 bricks inside this sub vol:
>>>>>>
>>>>>>            [root at glusterfs6 ~]# getfattr -e hex -n
trusted.ec.version
/ws/disk11/ws_brick//Folder_01.05.2017_21.15/CV_MAGNETIC/V_30970/CHUNK_390607
>>>>>>            # file:
ws/disk11/ws_brick//Folder_01.05.2017_21.15/CV_MAGNETIC/V_30970/CHUNK_390607
>>>>>>           
trusted.ec.version=0x0000000000000009000000000000000b
>>>>>>
>>>>>>            [root at glusterfs4 ~]# getfattr -e hex -n
trusted.ec.version
/ws/disk11/ws_brick//Folder_01.05.2017_21.15/CV_MAGNETIC/V_30970/CHUNK_390607
>>>>>>            # file:
ws/disk11/ws_brick//Folder_01.05.2017_21.15/CV_MAGNETIC/V_30970/CHUNK_390607
>>>>>>            
trusted.ec.version=0x0000000000000009000000000000000b
>>>>>>
>>>>>>             [root at glusterfs5 glusterfs]# getfattr -e
hex -n trusted.ec.version
/ws/disk11/ws_brick//Folder_01.05.2017_21.15/CV_MAGNETIC/V_30970/CHUNK_390607
>>>>>>             # file:
ws/disk11/ws_brick//Folder_01.05.2017_21.15/CV_MAGNETIC/V_30970/CHUNK_390607
>>>>>>            
trusted.ec.version=0x0000000000000009000000000000000b
>>>>>>
>>>>>> The attribute value seems to be same on all the 3
bricks.
>>>>>
>>>>> That's a clear indication that the ec warning is not
related to this directory because trusted.ec.version always increases, never
decreases, and the directory has a value smaller that the one that appears in
the log message.
>>>>>
>>>>> If you show all dict entries in the log, it seems that it
does refer to a directory because trusted.ec.size is not present, but it must be
another directory than the one you looked at. We would need to find which one is
having this issue. The TRACE log would be helpful here.
>>>>>
>>>>>
>>>>>>
>>>>>>              Also please note that every single time
that the trusted.ec.version was found to mismatch the  same values are getting
logged. Following are 2 more instances of trusted.ec.version mismatch.
>>>>>>
>>>>>> [2017-01-12 20:14:25.554540] I
[dict.c:166:key_value_cmp]
>>>>>> 0-glusterfsProd-disperse-2:
'trusted.ec.version' is different in two
>>>>>> dicts (16, 16)
>>>>>> [2017-01-12 20:14:25.554588] I
[dict.c:3065:dict_dump_to_log]
>>>>>> 0-glusterfsProd-disperse-2: dict=0x7f0b6495a9f0
>>>>>>
((glusterfs.open-fd-count:30:0:)(trusted.glusterfs.dht:0:0:0:1:0:0:0:
>>>>>>
0:0:0:0:0:15:55:55:54:)(trusted.ec.version:0:0:0:0:0:0:0:d:0:0:0:0:0:
>>>>>> 0:0:e:))
>>>>>> [2017-01-12 20:14:25.554608] I
[dict.c:3065:dict_dump_to_log]
>>>>>> 0-glusterfsProd-disperse-2: dict=0x7f0b6495903c
>>>>>>
((glusterfs.open-fd-count:30:0:)(trusted.glusterfs.dht:0:0:0:1:0:0:0:
>>>>>>
0:0:0:0:0:15:55:55:54:)(trusted.ec.version:0:0:0:0:0:0:0:b:0:0:0:0:0:
>>>>>> 0:0:e:))
>>>>>> [2017-01-12 20:14:25.554624] W [MSGID: 122053]
>>>>>> [ec-common.c:116:ec_check_status]
0-glusterfsProd-disperse-2:
>>>>>> Operation failed on some subvolumes (up=7, mask=7,
remaining=0,
>>>>>> good=3, bad=4)
>>>>>> [2017-01-12 20:14:25.554632] W [MSGID: 122002]
>>>>>> [ec-common.c:71:ec_heal_report]
0-glusterfsProd-disperse-2: Heal
>>>>>> failed [Invalid argument]
>>>>>> [2017-01-12 20:14:25.555598] I
[dict.c:166:key_value_cmp]
>>>>>> 0-glusterfsProd-disperse-2:
'trusted.ec.version' is different in two
>>>>>> dicts (16, 16)
>>>>>> [2017-01-12 20:14:25.555622] I
[dict.c:3065:dict_dump_to_log]
>>>>>> 0-glusterfsProd-disperse-2: dict=0x7f0b64956c24
>>>>>>
((glusterfs.open-fd-count:30:0:)(trusted.glusterfs.dht:0:0:0:1:0:0:0:
>>>>>>
0:0:0:0:0:15:55:55:54:)(trusted.ec.version:0:0:0:0:0:0:0:b:0:0:0:0:0:
>>>>>> 0:0:e:))
>>>>>> [2017-01-12 20:14:25.555638] I
[dict.c:3065:dict_dump_to_log]
>>>>>> 0-glusterfsProd-disperse-2: dict=0x7f0b64964e8c
>>>>>>
((glusterfs.open-fd-count:30:0:)(trusted.glusterfs.dht:0:0:0:1:0:0:0:
>>>>>>
0:0:0:0:0:15:55:55:54:)(trusted.ec.version:0:0:0:0:0:0:0:d:0:0:0:0:0:
>>>>>> 0:0:e:))
>>>>>
>>>>>
>>>>> I think that this refers to the same directory. This seems
an attempt to heal it that has failed. So it makes sense that it finds exactly
the same values.
>>>>>
>>>>>>
>>>>>>
>>>>>> In glustershd.log lot of similar errors are logged.
>>>>>>
>>>>>> [2017-01-12 21:10:53.728770] I
[dict.c:166:key_value_cmp]
>>>>>> 0-glusterfsProd-disperse-0: 'trusted.ec.size'
is different in two
>>>>>> dicts (8, 8)
>>>>>> [2017-01-12 21:10:53.728804] I
[dict.c:3065:dict_dump_to_log]
>>>>>> 0-glusterfsProd-disperse-0: dict=0x7f21694b6f50
>>>>>>
((trusted.ec.size:0:0:0:0:42:3a:0:0:)(trusted.ec.version:0:0:0:0:0:0:
>>>>>> 37:5f:0:0:0:0:0:0:37:5f:))
>>>>>> [2017-01-12 21:10:53.728827] I
[dict.c:3065:dict_dump_to_log]
>>>>>> 0-glusterfsProd-disperse-0: dict=0x7f21694b62bc
>>>>>>
((trusted.ec.size:0:0:0:0:0:ca:0:0:)(trusted.ec.version:0:0:0:0:0:0:0
>>>>>> :a1:0:0:0:0:0:0:37:5f:))
>>>>>> [2017-01-12 21:10:53.728842] W [MSGID: 122056]
[ec-combine.c:873:ec_combine_check] 0-glusterfsProd-disperse-0: Mismatching
xdata in answers of 'LOOKUP'
>>>>>> [2017-01-12 21:10:53.728854] W [MSGID: 122053]
>>>>>> [ec-common.c:116:ec_check_status]
0-glusterfsProd-disperse-0:
>>>>>> Operation failed on some subvolumes (up=7, mask=7,
remaining=0,
>>>>>> good=6, bad=1)
>>>>>> [2017-01-12 21:10:53.728876] W [MSGID: 122002]
>>>>>> [ec-common.c:71:ec_heal_report]
0-glusterfsProd-disperse-0: Heal
>>>>>> failed [Invalid argument]
>>>>>
>>>>> This seems an attempt to heal a file, but I see a lot of
differences between both versions. The size on one brick is 13.238.272 bytes,
but on the other brick it's 1.111.097.344 bytes. That's a huge
difference.
>>>>>
>>>>> Looking at the trusted.ec.version, I see that the
'data' version is very different (from 161 to 14.175), however the
metadata version is exactly the same. This really seems like a lot of writes
while one brick was down (or disconnected for some reason, or writes failed for
some reason). One brick has lost about 14.000
> writes of ~80KB.
>>>>>
>>>>> I think the most important thing right now would be to
identify which files and directories are having these problems to be able to
identify the cause. Again, the TRACE log will be really useful.
>>>>>
>>>>> Xavi
>>>>>
>>>>>>
>>>>>> Thanks and Regards,
>>>>>> Ram
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Xavier Hernandez [mailto:xhernandez at
datalab.es]
>>>>>> Sent: Thursday, January 12, 2017 6:40 AM
>>>>>> To: Ankireddypalle Reddy
>>>>>> Cc: Gluster Devel (gluster-devel at gluster.org
<mailto:gluster-devel at gluster.org>);
>>>>>> gluster-users at gluster.org <mailto:gluster-users
at gluster.org>
>>>>>> Subject: Re: [Gluster-users] [Gluster-devel] Lot of EIO
errors in
>>>>>> disperse volume
>>>>>>
>>>>>> Hi Ram,
>>>>>>
>>>>>>
>>>>>>> On 12/01/17 11:49, Ankireddypalle Reddy wrote:
>>>>>>> Xavi,
>>>>>>>         As I mentioned before the error could
happen for any FOP. Will try to run with TRACE debug level. Is there a
possibility that we are checking for this attribute on a directory, because a
directory does not seem to be having this attribute set.
>>>>>>
>>>>>> No, directories do not have this attribute and no one
should be reading it from a directory.
>>>>>>
>>>>>>> Also is the function to check size and version
called after it is decided that heal should be run or is this check is the one
which decides whether a heal should be run.
>>>>>>
>>>>>> Almost all checks that trigger a heal are done in the
lookup fop when some discrepancy is detected.
>>>>>>
>>>>>> The function that checks size and version is called
later once a lock on the inode is acquired (even if no heal is needed). However
further failures in the processing of any fop can also trigger a self-heal.
>>>>>>
>>>>>> Xavi
>>>>>>
>>>>>>>
>>>>>>> Thanks and Regards,
>>>>>>> Ram
>>>>>>>
>>>>>>> Sent from my iPhone
>>>>>>>
>>>>>>>> On Jan 12, 2017, at 2:25 AM, Xavier Hernandez
<xhernandez at datalab.es <mailto:xhernandez at datalab.es>> wrote:
>>>>>>>>
>>>>>>>> Hi Ram,
>>>>>>>>
>>>>>>>>> On 12/01/17 02:36, Ankireddypalle Reddy
wrote:
>>>>>>>>> Xavi,
>>>>>>>>>        I added some more logging
information. The trusted.ec.size field values are in fact different.
>>>>>>>>>         trusted.ec.size    l1 =
62719407423488    l2 = 0
>>>>>>>>
>>>>>>>> That's very weird. Directories do not have
this attribute. It's only present on regular files. But you said that the
error happens while creating the file, so it doesn't make much sense because
file creation always sets trusted.ec.size to 0.
>>>>>>>>
>>>>>>>> Could you reproduce the problem with
diagnostics.client-log-level set to TRACE and send the log to me ? it will
create a big log, but I'll have much more information about what's going
on.
>>>>>>>>
>>>>>>>> Do you have a mixed setup with nodes of
different types ? for example mixed 32/64 bits architectures or different
operating systems ? I ask this because 62719407423488 in hex is 0x390B00000000,
which has the lower 32 bits set to 0, but has garbage above that.
>>>>>>>>
>>>>>>>>>
>>>>>>>>>         This is a fairly static setup with
no brick/ node failure.  Please explain why  is that a heal is being triggered
and what could have acutually caused these size xattrs to differ.  This is
causing random I/O failures and is impacting the backup schedules.
>>>>>>>>
>>>>>>>> The launch of self-heal is normal because it
has detected an inconsistency. The real problem is what originates that
inconsistency.
>>>>>>>>
>>>>>>>> Xavi
>>>>>>>>
>>>>>>>>>
>>>>>>>>> [ 2017-01-12 01:19:18.256970] W [MSGID:
122056] [ec-combine.c:873:ec_combine_check] 0-glusterfsProd-disperse-8:
Mismatching xdata in answers of 'LOOKUP'
>>>>>>>>> [2017-01-12 01:19:18.257015] W [MSGID:
122053]
>>>>>>>>> [ec-common.c:116:ec_check_status]
0-glusterfsProd-disperse-8:
>>>>>>>>> Operation failed on some subvolumes (up=7,
mask=7, remaining=0,
>>>>>>>>> good=3, bad=4)
>>>>>>>>> [2017-01-12 01:19:18.257018] W [MSGID:
122002]
>>>>>>>>> [ec-common.c:71:ec_heal_report]
0-glusterfsProd-disperse-8: Heal
>>>>>>>>> failed [Invalid argument]
>>>>>>>>> [2017-01-12 01:19:21.002028] E
[dict.c:197:key_value_cmp]
>>>>>>>>> 0-glusterfsProd-disperse-4:
'trusted.ec.size' is different in two
>>>>>>>>> dicts (8, 8)
>>>>>>>>> [2017-01-12 01:19:21.002056] E
[dict.c:166:log_value]
>>>>>>>>> 0-glusterfsProd-disperse-4: trusted.ec.size
[ l1 = 62719407423488
>>>>>>>>> l2 = 0 i1 = 0 i2 = 0 ]
>>>>>>>>> [2017-01-12 01:19:21.002064] W [MSGID:
122056] [ec-combine.c:873:ec_combine_check] 0-glusterfsProd-disperse-4:
Mismatching xdata in answers of 'LOOKUP'
>>>>>>>>> [2017-01-12 01:19:21.209640] E
[dict.c:197:key_value_cmp]
>>>>>>>>> 0-glusterfsProd-disperse-4:
'trusted.ec.size' is different in two
>>>>>>>>> dicts (8, 8)
>>>>>>>>> [2017-01-12 01:19:21.209673] E
[dict.c:166:log_value]
>>>>>>>>> 0-glusterfsProd-disperse-4: trusted.ec.size
[ l1 = 62719407423488
>>>>>>>>> l2 = 0 i1 = 0 i2 = 0 ]
>>>>>>>>> [2017-01-12 01:19:21.209686] W [MSGID:
122056] [ec-combine.c:873:ec_combine_check] 0-glusterfsProd-disperse-4:
Mismatching xdata in answers of 'LOOKUP'
>>>>>>>>> [2017-01-12 01:19:21.209719] W [MSGID:
122053]
>>>>>>>>> [ec-common.c:116:ec_check_status]
0-glusterfsProd-disperse-4:
>>>>>>>>> Operation failed on some subvolumes (up=7,
mask=7, remaining=0,
>>>>>>>>> good=6, bad=1)
>>>>>>>>> [2017-01-12 01:19:21.209753] W [MSGID:
122002]
>>>>>>>>> [ec-common.c:71:ec_heal_report]
0-glusterfsProd-disperse-4: Heal
>>>>>>>>> failed [Invalid argument]
>>>>>>>>>
>>>>>>>>> Thanks and Regards,
>>>>>>>>> Ram
>>>>>>>>>
>>>>>>>>> -----Original Message-----
>>>>>>>>> From: Ankireddypalle Reddy
>>>>>>>>> Sent: Wednesday, January 11, 2017 9:29 AM
>>>>>>>>> To: Ankireddypalle Reddy; Xavier Hernandez;
Gluster Devel
>>>>>>>>> (gluster-devel at gluster.org
<mailto:gluster-devel at gluster.org>);
> gluster-users at gluster.org <mailto:gluster-users at gluster.org>
>>>>>>>>> Subject: RE: [Gluster-users]
[Gluster-devel] Lot of EIO errors in
>>>>>>>>> disperse volume
>>>>>>>>>
>>>>>>>>> Xavi,
>>>>>>>>>          I built a debug binary to log more
information. This is what is getting logged. Looks like it is the attribute
trusted.ec.size which is different among the bricks in a sub volume.
>>>>>>>>>
>>>>>>>>> In glustershd.log :
>>>>>>>>>
>>>>>>>>> [2017-01-11 14:19:45.023845] N [MSGID:
122029] [ec-generic.c:683:ec_combine_lookup] 0-glusterfsProd-disperse-8:
Mismatching iatt in answers of 'GF_FOP_LOOKUP'
>>>>>>>>> [2017-01-11 14:19:45.027718] E
[dict.c:166:key_value_cmp]
>>>>>>>>> 0-glusterfsProd-disperse-6:
'trusted.ec.size' is different in two
>>>>>>>>> dicts (8, 8)
>>>>>>>>> [2017-01-11 14:19:45.027736] W [MSGID:
122056] [ec-combine.c:873:ec_combine_check] 0-glusterfsProd-disperse-6:
Mismatching xdata in answers of 'LOOKUP'
>>>>>>>>> [2017-01-11 14:19:45.027763] E
[dict.c:166:key_value_cmp]
>>>>>>>>> 0-glusterfsProd-disperse-6:
'trusted.ec.size' is different in two
>>>>>>>>> dicts (8, 8)
>>>>>>>>> [2017-01-11 14:19:45.027781] W [MSGID:
122056] [ec-combine.c:873:ec_combine_check] 0-glusterfsProd-disperse-6:
Mismatching xdata in answers of 'LOOKUP'
>>>>>>>>> [2017-01-11 14:19:45.027793] W [MSGID:
122053]
>>>>>>>>> [ec-common.c:116:ec_check_status]
0-glusterfsProd-disperse-6:
>>>>>>>>> Operation failed on some subvolumes (up=7,
mask=7, remaining=0,
>>>>>>>>> good=6, bad=1)
>>>>>>>>> [2017-01-11 14:19:45.027815] W [MSGID:
122002]
>>>>>>>>> [ec-common.c:71:ec_heal_report]
0-glusterfsProd-disperse-6: Heal
>>>>>>>>> failed [Invalid argument]
>>>>>>>>> [2017-01-11 14:19:45.029035] E
[dict.c:166:key_value_cmp]
>>>>>>>>> 0-glusterfsProd-disperse-8:
'trusted.ec.size' is different in two
>>>>>>>>> dicts (8, 8)
>>>>>>>>> [2017-01-11 14:19:45.029057] W [MSGID:
122056] [ec-combine.c:873:ec_combine_check] 0-glusterfsProd-disperse-8:
Mismatching xdata in answers of 'LOOKUP'
>>>>>>>>> [2017-01-11 14:19:45.029089] E
[dict.c:166:key_value_cmp]
>>>>>>>>> 0-glusterfsProd-disperse-8:
'trusted.ec.size' is different in two
>>>>>>>>> dicts (8, 8)
>>>>>>>>> [2017-01-11 14:19:45.029105] W [MSGID:
122056] [ec-combine.c:873:ec_combine_check] 0-glusterfsProd-disperse-8:
Mismatching xdata in answers of 'LOOKUP'
>>>>>>>>> [2017-01-11 14:19:45.029121] W [MSGID:
122053]
>>>>>>>>> [ec-common.c:116:ec_check_status]
0-glusterfsProd-disperse-8:
>>>>>>>>> Operation failed on some subvolumes (up=7,
mask=7, remaining=0,
>>>>>>>>> good=6, bad=1)
>>>>>>>>> [2017-01-11 14:19:45.032566] E
[dict.c:166:key_value_cmp]
>>>>>>>>> 0-glusterfsProd-disperse-6:
'trusted.ec.size' is different in two
>>>>>>>>> dicts (8, 8)
>>>>>>>>> [2017-01-11 14:19:45.029138] W [MSGID:
122002]
>>>>>>>>> [ec-common.c:71:ec_heal_report]
0-glusterfsProd-disperse-8: Heal
>>>>>>>>> failed [Invalid argument]
>>>>>>>>> [2017-01-11 14:19:45.032585] W [MSGID:
122056] [ec-combine.c:873:ec_combine_check] 0-glusterfsProd-disperse-6:
Mismatching xdata in answers of 'LOOKUP'
>>>>>>>>> [2017-01-11 14:19:45.032614] E
[dict.c:166:key_value_cmp]
>>>>>>>>> 0-glusterfsProd-disperse-6:
'trusted.ec.size' is different in two
>>>>>>>>> dicts (8, 8)
>>>>>>>>> [2017-01-11 14:19:45.032631] W [MSGID:
122056] [ec-combine.c:873:ec_combine_check] 0-glusterfsProd-disperse-6:
Mismatching xdata in answers of 'LOOKUP'
>>>>>>>>> [2017-01-11 14:19:45.032638] W [MSGID:
122053]
>>>>>>>>> [ec-common.c:116:ec_check_status]
0-glusterfsProd-disperse-6:
>>>>>>>>> Operation failed on some subvolumes (up=7,
mask=7, remaining=0,
>>>>>>>>> good=6, bad=1)
>>>>>>>>> [2017-01-11 14:19:45.032654] W [MSGID:
122002]
>>>>>>>>> [ec-common.c:71:ec_heal_report]
0-glusterfsProd-disperse-6: Heal
>>>>>>>>> failed [Invalid argument]
>>>>>>>>> [2017-01-11 14:19:45.037514] E
[dict.c:166:key_value_cmp]
>>>>>>>>> 0-glusterfsProd-disperse-6:
'trusted.ec.size' is different in two
>>>>>>>>> dicts (8, 8)
>>>>>>>>> [2017-01-11 14:19:45.037536] W [MSGID:
122056] [ec-combine.c:873:ec_combine_check] 0-glusterfsProd-disperse-6:
Mismatching xdata in answers of 'LOOKUP'
>>>>>>>>> [2017-01-11 14:19:45.037553] E
[dict.c:166:key_value_cmp]
>>>>>>>>> 0-glusterfsProd-disperse-6:
'trusted.ec.size' is different in two
>>>>>>>>> dicts (8, 8)
>>>>>>>>> [2017-01-11 14:19:45.037573] W [MSGID:
122056] [ec-combine.c:873:ec_combine_check] 0-glusterfsProd-disperse-6:
Mismatching xdata in answers of 'LOOKUP'
>>>>>>>>> [2017-01-11 14:19:45.037582] W [MSGID:
122053]
>>>>>>>>> [ec-common.c:116:ec_check_status]
0-glusterfsProd-disperse-6:
>>>>>>>>> Operation failed on some subvolumes (up=7,
mask=7, remaining=0,
>>>>>>>>> good=6, bad=1)
>>>>>>>>> [2017-01-11 14:19:45.037599] W [MSGID:
122002]
>>>>>>>>> [ec-common.c:71:ec_heal_report]
0-glusterfsProd-disperse-6: Heal
>>>>>>>>> failed [Invalid argument]
>>>>>>>>> [2017-01-11 14:20:40.001401] E
[dict.c:166:key_value_cmp]
>>>>>>>>> 0-glusterfsProd-disperse-3:
'trusted.ec.size' is different in two
>>>>>>>>> dicts (8, 8)
>>>>>>>>> [2017-01-11 14:20:40.001387] E
[dict.c:166:key_value_cmp]
>>>>>>>>> 0-glusterfsProd-disperse-5:
'trusted.ec.size' is different in two
>>>>>>>>> dicts (8, 8)
>>>>>>>>>
>>>>>>>>> In the mount daemon log:
>>>>>>>>>
>>>>>>>>> [2017-01-11 14:20:17.806826] E [MSGID:
122001]
>>>>>>>>> [ec-common.c:872:ec_config_check]
2-glusterfsProd-disperse-0:
>>>>>>>>> Invalid or corrupted config [Invalid
argument]
>>>>>>>>> [2017-01-11 14:20:17.806847] E [MSGID:
122066]
>>>>>>>>> [ec-common.c:969:ec_prepare_update_cbk]
2-glusterfsProd-disperse-0:
>>>>>>>>> Invalid config xattr [Invalid argument]
>>>>>>>>> [2017-01-11 14:20:17.807076] E [MSGID:
122001]
>>>>>>>>> [ec-common.c:872:ec_config_check]
2-glusterfsProd-disperse-1:
>>>>>>>>> Invalid or corrupted config [Invalid
argument]
>>>>>>>>> [2017-01-11 14:20:17.807099] E [MSGID:
122066]
>>>>>>>>> [ec-common.c:969:ec_prepare_update_cbk]
2-glusterfsProd-disperse-1:
>>>>>>>>> Invalid config xattr [Invalid argument]
>>>>>>>>> [2017-01-11 14:20:17.807286] E [MSGID:
122001]
>>>>>>>>> [ec-common.c:872:ec_config_check]
2-glusterfsProd-disperse-10:
>>>>>>>>> Invalid or corrupted config [Invalid
argument]
>>>>>>>>> [2017-01-11 14:20:17.807298] E [MSGID:
122066]
>>>>>>>>> [ec-common.c:969:ec_prepare_update_cbk]
2-glusterfsProd-disperse-10:
>>>>>>>>> Invalid config xattr [Invalid argument]
>>>>>>>>> [2017-01-11 14:20:17.807409] E [MSGID:
122001]
>>>>>>>>> [ec-common.c:872:ec_config_check]
2-glusterfsProd-disperse-11:
>>>>>>>>> Invalid or corrupted config [Invalid
argument]
>>>>>>>>> [2017-01-11 14:20:17.807420] E [MSGID:
122066]
>>>>>>>>> [ec-common.c:969:ec_prepare_update_cbk]
2-glusterfsProd-disperse-11:
>>>>>>>>> Invalid config xattr [Invalid argument]
>>>>>>>>> [2017-01-11 14:20:17.807448] E [MSGID:
122001]
>>>>>>>>> [ec-common.c:872:ec_config_check]
2-glusterfsProd-disperse-4:
>>>>>>>>> Invalid or corrupted config [Invalid
argument]
>>>>>>>>> [2017-01-11 14:20:17.807462] E [MSGID:
122066]
>>>>>>>>> [ec-common.c:969:ec_prepare_update_cbk]
2-glusterfsProd-disperse-4:
>>>>>>>>> Invalid config xattr [Invalid argument]
>>>>>>>>> [2017-01-11 14:20:17.807539] E [MSGID:
122001]
>>>>>>>>> [ec-common.c:872:ec_config_check]
2-glusterfsProd-disperse-2:
>>>>>>>>> Invalid or corrupted config [Invalid
argument]
>>>>>>>>> [2017-01-11 14:20:17.807550] E [MSGID:
122066]
>>>>>>>>> [ec-common.c:969:ec_prepare_update_cbk]
2-glusterfsProd-disperse-2:
>>>>>>>>> Invalid config xattr [Invalid argument]
>>>>>>>>> [2017-01-11 14:20:17.807723] E [MSGID:
122001]
>>>>>>>>> [ec-common.c:872:ec_config_check]
2-glusterfsProd-disperse-3:
>>>>>>>>> Invalid or corrupted config [Invalid
argument]
>>>>>>>>> [2017-01-11 14:20:17.807739] E [MSGID:
122066]
>>>>>>>>> [ec-common.c:969:ec_prepare_update_cbk]
2-glusterfsProd-disperse-3:
>>>>>>>>> Invalid config xattr [Invalid argument]
>>>>>>>>> [2017-01-11 14:20:17.807785] E [MSGID:
122001]
>>>>>>>>> [ec-common.c:872:ec_config_check]
2-glusterfsProd-disperse-5:
>>>>>>>>> Invalid or corrupted config [Invalid
argument]
>>>>>>>>> [2017-01-11 14:20:17.807796] E [MSGID:
122066]
>>>>>>>>> [ec-common.c:969:ec_prepare_update_cbk]
2-glusterfsProd-disperse-5:
>>>>>>>>> Invalid config xattr [Invalid argument]
>>>>>>>>> [2017-01-11 14:20:17.808020] E [MSGID:
122001]
>>>>>>>>> [ec-common.c:872:ec_config_check]
2-glusterfsProd-disperse-9:
>>>>>>>>> Invalid or corrupted config [Invalid
argument]
>>>>>>>>> [2017-01-11 14:20:17.808034] E [MSGID:
122066]
>>>>>>>>> [ec-common.c:969:ec_prepare_update_cbk]
2-glusterfsProd-disperse-9:
>>>>>>>>> Invalid config xattr [Invalid argument]
>>>>>>>>> [2017-01-11 14:20:17.808054] E [MSGID:
122001]
>>>>>>>>> [ec-common.c:872:ec_config_check]
2-glusterfsProd-disperse-6:
>>>>>>>>> Invalid or corrupted config [Invalid
argument]
>>>>>>>>> [2017-01-11 14:20:17.808066] E [MSGID:
122066]
>>>>>>>>> [ec-common.c:969:ec_prepare_update_cbk]
2-glusterfsProd-disperse-6:
>>>>>>>>> Invalid config xattr [Invalid argument]
>>>>>>>>> [2017-01-11 14:20:17.808282] E [MSGID:
122001]
>>>>>>>>> [ec-common.c:872:ec_config_check]
2-glusterfsProd-disperse-8:
>>>>>>>>> Invalid or corrupted config [Invalid
argument]
>>>>>>>>> [2017-01-11 14:20:17.808292] E [MSGID:
122066]
>>>>>>>>> [ec-common.c:969:ec_prepare_update_cbk]
2-glusterfsProd-disperse-8:
>>>>>>>>> Invalid config xattr [Invalid argument]
>>>>>>>>> [2017-01-11 14:20:17.809212] E [MSGID:
122001]
>>>>>>>>> [ec-common.c:872:ec_config_check]
2-glusterfsProd-disperse-7:
>>>>>>>>> Invalid or corrupted config [Invalid
argument]
>>>>>>>>> [2017-01-11 14:20:17.809228] E [MSGID:
122066]
>>>>>>>>> [ec-common.c:969:ec_prepare_update_cbk]
2-glusterfsProd-disperse-7:
>>>>>>>>> Invalid config xattr [Invalid argument]
>>>>>>>>>
>>>>>>>>> [2017-01-11 14:20:17.812660] I [MSGID:
109036]
>>>>>>>>>
[dht-common.c:8043:dht_log_new_layout_for_dir_selfheal]
>>>>>>>>> 2-glusterfsProd-dht: Setting layout of
>>>>>>>>>
/Folder_01.05.2017_21.15/CV_MAGNETIC/V_31500/CHUNK_402578 with
>>>>>>>>> [Subvol_name: glusterfsProd-disperse-0,
Err: -1 , Start:
>>>>>>>>> 1789569705 , Stop: 2147483645 , Hash: 1 ],
[Subvol_name:
>>>>>>>>> glusterfsProd-disperse-1, Err: -1 , Start:
2147483646 , Stop:
>>>>>>>>> 2505397586 , Hash: 1 ], [Subvol_name:
glusterfsProd-disperse-10,
>>>>>>>>> Err: -1 , Start: 2505397587 , Stop:
2863311527 , Hash: 1 ],
>>>>>>>>> [Subvol_name: glusterfsProd-disperse-11,
Err: -1 , Start:
>>>>>>>>> 2863311528 , Stop: 3221225468 , Hash: 1 ],
[Subvol_name:
>>>>>>>>> glusterfsProd-disperse-2, Err: -1 , Start:
3221225469 , Stop:
>>>>>>>>> 3579139409 , Hash: 1 ], [Subvol_name:
glusterfsProd-disperse-3, Err:
>>>>>>>>> -1 , Start: 3579139410 , Stop: 3937053350 ,
Hash: 1 ], [Subvol_name:
>>>>>>>>> glusterfsProd-disperse-4, Err: -1 , Start:
3937053351 , Stop:
>>>>>>>>> 4294967295 , Hash: 1 ], [Subvol_name:
glusterfsProd-disperse-5, Err:
>>>>>>>>> -1 , Start: 0 , Stop: 357913940 , Hash: 1
], [Subvol_name:
>>>>>>>>> glusterfsProd-disperse-6, Err: -1 , Start:
357913941 , Stop:
>>>>>>>>> 715827881 , Hash: 1 ], [Subvol_name:
glusterfsProd-disperse-7, Err:
>>>>>>>>> -1 , Start: 715827882 , Stop: 1073741822 ,
Hash: 1 ], [Subvol_name:
>>>>>>>>> glusterfsProd-disperse-8, Err: -1 , Start:
1073741823 , Stop:
>>>>>>>>> 1431655763 , Hash: 1 ], [Subvol_name:
glusterfsProd-disperse-9, Err:
>>>>>>>>> -1 , Start: 1431655764 , Stop: 1789569704 ,
Hash: 1 ],
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> -----Original Message-----
>>>>>>>>> From: gluster-users-bounces at gluster.org
<mailto:gluster-users-bounces at gluster.org>
>>>>>>>>> [mailto:gluster-users-bounces at
gluster.org] On Behalf Of
>>>>>>>>> Ankireddypalle Reddy
>>>>>>>>> Sent: Tuesday, January 10, 2017 10:09 AM
>>>>>>>>> To: Xavier Hernandez; Gluster Devel
(gluster-devel at gluster.org <mailto:gluster-devel at gluster.org>);
>>>>>>>>> gluster-users at gluster.org
<mailto:gluster-users at gluster.org>
>>>>>>>>> Subject: Re: [Gluster-users]
[Gluster-devel] Lot of EIO errors in
>>>>>>>>> disperse volume
>>>>>>>>>
>>>>>>>>> Xavi,
>>>>>>>>>         In this case it's the file
creation which failed. So I provided the xattrs of the parent.
>>>>>>>>>
>>>>>>>>> Thanks and Regards,
>>>>>>>>> Ram
>>>>>>>>>
>>>>>>>>> -----Original Message-----
>>>>>>>>> From: Xavier Hernandez [mailto:xhernandez
at datalab.es]
>>>>>>>>> Sent: Tuesday, January 10, 2017 9:10 AM
>>>>>>>>> To: Ankireddypalle Reddy; Gluster Devel
>>>>>>>>> (gluster-devel at gluster.org
<mailto:gluster-devel at gluster.org>);
> gluster-users at gluster.org <mailto:gluster-users at gluster.org>
>>>>>>>>> Subject: Re: [Gluster-devel] Lot of EIO
errors in disperse volume
>>>>>>>>>
>>>>>>>>> Hi Ram,
>>>>>>>>>
>>>>>>>>>> On 10/01/17 14:42, Ankireddypalle Reddy
wrote:
>>>>>>>>>> Attachments (2):
>>>>>>>>>>
>>>>>>>>>> 1
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> ec.txt
>>>>>>>>>>
<https://imap.commvault.com/webconsole/embedded.do?url=https://im
>
<https://imap.commvault.com/webconsole/embedded.do?url=https://im%0b>>>>>>>>>>
> ap
>>>>>>>>>> .co
>>>>>>>>>>
mmvault.com/webconsole/api/drive/publicshare/346714/file/ee2d1536
>>>>>>>>>> c2
>>>>>>>>>> dc4
>>>>>>>>>>
dff94afb12132b4f8f6/action/preview&downloadUrl=https://imap.commvault.
>>>>>>>>>>
com/webconsole/api/contentstore/publicshare/346714/file/ee2d1536c
>>>>>>>>>> 2d c4d
ff94afb12132b4f8f6/action/download>
>>>>>>>>>> [Download]
>>>>>>>>>>
<https://imap.commvault.com/webconsole/api/contentstore/publicsha
>
<https://imap.commvault.com/webconsole/api/contentstore/publicsha%0b>>>>>>>>>>
> re
>>>>>>>>>> /34
>>>>>>>>>>
6714/file/ee2d1536c2dc4dff94afb12132b4f8f6/action/download>(11.50
>>>>>>>>>> KB)
>>>>>>>>>>
>>>>>>>>>> 2
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> ws-glus.log
>>>>>>>>>>
<https://imap.commvault.com/webconsole/embedded.do?url=https://im
>
<https://imap.commvault.com/webconsole/embedded.do?url=https://im%0b>>>>>>>>>>
> ap
>>>>>>>>>> .co
>>>>>>>>>>
mmvault.com/webconsole/api/drive/publicshare/346714/file/cff3e050
>>>>>>>>>> 6e
>>>>>>>>>> 754
>>>>>>>>>>
b9a939db02da1cbbd58/action/preview&downloadUrl=https://imap.commvault.
>>>>>>>>>>
com/webconsole/api/contentstore/publicshare/346714/file/cff3e0506
>>>>>>>>>> e7 54b
9a939db02da1cbbd58/action/download>
>>>>>>>>>> [Download]
>>>>>>>>>>
<https://imap.commvault.com/webconsole/api/contentstore/publicsha
>
<https://imap.commvault.com/webconsole/api/contentstore/publicsha%0b>>>>>>>>>>
> re
>>>>>>>>>> /34
>>>>>>>>>>
6714/file/cff3e0506e754b9a939db02da1cbbd58/action/download>(3.48
>>>>>>>>>> MB)
>>>>>>>>>>
>>>>>>>>>> Xavi,
>>>>>>>>>>        We are encountering errors for
different kinds of FOPS.
>>>>>>>>>>        The open failed for the
following file:
>>>>>>>>>>
>>>>>>>>>>       
cvd_2017_01_10_02_28_26.log:98182 1f9fe 01/10 00:57:10 8414465
>>>>>>>>>> [MEDIAFS    ] 20117519-52075477
SingleInstancer_FS::StartDataFile2:
>>>>>>>>>> Failed to create the data file
>>>>>>>>>>
[/ws/glus/Folder_07.11.2016_23.02/CV_MAGNETIC/V_8854974/CHUNK_513
>>>>>>>>>> 42
>>>>>>>>>> 720 /SFILE_CONTAINER_062],
error=0xECCC0005:{CQiFile::Open(92)} +
>>>>>>>>>>
{CQiUTFOSAPI::open(96)/ErrNo.5.(Input/output error)-Open failed,
>>>>>>>>>>
File=/ws/glus/Folder_07.11.2016_23.02/CV_MAGNETIC/V_8854974/CHUNK
>>>>>>>>>> _5
>>>>>>>>>> 134 2720/SFILE_CONTAINER_062,
OperationFlag=0xC1,
>>>>>>>>>> PermissionMode=0x1FF}
>>>>>>>>>>
>>>>>>>>>>        I've attached the extended
attributes for the directories
>>>>>>>>>>       
/ws/glus/Folder_07.11.2016_23.02/CV_MAGNETIC/V_8854974/
>>>>>>>>>> and
>>>>>>>>>>
>>>>>>>>>>
/ws/glus/Folder_07.11.2016_23.02/CV_MAGNETIC/V_8854974/CHUNK_5134
>>>>>>>>>> 27
>>>>>>>>>> 20
>>>>>>>>>> from all the bricks.
>>>>>>>>>>
>>>>>>>>>>       The attributes look fine to me.
I've also attached some
>>>>>>>>>> log cuts to illustrate the problem.
>>>>>>>>>
>>>>>>>>> I need the extended attributes of the file
itself, not the parent directories.
>>>>>>>>>
>>>>>>>>> Xavi
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Thanks and Regards,
>>>>>>>>>> Ram
>>>>>>>>>>
>>>>>>>>>> -----Original Message-----
>>>>>>>>>> From: Xavier Hernandez
[mailto:xhernandez at datalab.es]
>>>>>>>>>> Sent: Tuesday, January 10, 2017 7:53 AM
>>>>>>>>>> To: Ankireddypalle Reddy; Gluster Devel
>>>>>>>>>> (gluster-devel at gluster.org
<mailto:gluster-devel at gluster.org>);
> gluster-users at gluster.org <mailto:gluster-users at gluster.org>
>>>>>>>>>> Subject: Re: [Gluster-devel] Lot of EIO
errors in disperse volume
>>>>>>>>>>
>>>>>>>>>> Hi Ram,
>>>>>>>>>>
>>>>>>>>>> the error is caused by an extended
attribute that does not match
>>>>>>>>>> on all
>>>>>>>>>> 3 bricks of the disperse set. Most
probable value is
>>>>>>>>>> trusted.ec.version, but could be
others.
>>>>>>>>>>
>>>>>>>>>> At first sight, I don't see any
change from 3.7.8 that could have
>>>>>>>>>> caused this. I'll check again.
>>>>>>>>>>
>>>>>>>>>> What kind of operations are you doing ?
this can help me narrow the search.
>>>>>>>>>>
>>>>>>>>>> Xavi
>>>>>>>>>>
>>>>>>>>>>> On 10/01/17 13:43, Ankireddypalle
Reddy wrote:
>>>>>>>>>>> Xavi,
>>>>>>>>>>>        Thanks. If you could please
explain what to look for in
>>>>>>>>>>> the
>>>>>>>>>> extended attributes then I will check
and let you know if I find
>>>>>>>>>> anything suspicious.  Also we noticed
that some of these
>>>>>>>>>> operations would succeed if retried. Do
you know of any
>>>>>>>>>> communicated related errors that are
being reported/triaged.
>>>>>>>>>>>
>>>>>>>>>>> Thanks and Regards,
>>>>>>>>>>> Ram
>>>>>>>>>>>
>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>> From: Xavier Hernandez
[mailto:xhernandez at datalab.es]
>>>>>>>>>>> Sent: Tuesday, January 10, 2017
7:23 AM
>>>>>>>>>>> To: Ankireddypalle Reddy; Gluster
Devel
>>>>>>>>>>> (gluster-devel at gluster.org
<mailto:gluster-devel at gluster.org>);
> gluster-users at gluster.org <mailto:gluster-users at gluster.org>
>>>>>>>>>>> Subject: Re: [Gluster-devel] Lot of
EIO errors in disperse
>>>>>>>>>>> volume
>>>>>>>>>>>
>>>>>>>>>>> Hi Ram,
>>>>>>>>>>>
>>>>>>>>>>>> On 10/01/17 13:14,
Ankireddypalle Reddy wrote:
>>>>>>>>>>>> Attachment (1):
>>>>>>>>>>>>
>>>>>>>>>>>> 1
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> ecxattrs.txt
>>>>>>>>>>>>
<https://imap.commvault.com/webconsole/embedded.do?url=https://imap.
>
<https://imap.commvault.com/webconsole/embedded.do?url=https://imap.%0b>>>>>>>>>>>>
> c
>>>>>>>>>>>> o
>>>>>>>>>>>>
mmvault.com/webconsole/api/drive/publicshare/346714/file/1272e6
>>>>>>>>>>>> 82
>>>>>>>>>>>> 787
>>>>>>>>>>>> 4
>>>>>>>>>>>> 4
>>>>>>>>>>>>
f15bf1a54f2b31b559d/action/preview&downloadUrl=https://imap.commvault.
>>>>>>>>>>>>
com/webconsole/api/contentstore/publicshare/346714/file/1272e68
>>>>>>>>>>>> 27
>>>>>>>>>>>> 874
>>>>>>>>>>>> 4
>>>>>>>>>>>> f
>>>>>>>>>>>>
15bf1a54f2b31b559d/action/download>
>>>>>>>>>>>> [Download]
>>>>>>>>>>>>
<https://imap.commvault.com/webconsole/api/contentstore/publics
>
<https://imap.commvault.com/webconsole/api/contentstore/publics%0b>>>>>>>>>>>>
> ha
>>>>>>>>>>>> re/
>>>>>>>>>>>> 3
>>>>>>>>>>>> 4
>>>>>>>>>>>>
6714/file/1272e68278744f15bf1a54f2b31b559d/action/download>(5.9
>>>>>>>>>>>> 2
>>>>>>>>>>>> KB)
>>>>>>>>>>>>
>>>>>>>>>>>> Xavi,
>>>>>>>>>>>>           Please find attached
the extended attributes for a
>>>>>>>>>>>> directory from all the bricks.
Free space check failed for this
>>>>>>>>>>>> with error number EIO.
>>>>>>>>>>>
>>>>>>>>>>> What do you mean ? what operation
have you made to check the
>>>>>>>>>>> free
>>>>>>>>>> space on that directory ?
>>>>>>>>>>>
>>>>>>>>>>> If it's a recursive check, I
need the extended attributes from
>>>>>>>>>>> the
>>>>>>>>>> exact file that triggers the EIO. The
attached attributes seem
>>>>>>>>>> consistent and that directory
shouldn't cause any problem. Does an 'ls'
>>>>>>>>>> on that directory fail or does it show
the contents ?
>>>>>>>>>>>
>>>>>>>>>>> Xavi
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks and Regards,
>>>>>>>>>>>> Ram
>>>>>>>>>>>>
>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>> From: Xavier Hernandez
[mailto:xhernandez at datalab.es]
>>>>>>>>>>>> Sent: Tuesday, January 10, 2017
6:45 AM
>>>>>>>>>>>> To: Ankireddypalle Reddy;
Gluster Devel
>>>>>>>>>>>> (gluster-devel at gluster.org
<mailto:gluster-devel at gluster.org>);
> gluster-users at gluster.org <mailto:gluster-users at gluster.org>
>>>>>>>>>>>> Subject: Re: [Gluster-devel]
Lot of EIO errors in disperse
>>>>>>>>>>>> volume
>>>>>>>>>>>>
>>>>>>>>>>>> Hi Ram,
>>>>>>>>>>>>
>>>>>>>>>>>> can you execute the following
command on all bricks on a file
>>>>>>>>>>>> that is giving EIO ?
>>>>>>>>>>>>
>>>>>>>>>>>> getfattr -m. -e hex -d <path
to file in brick>
>>>>>>>>>>>>
>>>>>>>>>>>> Xavi
>>>>>>>>>>>>
>>>>>>>>>>>>> On 10/01/17 12:41,
Ankireddypalle Reddy wrote:
>>>>>>>>>>>>> Xavi,
>>>>>>>>>>>>>          We have been
running 3.7.8 on these servers. We
>>>>>>>>>>>>> upgraded
>>>>>>>>>>>> to 3.7.18 yesterday. We
upgraded all the servers at a time.
>>>>>>>>>>>> The volume was brought down
during upgrade.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks and Regards,
>>>>>>>>>>>>> Ram
>>>>>>>>>>>>>
>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>> From: Xavier Hernandez
[mailto:xhernandez at datalab.es]
>>>>>>>>>>>>> Sent: Tuesday, January 10,
2017 6:35 AM
>>>>>>>>>>>>> To: Ankireddypalle Reddy;
Gluster Devel
>>>>>>>>>>>>> (gluster-devel at
gluster.org <mailto:gluster-devel at gluster.org>);
> gluster-users at gluster.org <mailto:gluster-users at gluster.org>
>>>>>>>>>>>>> Subject: Re:
[Gluster-devel] Lot of EIO errors in disperse
>>>>>>>>>>>>> volume
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Ram,
>>>>>>>>>>>>>
>>>>>>>>>>>>> how did you upgrade gluster
? from which version ?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Did you upgrade one server
at a time and waited until
>>>>>>>>>>>>> self-heal
>>>>>>>>>>>> finished before upgrading the
next server ?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Xavi
>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 10/01/17 11:39,
Ankireddypalle Reddy wrote:
>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>    We upgraded to
GlusterFS 3.7.18 yesterday.  We see lot of
>>>>>>>>>>>>>> failures in our
applications. Most of the errors are EIO. The
>>>>>>>>>>>>>> following log lines are
commonly seen in the logs:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The message "W
[MSGID: 122056]
>>>>>>>>>>>>>>
[ec-combine.c:873:ec_combine_check]
>>>>>>>>>>>>>>
0-StoragePool-disperse-4: Mismatching xdata in answers of 'LOOKUP'"
>>>>>>>>>>>>>> repeated 2 times
between [2017-01-10 02:46:25.069809] and
>>>>>>>>>>>>>> [2017-01-10
02:46:25.069835]
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> [2017-01-10
02:46:25.069852] W [MSGID: 122056]
>>>>>>>>>>>>>>
[ec-combine.c:873:ec_combine_check] 0-StoragePool-disperse-5:
>>>>>>>>>>>>>> Mismatching xdata in
answers of 'LOOKUP'
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The message "W
[MSGID: 122056]
>>>>>>>>>>>>>>
[ec-combine.c:873:ec_combine_check]
>>>>>>>>>>>>>>
0-StoragePool-disperse-5: Mismatching xdata in answers of 'LOOKUP'"
>>>>>>>>>>>>>> repeated 2 times
between [2017-01-10 02:46:25.069852] and
>>>>>>>>>>>>>> [2017-01-10
02:46:25.069873]
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> [2017-01-10
02:46:25.069910] W [MSGID: 122056]
>>>>>>>>>>>>>>
[ec-combine.c:873:ec_combine_check] 0-StoragePool-disperse-6:
>>>>>>>>>>>>>> Mismatching xdata in
answers of 'LOOKUP'
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> ...
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> [2017-01-10
02:46:26.520774] I [MSGID: 109036]
>>>>>>>>>>>>>>
[dht-common.c:9076:dht_log_new_layout_for_dir_selfheal]
>>>>>>>>>>>>>> 0-StoragePool-dht:
Setting layout of
>>>>>>>>>>>>>>
/Folder_07.11.2016_23.02/CV_MAGNETIC/V_8854213/CHUNK_51334585
>>>>>>>>>>>>>> with
>>>>>>>>>>>>>> [Subvol_name:
StoragePool-disperse-0, Err: -1 , Start:
>>>>>>>>>>>>>> 3221225466 ,
>>>>>>>>>>>>>> Stop: 3758096376 ,
Hash: 1 ], [Subvol_name:
>>>>>>>>>>>>>> StoragePool-disperse-1,
>>>>>>>>>> Err:
>>>>>>>>>>>>>> -1 , Start: 3758096377
, Stop: 4294967295 , Hash: 1 ], [Subvol_name:
>>>>>>>>>>>>>> StoragePool-disperse-2,
Err: -1 , Start: 0 , Stop: 536870910 , Hash:
>>>>>>>>>>>>>> 1 ], [Subvol_name:
StoragePool-disperse-3, Err: -1 , Start:
>>>>>>>>>>>>>> 536870911 ,
>>>>>>>>>>>>>> Stop: 1073741821 ,
Hash: 1 ], [Subvol_name:
>>>>>>>>>>>>>> StoragePool-disperse-4,
>>>>>>>>>> Err:
>>>>>>>>>>>>>> -1 , Start: 1073741822
, Stop: 1610612732 , Hash: 1 ], [Subvol_name:
>>>>>>>>>>>>>> StoragePool-disperse-5,
Err: -1 , Start: 1610612733 , Stop:
>>>>>>>>>>>>>> 2147483643 ,
>>>>>>>>>>>>>> Hash: 1 ],
[Subvol_name: StoragePool-disperse-6, Err: -1 , Start:
>>>>>>>>>>>>>> 2147483644 , Stop:
2684354554 , Hash: 1 ], [Subvol_name:
>>>>>>>>>>>>>> StoragePool-disperse-7,
Err: -1 , Start: 2684354555 , Stop:
>>>>>>>>>>>>>> 3221225465 ,
>>>>>>>>>>>>>> Hash: 1 ],
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> [2017-01-10
02:46:26.522841] N [MSGID: 122031]
>>>>>>>>>>>>>>
[ec-generic.c:1130:ec_combine_xattrop] 0-StoragePool-disperse-3:
>>>>>>>>>>>>>> Mismatching dictionary
in answers of 'GF_FOP_XATTROP'
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The message "N
[MSGID: 122031]
>>>>>>>>>>>>>>
[ec-generic.c:1130:ec_combine_xattrop]
>>>>>>>>>>>>>>
0-StoragePool-disperse-3: Mismatching dictionary in answers
>>>>>>>>>>>>>> of
'GF_FOP_XATTROP'" repeated 2 times between [2017-01-10
>>>>>>>>>>>>>> 02:46:26.522841] and
[2017-01-10 02:46:26.522894]
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> [2017-01-10
02:46:26.522898] W [MSGID: 122040]
>>>>>>>>>>>>>>
[ec-common.c:919:ec_prepare_update_cbk] 0-StoragePool-disperse-3:
>>>>>>>>>>>>>> Failed to get size and
version [Input/output error]
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> [2017-01-10
02:46:26.523115] N [MSGID: 122031]
>>>>>>>>>>>>>>
[ec-generic.c:1130:ec_combine_xattrop] 0-StoragePool-disperse-6:
>>>>>>>>>>>>>> Mismatching dictionary
in answers of 'GF_FOP_XATTROP'
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The message "N
[MSGID: 122031]
>>>>>>>>>>>>>>
[ec-generic.c:1130:ec_combine_xattrop]
>>>>>>>>>>>>>>
0-StoragePool-disperse-6: Mismatching dictionary in answers
>>>>>>>>>>>>>> of
'GF_FOP_XATTROP'" repeated 2 times between [2017-01-10
>>>>>>>>>>>>>> 02:46:26.523115] and
[2017-01-10 02:46:26.523143]
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> [2017-01-10
02:46:26.523147] W [MSGID: 122040]
>>>>>>>>>>>>>>
[ec-common.c:919:ec_prepare_update_cbk] 0-StoragePool-disperse-6:
>>>>>>>>>>>>>> Failed to get size and
version [Input/output error]
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> [2017-01-10
02:46:26.523302] N [MSGID: 122031]
>>>>>>>>>>>>>>
[ec-generic.c:1130:ec_combine_xattrop] 0-StoragePool-disperse-2:
>>>>>>>>>>>>>> Mismatching dictionary
in answers of 'GF_FOP_XATTROP'
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The message "N
[MSGID: 122031]
>>>>>>>>>>>>>>
[ec-generic.c:1130:ec_combine_xattrop]
>>>>>>>>>>>>>>
0-StoragePool-disperse-2: Mismatching dictionary in answers
>>>>>>>>>>>>>> of
'GF_FOP_XATTROP'" repeated 2 times between [2017-01-10
>>>>>>>>>>>>>> 02:46:26.523302] and
[2017-01-10 02:46:26.523324]
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> [2017-01-10
02:46:26.523328] W [MSGID: 122040]
>>>>>>>>>>>>>>
[ec-common.c:919:ec_prepare_update_cbk] 0-StoragePool-disperse-2:
>>>>>>>>>>>>>> Failed to get size and
version [Input/output error]
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> [root at glusterfs3
Log_Files]# gluster --version
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> glusterfs 3.7.18 built
on Dec  8 2016 06:34:26
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> [root at glusterfs3
Log_Files]# gluster volume info
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Volume Name:
StoragePool
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Type:
Distributed-Disperse
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Volume ID:
149e976f-4e21-451c-bf0f-f5691208531f
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Status: Started
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Number of Bricks: 8 x
(2 + 1) = 24
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Transport-type: tcp
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Bricks:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Brick1:
glusterfs1sds:/ws/disk1/ws_brick
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Brick2:
glusterfs2sds:/ws/disk1/ws_brick
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Brick3:
glusterfs3sds:/ws/disk1/ws_brick
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Brick4:
glusterfs1sds:/ws/disk2/ws_brick
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Brick5:
glusterfs2sds:/ws/disk2/ws_brick
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Brick6:
glusterfs3sds:/ws/disk2/ws_brick
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Brick7:
glusterfs1sds:/ws/disk3/ws_brick
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Brick8:
glusterfs2sds:/ws/disk3/ws_brick
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Brick9:
glusterfs3sds:/ws/disk3/ws_brick
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Brick10:
glusterfs1sds:/ws/disk4/ws_brick
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Brick11:
glusterfs2sds:/ws/disk4/ws_brick
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Brick12:
glusterfs3sds:/ws/disk4/ws_brick
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Brick13:
glusterfs1sds:/ws/disk5/ws_brick
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Brick14:
glusterfs2sds:/ws/disk5/ws_brick
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Brick15:
glusterfs3sds:/ws/disk5/ws_brick
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Brick16:
glusterfs1sds:/ws/disk6/ws_brick
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Brick17:
glusterfs2sds:/ws/disk6/ws_brick
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Brick18:
glusterfs3sds:/ws/disk6/ws_brick
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Brick19:
glusterfs1sds:/ws/disk7/ws_brick
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Brick20:
glusterfs2sds:/ws/disk7/ws_brick
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Brick21:
glusterfs3sds:/ws/disk7/ws_brick
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Brick22:
glusterfs1sds:/ws/disk8/ws_brick
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Brick23:
glusterfs2sds:/ws/disk8/ws_brick
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Brick24:
glusterfs3sds:/ws/disk8/ws_brick
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Options Reconfigured:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
performance.readdir-ahead: on
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
diagnostics.client-log-level: INFO
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks and Regards,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Ram
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
***************************Legal
>>>>>>>>>>>>>>
Disclaimer***************************
>>>>>>>>>>>>>> "This
communication may contain confidential and privileged
>>>>>>>>>>>>>> material for the sole
use of the intended recipient. Any
>>>>>>>>>>>>>> unauthorized review,
use or distribution by others is
>>>>>>>>>>>>>> strictly prohibited. If
you have received the message by
>>>>>>>>>>>>>> mistake, please advise
the sender by reply email and delete the message. Thank you."
>>>>>>>>>>>>>>
*************************************************************
>>>>>>>>>>>>>> **
>>>>>>>>>>>>>> ***
>>>>>>>>>>>>>> *
>>>>>>>>>>>>>> *
>>>>>>>>>>>>>> *
>>>>>>>>>>>>>> *
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
_______________________________________________
>>>>>>>>>>>>>> Gluster-devel mailing
list
>>>>>>>>>>>>>> Gluster-devel at
gluster.org <mailto:Gluster-devel at gluster.org>
>>>>>>>>>>>>>>
http://www.gluster.org/mailman/listinfo/gluster-devel
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
***************************Legal
>>>>>>>>>>>>>
Disclaimer***************************
>>>>>>>>>>>>> "This communication
may contain confidential and privileged
>>>>>>>>>>>>> material for the sole use
of the intended recipient. Any
>>>>>>>>>>>>> unauthorized review, use or
distribution by others is strictly
>>>>>>>>>>>>> prohibited. If you have
received the message by mistake,
>>>>>>>>>>>>> please advise the sender by
reply
>>>>>>>>>>>> email and delete the message.
Thank you."
>>>>>>>>>>>>>
**************************************************************
>>>>>>>>>>>>> **
>>>>>>>>>>>>> ***
>>>>>>>>>>>>> *
>>>>>>>>>>>>> *
>>>>>>>>>>>>> *
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
***************************Legal
>>>>>>>>>>>>
Disclaimer***************************
>>>>>>>>>>>> "This communication may
contain confidential and privileged
>>>>>>>>>>>> material for the sole use of
the intended recipient. Any
>>>>>>>>>>>> unauthorized review, use or
distribution by others is strictly
>>>>>>>>>>>> prohibited. If you have
received the message by mistake, please
>>>>>>>>>>>> advise the sender by reply
email and delete the message. Thank you."
>>>>>>>>>>>>
***************************************************************
>>>>>>>>>>>> **
>>>>>>>>>>>> ***
>>>>>>>>>>>> *
>>>>>>>>>>>> *
>>>>>>>>>>>
>>>>>>>>>>> ***************************Legal
>>>>>>>>>>>
Disclaimer***************************
>>>>>>>>>>> "This communication may
contain confidential and privileged
>>>>>>>>>>> material for the sole use of the
intended recipient. Any
>>>>>>>>>>> unauthorized review, use or
distribution by others is strictly
>>>>>>>>>>> prohibited. If you have received
the message by mistake, please
>>>>>>>>>>> advise the sender by reply
>>>>>>>>>> email and delete the message. Thank
you."
>>>>>>>>>>>
****************************************************************
>>>>>>>>>>> **
>>>>>>>>>>> ***
>>>>>>>>>>> *
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> ***************************Legal
>>>>>>>>>> Disclaimer***************************
>>>>>>>>>> "This communication may contain
confidential and privileged
>>>>>>>>>> material for the sole use of the
intended recipient. Any
>>>>>>>>>> unauthorized review, use or
distribution by others is strictly
>>>>>>>>>> prohibited. If you have received the
message by mistake, please
>>>>>>>>>> advise the sender by reply email and
delete the message. Thank you."
>>>>>>>>>>
*****************************************************************
>>>>>>>>>> **
>>>>>>>>>> ***
>>>>>>>>>
>>>>>>>>> ***************************Legal
>>>>>>>>> Disclaimer***************************
>>>>>>>>> "This communication may contain
confidential and privileged material for the sole use of the intended recipient.
Any unauthorized review, use or distribution by others is strictly prohibited.
If you have received the message by mistake, please advise the sender by reply
email and delete the message. Thank you."
>>>>>>>>>
******************************************************************
>>>>>>>>> **
>>>>>>>>> **
>>>>>>>>>
>>>>>>>>>
_______________________________________________
>>>>>>>>> Gluster-users mailing list
>>>>>>>>> Gluster-users at gluster.org
<mailto:Gluster-users at gluster.org>
>>>>>>>>>
http://www.gluster.org/mailman/listinfo/gluster-users
>>>>>>>>> ***************************Legal
>>>>>>>>> Disclaimer***************************
>>>>>>>>> "This communication may contain
confidential and privileged
>>>>>>>>> material for the sole use of the intended
recipient. Any
>>>>>>>>> unauthorized review, use or distribution by
others is strictly
>>>>>>>>> prohibited. If you have received the
message by mistake, please advise the sender by reply email and delete the
message. Thank you."
>>>>>>>>>
******************************************************************
>>>>>>>>> **
>>>>>>>>> **
>>>>>>>>>
>>>>>>>>
>>>>>>> ***************************Legal
>>>>>>> Disclaimer***************************
>>>>>>> "This communication may contain confidential
and privileged material
>>>>>>> for the sole use of the intended recipient. Any
unauthorized review,
>>>>>>> use or distribution by others is strictly
prohibited. If you have
>>>>>>> received the message by mistake, please advise the
sender by reply email and delete the message. Thank you."
>>>>>>>
********************************************************************
>>>>>>> **
>>>>>>>
>>>>>>
>>>>>> ***************************Legal
>>>>>> Disclaimer***************************
>>>>>> "This communication may contain confidential and
privileged material
>>>>>> for the sole use of the intended recipient. Any
unauthorized review,
>>>>>> use or distribution by others is strictly prohibited.
If you have
>>>>>> received the message by mistake, please advise the
sender by reply email and delete the message. Thank you."
>>>>>>
*********************************************************************
>>>>>> *
>>>>>>
>>>>>>
>>>>>
>>>> ***************************Legal
Disclaimer***************************
>>>> "This communication may contain confidential and
privileged material for the sole use of the intended recipient. Any unauthorized
review, use or distribution by others is strictly prohibited. If you have
received the message by mistake, please advise the sender by reply email and
delete the message. Thank you."
>>>>
**********************************************************************
>>>>
>>>> _______________________________________________
>>>> Gluster-devel mailing list
>>>> Gluster-devel at gluster.org <mailto:Gluster-devel at
gluster.org>
>>>> http://www.gluster.org/mailman/listinfo/gluster-devel
>>>> ***************************Legal
Disclaimer***************************
>>>> "This communication may contain confidential and
privileged material for the
>>>> sole use of the intended recipient. Any unauthorized review,
use or distribution
>>>> by others is strictly prohibited. If you have received the
message by mistake,
>>>> please advise the sender by reply email and delete the message.
Thank you."
>>>>
**********************************************************************
>>>>
>>>
>> ***************************Legal Disclaimer***************************
>> "This communication may contain confidential and privileged
material for the
>> sole use of the intended recipient. Any unauthorized review, use or
distribution
>> by others is strictly prohibited. If you have received the message by
mistake,
>> please advise the sender by reply email and delete the message. Thank
you."
>> **********************************************************************
>>
>
>
>
> ***************************Legal Disclaimer***************************
> "This communication may contain confidential and privileged material
for the
> sole use of the intended recipient. Any unauthorized review, use or
> distribution
> by others is strictly prohibited. If you have received the message by
> mistake,
> please advise the sender by reply email and delete the message. Thank
you."
> **********************************************************************
>
>
>
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-users
> ***************************Legal Disclaimer***************************
> "This communication may contain confidential and privileged material
for the
> sole use of the intended recipient. Any unauthorized review, use or
> distribution
> by others is strictly prohibited. If you have received the message by
> mistake,
> please advise the sender by reply email and delete the message. Thank
you."
> **********************************************************************
>
>
>
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-users
>
>
>
> ***************************Legal Disclaimer***************************
> "This communication may contain confidential and privileged material
for the
> sole use of the intended recipient. Any unauthorized review, use or
> distribution
> by others is strictly prohibited. If you have received the message by
> mistake,
> please advise the sender by reply email and delete the message. Thank
you."
> **********************************************************************

Ankireddypalle Reddy

2017-Jan-20 07:55 UTC

head link

[Gluster-users] [Gluster-devel] Lot of EIO errors in disperse volume

Xavi,
           Thanks.  Please let me know the functions that we need to track for
any inconsistencies in the return codes from multiple bricks for issue 1. I will
start doing that.

          1. Why the write fails in first place

Thanks and Regards,
Ram
     
-----Original Message-----
From: Xavier Hernandez [mailto:xhernandez at datalab.es] 
Sent: Friday, January 20, 2017 2:41 AM
To: Ankireddypalle Reddy; Ashish Pandey
Cc: gluster-users at gluster.org; Gluster Devel (gluster-devel at gluster.org)
Subject: Re: [Gluster-users] [Gluster-devel] Lot of EIO errors in disperse
volume

Hi Ram,

On 20/01/17 08:02, Ankireddypalle Reddy wrote:> Ashish,
>
>                  Thanks for looking in to the issue. In the given
> example the size/version matches for file on glusterfs4 and glusterfs5
> nodes. The file is empty on glusterfs6. Now what happens if glusterfs5
> goes down. Though the SLA factor of 2 is met  still I will not be able
> to access the data.
True, but having a brick with inconsistent data is the same as having it 
down. You would have lost 2 bricks out of 3.

The problem is how to detect the inconsistent data, what is causing it 
and why self-heal (apparently) is not healing it.
> The problem is that writes did not fail for the file
> to indicate the issue to the application.
That's the expected behavior. Since we have redundancy 1, a loss or 
failure on a single fragment is hidden from the application. However 
this triggers internal procedures to repair the problem on the 
background. This also seems to not be working.

There are two issues to identify here:

1. Why the write fails in first place
2. Why self-heal is unable to repair it

Probably the root cause is the same for both problems, but I'm not sure.

For 1 there should be some warning or error in the mount log, since one 
of the bricks is reporting an error, even if ec is able to report 
success to the application later.

For 2 the analysis could be more complex, but most probably there should 
be some warning or error message in the mount log and/or self-heal log 
of one of the servers.

Xavi
>  It?s not that we are
> encountering issue for every file on the mount point. The issue happens
> randomly for different files.
>
>
>
> [root at glusterfs4 glusterfs]# gluster volume info
>
>
>
> Volume Name: glusterfsProd
>
> Type: Distributed-Disperse
>
> Volume ID: 622aa3ee-958f-485f-b3c1-fb0f6c8db34c
>
> Status: Started
>
> Number of Bricks: 12 x (2 + 1) = 36
>
> Transport-type: tcp
>
> Bricks:
>
> Brick1: glusterfs4sds:/ws/disk1/ws_brick
>
> Brick2: glusterfs5sds:/ws/disk1/ws_brick
>
> Brick3: glusterfs6sds:/ws/disk1/ws_brick
>
> Brick4: glusterfs4sds:/ws/disk10/ws_brick
>
> Brick5: glusterfs5sds:/ws/disk10/ws_brick
>
> Brick6: glusterfs6sds:/ws/disk10/ws_brick
>
> Brick7: glusterfs4sds:/ws/disk11/ws_brick
>
> Brick8: glusterfs5sds:/ws/disk11/ws_brick
>
> Brick9: glusterfs6sds:/ws/disk11/ws_brick
>
> Brick10: glusterfs4sds:/ws/disk2/ws_brick
>
> Brick11: glusterfs5sds:/ws/disk2/ws_brick
>
> Brick12: glusterfs6sds:/ws/disk2/ws_brick
>
> Brick13: glusterfs4sds:/ws/disk3/ws_brick
>
> Brick14: glusterfs5sds:/ws/disk3/ws_brick
>
> Brick15: glusterfs6sds:/ws/disk3/ws_brick
>
> Brick16: glusterfs4sds:/ws/disk4/ws_brick
>
> Brick17: glusterfs5sds:/ws/disk4/ws_brick
>
> Brick18: glusterfs6sds:/ws/disk4/ws_brick
>
> Brick19: glusterfs4sds:/ws/disk5/ws_brick
>
> Brick20: glusterfs5sds:/ws/disk5/ws_brick
>
> Brick21: glusterfs6sds:/ws/disk5/ws_brick
>
> Brick22: glusterfs4sds:/ws/disk6/ws_brick
>
> Brick23: glusterfs5sds:/ws/disk6/ws_brick
>
> Brick24: glusterfs6sds:/ws/disk6/ws_brick
>
> Brick25: glusterfs4sds:/ws/disk7/ws_brick
>
> Brick26: glusterfs5sds:/ws/disk7/ws_brick
>
> Brick27: glusterfs6sds:/ws/disk7/ws_brick
>
> Brick28: glusterfs4sds:/ws/disk8/ws_brick
>
> Brick29: glusterfs5sds:/ws/disk8/ws_brick
>
> Brick30: glusterfs6sds:/ws/disk8/ws_brick
>
> Brick31: glusterfs4sds:/ws/disk9/ws_brick
>
> Brick32: glusterfs5sds:/ws/disk9/ws_brick
>
> Brick33: glusterfs6sds:/ws/disk9/ws_brick
>
> Brick34: glusterfs4sds:/ws/disk12/ws_brick
>
> Brick35: glusterfs5sds:/ws/disk12/ws_brick
>
> Brick36: glusterfs6sds:/ws/disk12/ws_brick
>
> Options Reconfigured:
>
> storage.build-pgfid: on
>
> performance.readdir-ahead: on
>
> nfs.export-dirs: off
>
> nfs.export-volumes: off
>
> nfs.disable: on
>
> auth.allow: glusterfs4sds,glusterfs5sds,glusterfs6sds
>
> diagnostics.client-log-level: INFO
>
> [root at glusterfs4 glusterfs]#
>
>
>
> Thanks and Regards,
>
> Ram
>
> *From:*Ashish Pandey [mailto:aspandey at redhat.com]
> *Sent:* Thursday, January 19, 2017 10:36 PM
> *To:* Ankireddypalle Reddy
> *Cc:* Xavier Hernandez; gluster-users at gluster.org; Gluster Devel
> (gluster-devel at gluster.org)
> *Subject:* Re: [Gluster-users] [Gluster-devel] Lot of EIO errors in
> disperse volume
>
>
>
> Ram,
>
>
>
> I don't understand what do you mean by saying "redundancy factor
of 2 is
> met in a 3:1 disperse volume".
> You have given the xattr's of only 3 bricks.
>
> All the above 2 sentence and output of getxattr contradicts each other.
>
> In the given scenario, if you have (2+1) ec configuration, and 2 bricks
> are having same size and version, then there should not be
>
> any problem to access this file.  Run heal and 3rd fragment will also
> bee healthy.
>
>
>
> I think there has been major gap in providing the complete and correct
> information about the volume and all the logs and activities.
> Could you please provide the following -
>
> 1 - gluster v info - please give us the output of this command
>
> 2 - Let's consider only one file which you are not able to access and
> find out the reason.
>
> 3 - Try to create and write some files on mount point and see if there
> is any issue with new file creation if yes, why, provide logs.
>
>
>
> Specific but enough information is required to find the root cause of
> the issue.
>
>
>
> Ashish
>
>
>
>
>
>
>
> ------------------------------------------------------------------------
>
> *From: *"Ankireddypalle Reddy" <areddy at commvault.com
> <mailto:areddy at commvault.com>>
> *To: *"Ankireddypalle Reddy" <areddy at commvault.com
> <mailto:areddy at commvault.com>>, "Xavier Hernandez"
> <xhernandez at datalab.es <mailto:xhernandez at datalab.es>>
> *Cc: *gluster-users at gluster.org <mailto:gluster-users at
gluster.org>,
> "Gluster Devel (gluster-devel at gluster.org
> <mailto:gluster-devel at gluster.org>)" <gluster-devel at
gluster.org
> <mailto:gluster-devel at gluster.org>>
> *Sent: *Friday, January 20, 2017 3:20:46 AM
> *Subject: *Re: [Gluster-users] [Gluster-devel] Lot of EIO errors in
> disperse volume
>
>
>
> Xavi,
>             Finally I am able to locate the files for which we are
> seeing the mismatch.
>
>
>
> [2017-01-19 21:20:10.002737] W [MSGID: 122056]
> [ec-combine.c:875:ec_combine_check] 0-glusterfsProd-disperse-0:
> Mismatching xdata in answers of 'LOOKUP' for
> f5e61224-17f6-4803-9f75-dd74cd79e248
> [2017-01-19 21:20:10.002776] W [MSGID: 122006]
> [ec-combine.c:207:ec_iatt_combine] 0-glusterfsProd-disperse-0: Failed to
> combine iatt (inode: 11490333518038950472-11490333518038950472, links:
> 1-1, uid: 0-0, gid: 0-0, rdev: 0-0, size: 4800512-4734976, mode:
> 100775-100775) for f5e61224-17f6-4803-9f75-dd74cd79e248
>
>
>
> GFID f5e61224-17f6-4803-9f75-dd74cd79e248 maps to file
>
/ws/disk1/ws_brick/Folder_01.05.2017_21.15/CV_MAGNETIC/V_32410/CHUNK_424795/SFILE_CONTAINER_158
>
>
>
> The size field differs between the bricks of this sub volume.
>
>
>
> [root at glusterfs4 glusterfs]# getfattr -d -e hex -m .
>
/ws/disk1/ws_brick/Folder_01.05.2017_21.15/CV_MAGNETIC/V_32410/CHUNK_424795/SFILE_CONTAINER_158
> trusted.ec.size=0x00000000000e0000
> trusted.ec.version=0x00000000000000080000000000000008
>
>
>
> [root at glusterfs5 ws]# getfattr -d -e hex -m .
>
/ws/disk1/ws_brick/Folder_01.05.2017_21.15/CV_MAGNETIC/V_32410/CHUNK_424795/SFILE_CONTAINER_158
> trusted.ec.size=0x00000000000e0000
> trusted.ec.version=0x00000000000000080000000000000008
>
>
>
> [root at glusterfs6 glusterfs]# getfattr -d -e hex -m .
>
/ws/disk1/ws_brick/Folder_01.05.2017_21.15/CV_MAGNETIC/V_32410/CHUNK_424795/SFILE_CONTAINER_158
> getfattr: Removing leading '/' from absolute path names
> trusted.ec.size=0x0000000000000000
> trusted.ec.version=0x00000000000000000000000000000008
>
>
>
> On glusterfs6 writes seem to have failed and the file appears to be an
> empty file.
> [root at glusterfs6 glusterfs]# stat
>
/ws/disk1/ws_brick/Folder_01.05.2017_21.15/CV_MAGNETIC/V_32410/CHUNK_424795/SFILE_CONTAINER_158
>   File:
>
'/ws/disk1/ws_brick/Folder_01.05.2017_21.15/CV_MAGNETIC/V_32410/CHUNK_424795/SFILE_CONTAINER_158'
>   Size: 0               Blocks: 0          IO Block: 4096   regular
> empty file
>
>
>
> This looks like a major issue. Since we are left with no redundancy for
> these files and if the brick on other node fails then we will not be
> able to access this data though the redundancy factor of 2 is met in a
> 3:1 disperse volume.
>
>
>
> Thanks and Regards,
> Ram
>
>
>
> -----Original Message-----
> From: gluster-users-bounces at gluster.org
> <mailto:gluster-users-bounces at gluster.org>
> [mailto:gluster-users-bounces at gluster.org] On Behalf Of Ankireddypalle
Reddy
> Sent: Monday, January 16, 2017 9:41 AM
> To: Xavier Hernandez
> Cc: gluster-users at gluster.org <mailto:gluster-users at
gluster.org>;
> Gluster Devel (gluster-devel at gluster.org <mailto:gluster-devel at
gluster.org>)
> Subject: Re: [Gluster-users] [Gluster-devel] Lot of EIO errors in
> disperse volume
>
>
>
> Xavi,
>            Thanks. I will start by tracking the delta if any in return
> codes from the bricks for writes in EC. I have a feeling that  if we
> could get to the bottom of this issue then the EIO errors could be
> ultimately avoided.  Please note that while we are testing all these on
> a standby setup we continue to encounter the EIO errors on our
> production setup which is running @ 3.7.18.
>
>
>
> Thanks and Regards,
> Ram
>
>
>
> -----Original Message-----
> From: Xavier Hernandez [mailto:xhernandez at datalab.es]
> Sent: Monday, January 16, 2017 6:54 AM
> To: Ankireddypalle Reddy
> Cc: gluster-users at gluster.org <mailto:gluster-users at
gluster.org>;
> Gluster Devel (gluster-devel at gluster.org <mailto:gluster-devel at
gluster.org>)
> Subject: Re: [Gluster-devel] [Gluster-users] Lot of EIO errors in
> disperse volume
>
>
>
> Hi Ram,
>
>
>
> On 16/01/17 12:33, Ankireddypalle Reddy wrote:
>> Xavi,
>>           Thanks. Is there any other way to map from GFID to path.
>
>
>
> The only way I know is to search all files from bricks and lookup for
> the trusted.gfid xattr.
>
>
>
>> I will look for a way to share the TRACE logs. Easier way might be to
add some extra logging. I could do that if you could let me know functions in
which you are interested..
>
>
>
> The problem is that I don't know where the problem is. One possibility
> could be to track all return values from all bricks for all writes and
> then identify which ones belong to an inconsistent file.
>
>
>
> But if this doesn't reveal anything interesting we'll need to look
at
> some other place. And this can be very tedious and slow.
>
>
>
> Anyway, what we are looking now is not the source of an EIO, since there
> are two bricks with consistent state and the file should be perfectly
> readable and writable. It's true that there's some problem here and
it
> could derive in EIO if one of the healthy bricks degrades, but at least
> this file shouldn't be giving EIO errors for now.
>
>
>
> Xavi
>
>
>
>>
>> Sent on from my iPhone
>>
>>> On Jan 16, 2017, at 6:23 AM, Xavier Hernandez <xhernandez at
datalab.es <mailto:xhernandez at datalab.es>> wrote:
>>>
>>> Hi Ram,
>>>
>>>> On 13/01/17 18:41, Ankireddypalle Reddy wrote:
>>>> Xavi,
>>>>             I enabled TRACE logging. The log grew up to 120GB
and could not make much out of it. Then I started logging GFID in the code where
we were seeing errors.
>>>>
>>>> [2017-01-13 17:02:01.761349] I [dict.c:3065:dict_dump_to_log]
0-glusterfsProd-disperse-0: dict=0x7fa6706bc690
((trusted.ec.size:0:0:0:0:30:6b:0:0:)(trusted.ec.version:0:0:0:0:0:0:2a:38:0:0:0:0:0:0:2a:38:))
>>>> [2017-01-13 17:02:01.761360] I [dict.c:3065:dict_dump_to_log]
0-glusterfsProd-disperse-0: dict=0x7fa6706bed64
((trusted.ec.size:0:0:0:0:0:0:0:0:)(trusted.ec.version:0:0:0:0:0:0:0:0:0:0:0:0:0:0:2a:38:))
>>>> [2017-01-13 17:02:01.761365] W [MSGID: 122056]
[ec-combine.c:881:ec_combine_check] 0-glusterfsProd-disperse-0: Mismatching
xdata in answers of 'LOOKUP'
>>>> [2017-01-13 17:02:01.761405] I [dict.c:166:key_value_cmp]
0-glusterfsProd-disperse-0: 'trusted.ec.size' is different in two dicts
(8, 8)
>>>> [2017-01-13 17:02:01.761417] I [dict.c:3065:dict_dump_to_log]
0-glusterfsProd-disperse-0: dict=0x7fa6706bbb14
((trusted.ec.size:0:0:0:0:30:6b:0:0:)(trusted.ec.version:0:0:0:0:0:0:2a:38:0:0:0:0:0:0:2a:38:))
>>>> [2017-01-13 17:02:01.761428] I [dict.c:3065:dict_dump_to_log]
0-glusterfsProd-disperse-0: dict=0x7fa6706bed64
((trusted.ec.size:0:0:0:0:0:0:0:0:)(trusted.ec.version:0:0:0:0:0:0:0:0:0:0:0:0:0:0:2a:38:))
>>>> [2017-01-13 17:02:01.761433] W [MSGID: 122056]
[ec-combine.c:881:ec_combine_check] 0-glusterfsProd-disperse-0: Mismatching
xdata in answers of 'LOOKUP'
>>>> [2017-01-13 17:02:01.761442] W [MSGID: 122006]
[ec-combine.c:214:ec_iatt_combine] 0-glusterfsProd-disperse-0: Failed to combine
iatt (inode: 11275691004192850514-11275691004192850514, gfid:
60b990ed-d741-4176-9c7b-4d3a25fb8252  -  60b990ed-d741-4176-9c7b-4d3a25fb8252, 
links: 1-1, uid: 0-0, gid: 0-0, rdev: 0-0,size: 406650880-406683648,
> mode: 100775-100775)
>>>>
>>>> The file for which we are seeing this error turns out to be
having a GFID of 60b990ed-d741-4176-9c7b-4d3a25fb8252
>>>>
>>>> Then I tried looking for find out the file with this GFID. It
pointed me to following path. I was expecting a real file system path from the
following turorial:
>>>>
https://gluster.readthedocs.io/en/latest/Troubleshooting/gfid-to-path/
>>>
>>> I think this method only works if bricks have the inode cached.
>>>
>>>>
>>>> getfattr -n trusted.glusterfs.pathinfo -e text
/mnt/gfid/.gfid/60b990ed-d741-4176-9c7b-4d3a25fb8252
>>>> getfattr: Removing leading '/' from absolute path names
>>>> # file: mnt/gfid/.gfid/60b990ed-d741-4176-9c7b-4d3a25fb8252
>>>>
trusted.glusterfs.pathinfo="(<DISTRIBUTE:glusterfsProd-dht>
(<EC:glusterfsProd-disperse-0>
<POSIX(/ws/disk1/ws_brick):glusterfs6:/ws/disk1/ws_brick/.glusterfs/60/b9/60b990ed-d741-4176-9c7b-4d3a25fb8252>
<POSIX(/ws/disk1/ws_brick):glusterfs5:/ws/disk1/ws_brick/.glusterfs/60/b9/60b990ed-d741-4176-9c7b-4d3a25fb8252>))"
>>>>
>>>> Then I looked for the xatttrs for these files from all the 3
bricks
>>>>
>>>> [root at glusterfs4 glusterfs]# getfattr -d -m . -e hex
/ws/disk1/ws_brick/.glusterfs/60/b9/60b990ed-d741-4176-9c7b-4d3a25fb8252
>>>> getfattr: Removing leading '/' from absolute path names
>>>> # file:
ws/disk1/ws_brick/.glusterfs/60/b9/60b990ed-d741-4176-9c7b-4d3a25fb8252
>>>> trusted.bit-rot.version=0x02000000000000005877a8dc00041138
>>>> trusted.ec.config=0x0000080301000200
>>>> trusted.ec.size=0x0000000000000000
>>>> trusted.ec.version=0x00000000000000000000000000002a38
>>>> trusted.gfid=0x60b990edd74141769c7b4d3a25fb8252
>>>>
>>>> [root at glusterfs5 bricks]# getfattr -d -m . -e hex
/ws/disk1/ws_brick/.glusterfs/60/b9/60b990ed-d741-4176-9c7b-4d3a25fb8252
>>>> getfattr: Removing leading '/' from absolute path names
>>>> # file:
ws/disk1/ws_brick/.glusterfs/60/b9/60b990ed-d741-4176-9c7b-4d3a25fb8252
>>>> trusted.bit-rot.version=0x02000000000000005877a8dc000c92d0
>>>> trusted.ec.config=0x0000080301000200
>>>> trusted.ec.dirty=0x00000000000000160000000000000000
>>>> trusted.ec.size=0x00000000306b0000
>>>> trusted.ec.version=0x0000000000002a380000000000002a38
>>>> trusted.gfid=0x60b990edd74141769c7b4d3a25fb8252
>>>>
>>>> [root at glusterfs6 ee]# getfattr -d -m . -e hex
/ws/disk1/ws_brick/.glusterfs/60/b9/60b990ed-d741-4176-9c7b-4d3a25fb8252
>>>> getfattr: Removing leading '/' from absolute path names
>>>> # file:
ws/disk1/ws_brick/.glusterfs/60/b9/60b990ed-d741-4176-9c7b-4d3a25fb8252
>>>> trusted.bit-rot.version=0x02000000000000005877a8dc000c9436
>>>> trusted.ec.config=0x0000080301000200
>>>> trusted.ec.dirty=0x00000000000000160000000000000000
>>>> trusted.ec.size=0x00000000306b0000
>>>> trusted.ec.version=0x0000000000002a380000000000002a38
>>>> trusted.gfid=0x60b990edd74141769c7b4d3a25fb8252
>>>>
>>>> It turns out that the size and version in fact does not match
for one of the files.
>>>
>>> It seems as if the brick on glusterfs4 didn't receive any write
request (or they failed for some reason). Do you still have the trace log ? is
there any way I could download it ?
>>>
>>> Xavi
>>>
>>>>
>>>> Thanks and Regards,
>>>> Ram
>>>>
>>>> -----Original Message-----
>>>> From: gluster-devel-bounces at gluster.org
> <mailto:gluster-devel-bounces at gluster.org>
> [mailto:gluster-devel-bounces at gluster.org] On Behalf Of Ankireddypalle
Reddy
>>>> Sent: Friday, January 13, 2017 4:17 AM
>>>> To: Xavier Hernandez
>>>> Cc: gluster-users at gluster.org <mailto:gluster-users at
gluster.org>; Gluster
> Devel (gluster-devel at gluster.org <mailto:gluster-devel at
gluster.org>)
>>>> Subject: Re: [Gluster-devel] [Gluster-users] Lot of EIO errors
in disperse volume
>>>>
>>>> Xavi,
>>>>        Thanks for explanation. Will collect TRACE logs today.
>>>>
>>>> Thanks and Regards,
>>>> Ram
>>>>
>>>> Sent from my iPhone
>>>>
>>>>> On Jan 13, 2017, at 3:03 AM, Xavier Hernandez
<xhernandez at datalab.es <mailto:xhernandez at datalab.es>> wrote:
>>>>>
>>>>> Hi Ram,
>>>>>
>>>>>> On 12/01/17 22:14, Ankireddypalle Reddy wrote:
>>>>>> Xavi,
>>>>>>            I changed the logging to log the individual
bytes. Consider the following from ws-glus.log file where /ws/glus is the mount
point.
>>>>>>
>>>>>> [2017-01-12 20:47:59.368102] I [MSGID: 109063]
>>>>>> [dht-layout.c:718:dht_layout_normalize]
0-glusterfsProd-dht: Found
>>>>>> anomalies in
>>>>>>
/Folder_01.05.2017_21.15/CV_MAGNETIC/V_30970/CHUNK_390607 (gfid
>>>>>> e694387f-dde7-410b-9562-914a994d5e85). Holes=1
overlaps=0
>>>>>> [2017-01-12 20:47:59.391218] I [MSGID: 109036]
>>>>>> [dht-common.c:9082:dht_log_new_layout_for_dir_selfheal]
>>>>>> 0-glusterfsProd-dht: Setting layout of
>>>>>>
/Folder_01.05.2017_21.15/CV_MAGNETIC/V_30970/CHUNK_390607 with
>>>>>> [Subvol_name: glusterfsProd-disperse-0, Err: -1 ,
Start: 2505397587 ,
>>>>>> Stop: 2863311527 , Hash: 1 ], [Subvol_name:
glusterfsProd-disperse-1,
>>>>>> Err: -1 , Start: 2863311528 , Stop: 3221225468 , Hash:
1 ],
>>>>>> [Subvol_name: glusterfsProd-disperse-10, Err: -1 ,
Start: 3221225469
>>>>>> , Stop: 3579139409 , Hash: 1 ], [Subvol_name:
>>>>>> glusterfsProd-disperse-11, Err: -1 , Start: 3579139410
, Stop:
>>>>>> 3937053350 , Hash: 1 ], [Subvol_name:
glusterfsProd-disperse-2, Err:
>>>>>> -1 , Start: 3937053351 , Stop: 4294967295 , Hash: 1 ],
[Subvol_name:
>>>>>> glusterfsProd-disperse-3, Err: -1 , Start: 0 , Stop:
357913940 ,
>>>>>> Hash: 1 ], [Subvol_name: glusterfsProd-disperse-4, Err:
-1 , Start:
>>>>>> 357913941 , Stop: 715827881 , Hash: 1 ], [Subvol_name:
>>>>>> glusterfsProd-disperse-5, Err: -1 , Start: 715827882 ,
Stop:
>>>>>> 1073741822 , Hash
>>>> : 1 ], [Subvol_name: glusterfsProd-disperse-6, Err: -1 , Start:
1073741823 , Stop: 1431655763 , Hash: 1 ], [Subvol_name:
glusterfsProd-disperse-7, Err: -1 , Start: 1431655764 , Stop: 1789569704 , Hash:
1 ], [Subvol_name: glusterfsProd-disperse-8, Err: -1 , Start: 1789569705 , Stop:
2147483645 , Hash: 1 ], [Subvol_name:
> glusterfsProd-disperse-9, Err: -1 , Start: 2147483646 , Stop: 2505397586
> , Hash: 1 ],
>>>>>>
>>>>>>           Self-heal seems to be triggered for path 
/Folder_01.05.2017_21.15/CV_MAGNETIC/V_30970/CHUNK_390607 due to anomalies as
per DHT.  It would be great if someone could explain what could be the anomaly
here.  The setup where we encountered this is a fairly stable setup with no
brick failures or no node failures.
>>>>>
>>>>> This is not really a self-heal, at least from the point of
view of ec. This means that DHT has found a discrepancy in the layout of that
directory, however this doesn't mean any problem (notice the 'I' in
the log, meaning that it's informative, not a warning nor error).
>>>>>
>>>>> Not sure how DHT works in this case or why it finds this
"anomaly", but if there aren't any previous errors before that
message, it can be completely ignored.
>>>>>
>>>>> Not sure if it can be related to option
cluster.weighted-rebalance that is enabled by default.
>>>>>
>>>>>>
>>>>>>          Then Self-heal seems to have encountered the
following error.
>>>>>>
>>>>>> [2017-01-12 20:48:23.418432] I
[dict.c:166:key_value_cmp]
>>>>>> 0-glusterfsProd-disperse-2:
'trusted.ec.version' is different in two
>>>>>> dicts (16, 16)
>>>>>> [2017-01-12 20:48:23.418496] I
[dict.c:3065:dict_dump_to_log]
>>>>>> 0-glusterfsProd-disperse-2: dict=0x7f0b649520ac
>>>>>>
((trusted.glusterfs.dht:0:0:0:1:0:0:0:0:0:0:0:0:15:55:55:54:)(trusted
>>>>>> .ec.version:0:0:0:0:0:0:0:b:0:0:0:0:0:0:0:e:))
>>>>>> [2017-01-12 20:48:23.418519] I
[dict.c:3065:dict_dump_to_log]
>>>>>> 0-glusterfsProd-disperse-2: dict=0x7f0b6495b4e0
>>>>>>
((trusted.glusterfs.dht:0:0:0:1:0:0:0:0:0:0:0:0:15:55:55:54:)(trusted
>>>>>> .ec.version:0:0:0:0:0:0:0:d:0:0:0:0:0:0:0:e:))
>>>>>> [2017-01-12 20:48:23.418531] W [MSGID: 122056]
[ec-combine.c:873:ec_combine_check] 0-glusterfsProd-disperse-2: Mismatching
xdata in answers of 'LOOKUP'
>>>>>
>>>>> That's a real problem. Here we have two bricks that
differ in the trusted.ec.version xattr. However this xattr not necessarily
belongs to the previous directory. They are unrelated messages.
>>>>>
>>>>>>
>>>>>>            In this case glusterfsProd-disperse-2 sub
volume actually consists of the following bricks.
>>>>>>            glusterfs4sds:/ws/disk11/ws_brick,
glusterfs5sds:
>>>>>> /ws/disk11/ws_brick, glusterfs6sds: /ws/disk11/ws_brick
>>>>>>
>>>>>>            I went ahead and checked the value of
trusted.ec.version on all the 3 bricks inside this sub vol:
>>>>>>
>>>>>>            [root at glusterfs6 ~]# getfattr -e hex -n
trusted.ec.version
/ws/disk11/ws_brick//Folder_01.05.2017_21.15/CV_MAGNETIC/V_30970/CHUNK_390607
>>>>>>            # file:
ws/disk11/ws_brick//Folder_01.05.2017_21.15/CV_MAGNETIC/V_30970/CHUNK_390607
>>>>>>           
trusted.ec.version=0x0000000000000009000000000000000b
>>>>>>
>>>>>>            [root at glusterfs4 ~]# getfattr -e hex -n
trusted.ec.version
/ws/disk11/ws_brick//Folder_01.05.2017_21.15/CV_MAGNETIC/V_30970/CHUNK_390607
>>>>>>            # file:
ws/disk11/ws_brick//Folder_01.05.2017_21.15/CV_MAGNETIC/V_30970/CHUNK_390607
>>>>>>            
trusted.ec.version=0x0000000000000009000000000000000b
>>>>>>
>>>>>>             [root at glusterfs5 glusterfs]# getfattr -e
hex -n trusted.ec.version
/ws/disk11/ws_brick//Folder_01.05.2017_21.15/CV_MAGNETIC/V_30970/CHUNK_390607
>>>>>>             # file:
ws/disk11/ws_brick//Folder_01.05.2017_21.15/CV_MAGNETIC/V_30970/CHUNK_390607
>>>>>>            
trusted.ec.version=0x0000000000000009000000000000000b
>>>>>>
>>>>>> The attribute value seems to be same on all the 3
bricks.
>>>>>
>>>>> That's a clear indication that the ec warning is not
related to this directory because trusted.ec.version always increases, never
decreases, and the directory has a value smaller that the one that appears in
the log message.
>>>>>
>>>>> If you show all dict entries in the log, it seems that it
does refer to a directory because trusted.ec.size is not present, but it must be
another directory than the one you looked at. We would need to find which one is
having this issue. The TRACE log would be helpful here.
>>>>>
>>>>>
>>>>>>
>>>>>>              Also please note that every single time
that the trusted.ec.version was found to mismatch the  same values are getting
logged. Following are 2 more instances of trusted.ec.version mismatch.
>>>>>>
>>>>>> [2017-01-12 20:14:25.554540] I
[dict.c:166:key_value_cmp]
>>>>>> 0-glusterfsProd-disperse-2:
'trusted.ec.version' is different in two
>>>>>> dicts (16, 16)
>>>>>> [2017-01-12 20:14:25.554588] I
[dict.c:3065:dict_dump_to_log]
>>>>>> 0-glusterfsProd-disperse-2: dict=0x7f0b6495a9f0
>>>>>>
((glusterfs.open-fd-count:30:0:)(trusted.glusterfs.dht:0:0:0:1:0:0:0:
>>>>>>
0:0:0:0:0:15:55:55:54:)(trusted.ec.version:0:0:0:0:0:0:0:d:0:0:0:0:0:
>>>>>> 0:0:e:))
>>>>>> [2017-01-12 20:14:25.554608] I
[dict.c:3065:dict_dump_to_log]
>>>>>> 0-glusterfsProd-disperse-2: dict=0x7f0b6495903c
>>>>>>
((glusterfs.open-fd-count:30:0:)(trusted.glusterfs.dht:0:0:0:1:0:0:0:
>>>>>>
0:0:0:0:0:15:55:55:54:)(trusted.ec.version:0:0:0:0:0:0:0:b:0:0:0:0:0:
>>>>>> 0:0:e:))
>>>>>> [2017-01-12 20:14:25.554624] W [MSGID: 122053]
>>>>>> [ec-common.c:116:ec_check_status]
0-glusterfsProd-disperse-2:
>>>>>> Operation failed on some subvolumes (up=7, mask=7,
remaining=0,
>>>>>> good=3, bad=4)
>>>>>> [2017-01-12 20:14:25.554632] W [MSGID: 122002]
>>>>>> [ec-common.c:71:ec_heal_report]
0-glusterfsProd-disperse-2: Heal
>>>>>> failed [Invalid argument]
>>>>>> [2017-01-12 20:14:25.555598] I
[dict.c:166:key_value_cmp]
>>>>>> 0-glusterfsProd-disperse-2:
'trusted.ec.version' is different in two
>>>>>> dicts (16, 16)
>>>>>> [2017-01-12 20:14:25.555622] I
[dict.c:3065:dict_dump_to_log]
>>>>>> 0-glusterfsProd-disperse-2: dict=0x7f0b64956c24
>>>>>>
((glusterfs.open-fd-count:30:0:)(trusted.glusterfs.dht:0:0:0:1:0:0:0:
>>>>>>
0:0:0:0:0:15:55:55:54:)(trusted.ec.version:0:0:0:0:0:0:0:b:0:0:0:0:0:
>>>>>> 0:0:e:))
>>>>>> [2017-01-12 20:14:25.555638] I
[dict.c:3065:dict_dump_to_log]
>>>>>> 0-glusterfsProd-disperse-2: dict=0x7f0b64964e8c
>>>>>>
((glusterfs.open-fd-count:30:0:)(trusted.glusterfs.dht:0:0:0:1:0:0:0:
>>>>>>
0:0:0:0:0:15:55:55:54:)(trusted.ec.version:0:0:0:0:0:0:0:d:0:0:0:0:0:
>>>>>> 0:0:e:))
>>>>>
>>>>>
>>>>> I think that this refers to the same directory. This seems
an attempt to heal it that has failed. So it makes sense that it finds exactly
the same values.
>>>>>
>>>>>>
>>>>>>
>>>>>> In glustershd.log lot of similar errors are logged.
>>>>>>
>>>>>> [2017-01-12 21:10:53.728770] I
[dict.c:166:key_value_cmp]
>>>>>> 0-glusterfsProd-disperse-0: 'trusted.ec.size'
is different in two
>>>>>> dicts (8, 8)
>>>>>> [2017-01-12 21:10:53.728804] I
[dict.c:3065:dict_dump_to_log]
>>>>>> 0-glusterfsProd-disperse-0: dict=0x7f21694b6f50
>>>>>>
((trusted.ec.size:0:0:0:0:42:3a:0:0:)(trusted.ec.version:0:0:0:0:0:0:
>>>>>> 37:5f:0:0:0:0:0:0:37:5f:))
>>>>>> [2017-01-12 21:10:53.728827] I
[dict.c:3065:dict_dump_to_log]
>>>>>> 0-glusterfsProd-disperse-0: dict=0x7f21694b62bc
>>>>>>
((trusted.ec.size:0:0:0:0:0:ca:0:0:)(trusted.ec.version:0:0:0:0:0:0:0
>>>>>> :a1:0:0:0:0:0:0:37:5f:))
>>>>>> [2017-01-12 21:10:53.728842] W [MSGID: 122056]
[ec-combine.c:873:ec_combine_check] 0-glusterfsProd-disperse-0: Mismatching
xdata in answers of 'LOOKUP'
>>>>>> [2017-01-12 21:10:53.728854] W [MSGID: 122053]
>>>>>> [ec-common.c:116:ec_check_status]
0-glusterfsProd-disperse-0:
>>>>>> Operation failed on some subvolumes (up=7, mask=7,
remaining=0,
>>>>>> good=6, bad=1)
>>>>>> [2017-01-12 21:10:53.728876] W [MSGID: 122002]
>>>>>> [ec-common.c:71:ec_heal_report]
0-glusterfsProd-disperse-0: Heal
>>>>>> failed [Invalid argument]
>>>>>
>>>>> This seems an attempt to heal a file, but I see a lot of
differences between both versions. The size on one brick is 13.238.272 bytes,
but on the other brick it's 1.111.097.344 bytes. That's a huge
difference.
>>>>>
>>>>> Looking at the trusted.ec.version, I see that the
'data' version is very different (from 161 to 14.175), however the
metadata version is exactly the same. This really seems like a lot of writes
while one brick was down (or disconnected for some reason, or writes failed for
some reason). One brick has lost about 14.000
> writes of ~80KB.
>>>>>
>>>>> I think the most important thing right now would be to
identify which files and directories are having these problems to be able to
identify the cause. Again, the TRACE log will be really useful.
>>>>>
>>>>> Xavi
>>>>>
>>>>>>
>>>>>> Thanks and Regards,
>>>>>> Ram
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Xavier Hernandez [mailto:xhernandez at
datalab.es]
>>>>>> Sent: Thursday, January 12, 2017 6:40 AM
>>>>>> To: Ankireddypalle Reddy
>>>>>> Cc: Gluster Devel (gluster-devel at gluster.org
<mailto:gluster-devel at gluster.org>);
>>>>>> gluster-users at gluster.org <mailto:gluster-users
at gluster.org>
>>>>>> Subject: Re: [Gluster-users] [Gluster-devel] Lot of EIO
errors in
>>>>>> disperse volume
>>>>>>
>>>>>> Hi Ram,
>>>>>>
>>>>>>
>>>>>>> On 12/01/17 11:49, Ankireddypalle Reddy wrote:
>>>>>>> Xavi,
>>>>>>>         As I mentioned before the error could
happen for any FOP. Will try to run with TRACE debug level. Is there a
possibility that we are checking for this attribute on a directory, because a
directory does not seem to be having this attribute set.
>>>>>>
>>>>>> No, directories do not have this attribute and no one
should be reading it from a directory.
>>>>>>
>>>>>>> Also is the function to check size and version
called after it is decided that heal should be run or is this check is the one
which decides whether a heal should be run.
>>>>>>
>>>>>> Almost all checks that trigger a heal are done in the
lookup fop when some discrepancy is detected.
>>>>>>
>>>>>> The function that checks size and version is called
later once a lock on the inode is acquired (even if no heal is needed). However
further failures in the processing of any fop can also trigger a self-heal.
>>>>>>
>>>>>> Xavi
>>>>>>
>>>>>>>
>>>>>>> Thanks and Regards,
>>>>>>> Ram
>>>>>>>
>>>>>>> Sent from my iPhone
>>>>>>>
>>>>>>>> On Jan 12, 2017, at 2:25 AM, Xavier Hernandez
<xhernandez at datalab.es <mailto:xhernandez at datalab.es>> wrote:
>>>>>>>>
>>>>>>>> Hi Ram,
>>>>>>>>
>>>>>>>>> On 12/01/17 02:36, Ankireddypalle Reddy
wrote:
>>>>>>>>> Xavi,
>>>>>>>>>        I added some more logging
information. The trusted.ec.size field values are in fact different.
>>>>>>>>>         trusted.ec.size    l1 =
62719407423488    l2 = 0
>>>>>>>>
>>>>>>>> That's very weird. Directories do not have
this attribute. It's only present on regular files. But you said that the
error happens while creating the file, so it doesn't make much sense because
file creation always sets trusted.ec.size to 0.
>>>>>>>>
>>>>>>>> Could you reproduce the problem with
diagnostics.client-log-level set to TRACE and send the log to me ? it will
create a big log, but I'll have much more information about what's going
on.
>>>>>>>>
>>>>>>>> Do you have a mixed setup with nodes of
different types ? for example mixed 32/64 bits architectures or different
operating systems ? I ask this because 62719407423488 in hex is 0x390B00000000,
which has the lower 32 bits set to 0, but has garbage above that.
>>>>>>>>
>>>>>>>>>
>>>>>>>>>         This is a fairly static setup with
no brick/ node failure.  Please explain why  is that a heal is being triggered
and what could have acutually caused these size xattrs to differ.  This is
causing random I/O failures and is impacting the backup schedules.
>>>>>>>>
>>>>>>>> The launch of self-heal is normal because it
has detected an inconsistency. The real problem is what originates that
inconsistency.
>>>>>>>>
>>>>>>>> Xavi
>>>>>>>>
>>>>>>>>>
>>>>>>>>> [ 2017-01-12 01:19:18.256970] W [MSGID:
122056] [ec-combine.c:873:ec_combine_check] 0-glusterfsProd-disperse-8:
Mismatching xdata in answers of 'LOOKUP'
>>>>>>>>> [2017-01-12 01:19:18.257015] W [MSGID:
122053]
>>>>>>>>> [ec-common.c:116:ec_check_status]
0-glusterfsProd-disperse-8:
>>>>>>>>> Operation failed on some subvolumes (up=7,
mask=7, remaining=0,
>>>>>>>>> good=3, bad=4)
>>>>>>>>> [2017-01-12 01:19:18.257018] W [MSGID:
122002]
>>>>>>>>> [ec-common.c:71:ec_heal_report]
0-glusterfsProd-disperse-8: Heal
>>>>>>>>> failed [Invalid argument]
>>>>>>>>> [2017-01-12 01:19:21.002028] E
[dict.c:197:key_value_cmp]
>>>>>>>>> 0-glusterfsProd-disperse-4:
'trusted.ec.size' is different in two
>>>>>>>>> dicts (8, 8)
>>>>>>>>> [2017-01-12 01:19:21.002056] E
[dict.c:166:log_value]
>>>>>>>>> 0-glusterfsProd-disperse-4: trusted.ec.size
[ l1 = 62719407423488
>>>>>>>>> l2 = 0 i1 = 0 i2 = 0 ]
>>>>>>>>> [2017-01-12 01:19:21.002064] W [MSGID:
122056] [ec-combine.c:873:ec_combine_check] 0-glusterfsProd-disperse-4:
Mismatching xdata in answers of 'LOOKUP'
>>>>>>>>> [2017-01-12 01:19:21.209640] E
[dict.c:197:key_value_cmp]
>>>>>>>>> 0-glusterfsProd-disperse-4:
'trusted.ec.size' is different in two
>>>>>>>>> dicts (8, 8)
>>>>>>>>> [2017-01-12 01:19:21.209673] E
[dict.c:166:log_value]
>>>>>>>>> 0-glusterfsProd-disperse-4: trusted.ec.size
[ l1 = 62719407423488
>>>>>>>>> l2 = 0 i1 = 0 i2 = 0 ]
>>>>>>>>> [2017-01-12 01:19:21.209686] W [MSGID:
122056] [ec-combine.c:873:ec_combine_check] 0-glusterfsProd-disperse-4:
Mismatching xdata in answers of 'LOOKUP'
>>>>>>>>> [2017-01-12 01:19:21.209719] W [MSGID:
122053]
>>>>>>>>> [ec-common.c:116:ec_check_status]
0-glusterfsProd-disperse-4:
>>>>>>>>> Operation failed on some subvolumes (up=7,
mask=7, remaining=0,
>>>>>>>>> good=6, bad=1)
>>>>>>>>> [2017-01-12 01:19:21.209753] W [MSGID:
122002]
>>>>>>>>> [ec-common.c:71:ec_heal_report]
0-glusterfsProd-disperse-4: Heal
>>>>>>>>> failed [Invalid argument]
>>>>>>>>>
>>>>>>>>> Thanks and Regards,
>>>>>>>>> Ram
>>>>>>>>>
>>>>>>>>> -----Original Message-----
>>>>>>>>> From: Ankireddypalle Reddy
>>>>>>>>> Sent: Wednesday, January 11, 2017 9:29 AM
>>>>>>>>> To: Ankireddypalle Reddy; Xavier Hernandez;
Gluster Devel
>>>>>>>>> (gluster-devel at gluster.org
<mailto:gluster-devel at gluster.org>);
> gluster-users at gluster.org <mailto:gluster-users at gluster.org>
>>>>>>>>> Subject: RE: [Gluster-users]
[Gluster-devel] Lot of EIO errors in
>>>>>>>>> disperse volume
>>>>>>>>>
>>>>>>>>> Xavi,
>>>>>>>>>          I built a debug binary to log more
information. This is what is getting logged. Looks like it is the attribute
trusted.ec.size which is different among the bricks in a sub volume.
>>>>>>>>>
>>>>>>>>> In glustershd.log :
>>>>>>>>>
>>>>>>>>> [2017-01-11 14:19:45.023845] N [MSGID:
122029] [ec-generic.c:683:ec_combine_lookup] 0-glusterfsProd-disperse-8:
Mismatching iatt in answers of 'GF_FOP_LOOKUP'
>>>>>>>>> [2017-01-11 14:19:45.027718] E
[dict.c:166:key_value_cmp]
>>>>>>>>> 0-glusterfsProd-disperse-6:
'trusted.ec.size' is different in two
>>>>>>>>> dicts (8, 8)
>>>>>>>>> [2017-01-11 14:19:45.027736] W [MSGID:
122056] [ec-combine.c:873:ec_combine_check] 0-glusterfsProd-disperse-6:
Mismatching xdata in answers of 'LOOKUP'
>>>>>>>>> [2017-01-11 14:19:45.027763] E
[dict.c:166:key_value_cmp]
>>>>>>>>> 0-glusterfsProd-disperse-6:
'trusted.ec.size' is different in two
>>>>>>>>> dicts (8, 8)
>>>>>>>>> [2017-01-11 14:19:45.027781] W [MSGID:
122056] [ec-combine.c:873:ec_combine_check] 0-glusterfsProd-disperse-6:
Mismatching xdata in answers of 'LOOKUP'
>>>>>>>>> [2017-01-11 14:19:45.027793] W [MSGID:
122053]
>>>>>>>>> [ec-common.c:116:ec_check_status]
0-glusterfsProd-disperse-6:
>>>>>>>>> Operation failed on some subvolumes (up=7,
mask=7, remaining=0,
>>>>>>>>> good=6, bad=1)
>>>>>>>>> [2017-01-11 14:19:45.027815] W [MSGID:
122002]
>>>>>>>>> [ec-common.c:71:ec_heal_report]
0-glusterfsProd-disperse-6: Heal
>>>>>>>>> failed [Invalid argument]
>>>>>>>>> [2017-01-11 14:19:45.029035] E
[dict.c:166:key_value_cmp]
>>>>>>>>> 0-glusterfsProd-disperse-8:
'trusted.ec.size' is different in two
>>>>>>>>> dicts (8, 8)
>>>>>>>>> [2017-01-11 14:19:45.029057] W [MSGID:
122056] [ec-combine.c:873:ec_combine_check] 0-glusterfsProd-disperse-8:
Mismatching xdata in answers of 'LOOKUP'
>>>>>>>>> [2017-01-11 14:19:45.029089] E
[dict.c:166:key_value_cmp]
>>>>>>>>> 0-glusterfsProd-disperse-8:
'trusted.ec.size' is different in two
>>>>>>>>> dicts (8, 8)
>>>>>>>>> [2017-01-11 14:19:45.029105] W [MSGID:
122056] [ec-combine.c:873:ec_combine_check] 0-glusterfsProd-disperse-8:
Mismatching xdata in answers of 'LOOKUP'
>>>>>>>>> [2017-01-11 14:19:45.029121] W [MSGID:
122053]
>>>>>>>>> [ec-common.c:116:ec_check_status]
0-glusterfsProd-disperse-8:
>>>>>>>>> Operation failed on some subvolumes (up=7,
mask=7, remaining=0,
>>>>>>>>> good=6, bad=1)
>>>>>>>>> [2017-01-11 14:19:45.032566] E
[dict.c:166:key_value_cmp]
>>>>>>>>> 0-glusterfsProd-disperse-6:
'trusted.ec.size' is different in two
>>>>>>>>> dicts (8, 8)
>>>>>>>>> [2017-01-11 14:19:45.029138] W [MSGID:
122002]
>>>>>>>>> [ec-common.c:71:ec_heal_report]
0-glusterfsProd-disperse-8: Heal
>>>>>>>>> failed [Invalid argument]
>>>>>>>>> [2017-01-11 14:19:45.032585] W [MSGID:
122056] [ec-combine.c:873:ec_combine_check] 0-glusterfsProd-disperse-6:
Mismatching xdata in answers of 'LOOKUP'
>>>>>>>>> [2017-01-11 14:19:45.032614] E
[dict.c:166:key_value_cmp]
>>>>>>>>> 0-glusterfsProd-disperse-6:
'trusted.ec.size' is different in two
>>>>>>>>> dicts (8, 8)
>>>>>>>>> [2017-01-11 14:19:45.032631] W [MSGID:
122056] [ec-combine.c:873:ec_combine_check] 0-glusterfsProd-disperse-6:
Mismatching xdata in answers of 'LOOKUP'
>>>>>>>>> [2017-01-11 14:19:45.032638] W [MSGID:
122053]
>>>>>>>>> [ec-common.c:116:ec_check_status]
0-glusterfsProd-disperse-6:
>>>>>>>>> Operation failed on some subvolumes (up=7,
mask=7, remaining=0,
>>>>>>>>> good=6, bad=1)
>>>>>>>>> [2017-01-11 14:19:45.032654] W [MSGID:
122002]
>>>>>>>>> [ec-common.c:71:ec_heal_report]
0-glusterfsProd-disperse-6: Heal
>>>>>>>>> failed [Invalid argument]
>>>>>>>>> [2017-01-11 14:19:45.037514] E
[dict.c:166:key_value_cmp]
>>>>>>>>> 0-glusterfsProd-disperse-6:
'trusted.ec.size' is different in two
>>>>>>>>> dicts (8, 8)
>>>>>>>>> [2017-01-11 14:19:45.037536] W [MSGID:
122056] [ec-combine.c:873:ec_combine_check] 0-glusterfsProd-disperse-6:
Mismatching xdata in answers of 'LOOKUP'
>>>>>>>>> [2017-01-11 14:19:45.037553] E
[dict.c:166:key_value_cmp]
>>>>>>>>> 0-glusterfsProd-disperse-6:
'trusted.ec.size' is different in two
>>>>>>>>> dicts (8, 8)
>>>>>>>>> [2017-01-11 14:19:45.037573] W [MSGID:
122056] [ec-combine.c:873:ec_combine_check] 0-glusterfsProd-disperse-6:
Mismatching xdata in answers of 'LOOKUP'
>>>>>>>>> [2017-01-11 14:19:45.037582] W [MSGID:
122053]
>>>>>>>>> [ec-common.c:116:ec_check_status]
0-glusterfsProd-disperse-6:
>>>>>>>>> Operation failed on some subvolumes (up=7,
mask=7, remaining=0,
>>>>>>>>> good=6, bad=1)
>>>>>>>>> [2017-01-11 14:19:45.037599] W [MSGID:
122002]
>>>>>>>>> [ec-common.c:71:ec_heal_report]
0-glusterfsProd-disperse-6: Heal
>>>>>>>>> failed [Invalid argument]
>>>>>>>>> [2017-01-11 14:20:40.001401] E
[dict.c:166:key_value_cmp]
>>>>>>>>> 0-glusterfsProd-disperse-3:
'trusted.ec.size' is different in two
>>>>>>>>> dicts (8, 8)
>>>>>>>>> [2017-01-11 14:20:40.001387] E
[dict.c:166:key_value_cmp]
>>>>>>>>> 0-glusterfsProd-disperse-5:
'trusted.ec.size' is different in two
>>>>>>>>> dicts (8, 8)
>>>>>>>>>
>>>>>>>>> In the mount daemon log:
>>>>>>>>>
>>>>>>>>> [2017-01-11 14:20:17.806826] E [MSGID:
122001]
>>>>>>>>> [ec-common.c:872:ec_config_check]
2-glusterfsProd-disperse-0:
>>>>>>>>> Invalid or corrupted config [Invalid
argument]
>>>>>>>>> [2017-01-11 14:20:17.806847] E [MSGID:
122066]
>>>>>>>>> [ec-common.c:969:ec_prepare_update_cbk]
2-glusterfsProd-disperse-0:
>>>>>>>>> Invalid config xattr [Invalid argument]
>>>>>>>>> [2017-01-11 14:20:17.807076] E [MSGID:
122001]
>>>>>>>>> [ec-common.c:872:ec_config_check]
2-glusterfsProd-disperse-1:
>>>>>>>>> Invalid or corrupted config [Invalid
argument]
>>>>>>>>> [2017-01-11 14:20:17.807099] E [MSGID:
122066]
>>>>>>>>> [ec-common.c:969:ec_prepare_update_cbk]
2-glusterfsProd-disperse-1:
>>>>>>>>> Invalid config xattr [Invalid argument]
>>>>>>>>> [2017-01-11 14:20:17.807286] E [MSGID:
122001]
>>>>>>>>> [ec-common.c:872:ec_config_check]
2-glusterfsProd-disperse-10:
>>>>>>>>> Invalid or corrupted config [Invalid
argument]
>>>>>>>>> [2017-01-11 14:20:17.807298] E [MSGID:
122066]
>>>>>>>>> [ec-common.c:969:ec_prepare_update_cbk]
2-glusterfsProd-disperse-10:
>>>>>>>>> Invalid config xattr [Invalid argument]
>>>>>>>>> [2017-01-11 14:20:17.807409] E [MSGID:
122001]
>>>>>>>>> [ec-common.c:872:ec_config_check]
2-glusterfsProd-disperse-11:
>>>>>>>>> Invalid or corrupted config [Invalid
argument]
>>>>>>>>> [2017-01-11 14:20:17.807420] E [MSGID:
122066]
>>>>>>>>> [ec-common.c:969:ec_prepare_update_cbk]
2-glusterfsProd-disperse-11:
>>>>>>>>> Invalid config xattr [Invalid argument]
>>>>>>>>> [2017-01-11 14:20:17.807448] E [MSGID:
122001]
>>>>>>>>> [ec-common.c:872:ec_config_check]
2-glusterfsProd-disperse-4:
>>>>>>>>> Invalid or corrupted config [Invalid
argument]
>>>>>>>>> [2017-01-11 14:20:17.807462] E [MSGID:
122066]
>>>>>>>>> [ec-common.c:969:ec_prepare_update_cbk]
2-glusterfsProd-disperse-4:
>>>>>>>>> Invalid config xattr [Invalid argument]
>>>>>>>>> [2017-01-11 14:20:17.807539] E [MSGID:
122001]
>>>>>>>>> [ec-common.c:872:ec_config_check]
2-glusterfsProd-disperse-2:
>>>>>>>>> Invalid or corrupted config [Invalid
argument]
>>>>>>>>> [2017-01-11 14:20:17.807550] E [MSGID:
122066]
>>>>>>>>> [ec-common.c:969:ec_prepare_update_cbk]
2-glusterfsProd-disperse-2:
>>>>>>>>> Invalid config xattr [Invalid argument]
>>>>>>>>> [2017-01-11 14:20:17.807723] E [MSGID:
122001]
>>>>>>>>> [ec-common.c:872:ec_config_check]
2-glusterfsProd-disperse-3:
>>>>>>>>> Invalid or corrupted config [Invalid
argument]
>>>>>>>>> [2017-01-11 14:20:17.807739] E [MSGID:
122066]
>>>>>>>>> [ec-common.c:969:ec_prepare_update_cbk]
2-glusterfsProd-disperse-3:
>>>>>>>>> Invalid config xattr [Invalid argument]
>>>>>>>>> [2017-01-11 14:20:17.807785] E [MSGID:
122001]
>>>>>>>>> [ec-common.c:872:ec_config_check]
2-glusterfsProd-disperse-5:
>>>>>>>>> Invalid or corrupted config [Invalid
argument]
>>>>>>>>> [2017-01-11 14:20:17.807796] E [MSGID:
122066]
>>>>>>>>> [ec-common.c:969:ec_prepare_update_cbk]
2-glusterfsProd-disperse-5:
>>>>>>>>> Invalid config xattr [Invalid argument]
>>>>>>>>> [2017-01-11 14:20:17.808020] E [MSGID:
122001]
>>>>>>>>> [ec-common.c:872:ec_config_check]
2-glusterfsProd-disperse-9:
>>>>>>>>> Invalid or corrupted config [Invalid
argument]
>>>>>>>>> [2017-01-11 14:20:17.808034] E [MSGID:
122066]
>>>>>>>>> [ec-common.c:969:ec_prepare_update_cbk]
2-glusterfsProd-disperse-9:
>>>>>>>>> Invalid config xattr [Invalid argument]
>>>>>>>>> [2017-01-11 14:20:17.808054] E [MSGID:
122001]
>>>>>>>>> [ec-common.c:872:ec_config_check]
2-glusterfsProd-disperse-6:
>>>>>>>>> Invalid or corrupted config [Invalid
argument]
>>>>>>>>> [2017-01-11 14:20:17.808066] E [MSGID:
122066]
>>>>>>>>> [ec-common.c:969:ec_prepare_update_cbk]
2-glusterfsProd-disperse-6:
>>>>>>>>> Invalid config xattr [Invalid argument]
>>>>>>>>> [2017-01-11 14:20:17.808282] E [MSGID:
122001]
>>>>>>>>> [ec-common.c:872:ec_config_check]
2-glusterfsProd-disperse-8:
>>>>>>>>> Invalid or corrupted config [Invalid
argument]
>>>>>>>>> [2017-01-11 14:20:17.808292] E [MSGID:
122066]
>>>>>>>>> [ec-common.c:969:ec_prepare_update_cbk]
2-glusterfsProd-disperse-8:
>>>>>>>>> Invalid config xattr [Invalid argument]
>>>>>>>>> [2017-01-11 14:20:17.809212] E [MSGID:
122001]
>>>>>>>>> [ec-common.c:872:ec_config_check]
2-glusterfsProd-disperse-7:
>>>>>>>>> Invalid or corrupted config [Invalid
argument]
>>>>>>>>> [2017-01-11 14:20:17.809228] E [MSGID:
122066]
>>>>>>>>> [ec-common.c:969:ec_prepare_update_cbk]
2-glusterfsProd-disperse-7:
>>>>>>>>> Invalid config xattr [Invalid argument]
>>>>>>>>>
>>>>>>>>> [2017-01-11 14:20:17.812660] I [MSGID:
109036]
>>>>>>>>>
[dht-common.c:8043:dht_log_new_layout_for_dir_selfheal]
>>>>>>>>> 2-glusterfsProd-dht: Setting layout of
>>>>>>>>>
/Folder_01.05.2017_21.15/CV_MAGNETIC/V_31500/CHUNK_402578 with
>>>>>>>>> [Subvol_name: glusterfsProd-disperse-0,
Err: -1 , Start:
>>>>>>>>> 1789569705 , Stop: 2147483645 , Hash: 1 ],
[Subvol_name:
>>>>>>>>> glusterfsProd-disperse-1, Err: -1 , Start:
2147483646 , Stop:
>>>>>>>>> 2505397586 , Hash: 1 ], [Subvol_name:
glusterfsProd-disperse-10,
>>>>>>>>> Err: -1 , Start: 2505397587 , Stop:
2863311527 , Hash: 1 ],
>>>>>>>>> [Subvol_name: glusterfsProd-disperse-11,
Err: -1 , Start:
>>>>>>>>> 2863311528 , Stop: 3221225468 , Hash: 1 ],
[Subvol_name:
>>>>>>>>> glusterfsProd-disperse-2, Err: -1 , Start:
3221225469 , Stop:
>>>>>>>>> 3579139409 , Hash: 1 ], [Subvol_name:
glusterfsProd-disperse-3, Err:
>>>>>>>>> -1 , Start: 3579139410 , Stop: 3937053350 ,
Hash: 1 ], [Subvol_name:
>>>>>>>>> glusterfsProd-disperse-4, Err: -1 , Start:
3937053351 , Stop:
>>>>>>>>> 4294967295 , Hash: 1 ], [Subvol_name:
glusterfsProd-disperse-5, Err:
>>>>>>>>> -1 , Start: 0 , Stop: 357913940 , Hash: 1
], [Subvol_name:
>>>>>>>>> glusterfsProd-disperse-6, Err: -1 , Start:
357913941 , Stop:
>>>>>>>>> 715827881 , Hash: 1 ], [Subvol_name:
glusterfsProd-disperse-7, Err:
>>>>>>>>> -1 , Start: 715827882 , Stop: 1073741822 ,
Hash: 1 ], [Subvol_name:
>>>>>>>>> glusterfsProd-disperse-8, Err: -1 , Start:
1073741823 , Stop:
>>>>>>>>> 1431655763 , Hash: 1 ], [Subvol_name:
glusterfsProd-disperse-9, Err:
>>>>>>>>> -1 , Start: 1431655764 , Stop: 1789569704 ,
Hash: 1 ],
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> -----Original Message-----
>>>>>>>>> From: gluster-users-bounces at gluster.org
<mailto:gluster-users-bounces at gluster.org>
>>>>>>>>> [mailto:gluster-users-bounces at
gluster.org] On Behalf Of
>>>>>>>>> Ankireddypalle Reddy
>>>>>>>>> Sent: Tuesday, January 10, 2017 10:09 AM
>>>>>>>>> To: Xavier Hernandez; Gluster Devel
(gluster-devel at gluster.org <mailto:gluster-devel at gluster.org>);
>>>>>>>>> gluster-users at gluster.org
<mailto:gluster-users at gluster.org>
>>>>>>>>> Subject: Re: [Gluster-users]
[Gluster-devel] Lot of EIO errors in
>>>>>>>>> disperse volume
>>>>>>>>>
>>>>>>>>> Xavi,
>>>>>>>>>         In this case it's the file
creation which failed. So I provided the xattrs of the parent.
>>>>>>>>>
>>>>>>>>> Thanks and Regards,
>>>>>>>>> Ram
>>>>>>>>>
>>>>>>>>> -----Original Message-----
>>>>>>>>> From: Xavier Hernandez [mailto:xhernandez
at datalab.es]
>>>>>>>>> Sent: Tuesday, January 10, 2017 9:10 AM
>>>>>>>>> To: Ankireddypalle Reddy; Gluster Devel
>>>>>>>>> (gluster-devel at gluster.org
<mailto:gluster-devel at gluster.org>);
> gluster-users at gluster.org <mailto:gluster-users at gluster.org>
>>>>>>>>> Subject: Re: [Gluster-devel] Lot of EIO
errors in disperse volume
>>>>>>>>>
>>>>>>>>> Hi Ram,
>>>>>>>>>
>>>>>>>>>> On 10/01/17 14:42, Ankireddypalle Reddy
wrote:
>>>>>>>>>> Attachments (2):
>>>>>>>>>>
>>>>>>>>>> 1
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> ec.txt
>>>>>>>>>>
<https://imap.commvault.com/webconsole/embedded.do?url=https://im
>
<https://imap.commvault.com/webconsole/embedded.do?url=https://im%0b>>>>>>>>>>
> ap
>>>>>>>>>> .co
>>>>>>>>>>
mmvault.com/webconsole/api/drive/publicshare/346714/file/ee2d1536
>>>>>>>>>> c2
>>>>>>>>>> dc4
>>>>>>>>>>
dff94afb12132b4f8f6/action/preview&downloadUrl=https://imap.commvault.
>>>>>>>>>>
com/webconsole/api/contentstore/publicshare/346714/file/ee2d1536c
>>>>>>>>>> 2d c4d
ff94afb12132b4f8f6/action/download>
>>>>>>>>>> [Download]
>>>>>>>>>>
<https://imap.commvault.com/webconsole/api/contentstore/publicsha
>
<https://imap.commvault.com/webconsole/api/contentstore/publicsha%0b>>>>>>>>>>
> re
>>>>>>>>>> /34
>>>>>>>>>>
6714/file/ee2d1536c2dc4dff94afb12132b4f8f6/action/download>(11.50
>>>>>>>>>> KB)
>>>>>>>>>>
>>>>>>>>>> 2
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> ws-glus.log
>>>>>>>>>>
<https://imap.commvault.com/webconsole/embedded.do?url=https://im
>
<https://imap.commvault.com/webconsole/embedded.do?url=https://im%0b>>>>>>>>>>
> ap
>>>>>>>>>> .co
>>>>>>>>>>
mmvault.com/webconsole/api/drive/publicshare/346714/file/cff3e050
>>>>>>>>>> 6e
>>>>>>>>>> 754
>>>>>>>>>>
b9a939db02da1cbbd58/action/preview&downloadUrl=https://imap.commvault.
>>>>>>>>>>
com/webconsole/api/contentstore/publicshare/346714/file/cff3e0506
>>>>>>>>>> e7 54b
9a939db02da1cbbd58/action/download>
>>>>>>>>>> [Download]
>>>>>>>>>>
<https://imap.commvault.com/webconsole/api/contentstore/publicsha
>
<https://imap.commvault.com/webconsole/api/contentstore/publicsha%0b>>>>>>>>>>
> re
>>>>>>>>>> /34
>>>>>>>>>>
6714/file/cff3e0506e754b9a939db02da1cbbd58/action/download>(3.48
>>>>>>>>>> MB)
>>>>>>>>>>
>>>>>>>>>> Xavi,
>>>>>>>>>>        We are encountering errors for
different kinds of FOPS.
>>>>>>>>>>        The open failed for the
following file:
>>>>>>>>>>
>>>>>>>>>>       
cvd_2017_01_10_02_28_26.log:98182 1f9fe 01/10 00:57:10 8414465
>>>>>>>>>> [MEDIAFS    ] 20117519-52075477
SingleInstancer_FS::StartDataFile2:
>>>>>>>>>> Failed to create the data file
>>>>>>>>>>
[/ws/glus/Folder_07.11.2016_23.02/CV_MAGNETIC/V_8854974/CHUNK_513
>>>>>>>>>> 42
>>>>>>>>>> 720 /SFILE_CONTAINER_062],
error=0xECCC0005:{CQiFile::Open(92)} +
>>>>>>>>>>
{CQiUTFOSAPI::open(96)/ErrNo.5.(Input/output error)-Open failed,
>>>>>>>>>>
File=/ws/glus/Folder_07.11.2016_23.02/CV_MAGNETIC/V_8854974/CHUNK
>>>>>>>>>> _5
>>>>>>>>>> 134 2720/SFILE_CONTAINER_062,
OperationFlag=0xC1,
>>>>>>>>>> PermissionMode=0x1FF}
>>>>>>>>>>
>>>>>>>>>>        I've attached the extended
attributes for the directories
>>>>>>>>>>       
/ws/glus/Folder_07.11.2016_23.02/CV_MAGNETIC/V_8854974/
>>>>>>>>>> and
>>>>>>>>>>
>>>>>>>>>>
/ws/glus/Folder_07.11.2016_23.02/CV_MAGNETIC/V_8854974/CHUNK_5134
>>>>>>>>>> 27
>>>>>>>>>> 20
>>>>>>>>>> from all the bricks.
>>>>>>>>>>
>>>>>>>>>>       The attributes look fine to me.
I've also attached some
>>>>>>>>>> log cuts to illustrate the problem.
>>>>>>>>>
>>>>>>>>> I need the extended attributes of the file
itself, not the parent directories.
>>>>>>>>>
>>>>>>>>> Xavi
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Thanks and Regards,
>>>>>>>>>> Ram
>>>>>>>>>>
>>>>>>>>>> -----Original Message-----
>>>>>>>>>> From: Xavier Hernandez
[mailto:xhernandez at datalab.es]
>>>>>>>>>> Sent: Tuesday, January 10, 2017 7:53 AM
>>>>>>>>>> To: Ankireddypalle Reddy; Gluster Devel
>>>>>>>>>> (gluster-devel at gluster.org
<mailto:gluster-devel at gluster.org>);
> gluster-users at gluster.org <mailto:gluster-users at gluster.org>
>>>>>>>>>> Subject: Re: [Gluster-devel] Lot of EIO
errors in disperse volume
>>>>>>>>>>
>>>>>>>>>> Hi Ram,
>>>>>>>>>>
>>>>>>>>>> the error is caused by an extended
attribute that does not match
>>>>>>>>>> on all
>>>>>>>>>> 3 bricks of the disperse set. Most
probable value is
>>>>>>>>>> trusted.ec.version, but could be
others.
>>>>>>>>>>
>>>>>>>>>> At first sight, I don't see any
change from 3.7.8 that could have
>>>>>>>>>> caused this. I'll check again.
>>>>>>>>>>
>>>>>>>>>> What kind of operations are you doing ?
this can help me narrow the search.
>>>>>>>>>>
>>>>>>>>>> Xavi
>>>>>>>>>>
>>>>>>>>>>> On 10/01/17 13:43, Ankireddypalle
Reddy wrote:
>>>>>>>>>>> Xavi,
>>>>>>>>>>>        Thanks. If you could please
explain what to look for in
>>>>>>>>>>> the
>>>>>>>>>> extended attributes then I will check
and let you know if I find
>>>>>>>>>> anything suspicious.  Also we noticed
that some of these
>>>>>>>>>> operations would succeed if retried. Do
you know of any
>>>>>>>>>> communicated related errors that are
being reported/triaged.
>>>>>>>>>>>
>>>>>>>>>>> Thanks and Regards,
>>>>>>>>>>> Ram
>>>>>>>>>>>
>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>> From: Xavier Hernandez
[mailto:xhernandez at datalab.es]
>>>>>>>>>>> Sent: Tuesday, January 10, 2017
7:23 AM
>>>>>>>>>>> To: Ankireddypalle Reddy; Gluster
Devel
>>>>>>>>>>> (gluster-devel at gluster.org
<mailto:gluster-devel at gluster.org>);
> gluster-users at gluster.org <mailto:gluster-users at gluster.org>
>>>>>>>>>>> Subject: Re: [Gluster-devel] Lot of
EIO errors in disperse
>>>>>>>>>>> volume
>>>>>>>>>>>
>>>>>>>>>>> Hi Ram,
>>>>>>>>>>>
>>>>>>>>>>>> On 10/01/17 13:14,
Ankireddypalle Reddy wrote:
>>>>>>>>>>>> Attachment (1):
>>>>>>>>>>>>
>>>>>>>>>>>> 1
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> ecxattrs.txt
>>>>>>>>>>>>
<https://imap.commvault.com/webconsole/embedded.do?url=https://imap.
>
<https://imap.commvault.com/webconsole/embedded.do?url=https://imap.%0b>>>>>>>>>>>>
> c
>>>>>>>>>>>> o
>>>>>>>>>>>>
mmvault.com/webconsole/api/drive/publicshare/346714/file/1272e6
>>>>>>>>>>>> 82
>>>>>>>>>>>> 787
>>>>>>>>>>>> 4
>>>>>>>>>>>> 4
>>>>>>>>>>>>
f15bf1a54f2b31b559d/action/preview&downloadUrl=https://imap.commvault.
>>>>>>>>>>>>
com/webconsole/api/contentstore/publicshare/346714/file/1272e68
>>>>>>>>>>>> 27
>>>>>>>>>>>> 874
>>>>>>>>>>>> 4
>>>>>>>>>>>> f
>>>>>>>>>>>>
15bf1a54f2b31b559d/action/download>
>>>>>>>>>>>> [Download]
>>>>>>>>>>>>
<https://imap.commvault.com/webconsole/api/contentstore/publics
>
<https://imap.commvault.com/webconsole/api/contentstore/publics%0b>>>>>>>>>>>>
> ha
>>>>>>>>>>>> re/
>>>>>>>>>>>> 3
>>>>>>>>>>>> 4
>>>>>>>>>>>>
6714/file/1272e68278744f15bf1a54f2b31b559d/action/download>(5.9
>>>>>>>>>>>> 2
>>>>>>>>>>>> KB)
>>>>>>>>>>>>
>>>>>>>>>>>> Xavi,
>>>>>>>>>>>>           Please find attached
the extended attributes for a
>>>>>>>>>>>> directory from all the bricks.
Free space check failed for this
>>>>>>>>>>>> with error number EIO.
>>>>>>>>>>>
>>>>>>>>>>> What do you mean ? what operation
have you made to check the
>>>>>>>>>>> free
>>>>>>>>>> space on that directory ?
>>>>>>>>>>>
>>>>>>>>>>> If it's a recursive check, I
need the extended attributes from
>>>>>>>>>>> the
>>>>>>>>>> exact file that triggers the EIO. The
attached attributes seem
>>>>>>>>>> consistent and that directory
shouldn't cause any problem. Does an 'ls'
>>>>>>>>>> on that directory fail or does it show
the contents ?
>>>>>>>>>>>
>>>>>>>>>>> Xavi
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks and Regards,
>>>>>>>>>>>> Ram
>>>>>>>>>>>>
>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>> From: Xavier Hernandez
[mailto:xhernandez at datalab.es]
>>>>>>>>>>>> Sent: Tuesday, January 10, 2017
6:45 AM
>>>>>>>>>>>> To: Ankireddypalle Reddy;
Gluster Devel
>>>>>>>>>>>> (gluster-devel at gluster.org
<mailto:gluster-devel at gluster.org>);
> gluster-users at gluster.org <mailto:gluster-users at gluster.org>
>>>>>>>>>>>> Subject: Re: [Gluster-devel]
Lot of EIO errors in disperse
>>>>>>>>>>>> volume
>>>>>>>>>>>>
>>>>>>>>>>>> Hi Ram,
>>>>>>>>>>>>
>>>>>>>>>>>> can you execute the following
command on all bricks on a file
>>>>>>>>>>>> that is giving EIO ?
>>>>>>>>>>>>
>>>>>>>>>>>> getfattr -m. -e hex -d <path
to file in brick>
>>>>>>>>>>>>
>>>>>>>>>>>> Xavi
>>>>>>>>>>>>
>>>>>>>>>>>>> On 10/01/17 12:41,
Ankireddypalle Reddy wrote:
>>>>>>>>>>>>> Xavi,
>>>>>>>>>>>>>          We have been
running 3.7.8 on these servers. We
>>>>>>>>>>>>> upgraded
>>>>>>>>>>>> to 3.7.18 yesterday. We
upgraded all the servers at a time.
>>>>>>>>>>>> The volume was brought down
during upgrade.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks and Regards,
>>>>>>>>>>>>> Ram
>>>>>>>>>>>>>
>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>> From: Xavier Hernandez
[mailto:xhernandez at datalab.es]
>>>>>>>>>>>>> Sent: Tuesday, January 10,
2017 6:35 AM
>>>>>>>>>>>>> To: Ankireddypalle Reddy;
Gluster Devel
>>>>>>>>>>>>> (gluster-devel at
gluster.org <mailto:gluster-devel at gluster.org>);
> gluster-users at gluster.org <mailto:gluster-users at gluster.org>
>>>>>>>>>>>>> Subject: Re:
[Gluster-devel] Lot of EIO errors in disperse
>>>>>>>>>>>>> volume
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Ram,
>>>>>>>>>>>>>
>>>>>>>>>>>>> how did you upgrade gluster
? from which version ?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Did you upgrade one server
at a time and waited until
>>>>>>>>>>>>> self-heal
>>>>>>>>>>>> finished before upgrading the
next server ?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Xavi
>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 10/01/17 11:39,
Ankireddypalle Reddy wrote:
>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>    We upgraded to
GlusterFS 3.7.18 yesterday.  We see lot of
>>>>>>>>>>>>>> failures in our
applications. Most of the errors are EIO. The
>>>>>>>>>>>>>> following log lines are
commonly seen in the logs:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The message "W
[MSGID: 122056]
>>>>>>>>>>>>>>
[ec-combine.c:873:ec_combine_check]
>>>>>>>>>>>>>>
0-StoragePool-disperse-4: Mismatching xdata in answers of 'LOOKUP'"
>>>>>>>>>>>>>> repeated 2 times
between [2017-01-10 02:46:25.069809] and
>>>>>>>>>>>>>> [2017-01-10
02:46:25.069835]
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> [2017-01-10
02:46:25.069852] W [MSGID: 122056]
>>>>>>>>>>>>>>
[ec-combine.c:873:ec_combine_check] 0-StoragePool-disperse-5:
>>>>>>>>>>>>>> Mismatching xdata in
answers of 'LOOKUP'
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The message "W
[MSGID: 122056]
>>>>>>>>>>>>>>
[ec-combine.c:873:ec_combine_check]
>>>>>>>>>>>>>>
0-StoragePool-disperse-5: Mismatching xdata in answers of 'LOOKUP'"
>>>>>>>>>>>>>> repeated 2 times
between [2017-01-10 02:46:25.069852] and
>>>>>>>>>>>>>> [2017-01-10
02:46:25.069873]
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> [2017-01-10
02:46:25.069910] W [MSGID: 122056]
>>>>>>>>>>>>>>
[ec-combine.c:873:ec_combine_check] 0-StoragePool-disperse-6:
>>>>>>>>>>>>>> Mismatching xdata in
answers of 'LOOKUP'
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> ...
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> [2017-01-10
02:46:26.520774] I [MSGID: 109036]
>>>>>>>>>>>>>>
[dht-common.c:9076:dht_log_new_layout_for_dir_selfheal]
>>>>>>>>>>>>>> 0-StoragePool-dht:
Setting layout of
>>>>>>>>>>>>>>
/Folder_07.11.2016_23.02/CV_MAGNETIC/V_8854213/CHUNK_51334585
>>>>>>>>>>>>>> with
>>>>>>>>>>>>>> [Subvol_name:
StoragePool-disperse-0, Err: -1 , Start:
>>>>>>>>>>>>>> 3221225466 ,
>>>>>>>>>>>>>> Stop: 3758096376 ,
Hash: 1 ], [Subvol_name:
>>>>>>>>>>>>>> StoragePool-disperse-1,
>>>>>>>>>> Err:
>>>>>>>>>>>>>> -1 , Start: 3758096377
, Stop: 4294967295 , Hash: 1 ], [Subvol_name:
>>>>>>>>>>>>>> StoragePool-disperse-2,
Err: -1 , Start: 0 , Stop: 536870910 , Hash:
>>>>>>>>>>>>>> 1 ], [Subvol_name:
StoragePool-disperse-3, Err: -1 , Start:
>>>>>>>>>>>>>> 536870911 ,
>>>>>>>>>>>>>> Stop: 1073741821 ,
Hash: 1 ], [Subvol_name:
>>>>>>>>>>>>>> StoragePool-disperse-4,
>>>>>>>>>> Err:
>>>>>>>>>>>>>> -1 , Start: 1073741822
, Stop: 1610612732 , Hash: 1 ], [Subvol_name:
>>>>>>>>>>>>>> StoragePool-disperse-5,
Err: -1 , Start: 1610612733 , Stop:
>>>>>>>>>>>>>> 2147483643 ,
>>>>>>>>>>>>>> Hash: 1 ],
[Subvol_name: StoragePool-disperse-6, Err: -1 , Start:
>>>>>>>>>>>>>> 2147483644 , Stop:
2684354554 , Hash: 1 ], [Subvol_name:
>>>>>>>>>>>>>> StoragePool-disperse-7,
Err: -1 , Start: 2684354555 , Stop:
>>>>>>>>>>>>>> 3221225465 ,
>>>>>>>>>>>>>> Hash: 1 ],
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> [2017-01-10
02:46:26.522841] N [MSGID: 122031]
>>>>>>>>>>>>>>
[ec-generic.c:1130:ec_combine_xattrop] 0-StoragePool-disperse-3:
>>>>>>>>>>>>>> Mismatching dictionary
in answers of 'GF_FOP_XATTROP'
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The message "N
[MSGID: 122031]
>>>>>>>>>>>>>>
[ec-generic.c:1130:ec_combine_xattrop]
>>>>>>>>>>>>>>
0-StoragePool-disperse-3: Mismatching dictionary in answers
>>>>>>>>>>>>>> of
'GF_FOP_XATTROP'" repeated 2 times between [2017-01-10
>>>>>>>>>>>>>> 02:46:26.522841] and
[2017-01-10 02:46:26.522894]
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> [2017-01-10
02:46:26.522898] W [MSGID: 122040]
>>>>>>>>>>>>>>
[ec-common.c:919:ec_prepare_update_cbk] 0-StoragePool-disperse-3:
>>>>>>>>>>>>>> Failed to get size and
version [Input/output error]
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> [2017-01-10
02:46:26.523115] N [MSGID: 122031]
>>>>>>>>>>>>>>
[ec-generic.c:1130:ec_combine_xattrop] 0-StoragePool-disperse-6:
>>>>>>>>>>>>>> Mismatching dictionary
in answers of 'GF_FOP_XATTROP'
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The message "N
[MSGID: 122031]
>>>>>>>>>>>>>>
[ec-generic.c:1130:ec_combine_xattrop]
>>>>>>>>>>>>>>
0-StoragePool-disperse-6: Mismatching dictionary in answers
>>>>>>>>>>>>>> of
'GF_FOP_XATTROP'" repeated 2 times between [2017-01-10
>>>>>>>>>>>>>> 02:46:26.523115] and
[2017-01-10 02:46:26.523143]
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> [2017-01-10
02:46:26.523147] W [MSGID: 122040]
>>>>>>>>>>>>>>
[ec-common.c:919:ec_prepare_update_cbk] 0-StoragePool-disperse-6:
>>>>>>>>>>>>>> Failed to get size and
version [Input/output error]
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> [2017-01-10
02:46:26.523302] N [MSGID: 122031]
>>>>>>>>>>>>>>
[ec-generic.c:1130:ec_combine_xattrop] 0-StoragePool-disperse-2:
>>>>>>>>>>>>>> Mismatching dictionary
in answers of 'GF_FOP_XATTROP'
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The message "N
[MSGID: 122031]
>>>>>>>>>>>>>>
[ec-generic.c:1130:ec_combine_xattrop]
>>>>>>>>>>>>>>
0-StoragePool-disperse-2: Mismatching dictionary in answers
>>>>>>>>>>>>>> of
'GF_FOP_XATTROP'" repeated 2 times between [2017-01-10
>>>>>>>>>>>>>> 02:46:26.523302] and
[2017-01-10 02:46:26.523324]
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> [2017-01-10
02:46:26.523328] W [MSGID: 122040]
>>>>>>>>>>>>>>
[ec-common.c:919:ec_prepare_update_cbk] 0-StoragePool-disperse-2:
>>>>>>>>>>>>>> Failed to get size and
version [Input/output error]
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> [root at glusterfs3
Log_Files]# gluster --version
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> glusterfs 3.7.18 built
on Dec  8 2016 06:34:26
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> [root at glusterfs3
Log_Files]# gluster volume info
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Volume Name:
StoragePool
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Type:
Distributed-Disperse
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Volume ID:
149e976f-4e21-451c-bf0f-f5691208531f
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Status: Started
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Number of Bricks: 8 x
(2 + 1) = 24
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Transport-type: tcp
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Bricks:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Brick1:
glusterfs1sds:/ws/disk1/ws_brick
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Brick2:
glusterfs2sds:/ws/disk1/ws_brick
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Brick3:
glusterfs3sds:/ws/disk1/ws_brick
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Brick4:
glusterfs1sds:/ws/disk2/ws_brick
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Brick5:
glusterfs2sds:/ws/disk2/ws_brick
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Brick6:
glusterfs3sds:/ws/disk2/ws_brick
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Brick7:
glusterfs1sds:/ws/disk3/ws_brick
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Brick8:
glusterfs2sds:/ws/disk3/ws_brick
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Brick9:
glusterfs3sds:/ws/disk3/ws_brick
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Brick10:
glusterfs1sds:/ws/disk4/ws_brick
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Brick11:
glusterfs2sds:/ws/disk4/ws_brick
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Brick12:
glusterfs3sds:/ws/disk4/ws_brick
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Brick13:
glusterfs1sds:/ws/disk5/ws_brick
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Brick14:
glusterfs2sds:/ws/disk5/ws_brick
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Brick15:
glusterfs3sds:/ws/disk5/ws_brick
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Brick16:
glusterfs1sds:/ws/disk6/ws_brick
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Brick17:
glusterfs2sds:/ws/disk6/ws_brick
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Brick18:
glusterfs3sds:/ws/disk6/ws_brick
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Brick19:
glusterfs1sds:/ws/disk7/ws_brick
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Brick20:
glusterfs2sds:/ws/disk7/ws_brick
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Brick21:
glusterfs3sds:/ws/disk7/ws_brick
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Brick22:
glusterfs1sds:/ws/disk8/ws_brick
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Brick23:
glusterfs2sds:/ws/disk8/ws_brick
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Brick24:
glusterfs3sds:/ws/disk8/ws_brick
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Options Reconfigured:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
performance.readdir-ahead: on
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
diagnostics.client-log-level: INFO
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks and Regards,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Ram
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
***************************Legal
>>>>>>>>>>>>>>
Disclaimer***************************
>>>>>>>>>>>>>> "This
communication may contain confidential and privileged
>>>>>>>>>>>>>> material for the sole
use of the intended recipient. Any
>>>>>>>>>>>>>> unauthorized review,
use or distribution by others is
>>>>>>>>>>>>>> strictly prohibited. If
you have received the message by
>>>>>>>>>>>>>> mistake, please advise
the sender by reply email and delete the message. Thank you."
>>>>>>>>>>>>>>
*************************************************************
>>>>>>>>>>>>>> **
>>>>>>>>>>>>>> ***
>>>>>>>>>>>>>> *
>>>>>>>>>>>>>> *
>>>>>>>>>>>>>> *
>>>>>>>>>>>>>> *
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
_______________________________________________
>>>>>>>>>>>>>> Gluster-devel mailing
list
>>>>>>>>>>>>>> Gluster-devel at
gluster.org <mailto:Gluster-devel at gluster.org>
>>>>>>>>>>>>>>
http://www.gluster.org/mailman/listinfo/gluster-devel
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
***************************Legal
>>>>>>>>>>>>>
Disclaimer***************************
>>>>>>>>>>>>> "This communication
may contain confidential and privileged
>>>>>>>>>>>>> material for the sole use
of the intended recipient. Any
>>>>>>>>>>>>> unauthorized review, use or
distribution by others is strictly
>>>>>>>>>>>>> prohibited. If you have
received the message by mistake,
>>>>>>>>>>>>> please advise the sender by
reply
>>>>>>>>>>>> email and delete the message.
Thank you."
>>>>>>>>>>>>>
**************************************************************
>>>>>>>>>>>>> **
>>>>>>>>>>>>> ***
>>>>>>>>>>>>> *
>>>>>>>>>>>>> *
>>>>>>>>>>>>> *
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
***************************Legal
>>>>>>>>>>>>
Disclaimer***************************
>>>>>>>>>>>> "This communication may
contain confidential and privileged
>>>>>>>>>>>> material for the sole use of
the intended recipient. Any
>>>>>>>>>>>> unauthorized review, use or
distribution by others is strictly
>>>>>>>>>>>> prohibited. If you have
received the message by mistake, please
>>>>>>>>>>>> advise the sender by reply
email and delete the message. Thank you."
>>>>>>>>>>>>
***************************************************************
>>>>>>>>>>>> **
>>>>>>>>>>>> ***
>>>>>>>>>>>> *
>>>>>>>>>>>> *
>>>>>>>>>>>
>>>>>>>>>>> ***************************Legal
>>>>>>>>>>>
Disclaimer***************************
>>>>>>>>>>> "This communication may
contain confidential and privileged
>>>>>>>>>>> material for the sole use of the
intended recipient. Any
>>>>>>>>>>> unauthorized review, use or
distribution by others is strictly
>>>>>>>>>>> prohibited. If you have received
the message by mistake, please
>>>>>>>>>>> advise the sender by reply
>>>>>>>>>> email and delete the message. Thank
you."
>>>>>>>>>>>
****************************************************************
>>>>>>>>>>> **
>>>>>>>>>>> ***
>>>>>>>>>>> *
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> ***************************Legal
>>>>>>>>>> Disclaimer***************************
>>>>>>>>>> "This communication may contain
confidential and privileged
>>>>>>>>>> material for the sole use of the
intended recipient. Any
>>>>>>>>>> unauthorized review, use or
distribution by others is strictly
>>>>>>>>>> prohibited. If you have received the
message by mistake, please
>>>>>>>>>> advise the sender by reply email and
delete the message. Thank you."
>>>>>>>>>>
*****************************************************************
>>>>>>>>>> **
>>>>>>>>>> ***
>>>>>>>>>
>>>>>>>>> ***************************Legal
>>>>>>>>> Disclaimer***************************
>>>>>>>>> "This communication may contain
confidential and privileged material for the sole use of the intended recipient.
Any unauthorized review, use or distribution by others is strictly prohibited.
If you have received the message by mistake, please advise the sender by reply
email and delete the message. Thank you."
>>>>>>>>>
******************************************************************
>>>>>>>>> **
>>>>>>>>> **
>>>>>>>>>
>>>>>>>>>
_______________________________________________
>>>>>>>>> Gluster-users mailing list
>>>>>>>>> Gluster-users at gluster.org
<mailto:Gluster-users at gluster.org>
>>>>>>>>>
http://www.gluster.org/mailman/listinfo/gluster-users
>>>>>>>>> ***************************Legal
>>>>>>>>> Disclaimer***************************
>>>>>>>>> "This communication may contain
confidential and privileged
>>>>>>>>> material for the sole use of the intended
recipient. Any
>>>>>>>>> unauthorized review, use or distribution by
others is strictly
>>>>>>>>> prohibited. If you have received the
message by mistake, please advise the sender by reply email and delete the
message. Thank you."
>>>>>>>>>
******************************************************************
>>>>>>>>> **
>>>>>>>>> **
>>>>>>>>>
>>>>>>>>
>>>>>>> ***************************Legal
>>>>>>> Disclaimer***************************
>>>>>>> "This communication may contain confidential
and privileged material
>>>>>>> for the sole use of the intended recipient. Any
unauthorized review,
>>>>>>> use or distribution by others is strictly
prohibited. If you have
>>>>>>> received the message by mistake, please advise the
sender by reply email and delete the message. Thank you."
>>>>>>>
********************************************************************
>>>>>>> **
>>>>>>>
>>>>>>
>>>>>> ***************************Legal
>>>>>> Disclaimer***************************
>>>>>> "This communication may contain confidential and
privileged material
>>>>>> for the sole use of the intended recipient. Any
unauthorized review,
>>>>>> use or distribution by others is strictly prohibited.
If you have
>>>>>> received the message by mistake, please advise the
sender by reply email and delete the message. Thank you."
>>>>>>
*********************************************************************
>>>>>> *
>>>>>>
>>>>>>
>>>>>
>>>> ***************************Legal
Disclaimer***************************
>>>> "This communication may contain confidential and
privileged material for the sole use of the intended recipient. Any unauthorized
review, use or distribution by others is strictly prohibited. If you have
received the message by mistake, please advise the sender by reply email and
delete the message. Thank you."
>>>>
**********************************************************************
>>>>
>>>> _______________________________________________
>>>> Gluster-devel mailing list
>>>> Gluster-devel at gluster.org <mailto:Gluster-devel at
gluster.org>
>>>> http://www.gluster.org/mailman/listinfo/gluster-devel
>>>> ***************************Legal
Disclaimer***************************
>>>> "This communication may contain confidential and
privileged material for the
>>>> sole use of the intended recipient. Any unauthorized review,
use or distribution
>>>> by others is strictly prohibited. If you have received the
message by mistake,
>>>> please advise the sender by reply email and delete the message.
Thank you."
>>>>
**********************************************************************
>>>>
>>>
>> ***************************Legal Disclaimer***************************
>> "This communication may contain confidential and privileged
material for the
>> sole use of the intended recipient. Any unauthorized review, use or
distribution
>> by others is strictly prohibited. If you have received the message by
mistake,
>> please advise the sender by reply email and delete the message. Thank
you."
>> **********************************************************************
>>
>
>
>
> ***************************Legal Disclaimer***************************
> "This communication may contain confidential and privileged material
for the
> sole use of the intended recipient. Any unauthorized review, use or
> distribution
> by others is strictly prohibited. If you have received the message by
> mistake,
> please advise the sender by reply email and delete the message. Thank
you."
> **********************************************************************
>
>
>
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-users
> ***************************Legal Disclaimer***************************
> "This communication may contain confidential and privileged material
for the
> sole use of the intended recipient. Any unauthorized review, use or
> distribution
> by others is strictly prohibited. If you have received the message by
> mistake,
> please advise the sender by reply email and delete the message. Thank
you."
> **********************************************************************
>
>
>
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-users
>
>
>
> ***************************Legal Disclaimer***************************
> "This communication may contain confidential and privileged material
for the
> sole use of the intended recipient. Any unauthorized review, use or
> distribution
> by others is strictly prohibited. If you have received the message by
> mistake,
> please advise the sender by reply email and delete the message. Thank
you."
> **********************************************************************
***************************Legal Disclaimer***************************
"This communication may contain confidential and privileged material for
the
sole use of the intended recipient. Any unauthorized review, use or distribution
by others is strictly prohibited. If you have received the message by mistake,
please advise the sender by reply email and delete the message. Thank you."
**********************************************************************

Ashish Pandey

2017-Jan-20 07:56 UTC

head link

[Gluster-users] [Gluster-devel] Lot of EIO errors in disperse volume

Now, as Xavi has explained 2 probable main issues, we can start to debug it one
by one.

Problem 2 - 

- Check if shd is running or not, "gluster v status <vol name>" 
- See if you can find this file in "gluster v heal <vol name >
info"
- Run heal manually using "gluster v heal <vol name > " or
"gluster v heal <vol name > full"
- Check if you are still seeing the file in heal info o rnot. Also, check
xattr's of the specific file.
- Still, if you are seeing the different versions and size on one brick that
means heal has not happened and, above all, this is the most important issue.
- You shold also check if the brick connections are fine or not. 


For problem 1 - 

- You have to depend on logs and see whay a perticular file has not been written
on a brick. Check all the logs, mount logs,
shd logs and also brick logs. may be at that point of time some connection issue
was triggered and that could have been logged in brick logs.

Ashish 




----- Original Message -----

From: "Xavier Hernandez" <xhernandez at datalab.es> 
To: "Ankireddypalle Reddy" <areddy at commvault.com>,
"Ashish Pandey" <aspandey at redhat.com>
Cc: gluster-users at gluster.org, "Gluster Devel (gluster-devel at
gluster.org)" <gluster-devel at gluster.org>
Sent: Friday, January 20, 2017 1:11:24 PM 
Subject: Re: [Gluster-users] [Gluster-devel] Lot of EIO errors in disperse
volume

Hi Ram, 

On 20/01/17 08:02, Ankireddypalle Reddy wrote: > Ashish, 
> 
> Thanks for looking in to the issue. In the given 
> example the size/version matches for file on glusterfs4 and glusterfs5 
> nodes. The file is empty on glusterfs6. Now what happens if glusterfs5 
> goes down. Though the SLA factor of 2 is met still I will not be able 
> to access the data. 
True, but having a brick with inconsistent data is the same as having it 
down. You would have lost 2 bricks out of 3. 

The problem is how to detect the inconsistent data, what is causing it 
and why self-heal (apparently) is not healing it. 
> The problem is that writes did not fail for the file 
> to indicate the issue to the application. 
That's the expected behavior. Since we have redundancy 1, a loss or 
failure on a single fragment is hidden from the application. However 
this triggers internal procedures to repair the problem on the 
background. This also seems to not be working. 

There are two issues to identify here: 

1. Why the write fails in first place 
2. Why self-heal is unable to repair it 

Probably the root cause is the same for both problems, but I'm not sure. 

For 1 there should be some warning or error in the mount log, since one 
of the bricks is reporting an error, even if ec is able to report 
success to the application later. 

For 2 the analysis could be more complex, but most probably there should 
be some warning or error message in the mount log and/or self-heal log 
of one of the servers. 

Xavi 
> It?s not that we are 
> encountering issue for every file on the mount point. The issue happens 
> randomly for different files. 
> 
> 
> 
> [root at glusterfs4 glusterfs]# gluster volume info 
> 
> 
> 
> Volume Name: glusterfsProd 
> 
> Type: Distributed-Disperse 
> 
> Volume ID: 622aa3ee-958f-485f-b3c1-fb0f6c8db34c 
> 
> Status: Started 
> 
> Number of Bricks: 12 x (2 + 1) = 36 
> 
> Transport-type: tcp 
> 
> Bricks: 
> 
> Brick1: glusterfs4sds:/ws/disk1/ws_brick 
> 
> Brick2: glusterfs5sds:/ws/disk1/ws_brick 
> 
> Brick3: glusterfs6sds:/ws/disk1/ws_brick 
> 
> Brick4: glusterfs4sds:/ws/disk10/ws_brick 
> 
> Brick5: glusterfs5sds:/ws/disk10/ws_brick 
> 
> Brick6: glusterfs6sds:/ws/disk10/ws_brick 
> 
> Brick7: glusterfs4sds:/ws/disk11/ws_brick 
> 
> Brick8: glusterfs5sds:/ws/disk11/ws_brick 
> 
> Brick9: glusterfs6sds:/ws/disk11/ws_brick 
> 
> Brick10: glusterfs4sds:/ws/disk2/ws_brick 
> 
> Brick11: glusterfs5sds:/ws/disk2/ws_brick 
> 
> Brick12: glusterfs6sds:/ws/disk2/ws_brick 
> 
> Brick13: glusterfs4sds:/ws/disk3/ws_brick 
> 
> Brick14: glusterfs5sds:/ws/disk3/ws_brick 
> 
> Brick15: glusterfs6sds:/ws/disk3/ws_brick 
> 
> Brick16: glusterfs4sds:/ws/disk4/ws_brick 
> 
> Brick17: glusterfs5sds:/ws/disk4/ws_brick 
> 
> Brick18: glusterfs6sds:/ws/disk4/ws_brick 
> 
> Brick19: glusterfs4sds:/ws/disk5/ws_brick 
> 
> Brick20: glusterfs5sds:/ws/disk5/ws_brick 
> 
> Brick21: glusterfs6sds:/ws/disk5/ws_brick 
> 
> Brick22: glusterfs4sds:/ws/disk6/ws_brick 
> 
> Brick23: glusterfs5sds:/ws/disk6/ws_brick 
> 
> Brick24: glusterfs6sds:/ws/disk6/ws_brick 
> 
> Brick25: glusterfs4sds:/ws/disk7/ws_brick 
> 
> Brick26: glusterfs5sds:/ws/disk7/ws_brick 
> 
> Brick27: glusterfs6sds:/ws/disk7/ws_brick 
> 
> Brick28: glusterfs4sds:/ws/disk8/ws_brick 
> 
> Brick29: glusterfs5sds:/ws/disk8/ws_brick 
> 
> Brick30: glusterfs6sds:/ws/disk8/ws_brick 
> 
> Brick31: glusterfs4sds:/ws/disk9/ws_brick 
> 
> Brick32: glusterfs5sds:/ws/disk9/ws_brick 
> 
> Brick33: glusterfs6sds:/ws/disk9/ws_brick 
> 
> Brick34: glusterfs4sds:/ws/disk12/ws_brick 
> 
> Brick35: glusterfs5sds:/ws/disk12/ws_brick 
> 
> Brick36: glusterfs6sds:/ws/disk12/ws_brick 
> 
> Options Reconfigured: 
> 
> storage.build-pgfid: on 
> 
> performance.readdir-ahead: on 
> 
> nfs.export-dirs: off 
> 
> nfs.export-volumes: off 
> 
> nfs.disable: on 
> 
> auth.allow: glusterfs4sds,glusterfs5sds,glusterfs6sds 
> 
> diagnostics.client-log-level: INFO 
> 
> [root at glusterfs4 glusterfs]# 
> 
> 
> 
> Thanks and Regards, 
> 
> Ram 
> 
> *From:*Ashish Pandey [mailto:aspandey at redhat.com] 
> *Sent:* Thursday, January 19, 2017 10:36 PM 
> *To:* Ankireddypalle Reddy 
> *Cc:* Xavier Hernandez; gluster-users at gluster.org; Gluster Devel 
> (gluster-devel at gluster.org) 
> *Subject:* Re: [Gluster-users] [Gluster-devel] Lot of EIO errors in 
> disperse volume 
> 
> 
> 
> Ram, 
> 
> 
> 
> I don't understand what do you mean by saying "redundancy factor
of 2 is
> met in a 3:1 disperse volume". 
> You have given the xattr's of only 3 bricks. 
> 
> All the above 2 sentence and output of getxattr contradicts each other. 
> 
> In the given scenario, if you have (2+1) ec configuration, and 2 bricks 
> are having same size and version, then there should not be 
> 
> any problem to access this file. Run heal and 3rd fragment will also 
> bee healthy. 
> 
> 
> 
> I think there has been major gap in providing the complete and correct 
> information about the volume and all the logs and activities. 
> Could you please provide the following - 
> 
> 1 - gluster v info - please give us the output of this command 
> 
> 2 - Let's consider only one file which you are not able to access and 
> find out the reason. 
> 
> 3 - Try to create and write some files on mount point and see if there 
> is any issue with new file creation if yes, why, provide logs. 
> 
> 
> 
> Specific but enough information is required to find the root cause of 
> the issue. 
> 
> 
> 
> Ashish 
> 
> 
> 
> 
> 
> 
> 
> ------------------------------------------------------------------------ 
> 
> *From: *"Ankireddypalle Reddy" <areddy at commvault.com 
> <mailto:areddy at commvault.com>> 
> *To: *"Ankireddypalle Reddy" <areddy at commvault.com 
> <mailto:areddy at commvault.com>>, "Xavier Hernandez" 
> <xhernandez at datalab.es <mailto:xhernandez at datalab.es>> 
> *Cc: *gluster-users at gluster.org <mailto:gluster-users at
gluster.org>,
> "Gluster Devel (gluster-devel at gluster.org 
> <mailto:gluster-devel at gluster.org>)" <gluster-devel at
gluster.org
> <mailto:gluster-devel at gluster.org>> 
> *Sent: *Friday, January 20, 2017 3:20:46 AM 
> *Subject: *Re: [Gluster-users] [Gluster-devel] Lot of EIO errors in 
> disperse volume 
> 
> 
> 
> Xavi, 
> Finally I am able to locate the files for which we are 
> seeing the mismatch. 
> 
> 
> 
> [2017-01-19 21:20:10.002737] W [MSGID: 122056] 
> [ec-combine.c:875:ec_combine_check] 0-glusterfsProd-disperse-0: 
> Mismatching xdata in answers of 'LOOKUP' for 
> f5e61224-17f6-4803-9f75-dd74cd79e248 
> [2017-01-19 21:20:10.002776] W [MSGID: 122006] 
> [ec-combine.c:207:ec_iatt_combine] 0-glusterfsProd-disperse-0: Failed to 
> combine iatt (inode: 11490333518038950472-11490333518038950472, links: 
> 1-1, uid: 0-0, gid: 0-0, rdev: 0-0, size: 4800512-4734976, mode: 
> 100775-100775) for f5e61224-17f6-4803-9f75-dd74cd79e248 
> 
> 
> 
> GFID f5e61224-17f6-4803-9f75-dd74cd79e248 maps to file 
>
/ws/disk1/ws_brick/Folder_01.05.2017_21.15/CV_MAGNETIC/V_32410/CHUNK_424795/SFILE_CONTAINER_158
> 
> 
> 
> The size field differs between the bricks of this sub volume. 
> 
> 
> 
> [root at glusterfs4 glusterfs]# getfattr -d -e hex -m . 
>
/ws/disk1/ws_brick/Folder_01.05.2017_21.15/CV_MAGNETIC/V_32410/CHUNK_424795/SFILE_CONTAINER_158
> trusted.ec.size=0x00000000000e0000 
> trusted.ec.version=0x00000000000000080000000000000008 
> 
> 
> 
> [root at glusterfs5 ws]# getfattr -d -e hex -m . 
>
/ws/disk1/ws_brick/Folder_01.05.2017_21.15/CV_MAGNETIC/V_32410/CHUNK_424795/SFILE_CONTAINER_158
> trusted.ec.size=0x00000000000e0000 
> trusted.ec.version=0x00000000000000080000000000000008 
> 
> 
> 
> [root at glusterfs6 glusterfs]# getfattr -d -e hex -m . 
>
/ws/disk1/ws_brick/Folder_01.05.2017_21.15/CV_MAGNETIC/V_32410/CHUNK_424795/SFILE_CONTAINER_158
> getfattr: Removing leading '/' from absolute path names 
> trusted.ec.size=0x0000000000000000 
> trusted.ec.version=0x00000000000000000000000000000008 
> 
> 
> 
> On glusterfs6 writes seem to have failed and the file appears to be an 
> empty file. 
> [root at glusterfs6 glusterfs]# stat 
>
/ws/disk1/ws_brick/Folder_01.05.2017_21.15/CV_MAGNETIC/V_32410/CHUNK_424795/SFILE_CONTAINER_158
> File: 
>
'/ws/disk1/ws_brick/Folder_01.05.2017_21.15/CV_MAGNETIC/V_32410/CHUNK_424795/SFILE_CONTAINER_158'
> Size: 0 Blocks: 0 IO Block: 4096 regular 
> empty file 
> 
> 
> 
> This looks like a major issue. Since we are left with no redundancy for 
> these files and if the brick on other node fails then we will not be 
> able to access this data though the redundancy factor of 2 is met in a 
> 3:1 disperse volume. 
> 
> 
> 
> Thanks and Regards, 
> Ram 
> 
> 
> 
> -----Original Message----- 
> From: gluster-users-bounces at gluster.org 
> <mailto:gluster-users-bounces at gluster.org> 
> [mailto:gluster-users-bounces at gluster.org] On Behalf Of Ankireddypalle
Reddy
> Sent: Monday, January 16, 2017 9:41 AM 
> To: Xavier Hernandez 
> Cc: gluster-users at gluster.org <mailto:gluster-users at
gluster.org>;
> Gluster Devel (gluster-devel at gluster.org <mailto:gluster-devel at
gluster.org>)
> Subject: Re: [Gluster-users] [Gluster-devel] Lot of EIO errors in 
> disperse volume 
> 
> 
> 
> Xavi, 
> Thanks. I will start by tracking the delta if any in return 
> codes from the bricks for writes in EC. I have a feeling that if we 
> could get to the bottom of this issue then the EIO errors could be 
> ultimately avoided. Please note that while we are testing all these on 
> a standby setup we continue to encounter the EIO errors on our 
> production setup which is running @ 3.7.18. 
> 
> 
> 
> Thanks and Regards, 
> Ram 
> 
> 
> 
> -----Original Message----- 
> From: Xavier Hernandez [mailto:xhernandez at datalab.es] 
> Sent: Monday, January 16, 2017 6:54 AM 
> To: Ankireddypalle Reddy 
> Cc: gluster-users at gluster.org <mailto:gluster-users at
gluster.org>;
> Gluster Devel (gluster-devel at gluster.org <mailto:gluster-devel at
gluster.org>)
> Subject: Re: [Gluster-devel] [Gluster-users] Lot of EIO errors in 
> disperse volume 
> 
> 
> 
> Hi Ram, 
> 
> 
> 
> On 16/01/17 12:33, Ankireddypalle Reddy wrote: 
>> Xavi, 
>> Thanks. Is there any other way to map from GFID to path. 
> 
> 
> 
> The only way I know is to search all files from bricks and lookup for 
> the trusted.gfid xattr. 
> 
> 
> 
>> I will look for a way to share the TRACE logs. Easier way might be to
add some extra logging. I could do that if you could let me know functions in
which you are interested..
> 
> 
> 
> The problem is that I don't know where the problem is. One possibility 
> could be to track all return values from all bricks for all writes and 
> then identify which ones belong to an inconsistent file. 
> 
> 
> 
> But if this doesn't reveal anything interesting we'll need to look
at
> some other place. And this can be very tedious and slow. 
> 
> 
> 
> Anyway, what we are looking now is not the source of an EIO, since there 
> are two bricks with consistent state and the file should be perfectly 
> readable and writable. It's true that there's some problem here and
it
> could derive in EIO if one of the healthy bricks degrades, but at least 
> this file shouldn't be giving EIO errors for now. 
> 
> 
> 
> Xavi 
> 
> 
> 
>> 
>> Sent on from my iPhone 
>> 
>>> On Jan 16, 2017, at 6:23 AM, Xavier Hernandez <xhernandez at
datalab.es <mailto:xhernandez at datalab.es>> wrote:
>>> 
>>> Hi Ram, 
>>> 
>>>> On 13/01/17 18:41, Ankireddypalle Reddy wrote: 
>>>> Xavi, 
>>>> I enabled TRACE logging. The log grew up to 120GB and could not
make much out of it. Then I started logging GFID in the code where we were
seeing errors.
>>>> 
>>>> [2017-01-13 17:02:01.761349] I [dict.c:3065:dict_dump_to_log]
0-glusterfsProd-disperse-0: dict=0x7fa6706bc690
((trusted.ec.size:0:0:0:0:30:6b:0:0:)(trusted.ec.version:0:0:0:0:0:0:2a:38:0:0:0:0:0:0:2a:38:))
>>>> [2017-01-13 17:02:01.761360] I [dict.c:3065:dict_dump_to_log]
0-glusterfsProd-disperse-0: dict=0x7fa6706bed64
((trusted.ec.size:0:0:0:0:0:0:0:0:)(trusted.ec.version:0:0:0:0:0:0:0:0:0:0:0:0:0:0:2a:38:))
>>>> [2017-01-13 17:02:01.761365] W [MSGID: 122056]
[ec-combine.c:881:ec_combine_check] 0-glusterfsProd-disperse-0: Mismatching
xdata in answers of 'LOOKUP'
>>>> [2017-01-13 17:02:01.761405] I [dict.c:166:key_value_cmp]
0-glusterfsProd-disperse-0: 'trusted.ec.size' is different in two dicts
(8, 8)
>>>> [2017-01-13 17:02:01.761417] I [dict.c:3065:dict_dump_to_log]
0-glusterfsProd-disperse-0: dict=0x7fa6706bbb14
((trusted.ec.size:0:0:0:0:30:6b:0:0:)(trusted.ec.version:0:0:0:0:0:0:2a:38:0:0:0:0:0:0:2a:38:))
>>>> [2017-01-13 17:02:01.761428] I [dict.c:3065:dict_dump_to_log]
0-glusterfsProd-disperse-0: dict=0x7fa6706bed64
((trusted.ec.size:0:0:0:0:0:0:0:0:)(trusted.ec.version:0:0:0:0:0:0:0:0:0:0:0:0:0:0:2a:38:))
>>>> [2017-01-13 17:02:01.761433] W [MSGID: 122056]
[ec-combine.c:881:ec_combine_check] 0-glusterfsProd-disperse-0: Mismatching
xdata in answers of 'LOOKUP'
>>>> [2017-01-13 17:02:01.761442] W [MSGID: 122006]
[ec-combine.c:214:ec_iatt_combine] 0-glusterfsProd-disperse-0: Failed to combine
iatt (inode: 11275691004192850514-11275691004192850514, gfid:
60b990ed-d741-4176-9c7b-4d3a25fb8252 - 60b990ed-d741-4176-9c7b-4d3a25fb8252,
links: 1-1, uid: 0-0, gid: 0-0, rdev: 0-0,size: 406650880-406683648,
> mode: 100775-100775) 
>>>> 
>>>> The file for which we are seeing this error turns out to be
having a GFID of 60b990ed-d741-4176-9c7b-4d3a25fb8252
>>>> 
>>>> Then I tried looking for find out the file with this GFID. It
pointed me to following path. I was expecting a real file system path from the
following turorial:
>>>>
https://gluster.readthedocs.io/en/latest/Troubleshooting/gfid-to-path/
>>> 
>>> I think this method only works if bricks have the inode cached. 
>>> 
>>>> 
>>>> getfattr -n trusted.glusterfs.pathinfo -e text
/mnt/gfid/.gfid/60b990ed-d741-4176-9c7b-4d3a25fb8252
>>>> getfattr: Removing leading '/' from absolute path names
>>>> # file: mnt/gfid/.gfid/60b990ed-d741-4176-9c7b-4d3a25fb8252 
>>>>
trusted.glusterfs.pathinfo="(<DISTRIBUTE:glusterfsProd-dht>
(<EC:glusterfsProd-disperse-0>
<POSIX(/ws/disk1/ws_brick):glusterfs6:/ws/disk1/ws_brick/.glusterfs/60/b9/60b990ed-d741-4176-9c7b-4d3a25fb8252>
<POSIX(/ws/disk1/ws_brick):glusterfs5:/ws/disk1/ws_brick/.glusterfs/60/b9/60b990ed-d741-4176-9c7b-4d3a25fb8252>))"
>>>> 
>>>> Then I looked for the xatttrs for these files from all the 3
bricks
>>>> 
>>>> [root at glusterfs4 glusterfs]# getfattr -d -m . -e hex
/ws/disk1/ws_brick/.glusterfs/60/b9/60b990ed-d741-4176-9c7b-4d3a25fb8252
>>>> getfattr: Removing leading '/' from absolute path names
>>>> # file:
ws/disk1/ws_brick/.glusterfs/60/b9/60b990ed-d741-4176-9c7b-4d3a25fb8252
>>>> trusted.bit-rot.version=0x02000000000000005877a8dc00041138 
>>>> trusted.ec.config=0x0000080301000200 
>>>> trusted.ec.size=0x0000000000000000 
>>>> trusted.ec.version=0x00000000000000000000000000002a38 
>>>> trusted.gfid=0x60b990edd74141769c7b4d3a25fb8252 
>>>> 
>>>> [root at glusterfs5 bricks]# getfattr -d -m . -e hex
/ws/disk1/ws_brick/.glusterfs/60/b9/60b990ed-d741-4176-9c7b-4d3a25fb8252
>>>> getfattr: Removing leading '/' from absolute path names
>>>> # file:
ws/disk1/ws_brick/.glusterfs/60/b9/60b990ed-d741-4176-9c7b-4d3a25fb8252
>>>> trusted.bit-rot.version=0x02000000000000005877a8dc000c92d0 
>>>> trusted.ec.config=0x0000080301000200 
>>>> trusted.ec.dirty=0x00000000000000160000000000000000 
>>>> trusted.ec.size=0x00000000306b0000 
>>>> trusted.ec.version=0x0000000000002a380000000000002a38 
>>>> trusted.gfid=0x60b990edd74141769c7b4d3a25fb8252 
>>>> 
>>>> [root at glusterfs6 ee]# getfattr -d -m . -e hex
/ws/disk1/ws_brick/.glusterfs/60/b9/60b990ed-d741-4176-9c7b-4d3a25fb8252
>>>> getfattr: Removing leading '/' from absolute path names
>>>> # file:
ws/disk1/ws_brick/.glusterfs/60/b9/60b990ed-d741-4176-9c7b-4d3a25fb8252
>>>> trusted.bit-rot.version=0x02000000000000005877a8dc000c9436 
>>>> trusted.ec.config=0x0000080301000200 
>>>> trusted.ec.dirty=0x00000000000000160000000000000000 
>>>> trusted.ec.size=0x00000000306b0000 
>>>> trusted.ec.version=0x0000000000002a380000000000002a38 
>>>> trusted.gfid=0x60b990edd74141769c7b4d3a25fb8252 
>>>> 
>>>> It turns out that the size and version in fact does not match
for one of the files.
>>> 
>>> It seems as if the brick on glusterfs4 didn't receive any write
request (or they failed for some reason). Do you still have the trace log ? is
there any way I could download it ?
>>> 
>>> Xavi 
>>> 
>>>> 
>>>> Thanks and Regards, 
>>>> Ram 
>>>> 
>>>> -----Original Message----- 
>>>> From: gluster-devel-bounces at gluster.org 
> <mailto:gluster-devel-bounces at gluster.org> 
> [mailto:gluster-devel-bounces at gluster.org] On Behalf Of Ankireddypalle
Reddy
>>>> Sent: Friday, January 13, 2017 4:17 AM 
>>>> To: Xavier Hernandez 
>>>> Cc: gluster-users at gluster.org <mailto:gluster-users at
gluster.org>; Gluster
> Devel (gluster-devel at gluster.org <mailto:gluster-devel at
gluster.org>)
>>>> Subject: Re: [Gluster-devel] [Gluster-users] Lot of EIO errors
in disperse volume
>>>> 
>>>> Xavi, 
>>>> Thanks for explanation. Will collect TRACE logs today. 
>>>> 
>>>> Thanks and Regards, 
>>>> Ram 
>>>> 
>>>> Sent from my iPhone 
>>>> 
>>>>> On Jan 13, 2017, at 3:03 AM, Xavier Hernandez
<xhernandez at datalab.es <mailto:xhernandez at datalab.es>> wrote:
>>>>> 
>>>>> Hi Ram, 
>>>>> 
>>>>>> On 12/01/17 22:14, Ankireddypalle Reddy wrote: 
>>>>>> Xavi, 
>>>>>> I changed the logging to log the individual bytes.
Consider the following from ws-glus.log file where /ws/glus is the mount point.
>>>>>> 
>>>>>> [2017-01-12 20:47:59.368102] I [MSGID: 109063] 
>>>>>> [dht-layout.c:718:dht_layout_normalize]
0-glusterfsProd-dht: Found
>>>>>> anomalies in 
>>>>>>
/Folder_01.05.2017_21.15/CV_MAGNETIC/V_30970/CHUNK_390607 (gfid =
>>>>>> e694387f-dde7-410b-9562-914a994d5e85). Holes=1
overlaps=0
>>>>>> [2017-01-12 20:47:59.391218] I [MSGID: 109036] 
>>>>>> [dht-common.c:9082:dht_log_new_layout_for_dir_selfheal]
>>>>>> 0-glusterfsProd-dht: Setting layout of 
>>>>>>
/Folder_01.05.2017_21.15/CV_MAGNETIC/V_30970/CHUNK_390607 with
>>>>>> [Subvol_name: glusterfsProd-disperse-0, Err: -1 ,
Start: 2505397587 ,
>>>>>> Stop: 2863311527 , Hash: 1 ], [Subvol_name:
glusterfsProd-disperse-1,
>>>>>> Err: -1 , Start: 2863311528 , Stop: 3221225468 , Hash:
1 ],
>>>>>> [Subvol_name: glusterfsProd-disperse-10, Err: -1 ,
Start: 3221225469
>>>>>> , Stop: 3579139409 , Hash: 1 ], [Subvol_name: 
>>>>>> glusterfsProd-disperse-11, Err: -1 , Start: 3579139410
, Stop:
>>>>>> 3937053350 , Hash: 1 ], [Subvol_name:
glusterfsProd-disperse-2, Err:
>>>>>> -1 , Start: 3937053351 , Stop: 4294967295 , Hash: 1 ],
[Subvol_name:
>>>>>> glusterfsProd-disperse-3, Err: -1 , Start: 0 , Stop:
357913940 ,
>>>>>> Hash: 1 ], [Subvol_name: glusterfsProd-disperse-4, Err:
-1 , Start:
>>>>>> 357913941 , Stop: 715827881 , Hash: 1 ], [Subvol_name: 
>>>>>> glusterfsProd-disperse-5, Err: -1 , Start: 715827882 ,
Stop:
>>>>>> 1073741822 , Hash 
>>>> : 1 ], [Subvol_name: glusterfsProd-disperse-6, Err: -1 , Start:
1073741823 , Stop: 1431655763 , Hash: 1 ], [Subvol_name:
glusterfsProd-disperse-7, Err: -1 , Start: 1431655764 , Stop: 1789569704 , Hash:
1 ], [Subvol_name: glusterfsProd-disperse-8, Err: -1 , Start: 1789569705 , Stop:
2147483645 , Hash: 1 ], [Subvol_name:
> glusterfsProd-disperse-9, Err: -1 , Start: 2147483646 , Stop: 2505397586 
> , Hash: 1 ], 
>>>>>> 
>>>>>> Self-heal seems to be triggered for path
/Folder_01.05.2017_21.15/CV_MAGNETIC/V_30970/CHUNK_390607 due to anomalies as
per DHT. It would be great if someone could explain what could be the anomaly
here. The setup where we encountered this is a fairly stable setup with no brick
failures or no node failures.
>>>>> 
>>>>> This is not really a self-heal, at least from the point of
view of ec. This means that DHT has found a discrepancy in the layout of that
directory, however this doesn't mean any problem (notice the 'I' in
the log, meaning that it's informative, not a warning nor error).
>>>>> 
>>>>> Not sure how DHT works in this case or why it finds this
"anomaly", but if there aren't any previous errors before that
message, it can be completely ignored.
>>>>> 
>>>>> Not sure if it can be related to option
cluster.weighted-rebalance that is enabled by default.
>>>>> 
>>>>>> 
>>>>>> Then Self-heal seems to have encountered the following
error.
>>>>>> 
>>>>>> [2017-01-12 20:48:23.418432] I
[dict.c:166:key_value_cmp]
>>>>>> 0-glusterfsProd-disperse-2:
'trusted.ec.version' is different in two
>>>>>> dicts (16, 16) 
>>>>>> [2017-01-12 20:48:23.418496] I
[dict.c:3065:dict_dump_to_log]
>>>>>> 0-glusterfsProd-disperse-2: dict=0x7f0b649520ac 
>>>>>>
((trusted.glusterfs.dht:0:0:0:1:0:0:0:0:0:0:0:0:15:55:55:54:)(trusted
>>>>>> .ec.version:0:0:0:0:0:0:0:b:0:0:0:0:0:0:0:e:)) 
>>>>>> [2017-01-12 20:48:23.418519] I
[dict.c:3065:dict_dump_to_log]
>>>>>> 0-glusterfsProd-disperse-2: dict=0x7f0b6495b4e0 
>>>>>>
((trusted.glusterfs.dht:0:0:0:1:0:0:0:0:0:0:0:0:15:55:55:54:)(trusted
>>>>>> .ec.version:0:0:0:0:0:0:0:d:0:0:0:0:0:0:0:e:)) 
>>>>>> [2017-01-12 20:48:23.418531] W [MSGID: 122056]
[ec-combine.c:873:ec_combine_check] 0-glusterfsProd-disperse-2: Mismatching
xdata in answers of 'LOOKUP'
>>>>> 
>>>>> That's a real problem. Here we have two bricks that
differ in the trusted.ec.version xattr. However this xattr not necessarily
belongs to the previous directory. They are unrelated messages.
>>>>> 
>>>>>> 
>>>>>> In this case glusterfsProd-disperse-2 sub volume
actually consists of the following bricks.
>>>>>> glusterfs4sds:/ws/disk11/ws_brick, glusterfs5sds: 
>>>>>> /ws/disk11/ws_brick, glusterfs6sds: /ws/disk11/ws_brick
>>>>>> 
>>>>>> I went ahead and checked the value of
trusted.ec.version on all the 3 bricks inside this sub vol:
>>>>>> 
>>>>>> [root at glusterfs6 ~]# getfattr -e hex -n
trusted.ec.version
/ws/disk11/ws_brick//Folder_01.05.2017_21.15/CV_MAGNETIC/V_30970/CHUNK_390607
>>>>>> # file:
ws/disk11/ws_brick//Folder_01.05.2017_21.15/CV_MAGNETIC/V_30970/CHUNK_390607
>>>>>> trusted.ec.version=0x0000000000000009000000000000000b 
>>>>>> 
>>>>>> [root at glusterfs4 ~]# getfattr -e hex -n
trusted.ec.version
/ws/disk11/ws_brick//Folder_01.05.2017_21.15/CV_MAGNETIC/V_30970/CHUNK_390607
>>>>>> # file:
ws/disk11/ws_brick//Folder_01.05.2017_21.15/CV_MAGNETIC/V_30970/CHUNK_390607
>>>>>> trusted.ec.version=0x0000000000000009000000000000000b 
>>>>>> 
>>>>>> [root at glusterfs5 glusterfs]# getfattr -e hex -n
trusted.ec.version
/ws/disk11/ws_brick//Folder_01.05.2017_21.15/CV_MAGNETIC/V_30970/CHUNK_390607
>>>>>> # file:
ws/disk11/ws_brick//Folder_01.05.2017_21.15/CV_MAGNETIC/V_30970/CHUNK_390607
>>>>>> trusted.ec.version=0x0000000000000009000000000000000b 
>>>>>> 
>>>>>> The attribute value seems to be same on all the 3
bricks.
>>>>> 
>>>>> That's a clear indication that the ec warning is not
related to this directory because trusted.ec.version always increases, never
decreases, and the directory has a value smaller that the one that appears in
the log message.
>>>>> 
>>>>> If you show all dict entries in the log, it seems that it
does refer to a directory because trusted.ec.size is not present, but it must be
another directory than the one you looked at. We would need to find which one is
having this issue. The TRACE log would be helpful here.
>>>>> 
>>>>> 
>>>>>> 
>>>>>> Also please note that every single time that the
trusted.ec.version was found to mismatch the same values are getting logged.
Following are 2 more instances of trusted.ec.version mismatch.
>>>>>> 
>>>>>> [2017-01-12 20:14:25.554540] I
[dict.c:166:key_value_cmp]
>>>>>> 0-glusterfsProd-disperse-2:
'trusted.ec.version' is different in two
>>>>>> dicts (16, 16) 
>>>>>> [2017-01-12 20:14:25.554588] I
[dict.c:3065:dict_dump_to_log]
>>>>>> 0-glusterfsProd-disperse-2: dict=0x7f0b6495a9f0 
>>>>>>
((glusterfs.open-fd-count:30:0:)(trusted.glusterfs.dht:0:0:0:1:0:0:0:
>>>>>>
0:0:0:0:0:15:55:55:54:)(trusted.ec.version:0:0:0:0:0:0:0:d:0:0:0:0:0:
>>>>>> 0:0:e:)) 
>>>>>> [2017-01-12 20:14:25.554608] I
[dict.c:3065:dict_dump_to_log]
>>>>>> 0-glusterfsProd-disperse-2: dict=0x7f0b6495903c 
>>>>>>
((glusterfs.open-fd-count:30:0:)(trusted.glusterfs.dht:0:0:0:1:0:0:0:
>>>>>>
0:0:0:0:0:15:55:55:54:)(trusted.ec.version:0:0:0:0:0:0:0:b:0:0:0:0:0:
>>>>>> 0:0:e:)) 
>>>>>> [2017-01-12 20:14:25.554624] W [MSGID: 122053] 
>>>>>> [ec-common.c:116:ec_check_status]
0-glusterfsProd-disperse-2:
>>>>>> Operation failed on some subvolumes (up=7, mask=7,
remaining=0,
>>>>>> good=3, bad=4) 
>>>>>> [2017-01-12 20:14:25.554632] W [MSGID: 122002] 
>>>>>> [ec-common.c:71:ec_heal_report]
0-glusterfsProd-disperse-2: Heal
>>>>>> failed [Invalid argument] 
>>>>>> [2017-01-12 20:14:25.555598] I
[dict.c:166:key_value_cmp]
>>>>>> 0-glusterfsProd-disperse-2:
'trusted.ec.version' is different in two
>>>>>> dicts (16, 16) 
>>>>>> [2017-01-12 20:14:25.555622] I
[dict.c:3065:dict_dump_to_log]
>>>>>> 0-glusterfsProd-disperse-2: dict=0x7f0b64956c24 
>>>>>>
((glusterfs.open-fd-count:30:0:)(trusted.glusterfs.dht:0:0:0:1:0:0:0:
>>>>>>
0:0:0:0:0:15:55:55:54:)(trusted.ec.version:0:0:0:0:0:0:0:b:0:0:0:0:0:
>>>>>> 0:0:e:)) 
>>>>>> [2017-01-12 20:14:25.555638] I
[dict.c:3065:dict_dump_to_log]
>>>>>> 0-glusterfsProd-disperse-2: dict=0x7f0b64964e8c 
>>>>>>
((glusterfs.open-fd-count:30:0:)(trusted.glusterfs.dht:0:0:0:1:0:0:0:
>>>>>>
0:0:0:0:0:15:55:55:54:)(trusted.ec.version:0:0:0:0:0:0:0:d:0:0:0:0:0:
>>>>>> 0:0:e:)) 
>>>>> 
>>>>> 
>>>>> I think that this refers to the same directory. This seems
an attempt to heal it that has failed. So it makes sense that it finds exactly
the same values.
>>>>> 
>>>>>> 
>>>>>> 
>>>>>> In glustershd.log lot of similar errors are logged. 
>>>>>> 
>>>>>> [2017-01-12 21:10:53.728770] I
[dict.c:166:key_value_cmp]
>>>>>> 0-glusterfsProd-disperse-0: 'trusted.ec.size'
is different in two
>>>>>> dicts (8, 8) 
>>>>>> [2017-01-12 21:10:53.728804] I
[dict.c:3065:dict_dump_to_log]
>>>>>> 0-glusterfsProd-disperse-0: dict=0x7f21694b6f50 
>>>>>>
((trusted.ec.size:0:0:0:0:42:3a:0:0:)(trusted.ec.version:0:0:0:0:0:0:
>>>>>> 37:5f:0:0:0:0:0:0:37:5f:)) 
>>>>>> [2017-01-12 21:10:53.728827] I
[dict.c:3065:dict_dump_to_log]
>>>>>> 0-glusterfsProd-disperse-0: dict=0x7f21694b62bc 
>>>>>>
((trusted.ec.size:0:0:0:0:0:ca:0:0:)(trusted.ec.version:0:0:0:0:0:0:0
>>>>>> :a1:0:0:0:0:0:0:37:5f:)) 
>>>>>> [2017-01-12 21:10:53.728842] W [MSGID: 122056]
[ec-combine.c:873:ec_combine_check] 0-glusterfsProd-disperse-0: Mismatching
xdata in answers of 'LOOKUP'
>>>>>> [2017-01-12 21:10:53.728854] W [MSGID: 122053] 
>>>>>> [ec-common.c:116:ec_check_status]
0-glusterfsProd-disperse-0:
>>>>>> Operation failed on some subvolumes (up=7, mask=7,
remaining=0,
>>>>>> good=6, bad=1) 
>>>>>> [2017-01-12 21:10:53.728876] W [MSGID: 122002] 
>>>>>> [ec-common.c:71:ec_heal_report]
0-glusterfsProd-disperse-0: Heal
>>>>>> failed [Invalid argument] 
>>>>> 
>>>>> This seems an attempt to heal a file, but I see a lot of
differences between both versions. The size on one brick is 13.238.272 bytes,
but on the other brick it's 1.111.097.344 bytes. That's a huge
difference.
>>>>> 
>>>>> Looking at the trusted.ec.version, I see that the
'data' version is very different (from 161 to 14.175), however the
metadata version is exactly the same. This really seems like a lot of writes
while one brick was down (or disconnected for some reason, or writes failed for
some reason). One brick has lost about 14.000
> writes of ~80KB. 
>>>>> 
>>>>> I think the most important thing right now would be to
identify which files and directories are having these problems to be able to
identify the cause. Again, the TRACE log will be really useful.
>>>>> 
>>>>> Xavi 
>>>>> 
>>>>>> 
>>>>>> Thanks and Regards, 
>>>>>> Ram 
>>>>>> 
>>>>>> -----Original Message----- 
>>>>>> From: Xavier Hernandez [mailto:xhernandez at
datalab.es]
>>>>>> Sent: Thursday, January 12, 2017 6:40 AM 
>>>>>> To: Ankireddypalle Reddy 
>>>>>> Cc: Gluster Devel (gluster-devel at gluster.org
<mailto:gluster-devel at gluster.org>);
>>>>>> gluster-users at gluster.org <mailto:gluster-users
at gluster.org>
>>>>>> Subject: Re: [Gluster-users] [Gluster-devel] Lot of EIO
errors in
>>>>>> disperse volume 
>>>>>> 
>>>>>> Hi Ram, 
>>>>>> 
>>>>>> 
>>>>>>> On 12/01/17 11:49, Ankireddypalle Reddy wrote: 
>>>>>>> Xavi, 
>>>>>>> As I mentioned before the error could happen for
any FOP. Will try to run with TRACE debug level. Is there a possibility that we
are checking for this attribute on a directory, because a directory does not
seem to be having this attribute set.
>>>>>> 
>>>>>> No, directories do not have this attribute and no one
should be reading it from a directory.
>>>>>> 
>>>>>>> Also is the function to check size and version
called after it is decided that heal should be run or is this check is the one
which decides whether a heal should be run.
>>>>>> 
>>>>>> Almost all checks that trigger a heal are done in the
lookup fop when some discrepancy is detected.
>>>>>> 
>>>>>> The function that checks size and version is called
later once a lock on the inode is acquired (even if no heal is needed). However
further failures in the processing of any fop can also trigger a self-heal.
>>>>>> 
>>>>>> Xavi 
>>>>>> 
>>>>>>> 
>>>>>>> Thanks and Regards, 
>>>>>>> Ram 
>>>>>>> 
>>>>>>> Sent from my iPhone 
>>>>>>> 
>>>>>>>> On Jan 12, 2017, at 2:25 AM, Xavier Hernandez
<xhernandez at datalab.es <mailto:xhernandez at datalab.es>> wrote:
>>>>>>>> 
>>>>>>>> Hi Ram, 
>>>>>>>> 
>>>>>>>>> On 12/01/17 02:36, Ankireddypalle Reddy
wrote:
>>>>>>>>> Xavi, 
>>>>>>>>> I added some more logging information. The
trusted.ec.size field values are in fact different.
>>>>>>>>> trusted.ec.size l1 = 62719407423488 l2 = 0 
>>>>>>>> 
>>>>>>>> That's very weird. Directories do not have
this attribute. It's only present on regular files. But you said that the
error happens while creating the file, so it doesn't make much sense because
file creation always sets trusted.ec.size to 0.
>>>>>>>> 
>>>>>>>> Could you reproduce the problem with
diagnostics.client-log-level set to TRACE and send the log to me ? it will
create a big log, but I'll have much more information about what's going
on.
>>>>>>>> 
>>>>>>>> Do you have a mixed setup with nodes of
different types ? for example mixed 32/64 bits architectures or different
operating systems ? I ask this because 62719407423488 in hex is 0x390B00000000,
which has the lower 32 bits set to 0, but has garbage above that.
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> This is a fairly static setup with no
brick/ node failure. Please explain why is that a heal is being triggered and
what could have acutually caused these size xattrs to differ. This is causing
random I/O failures and is impacting the backup schedules.
>>>>>>>> 
>>>>>>>> The launch of self-heal is normal because it
has detected an inconsistency. The real problem is what originates that
inconsistency.
>>>>>>>> 
>>>>>>>> Xavi 
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> [ 2017-01-12 01:19:18.256970] W [MSGID:
122056] [ec-combine.c:873:ec_combine_check] 0-glusterfsProd-disperse-8:
Mismatching xdata in answers of 'LOOKUP'
>>>>>>>>> [2017-01-12 01:19:18.257015] W [MSGID:
122053]
>>>>>>>>> [ec-common.c:116:ec_check_status]
0-glusterfsProd-disperse-8:
>>>>>>>>> Operation failed on some subvolumes (up=7,
mask=7, remaining=0,
>>>>>>>>> good=3, bad=4) 
>>>>>>>>> [2017-01-12 01:19:18.257018] W [MSGID:
122002]
>>>>>>>>> [ec-common.c:71:ec_heal_report]
0-glusterfsProd-disperse-8: Heal
>>>>>>>>> failed [Invalid argument] 
>>>>>>>>> [2017-01-12 01:19:21.002028] E
[dict.c:197:key_value_cmp]
>>>>>>>>> 0-glusterfsProd-disperse-4:
'trusted.ec.size' is different in two
>>>>>>>>> dicts (8, 8) 
>>>>>>>>> [2017-01-12 01:19:21.002056] E
[dict.c:166:log_value]
>>>>>>>>> 0-glusterfsProd-disperse-4: trusted.ec.size
[ l1 = 62719407423488
>>>>>>>>> l2 = 0 i1 = 0 i2 = 0 ] 
>>>>>>>>> [2017-01-12 01:19:21.002064] W [MSGID:
122056] [ec-combine.c:873:ec_combine_check] 0-glusterfsProd-disperse-4:
Mismatching xdata in answers of 'LOOKUP'
>>>>>>>>> [2017-01-12 01:19:21.209640] E
[dict.c:197:key_value_cmp]
>>>>>>>>> 0-glusterfsProd-disperse-4:
'trusted.ec.size' is different in two
>>>>>>>>> dicts (8, 8) 
>>>>>>>>> [2017-01-12 01:19:21.209673] E
[dict.c:166:log_value]
>>>>>>>>> 0-glusterfsProd-disperse-4: trusted.ec.size
[ l1 = 62719407423488
>>>>>>>>> l2 = 0 i1 = 0 i2 = 0 ] 
>>>>>>>>> [2017-01-12 01:19:21.209686] W [MSGID:
122056] [ec-combine.c:873:ec_combine_check] 0-glusterfsProd-disperse-4:
Mismatching xdata in answers of 'LOOKUP'
>>>>>>>>> [2017-01-12 01:19:21.209719] W [MSGID:
122053]
>>>>>>>>> [ec-common.c:116:ec_check_status]
0-glusterfsProd-disperse-4:
>>>>>>>>> Operation failed on some subvolumes (up=7,
mask=7, remaining=0,
>>>>>>>>> good=6, bad=1) 
>>>>>>>>> [2017-01-12 01:19:21.209753] W [MSGID:
122002]
>>>>>>>>> [ec-common.c:71:ec_heal_report]
0-glusterfsProd-disperse-4: Heal
>>>>>>>>> failed [Invalid argument] 
>>>>>>>>> 
>>>>>>>>> Thanks and Regards, 
>>>>>>>>> Ram 
>>>>>>>>> 
>>>>>>>>> -----Original Message----- 
>>>>>>>>> From: Ankireddypalle Reddy 
>>>>>>>>> Sent: Wednesday, January 11, 2017 9:29 AM 
>>>>>>>>> To: Ankireddypalle Reddy; Xavier Hernandez;
Gluster Devel
>>>>>>>>> (gluster-devel at gluster.org
<mailto:gluster-devel at gluster.org>);
> gluster-users at gluster.org <mailto:gluster-users at gluster.org> 
>>>>>>>>> Subject: RE: [Gluster-users]
[Gluster-devel] Lot of EIO errors in
>>>>>>>>> disperse volume 
>>>>>>>>> 
>>>>>>>>> Xavi, 
>>>>>>>>> I built a debug binary to log more
information. This is what is getting logged. Looks like it is the attribute
trusted.ec.size which is different among the bricks in a sub volume.
>>>>>>>>> 
>>>>>>>>> In glustershd.log : 
>>>>>>>>> 
>>>>>>>>> [2017-01-11 14:19:45.023845] N [MSGID:
122029] [ec-generic.c:683:ec_combine_lookup] 0-glusterfsProd-disperse-8:
Mismatching iatt in answers of 'GF_FOP_LOOKUP'
>>>>>>>>> [2017-01-11 14:19:45.027718] E
[dict.c:166:key_value_cmp]
>>>>>>>>> 0-glusterfsProd-disperse-6:
'trusted.ec.size' is different in two
>>>>>>>>> dicts (8, 8) 
>>>>>>>>> [2017-01-11 14:19:45.027736] W [MSGID:
122056] [ec-combine.c:873:ec_combine_check] 0-glusterfsProd-disperse-6:
Mismatching xdata in answers of 'LOOKUP'
>>>>>>>>> [2017-01-11 14:19:45.027763] E
[dict.c:166:key_value_cmp]
>>>>>>>>> 0-glusterfsProd-disperse-6:
'trusted.ec.size' is different in two
>>>>>>>>> dicts (8, 8) 
>>>>>>>>> [2017-01-11 14:19:45.027781] W [MSGID:
122056] [ec-combine.c:873:ec_combine_check] 0-glusterfsProd-disperse-6:
Mismatching xdata in answers of 'LOOKUP'
>>>>>>>>> [2017-01-11 14:19:45.027793] W [MSGID:
122053]
>>>>>>>>> [ec-common.c:116:ec_check_status]
0-glusterfsProd-disperse-6:
>>>>>>>>> Operation failed on some subvolumes (up=7,
mask=7, remaining=0,
>>>>>>>>> good=6, bad=1) 
>>>>>>>>> [2017-01-11 14:19:45.027815] W [MSGID:
122002]
>>>>>>>>> [ec-common.c:71:ec_heal_report]
0-glusterfsProd-disperse-6: Heal
>>>>>>>>> failed [Invalid argument] 
>>>>>>>>> [2017-01-11 14:19:45.029035] E
[dict.c:166:key_value_cmp]
>>>>>>>>> 0-glusterfsProd-disperse-8:
'trusted.ec.size' is different in two
>>>>>>>>> dicts (8, 8) 
>>>>>>>>> [2017-01-11 14:19:45.029057] W [MSGID:
122056] [ec-combine.c:873:ec_combine_check] 0-glusterfsProd-disperse-8:
Mismatching xdata in answers of 'LOOKUP'
>>>>>>>>> [2017-01-11 14:19:45.029089] E
[dict.c:166:key_value_cmp]
>>>>>>>>> 0-glusterfsProd-disperse-8:
'trusted.ec.size' is different in two
>>>>>>>>> dicts (8, 8) 
>>>>>>>>> [2017-01-11 14:19:45.029105] W [MSGID:
122056] [ec-combine.c:873:ec_combine_check] 0-glusterfsProd-disperse-8:
Mismatching xdata in answers of 'LOOKUP'
>>>>>>>>> [2017-01-11 14:19:45.029121] W [MSGID:
122053]
>>>>>>>>> [ec-common.c:116:ec_check_status]
0-glusterfsProd-disperse-8:
>>>>>>>>> Operation failed on some subvolumes (up=7,
mask=7, remaining=0,
>>>>>>>>> good=6, bad=1) 
>>>>>>>>> [2017-01-11 14:19:45.032566] E
[dict.c:166:key_value_cmp]
>>>>>>>>> 0-glusterfsProd-disperse-6:
'trusted.ec.size' is different in two
>>>>>>>>> dicts (8, 8) 
>>>>>>>>> [2017-01-11 14:19:45.029138] W [MSGID:
122002]
>>>>>>>>> [ec-common.c:71:ec_heal_report]
0-glusterfsProd-disperse-8: Heal
>>>>>>>>> failed [Invalid argument] 
>>>>>>>>> [2017-01-11 14:19:45.032585] W [MSGID:
122056] [ec-combine.c:873:ec_combine_check] 0-glusterfsProd-disperse-6:
Mismatching xdata in answers of 'LOOKUP'
>>>>>>>>> [2017-01-11 14:19:45.032614] E
[dict.c:166:key_value_cmp]
>>>>>>>>> 0-glusterfsProd-disperse-6:
'trusted.ec.size' is different in two
>>>>>>>>> dicts (8, 8) 
>>>>>>>>> [2017-01-11 14:19:45.032631] W [MSGID:
122056] [ec-combine.c:873:ec_combine_check] 0-glusterfsProd-disperse-6:
Mismatching xdata in answers of 'LOOKUP'
>>>>>>>>> [2017-01-11 14:19:45.032638] W [MSGID:
122053]
>>>>>>>>> [ec-common.c:116:ec_check_status]
0-glusterfsProd-disperse-6:
>>>>>>>>> Operation failed on some subvolumes (up=7,
mask=7, remaining=0,
>>>>>>>>> good=6, bad=1) 
>>>>>>>>> [2017-01-11 14:19:45.032654] W [MSGID:
122002]
>>>>>>>>> [ec-common.c:71:ec_heal_report]
0-glusterfsProd-disperse-6: Heal
>>>>>>>>> failed [Invalid argument] 
>>>>>>>>> [2017-01-11 14:19:45.037514] E
[dict.c:166:key_value_cmp]
>>>>>>>>> 0-glusterfsProd-disperse-6:
'trusted.ec.size' is different in two
>>>>>>>>> dicts (8, 8) 
>>>>>>>>> [2017-01-11 14:19:45.037536] W [MSGID:
122056] [ec-combine.c:873:ec_combine_check] 0-glusterfsProd-disperse-6:
Mismatching xdata in answers of 'LOOKUP'
>>>>>>>>> [2017-01-11 14:19:45.037553] E
[dict.c:166:key_value_cmp]
>>>>>>>>> 0-glusterfsProd-disperse-6:
'trusted.ec.size' is different in two
>>>>>>>>> dicts (8, 8) 
>>>>>>>>> [2017-01-11 14:19:45.037573] W [MSGID:
122056] [ec-combine.c:873:ec_combine_check] 0-glusterfsProd-disperse-6:
Mismatching xdata in answers of 'LOOKUP'
>>>>>>>>> [2017-01-11 14:19:45.037582] W [MSGID:
122053]
>>>>>>>>> [ec-common.c:116:ec_check_status]
0-glusterfsProd-disperse-6:
>>>>>>>>> Operation failed on some subvolumes (up=7,
mask=7, remaining=0,
>>>>>>>>> good=6, bad=1) 
>>>>>>>>> [2017-01-11 14:19:45.037599] W [MSGID:
122002]
>>>>>>>>> [ec-common.c:71:ec_heal_report]
0-glusterfsProd-disperse-6: Heal
>>>>>>>>> failed [Invalid argument] 
>>>>>>>>> [2017-01-11 14:20:40.001401] E
[dict.c:166:key_value_cmp]
>>>>>>>>> 0-glusterfsProd-disperse-3:
'trusted.ec.size' is different in two
>>>>>>>>> dicts (8, 8) 
>>>>>>>>> [2017-01-11 14:20:40.001387] E
[dict.c:166:key_value_cmp]
>>>>>>>>> 0-glusterfsProd-disperse-5:
'trusted.ec.size' is different in two
>>>>>>>>> dicts (8, 8) 
>>>>>>>>> 
>>>>>>>>> In the mount daemon log: 
>>>>>>>>> 
>>>>>>>>> [2017-01-11 14:20:17.806826] E [MSGID:
122001]
>>>>>>>>> [ec-common.c:872:ec_config_check]
2-glusterfsProd-disperse-0:
>>>>>>>>> Invalid or corrupted config [Invalid
argument]
>>>>>>>>> [2017-01-11 14:20:17.806847] E [MSGID:
122066]
>>>>>>>>> [ec-common.c:969:ec_prepare_update_cbk]
2-glusterfsProd-disperse-0:
>>>>>>>>> Invalid config xattr [Invalid argument] 
>>>>>>>>> [2017-01-11 14:20:17.807076] E [MSGID:
122001]
>>>>>>>>> [ec-common.c:872:ec_config_check]
2-glusterfsProd-disperse-1:
>>>>>>>>> Invalid or corrupted config [Invalid
argument]
>>>>>>>>> [2017-01-11 14:20:17.807099] E [MSGID:
122066]
>>>>>>>>> [ec-common.c:969:ec_prepare_update_cbk]
2-glusterfsProd-disperse-1:
>>>>>>>>> Invalid config xattr [Invalid argument] 
>>>>>>>>> [2017-01-11 14:20:17.807286] E [MSGID:
122001]
>>>>>>>>> [ec-common.c:872:ec_config_check]
2-glusterfsProd-disperse-10:
>>>>>>>>> Invalid or corrupted config [Invalid
argument]
>>>>>>>>> [2017-01-11 14:20:17.807298] E [MSGID:
122066]
>>>>>>>>> [ec-common.c:969:ec_prepare_update_cbk]
2-glusterfsProd-disperse-10:
>>>>>>>>> Invalid config xattr [Invalid argument] 
>>>>>>>>> [2017-01-11 14:20:17.807409] E [MSGID:
122001]
>>>>>>>>> [ec-common.c:872:ec_config_check]
2-glusterfsProd-disperse-11:
>>>>>>>>> Invalid or corrupted config [Invalid
argument]
>>>>>>>>> [2017-01-11 14:20:17.807420] E [MSGID:
122066]
>>>>>>>>> [ec-common.c:969:ec_prepare_update_cbk]
2-glusterfsProd-disperse-11:
>>>>>>>>> Invalid config xattr [Invalid argument] 
>>>>>>>>> [2017-01-11 14:20:17.807448] E [MSGID:
122001]
>>>>>>>>> [ec-common.c:872:ec_config_check]
2-glusterfsProd-disperse-4:
>>>>>>>>> Invalid or corrupted config [Invalid
argument]
>>>>>>>>> [2017-01-11 14:20:17.807462] E [MSGID:
122066]
>>>>>>>>> [ec-common.c:969:ec_prepare_update_cbk]
2-glusterfsProd-disperse-4:
>>>>>>>>> Invalid config xattr [Invalid argument] 
>>>>>>>>> [2017-01-11 14:20:17.807539] E [MSGID:
122001]
>>>>>>>>> [ec-common.c:872:ec_config_check]
2-glusterfsProd-disperse-2:
>>>>>>>>> Invalid or corrupted config [Invalid
argument]
>>>>>>>>> [2017-01-11 14:20:17.807550] E [MSGID:
122066]
>>>>>>>>> [ec-common.c:969:ec_prepare_update_cbk]
2-glusterfsProd-disperse-2:
>>>>>>>>> Invalid config xattr [Invalid argument] 
>>>>>>>>> [2017-01-11 14:20:17.807723] E [MSGID:
122001]
>>>>>>>>> [ec-common.c:872:ec_config_check]
2-glusterfsProd-disperse-3:
>>>>>>>>> Invalid or corrupted config [Invalid
argument]
>>>>>>>>> [2017-01-11 14:20:17.807739] E [MSGID:
122066]
>>>>>>>>> [ec-common.c:969:ec_prepare_update_cbk]
2-glusterfsProd-disperse-3:
>>>>>>>>> Invalid config xattr [Invalid argument] 
>>>>>>>>> [2017-01-11 14:20:17.807785] E [MSGID:
122001]
>>>>>>>>> [ec-common.c:872:ec_config_check]
2-glusterfsProd-disperse-5:
>>>>>>>>> Invalid or corrupted config [Invalid
argument]
>>>>>>>>> [2017-01-11 14:20:17.807796] E [MSGID:
122066]
>>>>>>>>> [ec-common.c:969:ec_prepare_update_cbk]
2-glusterfsProd-disperse-5:
>>>>>>>>> Invalid config xattr [Invalid argument] 
>>>>>>>>> [2017-01-11 14:20:17.808020] E [MSGID:
122001]
>>>>>>>>> [ec-common.c:872:ec_config_check]
2-glusterfsProd-disperse-9:
>>>>>>>>> Invalid or corrupted config [Invalid
argument]
>>>>>>>>> [2017-01-11 14:20:17.808034] E [MSGID:
122066]
>>>>>>>>> [ec-common.c:969:ec_prepare_update_cbk]
2-glusterfsProd-disperse-9:
>>>>>>>>> Invalid config xattr [Invalid argument] 
>>>>>>>>> [2017-01-11 14:20:17.808054] E [MSGID:
122001]
>>>>>>>>> [ec-common.c:872:ec_config_check]
2-glusterfsProd-disperse-6:
>>>>>>>>> Invalid or corrupted config [Invalid
argument]
>>>>>>>>> [2017-01-11 14:20:17.808066] E [MSGID:
122066]
>>>>>>>>> [ec-common.c:969:ec_prepare_update_cbk]
2-glusterfsProd-disperse-6:
>>>>>>>>> Invalid config xattr [Invalid argument] 
>>>>>>>>> [2017-01-11 14:20:17.808282] E [MSGID:
122001]
>>>>>>>>> [ec-common.c:872:ec_config_check]
2-glusterfsProd-disperse-8:
>>>>>>>>> Invalid or corrupted config [Invalid
argument]
>>>>>>>>> [2017-01-11 14:20:17.808292] E [MSGID:
122066]
>>>>>>>>> [ec-common.c:969:ec_prepare_update_cbk]
2-glusterfsProd-disperse-8:
>>>>>>>>> Invalid config xattr [Invalid argument] 
>>>>>>>>> [2017-01-11 14:20:17.809212] E [MSGID:
122001]
>>>>>>>>> [ec-common.c:872:ec_config_check]
2-glusterfsProd-disperse-7:
>>>>>>>>> Invalid or corrupted config [Invalid
argument]
>>>>>>>>> [2017-01-11 14:20:17.809228] E [MSGID:
122066]
>>>>>>>>> [ec-common.c:969:ec_prepare_update_cbk]
2-glusterfsProd-disperse-7:
>>>>>>>>> Invalid config xattr [Invalid argument] 
>>>>>>>>> 
>>>>>>>>> [2017-01-11 14:20:17.812660] I [MSGID:
109036]
>>>>>>>>>
[dht-common.c:8043:dht_log_new_layout_for_dir_selfheal]
>>>>>>>>> 2-glusterfsProd-dht: Setting layout of 
>>>>>>>>>
/Folder_01.05.2017_21.15/CV_MAGNETIC/V_31500/CHUNK_402578 with
>>>>>>>>> [Subvol_name: glusterfsProd-disperse-0,
Err: -1 , Start:
>>>>>>>>> 1789569705 , Stop: 2147483645 , Hash: 1 ],
[Subvol_name:
>>>>>>>>> glusterfsProd-disperse-1, Err: -1 , Start:
2147483646 , Stop:
>>>>>>>>> 2505397586 , Hash: 1 ], [Subvol_name:
glusterfsProd-disperse-10,
>>>>>>>>> Err: -1 , Start: 2505397587 , Stop:
2863311527 , Hash: 1 ],
>>>>>>>>> [Subvol_name: glusterfsProd-disperse-11,
Err: -1 , Start:
>>>>>>>>> 2863311528 , Stop: 3221225468 , Hash: 1 ],
[Subvol_name:
>>>>>>>>> glusterfsProd-disperse-2, Err: -1 , Start:
3221225469 , Stop:
>>>>>>>>> 3579139409 , Hash: 1 ], [Subvol_name:
glusterfsProd-disperse-3, Err:
>>>>>>>>> -1 , Start: 3579139410 , Stop: 3937053350 ,
Hash: 1 ], [Subvol_name:
>>>>>>>>> glusterfsProd-disperse-4, Err: -1 , Start:
3937053351 , Stop:
>>>>>>>>> 4294967295 , Hash: 1 ], [Subvol_name:
glusterfsProd-disperse-5, Err:
>>>>>>>>> -1 , Start: 0 , Stop: 357913940 , Hash: 1
], [Subvol_name:
>>>>>>>>> glusterfsProd-disperse-6, Err: -1 , Start:
357913941 , Stop:
>>>>>>>>> 715827881 , Hash: 1 ], [Subvol_name:
glusterfsProd-disperse-7, Err:
>>>>>>>>> -1 , Start: 715827882 , Stop: 1073741822 ,
Hash: 1 ], [Subvol_name:
>>>>>>>>> glusterfsProd-disperse-8, Err: -1 , Start:
1073741823 , Stop:
>>>>>>>>> 1431655763 , Hash: 1 ], [Subvol_name:
glusterfsProd-disperse-9, Err:
>>>>>>>>> -1 , Start: 1431655764 , Stop: 1789569704 ,
Hash: 1 ],
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> -----Original Message----- 
>>>>>>>>> From: gluster-users-bounces at gluster.org
<mailto:gluster-users-bounces at gluster.org>
>>>>>>>>> [mailto:gluster-users-bounces at
gluster.org] On Behalf Of
>>>>>>>>> Ankireddypalle Reddy 
>>>>>>>>> Sent: Tuesday, January 10, 2017 10:09 AM 
>>>>>>>>> To: Xavier Hernandez; Gluster Devel
(gluster-devel at gluster.org <mailto:gluster-devel at gluster.org>);
>>>>>>>>> gluster-users at gluster.org
<mailto:gluster-users at gluster.org>
>>>>>>>>> Subject: Re: [Gluster-users]
[Gluster-devel] Lot of EIO errors in
>>>>>>>>> disperse volume 
>>>>>>>>> 
>>>>>>>>> Xavi, 
>>>>>>>>> In this case it's the file creation
which failed. So I provided the xattrs of the parent.
>>>>>>>>> 
>>>>>>>>> Thanks and Regards, 
>>>>>>>>> Ram 
>>>>>>>>> 
>>>>>>>>> -----Original Message----- 
>>>>>>>>> From: Xavier Hernandez [mailto:xhernandez
at datalab.es]
>>>>>>>>> Sent: Tuesday, January 10, 2017 9:10 AM 
>>>>>>>>> To: Ankireddypalle Reddy; Gluster Devel 
>>>>>>>>> (gluster-devel at gluster.org
<mailto:gluster-devel at gluster.org>);
> gluster-users at gluster.org <mailto:gluster-users at gluster.org> 
>>>>>>>>> Subject: Re: [Gluster-devel] Lot of EIO
errors in disperse volume
>>>>>>>>> 
>>>>>>>>> Hi Ram, 
>>>>>>>>> 
>>>>>>>>>> On 10/01/17 14:42, Ankireddypalle Reddy
wrote:
>>>>>>>>>> Attachments (2): 
>>>>>>>>>> 
>>>>>>>>>> 1 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> ec.txt 
>>>>>>>>>>
<https://imap.commvault.com/webconsole/embedded.do?url=https://im
>
<https://imap.commvault.com/webconsole/embedded.do?url=https://im%0b>>>>>>>>>>
> ap 
>>>>>>>>>> .co 
>>>>>>>>>>
mmvault.com/webconsole/api/drive/publicshare/346714/file/ee2d1536
>>>>>>>>>> c2 
>>>>>>>>>> dc4 
>>>>>>>>>>
dff94afb12132b4f8f6/action/preview&downloadUrl=https://imap.commvault.
>>>>>>>>>>
com/webconsole/api/contentstore/publicshare/346714/file/ee2d1536c
>>>>>>>>>> 2d c4d
ff94afb12132b4f8f6/action/download>
>>>>>>>>>> [Download] 
>>>>>>>>>>
<https://imap.commvault.com/webconsole/api/contentstore/publicsha
>
<https://imap.commvault.com/webconsole/api/contentstore/publicsha%0b>>>>>>>>>>
> re 
>>>>>>>>>> /34 
>>>>>>>>>>
6714/file/ee2d1536c2dc4dff94afb12132b4f8f6/action/download>(11.50
>>>>>>>>>> KB) 
>>>>>>>>>> 
>>>>>>>>>> 2 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> ws-glus.log 
>>>>>>>>>>
<https://imap.commvault.com/webconsole/embedded.do?url=https://im
>
<https://imap.commvault.com/webconsole/embedded.do?url=https://im%0b>>>>>>>>>>
> ap 
>>>>>>>>>> .co 
>>>>>>>>>>
mmvault.com/webconsole/api/drive/publicshare/346714/file/cff3e050
>>>>>>>>>> 6e 
>>>>>>>>>> 754 
>>>>>>>>>>
b9a939db02da1cbbd58/action/preview&downloadUrl=https://imap.commvault.
>>>>>>>>>>
com/webconsole/api/contentstore/publicshare/346714/file/cff3e0506
>>>>>>>>>> e7 54b
9a939db02da1cbbd58/action/download>
>>>>>>>>>> [Download] 
>>>>>>>>>>
<https://imap.commvault.com/webconsole/api/contentstore/publicsha
>
<https://imap.commvault.com/webconsole/api/contentstore/publicsha%0b>>>>>>>>>>
> re 
>>>>>>>>>> /34 
>>>>>>>>>>
6714/file/cff3e0506e754b9a939db02da1cbbd58/action/download>(3.48
>>>>>>>>>> MB) 
>>>>>>>>>> 
>>>>>>>>>> Xavi, 
>>>>>>>>>> We are encountering errors for
different kinds of FOPS.
>>>>>>>>>> The open failed for the following file:
>>>>>>>>>> 
>>>>>>>>>> cvd_2017_01_10_02_28_26.log:98182 1f9fe
01/10 00:57:10 8414465
>>>>>>>>>> [MEDIAFS ] 20117519-52075477
SingleInstancer_FS::StartDataFile2:
>>>>>>>>>> Failed to create the data file 
>>>>>>>>>>
[/ws/glus/Folder_07.11.2016_23.02/CV_MAGNETIC/V_8854974/CHUNK_513
>>>>>>>>>> 42 
>>>>>>>>>> 720 /SFILE_CONTAINER_062],
error=0xECCC0005:{CQiFile::Open(92)} +
>>>>>>>>>>
{CQiUTFOSAPI::open(96)/ErrNo.5.(Input/output error)-Open failed,
>>>>>>>>>>
File=/ws/glus/Folder_07.11.2016_23.02/CV_MAGNETIC/V_8854974/CHUNK
>>>>>>>>>> _5 
>>>>>>>>>> 134 2720/SFILE_CONTAINER_062,
OperationFlag=0xC1,
>>>>>>>>>> PermissionMode=0x1FF} 
>>>>>>>>>> 
>>>>>>>>>> I've attached the extended
attributes for the directories
>>>>>>>>>>
/ws/glus/Folder_07.11.2016_23.02/CV_MAGNETIC/V_8854974/
>>>>>>>>>> and 
>>>>>>>>>> 
>>>>>>>>>>
/ws/glus/Folder_07.11.2016_23.02/CV_MAGNETIC/V_8854974/CHUNK_5134
>>>>>>>>>> 27 
>>>>>>>>>> 20 
>>>>>>>>>> from all the bricks. 
>>>>>>>>>> 
>>>>>>>>>> The attributes look fine to me.
I've also attached some
>>>>>>>>>> log cuts to illustrate the problem. 
>>>>>>>>> 
>>>>>>>>> I need the extended attributes of the file
itself, not the parent directories.
>>>>>>>>> 
>>>>>>>>> Xavi 
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Thanks and Regards, 
>>>>>>>>>> Ram 
>>>>>>>>>> 
>>>>>>>>>> -----Original Message----- 
>>>>>>>>>> From: Xavier Hernandez
[mailto:xhernandez at datalab.es]
>>>>>>>>>> Sent: Tuesday, January 10, 2017 7:53 AM
>>>>>>>>>> To: Ankireddypalle Reddy; Gluster Devel
>>>>>>>>>> (gluster-devel at gluster.org
<mailto:gluster-devel at gluster.org>);
> gluster-users at gluster.org <mailto:gluster-users at gluster.org> 
>>>>>>>>>> Subject: Re: [Gluster-devel] Lot of EIO
errors in disperse volume
>>>>>>>>>> 
>>>>>>>>>> Hi Ram, 
>>>>>>>>>> 
>>>>>>>>>> the error is caused by an extended
attribute that does not match
>>>>>>>>>> on all 
>>>>>>>>>> 3 bricks of the disperse set. Most
probable value is
>>>>>>>>>> trusted.ec.version, but could be
others.
>>>>>>>>>> 
>>>>>>>>>> At first sight, I don't see any
change from 3.7.8 that could have
>>>>>>>>>> caused this. I'll check again. 
>>>>>>>>>> 
>>>>>>>>>> What kind of operations are you doing ?
this can help me narrow the search.
>>>>>>>>>> 
>>>>>>>>>> Xavi 
>>>>>>>>>> 
>>>>>>>>>>> On 10/01/17 13:43, Ankireddypalle
Reddy wrote:
>>>>>>>>>>> Xavi, 
>>>>>>>>>>> Thanks. If you could please explain
what to look for in
>>>>>>>>>>> the 
>>>>>>>>>> extended attributes then I will check
and let you know if I find
>>>>>>>>>> anything suspicious. Also we noticed
that some of these
>>>>>>>>>> operations would succeed if retried. Do
you know of any
>>>>>>>>>> communicated related errors that are
being reported/triaged.
>>>>>>>>>>> 
>>>>>>>>>>> Thanks and Regards, 
>>>>>>>>>>> Ram 
>>>>>>>>>>> 
>>>>>>>>>>> -----Original Message----- 
>>>>>>>>>>> From: Xavier Hernandez
[mailto:xhernandez at datalab.es]
>>>>>>>>>>> Sent: Tuesday, January 10, 2017
7:23 AM
>>>>>>>>>>> To: Ankireddypalle Reddy; Gluster
Devel
>>>>>>>>>>> (gluster-devel at gluster.org
<mailto:gluster-devel at gluster.org>);
> gluster-users at gluster.org <mailto:gluster-users at gluster.org> 
>>>>>>>>>>> Subject: Re: [Gluster-devel] Lot of
EIO errors in disperse
>>>>>>>>>>> volume 
>>>>>>>>>>> 
>>>>>>>>>>> Hi Ram, 
>>>>>>>>>>> 
>>>>>>>>>>>> On 10/01/17 13:14,
Ankireddypalle Reddy wrote:
>>>>>>>>>>>> Attachment (1): 
>>>>>>>>>>>> 
>>>>>>>>>>>> 1 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> ecxattrs.txt 
>>>>>>>>>>>>
<https://imap.commvault.com/webconsole/embedded.do?url=https://imap.
>
<https://imap.commvault.com/webconsole/embedded.do?url=https://imap.%0b>>>>>>>>>>>>
> c 
>>>>>>>>>>>> o 
>>>>>>>>>>>>
mmvault.com/webconsole/api/drive/publicshare/346714/file/1272e6
>>>>>>>>>>>> 82 
>>>>>>>>>>>> 787 
>>>>>>>>>>>> 4 
>>>>>>>>>>>> 4 
>>>>>>>>>>>>
f15bf1a54f2b31b559d/action/preview&downloadUrl=https://imap.commvault.
>>>>>>>>>>>>
com/webconsole/api/contentstore/publicshare/346714/file/1272e68
>>>>>>>>>>>> 27 
>>>>>>>>>>>> 874 
>>>>>>>>>>>> 4 
>>>>>>>>>>>> f 
>>>>>>>>>>>>
15bf1a54f2b31b559d/action/download>
>>>>>>>>>>>> [Download] 
>>>>>>>>>>>>
<https://imap.commvault.com/webconsole/api/contentstore/publics
>
<https://imap.commvault.com/webconsole/api/contentstore/publics%0b>>>>>>>>>>>>
> ha 
>>>>>>>>>>>> re/ 
>>>>>>>>>>>> 3 
>>>>>>>>>>>> 4 
>>>>>>>>>>>>
6714/file/1272e68278744f15bf1a54f2b31b559d/action/download>(5.9
>>>>>>>>>>>> 2 
>>>>>>>>>>>> KB) 
>>>>>>>>>>>> 
>>>>>>>>>>>> Xavi, 
>>>>>>>>>>>> Please find attached the
extended attributes for a
>>>>>>>>>>>> directory from all the bricks.
Free space check failed for this
>>>>>>>>>>>> with error number EIO. 
>>>>>>>>>>> 
>>>>>>>>>>> What do you mean ? what operation
have you made to check the
>>>>>>>>>>> free 
>>>>>>>>>> space on that directory ? 
>>>>>>>>>>> 
>>>>>>>>>>> If it's a recursive check, I
need the extended attributes from
>>>>>>>>>>> the 
>>>>>>>>>> exact file that triggers the EIO. The
attached attributes seem
>>>>>>>>>> consistent and that directory
shouldn't cause any problem. Does an 'ls'
>>>>>>>>>> on that directory fail or does it show
the contents ?
>>>>>>>>>>> 
>>>>>>>>>>> Xavi 
>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> Thanks and Regards, 
>>>>>>>>>>>> Ram 
>>>>>>>>>>>> 
>>>>>>>>>>>> -----Original Message----- 
>>>>>>>>>>>> From: Xavier Hernandez
[mailto:xhernandez at datalab.es]
>>>>>>>>>>>> Sent: Tuesday, January 10, 2017
6:45 AM
>>>>>>>>>>>> To: Ankireddypalle Reddy;
Gluster Devel
>>>>>>>>>>>> (gluster-devel at gluster.org
<mailto:gluster-devel at gluster.org>);
> gluster-users at gluster.org <mailto:gluster-users at gluster.org> 
>>>>>>>>>>>> Subject: Re: [Gluster-devel]
Lot of EIO errors in disperse
>>>>>>>>>>>> volume 
>>>>>>>>>>>> 
>>>>>>>>>>>> Hi Ram, 
>>>>>>>>>>>> 
>>>>>>>>>>>> can you execute the following
command on all bricks on a file
>>>>>>>>>>>> that is giving EIO ? 
>>>>>>>>>>>> 
>>>>>>>>>>>> getfattr -m. -e hex -d <path
to file in brick>
>>>>>>>>>>>> 
>>>>>>>>>>>> Xavi 
>>>>>>>>>>>> 
>>>>>>>>>>>>> On 10/01/17 12:41,
Ankireddypalle Reddy wrote:
>>>>>>>>>>>>> Xavi, 
>>>>>>>>>>>>> We have been running 3.7.8
on these servers. We
>>>>>>>>>>>>> upgraded 
>>>>>>>>>>>> to 3.7.18 yesterday. We
upgraded all the servers at a time.
>>>>>>>>>>>> The volume was brought down
during upgrade.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Thanks and Regards, 
>>>>>>>>>>>>> Ram 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> -----Original Message----- 
>>>>>>>>>>>>> From: Xavier Hernandez
[mailto:xhernandez at datalab.es]
>>>>>>>>>>>>> Sent: Tuesday, January 10,
2017 6:35 AM
>>>>>>>>>>>>> To: Ankireddypalle Reddy;
Gluster Devel
>>>>>>>>>>>>> (gluster-devel at
gluster.org <mailto:gluster-devel at gluster.org>);
> gluster-users at gluster.org <mailto:gluster-users at gluster.org> 
>>>>>>>>>>>>> Subject: Re:
[Gluster-devel] Lot of EIO errors in disperse
>>>>>>>>>>>>> volume 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Hi Ram, 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> how did you upgrade gluster
? from which version ?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Did you upgrade one server
at a time and waited until
>>>>>>>>>>>>> self-heal 
>>>>>>>>>>>> finished before upgrading the
next server ?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Xavi 
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On 10/01/17 11:39,
Ankireddypalle Reddy wrote:
>>>>>>>>>>>>>> Hi, 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> We upgraded to
GlusterFS 3.7.18 yesterday. We see lot of
>>>>>>>>>>>>>> failures in our
applications. Most of the errors are EIO. The
>>>>>>>>>>>>>> following log lines are
commonly seen in the logs:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> The message "W
[MSGID: 122056]
>>>>>>>>>>>>>>
[ec-combine.c:873:ec_combine_check]
>>>>>>>>>>>>>>
0-StoragePool-disperse-4: Mismatching xdata in answers of 'LOOKUP'"
>>>>>>>>>>>>>> repeated 2 times
between [2017-01-10 02:46:25.069809] and
>>>>>>>>>>>>>> [2017-01-10
02:46:25.069835]
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> [2017-01-10
02:46:25.069852] W [MSGID: 122056]
>>>>>>>>>>>>>>
[ec-combine.c:873:ec_combine_check] 0-StoragePool-disperse-5:
>>>>>>>>>>>>>> Mismatching xdata in
answers of 'LOOKUP'
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> The message "W
[MSGID: 122056]
>>>>>>>>>>>>>>
[ec-combine.c:873:ec_combine_check]
>>>>>>>>>>>>>>
0-StoragePool-disperse-5: Mismatching xdata in answers of 'LOOKUP'"
>>>>>>>>>>>>>> repeated 2 times
between [2017-01-10 02:46:25.069852] and
>>>>>>>>>>>>>> [2017-01-10
02:46:25.069873]
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> [2017-01-10
02:46:25.069910] W [MSGID: 122056]
>>>>>>>>>>>>>>
[ec-combine.c:873:ec_combine_check] 0-StoragePool-disperse-6:
>>>>>>>>>>>>>> Mismatching xdata in
answers of 'LOOKUP'
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> ... 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> [2017-01-10
02:46:26.520774] I [MSGID: 109036]
>>>>>>>>>>>>>>
[dht-common.c:9076:dht_log_new_layout_for_dir_selfheal]
>>>>>>>>>>>>>> 0-StoragePool-dht:
Setting layout of
>>>>>>>>>>>>>>
/Folder_07.11.2016_23.02/CV_MAGNETIC/V_8854213/CHUNK_51334585
>>>>>>>>>>>>>> with 
>>>>>>>>>>>>>> [Subvol_name:
StoragePool-disperse-0, Err: -1 , Start:
>>>>>>>>>>>>>> 3221225466 , 
>>>>>>>>>>>>>> Stop: 3758096376 ,
Hash: 1 ], [Subvol_name:
>>>>>>>>>>>>>> StoragePool-disperse-1,
>>>>>>>>>> Err: 
>>>>>>>>>>>>>> -1 , Start: 3758096377
, Stop: 4294967295 , Hash: 1 ], [Subvol_name:
>>>>>>>>>>>>>> StoragePool-disperse-2,
Err: -1 , Start: 0 , Stop: 536870910 , Hash:
>>>>>>>>>>>>>> 1 ], [Subvol_name:
StoragePool-disperse-3, Err: -1 , Start:
>>>>>>>>>>>>>> 536870911 , 
>>>>>>>>>>>>>> Stop: 1073741821 ,
Hash: 1 ], [Subvol_name:
>>>>>>>>>>>>>> StoragePool-disperse-4,
>>>>>>>>>> Err: 
>>>>>>>>>>>>>> -1 , Start: 1073741822
, Stop: 1610612732 , Hash: 1 ], [Subvol_name:
>>>>>>>>>>>>>> StoragePool-disperse-5,
Err: -1 , Start: 1610612733 , Stop:
>>>>>>>>>>>>>> 2147483643 , 
>>>>>>>>>>>>>> Hash: 1 ],
[Subvol_name: StoragePool-disperse-6, Err: -1 , Start:
>>>>>>>>>>>>>> 2147483644 , Stop:
2684354554 , Hash: 1 ], [Subvol_name:
>>>>>>>>>>>>>> StoragePool-disperse-7,
Err: -1 , Start: 2684354555 , Stop:
>>>>>>>>>>>>>> 3221225465 , 
>>>>>>>>>>>>>> Hash: 1 ], 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> [2017-01-10
02:46:26.522841] N [MSGID: 122031]
>>>>>>>>>>>>>>
[ec-generic.c:1130:ec_combine_xattrop] 0-StoragePool-disperse-3:
>>>>>>>>>>>>>> Mismatching dictionary
in answers of 'GF_FOP_XATTROP'
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> The message "N
[MSGID: 122031]
>>>>>>>>>>>>>>
[ec-generic.c:1130:ec_combine_xattrop]
>>>>>>>>>>>>>>
0-StoragePool-disperse-3: Mismatching dictionary in answers
>>>>>>>>>>>>>> of
'GF_FOP_XATTROP'" repeated 2 times between [2017-01-10
>>>>>>>>>>>>>> 02:46:26.522841] and
[2017-01-10 02:46:26.522894]
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> [2017-01-10
02:46:26.522898] W [MSGID: 122040]
>>>>>>>>>>>>>>
[ec-common.c:919:ec_prepare_update_cbk] 0-StoragePool-disperse-3:
>>>>>>>>>>>>>> Failed to get size and
version [Input/output error]
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> [2017-01-10
02:46:26.523115] N [MSGID: 122031]
>>>>>>>>>>>>>>
[ec-generic.c:1130:ec_combine_xattrop] 0-StoragePool-disperse-6:
>>>>>>>>>>>>>> Mismatching dictionary
in answers of 'GF_FOP_XATTROP'
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> The message "N
[MSGID: 122031]
>>>>>>>>>>>>>>
[ec-generic.c:1130:ec_combine_xattrop]
>>>>>>>>>>>>>>
0-StoragePool-disperse-6: Mismatching dictionary in answers
>>>>>>>>>>>>>> of
'GF_FOP_XATTROP'" repeated 2 times between [2017-01-10
>>>>>>>>>>>>>> 02:46:26.523115] and
[2017-01-10 02:46:26.523143]
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> [2017-01-10
02:46:26.523147] W [MSGID: 122040]
>>>>>>>>>>>>>>
[ec-common.c:919:ec_prepare_update_cbk] 0-StoragePool-disperse-6:
>>>>>>>>>>>>>> Failed to get size and
version [Input/output error]
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> [2017-01-10
02:46:26.523302] N [MSGID: 122031]
>>>>>>>>>>>>>>
[ec-generic.c:1130:ec_combine_xattrop] 0-StoragePool-disperse-2:
>>>>>>>>>>>>>> Mismatching dictionary
in answers of 'GF_FOP_XATTROP'
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> The message "N
[MSGID: 122031]
>>>>>>>>>>>>>>
[ec-generic.c:1130:ec_combine_xattrop]
>>>>>>>>>>>>>>
0-StoragePool-disperse-2: Mismatching dictionary in answers
>>>>>>>>>>>>>> of
'GF_FOP_XATTROP'" repeated 2 times between [2017-01-10
>>>>>>>>>>>>>> 02:46:26.523302] and
[2017-01-10 02:46:26.523324]
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> [2017-01-10
02:46:26.523328] W [MSGID: 122040]
>>>>>>>>>>>>>>
[ec-common.c:919:ec_prepare_update_cbk] 0-StoragePool-disperse-2:
>>>>>>>>>>>>>> Failed to get size and
version [Input/output error]
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> [root at glusterfs3
Log_Files]# gluster --version
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> glusterfs 3.7.18 built
on Dec 8 2016 06:34:26
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> [root at glusterfs3
Log_Files]# gluster volume info
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Volume Name:
StoragePool
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Type:
Distributed-Disperse
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Volume ID:
149e976f-4e21-451c-bf0f-f5691208531f
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Status: Started 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Number of Bricks: 8 x
(2 + 1) = 24
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Transport-type: tcp 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Bricks: 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Brick1:
glusterfs1sds:/ws/disk1/ws_brick
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Brick2:
glusterfs2sds:/ws/disk1/ws_brick
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Brick3:
glusterfs3sds:/ws/disk1/ws_brick
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Brick4:
glusterfs1sds:/ws/disk2/ws_brick
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Brick5:
glusterfs2sds:/ws/disk2/ws_brick
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Brick6:
glusterfs3sds:/ws/disk2/ws_brick
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Brick7:
glusterfs1sds:/ws/disk3/ws_brick
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Brick8:
glusterfs2sds:/ws/disk3/ws_brick
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Brick9:
glusterfs3sds:/ws/disk3/ws_brick
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Brick10:
glusterfs1sds:/ws/disk4/ws_brick
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Brick11:
glusterfs2sds:/ws/disk4/ws_brick
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Brick12:
glusterfs3sds:/ws/disk4/ws_brick
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Brick13:
glusterfs1sds:/ws/disk5/ws_brick
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Brick14:
glusterfs2sds:/ws/disk5/ws_brick
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Brick15:
glusterfs3sds:/ws/disk5/ws_brick
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Brick16:
glusterfs1sds:/ws/disk6/ws_brick
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Brick17:
glusterfs2sds:/ws/disk6/ws_brick
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Brick18:
glusterfs3sds:/ws/disk6/ws_brick
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Brick19:
glusterfs1sds:/ws/disk7/ws_brick
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Brick20:
glusterfs2sds:/ws/disk7/ws_brick
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Brick21:
glusterfs3sds:/ws/disk7/ws_brick
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Brick22:
glusterfs1sds:/ws/disk8/ws_brick
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Brick23:
glusterfs2sds:/ws/disk8/ws_brick
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Brick24:
glusterfs3sds:/ws/disk8/ws_brick
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Options Reconfigured: 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>
performance.readdir-ahead: on
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>
diagnostics.client-log-level: INFO
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Thanks and Regards, 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Ram 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>
***************************Legal
>>>>>>>>>>>>>>
Disclaimer***************************
>>>>>>>>>>>>>> "This
communication may contain confidential and privileged
>>>>>>>>>>>>>> material for the sole
use of the intended recipient. Any
>>>>>>>>>>>>>> unauthorized review,
use or distribution by others is
>>>>>>>>>>>>>> strictly prohibited. If
you have received the message by
>>>>>>>>>>>>>> mistake, please advise
the sender by reply email and delete the message. Thank you."
>>>>>>>>>>>>>>
*************************************************************
>>>>>>>>>>>>>> ** 
>>>>>>>>>>>>>> *** 
>>>>>>>>>>>>>> * 
>>>>>>>>>>>>>> * 
>>>>>>>>>>>>>> * 
>>>>>>>>>>>>>> * 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>
_______________________________________________
>>>>>>>>>>>>>> Gluster-devel mailing
list
>>>>>>>>>>>>>> Gluster-devel at
gluster.org <mailto:Gluster-devel at gluster.org>
>>>>>>>>>>>>>>
http://www.gluster.org/mailman/listinfo/gluster-devel
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>>
***************************Legal
>>>>>>>>>>>>>
Disclaimer***************************
>>>>>>>>>>>>> "This communication
may contain confidential and privileged
>>>>>>>>>>>>> material for the sole use
of the intended recipient. Any
>>>>>>>>>>>>> unauthorized review, use or
distribution by others is strictly
>>>>>>>>>>>>> prohibited. If you have
received the message by mistake,
>>>>>>>>>>>>> please advise the sender by
reply
>>>>>>>>>>>> email and delete the message.
Thank you."
>>>>>>>>>>>>>
**************************************************************
>>>>>>>>>>>>> ** 
>>>>>>>>>>>>> *** 
>>>>>>>>>>>>> * 
>>>>>>>>>>>>> * 
>>>>>>>>>>>>> * 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>
***************************Legal
>>>>>>>>>>>>
Disclaimer***************************
>>>>>>>>>>>> "This communication may
contain confidential and privileged
>>>>>>>>>>>> material for the sole use of
the intended recipient. Any
>>>>>>>>>>>> unauthorized review, use or
distribution by others is strictly
>>>>>>>>>>>> prohibited. If you have
received the message by mistake, please
>>>>>>>>>>>> advise the sender by reply
email and delete the message. Thank you."
>>>>>>>>>>>>
***************************************************************
>>>>>>>>>>>> ** 
>>>>>>>>>>>> *** 
>>>>>>>>>>>> * 
>>>>>>>>>>>> * 
>>>>>>>>>>> 
>>>>>>>>>>> ***************************Legal 
>>>>>>>>>>>
Disclaimer***************************
>>>>>>>>>>> "This communication may
contain confidential and privileged
>>>>>>>>>>> material for the sole use of the
intended recipient. Any
>>>>>>>>>>> unauthorized review, use or
distribution by others is strictly
>>>>>>>>>>> prohibited. If you have received
the message by mistake, please
>>>>>>>>>>> advise the sender by reply 
>>>>>>>>>> email and delete the message. Thank
you."
>>>>>>>>>>>
****************************************************************
>>>>>>>>>>> ** 
>>>>>>>>>>> *** 
>>>>>>>>>>> * 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> ***************************Legal 
>>>>>>>>>> Disclaimer*************************** 
>>>>>>>>>> "This communication may contain
confidential and privileged
>>>>>>>>>> material for the sole use of the
intended recipient. Any
>>>>>>>>>> unauthorized review, use or
distribution by others is strictly
>>>>>>>>>> prohibited. If you have received the
message by mistake, please
>>>>>>>>>> advise the sender by reply email and
delete the message. Thank you."
>>>>>>>>>>
*****************************************************************
>>>>>>>>>> ** 
>>>>>>>>>> *** 
>>>>>>>>> 
>>>>>>>>> ***************************Legal 
>>>>>>>>> Disclaimer*************************** 
>>>>>>>>> "This communication may contain
confidential and privileged material for the sole use of the intended recipient.
Any unauthorized review, use or distribution by others is strictly prohibited.
If you have received the message by mistake, please advise the sender by reply
email and delete the message. Thank you."
>>>>>>>>>
******************************************************************
>>>>>>>>> ** 
>>>>>>>>> ** 
>>>>>>>>> 
>>>>>>>>>
_______________________________________________
>>>>>>>>> Gluster-users mailing list 
>>>>>>>>> Gluster-users at gluster.org
<mailto:Gluster-users at gluster.org>
>>>>>>>>>
http://www.gluster.org/mailman/listinfo/gluster-users
>>>>>>>>> ***************************Legal 
>>>>>>>>> Disclaimer*************************** 
>>>>>>>>> "This communication may contain
confidential and privileged
>>>>>>>>> material for the sole use of the intended
recipient. Any
>>>>>>>>> unauthorized review, use or distribution by
others is strictly
>>>>>>>>> prohibited. If you have received the
message by mistake, please advise the sender by reply email and delete the
message. Thank you."
>>>>>>>>>
******************************************************************
>>>>>>>>> ** 
>>>>>>>>> ** 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> ***************************Legal 
>>>>>>> Disclaimer*************************** 
>>>>>>> "This communication may contain confidential
and privileged material
>>>>>>> for the sole use of the intended recipient. Any
unauthorized review,
>>>>>>> use or distribution by others is strictly
prohibited. If you have
>>>>>>> received the message by mistake, please advise the
sender by reply email and delete the message. Thank you."
>>>>>>>
********************************************************************
>>>>>>> ** 
>>>>>>> 
>>>>>> 
>>>>>> ***************************Legal 
>>>>>> Disclaimer*************************** 
>>>>>> "This communication may contain confidential and
privileged material
>>>>>> for the sole use of the intended recipient. Any
unauthorized review,
>>>>>> use or distribution by others is strictly prohibited.
If you have
>>>>>> received the message by mistake, please advise the
sender by reply email and delete the message. Thank you."
>>>>>>
*********************************************************************
>>>>>> * 
>>>>>> 
>>>>>> 
>>>>> 
>>>> ***************************Legal
Disclaimer***************************
>>>> "This communication may contain confidential and
privileged material for the sole use of the intended recipient. Any unauthorized
review, use or distribution by others is strictly prohibited. If you have
received the message by mistake, please advise the sender by reply email and
delete the message. Thank you."
>>>>
**********************************************************************
>>>> 
>>>> _______________________________________________ 
>>>> Gluster-devel mailing list 
>>>> Gluster-devel at gluster.org <mailto:Gluster-devel at
gluster.org>
>>>> http://www.gluster.org/mailman/listinfo/gluster-devel 
>>>> ***************************Legal
Disclaimer***************************
>>>> "This communication may contain confidential and
privileged material for the
>>>> sole use of the intended recipient. Any unauthorized review,
use or distribution
>>>> by others is strictly prohibited. If you have received the
message by mistake,
>>>> please advise the sender by reply email and delete the message.
Thank you."
>>>>
**********************************************************************
>>>> 
>>> 
>> ***************************Legal Disclaimer*************************** 
>> "This communication may contain confidential and privileged
material for the
>> sole use of the intended recipient. Any unauthorized review, use or
distribution
>> by others is strictly prohibited. If you have received the message by
mistake,
>> please advise the sender by reply email and delete the message. Thank
you."
>> ********************************************************************** 
>> 
> 
> 
> 
> ***************************Legal Disclaimer*************************** 
> "This communication may contain confidential and privileged material
for the
> sole use of the intended recipient. Any unauthorized review, use or 
> distribution 
> by others is strictly prohibited. If you have received the message by 
> mistake, 
> please advise the sender by reply email and delete the message. Thank
you."
> ********************************************************************** 
> 
> 
> 
> _______________________________________________ 
> Gluster-users mailing list 
> Gluster-users at gluster.org 
> http://lists.gluster.org/mailman/listinfo/gluster-users 
> ***************************Legal Disclaimer*************************** 
> "This communication may contain confidential and privileged material
for the
> sole use of the intended recipient. Any unauthorized review, use or 
> distribution 
> by others is strictly prohibited. If you have received the message by 
> mistake, 
> please advise the sender by reply email and delete the message. Thank
you."
> ********************************************************************** 
> 
> 
> 
> _______________________________________________ 
> Gluster-users mailing list 
> Gluster-users at gluster.org 
> http://lists.gluster.org/mailman/listinfo/gluster-users 
> 
> 
> 
> ***************************Legal Disclaimer*************************** 
> "This communication may contain confidential and privileged material
for the
> sole use of the intended recipient. Any unauthorized review, use or 
> distribution 
> by others is strictly prohibited. If you have received the message by 
> mistake, 
> please advise the sender by reply email and delete the message. Thank
you."
> ********************************************************************** 
_______________________________________________ 
Gluster-users mailing list 
Gluster-users at gluster.org 
http://lists.gluster.org/mailman/listinfo/gluster-users 

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20170120/5eada65a/attachment.html>

Gluster users - Jan 2017 - [Gluster-devel] Lot of EIO errors in disperse volume

[Gluster-users] [Gluster-devel] Lot of EIO errors in disperse volume

[Gluster-users] [Gluster-devel] Lot of EIO errors in disperse volume

[Gluster-users] [Gluster-devel] Lot of EIO errors in disperse volume