Xavier Hernandez
2017-Jan-13 08:01 UTC
[Gluster-users] [Gluster-devel] Lot of EIO errors in disperse volume
Hi Ram, On 12/01/17 22:14, Ankireddypalle Reddy wrote:> Xavi, > I changed the logging to log the individual bytes. Consider the following from ws-glus.log file where /ws/glus is the mount point. > > [2017-01-12 20:47:59.368102] I [MSGID: 109063] [dht-layout.c:718:dht_layout_normalize] 0-glusterfsProd-dht: Found anomalies in /Folder_01.05.2017_21.15/CV_MAGNETIC/V_30970/CHUNK_390607 (gfid = e694387f-dde7-410b-9562-914a994d5e85). Holes=1 overlaps=0 > [2017-01-12 20:47:59.391218] I [MSGID: 109036] [dht-common.c:9082:dht_log_new_layout_for_dir_selfheal] 0-glusterfsProd-dht: Setting layout of /Folder_01.05.2017_21.15/CV_MAGNETIC/V_30970/CHUNK_390607 with [Subvol_name: glusterfsProd-disperse-0, Err: -1 , Start: 2505397587 , Stop: 2863311527 , Hash: 1 ], [Subvol_name: glusterfsProd-disperse-1, Err: -1 , Start: 2863311528 , Stop: 3221225468 , Hash: 1 ], [Subvol_name: glusterfsProd-disperse-10, Err: -1 , Start: 3221225469 , Stop: 3579139409 , Hash: 1 ], [Subvol_name: glusterfsProd-disperse-11, Err: -1 , Start: 3579139410 , Stop: 3937053350 , Hash: 1 ], [Subvol_name: glusterfsProd-disperse-2, Err: -1 , Start: 3937053351 , Stop: 4294967295 , Hash: 1 ], [Subvol_name: glusterfsProd-disperse-3, Err: -1 , Start: 0 , Stop: 357913940 , Hash: 1 ], [Subvol_name: glusterfsProd-disperse-4, Err: -1 , Start: 357913941 , Stop: 715827881 , Hash: 1 ], [Subvol_name: glusterfsProd-disperse-5, Err: -1 , Start: 715827882 , Stop: 1073741822 , Hash: 1 ], [Subvol_name: glusterfsProd-disperse-6, Err: -1 , Start: 1073741823 , Stop: 1431655763 , Hash: 1 ], [Subvol_name: glusterfsProd-disperse-7, Err: -1 , Start: 1431655764 , Stop: 1789569704 , Hash: 1 ], [Subvol_name: glusterfsProd-disperse-8, Err: -1 , Start: 1789569705 , Stop: 2147483645 , Hash: 1 ], [Subvol_name: glusterfsProd-disperse-9, Err: -1 , Start: 2147483646 , Stop: 2505397586 , Hash: 1 ], > > Self-heal seems to be triggered for path /Folder_01.05.2017_21.15/CV_MAGNETIC/V_30970/CHUNK_390607 due to anomalies as per DHT. It would be great if someone could explain what could be the anomaly here. The setup where we encountered this is a fairly stable setup with no brick failures or no node failures.This is not really a self-heal, at least from the point of view of ec. This means that DHT has found a discrepancy in the layout of that directory, however this doesn't mean any problem (notice the 'I' in the log, meaning that it's informative, not a warning nor error). Not sure how DHT works in this case or why it finds this "anomaly", but if there aren't any previous errors before that message, it can be completely ignored. Not sure if it can be related to option cluster.weighted-rebalance that is enabled by default.> > Then Self-heal seems to have encountered the following error. > > [2017-01-12 20:48:23.418432] I [dict.c:166:key_value_cmp] 0-glusterfsProd-disperse-2: 'trusted.ec.version' is different in two dicts (16, 16) > [2017-01-12 20:48:23.418496] I [dict.c:3065:dict_dump_to_log] 0-glusterfsProd-disperse-2: dict=0x7f0b649520ac ((trusted.glusterfs.dht:0:0:0:1:0:0:0:0:0:0:0:0:15:55:55:54:)(trusted.ec.version:0:0:0:0:0:0:0:b:0:0:0:0:0:0:0:e:)) > [2017-01-12 20:48:23.418519] I [dict.c:3065:dict_dump_to_log] 0-glusterfsProd-disperse-2: dict=0x7f0b6495b4e0 ((trusted.glusterfs.dht:0:0:0:1:0:0:0:0:0:0:0:0:15:55:55:54:)(trusted.ec.version:0:0:0:0:0:0:0:d:0:0:0:0:0:0:0:e:)) > [2017-01-12 20:48:23.418531] W [MSGID: 122056] [ec-combine.c:873:ec_combine_check] 0-glusterfsProd-disperse-2: Mismatching xdata in answers of 'LOOKUP'That's a real problem. Here we have two bricks that differ in the trusted.ec.version xattr. However this xattr not necessarily belongs to the previous directory. They are unrelated messages.> > In this case glusterfsProd-disperse-2 sub volume actually consists of the following bricks. > glusterfs4sds:/ws/disk11/ws_brick, glusterfs5sds: /ws/disk11/ws_brick, glusterfs6sds: /ws/disk11/ws_brick > > I went ahead and checked the value of trusted.ec.version on all the 3 bricks inside this sub vol: > > [root at glusterfs6 ~]# getfattr -e hex -n trusted.ec.version /ws/disk11/ws_brick//Folder_01.05.2017_21.15/CV_MAGNETIC/V_30970/CHUNK_390607 > # file: ws/disk11/ws_brick//Folder_01.05.2017_21.15/CV_MAGNETIC/V_30970/CHUNK_390607 > trusted.ec.version=0x0000000000000009000000000000000b > > [root at glusterfs4 ~]# getfattr -e hex -n trusted.ec.version /ws/disk11/ws_brick//Folder_01.05.2017_21.15/CV_MAGNETIC/V_30970/CHUNK_390607 > # file: ws/disk11/ws_brick//Folder_01.05.2017_21.15/CV_MAGNETIC/V_30970/CHUNK_390607 > trusted.ec.version=0x0000000000000009000000000000000b > > [root at glusterfs5 glusterfs]# getfattr -e hex -n trusted.ec.version /ws/disk11/ws_brick//Folder_01.05.2017_21.15/CV_MAGNETIC/V_30970/CHUNK_390607 > # file: ws/disk11/ws_brick//Folder_01.05.2017_21.15/CV_MAGNETIC/V_30970/CHUNK_390607 > trusted.ec.version=0x0000000000000009000000000000000b > > The attribute value seems to be same on all the 3 bricks.That's a clear indication that the ec warning is not related to this directory because trusted.ec.version always increases, never decreases, and the directory has a value smaller that the one that appears in the log message. If you show all dict entries in the log, it seems that it does refer to a directory because trusted.ec.size is not present, but it must be another directory than the one you looked at. We would need to find which one is having this issue. The TRACE log would be helpful here.> > Also please note that every single time that the trusted.ec.version was found to mismatch the same values are getting logged. Following are 2 more instances of trusted.ec.version mismatch. > > [2017-01-12 20:14:25.554540] I [dict.c:166:key_value_cmp] 0-glusterfsProd-disperse-2: 'trusted.ec.version' is different in two dicts (16, 16) > [2017-01-12 20:14:25.554588] I [dict.c:3065:dict_dump_to_log] 0-glusterfsProd-disperse-2: dict=0x7f0b6495a9f0 ((glusterfs.open-fd-count:30:0:)(trusted.glusterfs.dht:0:0:0:1:0:0:0:0:0:0:0:0:15:55:55:54:)(trusted.ec.version:0:0:0:0:0:0:0:d:0:0:0:0:0:0:0:e:)) > [2017-01-12 20:14:25.554608] I [dict.c:3065:dict_dump_to_log] 0-glusterfsProd-disperse-2: dict=0x7f0b6495903c ((glusterfs.open-fd-count:30:0:)(trusted.glusterfs.dht:0:0:0:1:0:0:0:0:0:0:0:0:15:55:55:54:)(trusted.ec.version:0:0:0:0:0:0:0:b:0:0:0:0:0:0:0:e:)) > [2017-01-12 20:14:25.554624] W [MSGID: 122053] [ec-common.c:116:ec_check_status] 0-glusterfsProd-disperse-2: Operation failed on some subvolumes (up=7, mask=7, remaining=0, good=3, bad=4) > [2017-01-12 20:14:25.554632] W [MSGID: 122002] [ec-common.c:71:ec_heal_report] 0-glusterfsProd-disperse-2: Heal failed [Invalid argument] > [2017-01-12 20:14:25.555598] I [dict.c:166:key_value_cmp] 0-glusterfsProd-disperse-2: 'trusted.ec.version' is different in two dicts (16, 16) > [2017-01-12 20:14:25.555622] I [dict.c:3065:dict_dump_to_log] 0-glusterfsProd-disperse-2: dict=0x7f0b64956c24 ((glusterfs.open-fd-count:30:0:)(trusted.glusterfs.dht:0:0:0:1:0:0:0:0:0:0:0:0:15:55:55:54:)(trusted.ec.version:0:0:0:0:0:0:0:b:0:0:0:0:0:0:0:e:)) > [2017-01-12 20:14:25.555638] I [dict.c:3065:dict_dump_to_log] 0-glusterfsProd-disperse-2: dict=0x7f0b64964e8c ((glusterfs.open-fd-count:30:0:)(trusted.glusterfs.dht:0:0:0:1:0:0:0:0:0:0:0:0:15:55:55:54:)(trusted.ec.version:0:0:0:0:0:0:0:d:0:0:0:0:0:0:0:e:))I think that this refers to the same directory. This seems an attempt to heal it that has failed. So it makes sense that it finds exactly the same values.> > > In glustershd.log lot of similar errors are logged. > > [2017-01-12 21:10:53.728770] I [dict.c:166:key_value_cmp] 0-glusterfsProd-disperse-0: 'trusted.ec.size' is different in two dicts (8, 8) > [2017-01-12 21:10:53.728804] I [dict.c:3065:dict_dump_to_log] 0-glusterfsProd-disperse-0: dict=0x7f21694b6f50 ((trusted.ec.size:0:0:0:0:42:3a:0:0:)(trusted.ec.version:0:0:0:0:0:0:37:5f:0:0:0:0:0:0:37:5f:)) > [2017-01-12 21:10:53.728827] I [dict.c:3065:dict_dump_to_log] 0-glusterfsProd-disperse-0: dict=0x7f21694b62bc ((trusted.ec.size:0:0:0:0:0:ca:0:0:)(trusted.ec.version:0:0:0:0:0:0:0:a1:0:0:0:0:0:0:37:5f:)) > [2017-01-12 21:10:53.728842] W [MSGID: 122056] [ec-combine.c:873:ec_combine_check] 0-glusterfsProd-disperse-0: Mismatching xdata in answers of 'LOOKUP' > [2017-01-12 21:10:53.728854] W [MSGID: 122053] [ec-common.c:116:ec_check_status] 0-glusterfsProd-disperse-0: Operation failed on some subvolumes (up=7, mask=7, remaining=0, good=6, bad=1) > [2017-01-12 21:10:53.728876] W [MSGID: 122002] [ec-common.c:71:ec_heal_report] 0-glusterfsProd-disperse-0: Heal failed [Invalid argument]This seems an attempt to heal a file, but I see a lot of differences between both versions. The size on one brick is 13.238.272 bytes, but on the other brick it's 1.111.097.344 bytes. That's a huge difference. Looking at the trusted.ec.version, I see that the 'data' version is very different (from 161 to 14.175), however the metadata version is exactly the same. This really seems like a lot of writes while one brick was down (or disconnected for some reason, or writes failed for some reason). One brick has lost about 14.000 writes of ~80KB. I think the most important thing right now would be to identify which files and directories are having these problems to be able to identify the cause. Again, the TRACE log will be really useful. Xavi> > Thanks and Regards, > Ram > > -----Original Message----- > From: Xavier Hernandez [mailto:xhernandez at datalab.es] > Sent: Thursday, January 12, 2017 6:40 AM > To: Ankireddypalle Reddy > Cc: Gluster Devel (gluster-devel at gluster.org); gluster-users at gluster.org > Subject: Re: [Gluster-users] [Gluster-devel] Lot of EIO errors in disperse volume > > Hi Ram, > > > On 12/01/17 11:49, Ankireddypalle Reddy wrote: >> Xavi, >> As I mentioned before the error could happen for any FOP. Will try to run with TRACE debug level. Is there a possibility that we are checking for this attribute on a directory, because a directory does not seem to be having this attribute set. > > No, directories do not have this attribute and no one should be reading it from a directory. > >> Also is the function to check size and version called after it is decided that heal should be run or is this check is the one which decides whether a heal should be run. > > Almost all checks that trigger a heal are done in the lookup fop when some discrepancy is detected. > > The function that checks size and version is called later once a lock on the inode is acquired (even if no heal is needed). However further failures in the processing of any fop can also trigger a self-heal. > > Xavi > >> >> Thanks and Regards, >> Ram >> >> Sent from my iPhone >> >>> On Jan 12, 2017, at 2:25 AM, Xavier Hernandez <xhernandez at datalab.es> wrote: >>> >>> Hi Ram, >>> >>>> On 12/01/17 02:36, Ankireddypalle Reddy wrote: >>>> Xavi, >>>> I added some more logging information. The trusted.ec.size field values are in fact different. >>>> trusted.ec.size l1 = 62719407423488 l2 = 0 >>> >>> That's very weird. Directories do not have this attribute. It's only present on regular files. But you said that the error happens while creating the file, so it doesn't make much sense because file creation always sets trusted.ec.size to 0. >>> >>> Could you reproduce the problem with diagnostics.client-log-level set to TRACE and send the log to me ? it will create a big log, but I'll have much more information about what's going on. >>> >>> Do you have a mixed setup with nodes of different types ? for example mixed 32/64 bits architectures or different operating systems ? I ask this because 62719407423488 in hex is 0x390B00000000, which has the lower 32 bits set to 0, but has garbage above that. >>> >>>> >>>> This is a fairly static setup with no brick/ node failure. Please explain why is that a heal is being triggered and what could have acutually caused these size xattrs to differ. This is causing random I/O failures and is impacting the backup schedules. >>> >>> The launch of self-heal is normal because it has detected an inconsistency. The real problem is what originates that inconsistency. >>> >>> Xavi >>> >>>> >>>> [ 2017-01-12 01:19:18.256970] W [MSGID: 122056] [ec-combine.c:873:ec_combine_check] 0-glusterfsProd-disperse-8: Mismatching xdata in answers of 'LOOKUP' >>>> [2017-01-12 01:19:18.257015] W [MSGID: 122053] >>>> [ec-common.c:116:ec_check_status] 0-glusterfsProd-disperse-8: >>>> Operation failed on some subvolumes (up=7, mask=7, remaining=0, >>>> good=3, bad=4) >>>> [2017-01-12 01:19:18.257018] W [MSGID: 122002] >>>> [ec-common.c:71:ec_heal_report] 0-glusterfsProd-disperse-8: Heal >>>> failed [Invalid argument] >>>> [2017-01-12 01:19:21.002028] E [dict.c:197:key_value_cmp] >>>> 0-glusterfsProd-disperse-4: 'trusted.ec.size' is different in two >>>> dicts (8, 8) >>>> [2017-01-12 01:19:21.002056] E [dict.c:166:log_value] >>>> 0-glusterfsProd-disperse-4: trusted.ec.size [ l1 = 62719407423488 l2 >>>> = 0 i1 = 0 i2 = 0 ] >>>> [2017-01-12 01:19:21.002064] W [MSGID: 122056] [ec-combine.c:873:ec_combine_check] 0-glusterfsProd-disperse-4: Mismatching xdata in answers of 'LOOKUP' >>>> [2017-01-12 01:19:21.209640] E [dict.c:197:key_value_cmp] >>>> 0-glusterfsProd-disperse-4: 'trusted.ec.size' is different in two >>>> dicts (8, 8) >>>> [2017-01-12 01:19:21.209673] E [dict.c:166:log_value] >>>> 0-glusterfsProd-disperse-4: trusted.ec.size [ l1 = 62719407423488 l2 >>>> = 0 i1 = 0 i2 = 0 ] >>>> [2017-01-12 01:19:21.209686] W [MSGID: 122056] [ec-combine.c:873:ec_combine_check] 0-glusterfsProd-disperse-4: Mismatching xdata in answers of 'LOOKUP' >>>> [2017-01-12 01:19:21.209719] W [MSGID: 122053] >>>> [ec-common.c:116:ec_check_status] 0-glusterfsProd-disperse-4: >>>> Operation failed on some subvolumes (up=7, mask=7, remaining=0, >>>> good=6, bad=1) >>>> [2017-01-12 01:19:21.209753] W [MSGID: 122002] >>>> [ec-common.c:71:ec_heal_report] 0-glusterfsProd-disperse-4: Heal >>>> failed [Invalid argument] >>>> >>>> Thanks and Regards, >>>> Ram >>>> >>>> -----Original Message----- >>>> From: Ankireddypalle Reddy >>>> Sent: Wednesday, January 11, 2017 9:29 AM >>>> To: Ankireddypalle Reddy; Xavier Hernandez; Gluster Devel >>>> (gluster-devel at gluster.org); gluster-users at gluster.org >>>> Subject: RE: [Gluster-users] [Gluster-devel] Lot of EIO errors in >>>> disperse volume >>>> >>>> Xavi, >>>> I built a debug binary to log more information. This is what is getting logged. Looks like it is the attribute trusted.ec.size which is different among the bricks in a sub volume. >>>> >>>> In glustershd.log : >>>> >>>> [2017-01-11 14:19:45.023845] N [MSGID: 122029] [ec-generic.c:683:ec_combine_lookup] 0-glusterfsProd-disperse-8: Mismatching iatt in answers of 'GF_FOP_LOOKUP' >>>> [2017-01-11 14:19:45.027718] E [dict.c:166:key_value_cmp] >>>> 0-glusterfsProd-disperse-6: 'trusted.ec.size' is different in two >>>> dicts (8, 8) >>>> [2017-01-11 14:19:45.027736] W [MSGID: 122056] [ec-combine.c:873:ec_combine_check] 0-glusterfsProd-disperse-6: Mismatching xdata in answers of 'LOOKUP' >>>> [2017-01-11 14:19:45.027763] E [dict.c:166:key_value_cmp] >>>> 0-glusterfsProd-disperse-6: 'trusted.ec.size' is different in two >>>> dicts (8, 8) >>>> [2017-01-11 14:19:45.027781] W [MSGID: 122056] [ec-combine.c:873:ec_combine_check] 0-glusterfsProd-disperse-6: Mismatching xdata in answers of 'LOOKUP' >>>> [2017-01-11 14:19:45.027793] W [MSGID: 122053] >>>> [ec-common.c:116:ec_check_status] 0-glusterfsProd-disperse-6: >>>> Operation failed on some subvolumes (up=7, mask=7, remaining=0, >>>> good=6, bad=1) >>>> [2017-01-11 14:19:45.027815] W [MSGID: 122002] >>>> [ec-common.c:71:ec_heal_report] 0-glusterfsProd-disperse-6: Heal >>>> failed [Invalid argument] >>>> [2017-01-11 14:19:45.029035] E [dict.c:166:key_value_cmp] >>>> 0-glusterfsProd-disperse-8: 'trusted.ec.size' is different in two >>>> dicts (8, 8) >>>> [2017-01-11 14:19:45.029057] W [MSGID: 122056] [ec-combine.c:873:ec_combine_check] 0-glusterfsProd-disperse-8: Mismatching xdata in answers of 'LOOKUP' >>>> [2017-01-11 14:19:45.029089] E [dict.c:166:key_value_cmp] >>>> 0-glusterfsProd-disperse-8: 'trusted.ec.size' is different in two >>>> dicts (8, 8) >>>> [2017-01-11 14:19:45.029105] W [MSGID: 122056] [ec-combine.c:873:ec_combine_check] 0-glusterfsProd-disperse-8: Mismatching xdata in answers of 'LOOKUP' >>>> [2017-01-11 14:19:45.029121] W [MSGID: 122053] >>>> [ec-common.c:116:ec_check_status] 0-glusterfsProd-disperse-8: >>>> Operation failed on some subvolumes (up=7, mask=7, remaining=0, >>>> good=6, bad=1) >>>> [2017-01-11 14:19:45.032566] E [dict.c:166:key_value_cmp] >>>> 0-glusterfsProd-disperse-6: 'trusted.ec.size' is different in two >>>> dicts (8, 8) >>>> [2017-01-11 14:19:45.029138] W [MSGID: 122002] >>>> [ec-common.c:71:ec_heal_report] 0-glusterfsProd-disperse-8: Heal >>>> failed [Invalid argument] >>>> [2017-01-11 14:19:45.032585] W [MSGID: 122056] [ec-combine.c:873:ec_combine_check] 0-glusterfsProd-disperse-6: Mismatching xdata in answers of 'LOOKUP' >>>> [2017-01-11 14:19:45.032614] E [dict.c:166:key_value_cmp] >>>> 0-glusterfsProd-disperse-6: 'trusted.ec.size' is different in two >>>> dicts (8, 8) >>>> [2017-01-11 14:19:45.032631] W [MSGID: 122056] [ec-combine.c:873:ec_combine_check] 0-glusterfsProd-disperse-6: Mismatching xdata in answers of 'LOOKUP' >>>> [2017-01-11 14:19:45.032638] W [MSGID: 122053] >>>> [ec-common.c:116:ec_check_status] 0-glusterfsProd-disperse-6: >>>> Operation failed on some subvolumes (up=7, mask=7, remaining=0, >>>> good=6, bad=1) >>>> [2017-01-11 14:19:45.032654] W [MSGID: 122002] >>>> [ec-common.c:71:ec_heal_report] 0-glusterfsProd-disperse-6: Heal >>>> failed [Invalid argument] >>>> [2017-01-11 14:19:45.037514] E [dict.c:166:key_value_cmp] >>>> 0-glusterfsProd-disperse-6: 'trusted.ec.size' is different in two >>>> dicts (8, 8) >>>> [2017-01-11 14:19:45.037536] W [MSGID: 122056] [ec-combine.c:873:ec_combine_check] 0-glusterfsProd-disperse-6: Mismatching xdata in answers of 'LOOKUP' >>>> [2017-01-11 14:19:45.037553] E [dict.c:166:key_value_cmp] >>>> 0-glusterfsProd-disperse-6: 'trusted.ec.size' is different in two >>>> dicts (8, 8) >>>> [2017-01-11 14:19:45.037573] W [MSGID: 122056] [ec-combine.c:873:ec_combine_check] 0-glusterfsProd-disperse-6: Mismatching xdata in answers of 'LOOKUP' >>>> [2017-01-11 14:19:45.037582] W [MSGID: 122053] >>>> [ec-common.c:116:ec_check_status] 0-glusterfsProd-disperse-6: >>>> Operation failed on some subvolumes (up=7, mask=7, remaining=0, >>>> good=6, bad=1) >>>> [2017-01-11 14:19:45.037599] W [MSGID: 122002] >>>> [ec-common.c:71:ec_heal_report] 0-glusterfsProd-disperse-6: Heal >>>> failed [Invalid argument] >>>> [2017-01-11 14:20:40.001401] E [dict.c:166:key_value_cmp] >>>> 0-glusterfsProd-disperse-3: 'trusted.ec.size' is different in two >>>> dicts (8, 8) >>>> [2017-01-11 14:20:40.001387] E [dict.c:166:key_value_cmp] >>>> 0-glusterfsProd-disperse-5: 'trusted.ec.size' is different in two >>>> dicts (8, 8) >>>> >>>> In the mount daemon log: >>>> >>>> [2017-01-11 14:20:17.806826] E [MSGID: 122001] >>>> [ec-common.c:872:ec_config_check] 2-glusterfsProd-disperse-0: >>>> Invalid or corrupted config [Invalid argument] >>>> [2017-01-11 14:20:17.806847] E [MSGID: 122066] >>>> [ec-common.c:969:ec_prepare_update_cbk] 2-glusterfsProd-disperse-0: >>>> Invalid config xattr [Invalid argument] >>>> [2017-01-11 14:20:17.807076] E [MSGID: 122001] >>>> [ec-common.c:872:ec_config_check] 2-glusterfsProd-disperse-1: >>>> Invalid or corrupted config [Invalid argument] >>>> [2017-01-11 14:20:17.807099] E [MSGID: 122066] >>>> [ec-common.c:969:ec_prepare_update_cbk] 2-glusterfsProd-disperse-1: >>>> Invalid config xattr [Invalid argument] >>>> [2017-01-11 14:20:17.807286] E [MSGID: 122001] >>>> [ec-common.c:872:ec_config_check] 2-glusterfsProd-disperse-10: >>>> Invalid or corrupted config [Invalid argument] >>>> [2017-01-11 14:20:17.807298] E [MSGID: 122066] >>>> [ec-common.c:969:ec_prepare_update_cbk] 2-glusterfsProd-disperse-10: >>>> Invalid config xattr [Invalid argument] >>>> [2017-01-11 14:20:17.807409] E [MSGID: 122001] >>>> [ec-common.c:872:ec_config_check] 2-glusterfsProd-disperse-11: >>>> Invalid or corrupted config [Invalid argument] >>>> [2017-01-11 14:20:17.807420] E [MSGID: 122066] >>>> [ec-common.c:969:ec_prepare_update_cbk] 2-glusterfsProd-disperse-11: >>>> Invalid config xattr [Invalid argument] >>>> [2017-01-11 14:20:17.807448] E [MSGID: 122001] >>>> [ec-common.c:872:ec_config_check] 2-glusterfsProd-disperse-4: >>>> Invalid or corrupted config [Invalid argument] >>>> [2017-01-11 14:20:17.807462] E [MSGID: 122066] >>>> [ec-common.c:969:ec_prepare_update_cbk] 2-glusterfsProd-disperse-4: >>>> Invalid config xattr [Invalid argument] >>>> [2017-01-11 14:20:17.807539] E [MSGID: 122001] >>>> [ec-common.c:872:ec_config_check] 2-glusterfsProd-disperse-2: >>>> Invalid or corrupted config [Invalid argument] >>>> [2017-01-11 14:20:17.807550] E [MSGID: 122066] >>>> [ec-common.c:969:ec_prepare_update_cbk] 2-glusterfsProd-disperse-2: >>>> Invalid config xattr [Invalid argument] >>>> [2017-01-11 14:20:17.807723] E [MSGID: 122001] >>>> [ec-common.c:872:ec_config_check] 2-glusterfsProd-disperse-3: >>>> Invalid or corrupted config [Invalid argument] >>>> [2017-01-11 14:20:17.807739] E [MSGID: 122066] >>>> [ec-common.c:969:ec_prepare_update_cbk] 2-glusterfsProd-disperse-3: >>>> Invalid config xattr [Invalid argument] >>>> [2017-01-11 14:20:17.807785] E [MSGID: 122001] >>>> [ec-common.c:872:ec_config_check] 2-glusterfsProd-disperse-5: >>>> Invalid or corrupted config [Invalid argument] >>>> [2017-01-11 14:20:17.807796] E [MSGID: 122066] >>>> [ec-common.c:969:ec_prepare_update_cbk] 2-glusterfsProd-disperse-5: >>>> Invalid config xattr [Invalid argument] >>>> [2017-01-11 14:20:17.808020] E [MSGID: 122001] >>>> [ec-common.c:872:ec_config_check] 2-glusterfsProd-disperse-9: >>>> Invalid or corrupted config [Invalid argument] >>>> [2017-01-11 14:20:17.808034] E [MSGID: 122066] >>>> [ec-common.c:969:ec_prepare_update_cbk] 2-glusterfsProd-disperse-9: >>>> Invalid config xattr [Invalid argument] >>>> [2017-01-11 14:20:17.808054] E [MSGID: 122001] >>>> [ec-common.c:872:ec_config_check] 2-glusterfsProd-disperse-6: >>>> Invalid or corrupted config [Invalid argument] >>>> [2017-01-11 14:20:17.808066] E [MSGID: 122066] >>>> [ec-common.c:969:ec_prepare_update_cbk] 2-glusterfsProd-disperse-6: >>>> Invalid config xattr [Invalid argument] >>>> [2017-01-11 14:20:17.808282] E [MSGID: 122001] >>>> [ec-common.c:872:ec_config_check] 2-glusterfsProd-disperse-8: >>>> Invalid or corrupted config [Invalid argument] >>>> [2017-01-11 14:20:17.808292] E [MSGID: 122066] >>>> [ec-common.c:969:ec_prepare_update_cbk] 2-glusterfsProd-disperse-8: >>>> Invalid config xattr [Invalid argument] >>>> [2017-01-11 14:20:17.809212] E [MSGID: 122001] >>>> [ec-common.c:872:ec_config_check] 2-glusterfsProd-disperse-7: >>>> Invalid or corrupted config [Invalid argument] >>>> [2017-01-11 14:20:17.809228] E [MSGID: 122066] >>>> [ec-common.c:969:ec_prepare_update_cbk] 2-glusterfsProd-disperse-7: >>>> Invalid config xattr [Invalid argument] >>>> >>>> [2017-01-11 14:20:17.812660] I [MSGID: 109036] >>>> [dht-common.c:8043:dht_log_new_layout_for_dir_selfheal] >>>> 2-glusterfsProd-dht: Setting layout of >>>> /Folder_01.05.2017_21.15/CV_MAGNETIC/V_31500/CHUNK_402578 with >>>> [Subvol_name: glusterfsProd-disperse-0, Err: -1 , Start: 1789569705 >>>> , Stop: 2147483645 , Hash: 1 ], [Subvol_name: >>>> glusterfsProd-disperse-1, Err: -1 , Start: 2147483646 , Stop: >>>> 2505397586 , Hash: 1 ], [Subvol_name: glusterfsProd-disperse-10, >>>> Err: -1 , Start: 2505397587 , Stop: 2863311527 , Hash: 1 ], >>>> [Subvol_name: glusterfsProd-disperse-11, Err: -1 , Start: 2863311528 >>>> , Stop: 3221225468 , Hash: 1 ], [Subvol_name: >>>> glusterfsProd-disperse-2, Err: -1 , Start: 3221225469 , Stop: >>>> 3579139409 , Hash: 1 ], [Subvol_name: glusterfsProd-disperse-3, Err: >>>> -1 , Start: 3579139410 , Stop: 3937053350 , Hash: 1 ], [Subvol_name: >>>> glusterfsProd-disperse-4, Err: -1 , Start: 3937053351 , Stop: >>>> 4294967295 , Hash: 1 ], [Subvol_name: glusterfsProd-disperse-5, Err: >>>> -1 , Start: 0 , Stop: 357913940 , Hash: 1 ], [Subvol_name: >>>> glusterfsProd-disperse-6, Err: -1 , Start: 357913941 , Stop: >>>> 715827881 , Hash: 1 ], [Subvol_name: glusterfsProd-disperse-7, Err: >>>> -1 , Start: 715827882 , Stop: 1073741822 , Hash: 1 ], [Subvol_name: >>>> glusterfsProd-disperse-8, Err: -1 , Start: 1073741823 , Stop: >>>> 1431655763 , Hash: 1 ], [Subvol_name: glusterfsProd-disperse-9, Err: >>>> -1 , Start: 1431655764 , Stop: 1789569704 , Hash: 1 ], >>>> >>>> >>>> -----Original Message----- >>>> From: gluster-users-bounces at gluster.org >>>> [mailto:gluster-users-bounces at gluster.org] On Behalf Of >>>> Ankireddypalle Reddy >>>> Sent: Tuesday, January 10, 2017 10:09 AM >>>> To: Xavier Hernandez; Gluster Devel (gluster-devel at gluster.org); >>>> gluster-users at gluster.org >>>> Subject: Re: [Gluster-users] [Gluster-devel] Lot of EIO errors in >>>> disperse volume >>>> >>>> Xavi, >>>> In this case it's the file creation which failed. So I provided the xattrs of the parent. >>>> >>>> Thanks and Regards, >>>> Ram >>>> >>>> -----Original Message----- >>>> From: Xavier Hernandez [mailto:xhernandez at datalab.es] >>>> Sent: Tuesday, January 10, 2017 9:10 AM >>>> To: Ankireddypalle Reddy; Gluster Devel (gluster-devel at gluster.org); >>>> gluster-users at gluster.org >>>> Subject: Re: [Gluster-devel] Lot of EIO errors in disperse volume >>>> >>>> Hi Ram, >>>> >>>>> On 10/01/17 14:42, Ankireddypalle Reddy wrote: >>>>> Attachments (2): >>>>> >>>>> 1 >>>>> >>>>> >>>>> >>>>> ec.txt >>>>> <https://imap.commvault.com/webconsole/embedded.do?url=https://imap >>>>> .co >>>>> mmvault.com/webconsole/api/drive/publicshare/346714/file/ee2d1536c2 >>>>> dc4 >>>>> dff94afb12132b4f8f6/action/preview&downloadUrl=https://imap.commvault. >>>>> com/webconsole/api/contentstore/publicshare/346714/file/ee2d1536c2d >>>>> c4d ff94afb12132b4f8f6/action/download> >>>>> [Download] >>>>> <https://imap.commvault.com/webconsole/api/contentstore/publicshare >>>>> /34 >>>>> 6714/file/ee2d1536c2dc4dff94afb12132b4f8f6/action/download>(11.50 >>>>> KB) >>>>> >>>>> 2 >>>>> >>>>> >>>>> >>>>> ws-glus.log >>>>> <https://imap.commvault.com/webconsole/embedded.do?url=https://imap >>>>> .co >>>>> mmvault.com/webconsole/api/drive/publicshare/346714/file/cff3e0506e >>>>> 754 >>>>> b9a939db02da1cbbd58/action/preview&downloadUrl=https://imap.commvault. >>>>> com/webconsole/api/contentstore/publicshare/346714/file/cff3e0506e7 >>>>> 54b 9a939db02da1cbbd58/action/download> >>>>> [Download] >>>>> <https://imap.commvault.com/webconsole/api/contentstore/publicshare >>>>> /34 >>>>> 6714/file/cff3e0506e754b9a939db02da1cbbd58/action/download>(3.48 >>>>> MB) >>>>> >>>>> Xavi, >>>>> We are encountering errors for different kinds of FOPS. >>>>> The open failed for the following file: >>>>> >>>>> cvd_2017_01_10_02_28_26.log:98182 1f9fe 01/10 00:57:10 8414465 >>>>> [MEDIAFS ] 20117519-52075477 SingleInstancer_FS::StartDataFile2: >>>>> Failed to create the data file >>>>> [/ws/glus/Folder_07.11.2016_23.02/CV_MAGNETIC/V_8854974/CHUNK_51342 >>>>> 720 /SFILE_CONTAINER_062], error=0xECCC0005:{CQiFile::Open(92)} + >>>>> {CQiUTFOSAPI::open(96)/ErrNo.5.(Input/output error)-Open failed, >>>>> File=/ws/glus/Folder_07.11.2016_23.02/CV_MAGNETIC/V_8854974/CHUNK_5 >>>>> 134 2720/SFILE_CONTAINER_062, OperationFlag=0xC1, >>>>> PermissionMode=0x1FF} >>>>> >>>>> I've attached the extended attributes for the directories >>>>> /ws/glus/Folder_07.11.2016_23.02/CV_MAGNETIC/V_8854974/ >>>>> and >>>>> >>>>> /ws/glus/Folder_07.11.2016_23.02/CV_MAGNETIC/V_8854974/CHUNK_513427 >>>>> 20 >>>>> from all the bricks. >>>>> >>>>> The attributes look fine to me. I've also attached some log >>>>> cuts to illustrate the problem. >>>> >>>> I need the extended attributes of the file itself, not the parent directories. >>>> >>>> Xavi >>>> >>>>> >>>>> Thanks and Regards, >>>>> Ram >>>>> >>>>> -----Original Message----- >>>>> From: Xavier Hernandez [mailto:xhernandez at datalab.es] >>>>> Sent: Tuesday, January 10, 2017 7:53 AM >>>>> To: Ankireddypalle Reddy; Gluster Devel >>>>> (gluster-devel at gluster.org); gluster-users at gluster.org >>>>> Subject: Re: [Gluster-devel] Lot of EIO errors in disperse volume >>>>> >>>>> Hi Ram, >>>>> >>>>> the error is caused by an extended attribute that does not match on >>>>> all >>>>> 3 bricks of the disperse set. Most probable value is >>>>> trusted.ec.version, but could be others. >>>>> >>>>> At first sight, I don't see any change from 3.7.8 that could have >>>>> caused this. I'll check again. >>>>> >>>>> What kind of operations are you doing ? this can help me narrow the search. >>>>> >>>>> Xavi >>>>> >>>>>> On 10/01/17 13:43, Ankireddypalle Reddy wrote: >>>>>> Xavi, >>>>>> Thanks. If you could please explain what to look for in >>>>>> the >>>>> extended attributes then I will check and let you know if I find >>>>> anything suspicious. Also we noticed that some of these operations >>>>> would succeed if retried. Do you know of any communicated related >>>>> errors that are being reported/triaged. >>>>>> >>>>>> Thanks and Regards, >>>>>> Ram >>>>>> >>>>>> -----Original Message----- >>>>>> From: Xavier Hernandez [mailto:xhernandez at datalab.es] >>>>>> Sent: Tuesday, January 10, 2017 7:23 AM >>>>>> To: Ankireddypalle Reddy; Gluster Devel >>>>>> (gluster-devel at gluster.org); gluster-users at gluster.org >>>>>> Subject: Re: [Gluster-devel] Lot of EIO errors in disperse volume >>>>>> >>>>>> Hi Ram, >>>>>> >>>>>>> On 10/01/17 13:14, Ankireddypalle Reddy wrote: >>>>>>> Attachment (1): >>>>>>> >>>>>>> 1 >>>>>>> >>>>>>> >>>>>>> >>>>>>> ecxattrs.txt >>>>>>> <https://imap.commvault.com/webconsole/embedded.do?url=https://imap. >>>>>>> c >>>>>>> o >>>>>>> mmvault.com/webconsole/api/drive/publicshare/346714/file/1272e682 >>>>>>> 787 >>>>>>> 4 >>>>>>> 4 >>>>>>> f15bf1a54f2b31b559d/action/preview&downloadUrl=https://imap.commvault. >>>>>>> com/webconsole/api/contentstore/publicshare/346714/file/1272e6827 >>>>>>> 874 >>>>>>> 4 >>>>>>> f >>>>>>> 15bf1a54f2b31b559d/action/download> >>>>>>> [Download] >>>>>>> <https://imap.commvault.com/webconsole/api/contentstore/publicsha >>>>>>> re/ >>>>>>> 3 >>>>>>> 4 >>>>>>> 6714/file/1272e68278744f15bf1a54f2b31b559d/action/download>(5.92 >>>>>>> KB) >>>>>>> >>>>>>> Xavi, >>>>>>> Please find attached the extended attributes for a >>>>>>> directory from all the bricks. Free space check failed for this >>>>>>> with error number EIO. >>>>>> >>>>>> What do you mean ? what operation have you made to check the free >>>>> space on that directory ? >>>>>> >>>>>> If it's a recursive check, I need the extended attributes from the >>>>> exact file that triggers the EIO. The attached attributes seem >>>>> consistent and that directory shouldn't cause any problem. Does an 'ls' >>>>> on that directory fail or does it show the contents ? >>>>>> >>>>>> Xavi >>>>>> >>>>>>> >>>>>>> Thanks and Regards, >>>>>>> Ram >>>>>>> >>>>>>> -----Original Message----- >>>>>>> From: Xavier Hernandez [mailto:xhernandez at datalab.es] >>>>>>> Sent: Tuesday, January 10, 2017 6:45 AM >>>>>>> To: Ankireddypalle Reddy; Gluster Devel >>>>>>> (gluster-devel at gluster.org); gluster-users at gluster.org >>>>>>> Subject: Re: [Gluster-devel] Lot of EIO errors in disperse volume >>>>>>> >>>>>>> Hi Ram, >>>>>>> >>>>>>> can you execute the following command on all bricks on a file >>>>>>> that is giving EIO ? >>>>>>> >>>>>>> getfattr -m. -e hex -d <path to file in brick> >>>>>>> >>>>>>> Xavi >>>>>>> >>>>>>>> On 10/01/17 12:41, Ankireddypalle Reddy wrote: >>>>>>>> Xavi, >>>>>>>> We have been running 3.7.8 on these servers. We >>>>>>>> upgraded >>>>>>> to 3.7.18 yesterday. We upgraded all the servers at a time. The >>>>>>> volume was brought down during upgrade. >>>>>>>> >>>>>>>> Thanks and Regards, >>>>>>>> Ram >>>>>>>> >>>>>>>> -----Original Message----- >>>>>>>> From: Xavier Hernandez [mailto:xhernandez at datalab.es] >>>>>>>> Sent: Tuesday, January 10, 2017 6:35 AM >>>>>>>> To: Ankireddypalle Reddy; Gluster Devel >>>>>>>> (gluster-devel at gluster.org); gluster-users at gluster.org >>>>>>>> Subject: Re: [Gluster-devel] Lot of EIO errors in disperse >>>>>>>> volume >>>>>>>> >>>>>>>> Hi Ram, >>>>>>>> >>>>>>>> how did you upgrade gluster ? from which version ? >>>>>>>> >>>>>>>> Did you upgrade one server at a time and waited until self-heal >>>>>>> finished before upgrading the next server ? >>>>>>>> >>>>>>>> Xavi >>>>>>>> >>>>>>>>> On 10/01/17 11:39, Ankireddypalle Reddy wrote: >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> We upgraded to GlusterFS 3.7.18 yesterday. We see lot of >>>>>>>>> failures in our applications. Most of the errors are EIO. The >>>>>>>>> following log lines are commonly seen in the logs: >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> The message "W [MSGID: 122056] >>>>>>>>> [ec-combine.c:873:ec_combine_check] >>>>>>>>> 0-StoragePool-disperse-4: Mismatching xdata in answers of 'LOOKUP'" >>>>>>>>> repeated 2 times between [2017-01-10 02:46:25.069809] and >>>>>>>>> [2017-01-10 02:46:25.069835] >>>>>>>>> >>>>>>>>> [2017-01-10 02:46:25.069852] W [MSGID: 122056] >>>>>>>>> [ec-combine.c:873:ec_combine_check] 0-StoragePool-disperse-5: >>>>>>>>> Mismatching xdata in answers of 'LOOKUP' >>>>>>>>> >>>>>>>>> The message "W [MSGID: 122056] >>>>>>>>> [ec-combine.c:873:ec_combine_check] >>>>>>>>> 0-StoragePool-disperse-5: Mismatching xdata in answers of 'LOOKUP'" >>>>>>>>> repeated 2 times between [2017-01-10 02:46:25.069852] and >>>>>>>>> [2017-01-10 02:46:25.069873] >>>>>>>>> >>>>>>>>> [2017-01-10 02:46:25.069910] W [MSGID: 122056] >>>>>>>>> [ec-combine.c:873:ec_combine_check] 0-StoragePool-disperse-6: >>>>>>>>> Mismatching xdata in answers of 'LOOKUP' >>>>>>>>> >>>>>>>>> ... >>>>>>>>> >>>>>>>>> [2017-01-10 02:46:26.520774] I [MSGID: 109036] >>>>>>>>> [dht-common.c:9076:dht_log_new_layout_for_dir_selfheal] >>>>>>>>> 0-StoragePool-dht: Setting layout of >>>>>>>>> /Folder_07.11.2016_23.02/CV_MAGNETIC/V_8854213/CHUNK_51334585 >>>>>>>>> with >>>>>>>>> [Subvol_name: StoragePool-disperse-0, Err: -1 , Start: >>>>>>>>> 3221225466 , >>>>>>>>> Stop: 3758096376 , Hash: 1 ], [Subvol_name: >>>>>>>>> StoragePool-disperse-1, >>>>> Err: >>>>>>>>> -1 , Start: 3758096377 , Stop: 4294967295 , Hash: 1 ], [Subvol_name: >>>>>>>>> StoragePool-disperse-2, Err: -1 , Start: 0 , Stop: 536870910 , Hash: >>>>>>>>> 1 ], [Subvol_name: StoragePool-disperse-3, Err: -1 , Start: >>>>>>>>> 536870911 , >>>>>>>>> Stop: 1073741821 , Hash: 1 ], [Subvol_name: >>>>>>>>> StoragePool-disperse-4, >>>>> Err: >>>>>>>>> -1 , Start: 1073741822 , Stop: 1610612732 , Hash: 1 ], [Subvol_name: >>>>>>>>> StoragePool-disperse-5, Err: -1 , Start: 1610612733 , Stop: >>>>>>>>> 2147483643 , >>>>>>>>> Hash: 1 ], [Subvol_name: StoragePool-disperse-6, Err: -1 , Start: >>>>>>>>> 2147483644 , Stop: 2684354554 , Hash: 1 ], [Subvol_name: >>>>>>>>> StoragePool-disperse-7, Err: -1 , Start: 2684354555 , Stop: >>>>>>>>> 3221225465 , >>>>>>>>> Hash: 1 ], >>>>>>>>> >>>>>>>>> [2017-01-10 02:46:26.522841] N [MSGID: 122031] >>>>>>>>> [ec-generic.c:1130:ec_combine_xattrop] 0-StoragePool-disperse-3: >>>>>>>>> Mismatching dictionary in answers of 'GF_FOP_XATTROP' >>>>>>>>> >>>>>>>>> The message "N [MSGID: 122031] >>>>>>>>> [ec-generic.c:1130:ec_combine_xattrop] >>>>>>>>> 0-StoragePool-disperse-3: Mismatching dictionary in answers of >>>>>>>>> 'GF_FOP_XATTROP'" repeated 2 times between [2017-01-10 >>>>>>>>> 02:46:26.522841] and [2017-01-10 02:46:26.522894] >>>>>>>>> >>>>>>>>> [2017-01-10 02:46:26.522898] W [MSGID: 122040] >>>>>>>>> [ec-common.c:919:ec_prepare_update_cbk] 0-StoragePool-disperse-3: >>>>>>>>> Failed to get size and version [Input/output error] >>>>>>>>> >>>>>>>>> [2017-01-10 02:46:26.523115] N [MSGID: 122031] >>>>>>>>> [ec-generic.c:1130:ec_combine_xattrop] 0-StoragePool-disperse-6: >>>>>>>>> Mismatching dictionary in answers of 'GF_FOP_XATTROP' >>>>>>>>> >>>>>>>>> The message "N [MSGID: 122031] >>>>>>>>> [ec-generic.c:1130:ec_combine_xattrop] >>>>>>>>> 0-StoragePool-disperse-6: Mismatching dictionary in answers of >>>>>>>>> 'GF_FOP_XATTROP'" repeated 2 times between [2017-01-10 >>>>>>>>> 02:46:26.523115] and [2017-01-10 02:46:26.523143] >>>>>>>>> >>>>>>>>> [2017-01-10 02:46:26.523147] W [MSGID: 122040] >>>>>>>>> [ec-common.c:919:ec_prepare_update_cbk] 0-StoragePool-disperse-6: >>>>>>>>> Failed to get size and version [Input/output error] >>>>>>>>> >>>>>>>>> [2017-01-10 02:46:26.523302] N [MSGID: 122031] >>>>>>>>> [ec-generic.c:1130:ec_combine_xattrop] 0-StoragePool-disperse-2: >>>>>>>>> Mismatching dictionary in answers of 'GF_FOP_XATTROP' >>>>>>>>> >>>>>>>>> The message "N [MSGID: 122031] >>>>>>>>> [ec-generic.c:1130:ec_combine_xattrop] >>>>>>>>> 0-StoragePool-disperse-2: Mismatching dictionary in answers of >>>>>>>>> 'GF_FOP_XATTROP'" repeated 2 times between [2017-01-10 >>>>>>>>> 02:46:26.523302] and [2017-01-10 02:46:26.523324] >>>>>>>>> >>>>>>>>> [2017-01-10 02:46:26.523328] W [MSGID: 122040] >>>>>>>>> [ec-common.c:919:ec_prepare_update_cbk] 0-StoragePool-disperse-2: >>>>>>>>> Failed to get size and version [Input/output error] >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> [root at glusterfs3 Log_Files]# gluster --version >>>>>>>>> >>>>>>>>> glusterfs 3.7.18 built on Dec 8 2016 06:34:26 >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> [root at glusterfs3 Log_Files]# gluster volume info >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Volume Name: StoragePool >>>>>>>>> >>>>>>>>> Type: Distributed-Disperse >>>>>>>>> >>>>>>>>> Volume ID: 149e976f-4e21-451c-bf0f-f5691208531f >>>>>>>>> >>>>>>>>> Status: Started >>>>>>>>> >>>>>>>>> Number of Bricks: 8 x (2 + 1) = 24 >>>>>>>>> >>>>>>>>> Transport-type: tcp >>>>>>>>> >>>>>>>>> Bricks: >>>>>>>>> >>>>>>>>> Brick1: glusterfs1sds:/ws/disk1/ws_brick >>>>>>>>> >>>>>>>>> Brick2: glusterfs2sds:/ws/disk1/ws_brick >>>>>>>>> >>>>>>>>> Brick3: glusterfs3sds:/ws/disk1/ws_brick >>>>>>>>> >>>>>>>>> Brick4: glusterfs1sds:/ws/disk2/ws_brick >>>>>>>>> >>>>>>>>> Brick5: glusterfs2sds:/ws/disk2/ws_brick >>>>>>>>> >>>>>>>>> Brick6: glusterfs3sds:/ws/disk2/ws_brick >>>>>>>>> >>>>>>>>> Brick7: glusterfs1sds:/ws/disk3/ws_brick >>>>>>>>> >>>>>>>>> Brick8: glusterfs2sds:/ws/disk3/ws_brick >>>>>>>>> >>>>>>>>> Brick9: glusterfs3sds:/ws/disk3/ws_brick >>>>>>>>> >>>>>>>>> Brick10: glusterfs1sds:/ws/disk4/ws_brick >>>>>>>>> >>>>>>>>> Brick11: glusterfs2sds:/ws/disk4/ws_brick >>>>>>>>> >>>>>>>>> Brick12: glusterfs3sds:/ws/disk4/ws_brick >>>>>>>>> >>>>>>>>> Brick13: glusterfs1sds:/ws/disk5/ws_brick >>>>>>>>> >>>>>>>>> Brick14: glusterfs2sds:/ws/disk5/ws_brick >>>>>>>>> >>>>>>>>> Brick15: glusterfs3sds:/ws/disk5/ws_brick >>>>>>>>> >>>>>>>>> Brick16: glusterfs1sds:/ws/disk6/ws_brick >>>>>>>>> >>>>>>>>> Brick17: glusterfs2sds:/ws/disk6/ws_brick >>>>>>>>> >>>>>>>>> Brick18: glusterfs3sds:/ws/disk6/ws_brick >>>>>>>>> >>>>>>>>> Brick19: glusterfs1sds:/ws/disk7/ws_brick >>>>>>>>> >>>>>>>>> Brick20: glusterfs2sds:/ws/disk7/ws_brick >>>>>>>>> >>>>>>>>> Brick21: glusterfs3sds:/ws/disk7/ws_brick >>>>>>>>> >>>>>>>>> Brick22: glusterfs1sds:/ws/disk8/ws_brick >>>>>>>>> >>>>>>>>> Brick23: glusterfs2sds:/ws/disk8/ws_brick >>>>>>>>> >>>>>>>>> Brick24: glusterfs3sds:/ws/disk8/ws_brick >>>>>>>>> >>>>>>>>> Options Reconfigured: >>>>>>>>> >>>>>>>>> performance.readdir-ahead: on >>>>>>>>> >>>>>>>>> diagnostics.client-log-level: INFO >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Thanks and Regards, >>>>>>>>> >>>>>>>>> Ram >>>>>>>>> >>>>>>>>> ***************************Legal >>>>>>>>> Disclaimer*************************** >>>>>>>>> "This communication may contain confidential and privileged >>>>>>>>> material for the sole use of the intended recipient. Any >>>>>>>>> unauthorized review, use or distribution by others is strictly >>>>>>>>> prohibited. If you have received the message by mistake, please >>>>>>>>> advise the sender by reply email and delete the message. Thank you." >>>>>>>>> *************************************************************** >>>>>>>>> *** >>>>>>>>> * >>>>>>>>> * >>>>>>>>> * >>>>>>>>> * >>>>>>>>> >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> Gluster-devel mailing list >>>>>>>>> Gluster-devel at gluster.org >>>>>>>>> http://www.gluster.org/mailman/listinfo/gluster-devel >>>>>>>>> >>>>>>>> >>>>>>>> ***************************Legal >>>>>>>> Disclaimer*************************** >>>>>>>> "This communication may contain confidential and privileged >>>>>>>> material for the sole use of the intended recipient. Any >>>>>>>> unauthorized review, use or distribution by others is strictly >>>>>>>> prohibited. If you have received the message by mistake, please >>>>>>>> advise the sender by reply >>>>>>> email and delete the message. Thank you." >>>>>>>> **************************************************************** >>>>>>>> *** >>>>>>>> * >>>>>>>> * >>>>>>>> * >>>>>>>> >>>>>>> >>>>>>> ***************************Legal >>>>>>> Disclaimer*************************** >>>>>>> "This communication may contain confidential and privileged >>>>>>> material for the sole use of the intended recipient. Any >>>>>>> unauthorized review, use or distribution by others is strictly >>>>>>> prohibited. If you have received the message by mistake, please >>>>>>> advise the sender by reply email and delete the message. Thank you." >>>>>>> ***************************************************************** >>>>>>> *** >>>>>>> * >>>>>>> * >>>>>> >>>>>> ***************************Legal >>>>>> Disclaimer*************************** >>>>>> "This communication may contain confidential and privileged >>>>>> material for the sole use of the intended recipient. Any >>>>>> unauthorized review, use or distribution by others is strictly >>>>>> prohibited. If you have received the message by mistake, please >>>>>> advise the sender by reply >>>>> email and delete the message. Thank you." >>>>>> ****************************************************************** >>>>>> *** >>>>>> * >>>>>> >>>>> >>>>> ***************************Legal >>>>> Disclaimer*************************** >>>>> "This communication may contain confidential and privileged >>>>> material for the sole use of the intended recipient. Any >>>>> unauthorized review, use or distribution by others is strictly >>>>> prohibited. If you have received the message by mistake, please >>>>> advise the sender by reply email and delete the message. Thank you." >>>>> ******************************************************************* >>>>> *** >>>> >>>> ***************************Legal >>>> Disclaimer*************************** >>>> "This communication may contain confidential and privileged material for the sole use of the intended recipient. Any unauthorized review, use or distribution by others is strictly prohibited. If you have received the message by mistake, please advise the sender by reply email and delete the message. Thank you." >>>> ******************************************************************** >>>> ** >>>> >>>> _______________________________________________ >>>> Gluster-users mailing list >>>> Gluster-users at gluster.org >>>> http://www.gluster.org/mailman/listinfo/gluster-users >>>> ***************************Legal >>>> Disclaimer*************************** >>>> "This communication may contain confidential and privileged material >>>> for the sole use of the intended recipient. Any unauthorized review, >>>> use or distribution by others is strictly prohibited. If you have >>>> received the message by mistake, please advise the sender by reply email and delete the message. Thank you." >>>> ******************************************************************** >>>> ** >>>> >>> >> ***************************Legal Disclaimer*************************** >> "This communication may contain confidential and privileged material >> for the sole use of the intended recipient. Any unauthorized review, >> use or distribution by others is strictly prohibited. If you have >> received the message by mistake, please advise the sender by reply email and delete the message. Thank you." >> ********************************************************************** >> > > ***************************Legal Disclaimer*************************** > "This communication may contain confidential and privileged material for the > sole use of the intended recipient. Any unauthorized review, use or distribution > by others is strictly prohibited. If you have received the message by mistake, > please advise the sender by reply email and delete the message. Thank you." > ********************************************************************** > >
Ankireddypalle Reddy
2017-Jan-13 09:16 UTC
[Gluster-users] [Gluster-devel] Lot of EIO errors in disperse volume
Xavi,
Thanks for explanation. Will collect TRACE logs today.
Thanks and Regards,
Ram
Sent from my iPhone
> On Jan 13, 2017, at 3:03 AM, Xavier Hernandez <xhernandez at
datalab.es> wrote:
>
> Hi Ram,
>
>> On 12/01/17 22:14, Ankireddypalle Reddy wrote:
>> Xavi,
>> I changed the logging to log the individual bytes. Consider
the following from ws-glus.log file where /ws/glus is the mount point.
>>
>> [2017-01-12 20:47:59.368102] I [MSGID: 109063]
[dht-layout.c:718:dht_layout_normalize] 0-glusterfsProd-dht: Found anomalies in
/Folder_01.05.2017_21.15/CV_MAGNETIC/V_30970/CHUNK_390607 (gfid =
e694387f-dde7-410b-9562-914a994d5e85). Holes=1 overlaps=0
>> [2017-01-12 20:47:59.391218] I [MSGID: 109036]
[dht-common.c:9082:dht_log_new_layout_for_dir_selfheal] 0-glusterfsProd-dht:
Setting layout of /Folder_01.05.2017_21.15/CV_MAGNETIC/V_30970/CHUNK_390607 with
[Subvol_name: glusterfsProd-disperse-0, Err: -1 , Start: 2505397587 , Stop:
2863311527 , Hash: 1 ], [Subvol_name: glusterfsProd-disperse-1, Err: -1 , Start:
2863311528 , Stop: 3221225468 , Hash: 1 ], [Subvol_name:
glusterfsProd-disperse-10, Err: -1 , Start: 3221225469 , Stop: 3579139409 ,
Hash: 1 ], [Subvol_name: glusterfsProd-disperse-11, Err: -1 , Start: 3579139410
, Stop: 3937053350 , Hash: 1 ], [Subvol_name: glusterfsProd-disperse-2, Err: -1
, Start: 3937053351 , Stop: 4294967295 , Hash: 1 ], [Subvol_name:
glusterfsProd-disperse-3, Err: -1 , Start: 0 , Stop: 357913940 , Hash: 1 ],
[Subvol_name: glusterfsProd-disperse-4, Err: -1 , Start: 357913941 , Stop:
715827881 , Hash: 1 ], [Subvol_name: glusterfsProd-disperse-5, Err: -1 , Start:
715827882 , Stop: 1073741822 , Hash: 1 ], [Subvol_name:
glusterfsProd-disperse-6, Err: -1 , Start: 1073741823 , Stop: 1431655763 , Hash:
1 ], [Subvol_name: glusterfsProd-disperse-7, Err: -1 , Start: 1431655764 , Stop:
1789569704 , Hash: 1 ], [Subvol_name: glusterfsProd-disperse-8, Err: -1 , Start:
1789569705 , Stop: 2147483645 , Hash: 1 ], [Subvol_name:
glusterfsProd-disperse-9, Err: -1 , Start: 2147483646 , Stop: 2505397586 , Hash:
1 ],
>>
>> Self-heal seems to be triggered for path
/Folder_01.05.2017_21.15/CV_MAGNETIC/V_30970/CHUNK_390607 due to anomalies as
per DHT. It would be great if someone could explain what could be the anomaly
here. The setup where we encountered this is a fairly stable setup with no
brick failures or no node failures.
>
> This is not really a self-heal, at least from the point of view of ec. This
means that DHT has found a discrepancy in the layout of that directory, however
this doesn't mean any problem (notice the 'I' in the log, meaning
that it's informative, not a warning nor error).
>
> Not sure how DHT works in this case or why it finds this
"anomaly", but if there aren't any previous errors before that
message, it can be completely ignored.
>
> Not sure if it can be related to option cluster.weighted-rebalance that is
enabled by default.
>
>>
>> Then Self-heal seems to have encountered the following error.
>>
>> [2017-01-12 20:48:23.418432] I [dict.c:166:key_value_cmp]
0-glusterfsProd-disperse-2: 'trusted.ec.version' is different in two
dicts (16, 16)
>> [2017-01-12 20:48:23.418496] I [dict.c:3065:dict_dump_to_log]
0-glusterfsProd-disperse-2: dict=0x7f0b649520ac
((trusted.glusterfs.dht:0:0:0:1:0:0:0:0:0:0:0:0:15:55:55:54:)(trusted.ec.version:0:0:0:0:0:0:0:b:0:0:0:0:0:0:0:e:))
>> [2017-01-12 20:48:23.418519] I [dict.c:3065:dict_dump_to_log]
0-glusterfsProd-disperse-2: dict=0x7f0b6495b4e0
((trusted.glusterfs.dht:0:0:0:1:0:0:0:0:0:0:0:0:15:55:55:54:)(trusted.ec.version:0:0:0:0:0:0:0:d:0:0:0:0:0:0:0:e:))
>> [2017-01-12 20:48:23.418531] W [MSGID: 122056]
[ec-combine.c:873:ec_combine_check] 0-glusterfsProd-disperse-2: Mismatching
xdata in answers of 'LOOKUP'
>
> That's a real problem. Here we have two bricks that differ in the
trusted.ec.version xattr. However this xattr not necessarily belongs to the
previous directory. They are unrelated messages.
>
>>
>> In this case glusterfsProd-disperse-2 sub volume actually
consists of the following bricks.
>> glusterfs4sds:/ws/disk11/ws_brick, glusterfs5sds:
/ws/disk11/ws_brick, glusterfs6sds: /ws/disk11/ws_brick
>>
>> I went ahead and checked the value of trusted.ec.version on
all the 3 bricks inside this sub vol:
>>
>> [root at glusterfs6 ~]# getfattr -e hex -n
trusted.ec.version
/ws/disk11/ws_brick//Folder_01.05.2017_21.15/CV_MAGNETIC/V_30970/CHUNK_390607
>> # file:
ws/disk11/ws_brick//Folder_01.05.2017_21.15/CV_MAGNETIC/V_30970/CHUNK_390607
>> trusted.ec.version=0x0000000000000009000000000000000b
>>
>> [root at glusterfs4 ~]# getfattr -e hex -n
trusted.ec.version
/ws/disk11/ws_brick//Folder_01.05.2017_21.15/CV_MAGNETIC/V_30970/CHUNK_390607
>> # file:
ws/disk11/ws_brick//Folder_01.05.2017_21.15/CV_MAGNETIC/V_30970/CHUNK_390607
>> trusted.ec.version=0x0000000000000009000000000000000b
>>
>> [root at glusterfs5 glusterfs]# getfattr -e hex -n
trusted.ec.version
/ws/disk11/ws_brick//Folder_01.05.2017_21.15/CV_MAGNETIC/V_30970/CHUNK_390607
>> # file:
ws/disk11/ws_brick//Folder_01.05.2017_21.15/CV_MAGNETIC/V_30970/CHUNK_390607
>> trusted.ec.version=0x0000000000000009000000000000000b
>>
>> The attribute value seems to be same on all the 3 bricks.
>
> That's a clear indication that the ec warning is not related to this
directory because trusted.ec.version always increases, never decreases, and the
directory has a value smaller that the one that appears in the log message.
>
> If you show all dict entries in the log, it seems that it does refer to a
directory because trusted.ec.size is not present, but it must be another
directory than the one you looked at. We would need to find which one is having
this issue. The TRACE log would be helpful here.
>
>
>>
>> Also please note that every single time that the
trusted.ec.version was found to mismatch the same values are getting logged.
Following are 2 more instances of trusted.ec.version mismatch.
>>
>> [2017-01-12 20:14:25.554540] I [dict.c:166:key_value_cmp]
0-glusterfsProd-disperse-2: 'trusted.ec.version' is different in two
dicts (16, 16)
>> [2017-01-12 20:14:25.554588] I [dict.c:3065:dict_dump_to_log]
0-glusterfsProd-disperse-2: dict=0x7f0b6495a9f0
((glusterfs.open-fd-count:30:0:)(trusted.glusterfs.dht:0:0:0:1:0:0:0:0:0:0:0:0:15:55:55:54:)(trusted.ec.version:0:0:0:0:0:0:0:d:0:0:0:0:0:0:0:e:))
>> [2017-01-12 20:14:25.554608] I [dict.c:3065:dict_dump_to_log]
0-glusterfsProd-disperse-2: dict=0x7f0b6495903c
((glusterfs.open-fd-count:30:0:)(trusted.glusterfs.dht:0:0:0:1:0:0:0:0:0:0:0:0:15:55:55:54:)(trusted.ec.version:0:0:0:0:0:0:0:b:0:0:0:0:0:0:0:e:))
>> [2017-01-12 20:14:25.554624] W [MSGID: 122053]
[ec-common.c:116:ec_check_status] 0-glusterfsProd-disperse-2: Operation failed
on some subvolumes (up=7, mask=7, remaining=0, good=3, bad=4)
>> [2017-01-12 20:14:25.554632] W [MSGID: 122002]
[ec-common.c:71:ec_heal_report] 0-glusterfsProd-disperse-2: Heal failed [Invalid
argument]
>> [2017-01-12 20:14:25.555598] I [dict.c:166:key_value_cmp]
0-glusterfsProd-disperse-2: 'trusted.ec.version' is different in two
dicts (16, 16)
>> [2017-01-12 20:14:25.555622] I [dict.c:3065:dict_dump_to_log]
0-glusterfsProd-disperse-2: dict=0x7f0b64956c24
((glusterfs.open-fd-count:30:0:)(trusted.glusterfs.dht:0:0:0:1:0:0:0:0:0:0:0:0:15:55:55:54:)(trusted.ec.version:0:0:0:0:0:0:0:b:0:0:0:0:0:0:0:e:))
>> [2017-01-12 20:14:25.555638] I [dict.c:3065:dict_dump_to_log]
0-glusterfsProd-disperse-2: dict=0x7f0b64964e8c
((glusterfs.open-fd-count:30:0:)(trusted.glusterfs.dht:0:0:0:1:0:0:0:0:0:0:0:0:15:55:55:54:)(trusted.ec.version:0:0:0:0:0:0:0:d:0:0:0:0:0:0:0:e:))
>
>
> I think that this refers to the same directory. This seems an attempt to
heal it that has failed. So it makes sense that it finds exactly the same
values.
>
>>
>>
>> In glustershd.log lot of similar errors are logged.
>>
>> [2017-01-12 21:10:53.728770] I [dict.c:166:key_value_cmp]
0-glusterfsProd-disperse-0: 'trusted.ec.size' is different in two dicts
(8, 8)
>> [2017-01-12 21:10:53.728804] I [dict.c:3065:dict_dump_to_log]
0-glusterfsProd-disperse-0: dict=0x7f21694b6f50
((trusted.ec.size:0:0:0:0:42:3a:0:0:)(trusted.ec.version:0:0:0:0:0:0:37:5f:0:0:0:0:0:0:37:5f:))
>> [2017-01-12 21:10:53.728827] I [dict.c:3065:dict_dump_to_log]
0-glusterfsProd-disperse-0: dict=0x7f21694b62bc
((trusted.ec.size:0:0:0:0:0:ca:0:0:)(trusted.ec.version:0:0:0:0:0:0:0:a1:0:0:0:0:0:0:37:5f:))
>> [2017-01-12 21:10:53.728842] W [MSGID: 122056]
[ec-combine.c:873:ec_combine_check] 0-glusterfsProd-disperse-0: Mismatching
xdata in answers of 'LOOKUP'
>> [2017-01-12 21:10:53.728854] W [MSGID: 122053]
[ec-common.c:116:ec_check_status] 0-glusterfsProd-disperse-0: Operation failed
on some subvolumes (up=7, mask=7, remaining=0, good=6, bad=1)
>> [2017-01-12 21:10:53.728876] W [MSGID: 122002]
[ec-common.c:71:ec_heal_report] 0-glusterfsProd-disperse-0: Heal failed [Invalid
argument]
>
> This seems an attempt to heal a file, but I see a lot of differences
between both versions. The size on one brick is 13.238.272 bytes, but on the
other brick it's 1.111.097.344 bytes. That's a huge difference.
>
> Looking at the trusted.ec.version, I see that the 'data' version is
very different (from 161 to 14.175), however the metadata version is exactly the
same. This really seems like a lot of writes while one brick was down (or
disconnected for some reason, or writes failed for some reason). One brick has
lost about 14.000 writes of ~80KB.
>
> I think the most important thing right now would be to identify which files
and directories are having these problems to be able to identify the cause.
Again, the TRACE log will be really useful.
>
> Xavi
>
>>
>> Thanks and Regards,
>> Ram
>>
>> -----Original Message-----
>> From: Xavier Hernandez [mailto:xhernandez at datalab.es]
>> Sent: Thursday, January 12, 2017 6:40 AM
>> To: Ankireddypalle Reddy
>> Cc: Gluster Devel (gluster-devel at gluster.org); gluster-users at
gluster.org
>> Subject: Re: [Gluster-users] [Gluster-devel] Lot of EIO errors in
disperse volume
>>
>> Hi Ram,
>>
>>
>>> On 12/01/17 11:49, Ankireddypalle Reddy wrote:
>>> Xavi,
>>> As I mentioned before the error could happen for any FOP.
Will try to run with TRACE debug level. Is there a possibility that we are
checking for this attribute on a directory, because a directory does not seem to
be having this attribute set.
>>
>> No, directories do not have this attribute and no one should be reading
it from a directory.
>>
>>> Also is the function to check size and version called after it is
decided that heal should be run or is this check is the one which decides
whether a heal should be run.
>>
>> Almost all checks that trigger a heal are done in the lookup fop when
some discrepancy is detected.
>>
>> The function that checks size and version is called later once a lock
on the inode is acquired (even if no heal is needed). However further failures
in the processing of any fop can also trigger a self-heal.
>>
>> Xavi
>>
>>>
>>> Thanks and Regards,
>>> Ram
>>>
>>> Sent from my iPhone
>>>
>>>> On Jan 12, 2017, at 2:25 AM, Xavier Hernandez <xhernandez at
datalab.es> wrote:
>>>>
>>>> Hi Ram,
>>>>
>>>>> On 12/01/17 02:36, Ankireddypalle Reddy wrote:
>>>>> Xavi,
>>>>> I added some more logging information. The
trusted.ec.size field values are in fact different.
>>>>> trusted.ec.size l1 = 62719407423488 l2 = 0
>>>>
>>>> That's very weird. Directories do not have this attribute.
It's only present on regular files. But you said that the error happens
while creating the file, so it doesn't make much sense because file creation
always sets trusted.ec.size to 0.
>>>>
>>>> Could you reproduce the problem with
diagnostics.client-log-level set to TRACE and send the log to me ? it will
create a big log, but I'll have much more information about what's going
on.
>>>>
>>>> Do you have a mixed setup with nodes of different types ? for
example mixed 32/64 bits architectures or different operating systems ? I ask
this because 62719407423488 in hex is 0x390B00000000, which has the lower 32
bits set to 0, but has garbage above that.
>>>>
>>>>>
>>>>> This is a fairly static setup with no brick/ node
failure. Please explain why is that a heal is being triggered and what could
have acutually caused these size xattrs to differ. This is causing random I/O
failures and is impacting the backup schedules.
>>>>
>>>> The launch of self-heal is normal because it has detected an
inconsistency. The real problem is what originates that inconsistency.
>>>>
>>>> Xavi
>>>>
>>>>>
>>>>> [ 2017-01-12 01:19:18.256970] W [MSGID: 122056]
[ec-combine.c:873:ec_combine_check] 0-glusterfsProd-disperse-8: Mismatching
xdata in answers of 'LOOKUP'
>>>>> [2017-01-12 01:19:18.257015] W [MSGID: 122053]
>>>>> [ec-common.c:116:ec_check_status]
0-glusterfsProd-disperse-8:
>>>>> Operation failed on some subvolumes (up=7, mask=7,
remaining=0,
>>>>> good=3, bad=4)
>>>>> [2017-01-12 01:19:18.257018] W [MSGID: 122002]
>>>>> [ec-common.c:71:ec_heal_report] 0-glusterfsProd-disperse-8:
Heal
>>>>> failed [Invalid argument]
>>>>> [2017-01-12 01:19:21.002028] E [dict.c:197:key_value_cmp]
>>>>> 0-glusterfsProd-disperse-4: 'trusted.ec.size' is
different in two
>>>>> dicts (8, 8)
>>>>> [2017-01-12 01:19:21.002056] E [dict.c:166:log_value]
>>>>> 0-glusterfsProd-disperse-4: trusted.ec.size [ l1 =
62719407423488 l2
>>>>> = 0 i1 = 0 i2 = 0 ]
>>>>> [2017-01-12 01:19:21.002064] W [MSGID: 122056]
[ec-combine.c:873:ec_combine_check] 0-glusterfsProd-disperse-4: Mismatching
xdata in answers of 'LOOKUP'
>>>>> [2017-01-12 01:19:21.209640] E [dict.c:197:key_value_cmp]
>>>>> 0-glusterfsProd-disperse-4: 'trusted.ec.size' is
different in two
>>>>> dicts (8, 8)
>>>>> [2017-01-12 01:19:21.209673] E [dict.c:166:log_value]
>>>>> 0-glusterfsProd-disperse-4: trusted.ec.size [ l1 =
62719407423488 l2
>>>>> = 0 i1 = 0 i2 = 0 ]
>>>>> [2017-01-12 01:19:21.209686] W [MSGID: 122056]
[ec-combine.c:873:ec_combine_check] 0-glusterfsProd-disperse-4: Mismatching
xdata in answers of 'LOOKUP'
>>>>> [2017-01-12 01:19:21.209719] W [MSGID: 122053]
>>>>> [ec-common.c:116:ec_check_status]
0-glusterfsProd-disperse-4:
>>>>> Operation failed on some subvolumes (up=7, mask=7,
remaining=0,
>>>>> good=6, bad=1)
>>>>> [2017-01-12 01:19:21.209753] W [MSGID: 122002]
>>>>> [ec-common.c:71:ec_heal_report] 0-glusterfsProd-disperse-4:
Heal
>>>>> failed [Invalid argument]
>>>>>
>>>>> Thanks and Regards,
>>>>> Ram
>>>>>
>>>>> -----Original Message-----
>>>>> From: Ankireddypalle Reddy
>>>>> Sent: Wednesday, January 11, 2017 9:29 AM
>>>>> To: Ankireddypalle Reddy; Xavier Hernandez; Gluster Devel
>>>>> (gluster-devel at gluster.org); gluster-users at
gluster.org
>>>>> Subject: RE: [Gluster-users] [Gluster-devel] Lot of EIO
errors in
>>>>> disperse volume
>>>>>
>>>>> Xavi,
>>>>> I built a debug binary to log more information.
This is what is getting logged. Looks like it is the attribute trusted.ec.size
which is different among the bricks in a sub volume.
>>>>>
>>>>> In glustershd.log :
>>>>>
>>>>> [2017-01-11 14:19:45.023845] N [MSGID: 122029]
[ec-generic.c:683:ec_combine_lookup] 0-glusterfsProd-disperse-8: Mismatching
iatt in answers of 'GF_FOP_LOOKUP'
>>>>> [2017-01-11 14:19:45.027718] E [dict.c:166:key_value_cmp]
>>>>> 0-glusterfsProd-disperse-6: 'trusted.ec.size' is
different in two
>>>>> dicts (8, 8)
>>>>> [2017-01-11 14:19:45.027736] W [MSGID: 122056]
[ec-combine.c:873:ec_combine_check] 0-glusterfsProd-disperse-6: Mismatching
xdata in answers of 'LOOKUP'
>>>>> [2017-01-11 14:19:45.027763] E [dict.c:166:key_value_cmp]
>>>>> 0-glusterfsProd-disperse-6: 'trusted.ec.size' is
different in two
>>>>> dicts (8, 8)
>>>>> [2017-01-11 14:19:45.027781] W [MSGID: 122056]
[ec-combine.c:873:ec_combine_check] 0-glusterfsProd-disperse-6: Mismatching
xdata in answers of 'LOOKUP'
>>>>> [2017-01-11 14:19:45.027793] W [MSGID: 122053]
>>>>> [ec-common.c:116:ec_check_status]
0-glusterfsProd-disperse-6:
>>>>> Operation failed on some subvolumes (up=7, mask=7,
remaining=0,
>>>>> good=6, bad=1)
>>>>> [2017-01-11 14:19:45.027815] W [MSGID: 122002]
>>>>> [ec-common.c:71:ec_heal_report] 0-glusterfsProd-disperse-6:
Heal
>>>>> failed [Invalid argument]
>>>>> [2017-01-11 14:19:45.029035] E [dict.c:166:key_value_cmp]
>>>>> 0-glusterfsProd-disperse-8: 'trusted.ec.size' is
different in two
>>>>> dicts (8, 8)
>>>>> [2017-01-11 14:19:45.029057] W [MSGID: 122056]
[ec-combine.c:873:ec_combine_check] 0-glusterfsProd-disperse-8: Mismatching
xdata in answers of 'LOOKUP'
>>>>> [2017-01-11 14:19:45.029089] E [dict.c:166:key_value_cmp]
>>>>> 0-glusterfsProd-disperse-8: 'trusted.ec.size' is
different in two
>>>>> dicts (8, 8)
>>>>> [2017-01-11 14:19:45.029105] W [MSGID: 122056]
[ec-combine.c:873:ec_combine_check] 0-glusterfsProd-disperse-8: Mismatching
xdata in answers of 'LOOKUP'
>>>>> [2017-01-11 14:19:45.029121] W [MSGID: 122053]
>>>>> [ec-common.c:116:ec_check_status]
0-glusterfsProd-disperse-8:
>>>>> Operation failed on some subvolumes (up=7, mask=7,
remaining=0,
>>>>> good=6, bad=1)
>>>>> [2017-01-11 14:19:45.032566] E [dict.c:166:key_value_cmp]
>>>>> 0-glusterfsProd-disperse-6: 'trusted.ec.size' is
different in two
>>>>> dicts (8, 8)
>>>>> [2017-01-11 14:19:45.029138] W [MSGID: 122002]
>>>>> [ec-common.c:71:ec_heal_report] 0-glusterfsProd-disperse-8:
Heal
>>>>> failed [Invalid argument]
>>>>> [2017-01-11 14:19:45.032585] W [MSGID: 122056]
[ec-combine.c:873:ec_combine_check] 0-glusterfsProd-disperse-6: Mismatching
xdata in answers of 'LOOKUP'
>>>>> [2017-01-11 14:19:45.032614] E [dict.c:166:key_value_cmp]
>>>>> 0-glusterfsProd-disperse-6: 'trusted.ec.size' is
different in two
>>>>> dicts (8, 8)
>>>>> [2017-01-11 14:19:45.032631] W [MSGID: 122056]
[ec-combine.c:873:ec_combine_check] 0-glusterfsProd-disperse-6: Mismatching
xdata in answers of 'LOOKUP'
>>>>> [2017-01-11 14:19:45.032638] W [MSGID: 122053]
>>>>> [ec-common.c:116:ec_check_status]
0-glusterfsProd-disperse-6:
>>>>> Operation failed on some subvolumes (up=7, mask=7,
remaining=0,
>>>>> good=6, bad=1)
>>>>> [2017-01-11 14:19:45.032654] W [MSGID: 122002]
>>>>> [ec-common.c:71:ec_heal_report] 0-glusterfsProd-disperse-6:
Heal
>>>>> failed [Invalid argument]
>>>>> [2017-01-11 14:19:45.037514] E [dict.c:166:key_value_cmp]
>>>>> 0-glusterfsProd-disperse-6: 'trusted.ec.size' is
different in two
>>>>> dicts (8, 8)
>>>>> [2017-01-11 14:19:45.037536] W [MSGID: 122056]
[ec-combine.c:873:ec_combine_check] 0-glusterfsProd-disperse-6: Mismatching
xdata in answers of 'LOOKUP'
>>>>> [2017-01-11 14:19:45.037553] E [dict.c:166:key_value_cmp]
>>>>> 0-glusterfsProd-disperse-6: 'trusted.ec.size' is
different in two
>>>>> dicts (8, 8)
>>>>> [2017-01-11 14:19:45.037573] W [MSGID: 122056]
[ec-combine.c:873:ec_combine_check] 0-glusterfsProd-disperse-6: Mismatching
xdata in answers of 'LOOKUP'
>>>>> [2017-01-11 14:19:45.037582] W [MSGID: 122053]
>>>>> [ec-common.c:116:ec_check_status]
0-glusterfsProd-disperse-6:
>>>>> Operation failed on some subvolumes (up=7, mask=7,
remaining=0,
>>>>> good=6, bad=1)
>>>>> [2017-01-11 14:19:45.037599] W [MSGID: 122002]
>>>>> [ec-common.c:71:ec_heal_report] 0-glusterfsProd-disperse-6:
Heal
>>>>> failed [Invalid argument]
>>>>> [2017-01-11 14:20:40.001401] E [dict.c:166:key_value_cmp]
>>>>> 0-glusterfsProd-disperse-3: 'trusted.ec.size' is
different in two
>>>>> dicts (8, 8)
>>>>> [2017-01-11 14:20:40.001387] E [dict.c:166:key_value_cmp]
>>>>> 0-glusterfsProd-disperse-5: 'trusted.ec.size' is
different in two
>>>>> dicts (8, 8)
>>>>>
>>>>> In the mount daemon log:
>>>>>
>>>>> [2017-01-11 14:20:17.806826] E [MSGID: 122001]
>>>>> [ec-common.c:872:ec_config_check]
2-glusterfsProd-disperse-0:
>>>>> Invalid or corrupted config [Invalid argument]
>>>>> [2017-01-11 14:20:17.806847] E [MSGID: 122066]
>>>>> [ec-common.c:969:ec_prepare_update_cbk]
2-glusterfsProd-disperse-0:
>>>>> Invalid config xattr [Invalid argument]
>>>>> [2017-01-11 14:20:17.807076] E [MSGID: 122001]
>>>>> [ec-common.c:872:ec_config_check]
2-glusterfsProd-disperse-1:
>>>>> Invalid or corrupted config [Invalid argument]
>>>>> [2017-01-11 14:20:17.807099] E [MSGID: 122066]
>>>>> [ec-common.c:969:ec_prepare_update_cbk]
2-glusterfsProd-disperse-1:
>>>>> Invalid config xattr [Invalid argument]
>>>>> [2017-01-11 14:20:17.807286] E [MSGID: 122001]
>>>>> [ec-common.c:872:ec_config_check]
2-glusterfsProd-disperse-10:
>>>>> Invalid or corrupted config [Invalid argument]
>>>>> [2017-01-11 14:20:17.807298] E [MSGID: 122066]
>>>>> [ec-common.c:969:ec_prepare_update_cbk]
2-glusterfsProd-disperse-10:
>>>>> Invalid config xattr [Invalid argument]
>>>>> [2017-01-11 14:20:17.807409] E [MSGID: 122001]
>>>>> [ec-common.c:872:ec_config_check]
2-glusterfsProd-disperse-11:
>>>>> Invalid or corrupted config [Invalid argument]
>>>>> [2017-01-11 14:20:17.807420] E [MSGID: 122066]
>>>>> [ec-common.c:969:ec_prepare_update_cbk]
2-glusterfsProd-disperse-11:
>>>>> Invalid config xattr [Invalid argument]
>>>>> [2017-01-11 14:20:17.807448] E [MSGID: 122001]
>>>>> [ec-common.c:872:ec_config_check]
2-glusterfsProd-disperse-4:
>>>>> Invalid or corrupted config [Invalid argument]
>>>>> [2017-01-11 14:20:17.807462] E [MSGID: 122066]
>>>>> [ec-common.c:969:ec_prepare_update_cbk]
2-glusterfsProd-disperse-4:
>>>>> Invalid config xattr [Invalid argument]
>>>>> [2017-01-11 14:20:17.807539] E [MSGID: 122001]
>>>>> [ec-common.c:872:ec_config_check]
2-glusterfsProd-disperse-2:
>>>>> Invalid or corrupted config [Invalid argument]
>>>>> [2017-01-11 14:20:17.807550] E [MSGID: 122066]
>>>>> [ec-common.c:969:ec_prepare_update_cbk]
2-glusterfsProd-disperse-2:
>>>>> Invalid config xattr [Invalid argument]
>>>>> [2017-01-11 14:20:17.807723] E [MSGID: 122001]
>>>>> [ec-common.c:872:ec_config_check]
2-glusterfsProd-disperse-3:
>>>>> Invalid or corrupted config [Invalid argument]
>>>>> [2017-01-11 14:20:17.807739] E [MSGID: 122066]
>>>>> [ec-common.c:969:ec_prepare_update_cbk]
2-glusterfsProd-disperse-3:
>>>>> Invalid config xattr [Invalid argument]
>>>>> [2017-01-11 14:20:17.807785] E [MSGID: 122001]
>>>>> [ec-common.c:872:ec_config_check]
2-glusterfsProd-disperse-5:
>>>>> Invalid or corrupted config [Invalid argument]
>>>>> [2017-01-11 14:20:17.807796] E [MSGID: 122066]
>>>>> [ec-common.c:969:ec_prepare_update_cbk]
2-glusterfsProd-disperse-5:
>>>>> Invalid config xattr [Invalid argument]
>>>>> [2017-01-11 14:20:17.808020] E [MSGID: 122001]
>>>>> [ec-common.c:872:ec_config_check]
2-glusterfsProd-disperse-9:
>>>>> Invalid or corrupted config [Invalid argument]
>>>>> [2017-01-11 14:20:17.808034] E [MSGID: 122066]
>>>>> [ec-common.c:969:ec_prepare_update_cbk]
2-glusterfsProd-disperse-9:
>>>>> Invalid config xattr [Invalid argument]
>>>>> [2017-01-11 14:20:17.808054] E [MSGID: 122001]
>>>>> [ec-common.c:872:ec_config_check]
2-glusterfsProd-disperse-6:
>>>>> Invalid or corrupted config [Invalid argument]
>>>>> [2017-01-11 14:20:17.808066] E [MSGID: 122066]
>>>>> [ec-common.c:969:ec_prepare_update_cbk]
2-glusterfsProd-disperse-6:
>>>>> Invalid config xattr [Invalid argument]
>>>>> [2017-01-11 14:20:17.808282] E [MSGID: 122001]
>>>>> [ec-common.c:872:ec_config_check]
2-glusterfsProd-disperse-8:
>>>>> Invalid or corrupted config [Invalid argument]
>>>>> [2017-01-11 14:20:17.808292] E [MSGID: 122066]
>>>>> [ec-common.c:969:ec_prepare_update_cbk]
2-glusterfsProd-disperse-8:
>>>>> Invalid config xattr [Invalid argument]
>>>>> [2017-01-11 14:20:17.809212] E [MSGID: 122001]
>>>>> [ec-common.c:872:ec_config_check]
2-glusterfsProd-disperse-7:
>>>>> Invalid or corrupted config [Invalid argument]
>>>>> [2017-01-11 14:20:17.809228] E [MSGID: 122066]
>>>>> [ec-common.c:969:ec_prepare_update_cbk]
2-glusterfsProd-disperse-7:
>>>>> Invalid config xattr [Invalid argument]
>>>>>
>>>>> [2017-01-11 14:20:17.812660] I [MSGID: 109036]
>>>>> [dht-common.c:8043:dht_log_new_layout_for_dir_selfheal]
>>>>> 2-glusterfsProd-dht: Setting layout of
>>>>> /Folder_01.05.2017_21.15/CV_MAGNETIC/V_31500/CHUNK_402578
with
>>>>> [Subvol_name: glusterfsProd-disperse-0, Err: -1 , Start:
1789569705
>>>>> , Stop: 2147483645 , Hash: 1 ], [Subvol_name:
>>>>> glusterfsProd-disperse-1, Err: -1 , Start: 2147483646 ,
Stop:
>>>>> 2505397586 , Hash: 1 ], [Subvol_name:
glusterfsProd-disperse-10,
>>>>> Err: -1 , Start: 2505397587 , Stop: 2863311527 , Hash: 1 ],
>>>>> [Subvol_name: glusterfsProd-disperse-11, Err: -1 , Start:
2863311528
>>>>> , Stop: 3221225468 , Hash: 1 ], [Subvol_name:
>>>>> glusterfsProd-disperse-2, Err: -1 , Start: 3221225469 ,
Stop:
>>>>> 3579139409 , Hash: 1 ], [Subvol_name:
glusterfsProd-disperse-3, Err:
>>>>> -1 , Start: 3579139410 , Stop: 3937053350 , Hash: 1 ],
[Subvol_name:
>>>>> glusterfsProd-disperse-4, Err: -1 , Start: 3937053351 ,
Stop:
>>>>> 4294967295 , Hash: 1 ], [Subvol_name:
glusterfsProd-disperse-5, Err:
>>>>> -1 , Start: 0 , Stop: 357913940 , Hash: 1 ], [Subvol_name:
>>>>> glusterfsProd-disperse-6, Err: -1 , Start: 357913941 ,
Stop:
>>>>> 715827881 , Hash: 1 ], [Subvol_name:
glusterfsProd-disperse-7, Err:
>>>>> -1 , Start: 715827882 , Stop: 1073741822 , Hash: 1 ],
[Subvol_name:
>>>>> glusterfsProd-disperse-8, Err: -1 , Start: 1073741823 ,
Stop:
>>>>> 1431655763 , Hash: 1 ], [Subvol_name:
glusterfsProd-disperse-9, Err:
>>>>> -1 , Start: 1431655764 , Stop: 1789569704 , Hash: 1 ],
>>>>>
>>>>>
>>>>> -----Original Message-----
>>>>> From: gluster-users-bounces at gluster.org
>>>>> [mailto:gluster-users-bounces at gluster.org] On Behalf Of
>>>>> Ankireddypalle Reddy
>>>>> Sent: Tuesday, January 10, 2017 10:09 AM
>>>>> To: Xavier Hernandez; Gluster Devel (gluster-devel at
gluster.org);
>>>>> gluster-users at gluster.org
>>>>> Subject: Re: [Gluster-users] [Gluster-devel] Lot of EIO
errors in
>>>>> disperse volume
>>>>>
>>>>> Xavi,
>>>>> In this case it's the file creation which
failed. So I provided the xattrs of the parent.
>>>>>
>>>>> Thanks and Regards,
>>>>> Ram
>>>>>
>>>>> -----Original Message-----
>>>>> From: Xavier Hernandez [mailto:xhernandez at datalab.es]
>>>>> Sent: Tuesday, January 10, 2017 9:10 AM
>>>>> To: Ankireddypalle Reddy; Gluster Devel (gluster-devel at
gluster.org);
>>>>> gluster-users at gluster.org
>>>>> Subject: Re: [Gluster-devel] Lot of EIO errors in disperse
volume
>>>>>
>>>>> Hi Ram,
>>>>>
>>>>>> On 10/01/17 14:42, Ankireddypalle Reddy wrote:
>>>>>> Attachments (2):
>>>>>>
>>>>>> 1
>>>>>>
>>>>>>
>>>>>>
>>>>>> ec.txt
>>>>>>
<https://imap.commvault.com/webconsole/embedded.do?url=https://imap
>>>>>> .co
>>>>>>
mmvault.com/webconsole/api/drive/publicshare/346714/file/ee2d1536c2
>>>>>> dc4
>>>>>>
dff94afb12132b4f8f6/action/preview&downloadUrl=https://imap.commvault.
>>>>>>
com/webconsole/api/contentstore/publicshare/346714/file/ee2d1536c2d
>>>>>> c4d ff94afb12132b4f8f6/action/download>
>>>>>> [Download]
>>>>>>
<https://imap.commvault.com/webconsole/api/contentstore/publicshare
>>>>>> /34
>>>>>>
6714/file/ee2d1536c2dc4dff94afb12132b4f8f6/action/download>(11.50
>>>>>> KB)
>>>>>>
>>>>>> 2
>>>>>>
>>>>>>
>>>>>>
>>>>>> ws-glus.log
>>>>>>
<https://imap.commvault.com/webconsole/embedded.do?url=https://imap
>>>>>> .co
>>>>>>
mmvault.com/webconsole/api/drive/publicshare/346714/file/cff3e0506e
>>>>>> 754
>>>>>>
b9a939db02da1cbbd58/action/preview&downloadUrl=https://imap.commvault.
>>>>>>
com/webconsole/api/contentstore/publicshare/346714/file/cff3e0506e7
>>>>>> 54b 9a939db02da1cbbd58/action/download>
>>>>>> [Download]
>>>>>>
<https://imap.commvault.com/webconsole/api/contentstore/publicshare
>>>>>> /34
>>>>>>
6714/file/cff3e0506e754b9a939db02da1cbbd58/action/download>(3.48
>>>>>> MB)
>>>>>>
>>>>>> Xavi,
>>>>>> We are encountering errors for different kinds
of FOPS.
>>>>>> The open failed for the following file:
>>>>>>
>>>>>> cvd_2017_01_10_02_28_26.log:98182 1f9fe 01/10
00:57:10 8414465
>>>>>> [MEDIAFS ] 20117519-52075477
SingleInstancer_FS::StartDataFile2:
>>>>>> Failed to create the data file
>>>>>>
[/ws/glus/Folder_07.11.2016_23.02/CV_MAGNETIC/V_8854974/CHUNK_51342
>>>>>> 720 /SFILE_CONTAINER_062],
error=0xECCC0005:{CQiFile::Open(92)} +
>>>>>> {CQiUTFOSAPI::open(96)/ErrNo.5.(Input/output
error)-Open failed,
>>>>>>
File=/ws/glus/Folder_07.11.2016_23.02/CV_MAGNETIC/V_8854974/CHUNK_5
>>>>>> 134 2720/SFILE_CONTAINER_062, OperationFlag=0xC1,
>>>>>> PermissionMode=0x1FF}
>>>>>>
>>>>>> I've attached the extended attributes for
the directories
>>>>>>
/ws/glus/Folder_07.11.2016_23.02/CV_MAGNETIC/V_8854974/
>>>>>> and
>>>>>>
>>>>>>
/ws/glus/Folder_07.11.2016_23.02/CV_MAGNETIC/V_8854974/CHUNK_513427
>>>>>> 20
>>>>>> from all the bricks.
>>>>>>
>>>>>> The attributes look fine to me. I've also
attached some log
>>>>>> cuts to illustrate the problem.
>>>>>
>>>>> I need the extended attributes of the file itself, not the
parent directories.
>>>>>
>>>>> Xavi
>>>>>
>>>>>>
>>>>>> Thanks and Regards,
>>>>>> Ram
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Xavier Hernandez [mailto:xhernandez at
datalab.es]
>>>>>> Sent: Tuesday, January 10, 2017 7:53 AM
>>>>>> To: Ankireddypalle Reddy; Gluster Devel
>>>>>> (gluster-devel at gluster.org); gluster-users at
gluster.org
>>>>>> Subject: Re: [Gluster-devel] Lot of EIO errors in
disperse volume
>>>>>>
>>>>>> Hi Ram,
>>>>>>
>>>>>> the error is caused by an extended attribute that does
not match on
>>>>>> all
>>>>>> 3 bricks of the disperse set. Most probable value is
>>>>>> trusted.ec.version, but could be others.
>>>>>>
>>>>>> At first sight, I don't see any change from 3.7.8
that could have
>>>>>> caused this. I'll check again.
>>>>>>
>>>>>> What kind of operations are you doing ? this can help
me narrow the search.
>>>>>>
>>>>>> Xavi
>>>>>>
>>>>>>> On 10/01/17 13:43, Ankireddypalle Reddy wrote:
>>>>>>> Xavi,
>>>>>>> Thanks. If you could please explain what to
look for in
>>>>>>> the
>>>>>> extended attributes then I will check and let you know
if I find
>>>>>> anything suspicious. Also we noticed that some of
these operations
>>>>>> would succeed if retried. Do you know of any
communicated related
>>>>>> errors that are being reported/triaged.
>>>>>>>
>>>>>>> Thanks and Regards,
>>>>>>> Ram
>>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Xavier Hernandez [mailto:xhernandez at
datalab.es]
>>>>>>> Sent: Tuesday, January 10, 2017 7:23 AM
>>>>>>> To: Ankireddypalle Reddy; Gluster Devel
>>>>>>> (gluster-devel at gluster.org); gluster-users at
gluster.org
>>>>>>> Subject: Re: [Gluster-devel] Lot of EIO errors in
disperse volume
>>>>>>>
>>>>>>> Hi Ram,
>>>>>>>
>>>>>>>> On 10/01/17 13:14, Ankireddypalle Reddy wrote:
>>>>>>>> Attachment (1):
>>>>>>>>
>>>>>>>> 1
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> ecxattrs.txt
>>>>>>>>
<https://imap.commvault.com/webconsole/embedded.do?url=https://imap.
>>>>>>>> c
>>>>>>>> o
>>>>>>>>
mmvault.com/webconsole/api/drive/publicshare/346714/file/1272e682
>>>>>>>> 787
>>>>>>>> 4
>>>>>>>> 4
>>>>>>>>
f15bf1a54f2b31b559d/action/preview&downloadUrl=https://imap.commvault.
>>>>>>>>
com/webconsole/api/contentstore/publicshare/346714/file/1272e6827
>>>>>>>> 874
>>>>>>>> 4
>>>>>>>> f
>>>>>>>> 15bf1a54f2b31b559d/action/download>
>>>>>>>> [Download]
>>>>>>>>
<https://imap.commvault.com/webconsole/api/contentstore/publicsha
>>>>>>>> re/
>>>>>>>> 3
>>>>>>>> 4
>>>>>>>>
6714/file/1272e68278744f15bf1a54f2b31b559d/action/download>(5.92
>>>>>>>> KB)
>>>>>>>>
>>>>>>>> Xavi,
>>>>>>>> Please find attached the extended
attributes for a
>>>>>>>> directory from all the bricks. Free space check
failed for this
>>>>>>>> with error number EIO.
>>>>>>>
>>>>>>> What do you mean ? what operation have you made to
check the free
>>>>>> space on that directory ?
>>>>>>>
>>>>>>> If it's a recursive check, I need the extended
attributes from the
>>>>>> exact file that triggers the EIO. The attached
attributes seem
>>>>>> consistent and that directory shouldn't cause any
problem. Does an 'ls'
>>>>>> on that directory fail or does it show the contents ?
>>>>>>>
>>>>>>> Xavi
>>>>>>>
>>>>>>>>
>>>>>>>> Thanks and Regards,
>>>>>>>> Ram
>>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: Xavier Hernandez [mailto:xhernandez at
datalab.es]
>>>>>>>> Sent: Tuesday, January 10, 2017 6:45 AM
>>>>>>>> To: Ankireddypalle Reddy; Gluster Devel
>>>>>>>> (gluster-devel at gluster.org); gluster-users
at gluster.org
>>>>>>>> Subject: Re: [Gluster-devel] Lot of EIO errors
in disperse volume
>>>>>>>>
>>>>>>>> Hi Ram,
>>>>>>>>
>>>>>>>> can you execute the following command on all
bricks on a file
>>>>>>>> that is giving EIO ?
>>>>>>>>
>>>>>>>> getfattr -m. -e hex -d <path to file in
brick>
>>>>>>>>
>>>>>>>> Xavi
>>>>>>>>
>>>>>>>>> On 10/01/17 12:41, Ankireddypalle Reddy
wrote:
>>>>>>>>> Xavi,
>>>>>>>>> We have been running 3.7.8 on
these servers. We
>>>>>>>>> upgraded
>>>>>>>> to 3.7.18 yesterday. We upgraded all the
servers at a time. The
>>>>>>>> volume was brought down during upgrade.
>>>>>>>>>
>>>>>>>>> Thanks and Regards,
>>>>>>>>> Ram
>>>>>>>>>
>>>>>>>>> -----Original Message-----
>>>>>>>>> From: Xavier Hernandez [mailto:xhernandez
at datalab.es]
>>>>>>>>> Sent: Tuesday, January 10, 2017 6:35 AM
>>>>>>>>> To: Ankireddypalle Reddy; Gluster Devel
>>>>>>>>> (gluster-devel at gluster.org);
gluster-users at gluster.org
>>>>>>>>> Subject: Re: [Gluster-devel] Lot of EIO
errors in disperse
>>>>>>>>> volume
>>>>>>>>>
>>>>>>>>> Hi Ram,
>>>>>>>>>
>>>>>>>>> how did you upgrade gluster ? from which
version ?
>>>>>>>>>
>>>>>>>>> Did you upgrade one server at a time and
waited until self-heal
>>>>>>>> finished before upgrading the next server ?
>>>>>>>>>
>>>>>>>>> Xavi
>>>>>>>>>
>>>>>>>>>> On 10/01/17 11:39, Ankireddypalle Reddy
wrote:
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> We upgraded to GlusterFS 3.7.18
yesterday. We see lot of
>>>>>>>>>> failures in our applications. Most of
the errors are EIO. The
>>>>>>>>>> following log lines are commonly seen
in the logs:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> The message "W [MSGID: 122056]
>>>>>>>>>> [ec-combine.c:873:ec_combine_check]
>>>>>>>>>> 0-StoragePool-disperse-4: Mismatching
xdata in answers of 'LOOKUP'"
>>>>>>>>>> repeated 2 times between [2017-01-10
02:46:25.069809] and
>>>>>>>>>> [2017-01-10 02:46:25.069835]
>>>>>>>>>>
>>>>>>>>>> [2017-01-10 02:46:25.069852] W [MSGID:
122056]
>>>>>>>>>> [ec-combine.c:873:ec_combine_check]
0-StoragePool-disperse-5:
>>>>>>>>>> Mismatching xdata in answers of
'LOOKUP'
>>>>>>>>>>
>>>>>>>>>> The message "W [MSGID: 122056]
>>>>>>>>>> [ec-combine.c:873:ec_combine_check]
>>>>>>>>>> 0-StoragePool-disperse-5: Mismatching
xdata in answers of 'LOOKUP'"
>>>>>>>>>> repeated 2 times between [2017-01-10
02:46:25.069852] and
>>>>>>>>>> [2017-01-10 02:46:25.069873]
>>>>>>>>>>
>>>>>>>>>> [2017-01-10 02:46:25.069910] W [MSGID:
122056]
>>>>>>>>>> [ec-combine.c:873:ec_combine_check]
0-StoragePool-disperse-6:
>>>>>>>>>> Mismatching xdata in answers of
'LOOKUP'
>>>>>>>>>>
>>>>>>>>>> ...
>>>>>>>>>>
>>>>>>>>>> [2017-01-10 02:46:26.520774] I [MSGID:
109036]
>>>>>>>>>>
[dht-common.c:9076:dht_log_new_layout_for_dir_selfheal]
>>>>>>>>>> 0-StoragePool-dht: Setting layout of
>>>>>>>>>>
/Folder_07.11.2016_23.02/CV_MAGNETIC/V_8854213/CHUNK_51334585
>>>>>>>>>> with
>>>>>>>>>> [Subvol_name: StoragePool-disperse-0,
Err: -1 , Start:
>>>>>>>>>> 3221225466 ,
>>>>>>>>>> Stop: 3758096376 , Hash: 1 ],
[Subvol_name:
>>>>>>>>>> StoragePool-disperse-1,
>>>>>> Err:
>>>>>>>>>> -1 , Start: 3758096377 , Stop:
4294967295 , Hash: 1 ], [Subvol_name:
>>>>>>>>>> StoragePool-disperse-2, Err: -1 ,
Start: 0 , Stop: 536870910 , Hash:
>>>>>>>>>> 1 ], [Subvol_name:
StoragePool-disperse-3, Err: -1 , Start:
>>>>>>>>>> 536870911 ,
>>>>>>>>>> Stop: 1073741821 , Hash: 1 ],
[Subvol_name:
>>>>>>>>>> StoragePool-disperse-4,
>>>>>> Err:
>>>>>>>>>> -1 , Start: 1073741822 , Stop:
1610612732 , Hash: 1 ], [Subvol_name:
>>>>>>>>>> StoragePool-disperse-5, Err: -1 ,
Start: 1610612733 , Stop:
>>>>>>>>>> 2147483643 ,
>>>>>>>>>> Hash: 1 ], [Subvol_name:
StoragePool-disperse-6, Err: -1 , Start:
>>>>>>>>>> 2147483644 , Stop: 2684354554 , Hash: 1
], [Subvol_name:
>>>>>>>>>> StoragePool-disperse-7, Err: -1 ,
Start: 2684354555 , Stop:
>>>>>>>>>> 3221225465 ,
>>>>>>>>>> Hash: 1 ],
>>>>>>>>>>
>>>>>>>>>> [2017-01-10 02:46:26.522841] N [MSGID:
122031]
>>>>>>>>>> [ec-generic.c:1130:ec_combine_xattrop]
0-StoragePool-disperse-3:
>>>>>>>>>> Mismatching dictionary in answers of
'GF_FOP_XATTROP'
>>>>>>>>>>
>>>>>>>>>> The message "N [MSGID: 122031]
>>>>>>>>>> [ec-generic.c:1130:ec_combine_xattrop]
>>>>>>>>>> 0-StoragePool-disperse-3: Mismatching
dictionary in answers of
>>>>>>>>>> 'GF_FOP_XATTROP'" repeated
2 times between [2017-01-10
>>>>>>>>>> 02:46:26.522841] and [2017-01-10
02:46:26.522894]
>>>>>>>>>>
>>>>>>>>>> [2017-01-10 02:46:26.522898] W [MSGID:
122040]
>>>>>>>>>> [ec-common.c:919:ec_prepare_update_cbk]
0-StoragePool-disperse-3:
>>>>>>>>>> Failed to get size and version
[Input/output error]
>>>>>>>>>>
>>>>>>>>>> [2017-01-10 02:46:26.523115] N [MSGID:
122031]
>>>>>>>>>> [ec-generic.c:1130:ec_combine_xattrop]
0-StoragePool-disperse-6:
>>>>>>>>>> Mismatching dictionary in answers of
'GF_FOP_XATTROP'
>>>>>>>>>>
>>>>>>>>>> The message "N [MSGID: 122031]
>>>>>>>>>> [ec-generic.c:1130:ec_combine_xattrop]
>>>>>>>>>> 0-StoragePool-disperse-6: Mismatching
dictionary in answers of
>>>>>>>>>> 'GF_FOP_XATTROP'" repeated
2 times between [2017-01-10
>>>>>>>>>> 02:46:26.523115] and [2017-01-10
02:46:26.523143]
>>>>>>>>>>
>>>>>>>>>> [2017-01-10 02:46:26.523147] W [MSGID:
122040]
>>>>>>>>>> [ec-common.c:919:ec_prepare_update_cbk]
0-StoragePool-disperse-6:
>>>>>>>>>> Failed to get size and version
[Input/output error]
>>>>>>>>>>
>>>>>>>>>> [2017-01-10 02:46:26.523302] N [MSGID:
122031]
>>>>>>>>>> [ec-generic.c:1130:ec_combine_xattrop]
0-StoragePool-disperse-2:
>>>>>>>>>> Mismatching dictionary in answers of
'GF_FOP_XATTROP'
>>>>>>>>>>
>>>>>>>>>> The message "N [MSGID: 122031]
>>>>>>>>>> [ec-generic.c:1130:ec_combine_xattrop]
>>>>>>>>>> 0-StoragePool-disperse-2: Mismatching
dictionary in answers of
>>>>>>>>>> 'GF_FOP_XATTROP'" repeated
2 times between [2017-01-10
>>>>>>>>>> 02:46:26.523302] and [2017-01-10
02:46:26.523324]
>>>>>>>>>>
>>>>>>>>>> [2017-01-10 02:46:26.523328] W [MSGID:
122040]
>>>>>>>>>> [ec-common.c:919:ec_prepare_update_cbk]
0-StoragePool-disperse-2:
>>>>>>>>>> Failed to get size and version
[Input/output error]
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> [root at glusterfs3 Log_Files]# gluster
--version
>>>>>>>>>>
>>>>>>>>>> glusterfs 3.7.18 built on Dec 8 2016
06:34:26
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> [root at glusterfs3 Log_Files]# gluster
volume info
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Volume Name: StoragePool
>>>>>>>>>>
>>>>>>>>>> Type: Distributed-Disperse
>>>>>>>>>>
>>>>>>>>>> Volume ID:
149e976f-4e21-451c-bf0f-f5691208531f
>>>>>>>>>>
>>>>>>>>>> Status: Started
>>>>>>>>>>
>>>>>>>>>> Number of Bricks: 8 x (2 + 1) = 24
>>>>>>>>>>
>>>>>>>>>> Transport-type: tcp
>>>>>>>>>>
>>>>>>>>>> Bricks:
>>>>>>>>>>
>>>>>>>>>> Brick1:
glusterfs1sds:/ws/disk1/ws_brick
>>>>>>>>>>
>>>>>>>>>> Brick2:
glusterfs2sds:/ws/disk1/ws_brick
>>>>>>>>>>
>>>>>>>>>> Brick3:
glusterfs3sds:/ws/disk1/ws_brick
>>>>>>>>>>
>>>>>>>>>> Brick4:
glusterfs1sds:/ws/disk2/ws_brick
>>>>>>>>>>
>>>>>>>>>> Brick5:
glusterfs2sds:/ws/disk2/ws_brick
>>>>>>>>>>
>>>>>>>>>> Brick6:
glusterfs3sds:/ws/disk2/ws_brick
>>>>>>>>>>
>>>>>>>>>> Brick7:
glusterfs1sds:/ws/disk3/ws_brick
>>>>>>>>>>
>>>>>>>>>> Brick8:
glusterfs2sds:/ws/disk3/ws_brick
>>>>>>>>>>
>>>>>>>>>> Brick9:
glusterfs3sds:/ws/disk3/ws_brick
>>>>>>>>>>
>>>>>>>>>> Brick10:
glusterfs1sds:/ws/disk4/ws_brick
>>>>>>>>>>
>>>>>>>>>> Brick11:
glusterfs2sds:/ws/disk4/ws_brick
>>>>>>>>>>
>>>>>>>>>> Brick12:
glusterfs3sds:/ws/disk4/ws_brick
>>>>>>>>>>
>>>>>>>>>> Brick13:
glusterfs1sds:/ws/disk5/ws_brick
>>>>>>>>>>
>>>>>>>>>> Brick14:
glusterfs2sds:/ws/disk5/ws_brick
>>>>>>>>>>
>>>>>>>>>> Brick15:
glusterfs3sds:/ws/disk5/ws_brick
>>>>>>>>>>
>>>>>>>>>> Brick16:
glusterfs1sds:/ws/disk6/ws_brick
>>>>>>>>>>
>>>>>>>>>> Brick17:
glusterfs2sds:/ws/disk6/ws_brick
>>>>>>>>>>
>>>>>>>>>> Brick18:
glusterfs3sds:/ws/disk6/ws_brick
>>>>>>>>>>
>>>>>>>>>> Brick19:
glusterfs1sds:/ws/disk7/ws_brick
>>>>>>>>>>
>>>>>>>>>> Brick20:
glusterfs2sds:/ws/disk7/ws_brick
>>>>>>>>>>
>>>>>>>>>> Brick21:
glusterfs3sds:/ws/disk7/ws_brick
>>>>>>>>>>
>>>>>>>>>> Brick22:
glusterfs1sds:/ws/disk8/ws_brick
>>>>>>>>>>
>>>>>>>>>> Brick23:
glusterfs2sds:/ws/disk8/ws_brick
>>>>>>>>>>
>>>>>>>>>> Brick24:
glusterfs3sds:/ws/disk8/ws_brick
>>>>>>>>>>
>>>>>>>>>> Options Reconfigured:
>>>>>>>>>>
>>>>>>>>>> performance.readdir-ahead: on
>>>>>>>>>>
>>>>>>>>>> diagnostics.client-log-level: INFO
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Thanks and Regards,
>>>>>>>>>>
>>>>>>>>>> Ram
>>>>>>>>>>
>>>>>>>>>> ***************************Legal
>>>>>>>>>> Disclaimer***************************
>>>>>>>>>> "This communication may contain
confidential and privileged
>>>>>>>>>> material for the sole use of the
intended recipient. Any
>>>>>>>>>> unauthorized review, use or
distribution by others is strictly
>>>>>>>>>> prohibited. If you have received the
message by mistake, please
>>>>>>>>>> advise the sender by reply email and
delete the message. Thank you."
>>>>>>>>>>
***************************************************************
>>>>>>>>>> ***
>>>>>>>>>> *
>>>>>>>>>> *
>>>>>>>>>> *
>>>>>>>>>> *
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
_______________________________________________
>>>>>>>>>> Gluster-devel mailing list
>>>>>>>>>> Gluster-devel at gluster.org
>>>>>>>>>>
http://www.gluster.org/mailman/listinfo/gluster-devel
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ***************************Legal
>>>>>>>>> Disclaimer***************************
>>>>>>>>> "This communication may contain
confidential and privileged
>>>>>>>>> material for the sole use of the intended
recipient. Any
>>>>>>>>> unauthorized review, use or distribution by
others is strictly
>>>>>>>>> prohibited. If you have received the
message by mistake, please
>>>>>>>>> advise the sender by reply
>>>>>>>> email and delete the message. Thank you."
>>>>>>>>>
****************************************************************
>>>>>>>>> ***
>>>>>>>>> *
>>>>>>>>> *
>>>>>>>>> *
>>>>>>>>>
>>>>>>>>
>>>>>>>> ***************************Legal
>>>>>>>> Disclaimer***************************
>>>>>>>> "This communication may contain
confidential and privileged
>>>>>>>> material for the sole use of the intended
recipient. Any
>>>>>>>> unauthorized review, use or distribution by
others is strictly
>>>>>>>> prohibited. If you have received the message by
mistake, please
>>>>>>>> advise the sender by reply email and delete the
message. Thank you."
>>>>>>>>
*****************************************************************
>>>>>>>> ***
>>>>>>>> *
>>>>>>>> *
>>>>>>>
>>>>>>> ***************************Legal
>>>>>>> Disclaimer***************************
>>>>>>> "This communication may contain confidential
and privileged
>>>>>>> material for the sole use of the intended
recipient. Any
>>>>>>> unauthorized review, use or distribution by others
is strictly
>>>>>>> prohibited. If you have received the message by
mistake, please
>>>>>>> advise the sender by reply
>>>>>> email and delete the message. Thank you."
>>>>>>>
******************************************************************
>>>>>>> ***
>>>>>>> *
>>>>>>>
>>>>>>
>>>>>> ***************************Legal
>>>>>> Disclaimer***************************
>>>>>> "This communication may contain confidential and
privileged
>>>>>> material for the sole use of the intended recipient.
Any
>>>>>> unauthorized review, use or distribution by others is
strictly
>>>>>> prohibited. If you have received the message by
mistake, please
>>>>>> advise the sender by reply email and delete the
message. Thank you."
>>>>>>
*******************************************************************
>>>>>> ***
>>>>>
>>>>> ***************************Legal
>>>>> Disclaimer***************************
>>>>> "This communication may contain confidential and
privileged material for the sole use of the intended recipient. Any unauthorized
review, use or distribution by others is strictly prohibited. If you have
received the message by mistake, please advise the sender by reply email and
delete the message. Thank you."
>>>>>
********************************************************************
>>>>> **
>>>>>
>>>>> _______________________________________________
>>>>> Gluster-users mailing list
>>>>> Gluster-users at gluster.org
>>>>> http://www.gluster.org/mailman/listinfo/gluster-users
>>>>> ***************************Legal
>>>>> Disclaimer***************************
>>>>> "This communication may contain confidential and
privileged material
>>>>> for the sole use of the intended recipient. Any
unauthorized review,
>>>>> use or distribution by others is strictly prohibited. If
you have
>>>>> received the message by mistake, please advise the sender
by reply email and delete the message. Thank you."
>>>>>
********************************************************************
>>>>> **
>>>>>
>>>>
>>> ***************************Legal
Disclaimer***************************
>>> "This communication may contain confidential and privileged
material
>>> for the sole use of the intended recipient. Any unauthorized
review,
>>> use or distribution by others is strictly prohibited. If you have
>>> received the message by mistake, please advise the sender by reply
email and delete the message. Thank you."
>>>
**********************************************************************
>>>
>>
>> ***************************Legal Disclaimer***************************
>> "This communication may contain confidential and privileged
material for the
>> sole use of the intended recipient. Any unauthorized review, use or
distribution
>> by others is strictly prohibited. If you have received the message by
mistake,
>> please advise the sender by reply email and delete the message. Thank
you."
>> **********************************************************************
>>
>>
>
***************************Legal Disclaimer***************************
"This communication may contain confidential and privileged material for
the
sole use of the intended recipient. Any unauthorized review, use or distribution
by others is strictly prohibited. If you have received the message by mistake,
please advise the sender by reply email and delete the message. Thank you."
**********************************************************************