Pranith Kumar Karampuri
2015-Feb-05 10:30 UTC
[Gluster-users] [Gluster-devel] missing files
On 02/05/2015 03:48 PM, Pranith Kumar Karampuri wrote:> I believe David already fixed this. I hope this is the same issue he > told about permissions issue.Oops, it is not. I will take a look. Pranith> > Pranith > On 02/05/2015 03:44 PM, Xavier Hernandez wrote: >> Is the failure repeatable ? with the same directories ? >> >> It's very weird that the directories appear on the volume when you do >> an 'ls' on the bricks. Could it be that you only made a single 'ls' >> on fuse mount which not showed the directory ? Is it possible that >> this 'ls' triggered a self-heal that repaired the problem, whatever >> it was, and when you did another 'ls' on the fuse mount after the >> 'ls' on the bricks, the directories were there ? >> >> The first 'ls' could have healed the files, causing that the >> following 'ls' on the bricks showed the files as if nothing were >> damaged. If that's the case, it's possible that there were some >> disconnections during the copy. >> >> Added Pranith because he knows better replication and self-heal details. >> >> Xavi >> >> On 02/04/2015 07:23 PM, David F. Robinson wrote: >>> Distributed/replicated >>> >>> Volume Name: homegfs >>> Type: Distributed-Replicate >>> Volume ID: 1e32672a-f1b7-4b58-ba94-58c085e59071 >>> Status: Started >>> Number of Bricks: 4 x 2 = 8 >>> Transport-type: tcp >>> Bricks: >>> Brick1: gfsib01a.corvidtec.com:/data/brick01a/homegfs >>> Brick2: gfsib01b.corvidtec.com:/data/brick01b/homegfs >>> Brick3: gfsib01a.corvidtec.com:/data/brick02a/homegfs >>> Brick4: gfsib01b.corvidtec.com:/data/brick02b/homegfs >>> Brick5: gfsib02a.corvidtec.com:/data/brick01a/homegfs >>> Brick6: gfsib02b.corvidtec.com:/data/brick01b/homegfs >>> Brick7: gfsib02a.corvidtec.com:/data/brick02a/homegfs >>> Brick8: gfsib02b.corvidtec.com:/data/brick02b/homegfs >>> Options Reconfigured: >>> performance.io-thread-count: 32 >>> performance.cache-size: 128MB >>> performance.write-behind-window-size: 128MB >>> server.allow-insecure: on >>> network.ping-timeout: 10 >>> storage.owner-gid: 100 >>> geo-replication.indexing: off >>> geo-replication.ignore-pid-check: on >>> changelog.changelog: on >>> changelog.fsync-interval: 3 >>> changelog.rollover-time: 15 >>> server.manage-gids: on >>> >>> >>> ------ Original Message ------ >>> From: "Xavier Hernandez" <xhernandez at datalab.es> >>> To: "David F. Robinson" <david.robinson at corvidtec.com>; "Benjamin >>> Turner" <bennyturns at gmail.com> >>> Cc: "gluster-users at gluster.org" <gluster-users at gluster.org>; "Gluster >>> Devel" <gluster-devel at gluster.org> >>> Sent: 2/4/2015 6:03:45 AM >>> Subject: Re: [Gluster-devel] missing files >>> >>>> On 02/04/2015 01:30 AM, David F. Robinson wrote: >>>>> Sorry. Thought about this a little more. I should have been clearer. >>>>> The files were on both bricks of the replica, not just one side. So, >>>>> both bricks had to have been up... The files/directories just >>>>> don't show >>>>> up on the mount. >>>>> I was reading and saw a related bug >>>>> (https://bugzilla.redhat.com/show_bug.cgi?id=1159484). I saw it >>>>> suggested to run: >>>>> find <mount> -d -exec getfattr -h -n trusted.ec.heal {} \; >>>> >>>> This command is specific for a dispersed volume. It won't do anything >>>> (aside from the error you are seeing) on a replicated volume. >>>> >>>> I think you are using a replicated volume, right ? >>>> >>>> In this case I'm not sure what can be happening. Is your volume a pure >>>> replicated one or a distributed-replicated ? on a pure replicated it >>>> doesn't make sense that some entries do not show in an 'ls' when the >>>> file is in both replicas (at least without any error message in the >>>> logs). On a distributed-replicated it could be caused by some problem >>>> while combining contents of each replica set. >>>> >>>> What's the configuration of your volume ? >>>> >>>> Xavi >>>> >>>>> >>>>> I get a bunch of errors for operation not supported: >>>>> [root at gfs02a homegfs]# find wks_backup -d -exec getfattr -h -n >>>>> trusted.ec.heal {} \; >>>>> find: warning: the -d option is deprecated; please use -depth >>>>> instead, >>>>> because the latter is a POSIX-compliant feature. >>>>> wks_backup/homer_backup/backup: trusted.ec.heal: Operation not >>>>> supported >>>>> wks_backup/homer_backup/logs/2014_05_20.log: trusted.ec.heal: >>>>> Operation >>>>> not supported >>>>> wks_backup/homer_backup/logs/2014_05_21.log: trusted.ec.heal: >>>>> Operation >>>>> not supported >>>>> wks_backup/homer_backup/logs/2014_05_18.log: trusted.ec.heal: >>>>> Operation >>>>> not supported >>>>> wks_backup/homer_backup/logs/2014_05_19.log: trusted.ec.heal: >>>>> Operation >>>>> not supported >>>>> wks_backup/homer_backup/logs/2014_05_22.log: trusted.ec.heal: >>>>> Operation >>>>> not supported >>>>> wks_backup/homer_backup/logs: trusted.ec.heal: Operation not >>>>> supported >>>>> wks_backup/homer_backup: trusted.ec.heal: Operation not supported >>>>> ------ Original Message ------ >>>>> From: "Benjamin Turner" <bennyturns at gmail.com >>>>> <mailto:bennyturns at gmail.com>> >>>>> To: "David F. Robinson" <david.robinson at corvidtec.com >>>>> <mailto:david.robinson at corvidtec.com>> >>>>> Cc: "Gluster Devel" <gluster-devel at gluster.org >>>>> <mailto:gluster-devel at gluster.org>>; "gluster-users at gluster.org" >>>>> <gluster-users at gluster.org <mailto:gluster-users at gluster.org>> >>>>> Sent: 2/3/2015 7:12:34 PM >>>>> Subject: Re: [Gluster-devel] missing files >>>>>> It sounds to me like the files were only copied to one replica, >>>>>> werent >>>>>> there for the initial for the initial ls which triggered a self >>>>>> heal, >>>>>> and were there for the last ls because they were healed. Is there >>>>>> any >>>>>> chance that one of the replicas was down during the rsync? It could >>>>>> be that you lost a brick during copy or something like that. To >>>>>> confirm I would look for disconnects in the brick logs as well as >>>>>> checking glusterfshd.log to verify the missing files were actually >>>>>> healed. >>>>>> >>>>>> -b >>>>>> >>>>>> On Tue, Feb 3, 2015 at 5:37 PM, David F. Robinson >>>>>> <david.robinson at corvidtec.com <mailto:david.robinson at corvidtec.com>> >>>>>> wrote: >>>>>> >>>>>> I rsync'd 20-TB over to my gluster system and noticed that I had >>>>>> some directories missing even though the rsync completed >>>>>> normally. >>>>>> The rsync logs showed that the missing files were transferred. >>>>>> I went to the bricks and did an 'ls -al >>>>>> /data/brick*/homegfs/dir/*' the files were on the bricks. >>>>>> After I >>>>>> did this 'ls', the files then showed up on the FUSE mounts. >>>>>> 1) Why are the files hidden on the fuse mount? >>>>>> 2) Why does the ls make them show up on the FUSE mount? >>>>>> 3) How can I prevent this from happening again? >>>>>> Note, I also mounted the gluster volume using NFS and saw the >>>>>> same >>>>>> behavior. The files/directories were not shown until I did the >>>>>> "ls" on the bricks. >>>>>> David >>>>>> ==============================>>>>>> David F. Robinson, Ph.D. >>>>>> President - Corvid Technologies >>>>>> 704.799.6944 x101 <tel:704.799.6944%20x101> [office] >>>>>> 704.252.1310 <tel:704.252.1310> [cell] >>>>>> 704.799.7974 <tel:704.799.7974> [fax] >>>>>> David.Robinson at corvidtec.com >>>>>> <mailto:David.Robinson at corvidtec.com> >>>>>> http://www.corvidtechnologies.com >>>>>> <http://www.corvidtechnologies.com/> >>>>>> >>>>>> _______________________________________________ >>>>>> Gluster-devel mailing list >>>>>> Gluster-devel at gluster.org <mailto:Gluster-devel at gluster.org> >>>>>> http://www.gluster.org/mailman/listinfo/gluster-devel >>>>>> >>>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> Gluster-devel mailing list >>>>> Gluster-devel at gluster.org >>>>> http://www.gluster.org/mailman/listinfo/gluster-devel >>>>> >>> > > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > http://www.gluster.org/mailman/listinfo/gluster-users
----- Original Message -----> From: "Pranith Kumar Karampuri" <pkarampu at redhat.com> > To: "Xavier Hernandez" <xhernandez at datalab.es>, "David F. Robinson" <david.robinson at corvidtec.com>, "Benjamin Turner" > <bennyturns at gmail.com> > Cc: gluster-users at gluster.org, "Gluster Devel" <gluster-devel at gluster.org> > Sent: Thursday, February 5, 2015 5:30:04 AM > Subject: Re: [Gluster-users] [Gluster-devel] missing files > > > On 02/05/2015 03:48 PM, Pranith Kumar Karampuri wrote: > > I believe David already fixed this. I hope this is the same issue he > > told about permissions issue. > Oops, it is not. I will take a look.Yes David exactly like these: data-brick02a-homegfs.log:[2015-02-03 19:09:34.568842] I [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting connection from gfs02a.corvidtec.com-18563-2015/02/03-19:07:58:519134-homegfs-client-2-0-0 data-brick02a-homegfs.log:[2015-02-03 19:09:41.286551] I [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting connection from gfs01a.corvidtec.com-12804-2015/02/03-19:09:38:497808-homegfs-client-2-0-0 data-brick02a-homegfs.log:[2015-02-03 19:16:35.906412] I [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting connection from gfs02b.corvidtec.com-27190-2015/02/03-19:15:53:458467-homegfs-client-2-0-0 data-brick02a-homegfs.log:[2015-02-03 19:51:22.761293] I [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting connection from gfs01a.corvidtec.com-25926-2015/02/03-19:51:02:89070-homegfs-client-2-0-0 data-brick02a-homegfs.log:[2015-02-03 20:54:02.772180] I [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting connection from gfs01b.corvidtec.com-4175-2015/02/02-16:44:31:179119-homegfs-client-2-0-1 You can 100% verify my theory if you can correlate the time on the disconnects to the time that the missing files were healed. Can you have a look at /var/log/glusterfs/glustershd.log? That has all of the healed files + timestamps, if we can see a disconnect during the rsync and a self heal of the missing file I think we can safely assume that the disconnects may have caused this. I'll try this on my test systems, how much data did you rsync? What size ish of files / an idea of the dir layout? @Pranith - Could bricks flapping up and down during the rsync cause the files to be missing on the first ls(written to 1 subvol but not the other cause it was down), the ls triggered SH, and thats why the files were there for the second ls be a possible cause here? -b> Pranith > > > > Pranith > > On 02/05/2015 03:44 PM, Xavier Hernandez wrote: > >> Is the failure repeatable ? with the same directories ? > >> > >> It's very weird that the directories appear on the volume when you do > >> an 'ls' on the bricks. Could it be that you only made a single 'ls' > >> on fuse mount which not showed the directory ? Is it possible that > >> this 'ls' triggered a self-heal that repaired the problem, whatever > >> it was, and when you did another 'ls' on the fuse mount after the > >> 'ls' on the bricks, the directories were there ? > >> > >> The first 'ls' could have healed the files, causing that the > >> following 'ls' on the bricks showed the files as if nothing were > >> damaged. If that's the case, it's possible that there were some > >> disconnections during the copy. > >> > >> Added Pranith because he knows better replication and self-heal details. > >> > >> Xavi > >> > >> On 02/04/2015 07:23 PM, David F. Robinson wrote: > >>> Distributed/replicated > >>> > >>> Volume Name: homegfs > >>> Type: Distributed-Replicate > >>> Volume ID: 1e32672a-f1b7-4b58-ba94-58c085e59071 > >>> Status: Started > >>> Number of Bricks: 4 x 2 = 8 > >>> Transport-type: tcp > >>> Bricks: > >>> Brick1: gfsib01a.corvidtec.com:/data/brick01a/homegfs > >>> Brick2: gfsib01b.corvidtec.com:/data/brick01b/homegfs > >>> Brick3: gfsib01a.corvidtec.com:/data/brick02a/homegfs > >>> Brick4: gfsib01b.corvidtec.com:/data/brick02b/homegfs > >>> Brick5: gfsib02a.corvidtec.com:/data/brick01a/homegfs > >>> Brick6: gfsib02b.corvidtec.com:/data/brick01b/homegfs > >>> Brick7: gfsib02a.corvidtec.com:/data/brick02a/homegfs > >>> Brick8: gfsib02b.corvidtec.com:/data/brick02b/homegfs > >>> Options Reconfigured: > >>> performance.io-thread-count: 32 > >>> performance.cache-size: 128MB > >>> performance.write-behind-window-size: 128MB > >>> server.allow-insecure: on > >>> network.ping-timeout: 10 > >>> storage.owner-gid: 100 > >>> geo-replication.indexing: off > >>> geo-replication.ignore-pid-check: on > >>> changelog.changelog: on > >>> changelog.fsync-interval: 3 > >>> changelog.rollover-time: 15 > >>> server.manage-gids: on > >>> > >>> > >>> ------ Original Message ------ > >>> From: "Xavier Hernandez" <xhernandez at datalab.es> > >>> To: "David F. Robinson" <david.robinson at corvidtec.com>; "Benjamin > >>> Turner" <bennyturns at gmail.com> > >>> Cc: "gluster-users at gluster.org" <gluster-users at gluster.org>; "Gluster > >>> Devel" <gluster-devel at gluster.org> > >>> Sent: 2/4/2015 6:03:45 AM > >>> Subject: Re: [Gluster-devel] missing files > >>> > >>>> On 02/04/2015 01:30 AM, David F. Robinson wrote: > >>>>> Sorry. Thought about this a little more. I should have been clearer. > >>>>> The files were on both bricks of the replica, not just one side. So, > >>>>> both bricks had to have been up... The files/directories just > >>>>> don't show > >>>>> up on the mount. > >>>>> I was reading and saw a related bug > >>>>> (https://bugzilla.redhat.com/show_bug.cgi?id=1159484). I saw it > >>>>> suggested to run: > >>>>> find <mount> -d -exec getfattr -h -n trusted.ec.heal {} \; > >>>> > >>>> This command is specific for a dispersed volume. It won't do anything > >>>> (aside from the error you are seeing) on a replicated volume. > >>>> > >>>> I think you are using a replicated volume, right ? > >>>> > >>>> In this case I'm not sure what can be happening. Is your volume a pure > >>>> replicated one or a distributed-replicated ? on a pure replicated it > >>>> doesn't make sense that some entries do not show in an 'ls' when the > >>>> file is in both replicas (at least without any error message in the > >>>> logs). On a distributed-replicated it could be caused by some problem > >>>> while combining contents of each replica set. > >>>> > >>>> What's the configuration of your volume ? > >>>> > >>>> Xavi > >>>> > >>>>> > >>>>> I get a bunch of errors for operation not supported: > >>>>> [root at gfs02a homegfs]# find wks_backup -d -exec getfattr -h -n > >>>>> trusted.ec.heal {} \; > >>>>> find: warning: the -d option is deprecated; please use -depth > >>>>> instead, > >>>>> because the latter is a POSIX-compliant feature. > >>>>> wks_backup/homer_backup/backup: trusted.ec.heal: Operation not > >>>>> supported > >>>>> wks_backup/homer_backup/logs/2014_05_20.log: trusted.ec.heal: > >>>>> Operation > >>>>> not supported > >>>>> wks_backup/homer_backup/logs/2014_05_21.log: trusted.ec.heal: > >>>>> Operation > >>>>> not supported > >>>>> wks_backup/homer_backup/logs/2014_05_18.log: trusted.ec.heal: > >>>>> Operation > >>>>> not supported > >>>>> wks_backup/homer_backup/logs/2014_05_19.log: trusted.ec.heal: > >>>>> Operation > >>>>> not supported > >>>>> wks_backup/homer_backup/logs/2014_05_22.log: trusted.ec.heal: > >>>>> Operation > >>>>> not supported > >>>>> wks_backup/homer_backup/logs: trusted.ec.heal: Operation not > >>>>> supported > >>>>> wks_backup/homer_backup: trusted.ec.heal: Operation not supported > >>>>> ------ Original Message ------ > >>>>> From: "Benjamin Turner" <bennyturns at gmail.com > >>>>> <mailto:bennyturns at gmail.com>> > >>>>> To: "David F. Robinson" <david.robinson at corvidtec.com > >>>>> <mailto:david.robinson at corvidtec.com>> > >>>>> Cc: "Gluster Devel" <gluster-devel at gluster.org > >>>>> <mailto:gluster-devel at gluster.org>>; "gluster-users at gluster.org" > >>>>> <gluster-users at gluster.org <mailto:gluster-users at gluster.org>> > >>>>> Sent: 2/3/2015 7:12:34 PM > >>>>> Subject: Re: [Gluster-devel] missing files > >>>>>> It sounds to me like the files were only copied to one replica, > >>>>>> werent > >>>>>> there for the initial for the initial ls which triggered a self > >>>>>> heal, > >>>>>> and were there for the last ls because they were healed. Is there > >>>>>> any > >>>>>> chance that one of the replicas was down during the rsync? It could > >>>>>> be that you lost a brick during copy or something like that. To > >>>>>> confirm I would look for disconnects in the brick logs as well as > >>>>>> checking glusterfshd.log to verify the missing files were actually > >>>>>> healed. > >>>>>> > >>>>>> -b > >>>>>> > >>>>>> On Tue, Feb 3, 2015 at 5:37 PM, David F. Robinson > >>>>>> <david.robinson at corvidtec.com <mailto:david.robinson at corvidtec.com>> > >>>>>> wrote: > >>>>>> > >>>>>> I rsync'd 20-TB over to my gluster system and noticed that I had > >>>>>> some directories missing even though the rsync completed > >>>>>> normally. > >>>>>> The rsync logs showed that the missing files were transferred. > >>>>>> I went to the bricks and did an 'ls -al > >>>>>> /data/brick*/homegfs/dir/*' the files were on the bricks. > >>>>>> After I > >>>>>> did this 'ls', the files then showed up on the FUSE mounts. > >>>>>> 1) Why are the files hidden on the fuse mount? > >>>>>> 2) Why does the ls make them show up on the FUSE mount? > >>>>>> 3) How can I prevent this from happening again? > >>>>>> Note, I also mounted the gluster volume using NFS and saw the > >>>>>> same > >>>>>> behavior. The files/directories were not shown until I did the > >>>>>> "ls" on the bricks. > >>>>>> David > >>>>>> ==============================> >>>>>> David F. Robinson, Ph.D. > >>>>>> President - Corvid Technologies > >>>>>> 704.799.6944 x101 <tel:704.799.6944%20x101> [office] > >>>>>> 704.252.1310 <tel:704.252.1310> [cell] > >>>>>> 704.799.7974 <tel:704.799.7974> [fax] > >>>>>> David.Robinson at corvidtec.com > >>>>>> <mailto:David.Robinson at corvidtec.com> > >>>>>> http://www.corvidtechnologies.com > >>>>>> <http://www.corvidtechnologies.com/> > >>>>>> > >>>>>> _______________________________________________ > >>>>>> Gluster-devel mailing list > >>>>>> Gluster-devel at gluster.org <mailto:Gluster-devel at gluster.org> > >>>>>> http://www.gluster.org/mailman/listinfo/gluster-devel > >>>>>> > >>>>>> > >>>>> > >>>>> > >>>>> _______________________________________________ > >>>>> Gluster-devel mailing list > >>>>> Gluster-devel at gluster.org > >>>>> http://www.gluster.org/mailman/listinfo/gluster-devel > >>>>> > >>> > > > > _______________________________________________ > > Gluster-users mailing list > > Gluster-users at gluster.org > > http://www.gluster.org/mailman/listinfo/gluster-users > > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > http://www.gluster.org/mailman/listinfo/gluster-users >