----- Original Message -----> From: "Ben Turner" <bturner at redhat.com> > To: "David F. Robinson" <david.robinson at corvidtec.com> > Cc: "Pranith Kumar Karampuri" <pkarampu at redhat.com>, "Xavier Hernandez" <xhernandez at datalab.es>, "Benjamin Turner" > <bennyturns at gmail.com>, gluster-users at gluster.org, "Gluster Devel" <gluster-devel at gluster.org> > Sent: Thursday, February 5, 2015 5:22:26 PM > Subject: Re: [Gluster-users] [Gluster-devel] missing files > > ----- Original Message ----- > > From: "David F. Robinson" <david.robinson at corvidtec.com> > > To: "Ben Turner" <bturner at redhat.com> > > Cc: "Pranith Kumar Karampuri" <pkarampu at redhat.com>, "Xavier Hernandez" > > <xhernandez at datalab.es>, "Benjamin Turner" > > <bennyturns at gmail.com>, gluster-users at gluster.org, "Gluster Devel" > > <gluster-devel at gluster.org> > > Sent: Thursday, February 5, 2015 5:01:13 PM > > Subject: Re: [Gluster-users] [Gluster-devel] missing files > > > > I'll send you the emails I sent Pranith with the logs. What causes these > > disconnects? > > Thanks David! Disconnects happen when there are interruption in > communication between peers, normally there is ping timeout that happens. > It could be anything from a flaky NW to the system was to busy to respond > to the pings. My initial take is more towards the ladder as rsync is > absolutely the worst use case for gluster - IIRC it writes in 4kb blocks. I > try to keep my writes at least 64KB as in my testing that is the smallest > block size I can write with before perf starts to really drop off. I'll try > something similar in the lab.Ok I do think that the file being self healed is RCA for what you were seeing. Lets look at one of the disconnects: data-brick02a-homegfs.log:[2015-02-03 20:54:02.772180] I [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting connection from gfs01b.corvidtec.com-4175-2015/02/02-16:44:31:179119-homegfs-client-2-0-1 And in the glustershd.log from the gfs01b_glustershd.log file: [2015-02-03 20:55:48.001797] I [afr-self-heal-entry.c:554:afr_selfheal_entry_do] 0-homegfs-replicate-0: performing entry selfheal on 6c79a368-edaa-432b-bef9-ec690ab42448 [2015-02-03 20:55:49.341996] I [afr-self-heal-common.c:476:afr_log_selfheal] 0-homegfs-replicate-0: Completed entry selfheal on 6c79a368-edaa-432b-bef9-ec690ab42448. source=1 sinks=0 [2015-02-03 20:55:49.343093] I [afr-self-heal-entry.c:554:afr_selfheal_entry_do] 0-homegfs-replicate-0: performing entry selfheal on 792cb0d6-9290-4447-8cd7-2b2d7a116a69 [2015-02-03 20:55:50.463652] I [afr-self-heal-common.c:476:afr_log_selfheal] 0-homegfs-replicate-0: Completed entry selfheal on 792cb0d6-9290-4447-8cd7-2b2d7a116a69. source=1 sinks=0 [2015-02-03 20:55:51.465289] I [afr-self-heal-metadata.c:54:__afr_selfheal_metadata_do] 0-homegfs-replicate-0: performing metadata selfheal on 403e661a-1c27-4e79-9867-c0572aba2b3c [2015-02-03 20:55:51.466515] I [afr-self-heal-common.c:476:afr_log_selfheal] 0-homegfs-replicate-0: Completed metadata selfheal on 403e661a-1c27-4e79-9867-c0572aba2b3c. source=1 sinks=0 [2015-02-03 20:55:51.467098] I [afr-self-heal-entry.c:554:afr_selfheal_entry_do] 0-homegfs-replicate-0: performing entry selfheal on 403e661a-1c27-4e79-9867-c0572aba2b3c [2015-02-03 20:55:55.257808] I [afr-self-heal-common.c:476:afr_log_selfheal] 0-homegfs-replicate-0: Completed entry selfheal on 403e661a-1c27-4e79-9867-c0572aba2b3c. source=1 sinks=0 [2015-02-03 20:55:55.258548] I [afr-self-heal-metadata.c:54:__afr_selfheal_metadata_do] 0-homegfs-replicate-0: performing metadata selfheal on c612ee2f-2fb4-4157-a9ab-5a2d5603c541 [2015-02-03 20:55:55.259367] I [afr-self-heal-common.c:476:afr_log_selfheal] 0-homegfs-replicate-0: Completed metadata selfheal on c612ee2f-2fb4-4157-a9ab-5a2d5603c541. source=1 sinks=0 [2015-02-03 20:55:55.259980] I [afr-self-heal-entry.c:554:afr_selfheal_entry_do] 0-homegfs-replicate-0: performing entry selfheal on c612ee2f-2fb4-4157-a9ab-5a2d5603c541 As you can see the self heal logs are just spammed with files being healed, and I looked at a couple of disconnects and I see self heals getting run shortly after on the bricks that were down. Now we need to find the cause of the disconnects, I am thinking once the disconnects are resolved the files should be properly copied over without SH having to fix things. Like I said I'll give this a go on my lab systems and see if I can repro the disconnects, I'll have time to run through it tomorrow. If in the mean time anyone else has a theory / anything to add here it would be appreciated. -b> -b > > > David (Sent from mobile) > > > > ==============================> > David F. Robinson, Ph.D. > > President - Corvid Technologies > > 704.799.6944 x101 [office] > > 704.252.1310 [cell] > > 704.799.7974 [fax] > > David.Robinson at corvidtec.com > > http://www.corvidtechnologies.com > > > > > On Feb 5, 2015, at 4:55 PM, Ben Turner <bturner at redhat.com> wrote: > > > > > > ----- Original Message ----- > > >> From: "Pranith Kumar Karampuri" <pkarampu at redhat.com> > > >> To: "Xavier Hernandez" <xhernandez at datalab.es>, "David F. Robinson" > > >> <david.robinson at corvidtec.com>, "Benjamin Turner" > > >> <bennyturns at gmail.com> > > >> Cc: gluster-users at gluster.org, "Gluster Devel" > > >> <gluster-devel at gluster.org> > > >> Sent: Thursday, February 5, 2015 5:30:04 AM > > >> Subject: Re: [Gluster-users] [Gluster-devel] missing files > > >> > > >> > > >>> On 02/05/2015 03:48 PM, Pranith Kumar Karampuri wrote: > > >>> I believe David already fixed this. I hope this is the same issue he > > >>> told about permissions issue. > > >> Oops, it is not. I will take a look. > > > > > > Yes David exactly like these: > > > > > > data-brick02a-homegfs.log:[2015-02-03 19:09:34.568842] I > > > [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting > > > connection from > > > gfs02a.corvidtec.com-18563-2015/02/03-19:07:58:519134-homegfs-client-2-0-0 > > > data-brick02a-homegfs.log:[2015-02-03 19:09:41.286551] I > > > [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting > > > connection from > > > gfs01a.corvidtec.com-12804-2015/02/03-19:09:38:497808-homegfs-client-2-0-0 > > > data-brick02a-homegfs.log:[2015-02-03 19:16:35.906412] I > > > [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting > > > connection from > > > gfs02b.corvidtec.com-27190-2015/02/03-19:15:53:458467-homegfs-client-2-0-0 > > > data-brick02a-homegfs.log:[2015-02-03 19:51:22.761293] I > > > [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting > > > connection from > > > gfs01a.corvidtec.com-25926-2015/02/03-19:51:02:89070-homegfs-client-2-0-0 > > > data-brick02a-homegfs.log:[2015-02-03 20:54:02.772180] I > > > [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting > > > connection from > > > gfs01b.corvidtec.com-4175-2015/02/02-16:44:31:179119-homegfs-client-2-0-1 > > > > > > You can 100% verify my theory if you can correlate the time on the > > > disconnects to the time that the missing files were healed. Can you have > > > a look at /var/log/glusterfs/glustershd.log? That has all of the healed > > > files + timestamps, if we can see a disconnect during the rsync and a > > > self > > > heal of the missing file I think we can safely assume that the > > > disconnects > > > may have caused this. I'll try this on my test systems, how much data > > > did > > > you rsync? What size ish of files / an idea of the dir layout? > > > > > > @Pranith - Could bricks flapping up and down during the rsync cause the > > > files to be missing on the first ls(written to 1 subvol but not the other > > > cause it was down), the ls triggered SH, and thats why the files were > > > there for the second ls be a possible cause here? > > > > > > -b > > > > > > > > >> Pranith > > >>> > > >>> Pranith > > >>>> On 02/05/2015 03:44 PM, Xavier Hernandez wrote: > > >>>> Is the failure repeatable ? with the same directories ? > > >>>> > > >>>> It's very weird that the directories appear on the volume when you do > > >>>> an 'ls' on the bricks. Could it be that you only made a single 'ls' > > >>>> on fuse mount which not showed the directory ? Is it possible that > > >>>> this 'ls' triggered a self-heal that repaired the problem, whatever > > >>>> it was, and when you did another 'ls' on the fuse mount after the > > >>>> 'ls' on the bricks, the directories were there ? > > >>>> > > >>>> The first 'ls' could have healed the files, causing that the > > >>>> following 'ls' on the bricks showed the files as if nothing were > > >>>> damaged. If that's the case, it's possible that there were some > > >>>> disconnections during the copy. > > >>>> > > >>>> Added Pranith because he knows better replication and self-heal > > >>>> details. > > >>>> > > >>>> Xavi > > >>>> > > >>>>> On 02/04/2015 07:23 PM, David F. Robinson wrote: > > >>>>> Distributed/replicated > > >>>>> > > >>>>> Volume Name: homegfs > > >>>>> Type: Distributed-Replicate > > >>>>> Volume ID: 1e32672a-f1b7-4b58-ba94-58c085e59071 > > >>>>> Status: Started > > >>>>> Number of Bricks: 4 x 2 = 8 > > >>>>> Transport-type: tcp > > >>>>> Bricks: > > >>>>> Brick1: gfsib01a.corvidtec.com:/data/brick01a/homegfs > > >>>>> Brick2: gfsib01b.corvidtec.com:/data/brick01b/homegfs > > >>>>> Brick3: gfsib01a.corvidtec.com:/data/brick02a/homegfs > > >>>>> Brick4: gfsib01b.corvidtec.com:/data/brick02b/homegfs > > >>>>> Brick5: gfsib02a.corvidtec.com:/data/brick01a/homegfs > > >>>>> Brick6: gfsib02b.corvidtec.com:/data/brick01b/homegfs > > >>>>> Brick7: gfsib02a.corvidtec.com:/data/brick02a/homegfs > > >>>>> Brick8: gfsib02b.corvidtec.com:/data/brick02b/homegfs > > >>>>> Options Reconfigured: > > >>>>> performance.io-thread-count: 32 > > >>>>> performance.cache-size: 128MB > > >>>>> performance.write-behind-window-size: 128MB > > >>>>> server.allow-insecure: on > > >>>>> network.ping-timeout: 10 > > >>>>> storage.owner-gid: 100 > > >>>>> geo-replication.indexing: off > > >>>>> geo-replication.ignore-pid-check: on > > >>>>> changelog.changelog: on > > >>>>> changelog.fsync-interval: 3 > > >>>>> changelog.rollover-time: 15 > > >>>>> server.manage-gids: on > > >>>>> > > >>>>> > > >>>>> ------ Original Message ------ > > >>>>> From: "Xavier Hernandez" <xhernandez at datalab.es> > > >>>>> To: "David F. Robinson" <david.robinson at corvidtec.com>; "Benjamin > > >>>>> Turner" <bennyturns at gmail.com> > > >>>>> Cc: "gluster-users at gluster.org" <gluster-users at gluster.org>; "Gluster > > >>>>> Devel" <gluster-devel at gluster.org> > > >>>>> Sent: 2/4/2015 6:03:45 AM > > >>>>> Subject: Re: [Gluster-devel] missing files > > >>>>> > > >>>>>>> On 02/04/2015 01:30 AM, David F. Robinson wrote: > > >>>>>>> Sorry. Thought about this a little more. I should have been > > >>>>>>> clearer. > > >>>>>>> The files were on both bricks of the replica, not just one side. > > >>>>>>> So, > > >>>>>>> both bricks had to have been up... The files/directories just > > >>>>>>> don't show > > >>>>>>> up on the mount. > > >>>>>>> I was reading and saw a related bug > > >>>>>>> (https://bugzilla.redhat.com/show_bug.cgi?id=1159484). I saw it > > >>>>>>> suggested to run: > > >>>>>>> find <mount> -d -exec getfattr -h -n trusted.ec.heal {} \; > > >>>>>> > > >>>>>> This command is specific for a dispersed volume. It won't do > > >>>>>> anything > > >>>>>> (aside from the error you are seeing) on a replicated volume. > > >>>>>> > > >>>>>> I think you are using a replicated volume, right ? > > >>>>>> > > >>>>>> In this case I'm not sure what can be happening. Is your volume a > > >>>>>> pure > > >>>>>> replicated one or a distributed-replicated ? on a pure replicated it > > >>>>>> doesn't make sense that some entries do not show in an 'ls' when the > > >>>>>> file is in both replicas (at least without any error message in the > > >>>>>> logs). On a distributed-replicated it could be caused by some > > >>>>>> problem > > >>>>>> while combining contents of each replica set. > > >>>>>> > > >>>>>> What's the configuration of your volume ? > > >>>>>> > > >>>>>> Xavi > > >>>>>> > > >>>>>>> > > >>>>>>> I get a bunch of errors for operation not supported: > > >>>>>>> [root at gfs02a homegfs]# find wks_backup -d -exec getfattr -h -n > > >>>>>>> trusted.ec.heal {} \; > > >>>>>>> find: warning: the -d option is deprecated; please use -depth > > >>>>>>> instead, > > >>>>>>> because the latter is a POSIX-compliant feature. > > >>>>>>> wks_backup/homer_backup/backup: trusted.ec.heal: Operation not > > >>>>>>> supported > > >>>>>>> wks_backup/homer_backup/logs/2014_05_20.log: trusted.ec.heal: > > >>>>>>> Operation > > >>>>>>> not supported > > >>>>>>> wks_backup/homer_backup/logs/2014_05_21.log: trusted.ec.heal: > > >>>>>>> Operation > > >>>>>>> not supported > > >>>>>>> wks_backup/homer_backup/logs/2014_05_18.log: trusted.ec.heal: > > >>>>>>> Operation > > >>>>>>> not supported > > >>>>>>> wks_backup/homer_backup/logs/2014_05_19.log: trusted.ec.heal: > > >>>>>>> Operation > > >>>>>>> not supported > > >>>>>>> wks_backup/homer_backup/logs/2014_05_22.log: trusted.ec.heal: > > >>>>>>> Operation > > >>>>>>> not supported > > >>>>>>> wks_backup/homer_backup/logs: trusted.ec.heal: Operation not > > >>>>>>> supported > > >>>>>>> wks_backup/homer_backup: trusted.ec.heal: Operation not supported > > >>>>>>> ------ Original Message ------ > > >>>>>>> From: "Benjamin Turner" <bennyturns at gmail.com > > >>>>>>> <mailto:bennyturns at gmail.com>> > > >>>>>>> To: "David F. Robinson" <david.robinson at corvidtec.com > > >>>>>>> <mailto:david.robinson at corvidtec.com>> > > >>>>>>> Cc: "Gluster Devel" <gluster-devel at gluster.org > > >>>>>>> <mailto:gluster-devel at gluster.org>>; "gluster-users at gluster.org" > > >>>>>>> <gluster-users at gluster.org <mailto:gluster-users at gluster.org>> > > >>>>>>> Sent: 2/3/2015 7:12:34 PM > > >>>>>>> Subject: Re: [Gluster-devel] missing files > > >>>>>>>> It sounds to me like the files were only copied to one replica, > > >>>>>>>> werent > > >>>>>>>> there for the initial for the initial ls which triggered a self > > >>>>>>>> heal, > > >>>>>>>> and were there for the last ls because they were healed. Is there > > >>>>>>>> any > > >>>>>>>> chance that one of the replicas was down during the rsync? It > > >>>>>>>> could > > >>>>>>>> be that you lost a brick during copy or something like that. To > > >>>>>>>> confirm I would look for disconnects in the brick logs as well as > > >>>>>>>> checking glusterfshd.log to verify the missing files were actually > > >>>>>>>> healed. > > >>>>>>>> > > >>>>>>>> -b > > >>>>>>>> > > >>>>>>>> On Tue, Feb 3, 2015 at 5:37 PM, David F. Robinson > > >>>>>>>> <david.robinson at corvidtec.com > > >>>>>>>> <mailto:david.robinson at corvidtec.com>> > > >>>>>>>> wrote: > > >>>>>>>> > > >>>>>>>> I rsync'd 20-TB over to my gluster system and noticed that I > > >>>>>>>> had > > >>>>>>>> some directories missing even though the rsync completed > > >>>>>>>> normally. > > >>>>>>>> The rsync logs showed that the missing files were transferred. > > >>>>>>>> I went to the bricks and did an 'ls -al > > >>>>>>>> /data/brick*/homegfs/dir/*' the files were on the bricks. > > >>>>>>>> After I > > >>>>>>>> did this 'ls', the files then showed up on the FUSE mounts. > > >>>>>>>> 1) Why are the files hidden on the fuse mount? > > >>>>>>>> 2) Why does the ls make them show up on the FUSE mount? > > >>>>>>>> 3) How can I prevent this from happening again? > > >>>>>>>> Note, I also mounted the gluster volume using NFS and saw the > > >>>>>>>> same > > >>>>>>>> behavior. The files/directories were not shown until I did the > > >>>>>>>> "ls" on the bricks. > > >>>>>>>> David > > >>>>>>>> ==============================> > >>>>>>>> David F. Robinson, Ph.D. > > >>>>>>>> President - Corvid Technologies > > >>>>>>>> 704.799.6944 x101 <tel:704.799.6944%20x101> [office] > > >>>>>>>> 704.252.1310 <tel:704.252.1310> [cell] > > >>>>>>>> 704.799.7974 <tel:704.799.7974> [fax] > > >>>>>>>> David.Robinson at corvidtec.com > > >>>>>>>> <mailto:David.Robinson at corvidtec.com> > > >>>>>>>> http://www.corvidtechnologies.com > > >>>>>>>> <http://www.corvidtechnologies.com/> > > >>>>>>>> > > >>>>>>>> _______________________________________________ > > >>>>>>>> Gluster-devel mailing list > > >>>>>>>> Gluster-devel at gluster.org <mailto:Gluster-devel at gluster.org> > > >>>>>>>> http://www.gluster.org/mailman/listinfo/gluster-devel > > >>>>>>> > > >>>>>>> > > >>>>>>> _______________________________________________ > > >>>>>>> Gluster-devel mailing list > > >>>>>>> Gluster-devel at gluster.org > > >>>>>>> http://www.gluster.org/mailman/listinfo/gluster-devel > > >>> > > >>> _______________________________________________ > > >>> Gluster-users mailing list > > >>> Gluster-users at gluster.org > > >>> http://www.gluster.org/mailman/listinfo/gluster-users > > >> > > >> _______________________________________________ > > >> Gluster-users mailing list > > >> Gluster-users at gluster.org > > >> http://www.gluster.org/mailman/listinfo/gluster-users > > >> > > >
Should I run my rsync with --block-size = something other than the default? Is there an optimal value? I think 128k is the max from my quick search. Didn't dig into it throughly though. David (Sent from mobile) ==============================David F. Robinson, Ph.D. President - Corvid Technologies 704.799.6944 x101 [office] 704.252.1310 [cell] 704.799.7974 [fax] David.Robinson at corvidtec.com http://www.corvidtechnologies.com> On Feb 5, 2015, at 5:41 PM, Ben Turner <bturner at redhat.com> wrote: > > ----- Original Message ----- >> From: "Ben Turner" <bturner at redhat.com> >> To: "David F. Robinson" <david.robinson at corvidtec.com> >> Cc: "Pranith Kumar Karampuri" <pkarampu at redhat.com>, "Xavier Hernandez" <xhernandez at datalab.es>, "Benjamin Turner" >> <bennyturns at gmail.com>, gluster-users at gluster.org, "Gluster Devel" <gluster-devel at gluster.org> >> Sent: Thursday, February 5, 2015 5:22:26 PM >> Subject: Re: [Gluster-users] [Gluster-devel] missing files >> >> ----- Original Message ----- >>> From: "David F. Robinson" <david.robinson at corvidtec.com> >>> To: "Ben Turner" <bturner at redhat.com> >>> Cc: "Pranith Kumar Karampuri" <pkarampu at redhat.com>, "Xavier Hernandez" >>> <xhernandez at datalab.es>, "Benjamin Turner" >>> <bennyturns at gmail.com>, gluster-users at gluster.org, "Gluster Devel" >>> <gluster-devel at gluster.org> >>> Sent: Thursday, February 5, 2015 5:01:13 PM >>> Subject: Re: [Gluster-users] [Gluster-devel] missing files >>> >>> I'll send you the emails I sent Pranith with the logs. What causes these >>> disconnects? >> >> Thanks David! Disconnects happen when there are interruption in >> communication between peers, normally there is ping timeout that happens. >> It could be anything from a flaky NW to the system was to busy to respond >> to the pings. My initial take is more towards the ladder as rsync is >> absolutely the worst use case for gluster - IIRC it writes in 4kb blocks. I >> try to keep my writes at least 64KB as in my testing that is the smallest >> block size I can write with before perf starts to really drop off. I'll try >> something similar in the lab. > > Ok I do think that the file being self healed is RCA for what you were seeing. Lets look at one of the disconnects: > > data-brick02a-homegfs.log:[2015-02-03 20:54:02.772180] I [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting connection from gfs01b.corvidtec.com-4175-2015/02/02-16:44:31:179119-homegfs-client-2-0-1 > > And in the glustershd.log from the gfs01b_glustershd.log file: > > [2015-02-03 20:55:48.001797] I [afr-self-heal-entry.c:554:afr_selfheal_entry_do] 0-homegfs-replicate-0: performing entry selfheal on 6c79a368-edaa-432b-bef9-ec690ab42448 > [2015-02-03 20:55:49.341996] I [afr-self-heal-common.c:476:afr_log_selfheal] 0-homegfs-replicate-0: Completed entry selfheal on 6c79a368-edaa-432b-bef9-ec690ab42448. source=1 sinks=0 > [2015-02-03 20:55:49.343093] I [afr-self-heal-entry.c:554:afr_selfheal_entry_do] 0-homegfs-replicate-0: performing entry selfheal on 792cb0d6-9290-4447-8cd7-2b2d7a116a69 > [2015-02-03 20:55:50.463652] I [afr-self-heal-common.c:476:afr_log_selfheal] 0-homegfs-replicate-0: Completed entry selfheal on 792cb0d6-9290-4447-8cd7-2b2d7a116a69. source=1 sinks=0 > [2015-02-03 20:55:51.465289] I [afr-self-heal-metadata.c:54:__afr_selfheal_metadata_do] 0-homegfs-replicate-0: performing metadata selfheal on 403e661a-1c27-4e79-9867-c0572aba2b3c > [2015-02-03 20:55:51.466515] I [afr-self-heal-common.c:476:afr_log_selfheal] 0-homegfs-replicate-0: Completed metadata selfheal on 403e661a-1c27-4e79-9867-c0572aba2b3c. source=1 sinks=0 > [2015-02-03 20:55:51.467098] I [afr-self-heal-entry.c:554:afr_selfheal_entry_do] 0-homegfs-replicate-0: performing entry selfheal on 403e661a-1c27-4e79-9867-c0572aba2b3c > [2015-02-03 20:55:55.257808] I [afr-self-heal-common.c:476:afr_log_selfheal] 0-homegfs-replicate-0: Completed entry selfheal on 403e661a-1c27-4e79-9867-c0572aba2b3c. source=1 sinks=0 > [2015-02-03 20:55:55.258548] I [afr-self-heal-metadata.c:54:__afr_selfheal_metadata_do] 0-homegfs-replicate-0: performing metadata selfheal on c612ee2f-2fb4-4157-a9ab-5a2d5603c541 > [2015-02-03 20:55:55.259367] I [afr-self-heal-common.c:476:afr_log_selfheal] 0-homegfs-replicate-0: Completed metadata selfheal on c612ee2f-2fb4-4157-a9ab-5a2d5603c541. source=1 sinks=0 > [2015-02-03 20:55:55.259980] I [afr-self-heal-entry.c:554:afr_selfheal_entry_do] 0-homegfs-replicate-0: performing entry selfheal on c612ee2f-2fb4-4157-a9ab-5a2d5603c541 > > As you can see the self heal logs are just spammed with files being healed, and I looked at a couple of disconnects and I see self heals getting run shortly after on the bricks that were down. Now we need to find the cause of the disconnects, I am thinking once the disconnects are resolved the files should be properly copied over without SH having to fix things. Like I said I'll give this a go on my lab systems and see if I can repro the disconnects, I'll have time to run through it tomorrow. If in the mean time anyone else has a theory / anything to add here it would be appreciated. > > -b > >> -b >> >>> David (Sent from mobile) >>> >>> ==============================>>> David F. Robinson, Ph.D. >>> President - Corvid Technologies >>> 704.799.6944 x101 [office] >>> 704.252.1310 [cell] >>> 704.799.7974 [fax] >>> David.Robinson at corvidtec.com >>> http://www.corvidtechnologies.com >>> >>>> On Feb 5, 2015, at 4:55 PM, Ben Turner <bturner at redhat.com> wrote: >>>> >>>> ----- Original Message ----- >>>>> From: "Pranith Kumar Karampuri" <pkarampu at redhat.com> >>>>> To: "Xavier Hernandez" <xhernandez at datalab.es>, "David F. Robinson" >>>>> <david.robinson at corvidtec.com>, "Benjamin Turner" >>>>> <bennyturns at gmail.com> >>>>> Cc: gluster-users at gluster.org, "Gluster Devel" >>>>> <gluster-devel at gluster.org> >>>>> Sent: Thursday, February 5, 2015 5:30:04 AM >>>>> Subject: Re: [Gluster-users] [Gluster-devel] missing files >>>>> >>>>> >>>>>> On 02/05/2015 03:48 PM, Pranith Kumar Karampuri wrote: >>>>>> I believe David already fixed this. I hope this is the same issue he >>>>>> told about permissions issue. >>>>> Oops, it is not. I will take a look. >>>> >>>> Yes David exactly like these: >>>> >>>> data-brick02a-homegfs.log:[2015-02-03 19:09:34.568842] I >>>> [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting >>>> connection from >>>> gfs02a.corvidtec.com-18563-2015/02/03-19:07:58:519134-homegfs-client-2-0-0 >>>> data-brick02a-homegfs.log:[2015-02-03 19:09:41.286551] I >>>> [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting >>>> connection from >>>> gfs01a.corvidtec.com-12804-2015/02/03-19:09:38:497808-homegfs-client-2-0-0 >>>> data-brick02a-homegfs.log:[2015-02-03 19:16:35.906412] I >>>> [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting >>>> connection from >>>> gfs02b.corvidtec.com-27190-2015/02/03-19:15:53:458467-homegfs-client-2-0-0 >>>> data-brick02a-homegfs.log:[2015-02-03 19:51:22.761293] I >>>> [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting >>>> connection from >>>> gfs01a.corvidtec.com-25926-2015/02/03-19:51:02:89070-homegfs-client-2-0-0 >>>> data-brick02a-homegfs.log:[2015-02-03 20:54:02.772180] I >>>> [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting >>>> connection from >>>> gfs01b.corvidtec.com-4175-2015/02/02-16:44:31:179119-homegfs-client-2-0-1 >>>> >>>> You can 100% verify my theory if you can correlate the time on the >>>> disconnects to the time that the missing files were healed. Can you have >>>> a look at /var/log/glusterfs/glustershd.log? That has all of the healed >>>> files + timestamps, if we can see a disconnect during the rsync and a >>>> self >>>> heal of the missing file I think we can safely assume that the >>>> disconnects >>>> may have caused this. I'll try this on my test systems, how much data >>>> did >>>> you rsync? What size ish of files / an idea of the dir layout? >>>> >>>> @Pranith - Could bricks flapping up and down during the rsync cause the >>>> files to be missing on the first ls(written to 1 subvol but not the other >>>> cause it was down), the ls triggered SH, and thats why the files were >>>> there for the second ls be a possible cause here? >>>> >>>> -b >>>> >>>> >>>>> Pranith >>>>>> >>>>>> Pranith >>>>>>> On 02/05/2015 03:44 PM, Xavier Hernandez wrote: >>>>>>> Is the failure repeatable ? with the same directories ? >>>>>>> >>>>>>> It's very weird that the directories appear on the volume when you do >>>>>>> an 'ls' on the bricks. Could it be that you only made a single 'ls' >>>>>>> on fuse mount which not showed the directory ? Is it possible that >>>>>>> this 'ls' triggered a self-heal that repaired the problem, whatever >>>>>>> it was, and when you did another 'ls' on the fuse mount after the >>>>>>> 'ls' on the bricks, the directories were there ? >>>>>>> >>>>>>> The first 'ls' could have healed the files, causing that the >>>>>>> following 'ls' on the bricks showed the files as if nothing were >>>>>>> damaged. If that's the case, it's possible that there were some >>>>>>> disconnections during the copy. >>>>>>> >>>>>>> Added Pranith because he knows better replication and self-heal >>>>>>> details. >>>>>>> >>>>>>> Xavi >>>>>>> >>>>>>>> On 02/04/2015 07:23 PM, David F. Robinson wrote: >>>>>>>> Distributed/replicated >>>>>>>> >>>>>>>> Volume Name: homegfs >>>>>>>> Type: Distributed-Replicate >>>>>>>> Volume ID: 1e32672a-f1b7-4b58-ba94-58c085e59071 >>>>>>>> Status: Started >>>>>>>> Number of Bricks: 4 x 2 = 8 >>>>>>>> Transport-type: tcp >>>>>>>> Bricks: >>>>>>>> Brick1: gfsib01a.corvidtec.com:/data/brick01a/homegfs >>>>>>>> Brick2: gfsib01b.corvidtec.com:/data/brick01b/homegfs >>>>>>>> Brick3: gfsib01a.corvidtec.com:/data/brick02a/homegfs >>>>>>>> Brick4: gfsib01b.corvidtec.com:/data/brick02b/homegfs >>>>>>>> Brick5: gfsib02a.corvidtec.com:/data/brick01a/homegfs >>>>>>>> Brick6: gfsib02b.corvidtec.com:/data/brick01b/homegfs >>>>>>>> Brick7: gfsib02a.corvidtec.com:/data/brick02a/homegfs >>>>>>>> Brick8: gfsib02b.corvidtec.com:/data/brick02b/homegfs >>>>>>>> Options Reconfigured: >>>>>>>> performance.io-thread-count: 32 >>>>>>>> performance.cache-size: 128MB >>>>>>>> performance.write-behind-window-size: 128MB >>>>>>>> server.allow-insecure: on >>>>>>>> network.ping-timeout: 10 >>>>>>>> storage.owner-gid: 100 >>>>>>>> geo-replication.indexing: off >>>>>>>> geo-replication.ignore-pid-check: on >>>>>>>> changelog.changelog: on >>>>>>>> changelog.fsync-interval: 3 >>>>>>>> changelog.rollover-time: 15 >>>>>>>> server.manage-gids: on >>>>>>>> >>>>>>>> >>>>>>>> ------ Original Message ------ >>>>>>>> From: "Xavier Hernandez" <xhernandez at datalab.es> >>>>>>>> To: "David F. Robinson" <david.robinson at corvidtec.com>; "Benjamin >>>>>>>> Turner" <bennyturns at gmail.com> >>>>>>>> Cc: "gluster-users at gluster.org" <gluster-users at gluster.org>; "Gluster >>>>>>>> Devel" <gluster-devel at gluster.org> >>>>>>>> Sent: 2/4/2015 6:03:45 AM >>>>>>>> Subject: Re: [Gluster-devel] missing files >>>>>>>> >>>>>>>>>> On 02/04/2015 01:30 AM, David F. Robinson wrote: >>>>>>>>>> Sorry. Thought about this a little more. I should have been >>>>>>>>>> clearer. >>>>>>>>>> The files were on both bricks of the replica, not just one side. >>>>>>>>>> So, >>>>>>>>>> both bricks had to have been up... The files/directories just >>>>>>>>>> don't show >>>>>>>>>> up on the mount. >>>>>>>>>> I was reading and saw a related bug >>>>>>>>>> (https://bugzilla.redhat.com/show_bug.cgi?id=1159484). I saw it >>>>>>>>>> suggested to run: >>>>>>>>>> find <mount> -d -exec getfattr -h -n trusted.ec.heal {} \; >>>>>>>>> >>>>>>>>> This command is specific for a dispersed volume. It won't do >>>>>>>>> anything >>>>>>>>> (aside from the error you are seeing) on a replicated volume. >>>>>>>>> >>>>>>>>> I think you are using a replicated volume, right ? >>>>>>>>> >>>>>>>>> In this case I'm not sure what can be happening. Is your volume a >>>>>>>>> pure >>>>>>>>> replicated one or a distributed-replicated ? on a pure replicated it >>>>>>>>> doesn't make sense that some entries do not show in an 'ls' when the >>>>>>>>> file is in both replicas (at least without any error message in the >>>>>>>>> logs). On a distributed-replicated it could be caused by some >>>>>>>>> problem >>>>>>>>> while combining contents of each replica set. >>>>>>>>> >>>>>>>>> What's the configuration of your volume ? >>>>>>>>> >>>>>>>>> Xavi >>>>>>>>> >>>>>>>>>> >>>>>>>>>> I get a bunch of errors for operation not supported: >>>>>>>>>> [root at gfs02a homegfs]# find wks_backup -d -exec getfattr -h -n >>>>>>>>>> trusted.ec.heal {} \; >>>>>>>>>> find: warning: the -d option is deprecated; please use -depth >>>>>>>>>> instead, >>>>>>>>>> because the latter is a POSIX-compliant feature. >>>>>>>>>> wks_backup/homer_backup/backup: trusted.ec.heal: Operation not >>>>>>>>>> supported >>>>>>>>>> wks_backup/homer_backup/logs/2014_05_20.log: trusted.ec.heal: >>>>>>>>>> Operation >>>>>>>>>> not supported >>>>>>>>>> wks_backup/homer_backup/logs/2014_05_21.log: trusted.ec.heal: >>>>>>>>>> Operation >>>>>>>>>> not supported >>>>>>>>>> wks_backup/homer_backup/logs/2014_05_18.log: trusted.ec.heal: >>>>>>>>>> Operation >>>>>>>>>> not supported >>>>>>>>>> wks_backup/homer_backup/logs/2014_05_19.log: trusted.ec.heal: >>>>>>>>>> Operation >>>>>>>>>> not supported >>>>>>>>>> wks_backup/homer_backup/logs/2014_05_22.log: trusted.ec.heal: >>>>>>>>>> Operation >>>>>>>>>> not supported >>>>>>>>>> wks_backup/homer_backup/logs: trusted.ec.heal: Operation not >>>>>>>>>> supported >>>>>>>>>> wks_backup/homer_backup: trusted.ec.heal: Operation not supported >>>>>>>>>> ------ Original Message ------ >>>>>>>>>> From: "Benjamin Turner" <bennyturns at gmail.com >>>>>>>>>> <mailto:bennyturns at gmail.com>> >>>>>>>>>> To: "David F. Robinson" <david.robinson at corvidtec.com >>>>>>>>>> <mailto:david.robinson at corvidtec.com>> >>>>>>>>>> Cc: "Gluster Devel" <gluster-devel at gluster.org >>>>>>>>>> <mailto:gluster-devel at gluster.org>>; "gluster-users at gluster.org" >>>>>>>>>> <gluster-users at gluster.org <mailto:gluster-users at gluster.org>> >>>>>>>>>> Sent: 2/3/2015 7:12:34 PM >>>>>>>>>> Subject: Re: [Gluster-devel] missing files >>>>>>>>>>> It sounds to me like the files were only copied to one replica, >>>>>>>>>>> werent >>>>>>>>>>> there for the initial for the initial ls which triggered a self >>>>>>>>>>> heal, >>>>>>>>>>> and were there for the last ls because they were healed. Is there >>>>>>>>>>> any >>>>>>>>>>> chance that one of the replicas was down during the rsync? It >>>>>>>>>>> could >>>>>>>>>>> be that you lost a brick during copy or something like that. To >>>>>>>>>>> confirm I would look for disconnects in the brick logs as well as >>>>>>>>>>> checking glusterfshd.log to verify the missing files were actually >>>>>>>>>>> healed. >>>>>>>>>>> >>>>>>>>>>> -b >>>>>>>>>>> >>>>>>>>>>> On Tue, Feb 3, 2015 at 5:37 PM, David F. Robinson >>>>>>>>>>> <david.robinson at corvidtec.com >>>>>>>>>>> <mailto:david.robinson at corvidtec.com>> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>> I rsync'd 20-TB over to my gluster system and noticed that I >>>>>>>>>>> had >>>>>>>>>>> some directories missing even though the rsync completed >>>>>>>>>>> normally. >>>>>>>>>>> The rsync logs showed that the missing files were transferred. >>>>>>>>>>> I went to the bricks and did an 'ls -al >>>>>>>>>>> /data/brick*/homegfs/dir/*' the files were on the bricks. >>>>>>>>>>> After I >>>>>>>>>>> did this 'ls', the files then showed up on the FUSE mounts. >>>>>>>>>>> 1) Why are the files hidden on the fuse mount? >>>>>>>>>>> 2) Why does the ls make them show up on the FUSE mount? >>>>>>>>>>> 3) How can I prevent this from happening again? >>>>>>>>>>> Note, I also mounted the gluster volume using NFS and saw the >>>>>>>>>>> same >>>>>>>>>>> behavior. The files/directories were not shown until I did the >>>>>>>>>>> "ls" on the bricks. >>>>>>>>>>> David >>>>>>>>>>> ==============================>>>>>>>>>>> David F. Robinson, Ph.D. >>>>>>>>>>> President - Corvid Technologies >>>>>>>>>>> 704.799.6944 x101 <tel:704.799.6944%20x101> [office] >>>>>>>>>>> 704.252.1310 <tel:704.252.1310> [cell] >>>>>>>>>>> 704.799.7974 <tel:704.799.7974> [fax] >>>>>>>>>>> David.Robinson at corvidtec.com >>>>>>>>>>> <mailto:David.Robinson at corvidtec.com> >>>>>>>>>>> http://www.corvidtechnologies.com >>>>>>>>>>> <http://www.corvidtechnologies.com/> >>>>>>>>>>> >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> Gluster-devel mailing list >>>>>>>>>>> Gluster-devel at gluster.org <mailto:Gluster-devel at gluster.org> >>>>>>>>>>> http://www.gluster.org/mailman/listinfo/gluster-devel >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>> Gluster-devel mailing list >>>>>>>>>> Gluster-devel at gluster.org >>>>>>>>>> http://www.gluster.org/mailman/listinfo/gluster-devel >>>>>> >>>>>> _______________________________________________ >>>>>> Gluster-users mailing list >>>>>> Gluster-users at gluster.org >>>>>> http://www.gluster.org/mailman/listinfo/gluster-users >>>>> >>>>> _______________________________________________ >>>>> Gluster-users mailing list >>>>> Gluster-users at gluster.org >>>>> http://www.gluster.org/mailman/listinfo/gluster-users >>
Isn't rsync what geo-rep uses? David (Sent from mobile) ==============================David F. Robinson, Ph.D. President - Corvid Technologies 704.799.6944 x101 [office] 704.252.1310 [cell] 704.799.7974 [fax] David.Robinson at corvidtec.com http://www.corvidtechnologies.com> On Feb 5, 2015, at 5:41 PM, Ben Turner <bturner at redhat.com> wrote: > > ----- Original Message ----- >> From: "Ben Turner" <bturner at redhat.com> >> To: "David F. Robinson" <david.robinson at corvidtec.com> >> Cc: "Pranith Kumar Karampuri" <pkarampu at redhat.com>, "Xavier Hernandez" <xhernandez at datalab.es>, "Benjamin Turner" >> <bennyturns at gmail.com>, gluster-users at gluster.org, "Gluster Devel" <gluster-devel at gluster.org> >> Sent: Thursday, February 5, 2015 5:22:26 PM >> Subject: Re: [Gluster-users] [Gluster-devel] missing files >> >> ----- Original Message ----- >>> From: "David F. Robinson" <david.robinson at corvidtec.com> >>> To: "Ben Turner" <bturner at redhat.com> >>> Cc: "Pranith Kumar Karampuri" <pkarampu at redhat.com>, "Xavier Hernandez" >>> <xhernandez at datalab.es>, "Benjamin Turner" >>> <bennyturns at gmail.com>, gluster-users at gluster.org, "Gluster Devel" >>> <gluster-devel at gluster.org> >>> Sent: Thursday, February 5, 2015 5:01:13 PM >>> Subject: Re: [Gluster-users] [Gluster-devel] missing files >>> >>> I'll send you the emails I sent Pranith with the logs. What causes these >>> disconnects? >> >> Thanks David! Disconnects happen when there are interruption in >> communication between peers, normally there is ping timeout that happens. >> It could be anything from a flaky NW to the system was to busy to respond >> to the pings. My initial take is more towards the ladder as rsync is >> absolutely the worst use case for gluster - IIRC it writes in 4kb blocks. I >> try to keep my writes at least 64KB as in my testing that is the smallest >> block size I can write with before perf starts to really drop off. I'll try >> something similar in the lab. > > Ok I do think that the file being self healed is RCA for what you were seeing. Lets look at one of the disconnects: > > data-brick02a-homegfs.log:[2015-02-03 20:54:02.772180] I [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting connection from gfs01b.corvidtec.com-4175-2015/02/02-16:44:31:179119-homegfs-client-2-0-1 > > And in the glustershd.log from the gfs01b_glustershd.log file: > > [2015-02-03 20:55:48.001797] I [afr-self-heal-entry.c:554:afr_selfheal_entry_do] 0-homegfs-replicate-0: performing entry selfheal on 6c79a368-edaa-432b-bef9-ec690ab42448 > [2015-02-03 20:55:49.341996] I [afr-self-heal-common.c:476:afr_log_selfheal] 0-homegfs-replicate-0: Completed entry selfheal on 6c79a368-edaa-432b-bef9-ec690ab42448. source=1 sinks=0 > [2015-02-03 20:55:49.343093] I [afr-self-heal-entry.c:554:afr_selfheal_entry_do] 0-homegfs-replicate-0: performing entry selfheal on 792cb0d6-9290-4447-8cd7-2b2d7a116a69 > [2015-02-03 20:55:50.463652] I [afr-self-heal-common.c:476:afr_log_selfheal] 0-homegfs-replicate-0: Completed entry selfheal on 792cb0d6-9290-4447-8cd7-2b2d7a116a69. source=1 sinks=0 > [2015-02-03 20:55:51.465289] I [afr-self-heal-metadata.c:54:__afr_selfheal_metadata_do] 0-homegfs-replicate-0: performing metadata selfheal on 403e661a-1c27-4e79-9867-c0572aba2b3c > [2015-02-03 20:55:51.466515] I [afr-self-heal-common.c:476:afr_log_selfheal] 0-homegfs-replicate-0: Completed metadata selfheal on 403e661a-1c27-4e79-9867-c0572aba2b3c. source=1 sinks=0 > [2015-02-03 20:55:51.467098] I [afr-self-heal-entry.c:554:afr_selfheal_entry_do] 0-homegfs-replicate-0: performing entry selfheal on 403e661a-1c27-4e79-9867-c0572aba2b3c > [2015-02-03 20:55:55.257808] I [afr-self-heal-common.c:476:afr_log_selfheal] 0-homegfs-replicate-0: Completed entry selfheal on 403e661a-1c27-4e79-9867-c0572aba2b3c. source=1 sinks=0 > [2015-02-03 20:55:55.258548] I [afr-self-heal-metadata.c:54:__afr_selfheal_metadata_do] 0-homegfs-replicate-0: performing metadata selfheal on c612ee2f-2fb4-4157-a9ab-5a2d5603c541 > [2015-02-03 20:55:55.259367] I [afr-self-heal-common.c:476:afr_log_selfheal] 0-homegfs-replicate-0: Completed metadata selfheal on c612ee2f-2fb4-4157-a9ab-5a2d5603c541. source=1 sinks=0 > [2015-02-03 20:55:55.259980] I [afr-self-heal-entry.c:554:afr_selfheal_entry_do] 0-homegfs-replicate-0: performing entry selfheal on c612ee2f-2fb4-4157-a9ab-5a2d5603c541 > > As you can see the self heal logs are just spammed with files being healed, and I looked at a couple of disconnects and I see self heals getting run shortly after on the bricks that were down. Now we need to find the cause of the disconnects, I am thinking once the disconnects are resolved the files should be properly copied over without SH having to fix things. Like I said I'll give this a go on my lab systems and see if I can repro the disconnects, I'll have time to run through it tomorrow. If in the mean time anyone else has a theory / anything to add here it would be appreciated. > > -b > >> -b >> >>> David (Sent from mobile) >>> >>> ==============================>>> David F. Robinson, Ph.D. >>> President - Corvid Technologies >>> 704.799.6944 x101 [office] >>> 704.252.1310 [cell] >>> 704.799.7974 [fax] >>> David.Robinson at corvidtec.com >>> http://www.corvidtechnologies.com >>> >>>> On Feb 5, 2015, at 4:55 PM, Ben Turner <bturner at redhat.com> wrote: >>>> >>>> ----- Original Message ----- >>>>> From: "Pranith Kumar Karampuri" <pkarampu at redhat.com> >>>>> To: "Xavier Hernandez" <xhernandez at datalab.es>, "David F. Robinson" >>>>> <david.robinson at corvidtec.com>, "Benjamin Turner" >>>>> <bennyturns at gmail.com> >>>>> Cc: gluster-users at gluster.org, "Gluster Devel" >>>>> <gluster-devel at gluster.org> >>>>> Sent: Thursday, February 5, 2015 5:30:04 AM >>>>> Subject: Re: [Gluster-users] [Gluster-devel] missing files >>>>> >>>>> >>>>>> On 02/05/2015 03:48 PM, Pranith Kumar Karampuri wrote: >>>>>> I believe David already fixed this. I hope this is the same issue he >>>>>> told about permissions issue. >>>>> Oops, it is not. I will take a look. >>>> >>>> Yes David exactly like these: >>>> >>>> data-brick02a-homegfs.log:[2015-02-03 19:09:34.568842] I >>>> [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting >>>> connection from >>>> gfs02a.corvidtec.com-18563-2015/02/03-19:07:58:519134-homegfs-client-2-0-0 >>>> data-brick02a-homegfs.log:[2015-02-03 19:09:41.286551] I >>>> [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting >>>> connection from >>>> gfs01a.corvidtec.com-12804-2015/02/03-19:09:38:497808-homegfs-client-2-0-0 >>>> data-brick02a-homegfs.log:[2015-02-03 19:16:35.906412] I >>>> [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting >>>> connection from >>>> gfs02b.corvidtec.com-27190-2015/02/03-19:15:53:458467-homegfs-client-2-0-0 >>>> data-brick02a-homegfs.log:[2015-02-03 19:51:22.761293] I >>>> [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting >>>> connection from >>>> gfs01a.corvidtec.com-25926-2015/02/03-19:51:02:89070-homegfs-client-2-0-0 >>>> data-brick02a-homegfs.log:[2015-02-03 20:54:02.772180] I >>>> [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting >>>> connection from >>>> gfs01b.corvidtec.com-4175-2015/02/02-16:44:31:179119-homegfs-client-2-0-1 >>>> >>>> You can 100% verify my theory if you can correlate the time on the >>>> disconnects to the time that the missing files were healed. Can you have >>>> a look at /var/log/glusterfs/glustershd.log? That has all of the healed >>>> files + timestamps, if we can see a disconnect during the rsync and a >>>> self >>>> heal of the missing file I think we can safely assume that the >>>> disconnects >>>> may have caused this. I'll try this on my test systems, how much data >>>> did >>>> you rsync? What size ish of files / an idea of the dir layout? >>>> >>>> @Pranith - Could bricks flapping up and down during the rsync cause the >>>> files to be missing on the first ls(written to 1 subvol but not the other >>>> cause it was down), the ls triggered SH, and thats why the files were >>>> there for the second ls be a possible cause here? >>>> >>>> -b >>>> >>>> >>>>> Pranith >>>>>> >>>>>> Pranith >>>>>>> On 02/05/2015 03:44 PM, Xavier Hernandez wrote: >>>>>>> Is the failure repeatable ? with the same directories ? >>>>>>> >>>>>>> It's very weird that the directories appear on the volume when you do >>>>>>> an 'ls' on the bricks. Could it be that you only made a single 'ls' >>>>>>> on fuse mount which not showed the directory ? Is it possible that >>>>>>> this 'ls' triggered a self-heal that repaired the problem, whatever >>>>>>> it was, and when you did another 'ls' on the fuse mount after the >>>>>>> 'ls' on the bricks, the directories were there ? >>>>>>> >>>>>>> The first 'ls' could have healed the files, causing that the >>>>>>> following 'ls' on the bricks showed the files as if nothing were >>>>>>> damaged. If that's the case, it's possible that there were some >>>>>>> disconnections during the copy. >>>>>>> >>>>>>> Added Pranith because he knows better replication and self-heal >>>>>>> details. >>>>>>> >>>>>>> Xavi >>>>>>> >>>>>>>> On 02/04/2015 07:23 PM, David F. Robinson wrote: >>>>>>>> Distributed/replicated >>>>>>>> >>>>>>>> Volume Name: homegfs >>>>>>>> Type: Distributed-Replicate >>>>>>>> Volume ID: 1e32672a-f1b7-4b58-ba94-58c085e59071 >>>>>>>> Status: Started >>>>>>>> Number of Bricks: 4 x 2 = 8 >>>>>>>> Transport-type: tcp >>>>>>>> Bricks: >>>>>>>> Brick1: gfsib01a.corvidtec.com:/data/brick01a/homegfs >>>>>>>> Brick2: gfsib01b.corvidtec.com:/data/brick01b/homegfs >>>>>>>> Brick3: gfsib01a.corvidtec.com:/data/brick02a/homegfs >>>>>>>> Brick4: gfsib01b.corvidtec.com:/data/brick02b/homegfs >>>>>>>> Brick5: gfsib02a.corvidtec.com:/data/brick01a/homegfs >>>>>>>> Brick6: gfsib02b.corvidtec.com:/data/brick01b/homegfs >>>>>>>> Brick7: gfsib02a.corvidtec.com:/data/brick02a/homegfs >>>>>>>> Brick8: gfsib02b.corvidtec.com:/data/brick02b/homegfs >>>>>>>> Options Reconfigured: >>>>>>>> performance.io-thread-count: 32 >>>>>>>> performance.cache-size: 128MB >>>>>>>> performance.write-behind-window-size: 128MB >>>>>>>> server.allow-insecure: on >>>>>>>> network.ping-timeout: 10 >>>>>>>> storage.owner-gid: 100 >>>>>>>> geo-replication.indexing: off >>>>>>>> geo-replication.ignore-pid-check: on >>>>>>>> changelog.changelog: on >>>>>>>> changelog.fsync-interval: 3 >>>>>>>> changelog.rollover-time: 15 >>>>>>>> server.manage-gids: on >>>>>>>> >>>>>>>> >>>>>>>> ------ Original Message ------ >>>>>>>> From: "Xavier Hernandez" <xhernandez at datalab.es> >>>>>>>> To: "David F. Robinson" <david.robinson at corvidtec.com>; "Benjamin >>>>>>>> Turner" <bennyturns at gmail.com> >>>>>>>> Cc: "gluster-users at gluster.org" <gluster-users at gluster.org>; "Gluster >>>>>>>> Devel" <gluster-devel at gluster.org> >>>>>>>> Sent: 2/4/2015 6:03:45 AM >>>>>>>> Subject: Re: [Gluster-devel] missing files >>>>>>>> >>>>>>>>>> On 02/04/2015 01:30 AM, David F. Robinson wrote: >>>>>>>>>> Sorry. Thought about this a little more. I should have been >>>>>>>>>> clearer. >>>>>>>>>> The files were on both bricks of the replica, not just one side. >>>>>>>>>> So, >>>>>>>>>> both bricks had to have been up... The files/directories just >>>>>>>>>> don't show >>>>>>>>>> up on the mount. >>>>>>>>>> I was reading and saw a related bug >>>>>>>>>> (https://bugzilla.redhat.com/show_bug.cgi?id=1159484). I saw it >>>>>>>>>> suggested to run: >>>>>>>>>> find <mount> -d -exec getfattr -h -n trusted.ec.heal {} \; >>>>>>>>> >>>>>>>>> This command is specific for a dispersed volume. It won't do >>>>>>>>> anything >>>>>>>>> (aside from the error you are seeing) on a replicated volume. >>>>>>>>> >>>>>>>>> I think you are using a replicated volume, right ? >>>>>>>>> >>>>>>>>> In this case I'm not sure what can be happening. Is your volume a >>>>>>>>> pure >>>>>>>>> replicated one or a distributed-replicated ? on a pure replicated it >>>>>>>>> doesn't make sense that some entries do not show in an 'ls' when the >>>>>>>>> file is in both replicas (at least without any error message in the >>>>>>>>> logs). On a distributed-replicated it could be caused by some >>>>>>>>> problem >>>>>>>>> while combining contents of each replica set. >>>>>>>>> >>>>>>>>> What's the configuration of your volume ? >>>>>>>>> >>>>>>>>> Xavi >>>>>>>>> >>>>>>>>>> >>>>>>>>>> I get a bunch of errors for operation not supported: >>>>>>>>>> [root at gfs02a homegfs]# find wks_backup -d -exec getfattr -h -n >>>>>>>>>> trusted.ec.heal {} \; >>>>>>>>>> find: warning: the -d option is deprecated; please use -depth >>>>>>>>>> instead, >>>>>>>>>> because the latter is a POSIX-compliant feature. >>>>>>>>>> wks_backup/homer_backup/backup: trusted.ec.heal: Operation not >>>>>>>>>> supported >>>>>>>>>> wks_backup/homer_backup/logs/2014_05_20.log: trusted.ec.heal: >>>>>>>>>> Operation >>>>>>>>>> not supported >>>>>>>>>> wks_backup/homer_backup/logs/2014_05_21.log: trusted.ec.heal: >>>>>>>>>> Operation >>>>>>>>>> not supported >>>>>>>>>> wks_backup/homer_backup/logs/2014_05_18.log: trusted.ec.heal: >>>>>>>>>> Operation >>>>>>>>>> not supported >>>>>>>>>> wks_backup/homer_backup/logs/2014_05_19.log: trusted.ec.heal: >>>>>>>>>> Operation >>>>>>>>>> not supported >>>>>>>>>> wks_backup/homer_backup/logs/2014_05_22.log: trusted.ec.heal: >>>>>>>>>> Operation >>>>>>>>>> not supported >>>>>>>>>> wks_backup/homer_backup/logs: trusted.ec.heal: Operation not >>>>>>>>>> supported >>>>>>>>>> wks_backup/homer_backup: trusted.ec.heal: Operation not supported >>>>>>>>>> ------ Original Message ------ >>>>>>>>>> From: "Benjamin Turner" <bennyturns at gmail.com >>>>>>>>>> <mailto:bennyturns at gmail.com>> >>>>>>>>>> To: "David F. Robinson" <david.robinson at corvidtec.com >>>>>>>>>> <mailto:david.robinson at corvidtec.com>> >>>>>>>>>> Cc: "Gluster Devel" <gluster-devel at gluster.org >>>>>>>>>> <mailto:gluster-devel at gluster.org>>; "gluster-users at gluster.org" >>>>>>>>>> <gluster-users at gluster.org <mailto:gluster-users at gluster.org>> >>>>>>>>>> Sent: 2/3/2015 7:12:34 PM >>>>>>>>>> Subject: Re: [Gluster-devel] missing files >>>>>>>>>>> It sounds to me like the files were only copied to one replica, >>>>>>>>>>> werent >>>>>>>>>>> there for the initial for the initial ls which triggered a self >>>>>>>>>>> heal, >>>>>>>>>>> and were there for the last ls because they were healed. Is there >>>>>>>>>>> any >>>>>>>>>>> chance that one of the replicas was down during the rsync? It >>>>>>>>>>> could >>>>>>>>>>> be that you lost a brick during copy or something like that. To >>>>>>>>>>> confirm I would look for disconnects in the brick logs as well as >>>>>>>>>>> checking glusterfshd.log to verify the missing files were actually >>>>>>>>>>> healed. >>>>>>>>>>> >>>>>>>>>>> -b >>>>>>>>>>> >>>>>>>>>>> On Tue, Feb 3, 2015 at 5:37 PM, David F. Robinson >>>>>>>>>>> <david.robinson at corvidtec.com >>>>>>>>>>> <mailto:david.robinson at corvidtec.com>> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>> I rsync'd 20-TB over to my gluster system and noticed that I >>>>>>>>>>> had >>>>>>>>>>> some directories missing even though the rsync completed >>>>>>>>>>> normally. >>>>>>>>>>> The rsync logs showed that the missing files were transferred. >>>>>>>>>>> I went to the bricks and did an 'ls -al >>>>>>>>>>> /data/brick*/homegfs/dir/*' the files were on the bricks. >>>>>>>>>>> After I >>>>>>>>>>> did this 'ls', the files then showed up on the FUSE mounts. >>>>>>>>>>> 1) Why are the files hidden on the fuse mount? >>>>>>>>>>> 2) Why does the ls make them show up on the FUSE mount? >>>>>>>>>>> 3) How can I prevent this from happening again? >>>>>>>>>>> Note, I also mounted the gluster volume using NFS and saw the >>>>>>>>>>> same >>>>>>>>>>> behavior. The files/directories were not shown until I did the >>>>>>>>>>> "ls" on the bricks. >>>>>>>>>>> David >>>>>>>>>>> ==============================>>>>>>>>>>> David F. Robinson, Ph.D. >>>>>>>>>>> President - Corvid Technologies >>>>>>>>>>> 704.799.6944 x101 <tel:704.799.6944%20x101> [office] >>>>>>>>>>> 704.252.1310 <tel:704.252.1310> [cell] >>>>>>>>>>> 704.799.7974 <tel:704.799.7974> [fax] >>>>>>>>>>> David.Robinson at corvidtec.com >>>>>>>>>>> <mailto:David.Robinson at corvidtec.com> >>>>>>>>>>> http://www.corvidtechnologies.com >>>>>>>>>>> <http://www.corvidtechnologies.com/> >>>>>>>>>>> >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> Gluster-devel mailing list >>>>>>>>>>> Gluster-devel at gluster.org <mailto:Gluster-devel at gluster.org> >>>>>>>>>>> http://www.gluster.org/mailman/listinfo/gluster-devel >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>> Gluster-devel mailing list >>>>>>>>>> Gluster-devel at gluster.org >>>>>>>>>> http://www.gluster.org/mailman/listinfo/gluster-devel >>>>>> >>>>>> _______________________________________________ >>>>>> Gluster-users mailing list >>>>>> Gluster-users at gluster.org >>>>>> http://www.gluster.org/mailman/listinfo/gluster-users >>>>> >>>>> _______________________________________________ >>>>> Gluster-users mailing list >>>>> Gluster-users at gluster.org >>>>> http://www.gluster.org/mailman/listinfo/gluster-users >>