thr3ads.net - Gluster users - [Gluster-users] [Gluster-devel] missing files [Feb 2015]

If this information is useful, please help other people find it:
Share via:

Ben Turner

2015-Feb-05 22:22 UTC

[Gluster-users] [Gluster-devel] missing files

----- Original Message -----> From: "David F. Robinson" <david.robinson at corvidtec.com>
> To: "Ben Turner" <bturner at redhat.com>
> Cc: "Pranith Kumar Karampuri" <pkarampu at redhat.com>,
"Xavier Hernandez" <xhernandez at datalab.es>, "Benjamin
Turner"
> <bennyturns at gmail.com>, gluster-users at gluster.org,
"Gluster Devel" <gluster-devel at gluster.org>
> Sent: Thursday, February 5, 2015 5:01:13 PM
> Subject: Re: [Gluster-users] [Gluster-devel] missing files
> 
> I'll send you the emails I sent Pranith with the logs. What causes
these
> disconnects?
Thanks David!  Disconnects happen when there are interruption in communication
between peers, normally there is ping timeout that happens.  It could be
anything from a flaky NW to the system was to busy to respond to the pings.  My
initial take is more towards the ladder as rsync is absolutely the worst use
case for gluster - IIRC it writes in 4kb blocks.  I try to keep my writes at
least 64KB as in my testing that is the smallest block size I can write with
before perf starts to really drop off.  I'll try something similar in the
lab.

-b
 > David  (Sent from mobile)
> 
> ==============================> David F. Robinson, Ph.D.
> President - Corvid Technologies
> 704.799.6944 x101 [office]
> 704.252.1310      [cell]
> 704.799.7974      [fax]
> David.Robinson at corvidtec.com
> http://www.corvidtechnologies.com
> 
> > On Feb 5, 2015, at 4:55 PM, Ben Turner <bturner at redhat.com>
wrote:
> > 
> > ----- Original Message -----
> >> From: "Pranith Kumar Karampuri" <pkarampu at
redhat.com>
> >> To: "Xavier Hernandez" <xhernandez at datalab.es>,
"David F. Robinson"
> >> <david.robinson at corvidtec.com>, "Benjamin
Turner"
> >> <bennyturns at gmail.com>
> >> Cc: gluster-users at gluster.org, "Gluster Devel"
<gluster-devel at gluster.org>
> >> Sent: Thursday, February 5, 2015 5:30:04 AM
> >> Subject: Re: [Gluster-users] [Gluster-devel] missing files
> >> 
> >> 
> >>> On 02/05/2015 03:48 PM, Pranith Kumar Karampuri wrote:
> >>> I believe David already fixed this. I hope this is the same
issue he
> >>> told about permissions issue.
> >> Oops, it is not. I will take a look.
> > 
> > Yes David exactly like these:
> > 
> > data-brick02a-homegfs.log:[2015-02-03 19:09:34.568842] I
> > [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting
> > connection from
> >
gfs02a.corvidtec.com-18563-2015/02/03-19:07:58:519134-homegfs-client-2-0-0
> > data-brick02a-homegfs.log:[2015-02-03 19:09:41.286551] I
> > [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting
> > connection from
> >
gfs01a.corvidtec.com-12804-2015/02/03-19:09:38:497808-homegfs-client-2-0-0
> > data-brick02a-homegfs.log:[2015-02-03 19:16:35.906412] I
> > [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting
> > connection from
> >
gfs02b.corvidtec.com-27190-2015/02/03-19:15:53:458467-homegfs-client-2-0-0
> > data-brick02a-homegfs.log:[2015-02-03 19:51:22.761293] I
> > [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting
> > connection from
> >
gfs01a.corvidtec.com-25926-2015/02/03-19:51:02:89070-homegfs-client-2-0-0
> > data-brick02a-homegfs.log:[2015-02-03 20:54:02.772180] I
> > [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting
> > connection from
> >
gfs01b.corvidtec.com-4175-2015/02/02-16:44:31:179119-homegfs-client-2-0-1
> > 
> > You can 100% verify my theory if you can correlate the time on the
> > disconnects to the time that the missing files were healed.  Can you
have
> > a look at /var/log/glusterfs/glustershd.log?  That has all of the
healed
> > files + timestamps, if we can see a disconnect during the rsync and a
self
> > heal of the missing file I think we can safely assume that the
disconnects
> > may have caused this.  I'll try this on my test systems, how much
data did
> > you rsync?  What size ish of files / an idea of the dir layout?
> > 
> > @Pranith - Could bricks flapping up and down during the rsync cause
the
> > files to be missing on the first ls(written to 1 subvol but not the
other
> > cause it was down), the ls triggered SH, and thats why the files were
> > there for the second ls be a possible cause here?
> > 
> > -b
> > 
> > 
> >> Pranith
> >>> 
> >>> Pranith
> >>>> On 02/05/2015 03:44 PM, Xavier Hernandez wrote:
> >>>> Is the failure repeatable ? with the same directories ?
> >>>> 
> >>>> It's very weird that the directories appear on the
volume when you do
> >>>> an 'ls' on the bricks. Could it be that you only
made a single 'ls'
> >>>> on fuse mount which not showed the directory ? Is it
possible that
> >>>> this 'ls' triggered a self-heal that repaired the
problem, whatever
> >>>> it was, and when you did another 'ls' on the fuse
mount after the
> >>>> 'ls' on the bricks, the directories were there ?
> >>>> 
> >>>> The first 'ls' could have healed the files,
causing that the
> >>>> following 'ls' on the bricks showed the files as
if nothing were
> >>>> damaged. If that's the case, it's possible that
there were some
> >>>> disconnections during the copy.
> >>>> 
> >>>> Added Pranith because he knows better replication and
self-heal details.
> >>>> 
> >>>> Xavi
> >>>> 
> >>>>> On 02/04/2015 07:23 PM, David F. Robinson wrote:
> >>>>> Distributed/replicated
> >>>>> 
> >>>>> Volume Name: homegfs
> >>>>> Type: Distributed-Replicate
> >>>>> Volume ID: 1e32672a-f1b7-4b58-ba94-58c085e59071
> >>>>> Status: Started
> >>>>> Number of Bricks: 4 x 2 = 8
> >>>>> Transport-type: tcp
> >>>>> Bricks:
> >>>>> Brick1: gfsib01a.corvidtec.com:/data/brick01a/homegfs
> >>>>> Brick2: gfsib01b.corvidtec.com:/data/brick01b/homegfs
> >>>>> Brick3: gfsib01a.corvidtec.com:/data/brick02a/homegfs
> >>>>> Brick4: gfsib01b.corvidtec.com:/data/brick02b/homegfs
> >>>>> Brick5: gfsib02a.corvidtec.com:/data/brick01a/homegfs
> >>>>> Brick6: gfsib02b.corvidtec.com:/data/brick01b/homegfs
> >>>>> Brick7: gfsib02a.corvidtec.com:/data/brick02a/homegfs
> >>>>> Brick8: gfsib02b.corvidtec.com:/data/brick02b/homegfs
> >>>>> Options Reconfigured:
> >>>>> performance.io-thread-count: 32
> >>>>> performance.cache-size: 128MB
> >>>>> performance.write-behind-window-size: 128MB
> >>>>> server.allow-insecure: on
> >>>>> network.ping-timeout: 10
> >>>>> storage.owner-gid: 100
> >>>>> geo-replication.indexing: off
> >>>>> geo-replication.ignore-pid-check: on
> >>>>> changelog.changelog: on
> >>>>> changelog.fsync-interval: 3
> >>>>> changelog.rollover-time: 15
> >>>>> server.manage-gids: on
> >>>>> 
> >>>>> 
> >>>>> ------ Original Message ------
> >>>>> From: "Xavier Hernandez" <xhernandez at
datalab.es>
> >>>>> To: "David F. Robinson" <david.robinson
at corvidtec.com>; "Benjamin
> >>>>> Turner" <bennyturns at gmail.com>
> >>>>> Cc: "gluster-users at gluster.org"
<gluster-users at gluster.org>; "Gluster
> >>>>> Devel" <gluster-devel at gluster.org>
> >>>>> Sent: 2/4/2015 6:03:45 AM
> >>>>> Subject: Re: [Gluster-devel] missing files
> >>>>> 
> >>>>>>> On 02/04/2015 01:30 AM, David F. Robinson
wrote:
> >>>>>>> Sorry. Thought about this a little more. I
should have been clearer.
> >>>>>>> The files were on both bricks of the replica,
not just one side. So,
> >>>>>>> both bricks had to have been up... The
files/directories just
> >>>>>>> don't show
> >>>>>>> up on the mount.
> >>>>>>> I was reading and saw a related bug
> >>>>>>>
(https://bugzilla.redhat.com/show_bug.cgi?id=1159484). I saw it
> >>>>>>> suggested to run:
> >>>>>>>         find <mount> -d -exec getfattr
-h -n trusted.ec.heal {} \;
> >>>>>> 
> >>>>>> This command is specific for a dispersed volume.
It won't do anything
> >>>>>> (aside from the error you are seeing) on a
replicated volume.
> >>>>>> 
> >>>>>> I think you are using a replicated volume, right ?
> >>>>>> 
> >>>>>> In this case I'm not sure what can be
happening. Is your volume a pure
> >>>>>> replicated one or a distributed-replicated ? on a
pure replicated it
> >>>>>> doesn't make sense that some entries do not
show in an 'ls' when the
> >>>>>> file is in both replicas (at least without any
error message in the
> >>>>>> logs). On a distributed-replicated it could be
caused by some problem
> >>>>>> while combining contents of each replica set.
> >>>>>> 
> >>>>>> What's the configuration of your volume ?
> >>>>>> 
> >>>>>> Xavi
> >>>>>> 
> >>>>>>> 
> >>>>>>> I get a bunch of errors for operation not
supported:
> >>>>>>> [root at gfs02a homegfs]# find wks_backup -d
-exec getfattr -h -n
> >>>>>>> trusted.ec.heal {} \;
> >>>>>>> find: warning: the -d option is deprecated;
please use -depth
> >>>>>>> instead,
> >>>>>>> because the latter is a POSIX-compliant
feature.
> >>>>>>> wks_backup/homer_backup/backup:
trusted.ec.heal: Operation not
> >>>>>>> supported
> >>>>>>> wks_backup/homer_backup/logs/2014_05_20.log:
trusted.ec.heal:
> >>>>>>> Operation
> >>>>>>> not supported
> >>>>>>> wks_backup/homer_backup/logs/2014_05_21.log:
trusted.ec.heal:
> >>>>>>> Operation
> >>>>>>> not supported
> >>>>>>> wks_backup/homer_backup/logs/2014_05_18.log:
trusted.ec.heal:
> >>>>>>> Operation
> >>>>>>> not supported
> >>>>>>> wks_backup/homer_backup/logs/2014_05_19.log:
trusted.ec.heal:
> >>>>>>> Operation
> >>>>>>> not supported
> >>>>>>> wks_backup/homer_backup/logs/2014_05_22.log:
trusted.ec.heal:
> >>>>>>> Operation
> >>>>>>> not supported
> >>>>>>> wks_backup/homer_backup/logs: trusted.ec.heal:
Operation not
> >>>>>>> supported
> >>>>>>> wks_backup/homer_backup: trusted.ec.heal:
Operation not supported
> >>>>>>> ------ Original Message ------
> >>>>>>> From: "Benjamin Turner"
<bennyturns at gmail.com
> >>>>>>> <mailto:bennyturns at gmail.com>>
> >>>>>>> To: "David F. Robinson"
<david.robinson at corvidtec.com
> >>>>>>> <mailto:david.robinson at
corvidtec.com>>
> >>>>>>> Cc: "Gluster Devel"
<gluster-devel at gluster.org
> >>>>>>> <mailto:gluster-devel at
gluster.org>>; "gluster-users at gluster.org"
> >>>>>>> <gluster-users at gluster.org
<mailto:gluster-users at gluster.org>>
> >>>>>>> Sent: 2/3/2015 7:12:34 PM
> >>>>>>> Subject: Re: [Gluster-devel] missing files
> >>>>>>>> It sounds to me like the files were only
copied to one replica,
> >>>>>>>> werent
> >>>>>>>> there for the initial for the initial ls
which triggered a self
> >>>>>>>> heal,
> >>>>>>>> and were there for the last ls because
they were healed. Is there
> >>>>>>>> any
> >>>>>>>> chance that one of the replicas was down
during the rsync? It could
> >>>>>>>> be that you lost a brick during copy or
something like that. To
> >>>>>>>> confirm I would look for disconnects in
the brick logs as well as
> >>>>>>>> checking glusterfshd.log to verify the
missing files were actually
> >>>>>>>> healed.
> >>>>>>>> 
> >>>>>>>> -b
> >>>>>>>> 
> >>>>>>>> On Tue, Feb 3, 2015 at 5:37 PM, David F.
Robinson
> >>>>>>>> <david.robinson at corvidtec.com
<mailto:david.robinson at corvidtec.com>>
> >>>>>>>> wrote:
> >>>>>>>> 
> >>>>>>>>    I rsync'd 20-TB over to my gluster
system and noticed that I had
> >>>>>>>>    some directories missing even though
the rsync completed
> >>>>>>>> normally.
> >>>>>>>>    The rsync logs showed that the missing
files were transferred.
> >>>>>>>>    I went to the bricks and did an 'ls
-al
> >>>>>>>>    /data/brick*/homegfs/dir/*' the
files were on the bricks.
> >>>>>>>> After I
> >>>>>>>>    did this 'ls', the files then
showed up on the FUSE mounts.
> >>>>>>>>    1) Why are the files hidden on the fuse
mount?
> >>>>>>>>    2) Why does the ls make them show up on
the FUSE mount?
> >>>>>>>>    3) How can I prevent this from
happening again?
> >>>>>>>>    Note, I also mounted the gluster volume
using NFS and saw the
> >>>>>>>> same
> >>>>>>>>    behavior. The files/directories were
not shown until I did the
> >>>>>>>>    "ls" on the bricks.
> >>>>>>>>    David
> >>>>>>>>    ==============================>
>>>>>>>>    David F. Robinson, Ph.D.
> >>>>>>>>    President - Corvid Technologies
> >>>>>>>>    704.799.6944 x101
<tel:704.799.6944%20x101> [office]
> >>>>>>>>    704.252.1310 <tel:704.252.1310>
[cell]
> >>>>>>>>    704.799.7974 <tel:704.799.7974>
[fax]
> >>>>>>>>    David.Robinson at corvidtec.com
> >>>>>>>> <mailto:David.Robinson at
corvidtec.com>
> >>>>>>>>    http://www.corvidtechnologies.com
> >>>>>>>> <http://www.corvidtechnologies.com/>
> >>>>>>>> 
> >>>>>>>>   
_______________________________________________
> >>>>>>>>    Gluster-devel mailing list
> >>>>>>>>    Gluster-devel at gluster.org
<mailto:Gluster-devel at gluster.org>
> >>>>>>>>
http://www.gluster.org/mailman/listinfo/gluster-devel
> >>>>>>> 
> >>>>>>> 
> >>>>>>>
_______________________________________________
> >>>>>>> Gluster-devel mailing list
> >>>>>>> Gluster-devel at gluster.org
> >>>>>>>
http://www.gluster.org/mailman/listinfo/gluster-devel
> >>> 
> >>> _______________________________________________
> >>> Gluster-users mailing list
> >>> Gluster-users at gluster.org
> >>> http://www.gluster.org/mailman/listinfo/gluster-users
> >> 
> >> _______________________________________________
> >> Gluster-users mailing list
> >> Gluster-users at gluster.org
> >> http://www.gluster.org/mailman/listinfo/gluster-users
> >> 
>

Ben Turner

2015-Feb-05 22:41 UTC

head link

[Gluster-users] [Gluster-devel] missing files

----- Original Message -----> From: "Ben Turner" <bturner at redhat.com>
> To: "David F. Robinson" <david.robinson at corvidtec.com>
> Cc: "Pranith Kumar Karampuri" <pkarampu at redhat.com>,
"Xavier Hernandez" <xhernandez at datalab.es>, "Benjamin
Turner"
> <bennyturns at gmail.com>, gluster-users at gluster.org,
"Gluster Devel" <gluster-devel at gluster.org>
> Sent: Thursday, February 5, 2015 5:22:26 PM
> Subject: Re: [Gluster-users] [Gluster-devel] missing files
> 
> ----- Original Message -----
> > From: "David F. Robinson" <david.robinson at
corvidtec.com>
> > To: "Ben Turner" <bturner at redhat.com>
> > Cc: "Pranith Kumar Karampuri" <pkarampu at
redhat.com>, "Xavier Hernandez"
> > <xhernandez at datalab.es>, "Benjamin Turner"
> > <bennyturns at gmail.com>, gluster-users at gluster.org,
"Gluster Devel"
> > <gluster-devel at gluster.org>
> > Sent: Thursday, February 5, 2015 5:01:13 PM
> > Subject: Re: [Gluster-users] [Gluster-devel] missing files
> > 
> > I'll send you the emails I sent Pranith with the logs. What causes
these
> > disconnects?
> 
> Thanks David!  Disconnects happen when there are interruption in
> communication between peers, normally there is ping timeout that happens.
> It could be anything from a flaky NW to the system was to busy to respond
> to the pings.  My initial take is more towards the ladder as rsync is
> absolutely the worst use case for gluster - IIRC it writes in 4kb blocks. 
I
> try to keep my writes at least 64KB as in my testing that is the smallest
> block size I can write with before perf starts to really drop off. 
I'll try
> something similar in the lab.
Ok I do think that the file being self healed is RCA for what you were seeing. 
Lets look at one of the disconnects:

data-brick02a-homegfs.log:[2015-02-03 20:54:02.772180] I
[server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting connection from
gfs01b.corvidtec.com-4175-2015/02/02-16:44:31:179119-homegfs-client-2-0-1

And in the glustershd.log from the gfs01b_glustershd.log file:

[2015-02-03 20:55:48.001797] I [afr-self-heal-entry.c:554:afr_selfheal_entry_do]
0-homegfs-replicate-0: performing entry selfheal on
6c79a368-edaa-432b-bef9-ec690ab42448
[2015-02-03 20:55:49.341996] I [afr-self-heal-common.c:476:afr_log_selfheal]
0-homegfs-replicate-0: Completed entry selfheal on
6c79a368-edaa-432b-bef9-ec690ab42448. source=1 sinks=0
[2015-02-03 20:55:49.343093] I [afr-self-heal-entry.c:554:afr_selfheal_entry_do]
0-homegfs-replicate-0: performing entry selfheal on
792cb0d6-9290-4447-8cd7-2b2d7a116a69
[2015-02-03 20:55:50.463652] I [afr-self-heal-common.c:476:afr_log_selfheal]
0-homegfs-replicate-0: Completed entry selfheal on
792cb0d6-9290-4447-8cd7-2b2d7a116a69. source=1 sinks=0
[2015-02-03 20:55:51.465289] I
[afr-self-heal-metadata.c:54:__afr_selfheal_metadata_do] 0-homegfs-replicate-0:
performing metadata selfheal on 403e661a-1c27-4e79-9867-c0572aba2b3c
[2015-02-03 20:55:51.466515] I [afr-self-heal-common.c:476:afr_log_selfheal]
0-homegfs-replicate-0: Completed metadata selfheal on
403e661a-1c27-4e79-9867-c0572aba2b3c. source=1 sinks=0
[2015-02-03 20:55:51.467098] I [afr-self-heal-entry.c:554:afr_selfheal_entry_do]
0-homegfs-replicate-0: performing entry selfheal on
403e661a-1c27-4e79-9867-c0572aba2b3c
[2015-02-03 20:55:55.257808] I [afr-self-heal-common.c:476:afr_log_selfheal]
0-homegfs-replicate-0: Completed entry selfheal on
403e661a-1c27-4e79-9867-c0572aba2b3c. source=1 sinks=0
[2015-02-03 20:55:55.258548] I
[afr-self-heal-metadata.c:54:__afr_selfheal_metadata_do] 0-homegfs-replicate-0:
performing metadata selfheal on c612ee2f-2fb4-4157-a9ab-5a2d5603c541
[2015-02-03 20:55:55.259367] I [afr-self-heal-common.c:476:afr_log_selfheal]
0-homegfs-replicate-0: Completed metadata selfheal on
c612ee2f-2fb4-4157-a9ab-5a2d5603c541. source=1 sinks=0
[2015-02-03 20:55:55.259980] I [afr-self-heal-entry.c:554:afr_selfheal_entry_do]
0-homegfs-replicate-0: performing entry selfheal on
c612ee2f-2fb4-4157-a9ab-5a2d5603c541

As you can see the self heal logs are just spammed with files being healed, and
I looked at a couple of disconnects and I see self heals getting run shortly
after on the bricks that were down.  Now we need to find the cause of the
disconnects, I am thinking once the disconnects are resolved the files should be
properly copied over without SH having to fix things.  Like I said I'll give
this a go on my lab systems and see if I can repro the disconnects, I'll
have time to run through it tomorrow.  If in the mean time anyone else has a
theory / anything to add here it would be appreciated.

-b
 > -b
>  
> > David  (Sent from mobile)
> > 
> > ==============================> > David F. Robinson, Ph.D.
> > President - Corvid Technologies
> > 704.799.6944 x101 [office]
> > 704.252.1310      [cell]
> > 704.799.7974      [fax]
> > David.Robinson at corvidtec.com
> > http://www.corvidtechnologies.com
> > 
> > > On Feb 5, 2015, at 4:55 PM, Ben Turner <bturner at
redhat.com> wrote:
> > > 
> > > ----- Original Message -----
> > >> From: "Pranith Kumar Karampuri" <pkarampu at
redhat.com>
> > >> To: "Xavier Hernandez" <xhernandez at
datalab.es>, "David F. Robinson"
> > >> <david.robinson at corvidtec.com>, "Benjamin
Turner"
> > >> <bennyturns at gmail.com>
> > >> Cc: gluster-users at gluster.org, "Gluster Devel"
> > >> <gluster-devel at gluster.org>
> > >> Sent: Thursday, February 5, 2015 5:30:04 AM
> > >> Subject: Re: [Gluster-users] [Gluster-devel] missing files
> > >> 
> > >> 
> > >>> On 02/05/2015 03:48 PM, Pranith Kumar Karampuri wrote:
> > >>> I believe David already fixed this. I hope this is the
same issue he
> > >>> told about permissions issue.
> > >> Oops, it is not. I will take a look.
> > > 
> > > Yes David exactly like these:
> > > 
> > > data-brick02a-homegfs.log:[2015-02-03 19:09:34.568842] I
> > > [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting
> > > connection from
> > >
gfs02a.corvidtec.com-18563-2015/02/03-19:07:58:519134-homegfs-client-2-0-0
> > > data-brick02a-homegfs.log:[2015-02-03 19:09:41.286551] I
> > > [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting
> > > connection from
> > >
gfs01a.corvidtec.com-12804-2015/02/03-19:09:38:497808-homegfs-client-2-0-0
> > > data-brick02a-homegfs.log:[2015-02-03 19:16:35.906412] I
> > > [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting
> > > connection from
> > >
gfs02b.corvidtec.com-27190-2015/02/03-19:15:53:458467-homegfs-client-2-0-0
> > > data-brick02a-homegfs.log:[2015-02-03 19:51:22.761293] I
> > > [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting
> > > connection from
> > >
gfs01a.corvidtec.com-25926-2015/02/03-19:51:02:89070-homegfs-client-2-0-0
> > > data-brick02a-homegfs.log:[2015-02-03 20:54:02.772180] I
> > > [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting
> > > connection from
> > >
gfs01b.corvidtec.com-4175-2015/02/02-16:44:31:179119-homegfs-client-2-0-1
> > > 
> > > You can 100% verify my theory if you can correlate the time on
the
> > > disconnects to the time that the missing files were healed.  Can
you have
> > > a look at /var/log/glusterfs/glustershd.log?  That has all of the
healed
> > > files + timestamps, if we can see a disconnect during the rsync
and a
> > > self
> > > heal of the missing file I think we can safely assume that the
> > > disconnects
> > > may have caused this.  I'll try this on my test systems, how
much data
> > > did
> > > you rsync?  What size ish of files / an idea of the dir layout?
> > > 
> > > @Pranith - Could bricks flapping up and down during the rsync
cause the
> > > files to be missing on the first ls(written to 1 subvol but not
the other
> > > cause it was down), the ls triggered SH, and thats why the files
were
> > > there for the second ls be a possible cause here?
> > > 
> > > -b
> > > 
> > > 
> > >> Pranith
> > >>> 
> > >>> Pranith
> > >>>> On 02/05/2015 03:44 PM, Xavier Hernandez wrote:
> > >>>> Is the failure repeatable ? with the same directories
?
> > >>>> 
> > >>>> It's very weird that the directories appear on
the volume when you do
> > >>>> an 'ls' on the bricks. Could it be that you
only made a single 'ls'
> > >>>> on fuse mount which not showed the directory ? Is it
possible that
> > >>>> this 'ls' triggered a self-heal that repaired
the problem, whatever
> > >>>> it was, and when you did another 'ls' on the
fuse mount after the
> > >>>> 'ls' on the bricks, the directories were
there ?
> > >>>> 
> > >>>> The first 'ls' could have healed the files,
causing that the
> > >>>> following 'ls' on the bricks showed the files
as if nothing were
> > >>>> damaged. If that's the case, it's possible
that there were some
> > >>>> disconnections during the copy.
> > >>>> 
> > >>>> Added Pranith because he knows better replication and
self-heal
> > >>>> details.
> > >>>> 
> > >>>> Xavi
> > >>>> 
> > >>>>> On 02/04/2015 07:23 PM, David F. Robinson wrote:
> > >>>>> Distributed/replicated
> > >>>>> 
> > >>>>> Volume Name: homegfs
> > >>>>> Type: Distributed-Replicate
> > >>>>> Volume ID: 1e32672a-f1b7-4b58-ba94-58c085e59071
> > >>>>> Status: Started
> > >>>>> Number of Bricks: 4 x 2 = 8
> > >>>>> Transport-type: tcp
> > >>>>> Bricks:
> > >>>>> Brick1:
gfsib01a.corvidtec.com:/data/brick01a/homegfs
> > >>>>> Brick2:
gfsib01b.corvidtec.com:/data/brick01b/homegfs
> > >>>>> Brick3:
gfsib01a.corvidtec.com:/data/brick02a/homegfs
> > >>>>> Brick4:
gfsib01b.corvidtec.com:/data/brick02b/homegfs
> > >>>>> Brick5:
gfsib02a.corvidtec.com:/data/brick01a/homegfs
> > >>>>> Brick6:
gfsib02b.corvidtec.com:/data/brick01b/homegfs
> > >>>>> Brick7:
gfsib02a.corvidtec.com:/data/brick02a/homegfs
> > >>>>> Brick8:
gfsib02b.corvidtec.com:/data/brick02b/homegfs
> > >>>>> Options Reconfigured:
> > >>>>> performance.io-thread-count: 32
> > >>>>> performance.cache-size: 128MB
> > >>>>> performance.write-behind-window-size: 128MB
> > >>>>> server.allow-insecure: on
> > >>>>> network.ping-timeout: 10
> > >>>>> storage.owner-gid: 100
> > >>>>> geo-replication.indexing: off
> > >>>>> geo-replication.ignore-pid-check: on
> > >>>>> changelog.changelog: on
> > >>>>> changelog.fsync-interval: 3
> > >>>>> changelog.rollover-time: 15
> > >>>>> server.manage-gids: on
> > >>>>> 
> > >>>>> 
> > >>>>> ------ Original Message ------
> > >>>>> From: "Xavier Hernandez" <xhernandez
at datalab.es>
> > >>>>> To: "David F. Robinson"
<david.robinson at corvidtec.com>; "Benjamin
> > >>>>> Turner" <bennyturns at gmail.com>
> > >>>>> Cc: "gluster-users at gluster.org"
<gluster-users at gluster.org>; "Gluster
> > >>>>> Devel" <gluster-devel at gluster.org>
> > >>>>> Sent: 2/4/2015 6:03:45 AM
> > >>>>> Subject: Re: [Gluster-devel] missing files
> > >>>>> 
> > >>>>>>> On 02/04/2015 01:30 AM, David F. Robinson
wrote:
> > >>>>>>> Sorry. Thought about this a little more.
I should have been
> > >>>>>>> clearer.
> > >>>>>>> The files were on both bricks of the
replica, not just one side.
> > >>>>>>> So,
> > >>>>>>> both bricks had to have been up... The
files/directories just
> > >>>>>>> don't show
> > >>>>>>> up on the mount.
> > >>>>>>> I was reading and saw a related bug
> > >>>>>>>
(https://bugzilla.redhat.com/show_bug.cgi?id=1159484). I saw it
> > >>>>>>> suggested to run:
> > >>>>>>>         find <mount> -d -exec
getfattr -h -n trusted.ec.heal {} \;
> > >>>>>> 
> > >>>>>> This command is specific for a dispersed
volume. It won't do
> > >>>>>> anything
> > >>>>>> (aside from the error you are seeing) on a
replicated volume.
> > >>>>>> 
> > >>>>>> I think you are using a replicated volume,
right ?
> > >>>>>> 
> > >>>>>> In this case I'm not sure what can be
happening. Is your volume a
> > >>>>>> pure
> > >>>>>> replicated one or a distributed-replicated ?
on a pure replicated it
> > >>>>>> doesn't make sense that some entries do
not show in an 'ls' when the
> > >>>>>> file is in both replicas (at least without
any error message in the
> > >>>>>> logs). On a distributed-replicated it could
be caused by some
> > >>>>>> problem
> > >>>>>> while combining contents of each replica set.
> > >>>>>> 
> > >>>>>> What's the configuration of your volume ?
> > >>>>>> 
> > >>>>>> Xavi
> > >>>>>> 
> > >>>>>>> 
> > >>>>>>> I get a bunch of errors for operation not
supported:
> > >>>>>>> [root at gfs02a homegfs]# find wks_backup
-d -exec getfattr -h -n
> > >>>>>>> trusted.ec.heal {} \;
> > >>>>>>> find: warning: the -d option is
deprecated; please use -depth
> > >>>>>>> instead,
> > >>>>>>> because the latter is a POSIX-compliant
feature.
> > >>>>>>> wks_backup/homer_backup/backup:
trusted.ec.heal: Operation not
> > >>>>>>> supported
> > >>>>>>>
wks_backup/homer_backup/logs/2014_05_20.log: trusted.ec.heal:
> > >>>>>>> Operation
> > >>>>>>> not supported
> > >>>>>>>
wks_backup/homer_backup/logs/2014_05_21.log: trusted.ec.heal:
> > >>>>>>> Operation
> > >>>>>>> not supported
> > >>>>>>>
wks_backup/homer_backup/logs/2014_05_18.log: trusted.ec.heal:
> > >>>>>>> Operation
> > >>>>>>> not supported
> > >>>>>>>
wks_backup/homer_backup/logs/2014_05_19.log: trusted.ec.heal:
> > >>>>>>> Operation
> > >>>>>>> not supported
> > >>>>>>>
wks_backup/homer_backup/logs/2014_05_22.log: trusted.ec.heal:
> > >>>>>>> Operation
> > >>>>>>> not supported
> > >>>>>>> wks_backup/homer_backup/logs:
trusted.ec.heal: Operation not
> > >>>>>>> supported
> > >>>>>>> wks_backup/homer_backup: trusted.ec.heal:
Operation not supported
> > >>>>>>> ------ Original Message ------
> > >>>>>>> From: "Benjamin Turner"
<bennyturns at gmail.com
> > >>>>>>> <mailto:bennyturns at
gmail.com>>
> > >>>>>>> To: "David F. Robinson"
<david.robinson at corvidtec.com
> > >>>>>>> <mailto:david.robinson at
corvidtec.com>>
> > >>>>>>> Cc: "Gluster Devel"
<gluster-devel at gluster.org
> > >>>>>>> <mailto:gluster-devel at
gluster.org>>; "gluster-users at gluster.org"
> > >>>>>>> <gluster-users at gluster.org
<mailto:gluster-users at gluster.org>>
> > >>>>>>> Sent: 2/3/2015 7:12:34 PM
> > >>>>>>> Subject: Re: [Gluster-devel] missing
files
> > >>>>>>>> It sounds to me like the files were
only copied to one replica,
> > >>>>>>>> werent
> > >>>>>>>> there for the initial for the initial
ls which triggered a self
> > >>>>>>>> heal,
> > >>>>>>>> and were there for the last ls
because they were healed. Is there
> > >>>>>>>> any
> > >>>>>>>> chance that one of the replicas was
down during the rsync? It
> > >>>>>>>> could
> > >>>>>>>> be that you lost a brick during copy
or something like that. To
> > >>>>>>>> confirm I would look for disconnects
in the brick logs as well as
> > >>>>>>>> checking glusterfshd.log to verify
the missing files were actually
> > >>>>>>>> healed.
> > >>>>>>>> 
> > >>>>>>>> -b
> > >>>>>>>> 
> > >>>>>>>> On Tue, Feb 3, 2015 at 5:37 PM, David
F. Robinson
> > >>>>>>>> <david.robinson at corvidtec.com
> > >>>>>>>> <mailto:david.robinson at
corvidtec.com>>
> > >>>>>>>> wrote:
> > >>>>>>>> 
> > >>>>>>>>    I rsync'd 20-TB over to my
gluster system and noticed that I
> > >>>>>>>>    had
> > >>>>>>>>    some directories missing even
though the rsync completed
> > >>>>>>>> normally.
> > >>>>>>>>    The rsync logs showed that the
missing files were transferred.
> > >>>>>>>>    I went to the bricks and did an
'ls -al
> > >>>>>>>>    /data/brick*/homegfs/dir/*'
the files were on the bricks.
> > >>>>>>>> After I
> > >>>>>>>>    did this 'ls', the files
then showed up on the FUSE mounts.
> > >>>>>>>>    1) Why are the files hidden on the
fuse mount?
> > >>>>>>>>    2) Why does the ls make them show
up on the FUSE mount?
> > >>>>>>>>    3) How can I prevent this from
happening again?
> > >>>>>>>>    Note, I also mounted the gluster
volume using NFS and saw the
> > >>>>>>>> same
> > >>>>>>>>    behavior. The files/directories
were not shown until I did the
> > >>>>>>>>    "ls" on the bricks.
> > >>>>>>>>    David
> > >>>>>>>>    ==============================>
> >>>>>>>>    David F. Robinson, Ph.D.
> > >>>>>>>>    President - Corvid Technologies
> > >>>>>>>>    704.799.6944 x101
<tel:704.799.6944%20x101> [office]
> > >>>>>>>>    704.252.1310
<tel:704.252.1310> [cell]
> > >>>>>>>>    704.799.7974
<tel:704.799.7974> [fax]
> > >>>>>>>>    David.Robinson at corvidtec.com
> > >>>>>>>> <mailto:David.Robinson at
corvidtec.com>
> > >>>>>>>>    http://www.corvidtechnologies.com
> > >>>>>>>>
<http://www.corvidtechnologies.com/>
> > >>>>>>>> 
> > >>>>>>>>   
_______________________________________________
> > >>>>>>>>    Gluster-devel mailing list
> > >>>>>>>>    Gluster-devel at gluster.org
<mailto:Gluster-devel at gluster.org>
> > >>>>>>>>
http://www.gluster.org/mailman/listinfo/gluster-devel
> > >>>>>>> 
> > >>>>>>> 
> > >>>>>>>
_______________________________________________
> > >>>>>>> Gluster-devel mailing list
> > >>>>>>> Gluster-devel at gluster.org
> > >>>>>>>
http://www.gluster.org/mailman/listinfo/gluster-devel
> > >>> 
> > >>> _______________________________________________
> > >>> Gluster-users mailing list
> > >>> Gluster-users at gluster.org
> > >>> http://www.gluster.org/mailman/listinfo/gluster-users
> > >> 
> > >> _______________________________________________
> > >> Gluster-users mailing list
> > >> Gluster-users at gluster.org
> > >> http://www.gluster.org/mailman/listinfo/gluster-users
> > >> 
> > 
>

Gluster users - Feb 2015 - [Gluster-devel] missing files

[Gluster-users] [Gluster-devel] missing files

[Gluster-users] [Gluster-devel] missing files