thr3ads.net - Gluster users - [Gluster-users] [Gluster-devel] missing files [Feb 2015]

If this information is useful, please help other people find it:
Share via:

Ben Turner

2015-Feb-05 22:41 UTC

[Gluster-users] [Gluster-devel] missing files

----- Original Message -----> From: "Ben Turner" <bturner at redhat.com>
> To: "David F. Robinson" <david.robinson at corvidtec.com>
> Cc: "Pranith Kumar Karampuri" <pkarampu at redhat.com>,
"Xavier Hernandez" <xhernandez at datalab.es>, "Benjamin
Turner"
> <bennyturns at gmail.com>, gluster-users at gluster.org,
"Gluster Devel" <gluster-devel at gluster.org>
> Sent: Thursday, February 5, 2015 5:22:26 PM
> Subject: Re: [Gluster-users] [Gluster-devel] missing files
> 
> ----- Original Message -----
> > From: "David F. Robinson" <david.robinson at
corvidtec.com>
> > To: "Ben Turner" <bturner at redhat.com>
> > Cc: "Pranith Kumar Karampuri" <pkarampu at
redhat.com>, "Xavier Hernandez"
> > <xhernandez at datalab.es>, "Benjamin Turner"
> > <bennyturns at gmail.com>, gluster-users at gluster.org,
"Gluster Devel"
> > <gluster-devel at gluster.org>
> > Sent: Thursday, February 5, 2015 5:01:13 PM
> > Subject: Re: [Gluster-users] [Gluster-devel] missing files
> > 
> > I'll send you the emails I sent Pranith with the logs. What causes
these
> > disconnects?
> 
> Thanks David!  Disconnects happen when there are interruption in
> communication between peers, normally there is ping timeout that happens.
> It could be anything from a flaky NW to the system was to busy to respond
> to the pings.  My initial take is more towards the ladder as rsync is
> absolutely the worst use case for gluster - IIRC it writes in 4kb blocks. 
I
> try to keep my writes at least 64KB as in my testing that is the smallest
> block size I can write with before perf starts to really drop off. 
I'll try
> something similar in the lab.
Ok I do think that the file being self healed is RCA for what you were seeing. 
Lets look at one of the disconnects:

data-brick02a-homegfs.log:[2015-02-03 20:54:02.772180] I
[server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting connection from
gfs01b.corvidtec.com-4175-2015/02/02-16:44:31:179119-homegfs-client-2-0-1

And in the glustershd.log from the gfs01b_glustershd.log file:

[2015-02-03 20:55:48.001797] I [afr-self-heal-entry.c:554:afr_selfheal_entry_do]
0-homegfs-replicate-0: performing entry selfheal on
6c79a368-edaa-432b-bef9-ec690ab42448
[2015-02-03 20:55:49.341996] I [afr-self-heal-common.c:476:afr_log_selfheal]
0-homegfs-replicate-0: Completed entry selfheal on
6c79a368-edaa-432b-bef9-ec690ab42448. source=1 sinks=0
[2015-02-03 20:55:49.343093] I [afr-self-heal-entry.c:554:afr_selfheal_entry_do]
0-homegfs-replicate-0: performing entry selfheal on
792cb0d6-9290-4447-8cd7-2b2d7a116a69
[2015-02-03 20:55:50.463652] I [afr-self-heal-common.c:476:afr_log_selfheal]
0-homegfs-replicate-0: Completed entry selfheal on
792cb0d6-9290-4447-8cd7-2b2d7a116a69. source=1 sinks=0
[2015-02-03 20:55:51.465289] I
[afr-self-heal-metadata.c:54:__afr_selfheal_metadata_do] 0-homegfs-replicate-0:
performing metadata selfheal on 403e661a-1c27-4e79-9867-c0572aba2b3c
[2015-02-03 20:55:51.466515] I [afr-self-heal-common.c:476:afr_log_selfheal]
0-homegfs-replicate-0: Completed metadata selfheal on
403e661a-1c27-4e79-9867-c0572aba2b3c. source=1 sinks=0
[2015-02-03 20:55:51.467098] I [afr-self-heal-entry.c:554:afr_selfheal_entry_do]
0-homegfs-replicate-0: performing entry selfheal on
403e661a-1c27-4e79-9867-c0572aba2b3c
[2015-02-03 20:55:55.257808] I [afr-self-heal-common.c:476:afr_log_selfheal]
0-homegfs-replicate-0: Completed entry selfheal on
403e661a-1c27-4e79-9867-c0572aba2b3c. source=1 sinks=0
[2015-02-03 20:55:55.258548] I
[afr-self-heal-metadata.c:54:__afr_selfheal_metadata_do] 0-homegfs-replicate-0:
performing metadata selfheal on c612ee2f-2fb4-4157-a9ab-5a2d5603c541
[2015-02-03 20:55:55.259367] I [afr-self-heal-common.c:476:afr_log_selfheal]
0-homegfs-replicate-0: Completed metadata selfheal on
c612ee2f-2fb4-4157-a9ab-5a2d5603c541. source=1 sinks=0
[2015-02-03 20:55:55.259980] I [afr-self-heal-entry.c:554:afr_selfheal_entry_do]
0-homegfs-replicate-0: performing entry selfheal on
c612ee2f-2fb4-4157-a9ab-5a2d5603c541

As you can see the self heal logs are just spammed with files being healed, and
I looked at a couple of disconnects and I see self heals getting run shortly
after on the bricks that were down.  Now we need to find the cause of the
disconnects, I am thinking once the disconnects are resolved the files should be
properly copied over without SH having to fix things.  Like I said I'll give
this a go on my lab systems and see if I can repro the disconnects, I'll
have time to run through it tomorrow.  If in the mean time anyone else has a
theory / anything to add here it would be appreciated.

-b
 > -b
>  
> > David  (Sent from mobile)
> > 
> > ==============================> > David F. Robinson, Ph.D.
> > President - Corvid Technologies
> > 704.799.6944 x101 [office]
> > 704.252.1310      [cell]
> > 704.799.7974      [fax]
> > David.Robinson at corvidtec.com
> > http://www.corvidtechnologies.com
> > 
> > > On Feb 5, 2015, at 4:55 PM, Ben Turner <bturner at
redhat.com> wrote:
> > > 
> > > ----- Original Message -----
> > >> From: "Pranith Kumar Karampuri" <pkarampu at
redhat.com>
> > >> To: "Xavier Hernandez" <xhernandez at
datalab.es>, "David F. Robinson"
> > >> <david.robinson at corvidtec.com>, "Benjamin
Turner"
> > >> <bennyturns at gmail.com>
> > >> Cc: gluster-users at gluster.org, "Gluster Devel"
> > >> <gluster-devel at gluster.org>
> > >> Sent: Thursday, February 5, 2015 5:30:04 AM
> > >> Subject: Re: [Gluster-users] [Gluster-devel] missing files
> > >> 
> > >> 
> > >>> On 02/05/2015 03:48 PM, Pranith Kumar Karampuri wrote:
> > >>> I believe David already fixed this. I hope this is the
same issue he
> > >>> told about permissions issue.
> > >> Oops, it is not. I will take a look.
> > > 
> > > Yes David exactly like these:
> > > 
> > > data-brick02a-homegfs.log:[2015-02-03 19:09:34.568842] I
> > > [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting
> > > connection from
> > >
gfs02a.corvidtec.com-18563-2015/02/03-19:07:58:519134-homegfs-client-2-0-0
> > > data-brick02a-homegfs.log:[2015-02-03 19:09:41.286551] I
> > > [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting
> > > connection from
> > >
gfs01a.corvidtec.com-12804-2015/02/03-19:09:38:497808-homegfs-client-2-0-0
> > > data-brick02a-homegfs.log:[2015-02-03 19:16:35.906412] I
> > > [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting
> > > connection from
> > >
gfs02b.corvidtec.com-27190-2015/02/03-19:15:53:458467-homegfs-client-2-0-0
> > > data-brick02a-homegfs.log:[2015-02-03 19:51:22.761293] I
> > > [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting
> > > connection from
> > >
gfs01a.corvidtec.com-25926-2015/02/03-19:51:02:89070-homegfs-client-2-0-0
> > > data-brick02a-homegfs.log:[2015-02-03 20:54:02.772180] I
> > > [server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting
> > > connection from
> > >
gfs01b.corvidtec.com-4175-2015/02/02-16:44:31:179119-homegfs-client-2-0-1
> > > 
> > > You can 100% verify my theory if you can correlate the time on
the
> > > disconnects to the time that the missing files were healed.  Can
you have
> > > a look at /var/log/glusterfs/glustershd.log?  That has all of the
healed
> > > files + timestamps, if we can see a disconnect during the rsync
and a
> > > self
> > > heal of the missing file I think we can safely assume that the
> > > disconnects
> > > may have caused this.  I'll try this on my test systems, how
much data
> > > did
> > > you rsync?  What size ish of files / an idea of the dir layout?
> > > 
> > > @Pranith - Could bricks flapping up and down during the rsync
cause the
> > > files to be missing on the first ls(written to 1 subvol but not
the other
> > > cause it was down), the ls triggered SH, and thats why the files
were
> > > there for the second ls be a possible cause here?
> > > 
> > > -b
> > > 
> > > 
> > >> Pranith
> > >>> 
> > >>> Pranith
> > >>>> On 02/05/2015 03:44 PM, Xavier Hernandez wrote:
> > >>>> Is the failure repeatable ? with the same directories
?
> > >>>> 
> > >>>> It's very weird that the directories appear on
the volume when you do
> > >>>> an 'ls' on the bricks. Could it be that you
only made a single 'ls'
> > >>>> on fuse mount which not showed the directory ? Is it
possible that
> > >>>> this 'ls' triggered a self-heal that repaired
the problem, whatever
> > >>>> it was, and when you did another 'ls' on the
fuse mount after the
> > >>>> 'ls' on the bricks, the directories were
there ?
> > >>>> 
> > >>>> The first 'ls' could have healed the files,
causing that the
> > >>>> following 'ls' on the bricks showed the files
as if nothing were
> > >>>> damaged. If that's the case, it's possible
that there were some
> > >>>> disconnections during the copy.
> > >>>> 
> > >>>> Added Pranith because he knows better replication and
self-heal
> > >>>> details.
> > >>>> 
> > >>>> Xavi
> > >>>> 
> > >>>>> On 02/04/2015 07:23 PM, David F. Robinson wrote:
> > >>>>> Distributed/replicated
> > >>>>> 
> > >>>>> Volume Name: homegfs
> > >>>>> Type: Distributed-Replicate
> > >>>>> Volume ID: 1e32672a-f1b7-4b58-ba94-58c085e59071
> > >>>>> Status: Started
> > >>>>> Number of Bricks: 4 x 2 = 8
> > >>>>> Transport-type: tcp
> > >>>>> Bricks:
> > >>>>> Brick1:
gfsib01a.corvidtec.com:/data/brick01a/homegfs
> > >>>>> Brick2:
gfsib01b.corvidtec.com:/data/brick01b/homegfs
> > >>>>> Brick3:
gfsib01a.corvidtec.com:/data/brick02a/homegfs
> > >>>>> Brick4:
gfsib01b.corvidtec.com:/data/brick02b/homegfs
> > >>>>> Brick5:
gfsib02a.corvidtec.com:/data/brick01a/homegfs
> > >>>>> Brick6:
gfsib02b.corvidtec.com:/data/brick01b/homegfs
> > >>>>> Brick7:
gfsib02a.corvidtec.com:/data/brick02a/homegfs
> > >>>>> Brick8:
gfsib02b.corvidtec.com:/data/brick02b/homegfs
> > >>>>> Options Reconfigured:
> > >>>>> performance.io-thread-count: 32
> > >>>>> performance.cache-size: 128MB
> > >>>>> performance.write-behind-window-size: 128MB
> > >>>>> server.allow-insecure: on
> > >>>>> network.ping-timeout: 10
> > >>>>> storage.owner-gid: 100
> > >>>>> geo-replication.indexing: off
> > >>>>> geo-replication.ignore-pid-check: on
> > >>>>> changelog.changelog: on
> > >>>>> changelog.fsync-interval: 3
> > >>>>> changelog.rollover-time: 15
> > >>>>> server.manage-gids: on
> > >>>>> 
> > >>>>> 
> > >>>>> ------ Original Message ------
> > >>>>> From: "Xavier Hernandez" <xhernandez
at datalab.es>
> > >>>>> To: "David F. Robinson"
<david.robinson at corvidtec.com>; "Benjamin
> > >>>>> Turner" <bennyturns at gmail.com>
> > >>>>> Cc: "gluster-users at gluster.org"
<gluster-users at gluster.org>; "Gluster
> > >>>>> Devel" <gluster-devel at gluster.org>
> > >>>>> Sent: 2/4/2015 6:03:45 AM
> > >>>>> Subject: Re: [Gluster-devel] missing files
> > >>>>> 
> > >>>>>>> On 02/04/2015 01:30 AM, David F. Robinson
wrote:
> > >>>>>>> Sorry. Thought about this a little more.
I should have been
> > >>>>>>> clearer.
> > >>>>>>> The files were on both bricks of the
replica, not just one side.
> > >>>>>>> So,
> > >>>>>>> both bricks had to have been up... The
files/directories just
> > >>>>>>> don't show
> > >>>>>>> up on the mount.
> > >>>>>>> I was reading and saw a related bug
> > >>>>>>>
(https://bugzilla.redhat.com/show_bug.cgi?id=1159484). I saw it
> > >>>>>>> suggested to run:
> > >>>>>>>         find <mount> -d -exec
getfattr -h -n trusted.ec.heal {} \;
> > >>>>>> 
> > >>>>>> This command is specific for a dispersed
volume. It won't do
> > >>>>>> anything
> > >>>>>> (aside from the error you are seeing) on a
replicated volume.
> > >>>>>> 
> > >>>>>> I think you are using a replicated volume,
right ?
> > >>>>>> 
> > >>>>>> In this case I'm not sure what can be
happening. Is your volume a
> > >>>>>> pure
> > >>>>>> replicated one or a distributed-replicated ?
on a pure replicated it
> > >>>>>> doesn't make sense that some entries do
not show in an 'ls' when the
> > >>>>>> file is in both replicas (at least without
any error message in the
> > >>>>>> logs). On a distributed-replicated it could
be caused by some
> > >>>>>> problem
> > >>>>>> while combining contents of each replica set.
> > >>>>>> 
> > >>>>>> What's the configuration of your volume ?
> > >>>>>> 
> > >>>>>> Xavi
> > >>>>>> 
> > >>>>>>> 
> > >>>>>>> I get a bunch of errors for operation not
supported:
> > >>>>>>> [root at gfs02a homegfs]# find wks_backup
-d -exec getfattr -h -n
> > >>>>>>> trusted.ec.heal {} \;
> > >>>>>>> find: warning: the -d option is
deprecated; please use -depth
> > >>>>>>> instead,
> > >>>>>>> because the latter is a POSIX-compliant
feature.
> > >>>>>>> wks_backup/homer_backup/backup:
trusted.ec.heal: Operation not
> > >>>>>>> supported
> > >>>>>>>
wks_backup/homer_backup/logs/2014_05_20.log: trusted.ec.heal:
> > >>>>>>> Operation
> > >>>>>>> not supported
> > >>>>>>>
wks_backup/homer_backup/logs/2014_05_21.log: trusted.ec.heal:
> > >>>>>>> Operation
> > >>>>>>> not supported
> > >>>>>>>
wks_backup/homer_backup/logs/2014_05_18.log: trusted.ec.heal:
> > >>>>>>> Operation
> > >>>>>>> not supported
> > >>>>>>>
wks_backup/homer_backup/logs/2014_05_19.log: trusted.ec.heal:
> > >>>>>>> Operation
> > >>>>>>> not supported
> > >>>>>>>
wks_backup/homer_backup/logs/2014_05_22.log: trusted.ec.heal:
> > >>>>>>> Operation
> > >>>>>>> not supported
> > >>>>>>> wks_backup/homer_backup/logs:
trusted.ec.heal: Operation not
> > >>>>>>> supported
> > >>>>>>> wks_backup/homer_backup: trusted.ec.heal:
Operation not supported
> > >>>>>>> ------ Original Message ------
> > >>>>>>> From: "Benjamin Turner"
<bennyturns at gmail.com
> > >>>>>>> <mailto:bennyturns at
gmail.com>>
> > >>>>>>> To: "David F. Robinson"
<david.robinson at corvidtec.com
> > >>>>>>> <mailto:david.robinson at
corvidtec.com>>
> > >>>>>>> Cc: "Gluster Devel"
<gluster-devel at gluster.org
> > >>>>>>> <mailto:gluster-devel at
gluster.org>>; "gluster-users at gluster.org"
> > >>>>>>> <gluster-users at gluster.org
<mailto:gluster-users at gluster.org>>
> > >>>>>>> Sent: 2/3/2015 7:12:34 PM
> > >>>>>>> Subject: Re: [Gluster-devel] missing
files
> > >>>>>>>> It sounds to me like the files were
only copied to one replica,
> > >>>>>>>> werent
> > >>>>>>>> there for the initial for the initial
ls which triggered a self
> > >>>>>>>> heal,
> > >>>>>>>> and were there for the last ls
because they were healed. Is there
> > >>>>>>>> any
> > >>>>>>>> chance that one of the replicas was
down during the rsync? It
> > >>>>>>>> could
> > >>>>>>>> be that you lost a brick during copy
or something like that. To
> > >>>>>>>> confirm I would look for disconnects
in the brick logs as well as
> > >>>>>>>> checking glusterfshd.log to verify
the missing files were actually
> > >>>>>>>> healed.
> > >>>>>>>> 
> > >>>>>>>> -b
> > >>>>>>>> 
> > >>>>>>>> On Tue, Feb 3, 2015 at 5:37 PM, David
F. Robinson
> > >>>>>>>> <david.robinson at corvidtec.com
> > >>>>>>>> <mailto:david.robinson at
corvidtec.com>>
> > >>>>>>>> wrote:
> > >>>>>>>> 
> > >>>>>>>>    I rsync'd 20-TB over to my
gluster system and noticed that I
> > >>>>>>>>    had
> > >>>>>>>>    some directories missing even
though the rsync completed
> > >>>>>>>> normally.
> > >>>>>>>>    The rsync logs showed that the
missing files were transferred.
> > >>>>>>>>    I went to the bricks and did an
'ls -al
> > >>>>>>>>    /data/brick*/homegfs/dir/*'
the files were on the bricks.
> > >>>>>>>> After I
> > >>>>>>>>    did this 'ls', the files
then showed up on the FUSE mounts.
> > >>>>>>>>    1) Why are the files hidden on the
fuse mount?
> > >>>>>>>>    2) Why does the ls make them show
up on the FUSE mount?
> > >>>>>>>>    3) How can I prevent this from
happening again?
> > >>>>>>>>    Note, I also mounted the gluster
volume using NFS and saw the
> > >>>>>>>> same
> > >>>>>>>>    behavior. The files/directories
were not shown until I did the
> > >>>>>>>>    "ls" on the bricks.
> > >>>>>>>>    David
> > >>>>>>>>    ==============================>
> >>>>>>>>    David F. Robinson, Ph.D.
> > >>>>>>>>    President - Corvid Technologies
> > >>>>>>>>    704.799.6944 x101
<tel:704.799.6944%20x101> [office]
> > >>>>>>>>    704.252.1310
<tel:704.252.1310> [cell]
> > >>>>>>>>    704.799.7974
<tel:704.799.7974> [fax]
> > >>>>>>>>    David.Robinson at corvidtec.com
> > >>>>>>>> <mailto:David.Robinson at
corvidtec.com>
> > >>>>>>>>    http://www.corvidtechnologies.com
> > >>>>>>>>
<http://www.corvidtechnologies.com/>
> > >>>>>>>> 
> > >>>>>>>>   
_______________________________________________
> > >>>>>>>>    Gluster-devel mailing list
> > >>>>>>>>    Gluster-devel at gluster.org
<mailto:Gluster-devel at gluster.org>
> > >>>>>>>>
http://www.gluster.org/mailman/listinfo/gluster-devel
> > >>>>>>> 
> > >>>>>>> 
> > >>>>>>>
_______________________________________________
> > >>>>>>> Gluster-devel mailing list
> > >>>>>>> Gluster-devel at gluster.org
> > >>>>>>>
http://www.gluster.org/mailman/listinfo/gluster-devel
> > >>> 
> > >>> _______________________________________________
> > >>> Gluster-users mailing list
> > >>> Gluster-users at gluster.org
> > >>> http://www.gluster.org/mailman/listinfo/gluster-users
> > >> 
> > >> _______________________________________________
> > >> Gluster-users mailing list
> > >> Gluster-users at gluster.org
> > >> http://www.gluster.org/mailman/listinfo/gluster-users
> > >> 
> > 
>

David F. Robinson

2015-Feb-05 22:59 UTC

head link

[Gluster-users] [Gluster-devel] missing files

Should I run my rsync with --block-size = something other than the default? Is
there an optimal value? I think 128k is the max from my quick search. Didn't
dig into it throughly though.

David  (Sent from mobile)

==============================David F. Robinson, Ph.D. 
President - Corvid Technologies
704.799.6944 x101 [office]
704.252.1310      [cell]
704.799.7974      [fax]
David.Robinson at corvidtec.com
http://www.corvidtechnologies.com
> On Feb 5, 2015, at 5:41 PM, Ben Turner <bturner at redhat.com> wrote:
> 
> ----- Original Message -----
>> From: "Ben Turner" <bturner at redhat.com>
>> To: "David F. Robinson" <david.robinson at
corvidtec.com>
>> Cc: "Pranith Kumar Karampuri" <pkarampu at redhat.com>,
"Xavier Hernandez" <xhernandez at datalab.es>, "Benjamin
Turner"
>> <bennyturns at gmail.com>, gluster-users at gluster.org,
"Gluster Devel" <gluster-devel at gluster.org>
>> Sent: Thursday, February 5, 2015 5:22:26 PM
>> Subject: Re: [Gluster-users] [Gluster-devel] missing files
>> 
>> ----- Original Message -----
>>> From: "David F. Robinson" <david.robinson at
corvidtec.com>
>>> To: "Ben Turner" <bturner at redhat.com>
>>> Cc: "Pranith Kumar Karampuri" <pkarampu at
redhat.com>, "Xavier Hernandez"
>>> <xhernandez at datalab.es>, "Benjamin Turner"
>>> <bennyturns at gmail.com>, gluster-users at gluster.org,
"Gluster Devel"
>>> <gluster-devel at gluster.org>
>>> Sent: Thursday, February 5, 2015 5:01:13 PM
>>> Subject: Re: [Gluster-users] [Gluster-devel] missing files
>>> 
>>> I'll send you the emails I sent Pranith with the logs. What
causes these
>>> disconnects?
>> 
>> Thanks David!  Disconnects happen when there are interruption in
>> communication between peers, normally there is ping timeout that
happens.
>> It could be anything from a flaky NW to the system was to busy to
respond
>> to the pings.  My initial take is more towards the ladder as rsync is
>> absolutely the worst use case for gluster - IIRC it writes in 4kb
blocks.  I
>> try to keep my writes at least 64KB as in my testing that is the
smallest
>> block size I can write with before perf starts to really drop off. 
I'll try
>> something similar in the lab.
> 
> Ok I do think that the file being self healed is RCA for what you were
seeing.  Lets look at one of the disconnects:
> 
> data-brick02a-homegfs.log:[2015-02-03 20:54:02.772180] I
[server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting connection from
gfs01b.corvidtec.com-4175-2015/02/02-16:44:31:179119-homegfs-client-2-0-1
> 
> And in the glustershd.log from the gfs01b_glustershd.log file:
> 
> [2015-02-03 20:55:48.001797] I
[afr-self-heal-entry.c:554:afr_selfheal_entry_do] 0-homegfs-replicate-0:
performing entry selfheal on 6c79a368-edaa-432b-bef9-ec690ab42448
> [2015-02-03 20:55:49.341996] I
[afr-self-heal-common.c:476:afr_log_selfheal] 0-homegfs-replicate-0: Completed
entry selfheal on 6c79a368-edaa-432b-bef9-ec690ab42448. source=1 sinks=0
> [2015-02-03 20:55:49.343093] I
[afr-self-heal-entry.c:554:afr_selfheal_entry_do] 0-homegfs-replicate-0:
performing entry selfheal on 792cb0d6-9290-4447-8cd7-2b2d7a116a69
> [2015-02-03 20:55:50.463652] I
[afr-self-heal-common.c:476:afr_log_selfheal] 0-homegfs-replicate-0: Completed
entry selfheal on 792cb0d6-9290-4447-8cd7-2b2d7a116a69. source=1 sinks=0
> [2015-02-03 20:55:51.465289] I
[afr-self-heal-metadata.c:54:__afr_selfheal_metadata_do] 0-homegfs-replicate-0:
performing metadata selfheal on 403e661a-1c27-4e79-9867-c0572aba2b3c
> [2015-02-03 20:55:51.466515] I
[afr-self-heal-common.c:476:afr_log_selfheal] 0-homegfs-replicate-0: Completed
metadata selfheal on 403e661a-1c27-4e79-9867-c0572aba2b3c. source=1 sinks=0
> [2015-02-03 20:55:51.467098] I
[afr-self-heal-entry.c:554:afr_selfheal_entry_do] 0-homegfs-replicate-0:
performing entry selfheal on 403e661a-1c27-4e79-9867-c0572aba2b3c
> [2015-02-03 20:55:55.257808] I
[afr-self-heal-common.c:476:afr_log_selfheal] 0-homegfs-replicate-0: Completed
entry selfheal on 403e661a-1c27-4e79-9867-c0572aba2b3c. source=1 sinks=0
> [2015-02-03 20:55:55.258548] I
[afr-self-heal-metadata.c:54:__afr_selfheal_metadata_do] 0-homegfs-replicate-0:
performing metadata selfheal on c612ee2f-2fb4-4157-a9ab-5a2d5603c541
> [2015-02-03 20:55:55.259367] I
[afr-self-heal-common.c:476:afr_log_selfheal] 0-homegfs-replicate-0: Completed
metadata selfheal on c612ee2f-2fb4-4157-a9ab-5a2d5603c541. source=1 sinks=0
> [2015-02-03 20:55:55.259980] I
[afr-self-heal-entry.c:554:afr_selfheal_entry_do] 0-homegfs-replicate-0:
performing entry selfheal on c612ee2f-2fb4-4157-a9ab-5a2d5603c541
> 
> As you can see the self heal logs are just spammed with files being healed,
and I looked at a couple of disconnects and I see self heals getting run shortly
after on the bricks that were down.  Now we need to find the cause of the
disconnects, I am thinking once the disconnects are resolved the files should be
properly copied over without SH having to fix things.  Like I said I'll give
this a go on my lab systems and see if I can repro the disconnects, I'll
have time to run through it tomorrow.  If in the mean time anyone else has a
theory / anything to add here it would be appreciated.
> 
> -b
> 
>> -b
>> 
>>> David  (Sent from mobile)
>>> 
>>> ==============================>>> David F. Robinson, Ph.D.
>>> President - Corvid Technologies
>>> 704.799.6944 x101 [office]
>>> 704.252.1310      [cell]
>>> 704.799.7974      [fax]
>>> David.Robinson at corvidtec.com
>>> http://www.corvidtechnologies.com
>>> 
>>>> On Feb 5, 2015, at 4:55 PM, Ben Turner <bturner at
redhat.com> wrote:
>>>> 
>>>> ----- Original Message -----
>>>>> From: "Pranith Kumar Karampuri" <pkarampu at
redhat.com>
>>>>> To: "Xavier Hernandez" <xhernandez at
datalab.es>, "David F. Robinson"
>>>>> <david.robinson at corvidtec.com>, "Benjamin
Turner"
>>>>> <bennyturns at gmail.com>
>>>>> Cc: gluster-users at gluster.org, "Gluster Devel"
>>>>> <gluster-devel at gluster.org>
>>>>> Sent: Thursday, February 5, 2015 5:30:04 AM
>>>>> Subject: Re: [Gluster-users] [Gluster-devel] missing files
>>>>> 
>>>>> 
>>>>>> On 02/05/2015 03:48 PM, Pranith Kumar Karampuri wrote:
>>>>>> I believe David already fixed this. I hope this is the
same issue he
>>>>>> told about permissions issue.
>>>>> Oops, it is not. I will take a look.
>>>> 
>>>> Yes David exactly like these:
>>>> 
>>>> data-brick02a-homegfs.log:[2015-02-03 19:09:34.568842] I
>>>> [server.c:518:server_rpc_notify] 0-homegfs-server:
disconnecting
>>>> connection from
>>>>
gfs02a.corvidtec.com-18563-2015/02/03-19:07:58:519134-homegfs-client-2-0-0
>>>> data-brick02a-homegfs.log:[2015-02-03 19:09:41.286551] I
>>>> [server.c:518:server_rpc_notify] 0-homegfs-server:
disconnecting
>>>> connection from
>>>>
gfs01a.corvidtec.com-12804-2015/02/03-19:09:38:497808-homegfs-client-2-0-0
>>>> data-brick02a-homegfs.log:[2015-02-03 19:16:35.906412] I
>>>> [server.c:518:server_rpc_notify] 0-homegfs-server:
disconnecting
>>>> connection from
>>>>
gfs02b.corvidtec.com-27190-2015/02/03-19:15:53:458467-homegfs-client-2-0-0
>>>> data-brick02a-homegfs.log:[2015-02-03 19:51:22.761293] I
>>>> [server.c:518:server_rpc_notify] 0-homegfs-server:
disconnecting
>>>> connection from
>>>>
gfs01a.corvidtec.com-25926-2015/02/03-19:51:02:89070-homegfs-client-2-0-0
>>>> data-brick02a-homegfs.log:[2015-02-03 20:54:02.772180] I
>>>> [server.c:518:server_rpc_notify] 0-homegfs-server:
disconnecting
>>>> connection from
>>>>
gfs01b.corvidtec.com-4175-2015/02/02-16:44:31:179119-homegfs-client-2-0-1
>>>> 
>>>> You can 100% verify my theory if you can correlate the time on
the
>>>> disconnects to the time that the missing files were healed. 
Can you have
>>>> a look at /var/log/glusterfs/glustershd.log?  That has all of
the healed
>>>> files + timestamps, if we can see a disconnect during the rsync
and a
>>>> self
>>>> heal of the missing file I think we can safely assume that the
>>>> disconnects
>>>> may have caused this.  I'll try this on my test systems,
how much data
>>>> did
>>>> you rsync?  What size ish of files / an idea of the dir layout?
>>>> 
>>>> @Pranith - Could bricks flapping up and down during the rsync
cause the
>>>> files to be missing on the first ls(written to 1 subvol but not
the other
>>>> cause it was down), the ls triggered SH, and thats why the
files were
>>>> there for the second ls be a possible cause here?
>>>> 
>>>> -b
>>>> 
>>>> 
>>>>> Pranith
>>>>>> 
>>>>>> Pranith
>>>>>>> On 02/05/2015 03:44 PM, Xavier Hernandez wrote:
>>>>>>> Is the failure repeatable ? with the same
directories ?
>>>>>>> 
>>>>>>> It's very weird that the directories appear on
the volume when you do
>>>>>>> an 'ls' on the bricks. Could it be that you
only made a single 'ls'
>>>>>>> on fuse mount which not showed the directory ? Is
it possible that
>>>>>>> this 'ls' triggered a self-heal that
repaired the problem, whatever
>>>>>>> it was, and when you did another 'ls' on
the fuse mount after the
>>>>>>> 'ls' on the bricks, the directories were
there ?
>>>>>>> 
>>>>>>> The first 'ls' could have healed the files,
causing that the
>>>>>>> following 'ls' on the bricks showed the
files as if nothing were
>>>>>>> damaged. If that's the case, it's possible
that there were some
>>>>>>> disconnections during the copy.
>>>>>>> 
>>>>>>> Added Pranith because he knows better replication
and self-heal
>>>>>>> details.
>>>>>>> 
>>>>>>> Xavi
>>>>>>> 
>>>>>>>> On 02/04/2015 07:23 PM, David F. Robinson
wrote:
>>>>>>>> Distributed/replicated
>>>>>>>> 
>>>>>>>> Volume Name: homegfs
>>>>>>>> Type: Distributed-Replicate
>>>>>>>> Volume ID: 1e32672a-f1b7-4b58-ba94-58c085e59071
>>>>>>>> Status: Started
>>>>>>>> Number of Bricks: 4 x 2 = 8
>>>>>>>> Transport-type: tcp
>>>>>>>> Bricks:
>>>>>>>> Brick1:
gfsib01a.corvidtec.com:/data/brick01a/homegfs
>>>>>>>> Brick2:
gfsib01b.corvidtec.com:/data/brick01b/homegfs
>>>>>>>> Brick3:
gfsib01a.corvidtec.com:/data/brick02a/homegfs
>>>>>>>> Brick4:
gfsib01b.corvidtec.com:/data/brick02b/homegfs
>>>>>>>> Brick5:
gfsib02a.corvidtec.com:/data/brick01a/homegfs
>>>>>>>> Brick6:
gfsib02b.corvidtec.com:/data/brick01b/homegfs
>>>>>>>> Brick7:
gfsib02a.corvidtec.com:/data/brick02a/homegfs
>>>>>>>> Brick8:
gfsib02b.corvidtec.com:/data/brick02b/homegfs
>>>>>>>> Options Reconfigured:
>>>>>>>> performance.io-thread-count: 32
>>>>>>>> performance.cache-size: 128MB
>>>>>>>> performance.write-behind-window-size: 128MB
>>>>>>>> server.allow-insecure: on
>>>>>>>> network.ping-timeout: 10
>>>>>>>> storage.owner-gid: 100
>>>>>>>> geo-replication.indexing: off
>>>>>>>> geo-replication.ignore-pid-check: on
>>>>>>>> changelog.changelog: on
>>>>>>>> changelog.fsync-interval: 3
>>>>>>>> changelog.rollover-time: 15
>>>>>>>> server.manage-gids: on
>>>>>>>> 
>>>>>>>> 
>>>>>>>> ------ Original Message ------
>>>>>>>> From: "Xavier Hernandez"
<xhernandez at datalab.es>
>>>>>>>> To: "David F. Robinson"
<david.robinson at corvidtec.com>; "Benjamin
>>>>>>>> Turner" <bennyturns at gmail.com>
>>>>>>>> Cc: "gluster-users at gluster.org"
<gluster-users at gluster.org>; "Gluster
>>>>>>>> Devel" <gluster-devel at
gluster.org>
>>>>>>>> Sent: 2/4/2015 6:03:45 AM
>>>>>>>> Subject: Re: [Gluster-devel] missing files
>>>>>>>> 
>>>>>>>>>> On 02/04/2015 01:30 AM, David F.
Robinson wrote:
>>>>>>>>>> Sorry. Thought about this a little
more. I should have been
>>>>>>>>>> clearer.
>>>>>>>>>> The files were on both bricks of the
replica, not just one side.
>>>>>>>>>> So,
>>>>>>>>>> both bricks had to have been up... The
files/directories just
>>>>>>>>>> don't show
>>>>>>>>>> up on the mount.
>>>>>>>>>> I was reading and saw a related bug
>>>>>>>>>>
(https://bugzilla.redhat.com/show_bug.cgi?id=1159484). I saw it
>>>>>>>>>> suggested to run:
>>>>>>>>>>        find <mount> -d -exec
getfattr -h -n trusted.ec.heal {} \;
>>>>>>>>> 
>>>>>>>>> This command is specific for a dispersed
volume. It won't do
>>>>>>>>> anything
>>>>>>>>> (aside from the error you are seeing) on a
replicated volume.
>>>>>>>>> 
>>>>>>>>> I think you are using a replicated volume,
right ?
>>>>>>>>> 
>>>>>>>>> In this case I'm not sure what can be
happening. Is your volume a
>>>>>>>>> pure
>>>>>>>>> replicated one or a distributed-replicated
? on a pure replicated it
>>>>>>>>> doesn't make sense that some entries do
not show in an 'ls' when the
>>>>>>>>> file is in both replicas (at least without
any error message in the
>>>>>>>>> logs). On a distributed-replicated it could
be caused by some
>>>>>>>>> problem
>>>>>>>>> while combining contents of each replica
set.
>>>>>>>>> 
>>>>>>>>> What's the configuration of your volume
?
>>>>>>>>> 
>>>>>>>>> Xavi
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> I get a bunch of errors for operation
not supported:
>>>>>>>>>> [root at gfs02a homegfs]# find
wks_backup -d -exec getfattr -h -n
>>>>>>>>>> trusted.ec.heal {} \;
>>>>>>>>>> find: warning: the -d option is
deprecated; please use -depth
>>>>>>>>>> instead,
>>>>>>>>>> because the latter is a POSIX-compliant
feature.
>>>>>>>>>> wks_backup/homer_backup/backup:
trusted.ec.heal: Operation not
>>>>>>>>>> supported
>>>>>>>>>>
wks_backup/homer_backup/logs/2014_05_20.log: trusted.ec.heal:
>>>>>>>>>> Operation
>>>>>>>>>> not supported
>>>>>>>>>>
wks_backup/homer_backup/logs/2014_05_21.log: trusted.ec.heal:
>>>>>>>>>> Operation
>>>>>>>>>> not supported
>>>>>>>>>>
wks_backup/homer_backup/logs/2014_05_18.log: trusted.ec.heal:
>>>>>>>>>> Operation
>>>>>>>>>> not supported
>>>>>>>>>>
wks_backup/homer_backup/logs/2014_05_19.log: trusted.ec.heal:
>>>>>>>>>> Operation
>>>>>>>>>> not supported
>>>>>>>>>>
wks_backup/homer_backup/logs/2014_05_22.log: trusted.ec.heal:
>>>>>>>>>> Operation
>>>>>>>>>> not supported
>>>>>>>>>> wks_backup/homer_backup/logs:
trusted.ec.heal: Operation not
>>>>>>>>>> supported
>>>>>>>>>> wks_backup/homer_backup:
trusted.ec.heal: Operation not supported
>>>>>>>>>> ------ Original Message ------
>>>>>>>>>> From: "Benjamin Turner"
<bennyturns at gmail.com
>>>>>>>>>> <mailto:bennyturns at
gmail.com>>
>>>>>>>>>> To: "David F. Robinson"
<david.robinson at corvidtec.com
>>>>>>>>>> <mailto:david.robinson at
corvidtec.com>>
>>>>>>>>>> Cc: "Gluster Devel"
<gluster-devel at gluster.org
>>>>>>>>>> <mailto:gluster-devel at
gluster.org>>; "gluster-users at gluster.org"
>>>>>>>>>> <gluster-users at gluster.org
<mailto:gluster-users at gluster.org>>
>>>>>>>>>> Sent: 2/3/2015 7:12:34 PM
>>>>>>>>>> Subject: Re: [Gluster-devel] missing
files
>>>>>>>>>>> It sounds to me like the files were
only copied to one replica,
>>>>>>>>>>> werent
>>>>>>>>>>> there for the initial for the
initial ls which triggered a self
>>>>>>>>>>> heal,
>>>>>>>>>>> and were there for the last ls
because they were healed. Is there
>>>>>>>>>>> any
>>>>>>>>>>> chance that one of the replicas was
down during the rsync? It
>>>>>>>>>>> could
>>>>>>>>>>> be that you lost a brick during
copy or something like that. To
>>>>>>>>>>> confirm I would look for
disconnects in the brick logs as well as
>>>>>>>>>>> checking glusterfshd.log to verify
the missing files were actually
>>>>>>>>>>> healed.
>>>>>>>>>>> 
>>>>>>>>>>> -b
>>>>>>>>>>> 
>>>>>>>>>>> On Tue, Feb 3, 2015 at 5:37 PM,
David F. Robinson
>>>>>>>>>>> <david.robinson at corvidtec.com
>>>>>>>>>>> <mailto:david.robinson at
corvidtec.com>>
>>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>   I rsync'd 20-TB over to my
gluster system and noticed that I
>>>>>>>>>>>   had
>>>>>>>>>>>   some directories missing even
though the rsync completed
>>>>>>>>>>> normally.
>>>>>>>>>>>   The rsync logs showed that the
missing files were transferred.
>>>>>>>>>>>   I went to the bricks and did an
'ls -al
>>>>>>>>>>>   /data/brick*/homegfs/dir/*'
the files were on the bricks.
>>>>>>>>>>> After I
>>>>>>>>>>>   did this 'ls', the files
then showed up on the FUSE mounts.
>>>>>>>>>>>   1) Why are the files hidden on
the fuse mount?
>>>>>>>>>>>   2) Why does the ls make them show
up on the FUSE mount?
>>>>>>>>>>>   3) How can I prevent this from
happening again?
>>>>>>>>>>>   Note, I also mounted the gluster
volume using NFS and saw the
>>>>>>>>>>> same
>>>>>>>>>>>   behavior. The files/directories
were not shown until I did the
>>>>>>>>>>>   "ls" on the bricks.
>>>>>>>>>>>   David
>>>>>>>>>>>  
==============================>>>>>>>>>>>  
David F. Robinson, Ph.D.
>>>>>>>>>>>   President - Corvid Technologies
>>>>>>>>>>>   704.799.6944 x101
<tel:704.799.6944%20x101> [office]
>>>>>>>>>>>   704.252.1310
<tel:704.252.1310> [cell]
>>>>>>>>>>>   704.799.7974
<tel:704.799.7974> [fax]
>>>>>>>>>>>   David.Robinson at corvidtec.com
>>>>>>>>>>> <mailto:David.Robinson at
corvidtec.com>
>>>>>>>>>>>   http://www.corvidtechnologies.com
>>>>>>>>>>>
<http://www.corvidtechnologies.com/>
>>>>>>>>>>> 
>>>>>>>>>>>  
_______________________________________________
>>>>>>>>>>>   Gluster-devel mailing list
>>>>>>>>>>>   Gluster-devel at gluster.org
<mailto:Gluster-devel at gluster.org>
>>>>>>>>>>>
http://www.gluster.org/mailman/listinfo/gluster-devel
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>
_______________________________________________
>>>>>>>>>> Gluster-devel mailing list
>>>>>>>>>> Gluster-devel at gluster.org
>>>>>>>>>>
http://www.gluster.org/mailman/listinfo/gluster-devel
>>>>>> 
>>>>>> _______________________________________________
>>>>>> Gluster-users mailing list
>>>>>> Gluster-users at gluster.org
>>>>>> http://www.gluster.org/mailman/listinfo/gluster-users
>>>>> 
>>>>> _______________________________________________
>>>>> Gluster-users mailing list
>>>>> Gluster-users at gluster.org
>>>>> http://www.gluster.org/mailman/listinfo/gluster-users
>>

David F. Robinson

2015-Feb-05 23:04 UTC

head link

[Gluster-users] [Gluster-devel] missing files

Isn't rsync what geo-rep uses?

David  (Sent from mobile)

==============================David F. Robinson, Ph.D. 
President - Corvid Technologies
704.799.6944 x101 [office]
704.252.1310      [cell]
704.799.7974      [fax]
David.Robinson at corvidtec.com
http://www.corvidtechnologies.com
> On Feb 5, 2015, at 5:41 PM, Ben Turner <bturner at redhat.com> wrote:
> 
> ----- Original Message -----
>> From: "Ben Turner" <bturner at redhat.com>
>> To: "David F. Robinson" <david.robinson at
corvidtec.com>
>> Cc: "Pranith Kumar Karampuri" <pkarampu at redhat.com>,
"Xavier Hernandez" <xhernandez at datalab.es>, "Benjamin
Turner"
>> <bennyturns at gmail.com>, gluster-users at gluster.org,
"Gluster Devel" <gluster-devel at gluster.org>
>> Sent: Thursday, February 5, 2015 5:22:26 PM
>> Subject: Re: [Gluster-users] [Gluster-devel] missing files
>> 
>> ----- Original Message -----
>>> From: "David F. Robinson" <david.robinson at
corvidtec.com>
>>> To: "Ben Turner" <bturner at redhat.com>
>>> Cc: "Pranith Kumar Karampuri" <pkarampu at
redhat.com>, "Xavier Hernandez"
>>> <xhernandez at datalab.es>, "Benjamin Turner"
>>> <bennyturns at gmail.com>, gluster-users at gluster.org,
"Gluster Devel"
>>> <gluster-devel at gluster.org>
>>> Sent: Thursday, February 5, 2015 5:01:13 PM
>>> Subject: Re: [Gluster-users] [Gluster-devel] missing files
>>> 
>>> I'll send you the emails I sent Pranith with the logs. What
causes these
>>> disconnects?
>> 
>> Thanks David!  Disconnects happen when there are interruption in
>> communication between peers, normally there is ping timeout that
happens.
>> It could be anything from a flaky NW to the system was to busy to
respond
>> to the pings.  My initial take is more towards the ladder as rsync is
>> absolutely the worst use case for gluster - IIRC it writes in 4kb
blocks.  I
>> try to keep my writes at least 64KB as in my testing that is the
smallest
>> block size I can write with before perf starts to really drop off. 
I'll try
>> something similar in the lab.
> 
> Ok I do think that the file being self healed is RCA for what you were
seeing.  Lets look at one of the disconnects:
> 
> data-brick02a-homegfs.log:[2015-02-03 20:54:02.772180] I
[server.c:518:server_rpc_notify] 0-homegfs-server: disconnecting connection from
gfs01b.corvidtec.com-4175-2015/02/02-16:44:31:179119-homegfs-client-2-0-1
> 
> And in the glustershd.log from the gfs01b_glustershd.log file:
> 
> [2015-02-03 20:55:48.001797] I
[afr-self-heal-entry.c:554:afr_selfheal_entry_do] 0-homegfs-replicate-0:
performing entry selfheal on 6c79a368-edaa-432b-bef9-ec690ab42448
> [2015-02-03 20:55:49.341996] I
[afr-self-heal-common.c:476:afr_log_selfheal] 0-homegfs-replicate-0: Completed
entry selfheal on 6c79a368-edaa-432b-bef9-ec690ab42448. source=1 sinks=0
> [2015-02-03 20:55:49.343093] I
[afr-self-heal-entry.c:554:afr_selfheal_entry_do] 0-homegfs-replicate-0:
performing entry selfheal on 792cb0d6-9290-4447-8cd7-2b2d7a116a69
> [2015-02-03 20:55:50.463652] I
[afr-self-heal-common.c:476:afr_log_selfheal] 0-homegfs-replicate-0: Completed
entry selfheal on 792cb0d6-9290-4447-8cd7-2b2d7a116a69. source=1 sinks=0
> [2015-02-03 20:55:51.465289] I
[afr-self-heal-metadata.c:54:__afr_selfheal_metadata_do] 0-homegfs-replicate-0:
performing metadata selfheal on 403e661a-1c27-4e79-9867-c0572aba2b3c
> [2015-02-03 20:55:51.466515] I
[afr-self-heal-common.c:476:afr_log_selfheal] 0-homegfs-replicate-0: Completed
metadata selfheal on 403e661a-1c27-4e79-9867-c0572aba2b3c. source=1 sinks=0
> [2015-02-03 20:55:51.467098] I
[afr-self-heal-entry.c:554:afr_selfheal_entry_do] 0-homegfs-replicate-0:
performing entry selfheal on 403e661a-1c27-4e79-9867-c0572aba2b3c
> [2015-02-03 20:55:55.257808] I
[afr-self-heal-common.c:476:afr_log_selfheal] 0-homegfs-replicate-0: Completed
entry selfheal on 403e661a-1c27-4e79-9867-c0572aba2b3c. source=1 sinks=0
> [2015-02-03 20:55:55.258548] I
[afr-self-heal-metadata.c:54:__afr_selfheal_metadata_do] 0-homegfs-replicate-0:
performing metadata selfheal on c612ee2f-2fb4-4157-a9ab-5a2d5603c541
> [2015-02-03 20:55:55.259367] I
[afr-self-heal-common.c:476:afr_log_selfheal] 0-homegfs-replicate-0: Completed
metadata selfheal on c612ee2f-2fb4-4157-a9ab-5a2d5603c541. source=1 sinks=0
> [2015-02-03 20:55:55.259980] I
[afr-self-heal-entry.c:554:afr_selfheal_entry_do] 0-homegfs-replicate-0:
performing entry selfheal on c612ee2f-2fb4-4157-a9ab-5a2d5603c541
> 
> As you can see the self heal logs are just spammed with files being healed,
and I looked at a couple of disconnects and I see self heals getting run shortly
after on the bricks that were down.  Now we need to find the cause of the
disconnects, I am thinking once the disconnects are resolved the files should be
properly copied over without SH having to fix things.  Like I said I'll give
this a go on my lab systems and see if I can repro the disconnects, I'll
have time to run through it tomorrow.  If in the mean time anyone else has a
theory / anything to add here it would be appreciated.
> 
> -b
> 
>> -b
>> 
>>> David  (Sent from mobile)
>>> 
>>> ==============================>>> David F. Robinson, Ph.D.
>>> President - Corvid Technologies
>>> 704.799.6944 x101 [office]
>>> 704.252.1310      [cell]
>>> 704.799.7974      [fax]
>>> David.Robinson at corvidtec.com
>>> http://www.corvidtechnologies.com
>>> 
>>>> On Feb 5, 2015, at 4:55 PM, Ben Turner <bturner at
redhat.com> wrote:
>>>> 
>>>> ----- Original Message -----
>>>>> From: "Pranith Kumar Karampuri" <pkarampu at
redhat.com>
>>>>> To: "Xavier Hernandez" <xhernandez at
datalab.es>, "David F. Robinson"
>>>>> <david.robinson at corvidtec.com>, "Benjamin
Turner"
>>>>> <bennyturns at gmail.com>
>>>>> Cc: gluster-users at gluster.org, "Gluster Devel"
>>>>> <gluster-devel at gluster.org>
>>>>> Sent: Thursday, February 5, 2015 5:30:04 AM
>>>>> Subject: Re: [Gluster-users] [Gluster-devel] missing files
>>>>> 
>>>>> 
>>>>>> On 02/05/2015 03:48 PM, Pranith Kumar Karampuri wrote:
>>>>>> I believe David already fixed this. I hope this is the
same issue he
>>>>>> told about permissions issue.
>>>>> Oops, it is not. I will take a look.
>>>> 
>>>> Yes David exactly like these:
>>>> 
>>>> data-brick02a-homegfs.log:[2015-02-03 19:09:34.568842] I
>>>> [server.c:518:server_rpc_notify] 0-homegfs-server:
disconnecting
>>>> connection from
>>>>
gfs02a.corvidtec.com-18563-2015/02/03-19:07:58:519134-homegfs-client-2-0-0
>>>> data-brick02a-homegfs.log:[2015-02-03 19:09:41.286551] I
>>>> [server.c:518:server_rpc_notify] 0-homegfs-server:
disconnecting
>>>> connection from
>>>>
gfs01a.corvidtec.com-12804-2015/02/03-19:09:38:497808-homegfs-client-2-0-0
>>>> data-brick02a-homegfs.log:[2015-02-03 19:16:35.906412] I
>>>> [server.c:518:server_rpc_notify] 0-homegfs-server:
disconnecting
>>>> connection from
>>>>
gfs02b.corvidtec.com-27190-2015/02/03-19:15:53:458467-homegfs-client-2-0-0
>>>> data-brick02a-homegfs.log:[2015-02-03 19:51:22.761293] I
>>>> [server.c:518:server_rpc_notify] 0-homegfs-server:
disconnecting
>>>> connection from
>>>>
gfs01a.corvidtec.com-25926-2015/02/03-19:51:02:89070-homegfs-client-2-0-0
>>>> data-brick02a-homegfs.log:[2015-02-03 20:54:02.772180] I
>>>> [server.c:518:server_rpc_notify] 0-homegfs-server:
disconnecting
>>>> connection from
>>>>
gfs01b.corvidtec.com-4175-2015/02/02-16:44:31:179119-homegfs-client-2-0-1
>>>> 
>>>> You can 100% verify my theory if you can correlate the time on
the
>>>> disconnects to the time that the missing files were healed. 
Can you have
>>>> a look at /var/log/glusterfs/glustershd.log?  That has all of
the healed
>>>> files + timestamps, if we can see a disconnect during the rsync
and a
>>>> self
>>>> heal of the missing file I think we can safely assume that the
>>>> disconnects
>>>> may have caused this.  I'll try this on my test systems,
how much data
>>>> did
>>>> you rsync?  What size ish of files / an idea of the dir layout?
>>>> 
>>>> @Pranith - Could bricks flapping up and down during the rsync
cause the
>>>> files to be missing on the first ls(written to 1 subvol but not
the other
>>>> cause it was down), the ls triggered SH, and thats why the
files were
>>>> there for the second ls be a possible cause here?
>>>> 
>>>> -b
>>>> 
>>>> 
>>>>> Pranith
>>>>>> 
>>>>>> Pranith
>>>>>>> On 02/05/2015 03:44 PM, Xavier Hernandez wrote:
>>>>>>> Is the failure repeatable ? with the same
directories ?
>>>>>>> 
>>>>>>> It's very weird that the directories appear on
the volume when you do
>>>>>>> an 'ls' on the bricks. Could it be that you
only made a single 'ls'
>>>>>>> on fuse mount which not showed the directory ? Is
it possible that
>>>>>>> this 'ls' triggered a self-heal that
repaired the problem, whatever
>>>>>>> it was, and when you did another 'ls' on
the fuse mount after the
>>>>>>> 'ls' on the bricks, the directories were
there ?
>>>>>>> 
>>>>>>> The first 'ls' could have healed the files,
causing that the
>>>>>>> following 'ls' on the bricks showed the
files as if nothing were
>>>>>>> damaged. If that's the case, it's possible
that there were some
>>>>>>> disconnections during the copy.
>>>>>>> 
>>>>>>> Added Pranith because he knows better replication
and self-heal
>>>>>>> details.
>>>>>>> 
>>>>>>> Xavi
>>>>>>> 
>>>>>>>> On 02/04/2015 07:23 PM, David F. Robinson
wrote:
>>>>>>>> Distributed/replicated
>>>>>>>> 
>>>>>>>> Volume Name: homegfs
>>>>>>>> Type: Distributed-Replicate
>>>>>>>> Volume ID: 1e32672a-f1b7-4b58-ba94-58c085e59071
>>>>>>>> Status: Started
>>>>>>>> Number of Bricks: 4 x 2 = 8
>>>>>>>> Transport-type: tcp
>>>>>>>> Bricks:
>>>>>>>> Brick1:
gfsib01a.corvidtec.com:/data/brick01a/homegfs
>>>>>>>> Brick2:
gfsib01b.corvidtec.com:/data/brick01b/homegfs
>>>>>>>> Brick3:
gfsib01a.corvidtec.com:/data/brick02a/homegfs
>>>>>>>> Brick4:
gfsib01b.corvidtec.com:/data/brick02b/homegfs
>>>>>>>> Brick5:
gfsib02a.corvidtec.com:/data/brick01a/homegfs
>>>>>>>> Brick6:
gfsib02b.corvidtec.com:/data/brick01b/homegfs
>>>>>>>> Brick7:
gfsib02a.corvidtec.com:/data/brick02a/homegfs
>>>>>>>> Brick8:
gfsib02b.corvidtec.com:/data/brick02b/homegfs
>>>>>>>> Options Reconfigured:
>>>>>>>> performance.io-thread-count: 32
>>>>>>>> performance.cache-size: 128MB
>>>>>>>> performance.write-behind-window-size: 128MB
>>>>>>>> server.allow-insecure: on
>>>>>>>> network.ping-timeout: 10
>>>>>>>> storage.owner-gid: 100
>>>>>>>> geo-replication.indexing: off
>>>>>>>> geo-replication.ignore-pid-check: on
>>>>>>>> changelog.changelog: on
>>>>>>>> changelog.fsync-interval: 3
>>>>>>>> changelog.rollover-time: 15
>>>>>>>> server.manage-gids: on
>>>>>>>> 
>>>>>>>> 
>>>>>>>> ------ Original Message ------
>>>>>>>> From: "Xavier Hernandez"
<xhernandez at datalab.es>
>>>>>>>> To: "David F. Robinson"
<david.robinson at corvidtec.com>; "Benjamin
>>>>>>>> Turner" <bennyturns at gmail.com>
>>>>>>>> Cc: "gluster-users at gluster.org"
<gluster-users at gluster.org>; "Gluster
>>>>>>>> Devel" <gluster-devel at
gluster.org>
>>>>>>>> Sent: 2/4/2015 6:03:45 AM
>>>>>>>> Subject: Re: [Gluster-devel] missing files
>>>>>>>> 
>>>>>>>>>> On 02/04/2015 01:30 AM, David F.
Robinson wrote:
>>>>>>>>>> Sorry. Thought about this a little
more. I should have been
>>>>>>>>>> clearer.
>>>>>>>>>> The files were on both bricks of the
replica, not just one side.
>>>>>>>>>> So,
>>>>>>>>>> both bricks had to have been up... The
files/directories just
>>>>>>>>>> don't show
>>>>>>>>>> up on the mount.
>>>>>>>>>> I was reading and saw a related bug
>>>>>>>>>>
(https://bugzilla.redhat.com/show_bug.cgi?id=1159484). I saw it
>>>>>>>>>> suggested to run:
>>>>>>>>>>        find <mount> -d -exec
getfattr -h -n trusted.ec.heal {} \;
>>>>>>>>> 
>>>>>>>>> This command is specific for a dispersed
volume. It won't do
>>>>>>>>> anything
>>>>>>>>> (aside from the error you are seeing) on a
replicated volume.
>>>>>>>>> 
>>>>>>>>> I think you are using a replicated volume,
right ?
>>>>>>>>> 
>>>>>>>>> In this case I'm not sure what can be
happening. Is your volume a
>>>>>>>>> pure
>>>>>>>>> replicated one or a distributed-replicated
? on a pure replicated it
>>>>>>>>> doesn't make sense that some entries do
not show in an 'ls' when the
>>>>>>>>> file is in both replicas (at least without
any error message in the
>>>>>>>>> logs). On a distributed-replicated it could
be caused by some
>>>>>>>>> problem
>>>>>>>>> while combining contents of each replica
set.
>>>>>>>>> 
>>>>>>>>> What's the configuration of your volume
?
>>>>>>>>> 
>>>>>>>>> Xavi
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> I get a bunch of errors for operation
not supported:
>>>>>>>>>> [root at gfs02a homegfs]# find
wks_backup -d -exec getfattr -h -n
>>>>>>>>>> trusted.ec.heal {} \;
>>>>>>>>>> find: warning: the -d option is
deprecated; please use -depth
>>>>>>>>>> instead,
>>>>>>>>>> because the latter is a POSIX-compliant
feature.
>>>>>>>>>> wks_backup/homer_backup/backup:
trusted.ec.heal: Operation not
>>>>>>>>>> supported
>>>>>>>>>>
wks_backup/homer_backup/logs/2014_05_20.log: trusted.ec.heal:
>>>>>>>>>> Operation
>>>>>>>>>> not supported
>>>>>>>>>>
wks_backup/homer_backup/logs/2014_05_21.log: trusted.ec.heal:
>>>>>>>>>> Operation
>>>>>>>>>> not supported
>>>>>>>>>>
wks_backup/homer_backup/logs/2014_05_18.log: trusted.ec.heal:
>>>>>>>>>> Operation
>>>>>>>>>> not supported
>>>>>>>>>>
wks_backup/homer_backup/logs/2014_05_19.log: trusted.ec.heal:
>>>>>>>>>> Operation
>>>>>>>>>> not supported
>>>>>>>>>>
wks_backup/homer_backup/logs/2014_05_22.log: trusted.ec.heal:
>>>>>>>>>> Operation
>>>>>>>>>> not supported
>>>>>>>>>> wks_backup/homer_backup/logs:
trusted.ec.heal: Operation not
>>>>>>>>>> supported
>>>>>>>>>> wks_backup/homer_backup:
trusted.ec.heal: Operation not supported
>>>>>>>>>> ------ Original Message ------
>>>>>>>>>> From: "Benjamin Turner"
<bennyturns at gmail.com
>>>>>>>>>> <mailto:bennyturns at
gmail.com>>
>>>>>>>>>> To: "David F. Robinson"
<david.robinson at corvidtec.com
>>>>>>>>>> <mailto:david.robinson at
corvidtec.com>>
>>>>>>>>>> Cc: "Gluster Devel"
<gluster-devel at gluster.org
>>>>>>>>>> <mailto:gluster-devel at
gluster.org>>; "gluster-users at gluster.org"
>>>>>>>>>> <gluster-users at gluster.org
<mailto:gluster-users at gluster.org>>
>>>>>>>>>> Sent: 2/3/2015 7:12:34 PM
>>>>>>>>>> Subject: Re: [Gluster-devel] missing
files
>>>>>>>>>>> It sounds to me like the files were
only copied to one replica,
>>>>>>>>>>> werent
>>>>>>>>>>> there for the initial for the
initial ls which triggered a self
>>>>>>>>>>> heal,
>>>>>>>>>>> and were there for the last ls
because they were healed. Is there
>>>>>>>>>>> any
>>>>>>>>>>> chance that one of the replicas was
down during the rsync? It
>>>>>>>>>>> could
>>>>>>>>>>> be that you lost a brick during
copy or something like that. To
>>>>>>>>>>> confirm I would look for
disconnects in the brick logs as well as
>>>>>>>>>>> checking glusterfshd.log to verify
the missing files were actually
>>>>>>>>>>> healed.
>>>>>>>>>>> 
>>>>>>>>>>> -b
>>>>>>>>>>> 
>>>>>>>>>>> On Tue, Feb 3, 2015 at 5:37 PM,
David F. Robinson
>>>>>>>>>>> <david.robinson at corvidtec.com
>>>>>>>>>>> <mailto:david.robinson at
corvidtec.com>>
>>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>   I rsync'd 20-TB over to my
gluster system and noticed that I
>>>>>>>>>>>   had
>>>>>>>>>>>   some directories missing even
though the rsync completed
>>>>>>>>>>> normally.
>>>>>>>>>>>   The rsync logs showed that the
missing files were transferred.
>>>>>>>>>>>   I went to the bricks and did an
'ls -al
>>>>>>>>>>>   /data/brick*/homegfs/dir/*'
the files were on the bricks.
>>>>>>>>>>> After I
>>>>>>>>>>>   did this 'ls', the files
then showed up on the FUSE mounts.
>>>>>>>>>>>   1) Why are the files hidden on
the fuse mount?
>>>>>>>>>>>   2) Why does the ls make them show
up on the FUSE mount?
>>>>>>>>>>>   3) How can I prevent this from
happening again?
>>>>>>>>>>>   Note, I also mounted the gluster
volume using NFS and saw the
>>>>>>>>>>> same
>>>>>>>>>>>   behavior. The files/directories
were not shown until I did the
>>>>>>>>>>>   "ls" on the bricks.
>>>>>>>>>>>   David
>>>>>>>>>>>  
==============================>>>>>>>>>>>  
David F. Robinson, Ph.D.
>>>>>>>>>>>   President - Corvid Technologies
>>>>>>>>>>>   704.799.6944 x101
<tel:704.799.6944%20x101> [office]
>>>>>>>>>>>   704.252.1310
<tel:704.252.1310> [cell]
>>>>>>>>>>>   704.799.7974
<tel:704.799.7974> [fax]
>>>>>>>>>>>   David.Robinson at corvidtec.com
>>>>>>>>>>> <mailto:David.Robinson at
corvidtec.com>
>>>>>>>>>>>   http://www.corvidtechnologies.com
>>>>>>>>>>>
<http://www.corvidtechnologies.com/>
>>>>>>>>>>> 
>>>>>>>>>>>  
_______________________________________________
>>>>>>>>>>>   Gluster-devel mailing list
>>>>>>>>>>>   Gluster-devel at gluster.org
<mailto:Gluster-devel at gluster.org>
>>>>>>>>>>>
http://www.gluster.org/mailman/listinfo/gluster-devel
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>
_______________________________________________
>>>>>>>>>> Gluster-devel mailing list
>>>>>>>>>> Gluster-devel at gluster.org
>>>>>>>>>>
http://www.gluster.org/mailman/listinfo/gluster-devel
>>>>>> 
>>>>>> _______________________________________________
>>>>>> Gluster-users mailing list
>>>>>> Gluster-users at gluster.org
>>>>>> http://www.gluster.org/mailman/listinfo/gluster-users
>>>>> 
>>>>> _______________________________________________
>>>>> Gluster-users mailing list
>>>>> Gluster-users at gluster.org
>>>>> http://www.gluster.org/mailman/listinfo/gluster-users
>>

Gluster users - Feb 2015 - [Gluster-devel] missing files

[Gluster-users] [Gluster-devel] missing files

[Gluster-users] [Gluster-devel] missing files

[Gluster-users] [Gluster-devel] missing files