On 02/20/2015 12:21 PM, Olav Peeters wrote:> Let's take one file (3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd) as an
> example...
> On the 3 nodes where all bricks are formatted as XFS and mounted in
> /export and 272b2366-dfbf-ad47-2a0f-5d5cc40863e3 is the mounting point
> of a NFS shared storage connection from XenServer machines:
Did I just read this correctly? Your bricks are NFS mounts? ie,
GlusterFS Client <-> GlusterFS Server <-> NFS <->
XFS>
> [root at gluster01 ~]# find
> /export/*/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/ -name '300*' -exec
ls
> -la {} \;
> -rw-r--r--. 2 root root 44332659200 Feb 17 23:55
>
/export/brick13gfs01/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd
Supposedly, this is the actual file.> -rw-r--r--. 2 root root 0 Feb 18 00:51
>
/export/brick14gfs01/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd
This is not a linkfile. Note it's mode 0644. How it got there with those
permissions would be a matter of history and would require information
that's probably lost.>
> root at gluster02 ~]# find
> /export/*/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/ -name '300*' -exec
ls
> -la {} \;
> -rw-r--r--. 2 root root 44332659200 Feb 17 23:55
>
/export/brick13gfs02/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd
>
> [root at gluster03 ~]# find
> /export/*/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/ -name '300*' -exec
ls
> -la {} \;
> -rw-r--r--. 2 root root 44332659200 Feb 17 23:55
>
/export/brick13gfs03/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd
> -rw-r--r--. 2 root root 0 Feb 18 00:51
>
/export/brick14gfs03/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd
Same analysis as above.>
> 3 files with information, 2 x a 0-bit file with the same name
>
> Checking the 0-bit files:
> [root at gluster01 ~]# getfattr -m . -d -e hex
>
/export/brick14gfs01/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd
> getfattr: Removing leading '/' from absolute path names
> # file:
>
export/brick14gfs01/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd
> security.selinux=0x73797374656d5f753a6f626a6563745f723a66696c655f743a733000
> trusted.afr.dirty=0x000000000000000000000000
> trusted.afr.sr_vol01-client-34=0x000000000000000000000000
> trusted.afr.sr_vol01-client-35=0x000000000000000000000000
> trusted.gfid=0xaefd184508414a8f8408f1ab8aa7a417
>
> [root at gluster03 ~]# getfattr -m . -d -e hex
>
/export/brick14gfs03/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd
> getfattr: Removing leading '/' from absolute path names
> # file:
>
export/brick14gfs03/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd
> security.selinux=0x73797374656d5f753a6f626a6563745f723a66696c655f743a733000
> trusted.afr.dirty=0x000000000000000000000000
> trusted.afr.sr_vol01-client-34=0x000000000000000000000000
> trusted.afr.sr_vol01-client-35=0x000000000000000000000000
> trusted.gfid=0xaefd184508414a8f8408f1ab8aa7a417
>
> This is not a glusterfs link file since there is no
> "trusted.glusterfs.dht.linkto", am I correct?
You are correct.>
> And checking the "good" files:
>
> # file:
>
export/brick13gfs01/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd
>
security.selinux=0x756e636f6e66696e65645f753a6f626a6563745f723a66696c655f743a733000
> trusted.afr.dirty=0x000000000000000000000000
> trusted.afr.sr_vol01-client-32=0x000000000000000000000000
> trusted.afr.sr_vol01-client-33=0x000000000000000000000000
> trusted.afr.sr_vol01-client-34=0x000000000000000000000000
> trusted.afr.sr_vol01-client-35=0x000000010000000100000000
> trusted.gfid=0xaefd184508414a8f8408f1ab8aa7a417
>
> [root at gluster02 ~]# getfattr -m . -d -e hex
>
/export/brick13gfs02/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd
> getfattr: Removing leading '/' from absolute path names
> # file:
>
export/brick13gfs02/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd
> security.selinux=0x73797374656d5f753a6f626a6563745f723a66696c655f743a733000
> trusted.afr.dirty=0x000000000000000000000000
> trusted.afr.sr_vol01-client-32=0x000000000000000000000000
> trusted.afr.sr_vol01-client-33=0x000000000000000000000000
> trusted.gfid=0xaefd184508414a8f8408f1ab8aa7a417
>
> [root at gluster03 ~]# getfattr -m . -d -e hex
>
/export/brick13gfs03/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd
> getfattr: Removing leading '/' from absolute path names
> # file:
>
export/brick13gfs03/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd
> security.selinux=0x73797374656d5f753a6f626a6563745f723a66696c655f743a733000
> trusted.afr.dirty=0x000000000000000000000000
> trusted.afr.sr_vol01-client-40=0x000000000000000000000000
> trusted.afr.sr_vol01-client-41=0x000000000000000000000000
> trusted.gfid=0xaefd184508414a8f8408f1ab8aa7a417
>
>
>
> Seen from a client via a glusterfs mount:
> [root at client ~]# ls -al
> /mnt/glusterfs/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/300*
> -rw-r--r--. 1 root root 0 Feb 18 00:51
>
/mnt/glusterfs/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd
> -rw-r--r--. 1 root root 0 Feb 18 00:51
>
/mnt/glusterfs/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd
> -rw-r--r--. 1 root root 0 Feb 18 00:51
>
/mnt/glusterfs/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd
>
>
>
> Via NFS (just after performing a umount and mount the volume again):
> [root at client ~]# ls -al
> /mnt/nfs/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/300*
> -rw-r--r--. 1 root root 44332659200 Feb 17 23:55
>
/mnt/test/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd
> -rw-r--r--. 1 root root 44332659200 Feb 17 23:55
>
/mnt/test/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd
> -rw-r--r--. 1 root root 44332659200 Feb 17 23:55
>
/mnt/test/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd
>
> Doing the same list a couple of seconds later:
> [root at client ~]# ls -al
/mnt/nfs/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/300*
> -rw-r--r--. 1 root root 0 Feb 18 00:51
>
/mnt/test/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd
> -rw-r--r--. 1 root root 0 Feb 18 00:51
>
/mnt/test/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd
> -rw-r--r--. 1 root root 0 Feb 18 00:51
>
/mnt/test/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd
> And again, and again, and again:
> [root at client ~]# ls -al
/mnt/nfs/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/300*
> -rw-r--r--. 1 root root 0 Feb 18 00:51
>
/mnt/test/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd
> -rw-r--r--. 1 root root 0 Feb 18 00:51
>
/mnt/test/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd
> -rw-r--r--. 1 root root 0 Feb 18 00:51
>
/mnt/test/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd
>
> This really seems odd. Why do we get to see "real data file" once
only?
>
> It seems more and more that this crazy file duplication (and writing
> of sticky bit files) was actually triggered when rebooting one of the
> three nodes while there still is an active (even when there is no data
> exchange at all) NFS connection, since all 0-bit files (of the non
> Sticky bit type) were either created at 00:51 or 00:41, the exact
> moment one of the three nodes in the cluster were rebooted. This would
> mean that replication currently with GlusterFS creates hardly any
> redundancy. Quiet the opposite, if one of the machines goes down, all
> of your data seriously gets disorganised. I am buzzy configuring a
> test installation to see how this can be best reproduced for a bug
> report..
>
> Does anyone have a suggestion how to best get rid of the duplicates,
> or rather get this mess organised the way it should be?
> This is a cluster with millions of files. A rebalance does not fix the
> issue, neither does a rebalance fix-layout help. Since this is a
> replicated volume all files should be their 2x, not 3x. Can I safely
> just remove all the 0 bit files outside of the .glusterfs directory
> including the sticky bit files?
>
> The empty 0 bit files outside of .glusterfs on every brick I can
> probably safely removed like this:
> find /export/* -path */.glusterfs -prune -o -type f -size 0 -perm 1000
> -exec rm {} \;
> not?
>
> Thanks!
>
> Cheers,
> Olav
> On 18/02/15 22:10, Olav Peeters wrote:
>> Thanks Tom and Joe,
>> for the fast response!
>>
>> Before I started my upgrade I stopped all clients using the volume
>> and stopped all VM's with VHD on the volume, but I guess, and this
>> may be the missing thing to reproduce this in a lab, I did not detach
>> a NFS shared storage mount from a XenServer pool to this volume,
>> since this is an extremely risky business. I also did not stop the
>> volume. This I guess was a bit stupid, but since I did upgrades in
>> the past this way without any issues I skipped this step (a really
>> bad habit). I'll make amends and file a proper bug report :-). I
>> agree with you Joe, this should never happen, even when someone
>> ignores the advice of stopping the volume. If it would also be
>> nessessary to detach shared storage NFS connections to a volume, than
>> franky, glusterfs is unusable in a private cloud. No one can afford
>> downtime of the whole infrastructure just for a glusterfs upgrade.
>> Ideally a replicated gluster volume should even be able to remain
>> online and used during (at least a minor version) upgrade.
>>
>> I don't know whether a heal was maybe buzzy when I started the
>> upgrade. I forgot to check. I did check the CPU activity on the
>> gluster nodes which were very low (in the 0.0X range via top), so I
>> doubt it. I will add this to the bug report as a suggestion should
>> they not be able to reproduce with an open NFS connection.
>>
>> By the way, is it sufficient to do:
>> service glusterd stop
>> service glusterfsd stop
>> and do a:
>> ps aux | gluster*
>> to see if everything has stopped and kill any leftovers should this
>> be necessary?
>>
>> For the fix, do you agree that if I run e.g.:
>> find /export/* -type f -size 0 -perm 1000 -exec /bin/rm {} \;
>> on every node if /export is the location of all my bricks, also in a
>> replicated set-up, this will be save?
>> No necessary 0bit files will be deleted in e.g. the .glusterfs of
>> every brick?
>>
>> Thanks for your support!
>>
>> Cheers,
>> Olav
>>
>>
>>
>>
>>
>> On 18/02/15 20:51, Joe Julian wrote:
>>>
>>> On 02/18/2015 11:43 AM, tbenzvi at 3vgeomatics.com wrote:
>>>> Hi Olav,
>>>>
>>>> I have a hunch that our problem was caused by improper
unmounting
>>>> of the gluster volume, and have since found that the proper
order
>>>> should be: kill all jobs using volume -> unmount volume on
clients
>>>> -> gluster volume stop -> stop gluster service (if
necessary)
>>>> In my case, I wrote a Python script to find duplicate files on
the
>>>> mounted volume, then delete the corresponding link files on the
>>>> bricks (making sure to also delete files in the .glusterfs
directory)
>>>> However, your find command was also suggested to me and I think
>>>> it's a simpler solution. I believe removing all link files
(even
>>>> ones that are not causing duplicates) is fine since the next
file
>>>> access gluster will do a lookup on all bricks and recreate any
link
>>>> files if necessary. Hopefully a gluster expert can chime in on
this
>>>> point as I'm not completely sure.
>>>
>>> You are correct.
>>>
>>>> Keep in mind your setup is somewhat different than mine as I
have
>>>> only 5 bricks with no replication.
>>>> Regards,
>>>> Tom
>>>>
>>>> --------- Original Message ---------
>>>> Subject: Re: [Gluster-users] Hundreds of duplicate files
>>>> From: "Olav Peeters" <opeeters at
gmail.com>
>>>> Date: 2/18/15 10:52 am
>>>> To: gluster-users at gluster.org, tbenzvi at
3vgeomatics.com
>>>>
>>>> Hi all,
>>>> I'm have this problem after upgrading from 3.5.3 to
3.6.2.
>>>> At the moment I am still waiting for a heal to finish (on a
>>>> 31TB volume with 42 bricks, replicated over three nodes).
>>>>
>>>> Tom,
>>>> how did you remove the duplicates?
>>>> with 42 bricks I will not be able to do this manually..
>>>> Did a:
>>>> find $brick_root -type f -size 0 -perm 1000 -exec /bin/rm
{} \;
>>>> work for you?
>>>>
>>>> Should this type of thing ideally not be checked and mended
by
>>>> a heal?
>>>>
>>>> Does anyone have an idea yet how this happens in the first
>>>> place? Can it be connected to upgrading?
>>>>
>>>> Cheers,
>>>> Olav
>>>>
>>>>
>>>>
>>>> On 01/01/15 03:07, tbenzvi at 3vgeomatics.com wrote:
>>>>
>>>> No, the files can be read on a newly mounted client! I
went
>>>> ahead and deleted all of the link files associated with
>>>> these duplicates, and then remounted the volume. The
>>>> problem is fixed!
>>>> Thanks again for the help, Joe and Vijay.
>>>> Tom
>>>>
>>>> --------- Original Message ---------
>>>> Subject: Re: [Gluster-users] Hundreds of duplicate
files
>>>> From: "Vijay Bellur" <vbellur at
redhat.com>
>>>> Date: 12/28/14 3:23 am
>>>> To: tbenzvi at 3vgeomatics.com, gluster-users at
gluster.org
>>>>
>>>> On 12/28/2014 01:20 PM, tbenzvi at 3vgeomatics.com
wrote:
>>>> > Hi Vijay,
>>>> > Yes the files are still readable from the
.glusterfs
>>>> path.
>>>> > There is no explicit error. However, trying to
read a
>>>> text file in
>>>> > python simply gives me null characters:
>>>> >
>>>> > >>>
open('ott_mf_itab').readlines()
>>>> >
>>>>
['\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00']
>>>> >
>>>> > And reading binary files does the same
>>>> >
>>>>
>>>> Is this behavior seen with a freshly mounted client
too?
>>>>
>>>> -Vijay
>>>>
>>>> > --------- Original Message ---------
>>>> > Subject: Re: [Gluster-users] Hundreds of
duplicate files
>>>> > From: "Vijay Bellur" <vbellur at
redhat.com>
>>>> > Date: 12/27/14 9:57 pm
>>>> > To: tbenzvi at 3vgeomatics.com, gluster-users
at gluster.org
>>>> >
>>>> > On 12/28/2014 10:13 AM, tbenzvi at
3vgeomatics.com wrote:
>>>> > > Thanks Joe, I've read your blog post
as well as
>>>> your post
>>>> > regarding the
>>>> > > .glusterfs directory.
>>>> > > I found some unneeded duplicate files
which were
>>>> not being read
>>>> > > properly. I then deleted the link file
from the
>>>> brick. This always
>>>> > > removes the duplicate file from the
listing, but
>>>> the file does not
>>>> > > always become readable. If I also delete
the
>>>> associated file in the
>>>> > > .glusterfs directory on that brick, then
some more
>>>> files become
>>>> > > readable. However this solution still
doesn't work
>>>> for all files.
>>>> > > I know the file on the brick is not
corrupt as it
>>>> can be read
>>>> > directly
>>>> > > from the brick directory.
>>>> >
>>>> > For files that are not readable from the
client, can
>>>> you check if the
>>>> > file is readable from the .glusterfs/ path?
>>>> >
>>>> > What is the specific error that is seen while
trying
>>>> to read one such
>>>> > file from the client?
>>>> >
>>>> > Thanks,
>>>> > Vijay
>>>> >
>>>> >
>>>> >
>>>> >
_______________________________________________
>>>> > Gluster-users mailing list
>>>> > Gluster-users at gluster.org
>>>> >
http://www.gluster.org/mailman/listinfo/gluster-users
>>>> >
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Gluster-users mailing list
>>>> Gluster-users at gluster.org
>>>> http://www.gluster.org/mailman/listinfo/gluster-users
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Gluster-users mailing list
>>>> Gluster-users at gluster.org
>>>> http://www.gluster.org/mailman/listinfo/gluster-users
>>>
>>>
>>>
>>> _______________________________________________
>>> Gluster-users mailing list
>>> Gluster-users at gluster.org
>>> http://www.gluster.org/mailman/listinfo/gluster-users
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://www.gluster.org/pipermail/gluster-users/attachments/20150220/d8bf651c/attachment.html>