Dear Ravi,
In the last week I have completed a fix-layout and a full INDEX heal on
this volume. Now I've started a rebalance and I see a few terabytes of data
going around on different bricks since yesterday, which I'm sure is good.
While I wait for the rebalance to finish, I'm wondering if you know what
would cause directories to be missing from the FUSE mount point? If I list
the directories explicitly I can see their contents, but they do not appear
in their parent directories' listing. In the case of duplicated files it is
always because the files are not on the correct bricks (according to the
Dynamo/Elastic Hash algorithm), and I can fix it by copying the file to the
correct brick(s) and removing it from the others (along with their
.glusterfs hard links). So what could cause directories to be missing?
Thank you,
Thank you,
On Wed, Jun 5, 2019 at 1:08 AM Alan Orth <alan.orth at gmail.com> wrote:
> Hi Ravi,
>
> You're right that I had mentioned using rsync to copy the brick content
to
> a new host, but in the end I actually decided not to bring it up on a new
> brick. Instead I added the original brick back into the volume. So the
> xattrs and symlinks to .glusterfs on the original brick are fine. I think
> the problem probably lies with a remove-brick that got interrupted. A few
> weeks ago during the maintenance I had tried to remove a brick and then
> after twenty minutes and no obvious progress I stopped it?after that the
> bricks were still part of the volume.
>
> In the last few days I have run a fix-layout that took 26 hours and
> finished successfully. Then I started a full index heal and it has healed
> about 3.3 million files in a few days and I see a clear increase of network
> traffic from old brick host to new brick host over that time. Once the full
> index heal completes I will try to do a rebalance.
>
> Thank you,
>
>
> On Mon, Jun 3, 2019 at 7:40 PM Ravishankar N <ravishankar at
redhat.com>
> wrote:
>
>>
>> On 01/06/19 9:37 PM, Alan Orth wrote:
>>
>> Dear Ravi,
>>
>> The .glusterfs hardlinks/symlinks should be fine. I'm not sure how
I
>> could verify them for six bricks and millions of files, though... :\
>>
>> Hi Alan,
>>
>> The reason I asked this is because you had mentioned in one of your
>> earlier emails that when you moved content from the old brick to the
new
>> one, you had skipped the .glusterfs directory. So I was assuming that
when
>> you added back this new brick to the cluster, it might have been
missing
>> the .glusterfs entries. If that is the cae, one way to verify could be
to
>> check using a script if all files on the brick have a link-count of at
>> least 2 and all dirs have valid symlinks inside .glusterfs pointing to
>> themselves.
>>
>>
>> I had a small success in fixing some issues with duplicated files on
the
>> FUSE mount point yesterday. I read quite a bit about the elastic
hashing
>> algorithm that determines which files get placed on which bricks based
on
>> the hash of their filename and the trusted.glusterfs.dht xattr on brick
>> directories (thanks to Joe Julian's blog post and Python script for
showing
>> how it works?). With that knowledge I looked closer at one of the files
>> that was appearing as duplicated on the FUSE mount and found that it
was
>> also duplicated on more than `replica 2` bricks. For this particular
file I
>> found two "real" files and several zero-size files with
>> trusted.glusterfs.dht.linkto xattrs. Neither of the "real"
files were on
>> the correct brick as far as the DHT layout is concerned, so I copied
one of
>> them to the correct brick, deleted the others and their hard links, and
did
>> a `stat` on the file from the FUSE mount point and it fixed itself.
Yay!
>>
>> Could this have been caused by a replace-brick that got interrupted and
>> didn't finish re-labeling the xattrs?
>>
>> No, replace-brick only initiates AFR self-heal, which just copies the
>> contents from the other brick(s) of the *same* replica pair into the
>> replaced brick. The link-to files are created by DHT when you rename a
>> file from the client. If the new name hashes to a different brick, DHT
>> does not move the entire file there. It instead creates the link-to
file
>> (the one with the dht.linkto xattrs) on the hashed subvol. The value of
>> this xattr points to the brick where the actual data is there
(`getfattr -e
>> text` to see it for yourself). Perhaps you had attempted a rebalance
or
>> remove-brick earlier and interrupted that?
>>
>> Should I be thinking of some heuristics to identify and fix these
issues
>> with a script (incorrect brick placement), or is this something a fix
>> layout or repeated volume heals can fix? I've already completed a
whole
>> heal on this particular volume this week and it did heal about
1,000,000
>> files (mostly data and metadata, but about 20,000 entry heals as well).
>>
>> Maybe you should let the AFR self-heals complete first and then attempt
a
>> full rebalance to take care of the dht link-to files. But if the files
are
>> in millions, it could take quite some time to complete.
>> Regards,
>> Ravi
>>
>> Thanks for your support,
>>
>> ? https://joejulian.name/post/dht-misses-are-expensive/
>>
>> On Fri, May 31, 2019 at 7:57 AM Ravishankar N <ravishankar at
redhat.com>
>> wrote:
>>
>>>
>>> On 31/05/19 3:20 AM, Alan Orth wrote:
>>>
>>> Dear Ravi,
>>>
>>> I spent a bit of time inspecting the xattrs on some files and
>>> directories on a few bricks for this volume and it looks a bit
messy. Even
>>> if I could make sense of it for a few and potentially heal them
manually,
>>> there are millions of files and directories in total so that's
definitely
>>> not a scalable solution. After a few missteps with `replace-brick
...
>>> commit force` in the last week?one of which on a brick that was
>>> dead/offline?as well as some premature `remove-brick` commands,
I'm unsure
>>> how how to proceed and I'm getting demotivated. It's scary
how quickly
>>> things get out of hand in distributed systems...
>>>
>>> Hi Alan,
>>> The one good thing about gluster is it that the data is always
available
>>> directly on the backed bricks even if your volume has
inconsistencies at
>>> the gluster level. So theoretically, if your cluster is FUBAR, you
could
>>> just create a new volume and copy all data onto it via its mount
from the
>>> old volume's bricks.
>>>
>>>
>>> I had hoped that bringing the old brick back up would help, but by
the
>>> time I added it again a few days had passed and all the
brick-id's had
>>> changed due to the replace/remove brick commands, not to mention
that the
>>> trusted.afr.$volume-client-xx values were now probably pointing to
the
>>> wrong bricks (?).
>>>
>>> Anyways, a few hours ago I started a full heal on the volume and I
see
>>> that there is a sustained 100MiB/sec of network traffic going from
the old
>>> brick's host to the new one. The completed heals reported in
the logs look
>>> promising too:
>>>
>>> Old brick host:
>>>
>>> # grep '2019-05-30' /var/log/glusterfs/glustershd.log |
grep -o -E
>>> 'Completed (data|metadata|entry) selfheal' | sort | uniq -c
>>> 281614 Completed data selfheal
>>> 84 Completed entry selfheal
>>> 299648 Completed metadata selfheal
>>>
>>> New brick host:
>>>
>>> # grep '2019-05-30' /var/log/glusterfs/glustershd.log |
grep -o -E
>>> 'Completed (data|metadata|entry) selfheal' | sort | uniq -c
>>> 198256 Completed data selfheal
>>> 16829 Completed entry selfheal
>>> 229664 Completed metadata selfheal
>>>
>>> So that's good I guess, though I have no idea how long it will
take or
>>> if it will fix the "missing files" issue on the FUSE
mount. I've increased
>>> cluster.shd-max-threads to 8 to hopefully speed up the heal
process.
>>>
>>> The afr xattrs should not cause files to disappear from mount. If
the
>>> xattr names do not match what each AFR subvol expects (for eg. in a
replica
>>> 2 volume, trusted.afr.*-client-{0,1} for 1st subvol, client-{2,3}
for 2nd
>>> subvol and so on - ) for its children then it won't heal the
data, that is
>>> all. But in your case I see some inconsistencies like one brick
having the
>>> actual file (licenseserver.cfg) and the other having a linkto file
(the
>>> one with the dht.linkto xattr) *in the same replica pair*.
>>>
>>>
>>> I'd be happy for any advice or pointers,
>>>
>>> Did you check if the .glusterfs hardlinks/symlinks exist and are in
>>> order for all bricks?
>>>
>>> -Ravi
>>>
>>>
>>> On Wed, May 29, 2019 at 5:20 PM Alan Orth <alan.orth at
gmail.com> wrote:
>>>
>>>> Dear Ravi,
>>>>
>>>> Thank you for the link to the blog post series?it is very
informative
>>>> and current! If I understand your blog post correctly then I
think the
>>>> answer to your previous question about pending AFRs is: no,
there are no
>>>> pending AFRs. I have identified one file that is a good test
case to try to
>>>> understand what happened after I issued the `gluster volume
replace-brick
>>>> ... commit force` a few days ago and then added the same
original brick
>>>> back to the volume later. This is the current state of the
replica 2
>>>> distribute/replicate volume:
>>>>
>>>> [root at wingu0 ~]# gluster volume info apps
>>>>
>>>> Volume Name: apps
>>>> Type: Distributed-Replicate
>>>> Volume ID: f118d2da-79df-4ee1-919d-53884cd34eda
>>>> Status: Started
>>>> Snapshot Count: 0
>>>> Number of Bricks: 3 x 2 = 6
>>>> Transport-type: tcp
>>>> Bricks:
>>>> Brick1: wingu3:/mnt/gluster/apps
>>>> Brick2: wingu4:/mnt/gluster/apps
>>>> Brick3: wingu05:/data/glusterfs/sdb/apps
>>>> Brick4: wingu06:/data/glusterfs/sdb/apps
>>>> Brick5: wingu0:/mnt/gluster/apps
>>>> Brick6: wingu05:/data/glusterfs/sdc/apps
>>>> Options Reconfigured:
>>>> diagnostics.client-log-level: DEBUG
>>>> storage.health-check-interval: 10
>>>> nfs.disable: on
>>>>
>>>> I checked the xattrs of one file that is missing from the
volume's FUSE
>>>> mount (though I can read it if I access its full path
explicitly), but is
>>>> present in several of the volume's bricks (some with full
size, others
>>>> empty):
>>>>
>>>> [root at wingu0 ~]# getfattr -d -m. -e hex
>>>> /mnt/gluster/apps/clcgenomics/clclicsrv/licenseserver.cfg
>>>>
>>>> getfattr: Removing leading '/' from absolute path names
>>>> # file:
mnt/gluster/apps/clcgenomics/clclicsrv/licenseserver.cfg
>>>>
security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000
>>>> trusted.afr.apps-client-3=0x000000000000000000000000
>>>> trusted.afr.apps-client-5=0x000000000000000000000000
>>>> trusted.afr.dirty=0x000000000000000000000000
>>>> trusted.bit-rot.version=0x0200000000000000585a396f00046e15
>>>> trusted.gfid=0x878003a2fb5243b6a0d14d2f8b4306bd
>>>>
>>>> [root at wingu05 ~]# getfattr -d -m. -e hex
/data/glusterfs/sdb/apps/clcgenomics/clclicsrv/licenseserver.cfg
>>>> getfattr: Removing leading '/' from absolute path names
>>>> # file:
data/glusterfs/sdb/apps/clcgenomics/clclicsrv/licenseserver.cfg
>>>>
security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000
>>>> trusted.gfid=0x878003a2fb5243b6a0d14d2f8b4306bd
>>>>
trusted.gfid2path.82586deefbc539c3=0x34666437323861612d356462392d343836382d616232662d6564393031636566333561392f6c6963656e73657365727665722e636667
>>>>
trusted.glusterfs.dht.linkto=0x617070732d7265706c69636174652d3200
>>>>
>>>> [root at wingu05 ~]# getfattr -d -m. -e hex
/data/glusterfs/sdc/apps/clcgenomics/clclicsrv/licenseserver.cfg
>>>> getfattr: Removing leading '/' from absolute path names
>>>> # file:
data/glusterfs/sdc/apps/clcgenomics/clclicsrv/licenseserver.cfg
>>>>
security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000
>>>> trusted.gfid=0x878003a2fb5243b6a0d14d2f8b4306bd
>>>>
trusted.gfid2path.82586deefbc539c3=0x34666437323861612d356462392d343836382d616232662d6564393031636566333561392f6c6963656e73657365727665722e636667
>>>>
>>>> [root at wingu06 ~]# getfattr -d -m. -e hex
/data/glusterfs/sdb/apps/clcgenomics/clclicsrv/licenseserver.cfg
>>>> getfattr: Removing leading '/' from absolute path names
>>>> # file:
data/glusterfs/sdb/apps/clcgenomics/clclicsrv/licenseserver.cfg
>>>>
security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000
>>>> trusted.gfid=0x878003a2fb5243b6a0d14d2f8b4306bd
>>>>
trusted.gfid2path.82586deefbc539c3=0x34666437323861612d356462392d343836382d616232662d6564393031636566333561392f6c6963656e73657365727665722e636667
>>>>
trusted.glusterfs.dht.linkto=0x617070732d7265706c69636174652d3200
>>>>
>>>> According to the trusted.afr.apps-client-xx xattrs this
particular
>>>> file should be on bricks with id "apps-client-3" and
"apps-client-5". It
>>>> took me a few hours to realize that the brick-id values are
recorded in the
>>>> volume's volfiles in /var/lib/glusterd/vols/apps/bricks.
After comparing
>>>> those brick-id values with a volfile backup from before the
replace-brick,
>>>> I realized that the files are simply on the wrong brick now as
far as
>>>> Gluster is concerned. This particular file is now on the brick
for
>>>> "apps-client-4". As an experiment I copied this one
file to the two
>>>> bricks listed in the xattrs and I was then able to see the file
from the
>>>> FUSE mount (yay!).
>>>>
>>>> Other than replacing the brick, removing it, and then adding
the old
>>>> brick on the original server back, there has been no change in
the data
>>>> this entire time. Can I change the brick IDs in the volfiles so
they
>>>> reflect where the data actually is? Or perhaps script something
to reset
>>>> all the xattrs on the files/directories to point to the correct
bricks?
>>>>
>>>> Thank you for any help or pointers,
>>>>
>>>> On Wed, May 29, 2019 at 7:24 AM Ravishankar N <ravishankar
at redhat.com>
>>>> wrote:
>>>>
>>>>>
>>>>> On 29/05/19 9:50 AM, Ravishankar N wrote:
>>>>>
>>>>>
>>>>> On 29/05/19 3:59 AM, Alan Orth wrote:
>>>>>
>>>>> Dear Ravishankar,
>>>>>
>>>>> I'm not sure if Brick4 had pending AFRs because I
don't know what that
>>>>> means and it's been a few days so I am not sure I would
be able to find
>>>>> that information.
>>>>>
>>>>> When you find some time, have a look at a blog
<http://wp.me/peiBB-6b>
>>>>> series I wrote about AFR- I've tried to explain what
one needs to know to
>>>>> debug replication related issues in it.
>>>>>
>>>>> Made a typo error. The URL for the blog is
https://wp.me/peiBB-6b
>>>>>
>>>>> -Ravi
>>>>>
>>>>>
>>>>> Anyways, after wasting a few days rsyncing the old brick to
a new host
>>>>> I decided to just try to add the old brick back into the
volume instead of
>>>>> bringing it up on the new host. I created a new brick
directory on the old
>>>>> host, moved the old brick's contents into that new
directory (minus the
>>>>> .glusterfs directory), added the new brick to the volume,
and then did
>>>>> Vlad's find/stat trick? from the brick to the FUSE
mount point.
>>>>>
>>>>> The interesting problem I have now is that some files
don't appear in
>>>>> the FUSE mount's directory listings, but I can actually
list them directly
>>>>> and even read them. What could cause that?
>>>>>
>>>>> Not sure, too many variables in the hacks that you did to
take a
>>>>> guess. You can check if the contents of the .glusterfs
folder are in order
>>>>> on the new brick (example hardlink for files and symlinks
for directories
>>>>> are present etc.) .
>>>>> Regards,
>>>>> Ravi
>>>>>
>>>>>
>>>>> Thanks,
>>>>>
>>>>> ?
>>>>>
https://lists.gluster.org/pipermail/gluster-users/2018-February/033584.html
>>>>>
>>>>> On Fri, May 24, 2019 at 4:59 PM Ravishankar N
<ravishankar at redhat.com>
>>>>> wrote:
>>>>>
>>>>>>
>>>>>> On 23/05/19 2:40 AM, Alan Orth wrote:
>>>>>>
>>>>>> Dear list,
>>>>>>
>>>>>> I seem to have gotten into a tricky situation. Today I
brought up a
>>>>>> shiny new server with new disk arrays and attempted to
replace one brick of
>>>>>> a replica 2 distribute/replicate volume on an older
server using the
>>>>>> `replace-brick` command:
>>>>>>
>>>>>> # gluster volume replace-brick homes
wingu0:/mnt/gluster/homes
>>>>>> wingu06:/data/glusterfs/sdb/homes commit force
>>>>>>
>>>>>> The command was successful and I see the new brick in
the output of
>>>>>> `gluster volume info`. The problem is that Gluster
doesn't seem to be
>>>>>> migrating the data,
>>>>>>
>>>>>> `replace-brick` definitely must heal (not migrate) the
data. In your
>>>>>> case, data must have been healed from Brick-4 to the
replaced Brick-3. Are
>>>>>> there any errors in the self-heal daemon logs of
Brick-4's node? Does
>>>>>> Brick-4 have pending AFR xattrs blaming Brick-3? The
doc is a bit out of
>>>>>> date. replace-brick command internally does all the
setfattr steps that are
>>>>>> mentioned in the doc.
>>>>>>
>>>>>> -Ravi
>>>>>>
>>>>>>
>>>>>> and now the original brick that I replaced is no longer
part of the
>>>>>> volume (and a few terabytes of data are just sitting on
the old brick):
>>>>>>
>>>>>> # gluster volume info homes | grep -E
"Brick[0-9]:"
>>>>>> Brick1: wingu4:/mnt/gluster/homes
>>>>>> Brick2: wingu3:/mnt/gluster/homes
>>>>>> Brick3: wingu06:/data/glusterfs/sdb/homes
>>>>>> Brick4: wingu05:/data/glusterfs/sdb/homes
>>>>>> Brick5: wingu05:/data/glusterfs/sdc/homes
>>>>>> Brick6: wingu06:/data/glusterfs/sdc/homes
>>>>>>
>>>>>> I see the Gluster docs have a more complicated
procedure for
>>>>>> replacing bricks that involves getfattr/setfattr?. How
can I tell Gluster
>>>>>> about the old brick? I see that I have a backup of the
old volfile thanks
>>>>>> to yum's rpmsave function if that helps.
>>>>>>
>>>>>> We are using Gluster 5.6 on CentOS 7. Thank you for any
advice you
>>>>>> can give.
>>>>>>
>>>>>> ?
>>>>>>
https://docs.gluster.org/en/latest/Administrator%20Guide/Managing%20Volumes/#replace-faulty-brick
>>>>>>
>>>>>> --
>>>>>> Alan Orth
>>>>>> alan.orth at gmail.com
>>>>>> https://picturingjordan.com
>>>>>> https://englishbulgaria.net
>>>>>> https://mjanja.ch
>>>>>> "In heaven all the interesting people are
missing." ?Friedrich
>>>>>> Nietzsche
>>>>>>
>>>>>> _______________________________________________
>>>>>> Gluster-users mailing listGluster-users at
gluster.orghttps://lists.gluster.org/mailman/listinfo/gluster-users
>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Alan Orth
>>>>> alan.orth at gmail.com
>>>>> https://picturingjordan.com
>>>>> https://englishbulgaria.net
>>>>> https://mjanja.ch
>>>>> "In heaven all the interesting people are
missing." ?Friedrich
>>>>> Nietzsche
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Gluster-users mailing listGluster-users at
gluster.orghttps://lists.gluster.org/mailman/listinfo/gluster-users
>>>>>
>>>>>
>>>>
>>>> --
>>>> Alan Orth
>>>> alan.orth at gmail.com
>>>> https://picturingjordan.com
>>>> https://englishbulgaria.net
>>>> https://mjanja.ch
>>>> "In heaven all the interesting people are missing."
?Friedrich Nietzsche
>>>>
>>>
>>>
>>> --
>>> Alan Orth
>>> alan.orth at gmail.com
>>> https://picturingjordan.com
>>> https://englishbulgaria.net
>>> https://mjanja.ch
>>> "In heaven all the interesting people are missing."
?Friedrich Nietzsche
>>>
>>>
>>
>> --
>> Alan Orth
>> alan.orth at gmail.com
>> https://picturingjordan.com
>> https://englishbulgaria.net
>> https://mjanja.ch
>> "In heaven all the interesting people are missing."
?Friedrich Nietzsche
>>
>>
>
> --
> Alan Orth
> alan.orth at gmail.com
> https://picturingjordan.com
> https://englishbulgaria.net
> https://mjanja.ch
> "In heaven all the interesting people are missing." ?Friedrich
Nietzsche
>
--
Alan Orth
alan.orth at gmail.com
https://picturingjordan.com
https://englishbulgaria.net
https://mjanja.ch
"In heaven all the interesting people are missing." ?Friedrich
Nietzsche
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20190607/5ed7b881/attachment.html>