I am continuing on a thread from March last year; please
see the background in those previous postings.
I am having the same problem again, but now I found the
cause and the way to fix it. It looks to me like a bug,
though I can't be sure.
I have a live mail spool on a replica 3 volume. It has
a standard IMAP directory structure in the form
/volume_mountpoint/username/Maildir/{new,cur,tmp} .
It is important to know that the {new,cur,tmp} directories
are automatically created by the mail server if they
do not already exist. New unseen mail is placed in new
and is automatically moved to cur when an IMAP client
sees it.
I got again some entries that simply won't heal. Same
"errno 22" in the logs as last time. All these unhealable
entries are directories. I compared the directories and
their contents with ls -ld /path/to/dir and ls -l /path/to/dir
on the mounts of all three bricks. All the directories
and their contents were identical.
So I tried moving one of the unhealable directories
off the gluster replica, thinking I would then move
it back on again. What happened surprised me greatly
(shortened output without uid:gid here):
# ls -l /mnt/vmail/username/Maildir/new
total 274
-rw------- 1 10050 Mar 4 08:45
1646383528.M178705P59709V000000000000002EI82CC032CE98F87ED_1.node03.nettheatre.org,S=10050
-rw------- 1 183700 Mar 4 09:26
1646385969.M991789P60955V000000000000002EI9EB5A06448629596_1.node03.nettheatre.org,S=183700
-rw------- 1 6062 Mar 4 09:52
1646387533.M495363P61757V000000000000002EIB73C97A7F4E5E243_1.node03.nettheatre.org,S=6062
-rw------- 1 15646 Mar 4 10:20
1646389259.M17459P62633V000000000000002EI97FFAE35F02DDCC8_1.node03.nettheatre.org,S=15646
-rw------- 1 9254 Mar 4 10:56
1646391406.M98944P63701V000000000000002EIBE0BA94C5363CF98_1.node03.nettheatre.org,S=9254
-rw------- 1 31719 Mar 4 11:07
1646392073.M104124P64011V000000000000002EI8BEB5A4B698B5F97_1.node03.nettheatre.org,S=31719
-rw------- 1 5782 Mar 4 12:12
1646395962.M316395P65769V000000000000002EIA75B42807A9649D5_1.node03.nettheatre.org,S=5782
-rw------- 1 16061 Mar 4 12:22
1646396577.M41309P66103V000000000000002EIA108E5579AA913E1_1.node03.nettheatre.org,S=16061
# mv /mnt/vmail/username/Maildir/new /root/
# ls -l /mnt/vmail/username/Maildir/new
total 72
-rw------- 1 1071 Oct 11 11:23
1633951401.M287288P1545V000000000000FD00I000000000164106D_1.node01.nettheatre.org,S=1071
-rw------- 1 3569 Oct 11 11:24
1633951466.M405994P1571V000000000000FD00I000000000164106E_1.node01.nettheatre.org,S=3569
-rw------- 1 2521 Oct 11 11:51
1633953065.M213650P2762V000000000000FD00I000000000164108A_1.node01.nettheatre.org,S=2521
-rw------- 1 8674 Oct 11 12:16
1633954562.M295498P4083V000000000000FD00I0000000001641099_1.node01.nettheatre.org,S=8674
-rw------- 1 8629 Oct 11 12:16
1633954562.M939396P4087V000000000000FD00I000000000164109C_1.node01.nettheatre.org,S=8629
-rw------- 1 9362 Oct 11 12:39
1633955941.M968102P5102V000000000000FD00I000000000164109D_1.node01.nettheatre.org,S=9362
-rw------- 1 12023 Oct 11 13:41
1633959672.M502160P8408V000000000000FD00I000000000164109E_1.node01.nettheatre.org,S=12023
-rw------- 1 12020 Oct 11 14:06
1633961218.M38654P9430V000000000000FD00I00000000016410A1_1.node01.nettheatre.org,S=12020
The above is the exact sequence of commands, nothing
skipped. I moved the "new" directory off the brick and
underneath it there was another directory with the
exact same name and completely different content. This
clearly explains why the directory could not heal.
This is the same phenomenon that you get if you
mount a partition on a mountpoint that already
contains files: the contents of the newly mounted
partition mask the physical contents of the mountpoint.
How could this happen? I can only guess that the
directory "new" at some point became unavailable
on the brick that the mail server was working on.
A mail arrived for the user, so the mail server
created a "new" directory again, to which gluster
gave a new gfid. End result: two gfids with the
exact same /path/name. Of course it can't heal.
And quite right, as soon as I moved Maildir/new
off the volume, that entry healed. All I had to
do then was
# mv /root/new/* /mnt/vmail/username/Maildir/new/
to put the newer mail back where the mailserver
will (hopefully) see it.
These seem relevant:
https://access.redhat.com/errata/RHBA-2021:1462
https://bugzilla.redhat.com/show_bug.cgi?id=1640148
I have
cluster.use-anonymous-inode yes
The volume in question here was created and
populated on glusterfs-server 9.x; not sure about
the minor version back then. From September 2021
until now I've been running 9.3. The double-entry
error above occurred on or shortly after October 11,
so it's certainly on 9.3.
While I suspect that this is a bug, I won't open
an issue on github (a) because I'm not completely
sure it is a bug and (b) because there's a bot
running around there closing confirmed bugs that
haven't been worked on for a while.
Please ask if there's anything you'd like me to
test or report.
Z