thr3ads.net - samba - [Samba] ctdb vacuum timeouts and record locks [Oct 2017]

If this information is useful, please help other people find it:
Share via:

Computerisms Corporation

2017-Oct-27 05:44 UTC

[Samba] ctdb vacuum timeouts and record locks

Hi List,

I set up a ctdb cluster a couple months back.  Things seemed pretty 
solid for the first 2-3 weeks, but then I started getting reports of 
people not being able to access files, or some times directories.  It 
has taken me a while to figure some stuff out, but it seems the common 
denominator to this happening is vacuuming timeouts for locking.tdb in 
the ctdb log, which might go on every 2 minutes and 10 seconds for 
anywhere from an hour to a day and some, and then it will also add to 
the logs failure to get a RECORD lock on the same tdb file.  Whenever I 
get a report about inaccessible files I find this in the ctdb logs:

ctdbd[89]: Vacuuming child process timed out for db locking.tdb
ctdbd[89]: Vacuuming child process timed out for db locking.tdb
ctdbd[89]: Unable to get RECORD lock on database locking.tdb for 10 seconds
ctdbd[89]: Set lock debugging helper to 
"/usr/local/samba/etc/ctdb/debug_locks.sh"
/usr/local/samba/etc/ctdb/debug_locks.sh: 142: 
/usr/local/samba/etc/ctdb/debug_locks.sh: cannot create : Directory 
nonexistent
sh: echo: I/O error
sh: echo: I/O error
sh: echo: I/O error
sh: echo: I/O error
cat: write error: Broken pipe
sh: echo: I/O error
ctdbd[89]: Unable to get RECORD lock on database locking.tdb for 20 seconds
/usr/local/samba/etc/ctdb/debug_locks.sh: 142: 
/usr/local/samba/etc/ctdb/debug_locks.sh: cannot create : Directory 
nonexistent
sh: echo: I/O error
sh: echo: I/O error

 From googling, the vacuuming process is okay to timeout, it should 
succeed next time, and if it doesn't the only harm is a bloated file. 
But it never does succeed after the first time I see this message, and 
the locking.tdb file does not change size, bigger or smaller.

I am not really clear on what the script cannot create, but I did find 
no evidence of the gstack program being available on debian, so I 
changed the script to run pstack instead, and then ran it manually with 
set -x while the logs were recording the problem, and I think this is 
the trace output it is trying to come up with, but sadly this isn't 
meaningful to me (yet!):

cat /proc/30491/stack
[<ffffffff8197d00d>] inet_recvmsg+0x7d/0xb0
[<ffffffffc07c3856>] request_wait_answer+0x166/0x1f0 [fuse]
[<ffffffff814b8d50>] prepare_to_wait_event+0xf0/0xf0
[<ffffffffc07c3958>] __fuse_request_send+0x78/0x80 [fuse]
[<ffffffffc07c6bdd>] fuse_simple_request+0xbd/0x190 [fuse]
[<ffffffffc07ccc37>] fuse_setlk+0x177/0x190 [fuse]
[<ffffffff816592f7>] SyS_flock+0x117/0x190
[<ffffffff81403b1c>] do_syscall_64+0x7c/0xf0
[<ffffffff81a0632f>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

This might happen twice in a day or once in a week, doesn't seem 
consistent, and so far I haven't found any catalyst.

My setup is two servers, the OS is debian and is running samba AD on 
dedicated SSDs, and each server has a RAID array of HDDs for storage, 
with a mirrored GlusterFS running on top of them.  Each OS has an LXC 
container running the clustered member servers with the GlusterFS 
mounted to the containers.  The tdb files are in the containers, not on 
the shared storage.  I do not use ctdb to start smbd/nmbd.  I can't 
think what else is relevant about my setup as it pertains to this issue...

I can fix the access to the files by stopping the ctdb process and just 
letting the other cluster member run, but the only way I have found so 
far to fix the locking.tdb file is to shutdown the container.  sometimes 
I have to forcefully kill it from the host.

The errors are not confined to one member of the cluster, I have seen 
them happen on both of them.  Though, of the people reporting the 
problem, it often seems to be the same files causing the problem. 
Before I had figured out about ctdb logs, several times there were 
people who couldn't access a specific folder, but removing a specific 
file from that folder fixed it.

I have put lots of hours into google on this and nothing I have found 
has turned the light bulb in my brain on.  Maybe (hopefully, actually) I 
am overlooking something obvious.  Wondering if anyone can point me at 
the next step in troubleshooting this?


-- 
Bob Miller
Cell: 867-334-7117
Office: 867-633-3760
www.computerisms.ca

Martin Schwenke

2017-Oct-27 08:27 UTC

head link

[Samba] ctdb vacuum timeouts and record locks

Hi Bob,

On Thu, 26 Oct 2017 22:44:30 -0700, Computerisms Corporation via samba
<samba at lists.samba.org> wrote:
> I set up a ctdb cluster a couple months back.  Things seemed pretty 
> solid for the first 2-3 weeks, but then I started getting reports of 
> people not being able to access files, or some times directories.  It 
> has taken me a while to figure some stuff out, but it seems the common 
> denominator to this happening is vacuuming timeouts for locking.tdb in 
> the ctdb log, which might go on every 2 minutes and 10 seconds for 
> anywhere from an hour to a day and some, and then it will also add to 
> the logs failure to get a RECORD lock on the same tdb file.  Whenever I 
> get a report about inaccessible files I find this in the ctdb logs:
> 
> ctdbd[89]: Vacuuming child process timed out for db locking.tdb
> ctdbd[89]: Vacuuming child process timed out for db locking.tdb
> ctdbd[89]: Unable to get RECORD lock on database locking.tdb for 10 seconds
> ctdbd[89]: Set lock debugging helper to 
> "/usr/local/samba/etc/ctdb/debug_locks.sh"
> /usr/local/samba/etc/ctdb/debug_locks.sh: 142: 
> /usr/local/samba/etc/ctdb/debug_locks.sh: cannot create : Directory 
> nonexistent
> sh: echo: I/O error
> sh: echo: I/O error
> sh: echo: I/O error
> sh: echo: I/O error
> cat: write error: Broken pipe
> sh: echo: I/O error
> ctdbd[89]: Unable to get RECORD lock on database locking.tdb for 20 seconds
> /usr/local/samba/etc/ctdb/debug_locks.sh: 142: 
> /usr/local/samba/etc/ctdb/debug_locks.sh: cannot create : Directory 
> nonexistent
> sh: echo: I/O error
> sh: echo: I/O error
That's weird.  The only file really created by that script is the lock
file that is used to make sure we don't debug locks too many times.
That should be in:

  "${CTDB_SCRIPT_VARDIR}/debug_locks.lock"

The other possibility is the use of the script_log() function to try to
get the output logged.  script_log() isn't my greatest moment.  When
debugging you could just replace it with the logger command to get the
output out to syslog.
>  From googling, the vacuuming process is okay to timeout, it should 
> succeed next time, and if it doesn't the only harm is a bloated file. 
> But it never does succeed after the first time I see this message, and 
> the locking.tdb file does not change size, bigger or smaller.
> 
> I am not really clear on what the script cannot create, but I did find 
> no evidence of the gstack program being available on debian, so I 
> changed the script to run pstack instead, and then ran it manually with 
> set -x while the logs were recording the problem, and I think this is 
> the trace output it is trying to come up with, but sadly this isn't 
> meaningful to me (yet!):
> 
> cat /proc/30491/stack
> [<ffffffff8197d00d>] inet_recvmsg+0x7d/0xb0
> [<ffffffffc07c3856>] request_wait_answer+0x166/0x1f0 [fuse]
> [<ffffffff814b8d50>] prepare_to_wait_event+0xf0/0xf0
> [<ffffffffc07c3958>] __fuse_request_send+0x78/0x80 [fuse]
> [<ffffffffc07c6bdd>] fuse_simple_request+0xbd/0x190 [fuse]
> [<ffffffffc07ccc37>] fuse_setlk+0x177/0x190 [fuse]
> [<ffffffff816592f7>] SyS_flock+0x117/0x190
> [<ffffffff81403b1c>] do_syscall_64+0x7c/0xf0
> [<ffffffff81a0632f>] entry_SYSCALL64_slow_path+0x25/0x25
> [<ffffffffffffffff>] 0xffffffffffffffff
I'm pretty sure gstack used to be shipped as an example in the gdb
package in Debian.  However, it isn't there and changelog.Debian.gz
doesn't mention it.  I had a quick try of pstack but couldn't get sense
out of it.  :-(
> This might happen twice in a day or once in a week, doesn't seem 
> consistent, and so far I haven't found any catalyst.
> 
> My setup is two servers, the OS is debian and is running samba AD on 
> dedicated SSDs, and each server has a RAID array of HDDs for storage, 
> with a mirrored GlusterFS running on top of them.  Each OS has an LXC 
> container running the clustered member servers with the GlusterFS 
> mounted to the containers.  The tdb files are in the containers, not on 
> the shared storage.  I do not use ctdb to start smbd/nmbd.  I can't 
> think what else is relevant about my setup as it pertains to this issue...
Are the TDB files really on a FUSE filesystem?  Is that an artifact of
the LXC containers?  If so, could it be that locking isn't reliable on
the FUSE filesystem?
> I can fix the access to the files by stopping the ctdb process and just 
> letting the other cluster member run, but the only way I have found so 
> far to fix the locking.tdb file is to shutdown the container.  sometimes 
> I have to forcefully kill it from the host.
> 
> The errors are not confined to one member of the cluster, I have seen 
> them happen on both of them.  Though, of the people reporting the 
> problem, it often seems to be the same files causing the problem. 
> Before I had figured out about ctdb logs, several times there were 
> people who couldn't access a specific folder, but removing a specific 
> file from that folder fixed it.
> 
> I have put lots of hours into google on this and nothing I have found 
> has turned the light bulb in my brain on.  Maybe (hopefully, actually) I 
> am overlooking something obvious.  Wondering if anyone can point me at 
> the next step in troubleshooting this?
Is it possible to try this without the containers?  That would
certainly tell you if the problem is related to the container
infrastructure...

peace & happiness,
martin

Computerisms Corporation

2017-Oct-27 17:09 UTC

head link

[Samba] ctdb vacuum timeouts and record locks

Hi Martin,

Thanks for reading and taking the time to reply
>> ctdbd[89]: Unable to get RECORD lock on database locking.tdb for 20
seconds
>> /usr/local/samba/etc/ctdb/debug_locks.sh: 142:
>> /usr/local/samba/etc/ctdb/debug_locks.sh: cannot create : Directory
>> nonexistent
>> sh: echo: I/O error
>> sh: echo: I/O error
> 
> That's weird.  The only file really created by that script is the lock
> file that is used to make sure we don't debug locks too many times.
> That should be in:
> 
>    "${CTDB_SCRIPT_VARDIR}/debug_locks.lock"
Next time it happens I will check this.
> The other possibility is the use of the script_log() function to try to
> get the output logged.  script_log() isn't my greatest moment.  When
> debugging you could just replace it with the logger command to get the
> output out to syslog.
Okay, that sounds useful, will see what I can do next time I see the 
problem...
>> My setup is two servers, the OS is debian and is running samba AD on
>> dedicated SSDs, and each server has a RAID array of HDDs for storage,
>> with a mirrored GlusterFS running on top of them.  Each OS has an LXC
>> container running the clustered member servers with the GlusterFS
>> mounted to the containers.  The tdb files are in the containers, not on
>> the shared storage.  I do not use ctdb to start smbd/nmbd.  I can't
>> think what else is relevant about my setup as it pertains to this
issue...
> 
> Are the TDB files really on a FUSE filesystem?  Is that an artifact of
> the LXC containers?  If so, could it be that locking isn't reliable on
> the FUSE filesystem?
No.  The TDB files are in the container, and the container is on the SSD 
with the OS.  running mount from within the container shows:

/dev/sda1 on / type ext4 (rw,relatime,errors=remount-ro,data=ordered)

However, the gluster native client is a fuse-based system, so the data 
is stored on a fuse system which is mounted in the container:

masterchieflian:ctfngluster on /CTFN type fuse.glusterfs 
(rw,relatime,user_id=0,group_id=0,allow_other,max_read=131072)

Since this is where the files that become inaccessible are, perhaps this 
is really where the problem is, and not with the locking.tdb file?  I 
will investigate about file locks on the gluster system...
> Is it possible to try this without the containers?  That would
> certainly tell you if the problem is related to the container
> infrastructure...
I like to think everything is possible, but it's not really feasible in 
this case.  Since there are only two physical servers, and they need to 
be running AD, the only way to separate the containers now is with 
additional machines to act as member servers.  And because everything 
tested fine and actually was fine for at least two weeks, these servers 
are in production now and have been for a few months.  If I have to go 
this way, it will certainly be a last resort...

Thanks again for your reply, will get back to you with what I find...



> 
> peace & happiness,
> martin
>

Reasonably Related Threads

Search for more reasonably related threads

samba - Oct 2017 - ctdb vacuum timeouts and record locks

[Samba] ctdb vacuum timeouts and record locks

[Samba] ctdb vacuum timeouts and record locks

[Samba] ctdb vacuum timeouts and record locks

Reasonably Related Threads