thr3ads.net - samba - [Samba] ctdb vacuum timeouts and record locks [Nov 2017]

If this information is useful, please help other people find it:
Share via:

Martin Schwenke

2017-Nov-06 01:15 UTC

[Samba] ctdb vacuum timeouts and record locks

On Thu, 2 Nov 2017 11:17:27 -0700, Computerisms Corporation via samba
<samba at lists.samba.org> wrote:
> This occurred again this morning, when the user reported the problem, I 
> found in the ctdb logs that vacuuming has been going on since last 
> night.  The need to fix it was urgent (when isn't it?) so I didn't
have
> time to poke around for clues, but immediately restarted the lxc 
> container.  But this time it wouldn't restart, which I had time to
trace
> to a hung smbd process, and between that and a run of the debug_locks.sh 
> script, I traced it to the user reporting the problem.  Given that the 
> user was primarily having problems with files in a given folder, I am 
> thinking this is because of some kind of lock on a file within that 
> folder.
> 
> Ended up rebooting both physical machines, problem solved.  for now.
> 
> So, not sure how to determine if this is a gluster problem, an lxc 
> problem, or a ctdb/smbd problem.  Thoughts/suggestions are welcome...
You need a stack trace of the stuck smbd process.  If it is wedged in a
system call on the cluster filesystem then you can blame the cluster
filesystem.  debug_locks.sh is meant to be able to get you the relevant
stack trace via gstack.  In fact, even before you get the stack trace
you could check a process listing to see if the process is stuck in D
state.

gstack basically does:

  gdb -batch -ex "thread apply all bt" -p <pid>

For a single-threaded process it leaves out "thread apply all".
However, in recent GDB I'm not sure it makes a difference... seems to
work for me on Linux.

Note that gstack/gdb will hang when run against a process in D state.

peace & happiness,
martin

Computerisms Corporation

2017-Nov-15 06:48 UTC

head link

[Samba] ctdb vacuum timeouts and record locks

Hi Martin,

well, it has been over a week since my last hung process, but got 
another one today...>> So, not sure how to determine if this is a gluster problem, an lxc
>> problem, or a ctdb/smbd problem.  Thoughts/suggestions are welcome...
> 
> You need a stack trace of the stuck smbd process.  If it is wedged in a
> system call on the cluster filesystem then you can blame the cluster
> filesystem.  debug_locks.sh is meant to be able to get you the relevant
> stack trace via gstack.  In fact, even before you get the stack trace
> you could check a process listing to see if the process is stuck in D
> state.
So, yes, I do have a process stuck in the D state.  is in an smbd 
process.  matching up the times in the logs, I see that the the 
"Vacuuming child process timed out for db locking.tdb" error in ctdb 
lines up with the user who owns the the smbd process accessing a file 
that has been problematic before.  it is an xlsx file.
> gstack basically does:
> 
>    gdb -batch -ex "thread apply all bt" -p <pid>
> 
> For a single-threaded process it leaves out "thread apply all".
> However, in recent GDB I'm not sure it makes a difference... seems to
> work for me on Linux.
> 
> Note that gstack/gdb will hang when run against a process in D state.
Indeed, gdb, pstack, and strace all either hang or output no information.

I have been trying to find a way to get the actual gdb output, but all I 
can seem to find is the contents of /proc/<pid>/stack:

[<ffffffffc05ed856>] request_wait_answer+0x166/0x1f0 [fuse]
[<ffffffffa04b8d50>] prepare_to_wait_event+0xf0/0xf0
[<ffffffffc05ed958>] __fuse_request_send+0x78/0x80 [fuse]
[<ffffffffc05f0bdd>] fuse_simple_request+0xbd/0x190 [fuse]
[<ffffffffc05f6c37>] fuse_setlk+0x177/0x190 [fuse]
[<ffffffffa0659467>] SyS_flock+0x117/0x190
[<ffffffffa0403b1c>] do_syscall_64+0x7c/0xf0
[<ffffffffa0a0632f>] entry_SYSCALL64_slow_path+0x25/0x25
[<ffffffffffffffff>] 0xffffffffffffffff

I am still not too sure how to interpret this, but I think this is 
pointing me to the gluster file system, so will see what I can find 
chasing that down...

> 
> peace & happiness,
> martin
>

Martin Schwenke

2017-Nov-15 07:47 UTC

head link

[Samba] ctdb vacuum timeouts and record locks

On Tue, 14 Nov 2017 22:48:57 -0800, Computerisms Corporation via samba
<samba at lists.samba.org> wrote:
> well, it has been over a week since my last hung process, but got 
> another one today...
> >> So, not sure how to determine if this is a gluster problem, an lxc
> >> problem, or a ctdb/smbd problem.  Thoughts/suggestions are
welcome...
> > 
> > You need a stack trace of the stuck smbd process.  If it is wedged in
a
> > system call on the cluster filesystem then you can blame the cluster
> > filesystem.  debug_locks.sh is meant to be able to get you the
relevant
> > stack trace via gstack.  In fact, even before you get the stack trace
> > you could check a process listing to see if the process is stuck in D
> > state.  
> 
> So, yes, I do have a process stuck in the D state.  is in an smbd 
> process.  matching up the times in the logs, I see that the the 
> "Vacuuming child process timed out for db locking.tdb" error in
ctdb
> lines up with the user who owns the the smbd process accessing a file 
> that has been problematic before.  it is an xlsx file.
> 
> > gstack basically does:
> > 
> >    gdb -batch -ex "thread apply all bt" -p <pid>
> > 
> > For a single-threaded process it leaves out "thread apply
all".
> > However, in recent GDB I'm not sure it makes a difference... seems
to
> > work for me on Linux.
> > 
> > Note that gstack/gdb will hang when run against a process in D state.
> 
> Indeed, gdb, pstack, and strace all either hang or output no information.
> 
> I have been trying to find a way to get the actual gdb output, but all I 
> can seem to find is the contents of /proc/<pid>/stack:
> 
> [<ffffffffc05ed856>] request_wait_answer+0x166/0x1f0 [fuse]
> [<ffffffffa04b8d50>] prepare_to_wait_event+0xf0/0xf0
> [<ffffffffc05ed958>] __fuse_request_send+0x78/0x80 [fuse]
> [<ffffffffc05f0bdd>] fuse_simple_request+0xbd/0x190 [fuse]
> [<ffffffffc05f6c37>] fuse_setlk+0x177/0x190 [fuse]
> [<ffffffffa0659467>] SyS_flock+0x117/0x190
> [<ffffffffa0403b1c>] do_syscall_64+0x7c/0xf0
> [<ffffffffa0a0632f>] entry_SYSCALL64_slow_path+0x25/0x25
> [<ffffffffffffffff>] 0xffffffffffffffff
> 
> I am still not too sure how to interpret this, but I think this is 
> pointing me to the gluster file system, so will see what I can find 
> chasing that down...
Yes, it does look like it is in the gluster filesystem.

Are you only accessing the filesystem via Samba or do you also have
something like NFS exports?  If you are only exporting via Samba then
you could trying setting "posix locking = no" in your Samba
configuration.  However, please read the documentation for that option
in smb.conf(5) and be sure of your use-case before trying this on a
production system...

peace & happiness,
martin

Seemingly Similar Threads

Search for more seemingly similar threads

samba - Nov 2017 - ctdb vacuum timeouts and record locks

[Samba] ctdb vacuum timeouts and record locks

[Samba] ctdb vacuum timeouts and record locks

[Samba] ctdb vacuum timeouts and record locks

Seemingly Similar Threads