On Thu, 2 Nov 2017 11:17:27 -0700, Computerisms Corporation via samba <samba at lists.samba.org> wrote:> This occurred again this morning, when the user reported the problem, I > found in the ctdb logs that vacuuming has been going on since last > night. The need to fix it was urgent (when isn't it?) so I didn't have > time to poke around for clues, but immediately restarted the lxc > container. But this time it wouldn't restart, which I had time to trace > to a hung smbd process, and between that and a run of the debug_locks.sh > script, I traced it to the user reporting the problem. Given that the > user was primarily having problems with files in a given folder, I am > thinking this is because of some kind of lock on a file within that > folder. > > Ended up rebooting both physical machines, problem solved. for now. > > So, not sure how to determine if this is a gluster problem, an lxc > problem, or a ctdb/smbd problem. Thoughts/suggestions are welcome...You need a stack trace of the stuck smbd process. If it is wedged in a system call on the cluster filesystem then you can blame the cluster filesystem. debug_locks.sh is meant to be able to get you the relevant stack trace via gstack. In fact, even before you get the stack trace you could check a process listing to see if the process is stuck in D state. gstack basically does: gdb -batch -ex "thread apply all bt" -p <pid> For a single-threaded process it leaves out "thread apply all". However, in recent GDB I'm not sure it makes a difference... seems to work for me on Linux. Note that gstack/gdb will hang when run against a process in D state. peace & happiness, martin
Computerisms Corporation
2017-Nov-15 06:48 UTC
[Samba] ctdb vacuum timeouts and record locks
Hi Martin, well, it has been over a week since my last hung process, but got another one today...>> So, not sure how to determine if this is a gluster problem, an lxc >> problem, or a ctdb/smbd problem. Thoughts/suggestions are welcome... > > You need a stack trace of the stuck smbd process. If it is wedged in a > system call on the cluster filesystem then you can blame the cluster > filesystem. debug_locks.sh is meant to be able to get you the relevant > stack trace via gstack. In fact, even before you get the stack trace > you could check a process listing to see if the process is stuck in D > state.So, yes, I do have a process stuck in the D state. is in an smbd process. matching up the times in the logs, I see that the the "Vacuuming child process timed out for db locking.tdb" error in ctdb lines up with the user who owns the the smbd process accessing a file that has been problematic before. it is an xlsx file.> gstack basically does: > > gdb -batch -ex "thread apply all bt" -p <pid> > > For a single-threaded process it leaves out "thread apply all". > However, in recent GDB I'm not sure it makes a difference... seems to > work for me on Linux. > > Note that gstack/gdb will hang when run against a process in D state.Indeed, gdb, pstack, and strace all either hang or output no information. I have been trying to find a way to get the actual gdb output, but all I can seem to find is the contents of /proc/<pid>/stack: [<ffffffffc05ed856>] request_wait_answer+0x166/0x1f0 [fuse] [<ffffffffa04b8d50>] prepare_to_wait_event+0xf0/0xf0 [<ffffffffc05ed958>] __fuse_request_send+0x78/0x80 [fuse] [<ffffffffc05f0bdd>] fuse_simple_request+0xbd/0x190 [fuse] [<ffffffffc05f6c37>] fuse_setlk+0x177/0x190 [fuse] [<ffffffffa0659467>] SyS_flock+0x117/0x190 [<ffffffffa0403b1c>] do_syscall_64+0x7c/0xf0 [<ffffffffa0a0632f>] entry_SYSCALL64_slow_path+0x25/0x25 [<ffffffffffffffff>] 0xffffffffffffffff I am still not too sure how to interpret this, but I think this is pointing me to the gluster file system, so will see what I can find chasing that down...> > peace & happiness, > martin >
On Tue, 14 Nov 2017 22:48:57 -0800, Computerisms Corporation via samba <samba at lists.samba.org> wrote:> well, it has been over a week since my last hung process, but got > another one today... > >> So, not sure how to determine if this is a gluster problem, an lxc > >> problem, or a ctdb/smbd problem. Thoughts/suggestions are welcome... > > > > You need a stack trace of the stuck smbd process. If it is wedged in a > > system call on the cluster filesystem then you can blame the cluster > > filesystem. debug_locks.sh is meant to be able to get you the relevant > > stack trace via gstack. In fact, even before you get the stack trace > > you could check a process listing to see if the process is stuck in D > > state. > > So, yes, I do have a process stuck in the D state. is in an smbd > process. matching up the times in the logs, I see that the the > "Vacuuming child process timed out for db locking.tdb" error in ctdb > lines up with the user who owns the the smbd process accessing a file > that has been problematic before. it is an xlsx file. > > > gstack basically does: > > > > gdb -batch -ex "thread apply all bt" -p <pid> > > > > For a single-threaded process it leaves out "thread apply all". > > However, in recent GDB I'm not sure it makes a difference... seems to > > work for me on Linux. > > > > Note that gstack/gdb will hang when run against a process in D state. > > Indeed, gdb, pstack, and strace all either hang or output no information. > > I have been trying to find a way to get the actual gdb output, but all I > can seem to find is the contents of /proc/<pid>/stack: > > [<ffffffffc05ed856>] request_wait_answer+0x166/0x1f0 [fuse] > [<ffffffffa04b8d50>] prepare_to_wait_event+0xf0/0xf0 > [<ffffffffc05ed958>] __fuse_request_send+0x78/0x80 [fuse] > [<ffffffffc05f0bdd>] fuse_simple_request+0xbd/0x190 [fuse] > [<ffffffffc05f6c37>] fuse_setlk+0x177/0x190 [fuse] > [<ffffffffa0659467>] SyS_flock+0x117/0x190 > [<ffffffffa0403b1c>] do_syscall_64+0x7c/0xf0 > [<ffffffffa0a0632f>] entry_SYSCALL64_slow_path+0x25/0x25 > [<ffffffffffffffff>] 0xffffffffffffffff > > I am still not too sure how to interpret this, but I think this is > pointing me to the gluster file system, so will see what I can find > chasing that down...Yes, it does look like it is in the gluster filesystem. Are you only accessing the filesystem via Samba or do you also have something like NFS exports? If you are only exporting via Samba then you could trying setting "posix locking = no" in your Samba configuration. However, please read the documentation for that option in smb.conf(5) and be sure of your use-case before trying this on a production system... peace & happiness, martin