Computerisms Corporation
2017-Oct-27 17:09 UTC
[Samba] ctdb vacuum timeouts and record locks
Hi Martin, Thanks for reading and taking the time to reply>> ctdbd[89]: Unable to get RECORD lock on database locking.tdb for 20 seconds >> /usr/local/samba/etc/ctdb/debug_locks.sh: 142: >> /usr/local/samba/etc/ctdb/debug_locks.sh: cannot create : Directory >> nonexistent >> sh: echo: I/O error >> sh: echo: I/O error > > That's weird. The only file really created by that script is the lock > file that is used to make sure we don't debug locks too many times. > That should be in: > > "${CTDB_SCRIPT_VARDIR}/debug_locks.lock"Next time it happens I will check this.> The other possibility is the use of the script_log() function to try to > get the output logged. script_log() isn't my greatest moment. When > debugging you could just replace it with the logger command to get the > output out to syslog.Okay, that sounds useful, will see what I can do next time I see the problem...>> My setup is two servers, the OS is debian and is running samba AD on >> dedicated SSDs, and each server has a RAID array of HDDs for storage, >> with a mirrored GlusterFS running on top of them. Each OS has an LXC >> container running the clustered member servers with the GlusterFS >> mounted to the containers. The tdb files are in the containers, not on >> the shared storage. I do not use ctdb to start smbd/nmbd. I can't >> think what else is relevant about my setup as it pertains to this issue... > > Are the TDB files really on a FUSE filesystem? Is that an artifact of > the LXC containers? If so, could it be that locking isn't reliable on > the FUSE filesystem?No. The TDB files are in the container, and the container is on the SSD with the OS. running mount from within the container shows: /dev/sda1 on / type ext4 (rw,relatime,errors=remount-ro,data=ordered) However, the gluster native client is a fuse-based system, so the data is stored on a fuse system which is mounted in the container: masterchieflian:ctfngluster on /CTFN type fuse.glusterfs (rw,relatime,user_id=0,group_id=0,allow_other,max_read=131072) Since this is where the files that become inaccessible are, perhaps this is really where the problem is, and not with the locking.tdb file? I will investigate about file locks on the gluster system...> Is it possible to try this without the containers? That would > certainly tell you if the problem is related to the container > infrastructure...I like to think everything is possible, but it's not really feasible in this case. Since there are only two physical servers, and they need to be running AD, the only way to separate the containers now is with additional machines to act as member servers. And because everything tested fine and actually was fine for at least two weeks, these servers are in production now and have been for a few months. If I have to go this way, it will certainly be a last resort... Thanks again for your reply, will get back to you with what I find...> > peace & happiness, > martin >
Computerisms Corporation
2017-Nov-02 18:17 UTC
[Samba] ctdb vacuum timeouts and record locks
Hi, This occurred again this morning, when the user reported the problem, I found in the ctdb logs that vacuuming has been going on since last night. The need to fix it was urgent (when isn't it?) so I didn't have time to poke around for clues, but immediately restarted the lxc container. But this time it wouldn't restart, which I had time to trace to a hung smbd process, and between that and a run of the debug_locks.sh script, I traced it to the user reporting the problem. Given that the user was primarily having problems with files in a given folder, I am thinking this is because of some kind of lock on a file within that folder. Ended up rebooting both physical machines, problem solved. for now. So, not sure how to determine if this is a gluster problem, an lxc problem, or a ctdb/smbd problem. Thoughts/suggestions are welcome... On 2017-10-27 10:09 AM, Computerisms Corporation via samba wrote:> Hi Martin, > > Thanks for reading and taking the time to reply > >>> ctdbd[89]: Unable to get RECORD lock on database locking.tdb for 20 >>> seconds >>> /usr/local/samba/etc/ctdb/debug_locks.sh: 142: >>> /usr/local/samba/etc/ctdb/debug_locks.sh: cannot create : Directory >>> nonexistent >>> sh: echo: I/O error >>> sh: echo: I/O error >> >> That's weird. The only file really created by that script is the lock >> file that is used to make sure we don't debug locks too many times. >> That should be in: >> >> "${CTDB_SCRIPT_VARDIR}/debug_locks.lock" > > Next time it happens I will check this. > >> The other possibility is the use of the script_log() function to try to >> get the output logged. script_log() isn't my greatest moment. When >> debugging you could just replace it with the logger command to get the >> output out to syslog. > > Okay, that sounds useful, will see what I can do next time I see the > problem... > >>> My setup is two servers, the OS is debian and is running samba AD on >>> dedicated SSDs, and each server has a RAID array of HDDs for storage, >>> with a mirrored GlusterFS running on top of them. Each OS has an LXC >>> container running the clustered member servers with the GlusterFS >>> mounted to the containers. The tdb files are in the containers, not on >>> the shared storage. I do not use ctdb to start smbd/nmbd. I can't >>> think what else is relevant about my setup as it pertains to this >>> issue... >> >> Are the TDB files really on a FUSE filesystem? Is that an artifact of >> the LXC containers? If so, could it be that locking isn't reliable on >> the FUSE filesystem? > > No. The TDB files are in the container, and the container is on the SSD > with the OS. running mount from within the container shows: > > /dev/sda1 on / type ext4 (rw,relatime,errors=remount-ro,data=ordered) > > However, the gluster native client is a fuse-based system, so the data > is stored on a fuse system which is mounted in the container: > > masterchieflian:ctfngluster on /CTFN type fuse.glusterfs > (rw,relatime,user_id=0,group_id=0,allow_other,max_read=131072) > > Since this is where the files that become inaccessible are, perhaps this > is really where the problem is, and not with the locking.tdb file? I > will investigate about file locks on the gluster system... > >> Is it possible to try this without the containers? That would >> certainly tell you if the problem is related to the container >> infrastructure... > > I like to think everything is possible, but it's not really feasible in > this case. Since there are only two physical servers, and they need to > be running AD, the only way to separate the containers now is with > additional machines to act as member servers. And because everything > tested fine and actually was fine for at least two weeks, these servers > are in production now and have been for a few months. If I have to go > this way, it will certainly be a last resort... > > Thanks again for your reply, will get back to you with what I find... > > > > >> >> peace & happiness, >> martin >> >
Computerisms Corporation
2017-Nov-02 19:17 UTC
[Samba] ctdb vacuum timeouts and record locks
hm, I stand correct on the problem solved statement below. Ip addresses are simply not cooperating on the 2nd node. root at vault1:~# ctdb ip Public IPs on node 0 192.168.120.90 0 192.168.120.91 0 192.168.120.92 0 192.168.120.93 0 root at vault2:/service/ctdb/log/main# ctdb ip Public IPs on node 1 192.168.120.90 0 192.168.120.91 0 192.168.120.92 0 192.168.120.93 0 root at vault2:/service/ctdb/log/main# ctdb moveip 192.168.120.90 1 Control TAKEOVER_IP failed, ret=-1 Failed to takeover IP on node 1 root at vault1:~# ctdb moveip 192.168.120.90 0 Memory allocation error root at vault2:/service/ctdb/log/main# ctdb ipinfo 192.168.120.90 Public IP[192.168.120.90] info on node 1 IP:192.168.120.90 CurrentNode:0 NumInterfaces:1 Interface[1]: Name:eth0 Link:up References:0 Logs on vault2 (stays banned because it can't obtain IP): IP 192.168.120.90 still hosted during release IP callback, failing IP 192.168.120.92 still hosted during release IP callback, failing root at vault1:~# ctdb delip 192.168.120.90 root at vault1:~# ctdb delip 192.168.120.92 root at vault2:/service/ctdb/log/main# ctdb addip 192.168.120.90/22 eth0 Node already knows about IP 192.168.120.90 root at vault2:/service/ctdb/log/main# ctdb ip Public IPs on node 1 192.168.120.90 -1 192.168.120.91 0 192.168.120.92 -1 192.168.120.93 0 I am using the 10.external. ip addr show shows the correct IP addresses on eth0 in the lxc container. rebooted the physical machine, this node is buggered. shut it down, used ip addr add to put the addresses on the other node, used ctdb addip and the node took it and node1 is now functioning with all 4 IPs just fine. Or so it appears right now. something is seriously schizophrenic here... On 2017-11-02 11:17 AM, Computerisms Corporation via samba wrote:> Hi, > > This occurred again this morning, when the user reported the problem, I > found in the ctdb logs that vacuuming has been going on since last > night. The need to fix it was urgent (when isn't it?) so I didn't have > time to poke around for clues, but immediately restarted the lxc > container. But this time it wouldn't restart, which I had time to trace > to a hung smbd process, and between that and a run of the debug_locks.sh > script, I traced it to the user reporting the problem. Given that the > user was primarily having problems with files in a given folder, I am > thinking this is because of some kind of lock on a file within that folder. > > Ended up rebooting both physical machines, problem solved. for now. > > So, not sure how to determine if this is a gluster problem, an lxc > problem, or a ctdb/smbd problem. Thoughts/suggestions are welcome... > > On 2017-10-27 10:09 AM, Computerisms Corporation via samba wrote: >> Hi Martin, >> >> Thanks for reading and taking the time to reply >> >>>> ctdbd[89]: Unable to get RECORD lock on database locking.tdb for 20 >>>> seconds >>>> /usr/local/samba/etc/ctdb/debug_locks.sh: 142: >>>> /usr/local/samba/etc/ctdb/debug_locks.sh: cannot create : Directory >>>> nonexistent >>>> sh: echo: I/O error >>>> sh: echo: I/O error >>> >>> That's weird. The only file really created by that script is the lock >>> file that is used to make sure we don't debug locks too many times. >>> That should be in: >>> >>> "${CTDB_SCRIPT_VARDIR}/debug_locks.lock" >> >> Next time it happens I will check this. >> >>> The other possibility is the use of the script_log() function to try to >>> get the output logged. script_log() isn't my greatest moment. When >>> debugging you could just replace it with the logger command to get the >>> output out to syslog. >> >> Okay, that sounds useful, will see what I can do next time I see the >> problem... >> >>>> My setup is two servers, the OS is debian and is running samba AD on >>>> dedicated SSDs, and each server has a RAID array of HDDs for storage, >>>> with a mirrored GlusterFS running on top of them. Each OS has an LXC >>>> container running the clustered member servers with the GlusterFS >>>> mounted to the containers. The tdb files are in the containers, not on >>>> the shared storage. I do not use ctdb to start smbd/nmbd. I can't >>>> think what else is relevant about my setup as it pertains to this >>>> issue... >>> >>> Are the TDB files really on a FUSE filesystem? Is that an artifact of >>> the LXC containers? If so, could it be that locking isn't reliable on >>> the FUSE filesystem? >> >> No. The TDB files are in the container, and the container is on the >> SSD with the OS. running mount from within the container shows: >> >> /dev/sda1 on / type ext4 (rw,relatime,errors=remount-ro,data=ordered) >> >> However, the gluster native client is a fuse-based system, so the data >> is stored on a fuse system which is mounted in the container: >> >> masterchieflian:ctfngluster on /CTFN type fuse.glusterfs >> (rw,relatime,user_id=0,group_id=0,allow_other,max_read=131072) >> >> Since this is where the files that become inaccessible are, perhaps >> this is really where the problem is, and not with the locking.tdb >> file? I will investigate about file locks on the gluster system... >> >>> Is it possible to try this without the containers? That would >>> certainly tell you if the problem is related to the container >>> infrastructure... >> >> I like to think everything is possible, but it's not really feasible >> in this case. Since there are only two physical servers, and they >> need to be running AD, the only way to separate the containers now is >> with additional machines to act as member servers. And because >> everything tested fine and actually was fine for at least two weeks, >> these servers are in production now and have been for a few months. >> If I have to go this way, it will certainly be a last resort... >> >> Thanks again for your reply, will get back to you with what I find... >> >> >> >> >>> >>> peace & happiness, >>> martin >>> >> >
On Thu, 2 Nov 2017 11:17:27 -0700, Computerisms Corporation via samba <samba at lists.samba.org> wrote:> This occurred again this morning, when the user reported the problem, I > found in the ctdb logs that vacuuming has been going on since last > night. The need to fix it was urgent (when isn't it?) so I didn't have > time to poke around for clues, but immediately restarted the lxc > container. But this time it wouldn't restart, which I had time to trace > to a hung smbd process, and between that and a run of the debug_locks.sh > script, I traced it to the user reporting the problem. Given that the > user was primarily having problems with files in a given folder, I am > thinking this is because of some kind of lock on a file within that > folder. > > Ended up rebooting both physical machines, problem solved. for now. > > So, not sure how to determine if this is a gluster problem, an lxc > problem, or a ctdb/smbd problem. Thoughts/suggestions are welcome...You need a stack trace of the stuck smbd process. If it is wedged in a system call on the cluster filesystem then you can blame the cluster filesystem. debug_locks.sh is meant to be able to get you the relevant stack trace via gstack. In fact, even before you get the stack trace you could check a process listing to see if the process is stuck in D state. gstack basically does: gdb -batch -ex "thread apply all bt" -p <pid> For a single-threaded process it leaves out "thread apply all". However, in recent GDB I'm not sure it makes a difference... seems to work for me on Linux. Note that gstack/gdb will hang when run against a process in D state. peace & happiness, martin