thr3ads.net - samba - [Samba] ctdb vacuum timeouts and record locks [Nov 2017]

If this information is useful, please help other people find it:
Share via:

Computerisms Corporation

2017-Nov-02 19:17 UTC

[Samba] ctdb vacuum timeouts and record locks

hm, I stand correct on the problem solved statement below.  Ip addresses 
are simply not cooperating on the 2nd node.

root at vault1:~# ctdb ip
Public IPs on node 0
192.168.120.90 0
192.168.120.91 0
192.168.120.92 0
192.168.120.93 0

root at vault2:/service/ctdb/log/main# ctdb ip
Public IPs on node 1
192.168.120.90 0
192.168.120.91 0
192.168.120.92 0
192.168.120.93 0

root at vault2:/service/ctdb/log/main# ctdb moveip 192.168.120.90 1
Control TAKEOVER_IP failed, ret=-1
Failed to takeover IP on node 1

root at vault1:~# ctdb moveip 192.168.120.90 0
Memory allocation error

root at vault2:/service/ctdb/log/main# ctdb ipinfo 192.168.120.90
Public IP[192.168.120.90] info on node 1
IP:192.168.120.90
CurrentNode:0
NumInterfaces:1
Interface[1]: Name:eth0 Link:up References:0

Logs on vault2 (stays banned because it can't obtain IP):
IP 192.168.120.90 still hosted during release IP callback, failing
IP 192.168.120.92 still hosted during release IP callback, failing

root at vault1:~# ctdb delip 192.168.120.90
root at vault1:~# ctdb delip 192.168.120.92
root at vault2:/service/ctdb/log/main# ctdb addip 192.168.120.90/22 eth0
Node already knows about IP 192.168.120.90
root at vault2:/service/ctdb/log/main# ctdb ip
Public IPs on node 1
192.168.120.90 -1
192.168.120.91 0
192.168.120.92 -1
192.168.120.93 0


I am using the 10.external.  ip addr show shows the correct IP addresses 
on eth0 in the lxc container.  rebooted the physical machine, this node 
is buggered.  shut it down, used ip addr add to put the addresses on the 
other node, used ctdb addip and the node took it and node1 is now 
functioning with all 4 IPs just fine.  Or so it appears right now.

something is seriously schizophrenic here...




On 2017-11-02 11:17 AM, Computerisms Corporation via samba
wrote:> Hi,
> 
> This occurred again this morning, when the user reported the problem, I 
> found in the ctdb logs that vacuuming has been going on since last 
> night.  The need to fix it was urgent (when isn't it?) so I didn't
have
> time to poke around for clues, but immediately restarted the lxc 
> container.  But this time it wouldn't restart, which I had time to
trace
> to a hung smbd process, and between that and a run of the debug_locks.sh 
> script, I traced it to the user reporting the problem.  Given that the 
> user was primarily having problems with files in a given folder, I am 
> thinking this is because of some kind of lock on a file within that folder.
> 
> Ended up rebooting both physical machines, problem solved.  for now.
> 
> So, not sure how to determine if this is a gluster problem, an lxc 
> problem, or a ctdb/smbd problem.  Thoughts/suggestions are welcome...
> 
> On 2017-10-27 10:09 AM, Computerisms Corporation via samba wrote:
>> Hi Martin,
>>
>> Thanks for reading and taking the time to reply
>>
>>>> ctdbd[89]: Unable to get RECORD lock on database locking.tdb
for 20
>>>> seconds
>>>> /usr/local/samba/etc/ctdb/debug_locks.sh: 142:
>>>> /usr/local/samba/etc/ctdb/debug_locks.sh: cannot create :
Directory
>>>> nonexistent
>>>> sh: echo: I/O error
>>>> sh: echo: I/O error
>>>
>>> That's weird.  The only file really created by that script is
the lock
>>> file that is used to make sure we don't debug locks too many
times.
>>> That should be in:
>>>
>>>    "${CTDB_SCRIPT_VARDIR}/debug_locks.lock"
>>
>> Next time it happens I will check this.
>>
>>> The other possibility is the use of the script_log() function to
try to
>>> get the output logged.  script_log() isn't my greatest moment. 
When
>>> debugging you could just replace it with the logger command to get
the
>>> output out to syslog.
>>
>> Okay, that sounds useful, will see what I can do next time I see the 
>> problem...
>>
>>>> My setup is two servers, the OS is debian and is running samba
AD on
>>>> dedicated SSDs, and each server has a RAID array of HDDs for
storage,
>>>> with a mirrored GlusterFS running on top of them.  Each OS has
an LXC
>>>> container running the clustered member servers with the
GlusterFS
>>>> mounted to the containers.  The tdb files are in the
containers, not on
>>>> the shared storage.  I do not use ctdb to start smbd/nmbd.  I
can't
>>>> think what else is relevant about my setup as it pertains to
this
>>>> issue...
>>>
>>> Are the TDB files really on a FUSE filesystem?  Is that an artifact
of
>>> the LXC containers?  If so, could it be that locking isn't
reliable on
>>> the FUSE filesystem?
>>
>> No.  The TDB files are in the container, and the container is on the 
>> SSD with the OS.  running mount from within the container shows:
>>
>> /dev/sda1 on / type ext4 (rw,relatime,errors=remount-ro,data=ordered)
>>
>> However, the gluster native client is a fuse-based system, so the data 
>> is stored on a fuse system which is mounted in the container:
>>
>> masterchieflian:ctfngluster on /CTFN type fuse.glusterfs 
>> (rw,relatime,user_id=0,group_id=0,allow_other,max_read=131072)
>>
>> Since this is where the files that become inaccessible are, perhaps 
>> this is really where the problem is, and not with the locking.tdb 
>> file?  I will investigate about file locks on the gluster system...
>>
>>> Is it possible to try this without the containers?  That would
>>> certainly tell you if the problem is related to the container
>>> infrastructure...
>>
>> I like to think everything is possible, but it's not really
feasible
>> in this case.  Since there are only two physical servers, and they 
>> need to be running AD, the only way to separate the containers now is 
>> with additional machines to act as member servers.  And because 
>> everything tested fine and actually was fine for at least two weeks, 
>> these servers are in production now and have been for a few months.  
>> If I have to go this way, it will certainly be a last resort...
>>
>> Thanks again for your reply, will get back to you with what I find...
>>
>>
>>
>>
>>>
>>> peace & happiness,
>>> martin
>>>
>>
>

Martin Schwenke

2017-Nov-06 01:28 UTC

head link

[Samba] ctdb vacuum timeouts and record locks

On Thu, 2 Nov 2017 12:17:56 -0700, Computerisms Corporation via samba
<samba at lists.samba.org> wrote:
> hm, I stand correct on the problem solved statement below.  Ip addresses 
> are simply not cooperating on the 2nd node.
> 
> root at vault1:~# ctdb ip
> Public IPs on node 0
> 192.168.120.90 0
> 192.168.120.91 0
> 192.168.120.92 0
> 192.168.120.93 0
> 
> root at vault2:/service/ctdb/log/main# ctdb ip
> Public IPs on node 1
> 192.168.120.90 0
> 192.168.120.91 0
> 192.168.120.92 0
> 192.168.120.93 0
> 
> root at vault2:/service/ctdb/log/main# ctdb moveip 192.168.120.90 1
> Control TAKEOVER_IP failed, ret=-1
> Failed to takeover IP on node 1
> 
> root at vault1:~# ctdb moveip 192.168.120.90 0
> Memory allocation error
> 
> root at vault2:/service/ctdb/log/main# ctdb ipinfo 192.168.120.90
> Public IP[192.168.120.90] info on node 1
> IP:192.168.120.90
> CurrentNode:0
> NumInterfaces:1
> Interface[1]: Name:eth0 Link:up References:0
> 
> Logs on vault2 (stays banned because it can't obtain IP):
> IP 192.168.120.90 still hosted during release IP callback, failing
> IP 192.168.120.92 still hosted during release IP callback, failing
> 
> root at vault1:~# ctdb delip 192.168.120.90
> root at vault1:~# ctdb delip 192.168.120.92
> root at vault2:/service/ctdb/log/main# ctdb addip 192.168.120.90/22 eth0
> Node already knows about IP 192.168.120.90
> root at vault2:/service/ctdb/log/main# ctdb ip
> Public IPs on node 1
> 192.168.120.90 -1
> 192.168.120.91 0
> 192.168.120.92 -1
> 192.168.120.93 0
> 
> 
> I am using the 10.external.  ip addr show shows the correct IP addresses 
> on eth0 in the lxc container.  rebooted the physical machine, this node 
> is buggered.  shut it down, used ip addr add to put the addresses on the 
> other node, used ctdb addip and the node took it and node1 is now 
> functioning with all 4 IPs just fine.  Or so it appears right now.
> 
> something is seriously schizophrenic here...
I'm wondering why you're using 10.external.  Although we have tested
it, we haven't actually seen it used in production before!  10.external
is a hack to allow use of CTDB's connection tracking while managing the
public IP addresses externally.  That is, you tell CTDB about the
public IPs, use "ctdb moveip" to inform CTDB about moved public IPs
and
it sends grat ARPs and tickle ACKs on the takeover node.  It doesn't
actually assign the public IP addresses to nodes.

The documentation might not be clear on this but if you're using
10.external then you need to have the DisableIPFailover tunable set to
1 on all nodes so that CTDB doesn't try to move the IPs itself. 

Please let us know if the documentation could be improved... 

peace & happiness,
martin

Computerisms Corporation

2017-Nov-08 01:05 UTC

head link

[Samba] ctdb vacuum timeouts and record locks

Hi Martin,

Thanks for your answer...
>> I am using the 10.external.  ip addr show shows the correct IP
addresses
>> on eth0 in the lxc container.  rebooted the physical machine, this node
>> is buggered.  shut it down, used ip addr add to put the addresses on
the
>> other node, used ctdb addip and the node took it and node1 is now
>> functioning with all 4 IPs just fine.  Or so it appears right now.
>>
>> something is seriously schizophrenic here...
> 
> I'm wondering why you're using 10.external.  Although we have
tested
> it, we haven't actually seen it used in production before!  10.external
> is a hack to allow use of CTDB's connection tracking while managing the
> public IP addresses externally.  That is, you tell CTDB about the
> public IPs, use "ctdb moveip" to inform CTDB about moved public
IPs and
> it sends grat ARPs and tickle ACKs on the takeover node.  It doesn't
> actually assign the public IP addresses to nodes.
Hm, okay, I was clear that using 10.external it is a human's 
responsibility to deal with assigning IPs to physical interfaces.  In 
re-reading the docs, I see DeterministicIPs and NoIPFailback are 
required for moveip, which I am not sure are set.  will check next 
opportunity, if they aren't that might explain the behaviour, however, 
the ips were correctly assigned using the ip command.

The reason I am using 10.external is because when I initially set up my 
cluster test environment, none of ctdb's automatic networking 
assignments worked.  ip addr show wouldn't display the addresses as 
being assigned to the interface.  I never did get down to the bottom of 
that problem, I had thought perhaps the lxc container was the issue, but 
don't know why it would be, the ip commands all seem to work fine from 
th cli.

While I was trying to find my way around that, I found 10.external.  I 
found that by adjusting my start scripts to include the appropriate ip 
addr add commands, it worked fine.  in my test environment I played with 
the ctdb addip/delip/moveip commands, and manually assigning the 
addresses, and it all worked fine.  If I turned off a node, I could 
uncomment a couple lines in the start script in the other node and 
restart and everything moved to where it was supposed to be.

But not all things have worked in production as they did in my testing 
environment, and doesn't always seem to work the same in production from 
one time to the next, for that matter...> The documentation might not be clear on this but if you're using
> 10.external then you need to have the DisableIPFailover tunable set to
> 1 on all nodes so that CTDB doesn't try to move the IPs itself.
I do have the DisableIPFailover set.

from the documentation, I am under the impression that if I do ctdb 
delip on one node, and ctdb addip on the other node, and make sure the 
other node shows the correct additional IPs assigned to the physical 
interface using the ip addr show command, that should move an ip from 
one node to the other.  But when I do this, I will frequently still see 
messages like <ip> still hosted during callback, or failed to release 
<ip> in the logs.  sometimes on startup, I will see log entries like 
<ip> incorrectly on an interface, when ip addr show shows the address is 
correctly on an interface, and ctdb ipinfo will show that the ip is 
assigned to the node.

Does this mean these commands are not working, or could it be that the 
10.external doesn't do the magic in these cases?
> Please let us know if the documentation could be improved...
Often documentation isn't straightforward until you have had some 
experience and gained some of context that those who wrote it have.  I 
am not sure about improving documentation, but I can say I learned 
significantly more about how to set things up, what to expect, and what 
procedures to perform by reading mailing list posts than I did by 
reading the manuals or the wiki...
> 
> peace & happiness,
> martin
>

Maybe Matching Threads

Search for more maybe matching threads

samba - Nov 2017 - ctdb vacuum timeouts and record locks

[Samba] ctdb vacuum timeouts and record locks

[Samba] ctdb vacuum timeouts and record locks

[Samba] ctdb vacuum timeouts and record locks

Maybe Matching Threads