I can't see how any of those would impact carp unless pf is now incorrectly blocking carp packets, which seems unlikely from that commit. Questions: * Are you running a firewall? * What does sysctl net.inet.carp report? * What exactly does ifconfig report about your carp on both hosts? * Have you tried enabling more detailed carp logging using sysctl net.inet.carp.log? ??? Regards ??? Steve On 16/01/2019 14:31, Thomas Steen Rasmussen wrote:> On 1/16/19 3:14 PM, Pete French wrote: >> I just upgraded my pair of firewalls from 11 to 12, and am now in the >> situation where CARP no longer works between them to faiilover the >> virtual addresse. Both machines come up thinking that they >> are the master. If I manually set the advskew on the interfaces to >> a high number on what should be passive then it briefly goes to backup >> mode, but then goes back to master with the message: >> >> ????BACKUP -> MASTER (preempting a slower master) >> >> This is kind of a big problem! > > Indeed. I am seeing the same thing. Which revision of 12 are you running? > > I am currently (yesterday and today) bisecting revisions to find the > commit which broke this, because it worked in 12-BETA2 but doesn't > work on latest 12-STABLE. > > I have narrowed it down to somewhere between 12-STABLE-342037 which > works, and 12-STABLE-342055 which does not. > > Only 4 commits touch 12-STABLE branch in that range: > > ------------------------------------------------------------------------ > r342038 | eugen | 2018-12-13 10:52:40 +0000 (Thu, 13 Dec 2018) | 5 lines > > MFC r340394: ipfw.8: Fix part of the SYNOPSIS documenting > LIST OF RULES AND PREPROCESSING that is still referred > as last section of the SYNOPSIS later but was erroneously situated > in the section IN-KERNEL NAT. > > ------------------------------------------------------------------------ > r342047 | markj | 2018-12-13 15:51:07 +0000 (Thu, 13 Dec 2018) | 3 lines > > MFC r341638: > Let kern.trap_enotcap be set as a tunable. > > ------------------------------------------------------------------------ > r342048 | markj | 2018-12-13 16:07:35 +0000 (Thu, 13 Dec 2018) | 3 lines > > MFC r340405: > Add accounting to per-domain UMA full bucket caches. > > ------------------------------------------------------------------------ > r342051 | kp | 2018-12-13 20:00:11 +0000 (Thu, 13 Dec 2018) | 20 lines > > pfsync: Performance improvement > > pfsync code is called for every new state, state update and state > deletion in pf. While pf itself can operate on multiple states at the > same time (on different cores, assuming the states hash to a different > hashrow), pfsync only had a single lock. > This greatly reduced throughput on multicore systems. > > Address this by splitting the pfsync queues into buckets, based on the > state id. This ensures that updates for a given connection always end up > in the same bucket, which allows pfsync to still collapse multiple > updates into one, while allowing multiple cores to proceed at the same > time. > > The number of buckets is tunable, but defaults to 2 x number of cpus. > Benchmarking has shown improvement, depending on hardware and setup, > from ~30% > to ~100%. > > Sponsored by:?? Orange Business Services > > ------------------------------------------------------------------------ > > Of these I thought r342051 sounded most likely, so I am currently > building r342050. > > I will write again in a few hours when I have isolated the commit. > > Best regards, > > Thomas Steen Rasmussen > > > _______________________________________________ > freebsd-stable at freebsd.org mailing list > https://lists.freebsd.org/mailman/listinfo/freebsd-stable > To unsubscribe, send any mail to "freebsd-stable-unsubscribe at freebsd.org"
> I can't see how any of those would impact carp unless pf is now > incorrectly blocking carp packets, which seems unlikely from that commit.Just looking at the code it does seem unlikely, true - but my working system does not run pf+pfsync and the non working one does, so it is suspiciously in the right "place". If Thomas can bisect it and show it works before but nto after then it has to be in there somewhere I guess. The dmesg "(preempting a slower master)" also makes me think that it is reciving carp packets - though I havent checked the code to see if it produces that if it cant see any other masters at all.> Questions: > > * Are you running a firewall?Yes, pf. The boxes are basically our external firewall/router. I also run a laod balancer on them - relayd before, but now haproxy after yesterdays thread on here.> * What does sysctl net.inet.carp report?$ sysctl net.inet.carp net.inet.carp.ifdown_demotion_factor: 240 net.inet.carp.senderr_demotion_factor: 240 net.inet.carp.demotion: -240 net.inet.carp.log: 1 net.inet.carp.preempt: 1 net.inet.carp.dscp: 56 net.inet.carp.allow: 1> * What exactly does ifconfig report about your carp on both hosts?I only have carp enabled on one host for now, to pervent the downtime, but ifconfig on the master is below. I am currently running with a separate vhid for each address. I normally run with a separate vhid for each network and address family though - i.e. 4 - but theres no difference in the behaviour em0: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500 options=81249b<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,LRO,WOL_MAGIC,VLAN_HWFILTER> ether 00:25:90:31:bf:a2 inet 10.32.10.1 netmask 0xffff0000 broadcast 10.32.255.255 inet 10.32.10.6 netmask 0xffff0000 broadcast 10.32.255.255 vhid 1 inet6 fe80::225:90ff:fe31:bfa2%em0 prefixlen 64 scopeid 0x1 inet6 2a02:1658:1:2:e550::1 prefixlen 64 inet6 2a02:1658:1:2:e550::6 prefixlen 64 vhid 2 carp: MASTER vhid 1 advbase 1 advskew 10 carp: MASTER vhid 2 advbase 1 advskew 10 media: Ethernet autoselect (1000baseT <full-duplex>) status: active nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL> em1: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500 options=81249b<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,LRO,WOL_MAGIC,VLAN_HWFILTER> ether 00:25:90:31:bf:a3 inet 178.250.73.196 netmask 0xffffffc0 broadcast 178.250.73.255 inet 178.250.73.198 netmask 0xffffffc0 broadcast 178.250.73.255 vhid 3 inet 178.250.73.199 netmask 0xffffffc0 broadcast 178.250.73.255 vhid 5 inet 178.250.73.200 netmask 0xffffffc0 broadcast 178.250.73.255 vhid 6 inet 178.250.73.221 netmask 0xffffffc0 broadcast 178.250.73.255 vhid 7 inet6 fe80::225:90ff:fe31:bfa3%em1 prefixlen 64 scopeid 0x2 inet6 2a02:1658:1:1::1:2 prefixlen 64 inet6 2a02:1658:1:1::1:1 prefixlen 64 vhid 4 carp: MASTER vhid 3 advbase 1 advskew 10 carp: MASTER vhid 5 advbase 1 advskew 10 carp: MASTER vhid 6 advbase 1 advskew 10 carp: MASTER vhid 7 advbase 1 advskew 10 carp: MASTER vhid 4 advbase 1 advskew 10 media: Ethernet autoselect (1000baseT <full-duplex>) status: active nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL> lo0: flags=8049<UP,LOOPBACK,RUNNING,MULTICAST> metric 0 mtu 16384 options=680003<RXCSUM,TXCSUM,LINKSTATE,RXCSUM_IPV6,TXCSUM_IPV6> inet6 ::1 prefixlen 128 inet6 fe80::1%lo0 prefixlen 64 scopeid 0x3 inet 127.0.0.1 netmask 0xff000000 groups: lo nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL> pflog0: flags=0<> metric 0 mtu 33160 groups: pflog pfsync0: flags=41<UP,RUNNING> metric 0 mtu 1500 pfsync: syncdev: em0 syncpeer: 10.32.10.2 maxupd: 128 defer: off groups: pfsync> * Have you tried enabling more detailed carp logging using sysctl > net.inet.carp.log?I didnt have tme unfortuntely - at the point where all the alerts went off and all of the systems were offline then I just did what I needed to in order to get it working again (i.e. shut down the passive side). This is our main production firewall pair, so any downtime cause lots of problems and we cant make any sales. Is there anythng in the above which looks fishy to you though ? -pete.
Thomas Steen Rasmussen
2019-Jan-16 16:39 UTC
CARP stopped working after upgrade from 11 to 12
On 1/16/19 3:53 PM, Steven Hartland wrote: I have confirmed that pfsync is the culprit. Read on for details.> I can't see how any of those would impact carp unless pf is now > incorrectly blocking carp packets, which seems unlikely from that commit. >Well I would agree, but nevertheless, here we are.> Questions: > > ?* Are you running a firewall?Yes, pf, but it permits CARP packets, and MASTER/SLAVE works well up to and including r342050. Rebuild to r342051 with the exact same configuration and now both nodes are MASTER.> ?* What does sysctl net.inet.carp report?net.inet.carp.ifdown_demotion_factor: 240 net.inet.carp.senderr_demotion_factor: 240 net.inet.carp.demotion: 0 net.inet.carp.log: 1 net.inet.carp.preempt: 1 net.inet.carp.dscp: 56 net.inet.carp.allow: 1> ?* What exactly does ifconfig report about your carp on both hosts?with 12-STABLE r342050: [tykling at fwclu2a ~]$ uname -a FreeBSD fwclu2a 12.0-STABLE FreeBSD 12.0-STABLE r342050 GENERIC amd64 [tykling at fwclu2a ~]$ ifconfig | grep carp ??????? carp: MASTER vhid 1 advbase 1 advskew 100 ??????? carp: MASTER vhid 1 advbase 1 advskew 100 ??????? carp: MASTER vhid 1 advbase 1 advskew 100 [tykling at fwclu2a ~]$ [tykling at fwclu2b ~]$ uname -a FreeBSD fwclu2b 12.0-STABLE FreeBSD 12.0-STABLE r342050 GENERIC amd64 [tykling at fwclu2b ~]$ ifconfig | grep carp ??????? carp: BACKUP vhid 1 advbase 1 advskew 200 ??????? carp: BACKUP vhid 1 advbase 1 advskew 200 ??????? carp: BACKUP vhid 1 advbase 1 advskew 200 [tykling at fwclu2b ~]$ and with 12-STABLE r342051: [tykling at fwclu2a ~]$ uname -a FreeBSD fwclu2a 12.0-STABLE FreeBSD 12.0-STABLE r342051 GENERIC amd64 [tykling at fwclu2a ~]$ ifconfig | grep carp ??????? carp: MASTER vhid 1 advbase 1 advskew 100 ??????? carp: MASTER vhid 1 advbase 1 advskew 100 ??????? carp: MASTER vhid 1 advbase 1 advskew 100 [tykling at fwclu2a ~]$ [tykling at fwclu2b ~]$ uname -a FreeBSD fwclu2b 12.0-STABLE FreeBSD 12.0-STABLE r342051 GENERIC amd64 [tykling at fwclu2b ~]$ ifconfig | grep carp ??????? carp: MASTER vhid 1 advbase 1 advskew 200 ??????? carp: MASTER vhid 1 advbase 1 advskew 200 ??????? carp: MASTER vhid 1 advbase 1 advskew 200 [tykling at fwclu2b ~]$> ?* Have you tried enabling more detailed carp logging using sysctl > ?? net.inet.carp.log? >It is at 1 and increasing it to 2 doesn't appear to log anything new. I tried disabling pfsync and rebooting both nodes, they came up as MASTER/SLAVE then. Then I tried enabling pfsync and starting it, and on the SLAVE node I immediately got: Jan 16 16:34:56 fwclu2b kernel: carp: demoted by -240 to -240 (pfsync bulk done) Jan 16 16:34:56 fwclu2b kernel: carp: 1 at lagg2.52: BACKUP -> MASTER (preempting a slower master) Jan 16 16:34:56 fwclu2b kernel: carp: 1 at lagg2.51: BACKUP -> MASTER (preempting a slower master) Jan 16 16:34:56 fwclu2b kernel: carp: 1 at lagg3: BACKUP -> MASTER (preempting a slower master) Stopping pfsync again does not make it go back to SLAVE. Best regards, Thomas Steen Rasmussen