Thomas Steen Rasmussen
2019-Jan-16 14:31 UTC
CARP stopped working after upgrade from 11 to 12
On 1/16/19 3:14 PM, Pete French wrote:> I just upgraded my pair of firewalls from 11 to 12, and am now in the > situation where CARP no longer works between them to faiilover the > virtual addresse. Both machines come up thinking that they > are the master. If I manually set the advskew on the interfaces to > a high number on what should be passive then it briefly goes to backup > mode, but then goes back to master with the message: > > BACKUP -> MASTER (preempting a slower master) > > This is kind of a big problem!Indeed. I am seeing the same thing. Which revision of 12 are you running? I am currently (yesterday and today) bisecting revisions to find the commit which broke this, because it worked in 12-BETA2 but doesn't work on latest 12-STABLE. I have narrowed it down to somewhere between 12-STABLE-342037 which works, and 12-STABLE-342055 which does not. Only 4 commits touch 12-STABLE branch in that range: ------------------------------------------------------------------------ r342038 | eugen | 2018-12-13 10:52:40 +0000 (Thu, 13 Dec 2018) | 5 lines MFC r340394: ipfw.8: Fix part of the SYNOPSIS documenting LIST OF RULES AND PREPROCESSING that is still referred as last section of the SYNOPSIS later but was erroneously situated in the section IN-KERNEL NAT. ------------------------------------------------------------------------ r342047 | markj | 2018-12-13 15:51:07 +0000 (Thu, 13 Dec 2018) | 3 lines MFC r341638: Let kern.trap_enotcap be set as a tunable. ------------------------------------------------------------------------ r342048 | markj | 2018-12-13 16:07:35 +0000 (Thu, 13 Dec 2018) | 3 lines MFC r340405: Add accounting to per-domain UMA full bucket caches. ------------------------------------------------------------------------ r342051 | kp | 2018-12-13 20:00:11 +0000 (Thu, 13 Dec 2018) | 20 lines pfsync: Performance improvement pfsync code is called for every new state, state update and state deletion in pf. While pf itself can operate on multiple states at the same time (on different cores, assuming the states hash to a different hashrow), pfsync only had a single lock. This greatly reduced throughput on multicore systems. Address this by splitting the pfsync queues into buckets, based on the state id. This ensures that updates for a given connection always end up in the same bucket, which allows pfsync to still collapse multiple updates into one, while allowing multiple cores to proceed at the same time. The number of buckets is tunable, but defaults to 2 x number of cpus. Benchmarking has shown improvement, depending on hardware and setup, from ~30% to ~100%. Sponsored by:?? Orange Business Services ------------------------------------------------------------------------ Of these I thought r342051 sounded most likely, so I am currently building r342050. I will write again in a few hours when I have isolated the commit. Best regards, Thomas Steen Rasmussen
I can't see how any of those would impact carp unless pf is now incorrectly blocking carp packets, which seems unlikely from that commit. Questions: * Are you running a firewall? * What does sysctl net.inet.carp report? * What exactly does ifconfig report about your carp on both hosts? * Have you tried enabling more detailed carp logging using sysctl net.inet.carp.log? ??? Regards ??? Steve On 16/01/2019 14:31, Thomas Steen Rasmussen wrote:> On 1/16/19 3:14 PM, Pete French wrote: >> I just upgraded my pair of firewalls from 11 to 12, and am now in the >> situation where CARP no longer works between them to faiilover the >> virtual addresse. Both machines come up thinking that they >> are the master. If I manually set the advskew on the interfaces to >> a high number on what should be passive then it briefly goes to backup >> mode, but then goes back to master with the message: >> >> ????BACKUP -> MASTER (preempting a slower master) >> >> This is kind of a big problem! > > Indeed. I am seeing the same thing. Which revision of 12 are you running? > > I am currently (yesterday and today) bisecting revisions to find the > commit which broke this, because it worked in 12-BETA2 but doesn't > work on latest 12-STABLE. > > I have narrowed it down to somewhere between 12-STABLE-342037 which > works, and 12-STABLE-342055 which does not. > > Only 4 commits touch 12-STABLE branch in that range: > > ------------------------------------------------------------------------ > r342038 | eugen | 2018-12-13 10:52:40 +0000 (Thu, 13 Dec 2018) | 5 lines > > MFC r340394: ipfw.8: Fix part of the SYNOPSIS documenting > LIST OF RULES AND PREPROCESSING that is still referred > as last section of the SYNOPSIS later but was erroneously situated > in the section IN-KERNEL NAT. > > ------------------------------------------------------------------------ > r342047 | markj | 2018-12-13 15:51:07 +0000 (Thu, 13 Dec 2018) | 3 lines > > MFC r341638: > Let kern.trap_enotcap be set as a tunable. > > ------------------------------------------------------------------------ > r342048 | markj | 2018-12-13 16:07:35 +0000 (Thu, 13 Dec 2018) | 3 lines > > MFC r340405: > Add accounting to per-domain UMA full bucket caches. > > ------------------------------------------------------------------------ > r342051 | kp | 2018-12-13 20:00:11 +0000 (Thu, 13 Dec 2018) | 20 lines > > pfsync: Performance improvement > > pfsync code is called for every new state, state update and state > deletion in pf. While pf itself can operate on multiple states at the > same time (on different cores, assuming the states hash to a different > hashrow), pfsync only had a single lock. > This greatly reduced throughput on multicore systems. > > Address this by splitting the pfsync queues into buckets, based on the > state id. This ensures that updates for a given connection always end up > in the same bucket, which allows pfsync to still collapse multiple > updates into one, while allowing multiple cores to proceed at the same > time. > > The number of buckets is tunable, but defaults to 2 x number of cpus. > Benchmarking has shown improvement, depending on hardware and setup, > from ~30% > to ~100%. > > Sponsored by:?? Orange Business Services > > ------------------------------------------------------------------------ > > Of these I thought r342051 sounded most likely, so I am currently > building r342050. > > I will write again in a few hours when I have isolated the commit. > > Best regards, > > Thomas Steen Rasmussen > > > _______________________________________________ > freebsd-stable at freebsd.org mailing list > https://lists.freebsd.org/mailman/listinfo/freebsd-stable > To unsubscribe, send any mail to "freebsd-stable-unsubscribe at freebsd.org"
> Indeed. I am seeing the same thing. Which revision of 12 are you running?Ah, now that is very interesting - I wasnt expecting a reply so fast! I am running r342847 - not though, that this is also the version I am running on the two test systems which do work.> I am currently (yesterday and today) bisecting revisions to find the > commit which broke this, because it worked in 12-BETA2 but doesn't work > on latest 12-STABLE.Well done, thats takes a lot of effort to do. Thankyou for doing this.> MFC r340394: ipfw.8: Fix part of the SYNOPSIS documenting > LIST OF RULES AND PREPROCESSING that is still referred > as last section of the SYNOPSIS later but was erroneously situated > in the section IN-KERNEL NAT.Docs only, so cant be this one I think.> MFC r341638: > Let kern.trap_enotcap be set as a tunable.Also cant be this one from eyeballing the code. It simply makes it writeable.> MFC r340405: > Add accounting to per-domain UMA full bucket caches.This is not touching networking, so seems unlikely, though it is an actaul significant code chnage.> r342051 | kp | 2018-12-13 20:00:11 +0000 (Thu, 13 Dec 2018) | 20 lines > > pfsync: Performance improvementahh..... now, this is where things like likely, as the difference between my test amchines and my live machines is that the live machines are using pf with pfsync enabled.> Of these I thought r342051 sounded most likely, so I am currently > building r342050.Thats my feeling too going through the above. Have you also tried simply disabling pfsync to see if CARP returns to normal ? I could live without pfsync to be honest, if thats what it takes to make this work.> I will write again in a few hours when I have isolated the commit.Thankyou again for outting in the effort to bisect this - if you can isolate it then we can back the change out and try it again and see if that helps. -pete.