Once upon a time, John R Pierce <pierce at hogranch.com> said:> On 6/22/2016 10:47 AM, Digimer wrote: > >This is called "fabric fencing" and was originally the only supported > >option in the very early days of HA. It has fallen out of favour for > >several reasons, but it does still work fine. The main issues is that it > >leaves the node in an unclean state. If an admin (out of ignorance or > >panic) reconnects the node, all hell can break lose. So generally power > >cycling is much safer. > > how is that any different than said ignorant admin powering up the > shutdown node ?On boot, the cluster software assumes it is "wrong" and doesn't connect to any resources until it can verify state. If the node is just disconnected and left running, and later reconnected, it can try to write out (now old/incorrect) data to the storage, corrupting things. Speaking of shared storage, another fencing option is SCSI reservations. It can be terribly finicky, but it can be useful. -- Chris Adams <linux at cmadams.net>
On 22/06/16 02:01 PM, Chris Adams wrote:> Once upon a time, John R Pierce <pierce at hogranch.com> said: >> On 6/22/2016 10:47 AM, Digimer wrote: >>> This is called "fabric fencing" and was originally the only supported >>> option in the very early days of HA. It has fallen out of favour for >>> several reasons, but it does still work fine. The main issues is that it >>> leaves the node in an unclean state. If an admin (out of ignorance or >>> panic) reconnects the node, all hell can break lose. So generally power >>> cycling is much safer. >> >> how is that any different than said ignorant admin powering up the >> shutdown node ? > > On boot, the cluster software assumes it is "wrong" and doesn't connect > to any resources until it can verify state. > > If the node is just disconnected and left running, and later > reconnected, it can try to write out (now old/incorrect) data to the > storage, corrupting things. > > Speaking of shared storage, another fencing option is SCSI reservations. > It can be terribly finicky, but it can be useful.Close. The cluster software and any hosted services aren't running. It's not that they think they're wrong, they just have no existing state so they won't try to touch anything without first ensuring it is safe to do so. SCSI reservations, and anything that blocks access is technically OK. However, I stand by the recommendation to power cycle lost nodes. It's by far the safest (and easiest) approach. I know this goes against the grain of sysadmins to yank power, but in an HA setup, nodes should be disposable and replaceable. The nodes are not important, the hosted services are. -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education?
Once upon a time, Digimer <lists at alteeve.ca> said:> The cluster software and any hosted services aren't running. It's not > that they think they're wrong, they just have no existing state so they > won't try to touch anything without first ensuring it is safe to do so.Well, I was being short; what I meant was, in HA, if you aren't known to be right, you are wrong, and you do nothing.> SCSI reservations, and anything that blocks access is technically OK. > However, I stand by the recommendation to power cycle lost nodes. It's > by far the safest (and easiest) approach. I know this goes against the > grain of sysadmins to yank power, but in an HA setup, nodes should be > disposable and replaceable. The nodes are not important, the hosted > services are.One advantage SCSI reservations have is that if you can access the storage, you can lock out everybody else. It doesn't require access to a switch, management card, etc. (that may have its own problems). If you can access the storage, you own it, if you can't, you don't. Putting a lock directly on the actual shared resource can be the safest path (if you can't access it, you can't screw it up). I agree that rebooting a failed node is also good, just pointing out that putting the lock directly on the shared resource is also good. -- Chris Adams <linux at cmadams.net>
On 6/22/2016 11:06 AM, Digimer wrote:> I know this goes against the > grain of sysadmins to yank power, but in an HA setup, nodes should be > disposable and replaceable. The nodes are not important, the hosted > services are.of course, the really tricky problem is implementing an ISCSI storage infrastructure thats fully redundant and has no single point of failure. this requires the redundant storage controllers to have shared write-back cache, fully redundant networking, etc. The fiberchannel SAN folks had all this down pat 20 years ago, but at an astronomical price point. The more complex this stuff gets, the more points of potential failure you introduce. -- john r pierce, recycling bits in santa cruz
Digimer wrote:> On 22/06/16 02:01 PM, Chris Adams wrote: >> Once upon a time, John R Pierce <pierce at hogranch.com> said: >>> On 6/22/2016 10:47 AM, Digimer wrote: >>>> This is called "fabric fencing" and was originally the only supported >>>> option in the very early days of HA. It has fallen out of favour for >>>> several reasons, but it does still work fine. The main issues is that >>>> it leaves the node in an unclean state. If an admin (out of ignorance or >>>> panic) reconnects the node, all hell can break lose. So generally >>>> power cycling is much safer.<snip>>> If the node is just disconnected and left running, and later >> reconnected, it can try to write out (now old/incorrect) data to the >> storage, corrupting things. >> >> Speaking of shared storage, another fencing option is SCSI reservations. >> It can be terribly finicky, but it can be useful. > > Close. > > The cluster software and any hosted services aren't running. It's not > that they think they're wrong, they just have no existing state so they > won't try to touch anything without first ensuring it is safe to do so.<snip> Question: when y'all are saying "reconnect", is this different from stopping the h/a services, reconnecting to the network, and then starting the services (which would let you avoid a reboot)? mark
On Wed, 22 Jun 2016, Digimer wrote:> The nodes are not important, the hosted services are.The only time this isn't true is when you're using the node to heat the room. Otherwise, the service is always the important thing. (The node may become as synonymous with the service because there's no redundancy, but that's a bug, not a feature.) -- Paul Heinlein heinlein at madboa.com 45?38' N, 122?6' W