thr3ads.net - CentOS - [CentOS] KVM HA [Jun 2016]

If this information is useful, please help other people find it:
Share via:

Chris Adams

2016-Jun-22 18:01 UTC

[CentOS] KVM HA

Once upon a time, John R Pierce <pierce at hogranch.com>
said:> On 6/22/2016 10:47 AM, Digimer wrote:
> >This is called "fabric fencing" and was originally the only
supported
> >option in the very early days of HA. It has fallen out of favour for
> >several reasons, but it does still work fine. The main issues is that
it
> >leaves the node in an unclean state. If an admin (out of ignorance or
> >panic) reconnects the node, all hell can break lose. So generally power
> >cycling is much safer.
> 
> how is that any different than said ignorant admin powering up the
> shutdown node ?
On boot, the cluster software assumes it is "wrong" and doesn't
connect
to any resources until it can verify state.

If the node is just disconnected and left running, and later
reconnected, it can try to write out (now old/incorrect) data to the
storage, corrupting things.

Speaking of shared storage, another fencing option is SCSI reservations.
It can be terribly finicky, but it can be useful.
-- 
Chris Adams <linux at cmadams.net>

Digimer

2016-Jun-22 18:06 UTC

head link

[CentOS] KVM HA

On 22/06/16 02:01 PM, Chris Adams wrote:> Once upon a time, John R Pierce <pierce at hogranch.com> said:
>> On 6/22/2016 10:47 AM, Digimer wrote:
>>> This is called "fabric fencing" and was originally the
only supported
>>> option in the very early days of HA. It has fallen out of favour
for
>>> several reasons, but it does still work fine. The main issues is
that it
>>> leaves the node in an unclean state. If an admin (out of ignorance
or
>>> panic) reconnects the node, all hell can break lose. So generally
power
>>> cycling is much safer.
>>
>> how is that any different than said ignorant admin powering up the
>> shutdown node ?
> 
> On boot, the cluster software assumes it is "wrong" and
doesn't connect
> to any resources until it can verify state.
> 
> If the node is just disconnected and left running, and later
> reconnected, it can try to write out (now old/incorrect) data to the
> storage, corrupting things.
> 
> Speaking of shared storage, another fencing option is SCSI reservations.
> It can be terribly finicky, but it can be useful.
Close.

The cluster software and any hosted services aren't running. It's not
that they think they're wrong, they just have no existing state so they
won't try to touch anything without first ensuring it is safe to do so.

SCSI reservations, and anything that blocks access is technically OK.
However, I stand by the recommendation to power cycle lost nodes. It's
by far the safest (and easiest) approach. I know this goes against the
grain of sysadmins to yank power, but in an HA setup, nodes should be
disposable and replaceable. The nodes are not important, the hosted
services are.

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?

Chris Adams

2016-Jun-22 18:12 UTC

head link

[CentOS] KVM HA

Once upon a time, Digimer <lists at alteeve.ca>
said:> The cluster software and any hosted services aren't running. It's
not
> that they think they're wrong, they just have no existing state so they
> won't try to touch anything without first ensuring it is safe to do so.
Well, I was being short; what I meant was, in HA, if you aren't known to
be right, you are wrong, and you do nothing.
> SCSI reservations, and anything that blocks access is technically OK.
> However, I stand by the recommendation to power cycle lost nodes. It's
> by far the safest (and easiest) approach. I know this goes against the
> grain of sysadmins to yank power, but in an HA setup, nodes should be
> disposable and replaceable. The nodes are not important, the hosted
> services are.
One advantage SCSI reservations have is that if you can access the
storage, you can lock out everybody else.  It doesn't require access to
a switch, management card, etc. (that may have its own problems).  If
you can access the storage, you own it, if you can't, you don't.
Putting a lock directly on the actual shared resource can be the safest
path (if you can't access it, you can't screw it up).

I agree that rebooting a failed node is also good, just pointing out
that putting the lock directly on the shared resource is also good.

-- 
Chris Adams <linux at cmadams.net>

John R Pierce

2016-Jun-22 18:31 UTC

head link

[CentOS] KVM HA

On 6/22/2016 11:06 AM, Digimer wrote:> I know this goes against the
> grain of sysadmins to yank power, but in an HA setup, nodes should be
> disposable and replaceable. The nodes are not important, the hosted
> services are.
of course, the really tricky problem is implementing an ISCSI storage 
infrastructure thats fully redundant and has no single point of 
failure.   this requires the redundant storage controllers to have 
shared write-back cache, fully redundant networking, etc.   The 
fiberchannel SAN folks had all this down pat 20 years ago, but at an 
astronomical price point.

The more complex this stuff gets, the more points of potential failure 
you introduce.





-- 
john r pierce, recycling bits in santa cruz

m.roth at 5-cent.us

2016-Jun-22 18:34 UTC

head link

[CentOS] KVM HA

Digimer wrote:> On 22/06/16 02:01 PM, Chris Adams wrote:
>> Once upon a time, John R Pierce <pierce at hogranch.com> said:
>>> On 6/22/2016 10:47 AM, Digimer wrote:
>>>> This is called "fabric fencing" and was originally
the only supported
>>>> option in the very early days of HA. It has fallen out of
favour for
>>>> several reasons, but it does still work fine. The main issues
is that
>>>> it leaves the node in an unclean state. If an admin (out of
ignorance or
>>>> panic) reconnects the node, all hell can break lose. So
generally
>>>> power cycling is much safer.
<snip>>> If the node is just disconnected and left running, and later
>> reconnected, it can try to write out (now old/incorrect) data to the
>> storage, corrupting things.
>>
>> Speaking of shared storage, another fencing option is SCSI
reservations.
>> It can be terribly finicky, but it can be useful.
>
> Close.
>
> The cluster software and any hosted services aren't running. It's
not
> that they think they're wrong, they just have no existing state so they
> won't try to touch anything without first ensuring it is safe to do so.<snip>
Question: when y'all are saying "reconnect", is this different
from
stopping the h/a services, reconnecting to the network, and then starting
the services (which would let you avoid a reboot)?

          mark

Paul Heinlein

2016-Jun-22 18:36 UTC

head link

[CentOS] KVM HA

On Wed, 22 Jun 2016, Digimer wrote:
> The nodes are not important, the hosted services are.
The only time this isn't true is when you're using the node to heat 
the room.

Otherwise, the service is always the important thing. (The node may 
become as synonymous with the service because there's no redundancy, 
but that's a bug, not a feature.)

-- 
Paul Heinlein
heinlein at madboa.com
45?38' N, 122?6' W

Maybe Matching Threads

Search for more maybe matching threads

CentOS - Jun 2016 - KVM HA

[CentOS] KVM HA

[CentOS] KVM HA

[CentOS] KVM HA

[CentOS] KVM HA

[CentOS] KVM HA

[CentOS] KVM HA

Maybe Matching Threads