thr3ads.net - Lustre discuss - [Lustre-discuss] Re: Question about HA OST''s [May 2006]

If this information is useful, please help other people find it:
Share via:

Tuyen Nguyen

2006-May-19 07:36 UTC

[Lustre-discuss] Re: Question about HA OST''s

Hi Brent,
	My name is Tuyen Nguyen and am working at University of Massachusetts.
Last few weeks, I have been trying to get drbd working with failover
lustre OST.  And I can only get it partially working.

Problem is when primary OST fail, lustre switch over to backup node but
drbd device on this node still stay in secondary mode (readonly).  Try
to use heartbeat so drbd on backup OST node can switch to primary
(read/write) mode but lustre never register logical takover IP when
lustre starts for OST nodes.

You said you got it working.  Did you encounter same problem that I
have?  Would you be able to send me your docs how to setup HA lustre
with drbd?


Thanks a lot.

Brent A Nelson

2006-May-19 07:36 UTC

head link

[Lustre-discuss] Re: Question about HA OST''s

I''ve been experimenting with using drbd for network mirroring of the
OSTs,
and, although I have much more tinkering to do, it''s working fine, so 
far...

On Wed, 1 Mar 2006, Andreas Dilger wrote:
> On Mar 01, 2006  12:17 -0600, Craig Hansen wrote:
>> - Two dual CPU servers, each with two local logical volumes.
>> - Each server mounts one volume locally and exports the second via
ISCSI.
>> - Server A performs software mirroring between the local volume and
>> the ISCSI volume exported by Server B.
>> - Server B does likewise with Server A''s exported volume.
>> - Each server runs an OST for their local volume plus a failover OST
>> for the partner server which is normally idle.
>> - If a disk fails, software mirroring fails over until it is resolved.
>> - If server A fails, server B reconfigures and mounts the local volume
>> that was exported to server A via ISCSI and the failover OST acts as
>> server A until it is resolved.
>>
>> Is this a valid approach for creating a group of HA OST''s?  Is
there a
>> better way to get redundancy and failover protection for OST server
>> failures?
>
> It seems reasonable at least.  Several people have experimented with
> similar solutions using (G)NBD instead of iSCSI, but I''ve never
heard
> back on whether it works well or not.  I suspect if you are using a
> dedicated back-end network for the iSCSI (e.g. GigE with loopback
> cables?) that is as fast as the front-end network and disk it
shouldn''t
> be too much of a performance problem.
>
> For most of our support customers use external hardware RAID devices
> and the RAID is multi-ported FibreChannel that is just connected to
> each OSS directly.
>
> Cheers, Andreas
> --
> Andreas Dilger
> Principal Software Engineer
> Cluster File Systems, Inc.
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss@clusterfs.com
> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
>

Andreas Dilger

2006-May-19 07:36 UTC

head link

[Lustre-discuss] Re: Question about HA OST''s

On Mar 01, 2006  12:17 -0600, Craig Hansen wrote:> - Two dual CPU servers, each with two local logical volumes.
> - Each server mounts one volume locally and exports the second via ISCSI.
> - Server A performs software mirroring between the local volume and
> the ISCSI volume exported by Server B.
> - Server B does likewise with Server A''s exported volume.
> - Each server runs an OST for their local volume plus a failover OST
> for the partner server which is normally idle.
> - If a disk fails, software mirroring fails over until it is resolved.
> - If server A fails, server B reconfigures and mounts the local volume
> that was exported to server A via ISCSI and the failover OST acts as
> server A until it is resolved.
> 
> Is this a valid approach for creating a group of HA OST''s?  Is
there a
> better way to get redundancy and failover protection for OST server
> failures?
It seems reasonable at least.  Several people have experimented with
similar solutions using (G)NBD instead of iSCSI, but I''ve never heard
back on whether it works well or not.  I suspect if you are using a
dedicated back-end network for the iSCSI (e.g. GigE with loopback
cables?) that is as fast as the front-end network and disk it shouldn''t
be too much of a performance problem.

For most of our support customers use external hardware RAID devices
and the RAID is multi-ported FibreChannel that is just connected to
each OSS directly.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

Andreas Dilger

2006-May-19 07:36 UTC

head link

[Lustre-discuss] Re: Question about HA OST''s

On Mar 10, 2006  13:26 -0500, Brent A Nelson wrote:> Someone correct me if I''m wrong (it would be pretty cool if I
was), but
> you can''t have OST services started for the same storage device
from both
> nodes simultaneously, which is what you''ve been trying.  Drbd, 
> fortunately, prevented any damage by making the secondary read-only, but 
> the Lustre services you started on the backup node were worthless since 
> they were started without the ability to write to the drbd devices, so 
> they presumably didn''t start correctly and wouldn''t
recover even when the
> device later became read-write.
You are correct.  Starting OSTs on multiple nodes, but accessing the same
back-end storage device is a sure-fire way to corrupt the filesystem.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

Brent A Nelson

2006-May-19 07:36 UTC

head link

[Lustre-discuss] Re: Question about HA OST''s

Sorry for the delay in responding.

Unfortunately, I haven''t automated everything yet, I just verified
whole
node manual failovers.  I''m currently busy banging my head against the 
wall trying to LVS NFS from Lustre (unfs3 doesn''t reexport, kernel nfs 
presumably won''t work with 2.6 kernels and Lustre, and nfs-user-server 
exports with a different device ID from each machine in the LVS and has no 
fsid= option for exports).

In your case, you are trying to failover an OST, even though the OSS 
itself has not failed? If you want to do that, you''ll need to stop
lustre
on that OST only, switch its drbd device to secondary on the node with the 
failed disk, then you can switch the backup node''s drbd device to
primary,
and then have Lustre startup that OST on the backup node.

*** Warning! Unverified conjecture below! ***

However, in this situation, I would probably do nothing (except replace 
the failed drive at some point). The drbd device on the node with the 
failed drive, as I understand it, should continue serving out data by 
requesting it from the backup node, correct (just like RAID1)? I
haven''t
tested this, yet, but as best as I could tell from Google, that should be 
correct.  So, lustre would not need to switch at all, although your 
performance would be lower (and you wouldn''t be redundant) until you 
replace the drive...

In the case of a failed OSS, the heartbeat script supplied with drbd 
should do the trick.  Note that your failed OSS may need to stop 
responding on the network before the drbd device on the backup node can be 
switched to primary, as drbd seems to have a heartbeat of its own.  If 
you''re just testing failover, where your "failed" machine is
still alive
and functioning, it won''t work until you set the "failed"
machine''s drbd
device to secondary.

Thanks,

Brent Nelson
Director of Computing
Dept. of Physics
University of Florida

On Thu, 2 Mar 2006, Tuyen Nguyen wrote:
> Hi Brent,
> 	My name is Tuyen Nguyen and am working at University of Massachusetts.
> Last few weeks, I have been trying to get drbd working with failover
> lustre OST.  And I can only get it partially working.
>
> Problem is when primary OST fail, lustre switch over to backup node but
> drbd device on this node still stay in secondary mode (readonly).  Try
> to use heartbeat so drbd on backup OST node can switch to primary
> (read/write) mode but lustre never register logical takover IP when
> lustre starts for OST nodes.
>
> You said you got it working.  Did you encounter same problem that I
> have?  Would you be able to send me your docs how to setup HA lustre
> with drbd?
>
>
> Thanks a lot.
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss@clusterfs.com
> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
>

Brent A Nelson

2006-May-19 07:36 UTC

head link

[Lustre-discuss] Re: Question about HA OST''s

You need to first do the switch to primary on drbd.  Then, you need to 
have Lustre start up the services that the primary node would normally run 
by doing (on the backup node):

lconf --node primary-node file.xml

I''ve done this with both servers normally active (active/active setup),
with one serving out half the OSTs and the other node serving out the 
other half.  In the event of failure, I switch the other node to be drbd 
primary for everything and start up the remaining OSTs using the command 
above, and it works!

In your case, it looks like your backup node should only start Lustre 
services in the event of failure (unless you want to change to an 
active/active setup like the scheme above).

Someone correct me if I''m wrong (it would be pretty cool if I was), but
you can''t have OST services started for the same storage device from
both
nodes simultaneously, which is what you''ve been trying.  Drbd, 
fortunately, prevented any damage by making the secondary read-only, but 
the Lustre services you started on the backup node were worthless since 
they were started without the ability to write to the drbd devices, so 
they presumably didn''t start correctly and wouldn''t recover
even when the
device later became read-write.

Thanks,

Brent

On Fri, 10 Mar 2006, Tuyen Nguyen wrote:
> Hi Brent,
> 	Thanks for reply.  This is what I did the testing and it is not
> working.
>
> I manually power off running OSS.  From the client, I am still be able
> to see Lustre storage without problem.  Failover OSS does pick up.
> However any files on failover OSS is readonly.  I manually switch drbd
> device on failover OSS to primary and I still can''t do read-write
any
> files on that OSS.  This is what I found out from searching Internet to
> set up failover OSS but doesn''t make sense to me.   If it does
make
> sense to me, would you explain it to me?   Thanks a lot.
>
>
> On Unfortunately, we do not support IP takeover at this time. What we do
> is
> this:
> The servers are configured with a specific IP.
> The clients know about both IPs, and will attempt to connect in a
> round-robin fashion until they succeed.
>
> Here''s a typically configuration for OST failover: (servers
orlando and
> oscar)
>
> --add ost --node orlando --ost ost1-home --failover --group orlando \
> --lov lov-home --dev /dev/ost1
> --add ost --node orlando --ost ost2-home --failover \
> --lov lov-home --dev /dev/ost2
>
> --add ost --node oscar --ost ost2-home --failover --group oscar \
> --lov lov-home --dev /dev/ost2
> --add ost --node oscar --ost ost1-home --failover \
> --lov lov-home --dev /dev/ost1
>
>
> This is for shared SAN storage.  If I take shared storage as drbd
> device, above examples mean that I have to setup twice for each OSS
> having same drbd????
>
>> Sorry for the delay in responding.
>>
>> Unfortunately, I haven''t automated everything yet, I just
verified whole
>> node manual failovers.  I''m currently busy banging my head
against the
>> wall trying to LVS NFS from Lustre (unfs3 doesn''t reexport,
kernel nfs
>> presumably won''t work with 2.6 kernels and Lustre, and
nfs-user-server
>> exports with a different device ID from each machine in the LVS and has
no
>> fsid= option for exports).
>>
>> In your case, you are trying to failover an OST, even though the OSS
>> itself has not failed? If you want to do that, you''ll need to
stop lustre
>> on that OST only, switch its drbd device to secondary on the node with
the
>> failed disk, then you can switch the backup node''s drbd device
to primary,
>> and then have Lustre startup that OST on the backup node.
>>
>> *** Warning! Unverified conjecture below! ***
>>
>> However, in this situation, I would probably do nothing (except replace
>> the failed drive at some point). The drbd device on the node with the
>> failed drive, as I understand it, should continue serving out data by
>> requesting it from the backup node, correct (just like RAID1)? I
haven''t
>> tested this, yet, but as best as I could tell from Google, that should
be
>> correct.  So, lustre would not need to switch at all, although your
>> performance would be lower (and you wouldn''t be redundant)
until you
>> replace the drive...
>>
>> In the case of a failed OSS, the heartbeat script supplied with drbd
>> should do the trick.  Note that your failed OSS may need to stop
>> responding on the network before the drbd device on the backup node can
be
>> switched to primary, as drbd seems to have a heartbeat of its own.  If
>> you''re just testing failover, where your "failed"
machine is still alive
>> and functioning, it won''t work until you set the
"failed" machine''s drbd
>> device to secondary.
>>
>> Thanks,
>>
>> Brent Nelson
>> Director of Computing
>> Dept. of Physics
>> University of Florida
>>
>>
>> On Thu, 2 Mar 2006, Tuyen Nguyen wrote:
>>
>>> Hi Brent,
>>> 	My name is Tuyen Nguyen and am working at University of
Massachusetts.
>>> Last few weeks, I have been trying to get drbd working with
failover
>>> lustre OST.  And I can only get it partially working.
>>>
>>> Problem is when primary OST fail, lustre switch over to backup node
but
>>> drbd device on this node still stay in secondary mode (readonly). 
Try
>>> to use heartbeat so drbd on backup OST node can switch to primary
>>> (read/write) mode but lustre never register logical takover IP when
>>> lustre starts for OST nodes.
>>>
>>> You said you got it working.  Did you encounter same problem that I
>>> have?  Would you be able to send me your docs how to setup HA
lustre
>>> with drbd?
>>>
>>>
>>> Thanks a lot.
>>>
>>> _______________________________________________
>>> Lustre-discuss mailing list
>>> Lustre-discuss@clusterfs.com
>>> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
>>>
>
>

Brent A Nelson

2006-May-19 07:36 UTC

head link

[Lustre-discuss] Re: Question about HA OST''s

I should point out that the new Lustre manual in the 1.4.6 distribution 
does a good job at explaining how failover functions.  Also, you''ll 
probably want to use the --group and --select options to the lconf command 
described in the failover section of the manual rather than my selection 
of the other node in the command below...

Thanks,

Brent


On Fri, 10 Mar 2006, Brent A Nelson wrote:
> You need to first do the switch to primary on drbd.  Then, you need to have
> Lustre start up the services that the primary node would normally run by 
> doing (on the backup node):
>
> lconf --node primary-node file.xml
>
> I''ve done this with both servers normally active (active/active
setup), with
> one serving out half the OSTs and the other node serving out the other
half.
> In the event of failure, I switch the other node to be drbd primary for 
> everything and start up the remaining OSTs using the command above, and it 
> works!
>
> In your case, it looks like your backup node should only start Lustre 
> services in the event of failure (unless you want to change to an 
> active/active setup like the scheme above).
>
> Someone correct me if I''m wrong (it would be pretty cool if I
was), but you
> can''t have OST services started for the same storage device from
both nodes
> simultaneously, which is what you''ve been trying.  Drbd,
fortunately,
> prevented any damage by making the secondary read-only, but the Lustre 
> services you started on the backup node were worthless since they were 
> started without the ability to write to the drbd devices, so they
presumably
> didn''t start correctly and wouldn''t recover even when the
device later became
> read-write.
>
> Thanks,
>
> Brent
>
> On Fri, 10 Mar 2006, Tuyen Nguyen wrote:
>
>> Hi Brent,
>> 	Thanks for reply.  This is what I did the testing and it is not
>> working.
>> 
>> I manually power off running OSS.  From the client, I am still be able
>> to see Lustre storage without problem.  Failover OSS does pick up.
>> However any files on failover OSS is readonly.  I manually switch drbd
>> device on failover OSS to primary and I still can''t do
read-write any
>> files on that OSS.  This is what I found out from searching Internet to
>> set up failover OSS but doesn''t make sense to me.   If it does
make
>> sense to me, would you explain it to me?   Thanks a lot.
>> 
>> 
>> On Unfortunately, we do not support IP takeover at this time. What we
do
>> is
>> this:
>> The servers are configured with a specific IP.
>> The clients know about both IPs, and will attempt to connect in a
>> round-robin fashion until they succeed.
>> 
>> Here''s a typically configuration for OST failover: (servers
orlando and
>> oscar)
>> 
>> --add ost --node orlando --ost ost1-home --failover --group orlando \
>> --lov lov-home --dev /dev/ost1
>> --add ost --node orlando --ost ost2-home --failover \
>> --lov lov-home --dev /dev/ost2
>> 
>> --add ost --node oscar --ost ost2-home --failover --group oscar \
>> --lov lov-home --dev /dev/ost2
>> --add ost --node oscar --ost ost1-home --failover \
>> --lov lov-home --dev /dev/ost1
>> 
>> 
>> This is for shared SAN storage.  If I take shared storage as drbd
>> device, above examples mean that I have to setup twice for each OSS
>> having same drbd????
>> 
>>> Sorry for the delay in responding.
>>> 
>>> Unfortunately, I haven''t automated everything yet, I just
verified whole
>>> node manual failovers.  I''m currently busy banging my head
against the
>>> wall trying to LVS NFS from Lustre (unfs3 doesn''t
reexport, kernel nfs
>>> presumably won''t work with 2.6 kernels and Lustre, and
nfs-user-server
>>> exports with a different device ID from each machine in the LVS and
has no
>>> fsid= option for exports).
>>> 
>>> In your case, you are trying to failover an OST, even though the
OSS
>>> itself has not failed? If you want to do that, you''ll need
to stop lustre
>>> on that OST only, switch its drbd device to secondary on the node
with the
>>> failed disk, then you can switch the backup node''s drbd
device to primary,
>>> and then have Lustre startup that OST on the backup node.
>>> 
>>> *** Warning! Unverified conjecture below! ***
>>> 
>>> However, in this situation, I would probably do nothing (except
replace
>>> the failed drive at some point). The drbd device on the node with
the
>>> failed drive, as I understand it, should continue serving out data
by
>>> requesting it from the backup node, correct (just like RAID1)? I
haven''t
>>> tested this, yet, but as best as I could tell from Google, that
should be
>>> correct.  So, lustre would not need to switch at all, although your
>>> performance would be lower (and you wouldn''t be redundant)
until you
>>> replace the drive...
>>> 
>>> In the case of a failed OSS, the heartbeat script supplied with
drbd
>>> should do the trick.  Note that your failed OSS may need to stop
>>> responding on the network before the drbd device on the backup node
can be
>>> switched to primary, as drbd seems to have a heartbeat of its own. 
If
>>> you''re just testing failover, where your
"failed" machine is still alive
>>> and functioning, it won''t work until you set the
"failed" machine''s drbd
>>> device to secondary.
>>> 
>>> Thanks,
>>> 
>>> Brent Nelson
>>> Director of Computing
>>> Dept. of Physics
>>> University of Florida
>>> 
>>> 
>>> On Thu, 2 Mar 2006, Tuyen Nguyen wrote:
>>> 
>>>> Hi Brent,
>>>> 	My name is Tuyen Nguyen and am working at University of 
>>>> Massachusetts.
>>>> Last few weeks, I have been trying to get drbd working with
failover
>>>> lustre OST.  And I can only get it partially working.
>>>> 
>>>> Problem is when primary OST fail, lustre switch over to backup
node but
>>>> drbd device on this node still stay in secondary mode
(readonly).  Try
>>>> to use heartbeat so drbd on backup OST node can switch to
primary
>>>> (read/write) mode but lustre never register logical takover IP
when
>>>> lustre starts for OST nodes.
>>>> 
>>>> You said you got it working.  Did you encounter same problem
that I
>>>> have?  Would you be able to send me your docs how to setup HA
lustre
>>>> with drbd?
>>>> 
>>>> 
>>>> Thanks a lot.
>>>> 
>>>> _______________________________________________
>>>> Lustre-discuss mailing list
>>>> Lustre-discuss@clusterfs.com
>>>> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
>>>> 
>> 
>> 
>

Lustre discuss - May 2006 - Re: Question about HA OST's

[Lustre-discuss] Re: Question about HA OST''s

[Lustre-discuss] Re: Question about HA OST''s

[Lustre-discuss] Re: Question about HA OST''s

[Lustre-discuss] Re: Question about HA OST''s

[Lustre-discuss] Re: Question about HA OST''s

[Lustre-discuss] Re: Question about HA OST''s

[Lustre-discuss] Re: Question about HA OST''s