Hi Brent, My name is Tuyen Nguyen and am working at University of Massachusetts. Last few weeks, I have been trying to get drbd working with failover lustre OST. And I can only get it partially working. Problem is when primary OST fail, lustre switch over to backup node but drbd device on this node still stay in secondary mode (readonly). Try to use heartbeat so drbd on backup OST node can switch to primary (read/write) mode but lustre never register logical takover IP when lustre starts for OST nodes. You said you got it working. Did you encounter same problem that I have? Would you be able to send me your docs how to setup HA lustre with drbd? Thanks a lot.
I''ve been experimenting with using drbd for network mirroring of the OSTs, and, although I have much more tinkering to do, it''s working fine, so far... On Wed, 1 Mar 2006, Andreas Dilger wrote:> On Mar 01, 2006 12:17 -0600, Craig Hansen wrote: >> - Two dual CPU servers, each with two local logical volumes. >> - Each server mounts one volume locally and exports the second via ISCSI. >> - Server A performs software mirroring between the local volume and >> the ISCSI volume exported by Server B. >> - Server B does likewise with Server A''s exported volume. >> - Each server runs an OST for their local volume plus a failover OST >> for the partner server which is normally idle. >> - If a disk fails, software mirroring fails over until it is resolved. >> - If server A fails, server B reconfigures and mounts the local volume >> that was exported to server A via ISCSI and the failover OST acts as >> server A until it is resolved. >> >> Is this a valid approach for creating a group of HA OST''s? Is there a >> better way to get redundancy and failover protection for OST server >> failures? > > It seems reasonable at least. Several people have experimented with > similar solutions using (G)NBD instead of iSCSI, but I''ve never heard > back on whether it works well or not. I suspect if you are using a > dedicated back-end network for the iSCSI (e.g. GigE with loopback > cables?) that is as fast as the front-end network and disk it shouldn''t > be too much of a performance problem. > > For most of our support customers use external hardware RAID devices > and the RAID is multi-ported FibreChannel that is just connected to > each OSS directly. > > Cheers, Andreas > -- > Andreas Dilger > Principal Software Engineer > Cluster File Systems, Inc. > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss@clusterfs.com > https://mail.clusterfs.com/mailman/listinfo/lustre-discuss >
On Mar 01, 2006 12:17 -0600, Craig Hansen wrote:> - Two dual CPU servers, each with two local logical volumes. > - Each server mounts one volume locally and exports the second via ISCSI. > - Server A performs software mirroring between the local volume and > the ISCSI volume exported by Server B. > - Server B does likewise with Server A''s exported volume. > - Each server runs an OST for their local volume plus a failover OST > for the partner server which is normally idle. > - If a disk fails, software mirroring fails over until it is resolved. > - If server A fails, server B reconfigures and mounts the local volume > that was exported to server A via ISCSI and the failover OST acts as > server A until it is resolved. > > Is this a valid approach for creating a group of HA OST''s? Is there a > better way to get redundancy and failover protection for OST server > failures?It seems reasonable at least. Several people have experimented with similar solutions using (G)NBD instead of iSCSI, but I''ve never heard back on whether it works well or not. I suspect if you are using a dedicated back-end network for the iSCSI (e.g. GigE with loopback cables?) that is as fast as the front-end network and disk it shouldn''t be too much of a performance problem. For most of our support customers use external hardware RAID devices and the RAID is multi-ported FibreChannel that is just connected to each OSS directly. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc.
On Mar 10, 2006 13:26 -0500, Brent A Nelson wrote:> Someone correct me if I''m wrong (it would be pretty cool if I was), but > you can''t have OST services started for the same storage device from both > nodes simultaneously, which is what you''ve been trying. Drbd, > fortunately, prevented any damage by making the secondary read-only, but > the Lustre services you started on the backup node were worthless since > they were started without the ability to write to the drbd devices, so > they presumably didn''t start correctly and wouldn''t recover even when the > device later became read-write.You are correct. Starting OSTs on multiple nodes, but accessing the same back-end storage device is a sure-fire way to corrupt the filesystem. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc.
Sorry for the delay in responding. Unfortunately, I haven''t automated everything yet, I just verified whole node manual failovers. I''m currently busy banging my head against the wall trying to LVS NFS from Lustre (unfs3 doesn''t reexport, kernel nfs presumably won''t work with 2.6 kernels and Lustre, and nfs-user-server exports with a different device ID from each machine in the LVS and has no fsid= option for exports). In your case, you are trying to failover an OST, even though the OSS itself has not failed? If you want to do that, you''ll need to stop lustre on that OST only, switch its drbd device to secondary on the node with the failed disk, then you can switch the backup node''s drbd device to primary, and then have Lustre startup that OST on the backup node. *** Warning! Unverified conjecture below! *** However, in this situation, I would probably do nothing (except replace the failed drive at some point). The drbd device on the node with the failed drive, as I understand it, should continue serving out data by requesting it from the backup node, correct (just like RAID1)? I haven''t tested this, yet, but as best as I could tell from Google, that should be correct. So, lustre would not need to switch at all, although your performance would be lower (and you wouldn''t be redundant) until you replace the drive... In the case of a failed OSS, the heartbeat script supplied with drbd should do the trick. Note that your failed OSS may need to stop responding on the network before the drbd device on the backup node can be switched to primary, as drbd seems to have a heartbeat of its own. If you''re just testing failover, where your "failed" machine is still alive and functioning, it won''t work until you set the "failed" machine''s drbd device to secondary. Thanks, Brent Nelson Director of Computing Dept. of Physics University of Florida On Thu, 2 Mar 2006, Tuyen Nguyen wrote:> Hi Brent, > My name is Tuyen Nguyen and am working at University of Massachusetts. > Last few weeks, I have been trying to get drbd working with failover > lustre OST. And I can only get it partially working. > > Problem is when primary OST fail, lustre switch over to backup node but > drbd device on this node still stay in secondary mode (readonly). Try > to use heartbeat so drbd on backup OST node can switch to primary > (read/write) mode but lustre never register logical takover IP when > lustre starts for OST nodes. > > You said you got it working. Did you encounter same problem that I > have? Would you be able to send me your docs how to setup HA lustre > with drbd? > > > Thanks a lot. > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss@clusterfs.com > https://mail.clusterfs.com/mailman/listinfo/lustre-discuss >
You need to first do the switch to primary on drbd. Then, you need to have Lustre start up the services that the primary node would normally run by doing (on the backup node): lconf --node primary-node file.xml I''ve done this with both servers normally active (active/active setup), with one serving out half the OSTs and the other node serving out the other half. In the event of failure, I switch the other node to be drbd primary for everything and start up the remaining OSTs using the command above, and it works! In your case, it looks like your backup node should only start Lustre services in the event of failure (unless you want to change to an active/active setup like the scheme above). Someone correct me if I''m wrong (it would be pretty cool if I was), but you can''t have OST services started for the same storage device from both nodes simultaneously, which is what you''ve been trying. Drbd, fortunately, prevented any damage by making the secondary read-only, but the Lustre services you started on the backup node were worthless since they were started without the ability to write to the drbd devices, so they presumably didn''t start correctly and wouldn''t recover even when the device later became read-write. Thanks, Brent On Fri, 10 Mar 2006, Tuyen Nguyen wrote:> Hi Brent, > Thanks for reply. This is what I did the testing and it is not > working. > > I manually power off running OSS. From the client, I am still be able > to see Lustre storage without problem. Failover OSS does pick up. > However any files on failover OSS is readonly. I manually switch drbd > device on failover OSS to primary and I still can''t do read-write any > files on that OSS. This is what I found out from searching Internet to > set up failover OSS but doesn''t make sense to me. If it does make > sense to me, would you explain it to me? Thanks a lot. > > > On Unfortunately, we do not support IP takeover at this time. What we do > is > this: > The servers are configured with a specific IP. > The clients know about both IPs, and will attempt to connect in a > round-robin fashion until they succeed. > > Here''s a typically configuration for OST failover: (servers orlando and > oscar) > > --add ost --node orlando --ost ost1-home --failover --group orlando \ > --lov lov-home --dev /dev/ost1 > --add ost --node orlando --ost ost2-home --failover \ > --lov lov-home --dev /dev/ost2 > > --add ost --node oscar --ost ost2-home --failover --group oscar \ > --lov lov-home --dev /dev/ost2 > --add ost --node oscar --ost ost1-home --failover \ > --lov lov-home --dev /dev/ost1 > > > This is for shared SAN storage. If I take shared storage as drbd > device, above examples mean that I have to setup twice for each OSS > having same drbd???? > >> Sorry for the delay in responding. >> >> Unfortunately, I haven''t automated everything yet, I just verified whole >> node manual failovers. I''m currently busy banging my head against the >> wall trying to LVS NFS from Lustre (unfs3 doesn''t reexport, kernel nfs >> presumably won''t work with 2.6 kernels and Lustre, and nfs-user-server >> exports with a different device ID from each machine in the LVS and has no >> fsid= option for exports). >> >> In your case, you are trying to failover an OST, even though the OSS >> itself has not failed? If you want to do that, you''ll need to stop lustre >> on that OST only, switch its drbd device to secondary on the node with the >> failed disk, then you can switch the backup node''s drbd device to primary, >> and then have Lustre startup that OST on the backup node. >> >> *** Warning! Unverified conjecture below! *** >> >> However, in this situation, I would probably do nothing (except replace >> the failed drive at some point). The drbd device on the node with the >> failed drive, as I understand it, should continue serving out data by >> requesting it from the backup node, correct (just like RAID1)? I haven''t >> tested this, yet, but as best as I could tell from Google, that should be >> correct. So, lustre would not need to switch at all, although your >> performance would be lower (and you wouldn''t be redundant) until you >> replace the drive... >> >> In the case of a failed OSS, the heartbeat script supplied with drbd >> should do the trick. Note that your failed OSS may need to stop >> responding on the network before the drbd device on the backup node can be >> switched to primary, as drbd seems to have a heartbeat of its own. If >> you''re just testing failover, where your "failed" machine is still alive >> and functioning, it won''t work until you set the "failed" machine''s drbd >> device to secondary. >> >> Thanks, >> >> Brent Nelson >> Director of Computing >> Dept. of Physics >> University of Florida >> >> >> On Thu, 2 Mar 2006, Tuyen Nguyen wrote: >> >>> Hi Brent, >>> My name is Tuyen Nguyen and am working at University of Massachusetts. >>> Last few weeks, I have been trying to get drbd working with failover >>> lustre OST. And I can only get it partially working. >>> >>> Problem is when primary OST fail, lustre switch over to backup node but >>> drbd device on this node still stay in secondary mode (readonly). Try >>> to use heartbeat so drbd on backup OST node can switch to primary >>> (read/write) mode but lustre never register logical takover IP when >>> lustre starts for OST nodes. >>> >>> You said you got it working. Did you encounter same problem that I >>> have? Would you be able to send me your docs how to setup HA lustre >>> with drbd? >>> >>> >>> Thanks a lot. >>> >>> _______________________________________________ >>> Lustre-discuss mailing list >>> Lustre-discuss@clusterfs.com >>> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss >>> > >
I should point out that the new Lustre manual in the 1.4.6 distribution does a good job at explaining how failover functions. Also, you''ll probably want to use the --group and --select options to the lconf command described in the failover section of the manual rather than my selection of the other node in the command below... Thanks, Brent On Fri, 10 Mar 2006, Brent A Nelson wrote:> You need to first do the switch to primary on drbd. Then, you need to have > Lustre start up the services that the primary node would normally run by > doing (on the backup node): > > lconf --node primary-node file.xml > > I''ve done this with both servers normally active (active/active setup), with > one serving out half the OSTs and the other node serving out the other half. > In the event of failure, I switch the other node to be drbd primary for > everything and start up the remaining OSTs using the command above, and it > works! > > In your case, it looks like your backup node should only start Lustre > services in the event of failure (unless you want to change to an > active/active setup like the scheme above). > > Someone correct me if I''m wrong (it would be pretty cool if I was), but you > can''t have OST services started for the same storage device from both nodes > simultaneously, which is what you''ve been trying. Drbd, fortunately, > prevented any damage by making the secondary read-only, but the Lustre > services you started on the backup node were worthless since they were > started without the ability to write to the drbd devices, so they presumably > didn''t start correctly and wouldn''t recover even when the device later became > read-write. > > Thanks, > > Brent > > On Fri, 10 Mar 2006, Tuyen Nguyen wrote: > >> Hi Brent, >> Thanks for reply. This is what I did the testing and it is not >> working. >> >> I manually power off running OSS. From the client, I am still be able >> to see Lustre storage without problem. Failover OSS does pick up. >> However any files on failover OSS is readonly. I manually switch drbd >> device on failover OSS to primary and I still can''t do read-write any >> files on that OSS. This is what I found out from searching Internet to >> set up failover OSS but doesn''t make sense to me. If it does make >> sense to me, would you explain it to me? Thanks a lot. >> >> >> On Unfortunately, we do not support IP takeover at this time. What we do >> is >> this: >> The servers are configured with a specific IP. >> The clients know about both IPs, and will attempt to connect in a >> round-robin fashion until they succeed. >> >> Here''s a typically configuration for OST failover: (servers orlando and >> oscar) >> >> --add ost --node orlando --ost ost1-home --failover --group orlando \ >> --lov lov-home --dev /dev/ost1 >> --add ost --node orlando --ost ost2-home --failover \ >> --lov lov-home --dev /dev/ost2 >> >> --add ost --node oscar --ost ost2-home --failover --group oscar \ >> --lov lov-home --dev /dev/ost2 >> --add ost --node oscar --ost ost1-home --failover \ >> --lov lov-home --dev /dev/ost1 >> >> >> This is for shared SAN storage. If I take shared storage as drbd >> device, above examples mean that I have to setup twice for each OSS >> having same drbd???? >> >>> Sorry for the delay in responding. >>> >>> Unfortunately, I haven''t automated everything yet, I just verified whole >>> node manual failovers. I''m currently busy banging my head against the >>> wall trying to LVS NFS from Lustre (unfs3 doesn''t reexport, kernel nfs >>> presumably won''t work with 2.6 kernels and Lustre, and nfs-user-server >>> exports with a different device ID from each machine in the LVS and has no >>> fsid= option for exports). >>> >>> In your case, you are trying to failover an OST, even though the OSS >>> itself has not failed? If you want to do that, you''ll need to stop lustre >>> on that OST only, switch its drbd device to secondary on the node with the >>> failed disk, then you can switch the backup node''s drbd device to primary, >>> and then have Lustre startup that OST on the backup node. >>> >>> *** Warning! Unverified conjecture below! *** >>> >>> However, in this situation, I would probably do nothing (except replace >>> the failed drive at some point). The drbd device on the node with the >>> failed drive, as I understand it, should continue serving out data by >>> requesting it from the backup node, correct (just like RAID1)? I haven''t >>> tested this, yet, but as best as I could tell from Google, that should be >>> correct. So, lustre would not need to switch at all, although your >>> performance would be lower (and you wouldn''t be redundant) until you >>> replace the drive... >>> >>> In the case of a failed OSS, the heartbeat script supplied with drbd >>> should do the trick. Note that your failed OSS may need to stop >>> responding on the network before the drbd device on the backup node can be >>> switched to primary, as drbd seems to have a heartbeat of its own. If >>> you''re just testing failover, where your "failed" machine is still alive >>> and functioning, it won''t work until you set the "failed" machine''s drbd >>> device to secondary. >>> >>> Thanks, >>> >>> Brent Nelson >>> Director of Computing >>> Dept. of Physics >>> University of Florida >>> >>> >>> On Thu, 2 Mar 2006, Tuyen Nguyen wrote: >>> >>>> Hi Brent, >>>> My name is Tuyen Nguyen and am working at University of >>>> Massachusetts. >>>> Last few weeks, I have been trying to get drbd working with failover >>>> lustre OST. And I can only get it partially working. >>>> >>>> Problem is when primary OST fail, lustre switch over to backup node but >>>> drbd device on this node still stay in secondary mode (readonly). Try >>>> to use heartbeat so drbd on backup OST node can switch to primary >>>> (read/write) mode but lustre never register logical takover IP when >>>> lustre starts for OST nodes. >>>> >>>> You said you got it working. Did you encounter same problem that I >>>> have? Would you be able to send me your docs how to setup HA lustre >>>> with drbd? >>>> >>>> >>>> Thanks a lot. >>>> >>>> _______________________________________________ >>>> Lustre-discuss mailing list >>>> Lustre-discuss@clusterfs.com >>>> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss >>>> >> >> >