On Tue, 2 May 2006, Andreas Dilger wrote:> On May 02, 2006 18:51 +0200, Alexander Jolk wrote: >> I''d configure a pair of these OSSs with two RAID0 sets striped across >> all six disks, and form two DRBD volumes to export as OST. For the DRBD >> interconnect I was planning on using a crossover ethernet cable with >> jumbo frames; connection to the rest of the network is over the other >> ethernet port with standard MTU. > > I''d recommend against RAID0, just because disk failure is by far the > most common failure mode. You''ll have to resync the whole volume for > each disk failure, opening up the possibility of a double failure. >What if he reversed his scenario, using RAID0 on top of drbd (6 drbd pairs), essentially making a RAID10 setup? Similarly he could skip the RAID0, and have each drbd pair be a Lustre OST so that Lustre handles the striping...
On May 02, 2006 18:51 +0200, Alexander Jolk wrote:> I''d configure a pair of these OSSs with two RAID0 sets striped across > all six disks, and form two DRBD volumes to export as OST. For the DRBD > interconnect I was planning on using a crossover ethernet cable with > jumbo frames; connection to the rest of the network is over the other > ethernet port with standard MTU.I''d recommend against RAID0, just because disk failure is by far the most common failure mode. You''ll have to resync the whole volume for each disk failure, opening up the possibility of a double failure.> The MDS would be an identical pair, possibly with smaller SCSI disks > and/or RAID5 internally.The MDS should have RAID1, since it is doing almost exclusively small random IO.> An LTO-2 backup server (using amanda) would be a lustre client; a few > OSSs would very possibly serve as additional amanda clients in order to > speed up the nightly runs. (I''m particularly unsure about this point.)Running read-only clients on an OSS is believed to be safe (though not yet fully supported), the problem with client-on-OSS is for the case of potential deadlock under heavy write load. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc.
Brent A Nelson wrote:> On Tue, 2 May 2006, Andreas Dilger wrote: > >> On May 02, 2006 18:51 +0200, Alexander Jolk wrote: >> >>> I''d configure a pair of these OSSs with two RAID0 sets striped across >>> all six disks, and form two DRBD volumes to export as OST. For the DRBD >>> interconnect I was planning on using a crossover ethernet cable with >>> jumbo frames; connection to the rest of the network is over the other >>> ethernet port with standard MTU. >> >> >> I''d recommend against RAID0, just because disk failure is by far the >> most common failure mode. You''ll have to resync the whole volume for >> each disk failure, opening up the possibility of a double failure. >> > > What if he reversed his scenario, using RAID0 on top of drbd (6 drbd > pairs), essentially making a RAID10 setup? Similarly he could skip the > RAID0, and have each drbd pair be a Lustre OST so that Lustre handles > the striping...Sounds reasonable to me, thanks for the input. Just to make sure I follow correctly, if I do 6 DRBD pairs, three of which are exported by each of the OSSs, what happens if one disk fails? As long as the heartbeat between the nodes works, the other node won''t be tempted to stonith the first one? Does anybody have an idea of the I/O bandwidth that I might reasonably hope to attain with this kind of setup? Alex -- Alexander Jolk * BUF Compagnie * alexj@buf.com Tel +33-1 42 68 18 28 * Fax +33-1 42 68 18 29
On Wed, 3 May 2006, Alexander Jolk wrote:> Brent A Nelson wrote: >> On Tue, 2 May 2006, Andreas Dilger wrote: >> >>> On May 02, 2006 18:51 +0200, Alexander Jolk wrote: >>> >>>> I''d configure a pair of these OSSs with two RAID0 sets striped across >>>> all six disks, and form two DRBD volumes to export as OST. For the DRBD >>>> interconnect I was planning on using a crossover ethernet cable with >>>> jumbo frames; connection to the rest of the network is over the other >>>> ethernet port with standard MTU. >>> >>> >>> I''d recommend against RAID0, just because disk failure is by far the >>> most common failure mode. You''ll have to resync the whole volume for >>> each disk failure, opening up the possibility of a double failure. >>> >> >> What if he reversed his scenario, using RAID0 on top of drbd (6 drbd >> pairs), essentially making a RAID10 setup? Similarly he could skip the >> RAID0, and have each drbd pair be a Lustre OST so that Lustre handles the >> striping... > > Sounds reasonable to me, thanks for the input. Just to make sure I follow > correctly, if I do 6 DRBD pairs, three of which are exported by each of the > OSSs, what happens if one disk fails? As long as the heartbeat between the > nodes works, the other node won''t be tempted to stonith the first one? > > Does anybody have an idea of the I/O bandwidth that I might reasonably hope > to attain with this kind of setup? >If you''re letting heartbeat handle stonith (which I''m not certain is all that necessary anymore with drbd, at least not as necessary as it used to be), the node wouldn''t be expected to miss heartbeating and wouldn''t be killed. From a drbd point-of-view, I THINK it will be just like RAID1 (at least with appropriate drbd settings; drbd can be told to panic the whole node on disk error, which would trigger heartbeat failover): drbd would serve out the other drive in the pair (across the network from the other node). From a Lustre point-of-view, nothing would have happened. I really, really need to get around to testing this... Thanks, Brent
On a low-cost unreliable gigabit switch, 4 2-machine nodes with drbd 0.6 with 8 250GB SATA on each machine and lustre 1.2 we got around 170MB/s (same switch used for drbd and lustre). Each machine has 1GB of RAM and a PIV 2.4MHz We have since upgraded the switch and are deploying drbd 0.7 with lustre 1.4. I have no data on performance so far. Qua, 2006-05-03 ?s 16:35 +0200, Alexander Jolk escreveu:> Brent A Nelson wrote: > > On Tue, 2 May 2006, Andreas Dilger wrote: > > > >> On May 02, 2006 18:51 +0200, Alexander Jolk wrote: > >> > >>> I''d configure a pair of these OSSs with two RAID0 sets striped across > >>> all six disks, and form two DRBD volumes to export as OST. For the DRBD > >>> interconnect I was planning on using a crossover ethernet cable with > >>> jumbo frames; connection to the rest of the network is over the other > >>> ethernet port with standard MTU. > >> > >> > >> I''d recommend against RAID0, just because disk failure is by far the > >> most common failure mode. You''ll have to resync the whole volume for > >> each disk failure, opening up the possibility of a double failure. > >> > > > > What if he reversed his scenario, using RAID0 on top of drbd (6 drbd > > pairs), essentially making a RAID10 setup? Similarly he could skip the > > RAID0, and have each drbd pair be a Lustre OST so that Lustre handles > > the striping... > > Sounds reasonable to me, thanks for the input. Just to make sure I > follow correctly, if I do 6 DRBD pairs, three of which are exported by > each of the OSSs, what happens if one disk fails? As long as the > heartbeat between the nodes works, the other node won''t be tempted to > stonith the first one? > > Does anybody have an idea of the I/O bandwidth that I might reasonably > hope to attain with this kind of setup? > > Alex > >-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: Esta =?ISO-8859-1?Q?=E9?= uma parte de mensagem assinada digitalmente Url : http://mail.clusterfs.com/pipermail/lustre-discuss/attachments/20060503/992989b2/attachment.bin
Hi, we are currently thinking about a new installation of lustre using pairs of mutually mirrored DRBDs. I found an old mail by Jan Bruvoll from march 2004 where he described more or less exactly the same setup that I''m thinking about, but I haven''t seen much else on this topic. Let me describe quickly what we are planning to do; I''d very much like some feedback on my plans. We are considering lustre because we want a single exported volume with more than 500MB/s I/O bandwidth; our current crop of NFS servers quite often saturates on one server while the others are idling. We have about 6TB of online accessible storage, with about 6 more to come, and some 350 client machines, all under Linux (Debian sarge). We are a 24/7 operation, more or less. Each OSS would be a Dell PowerEdge 2850 or 2650 server with dual Xeon 3GHz and 1GB of RAM. Six SCSI disks (five for PE 2650) 146GB internal, dual 1000Base-T ethernet. I''d configure a pair of these OSSs with two RAID0 sets striped across all six disks, and form two DRBD volumes to export as OST. For the DRBD interconnect I was planning on using a crossover ethernet cable with jumbo frames; connection to the rest of the network is over the other ethernet port with standard MTU. (Rationale for these ideas: by striping across all disks, I get more bandwidth for a single OST; operations on different OSTs are less correlated than on the individual disks for one OST. Using up one ethernet port for DRBD speeds up every single write operation; the remaining ethernet port should be almost sufficient for the typical bandwidth of a striped 6-disk RAID0.) In normal operation mode, each of the two servers would export one volume and be DRBD-slave of the other; in case a server goes down, the other one takes over. The MDS would be an identical pair, possibly with smaller SCSI disks and/or RAID5 internally. We currently have 15 similar server machines that are each exporting an individual NFS volume. We would be planning on integrating them progressively into the lustre volume, using Lustre 1.6.x. The starting config would consist of 4 OST pairs. The 350 client machines would all access the same OV. Installing a new kernel on all of them is not a big problem; just installing a kernel module is even easier. (We are using cfengine for the whole net.) An LTO-2 backup server (using amanda) would be a lustre client; a few OSSs would very possibly serve as additional amanda clients in order to speed up the nightly runs. (I''m particularly unsure about this point.) Is anybody using this kind of setup, and do you think we are on the right track? I have run a few tests with lustre, but I haven''t done failover yet. Sorry for being so long-winded... Alex -- Alexander Jolk * BUF Compagnie * alexj@buf.com Tel +33-1 42 68 18 28 * Fax +33-1 42 68 18 29