Stuart Marshall
2009-May-14 20:08 UTC
[Lustre-discuss] external journal raid1 vs. single disk ext journal + hot spare on raid6
Hi All, With the upgrade from 1.6.x to 1.8.x we are planning to reconfigure our RAID systems. The OST RAID hardware are Sun 6140 arrays with 16x500GB SATA disks. Each 6140 tray has one OSS node (Sun X2200 M2). We have redundant paths and ultimately plan a failover strategy. The MDT will be a RAID 1+0 Sun 2540 with 12x73GB SAS disks. Each 6140 tray will be configured either as 1 or 2 RAID6 volumes. The lustre manual recommends more smaller OST''s over large and other docs I''ve seen seem to indicate that the optimal number of drives is ~(6+2). For these 16 disk trays, the choice would be one (12+2R6) + external journal and/or hot spares or two (5+2R6)''s + ext. jrnl and/or hot spares. So my questions are: 1.) What are the trade-offs of RAID1 external journal with no hot spare vs. single disk ext journal with a hot spare (spare is for R6 volume)? Specifically: - If a single disk external journal is lost, can we run fsck and only lose the transactions that have not been committed to disk? If so, then the loss of the disk hosting the external journal would not be catastrophic for the file system as a whole. - How comfortable are RAID6 users with no hot spares? (We''ll have cold spares handy, but prefer to get through weekends w/out service) 2.) The external journal only takes up ~400MB. If we create 2 RAID6 volumes, can we put 2 external journals on one disk or RAID1 set (suitably partitioned), or do we need to blow an entire disk for one external journal? 3.) In planning for "segment size" (chunk size in lustre manual) we''d have to go to 128kB or lower. However, in single disk tests (SATA), it seems that larger is better so perhaps this argues for small RAID6 sets as mentioned in the manual. Just wondering what other folks have found here also. We have the opportunity to test several scenarios with 2 6140 trays that are not part of the 1.6.x production system so I expect we will test performance as a function of the number of drives in the RAID6 volume (eg. 12+2 vs 5+2) along with array write segment sizes via sgpdd-survey. I''ll report back with test results once we sort out which knobs seem to make the most difference. Any advice or comments welcome, Stuart -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090514/e5326289/attachment.html
Robin Humble
2009-May-15 03:37 UTC
[Lustre-discuss] external journal raid1 vs. single disk ext journal + hot spare on raid6
Hi Stuart, On Thu, May 14, 2009 at 01:08:36PM -0700, Stuart Marshall wrote:>Each 6140 tray will be configured either as 1 or 2 RAID6 volumes. The >lustre manual recommends more smaller OST''s over large and other docs I''ve >seen seem to indicate that the optimal number of drives is ~(6+2). For >these 16 disk trays, the choice would be one (12+2R6) + external journal >and/or hot spares or two (5+2R6)''s + ext. jrnl and/or hot spares.2^n+partiy (eg. 8+2 R6) is generally best with software raid, and presumably with your 6140 too. 8+2 with a 64k/128k chunk size means 512kB/1MB per data stripe which plays nicely with Lustre''s 1M data transfer sizes. presumably you have 6+2 because that fits neatly into your 16 disk units - these things are always a compromise :-/>So my questions are: > >1.) What are the trade-offs of RAID1 external journal with no hot spare vs. >single disk ext journal with a hot spare (spare is for R6 volume)? >Specifically:external journal takes away 1/2 the seeks (small writes to the journal) when writing to RAID5/6''s so can double your write speeds. it does for us with software raid. having said that, if you have a large NVRAM cache in your hardware raid then you might not notice these extra seeks as they mostly go to ram and are flushed to spinning disk much less frequently. also I believe Lustre 1.8 hides the slowness of internal journals better than 1.6. IIRC, it allows multiple outstanding writes to be in flight (like metadata in 1.6) and holds copies of data on clients for replay in case an OSS crashes. so with 1.8 you may not notice external journals helping all that much.>- If a single disk external journal is lost, can we run fsck and only lose >the transactions that have not been committed to disk? If so, then the loss >of the disk hosting the external journal would not be catastrophic for the >file system as a whole.I think so, yes, although we run external journals on RAID1. if you lose the journal device then you might have to tune2fs to delete the external journal from the fs before you fsck, as fsck will go looking for the (dead/missing) journal device and will sulk. one problem we came across was that ext3/ldiskfs hard-codes the device name of the external journal (eg. /dev/md5 or /dev/sdc1 or whatever) into the filesystem. that means that when you failover OSS''s it will look for /dev/whatever on the failed-over node, and won''t mount if it can''t find it. so you need non-intersecting namespaces of journal devices within an OSS pair, so that each regular and failed-over RAID5/6 can always find its correct journal device. I didn''t manage to get ext3/ldiskfs to be sane and use UUID''s instead of hardcoded device names :-/ presumably you could also tune2fs to rename or delete the external journal as part of a failover, but that''s a horrible hack.>- How comfortable are RAID6 users with no hot spares? (We''ll have cold >spares handy, but prefer to get through weekends w/out service)fairly comfy. you can do the sums and work out the likelyhood of dual failures given your drive sizes and errors rates, and it''s not outrageous. assumes no correlations between drive failures of course...>2.) The external journal only takes up ~400MB. If we create 2 RAID6 >volumes, can we put 2 external journals on one disk or RAID1 set (suitably >partitioned), or do we need to blow an entire disk for one external journal?ext3/ldiskfs won''t let you share multiple fs''s in one journal (although apparently it''s technically possible), but as you say, you can just make 2 small partitions and put a journal on each. they will interfere if both fs''s are writing heavily (no interference on reads), but I''d guess (only a guess - I haven''t measured it) the penalty should still be smaller than with internal journals. the Lustre 1.8 changes should probably help both external shared and internal journal cases. I believe Sun folks have some numbers about such shared scenarios that you might be able to cajole out of them.>3.) In planning for "segment size" (chunk size in lustre manual) we''d have >to go to 128kB or lower. However, in single disk tests (SATA), it seems >that larger is better so perhaps this argues for small RAID6 sets as >mentioned in the manual. Just wondering what other folks have found here >also.you don''t want your RAID chunk size to be such that disks*chunk > 1MB, as then every Lustre op will be hitting less than one stripe on the RAID, which cause read-modify-writes, and will be slow.>We have the opportunity to test several scenarios with 2 6140 trays that are >not part of the 1.6.x production system so I expect we will test performance >as a function of the number of drives in the RAID6 volume (eg. 12+2 vs 5+2) >along with array write segment sizes via sgpdd-survey. > >I''ll report back with test results once we sort out which knobs seem to make >the most difference.that would be great to know. 6140''s are probably quite different from the software raid md SAS JBODs we run here. cheers, robin -- Dr Robin Humble, HPC Systems Analyst, NCI National Facility
Ralf Utermann
2009-May-15 12:33 UTC
[Lustre-discuss] external journal raid1 vs. single disk ext journal + hot spare on raid6
Stuart Marshall wrote:> Hi All, > > With the upgrade from 1.6.x to 1.8.x we are planning to reconfigure our > RAID systems. > > The OST RAID hardware are Sun 6140 arrays with 16x500GB SATA disks. > Each 6140 tray has one OSS node (Sun X2200 M2). We have redundant paths > and ultimately plan a failover strategy. The MDT will be a RAID 1+0 Sun > 2540 with 12x73GB SAS disks. > > Each 6140 tray will be configured either as 1 or 2 RAID6 volumes. The > lustre manual recommends more smaller OST''s over large and other docs > I''ve seen seem to indicate that the optimal number of drives is ~(6+2). > For these 16 disk trays, the choice would be one (12+2R6) + external > journal and/or hot spares or two (5+2R6)''s + ext. jrnl and/or hot spares. >We have a similar hardware setup, 2 OSS nodes attached to a Sun 6140 plus one CSM200 extension tray, which means 32x500 SATA disks. Because I assumed, as Robin says in his post, 2^n+parity to be optimal for this hardware, I went back to Raid5 for the OSTs and configured 2 x 4+1 and 2 x 8+1. Then there is one Raid1 for external journals and 2 disks left as hot spare. So the OSTs are not of the same size, but each OSS then serves one 4+1 and one 8+1 OST. I hope Lustre will spread the data in a reasonable way. The chunksizes used are 256k and 128k, so a stripe always adds up to 1M.> So my questions are: > > 1.) What are the trade-offs of RAID1 external journal with no hot spare > vs. single disk ext journal with a hot spare (spare is for R6 volume)? > Specifically: > > - If a single disk external journal is lost, can we run fsck and only > lose the transactions that have not been committed to disk? If so, then > the loss of the disk hosting the external journal would not be > catastrophic for the file system as a whole. > > - How comfortable are RAID6 users with no hot spares? (We''ll have cold > spares handy, but prefer to get through weekends w/out service) > > 2.) The external journal only takes up ~400MB. If we create 2 RAID6 > volumes, can we put 2 external journals on one disk or RAID1 set > (suitably partitioned), or do we need to blow an entire disk for one > external journal?we have the 4 journal volumes on one Raid1 virtual disk, but I did not compare to other setups with perfomance tests. I did some performance tests with iozone in our dual-Gigabit environment, and I see the performance going down significantly with smaller block sizes for patchless Lustre clients. This is seen for some OSTs, but not for others. I don''t know, whether this has something to do with the 6140 and it''s setup here; the patched clients don''t see this problem and I did not look further into it. Best regards, Ralf -- Ralf Utermann _____________________________________________________________________ Universit?t Augsburg, Institut f?r Physik -- EDV-Betreuer Universit?tsstr.1 D-86135 Augsburg Phone: +49-821-598-3231 SMTP: Ralf.Utermann at Physik.Uni-Augsburg.DE Fax: -3411
Andreas Dilger
2009-May-15 19:37 UTC
[Lustre-discuss] external journal raid1 vs. single disk ext journal + hot spare on raid6
On May 14, 2009 23:37 -0400, Robin Humble wrote:> one problem we came across was that ext3/ldiskfs hard-codes the device > name of the external journal (eg. /dev/md5 or /dev/sdc1 or whatever) > into the filesystem. > that means that when you failover OSS''s it will look for /dev/whatever > on the failed-over node, and won''t mount if it can''t find it. > so you need non-intersecting namespaces of journal devices within an OSS > pair, so that each regular and failed-over RAID5/6 can always find its > correct journal device. > I didn''t manage to get ext3/ldiskfs to be sane and use UUID''s instead of > hardcoded device names :-/There is a "journal_device" mount option for this. We''d like to make mount.lustre find this device automatically, but it hasn''t been fixed yet. See bug 16861.> presumably you could also tune2fs to rename or delete the external > journal as part of a failover, but that''s a horrible hack.No, that will potentially lose some data, since ext3 considers data written to the journal as "safe" Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.