Hi All, Hoping to gain some insight from some people who have done large scale systems before? I''m hoping to get some performance estimates, suggestions and/or general discussion/feedback. I cannot discuss the exact specifics of the purpose but will go into as much detail as I can. Technical Specs: 216x 3TB 7k3000 HDDs 24x 9 drive RAIDZ3 4x JBOD Chassis (45 bay) 1x server (36 bay) 2x AMD 12 Core CPU 128GB EEC RAM 2x 480GB SSD Cache 10Gbit NIC Workloads: Mainly streaming compressed data. That is, pulling compressed data in a sequential manner however could have multiple streams happening at once making it somewhat random. We are hoping to have 5 clients pull 500Mbit sustained. Considerations: The main reason RAIDZ3 was chosen was so we can distribute the parity across the JBOD enclosures. With this method even if an entire JBOD enclosure is taken offline the data is still accessible. Questions: How to manage the physical locations of such a vast number of drives? I have read this ( http://blogs.oracle.com/eschrock/entry/external_storage_enclosures_in_solaris) and am hoping some can shed some light if the SES2 enclosure identification has worked for them? (enclosures are SES2) What kind of performance would you expect from this setup? I know we can multiple the base IOPS by 24 but what about max sequential read/write? Thanks, Phil -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20110725/29bf28e6/attachment.html>
Wow. If you ever finish this monster, I would really like to hear more about the performance and how you connected everything. Could be useful as a reference for anyone else building big stuff. *drool* -- This message posted from opensolaris.org
Phil Harrison wrote:> Hi All, > > Hoping to gain some insight from some people who have done large scale > systems before? I''m hoping to get some performance estimates, suggestions > and/or general discussion/feedback.No personal experience, but you may find this useful: "Petabytes on a budget" http://blog.backblaze.com/2009/09/01/petabytes-on-a-budget-how-to-build-cheap-cloud-storage/ -- Roberto Waltman
they dont go into too much detail on their setup, and they are not running Solaris, but they do mention how their SATA cards see different drives, based on where they are placed.... they also have a second revision at http://blog.backblaze.com/2011/07/20/petabytes-on-a-budget-v2-0revealing-more-secrets/ which talks about building their system with 135Tb in a single 45 bay 4U box... I am also interested in this kind of scale... Looking at the BackBlaze box, i am thinking of building something like this, but not in one go... so, anything you do find out in your build, keep us informed! :) --Tiernan On Mon, Jul 25, 2011 at 4:25 PM, Roberto Waltman <lists at rwaltman.com> wrote:> > Phil Harrison wrote: > > Hi All, > > > > Hoping to gain some insight from some people who have done large scale > > systems before? I''m hoping to get some performance estimates, suggestions > > and/or general discussion/feedback. > > No personal experience, but you may find this useful: > "Petabytes on a budget" > > > http://blog.backblaze.com/2009/09/01/petabytes-on-a-budget-how-to-build-cheap-cloud-storage/ > > -- > > Roberto Waltman > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >-- Tiernan O''Toole blog.lotas-smartman.net www.geekphotographer.com www.tiernanotoole.ie -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20110725/3d40b79d/attachment-0001.html>
On Sun, Jul 24, 2011 at 11:34 PM, Phil Harrison <philhar88 at gmail.com> wrote:> What kind of performance would you expect from this setup? I know we can > multiple the base IOPS by 24 but what about max sequential read/write? >You should have a theoretical max close to 144x single-disk throughput. Each raidz3 has 6 "data drives" which can be read from simultaneously, multiplied by your 24 vdevs. Of course, you''ll hit your controllers'' limits well before that. Even with a controller per JBOD, you''ll be limited by the SAS connection. The 7k3000 has throughput from 115 - 150 MB/s, meaning each of your JBODs will be capable of 5.2 GB/sec - 6.8 GB/sec, roughly 10 times the bandwidth of a single SAS 6g connection. Use multipathing if you can to increase the bandwidth to each JBOD. Depending on the types of access that clients are performing, your cache devices may not be any help. If the data is read multiple times by multiple clients, then you''ll see some benefit. If it''s only being read infrequently or by one client, it probably won''t help much at all. That said, if your access is mostly sequential then random access latency shouldn''t affect you too much, and you will still have more bandwidth from your main storage pools than from the cache devices. -B -- Brandon High : bhigh at freaks.com -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20110725/aec02f40/attachment.html>
> Workloads: > > Mainly streaming compressed data. That is, pulling compressed data in > a sequential manner however could have multiple streams happening at > once making it somewhat random. We are hoping to have 5 clients pull > 500Mbit sustained.That shouldn''t be much of a problem with that amount of drives. I have a couple of smaller setups with 11x7-drive raidz2, about 100TiB each, and even they can handle 2,5Gbps load.> Considerations: > > The main reason RAIDZ3 was chosen was so we can distribute the parity > across the JBOD enclosures. With this method even if an entire JBOD > enclosure is taken offline the data is still accessible.Sounds like a good idea to me.> How to manage the physical locations of such a vast number of drives? > I have read this ( > http://blogs.oracle.com/eschrock/entry/external_storage_enclosures_in_solaris > ) and am hoping some can shed some light if the SES2 enclosure > identification has worked for them? (enclosures are SES2)Which enclosures will you be using? From the data you''ve posted, it looks like SuperMicro, and AFAIK, the ones we have, don''t support SES2.> What kind of performance would you expect from this setup? I know we > can multiple the base IOPS by 24 but what about max sequential > read/write?Parallell read/write from several clients will look like random I/O on the server. If bandwidth is crucial, use RAID1+0. Also, it looks to me you''re planning to fill up all external bays with data drives - where do you plan to put the root? If you''re looking at the SuperMicro SC847 line, there''s indeed room for a couple of 2,5" drives inside, but the chassis is screwed tightly together and doesn''t allow for opening during runtime. Also, those drives are placed in a rather awkward slot. If planning to use RAIDz, a couple of SSDs for the SLOG will help write performance a lot especially during scrub/resilver. For streaming, L2ARC won''t be of much use, though. Finally, a few spares won''t hurt even with redundancy levels as high as RAIDz3. Vennlige hilsener / Best regards roy -- Roy Sigurd Karlsbakk (+47) 97542685 roy at karlsbakk.net http://blogg.karlsbakk.net/ -- I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et element?rt imperativ for alle pedagoger ? unng? eksessiv anvendelse av idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og relevante synonymer p? norsk.
> Even with a controller per JBOD, you''ll be limited by the SAS > connection. The 7k3000 has throughput from 115 - 150 MB/s, meaning > each of your JBODs will be capable of 5.2 GB/sec - 6.8 GB/sec, roughly > 10 times the bandwidth of a single SAS 6g connection. Use multipathing > if you can to increase the bandwidth to each JBOD.With (something like) LSI 9211 and those supermicro babies I guess he''s planning on using, you''ll have one quad-port SAS2 cable to each backplane/SAS expander, one in front and one in the back, meaning theroretical 24Gbps (or 2,4GBps) to each backplane. With a maximum of 24 drives per back, this should probably suffice, since you''ll never get 150MB/s sustained from all drives. Vennlige hilsener / Best regards roy -- Roy Sigurd Karlsbakk (+47) 97542685 roy at karlsbakk.net http://blogg.karlsbakk.net/ -- I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et element?rt imperativ for alle pedagoger ? unng? eksessiv anvendelse av idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og relevante synonymer p? norsk.
Phil, Recently, we have built a large configuration on 4 way Xeon sever with 8 4U 24 Bay JBOD. We are using 2x LSI 6160 SAS switch so we can easy to expand the Storage in the future. 1) If you are planning to expand your storage, you should consider using LSI SAS switch for easy future expansion. 2) We carefully pick one HD from each JBOD to create RAIDZ2. So we can loss two JBOD at the same time while data is still accessible . It is good to know you have the same idea 3) Seq. read/write is currently limited by 10G NIC. Local storage can easily hit 1500MB/s + with even small number of HD. Again 10G is bottom-neck 4) I recommend you use native SAS HD in large scale system if possible. Native SAS HD work better 5) We are using DSM to locate fail disk and monitor FRU of JBOD http://dataonstorage.com/dsm. I hope the above points can help The configuration is similar to the configuration 3 in the following link http://dataonstorage.com/dataon-solutions/lsi-6gb-sas-switch-sas6160-storage .html Technical Specs: DNS-4800 4way Intel Xeon 7550 server with 256G RAM 2x LSI 9200-8E HBA 2x LSI 6160 SAS Switch 8x DNS-1600 4U 24bay JBOD(dual IO in MPxIO) with 2TB Seagate SAS HD RAIDZ2 STEC Zeus RAM for ZIL Intel 320 SSD for L2ARC 10G NIC Rocky From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-bounces at opensolaris.org] On Behalf Of Phil Harrison Sent: Sunday, July 24, 2011 11:34 PM To: zfs-discuss at opensolaris.org Subject: [zfs-discuss] Large scale performance query Hi All, Hoping to gain some insight from some people who have done large scale systems before? I''m hoping to get some performance estimates, suggestions and/or general discussion/feedback. I cannot discuss the exact specifics of the purpose but will go into as much detail as I can. Technical Specs: 216x 3TB 7k3000 HDDs 24x 9 drive RAIDZ3 4x JBOD Chassis (45 bay) 1x server (36 bay) 2x AMD 12 Core CPU 128GB EEC RAM 2x 480GB SSD Cache 10Gbit NIC Workloads: Mainly streaming compressed data. That is, pulling compressed data in a sequential manner however could have multiple streams happening at once making it somewhat random. We are hoping to have 5 clients pull 500Mbit sustained. Considerations: The main reason RAIDZ3 was chosen was so we can distribute the parity across the JBOD enclosures. With this method even if an entire JBOD enclosure is taken offline the data is still accessible. Questions: How to manage the physical locations of such a vast number of drives? I have read this (http://blogs.oracle.com/eschrock/entry/external_storage_enclosures_in_solar is) and am hoping some can shed some light if the SES2 enclosure identification has worked for them? (enclosures are SES2) What kind of performance would you expect from this setup? I know we can multiple the base IOPS by 24 but what about max sequential read/write? Thanks, Phil -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20110726/31fe5530/attachment.html>
On 25/07/2011 2:34 AM, Phil Harrison wrote:> Hi All, > > Hoping to gain some insight from some people who have done large scale systems before? I''m hoping to get some > performance estimates, suggestions and/or general discussion/feedback. I cannot discuss the exact specifics of the > purpose but will go into as much detail as I can. > > Technical Specs: > 216x 3TB 7k3000 HDDs > 24x 9 drive RAIDZ3 > 4x JBOD Chassis (45 bay) > 1x server (36 bay) > 2x AMD 12 Core CPU > 128GB EEC RAM > 2x 480GB SSD Cache > 10Gbit NIC > > Workloads: > > Mainly streaming compressed data. That is, pulling compressed data in a sequential manner however could have multiple > streams happening at once making it somewhat random. We are hoping to have 5 clients pull 500Mbit sustained. > > Considerations: > > The main reason RAIDZ3 was chosen was so we can distribute the parity across the JBOD enclosures. With this method even > if an entire JBOD enclosure is taken offline the data is still accessible.What kind of 45 bay enclosures? Have you tested this and took an enclosure out? Thanks Evgueni> > Questions: > > How to manage the physical locations of such a vast number of drives? I have read this > (http://blogs.oracle.com/eschrock/entry/external_storage_enclosures_in_solaris) and am hoping some can shed some light > if the SES2 enclosure identification has worked for them? (enclosures are SES2) > > What kind of performance would you expect from this setup? I know we can multiple the base IOPS by 24 but what about max > sequential read/write? > > Thanks, > > Phil > > > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Try mirrors. You will get much better multi-user performance, and you can easily split the mirrors across enclosures. If your priority is performance over capacity, you could experiment with n-way mirros, since more mirrors will load balance reads better than more stripes. -- This message posted from opensolaris.org
Is mirrors really a realistic alternative? I mean, if I have to resilver a raid with 3TB discs, it can take days I suspect. With 4TB disks it can take a week, maybe. So, if I use mirror and one disk break, then I only have single redundancy while the mirror repairs. Reparation will take long time and will stress the disks, which means the other disk might malfunction. Therefore, I think raidz2 or raidz3 that allows 2 or 3 disks to break while you resilver. Hence, mirror is not a realistic alternative when using large disks. True/false? What do you guys say? -- This message posted from opensolaris.org
On 08/ 6/11 10:42 AM, Orvar Korvar wrote:> Is mirrors really a realistic alternative?To what? Some context would be helpful.> I mean, if I have to resilver a raid with 3TB discs, it can take days I suspect. With 4TB disks it can take a week, maybe. So, if I use mirror and one disk break, then I only have single redundancy while the mirror repairs. Reparation will take long time and will stress the disks, which means the other disk might malfunction. > > Therefore, I think raidz2 or raidz3 that allows 2 or 3 disks to break while you resilver. Hence, mirror is not a realistic alternative when using large disks. > > True/false? What do you guys say?I don''t have any exact like for like comparison data, but from what I''ve seen a mirror resilvers a lot faster than a drive in a raidz(2) vdev. -- Ian.
Generally, mirrors resilver MUCH faster than RAIDZ, and you only lose redundancy on that stripe, so combined, you''re much closer to RAIDZ2 odds than you might think, especially with hot spare(s), which I''d reccommend. When you''re talking about IOPS, each stripe can support 1 simultanious user. Writing: Each RAIDZ group = 1 stripe. Each mirror group = 1 stripe. So, 216 drives can be 24 stripes or 108 stripes. Reading: Each RAIDZ group = 1 stripe. Each mirror group = 1 stripe per drive. So, 216 drives can be 24 stripes or 216 stripes. Actually, reads from mirrors are even more efficient than reads from stripes, because the software can optimally load balance across mirrors. So, back to the original poster''s question, 9 stripes might be enough to support 5 clients, but 216 stripes could support many more. Actually, this is an area where RAID5/6 has an advantage over RAIDZ, if I understand correctly, because for RAID5/6 on read-only workloads, each drive acts like a stripe. For workloads with writing, though, RAIDZ is significantly faster than RAID5/6, but mirrors/RAID10 give the best performance for all workloads. -- This message posted from opensolaris.org
Ok, so mirrors resilver faster. But, it is not uncommon that another disk shows problem during resilver (for instance r/w errors), this scenario would mean your entire raid is gone, right? If you are using mirrors, and one disk crashes and you start resilver. Then the other disk shows r/w errors because of the increased load - then you are screwed? Because large disks take long time to resilver, possibly weeks? In that case, it would be preferable to use mirrors with 3 disks in each vdev. Trimorrs. Each vdev should be one raidz3. -- This message posted from opensolaris.org
Shouldn''t the choice of RAID type also be based on the i/o requirements? Anyway, with RAID-10, even a second failed disk is not catastophic, so long as it is not the counterpart of the first failed disk, no matter the no. of disks. (With 2-way mirrors.) But that''s why we do backups, right? Mark Sent from my iPhone On Aug 6, 2011, at 7:01 AM, Orvar Korvar <knatte_fnatte_tjatte at yahoo.com> wrote:> Ok, so mirrors resilver faster. > > But, it is not uncommon that another disk shows problem during resilver (for instance r/w errors), this scenario would mean your entire raid is gone, right? If you are using mirrors, and one disk crashes and you start resilver. Then the other disk shows r/w errors because of the increased load - then you are screwed? Because large disks take long time to resilver, possibly weeks? > > In that case, it would be preferable to use mirrors with 3 disks in each vdev. Trimorrs. Each vdev should be one raidz3. > -- > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Orvar Korvar > > Ok, so mirrors resilver faster. > > But, it is not uncommon that another disk shows problem during resilver(for> instance r/w errors), this scenario would mean your entire raid is gone,right? Imagine, you have 8 disks configured as 4x 2-way mirrors. Capacity of 4 disks. Imagine for comparison, you have 6 disks configured as raidz2. Capacity of 4 disks. Imagine, in the event of a disk failure, the mirrored configuration resilvers 4x faster, which is a good estimate because each mirrored vdev has 1/4 as many objects on it. Yes it''s possible for 2 disks failure to destroy the mirrored configuration, if they happen to both be partners of each other. But the probability of a 2nd disk failure being the partner of the first failed disk is only 1/7, and it only results in pool failure if it occurs within the resilver window, which is 4x less probable. You can work out the probabilities, but suffice it to say, the probability of pool failure using the mirrored configuration is not dramatically different from the probability of pool failure with the raidz configuration. If you want to know the precise probabilities, you have to fill in all the variables... Number of drives in each configuration, resilver times, MTTDL for each drive... etc. Sometimes the mirrors are more reliable, sometimes the raid is more reliable. Performance of the mirrors is always equal or better than performance of the raidz. Cost of the mirrors is always equal or higher than the cost of the raidz.> If you are using mirrors, and one disk crashes and you start resilver.Then the> other disk shows r/w errors because of the increased load - then you are > screwed? Because large disks take long time to resilver, possibly weeks?If one disk fails in a mirror, then one disk has increased load. If one disk fails in a raidz, then N disks have increased load. So no, I don''t think this is a solid argument against mirrors. ;-) Incidentally, large disks only take weeks to resilver in a large raid configuration. That never happens in a mirrored configuration. ;-)> In that case, it would be preferable to use mirrors with 3 disks in eachvdev.> Trimorrs. Each vdev should be one raidz3.If I''m not mistaken, a 3-way mirror is not implemented behind the scenes in the same way as a 3-disk raidz3. You should use a 3-way mirror instead of a 3-disk raidz3.
I may have RAIDZ reading wrong here. Perhaps someone could clarify. For a read-only workload, does each RAIDZ drive act like a stripe, similar to RAID5/6? Do they have independant queues? It would seem that there is no escaping read/modify/write operations for sub-block writes, forcing the RAIDZ group to act like a single stripe. -- This message posted from opensolaris.org
RAIDZ has to rebuild data by reading all drives in the group, and reconstructing from parity. Mirrors simply copy a drive. Compare 3tb mirros vs. 9x3tb RAIDZ2. Mirrors: Read 3tb Write 3tb RAIDZ2: Read 24tb Reconstruct data on CPU Write 3tb In this case, RAIDZ is at least 8x slower to resilver (assuming CPU and writing happen in parallel). In the mean time, performance for the array is severely degraded for RAIDZ, but not for mirrors. Aside from resilvering, for many workloads, I have seen over 10x (!) better performance from mirrors. -- This message posted from opensolaris.org
> I may have RAIDZ reading wrong here. Perhaps someone > could clarify. > > For a read-only workload, does each RAIDZ drive act > like a stripe, similar to RAID5/6? Do they have > independant queues? > > It would seem that there is no escaping > read/modify/write operations for sub-block writes, > forcing the RAIDZ group to act like a single stripe.Can RAIDZ even do a partial block read? Perhaps it needs to read the full block (from all drives) in order to verify the checksum. If so, then RAIDZ groups would always act like one stripe, unlike RAID5/6. -- This message posted from opensolaris.org
On Sat, 6 Aug 2011, Orvar Korvar wrote:> Ok, so mirrors resilver faster. > > But, it is not uncommon that another disk shows problem during > resilver (for instance r/w errors), this scenario would mean your > entire raid is gone, right? If you are using mirrors, and one disk > crashes and you start resilver. Then the other disk shows r/w errorsThose using mirrors or raidz1 are best advised to perform periodic scrubs. This helps avoid future media read errors and also helps flush out failing hardware. Regardless, it is true that two hard failures can take out your whole pool. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On Sat, 6 Aug 2011, Rob Cohen wrote:> I may have RAIDZ reading wrong here. Perhaps someone could clarify. > > For a read-only workload, does each RAIDZ drive act like a stripe, > similar to RAID5/6? Do they have independant queues?They act like a stripe like in RAID5/6.> It would seem that there is no escaping read/modify/write operations > for sub-block writes, forcing the RAIDZ group to act like a single > stripe.True. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On Sat, 6 Aug 2011, Rob Cohen wrote:> > Can RAIDZ even do a partial block read? Perhaps it needs to read > the full block (from all drives) in order to verify the checksum. > If so, then RAIDZ groups would always act like one stripe, unlike > RAID5/6.ZFS does not do partial block reads/writes. It must read the whole block in order to validate the checksum. If there is a checksum failure, then RAID5 type algorithms are used to produce a corrected block. For this reason, it is wise to make sure that the zfs filesystem blocksize is appropriate for the task, and make sure that the system has sufficient RAM that the zfs ARC can cache enough data that it does not need to re-read from disk for recently accessed files. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Thanks for clarifying. If a block is spread across all drives in a RAIDZ group, and there are no partial block reads, how can each drive in the group act like a stripe? Many RAID5&6 implementations can do partial block reads, allowing for parallel random reads across drives (as long as there are no writes in the queue). Perhaps you are saying that they act like stripes for bandwidth purposes, but not for read ops/sec? -Rob -----Original Message----- From: Bob Friesenhahn [mailto:bfriesen at simple.dallas.tx.us] Sent: Saturday, August 06, 2011 11:41 AM To: Rob Cohen Cc: zfs-discuss at opensolaris.org Subject: Re: [zfs-discuss] Large scale performance query On Sat, 6 Aug 2011, Rob Cohen wrote:> > Can RAIDZ even do a partial block read? Perhaps it needs to read the > full block (from all drives) in order to verify the checksum. > If so, then RAIDZ groups would always act like one stripe, unlike > RAID5/6.ZFS does not do partial block reads/writes. It must read the whole block in order to validate the checksum. If there is a checksum failure, then RAID5 type algorithms are used to produce a corrected block. For this reason, it is wise to make sure that the zfs filesystem blocksize is appropriate for the task, and make sure that the system has sufficient RAM that the zfs ARC can cache enough data that it does not need to re-read from disk for recently accessed files. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Hello Bob Friesenhahn and List, On August, 06 2011, 18:34 <Bob Friesenhahn> wrote in [1]:> Those using mirrors or raidz1 are best advised to perform periodic > scrubs. This helps avoid future media read errors and also helps > flush out failing hardware.And what is your suggestion for scrubbing a mirror pool? Once per month, every 2 weeks, every week.> Regardless, it is true that two hard failures can take out your whole > pool.In RaidZ1 not in Mirror? -- Best Regards Alexander August, 06 2011 ........ [1] mid:alpine.GSO.2.01.1108061131170.1997 at freddy.simplesystems.org ........
Hello Rob Cohen and List, On August, 06 2011, 17:32 <Rob Cohen> wrote in [1]:> In this case, RAIDZ is at least 8x slower to resilver (assuming CPU > and writing happen in parallel). In the mean time, performance for > the array is severely degraded for RAIDZ, but not for mirrors.> Aside from resilvering, for many workloads, I have seen over 10x > (!) better performance from mirrors.Horrible. My little pool needs for scrubbing more than 8 hours with no workload. The pool has 6 Hitachi 2 TB # zpool status archepool pool: archepool state: ONLINE scan: scrub repaired 0 in 8h14m with 0 errors on Sun Jul 31 19:14:47 2011 config: NAME STATE READ WRITE CKSUM archepool ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 c1t50024E9003CE0317d0 ONLINE 0 0 0 c1t50024E9003CF7685d0 ONLINE 0 0 0 mirror-1 ONLINE 0 0 0 c1t50024E9003CE031Bd0 ONLINE 0 0 0 c1t50024E9003CE0368d0 ONLINE 0 0 0 mirror-2 ONLINE 0 0 0 c1t5000CCA369CA262Bd0 ONLINE 0 0 0 c1t5000CCA369CBF60Cd0 ONLINE 0 0 0 errors: No known data errors How much time needs the thread opener with his config?> Technical Specs: > 216x 3TB 7k3000 HDDs > 24x 9 drive RAIDZ3I suggest resilver need weeks and the chance that a second or third HD crashs in that time is high. Murphy?s Law -- Best Regards Alexander August, 06 2011 ........ [1] mid:1688088365.31312644757091.JavaMail.Twebapp at sf-app1 ........
> How much time needs the thread opener with his config? > > Technical Specs: > > 216x 3TB 7k3000 HDDs > > 24x 9 drive RAIDZ3 > > I suggest resilver need weeks and the chance that a second or > third HD crashs in that time is high. Murphy?s LawWith a full pool, perhaps a couple of weeks, but unless the pool is full (something that''s strictly discouraged), a few days should do. I''m currently replacing WD drives on a server with 4 9-drive RAIDz2 VDEVs, and it takes about two days. A single drive replacement takes about 24 hours. Vennlige hilsener / Best regards roy -- Roy Sigurd Karlsbakk (+47) 97542685 roy at karlsbakk.net http://blogg.karlsbakk.net/ -- I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et element?rt imperativ for alle pedagoger ? unng? eksessiv anvendelse av idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og relevante synonymer p? norsk.
On Sat, 6 Aug 2011, Rob Cohen wrote:> > Perhaps you are saying that they act like stripes for bandwidth purposes, but not for read ops/sec?Exactly. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On Sat, 6 Aug 2011, Alexander Lesle wrote:> >> Those using mirrors or raidz1 are best advised to perform periodic >> scrubs. This helps avoid future media read errors and also helps >> flush out failing hardware. > > And what is your suggestion for scrubbing a mirror pool? > Once per month, every 2 weeks, every week.I think that this depends on the type of hardware you have, how much new data is written over a period of time, the typical I/O load on the server (i.e. does scrubbing impact usability?), and how critical the data is to you. Even power consumption and air conditioning can be a factor since scrubbing is an intensive operation which will increase power consumption. Written data which has not been scrubbed at least once becomes subject to the possibility that it was not written correctly in the first place. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
> If I''m not mistaken, a 3-way mirror is not > implemented behind the scenes in > the same way as a 3-disk raidz3. You should use a > 3-way mirror instead of a > 3-disk raidz3.RAIDZ2 requires at least 4 drives, and RAIDZ3 requires at least 5 drives. But, yes, a 3-way mirror is implemented totally differently. Mirrored drives have identical copies of the data. RAIDZ drives store the data once, plus parity data. A 3-way mirror gives imporved redundancy and read performance, but at a high capacity cost, and slower writes than a 2-way mirror. It''s more common to do 2-way mirrors + hot spare. This gives comparable protection to RAIDZ2, but with MUCH better performance. Of course, mirrors cost more capacity, but it helps that ZFS''s compression and thin provisioning can often offset the loss in capacity, without sacrificing performance (especially when used in combination with L2ARC). -- This message posted from opensolaris.org
Hello Bob Friesenhahn and List, On August, 06 2011, 20:41 <Bob Friesenhahn> wrote in [1]:> I think that this depends on the type of hardware you have, how much > new data is written over a period of time, the typical I/O load on the > server (i.e. does scrubbing impact usability?), and how critical the > data is to you. Even power consumption and air conditioning can be a > factor since scrubbing is an intensive operation which will increase > power consumption.Thx Bob for answering. The hardware ist SM-Board, Xeon, 16 GB reg RAM, LSI 9211-8i HBA, 6 x Hitachi 2TB Deskstar 5K3000 HDS5C3020ALA632. Server is standing in the basement by 32?C The HDs are filled to 80% and the workload ist only most reading. Whats the best? Scrubbing every week, every second week once a month? -- Best Regards Alexander August, 07 2011 ........ [1] mid:alpine.GSO.2.01.1108061337570.1997 at freddy.simplesystems.org ........
> The hardware ist SM-Board, Xeon, 16 GB reg RAM, LSI 9211-8i HBA, > 6 x Hitachi 2TB Deskstar 5K3000 HDS5C3020ALA632. > Server is standing in the basement by 32?C > The HDs are filled to 80% and the workload ist only most reading. > > Whats the best? Scrubbing every week, every second week once a month?Generally, you can''t scrub too often. If you have a set of striped mirrors, the scrub shouldn''t take too long. The extra stress on the drives during scrub shouldn''t matter much, drives are made to be used. By the way, 32?C is a bit high for most servers. Could you check the drive temperature with smartctl or ipmi tools? Vennlige hilsener / Best regards roy -- Roy Sigurd Karlsbakk (+47) 97542685 roy at karlsbakk.net http://blogg.karlsbakk.net/ -- I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et element?rt imperativ for alle pedagoger ? unng? eksessiv anvendelse av idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og relevante synonymer p? norsk.
Hello Roy Sigurd Karlsbakk and List, On August, 07 2011, 19:27 <Roy Sigurd Karlsbakk> wrote in [1]:> Generally, you can''t scrub too often. If you have a set of striped > mirrors, the scrub shouldn''t take too long. The extra stress on the > drives during scrub shouldn''t matter much, drives are made to be > used. By the way, 32?C is a bit high for most servers. Could you > check the drive temperature with smartctl or ipmi tools?Thx Ron for answering. The temp. are between 27? and 31? C checked with smartctl. At the moment I scrub every sunday. -- Best Regards Alexander August, 08 2011 ........ [1] mid:8906420.12.1312738051044.JavaMail.root at zimbra ........
Alexander Lesle wrote:> And what is your suggestion for scrubbing a mirror pool? > Once per month, every 2 weeks, every week. >There isn''t just one answer. For a pool with redundancy, you need to do a scrub just before the redundancy is lost, so you can be reasonably sure the remaining data is correct and can rebuild the redundancy. The problem comes with knowing when this might happen. Of course, if you are doing some planned maintenance which will reduce the pool redundancy, then always do a scrub before that. However, in most cases, the redundancy is lost without prior warning, and you need to do periodic scrubs to cater for this case. I do a scrub via cron once a week on my home system. Having almost completely filled the pool, this was taking about 24 hours. However, now that I''ve replaced the disks and done a send/recv of the data across to a new larger pool which is only 1/3rd full, that''s dropped down to 2 hours. For a pool with no redundancy, where you rely only on backups for recovery, the scrub needs to be integrated into the backup cycle, such that you will discover corrupt data before it has crept too far through your backup cycle to be able to find a non corrupt version of the data. When you have a new hardware setup, I would perform scrubs more frequently as a further check that the hardware doesn''t have any systemic problems, until you have gained confidence in it. -- Andrew Gabriel
On 2011-Aug-08 17:12:15 +0800, Andrew Gabriel <Andrew.Gabriel at oracle.com> wrote:>periodic scrubs to cater for this case. I do a scrub via cron once a >week on my home system. Having almost completely filled the pool, this >was taking about 24 hours. However, now that I''ve replaced the disks and >done a send/recv of the data across to a new larger pool which is only >1/3rd full, that''s dropped down to 2 hours.FWIW, scrub time is more related to how fragmented a pool is, rather than how full it is. My main pool is only at 61% (of 5.4TiB) and has never been much above that but has lots of snapshots and a fair amount of activity. A scrub takes around 17 hours. This is another area where the mythical block rewrite would help a lot. -- Peter Jeremy -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 196 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20110809/9877f7c0/attachment-0001.bin>
On Aug 8, 2011, at 4:01 PM, Peter Jeremy wrote:> On 2011-Aug-08 17:12:15 +0800, Andrew Gabriel <Andrew.Gabriel at oracle.com> wrote: >> periodic scrubs to cater for this case. I do a scrub via cron once a >> week on my home system. Having almost completely filled the pool, this >> was taking about 24 hours. However, now that I''ve replaced the disks and >> done a send/recv of the data across to a new larger pool which is only >> 1/3rd full, that''s dropped down to 2 hours. > > FWIW, scrub time is more related to how fragmented a pool is, rather > than how full it is. My main pool is only at 61% (of 5.4TiB) and has > never been much above that but has lots of snapshots and a fair amount > of activity. A scrub takes around 17 hours.Don''t forget, scrubs are throttled on later versions of ZFS. In a former life, we did a study of when to scrub and the answer was about once a year for enterprise-grade storage. Once a week is ok for the paranoid.> > This is another area where the mythical block rewrite would help a lot.Maybe, by then I''ll be retired and fishing somewhere, scaring the children with stories about how hard we had it back in the days when we stored data on spinning platters :-) -- richard