Im setting up a server with 20x1TB disks. Initially I had thought to setup the disks using 2 RaidZ2 groups of 10 discs. However, I have just read the Best Practices guide, and it says your group shouldnt have > 9 disks. So Im thinking a better configuration would be 2 x 7disk RaidZ2 + 1 x 6disk RaidZ2. However its 14TB worth of data instead of 16TB. What are your suggestions and experiences? -- This message posted from opensolaris.org
Hi On Monday 06 September 2010 17:53:44 hatish wrote:> Im setting up a server with 20x1TB disks. Initially I had thought to setup > the disks using 2 RaidZ2 groups of 10 discs. However, I have just read the > Best Practices guide, and it says your group shouldnt have > 9 disks. So > Im thinking a better configuration would be 2 x 7disk RaidZ2 + 1 x 6disk > RaidZ2. However its 14TB worth of data instead of 16TB. > > What are your suggestions and experiences?Another one is that in one pool all vdev should be equal, i.e. not mixed like 2x7 and 1x6 (this configuration you most likely will need to force anyway). First, I''d assess what you want/expect from this file system in then end. Maximum performance, maximum reliability or maximum size - as always pick two ;) Cheers Carsten
Can you add another disk? then you have three 7 disc vdevs. (Always use raidz2.) -- This message posted from opensolaris.org
Otherwise you can have 2 discs as hot spare. three 6 disc vdevs. -- This message posted from opensolaris.org
On Mon, Sep 6, 2010 at 8:53 AM, hatish <hatish at gmail.com> wrote:> Im setting up a server with 20x1TB disks. Initially I had thought to setup the disks using 2 RaidZ2 groups of 10 discs. However, I have just read the Best Practices guide, and it says your group shouldnt have > 9 disks. So Im thinking a better configuration would be 2 x 7disk RaidZ2 + 1 x 6disk RaidZ2. However its 14TB worth of data instead of 16TB.2 x 10 disk raidz2 should be fine for general storage. It depends on what your performance needs are. Or go with 3 x 6 disk vdevs, a spare and a l2arc. -B -- Brandon High : bhigh at freaks.com
----- Original Message -----> On Mon, Sep 6, 2010 at 8:53 AM, hatish <hatish at gmail.com> wrote: > > Im setting up a server with 20x1TB disks. Initially I had thought to > > setup the disks using 2 RaidZ2 groups of 10 discs. However, I have > > just read the Best Practices guide, and it says your group shouldnt > > have > 9 disks. So Im thinking a better configuration would be 2 x > > 7disk RaidZ2 + 1 x 6disk RaidZ2. However its 14TB worth of data > > instead of 16TB. > > 2 x 10 disk raidz2 should be fine for general storage. It depends on > what your performance needs are. > > Or go with 3 x 6 disk vdevs, a spare and a l2arc. >a 7k2 drive for l2arc? Vennlige hilsener / Best regards roy -- Roy Sigurd Karlsbakk (+47) 97542685 roy at karlsbakk.net http://blogg.karlsbakk.net/ -- I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et element?rt imperativ for alle pedagoger ? unng? eksessiv anvendelse av idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og relevante synonymer p? norsk.
On Mon, Sep 6, 2010 at 2:36 PM, Roy Sigurd Karlsbakk <roy at karlsbakk.net> wrote:> a 7k2 drive for l2arc?It wouldn''t be great, but you could put an SSD in the bay instead. -B -- Brandon High : bhigh at freaks.com
Thanks for all the replies :) My mindset is split in two now... Some detail - I''m using 4 1-to-5 Sata Port multipliers connected to a 4-port SATA raid card. I only need reliability and size, as long as my performance is the equivalent of one drive, Im happy. Im assuming all the data used in the group is read once when re-creating a lost drive. Also assuming space consumed is 50%. So option 1 - Stay with the 2 x 10drive RaidZ2. My concern is the stress on the drives when one drive fails and the others go crazy (read-wise) to re-create the new drive. Is there no way to reduce this stress? Maybe limit the data rate, so its not quite so stressful, even though it will end up taking longer? (quite acceptable) [Available Space: 16TB, Redundancy Space: 4TB, Repair data read: 4.5TB] And option 2 - Add a 21st drive to one of the motherboard sata ports. And then go with 3 x 7drive RaidZ2. [Available Space: 15TB, Redundancy Space: 6TB, Repair data read: 3TB] Sadly, SSD''s wont go too well in a PM based setup like mine. I may add it directly onto the MB if I can afford it. But again, performance is not a prioity. Any further thoughts and ideas are much appreciated. -- This message posted from opensolaris.org
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of hatish > > I have just > read the Best Practices guide, and it says your group shouldnt have > 9 > disks.I think the value you can take from this is: Why does the BPG say that? What is the reasoning behind it? Anything that is a "rule of thumb" either has reasoning behind it (you should know the reasoning) or it doesn''t (you should ignore the rule of thumb, dismiss it as myth.)
Makes sense. My understanding is not good enough to confidently make my own decisions, and I''m learning as Im going. The BPG says: - The recommended number of disks per group is between 3 and 9. If you have more disks, use multiple groups If there was a reason leading up to this statement, I didnt follow it. However, a few paragraphs later, their RaidZ2 example says [4x(9+2), 2 hot spares, 18.0 TB]. So I guess 8+2 should be quite acceptable, especially since performance is the lowest priority. On Tue, Sep 7, 2010 at 4:59 PM, Edward Ned Harvey <shill at nedharvey.com>wrote:> > From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > > bounces at opensolaris.org] On Behalf Of hatish > > > > I have just > > read the Best Practices guide, and it says your group shouldnt have > 9 > > disks. > > I think the value you can take from this is: > Why does the BPG say that? What is the reasoning behind it? > > Anything that is a "rule of thumb" either has reasoning behind it (you > should know the reasoning) or it doesn''t (you should ignore the rule of > thumb, dismiss it as myth.) > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100907/3144a189/attachment-0001.html>
may be 5x(3+1) use one disk from each controller, 15TB usable space, 3+1 raidz rebuild time should be reasonable On 9/7/2010 4:40 AM, hatish wrote:> Thanks for all the replies :) > > My mindset is split in two now... > > Some detail - I''m using 4 1-to-5 Sata Port multipliers connected to a 4-port SATA raid card. > > I only need reliability and size, as long as my performance is the equivalent of one drive, Im happy. > > Im assuming all the data used in the group is read once when re-creating a lost drive. Also assuming space consumed is 50%. > > So option 1 - Stay with the 2 x 10drive RaidZ2. My concern is the stress on the drives when one drive fails and the others go crazy (read-wise) to re-create the new drive. Is there no way to reduce this stress? Maybe limit the data rate, so its not quite so stressful, even though it will end up taking longer? (quite acceptable) > [Available Space: 16TB, Redundancy Space: 4TB, Repair data read: 4.5TB] > > And option 2 - Add a 21st drive to one of the motherboard sata ports. And then go with 3 x 7drive RaidZ2. [Available Space: 15TB, Redundancy Space: 6TB, Repair data read: 3TB] > > Sadly, SSD''s wont go too well in a PM based setup like mine. I may add it directly onto the MB if I can afford it. But again, performance is not a prioity. > > Any further thoughts and ideas are much appreciated.-------------- next part -------------- A non-text attachment was scrubbed... Name: laotsao.vcf Type: text/x-vcard Size: 221 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100907/ef6e3d7c/attachment.vcf>
> On Tue, Sep 7, 2010 at 4:59 PM, Edward Ned Harvey <shill at nedharvey.com> > wrote: > > I think the value you can take from this is: > Why does the BPG say that? ?What is the reasoning behind it? > > Anything that is a "rule of thumb" either has reasoning behind it (you > should know the reasoning) or it doesn''t (you should ignore the rule of > thumb, dismiss it as myth.)Let''s examine the myth that you should limit the number of drives in a vdev because of resilver time. The myth goes something like this: You shouldn''t use more than ___ drives in a vdev raidz_ configuration, because all the drives need to read during a resilver, so the more drives are present, the longer the resilver time. The truth of the matter is: Only the size of used data is read. Because this is ZFS, it''s smarter than a hardware solution which would have to read all disks in their entirety. In ZFS, if you have a 6-disk raidz1 with capacity of 5 disks, and a total of 50G of data, then each disk has roughly 10G of data in it. During resilver, 5 disks will each read 10G of data, and 10G of data will be written to the new disk. If you have a 11-disk raidz1 with capacity of 10 disks, then each disk has roughly 5G of data. 10 disks will each read 5G of data, and 5G of data will be written to the new disk. If anything, more disks means a faster resilver, because you''re more easily able to saturate the bus, and you have a smaller amount of data that needs to be written to the replaced disk. Let''s examine the myth that you should limit the number of disks for the sake of redundancy. It is true that a carefully crafted system can survive things like SCSI controller or tray failure. Suppose you have 3 scsi cards. Suppose you construct a raidz2 device using 2 disks from controller 0, 2 disks from controller 1, and 2 disks from controller 2. Then if a controller dies, you have only lost 2 disks, and you are degraded but still functional as long as you don''t lose another disk. But you said you have 20 disks all connected to a single controller. So none of that matters in your case. Personally, I can''t imagine any good reason to generalize "don''t use more than ___ devices in a vdev." To me, a 12-disk raidz2 is just as likely to fail as a 6-disk raidz1. But a 12-disk raidz2 is slightly more reliable than having two 6-disk raidz1''s. Perhaps, maybe, a 64bit processor is able to calculate parity on an 8-disk raidz set in a single operation, but requires additional operations to calculate parity if your raidz has 9 or more disks in it ... But I am highly skeptical of this line of reasoning, and AFAIK, nobody has ever suggested this before me. I made it up just now. I''m grasping at straws and stretching my imagination to find *any* merit in the statement, "don''t use more than ___ disks in a vdev." I see no reasoning behind it, and unless somebody can say anything to support it, I think it''s bunk.
On Wed, Sep 8, 2010 at 06:59, Edward Ned Harvey <shill at nedharvey.com> wrote:>> On Tue, Sep 7, 2010 at 4:59 PM, Edward Ned Harvey <shill at nedharvey.com> >> wrote: >> >> I think the value you can take from this is: >> Why does the BPG say that? ?What is the reasoning behind it? >> >> Anything that is a "rule of thumb" either has reasoning behind it (you >> should know the reasoning) or it doesn''t (you should ignore the rule of >> thumb, dismiss it as myth.) > > Let''s examine the myth that you should limit the number of drives in a vdev > because of resilver time. ?The myth goes something like this: ?You shouldn''t > use more than ___ drives in a vdev raidz_ configuration, because all the > drives need to read during a resilver, so the more drives are present, the > longer the resilver time. > > The truth of the matter is: ?Only the size of used data is read. ?Because > this is ZFS, it''s smarter than a hardware solution which would have to read > all disks in their entirety. ?In ZFS, if you have a 6-disk raidz1 with > capacity of 5 disks, and a total of 50G of data, then each disk has roughly > 10G of data in it. ?During resilver, 5 disks will each read 10G of data, and > 10G of data will be written to the new disk. ?If you have a 11-disk raidz1 > with capacity of 10 disks, then each disk has roughly 5G of data. ?10 disks > will each read 5G of data, and 5G of data will be written to the new disk. > If anything, more disks means a faster resilver, because you''re more easily > able to saturate the bus, and you have a smaller amount of data that needs > to be written to the replaced disk.It is not a question of a vdev with 6 disk vs a vdev with 12 disks. It is about 1 vdev with 12 disk or 2 vdev with 6 disks. If you have 2 vdev you have to read half the data compared to 1 vdev to resilver a disk. Or look at it this way, you will put more data on a 12 disk vdev than on a 6 disk vdev. IO other than the resilver will also slow the resilver down more if you have large vdevs.
Rebuild time is not a concern for me. The concern with rebuilding was the stress it puts on the disks for an extended period of time (increasing the chances of another disk failure). The % of data used doesnt matter, as the system will try to get it done at max speed, thus creating the mentioned stress. But I suspect the Port multipliers will do a good job of throttling the IO such that the discs face minimal stress. Thus Im pretty sure I''ll stick with 2 x 10disk RaidZ2. Thanks for all the input! -- This message posted from opensolaris.org
> From: pantzare at gmail.com [mailto:pantzare at gmail.com] On Behalf Of > Mattias Pantzare > > It > is about 1 vdev with 12 disk or 2 vdev with 6 disks. If you have 2 > vdev you have to read half the data compared to 1 vdev to resilver a > disk.Let''s suppose you have 1T of data. You have 12-disk raidz2. So you have approx 100G on each disk, and you replace one disk. Then 11 disks will each read 100G, and the new disk will write 100G. Let''s suppose you have 1T of data. You have 2 vdev''s that are each 6-disk raidz1. Then we''ll estimate 500G is on each vdev, so each disk has approx 100G. You replace a disk. Then 5 disks will each read 100G, and 1 disk will write 100G. Both of the above situations resilver in equal time, unless there is a bus bottleneck. 21 disks in a single raidz3 will resilver just as fast as 7 disks in a raidz1, as long as you are avoiding the bus bottleneck. But 21 disks in a single raidz3 provides better redundancy than 3 vdev''s each containing a 7 disk raidz1. In my personal experience, approx 5 disks can max out approx 1 bus. (It actually ranges from 2 to 7 disks, if you have an imbalance of cheap disks on a good bus, or good disks on a crap bus, but generally speaking people don''t do that. Generally people get a good bus for good disks, and cheap disks for crap bus, so approx 5 disks max out approx 1 bus.) In my personal experience, servers are generally built with a separate bus for approx every 5-7 disk slots. So what it really comes down to is ... Instead of the Best Practices Guide saying "Don''t put more than ___ disks into a single vdev," the BPG should say "Avoid the bus bandwidth bottleneck by constructing your vdev''s using physical disks which are distributed across multiple buses, as necessary per the speed of your disks and buses."
Mattias, what you say makes a lot of sense. When I saw *Both of the above situations resilver in equal time*, I was like "no way!" But like you said, assuming no bus bottlenecks. This is my exact breakdown (cheap disks on cheap bus :P) : PCI-E 8X 4-port ESata Raid Controller. 4 x ESata to 5Sata Port multipliers (each connected to a ESata port on the controller). 20 x Samsung 1TB HDD''s. (each connected to a Port Multiplier). The PCIE 8x port gives me 4GBps, which is 32Gbps. No problem there. Each ESata port guarantees 3Gbps, therefore 12Gbps limit on the controller. Each PM can give up to 3Gbps, which is shared amongst 5 drives. According to Samsungs site, max read speed is 250MBps, which translates to 2Gbps. Multiply by 5 drives gives you 10Gbps. Which is 333% of the PM''s capability. So the drives arent likely to hit max read speed for long lengths of time, especially during rebuild time. So the bus is going to be quite a bottleneck. Lets assume that the drives are 80% full. Thats 800GB that needs to be read on each drive, which is (800x9) 7.2TB. Best case scenario, we can read 7.2TB at 3Gbps = 57.6 Tb at 3Gbps = 57600 Gb at 3Gbps = 19200 seconds = 320 minutes = 5 Hours 20 minutes. Even if it takes twice that amount of time, Im happy. Initially I had been thinking 2 PM''s for each vdev. But now Im thinking maybe split it wide as best I can ([2disks per PM] x 2, [3disks per PM] x 2) for each vdev. It''ll give the best possible speed, but still wont max out the HDD''s. I''ve never actually sat and done the math before. Hope its decently accurate :) On Wed, Sep 8, 2010 at 3:27 PM, Edward Ned Harvey <shill at nedharvey.com>wrote:> > From: pantzare at gmail.com [mailto:pantzare at gmail.com] On Behalf Of > > Mattias Pantzare > > > > It > > is about 1 vdev with 12 disk or 2 vdev with 6 disks. If you have 2 > > vdev you have to read half the data compared to 1 vdev to resilver a > > disk. > > Let''s suppose you have 1T of data. You have 12-disk raidz2. So you have > approx 100G on each disk, and you replace one disk. Then 11 disks will > each > read 100G, and the new disk will write 100G. > > Let''s suppose you have 1T of data. You have 2 vdev''s that are each 6-disk > raidz1. Then we''ll estimate 500G is on each vdev, so each disk has approx > 100G. You replace a disk. Then 5 disks will each read 100G, and 1 disk > will write 100G. > > Both of the above situations resilver in equal time, unless there is a bus > bottleneck. 21 disks in a single raidz3 will resilver just as fast as 7 > disks in a raidz1, as long as you are avoiding the bus bottleneck. But 21 > disks in a single raidz3 provides better redundancy than 3 vdev''s each > containing a 7 disk raidz1. > > In my personal experience, approx 5 disks can max out approx 1 bus. (It > actually ranges from 2 to 7 disks, if you have an imbalance of cheap disks > on a good bus, or good disks on a crap bus, but generally speaking people > don''t do that. Generally people get a good bus for good disks, and cheap > disks for crap bus, so approx 5 disks max out approx 1 bus.) > > In my personal experience, servers are generally built with a separate bus > for approx every 5-7 disk slots. So what it really comes down to is ... > > Instead of the Best Practices Guide saying "Don''t put more than ___ disks > into a single vdev," the BPG should say "Avoid the bus bandwidth bottleneck > by constructing your vdev''s using physical disks which are distributed > across multiple buses, as necessary per the speed of your disks and buses." > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100908/f243f5c6/attachment-0001.html>
On Wed, Sep 8, 2010 at 15:27, Edward Ned Harvey <shill at nedharvey.com> wrote:>> From: pantzare at gmail.com [mailto:pantzare at gmail.com] On Behalf Of >> Mattias Pantzare >> >> It >> is about 1 vdev with 12 disk or ?2 vdev with 6 disks. If you have 2 >> vdev you have to read half the data compared to 1 vdev to resilver a >> disk. > > Let''s suppose you have 1T of data. ?You have 12-disk raidz2. ?So you have > approx 100G on each disk, and you replace one disk. ?Then 11 disks will each > read 100G, and the new disk will write 100G. > > Let''s suppose you have 1T of data. ?You have 2 vdev''s that are each 6-disk > raidz1. ?Then we''ll estimate 500G is on each vdev, so each disk has approx > 100G. ?You replace a disk. ?Then 5 disks will each read 100G, and 1 disk > will write 100G. > > Both of the above situations resilver in equal time, unless there is a bus > bottleneck. ?21 disks in a single raidz3 will resilver just as fast as 7 > disks in a raidz1, as long as you are avoiding the bus bottleneck. ?But 21 > disks in a single raidz3 provides better redundancy than 3 vdev''s each > containing a 7 disk raidz1. > > In my personal experience, approx 5 disks can max out approx 1 bus. ?(It > actually ranges from 2 to 7 disks, if you have an imbalance of cheap disks > on a good bus, or good disks on a crap bus, but generally speaking people > don''t do that. ?Generally people get a good bus for good disks, and cheap > disks for crap bus, so approx 5 disks max out approx 1 bus.) > > In my personal experience, servers are generally built with a separate bus > for approx every 5-7 disk slots. ?So what it really comes down to is ... > > Instead of the Best Practices Guide saying "Don''t put more than ___ disks > into a single vdev," the BPG should say "Avoid the bus bandwidth bottleneck > by constructing your vdev''s using physical disks which are distributed > across multiple buses, as necessary per the speed of your disks and buses."This is assuming that you have no other IO besides the scrub. You should of course keep the number of disks in a vdev low for general performance reasons unless you only have linear reads (as your IOPS will be close to what only one disk can give for the whole vdev).
On Wed, Sep 8, 2010 at 6:27 AM, Edward Ned Harvey <shill at nedharvey.com> wrote:> Both of the above situations resilver in equal time, unless there is a bus > bottleneck. ?21 disks in a single raidz3 will resilver just as fast as 7 > disks in a raidz1, as long as you are avoiding the bus bottleneck. ?But 21 > disks in a single raidz3 provides better redundancy than 3 vdev''s each > containing a 7 disk raidz1.No, it (21-disk raidz3 vdev) most certainly will not resilver in the same amount of time. In fact, I highly doubt it would resilver at all. My first foray into ZFS resulted in a 24-disk raidz2 vdev using 500 GB Seagate ES.2 and WD RE3 drives connected to 3Ware 9550SXU and 9650SE multilane controllers. Nice 10 TB storage pool. Worked beatifully as we filled it with data. Had less than 50% usage when a disk died. No problem, it''s ZFS, it''s meant to be easy to replace a drive, just offline, swap, replace, wait for it to resilver. Well, 3 days later, it was still under 10%, and every disk light was still solid grrn. SNMP showed over 100 MB/s of disk I/O continuously, and the box was basically unusable (5 minutes to get the password line to appear on the console). Tried rebooting a few times, stopped all disk I/O to the machine (it was our backups box, running rysnc every night for - at the time - 50+ remote servers), let it do its thing. After 3 weeks of trying to get the resilver to complete (or even reach 50%), we pulled the plug and destroyed the pool, rebuilding it using 3x 8-drive raidz2 vdevs. Things have been a lot smoother ever since. Have replaced 8 of the drives (1 vdev) with 1.5 TB drives. Have replaced multiple dead drives. Resilvers, while running outgoing rsync all day and incoming rsync all night, take 3 days for a 1.5 TB drive (with SNMP showing 300 MB/s disk I/O). You most definitely do not want to use a single super-wide raidz vdev. It just won''t work.> Instead of the Best Practices Guide saying "Don''t put more than ___ disks > into a single vdev," the BPG should say "Avoid the bus bandwidth bottleneck > by constructing your vdev''s using physical disks which are distributed > across multiple buses, as necessary per the speed of your disks and buses."Yeah, I still don''t buy it. Even spreading disks out such that you have 4 SATA drives per PCI-X/PCIe bus, I don''t think you''d be able to get a 500 GB SATA disk to resilver in a 24-disk raidz vdev (even a raidz1) in a 50% full pool. Especially if you are using the pool for anything at the same time. -- Freddie Cash fjwcash at gmail.com
On 9/8/2010 10:08 PM, Freddie Cash wrote:> On Wed, Sep 8, 2010 at 6:27 AM, Edward Ned Harvey<shill at nedharvey.com> wrote: >> Both of the above situations resilver in equal time, unless there is a bus >> bottleneck. 21 disks in a single raidz3 will resilver just as fast as 7 >> disks in a raidz1, as long as you are avoiding the bus bottleneck. But 21 >> disks in a single raidz3 provides better redundancy than 3 vdev''s each >> containing a 7 disk raidz1. > No, it (21-disk raidz3 vdev) most certainly will not resilver in the > same amount of time. In fact, I highly doubt it would resilver at > all. > > My first foray into ZFS resulted in a 24-disk raidz2 vdev using 500 GB > Seagate ES.2 and WD RE3 drives connected to 3Ware 9550SXU and 9650SE > multilane controllers. Nice 10 TB storage pool. Worked beatifully as > we filled it with data. Had less than 50% usage when a disk died. > > No problem, it''s ZFS, it''s meant to be easy to replace a drive, just > offline, swap, replace, wait for it to resilver. > > Well, 3 days later, it was still under 10%, and every disk light was > still solid grrn. SNMP showed over 100 MB/s of disk I/O continuously, > and the box was basically unusable (5 minutes to get the password line > to appear on the console). > > Tried rebooting a few times, stopped all disk I/O to the machine (it > was our backups box, running rysnc every night for - at the time - 50+ > remote servers), let it do its thing. > > After 3 weeks of trying to get the resilver to complete (or even reach > 50%), we pulled the plug and destroyed the pool, rebuilding it using > 3x 8-drive raidz2 vdevs. Things have been a lot smoother ever since. > Have replaced 8 of the drives (1 vdev) with 1.5 TB drives. Have > replaced multiple dead drives. Resilvers, while running outgoing > rsync all day and incoming rsync all night, take 3 days for a 1.5 TB > drive (with SNMP showing 300 MB/s disk I/O). > > You most definitely do not want to use a single super-wide raidz vdev. > It just won''t work. > >> Instead of the Best Practices Guide saying "Don''t put more than ___ disks >> into a single vdev," the BPG should say "Avoid the bus bandwidth bottleneck >> by constructing your vdev''s using physical disks which are distributed >> across multiple buses, as necessary per the speed of your disks and buses." > Yeah, I still don''t buy it. Even spreading disks out such that you > have 4 SATA drives per PCI-X/PCIe bus, I don''t think you''d be able to > get a 500 GB SATA disk to resilver in a 24-disk raidz vdev (even a > raidz1) in a 50% full pool. Especially if you are using the pool for > anything at the same time. > >the thing that folks tend to forget is that RaidZ is IOPS limited. For the most part, if I want to reconstruct a single slab (stripe) of data, I have to issue a read to EACH disk in the vdev, and wait for that disk to return the value, before I can write the computed parity value out to the disk under reconstruction. This is *regardless* of the amount of data being reconstructed. So, the bottleneck tends to be the IOPS value of the single disk being reconstructed. Thus, having fewer disks in a vdev leads to less data being required to be resilvered, which leads to fewer IOPS being required to finish the resilver. Example (for ease of calculation, let''s do the disk-drive mfg''s cheat of 1k = 1000 bytes): Scenario 1: I have 5 1TB disks in a raidz1, and I assume I have 128k slab sizes. Thus, I have 32k of data for each slab written to each disk. (4x32k data + 32k parity for a 128k slab size). So, each IOPS gets to reconstruct 32k of data on the failed drive. It thus takes about 1TB/32k = 31e6 IOPS to reconstruct the full 1TB drive. Scenario 2: I have 10 1TB drives in a raidz1, with the same 128k slab sizes. In this case, there''s only about 14k of data on each drive for a slab. This means, each IOPS to the failed drive only write 14k. So, it takes 1TB/14k = 71e6 IOPS to complete. From this, it can be pretty easy to see that the number of required IOPS to the resilvered disk goes up linearly with the number of data drives in a vdev. Since you''re always going to be IOPS bound by the single disk resilvering, you have a fixed limit. In addition, remember that having more disks means you have to wait longer for each IOPS to complete. That is, it takes longer (fractionally, but in the aggregate, a measuable amount) for 9 drives to each return 14k of info than it does for 4 drives to return 32k of data. This is due to rotational and seek access delays. So, not only are you having to do more total IOPS in Scenario 2, but each IOPS takes longer to complete (the read cycle taking longer, the write/reconstruct cycle taking the same amount of time). -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA Timezone: US/Pacific (GMT-0800)
Erik: does that mean that keeping the number of data drives in a raidz(n) to a power of two is better? In the example you gave, you mentioned 14kb being written to each drive. That doesn''t sound very efficient to me. (when I say the above, I mean a five disk raidz or a ten disk raidz2, etc) Cheers, On 9 September 2010 18:58, Erik Trimble <erik.trimble at oracle.com> wrote:> > the thing that folks tend to forget is that RaidZ is IOPS limited. For the > most part, if I want to reconstruct a single slab (stripe) of data, I have > to issue a read to EACH disk in the vdev, and wait for that disk to return > the value, before I can write the computed parity value out to the disk > under reconstruction. > > This is *regardless* of the amount of data being reconstructed. > > So, the bottleneck tends to be the IOPS value of the single disk being > reconstructed. Thus, having fewer disks in a vdev leads to less data being > required to be resilvered, which leads to fewer IOPS being required to > finish the resilver. > > > Example (for ease of calculation, let''s do the disk-drive mfg''s cheat of 1k > = 1000 bytes): > > Scenario 1: I have 5 1TB disks in a raidz1, and I assume I have 128k > slab sizes. Thus, I have 32k of data for each slab written to each disk. > (4x32k data + 32k parity for a 128k slab size). So, each IOPS gets to > reconstruct 32k of data on the failed drive. It thus takes about 1TB/32k > 31e6 IOPS to reconstruct the full 1TB drive. > > Scenario 2: I have 10 1TB drives in a raidz1, with the same 128k slab > sizes. In this case, there''s only about 14k of data on each drive for a > slab. This means, each IOPS to the failed drive only write 14k. So, it > takes 1TB/14k = 71e6 IOPS to complete. > > > From this, it can be pretty easy to see that the number of required IOPS to > the resilvered disk goes up linearly with the number of data drives in a > vdev. Since you''re always going to be IOPS bound by the single disk > resilvering, you have a fixed limit. > > In addition, remember that having more disks means you have to wait longer > for each IOPS to complete. That is, it takes longer (fractionally, but in > the aggregate, a measuable amount) for 9 drives to each return 14k of info > than it does for 4 drives to return 32k of data. This is due to rotational > and seek access delays. So, not only are you having to do more total IOPS > in Scenario 2, but each IOPS takes longer to complete (the read cycle taking > longer, the write/reconstruct cycle taking the same amount of time). > > > > -- > Erik Trimble > Java System Support > Mailstop: usca22-123 > Phone: x17195 > Santa Clara, CA > Timezone: US/Pacific (GMT-0800)-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100909/457effa6/attachment.html>
On 9/9/2010 2:15 AM, taemun wrote:> Erik: does that mean that keeping the number of data drives in a > raidz(n) to a power of two is better? In the example you gave, you > mentioned 14kb being written to each drive. That doesn''t sound very > efficient to me. > > (when I say the above, I mean a five disk raidz or a ten disk raidz2, etc) > > Cheers, >Well, since the size of a slab can vary (from 512 bytes to 128k), it''s hard to say. Length (size) of the slab is likely the better determination. Remember each block on a hard drive is 512 bytes (for now). So, it''s really not any more efficient to write 16k than 14k (or vice versa). Both are integer multiples of 512 bytes. IIRC, there was something about using a power-of-two number of data drives in a RAIDZ, but I can''t remember what that was. It may just be a phantom memory. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA Timezone: US/Pacific (GMT-0800)
Very interesting... Well, lets see if we can do the numbers for my setup.>From a previous post of mine:[i]This is my exact breakdown (cheap disks on cheap bus :P) : PCI-E 8X 4-port ESata Raid Controller. 4 x ESata to 5Sata Port multipliers (each connected to a ESata port on the controller). 20 x Samsung 1TB HDD''s. (each connected to a Port Multiplier). The PCIE 8x port gives me 4GBps, which is 32Gbps. No problem there. Each ESata port guarantees 3Gbps, therefore 12Gbps limit on the controller. Each PM can give up to 3Gbps, which is shared amongst 5 drives. According to Samsungs site, max read speed is 250MBps, which translates to 2Gbps. Multiply by 5 drives gives you 10Gbps. Which is 333% of the PM''s capability. So the drives arent likely to hit max read speed for long lengths of time, especially during rebuild time. So the bus is going to be quite a bottleneck. Lets assume that the drives are 80% full. Thats 800GB that needs to be read on each drive, which is (800x9) 7.2TB. Best case scenario, we can read 7.2TB at 3Gbps = 57.6 Tb at 3Gbps = 57600 Gb at 3Gbps = 19200 seconds = 320 minutes = 5 Hours 20 minutes. Even if it takes twice that amount of time, Im happy. Initially I had been thinking 2 PM''s for each vdev. But now Im thinking maybe split it wide as best I can ([2Ddisks per PM] x 2, [2Ddisks&1Pdisk per PM] x 2) for each vdev. It''ll give the best possible speed, but still wont max out the HDD''s. I''ve never actually sat and done the math before. Hope its decently accurate :)[/i] My scenario, as from Erik''s post: Scenario: I have 10 1TB disks in a raidz2, and I have 128k slab sizes. Thus, I have 16k of data for each slab written to each disk. (8x16k data + 32k parity for a 128k slab size). So, each IOPS gets to reconstruct 16k of data on the failed drive. It thus takes about 1TB/16k = 62.5e6 IOPS to reconstruct the full 1TB drive. Lets assume the drives are at 95% capacity, which is a pretty bad scenario. So thats 7600GB, which is 60800Gb. There will be no other IO while a rebuild is going. Best Case: I''ll read at 12Gbps, & write at 3Gbps (4:1). I read 128K for every 16K I write (8:1). Hence the read bandwidth will be the bottleneck. So 60800Gb @ 12Gbps is 5066s which is 84m27s (Never gonna happen). A more realistic read of 1.5Gbps gives me 40533s, which is 675m33s, which is 11h15m33s. Which is a more realistic time to read 7.6TB. -- This message posted from opensolaris.org
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Freddie Cash > > No, it (21-disk raidz3 vdev) most certainly will not resilver in the > same amount of time. In fact, I highly doubt it would resilver at > all. > > My first foray into ZFS resulted in a 24-disk raidz2 vdev using 500 GB > Seagate ES.2 and WD RE3 drives connected to 3Ware 9550SXU and 9650SE > multilane controllers. Nice 10 TB storage pool. Worked beatifully as > we filled it with data. Had less than 50% usage when a disk died. > > No problem, it''s ZFS, it''s meant to be easy to replace a drive, just > offline, swap, replace, wait for it to resilver. > > Well, 3 days later, it was still under 10%, and every disk light was > still solid grrn. SNMP showed over 100 MB/s of disk I/O continuously,I don''t believe your situation is typical. I think you either encountered a bug, or you had something happening that you weren''t aware of (scrub, autosnapshots, etc) ... because the only time I''ve ever seen anything remotely similar to the behavior you described was the bug I''ve mentioned in other emails, which occurs when disk is 100% full and a scrub is taking place. I know it''s not the same bug for you, because you said your pool was only 50% full. But I don''t believe that what you saw was normal or typical.
On 9/9/2010 5:49 AM, hatish wrote:> Very interesting... > > Well, lets see if we can do the numbers for my setup. > > From a previous post of mine: > > [i]This is my exact breakdown (cheap disks on cheap bus :P) : > > PCI-E 8X 4-port ESata Raid Controller. > 4 x ESata to 5Sata Port multipliers (each connected to a ESata port on the controller). > 20 x Samsung 1TB HDD''s. (each connected to a Port Multiplier). > > The PCIE 8x port gives me 4GBps, which is 32Gbps. No problem there. Each ESata port guarantees 3Gbps, therefore 12Gbps limit on the controller. Each PM can give up to 3Gbps, which is shared amongst 5 drives. According to Samsungs site, max read speed is 250MBps, which translates to 2Gbps. Multiply by 5 drives gives you 10Gbps. Which is 333% of the PM''s capability. So the drives arent likely to hit max read speed for long lengths of time, especially during rebuild time. > > So the bus is going to be quite a bottleneck. Lets assume that the drives are 80% full. Thats 800GB that needs to be read on each drive, which is (800x9) 7.2TB. > Best case scenario, we can read 7.2TB at 3Gbps > = 57.6 Tb at 3Gbps > = 57600 Gb at 3Gbps > = 19200 seconds > = 320 minutes > = 5 Hours 20 minutes. > > Even if it takes twice that amount of time, Im happy. > > Initially I had been thinking 2 PM''s for each vdev. But now Im thinking maybe split it wide as best I can ([2Ddisks per PM] x 2, [2Ddisks&1Pdisk per PM] x 2) for each vdev. It''ll give the best possible speed, but still wont max out the HDD''s. > > I''ve never actually sat and done the math before. Hope its decently accurate :)[/i] > > My scenario, as from Erik''s post: > Scenario: I have 10 1TB disks in a raidz2, and I have 128k > slab sizes. Thus, I have 16k of data for each slab written to each > disk. (8x16k data + 32k parity for a 128k slab size). So, each IOPS > gets to reconstruct 16k of data on the failed drive. It thus takes > about 1TB/16k = 62.5e6 IOPS to reconstruct the full 1TB drive. > > Lets assume the drives are at 95% capacity, which is a pretty bad scenario. So thats 7600GB, which is 60800Gb. There will be no other IO while a rebuild is going. > Best Case: I''ll read at 12Gbps,& write at 3Gbps (4:1). I read 128K for every 16K I write (8:1). Hence the read bandwidth will be the bottleneck. So 60800Gb @ 12Gbps is 5066s which is 84m27s (Never gonna happen). A more realistic read of 1.5Gbps gives me 40533s, which is 675m33s, which is 11h15m33s. Which is a more realistic time to read 7.6TB.Actually, your biggest bottleneck will be the IOPS limits of the drives. A 7200RPM SATA drive tops out at 100 IOPS. Yup. That''s it. So, if you need to do 62.5e6 IOPS, and the rebuild drive can do just 100 IOPS, that means you will finish (best case) in 62.5e4 seconds. Which is over 173 hours. Or, about 7.25 WEEKS. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA Timezone: US/Pacific (GMT-0800)
On 9/9/2010 5:49 AM, hatish wrote:> Very interesting... > > Well, lets see if we can do the numbers for my setup. > > From a previous post of mine: > > [i]This is my exact breakdown (cheap disks on cheap bus :P) : > > PCI-E 8X 4-port ESata Raid Controller. > 4 x ESata to 5Sata Port multipliers (each connected to a ESata port on the controller). > 20 x Samsung 1TB HDD''s. (each connected to a Port Multiplier). > > The PCIE 8x port gives me 4GBps, which is 32Gbps. No problem there. Each ESata port guarantees 3Gbps, therefore 12Gbps limit on the controller. Each PM can give up to 3Gbps, which is shared amongst 5 drives. According to Samsungs site, max read speed is 250MBps, which translates to 2Gbps. Multiply by 5 drives gives you 10Gbps. Which is 333% of the PM''s capability. So the drives arent likely to hit max read speed for long lengths of time, especially during rebuild time. > > So the bus is going to be quite a bottleneck. Lets assume that the drives are 80% full. Thats 800GB that needs to be read on each drive, which is (800x9) 7.2TB. > Best case scenario, we can read 7.2TB at 3Gbps > = 57.6 Tb at 3Gbps > = 57600 Gb at 3Gbps > = 19200 seconds > = 320 minutes > = 5 Hours 20 minutes. > > Even if it takes twice that amount of time, Im happy. > > Initially I had been thinking 2 PM''s for each vdev. But now Im thinking maybe split it wide as best I can ([2Ddisks per PM] x 2, [2Ddisks&1Pdisk per PM] x 2) for each vdev. It''ll give the best possible speed, but still wont max out the HDD''s. > > I''ve never actually sat and done the math before. Hope its decently accurate :)[/i] > > My scenario, as from Erik''s post: > Scenario: I have 10 1TB disks in a raidz2, and I have 128k > slab sizes. Thus, I have 16k of data for each slab written to each > disk. (8x16k data + 32k parity for a 128k slab size). So, each IOPS > gets to reconstruct 16k of data on the failed drive. It thus takes > about 1TB/16k = 62.5e6 IOPS to reconstruct the full 1TB drive. > > Lets assume the drives are at 95% capacity, which is a pretty bad scenario. So thats 7600GB, which is 60800Gb. There will be no other IO while a rebuild is going. > Best Case: I''ll read at 12Gbps,& write at 3Gbps (4:1). I read 128K for every 16K I write (8:1). Hence the read bandwidth will be the bottleneck. So 60800Gb @ 12Gbps is 5066s which is 84m27s (Never gonna happen). A more realistic read of 1.5Gbps gives me 40533s, which is 675m33s, which is 11h15m33s. Which is a more realistic time to read 7.6TB.Actually, your biggest bottleneck will be the IOPS limits of the drives. A 7200RPM SATA drive tops out at 100 IOPS. Yup. That''s it. So, if you need to do 62.5e6 IOPS, and the rebuild drive can do just 100 IOPS, that means you will finish (best case) in 62.5e4 seconds. Which is over 173 hours. Or, about 7.25 WEEKS. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA Timezone: US/Pacific (GMT-0800)
On Thu, Sep 9, 2010 at 09:03, Erik Trimble <erik.trimble at oracle.com> wrote:> Actually, your biggest bottleneck will be the IOPS limits of the drives. ?A > 7200RPM SATA drive tops out at 100 IOPS. ?Yup. That''s it. > > So, if you need to do 62.5e6 IOPS, and the rebuild drive can do just 100 > IOPS, that means you will finish (best case) in 62.5e4 seconds. ?Which is > over 173 hours. Or, about 7.25 WEEKS.No argument on IOPS, but 173 hours is 7 days, or a little over one week. Will
Ahhhh, I see. But I think your math is a bit out: 62.5e6 iop @ 100iops = 625000 seconds = 10416m = 173h = 7D6h. So 7 days & 6 hours. Thats long, but I can live with it. This isnt for an enterprise environment. While the length of time is of worry in terms of increasing the chance another drive will fail, in my mind that is mitigated by the fact that the drives wont be under major stress during that time. Its a workable solution. On Thu, Sep 9, 2010 at 3:03 PM, Erik Trimble <erik.trimble at oracle.com>wrote:> On 9/9/2010 5:49 AM, hatish wrote: > >> Very interesting... >> >> Well, lets see if we can do the numbers for my setup. >> >> From a previous post of mine: >> >> [i]This is my exact breakdown (cheap disks on cheap bus :P) : >> >> >> PCI-E 8X 4-port ESata Raid Controller. >> 4 x ESata to 5Sata Port multipliers (each connected to a ESata port on the >> controller). >> 20 x Samsung 1TB HDD''s. (each connected to a Port Multiplier). >> >> The PCIE 8x port gives me 4GBps, which is 32Gbps. No problem there. Each >> ESata port guarantees 3Gbps, therefore 12Gbps limit on the controller. Each >> PM can give up to 3Gbps, which is shared amongst 5 drives. According to >> Samsungs site, max read speed is 250MBps, which translates to 2Gbps. >> Multiply by 5 drives gives you 10Gbps. Which is 333% of the PM''s capability. >> So the drives arent likely to hit max read speed for long lengths of time, >> especially during rebuild time. >> >> So the bus is going to be quite a bottleneck. Lets assume that the drives >> are 80% full. Thats 800GB that needs to be read on each drive, which is >> (800x9) 7.2TB. >> Best case scenario, we can read 7.2TB at 3Gbps >> = 57.6 Tb at 3Gbps >> = 57600 Gb at 3Gbps >> = 19200 seconds >> = 320 minutes >> = 5 Hours 20 minutes. >> >> Even if it takes twice that amount of time, Im happy. >> >> Initially I had been thinking 2 PM''s for each vdev. But now Im thinking >> maybe split it wide as best I can ([2Ddisks per PM] x 2, [2Ddisks&1Pdisk per >> PM] x 2) for each vdev. It''ll give the best possible speed, but still wont >> max out the HDD''s. >> >> I''ve never actually sat and done the math before. Hope its decently >> accurate :)[/i] >> >> My scenario, as from Erik''s post: >> Scenario: I have 10 1TB disks in a raidz2, and I have 128k >> slab sizes. Thus, I have 16k of data for each slab written to each >> disk. (8x16k data + 32k parity for a 128k slab size). So, each IOPS >> gets to reconstruct 16k of data on the failed drive. It thus takes >> about 1TB/16k = 62.5e6 IOPS to reconstruct the full 1TB drive. >> >> Lets assume the drives are at 95% capacity, which is a pretty bad >> scenario. So thats 7600GB, which is 60800Gb. There will be no other IO while >> a rebuild is going. >> Best Case: I''ll read at 12Gbps,& write at 3Gbps (4:1). I read 128K for >> every 16K I write (8:1). Hence the read bandwidth will be the bottleneck. So >> 60800Gb @ 12Gbps is 5066s which is 84m27s (Never gonna happen). A more >> realistic read of 1.5Gbps gives me 40533s, which is 675m33s, which is >> 11h15m33s. Which is a more realistic time to read 7.6TB. >> > > > Actually, your biggest bottleneck will be the IOPS limits of the drives. A > 7200RPM SATA drive tops out at 100 IOPS. Yup. That''s it. > > So, if you need to do 62.5e6 IOPS, and the rebuild drive can do just 100 > IOPS, that means you will finish (best case) in 62.5e4 seconds. Which is > over 173 hours. Or, about 7.25 WEEKS. > > > -- > Erik Trimble > Java System Support > Mailstop: usca22-123 > Phone: x17195 > Santa Clara, CA > Timezone: US/Pacific (GMT-0800) > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100909/457c45b0/attachment-0001.html>
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Erik Trimble > > the thing that folks tend to forget is that RaidZ is IOPS limited. For > the most part, if I want to reconstruct a single slab (stripe) of data, > I have to issue a read to EACH disk in the vdev, and wait for that disk > to return the value, before I can write the computed parity value out > to > the disk under reconstruction.If I''m trying to interpret your whole message, Erik, and condense it, I think I get the following. Please tell me if and where I''m wrong. In any given zpool, some number of slabs are used in the whole pool. In raidzN, a portion of each slab is written on each disk. Therefore, during resilver, if there are a total of 1million slabs used in the zpool, it means each good disk will need to read 1million partial slabs, and the replaced disk will need to write 1 million partial slabs. Each good disk receives a read request in parallel, and all of them must complete before a write is given to the new disk. Each read/write cycle is completed before the next cycle begins. (It seems this could be accelerated by allowing all the good disks to continue reading in parallel instead of waiting, right?) The conclusion I would reach is: Given no bus bottleneck: It is true that resilvering a raidz will be slower with many disks in the vdev, because the average latency for the worst of N disks will increase as N increases. But that effect is only marginal, and bounded between the average latency of a single disk, and the worst case latency of a single disk. The characteristic that *really* makes a big difference is the number of slabs in the pool. i.e. if your filesystem is composed of mostly small files or fragments, versus mostly large unfragmented files.
> From: Hatish Narotam [mailto:hatish at gmail.com] > > PCI-E 8X 4-port ESata Raid Controller. > 4 x ESata to 5Sata Port multipliers (each connected to a ESata port on > the controller). > 20 x Samsung 1TB HDD''s. (each connected to a Port Multiplier).Assuming your disks can all sustain 500Mbit/sec, which I find to be typical for 7200rpm sata disks, and you have groups of 5 that all have a 3Gbit upstream bottleneck, it means each of your groups of 5 should be fine in a raidz1 configuration. You think that your sata card can do 32Gbit because it''s on a PCIe x8 bus. I highly doubt it unless you paid a grand or two for your sata controller, but please prove me wrong. ;-) I think the backplane of the sata controller is more likely either 3G or 6G. If it''s 3G, then you should use 4 groups of raidz1. If it''s 6G, then you can use 2 groups of raidz2 (because 10 drives of 500Mbit can only sustain 5Gbit) If it''s 12G or higher, then you can make all of your drives one big vdev of raidz3.> According to Samsungs site, max read speed is 250MBps, which > translates to 2Gbps. Multiply by 5 drives gives you 10Gbps.I guarantee you this is not a sustainable speed for 7.2krpm sata disks. You can get a decent measure of sustainable speed by doing something like: (write 1G byte) time dd if=/dev/zero of=/some/file bs=1024k count=1024 (beware: you might get an inaccurate speed measurement here due to ram buffering. See below.) (reboot to ensure nothing is in cache) (read 1G byte) time dd if=/some/file of=/dev/null bs=1024k (Now you''re certain you have a good measurement. If it matches the measurement you had before, that means your original measurement was also accurate. ;-) )
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Edward Ned Harvey > > The characteristic that *really* makes a big difference is the number > of > slabs in the pool. i.e. if your filesystem is composed of mostly small > files or fragments, versus mostly large unfragmented files.Oh, if at least some of my reasoning was correct, there is one valuable take-away point for hatish: Given some number X total slabs used in the whole pool. If you use a single vdev for the whole pool, you will have X partial slabs written on each disk. If you have 2 vdev''s, you''ll have approx X/2 partial slabs written on each disk. 3 vdevs ~> X/3 partial slabs on each disk. Therefore, the resilver time approximately divides by the number of separate vdev''s you are using in your pool. So the largest factor affecting resilver time of a single large vdev versus many smaller vdev''s is NOT the quantity of data written on each disk, but just the fact that fewer slabs are used on each disk when using smaller vdev''s. If you want to choose between (a) 21disk raidz3 versus (b) 3 vdevs of each 7disk raidz1, then: The raidz3 provides better redundancy, but has the disadvantage that every slab must be partially written on every disk.
Hi, *The PCIE 8x port gives me 4GBps, which is 32Gbps. No problem there. Each ESata port guarantees 3Gbps, therefore 12Gbps limit on the controller.* I was simply listing the bandwidth available at the different stages of the data cycle. The PCIE port gives me 32Gbps. The Sata card gives me a possible 12Gbps. I''d rather be cautious and asuume I''ll get more like 6Gbps, it is a cheap card after all. *I guarantee you this is not a sustainable speed for 7.2krpm sata disks.* (I am well aware :) ) * Which is 333% of the PM''s capability. * Assuming that it is, 5 drives at that speed will max out my PM 3 times over. So my PM will automatically throttle the drives speed to a third of that on the account that the PM will be maxed out. Thanks for the rough IO speed check :) On Thu, Sep 9, 2010 at 3:20 PM, Edward Ned Harvey <shill at nedharvey.com>wrote:> > From: Hatish Narotam [mailto:hatish at gmail.com] > > > > PCI-E 8X 4-port ESata Raid Controller. > > 4 x ESata to 5Sata Port multipliers (each connected to a ESata port on > > the controller). > > 20 x Samsung 1TB HDD''s. (each connected to a Port Multiplier). > > Assuming your disks can all sustain 500Mbit/sec, which I find to be typical > for 7200rpm sata disks, and you have groups of 5 that all have a 3Gbit > upstream bottleneck, it means each of your groups of 5 should be fine in a > raidz1 configuration. > > You think that your sata card can do 32Gbit because it''s on a PCIe x8 bus. > I highly doubt it unless you paid a grand or two for your sata controller, > but please prove me wrong. ;-) I think the backplane of the sata > controller is more likely either 3G or 6G. > > If it''s 3G, then you should use 4 groups of raidz1. > If it''s 6G, then you can use 2 groups of raidz2 (because 10 drives of > 500Mbit can only sustain 5Gbit) > If it''s 12G or higher, then you can make all of your drives one big vdev of > raidz3. > > > > According to Samsungs site, max read speed is 250MBps, which > > translates to 2Gbps. Multiply by 5 drives gives you 10Gbps. > > I guarantee you this is not a sustainable speed for 7.2krpm sata disks. > You > can get a decent measure of sustainable speed by doing something like: > (write 1G byte) > time dd if=/dev/zero of=/some/file bs=1024k count=1024 > (beware: you might get an inaccurate speed measurement here > due to ram buffering. See below.) > > (reboot to ensure nothing is in cache) > (read 1G byte) > time dd if=/some/file of=/dev/null bs=1024k > (Now you''re certain you have a good measurement. > If it matches the measurement you had before, > that means your original measurement was also > accurate. ;-) ) > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100909/c8c67b90/attachment-0001.html>
Erik wrote:> Actually, your biggest bottleneck will be the IOPS > limits of the > drives. A 7200RPM SATA drive tops out at 100 IOPS. > Yup. That''s it. > So, if you need to do 62.5e6 IOPS, and the rebuild > drive can do just 100 > IOPS, that means you will finish (best case) in > 62.5e4 seconds. Which > is over 173 hours. Or, about 7.25 WEEKS.My OCD is coming out and I will split that hair with you. 173 hours is just over a week. This is a fascinating and timely discussion. My personal (biased and unhindered by facts) preference is wide stripes RAIDZ3. Ned is right that I kept reading that RAIDZx should not exceed _ devices and couldn''t find real numbers behind those conclusions. Discussions in this thread have opened my eyes a little and I am in the middle of deploying a second 22 disk fibre array on home server, so I have been struggling with the best way to allocate pools. Up until reading this thread, the biggest downside to wide stripes, that I was aware of, has been low iops. And let''s be clear: while on paper the iops of a wide stripe is the same as a single disk, it actually is worse. In truth, the service time for any request on wide stripe is the service time of the SLOWEST disk for that request. The slowest disk may vary from request to request, but will always delay the entire stripe operation. Since all of the 44 spindles are 15K disks, I am about to convince myself to go with two pools of wide stripes and keep several spindles for L2ARC and SLOG. The thinking is that other background operations (scrub and resilver) can take place with little impact to application performance, since those will be using L2ARC and SLOG. Of course, I could be wrong on any of the above. Cheers, Marty -- This message posted from opensolaris.org
On 9/9/2010 6:19 AM, Edward Ned Harvey wrote:>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- >> bounces at opensolaris.org] On Behalf Of Erik Trimble >> >> the thing that folks tend to forget is that RaidZ is IOPS limited. For >> the most part, if I want to reconstruct a single slab (stripe) of data, >> I have to issue a read to EACH disk in the vdev, and wait for that disk >> to return the value, before I can write the computed parity value out >> to >> the disk under reconstruction. > If I''m trying to interpret your whole message, Erik, and condense it, I > think I get the following. Please tell me if and where I''m wrong. > > In any given zpool, some number of slabs are used in the whole pool. In > raidzN, a portion of each slab is written on each disk. Therefore, during > resilver, if there are a total of 1million slabs used in the zpool, it means > each good disk will need to read 1million partial slabs, and the replaced > disk will need to write 1 million partial slabs. Each good disk receives a > read request in parallel, and all of them must complete before a write is > given to the new disk. Each read/write cycle is completed before the next > cycle begins. (It seems this could be accelerated by allowing all the good > disks to continue reading in parallel instead of waiting, right?) > > The conclusion I would reach is: > > Given no bus bottleneck: > > It is true that resilvering a raidz will be slower with many disks in the > vdev, because the average latency for the worst of N disks will increase as > N increases. But that effect is only marginal, and bounded between the > average latency of a single disk, and the worst case latency of a single > disk. > > The characteristic that *really* makes a big difference is the number of > slabs in the pool. i.e. if your filesystem is composed of mostly small > files or fragments, versus mostly large unfragmented files. >Oh, and a mea culpa on converting hours to weeks instead of days. I did the math, then forgot which unit I was dealing in. Ooops. Your reading of my posts is correct. Indeed, the number of slaps is critical, as this directly impacts IOPS needed. One of the very nice speedups for resilvering would be the ability to do a larger "read" of several continguous slabs (as physically laid out on the disks) in a single IOPS - the difference between reading a 128k slab portion and 5 consecutive 64k slab portions is trivial, so the ability to do more than one slab at a time would be critical for improving resilver times. I have *no* idea how hard this is - given that resilvering currently walks the space allocation tree (which is in creation time order), it generally doesn''t get good consecutive slab requests this way, so things would have to change from being tree-driven to being layout-on-disk-driven. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA Timezone: US/Pacific (GMT-0800)
Comment at end... Mattias Pantzare wrote:> On Wed, Sep 8, 2010 at 15:27, Edward Ned Harvey <shill at nedharvey.com> wrote: > >>> From: pantzare at gmail.com [mailto:pantzare at gmail.com] On Behalf Of >>> Mattias Pantzare >>> >>> It >>> is about 1 vdev with 12 disk or 2 vdev with 6 disks. If you have 2 >>> vdev you have to read half the data compared to 1 vdev to resilver a >>> disk. >>> >> Let''s suppose you have 1T of data. You have 12-disk raidz2. So you have >> approx 100G on each disk, and you replace one disk. Then 11 disks will each >> read 100G, and the new disk will write 100G. >> >> Let''s suppose you have 1T of data. You have 2 vdev''s that are each 6-disk >> raidz1. Then we''ll estimate 500G is on each vdev, so each disk has approx >> 100G. You replace a disk. Then 5 disks will each read 100G, and 1 disk >> will write 100G. >> >> Both of the above situations resilver in equal time, unless there is a bus >> bottleneck. 21 disks in a single raidz3 will resilver just as fast as 7 >> disks in a raidz1, as long as you are avoiding the bus bottleneck. But 21 >> disks in a single raidz3 provides better redundancy than 3 vdev''s each >> containing a 7 disk raidz1. >> >> In my personal experience, approx 5 disks can max out approx 1 bus. (It >> actually ranges from 2 to 7 disks, if you have an imbalance of cheap disks >> on a good bus, or good disks on a crap bus, but generally speaking people >> don''t do that. Generally people get a good bus for good disks, and cheap >> disks for crap bus, so approx 5 disks max out approx 1 bus.) >> >> In my personal experience, servers are generally built with a separate bus >> for approx every 5-7 disk slots. So what it really comes down to is ... >> >> Instead of the Best Practices Guide saying "Don''t put more than ___ disks >> into a single vdev," the BPG should say "Avoid the bus bandwidth bottleneck >> by constructing your vdev''s using physical disks which are distributed >> across multiple buses, as necessary per the speed of your disks and buses." >> > > This is assuming that you have no other IO besides the scrub. > > You should of course keep the number of disks in a vdev low for > general performance reasons unless you only have linear reads (as your > IOPS will be close to what only one disk can give for the whole vdev).There is another optimization in the Best Practices Guide that says the number of devices in a vdev should be (N+P) with P = 1 (raidz), 2 (raidz2), or 3 (raidz3) and N equals 2, 4, or 8. I.e. 2^N + P where N is 1, 2, or 3 and P is the RAIDZ level. I.e. Optimal sizes RAIDZ1 vdevs should have 3, 5, or 9 devices in each vdev RAIDZ2 vdevs should have 4, 6, or 10 devices in each vdev RAIDZ3 vdevs should have 5, 7, or 11 devices in each vdev -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100909/788d51bc/attachment-0001.html>
Erik Trimble wrote:> On 9/9/2010 2:15 AM, taemun wrote: >> Erik: does that mean that keeping the number of data drives in a >> raidz(n) to a power of two is better? In the example you gave, you >> mentioned 14kb being written to each drive. That doesn''t sound very >> efficient to me. >> >> (when I say the above, I mean a five disk raidz or a ten disk raidz2, >> etc) >> >> Cheers, >> > > Well, since the size of a slab can vary (from 512 bytes to 128k), it''s > hard to say. Length (size) of the slab is likely the better > determination. Remember each block on a hard drive is 512 bytes (for > now). So, it''s really not any more efficient to write 16k than 14k > (or vice versa). Both are integer multiples of 512 bytes. > > IIRC, there was something about using a power-of-two number of data > drives in a RAIDZ, but I can''t remember what that was. It may just be > a phantom memory.Not a phantom memory... From Matt Ahrens in a thread titled ''Metaslab alignment on RAID-Z'': http://www.opensolaris.org/jive/thread.jspa?messageID=60241 ''To eliminate the blank "round up" sectors for power-of-two blocksizes of 8k or larger, you should use a power-of-two plus 1 number of disks in your raid-z group -- that is, 3, 5, or 9 disks (for double-parity, use a power-of-two plus 2 -- that is, 4, 6, or 10). Smaller blocksizes are more constrained; for 4k, use 3 or 5 disks (for double parity, use 4 or 6) and for 2k, use 3 disks (for double parity, use 4).'' These round up sectors are skipped and used as padding to simplify space accounting and improve performance. I may have referred to them as zero padding sectors in other posts, however they''re not necessarily zeroed. See the thread titled ''raidz stripe size (not stripe width)'' http://opensolaris.org/jive/thread.jspa?messageID=495351 This looks to be the reasoning behind the optimization in the ZFS Best Practices Guide that says the number of devices in a vdev should be (N+P) with P = 1 (raidz), 2 (raidz2), or 3 (raidz3) and N equals 2, 4, or 8. I.e. 2^N + P where N is 1, 2, or 3 and P is the RAIDZ level. I.e. Optimal sizes RAIDZ1 vdevs should have 3, 5, or 9 devices in each vdev RAIDZ2 vdevs should have 4, 6, or 10 devices in each vdev RAIDZ3 vdevs should have 5, 7, or 11 devices in each vdev The best practices guide recommendation of 3-9 devices per vdev appears based on RAIDZ1''s optimal size with 3-9 devices when N=1 to 3 in 2^N + P. Victor Latushkin in a thread titled ''odd versus even'' said the same thing. Adam Leventhal said this had a ''very slight space-efficiency benefit'' in the same thread. http://www.mail-archive.com/zfs-discuss at opensolaris.org/msg05460.html --- That said, the recommendations in the Best Practices Guide for RAIDZ2 to start with 5 disks and RAIDZ3 to start with 8 disks, do not match with the last recommendation. What is the reasoning behind 5 and 8? Reliability vs space? Start a single-parity RAIDZ (raidz) configuration at 3 disks (2+1) Start a double-parity RAIDZ (raidz2) configuration at 5 disks (3+2) Start a triple-parity RAIDZ (raidz3) configuration at 8 disks (5+3) (N+P) with P = 1 (raidz), 2 (raidz2), or 3 (raidz3) and N equals 2, 4, or 8 Perhaps the Best Practices Guide should also recommend: -the use of striped vdevs in order to bring up the IOPS number, particularly when using enough hard drives to meet the capacity and reliability requirements. -avoiding slow consumer class drives (fast ones may be okay for some users) -more sample array configurations for common drive chassis capacities -consider using a RAIDZ1 main pool with RAIDZ1 backup pool rather than higher level RAIDZ or mirroring (touch on the value of backup vs. stronger RAIDZ) -watch out for BIOS or firmware upgrades that change host protected area (HPA) settings on drives making them appear smaller than before The BPG should also resolve this discrepancy: Storage Pools section: "For production systems, use whole disks rather than slices for storage pools for the following reasons" Additional Cautions for Storage Pools: "Consider planning ahead and reserving some space by creating a slice which is smaller than the whole disk instead of the whole disk." --- Other (somewhat) related threads: From Darren Dunham in a thread titled ''ZFS raidz2 number of disks'': http://groups.google.com/group/comp.unix.solaris/browse_thread/thread/dd1b5997bede5265 ''> 1 Why is the recommendation for a raidz2 3-9 disk, what are the cons for having 16 in a pool compared to 8? Reads potentially have to pull data from all data columns to reconstruct a filesystem block for verification. For random read workloads, increasing the number of columns in the raidz does not increase the read iops. So limiting the column count usually makes sense (with a cost tradeoff). 16 is valid, but not recommended.'' From Richard Relling in a thread titled ''rethinking RaidZ and Record size'': http://opensolaris.org/jive/thread.jspa?threadID=121016 ''The raidz pathological worst case is a random read from many-column raidz where files have records 128 KB in size. The inflated read problem is why it makes sense to match recordsize for fixed record workloads. This includes CIFS workloads which use 4 KB records. It is also why having many columns in the raidz for large records does not improve performance. Hence the 3 to 9 raidz disk limit recommendation in the zpool man page.'' From Adam Leventhal in a thread titled ''Problem with RAID-Z in builds snv_120 - snv_123'': http://www.mail-archive.com/zfs-discuss at opensolaris.org/msg28907.html ''Basically, RAID-Z writes full stripes every time; note that without careful accounting it would be possible to effectively fragment the vdev such that single sectors were free but useless since single-parity RAID-Z requires two adjacent sectors to store data (one for data, one for parity). To address this, RAID-Z rounds up its allocation to the next (nparity + 1). This ensures that all space is accounted for. RAID-Z will thus skip sectors that are unused based on this rounding. For example, under raidz1 a write of 1024 bytes would result in 512 bytes of parity, 512 bytes of data on two devices and 512 bytes skipped. To improve performance, ZFS aggregates multiple adjacent IOs into a single large IO. Further, hard drives themselves can perform aggregation of adjacent IOs. We noted that these skipped sectors were inhibiting performance so added "optional" IOs that could be used to improve aggregation. This yielded a significant performance boost for all RAID-Z configurations.'' From Adam Leventhal in a thread titled ''triple-parity: RAID-Z3'': http://opensolaris.org/jive/thread.jspa?threadID=108154 ''> So I''m not sure what the ''RAID-Z should mind the gap on writes'' > comment is getting at either. > > Clarification? I''m planning to write a blog post describing this, but the basic problem is that RAID-Z, by virtue of supporting variable stripe writes (the insight that allows us to avoid the RAID-5 write hole), must round the number of sectors up to a multiple of nparity+1. This means that we may have sectors that are effectively skipped. ZFS generally lays down data in large contiguous streams, but these skipped sectors can stymie both ZFS''s write aggregation as well as the hard drive''s ability to group I/Os and write them quickly. Jeff Bonwick added some code to mind these gaps on reads. The key insight there is that if we''re going to read 64K, say, with a 512 byte hole in the middle, we might as well do one big read rather than two smaller reads and just throw out the data that we don''t care about. Of course, doing this for writes is a bit trickier since we can''t just blithely write over gaps as those might contain live data on the disk. To solve this we push the knowledge of those skipped sectors down to the I/O aggregation layer in the form of ''optional'' I/Os purely for the purpose of coalescing writes into larger chunks.''
> From: Haudy Kazemi [mailto:kaze0010 at umn.edu] > > There is another optimization in the Best Practices Guide that says the > number of devices in a vdev should be (N+P) with P = 1 (raidz), 2 > (raidz2), or 3 (raidz3) and N equals 2, 4, or 8. > I.e. 2^N + P where N is 1, 2, or 3 and P is the RAIDZ level. > > I.e. Optimal sizes > RAIDZ1 vdevs should have 3, 5, or 9 devices in each vdev > RAIDZ2 vdevs should have 4, 6, or 10 devices in each vdev > RAIDZ3 vdevs should have 5, 7, or 11 devices in each vdevThis sounds logical, although I don''t know how real it is. The logic seems to be ... Assuming slab sizes of 128K, the amount of data written to each disk within the vdev gets divided into something which is a multiple of 512b or 4K (newer drives supposedly starting to use 4K block sizes instead of 512b). But I have doubts about the real-ness here, because ... An awful lot of times, your actual slabs are smaller than 128K just because you''re not performing sustained sequential writes very often. But it seems to make sense, whenever you *do* have some sequential writes, you would want the data written to each disk to be a multiple of 512b or 4K. If you had a 128K slab, divided into 5, then each disk would write 25.6K and even for sustained sequential writes, some degree of fragmentation would be impossible to avoid. Actually, I don''t think fragmentation is techinically the correct term for that behavior. It might be more appropriate to simply say it forces a less-than-100% duty cycle. And another thing ... Doesn''t the checksum take up some space anyway? Even if you obeyed the BPG and used ... let''s say ... 4 disks for N ... then each disk has 32K of data to write, which is a multiple of 4K and 512b ... but each disk also needs to write the checksum. So each disk writes 32K + a few bytes. Which defeats the whole purpose anyway, doesn''t it? The effect, if real at all, might be negligible. I don''t know how small it is, but I''m quite certain it''s not huge.
Ahhh! So thats how the formula works. That makes perfect sense. Lets take my case as a scenario: Each of my vdevs is 10 disk RaidZ2 (8 data + 2 Parity). Using 128K stripe, I''ll have 128K/8 = 16K blocks per data drive & 16K blocks per parity drive. That fits both 512B & 4KB. It works in my favour that I''ll have high average file sizes (>250MB). So I''ll see minimal effect of the "fragmentation" mentioned. -- This message posted from opensolaris.org
On Sep 9, 2010, at 6:39 AM, Marty Scholes wrote:> Erik wrote: >> Actually, your biggest bottleneck will be the IOPS >> limits of the >> drives. A 7200RPM SATA drive tops out at 100 IOPS. >> Yup. That''s it. >> So, if you need to do 62.5e6 IOPS, and the rebuild >> drive can do just 100 >> IOPS, that means you will finish (best case) in >> 62.5e4 seconds. Which >> is over 173 hours. Or, about 7.25 WEEKS. > > My OCD is coming out and I will split that hair with you. 173 hours is just over a week. > > This is a fascinating and timely discussion. My personal (biased and unhindered by facts) preference is wide stripes RAIDZ3. Ned is right that I kept reading that RAIDZx should not exceed _ devices and couldn''t find real numbers behind those conclusions.There isn''t a real number. We know that a 46-disk raidz stripe is a recipe for unhappiness (because people actually tried that when the thumper was released) And we know that a 2-disk raidz1 is kinda like mirroring -- a hard sell. So we had to find a number that was between the two, somewhere in the realm of reasonable.> Discussions in this thread have opened my eyes a little and I am in the middle of deploying a second 22 disk fibre array on home server, so I have been struggling with the best way to allocate pools.Simple, mirror it and be happy :-).> Up until reading this thread, the biggest downside to wide stripes, that I was aware of, has been low iops. And let''s be clear: while on paper the iops of a wide stripe is the same as a single disk, it actually is worse. In truth, the service time for any request on wide stripe is the service time of the SLOWEST disk for that request. The slowest disk may vary from request to request, but will always delay the entire stripe operation.Yes, but this is not a problem for async writes, so it will depend on the workload.> Since all of the 44 spindles are 15K disks, I am about to convince myself to go with two pools of wide stripes and keep several spindles for L2ARC and SLOG. The thinking is that other background operations (scrub and resilver) can take place with little impact to application performance, since those will be using L2ARC and SLOG. > > Of course, I could be wrong on any of the above.If you get it wrong, you can reconfigure most things on the fly. Except you can''t add columns to a raidz or shrink. A good strategy is to start with what you need and add disks as capacity requires. Oh, and by the way, the easiest way to do that is with mirrors :-) But if you insist on raidz, then consider something like 6-way or 8-way sets because that is the typical denominator for most hardware trays today. -- richard -- OpenStorage Summit, October 25-27, Palo Alto, CA http://nexenta-summit2010.eventbrite.com ZFS and performance consulting http://www.RichardElling.com