MLR
2012-Mar-21 03:16 UTC
[zfs-discuss] Basic ZFS Questions + Initial Setup Recommendation
I read the "ZFS_Best_Practices_Guide" and "ZFS_Evil_Tuning_Guide", and have some questions: 1. Cache device for L2ARC Say we get a decent ssd, ~500MB/s read/write. If we have a 20 HDD zpool setup shouldn''t we be reading at least at the 500MB/s read/write range? Why would we want a ~500MB/s cache? 2. ZFS dynamically strips along the top-most vdev''s and that "performance for 1 vdev is equivalent to performance of one drive in that group". Am I correct in thinking this means, for example, I have a single 14 disk raidz2 vdev zpool, the disks will go ~100MB/s each , this zpool would theoretically read/write at ~100MB/s max (how about real world average?)? If this was RAID6 I think this would go theoretically ~1.4GB/s, but in real life I am thinking ~1GB/s (aka 10x- 14x faster than zfs, and both provide the same amount of redundancy)? Is my thinking off in the RAID6 or RAIDZ2 numbers? Why doesn''t ZFS try to dynamically strip inside vdevs (and if it is, is there an easy to understand explanation why a vdev doesn''t read from multiple drives at once when requesting data, or why a zpool wouldn''t make N number of requests to a vdev with N being the number of disks in that vdev)? Since "performance for 1 vdev is equivalent to performance of one drive in that group" it seems like the higher raidzN are not very useful. If your using raidzN your probably looking for a lower than mirroring parity (aka 10%-33%), but if you try to use raidz3 with 15% parity your putting 20 HDDs in 1 vdev which is terrible (almost unimaginable) if your running at 1/20 the "ideal" performance. Main Question: 3. I am updating my old RAID5 and want to reuse my old drives. I have 8 1.5TB drives and buying new 3TB drives to fill up the rest of a 20 disk enclosure (Norco RPC-4220); there is also 1 spare, plus the bootdrive so 22 total. I want around 20%-25% parity. My system is like so: Main Application: Home NAS * Like to optimize max space with 20%(ideal) or 25% parity - would like ''decent'' reading performance - ''decent'' being max of 10GigE Ethernet, right now it is only 1 gigabit Ethernet but hope to leave room to update in future if 10GigE becomes cheaper. My RAID5 runs at ~500MB/s so was hoping to get at least above that with the 20 disk raid. * 16GB RAM * Open to using ZIL/L2ARC, but, left out for now: writing doesn''t occur much (~7GB a week, maybe a big burst every couple months), and don''t really read same data multiple times. What would be the best setup? I''m thinking one of the following: a. 1vdev of 8 1.5TB disks (raidz2). 1vdev of 12 3TB disks (raidz3)? (~200MB/s reading, best reliability) b. 1vdev of 8 1.5TB disks (raidz2). 3vdev of 4 3TB disks (raidz)? (~400MB/s reading, evens out size across vdevs) c. 2vdev of 4 1.5TB disks (raidz). 3vdev of 4 3TB disks (raidz)? (~500MB/s reading, maximize vdevs for performance) I am leaning towards "a." since I am thinking "raidz3"+"raidz2" should provide a little more reliability than 5 "raidz1"''s, but, worried that the real world read/write performance will be low (theoridical is ~200MB/s, and, since the 2nd vdev is 3x the size as the 1st, I am probably looking at more like 133MB/s?). The 12 disk array is also above the "9 disk group max" recommendation in the Best Practices guide, so not sure if this affects read performance (if it is just resilver time I am not as worried about it as long it isn''t like 3x longer)? I guess I''m hoping "a." really isn''t ~200MB/s hehe, if it is I''m leaning towards "b.", but, if so, all three are downgrades from my initial setup read performance wise -_-. Is someone able to correct my understanding if some of my numbers are off, or would someone have a better raidzN configuration I should consider? Thanks for any help.
Jim Klimov
2012-Mar-21 11:56 UTC
[zfs-discuss] Basic ZFS Questions + Initial Setup Recommendation
2012-03-21 7:16, MLR wrote:> I read the "ZFS_Best_Practices_Guide" and "ZFS_Evil_Tuning_Guide", and have some > questions: > > 1. Cache device for L2ARC > Say we get a decent ssd, ~500MB/s read/write. If we have a 20 HDD zpool > setup shouldn''t we be reading at least at the 500MB/s read/write range? Why > would we want a ~500MB/s cache?Basically, SSDs shine best in random IOs. For example, my (consumer-grade) 2Tb disks in a home NAS yield up to 160MB/s in linear reads, but drop to about 3Mb/s in random performance, occasionally bursting 10-20Mb/s for a short time. ZFS COW-based data structure is quite fragmented, so there are many random seeks. Raw low-level performance gets hurt as a tradeoff for reliability, and SSDs along with large RAM buffers are ways to recover and boost the performance. There is especially lot of work with metadata when/if you use deduplication - tens of gigabytes of RAM are recommended for a decent-sized pool of a few TB.> 2. ZFS dynamically strips along the top-most vdev''s and that "performance for 1 > vdev is equivalent to performance of one drive in that group". Am I correct in > thinking this means, for example, I have a single 14 disk raidz2 vdev zpool, the > disks will go ~100MB/s each , this zpool would theoretically read/write at > ~100MB/s max (how about real world average?)? If this was RAID6 I think this > would go theoretically ~1.4GB/s, but in real life I am thinking ~1GB/s (aka 10x- > 14x faster than zfs, and both provide the same amount of redundancy)? Is my > thinking off in the RAID6 or RAIDZ2 numbers?I think your numbers are not right. They would make sense for RAID0 of 14 drives though. All correctly implemented synchronously-redundant schemes must wait for all storage devices to complete writing, so they are "not faster" than single devices during writes, and due to bus contention, etc. are often a bit slower overall. Reads on the other hand can be parallelised on RAIDzN as well as on RAID5/6 and can boost read performance like striping more or less. As for "same level of redundancy", many people would stick your finger at the statement that usual RAIDs don''t have a method to know which part of the array is faulty (i.e. when one sector in a RAID stripe becomes corrupted, there is no way to certainly reconstruct correct data, and often no quick way to detect the corruption either). Many arrays depend on timestamps of the component disks so as to detect stale data, and can only recover well from full-disk failures. > Why doesn''t ZFS try to dynamically> strip inside vdevs (and if it is, is there an easy to understand explanation why > a vdev doesn''t read from multiple drives at once when requesting data, or why a > zpool wouldn''t make N number of requests to a vdev with N being the number of > disks in that vdev)?That it does, somewhat. In RAID terms you can think of a ZFS pool with several top-level devices each made up from several leaf devices, as implementing RAID50 or RAID60, to contain lots of "blocks". There are "banks" (TLVDevs) of disks in redundant arrays, and these have block data (and redundancy blocks) striped across sectors of different disks. A pool stripes (RAID0) userdata across several TLVDEVs by storing different blocks in different "banks". Loss of a whole TLVDEV is fatal, like in RAID50. ZFS has a variable step though, so depending on block size, the block-stripe size within a TLVDEV can vary. For minimal sized blocks on a raidz or raidz2 TLVDEV you''d have one or two redundancy sectors and a data sector using two or three disks only. Other "same-numbered" sectors of other disk in the TLVDEV can be used by another such stripe. There are nice illustrations in the docs and blogs regarding the layout. Note that redundancy disks are not used during normal reads of uncorrupted data. However, I believe that there may be a slight benefit from ZFS for smaller blocks which are not using the whole raidzN array stripe, since parallel disks can be used to read parts of different blocks. But the random seeks involved in mechanical disks would probably make it unnoticeable, and there''s probably lot of randomness in storage of small blocks.> > Since "performance for 1 vdev is equivalent to performance of one drive in that > group" it seems like the higher raidzN are not very useful. If your using raidzN > your probably looking for a lower than mirroring parity (aka 10%-33%), but if > you try to use raidz3 with 15% parity your putting 20 HDDs in 1 vdev which is > terrible (almost unimaginable) if your running at 1/20 the "ideal" performance.There are several tradeoffs, and other people on the list can explain them better (and did in the past - search the archives). Mostly this regards resilver times (how many disks are used to rebuild another disk) and striping performance. There were also some calculations regarding i.e. 10-disk sets: you can make two raidz1 arrays or one raidz2 array. They give you same userspace sizes (8 data disks), but the latter is deemed a lot more reliable. Basically, with mirroring you pay the most (2x-3x redundancy for each disk) and get the best performance as well as best redundancy. With raidzN you get more useable space on the same disks at a greater hit to performance, but cheaper. For many home users that does not matter. Say, your camera''s CF card can stream its photos at 10MB/s to save it into your home storage box, so sustained 10 or 50Mb/s of writes suffice for you. One thing to note though is that with larger drives you get longer times to just read in the whole drive trying to detect errors when scrubbing - and this is something your system should proactively do. This opens windows to multiple-drive errors, which can happen to become unrecoverable (i.e. several hits to same block exceeding its redundancy level). With multi-TB disks it is recommended to have at least 3-disk redundancy via 3-4-way mirrors or raidz3 or in traditional systems "RAID7" or "RAID6.3" as some call it. Apparently, having 3 parity disks in a raidz3 array places some requirement on the minimal size of the array so it becomes just reasonable (perhaps 8-10 disks overall).> > > Main Question: > 3. I am updating my old RAID5 and want to reuse my old drives. I have 8 1.5TB > drives and buying new 3TB drives to fill up the rest of a 20 disk enclosure > (Norco RPC-4220); there is also 1 spare, plus the bootdrive so 22 total. I want > around 20%-25% parity. My system is like so: > > Main Application: Home NAS > * Like to optimize max space with 20%(ideal) or 25% parity - would like ''decent'' > reading performance > - ''decent'' being max of 10GigE Ethernet, right now it is only 1 gigabit Ethernet but hope to leave room to update in future if 10GigE becomes cheaper. > My RAID5 runs at ~500MB/s so was hoping to get at least above that with the 20 > disk raid.10GigE is a theoretical 1250MB/s. That might be achievable for writes with mirrored disks and/or good fast caching (in bursts or if your working set fits in the cache), but seems unlikely with raidz sets. For reads caching would likewise help; disk speeds would be good if you have written lots of data contiguaously (so that the disks won''t have to seek too much and yield linear reads). I am not ready to conjure up some numbers out of thin air now, and hopefully someone else would reply to your main question in detail. I assume your other hardware won''t be a bottleneck? (PCI buses, disk controllers, RAM access, etc.)> * 16GB RAMNot so much for ZFS advanced features - don''t try dedup ;) Also, remember that L2ARC indexing still needs some RAM to reference the cached blocks. Reference size is constant (about 200 bytes per block), but due to varying block size the ratio (GB of RAM => GB of L2ARC) can be different and depends on your usage. In particular, for DEDUP the ratio is very bad, about 2x (a dedup-table entry is about twice as large as the reference to it from RAM ARC to L2ARC).> * Open to using ZIL/L2ARC, but, left out for now: writing doesn''t occur much > (~7GB a week, maybe a big burst every couple months), and don''t really read same > data multiple times.Dedicated fast and reliable (i.e. mirrored SSD or RAMDrive) ZIL would help if you have synchronous writes. For example - compilation of large projects creating many files, especially over NFS. ZIL is a rather specific investment, so it might not help you at all, and ideally it is a write-only device (read in only after crashes). So for SSDs you should expect a lot of wear, and orient for a mirror of SLC devices. Or RAM disks. Or maybe small dedicated HDDs to offload the write-seeks from main pool (that last idea is often argued for/against)...> > What would be the best setup? I''m thinking one of the following: > a. 1vdev of 8 1.5TB disks (raidz2). 1vdev of 12 3TB disks (raidz3)? > (~200MB/s reading, best reliability) > b. 1vdev of 8 1.5TB disks (raidz2). 3vdev of 4 3TB disks (raidz)? (~400MB/s > reading, evens out size across vdevs) > c. 2vdev of 4 1.5TB disks (raidz). 3vdev of 4 3TB disks (raidz)? (~500MB/s > reading, maximize vdevs for performance) > > I am leaning towards "a." since I am thinking "raidz3"+"raidz2" should provide a > little more reliability than 5 "raidz1"''s, but, worried that the real world > read/write performance will be low (theoridical is ~200MB/s, and, since the 2nd > vdev is 3x the size as the 1st, I am probably looking at more like 133MB/s?). > The 12 disk array is also above the "9 disk group max" recommendation in the > Best Practices guide, so not sure if this affects read performance (if it is > just resilver time I am not as worried about it as long it isn''t like 3x > longer)?One thing to note is that many people would not recommend using a "disbalanced" ZFS array - one expanded by adding a TLVDEV after many writes, or one consisting of differently-sized TLVDEVs. ZFS does a rather good job of trying to use available storage most efficiently, but it was often reported that it hits some algorithmic bottleneck when one of the TLVDEVs is about 80-90% full (even if others are new and empty). Blocks are balanced across TLVDEVs on write, so your old data is not magically redistributed until you explicitly rewrite it (i.e. zfs send or rsync into another dataset on this pool). So I''d suggest that you keep your disks separate, with two pools made from 1.5Tb disks and from 3Tb disks, and use these pools for different tasks (i.e. a working set with relatively high turnaround and fragmentation, and WORM static data with little fragmentation and high read performance). Also this would allow you to more easily upgrade/replace the whole set of 1.5Tb disks when the time comes. Note that the two disk types can also have other different characteristics, most notably the native sector size (4kb vs. 512b). You might expose your pool to a hit in reliability and performance if you used the 4kb-sectored disks with emulated 512b sectors as a 512b-sectored disk, however you''d gain some more useable space in exchange. You don''t have these negative hits when you use a native 512b disk as a 512b disk. It is likely that when you decide to replace the 1.5Tb disks, all those available on the market would be 4kb-sectored, so in-place replacement of disks (replacing pool disks one by one and resilvering) would be a bad option, IF your 1.5Tb disks have native 512b sectors and you use them as such in the pool. If interested, read up more on "ashift=9 vs. ashift=12" issues in ZFS.> > I guess I''m hoping "a." really isn''t ~200MB/s hehe, if it is I''m leaning towards > "b.", but, if so, all three are downgrades from my initial setup read > performance wise -_-. > > Is someone able to correct my understanding if some of my numbers are off, or > would someone have a better raidzN configuration I should consider? Thanks for > any help.Again, I hope someone else would correctly suggest the setup for your numbers. I''m somewhat more successful with theory now ;( HTH, //Jim Klimov
Paul Kraus
2012-Mar-21 12:41 UTC
[zfs-discuss] Basic ZFS Questions + Initial Setup Recommendation
On Wed, Mar 21, 2012 at 7:56 AM, Jim Klimov <jimklimov at cos.ru> wrote:> 2012-03-21 7:16, MLR wrote:> One thing to note is that many people would not recommend using > a "disbalanced" ZFS array - one expanded by adding a TLVDEV after > many writes, or one consisting of differently-sized TLVDEVs. > > ZFS does a rather good job of trying to use available storage > most efficiently, but it was often reported that it hits some > algorithmic bottleneck when one of the TLVDEVs is about 80-90% > full (even if others are new and empty). Blocks are balanced > across TLVDEVs on write, so your old data is not magically > redistributed until you explicitly rewrite it (i.e. zfs send > or rsync into another dataset on this pool).I have been running ZFS in a mission critical application since zpool version 10 and have not seen any issues with some of the vdevs in a zpool full while others are virtually empty. We have been running commercial Solaris 10 releases. The configuration was that each business unit had a separate zpool consisting of mirrored pairs of 500 GB LUNs from SAN based storage. Each zpool started with enough storage for that business unit. As each business unit filled their space, we added additional mirrored pairs of LUNs. So the smallest unit had one mirror vdev and the largest had 13 vdevs. In the case of the two largest (13 and 11 vdevs) most of the vdevs were well above 90% utilized and there were 2 or 3 almost empty vdevs. We never saw any reliability issues with this condition. In terms of performance, the storage was NOT our performance bottleneck, so I do not know if there were any performance issue with this situation.> So I''d suggest that you keep your disks separate, with two > pools made from 1.5Tb disks and from 3Tb disks, and use these > pools for different tasks (i.e. a working set with relatively > high turnaround and fragmentation, and WORM static data with > little fragmentation and high read performance). > Also this would allow you to more easily upgrade/replace the > whole set of 1.5Tb disks when the time comes.I have never tried mixing drives of different size or performance characteristic in the same zpool or vdev, except as a temporary migration strategy. You already know that growing a RAIDz vdev is currently impossible, so with a RAIDz strategy your only option for growth is to add complete RAIDz vdevs, and you _want_ those to match in terms of performance or you will have unpredictable performance. For situations where you _might_ want to grow the data capacity in the future I recommend mirrors, but ... and Richard Elling posted hard data on this to the list a while back, to get the reliability of RAIDz2 you need more than a 2-way mirror. In my mind, the larger the amount of data (and size of drives) the _more_ reliability you need. We are no longer using the configuration described above. The current configuration is five JBOD chassis of 24 drives each. We have 22 vdevs, each a RAIDz2 consisting of one drive from each chassis and 10 hot spares. Our priority was reliability followed by capacity and performance. If we could have, we would have just used 3 or 4 way mirrors, but we needed more capacity than that provided. I note that in pre-production testing we did have two of the five JBOD chassis go offline at once and did not lose _any_ data. The total pool size is about 40 TB. We also have a redundant copy of the data on a remote system. That system only has two JBOD chassis and capacity is the priority. The zpool consists of two vdevs each a RAIDz2 of 23 drives and two hot spares. The performance is dreadful, but we _have_ the data in case of a real disaster. -- {--------1---------2---------3---------4---------5---------6---------7---------} Paul Kraus -> Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ ) -> Sound Coordinator, Schenectady Light Opera Company ( http://www.sloctheater.org/ ) -> Technical Advisor, Troy Civic Theatre Company -> Technical Advisor, RPI Players
Jim Klimov
2012-Mar-21 12:55 UTC
[zfs-discuss] Basic ZFS Questions + Initial Setup Recommendation
2012-03-21 16:41, Paul Kraus wrote:> I have been running ZFS in a mission critical application since > zpool version 10 and have not seen any issues with some of the vdevs > in a zpool full while others are virtually empty. We have been running > commercial Solaris 10 releases. The configuration was that eachThanks for sharing some real-life data from larger deployments, as you often did. That''s something I don''t often have access to nowadays, with a liberty to tell :) Nice to hear about lack of degradations in this scenario you have, and it was one proposed a few years back on Sun Forums I believe. Perhaps the problems come if you similarly expand raidz-based arrays by adding TLVDEVs, or with OpenSolaris''s experimental features?.. I don''t know, really :) //Jim
Edward Ned Harvey
2012-Mar-21 13:28 UTC
[zfs-discuss] Basic ZFS Questions + Initial Setup Recommendation
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of MLR > > Say we get a decent ssd, ~500MB/s read/write. If we have a 20 HDDzpool> setup shouldn''t we be reading at least at the 500MB/s read/write range? > Why > would we want a ~500MB/s cache?You don''t add l2arc because you care about MB/sec. You add it because you care about IOPS (read). Similarly, you don''t add dedicated log device for MB/sec. You add it for IOPS (sync write). Any pool - raidz, raidz2, mirror - will give you optimum *sequential* throughput. All the performance enhancements are for random IO. Mirrors outperform raidzN, but in either case, you get improvements by adding log & cache.> Am I correct in > thinking this means, for example, I have a single 14 disk raidz2 vdevzpool, It''s not advisable to put more than ~8 disks in a single vdev, because it really hurts during resilver time. Maybe a week or two to resilver like that.> the > disks will go ~100MB/s each , this zpool would theoretically read/write atNo matter which configuration you choose, you can expect optimum throughput from all drives in sequential operations. Random IO is a different story.> What would be the best setup? I''m thinking one of the following: > a. 1vdev of 8 1.5TB disks (raidz2). 1vdev of 12 3TB disks (raidz3)? > (~200MB/s reading, best reliability)No. 12 in a single vdev is too much.> b. 1vdev of 8 1.5TB disks (raidz2). 3vdev of 4 3TB disks (raidz)?(~400MB/s> reading, evens out size across vdevs)Not bad, but different size vdev''s will perform differently (8 disks vs 4) so... See below.> c. 2vdev of 4 1.5TB disks (raidz). 3vdev of 4 3TB disks (raidz)?(~500MB/s> reading, maximize vdevs for performance)This would be your optimal configuration.
Paul Kraus
2012-Mar-21 13:30 UTC
[zfs-discuss] Basic ZFS Questions + Initial Setup Recommendation
On Tue, Mar 20, 2012 at 11:16 PM, MLR <maillistreader1 at gmail.com> wrote:> ?1. Cache device for L2ARC > ? ? Say we get a decent ssd, ~500MB/s read/write. If we have a 20 HDD zpool > setup shouldn''t we be reading at least at the 500MB/s read/write range? Why > would we want a ~500MB/s cache?Without knowing the I/O pattern, saying 500 MB/sec. is meaningless. Achieving 500MB/sec. with 8KB files and lots of random accesses is really hard, even with 20 HDDs. Achieving 500MB/sec. of sequential streaming of 100MB+ files is much easier. An SSD will be as fast on random I/O as on sequential (compared to an HDD). An SSD will be as fast on small I/O as large (once again, compared to an HDD). Due to it''s COW design, once a file is _changed_, ZFS no longer accesses it strictly sequentially. If the files are written once and never changed, then they _may_ be sequential on disk. An important point to remember about the ARC / L2ARC is that it (they ?) are ADAPTIVE. The amount of space used by the ARC will grow as ZFS reads data and shrinks as other processes need memory. I also suspect that data eventually ages out of the ARC. The L2ARC is (mostly) just an extension of the ARC, except that it does not have to give up capacity as other processes need more memory.> ?2. ZFS dynamically strips along the top-most vdev''s and that "performance for 1 > vdev is equivalent to performance of one drive in that group". Am I correct in > thinking this means, for example, I have a single 14 disk raidz2 vdev zpool, the > disks will go ~100MB/s each ,Assuming the disks will do 100MB/sec. for your data :-)> this zpool would theoretically read/write at > ~100MB/s max (how about real world average?)?Yes. In a RAIDz<n> when a write is dispatched to the vdev _all_ the drives must complete the write before the write is complete. All the drives in the vdev are written to in parallel. This is (or should be) the case for _any_ RAID scheme, including RAID1 (mirroring). If a zpool has more than one vdev, then writes are distributed among the vdevs based on a number of factors (which others are _much_ more qualified to discuss). For ZFS, performance is proportional to the number of vdevs NOT the number of drives or the number of drives per vdev. See https://docs.google.com/spreadsheet/ccc?key=0AtReWsGW-SB1dFB1cmw0QWNNd0RkR1ZnN0JEb2RsLXc for some testing I did a while back. I did not test sequential read as that is not part of our workload.> If this was RAID6 I think this > would go theoretically ~1.4GB/s, but in real life I am thinking ~1GB/s (aka 10x- > 14x faster than zfs, and both provide the same amount of redundancy)? Is my > thinking off in the RAID6 or RAIDZ2 numbers? Why doesn''t ZFS try to dynamically > strip inside vdevs (and if it is, is there an easy to understand explanation why > a vdev doesn''t read from multiple drives at once when requesting data, or why a > zpool wouldn''t make N number of requests to a vdev with N being the number of > disks in that vdev)?I understand why the read performance scales with the number of vdevs, but I have never really understood _why_ it does not also scale with the number of drives in each vdev. When I did my testing with 40 dribves, I expected similar READ performance regardless of the layout, but that was NOT the case.> Since "performance for 1 vdev is equivalent to performance of one drive in that > group" it seems like the higher raidzN are not very useful. If your using raidzN > your probably looking for a lower than mirroring parity (aka 10%-33%), but if > you try to use raidz3 with 15% parity your putting 20 HDDs in 1 vdev which is > terrible (almost unimaginable) if your running at 1/20 the "ideal" performance.The recommendation is to not go over 8 or so drives per vdev, but that is a performance issue NOT a reliability one. I have also not been able to duplicate others observations that 2^N drives per vdev is a magic number (4, 8, 16, etc). As you can see from the above, even a 40 drive vdev works and is reliable, just (relatively) slow :-)> Main Question: > ?3. I am updating my old RAID5 and want to reuse my old drives. I have 8 1.5TB > drives and buying new 3TB drives to fill up the rest of a 20 disk enclosure > (Norco RPC-4220); there is also 1 spare, plus the bootdrive so 22 total. I want > around 20%-25% parity. My system is like so:Is the enclosure just a JBOD? If it is not, can it present drives directly? If you cannot get at the drives individually, then the rest of the discussion is largely moot. You are buying 3TB drives, by definition you are NOT looking for performance or reliability but capacity. What is the uncorrectable error rate on these 3TB drives? What is the real random I/Ops capability of these 3TB drives? I am not trying to be mean here, but I would hate to see you put a ton of effort into this and then be disappointed with the result due to a poor choice of hardware.> Main Application: Home NAS > * Like to optimize max space with 20%(ideal) or 25% parity - would like ''decent'' > reading performance > ?- ''decent'' being max of 10GigE Ethernet, right now it is only 1 gigabit Ethernet but hope to leave room to update in future if 10GigE becomes cheaper.1,250MB/sec of random I/O (assuming small files) is very not trivial to achieve and is way more than "decent"... On my home network I see 30MB/sec of large file traffic per client, and I rarely have more than one client doing lots of I/O at a time. How much space do you _need_, including reasonable growth?> My RAID5 runs at ~500MB/s so was hoping to get at least above that with the 20 > disk raid.How did you measure this?> * 16GB RAMWhat OS? I have a 16 CPU Solaris 10 SPARC server with 16 GB of RAM and serving up 20TB of random small files. The ARC uses between 8 and 10 GB with between 1 and 2 GB free. But our users are generally accessing less than 3 TB of data at a time.> * Open to using ZIL/L2ARC, but, left out for now: writing doesn''t occur much > (~7GB a week, maybe a big burst every couple months), and don''t really read same > data multiple times.ZIL helps sync write performance (NFS) L2ARC give you more ARC space which helps all reads> What would be the best setup? I''m thinking one of the following: > ? ?a. 1vdev of 8 1.5TB disks (raidz2). 1vdev of 12 3TB disks (raidz3)? > (~200MB/s reading, best reliability) > ? ?b. 1vdev of 8 1.5TB disks (raidz2). 3vdev of 4 3TB disks (raidz)? (~400MB/s > reading, evens out size across vdevs) > ? ?c. 2vdev of 4 1.5TB disks (raidz). 3vdev of 4 3TB disks (raidz)? (~500MB/s > reading, maximize vdevs for performance)With the eight 1.5TB drives you can: 1 x 8 (raidz<n>) == worst performance 2 x 4 (raidz<n>) == better performance if raidz2, then capacity is the same as mirror but has better reliability 4 x 2 (mirror) == best performance With the twelve 3TB drives you can: 1 x 12 (raidz<n>) == worst performance 2 x 6 (raidz<n>) == better performance 3 x 4 (raidz<n>) == better performance 4 x 3 (mirror) == best performance 6 x 2 (mirror) == almost best performance I agree with Jim that you should keep the 1.5TB and the 3TB drives in separate zpools. Although you _can_ partition the 3TB drives to look like two 1.5TB drives. Group the first partition on each 3TB drive with the 1.5TB drives and use the second as a second zpool. There are caveats with doing that, but it may fit your needs... With 20 logical 1.5TB drives you can: 1 x 20 (raidz<n>) == bad performance, don''t do this :-) 2 x 10 (raidz<n>) == better 3 x 6 + 2 hot spare (raidz<n>) 4 x 5 (raidz<n>) 6 x 3 + 2 hot spare (mirror) 9 x 2 + 2 hot spare (mirror) Plus another 12 logical 1.5TB drives: 1 x 12 (raidz<n>) == worst performance 2 x 6 (raidz<n>) == better performance 3 x 4 (raidz<n>) == better performance 4 x 3 (mirror) == best performance 6 x 2 (mirror) == almost best performance If you have the time, setup each configuration and _measure_ the performance. If you can, load up a bunch of data (at least 33% full) and then trigger a scrub to see how long a resilver takes. Remember here that you are looking for _relative_ measures (unless you have a performance goal you need to hit). -- {--------1---------2---------3---------4---------5---------6---------7---------} Paul Kraus -> Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ ) -> Sound Coordinator, Schenectady Light Opera Company ( http://www.sloctheater.org/ ) -> Technical Advisor, Troy Civic Theatre Company -> Technical Advisor, RPI Players
Edward Ned Harvey
2012-Mar-21 13:30 UTC
[zfs-discuss] Basic ZFS Questions + Initial Setup Recommendation
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of MLR > > c. 2vdev of 4 1.5TB disks (raidz). 3vdev of 4 3TB disks (raidz)?(~500MB/s> reading, maximize vdevs for performance)If possible, spread your vdev''s across 4 different controllers/busses. So if you lose any one controller/bus, you will only be degraded, pool won''t go offline.
Jim Klimov
2012-Mar-21 13:51 UTC
[zfs-discuss] Basic ZFS Questions + Initial Setup Recommendation
2012-03-21 17:28, Edward Ned Harvey wrote:>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- >> bounces at opensolaris.org] On Behalf Of MLR...>> Am I correct in thinking this means, for example, I have a single>> 14 disk raidz2 vdev zpool,> > It''s not advisable to put more than ~8 disks in a single vdev, because it > really hurts during resilver time. Maybe a week or two to resilver like > that.Yes, that''s important to note also. While ZFS marketing initially stressed that unlike traditional RAID systems, a "rebuild" of ZFS onto a spare/replacement disk only needs to copy referenced data and not the whole disk, it somehow fell off the picture that such rebuild is a lot of random IO - because the data block tree must be read in as a tree walk, often with emphasis on block "age" (its birth TXG number). If your pool is reasonably full (and who runs it empty?) then this is indeed lots of random IO, and a blind full-disk copy would have gone orders of magnitude faster. The less disk participate in this thrashing - the faster it will go (less data needed overall to reconstruct a disk''s worth of sectors from redundancy data). That''s the way I understand the problem, anyway... //Jim
Paul Kraus
2012-Mar-21 14:37 UTC
[zfs-discuss] Basic ZFS Questions + Initial Setup Recommendation
On Wed, Mar 21, 2012 at 9:51 AM, Jim Klimov <jimklimov at cos.ru> wrote:> 2012-03-21 17:28, Edward Ned Harvey wrote:>> It''s not advisable to put more than ~8 disks in a single vdev, because it >> really hurts during resilver time. ?Maybe a week or two to resilver like >> that. > > Yes, that''s important to note also. While ZFS marketing initially > stressed that unlike traditional RAID systems, a "rebuild" of ZFS > onto a spare/replacement disk only needs to copy referenced data > and not the whole disk, it somehow fell off the picture that such > rebuild is a lot of random IO - because the data block tree must > be read in as a tree walk, often with emphasis on block "age" (its > birth TXG number). If your pool is reasonably full (and who runs > it empty?) then this is indeed lots of random IO, and a blind > full-disk copy would have gone orders of magnitude faster. > The less disk participate in this thrashing - the faster it will > go (less data needed overall to reconstruct a disk''s worth of > sectors from redundancy data).Three are two different cases here... resilver to reconstruct data from a failed drive and a scrub to pro-actively find bad sectors. The best case situation for the first case (bad drive replacement) is a mirrored drive in my experience. In that case only the data involved in the failure needs to be read and written. I am unclear how much of the data is read in the case of a failure of a drive in a RAIDz<n> vdev _from_other_vdevs_. I have seen disk activity on non-failure related vdevs during a drive replacement, which is why I am unsure in this case. In the case of a "scrub", _all_ of the data in the zpool is read and the checksums checked. My 22 vdev zpool takes about 300 hours for this while the 2 vdev zpool takes over 600 hours. Both have comparable amounts of data and snapshots. The 22 vdev zpool is on a production server with normal I/O activity, the 2 vdev case is only receiving zfs snapshots and doing no other I/O. -- {--------1---------2---------3---------4---------5---------6---------7---------} Paul Kraus -> Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ ) -> Sound Coordinator, Schenectady Light Opera Company ( http://www.sloctheater.org/ ) -> Technical Advisor, RPI Players
Marion Hakanson
2012-Mar-21 17:40 UTC
[zfs-discuss] Basic ZFS Questions + Initial Setup Recommendation
paul at kraus-haus.org said:> Without knowing the I/O pattern, saying 500 MB/sec. is meaningless. > Achieving 500MB/sec. with 8KB files and lots of random accesses is really > hard, even with 20 HDDs. Achieving 500MB/sec. of sequential streaming of > 100MB+ files is much easier. > . . . > For ZFS, performance is proportional to the number of vdevs NOT the > number of drives or the number of drives per vdev. See https:// > docs.google.com/spreadsheet/ccc?key=0AtReWsGW-SB1dFB1cmw0QWNNd0RkR1ZnN0JEb2RsL > Xc for some testing I did a while back. I did not test sequential read as > that is not part of our workload. > . . . > I understand why the read performance scales with the number of vdevs, > but I have never really understood _why_ it does not also scale with the > number of drives in each vdev. When I did my testing with 40 dribves, I > expected similar READ performance regardless of the layout, but that was NOT > the case.In your first paragraph you make the important point that "performance" is too ambiguous in this discussion. Yet in the 2nd & 3rd paragraphs above, you go back to using "performance" in its ambiguous form. I assume that by "performance" you are mostly focussing on random-read performance.... My experience is that sequential read performance _does_ scale with the number of drives in each vdev. Both sequential and random write performance also scales in this manner (note that ZFS tends to save up small, random writes and flush them out in a sequential batch). Small, random read performance does not scale with the number of drives in each raidz[123] vdev because of the dynamic striping. In order to read a single logical block, ZFS has to read all the segments of that logical block, which have been spread out across multiple drives, in order to validate the checksum before returning that logical block to the application. This is why a single vdev''s random-read performance is equivalent to the random-read performance of a single drive. paul at kraus-haus.org said:> The recommendation is to not go over 8 or so drives per vdev, but that is > a performance issue NOT a reliability one. I have also not been able to > duplicate others observations that 2^N drives per vdev is a magic number (4, > 8, 16, etc). As you can see from the above, even a 40 drive vdev works and is > reliable, just (relatively) slow :-)Again, the "performance issue" you describe above is for the random-read case, not sequential. If you rarely experience small-random-read workloads, then raidz* will perform just fine. We often see 2000 MBytes/sec sequential read (and write) performance on a raidz3 pool consisting of 3, 12-disk vdev''s (using 2TB drives). However, when a disk fails and must be resilvered, that''s when you will run into the slow performance of the small, random read workload. This is why I use raidz2 or raidz3 on vdevs consisting of more than 6-7 drives, especially of the 1TB+ size. That way if it takes 200 hours to resilver, you''ve still got a lot of redundancy in place. Regards, Marion
Jim Klimov
2012-Mar-21 18:26 UTC
[zfs-discuss] Basic ZFS Questions + Initial Setup Recommendation
2012-03-21 21:40, Marion Hakanson ?????:> Small, random read performance does not scale with the number of drives in each > raidz[123] vdev because of the dynamic striping. In order to read a single > logical block, ZFS has to read all the segments of that logical block, which > have been spread out across multiple drives, in order to validate the checksum > before returning that logical block to the application. This is why a single > vdev''s random-read performance is equivalent to the random-read performance of > a single drive.True, but if the stars align so nicely that all the sectors related to the block are read simultaneously in parallel from several drives of the top-level vdev, so there is no (substantial) *latency* incurred by waiting between the first and last drives to complete the read request, then the *aggregate bandwidth* of the array is (should be) similar to performance (bandwidth) of a stripe. This gain would probably be hidden by caches and averages, unless the stars align so nicely for many blocks in a row, such as a sequential uninterrupted read of a file written out sequentially - so that component drives would stream it off the platter track by track in a row... Ah, what a wonderful world that would be! ;) Also, after the sector is read by the disk and passed to the OS, it is supposedly cached until all sectors of the block arrive into the cache and the checksum matches. During this time the HDD is available to do other queued mechanical tasks. I am not sure which cache that might be: too early for ARC - no block yet, and the vdev-caches now drop non-metadata sectors. Perhaps it is just a variable buffer space in the instance of the reading routine which tries to gather all pieces of the block together and pass it to the reader (and into ARC)... //Jim
Richard Elling
2012-Mar-21 18:53 UTC
[zfs-discuss] Basic ZFS Questions + Initial Setup Recommendation
comments below... On Mar 21, 2012, at 10:40 AM, Marion Hakanson wrote:> paul at kraus-haus.org said: >> Without knowing the I/O pattern, saying 500 MB/sec. is meaningless. >> Achieving 500MB/sec. with 8KB files and lots of random accesses is really >> hard, even with 20 HDDs. Achieving 500MB/sec. of sequential streaming of >> 100MB+ files is much easier. >> . . . >> For ZFS, performance is proportional to the number of vdevs NOT the >> number of drives or the number of drives per vdev. See https:// >> docs.google.com/spreadsheet/ccc?key=0AtReWsGW-SB1dFB1cmw0QWNNd0RkR1ZnN0JEb2RsL >> Xc for some testing I did a while back. I did not test sequential read as >> that is not part of our workload.Actually, few people have sequential workloads. Many think they do, but I say prove it with iopattern.>> . . . >> I understand why the read performance scales with the number of vdevs, >> but I have never really understood _why_ it does not also scale with the >> number of drives in each vdev. When I did my testing with 40 dribves, I >> expected similar READ performance regardless of the layout, but that was NOT >> the case. > > In your first paragraph you make the important point that "performance" > is too ambiguous in this discussion. Yet in the 2nd & 3rd paragraphs above, > you go back to using "performance" in its ambiguous form. I assume that > by "performance" you are mostly focussing on random-read performance.... > > My experience is that sequential read performance _does_ scale with the number > of drives in each vdev. Both sequential and random write performance also > scales in this manner (note that ZFS tends to save up small, random writes > and flush them out in a sequential batch).Yes. I wrote a small, random read performance model that considers the various caches. It is described here: http://info.nexenta.com/rs/nexenta/images/tech_brief_nexenta_performance.pdf The spreadsheet shown in figure 3 is available for the asking (and it works on your iphone or ipad :-)> Small, random read performance does not scale with the number of drives in each > raidz[123] vdev because of the dynamic striping. In order to read a single > logical block, ZFS has to read all the segments of that logical block, which > have been spread out across multiple drives, in order to validate the checksum > before returning that logical block to the application. This is why a single > vdev''s random-read performance is equivalent to the random-read performance of > a single drive.It is not as bad as that. The actual worst case number for a HDD with zfs_vdev_max_pending of one is: average IOPS * ((D+P) / D) where, D = number of data vdevs P = numebr of parity vdevs (1 for raidz, 2 for raidz2, 3 for raidz3) total disks per set = D + P We did many studies that verified this. More recent studies show zfs_vdev_max_pending has a huge impact on average latency of HDDs, which I also described in my talk at OpenStorage Summit last fall.> paul at kraus-haus.org said: >> The recommendation is to not go over 8 or so drives per vdev, but that is >> a performance issue NOT a reliability one. I have also not been able to >> duplicate others observations that 2^N drives per vdev is a magic number (4, >> 8, 16, etc). As you can see from the above, even a 40 drive vdev works and is >> reliable, just (relatively) slow :-)Paul, I have a considerable amount of data that refutes your findings. Can we agree that YMMV and varies dramatically, depending on your workload?> > Again, the "performance issue" you describe above is for the random-read > case, not sequential. If you rarely experience small-random-read workloads, > then raidz* will perform just fine. We often see 2000 MBytes/sec sequential > read (and write) performance on a raidz3 pool consisting of 3, 12-disk vdev''s > (using 2TB drives).Yes, this is relatively easy to see. I''ve seen 6GByes/sec for large configs, but that begins to push the system boundaries in many ways.> > However, when a disk fails and must be resilvered, that''s when you will > run into the slow performance of the small, random read workload. This > is why I use raidz2 or raidz3 on vdevs consisting of more than 6-7 drives, > especially of the 1TB+ size. That way if it takes 200 hours to resilver, > you''ve still got a lot of redundancy in place. > > Regards, > > Marion > > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss-- DTrace Conference, April 3, 2012, http://wiki.smartos.org/display/DOC/dtrace.conf ZFS Performance and Training Richard.Elling at RichardElling.com +1-760-896-4422 -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120321/d8fde6f2/attachment-0001.html>
maillist reader
2012-Mar-21 19:36 UTC
[zfs-discuss] Basic ZFS Questions + Initial Setup Recommendation
Thank you all for the information, I believe it is much clearer to me. "Sequential Reads" should scale with the number of disks in the entire zpool (regardless of amount of vdevs), and "Random Reads" will scale with just the number of vdevs (aka idea I had before only applies to "Random Reads"), which I am much more happy with. Everything on my system should be mostly sequential as editing should not occur much (aka no virtual machine type things), when things get changed it usually means deleteing the old file and adding the "updated" file. I read though that ZFS does not have a "defragmentation" tool, is this still the case? It would seem with such a performance difference between "sequential reads" and "random reads" for raidzN''s, a defragmentation tool would be very high on ZFS''s TODO list ;).> Is the enclosure just a JBOD? If it is not, can it present drives&>I assume your other hardware won''t be a bottleneck? >(PCI buses, disk controllers, RAM access, etc.)A little more extra information about my system, it is a JBODs. The disks go to a SAS-2 Expander (RES2SV240), and that has a single connection to a tyan motherboard which has a LSI SAS 2008 chip controller built-in. The CPU is a i3 with a DMI of 5 GT/s (DMI is new to me vs FSB). RAM is server grade unbuffered ECC DDR3 1333 8GB sticks. It is a dedicated machine which will do nothing but serve files over the network. To my understanding the Network or disks themselves should be my bottlenet; SAS-2 connection between SAS Expander and mobo should be 24Gbit/s or 3GB/s(1.5GB/s if SAS-1), and 5 GT/s should provide ~20 GB/s max bandwidth for the 64bit machine from what I read online. I don''t think this affects me, but, I was also curious, does anyone know if the mobo <> sas expander will still establish a SAS-2 connection (if they both support SAS-2) if the backpanes only support SAS-1 / SATA 3Gb/s? I never looked up the backpane partnumbers in my Norco, but, think they support SATA 6Gb/s, so assume they support SAS-2. But in essance SAS Expander <> HDD''s wont be over 3Gb/s per port, so as long as SAS Expander <> mobo establish''s it''s own SAS-2 connection regardless of what SAS Expander <> HDD''s do, then I don''t even have to think about it. 1.5GB/s (SAS-1) is still above my optimal max anyway though. In essance, if the drives can provide it (and network interface ignored) I think the theoredical limitation is 3GB/s. I mentioned 1.25GB/s for the 10GigE is max I am looking at, but, I''d be happy with anywhere between 500MB/s->1GB/s for sequential reads of large files (don''t really care about any type of writes, and hopefully random reads do not happen so often *will test with iopattern*).> What is the uncorrectable > error rate on these 3TB drives? What is the real random I/Ops > capability of these 3TB drives?I''m unsure of these myself, all the other parts have arrived, or are on route, but I have not actually bought the HDD''s yet so can still choose almost anything. It will probably be cheapest consumer drives I can get though (probably "Seagate Barracude Green ST3000DM001"''s or "Hitachi 5K3000 3TB"''s). The 1.5TB''s I have in my old system are pretty much the same thing.> How much space do you _need_, including reasonable growth?My old system is 9.55TB and almost full, and I have about 3TB spread out elsewear. This was setup about 5years ago. With the 20 disk enclosure I''m thinking about 30TB usuable space (but maybe only usign 15 disks at first), and hoping it''ll last for another 5 years.> How did you measure this?ATTO Benchmark is what I used on the local machine for the 500MB/s number. For small files (1kB>16kB) it is small (50MB>150MB), for the larger 256kB+ it reads ~550MB/s). This is hardware RAID5 though. Over the 1Gbit network Windows7 always gets up to ~100MB/s when writing/reading from the RAID5 share.> What OS? I have a 16 CPU Solaris 10 SPARC server with 16 GB of RAMThe new ZFS system OS will probably be OpenIndiana with the v28 zpool. I have been looking at FreeNAS (FreeBSP) and a little up in the air on which to choose. Thank you all for the information. I will very likley create two zpools (one for 1.5TB drives, and one for 3TB drives), initially I thought down the road if the pool ever fills up (probably like 5+ years) I would start swaping the 1.5TB drives with 3TB drives to let the small vdev "expand" after all were replaced, but, I didn''t realize there could potentially be performance problems via block size differences of 1.5TB (~5 year old drives) drives and 3TB+ drives (~5 years in the future). -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120321/c9d89bee/attachment.html>
Bob Friesenhahn
2012-Mar-21 19:59 UTC
[zfs-discuss] Basic ZFS Questions + Initial Setup Recommendation
On Wed, 21 Mar 2012, maillist reader wrote:> I read though that ZFS does not have a "defragmentation" tool, is this still the case? It would seem with such a > performance difference between "sequential reads" and "random reads" for raidzN''s, a defragmentation tool would be > very high on ZFS''s TODO list ;).Zfs does not usually suffer significantly from fragmentation. To be clear, "random reads" means random file access and necessary head seeks. Any mechanical-based device will suffer from this and it is not specific to zfs. Something which is specific to zfs is that within a zdev, a stripe must be read and written completely. Partial stripe reads are not supported since a full read is necessary in order to validate the block checksum. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Jim Klimov
2012-Mar-22 10:03 UTC
[zfs-discuss] Basic ZFS Questions + Initial Setup Recommendation
2012-03-21 22:53, Richard Elling wrote: ...>> This is why a single >> vdev''s random-read performance is equivalent to the random-read >> performance of >> a single drive. > > It is not as bad as that. The actual worst case number for a HDD with > zfs_vdev_max_pending > of one is: > average IOPS * ((D+P) / D) > where, > D = number of data vdevs > P = numebr of parity vdevs (1 for raidz, 2 for raidz2, 3 for raidz3) > total disks per set = D + PI wrote in this thread that AFAIK for small blocks (i.e. 1-sector size worth of data) there would be P+1 sectors used to store the block, which is an even worse case at least capacity-wise, as well as impacting fragmentation => seeks, but might occasionally allow parallel reads of different objects (tasks running on disks not involved in storage of the one data sector and maybe its parities when required). Is there any truth to this picture? Were there any research or tests regarding storage of many small files (1-sector sized or close to that) on different vdev layouts? I believe that such files would use a single-sector-sized set of indirect blocks (dittoed at least twice), so one single-sector sized file would use at least 9 sectors in raidz2. Thanks :)> We did many studies that verified this. More recent studies show > zfs_vdev_max_pending > has a huge impact on average latency of HDDs, which I also described in > my talk at > OpenStorage Summit last fall.What about drives without (a good implementation of) NCQ/TCQ/whatever? Does ZFS in-kernel caching, queuing and sorting of pending requests provide a similar service? Is it controllable with the same switch? Or, alternatively, is it a kernel-only feature which does not depend on hardware *CQ? Are there any benefits to disks with *CQ then? :)
Edward Ned Harvey
2012-Mar-22 16:30 UTC
[zfs-discuss] Basic ZFS Questions + Initial Setup Recommendation
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Paul Kraus > > Three are two different cases here... resilver to reconstruct > data from a failed drive and a scrub to pro-actively find bad sectors. > > The best case situation for the first case (bad drive > replacement) is a mirrored drive in my experience. In that case only > the data involved in the failure needs to be read and written. I amDuring resilver, all the data in the vdev must be read & reconstructed to write the new disk. Notice I said "vdev." If you have a pool made of a single vdev, then it means all the data in your pool. However if you have a pool made of a million vdev''s, then ~ one millionth of the pool must be read. If you configured your pool using mirrors instead of raidz, then you have minimized the size of your vdev''s, and maximized the IOPS you''re able to perform *per* vdev. So mirrors resilver many times faster than raidz, but still, mirrors in my experience resilver ~ 10x slower than blindly reading & writing the entire disk serially. In my experience, hardware raid resilver takes a couple or a few hours (divide total disk size by total sustainable throughput), while zfs mirror resilver takes a dozen hours, or a day or two (lots of random IO). While raidz takes several days, if not multiple weeks to resilver. Of course all this is variable and dependent on both your data and usage patterns.
Edward Ned Harvey
2012-Mar-22 16:42 UTC
[zfs-discuss] Basic ZFS Questions + Initial Setup Recommendation
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of maillist reader > > I read though that ZFS does not have a "defragmentation" tool, is thisstill the> case?True.> It would seem with such a performance difference between > "sequential reads" and "random reads" for raidzN''s, a defragmentation tool > would be very high on ZFS''s TODO list ;).It is high on the todo list, and in fact a lot of other useful stuff is dependent on the same code, so when/if implemented, it will enable a lot of new features, where defrag is just one such new feature. However, there''s a very difficult decision regarding *what* do you count as defragmentation? (Not to mention, a lot of work to be done.) The goal of defrag is to align data on disks serially so as to maximize the useful speed of the disks. Unfortunately, there are some really big competing demands - where data is read in different orders. For example, the traditional perception of defrag would align disk blocks of individual files. Thus, when you later return to read those files sequentially, you would have maximum performance. But that''s not the same order of data read as compared to scrub/resilver/zfs send. Scrub/resilver/zfs send operate in (at least approximate) temporal order. So if you defrag at a file level, you hurt the performance of scrub/resilver/send. If you defrag at the temporal pool level (which is the default position, current behavior) you hurt performance of file operations. Pick your poison.
Richard Elling
2012-Mar-22 16:52 UTC
[zfs-discuss] Basic ZFS Questions + Initial Setup Recommendation
On Mar 22, 2012, at 3:03 AM, Jim Klimov wrote:> 2012-03-21 22:53, Richard Elling wrote: > ... >>> This is why a single >>> vdev''s random-read performance is equivalent to the random-read >>> performance of >>> a single drive. >> >> It is not as bad as that. The actual worst case number for a HDD with >> zfs_vdev_max_pending >> of one is: >> average IOPS * ((D+P) / D) >> where, >> D = number of data vdevs >> P = numebr of parity vdevs (1 for raidz, 2 for raidz2, 3 for raidz3) >> total disks per set = D + P > > I wrote in this thread that AFAIK for small blocks (i.e. 1-sector > size worth of data) there would be P+1 sectors used to store the > block, which is an even worse case at least capacity-wise, as well > as impacting fragmentation => seeks, but might occasionally allow > parallel reads of different objects (tasks running on disks not > involved in storage of the one data sector and maybe its parities > when required). > > Is there any truth to this picture?Yes, but it is a rare case for 512b sectors. It could be more common for 4KB sector disks when ashift=12. However, in that case the performance increases to the equivalent of mirroring, so there are some benefits. FWIW, some people call this "RAID-1E"> > Were there any research or tests regarding storage of many small > files (1-sector sized or close to that) on different vdev layouts?It is not a common case, so why bother?> I believe that such files would use a single-sector-sized set of > indirect blocks (dittoed at least twice), so one single-sector > sized file would use at least 9 sectors in raidz2.No. You can''t account for the metadata that way. Metadata space is not 1:1 with data space. Metadata tends to get written in 16KB chunks, compressed.> > Thanks :) > > >> We did many studies that verified this. More recent studies show >> zfs_vdev_max_pending >> has a huge impact on average latency of HDDs, which I also described in >> my talk at >> OpenStorage Summit last fall. > > What about drives without (a good implementation of) NCQ/TCQ/whatever?All HDDs I''ve tested suck. The form of the suckage is that the number of IOPS stays relatively constant, but the average latency increases dramatically. This makes sense, due to the way elevator algorithms work.> Does ZFS in-kernel caching, queuing and sorting of pending requests > provide a similar service? Is it controllable with the same switch?There are many caches at play here, with many tunables. The analysis doesn''t fit in an email.> > Or, alternatively, is it a kernel-only feature which does not depend > on hardware *CQ? Are there any benefits to disks with *CQ then? :)Yes, SSDs with NCQ work very well. -- richard -- DTrace Conference, April 3, 2012, http://wiki.smartos.org/display/DOC/dtrace.conf ZFS Performance and Training Richard.Elling at RichardElling.com +1-760-896-4422 -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120322/c401a03f/attachment.html>
Jim Klimov
2012-Mar-22 18:01 UTC
[zfs-discuss] Basic ZFS Questions + Initial Setup Recommendation
2012-03-22 20:52, Richard Elling wrote:> Yes, but it is a rare case for 512b sectors.> It could be more common for 4KB sector disks when ashift=12. ...>> Were there any research or tests regarding storage of many small >> files (1-sector sized or close to that) on different vdev layouts? > > It is not a common case, so why bother?I think that a certain Bob F. would disagree, especially when larger native sectors and ashist=12 come into play. Namely, one scenario where this is important is automated storage of thumbnails for websites, or some similar small objects in vast amounts. I agree that hordes of 512b files would be rare; 4kb-sized files (or a bit larger - 2-3 userdata sectors) are a lot more probable ;)> >> I believe that such files would use a single-sector-sized set of >> indirect blocks (dittoed at least twice), so one single-sector >> sized file would use at least 9 sectors in raidz2. > > No. You can''t account for the metadata that way. Metadata space is not > 1:1 with > data space. Metadata tends to get written in 16KB chunks, compressed.I purportedly made an example of single-sector-sized files. The way I get it (maybe wrong though), the tree of indirect blocks (dnode?) for a file is stored separately from other similar objects. While different L0 blkptr_t objects (BPs) "parented" by the same L1 object are stored as a single block on disk (128 BPs sized 128 bytes each = 16kb), further redundanced and ditto-copied, I believe that L0 BPs from different files are stored in separate blocks - as well as L0 BPs parented by different L1 BPs from different byterange stretches of the same file. Likewise for other layers of L(N+1) pointers if the file is sufficiently large (in amount of blocks used to write it). The BP tree for a file is itself an object for a ZFS dataset, individually referenced (as inode number) and there''s a pointer to its root from the DMU dnode of the dataset. If the above rant is true, then the single-block file should have a single L0 blkptr playing as its whole indirect tree of block pointers, and that L0 would be stored in a dedicated block (not shared with other files'' BPs), inflated by ditto copies=2 and raidz/mirror redundancy. Right/wrong? Thanks, //Jim
Bob Friesenhahn
2012-Mar-22 20:33 UTC
[zfs-discuss] Basic ZFS Questions + Initial Setup Recommendation
On Thu, 22 Mar 2012, Jim Klimov wrote:> > I think that a certain Bob F. would disagree, especially > when larger native sectors and ashist=12 come into play. > Namely, one scenario where this is important is automated > storage of thumbnails for websites, or some similar small > objects in vast amounts.I don''t know about that Bob F. but this Bob F. just took a look and noticed that thumbnail files for full-color images are typically 4KB or a bit larger. Low-color thumbnails can be much smaller. For a very large photo site, it would make sense to replicate just the thumbnails across a number of front-end servers and put the larger files on fewer storage servers because they are requested much less often and stream out better. This would mean that those front-end "thumbnail" servers would primarily contain small files. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Jeff Bacon
2012-Mar-24 23:33 UTC
[zfs-discuss] Basic ZFS Questions + Initial Setup Recommendation
> 2012-03-21 16:41, Paul Kraus wrote: > > I have been running ZFS in a mission critical application since > > zpool version 10 and have not seen any issues with some of the vdevs > > in a zpool full while others are virtually empty. We have been running > > commercial Solaris 10 releases. The configuration was that each > > Thanks for sharing some real-life data from larger deployments, > as you often did. That''s something I don''t often have access > to nowadays, with a liberty to tell :)Here''s another datapoint, then: I''m using sol10u9 and u10 on a number of supermicro boxes, mostly X8DTH boards with LSI 9211/9208 controllers and E5600 CPUs. Application is NFS file service to a bunch of clients, and we also have an in-house database application written in Java which implements a column-oriented db in files. Just about all of it is raidz2, much of it running gzip-compressed. Since I can''t find anything saying not to other than some common wisdom about not putting your eggs all in one basket that I''m choosing to reject in some cases, I just keep adding vdevs to the pool. started with 2TB barracudas for dev/test/archive usage and constellations for prod, now 3TB drives, have just added some of the new Pipeline drives with nothing particularly of interest to note therefrom. You can create a startlingly large pool this way: ny-fs7(68)% zpool list NAME SIZE ALLOC FREE CAP HEALTH ALTROOT srv 177T 114T 63.3T 64% ONLINE - most pools are smaller. this is an archive box that''s also the guinea pig, 12 vdevs of 7 drives raidz2. the largest prod one is 130TB in 11 vdevs of 8 drives raidz2. I won''t guess at the mix of 2TB and 3TB. these are both sol10u9. Another box has 150TB in 6 pools, raidz2/gzip using 2TB constellations, dual X5690s with 144GB RAM running 20-30 Java db workers. We do manage to break this box on the odd occasion - there''s a race condition in the ZIO code where a buffer can be freed while the block buffer is in the process of being "loaned" out to the compression code. However, it takes 600 zpool threads plus another 600-900 java threads running at the same time with a backlog of 80000 ZIOs in queue, so it''s not the sort of thing that anyone''s likely to run across much. :) It''s fixed in sol11, I understand; however, our intended fix is to split the whole thing so that the workload (which for various reasons needs to be on one box) is moved to a 4-socket Westmere, and all of the data pools are served via NFS from other boxes. I did lose some data once, long ago, using LSI 1068-based controllers on older kit, but pretty much I can attribute that to something between me-being-stupid and the 1068s really not being especially friendly towards the LSI expander chips in the older 3Gb/s SMC backplanes when used for SATA-over-SAS tunneling. The current arrangements are pretty solid otherwise. The SATA-based boxes can be a little cranky when a drive toasts, of course - they sit and hang for a while until they finally decide to offline the drive. We take that as par for the course; for the application in question (basically, storing huge amounts of data on the odd occasion that someone has a need for it), it''s not exactly a showstopper. I am curious as to whether there is any practical upper-limit on the number of vdevs, or how far one might push this kind of configuration in terms of pool size - assuming a sufficient quantity of RAM, of course.... I''m sure I will need to split this up someday but for the application there''s just something hideously convenient about leaving it all in one filesystem in one pool. -bacon
Richard Elling
2012-Mar-25 00:06 UTC
[zfs-discuss] Basic ZFS Questions + Initial Setup Recommendation
Thanks for sharing, Jeff! Comments below... On Mar 24, 2012, at 4:33 PM, Jeff Bacon wrote:>> 2012-03-21 16:41, Paul Kraus wrote: >>> I have been running ZFS in a mission critical application since >>> zpool version 10 and have not seen any issues with some of the vdevs >>> in a zpool full while others are virtually empty. We have been running >>> commercial Solaris 10 releases. The configuration was that each >> >> Thanks for sharing some real-life data from larger deployments, >> as you often did. That''s something I don''t often have access >> to nowadays, with a liberty to tell :) > > Here''s another datapoint, then: > > I''m using sol10u9 and u10 on a number of supermicro boxes, > mostly X8DTH boards with LSI 9211/9208 controllers and E5600 CPUs. > Application is NFS file service to a bunch of clients, and > we also have an in-house database application written in Java > which implements a column-oriented db in files. Just about all > of it is raidz2, much of it running gzip-compressed. > > Since I can''t find anything saying not to other than some common > wisdom about not putting your eggs all in one basket that I''m > choosing to reject in some cases, I just keep adding vdevs to > the pool. started with 2TB barracudas for dev/test/archive > usage and constellations for prod, now 3TB drives, have just > added some of the new Pipeline drives with nothing particularly > of interest to note therefrom. > > You can create a startlingly large pool this way: > > ny-fs7(68)% zpool list > NAME SIZE ALLOC FREE CAP HEALTH ALTROOT > srv 177T 114T 63.3T 64% ONLINE - > > most pools are smaller. this is an archive box that''s also > the guinea pig, 12 vdevs of 7 drives raidz2. the largest prod > one is 130TB in 11 vdevs of 8 drives raidz2. I won''t guess > at the mix of 2TB and 3TB. these are both sol10u9. > > Another box has 150TB in 6 pools, raidz2/gzip using 2TB > constellations, dual X5690s with 144GB RAM running 20-30 > Java db workers. We do manage to break this box on the > odd occasion - there''s a race condition in the ZIO code > where a buffer can be freed while the block buffer is in > the process of being "loaned" out to the compression code. > However, it takes 600 zpool threads plus another 600-900 > java threads running at the same time with a backlog of > 80000 ZIOs in queue, so it''s not the sort of thing that > anyone''s likely to run across much. :) It''s fixed > in sol11, I understand; however, our intended fix is > to split the whole thing so that the workload (which > for various reasons needs to be on one box) is moved > to a 4-socket Westmere, and all of the data pools > are served via NFS from other boxes. > > I did lose some data once, long ago, using LSI 1068-based > controllers on older kit, but pretty much I can attribute > that to something between me-being-stupid and the 1068s > really not being especially friendly towards the LSI > expander chips in the older 3Gb/s SMC backplanes when used > for SATA-over-SAS tunneling. The current arrangements > are pretty solid otherwise.In general, mixing SATA and SAS directly behind expanders (eg without SAS/SATA intereposers) seems to be bad juju that an OS can''t fix.> > The SATA-based boxes can be a little cranky when a drive > toasts, of course - they sit and hang for a while until they > finally decide to offline the drive. We take that as par > for the course; for the application in question (basically, > storing huge amounts of data on the odd occasion that someone > has a need for it), it''s not exactly a showstopper. > > > I am curious as to whether there is any practical upper-limit > on the number of vdevs, or how far one might push this kind of > configuration in terms of pool size - assuming a sufficient > quantity of RAM, of course.... I''m sure I will need to > split this up someday but for the application there''s just > something hideously convenient about leaving it all in one > filesystem in one pool.I''ve run pools with > 100 top-level vdevs. It is not uncommon to see 40+ top-level vdevs. -- richard -- DTrace Conference, April 3, 2012, http://wiki.smartos.org/display/DOC/dtrace.conf ZFS Performance and Training Richard.Elling at RichardElling.com +1-760-896-4422 -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120324/7721856d/attachment-0001.html>
Jeff Bacon
2012-Mar-25 13:26 UTC
[zfs-discuss] Basic ZFS Questions + Initial Setup Recommendation
> In general, mixing SATA and SAS directly behind expanders (eg without > SAS/SATA intereposers) seems to be bad juju that an OS can''t fix.In general I''d agree. Just mixing them on the same box can be problematic, I''ve noticed - though I think as much as anything that the firmware on the 3G/s expanders just isn''t as well-tuned as the firmware on the 6G/s expanders, and for all I know there''s a firmware update that will make things better. SSDs seem to be an exception, however. Several boxes have a mix of Crucial C300, OCZ Vertex Pro, and OCZ Vertex-3 SSDs for the usual purposes on the expander with the constellations, or in one case, Cheetah 15ks. One box has SSDs and Cheetah 15ks/constellations on the same expander under massive loads - the aforementioned box suffering from 80k ZIO queues - with nary a blip. (The SSDs are swap drives, and we were force-swapping processes out to disk as part of task management. Meanwhile, the Java processes are doing batch import processing using the Cheetahs as staging area, so those two expanders are under constant heavy load. Yes that is as ugly as it sounds, don''t ask, and don''t do this yourself. This is what happens when you develop a database without clear specs and have to just throw hardware underneath it guessing all the way. But to give you an idea of the load they were/are under.) The SSDs were chosen with an eye towards expander-friendliness, and tested relatively extensively before use. YMMV of course and this is nowhere to skimp on a-data or Kingston; buy what Anand says to buy and you seem to do very well. I would say, never do it on LSI 3G/s expanders. Be careful with using SATA spindles. Test the hell out of any SSD you use first. But you seem to be able to get away with the better consumer-class SATA SSDs. (I realize that many here would say that if you are going to use SSD in an enterprise config, you shouldn''t be messing with anything short of Deneva or the SAS-based SSDs. I''d say there are simply a bunch of caveats with the consumer MLC SSDs in such situations to consider and if you are very clear about them up front, then they can be just fine. I suspect the real difficulty in these situations is in having a management chain that is capable of both grokking the caveats up front and remembering that they agreed to them when something does go wrong. :) As in this case I am the management chain, it''s not an issue. This is of course not the usual case.) -bacon
Richard Elling
2012-Mar-25 16:55 UTC
[zfs-discuss] Basic ZFS Questions + Initial Setup Recommendation
On Mar 25, 2012, at 6:26 AM, Jeff Bacon wrote:>> In general, mixing SATA and SAS directly behind expanders (eg without >> SAS/SATA intereposers) seems to be bad juju that an OS can''t fix. > > In general I''d agree. Just mixing them on the same box can be problematic, > I''ve noticed - though I think as much as anything that the firmware > on the 3G/s expanders just isn''t as well-tuned as the firmware > on the 6G/s expanders, and for all I know there''s a firmware update > that will make things better.I haven''t noticed a big difference in the expanders, does anyone else see issues with 6G expanders?> SSDs seem to be an exception, however. Several boxes have a mix of > Crucial C300, OCZ Vertex Pro, and OCZ Vertex-3 SSDs for the usual > purposes on the expander with the constellations, or in one case, > Cheetah 15ks. One box has SSDs and Cheetah 15ks/constellations on the same > expander under massive loads - the aforementioned box suffering from > 80k ZIO queues - with nary a blip. (The SSDs are swap drives, and > we were force-swapping processes out to disk as part of task management. > Meanwhile, the Java processes are doing batch import processing using > the Cheetahs as staging area, so those two expanders are under constant > heavy load. Yes that is as ugly as it sounds, don''t ask, and don''t do > this yourself. This is what happens when you develop a database without > clear specs and have to just throw hardware underneath it guessing > all the way. But to give you an idea of the load they were/are under.)Sometime over beers we can trade war stories... many beers... :-)> > The SSDs were chosen with an eye towards expander-friendliness, and tested > relatively extensively before use. YMMV of course and this is nowhere > to skimp on a-data or Kingston; buy what Anand says to buy and you > seem to do very well.Yes. Be aware that companies like Kingston rebadge drives from other, reputable suppliers. And some reputable suppliers have less-than-perfect models.> > I would say, never do it on LSI 3G/s expanders. Be careful with using > SATA spindles. Test the hell out of any SSD you use first. But you seem > to be able to get away with the better consumer-class SATA SSDs. > > (I realize that many here would say that if you are going to use > SSD in an enterprise config, you shouldn''t be messing with anything > short of Deneva or the SAS-based SSDs. I''d say there are simply > a bunch of caveats with the consumer MLC SSDs in such situations > to consider and if you are very clear about them up front, then > they can be just fine. > > I suspect the real difficulty in these situations is in having > a management chain that is capable of both grokking the caveats up > front and remembering that they agreed to them when something > does go wrong. :) As in this case I am the management chain, > it''s not an issue. This is of course not the usual case.)We''d like to think that given the correct information, reasonable people will make the best choice. And then there are PHBs. -- richard -- DTrace Conference, April 3, 2012, http://wiki.smartos.org/display/DOC/dtrace.conf ZFS Performance and Training Richard.Elling at RichardElling.com +1-760-896-4422 -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120325/ed65e861/attachment-0001.html>