Nathan Kroenert
2012-May-28 12:48 UTC
[zfs-discuss] Advanced Format HDD''s - are we there yet? (or - how to buy a drive that won''t be teh sux0rs on zfs)
Hi folks, Looking to get some larger drives for one of my boxes. It runs exclusively ZFS and has been using Seagate 2TB units up until now (which are 512 byte sector). Anyone offer up suggestions of either 3 or preferably 4TB drives that actually work well with ZFS out of the box? (And not perform like rubbish)... I''m using Oracle Solaris 11 , and would prefer not to have to use a hacked up zpool to create something with ashift=12. Thoughts on the best drives - or is Solaris 11 actually ready to go with whatever I throw at it? :) And - am I doomed to have to use these so called ''advanced format'' drives (which as far as I can tell are in no way actually advanced, and only benefit HDD makers and not the end user). Cheers! Nathan.
Nigel W
2012-May-28 15:23 UTC
[zfs-discuss] Advanced Format HDD''s - are we there yet? (or - how to buy a drive that won''t be teh sux0rs on zfs)
On Mon, May 28, 2012 at 6:48 AM, Nathan Kroenert <nathan at tuneunix.com> wrote:> Anyone offer up suggestions of either 3 or preferably 4TB drives that > actually work well with ZFS out of the box? (And not perform like > rubbish)...With our NCP 3 boxes the WD drives seem to be working okay (this is with consumer level drives, which for what we do with our NCP boxes at $work seems to be working out okay). On Mon, May 28, 2012 at 6:48 AM, Nathan Kroenert <nathan at tuneunix.com> wrote:> And - am I doomed to have to use these so called ''advanced format'' drives > (which as far as I can tell are in no way actually advanced, and only > benefit HDD makers and not the end user).Yes. All of the manufactures are moving to use "advanced format" drives, more accurately known as 4K sector size drives. After a snafu last week at $work where a 512 byte pool would not resilver with a 4K drive plugged in, it appears that (keep in mind that these are consumer drives) Seagate no longer manufactures the 7200.12 series drives which has a select-able sector size. The new 7200.14 series is 4k only. WD for the time being appears to still present 512 byte sectors in their current lineup. What kind of performance penalty this carries I don''t know as we have not tested any as of yet. Presumably though, WD is going to stop doing that eventually just like Seagate already has. No, you are correct, the "advanced format" thing is just marketing. But it does have very real benefits and has everything to do (at least as far as I understand the technical details from the HDD manufactures) with the bit density and the sector count getting so high on these multi-terabyte drives that they are having to put a larger and larger percentage (and by extension absolute number of GBs) of the platter area to be used for ECC and sector locating magic. This means they are wasting more and more platter space trying to use the 512-byte sectors. By using the larger sectors they can use less sectors which means they have less "wasted" space. So all around it is good for everyone its just the HDD manufactures are trying to change 15+ years of complacency with respect to the sector size of HDDs. Nigel
Richard Elling
2012-May-28 20:39 UTC
[zfs-discuss] Advanced Format HDD''s - are we there yet? (or - how to buy a drive that won''t be teh sux0rs on zfs)
On May 28, 2012, at 5:48 AM, Nathan Kroenert wrote:> Hi folks, > > Looking to get some larger drives for one of my boxes. It runs exclusively ZFS and has been using Seagate 2TB units up until now (which are 512 byte sector). > > Anyone offer up suggestions of either 3 or preferably 4TB drives that actually work well with ZFS out of the box? (And not perform like rubbish)... > > I''m using Oracle Solaris 11 , and would prefer not to have to use a hacked up zpool to create something with ashift=12. > > Thoughts on the best drives - or is Solaris 11 actually ready to go with whatever I throw at it? :)Ashift is set automatically if the disk is truly 4k sector only (and doesn''t lie). This has been true for ZFS for a very, very long time. -- richard -- ZFS Performance and Training Richard.Elling at RichardElling.com +1-760-896-4422 -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120528/54a41004/attachment-0001.html>
nathan
2012-May-28 22:49 UTC
[zfs-discuss] Advanced Format HDD''s - are we there yet? (or - how to buy a drive that won''t be teh sux0rs on zfs)
On 29/05/2012 6:39 AM, Richard Elling wrote:> > On May 28, 2012, at 5:48 AM, Nathan Kroenert wrote: > >> Hi folks, >> >> Looking to get some larger drives for one of my boxes. It runs >> exclusively ZFS and has been using Seagate 2TB units up until now >> (which are 512 byte sector). >> >> Anyone offer up suggestions of either 3 or preferably 4TB drives that >> actually work well with ZFS out of the box? (And not perform like >> rubbish)... >> >> I''m using Oracle Solaris 11 , and would prefer not to have to use a >> hacked up zpool to create something with ashift=12. >> >> Thoughts on the best drives - or is Solaris 11 actually ready to go >> with whatever I throw at it? :) > > Ashift is set automatically if the disk is truly 4k sector only (and > doesn''t lie). This has been true for > ZFS for a very, very long time. > -- richard >Hi Richard, Indeed - and this is the very reason I asked for suggestions of drives, as I''m yet to find a computer shop that''s happy for me to plug things in to check if they report 512b or 4k. (Can''t imagine why ;) I''ll keep my fingers crossed. Nathan.
Daniel Carosone
2012-May-29 00:13 UTC
[zfs-discuss] Advanced Format HDD''s - are we there yet? (or - how to buy a drive that won''t be teh sux0rs on zfs)
On Mon, May 28, 2012 at 09:23:25AM -0600, Nigel W wrote:> After a snafu > last week at $work where a 512 byte pool would not resilver with a 4K > drive plugged in, it appears that (keep in mind that these are > consumer drives) Seagate no longer manufactures the 7200.12 series > drives which has a select-able sector size. The new 7200.14 series is > 4k only.Does this mean they actually present with 4k sectors externally, rather than use 4k internally and emulate 512b externally? If so, this is a good thing - and good to know.> WD for the time being appears to still present 512 byte > sectors in their current lineup. What kind of performance penalty this > carries I don''t know as we have not tested any as of yet. Presumably > though, WD is going to stop doing that eventually just like Seagate > already has.One hopes so. There are two problems using ZFS on drives with 4k sectors: 1) if the drive lies and presents 512-byte sectors, and you don''t manually force ashift=12, then the emulation can be slow (and possibly error prone). There is essentially an internal RMW cycle when a 4k sector is partially updated. We use ZFS to get away from the perils of RMW :) 2) with ashift=12, whther forced manually or automatically because the disks present 4k sectors, ZFS is less space-efficient for metadata and keeps fewer historical uberblocks. For choosing a tradeoff today, I''ll take 2 over 1, after experience with both. 1 bites, seemingly especially with raidz types, but also with mirrors. Also because a code change could at least improve the metadata packing in future. AFAIK, Hitachi is the only vendor still offering 512-native consumer drives in the 2&3T sizes. They cost a little more, so that''s another tradeoff. -- Dan. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 194 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120529/a1ec8f31/attachment.bin>
Nigel W
2012-May-29 05:02 UTC
[zfs-discuss] Advanced Format HDD''s - are we there yet? (or - how to buy a drive that won''t be teh sux0rs on zfs)
On Mon, May 28, 2012 at 6:13 PM, Daniel Carosone <dan at geek.com.au> wrote:> On Mon, May 28, 2012 at 09:23:25AM -0600, Nigel W wrote: >> After a snafu >> last week at $work where a 512 byte pool would not resilver with a 4K >> drive plugged in, it appears that (keep in mind that these are >> consumer drives) Seagate no longer manufactures the 7200.12 series >> drives which has a select-able sector size. ?The new 7200.14 series is >> 4k only. > > Does this mean they actually present with 4k sectors externally, > rather than use 4k internally and emulate 512b externally? ?If so, > this is a good thing - and good to know. >Based on the numbers stamped on drive and Seagate support, yes the 7200.14 present 4k sectors and the 7200.12 have a jumper that switches between 512 and 4k; though I don''t know if that means the disk is 4k or 512 internally. On Mon, May 28, 2012 at 6:13 PM, Daniel Carosone <dan at geek.com.au> wrote:> There are two problems using ZFS on drives with 4k sectors: > > ?1) if the drive lies and presents 512-byte sectors, and you don''t > ? ?manually force ashift=12, then the emulation can be slow (and > ? ?possibly error prone). There is essentially an internal RMW cycle > ? ?when a 4k sector is partially updated. ?We use ZFS to get away > ? ?from the perils of RMW :) > > ?2) with ashift=12, whther forced manually or automatically because > ? ?the disks present 4k sectors, ZFS is less space-efficient for > ? ?metadata and keeps fewer historical uberblocks. > > For choosing a tradeoff today, I''ll take 2 over 1, after experience > with both. 1 bites, seemingly especially with raidz types, but also > with mirrors. ?Also because a code change could at least improve the > metadata packing in future. >Yes that would suck the performance out and it is something that we have discussed at $work though so far it seems we have just lucked out and haven''t seen the performance issues as a result of this. On Mon, May 28, 2012 at 6:13 PM, Daniel Carosone <dan at geek.com.au> wrote:> AFAIK, Hitachi is the only vendor still offering 512-native consumer > drives in the 2&3T sizes. ?They cost a little more, so that''s another > tradeoff.Hmm. That is interesting to know. At very least another possible source of 512-byte drives if we need them for replacing drives in pools that are stuck with ashift=9.
Bill Sommerfeld
2012-May-29 05:38 UTC
[zfs-discuss] Advanced Format HDD''s - are we there yet? (or - how to buy a drive that won''t be teh sux0rs on zfs)
On 05/28/12 17:13, Daniel Carosone wrote:> There are two problems using ZFS on drives with 4k sectors: > > 1) if the drive lies and presents 512-byte sectors, and you don''t > manually force ashift=12, then the emulation can be slow (and > possibly error prone). There is essentially an internal RMW cycle > when a 4k sector is partially updated. We use ZFS to get away > from the perils of RMW :) > > 2) with ashift=12, whther forced manually or automatically because > the disks present 4k sectors, ZFS is less space-efficient for > metadata and keeps fewer historical uberblocks.two, more specific, problems I''ve run into recently: 1) if you move a disk with an ashift=9 pool on it from a controller/enclosure/.. combo where it claims to have 512 byte sectors to a path where it is detected as having 4k sectors (even if it can cope with 512-byte aligned I/O), the pool will fail to import and appear to be gravely corrupted; the error message you get will make no mention of the sector size change. Move the disk back to the original location and it imports cleanly. 2) if you have a pool with ashift=9 and a disk dies, and the intended replacement is detected as having 4k sectors, it will not be possible to attach the disk as a replacement drive..
John Martin
2012-May-29 10:54 UTC
[zfs-discuss] Advanced Format HDD''s - are we there yet? (or - how to buy a drive that won''t be teh sux0rs on zfs)
On 05/28/12 08:48, Nathan Kroenert wrote:> Looking to get some larger drives for one of my boxes. It runs > exclusively ZFS and has been using Seagate 2TB units up until now (which > are 512 byte sector). > > Anyone offer up suggestions of either 3 or preferably 4TB drives that > actually work well with ZFS out of the box? (And not perform like > rubbish)... > > I''m using Oracle Solaris 11 , and would prefer not to have to use a > hacked up zpool to create something with ashift=12.Are you replacing a failed drive or creating a new pool? I had a drive in a mirrored pool recently fail. Both drives were 1TB Seagate ST310005N1A1AS-RK with 512 byte sectors. All the 1TB Seagate boxed drives I could find with the same part number on the box (with factory seals in place) were really ST1000DM003-9YN1 with 512e/4196p. Just being cautious, I ended up migrating the pools over to a pair of the new drives. The pools were created with ashift=12 automatically: $ zdb -C | grep ashift ashift: 12 ashift: 12 ashift: 12 Resilvering the three pools concurrently went fairly quickly: $ zpool status scan: resilvered 223G in 2h14m with 0 errors on Tue May 22 21:02:32 2012 scan: resilvered 145G in 4h13m with 0 errors on Tue May 22 23:02:38 2012 scan: resilvered 153G in 3h44m with 0 errors on Tue May 22 22:30:51 2012 What performance problem do you expect?
bofh
2012-May-29 11:26 UTC
[zfs-discuss] Advanced Format HDD''s - are we there yet? (or - how to buy a drive that won''t be teh sux0rs on zfs)
On Tue, May 29, 2012 at 6:54 AM, John Martin <john.m.martin at oracle.com> wrote:> ?$ zdb -C | grep ashift > ? ? ? ? ? ? ?ashift: 12 > ? ? ? ? ? ? ?ashift: 12 > ? ? ? ? ? ? ?ashift: 12 >That''s interesting. I just created a raidz3 pool out of 7x3TB drives. My drives were ST3000DM001-9YN1 Hitachi HDS72303 Hitachi HDS72303 ST3000DM001-9YN1 Hitachi HDS5C303 Hitachi HDS5C303 ST33000651AS ashift:9 is that standard? I did nothing but plug them in and zpool create. Seem to run pretty fast, I can have up to 400 MB/s writes from /dev/zero... :) -- http://www.glumbert.com/media/shift http://www.youtube.com/watch?v=tGvHNNOLnCk "This officer''s men seem to follow him merely out of idle curiosity." -- Sandhurst officer cadet evaluation. "Securing an environment of Windows platforms from abuse - external or internal - is akin to trying to install sprinklers in a fireworks factory where smoking on the job is permitted."? -- Gene Spafford learn french:? http://www.youtube.com/watch?v=30v_g83VHK4
Nathan Kroenert
2012-May-29 12:35 UTC
[zfs-discuss] Advanced Format HDD''s - are we there yet? (or - how to buy a drive that won''t be teh sux0rs on zfs)
Hi John, Actually, last time I tried the whole AF (4k) thing, it''s performance was worse than woeful. But admittedly, that was a little while ago. The drives were the seagate green barracuda IIRC, and performance for just about everything was 20MB/s per spindle or worse, when it should have been closer to 100MB/s when streaming. Things were worse still when doing random... I''m actually looking to put in something larger than the 3*2TB drives (triple mirror for read perf) this pool has in it - preferably 3 * 4TB drives. (I don''t want to put in more spindles - just replace the current ones...) I might just have to bite the bullet and try something with current SW. :). Nathan. On 05/29/12 08:54 PM, John Martin wrote:> On 05/28/12 08:48, Nathan Kroenert wrote: > >> Looking to get some larger drives for one of my boxes. It runs >> exclusively ZFS and has been using Seagate 2TB units up until now (which >> are 512 byte sector). >> >> Anyone offer up suggestions of either 3 or preferably 4TB drives that >> actually work well with ZFS out of the box? (And not perform like >> rubbish)... >> >> I''m using Oracle Solaris 11 , and would prefer not to have to use a >> hacked up zpool to create something with ashift=12. > > Are you replacing a failed drive or creating a new pool? > > I had a drive in a mirrored pool recently fail. Both > drives were 1TB Seagate ST310005N1A1AS-RK with 512 byte sectors. > All the 1TB Seagate boxed drives I could find with the same > part number on the box (with factory seals in place) > were really ST1000DM003-9YN1 with 512e/4196p. Just being > cautious, I ended up migrating the pools over to a pair > of the new drives. The pools were created with ashift=12 > automatically: > > $ zdb -C | grep ashift > ashift: 12 > ashift: 12 > ashift: 12 > > Resilvering the three pools concurrently went fairly quickly: > > $ zpool status > scan: resilvered 223G in 2h14m with 0 errors on Tue May 22 > 21:02:32 2012 > scan: resilvered 145G in 4h13m with 0 errors on Tue May 22 > 23:02:38 2012 > scan: resilvered 153G in 3h44m with 0 errors on Tue May 22 > 22:30:51 2012 > > What performance problem do you expect?
John Martin
2012-May-29 13:04 UTC
[zfs-discuss] Advanced Format HDD''s - are we there yet? (or - how to buy a drive that won''t be teh sux0rs on zfs)
On 05/29/12 08:35, Nathan Kroenert wrote:> Hi John, > > Actually, last time I tried the whole AF (4k) thing, it''s performance > was worse than woeful. > > But admittedly, that was a little while ago. > > The drives were the seagate green barracuda IIRC, and performance for > just about everything was 20MB/s per spindle or worse, when it should > have been closer to 100MB/s when streaming. Things were worse still when > doing random... > > I''m actually looking to put in something larger than the 3*2TB drives > (triple mirror for read perf) this pool has in it - preferably 3 * 4TB > drives. (I don''t want to put in more spindles - just replace the current > ones...) > > I might just have to bite the bullet and try something with current SW. :).Raw read from one of the mirrors: # timex dd if=/dev/rdsk/c0t2d0s2 of=/dev/null bs=1024000 count=10000 10000+0 records in 10000+0 records out real 49.26 user 0.01 sys 0.27 filebench filemicro_seqread reports an impossibly high number (4GB/s) so the ARC is likely handling all reads. The label on the boxes I bought say: 1TB 32MB INTERNAL KIT 7200 ST310005N1A1AS-RK S/N: ... PN:9BX1A8-573 The drives in the box were really ST1000DM003-9YN162 with 64MB of cache. I have multiple pools on each disk so the cache should be disabled. The drive reports 512 byte logical sectors and 4096 physical sectors.
John Martin
2012-May-29 13:05 UTC
[zfs-discuss] Advanced Format HDD''s - are we there yet? (or - how to buy a drive that won''t be teh sux0rs on zfs)
On 05/29/12 07:26, bofh wrote:> ashift:9 is that standard?Depends on what the drive reports as physical sector size.
Casper.Dik at oracle.com
2012-May-29 13:06 UTC
[zfs-discuss] Advanced Format HDD''s - are we there yet? (or - how to buy a drive that won''t be teh sux0rs on zfs)
>The drives were the seagate green barracuda IIRC, and performance for >just about everything was 20MB/s per spindle or worse, when it should >have been closer to 100MB/s when streaming. Things were worse still when >doing random...It is possible that your partitions weren''t aligned at 4K and that will give serious issues with those drives (Solaris now tries to make sure that all partitions are on 4K boundaries or makes sure that the zpool dev_t is aligned to 4K. Casper
Jim Klimov
2012-May-29 13:10 UTC
[zfs-discuss] Advanced Format HDD''s - are we there yet? (or - how to buy a drive that won''t be teh sux0rs on zfs)
2012-05-29 16:35, Nathan Kroenert wrote:> Hi John, > > Actually, last time I tried the whole AF (4k) thing, it''s performance > was worse than woeful. > > But admittedly, that was a little while ago. > > The drives were the seagate green barracuda IIRC, and performance for > just about everything was 20MB/s per spindle or worse, when it should > have been closer to 100MB/s when streaming. Things were worse still when > doing random...On one hand, it is possible that being green, the drives aren''t very capable of fast IO - they had different design goals and tradeoffs. But actually I was going to ask if you paid attention to partitioning? At what offsets did your ZFS pool data start? Was that offset divisible by 4KB (i.e. 256 512byte sectors as is default now vs 34 sectors of the older default)? If the drive had 4kb native sectors but the logical FS blocks were not aligned with that, then every write IO would involve RMW of many sectors (perhaps disk''s caching might alleviate this for streaming writes though). Also note that ZFS IO often is random even for reads, since you have to read metadata and file data often from different dispersed locations. Again, OS caching helps statistically, when you have much RAM dedicated to caching. Hmmm... did you use dedup in those tests?- that is another source of performance degradation on smaller machines (under tens of GBs of RAM). HTH, //Jim
Richard Elling
2012-May-29 23:51 UTC
[zfs-discuss] Advanced Format HDD''s - are we there yet? (or - how to buy a drive that won''t be teh sux0rs on zfs)
On May 29, 2012, at 6:10 AM, Jim Klimov wrote:> Also note that ZFS IO often is random even for reads, since you > have to read metadata and file data often from different dispersed > locations.This is true for almost all other file systems, too. For example, in UFS, metadata is stored in fixed locations on the disk as defined when the filesystem is created. -- richard -- ZFS Performance and Training Richard.Elling at RichardElling.com +1-760-896-4422 -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120529/4ab6e73b/attachment-0001.html>
nathan
2012-May-30 02:33 UTC
[zfs-discuss] Advanced Format HDD''s - are we there yet? (or - how to buy a drive that won''t be teh sux0rs on zfs)
On 29/05/2012 11:10 PM, Jim Klimov wrote:> 2012-05-29 16:35, Nathan Kroenert wrote: >> Hi John, >> >> Actually, last time I tried the whole AF (4k) thing, it''s performance >> was worse than woeful. >> >> But admittedly, that was a little while ago. >> >> The drives were the seagate green barracuda IIRC, and performance for >> just about everything was 20MB/s per spindle or worse, when it should >> have been closer to 100MB/s when streaming. Things were worse still when >> doing random... > > On one hand, it is possible that being green, the drives aren''t very > capable of fast IO - they had different design goals and tradeoffs. >Indeed! I just wasn''t expecting it to be so profound.> But actually I was going to ask if you paid attention to partitioning? > At what offsets did your ZFS pool data start? Was that offset divisible > by 4KB (i.e. 256 512byte sectors as is default now vs 34 sectors of > the older default)?It was. Actually I tried it in a variety of ways, including auto EFI partition (zpool create with the whole disk), using SMI label, and trying a variety of tricks with offsets. Again, it was a while ago - before the time of the SD RMW fix...> > If the drive had 4kb native sectors but the logical FS blocks were > not aligned with that, then every write IO would involve RMW of > many sectors (perhaps disk''s caching might alleviate this for > streaming writes though).Yep - that''s what it *felt* like, and I didn''t seem to be able to change that at the time.> > Also note that ZFS IO often is random even for reads, since you > have to read metadata and file data often from different dispersed > locations. Again, OS caching helps statistically, when you have > much RAM dedicated to caching. Hmmm... did you use dedup in those > tests?- that is another source of performance degradation on smaller > machines (under tens of GBs of RAM).At the time, I had 1TB of data, and 1TB of space... I''d expect that most of the data would have been written ''closeish'' to sequential on disk, though I''ll confess I only spent a short time looking at the ''physical'' read/write locations being send down through the stack. (where the drive writes them - well.. That''s different. ;) I have been contacted off list by a few folks that have indicated success with current drives and current Solaris bits. I''m thinking that it might be time to take another run at it. I''ll let the list know the results. ;) Cheers Nathan.
Henrik Johansson
2012-Jun-07 08:32 UTC
[zfs-discuss] Advanced format 4K bug in illumos? ( was: zfs-discuss] Advanced Format HDD''s - are we there yet? ...)
Hello, While we are talking Advanced format, does anyone know if bugid 7021758 is a issue in illumos? Synopsis: zpool disk corruption detected on 4k block disks http://wesunsolve.net/bugid/id/7021758 Regards Henrik http://sparcv9.blogspot.com