Hi Guys, I currently have a 18 drive system built from 13x 2.0tb Samsung''s and 5x WD 1tb''s... I''m about to swap out all of my 1tb drives with 2tb ones to grow the pool a bit... My question is... The replacement 2tb drives are from various manufacturers (seagate/hitachi/samsung) and I know from previous experience that the geometry/boundaries of each manufacturer''s 2tb offerings are different. Is there a way to quickly ascertain if my seagate/hitachi drives are as large as the 2.0tb samsungs? I''d like to avoid the situation of replacing all drives and then not being able to grow the pool... Thanks, Michael -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120413/e28ce31a/attachment.html>
Michael Armstrong writes:> Is there a way to quickly ascertain if my seagate/hitachi drives are as > large as the 2.0tb samsungs? I''d like to avoid the situation of replacing > all drives and then not being able to grow the pool...Hitachi prints the block count of the drives on the physical product label. If you compare that number to the one given in the Solaris label as printed by the prtvtoc command, you should be able to answer your question. Don''t know about the Seagate drives, but they should at least have a block count somewhere in their documentation. HTH -- Volker -- ------------------------------------------------------------------------ Volker A. Brandt Consulting and Support for Oracle Solaris Brandt & Brandt Computer GmbH WWW: http://www.bb-c.de/ Am Wiesenpfad 6, 53340 Meckenheim, GERMANY Email: vab at bb-c.de Handelsregister: Amtsgericht Bonn, HRB 10513 Schuhgr??e: 46 Gesch?ftsf?hrer: Rainer J.H. Brandt und Volker A. Brandt "When logic and proportion have fallen sloppy dead"
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Michael Armstrong > > Is there a way to quickly ascertain if my seagate/hitachi drives are aslarge as> the 2.0tb samsungs? I''d like to avoid the situation of replacing alldrives and> then not being able to grow the pool...It doesn''t matter. If you have a bunch of drives that are all approx the same size but vary slightly, and you make (for example) a raidz out of them, then the raidz will only be limited by the size of the smallest one. So you will only be wasting 1% of the drives that are slightly larger. Also, given that you have a pool currently made up of 13x2T and 5x1T ... I presume these are separate vdev''s. You don''t have one huge 18-disk raidz3, do you? that would be bad. And it would also mean that you''re currently wasting 13x1T. I assume the 5x1T are a single raidzN. You can increase the size of these disks, without any cares about the size of the other 13. Just make sure you have the autoexpand property set. But most of all, make sure you do a scrub first, and make sure you complete the resilver in between each disk swap. Do not pull out more than one disk (or whatever your redundancy level is) while it''s still resilvering from the previously replaced disk. If you''re very thorough, you would also do a scrub in between each disk swap, but if it''s just a bunch of home movies that are replaceable, you will probably skip that step.
On Fri, Apr 13, 2012 at 9:35 AM, Edward Ned Harvey < opensolarisisdeadlongliveopensolaris at nedharvey.com> wrote:> > From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > > bounces at opensolaris.org] On Behalf Of Michael Armstrong > > > > Is there a way to quickly ascertain if my seagate/hitachi drives are as > large as > > the 2.0tb samsungs? I''d like to avoid the situation of replacing all > drives and > > then not being able to grow the pool... > > It doesn''t matter. If you have a bunch of drives that are all approx the > same size but vary slightly, and you make (for example) a raidz out of > them, > then the raidz will only be limited by the size of the smallest one. So > you > will only be wasting 1% of the drives that are slightly larger. > > Also, given that you have a pool currently made up of 13x2T and 5x1T ... I > presume these are separate vdev''s. You don''t have one huge 18-disk raidz3, > do you? that would be bad. And it would also mean that you''re currently > wasting 13x1T. I assume the 5x1T are a single raidzN. You can increase > the > size of these disks, without any cares about the size of the other 13. > > Just make sure you have the autoexpand property set. > > But most of all, make sure you do a scrub first, and make sure you complete > the resilver in between each disk swap. Do not pull out more than one disk > (or whatever your redundancy level is) while it''s still resilvering from > the > previously replaced disk. If you''re very thorough, you would also do a > scrub in between each disk swap, but if it''s just a bunch of home movies > that are replaceable, you will probably skip that step. >You will however have an issue replacing them if one should fail. You need to have the same block count to replace a device, which is why I asked for a "right-sizing" years ago. Deaf ears :/ --Tim> >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120413/8e22220d/attachment.html>
On Fri, Apr 13, 2012 at 9:30 AM, Tim Cook <tim at cook.ms> wrote:> You will however have an issue replacing them if one should fail. ?You need > to have the same block count to replace a device, which is why I asked for a > "right-sizing" years ago. ?Deaf ears :/I thought ZFSv20-something added a "if the blockcount is within 10%, then allow the replace to succeed" feature, to work around this issue? -- Freddie Cash fjwcash at gmail.com
Yes this Is another thing im weary of... I should have slightly under provisioned at the start or mixed manufacturers... Now i may have to replace 2tb fails with 2.5 for the sake of a block Sent from my iPhone On 13 Apr 2012, at 17:30, Tim Cook <tim at cook.ms> wrote:> > > On Fri, Apr 13, 2012 at 9:35 AM, Edward Ned Harvey <opensolarisisdeadlongliveopensolaris at nedharvey.com> wrote: > > From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > > bounces at opensolaris.org] On Behalf Of Michael Armstrong > > > > Is there a way to quickly ascertain if my seagate/hitachi drives are as > large as > > the 2.0tb samsungs? I''d like to avoid the situation of replacing all > drives and > > then not being able to grow the pool... > > It doesn''t matter. If you have a bunch of drives that are all approx the > same size but vary slightly, and you make (for example) a raidz out of them, > then the raidz will only be limited by the size of the smallest one. So you > will only be wasting 1% of the drives that are slightly larger. > > Also, given that you have a pool currently made up of 13x2T and 5x1T ... I > presume these are separate vdev''s. You don''t have one huge 18-disk raidz3, > do you? that would be bad. And it would also mean that you''re currently > wasting 13x1T. I assume the 5x1T are a single raidzN. You can increase the > size of these disks, without any cares about the size of the other 13. > > Just make sure you have the autoexpand property set. > > But most of all, make sure you do a scrub first, and make sure you complete > the resilver in between each disk swap. Do not pull out more than one disk > (or whatever your redundancy level is) while it''s still resilvering from the > previously replaced disk. If you''re very thorough, you would also do a > scrub in between each disk swap, but if it''s just a bunch of home movies > that are replaceable, you will probably skip that step. > > > You will however have an issue replacing them if one should fail. You need to have the same block count to replace a device, which is why I asked for a "right-sizing" years ago. Deaf ears :/ > > --Tim > > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120413/5e32f314/attachment.html>
On Fri, Apr 13, 2012 at 11:46 AM, Freddie Cash <fjwcash at gmail.com> wrote:> On Fri, Apr 13, 2012 at 9:30 AM, Tim Cook <tim at cook.ms> wrote: > > You will however have an issue replacing them if one should fail. You > need > > to have the same block count to replace a device, which is why I asked > for a > > "right-sizing" years ago. Deaf ears :/ > > I thought ZFSv20-something added a "if the blockcount is within 10%, > then allow the replace to succeed" feature, to work around this issue? > > -- > Freddie Cash > fjwcash at gmail.com >That would be news to me. I''d love to hear it''s true though. When I made the request there was excuse after excuse about how it would be difficult and Sun always provided replacement drives of identical size, etc (although there were people who responded who in fact had received drives from Sun of different sizes in RMA). I was hoping now that the braintrust had moved on from Sun that they''d embrace what I consider a common-sense decision, but I don''t think it''s happened. --Tim -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120413/473e25a9/attachment.html>
Am 13.04.12 19:22, schrieb Tim Cook:> > > On Fri, Apr 13, 2012 at 11:46 AM, Freddie Cash <fjwcash at gmail.com > <mailto:fjwcash at gmail.com>> wrote: > > On Fri, Apr 13, 2012 at 9:30 AM, Tim Cook <tim at cook.ms > <mailto:tim at cook.ms>> wrote: > > You will however have an issue replacing them if one should > fail. You need > > to have the same block count to replace a device, which is why I > asked for a > > "right-sizing" years ago. Deaf ears :/ > > I thought ZFSv20-something added a "if the blockcount is within 10%, > then allow the replace to succeed" feature, to work around this issue? > > -- > Freddie Cash > fjwcash at gmail.com <mailto:fjwcash at gmail.com> > > > > That would be news to me. I''d love to hear it''s true though. When I > made the request there was excuse after excuse about how it would be > difficult and Sun always provided replacement drives of identical > size, etc (although there were people who responded who in fact > had received drives from Sun of different sizes in RMA). I was hoping > now that the braintrust had moved on from Sun that they''d embrace what > I consider a common-sense decision, but I don''t think it''s happened. > >I tend to think, that S11 even tightened the gap of that. When I upgraded from SE11 to S11 a couple of drives became "corrupt" when S11 tried to mount the zpool, which consists of vdev mirrors. Switching back to SE1! and importing the very same zpool went without issue. In the SR I opened for that issue, it was stated that S11 is even more picky about drives sizes than SE11 had been and I had to replace the drives with new ones. Interstingly these were all Hitachi drives of the "same" size, but prtvoc displayed in fact fewer sectors for the ones that were refused by S11 - and the percentage was closer to 5% than to 10%, afair. I was able to create another zpool from the drives S11 refused later on in S11 though. Cheers, budy -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120413/0e319f6c/attachment-0001.html>
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Freddie Cash > > I thought ZFSv20-something added a "if the blockcount is within 10%, > then allow the replace to succeed" feature, to work around this issue?About 2 yrs ago, I replaced a drive with 1 block less, and it was a big problem. This was a drive bought from oracle, to replace an oracle drive, on a supported sun system, and it was the same model drive, with a higher firmware rev. We worked on it extensively, eventually managed to shoe-horn the drive in there, and I pledged I would always partition drives slightly smaller from now on. Then, about 2 weeks later, the support rep emailed me to say they implemented a new feature, which could autoresize +/- some small percentage difference, like 1Mb difference or something like that. So there is some solid reason to corroborate Freddie''s suspicion, but there''s no way I''m going to find any material to reference now. The timing even sounds about right to support the v20 idea. I haven''t tested or proven it myself, but I am confidently assuming moving forward, that small variations will be handled gracefully.
http://wesunsolve.net/bugid/id/6563887 -- richard On Apr 14, 2012, at 6:04 AM, Edward Ned Harvey wrote:>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- >> bounces at opensolaris.org] On Behalf Of Freddie Cash >> >> I thought ZFSv20-something added a "if the blockcount is within 10%, >> then allow the replace to succeed" feature, to work around this issue? > > About 2 yrs ago, I replaced a drive with 1 block less, and it was a big problem. This was a drive bought from oracle, to replace an oracle drive, on a supported sun system, and it was the same model drive, with a higher firmware rev. We worked on it extensively, eventually managed to shoe-horn the drive in there, and I pledged I would always partition drives slightly smaller from now on. > > Then, about 2 weeks later, the support rep emailed me to say they implemented a new feature, which could autoresize +/- some small percentage difference, like 1Mb difference or something like that. > > So there is some solid reason to corroborate Freddie''s suspicion, but there''s no way I''m going to find any material to reference now. The timing even sounds about right to support the v20 idea. I haven''t tested or proven it myself, but I am confidently assuming moving forward, that small variations will be handled gracefully. > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss-- ZFS Performance and Training Richard.Elling at RichardElling.com +1-760-896-4422 -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120414/d4457204/attachment.html>
On Sat, Apr 14, 2012 at 09:04:45AM -0400, Edward Ned Harvey wrote:> Then, about 2 weeks later, the support rep emailed me to say they > implemented a new feature, which could autoresize +/- some small > percentage difference, like 1Mb difference or something like that.There are two elements to this: - the size of actual data on the disk - the logical block count, and the resulting LBAs of the labels positioned relative to the end of the disk. The available size of the disk has always been rounded to a whole number of metaslabs, once the front and back label space is trimmed off. Combined with the fact that metaslab size is determined dynamically at vdev creation time based on device size, there can easily be some amount of unused space at the end, after the last metaslab and before the end labels. It is slop in this space that allows for the small differences you describe above, even for disks laid out in earlier zpool versions. A little poking with zdb and a few calculations will show you just how much a given disk has. However, to make the replacement actually work, the zpool code needed to not insist on an absoute >= number of blocks (rather to check the more proper condition, that there was room for all the metaslabs). There was also testing to ensure that it handled the end labels moving inwards in absolute position, for a replacement onto slightly smaller rather than same/larger disks. That was the change that happened at the time. (If you somehow had disks that fit exactly a whole number of metaslabs, you might still have an issue, I suppose. Perhaps that''s likely if you carefully calculated LUN sizes to carve out of some other storage, in which case you can do the same for replacements.) -- Dan. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 194 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120416/6f842089/attachment.bin>
On 2012-Apr-14 02:30:54 +1000, Tim Cook <tim at cook.ms> wrote:>You will however have an issue replacing them if one should fail. You need to have the same block count to replace a device, which is why I asked for a "right-sizing" years ago.The "traditional" approach this is to slice the disk yourself so you have a slice size with a known area and a dummy slice of a couple of GB in case a replacement is a bit smaller. Unfortunately, ZFS on Solaris disables the drive cache if you don''t give it a complete disk so this approach incurs as significant performance overhead there. FreeBSD leaves the drive cache enabled in either situation. I''m not sure how OI or Linux behave. -- Peter Jeremy -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 196 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120417/8fcd3cc8/attachment.bin>
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Peter Jeremy > > On 2012-Apr-14 02:30:54 +1000, Tim Cook <tim at cook.ms> wrote: > >You will however have an issue replacing them if one should fail. Youneed> to have the same block count to replace a device, which is why I asked fora> "right-sizing" years ago. > > The "traditional" approach this is to slice the disk yourself so you havea slice> size with a known area and a dummy slice of a couple of GB in case a > replacement is a bit smaller. Unfortunately, ZFS on Solaris disables thedrive> cache if you don''t give it a complete disk so this approach incurs assignificant> performance overhead there.It''s not so much that it "disables" it, as "doesn''t enable" it. By default, for anything, the write back cache (on-disk) would be disabled, but if you''re using the whole disk for ZFS, then ZFS enables it, because it''s known to be safe. (Unless... nevermind.) Whenever I''ve deployed ZFS on partitions, I just script the enabling of the writeback. So Peter''s message is true, but it''s solvable.
For the archives... On Apr 16, 2012, at 3:37 PM, Peter Jeremy wrote:> On 2012-Apr-14 02:30:54 +1000, Tim Cook <tim at cook.ms> wrote: >> You will however have an issue replacing them if one should fail. You need to have the same block count to replace a device, which is why I asked for a "right-sizing" years ago. > > The "traditional" approach this is to slice the disk yourself so you have a slice size with a known area and a dummy slice of a couple of GB in case a replacement is a bit smaller. Unfortunately, ZFS on Solaris disables the drive cache if you don''t give it a complete disk so this approach incurs as significant performance overhead there. FreeBSD leaves the drive cache enabled in either situation. I''m not sure how OI or Linux behave.Write-back cache enablement is toxic for file systems that do not issue cache flush commands, such as Solaris'' UFS. In the early days of ZFS, on Solaris 10 or before ZFS was bootable on OpenSolaris, it was not uncommon to have ZFS and UFS on the same system. NB, there are a number of consumer-grade IDE/*ATA disks that ignore disabling the write buffer. Hence, it is not always a win to enable the write buffer that cannot be disabled. -- richard -- ZFS Performance and Training Richard.Elling at RichardElling.com +1-760-896-4422 -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120416/d48027c2/attachment-0001.html>
2012-04-17 5:15, Richard Elling wrote:> For the archives... > > Write-back cache enablement is toxic for file systems that do not issue > cache flush commands, such as Solaris'' UFS. In the early days of ZFS, > on Solaris 10 or before ZFS was bootable on OpenSolaris, it was not> uncommon to have ZFS and UFS on the same system.> > NB, there are a number of consumer-grade IDE/*ATA disks that ignore > disabling > the write buffer. Hence, it is not always a win to enable the write > buffer that cannot > be disabled. > -- richardFor the sake of archives, can you please post a common troubleshooting techinque which users can try at home to see if their disks honour the request or not? ;) I guess it would involve random-write bandwidths in two cases? And for the sake of archives, here''s what I do on my home system for its pools to toggle the cache on disks involved (could be scripted better to detect disk names from zpool listing, but works-for-me as-is): # cat /etc/rc2.d/S95disable-pool-wcache #!/bin/sh case "$1" in start) for C in 7; do for T in 0 1 2 3 4 5; do ( echo cache; echo write; echo display; echo disable; echo display ) | format -e -d c${C}t${T}d0 & done; done wait sync ;; stop) for C in 7; do for T in 0 1 2 3 4 5; do ( echo cache; echo write; echo display; echo enable; echo display ) | format -e -d c${C}t${T}d0 & done; done wait sync ;; *) for C in 7; do for T in 0 1 2 3 4 5; do ( echo cache; echo write; echo display ) | format -e -d c${C}t${T}d0 & done; done wait sync ;; esac
On Apr 17, 2012, at 12:25 AM, Jim Klimov wrote:> 2012-04-17 5:15, Richard Elling wrote: >> For the archives... >> >> Write-back cache enablement is toxic for file systems that do not issue >> cache flush commands, such as Solaris'' UFS. In the early days of ZFS, >> on Solaris 10 or before ZFS was bootable on OpenSolaris, it was not > > uncommon to have ZFS and UFS on the same system. >> >> NB, there are a number of consumer-grade IDE/*ATA disks that ignore >> disabling >> the write buffer. Hence, it is not always a win to enable the write >> buffer that cannot >> be disabled. >> -- richard > > For the sake of archives, can you please post a common troubleshooting > techinque which users can try at home to see if their disks honour the > request or not? ;) I guess it would involve random-write bandwidths in > two cases?I am aware of only one method that is guaranteed to work: contact the manufacturer, sign NDA, read the docs. -- richard -- ZFS Performance and Training Richard.Elling at RichardElling.com +1-760-896-4422 -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120417/5ded07e3/attachment.html>
On 2012-Apr-17 17:25:36 +1000, Jim Klimov <jimklimov at cos.ru> wrote:>For the sake of archives, can you please post a common troubleshooting >techinque which users can try at home to see if their disks honour the >request or not? ;) I guess it would involve random-write bandwidths in >two cases?1) Issue "disable write cache" command to drive 2) Write several MB of data to drive 3) As soon as drive acknowledges completion, remove power to drive (this will require a electronic switch in the drive''s power lead) 4) Wait until drive spins down. 5) Power up drive and wait until ready 6) Verify data written in (2) can be read. 7) Argue with drive vendor that drive doesn''t meet specifications :-) A similar approach can also be used to verify that NCQ & cache flush commands actually work. -- Peter Jeremy -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 196 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120418/7a3a5214/attachment.bin>