Hi all, I''m having some trouble with adding cache drives to a zpool, anyone got any ideas? muslimwookie@Pyzee:~$ sudo zpool add aggr0 cache c25t10d1p2 Password: cannot open ''/dev/dsk/c25t10d1p2'': I/O error muslimwookie@Pyzee:~$ I have two SSDs in the system, I''ve created an 8gb partition on each drive for use as a mirrored write cache. I also have the remainder of the drive partitioned for use as the read only cache. However, when attempting to add it I get the error above. Here''s a zpool status: pool: aggr0 state: DEGRADED status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scan: resilver in progress since Thu Feb 21 21:13:45 2013 1.13T scanned out of 20.0T at 106M/s, 51h52m to go 74.2G resilvered, 5.65% done config: NAME STATE READ WRITE CKSUM aggr0 DEGRADED 0 0 0 raidz2-0 DEGRADED 0 0 0 c7t5000C50035CA68EDd0 ONLINE 0 0 0 c7t5000C5003679D3E2d0 ONLINE 0 0 0 c7t50014EE2B16BC08Bd0 ONLINE 0 0 0 c7t50014EE2B174216Dd0 ONLINE 0 0 0 c7t50014EE2B174366Bd0 ONLINE 0 0 0 c7t50014EE25C1E7646d0 ONLINE 0 0 0 c7t50014EE25C17A62Cd0 ONLINE 0 0 0 c7t50014EE25C17720Ed0 ONLINE 0 0 0 c7t50014EE206C2AFD1d0 ONLINE 0 0 0 c7t50014EE206C8E09Fd0 ONLINE 0 0 0 c7t50014EE602DFAACAd0 ONLINE 0 0 0 c7t50014EE602DFE701d0 ONLINE 0 0 0 c7t50014EE20677C1C1d0 ONLINE 0 0 0 replacing-13 UNAVAIL 0 0 0 c7t50014EE6031198C1d0 UNAVAIL 0 0 0 cannot open c7t50014EE0AE2AB006d0 ONLINE 0 0 0 (resilvering) c7t50014EE65835480Dd0 ONLINE 0 0 0 logs mirror-1 ONLINE 0 0 0 c25t10d1p1 ONLINE 0 0 0 c25t9d1p1 ONLINE 0 0 0 errors: No known data errors As you can see, I''ve successfully added the 8gb partitions in a write caches. Interestingly, when I do a zpool iostat -v it shows the total as 111gb: capacity operations bandwidth pool alloc free read write read write --------------------------- ----- ----- ----- ----- ----- ----- aggr0 20.0T 7.27T 1.33K 139 81.7M 4.19M raidz2 20.0T 7.27T 1.33K 115 81.7M 2.70M c7t5000C50035CA68EDd0 - - 566 9 6.91M 241K c7t5000C5003679D3E2d0 - - 493 8 6.97M 242K c7t50014EE2B16BC08Bd0 - - 544 9 7.02M 239K c7t50014EE2B174216Dd0 - - 525 9 6.94M 241K c7t50014EE2B174366Bd0 - - 540 9 6.95M 241K c7t50014EE25C1E7646d0 - - 549 9 7.02M 239K c7t50014EE25C17A62Cd0 - - 534 9 6.93M 241K c7t50014EE25C17720Ed0 - - 542 9 6.95M 241K c7t50014EE206C2AFD1d0 - - 549 9 7.02M 239K c7t50014EE206C8E09Fd0 - - 526 10 6.94M 241K c7t50014EE602DFAACAd0 - - 576 10 6.91M 241K c7t50014EE602DFE701d0 - - 591 10 7.00M 239K c7t50014EE20677C1C1d0 - - 530 10 6.95M 241K replacing - - 0 922 0 7.11M c7t50014EE6031198C1d0 - - 0 0 0 0 c7t50014EE0AE2AB006d0 - - 0 622 2 7.10M c7t50014EE65835480Dd0 - - 595 10 6.98M 239K logs - - - - - - mirror 740K 111G 0 43 0 2.75M c25t10d1p1 - - 0 43 3 2.75M c25t9d1p1 - - 0 43 3 2.75M --------------------------- ----- ----- ----- ----- ----- ----- rpool 7.32G 12.6G 2 4 41.9K 43.2K c4t0d0s0 7.32G 12.6G 2 4 41.9K 43.2K --------------------------- ----- ----- ----- ----- ----- ----- Something funky is going on here... Wooks _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Andrew Werchowiecki wrote:> > Hi all, > > I''m having some trouble with adding cache drives to a zpool, anyone > got any ideas? > > muslimwookie@Pyzee:~$ sudo zpool add aggr0 cache c25t10d1p2 > > Password: > > cannot open ''/dev/dsk/c25t10d1p2'': I/O error > > muslimwookie@Pyzee:~$ > > I have two SSDs in the system, I''ve created an 8gb partition on each > drive for use as a mirrored write cache. I also have the remainder of > the drive partitioned for use as the read only cache. However, when > attempting to add it I get the error above. >Create one 100% Solaris partition and then use format to create two slices. -- Ian.
Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)
2013-Mar-15 12:44 UTC
Re: partioned cache devices
> From: zfs-discuss-bounces@opensolaris.org [mailto:zfs-discuss- > bounces@opensolaris.org] On Behalf Of Andrew Werchowiecki > > muslimwookie@Pyzee:~$ sudo zpool add aggr0 cache c25t10d1p2 > Password: > cannot open ''/dev/dsk/c25t10d1p2'': I/O error > muslimwookie@Pyzee:~$ > > I have two SSDs in the system, I''ve created an 8gb partition on each drive for > use as a mirrored write cache. I also have the remainder of the drive > partitioned for use as the read only cache. However, when attempting to add > it I get the error above.Sounds like you''re probably running into confusion about how to partition the drive. If you create fdisk partitions, they will be accessible as p0, p1, p2, but I think p0 unconditionally refers to the whole drive, so the first partition is p1, and the second is p2. If you create one big solaris fdisk parititon and then slice it via "partition" where s2 is typically the encompassing slice, and people usually use s1 and s2 and s6 for actual slices, then they will be accessible via s1, s2, s6 Generally speaking, it''s unadvisable to split the slog/cache devices anyway. Because: If you''re splitting it, evidently you''re focusing on the wasted space. Buying an expensive 128G device where you couldn''t possibly ever use more than 4G or 8G in the slog. But that''s not what you should be focusing on. You should be focusing on the speed (that''s why you bought it in the first place.) The slog is write-only, and the cache is a mixture of read/write, where it should be hopefully doing more reads than writes. But regardless of your actual success with the cache device, your cache device will be busy most of the time, and competing against the slog. You have a mirror, you say. You should probably drop both the cache & log. Use one whole device for the cache, use one whole device for the log. The only risk you''ll run is: Since a slog is write-only (except during mount, typically at boot) it''s possible to have a failure mode where you think you''re writing to the log, but the first time you go back and read, you discover an error, and discover the device has gone bad. In other words, without ever doing any reads, you might not notice when/if the device goes bad. Fortunately, there''s an easy workaround. You could periodically (say, once a month) script the removal of your log device, create a junk pool, write a bunch of data to it, scrub it (thus verifying it was written correctly) and in the absence of any scrub errors, destroy the junk pool and re-add the device as a slog to the main pool. I''ve never heard of anyone actually being that paranoid, and I''ve never heard of anyone actually experiencing the aforementioned possible undetected device failure mode. So this is all mostly theoretical. Mirroring the slog device really isn''t necessary in the modern age.
It''s a home set up, the performance penalty from splitting the cache devices is non-existant, and that work around sounds like some pretty crazy amount of overhead where I could instead just have a mirrored slog. I''m less concerned about wasted space, more concerned about amount of SAS ports I have available. I understand that p0 refers to the whole disk... in the logs I pasted in I''m not attempting to mount p0. I''m trying to work out why I''m getting an error attempting to mount p2, after p1 has successfully mounted. Further, this has been done before on other systems in the same hardware configuration in the exact same fashion, and I''ve gone over the steps trying to make sure I haven''t missed something but can''t see a fault. I''m not keen on using Solaris slices because I don''t have an understanding of what that does to the pool''s OS interoperability. ________________________________________ From: Edward Ned Harvey (opensolarisisdeadlongliveopensolaris) [opensolarisisdeadlongliveopensolaris@nedharvey.com] Sent: Friday, 15 March 2013 8:44 PM To: Andrew Werchowiecki; zfs-discuss@opensolaris.org Subject: RE: partioned cache devices> From: zfs-discuss-bounces@opensolaris.org [mailto:zfs-discuss- > bounces@opensolaris.org] On Behalf Of Andrew Werchowiecki > > muslimwookie@Pyzee:~$ sudo zpool add aggr0 cache c25t10d1p2 > Password: > cannot open ''/dev/dsk/c25t10d1p2'': I/O error > muslimwookie@Pyzee:~$ > > I have two SSDs in the system, I''ve created an 8gb partition on each drive for > use as a mirrored write cache. I also have the remainder of the drive > partitioned for use as the read only cache. However, when attempting to add > it I get the error above.Sounds like you''re probably running into confusion about how to partition the drive. If you create fdisk partitions, they will be accessible as p0, p1, p2, but I think p0 unconditionally refers to the whole drive, so the first partition is p1, and the second is p2. If you create one big solaris fdisk parititon and then slice it via "partition" where s2 is typically the encompassing slice, and people usually use s1 and s2 and s6 for actual slices, then they will be accessible via s1, s2, s6 Generally speaking, it''s unadvisable to split the slog/cache devices anyway. Because: If you''re splitting it, evidently you''re focusing on the wasted space. Buying an expensive 128G device where you couldn''t possibly ever use more than 4G or 8G in the slog. But that''s not what you should be focusing on. You should be focusing on the speed (that''s why you bought it in the first place.) The slog is write-only, and the cache is a mixture of read/write, where it should be hopefully doing more reads than writes. But regardless of your actual success with the cache device, your cache device will be busy most of the time, and competing against the slog. You have a mirror, you say. You should probably drop both the cache & log. Use one whole device for the cache, use one whole device for the log. The only risk you''ll run is: Since a slog is write-only (except during mount, typically at boot) it''s possible to have a failure mode where you think you''re writing to the log, but the first time you go back and read, you discover an error, and discover the device has gone bad. In other words, without ever doing any reads, you might not notice when/if the device goes bad. Fortunately, there''s an easy workaround. You could periodically (say, once a month) script the removal of your log device, create a junk pool, write a bunch of data to it, scrub it (thus verifying it was written correctly) and in the absence of any scrub errors, destroy the junk pool and re-add the device as a slog to the main pool. I''ve never heard of anyone actually being that paranoid, and I''ve never heard of anyone actually experiencing the aforementioned possible undetected device failure mode. So this is all mostly theoretical. Mirroring the slog device really isn''t necessary in the modern age.
On Mar 16, 2013, at 7:01 PM, Andrew Werchowiecki <Andrew.Werchowiecki@xpanse.com.au> wrote:> It''s a home set up, the performance penalty from splitting the cache devices is non-existant, and that work around sounds like some pretty crazy amount of overhead where I could instead just have a mirrored slog. > > I''m less concerned about wasted space, more concerned about amount of SAS ports I have available. > > I understand that p0 refers to the whole disk... in the logs I pasted in I''m not attempting to mount p0. I''m trying to work out why I''m getting an error attempting to mount p2, after p1 has successfully mounted. Further, this has been done before on other systems in the same hardware configuration in the exact same fashion, and I''ve gone over the steps trying to make sure I haven''t missed something but can''t see a fault.You can have only one Solaris partition at a time. Ian already shared the answer, "Create one 100% Solaris partition and then use format to create two slices." -- richard> > I''m not keen on using Solaris slices because I don''t have an understanding of what that does to the pool''s OS interoperability. > ________________________________________ > From: Edward Ned Harvey (opensolarisisdeadlongliveopensolaris) [opensolarisisdeadlongliveopensolaris@nedharvey.com] > Sent: Friday, 15 March 2013 8:44 PM > To: Andrew Werchowiecki; zfs-discuss@opensolaris.org > Subject: RE: partioned cache devices > >> From: zfs-discuss-bounces@opensolaris.org [mailto:zfs-discuss- >> bounces@opensolaris.org] On Behalf Of Andrew Werchowiecki >> >> muslimwookie@Pyzee:~$ sudo zpool add aggr0 cache c25t10d1p2 >> Password: >> cannot open ''/dev/dsk/c25t10d1p2'': I/O error >> muslimwookie@Pyzee:~$ >> >> I have two SSDs in the system, I''ve created an 8gb partition on each drive for >> use as a mirrored write cache. I also have the remainder of the drive >> partitioned for use as the read only cache. However, when attempting to add >> it I get the error above. > > Sounds like you''re probably running into confusion about how to partition the drive. If you create fdisk partitions, they will be accessible as p0, p1, p2, but I think p0 unconditionally refers to the whole drive, so the first partition is p1, and the second is p2. > > If you create one big solaris fdisk parititon and then slice it via "partition" where s2 is typically the encompassing slice, and people usually use s1 and s2 and s6 for actual slices, then they will be accessible via s1, s2, s6 > > Generally speaking, it''s unadvisable to split the slog/cache devices anyway. Because: > > If you''re splitting it, evidently you''re focusing on the wasted space. Buying an expensive 128G device where you couldn''t possibly ever use more than 4G or 8G in the slog. But that''s not what you should be focusing on. You should be focusing on the speed (that''s why you bought it in the first place.) The slog is write-only, and the cache is a mixture of read/write, where it should be hopefully doing more reads than writes. But regardless of your actual success with the cache device, your cache device will be busy most of the time, and competing against the slog. > > You have a mirror, you say. You should probably drop both the cache & log. Use one whole device for the cache, use one whole device for the log. The only risk you''ll run is: > > Since a slog is write-only (except during mount, typically at boot) it''s possible to have a failure mode where you think you''re writing to the log, but the first time you go back and read, you discover an error, and discover the device has gone bad. In other words, without ever doing any reads, you might not notice when/if the device goes bad. Fortunately, there''s an easy workaround. You could periodically (say, once a month) script the removal of your log device, create a junk pool, write a bunch of data to it, scrub it (thus verifying it was written correctly) and in the absence of any scrub errors, destroy the junk pool and re-add the device as a slog to the main pool. > > I''ve never heard of anyone actually being that paranoid, and I''ve never heard of anyone actually experiencing the aforementioned possible undetected device failure mode. So this is all mostly theoretical. > > Mirroring the slog device really isn''t necessary in the modern age. > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss-- ZFS and performance consulting http://www.RichardElling.com
On Sun, Mar 17, 2013 at 1:01 PM, Andrew Werchowiecki < Andrew.Werchowiecki@xpanse.com.au> wrote:> I understand that p0 refers to the whole disk... in the logs I pasted in > I''m not attempting to mount p0. I''m trying to work out why I''m getting an > error attempting to mount p2, after p1 has successfully mounted. Further, > this has been done before on other systems in the same hardware > configuration in the exact same fashion, and I''ve gone over the steps > trying to make sure I haven''t missed something but can''t see a fault. > >How did you create the partition? Are those marked as solaris partition, or something else (e.g. fdisk on linux use type "83" by default). I''m not keen on using Solaris slices because I don''t have an understanding> of what that does to the pool''s OS interoperability. >Linux can read solaris slice and import solaris-made pools just fine, as long as you''re using compatible zpool version (e.g. zpool version 28). -- Fajar _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
I did something like the following: format -e /dev/rdsk/c5t0d0p0 fdisk 1 (create) F (EFI) 6 (exit) partition label 1 y 0 usr wm 64 4194367e 1 usr wm 4194368 117214990 label 1 y Total disk size is 9345 cylinders Cylinder size is 12544 (512 byte) blocks Cylinders Partition Status Type Start End Length % ========= ====== ============ ===== === ====== == 1 EFI 0 9345 9346 100 partition> print Current partition table (original): Total disk sectors available: 117214957 + 16384 (reserved sectors) Part Tag Flag First Sector Size Last Sector 0 usr wm 64 2.00GB 4194367 1 usr wm 4194368 53.89GB 117214990 2 unassigned wm 0 0 0 3 unassigned wm 0 0 0 4 unassigned wm 0 0 0 5 unassigned wm 0 0 0 6 unassigned wm 0 0 0 8 reserved wm 117214991 8.00MB 117231374 This isn''t the output from when I did it but it is exactly the same steps that I followed. Thanks for the info about slices, I may give that a go later on. I''m not keen on that because I have clear evidence (as in zpools set up this way, right now, working, without issue) that GPT partitions of the style shown above work and I want to see why it doesn''t work in my set up rather than simply ignoring and moving on. From: Fajar A. Nugraha [mailto:work@fajar.net] Sent: Sunday, 17 March 2013 3:04 PM To: Andrew Werchowiecki Cc: zfs-discuss@opensolaris.org Subject: Re: [zfs-discuss] partioned cache devices On Sun, Mar 17, 2013 at 1:01 PM, Andrew Werchowiecki <Andrew.Werchowiecki@xpanse.com.au<mailto:Andrew.Werchowiecki@xpanse.com.au>> wrote: I understand that p0 refers to the whole disk... in the logs I pasted in I''m not attempting to mount p0. I''m trying to work out why I''m getting an error attempting to mount p2, after p1 has successfully mounted. Further, this has been done before on other systems in the same hardware configuration in the exact same fashion, and I''ve gone over the steps trying to make sure I haven''t missed something but can''t see a fault. How did you create the partition? Are those marked as solaris partition, or something else (e.g. fdisk on linux use type "83" by default). I''m not keen on using Solaris slices because I don''t have an understanding of what that does to the pool''s OS interoperability. Linux can read solaris slice and import solaris-made pools just fine, as long as you''re using compatible zpool version (e.g. zpool version 28). -- Fajar _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Andrew Werchowiecki wrote:> > Thanks for the info about slices, I may give that a go later on. I’m > not keen on that because I have clear evidence (as in zpools set up > this way, right now, working, without issue) that GPT partitions of > the style shown above work and I want to see why it doesn’t work in my > set up rather than simply ignoring and moving on. >Didn''t you read Richard''s post? "You can have only one Solaris partition at a time." Your original example failed when you tried to add a second. -- Ian.
Hi Andrew, Your original syntax was incorrect. A p* device is a larger container for the d* device or s* devices. In the case of a cache device, you need to specify a d* or s* device. That you can add p* devices to a pool is a bug. Adding different slices from c25t10d1 as both log and cache devices would need the s* identifier, but you''ve already added the entire c25t10d1 as the log device. A better configuration would be using c25t10d1 for log and using c25t9d1 for cache or provide some spares for this large pool. After you remove the log devices, re-add like this: # zpool add aggr0 log c25t10d1 # zpool add aggr0 cache c25t9d1 You might review the ZFS recommendation practices section, here: http://docs.oracle.com/cd/E26502_01/html/E29007/zfspools-4.html#storage-2 See example 3-4 for adding a cache device, here: http://docs.oracle.com/cd/E26502_01/html/E29007/gayrd.html#gazgw Always have good backups. Thanks, Cindy On 03/18/13 23:23, Andrew Werchowiecki wrote:> I did something like the following: > > format -e /dev/rdsk/c5t0d0p0 > > fdisk > > 1 (create) > > F (EFI) > > 6 (exit) > > partition > > label > > 1 > > y > > 0 > > usr > > wm > > 64 > > 4194367e > > 1 > > usr > > wm > > 4194368 > > 117214990 > > label > > 1 > > y > > Total disk size is 9345 cylinders > > Cylinder size is 12544 (512 byte) blocks > > Cylinders > > Partition Status Type Start End Length % > > ========= ====== ============ ===== === ====== ==> > 1 EFI 0 9345 9346 100 > > partition> print > > Current partition table (original): > > Total disk sectors available: 117214957 + 16384 (reserved sectors) > > Part Tag Flag First Sector Size Last Sector > > 0 usr wm 64 2.00GB 4194367 > > 1 usr wm 4194368 53.89GB 117214990 > > 2 unassigned wm 0 0 0 > > 3 unassigned wm 0 0 0 > > 4 unassigned wm 0 0 0 > > 5 unassigned wm 0 0 0 > > 6 unassigned wm 0 0 0 > > 8 reserved wm 117214991 8.00MB 117231374 > > This isn’t the output from when I did it but it is exactly the same > steps that I followed. > > Thanks for the info about slices, I may give that a go later on. I’m not > keen on that because I have clear evidence (as in zpools set up this > way, right now, working, without issue) that GPT partitions of the style > shown above work and I want to see why it doesn’t work in my set up > rather than simply ignoring and moving on. > > *From:*Fajar A. Nugraha [mailto:work@fajar.net] > *Sent:* Sunday, 17 March 2013 3:04 PM > *To:* Andrew Werchowiecki > *Cc:* zfs-discuss@opensolaris.org > *Subject:* Re: [zfs-discuss] partioned cache devices > > On Sun, Mar 17, 2013 at 1:01 PM, Andrew Werchowiecki > <Andrew.Werchowiecki@xpanse.com.au > <mailto:Andrew.Werchowiecki@xpanse.com.au>> wrote: > > I understand that p0 refers to the whole disk... in the logs I > pasted in I''m not attempting to mount p0. I''m trying to work out why > I''m getting an error attempting to mount p2, after p1 has > successfully mounted. Further, this has been done before on other > systems in the same hardware configuration in the exact same > fashion, and I''ve gone over the steps trying to make sure I haven''t > missed something but can''t see a fault. > > How did you create the partition? Are those marked as solaris partition, > or something else (e.g. fdisk on linux use type "83" by default). > > I''m not keen on using Solaris slices because I don''t have an > understanding of what that does to the pool''s OS interoperability. > > Linux can read solaris slice and import solaris-made pools just fine, as > long as you''re using compatible zpool version (e.g. zpool version 28). > > -- > > Fajar > > > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Andrew Werchowiecki wrote:> Total disk size is 9345 cylinders > Cylinder size is 12544 (512 byte) blocks > > Cylinders > Partition Status Type Start End Length % > ========= ====== ============ ===== === ====== ==> 1 EFI 0 9345 9346 100You only have a p1 (and for a GPT/EFI labeled disk, you can only have p1 - no other FDISK partitions are allowed).> partition> print > Current partition table (original): > Total disk sectors available: 117214957 + 16384 (reserved sectors) > > Part Tag Flag First Sector Size Last Sector > 0 usr wm 64 2.00GB 4194367 > 1 usr wm 4194368 53.89GB 117214990 > 2 unassigned wm 0 0 0 > 3 unassigned wm 0 0 0 > 4 unassigned wm 0 0 0 > 5 unassigned wm 0 0 0 > 6 unassigned wm 0 0 0 > 8 reserved wm 117214991 8.00MB 117231374You have an s0 and s1.> This isn’t the output from when I did it but it is exactly the same > steps that I followed. > > Thanks for the info about slices, I may give that a go later on. I’m not > keen on that because I have clear evidence (as in zpools set up this > way, right now, working, without issue) that GPT partitions of the style > shown above work and I want to see why it doesn’t work in my set up > rather than simply ignoring and moving on.You would have to blow away the partitioning you have, and create an FDISK partitioned disk (not EFI), and then create a p1 and p2 partition. (Don''t use the ''partition'' subcommand, which confusingly creates solaris slices.) Give the FDISK partitions a partition type which nothing will recognise, such as ''other'', so that nothing will try and interpret them as OS partitions. Then you can use them as raw devices, and they should be portable between OS''s which can handle FDISK partitioned devices. -- Andrew
On 2013-03-19 20:38, Cindy Swearingen wrote:> Hi Andrew, > > Your original syntax was incorrect. > > A p* device is a larger container for the d* device or s* devices. > In the case of a cache device, you need to specify a d* or s* device. > That you can add p* devices to a pool is a bug.I disagree; at least, I''ve always thought differently: the "d" device is the whole disk denomination, with a unique number for a particular controller link ("c+t"). The disk has some partitioning table, MBR or GPT/EFI. In these tables, partition "p0" stands for the table itself (i.e. to manage partitioning), and the rest kind of "depends". In case of MBR tables, one partition may be named as having a Solaris (or Solaris2) type, and there it holds a SMI table of Solaris slices, and these slices can hold legacy filesystems or components of ZFS pools. In case of GPT, the GPT-partitions can be used directly by ZFS. However, they are also denominated as "slices" in ZFS and format utility. I believe, Solaris-based OSes accessing a "p"-named partition and an "s"-named slice of the same number on a GPT disk should lead to the same range of bytes on disk, but I am not really certain about this. Also, if a "whole disk" is given to ZFS (and for OSes other that the latest Solaris 11 this means non-rpool disks), then ZFS labels the disk as GPT and defines a partition for itself plus a small trailing partition (likely to level out discrepancies with replacement disks that might happen to be a few sectors too small). In this case ZFS reports that it uses "cXtYdZ" as a pool component, since it considers itself in charge of the partitioning table and its inner contents, and doesn''t intend to share the disk with other usages (dual-booting and other OSes'' partitions, or SLOG and L2ARC parts, etc). This also "allows" ZFS to influence hardware-related choices, like caching and throttling, and likely auto-expansion with the changed LUN sizes by fixing up the partition table along the way, since it assumes being 100% in charge of the disk. I don''t think there is a "crime" in trying to use the partitions (of either kind) as ZFS leaf vdevs, even the zpool(1M) manpage states that: ... The following virtual devices are supported: disk A block device, typically located under /dev/dsk. ZFS can use individual slices or partitions, though the recommended mode of operation is to use whole disks. ... This is orthogonal to the fact that there can only be one Solaris slice table, inside one partition, on MBR. AFAIK this is irrelevant on GPT/EFI - no SMI slices there. On my old home NAS with OpenSolaris I certainly did have MBR partitions on the rpool intended initially for some dual-booted OSes, but repurposed as L2ARC and ZIL devices for the storage pool on other disks, when I played with that technology. Didn''t gain much with a single spindle ;) HTH, //Jim Klimov
On 03/19/13 20:27, Jim Klimov wrote:> I disagree; at least, I''ve always thought differently: > the "d" device is the whole disk denomination, with a > unique number for a particular controller link ("c+t"). > > The disk has some partitioning table, MBR or GPT/EFI. > In these tables, partition "p0" stands for the table > itself (i.e. to manage partitioning),p0 is the whole disk regardless of any partitioning. (Hence you can use p0 to access any type of partition table.)> and the rest kind > of "depends". In case of MBR tables, one partition may > be named as having a Solaris (or Solaris2) type, and > there it holds a SMI table of Solaris slices, and these > slices can hold legacy filesystems or components of ZFS > pools. In case of GPT, the GPT-partitions can be used > directly by ZFS. However, they are also denominated as > "slices" in ZFS and format utility.The GPT partitioning spec requires the disk to be FDISK partitioned with just one single FDISK partition of type EFI, so that tools which predate GPT partitioning will still see such a GPT disk as fully assigned to FDISK partitions, and therefore less likely to be accidentally blown away.> I believe, Solaris-based OSes accessing a "p"-named > partition and an "s"-named slice of the same number > on a GPT disk should lead to the same range of bytes > on disk, but I am not really certain about this.No, you''ll see just p0 (whole disk), and p1 (whole disk less space for the backwards compatible FDISK partitioning).> Also, if a "whole disk" is given to ZFS (and for OSes > other that the latest Solaris 11 this means non-rpool > disks), then ZFS labels the disk as GPT and defines a > partition for itself plus a small trailing partition > (likely to level out discrepancies with replacement > disks that might happen to be a few sectors too small). > In this case ZFS reports that it uses "cXtYdZ" as a > pool component,For an EFI disk, the device name without a final p* or s* component is the whole EFI partition. (It''s actually the s7 slice minor device node, but the s7 is dropped from the device name to avoid the confusion we had with s2 on SMI labeled disks being the whole SMI partition.)> since it considers itself in charge > of the partitioning table and its inner contents, and > doesn''t intend to share the disk with other usages > (dual-booting and other OSes'' partitions, or SLOG and > L2ARC parts, etc). This also "allows" ZFS to influence > hardware-related choices, like caching and throttling, > and likely auto-expansion with the changed LUN sizes > by fixing up the partition table along the way, since > it assumes being 100% in charge of the disk. > > I don''t think there is a "crime" in trying to use the > partitions (of either kind) as ZFS leaf vdevs, even the > zpool(1M) manpage states that: > > ... The following virtual devices are supported: > disk > A block device, typically located under /dev/dsk. > ZFS can use individual slices or partitions, > though the recommended mode of operation is to use > whole disks. ...Right.> This is orthogonal to the fact that there can only be > one Solaris slice table, inside one partition, on MBR. > AFAIK this is irrelevant on GPT/EFI - no SMI slices there.There''s a simpler way to think of it on x86. You always have FDISK partitioning (p1, p2, p3, p4). You can then have SMI or GPT/EFI slices (both called s0, s1, ...) in an FDISK partition of the appropriate type. With SMI labeling, s2 is by convention the whole Solaris FDISK partition (although this is not enforced). With EFI labeling, s7 is enforced as the whole EFI FDISK partition, and so the trailing s7 is dropped off the device name for clarity. This simplicity is brought about because the GPT spec requires that backwards compatible FDISK partitioning is included, but with just 1 partition assigned. -- Andrew
On 2013-03-19 22:07, Andrew Gabriel wrote:> The GPT partitioning spec requires the disk to be FDISK > partitioned with just one single FDISK partition of type EFI, > so that tools which predate GPT partitioning will still see > such a GPT disk as fully assigned to FDISK partitions, and > therefore less likely to be accidentally blown away.Okay, I guess I got entangled in terminology now ;) Anyhow, your words are not all news to me, though my write-up was likely misleading to unprepared readers... sigh... Thanks for the clarifications and deeper details that I did not know! So, we can concur that GPT does indeed include the fake MBR header with one EFI partition which addresses the smaller of 2TB (MBR limit) or disk size, minus a few sectors for the GPT housekeeping. Inside the EFI partition are defined the GPT, um, partitions (represented as "s"lices in Solaris). This is after all a GUID *Partition* Table, and that''s how parted refers to them too ;) Notably, there are also unportable tricks to fool legacy OSes and bootloaders into addressing the same byte ranges via both MBR entries (forged manually and abusing the GPT/EFI spec) and proper GPT entries, as partitions in the sense of each table. //Jim