thr3ads.net - zfs discuss - partioned cache devices [Mar 2013]

If this information is useful, please help other people find it:
Share via:

Andrew Werchowiecki

2013-Mar-15 05:17 UTC

partioned cache devices

Hi all,

I''m having some trouble with adding cache drives to a zpool, anyone got
any ideas?

muslimwookie@Pyzee:~$ sudo zpool add aggr0 cache c25t10d1p2
Password:
cannot open ''/dev/dsk/c25t10d1p2'': I/O error
muslimwookie@Pyzee:~$

I have two SSDs in the system, I''ve created an 8gb partition on each
drive for use as a mirrored write cache. I also have the remainder of the drive
partitioned for use as the read only cache. However, when attempting to add it I
get the error above.

Here''s a zpool status:

  pool: aggr0
state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Thu Feb 21 21:13:45 2013
    1.13T scanned out of 20.0T at 106M/s, 51h52m to go
    74.2G resilvered, 5.65% done
config:

        NAME                         STATE     READ WRITE CKSUM
        aggr0                        DEGRADED     0     0     0
          raidz2-0                   DEGRADED     0     0     0
            c7t5000C50035CA68EDd0    ONLINE       0     0     0
            c7t5000C5003679D3E2d0    ONLINE       0     0     0
            c7t50014EE2B16BC08Bd0    ONLINE       0     0     0
            c7t50014EE2B174216Dd0    ONLINE       0     0     0
            c7t50014EE2B174366Bd0    ONLINE       0     0     0
            c7t50014EE25C1E7646d0    ONLINE       0     0     0
            c7t50014EE25C17A62Cd0    ONLINE       0     0     0
            c7t50014EE25C17720Ed0    ONLINE       0     0     0
            c7t50014EE206C2AFD1d0    ONLINE       0     0     0
            c7t50014EE206C8E09Fd0    ONLINE       0     0     0
            c7t50014EE602DFAACAd0    ONLINE       0     0     0
            c7t50014EE602DFE701d0    ONLINE       0     0     0
            c7t50014EE20677C1C1d0    ONLINE       0     0     0
            replacing-13             UNAVAIL      0     0     0
              c7t50014EE6031198C1d0  UNAVAIL      0     0     0  cannot open
              c7t50014EE0AE2AB006d0  ONLINE       0     0     0  (resilvering)
            c7t50014EE65835480Dd0    ONLINE       0     0     0
        logs
          mirror-1                   ONLINE       0     0     0
            c25t10d1p1               ONLINE       0     0     0
            c25t9d1p1                ONLINE       0     0     0

errors: No known data errors

As you can see, I''ve successfully added the 8gb partitions in a write
caches. Interestingly, when I do a zpool iostat -v it shows the total as 111gb:

                                capacity     operations    bandwidth
pool                         alloc   free   read  write   read  write
---------------------------  -----  -----  -----  -----  -----  -----
aggr0                        20.0T  7.27T  1.33K    139  81.7M  4.19M
  raidz2                     20.0T  7.27T  1.33K    115  81.7M  2.70M
    c7t5000C50035CA68EDd0        -      -    566      9  6.91M   241K
    c7t5000C5003679D3E2d0        -      -    493      8  6.97M   242K
    c7t50014EE2B16BC08Bd0        -      -    544      9  7.02M   239K
    c7t50014EE2B174216Dd0        -      -    525      9  6.94M   241K
    c7t50014EE2B174366Bd0        -      -    540      9  6.95M   241K
    c7t50014EE25C1E7646d0        -      -    549      9  7.02M   239K
    c7t50014EE25C17A62Cd0        -      -    534      9  6.93M   241K
    c7t50014EE25C17720Ed0        -      -    542      9  6.95M   241K
    c7t50014EE206C2AFD1d0        -      -    549      9  7.02M   239K
    c7t50014EE206C8E09Fd0        -      -    526     10  6.94M   241K
    c7t50014EE602DFAACAd0        -      -    576     10  6.91M   241K
    c7t50014EE602DFE701d0        -      -    591     10  7.00M   239K
    c7t50014EE20677C1C1d0        -      -    530     10  6.95M   241K
    replacing                    -      -      0    922      0  7.11M
      c7t50014EE6031198C1d0      -      -      0      0      0      0
      c7t50014EE0AE2AB006d0      -      -      0    622      2  7.10M
    c7t50014EE65835480Dd0        -      -    595     10  6.98M   239K
logs                             -      -      -      -      -      -
  mirror                      740K   111G      0     43      0  2.75M
    c25t10d1p1                   -      -      0     43      3  2.75M
    c25t9d1p1                    -      -      0     43      3  2.75M
---------------------------  -----  -----  -----  -----  -----  -----
rpool                        7.32G  12.6G      2      4  41.9K  43.2K
  c4t0d0s0                   7.32G  12.6G      2      4  41.9K  43.2K
---------------------------  -----  -----  -----  -----  -----  -----

Something funky is going on here...

Wooks


_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Ian Collins

2013-Mar-15 08:53 UTC

head link

Re: partioned cache devices

Andrew Werchowiecki wrote:>
> Hi all,
>
> I''m having some trouble with adding cache drives to a zpool,
anyone
> got any ideas?
>
> muslimwookie@Pyzee:~$ sudo zpool add aggr0 cache c25t10d1p2
>
> Password:
>
> cannot open ''/dev/dsk/c25t10d1p2'': I/O error
>
> muslimwookie@Pyzee:~$
>
> I have two SSDs in the system, I''ve created an 8gb partition on
each
> drive for use as a mirrored write cache. I also have the remainder of 
> the drive partitioned for use as the read only cache. However, when 
> attempting to add it I get the error above.
>
Create one 100% Solaris partition and then use format to create two slices.

-- 
Ian.

Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)

2013-Mar-15 12:44 UTC

head link

Re: partioned cache devices

> From: zfs-discuss-bounces@opensolaris.org [mailto:zfs-discuss-
> bounces@opensolaris.org] On Behalf Of Andrew Werchowiecki
> 
> muslimwookie@Pyzee:~$ sudo zpool add aggr0 cache c25t10d1p2
> Password:
> cannot open ''/dev/dsk/c25t10d1p2'': I/O error
> muslimwookie@Pyzee:~$
> 
> I have two SSDs in the system, I''ve created an 8gb partition on
each drive for
> use as a mirrored write cache. I also have the remainder of the drive
> partitioned for use as the read only cache. However, when attempting to add
> it I get the error above.
Sounds like you''re probably running into confusion about how to
partition the drive.  If you create fdisk partitions, they will be accessible as
p0, p1, p2, but I think p0 unconditionally refers to the whole drive, so the
first partition is p1, and the second is p2.

If you create one big solaris fdisk parititon and then slice it via
"partition" where s2 is typically the encompassing slice, and people
usually use s1 and s2 and s6 for actual slices, then they will be accessible via
s1, s2, s6

Generally speaking, it''s unadvisable to split the slog/cache devices
anyway.  Because:

If you''re splitting it, evidently you''re focusing on the
wasted space.  Buying an expensive 128G device where you couldn''t
possibly ever use more than 4G or 8G in the slog.  But that''s not what
you should be focusing on.  You should be focusing on the speed (that''s
why you bought it in the first place.)  The slog is write-only, and the cache is
a mixture of read/write, where it should be hopefully doing more reads than
writes.  But regardless of your actual success with the cache device, your cache
device will be busy most of the time, and competing against the slog.

You have a mirror, you say.  You should probably drop both the cache & log. 
Use one whole device for the cache, use one whole device for the log.  The only
risk you''ll run is:

Since a slog is write-only (except during mount, typically at boot)
it''s possible to have a failure mode where you think you''re
writing to the log, but the first time you go back and read, you discover an
error, and discover the device has gone bad.  In other words, without ever doing
any reads, you might not notice when/if the device goes bad.  Fortunately,
there''s an easy workaround.  You could periodically (say, once a month)
script the removal of your log device, create a junk pool, write a bunch of data
to it, scrub it (thus verifying it was written correctly) and in the absence of
any scrub errors, destroy the junk pool and re-add the device as a slog to the
main pool.

I''ve never heard of anyone actually being that paranoid, and
I''ve never heard of anyone actually experiencing the aforementioned
possible undetected device failure mode.  So this is all mostly theoretical.

Mirroring the slog device really isn''t necessary in the modern age.

Andrew Werchowiecki

2013-Mar-17 02:01 UTC

head link

Re: partioned cache devices

It''s a home set up, the performance penalty from splitting the cache
devices is non-existant, and that work around sounds like some pretty crazy
amount of overhead where I could instead just have a mirrored slog.

I''m less concerned about wasted space, more concerned about amount of
SAS ports I have available.

I understand that p0 refers to the whole disk... in the logs I pasted in
I''m not attempting to mount p0. I''m trying to work out why
I''m getting an error attempting to mount p2, after p1 has successfully
mounted. Further, this has been done before on other systems in the same
hardware configuration in the exact same fashion, and I''ve gone over
the steps trying to make sure I haven''t missed something but
can''t see a fault.

I''m not keen on using Solaris slices because I don''t have an
understanding of what that does to the pool''s OS interoperability.
________________________________________
From: Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)
[opensolarisisdeadlongliveopensolaris@nedharvey.com]
Sent: Friday, 15 March 2013 8:44 PM
To: Andrew Werchowiecki; zfs-discuss@opensolaris.org
Subject: RE: partioned cache devices
> From: zfs-discuss-bounces@opensolaris.org [mailto:zfs-discuss-
> bounces@opensolaris.org] On Behalf Of Andrew Werchowiecki
>
> muslimwookie@Pyzee:~$ sudo zpool add aggr0 cache c25t10d1p2
> Password:
> cannot open ''/dev/dsk/c25t10d1p2'': I/O error
> muslimwookie@Pyzee:~$
>
> I have two SSDs in the system, I''ve created an 8gb partition on
each drive for
> use as a mirrored write cache. I also have the remainder of the drive
> partitioned for use as the read only cache. However, when attempting to add
> it I get the error above.
Sounds like you''re probably running into confusion about how to
partition the drive.  If you create fdisk partitions, they will be accessible as
p0, p1, p2, but I think p0 unconditionally refers to the whole drive, so the
first partition is p1, and the second is p2.

If you create one big solaris fdisk parititon and then slice it via
"partition" where s2 is typically the encompassing slice, and people
usually use s1 and s2 and s6 for actual slices, then they will be accessible via
s1, s2, s6

Generally speaking, it''s unadvisable to split the slog/cache devices
anyway.  Because:

If you''re splitting it, evidently you''re focusing on the
wasted space.  Buying an expensive 128G device where you couldn''t
possibly ever use more than 4G or 8G in the slog.  But that''s not what
you should be focusing on.  You should be focusing on the speed (that''s
why you bought it in the first place.)  The slog is write-only, and the cache is
a mixture of read/write, where it should be hopefully doing more reads than
writes.  But regardless of your actual success with the cache device, your cache
device will be busy most of the time, and competing against the slog.

You have a mirror, you say.  You should probably drop both the cache & log. 
Use one whole device for the cache, use one whole device for the log.  The only
risk you''ll run is:

Since a slog is write-only (except during mount, typically at boot)
it''s possible to have a failure mode where you think you''re
writing to the log, but the first time you go back and read, you discover an
error, and discover the device has gone bad.  In other words, without ever doing
any reads, you might not notice when/if the device goes bad.  Fortunately,
there''s an easy workaround.  You could periodically (say, once a month)
script the removal of your log device, create a junk pool, write a bunch of data
to it, scrub it (thus verifying it was written correctly) and in the absence of
any scrub errors, destroy the junk pool and re-add the device as a slog to the
main pool.

I''ve never heard of anyone actually being that paranoid, and
I''ve never heard of anyone actually experiencing the aforementioned
possible undetected device failure mode.  So this is all mostly theoretical.

Mirroring the slog device really isn''t necessary in the modern age.

Richard Elling

2013-Mar-17 03:46 UTC

head link

Re: partioned cache devices

On Mar 16, 2013, at 7:01 PM, Andrew Werchowiecki
<Andrew.Werchowiecki@xpanse.com.au> wrote:
> It''s a home set up, the performance penalty from splitting the
cache devices is non-existant, and that work around sounds like some pretty
crazy amount of overhead where I could instead just have a mirrored slog.
> 
> I''m less concerned about wasted space, more concerned about amount
of SAS ports I have available.
> 
> I understand that p0 refers to the whole disk... in the logs I pasted in
I''m not attempting to mount p0. I''m trying to work out why
I''m getting an error attempting to mount p2, after p1 has successfully
mounted. Further, this has been done before on other systems in the same
hardware configuration in the exact same fashion, and I''ve gone over
the steps trying to make sure I haven''t missed something but
can''t see a fault.
You can have only one Solaris partition at a time. Ian already shared the
answer, "Create one 100%
Solaris partition and then use format to create two slices."
 -- richard
> 
> I''m not keen on using Solaris slices because I don''t have
an understanding of what that does to the pool''s OS interoperability.
> ________________________________________
> From: Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)
[opensolarisisdeadlongliveopensolaris@nedharvey.com]
> Sent: Friday, 15 March 2013 8:44 PM
> To: Andrew Werchowiecki; zfs-discuss@opensolaris.org
> Subject: RE: partioned cache devices
> 
>> From: zfs-discuss-bounces@opensolaris.org [mailto:zfs-discuss-
>> bounces@opensolaris.org] On Behalf Of Andrew Werchowiecki
>> 
>> muslimwookie@Pyzee:~$ sudo zpool add aggr0 cache c25t10d1p2
>> Password:
>> cannot open ''/dev/dsk/c25t10d1p2'': I/O error
>> muslimwookie@Pyzee:~$
>> 
>> I have two SSDs in the system, I''ve created an 8gb partition
on each drive for
>> use as a mirrored write cache. I also have the remainder of the drive
>> partitioned for use as the read only cache. However, when attempting to
add
>> it I get the error above.
> 
> Sounds like you''re probably running into confusion about how to
partition the drive.  If you create fdisk partitions, they will be accessible as
p0, p1, p2, but I think p0 unconditionally refers to the whole drive, so the
first partition is p1, and the second is p2.
> 
> If you create one big solaris fdisk parititon and then slice it via
"partition" where s2 is typically the encompassing slice, and people
usually use s1 and s2 and s6 for actual slices, then they will be accessible via
s1, s2, s6
> 
> Generally speaking, it''s unadvisable to split the slog/cache
devices anyway.  Because:
> 
> If you''re splitting it, evidently you''re focusing on the
wasted space.  Buying an expensive 128G device where you couldn''t
possibly ever use more than 4G or 8G in the slog.  But that''s not what
you should be focusing on.  You should be focusing on the speed (that''s
why you bought it in the first place.)  The slog is write-only, and the cache is
a mixture of read/write, where it should be hopefully doing more reads than
writes.  But regardless of your actual success with the cache device, your cache
device will be busy most of the time, and competing against the slog.
> 
> You have a mirror, you say.  You should probably drop both the cache &
log.  Use one whole device for the cache, use one whole device for the log.  The
only risk you''ll run is:
> 
> Since a slog is write-only (except during mount, typically at boot)
it''s possible to have a failure mode where you think you''re
writing to the log, but the first time you go back and read, you discover an
error, and discover the device has gone bad.  In other words, without ever doing
any reads, you might not notice when/if the device goes bad.  Fortunately,
there''s an easy workaround.  You could periodically (say, once a month)
script the removal of your log device, create a junk pool, write a bunch of data
to it, scrub it (thus verifying it was written correctly) and in the absence of
any scrub errors, destroy the junk pool and re-add the device as a slog to the
main pool.
> 
> I''ve never heard of anyone actually being that paranoid, and
I''ve never heard of anyone actually experiencing the aforementioned
possible undetected device failure mode.  So this is all mostly theoretical.
> 
> Mirroring the slog device really isn''t necessary in the modern
age.
> 
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
-- 

ZFS and performance consulting
http://www.RichardElling.com

Fajar A. Nugraha

2013-Mar-17 07:03 UTC

head link

Re: partioned cache devices

On Sun, Mar 17, 2013 at 1:01 PM, Andrew Werchowiecki <
Andrew.Werchowiecki@xpanse.com.au> wrote:
> I understand that p0 refers to the whole disk... in the logs I pasted in
> I''m not attempting to mount p0. I''m trying to work out
why I''m getting an
> error attempting to mount p2, after p1 has successfully mounted. Further,
> this has been done before on other systems in the same hardware
> configuration in the exact same fashion, and I''ve gone over the
steps
> trying to make sure I haven''t missed something but can''t
see a fault.
>
>How did you create the partition? Are those marked as solaris partition, or
something else (e.g. fdisk on linux use type "83" by default).

I''m not keen on using Solaris slices because I don''t have an
understanding> of what that does to the pool''s OS interoperability.
>

Linux can read solaris slice and import solaris-made pools just fine, as
long as you''re using compatible zpool version (e.g. zpool version 28).

-- 
Fajar


_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Andrew Werchowiecki

2013-Mar-19 05:23 UTC

head link

Re: partioned cache devices

I did something like the following:

format -e /dev/rdsk/c5t0d0p0
fdisk
1 (create)
F (EFI)
6 (exit)
partition
label
1
y
0
usr
wm
64
4194367e
1
usr
wm
4194368
117214990
label
1
y



             Total disk size is 9345 cylinders
             Cylinder size is 12544 (512 byte) blocks

                                               Cylinders
      Partition   Status    Type          Start   End   Length    %
      =========   ======    ============  =====   ===   ======   ==          1  
EFI               0  9345    9346    100

partition> print
Current partition table (original):
Total disk sectors available: 117214957 + 16384 (reserved sectors)

Part      Tag    Flag     First Sector         Size         Last Sector
  0        usr    wm                64        2.00GB          4194367
  1        usr    wm           4194368       53.89GB          117214990
  2 unassigned    wm                 0           0               0
  3 unassigned    wm                 0           0               0
  4 unassigned    wm                 0           0               0
  5 unassigned    wm                 0           0               0
  6 unassigned    wm                 0           0               0
  8   reserved    wm         117214991        8.00MB          117231374

This isn''t the output from when I did it but it is exactly the same
steps that I followed.

Thanks for the info about slices, I may give that a go later on. I''m
not keen on that because I have clear evidence (as in zpools set up this way,
right now, working, without issue) that GPT partitions of the style shown above
work and I want to see why it doesn''t work in my set up rather than
simply ignoring and moving on.

From: Fajar A. Nugraha [mailto:work@fajar.net]
Sent: Sunday, 17 March 2013 3:04 PM
To: Andrew Werchowiecki
Cc: zfs-discuss@opensolaris.org
Subject: Re: [zfs-discuss] partioned cache devices

On Sun, Mar 17, 2013 at 1:01 PM, Andrew Werchowiecki
<Andrew.Werchowiecki@xpanse.com.au<mailto:Andrew.Werchowiecki@xpanse.com.au>>
wrote:
I understand that p0 refers to the whole disk... in the logs I pasted in
I''m not attempting to mount p0. I''m trying to work out why
I''m getting an error attempting to mount p2, after p1 has successfully
mounted. Further, this has been done before on other systems in the same
hardware configuration in the exact same fashion, and I''ve gone over
the steps trying to make sure I haven''t missed something but
can''t see a fault.

How did you create the partition? Are those marked as solaris partition, or
something else (e.g. fdisk on linux use type "83" by default).

I''m not keen on using Solaris slices because I don''t have an
understanding of what that does to the pool''s OS interoperability.


Linux can read solaris slice and import solaris-made pools just fine, as long as
you''re using compatible zpool version (e.g. zpool version 28).

--
Fajar


_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Ian Collins

2013-Mar-19 06:14 UTC

head link

Re: partioned cache devices

Andrew Werchowiecki wrote:>
> Thanks for the info about slices, I may give that a go later on. I’m 
> not keen on that because I have clear evidence (as in zpools set up 
> this way, right now, working, without issue) that GPT partitions of 
> the style shown above work and I want to see why it doesn’t work in my 
> set up rather than simply ignoring and moving on.
>
Didn''t you read Richard''s post? "You can have only one
Solaris partition
at a time."

Your original example failed when you tried to add a second.

-- 
Ian.

Cindy Swearingen

2013-Mar-19 19:38 UTC

head link

Re: partioned cache devices

Hi Andrew,

Your original syntax was incorrect.

A p* device is a larger container for the d* device or s* devices.
In the case of a cache device, you need to specify a d* or s* device.
That you can add p* devices to a pool is a bug.

Adding different slices from c25t10d1 as both log and cache devices
would need the s* identifier, but you''ve already added the entire
c25t10d1 as the log device. A better configuration would be using
c25t10d1 for log and using c25t9d1 for cache or provide some spares
for this large pool.

After you remove the log devices, re-add like this:

# zpool add aggr0 log c25t10d1
# zpool add aggr0 cache c25t9d1

You might review the ZFS recommendation practices section, here:

http://docs.oracle.com/cd/E26502_01/html/E29007/zfspools-4.html#storage-2

See example 3-4 for adding a cache device, here:

http://docs.oracle.com/cd/E26502_01/html/E29007/gayrd.html#gazgw

Always have good backups.

Thanks, Cindy



On 03/18/13 23:23, Andrew Werchowiecki wrote:> I did something like the following:
>
> format -e /dev/rdsk/c5t0d0p0
>
> fdisk
>
> 1 (create)
>
> F (EFI)
>
> 6 (exit)
>
> partition
>
> label
>
> 1
>
> y
>
> 0
>
> usr
>
> wm
>
> 64
>
> 4194367e
>
> 1
>
> usr
>
> wm
>
> 4194368
>
> 117214990
>
> label
>
> 1
>
> y
>
> Total disk size is 9345 cylinders
>
> Cylinder size is 12544 (512 byte) blocks
>
> Cylinders
>
> Partition Status Type Start End Length %
>
> ========= ====== ============ ===== === ====== ==>
> 1 EFI 0 9345 9346 100
>
> partition> print
>
> Current partition table (original):
>
> Total disk sectors available: 117214957 + 16384 (reserved sectors)
>
> Part Tag Flag First Sector Size Last Sector
>
> 0 usr wm 64 2.00GB 4194367
>
> 1 usr wm 4194368 53.89GB 117214990
>
> 2 unassigned wm 0 0 0
>
> 3 unassigned wm 0 0 0
>
> 4 unassigned wm 0 0 0
>
> 5 unassigned wm 0 0 0
>
> 6 unassigned wm 0 0 0
>
> 8 reserved wm 117214991 8.00MB 117231374
>
> This isn’t the output from when I did it but it is exactly the same
> steps that I followed.
>
> Thanks for the info about slices, I may give that a go later on. I’m not
> keen on that because I have clear evidence (as in zpools set up this
> way, right now, working, without issue) that GPT partitions of the style
> shown above work and I want to see why it doesn’t work in my set up
> rather than simply ignoring and moving on.
>
> *From:*Fajar A. Nugraha [mailto:work@fajar.net]
> *Sent:* Sunday, 17 March 2013 3:04 PM
> *To:* Andrew Werchowiecki
> *Cc:* zfs-discuss@opensolaris.org
> *Subject:* Re: [zfs-discuss] partioned cache devices
>
> On Sun, Mar 17, 2013 at 1:01 PM, Andrew Werchowiecki
> <Andrew.Werchowiecki@xpanse.com.au
> <mailto:Andrew.Werchowiecki@xpanse.com.au>> wrote:
>
>     I understand that p0 refers to the whole disk... in the logs I
>     pasted in I''m not attempting to mount p0. I''m trying
to work out why
>     I''m getting an error attempting to mount p2, after p1 has
>     successfully mounted. Further, this has been done before on other
>     systems in the same hardware configuration in the exact same
>     fashion, and I''ve gone over the steps trying to make sure I
haven''t
>     missed something but can''t see a fault.
>
> How did you create the partition? Are those marked as solaris partition,
> or something else (e.g. fdisk on linux use type "83" by default).
>
>     I''m not keen on using Solaris slices because I don''t
have an
>     understanding of what that does to the pool''s OS
interoperability.
>
> Linux can read solaris slice and import solaris-made pools just fine, as
> long as you''re using compatible zpool version (e.g. zpool version
28).
>
> --
>
> Fajar
>
>
>
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Andrew Gabriel

2013-Mar-19 20:17 UTC

head link

Re: partioned cache devices

Andrew Werchowiecki wrote:
>              Total disk size is 9345 cylinders
>              Cylinder size is 12544 (512 byte) blocks
>  
>                                                Cylinders
>       Partition   Status    Type          Start   End   Length    %
>       =========   ======    ============  =====   ===   ======   ==>    
1                 EFI               0  9345    9346    100
You only have a p1 (and for a GPT/EFI labeled disk, you can only
have p1 - no other FDISK partitions are allowed).
> partition> print
> Current partition table (original):
> Total disk sectors available: 117214957 + 16384 (reserved sectors)
>  
> Part      Tag    Flag     First Sector         Size         Last Sector
>   0        usr    wm                64        2.00GB          4194367
>   1        usr    wm           4194368       53.89GB          117214990
>   2 unassigned    wm                 0           0               0
>   3 unassigned    wm                 0           0               0
>   4 unassigned    wm                 0           0               0
>   5 unassigned    wm                 0           0               0
>   6 unassigned    wm                 0           0               0
>   8   reserved    wm         117214991        8.00MB          117231374
You have an s0 and s1.
> This isn’t the output from when I did it but it is exactly the same 
> steps that I followed.
>  
> Thanks for the info about slices, I may give that a go later on. I’m not 
> keen on that because I have clear evidence (as in zpools set up this 
> way, right now, working, without issue) that GPT partitions of the style 
> shown above work and I want to see why it doesn’t work in my set up 
> rather than simply ignoring and moving on.
You would have to blow away the partitioning you have, and create an FDISK
partitioned disk (not EFI), and then create a p1 and p2 partition.
(Don''t
use the ''partition'' subcommand, which confusingly creates
solaris slices.)
Give the FDISK partitions a partition type which nothing will recognise,
such as ''other'', so that nothing will try and interpret them
as OS partitions.
Then you can use them as raw devices, and they should be portable between
OS''s which can handle FDISK partitioned devices.

-- 
Andrew

Jim Klimov

2013-Mar-19 20:27 UTC

head link

Re: partioned cache devices

On 2013-03-19 20:38, Cindy Swearingen wrote:> Hi Andrew,
>
> Your original syntax was incorrect.
>
> A p* device is a larger container for the d* device or s* devices.
> In the case of a cache device, you need to specify a d* or s* device.
> That you can add p* devices to a pool is a bug.
I disagree; at least, I''ve always thought differently:
the "d" device is the whole disk denomination, with a
unique number for a particular controller link ("c+t").

The disk has some partitioning table, MBR or GPT/EFI.
In these tables, partition "p0" stands for the table
itself (i.e. to manage partitioning), and the rest kind
of "depends". In case of MBR tables, one partition may
be named as having a Solaris (or Solaris2) type, and
there it holds a SMI table of Solaris slices, and these
slices can hold legacy filesystems or components of ZFS
pools. In case of GPT, the GPT-partitions can be used
directly by ZFS. However, they are also denominated as
"slices" in ZFS and format utility.

I believe, Solaris-based OSes accessing a "p"-named
partition and an "s"-named slice of the same number
on a GPT disk should lead to the same range of bytes
on disk, but I am not really certain about this.

Also, if a "whole disk" is given to ZFS (and for OSes
other that the latest Solaris 11 this means non-rpool
disks), then ZFS labels the disk as GPT and defines a
partition for itself plus a small trailing partition
(likely to level out discrepancies with replacement
disks that might happen to be a few sectors too small).
In this case ZFS reports that it uses "cXtYdZ" as a
pool component, since it considers itself in charge
of the partitioning table and its inner contents, and
doesn''t intend to share the disk with other usages
(dual-booting and other OSes'' partitions, or SLOG and
L2ARC parts, etc). This also "allows" ZFS to influence
hardware-related choices, like caching and throttling,
and likely auto-expansion with the changed LUN sizes
by fixing up the partition table along the way, since
it assumes being 100% in charge of the disk.

I don''t think there is a "crime" in trying to use the
partitions (of either kind) as ZFS leaf vdevs, even the
zpool(1M) manpage states that:

... The  following  virtual  devices  are supported:
       disk
         A block device, typically located under  /dev/dsk.
         ZFS  can  use  individual  slices  or  partitions,
         though the recommended mode of operation is to use
         whole  disks.  ...

This is orthogonal to the fact that there can only be
one Solaris slice table, inside one partition, on MBR.
AFAIK this is irrelevant on GPT/EFI - no SMI slices there.

On my old home NAS with OpenSolaris I certainly did have
MBR partitions on the rpool intended initially for some
dual-booted OSes, but repurposed as L2ARC and ZIL devices
for the storage pool on other disks, when I played with
that technology. Didn''t gain much with a single spindle ;)

HTH,
//Jim Klimov

Andrew Gabriel

2013-Mar-19 21:07 UTC

head link

Re: partioned cache devices

On 03/19/13 20:27, Jim Klimov wrote:> I disagree; at least, I''ve always thought differently:
> the "d" device is the whole disk denomination, with a
> unique number for a particular controller link ("c+t").
>
> The disk has some partitioning table, MBR or GPT/EFI.
> In these tables, partition "p0" stands for the table
> itself (i.e. to manage partitioning),
p0 is the whole disk regardless of any partitioning.
(Hence you can use p0 to access any type of partition table.)
> and the rest kind
> of "depends". In case of MBR tables, one partition may
> be named as having a Solaris (or Solaris2) type, and
> there it holds a SMI table of Solaris slices, and these
> slices can hold legacy filesystems or components of ZFS
> pools. In case of GPT, the GPT-partitions can be used
> directly by ZFS. However, they are also denominated as
> "slices" in ZFS and format utility.
The GPT partitioning spec requires the disk to be FDISK
partitioned with just one single FDISK partition of type EFI,
so that tools which predate GPT partitioning will still see
such a GPT disk as fully assigned to FDISK partitions, and
therefore less likely to be accidentally blown away.
> I believe, Solaris-based OSes accessing a "p"-named
> partition and an "s"-named slice of the same number
> on a GPT disk should lead to the same range of bytes
> on disk, but I am not really certain about this.
No, you''ll see just p0 (whole disk), and p1 (whole disk
less space for the backwards compatible FDISK partitioning).
> Also, if a "whole disk" is given to ZFS (and for OSes
> other that the latest Solaris 11 this means non-rpool
> disks), then ZFS labels the disk as GPT and defines a
> partition for itself plus a small trailing partition
> (likely to level out discrepancies with replacement
> disks that might happen to be a few sectors too small).
> In this case ZFS reports that it uses "cXtYdZ" as a
> pool component,
For an EFI disk, the device name without a final p* or s*
component is the whole EFI partition. (It''s actually the
s7 slice minor device node, but the s7 is dropped from
the device name to avoid the confusion we had with s2
on SMI labeled disks being the whole SMI partition.)
> since it considers itself in charge
> of the partitioning table and its inner contents, and
> doesn''t intend to share the disk with other usages
> (dual-booting and other OSes'' partitions, or SLOG and
> L2ARC parts, etc). This also "allows" ZFS to influence
> hardware-related choices, like caching and throttling,
> and likely auto-expansion with the changed LUN sizes
> by fixing up the partition table along the way, since
> it assumes being 100% in charge of the disk.
>
> I don''t think there is a "crime" in trying to use the
> partitions (of either kind) as ZFS leaf vdevs, even the
> zpool(1M) manpage states that:
>
> ... The  following  virtual  devices  are supported:
>       disk
>         A block device, typically located under  /dev/dsk.
>         ZFS  can  use  individual  slices  or  partitions,
>         though the recommended mode of operation is to use
>         whole  disks.  ...
Right.
> This is orthogonal to the fact that there can only be
> one Solaris slice table, inside one partition, on MBR.
> AFAIK this is irrelevant on GPT/EFI - no SMI slices there.
There''s a simpler way to think of it on x86.
You always have FDISK partitioning (p1, p2, p3, p4).
You can then have SMI or GPT/EFI slices (both called s0, s1, ...)
in an FDISK partition of the appropriate type.
With SMI labeling, s2 is by convention the whole Solaris FDISK
partition (although this is not enforced).
With EFI labeling, s7 is enforced as the whole EFI FDISK partition,
and so the trailing s7 is dropped off the device name for
clarity.

This simplicity is brought about because the GPT spec requires
that backwards compatible FDISK partitioning is included, but
with just 1 partition assigned.

-- 
Andrew

Jim Klimov

2013-Mar-19 21:42 UTC

head link

Re: partioned cache devices

On 2013-03-19 22:07, Andrew Gabriel wrote:> The GPT partitioning spec requires the disk to be FDISK
> partitioned with just one single FDISK partition of type EFI,
> so that tools which predate GPT partitioning will still see
> such a GPT disk as fully assigned to FDISK partitions, and
> therefore less likely to be accidentally blown away.
Okay, I guess I got entangled in terminology now ;)
Anyhow, your words are not all news to me, though my write-up
was likely misleading to unprepared readers... sigh... Thanks
for the clarifications and deeper details that I did not know!

So, we can concur that GPT does indeed include the fake MBR
header with one EFI partition which addresses the smaller of
2TB (MBR limit) or disk size, minus a few sectors for the GPT
housekeeping. Inside the EFI partition are defined the GPT,
um, partitions (represented as "s"lices in Solaris). This is
after all a GUID *Partition* Table, and that''s how parted
refers to them too ;)

Notably, there are also unportable tricks to fool legacy OSes
and bootloaders into addressing the same byte ranges via both
MBR entries (forged manually and abusing the GPT/EFI spec) and
proper GPT entries, as partitions in the sense of each table.

//Jim

zfs discuss - Mar 2013 - partioned cache devices

partioned cache devices

Re: partioned cache devices

Re: partioned cache devices

Re: partioned cache devices

Re: partioned cache devices

Re: partioned cache devices

Re: partioned cache devices

Re: partioned cache devices

Re: partioned cache devices

Re: partioned cache devices

Re: partioned cache devices

Re: partioned cache devices

Re: partioned cache devices