thr3ads.net - zfs discuss - [zfs-discuss] L2ARC in Cluster is picked up althought not part of the pool [Jan 2010]

If this information is useful, please help other people find it:
Share via:

Lutz Schumann

2010-Jan-20 11:17 UTC

[zfs-discuss] L2ARC in Cluster is picked up althought not part of the pool

Hello, 

we tested clustering with ZFS and the setup looks like this: 

- 2 head nodes (nodea, nodeb)
- head nodes contain l2arc devices (nodea_l2arc, nodeb_l2arc)
- two external jbods
- two mirror zpools (pool1,pool2)
   - each mirror is a mirror of one disk from each jbod
- no ZIL (anyone knows a well priced SAS SSD ?)

We want active/active and added the l2arc to the pools. 

- pool1 has nodea_l2arc as cache
- pool2 has nodeb_l2arc as cache

Everything is great so far. 

One thing to node is that the nodea_l2arc and nodea_l2arc are named equally !
(c0t2d0 on both nodes).

What we found is that during tests, the pool just picked up the device
nodeb_l2arc automatically, altought is was never explicitly added to the pool
pool1.

We had a setup stage when pool1 was configured on nodea with nodea_l2arc and
pool2 was configured on nodeb without a l2arc. Then we did a failover. Then
pool1 pickup up the (until then) unconfigured nodeb_l2arc.

Is this intended ? Why is a L2ARC device automatically picked up if the device
name is the same ?

In a later stage we had both pools configured with the corresponding l2arc
device. (pool1 at nodea with nodea_l2arc and pool2 at nodeb with nodeb_l2arc).
Then we also did a failover. The l2arc device of the pool failing over was
marked as "too many corruptions" instead of "missing".

So from this tests it looks like ZFS just picks up the device with the same name
and replaces the l2arc without looking at the device signatures to only consider
devices beeing part of a pool.

We have not tested with a data disk as "c0t2d0" but if the same
behaviour occurs - god save us all.

Can someone clarify the logic behind this ? 

Can also someone give a hint how to rename SAS disk devices in opensolaris ? 
(to workaround I would like to rename c0t2d0 on nodea (nodea_l2arc) to c0t24d0
and c0t2d0 on nodeb (nodea_l2arc) to c0t48d0).

P.s. Release is build 104 (NexentaCore 2). 

Thanks!
-- 
This message posted from opensolaris.org

Richard Elling

2010-Jan-20 22:08 UTC

head link

[zfs-discuss] L2ARC in Cluster is picked up althought not part of the pool

Hi Lutz,

On Jan 20, 2010, at 3:17 AM, Lutz Schumann wrote:
> Hello, 
> 
> we tested clustering with ZFS and the setup looks like this: 
> 
> - 2 head nodes (nodea, nodeb)
> - head nodes contain l2arc devices (nodea_l2arc, nodeb_l2arc)
This makes me nervous. I suspect this is not in the typical QA 
test plan.
> - two external jbods
> - two mirror zpools (pool1,pool2)
>   - each mirror is a mirror of one disk from each jbod
> - no ZIL (anyone knows a well priced SAS SSD ?)
> 
> We want active/active and added the l2arc to the pools. 
> 
> - pool1 has nodea_l2arc as cache
> - pool2 has nodeb_l2arc as cache
> 
> Everything is great so far. 
> 
> One thing to node is that the nodea_l2arc and nodea_l2arc are named equally
! (c0t2d0 on both nodes).
> 
> What we found is that during tests, the pool just picked up the device
nodeb_l2arc automatically, altought is was never explicitly added to the pool
pool1.
This is strange. Each vdev is supposed to be uniquely identified by its GUID.
This is how ZFS can identify the proper configuration when two pools have 
the same name. Can you check the GUIDs (using zdb) to see if there is a
collision?
 -- richard
> We had a setup stage when pool1 was configured on nodea with nodea_l2arc
and pool2 was configured on nodeb without a l2arc. Then we did a failover. Then
pool1 pickup up the (until then) unconfigured nodeb_l2arc.
> 
> Is this intended ? Why is a L2ARC device automatically picked up if the
device name is the same ?
> 
> In a later stage we had both pools configured with the corresponding l2arc
device. (pool1 at nodea with nodea_l2arc and pool2 at nodeb with nodeb_l2arc).
Then we also did a failover. The l2arc device of the pool failing over was
marked as "too many corruptions" instead of "missing".
> 
> So from this tests it looks like ZFS just picks up the device with the same
name and replaces the l2arc without looking at the device signatures to only
consider devices beeing part of a pool.
> 
> We have not tested with a data disk as "c0t2d0" but if the same
behaviour occurs - god save us all.
> 
> Can someone clarify the logic behind this ? 
> 
> Can also someone give a hint how to rename SAS disk devices in opensolaris
?
> (to workaround I would like to rename c0t2d0 on nodea (nodea_l2arc) to
c0t24d0 and c0t2d0 on nodeb (nodea_l2arc) to c0t48d0).
> 
> P.s. Release is build 104 (NexentaCore 2). 
> 
> Thanks!
> -- 
> This message posted from opensolaris.org
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Tomas Ögren

2010-Jan-20 22:20 UTC

head link

[zfs-discuss] L2ARC in Cluster is picked up althought not part of the pool

On 20 January, 2010 - Richard Elling sent me these 2,7K bytes:
> Hi Lutz,
> 
> On Jan 20, 2010, at 3:17 AM, Lutz Schumann wrote:
> 
> > Hello, 
> > 
> > we tested clustering with ZFS and the setup looks like this: 
> > 
> > - 2 head nodes (nodea, nodeb)
> > - head nodes contain l2arc devices (nodea_l2arc, nodeb_l2arc)
> 
> This makes me nervous. I suspect this is not in the typical QA 
> test plan.
> 
> > - two external jbods
> > - two mirror zpools (pool1,pool2)
> >   - each mirror is a mirror of one disk from each jbod
> > - no ZIL (anyone knows a well priced SAS SSD ?)
> > 
> > We want active/active and added the l2arc to the pools. 
> > 
> > - pool1 has nodea_l2arc as cache
> > - pool2 has nodeb_l2arc as cache
> > 
> > Everything is great so far. 
> > 
> > One thing to node is that the nodea_l2arc and nodea_l2arc are named
equally ! (c0t2d0 on both nodes).
> > 
> > What we found is that during tests, the pool just picked up the device
nodeb_l2arc automatically, altought is was never explicitly added to the pool
pool1.
> 
> This is strange. Each vdev is supposed to be uniquely identified by its
GUID.
> This is how ZFS can identify the proper configuration when two pools have 
> the same name. Can you check the GUIDs (using zdb) to see if there is a
> collision?
Reproducable:

itchy:/tmp/blah# mkfile 64m 64m disk1
itchy:/tmp/blah# zfs create -V 64m rpool/blahcache
itchy:/tmp/blah# zpool create blah /tmp/blah/disk1 
itchy:/tmp/blah# zpool add blah cache /dev/zvol/dsk/rpool/blahcache 
itchy:/tmp/blah# zpool status blah
  pool: blah
 state: ONLINE
 scrub: none requested
config:

        NAME                             STATE     READ WRITE CKSUM
        blah                             ONLINE       0     0     0
          /tmp/blah/disk1                ONLINE       0     0     0
        cache
          /dev/zvol/dsk/rpool/blahcache  ONLINE       0     0     0

errors: No known data errors
itchy:/tmp/blah# zpool export blah
itchy:/tmp/blah# zdb -l /dev/zvol/dsk/rpool/blahcache 
--------------------------------------------
LABEL 0
--------------------------------------------
    version=15
    state=4
    guid=6931317478877305718
....
itchy:/tmp/blah# zfs destroy rpool/blahcache
itchy:/tmp/blah# zfs create -V 64m rpool/blahcache
itchy:/tmp/blah# dd if=/dev/zero of=/dev/zvol/dsk/rpool/blahcache bs=1024k
count=64
64+0 records in
64+0 records out
67108864 bytes (67 MB) copied, 0.559299 seconds, 120 MB/s
itchy:/tmp/blah# zpool import -d /tmp/blah
  pool: blah
    id: 16691059548146709374
 state: ONLINE
action: The pool can be imported using its name or numeric identifier.
config:

        blah                             ONLINE
          /tmp/blah/disk1                ONLINE
        cache
          /dev/zvol/dsk/rpool/blahcache
itchy:/tmp/blah# zdb -l /dev/zvol/dsk/rpool/blahcache
--------------------------------------------
LABEL 0
--------------------------------------------
--------------------------------------------
LABEL 1
--------------------------------------------
--------------------------------------------
LABEL 2
--------------------------------------------
--------------------------------------------
LABEL 3
--------------------------------------------
itchy:/tmp/blah# zpool import -d /tmp/blah blah
itchy:/tmp/blah# zpool status
  pool: blah
 state: ONLINE
 scrub: none requested
config:

        NAME                             STATE     READ WRITE CKSUM
        blah                             ONLINE       0     0     0
          /tmp/blah/disk1                ONLINE       0     0     0
        cache
          /dev/zvol/dsk/rpool/blahcache  ONLINE       0     0     0

errors: No known data errors
itchy:/tmp/blah# zdb -l /dev/zvol/dsk/rpool/blahcache
--------------------------------------------
LABEL 0
--------------------------------------------
    version=15
    state=4
    guid=6931317478877305718
...


It did indeed overwrite my formerly clean blahcache.

Smells like a serious bug.

/Tomas
-- 
Tomas ?gren, stric at acc.umu.se, http://www.acc.umu.se/~stric/
|- Student at Computing Science, University of Ume?
`- Sysadmin at {cs,acc}.umu.se
>  -- richard
> 
> > We had a setup stage when pool1 was configured on nodea with
nodea_l2arc and pool2 was configured on nodeb without a l2arc. Then we did a
failover. Then pool1 pickup up the (until then) unconfigured nodeb_l2arc.
> > 
> > Is this intended ? Why is a L2ARC device automatically picked up if
the device name is the same ?
> > 
> > In a later stage we had both pools configured with the corresponding
l2arc device. (pool1 at nodea with nodea_l2arc and pool2 at nodeb with
nodeb_l2arc). Then we also did a failover. The l2arc device of the pool failing
over was marked as "too many corruptions" instead of
"missing".
> > 
> > So from this tests it looks like ZFS just picks up the device with the
same name and replaces the l2arc without looking at the device signatures to
only consider devices beeing part of a pool.
> > 
> > We have not tested with a data disk as "c0t2d0" but if the
same behaviour occurs - god save us all.
> > 
> > Can someone clarify the logic behind this ? 
> > 
> > Can also someone give a hint how to rename SAS disk devices in
opensolaris ?
> > (to workaround I would like to rename c0t2d0 on nodea (nodea_l2arc) to
c0t24d0 and c0t2d0 on nodeb (nodea_l2arc) to c0t48d0).
> > 
> > P.s. Release is build 104 (NexentaCore 2). 
> > 
> > Thanks!
> > -- 
> > This message posted from opensolaris.org
> > _______________________________________________
> > zfs-discuss mailing list
> > zfs-discuss at opensolaris.org
> > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> 
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Richard Elling

2010-Jan-20 23:20 UTC

head link

[zfs-discuss] L2ARC in Cluster is picked up althought not part of the pool

Though the ARC case, PSARC/2007/618 is "unpublished," I gather from
googling and the source that L2ARC devices are considered auxiliary,
in the same category as spares. If so, then it is perfectly reasonable to
expect that it gets picked up regardless of the GUID. This also implies
that it is shareable between pools until assigned. Brief testing confirms
this behaviour.  I learn something new every day :-)

So, I suspect Lutz sees a race when both pools are imported onto one
node.  This still makes me nervous though...
 -- richard

Daniel Carosone

2010-Jan-21 00:17 UTC

head link

[zfs-discuss] L2ARC in Cluster is picked up althought not part of the pool

On Wed, Jan 20, 2010 at 03:20:20PM -0800, Richard Elling
wrote:> Though the ARC case, PSARC/2007/618 is "unpublished," I gather
from
> googling and the source that L2ARC devices are considered auxiliary,
> in the same category as spares. If so, then it is perfectly reasonable to
> expect that it gets picked up regardless of the GUID. This also implies
> that it is shareable between pools until assigned. Brief testing confirms
> this behaviour.  I learn something new every day :-)
> 
> So, I suspect Lutz sees a race when both pools are imported onto one
> node.  This still makes me nervous though...
Yes. What if device reconfiguration renumbers my controllers, will
l2arc suddenly start trashing a data disk?  The same problem used to
be a risk for swap,  but less so now that we swap to named zvol. 

There''s work afoot to make l2arc persistent across reboot, which
implies some organised storage structure on the device.  Fixing this
shouldn''t wait for that.

--
Dan.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 194 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100121/854737df/attachment.bin>

Richard Elling

2010-Jan-21 17:36 UTC

head link

[zfs-discuss] L2ARC in Cluster is picked up althought not part of the pool

On Jan 20, 2010, at 4:17 PM, Daniel Carosone wrote:
> On Wed, Jan 20, 2010 at 03:20:20PM -0800, Richard Elling wrote:
>> Though the ARC case, PSARC/2007/618 is "unpublished," I
gather from
>> googling and the source that L2ARC devices are considered auxiliary,
>> in the same category as spares. If so, then it is perfectly reasonable
to
>> expect that it gets picked up regardless of the GUID. This also implies
>> that it is shareable between pools until assigned. Brief testing
confirms
>> this behaviour.  I learn something new every day :-)
>> 
>> So, I suspect Lutz sees a race when both pools are imported onto one
>> node.  This still makes me nervous though...
> 
> Yes. What if device reconfiguration renumbers my controllers, will
> l2arc suddenly start trashing a data disk?  The same problem used to
> be a risk for swap,  but less so now that we swap to named zvol. 
This will not happen unless the labels are rewritten on your data disk, 
and if that occurs, all bets are off.
> There''s work afoot to make l2arc persistent across reboot, which
> implies some organised storage structure on the device.  Fixing this
> shouldn''t wait for that.
Upon further review, the ruling on the field is confirmed ;-)  The L2ARC
is shared amongst pools just like the ARC. What is important is that at
least one pool has a cache vdev. I suppose one could make the case
that a new command is needed in addition to zpool and zfs (!) to manage
such devices. But perhaps we can live with the oddity for a while?

As such, for Lutz''s configuration, I am now less nervous. If I
understand
correctly, you could add the cache vdev to rpool and forget about how
it works with the shared pools.
 -- richard

Daniel Carosone

2010-Jan-21 21:13 UTC

head link

[zfs-discuss] L2ARC in Cluster is picked up althought not part of the pool

On Thu, Jan 21, 2010 at 09:36:06AM -0800, Richard Elling
wrote:> On Jan 20, 2010, at 4:17 PM, Daniel Carosone wrote:
> 
> > On Wed, Jan 20, 2010 at 03:20:20PM -0800, Richard Elling wrote:
> >> Though the ARC case, PSARC/2007/618 is "unpublished," I
gather from
> >> googling and the source that L2ARC devices are considered
auxiliary,
> >> in the same category as spares. If so, then it is perfectly
reasonable to
> >> expect that it gets picked up regardless of the GUID. This also
implies
> >> that it is shareable between pools until assigned. Brief testing
confirms
> >> this behaviour.  I learn something new every day :-)
> >> 
> >> So, I suspect Lutz sees a race when both pools are imported onto
one
> >> node.  This still makes me nervous though...
> > 
> > Yes. What if device reconfiguration renumbers my controllers, will
> > l2arc suddenly start trashing a data disk?  The same problem used to
> > be a risk for swap,  but less so now that we swap to named zvol. 
> 
> This will not happen unless the labels are rewritten on your data disk, 
> and if that occurs, all bets are off.
It occurred to me later yesterday, while offline, that the pool in
question might have autoreplace=on set.  If that were true, it would
explain why a disk in the same controller slot was overwritten and
used.

Lutz, is the pool autoreplace property on?  If so, "god help us all"
is no longer quite so necessary.
> > There''s work afoot to make l2arc persistent across reboot,
which
> > implies some organised storage structure on the device.  Fixing this
> > shouldn''t wait for that.
> 
> Upon further review, the ruling on the field is confirmed ;-)  The L2ARC
> is shared amongst pools just like the ARC. What is important is that at
> least one pool has a cache vdev. 
Wait, huh?  That''s a totally separate issue from what I understood
from the discussion.  What I was worried about was that disk Y, that
happened to have the same cLtMdN address as disk X on another node,
was overwritten and trashed on import to become l2arc.  

Maybe I missed some other detail in the thread and reached the wrong
conclusion? 
> As such, for Lutz''s configuration, I am now less nervous. If I
understand
> correctly, you could add the cache vdev to rpool and forget about how
> it works with the shared pools.
The fact that l2arc devices could be caching data from any pool in the
system is .. a whole different set of (mostly performance) wrinkles.

For example, if I have a pool of very slow disks (usb or remote
iscsi), and a pool of faster disks, and l2arc for the slow pool on the
same faster disks, it''s pointless having the faster pool using l2arc
on the same disks or even the same type of disks.  I''d need to set the
secondarycache properties of one pool according to the configuration
of another. 
> I suppose one could make the case
> that a new command is needed in addition to zpool and zfs (!) to manage
> such devices. But perhaps we can live with the oddity for a while?
This part, I expect, will be resolved or clarified as part of the
l2arc persistence work, since then their attachment to specific pools
will need to be clear and explicit.

Perhaps the answer is that the cache devices become their own pool
(since they''re going to need filesystem-like structured storage
anyway). The actual cache could be a zvol (or new object type) within
that pool, and then (if necessary) an association is made between
normal pools and the cache (especially if I have multiple of them).
No new top-level commands needed. 

--
Dan.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 194 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100122/29991474/attachment.bin>

Richard Elling

2010-Jan-21 23:33 UTC

head link

[zfs-discuss] L2ARC in Cluster is picked up althought not part of the pool

[Richard makes a hobby of confusing Dan :-)]
more below..

On Jan 21, 2010, at 1:13 PM, Daniel Carosone wrote:
> On Thu, Jan 21, 2010 at 09:36:06AM -0800, Richard Elling wrote:
>> On Jan 20, 2010, at 4:17 PM, Daniel Carosone wrote:
>> 
>>> On Wed, Jan 20, 2010 at 03:20:20PM -0800, Richard Elling wrote:
>>>> Though the ARC case, PSARC/2007/618 is "unpublished,"
I gather from
>>>> googling and the source that L2ARC devices are considered
auxiliary,
>>>> in the same category as spares. If so, then it is perfectly
reasonable to
>>>> expect that it gets picked up regardless of the GUID. This also
implies
>>>> that it is shareable between pools until assigned. Brief
testing confirms
>>>> this behaviour.  I learn something new every day :-)
>>>> 
>>>> So, I suspect Lutz sees a race when both pools are imported
onto one
>>>> node.  This still makes me nervous though...
>>> 
>>> Yes. What if device reconfiguration renumbers my controllers, will
>>> l2arc suddenly start trashing a data disk?  The same problem used
to
>>> be a risk for swap,  but less so now that we swap to named zvol. 
>> 
>> This will not happen unless the labels are rewritten on your data disk,
>> and if that occurs, all bets are off.
> 
> It occurred to me later yesterday, while offline, that the pool in
> question might have autoreplace=on set.  If that were true, it would
> explain why a disk in the same controller slot was overwritten and
> used.
> 
> Lutz, is the pool autoreplace property on?  If so, "god help us
all"
> is no longer quite so necessary.
I think this is a different issue. But since the label in a cache device does
not associate it with a pool, it is possible that any pool which expects a
cache will find it.  This seems to be as designed.
>>> There''s work afoot to make l2arc persistent across reboot,
which
>>> implies some organised storage structure on the device.  Fixing
this
>>> shouldn''t wait for that.
>> 
>> Upon further review, the ruling on the field is confirmed ;-)  The
L2ARC
>> is shared amongst pools just like the ARC. What is important is that at
>> least one pool has a cache vdev. 
> 
> Wait, huh?  That''s a totally separate issue from what I understood
> from the discussion.  What I was worried about was that disk Y, that
> happened to have the same cLtMdN address as disk X on another node,
> was overwritten and trashed on import to become l2arc.  
> 
> Maybe I missed some other detail in the thread and reached the wrong
> conclusion? 
> 
>> As such, for Lutz''s configuration, I am now less nervous. If I
understand
>> correctly, you could add the cache vdev to rpool and forget about how
>> it works with the shared pools.
> 
> The fact that l2arc devices could be caching data from any pool in the
> system is .. a whole different set of (mostly performance) wrinkles.
> 
> For example, if I have a pool of very slow disks (usb or remote
> iscsi), and a pool of faster disks, and l2arc for the slow pool on the
> same faster disks, it''s pointless having the faster pool using
l2arc
> on the same disks or even the same type of disks.  I''d need to set
the
> secondarycache properties of one pool according to the configuration
> of another. 
Don''t use slow devices for L2ARC.

Secondarycache is a dataset property, not a pool property.  You can
definitely manage the primary and secondary cache policies for each
dataset.
>> I suppose one could make the case
>> that a new command is needed in addition to zpool and zfs (!) to manage
>> such devices. But perhaps we can live with the oddity for a while?
> 
> This part, I expect, will be resolved or clarified as part of the
> l2arc persistence work, since then their attachment to specific pools
> will need to be clear and explicit.
Since the ARC is shared amongst all pools, it makes sense to share
L2ARC amongst all pools.
> Perhaps the answer is that the cache devices become their own pool
> (since they''re going to need filesystem-like structured storage
> anyway). The actual cache could be a zvol (or new object type) within
> that pool, and then (if necessary) an association is made between
> normal pools and the cache (especially if I have multiple of them).
> No new top-level commands needed. 
I propose a best practice of adding the cache device to rpool and be 
happy.
 -- richard

Daniel Carosone

2010-Jan-22 00:32 UTC

head link

[zfs-discuss] L2ARC in Cluster is picked up althought not part of the pool

On Thu, Jan 21, 2010 at 03:33:28PM -0800, Richard Elling
wrote:> [Richard makes a hobby of confusing Dan :-)]
Heh.
> > Lutz, is the pool autoreplace property on?  If so, "god help us
all"
> > is no longer quite so necessary.
> 
> I think this is a different issue.
I agree. For me, it was the main issue, and I still want clarity on
it.  However, at this point I''ll go back to the start of the thread
and look at what was actually reported again in more detail.  
> But since the label in a cache device does
> not associate it with a pool, it is possible that any pool which expects a
> cache will find it.  This seems to be as designed.
Hm. My recollection was that node b''s disk in that controller slot was
totally unlabelled, but perhaps I''m misremembering.. as above.
> > For example, if I have a pool of very slow disks (usb or remote
> > iscsi), and a pool of faster disks, and l2arc for the slow pool on the
> > same faster disks, it''s pointless having the faster pool
using l2arc
> > on the same disks or even the same type of disks.  I''d need
to set the
> > secondarycache properties of one pool according to the configuration
> > of another. 
> 
> Don''t use slow devices for L2ARC.
Slow is entirely relative, as we discussed here just recently.  They
just need to be faster than the pool devices I want to cache.  The
wrinkle here is that it''s now clear they should be faster than the
devices in all other pools as well (or I need to take special
measures).

Faster is better regardless, and suitable l2arc ssd''s are "cheap
enough" now.  It''s mostly academic that, previously, faster/local
hard
disks were "fast enough", since now you can have both.
> Secondarycache is a dataset property, not a pool property.  You can
> definitely manage the primary and secondary cache policies for each
> dataset.
Yeah, properties of the root fs and of the pool are easily conflated.
> >> such devices. But perhaps we can live with the oddity for a while?
> > 
> > This part, I expect, will be resolved or clarified as part of the
> > l2arc persistence work, since then their attachment to specific pools
> > will need to be clear and explicit.
> 
> Since the ARC is shared amongst all pools, it makes sense to share
> L2ARC amongst all pools.
Of course it does - apart from the wrinkles we now know we need to
watch out for.
> > Perhaps the answer is that the cache devices become their own pool
> > (since they''re going to need filesystem-like structured
storage
> > anyway). The actual cache could be a zvol (or new object type) within
> > that pool, and then (if necessary) an association is made between
> > normal pools and the cache (especially if I have multiple of them).
> > No new top-level commands needed. 
> 
> I propose a best practice of adding the cache device to rpool and be 
> happy.
It is *still* not that simple.  Forget my slow disks caching an even
slower pool (which is still fast enough for my needs, thanks to the
cache and zil).

Consider a server config thus:
 - two MLC SSDs (x25-M, OCZ Vertex, whatever)
 - SSDs partitioned in two, mirrored rpool & 2x l2arc
 - a bunch of disks for a data pool

This is a likely/common configuration, commodity systems being limited
mostly by number of sata ports.  I''d even go so far as to propose it
as another best practice, for those circumstances.

Now, why would I waste l2arc space, bandwidth, and wear cycles to
cache rpool to the same ssd''s that would be read on a miss anyway?  

So, there''s at least one more step required for happiness:
 # zfs set secondarycache=none rpool

(plus relying on property inheritance through the rest of rpool)

--
Dan.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 194 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100122/336012cd/attachment.bin>

Richard Elling

2010-Jan-22 01:52 UTC

head link

[zfs-discuss] L2ARC in Cluster is picked up althought not part of the pool

On Jan 21, 2010, at 4:32 PM, Daniel Carosone wrote:>> I propose a best practice of adding the cache device to rpool and be 
>> happy.
> 
> It is *still* not that simple.  Forget my slow disks caching an even
> slower pool (which is still fast enough for my needs, thanks to the
> cache and zil).
> 
> Consider a server config thus:
> - two MLC SSDs (x25-M, OCZ Vertex, whatever)
> - SSDs partitioned in two, mirrored rpool & 2x l2arc
> - a bunch of disks for a data pool
> 
> This is a likely/common configuration, commodity systems being limited
> mostly by number of sata ports.  I''d even go so far as to propose
it
> as another best practice, for those circumstances.
> Now, why would I waste l2arc space, bandwidth, and wear cycles to
> cache rpool to the same ssd''s that would be read on a miss anyway?
> 
> So, there''s at least one more step required for happiness:
> # zfs set secondarycache=none rpool
> 
> (plus relying on property inheritance through the rest of rpool)
I agree with this, except for the fact that the most common installers
(LiveCD, Nexenta, etc.) use the whole disk for rpool[1].  So the likely
and common configuration today is moving towards one whole
root disk.  That could change in the future.

[1] Solaris 10?  well... since installation hard anyway, might as well do this.
 -- richard

Daniel Carosone

2010-Jan-22 03:03 UTC

head link

[zfs-discuss] L2ARC in Cluster is picked up althought not part of the pool

On Thu, Jan 21, 2010 at 05:52:57PM -0800, Richard Elling
wrote:> I agree with this, except for the fact that the most common installers
> (LiveCD, Nexenta, etc.) use the whole disk for rpool[1]. 
Er, no. You certainly get the option of "whole disk" or "make
partitions", at least with the opensolaris livecd.

--
Dan.


-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 194 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100122/1af9c02e/attachment.bin>

Lutz Schumann

2010-Jan-23 17:53 UTC

head link

[zfs-discuss] L2ARC in Cluster is picked up althought not part of the pool

Hi, 

i found some time and was able to test again.

 - verify with unique uid of the device 
 - verify with autoreplace = off

Indeed autoreplace was set to yes for the pools. So I disabled the autoreplace. 

VOL     PROPERTY       VALUE       SOURCE
nxvol2  autoreplace    off         default

Erased the labels on the cache disk and added it again to the pool. Now both
cache disks have different guid''s:

# cache device in node1
root at nex1:/volumes# zdb -l -e /dev/rdsk/c0t2d0s0
--------------------------------------------
LABEL 0
--------------------------------------------
    version=14
    state=4
    guid=15970804704220025940

# cache device in node2
root at nex2:/volumes# zdb -l -e /dev/rdsk/c0t2d0s0
--------------------------------------------
LABEL 0
--------------------------------------------
    version=14
    state=4
    guid=2866316542752696853

GUID''s are different. 

However after switching the pool nxvol2 to node1 (where nxvol1 was active), the
disks picked up as cache dev''s:

# nxvol2 switched to this node ... 
volume: nxvol2
 state: ONLINE
 scrub: none requested
config:

        NAME         STATE     READ WRITE CKSUM
        nxvol2       ONLINE       0     0     0
          mirror     ONLINE       0     0     0
            c3t10d0  ONLINE       0     0     0
            c4t13d0  ONLINE       0     0     0
          mirror     ONLINE       0     0     0
            c3t9d0   ONLINE       0     0     0
            c4t12d0  ONLINE       0     0     0
          mirror     ONLINE       0     0     0
            c3t8d0   ONLINE       0     0     0
            c4t11d0  ONLINE       0     0     0
          mirror     ONLINE       0     0     0
            c3t18d0  ONLINE       0     0     0
            c4t22d0  ONLINE       0     0     0
          mirror     ONLINE       0     0     0
            c3t17d0  ONLINE       0     0     0
            c4t21d0  ONLINE       0     0     0
        cache
          c0t2d0     FAULTED      0     0     0  corrupted data

# nxvol1 was active here before ...
nmc at nex1:/$ show volume nxvol1 status
volume: nxvol1
 state: ONLINE
 scrub: none requested
config:

        NAME         STATE     READ WRITE CKSUM
        nxvol1       ONLINE       0     0     0
          mirror     ONLINE       0     0     0
            c3t15d0  ONLINE       0     0     0
            c4t18d0  ONLINE       0     0     0
          mirror     ONLINE       0     0     0
            c3t14d0  ONLINE       0     0     0
            c4t17d0  ONLINE       0     0     0
          mirror     ONLINE       0     0     0
            c3t13d0  ONLINE       0     0     0
            c4t16d0  ONLINE       0     0     0
          mirror     ONLINE       0     0     0
            c3t12d0  ONLINE       0     0     0
            c4t15d0  ONLINE       0     0     0
          mirror     ONLINE       0     0     0
            c3t11d0  ONLINE       0     0     0
            c4t14d0  ONLINE       0     0     0
        cache
          c0t2d0     ONLINE      0     0     0  

So this is true with and without autoreplace, and with differnt guid''s
of the devices.
-- 
This message posted from opensolaris.org

Richard Elling

2010-Jan-24 01:51 UTC

head link

[zfs-discuss] L2ARC in Cluster is picked up althought not part of the pool

AIUI, this works as designed. 

I think the best practice will be to add the L2ARC to syspool (nee rpool).
However, for current NexentaStor releases, you cannot add cache devices
to syspool.

Earlier I mentioned that this made me nervous.  I no longer hold any 
reservation against it. It should work just fine as-is.
 -- richard


On Jan 23, 2010, at 9:53 AM, Lutz Schumann wrote:
> Hi, 
> 
> i found some time and was able to test again.
> 
> - verify with unique uid of the device 
> - verify with autoreplace = off
> 
> Indeed autoreplace was set to yes for the pools. So I disabled the
autoreplace.
> 
> VOL     PROPERTY       VALUE       SOURCE
> nxvol2  autoreplace    off         default
> 
> Erased the labels on the cache disk and added it again to the pool. Now
both cache disks have different guid''s:
> 
> # cache device in node1
> root at nex1:/volumes# zdb -l -e /dev/rdsk/c0t2d0s0
> --------------------------------------------
> LABEL 0
> --------------------------------------------
>    version=14
>    state=4
>    guid=15970804704220025940
> 
> # cache device in node2
> root at nex2:/volumes# zdb -l -e /dev/rdsk/c0t2d0s0
> --------------------------------------------
> LABEL 0
> --------------------------------------------
>    version=14
>    state=4
>    guid=2866316542752696853
> 
> GUID''s are different. 
> 
> However after switching the pool nxvol2 to node1 (where nxvol1 was active),
the disks picked up as cache dev''s:
> 
> # nxvol2 switched to this node ... 
> volume: nxvol2
> state: ONLINE
> scrub: none requested
> config:
> 
>        NAME         STATE     READ WRITE CKSUM
>        nxvol2       ONLINE       0     0     0
>          mirror     ONLINE       0     0     0
>            c3t10d0  ONLINE       0     0     0
>            c4t13d0  ONLINE       0     0     0
>          mirror     ONLINE       0     0     0
>            c3t9d0   ONLINE       0     0     0
>            c4t12d0  ONLINE       0     0     0
>          mirror     ONLINE       0     0     0
>            c3t8d0   ONLINE       0     0     0
>            c4t11d0  ONLINE       0     0     0
>          mirror     ONLINE       0     0     0
>            c3t18d0  ONLINE       0     0     0
>            c4t22d0  ONLINE       0     0     0
>          mirror     ONLINE       0     0     0
>            c3t17d0  ONLINE       0     0     0
>            c4t21d0  ONLINE       0     0     0
>        cache
>          c0t2d0     FAULTED      0     0     0  corrupted data
> 
> # nxvol1 was active here before ...
> nmc at nex1:/$ show volume nxvol1 status
> volume: nxvol1
> state: ONLINE
> scrub: none requested
> config:
> 
>        NAME         STATE     READ WRITE CKSUM
>        nxvol1       ONLINE       0     0     0
>          mirror     ONLINE       0     0     0
>            c3t15d0  ONLINE       0     0     0
>            c4t18d0  ONLINE       0     0     0
>          mirror     ONLINE       0     0     0
>            c3t14d0  ONLINE       0     0     0
>            c4t17d0  ONLINE       0     0     0
>          mirror     ONLINE       0     0     0
>            c3t13d0  ONLINE       0     0     0
>            c4t16d0  ONLINE       0     0     0
>          mirror     ONLINE       0     0     0
>            c3t12d0  ONLINE       0     0     0
>            c4t15d0  ONLINE       0     0     0
>          mirror     ONLINE       0     0     0
>            c3t11d0  ONLINE       0     0     0
>            c4t14d0  ONLINE       0     0     0
>        cache
>          c0t2d0     ONLINE      0     0     0  
> 
> So this is true with and without autoreplace, and with differnt
guid''s of the devices.
> -- 
> This message posted from opensolaris.org
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Lutz Schumann

2010-Jan-24 08:55 UTC

head link

[zfs-discuss] L2ARC in Cluster is picked up althought not part of the pool

Thanks for the feedback Richard. 

Does that mean that the L2ARC can be part of ANY pool and that there is only ONE
L2ARC for all pools active on the machine ?

Thesis:

  - There is one L2ARC on the machine for all pools
  - all Pools active share the same L2ARC
  - the L2ARC can be part of any pool, also the root (syspool) pool 

If this true, the solution would be like this: 

a) Add L2ARC to the syspool 

or

b) Add another two (standby) L2ARC devices in the head that are used in case of
a failover. (Thus a configuration that accepts degrated performance after a
failover has to life with this "corrup data" effect).

True ?
-- 
This message posted from opensolaris.org

Lutz Schumann

2010-Jan-28 18:54 UTC

head link

[zfs-discuss] L2ARC in Cluster is picked up althought not part of the pool

Actuall I tested this. 

If I add a l2arc device to the syspool it is not used when issueing I/O to the
data pool (note: on root pool it must no be a whole disk, but only a slice of it
otherwise ZFS complains that root disks may not contain some EFI label).

So this does not work - unfortunately :(

Just for Info. 
Robert
-- 
This message posted from opensolaris.org

Richard Elling

2010-Jan-28 18:57 UTC

head link

[zfs-discuss] L2ARC in Cluster is picked up althought not part of the pool

On Jan 28, 2010, at 10:54 AM, Lutz Schumann wrote:
> Actuall I tested this. 
> 
> If I add a l2arc device to the syspool it is not used when issueing I/O to
the data pool (note: on root pool it must no be a whole disk, but only a slice
of it otherwise ZFS complains that root disks may not contain some EFI label).
In my tests it does work. Can you share your test plan?
 -- richard

Lutz Schumann

2010-Jan-28 19:35 UTC

head link

[zfs-discuss] L2ARC in Cluster is picked up althought not part of the pool

Yes, here it is .... (performance is vmware on laptop, so sorry for that)

How did I test ? 

1) My Disks: 

LUN ID      Device    Type         Size       Volume     Mounted Remov Attach
c0t0d0      sd4       cdrom        No Media              no      yes   ata
c1t0d0      sd0       disk         8GB        syspool    no      no    mpt
c1t1d0      sd1       disk         20GB       data       no      no    mpt
c1t2d0      sd2       disk         20GB       data       no      no    mpt
c1t3d0      sd3       disk         20GB       data       no      no    mpt
c1t4d0      sd8       disk         4GB                   no      no    mpt
syspo~/swap           zvol         768.0MB    syspool    no      no

2) My Pools:
  
volume: data
 state: ONLINE
 scrub: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        data        ONLINE       0     0     0
          raidz1    ONLINE       0     0     0
            c1t1d0  ONLINE       0     0     0
            c1t2d0  ONLINE       0     0     0
            c1t3d0  ONLINE       0     0     0

errors: No known data errors

volume: syspool
 state: ONLINE
 scrub: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        syspool     ONLINE       0     0     0
          c1t0d0s0  ONLINE       0     0     0

errors: No known data errors

3) Add the cache device to syspool:
zpool add -f syspool cache c1t4d0s2


root at nexenta:/volumes# zpool status
  pool: data
 state: ONLINE
 scrub: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        data        ONLINE       0     0     0
          raidz1    ONLINE       0     0     0
            c1t1d0  ONLINE       0     0     0
            c1t2d0  ONLINE       0     0     0
            c1t3d0  ONLINE       0     0     0

errors: No known data errors

  pool: syspool
 state: ONLINE
 scrub: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        syspool     ONLINE       0     0     0
          c1t0d0s0  ONLINE       0     0     0
        cache
          c1t4d0s2  ONLINE       0     0     0

errors: No known data errors

4) Do I/O on the data volume and watch if the l2arc is filled with "zpool
iostat":

cmd: 
cd /volumes/data
iozone -s 1G -i 0 -i 1 (for I/O) 

Typically looks like this: 

               capacity     operations    bandwidth
pool         used  avail   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
data        1.47G  58.0G      0    131      0  9.47M
  raidz1    1.47G  58.0G      0    131      0  9.47M
    c1t1d0      -      -      0    100      0  8.45M
    c1t2d0      -      -      0     77      0  4.74M
    c1t3d0      -      -      0     77      0  5.48M
----------  -----  -----  -----  -----  -----  -----
syspool     1.87G  6.06G      2      0  23.8K      0
  c1t0d0s0  1.87G  6.06G      2      0  23.7K      0
cache           -      -      -      -      -      -
  c1t4d0s2  95.9M  3.89G      0      0      0   127K
----------  -----  -----  -----  -----  -----  -----

5) Do the same I/O on the syspool: 

cd /volumes
iozone -s 1G -i 0 -i 1 (for I/O)

               capacity     operations    bandwidth
pool         used  avail   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
data         407K  59.5G      0      0      0      0
  raidz1     407K  59.5G      0      0      0      0
    c1t1d0      -      -      0      0      0      0
    c1t2d0      -      -      0      0      0      0
    c1t3d0      -      -      0      0      0      0
----------  -----  -----  -----  -----  -----  -----
syspool     2.35G  5.59G      0    167  6.25K  14.2M
  c1t0d0s0  2.35G  5.59G      0    167  6.25K  14.2M
cache           -      -      -      -      -      -
  c1t4d0s2   406M  3.59G      0     80      0  9.59M
----------  -----  -----  -----  -----  -----  -----


6) You see only if I/O to syspool is done, the l2arc in syspool used. 

Release is build 104 with some patches.
-- 
This message posted from opensolaris.org

Lutz Schumann

2010-Feb-01 13:53 UTC

head link

[zfs-discuss] L2ARC in Cluster is picked up althought not part of the pool

I tested some more and found that Pool disks are picked UP. 

Head1: Cachedevice1 (c0t0d0)
Head2: Cachedevice2 (c0t0d0)
Pool: Shared, c1t<X>d<Y>

I created a pool on shared storage. 
Added the cache device on Head1. 
Switched the pool to Head2 (export + import). 
Created a pool on head1 containing just the cache device (c0t0d0). 
Exported the pool on Head1. 
Switched back the pool from head2 to head1 (export + import)
The disk c0t0d0 is picked up as cache device ... 

This practially means my exported pool was destroyed. 

In production this would been hell.

Am I missing something here ?
-- 
This message posted from opensolaris.org

Richard Elling

2010-Feb-01 18:35 UTC

head link

[zfs-discuss] L2ARC in Cluster is picked up althought not part of the pool

On Feb 1, 2010, at 5:53 AM, Lutz Schumann wrote:
> I tested some more and found that Pool disks are picked UP. 
> 
> Head1: Cachedevice1 (c0t0d0)
> Head2: Cachedevice2 (c0t0d0)
> Pool: Shared, c1t<X>d<Y>
> 
> I created a pool on shared storage. 
> Added the cache device on Head1. 
> Switched the pool to Head2 (export + import). 
> Created a pool on head1 containing just the cache device (c0t0d0). 
This is not possible, unless there is a bug. You cannot create a pool
with only a cache device.  I have verified this on b131:
	# zpool create norealpool cache /dev/ramdisk/rc1                    
	invalid vdev specification: at least one toplevel vdev must be specified

This is also consistent with the notion that cache devices are auxiliary
devices and do not have pool configuration information in the label.
 -- richard
> Exported the pool on Head1. 
> Switched back the pool from head2 to head1 (export + import)
> The disk c0t0d0 is picked up as cache device ... 
> 
> This practially means my exported pool was destroyed. 
> 
> In production this would been hell.
> 
> Am I missing something here ?
> -- 
> This message posted from opensolaris.org
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Lutz Schumann

2010-Feb-01 20:22 UTC

head link

[zfs-discuss] L2ARC in Cluster is picked up althought not part of the pool

> > Created a pool on head1 containing just the cache
> device (c0t0d0). 
> 
> This is not possible, unless there is a bug. You
> cannot create a pool
> with only a cache device.  I have verified this on
> b131:
> # zpool create norealpool cache /dev/ramdisk/rc1
> 1                    
> invalid vdev specification: at least one toplevel
> l vdev must be specified
> 
> This is also consistent with the notion that cache
> devices are auxiliary
> devices and do not have pool configuration
> information in the label.
Sorry for the confustion ...  a little misunderstanding. I created a Pool
who''s only data disk is the disk formally used as cache device in the
pool that switched. Then I exported this pool mad eform just a single disk (data
disk). And switched back. The exported pool was picked up as cache device ...
this seems really problematic.

Robert
-- 
This message posted from opensolaris.org

Daniel Carosone

2010-Feb-08 22:10 UTC

head link

[zfs-discuss] L2ARC in Cluster is picked up althought not part of the pool

On Mon, Feb 01, 2010 at 12:22:55PM -0800, Lutz Schumann
wrote:> > > Created a pool on head1 containing just the cache
> > device (c0t0d0). 
> > 
> > This is not possible, unless there is a bug. You
> > cannot create a pool
> > with only a cache device.  I have verified this on
> > b131:
> > # zpool create norealpool cache /dev/ramdisk/rc1
> > 1                    
> > invalid vdev specification: at least one toplevel
> > l vdev must be specified
> > 
> > This is also consistent with the notion that cache
> > devices are auxiliary
> > devices and do not have pool configuration
> > information in the label.
> 
> Sorry for the confustion ...  a little misunderstanding. I created a Pool
who''s only data disk is the disk formally used as cache device in the
pool that switched. Then I exported this pool mad eform just a single disk (data
disk). And switched back. The exported pool was picked up as cache device ...
this seems really problematic.
This is exactly the scenario I was concerned about earlier in the
thread.  Thanks for confirming that it occurs.  Please verify that the
pool had autoreplace=off (just to avoid that distraction), and file a
bug.  

Cache devices should not automatically destroy disk contents based
solely on device path, especially where that device path came along
with a pool import.  Cache devices need labels to confirm their
identity. This is irrespective of whether the cache contents after the
label are persistent or volatile, ie should be fixed without waiting
for the CR about persistent l2arc.

--
Dan.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 194 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100209/ca32e083/attachment.bin>

Lutz Schumann

2010-Feb-24 12:25 UTC

head link

[zfs-discuss] L2ARC in Cluster is picked up althought not part of the pool

I fully agree. This needs fixing. I can think of so many situations, where
device names change in OpenSolaris (especially with movable pools). This problem
can lead to serious data corruption.

Besides persistent L2ARC (which is much more difficult I would say) - Making
L2ARC also rely on labels instead of device paths is essential.

Can someone open a CR for this ??
-- 
This message posted from opensolaris.org

zfs discuss - Jan 2010 - L2ARC in Cluster is picked up althought not part of the pool

[zfs-discuss] L2ARC in Cluster is picked up althought not part of the pool

[zfs-discuss] L2ARC in Cluster is picked up althought not part of the pool

[zfs-discuss] L2ARC in Cluster is picked up althought not part of the pool

[zfs-discuss] L2ARC in Cluster is picked up althought not part of the pool

[zfs-discuss] L2ARC in Cluster is picked up althought not part of the pool

[zfs-discuss] L2ARC in Cluster is picked up althought not part of the pool

[zfs-discuss] L2ARC in Cluster is picked up althought not part of the pool

[zfs-discuss] L2ARC in Cluster is picked up althought not part of the pool

[zfs-discuss] L2ARC in Cluster is picked up althought not part of the pool

[zfs-discuss] L2ARC in Cluster is picked up althought not part of the pool

[zfs-discuss] L2ARC in Cluster is picked up althought not part of the pool

[zfs-discuss] L2ARC in Cluster is picked up althought not part of the pool

[zfs-discuss] L2ARC in Cluster is picked up althought not part of the pool

[zfs-discuss] L2ARC in Cluster is picked up althought not part of the pool

[zfs-discuss] L2ARC in Cluster is picked up althought not part of the pool

[zfs-discuss] L2ARC in Cluster is picked up althought not part of the pool

[zfs-discuss] L2ARC in Cluster is picked up althought not part of the pool

[zfs-discuss] L2ARC in Cluster is picked up althought not part of the pool

[zfs-discuss] L2ARC in Cluster is picked up althought not part of the pool

[zfs-discuss] L2ARC in Cluster is picked up althought not part of the pool

[zfs-discuss] L2ARC in Cluster is picked up althought not part of the pool

[zfs-discuss] L2ARC in Cluster is picked up althought not part of the pool