Lutz Schumann
2010-Jan-20 11:17 UTC
[zfs-discuss] L2ARC in Cluster is picked up althought not part of the pool
Hello, we tested clustering with ZFS and the setup looks like this: - 2 head nodes (nodea, nodeb) - head nodes contain l2arc devices (nodea_l2arc, nodeb_l2arc) - two external jbods - two mirror zpools (pool1,pool2) - each mirror is a mirror of one disk from each jbod - no ZIL (anyone knows a well priced SAS SSD ?) We want active/active and added the l2arc to the pools. - pool1 has nodea_l2arc as cache - pool2 has nodeb_l2arc as cache Everything is great so far. One thing to node is that the nodea_l2arc and nodea_l2arc are named equally ! (c0t2d0 on both nodes). What we found is that during tests, the pool just picked up the device nodeb_l2arc automatically, altought is was never explicitly added to the pool pool1. We had a setup stage when pool1 was configured on nodea with nodea_l2arc and pool2 was configured on nodeb without a l2arc. Then we did a failover. Then pool1 pickup up the (until then) unconfigured nodeb_l2arc. Is this intended ? Why is a L2ARC device automatically picked up if the device name is the same ? In a later stage we had both pools configured with the corresponding l2arc device. (pool1 at nodea with nodea_l2arc and pool2 at nodeb with nodeb_l2arc). Then we also did a failover. The l2arc device of the pool failing over was marked as "too many corruptions" instead of "missing". So from this tests it looks like ZFS just picks up the device with the same name and replaces the l2arc without looking at the device signatures to only consider devices beeing part of a pool. We have not tested with a data disk as "c0t2d0" but if the same behaviour occurs - god save us all. Can someone clarify the logic behind this ? Can also someone give a hint how to rename SAS disk devices in opensolaris ? (to workaround I would like to rename c0t2d0 on nodea (nodea_l2arc) to c0t24d0 and c0t2d0 on nodeb (nodea_l2arc) to c0t48d0). P.s. Release is build 104 (NexentaCore 2). Thanks! -- This message posted from opensolaris.org
Richard Elling
2010-Jan-20 22:08 UTC
[zfs-discuss] L2ARC in Cluster is picked up althought not part of the pool
Hi Lutz, On Jan 20, 2010, at 3:17 AM, Lutz Schumann wrote:> Hello, > > we tested clustering with ZFS and the setup looks like this: > > - 2 head nodes (nodea, nodeb) > - head nodes contain l2arc devices (nodea_l2arc, nodeb_l2arc)This makes me nervous. I suspect this is not in the typical QA test plan.> - two external jbods > - two mirror zpools (pool1,pool2) > - each mirror is a mirror of one disk from each jbod > - no ZIL (anyone knows a well priced SAS SSD ?) > > We want active/active and added the l2arc to the pools. > > - pool1 has nodea_l2arc as cache > - pool2 has nodeb_l2arc as cache > > Everything is great so far. > > One thing to node is that the nodea_l2arc and nodea_l2arc are named equally ! (c0t2d0 on both nodes). > > What we found is that during tests, the pool just picked up the device nodeb_l2arc automatically, altought is was never explicitly added to the pool pool1.This is strange. Each vdev is supposed to be uniquely identified by its GUID. This is how ZFS can identify the proper configuration when two pools have the same name. Can you check the GUIDs (using zdb) to see if there is a collision? -- richard> We had a setup stage when pool1 was configured on nodea with nodea_l2arc and pool2 was configured on nodeb without a l2arc. Then we did a failover. Then pool1 pickup up the (until then) unconfigured nodeb_l2arc. > > Is this intended ? Why is a L2ARC device automatically picked up if the device name is the same ? > > In a later stage we had both pools configured with the corresponding l2arc device. (pool1 at nodea with nodea_l2arc and pool2 at nodeb with nodeb_l2arc). Then we also did a failover. The l2arc device of the pool failing over was marked as "too many corruptions" instead of "missing". > > So from this tests it looks like ZFS just picks up the device with the same name and replaces the l2arc without looking at the device signatures to only consider devices beeing part of a pool. > > We have not tested with a data disk as "c0t2d0" but if the same behaviour occurs - god save us all. > > Can someone clarify the logic behind this ? > > Can also someone give a hint how to rename SAS disk devices in opensolaris ? > (to workaround I would like to rename c0t2d0 on nodea (nodea_l2arc) to c0t24d0 and c0t2d0 on nodeb (nodea_l2arc) to c0t48d0). > > P.s. Release is build 104 (NexentaCore 2). > > Thanks! > -- > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Tomas Ă–gren
2010-Jan-20 22:20 UTC
[zfs-discuss] L2ARC in Cluster is picked up althought not part of the pool
On 20 January, 2010 - Richard Elling sent me these 2,7K bytes:> Hi Lutz, > > On Jan 20, 2010, at 3:17 AM, Lutz Schumann wrote: > > > Hello, > > > > we tested clustering with ZFS and the setup looks like this: > > > > - 2 head nodes (nodea, nodeb) > > - head nodes contain l2arc devices (nodea_l2arc, nodeb_l2arc) > > This makes me nervous. I suspect this is not in the typical QA > test plan. > > > - two external jbods > > - two mirror zpools (pool1,pool2) > > - each mirror is a mirror of one disk from each jbod > > - no ZIL (anyone knows a well priced SAS SSD ?) > > > > We want active/active and added the l2arc to the pools. > > > > - pool1 has nodea_l2arc as cache > > - pool2 has nodeb_l2arc as cache > > > > Everything is great so far. > > > > One thing to node is that the nodea_l2arc and nodea_l2arc are named equally ! (c0t2d0 on both nodes). > > > > What we found is that during tests, the pool just picked up the device nodeb_l2arc automatically, altought is was never explicitly added to the pool pool1. > > This is strange. Each vdev is supposed to be uniquely identified by its GUID. > This is how ZFS can identify the proper configuration when two pools have > the same name. Can you check the GUIDs (using zdb) to see if there is a > collision?Reproducable: itchy:/tmp/blah# mkfile 64m 64m disk1 itchy:/tmp/blah# zfs create -V 64m rpool/blahcache itchy:/tmp/blah# zpool create blah /tmp/blah/disk1 itchy:/tmp/blah# zpool add blah cache /dev/zvol/dsk/rpool/blahcache itchy:/tmp/blah# zpool status blah pool: blah state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM blah ONLINE 0 0 0 /tmp/blah/disk1 ONLINE 0 0 0 cache /dev/zvol/dsk/rpool/blahcache ONLINE 0 0 0 errors: No known data errors itchy:/tmp/blah# zpool export blah itchy:/tmp/blah# zdb -l /dev/zvol/dsk/rpool/blahcache -------------------------------------------- LABEL 0 -------------------------------------------- version=15 state=4 guid=6931317478877305718 .... itchy:/tmp/blah# zfs destroy rpool/blahcache itchy:/tmp/blah# zfs create -V 64m rpool/blahcache itchy:/tmp/blah# dd if=/dev/zero of=/dev/zvol/dsk/rpool/blahcache bs=1024k count=64 64+0 records in 64+0 records out 67108864 bytes (67 MB) copied, 0.559299 seconds, 120 MB/s itchy:/tmp/blah# zpool import -d /tmp/blah pool: blah id: 16691059548146709374 state: ONLINE action: The pool can be imported using its name or numeric identifier. config: blah ONLINE /tmp/blah/disk1 ONLINE cache /dev/zvol/dsk/rpool/blahcache itchy:/tmp/blah# zdb -l /dev/zvol/dsk/rpool/blahcache -------------------------------------------- LABEL 0 -------------------------------------------- -------------------------------------------- LABEL 1 -------------------------------------------- -------------------------------------------- LABEL 2 -------------------------------------------- -------------------------------------------- LABEL 3 -------------------------------------------- itchy:/tmp/blah# zpool import -d /tmp/blah blah itchy:/tmp/blah# zpool status pool: blah state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM blah ONLINE 0 0 0 /tmp/blah/disk1 ONLINE 0 0 0 cache /dev/zvol/dsk/rpool/blahcache ONLINE 0 0 0 errors: No known data errors itchy:/tmp/blah# zdb -l /dev/zvol/dsk/rpool/blahcache -------------------------------------------- LABEL 0 -------------------------------------------- version=15 state=4 guid=6931317478877305718 ... It did indeed overwrite my formerly clean blahcache. Smells like a serious bug. /Tomas -- Tomas ?gren, stric at acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Ume? `- Sysadmin at {cs,acc}.umu.se> -- richard > > > We had a setup stage when pool1 was configured on nodea with nodea_l2arc and pool2 was configured on nodeb without a l2arc. Then we did a failover. Then pool1 pickup up the (until then) unconfigured nodeb_l2arc. > > > > Is this intended ? Why is a L2ARC device automatically picked up if the device name is the same ? > > > > In a later stage we had both pools configured with the corresponding l2arc device. (pool1 at nodea with nodea_l2arc and pool2 at nodeb with nodeb_l2arc). Then we also did a failover. The l2arc device of the pool failing over was marked as "too many corruptions" instead of "missing". > > > > So from this tests it looks like ZFS just picks up the device with the same name and replaces the l2arc without looking at the device signatures to only consider devices beeing part of a pool. > > > > We have not tested with a data disk as "c0t2d0" but if the same behaviour occurs - god save us all. > > > > Can someone clarify the logic behind this ? > > > > Can also someone give a hint how to rename SAS disk devices in opensolaris ? > > (to workaround I would like to rename c0t2d0 on nodea (nodea_l2arc) to c0t24d0 and c0t2d0 on nodeb (nodea_l2arc) to c0t48d0). > > > > P.s. Release is build 104 (NexentaCore 2). > > > > Thanks! > > -- > > This message posted from opensolaris.org > > _______________________________________________ > > zfs-discuss mailing list > > zfs-discuss at opensolaris.org > > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Richard Elling
2010-Jan-20 23:20 UTC
[zfs-discuss] L2ARC in Cluster is picked up althought not part of the pool
Though the ARC case, PSARC/2007/618 is "unpublished," I gather from googling and the source that L2ARC devices are considered auxiliary, in the same category as spares. If so, then it is perfectly reasonable to expect that it gets picked up regardless of the GUID. This also implies that it is shareable between pools until assigned. Brief testing confirms this behaviour. I learn something new every day :-) So, I suspect Lutz sees a race when both pools are imported onto one node. This still makes me nervous though... -- richard
Daniel Carosone
2010-Jan-21 00:17 UTC
[zfs-discuss] L2ARC in Cluster is picked up althought not part of the pool
On Wed, Jan 20, 2010 at 03:20:20PM -0800, Richard Elling wrote:> Though the ARC case, PSARC/2007/618 is "unpublished," I gather from > googling and the source that L2ARC devices are considered auxiliary, > in the same category as spares. If so, then it is perfectly reasonable to > expect that it gets picked up regardless of the GUID. This also implies > that it is shareable between pools until assigned. Brief testing confirms > this behaviour. I learn something new every day :-) > > So, I suspect Lutz sees a race when both pools are imported onto one > node. This still makes me nervous though...Yes. What if device reconfiguration renumbers my controllers, will l2arc suddenly start trashing a data disk? The same problem used to be a risk for swap, but less so now that we swap to named zvol. There''s work afoot to make l2arc persistent across reboot, which implies some organised storage structure on the device. Fixing this shouldn''t wait for that. -- Dan. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 194 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100121/854737df/attachment.bin>
Richard Elling
2010-Jan-21 17:36 UTC
[zfs-discuss] L2ARC in Cluster is picked up althought not part of the pool
On Jan 20, 2010, at 4:17 PM, Daniel Carosone wrote:> On Wed, Jan 20, 2010 at 03:20:20PM -0800, Richard Elling wrote: >> Though the ARC case, PSARC/2007/618 is "unpublished," I gather from >> googling and the source that L2ARC devices are considered auxiliary, >> in the same category as spares. If so, then it is perfectly reasonable to >> expect that it gets picked up regardless of the GUID. This also implies >> that it is shareable between pools until assigned. Brief testing confirms >> this behaviour. I learn something new every day :-) >> >> So, I suspect Lutz sees a race when both pools are imported onto one >> node. This still makes me nervous though... > > Yes. What if device reconfiguration renumbers my controllers, will > l2arc suddenly start trashing a data disk? The same problem used to > be a risk for swap, but less so now that we swap to named zvol.This will not happen unless the labels are rewritten on your data disk, and if that occurs, all bets are off.> There''s work afoot to make l2arc persistent across reboot, which > implies some organised storage structure on the device. Fixing this > shouldn''t wait for that.Upon further review, the ruling on the field is confirmed ;-) The L2ARC is shared amongst pools just like the ARC. What is important is that at least one pool has a cache vdev. I suppose one could make the case that a new command is needed in addition to zpool and zfs (!) to manage such devices. But perhaps we can live with the oddity for a while? As such, for Lutz''s configuration, I am now less nervous. If I understand correctly, you could add the cache vdev to rpool and forget about how it works with the shared pools. -- richard
Daniel Carosone
2010-Jan-21 21:13 UTC
[zfs-discuss] L2ARC in Cluster is picked up althought not part of the pool
On Thu, Jan 21, 2010 at 09:36:06AM -0800, Richard Elling wrote:> On Jan 20, 2010, at 4:17 PM, Daniel Carosone wrote: > > > On Wed, Jan 20, 2010 at 03:20:20PM -0800, Richard Elling wrote: > >> Though the ARC case, PSARC/2007/618 is "unpublished," I gather from > >> googling and the source that L2ARC devices are considered auxiliary, > >> in the same category as spares. If so, then it is perfectly reasonable to > >> expect that it gets picked up regardless of the GUID. This also implies > >> that it is shareable between pools until assigned. Brief testing confirms > >> this behaviour. I learn something new every day :-) > >> > >> So, I suspect Lutz sees a race when both pools are imported onto one > >> node. This still makes me nervous though... > > > > Yes. What if device reconfiguration renumbers my controllers, will > > l2arc suddenly start trashing a data disk? The same problem used to > > be a risk for swap, but less so now that we swap to named zvol. > > This will not happen unless the labels are rewritten on your data disk, > and if that occurs, all bets are off.It occurred to me later yesterday, while offline, that the pool in question might have autoreplace=on set. If that were true, it would explain why a disk in the same controller slot was overwritten and used. Lutz, is the pool autoreplace property on? If so, "god help us all" is no longer quite so necessary.> > There''s work afoot to make l2arc persistent across reboot, which > > implies some organised storage structure on the device. Fixing this > > shouldn''t wait for that. > > Upon further review, the ruling on the field is confirmed ;-) The L2ARC > is shared amongst pools just like the ARC. What is important is that at > least one pool has a cache vdev.Wait, huh? That''s a totally separate issue from what I understood from the discussion. What I was worried about was that disk Y, that happened to have the same cLtMdN address as disk X on another node, was overwritten and trashed on import to become l2arc. Maybe I missed some other detail in the thread and reached the wrong conclusion?> As such, for Lutz''s configuration, I am now less nervous. If I understand > correctly, you could add the cache vdev to rpool and forget about how > it works with the shared pools.The fact that l2arc devices could be caching data from any pool in the system is .. a whole different set of (mostly performance) wrinkles. For example, if I have a pool of very slow disks (usb or remote iscsi), and a pool of faster disks, and l2arc for the slow pool on the same faster disks, it''s pointless having the faster pool using l2arc on the same disks or even the same type of disks. I''d need to set the secondarycache properties of one pool according to the configuration of another.> I suppose one could make the case > that a new command is needed in addition to zpool and zfs (!) to manage > such devices. But perhaps we can live with the oddity for a while?This part, I expect, will be resolved or clarified as part of the l2arc persistence work, since then their attachment to specific pools will need to be clear and explicit. Perhaps the answer is that the cache devices become their own pool (since they''re going to need filesystem-like structured storage anyway). The actual cache could be a zvol (or new object type) within that pool, and then (if necessary) an association is made between normal pools and the cache (especially if I have multiple of them). No new top-level commands needed. -- Dan. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 194 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100122/29991474/attachment.bin>
Richard Elling
2010-Jan-21 23:33 UTC
[zfs-discuss] L2ARC in Cluster is picked up althought not part of the pool
[Richard makes a hobby of confusing Dan :-)] more below.. On Jan 21, 2010, at 1:13 PM, Daniel Carosone wrote:> On Thu, Jan 21, 2010 at 09:36:06AM -0800, Richard Elling wrote: >> On Jan 20, 2010, at 4:17 PM, Daniel Carosone wrote: >> >>> On Wed, Jan 20, 2010 at 03:20:20PM -0800, Richard Elling wrote: >>>> Though the ARC case, PSARC/2007/618 is "unpublished," I gather from >>>> googling and the source that L2ARC devices are considered auxiliary, >>>> in the same category as spares. If so, then it is perfectly reasonable to >>>> expect that it gets picked up regardless of the GUID. This also implies >>>> that it is shareable between pools until assigned. Brief testing confirms >>>> this behaviour. I learn something new every day :-) >>>> >>>> So, I suspect Lutz sees a race when both pools are imported onto one >>>> node. This still makes me nervous though... >>> >>> Yes. What if device reconfiguration renumbers my controllers, will >>> l2arc suddenly start trashing a data disk? The same problem used to >>> be a risk for swap, but less so now that we swap to named zvol. >> >> This will not happen unless the labels are rewritten on your data disk, >> and if that occurs, all bets are off. > > It occurred to me later yesterday, while offline, that the pool in > question might have autoreplace=on set. If that were true, it would > explain why a disk in the same controller slot was overwritten and > used. > > Lutz, is the pool autoreplace property on? If so, "god help us all" > is no longer quite so necessary.I think this is a different issue. But since the label in a cache device does not associate it with a pool, it is possible that any pool which expects a cache will find it. This seems to be as designed.>>> There''s work afoot to make l2arc persistent across reboot, which >>> implies some organised storage structure on the device. Fixing this >>> shouldn''t wait for that. >> >> Upon further review, the ruling on the field is confirmed ;-) The L2ARC >> is shared amongst pools just like the ARC. What is important is that at >> least one pool has a cache vdev. > > Wait, huh? That''s a totally separate issue from what I understood > from the discussion. What I was worried about was that disk Y, that > happened to have the same cLtMdN address as disk X on another node, > was overwritten and trashed on import to become l2arc. > > Maybe I missed some other detail in the thread and reached the wrong > conclusion? > >> As such, for Lutz''s configuration, I am now less nervous. If I understand >> correctly, you could add the cache vdev to rpool and forget about how >> it works with the shared pools. > > The fact that l2arc devices could be caching data from any pool in the > system is .. a whole different set of (mostly performance) wrinkles. > > For example, if I have a pool of very slow disks (usb or remote > iscsi), and a pool of faster disks, and l2arc for the slow pool on the > same faster disks, it''s pointless having the faster pool using l2arc > on the same disks or even the same type of disks. I''d need to set the > secondarycache properties of one pool according to the configuration > of another.Don''t use slow devices for L2ARC. Secondarycache is a dataset property, not a pool property. You can definitely manage the primary and secondary cache policies for each dataset.>> I suppose one could make the case >> that a new command is needed in addition to zpool and zfs (!) to manage >> such devices. But perhaps we can live with the oddity for a while? > > This part, I expect, will be resolved or clarified as part of the > l2arc persistence work, since then their attachment to specific pools > will need to be clear and explicit.Since the ARC is shared amongst all pools, it makes sense to share L2ARC amongst all pools.> Perhaps the answer is that the cache devices become their own pool > (since they''re going to need filesystem-like structured storage > anyway). The actual cache could be a zvol (or new object type) within > that pool, and then (if necessary) an association is made between > normal pools and the cache (especially if I have multiple of them). > No new top-level commands needed.I propose a best practice of adding the cache device to rpool and be happy. -- richard
Daniel Carosone
2010-Jan-22 00:32 UTC
[zfs-discuss] L2ARC in Cluster is picked up althought not part of the pool
On Thu, Jan 21, 2010 at 03:33:28PM -0800, Richard Elling wrote:> [Richard makes a hobby of confusing Dan :-)]Heh.> > Lutz, is the pool autoreplace property on? If so, "god help us all" > > is no longer quite so necessary. > > I think this is a different issue.I agree. For me, it was the main issue, and I still want clarity on it. However, at this point I''ll go back to the start of the thread and look at what was actually reported again in more detail.> But since the label in a cache device does > not associate it with a pool, it is possible that any pool which expects a > cache will find it. This seems to be as designed.Hm. My recollection was that node b''s disk in that controller slot was totally unlabelled, but perhaps I''m misremembering.. as above.> > For example, if I have a pool of very slow disks (usb or remote > > iscsi), and a pool of faster disks, and l2arc for the slow pool on the > > same faster disks, it''s pointless having the faster pool using l2arc > > on the same disks or even the same type of disks. I''d need to set the > > secondarycache properties of one pool according to the configuration > > of another. > > Don''t use slow devices for L2ARC.Slow is entirely relative, as we discussed here just recently. They just need to be faster than the pool devices I want to cache. The wrinkle here is that it''s now clear they should be faster than the devices in all other pools as well (or I need to take special measures). Faster is better regardless, and suitable l2arc ssd''s are "cheap enough" now. It''s mostly academic that, previously, faster/local hard disks were "fast enough", since now you can have both.> Secondarycache is a dataset property, not a pool property. You can > definitely manage the primary and secondary cache policies for each > dataset.Yeah, properties of the root fs and of the pool are easily conflated.> >> such devices. But perhaps we can live with the oddity for a while? > > > > This part, I expect, will be resolved or clarified as part of the > > l2arc persistence work, since then their attachment to specific pools > > will need to be clear and explicit. > > Since the ARC is shared amongst all pools, it makes sense to share > L2ARC amongst all pools.Of course it does - apart from the wrinkles we now know we need to watch out for.> > Perhaps the answer is that the cache devices become their own pool > > (since they''re going to need filesystem-like structured storage > > anyway). The actual cache could be a zvol (or new object type) within > > that pool, and then (if necessary) an association is made between > > normal pools and the cache (especially if I have multiple of them). > > No new top-level commands needed. > > I propose a best practice of adding the cache device to rpool and be > happy.It is *still* not that simple. Forget my slow disks caching an even slower pool (which is still fast enough for my needs, thanks to the cache and zil). Consider a server config thus: - two MLC SSDs (x25-M, OCZ Vertex, whatever) - SSDs partitioned in two, mirrored rpool & 2x l2arc - a bunch of disks for a data pool This is a likely/common configuration, commodity systems being limited mostly by number of sata ports. I''d even go so far as to propose it as another best practice, for those circumstances. Now, why would I waste l2arc space, bandwidth, and wear cycles to cache rpool to the same ssd''s that would be read on a miss anyway? So, there''s at least one more step required for happiness: # zfs set secondarycache=none rpool (plus relying on property inheritance through the rest of rpool) -- Dan. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 194 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100122/336012cd/attachment.bin>
Richard Elling
2010-Jan-22 01:52 UTC
[zfs-discuss] L2ARC in Cluster is picked up althought not part of the pool
On Jan 21, 2010, at 4:32 PM, Daniel Carosone wrote:>> I propose a best practice of adding the cache device to rpool and be >> happy. > > It is *still* not that simple. Forget my slow disks caching an even > slower pool (which is still fast enough for my needs, thanks to the > cache and zil). > > Consider a server config thus: > - two MLC SSDs (x25-M, OCZ Vertex, whatever) > - SSDs partitioned in two, mirrored rpool & 2x l2arc > - a bunch of disks for a data pool > > This is a likely/common configuration, commodity systems being limited > mostly by number of sata ports. I''d even go so far as to propose it > as another best practice, for those circumstances.> Now, why would I waste l2arc space, bandwidth, and wear cycles to > cache rpool to the same ssd''s that would be read on a miss anyway? > > So, there''s at least one more step required for happiness: > # zfs set secondarycache=none rpool > > (plus relying on property inheritance through the rest of rpool)I agree with this, except for the fact that the most common installers (LiveCD, Nexenta, etc.) use the whole disk for rpool[1]. So the likely and common configuration today is moving towards one whole root disk. That could change in the future. [1] Solaris 10? well... since installation hard anyway, might as well do this. -- richard
Daniel Carosone
2010-Jan-22 03:03 UTC
[zfs-discuss] L2ARC in Cluster is picked up althought not part of the pool
On Thu, Jan 21, 2010 at 05:52:57PM -0800, Richard Elling wrote:> I agree with this, except for the fact that the most common installers > (LiveCD, Nexenta, etc.) use the whole disk for rpool[1].Er, no. You certainly get the option of "whole disk" or "make partitions", at least with the opensolaris livecd. -- Dan. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 194 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100122/1af9c02e/attachment.bin>
Lutz Schumann
2010-Jan-23 17:53 UTC
[zfs-discuss] L2ARC in Cluster is picked up althought not part of the pool
Hi, i found some time and was able to test again. - verify with unique uid of the device - verify with autoreplace = off Indeed autoreplace was set to yes for the pools. So I disabled the autoreplace. VOL PROPERTY VALUE SOURCE nxvol2 autoreplace off default Erased the labels on the cache disk and added it again to the pool. Now both cache disks have different guid''s: # cache device in node1 root at nex1:/volumes# zdb -l -e /dev/rdsk/c0t2d0s0 -------------------------------------------- LABEL 0 -------------------------------------------- version=14 state=4 guid=15970804704220025940 # cache device in node2 root at nex2:/volumes# zdb -l -e /dev/rdsk/c0t2d0s0 -------------------------------------------- LABEL 0 -------------------------------------------- version=14 state=4 guid=2866316542752696853 GUID''s are different. However after switching the pool nxvol2 to node1 (where nxvol1 was active), the disks picked up as cache dev''s: # nxvol2 switched to this node ... volume: nxvol2 state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM nxvol2 ONLINE 0 0 0 mirror ONLINE 0 0 0 c3t10d0 ONLINE 0 0 0 c4t13d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c3t9d0 ONLINE 0 0 0 c4t12d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c3t8d0 ONLINE 0 0 0 c4t11d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c3t18d0 ONLINE 0 0 0 c4t22d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c3t17d0 ONLINE 0 0 0 c4t21d0 ONLINE 0 0 0 cache c0t2d0 FAULTED 0 0 0 corrupted data # nxvol1 was active here before ... nmc at nex1:/$ show volume nxvol1 status volume: nxvol1 state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM nxvol1 ONLINE 0 0 0 mirror ONLINE 0 0 0 c3t15d0 ONLINE 0 0 0 c4t18d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c3t14d0 ONLINE 0 0 0 c4t17d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c3t13d0 ONLINE 0 0 0 c4t16d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c3t12d0 ONLINE 0 0 0 c4t15d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c3t11d0 ONLINE 0 0 0 c4t14d0 ONLINE 0 0 0 cache c0t2d0 ONLINE 0 0 0 So this is true with and without autoreplace, and with differnt guid''s of the devices. -- This message posted from opensolaris.org
Richard Elling
2010-Jan-24 01:51 UTC
[zfs-discuss] L2ARC in Cluster is picked up althought not part of the pool
AIUI, this works as designed. I think the best practice will be to add the L2ARC to syspool (nee rpool). However, for current NexentaStor releases, you cannot add cache devices to syspool. Earlier I mentioned that this made me nervous. I no longer hold any reservation against it. It should work just fine as-is. -- richard On Jan 23, 2010, at 9:53 AM, Lutz Schumann wrote:> Hi, > > i found some time and was able to test again. > > - verify with unique uid of the device > - verify with autoreplace = off > > Indeed autoreplace was set to yes for the pools. So I disabled the autoreplace. > > VOL PROPERTY VALUE SOURCE > nxvol2 autoreplace off default > > Erased the labels on the cache disk and added it again to the pool. Now both cache disks have different guid''s: > > # cache device in node1 > root at nex1:/volumes# zdb -l -e /dev/rdsk/c0t2d0s0 > -------------------------------------------- > LABEL 0 > -------------------------------------------- > version=14 > state=4 > guid=15970804704220025940 > > # cache device in node2 > root at nex2:/volumes# zdb -l -e /dev/rdsk/c0t2d0s0 > -------------------------------------------- > LABEL 0 > -------------------------------------------- > version=14 > state=4 > guid=2866316542752696853 > > GUID''s are different. > > However after switching the pool nxvol2 to node1 (where nxvol1 was active), the disks picked up as cache dev''s: > > # nxvol2 switched to this node ... > volume: nxvol2 > state: ONLINE > scrub: none requested > config: > > NAME STATE READ WRITE CKSUM > nxvol2 ONLINE 0 0 0 > mirror ONLINE 0 0 0 > c3t10d0 ONLINE 0 0 0 > c4t13d0 ONLINE 0 0 0 > mirror ONLINE 0 0 0 > c3t9d0 ONLINE 0 0 0 > c4t12d0 ONLINE 0 0 0 > mirror ONLINE 0 0 0 > c3t8d0 ONLINE 0 0 0 > c4t11d0 ONLINE 0 0 0 > mirror ONLINE 0 0 0 > c3t18d0 ONLINE 0 0 0 > c4t22d0 ONLINE 0 0 0 > mirror ONLINE 0 0 0 > c3t17d0 ONLINE 0 0 0 > c4t21d0 ONLINE 0 0 0 > cache > c0t2d0 FAULTED 0 0 0 corrupted data > > # nxvol1 was active here before ... > nmc at nex1:/$ show volume nxvol1 status > volume: nxvol1 > state: ONLINE > scrub: none requested > config: > > NAME STATE READ WRITE CKSUM > nxvol1 ONLINE 0 0 0 > mirror ONLINE 0 0 0 > c3t15d0 ONLINE 0 0 0 > c4t18d0 ONLINE 0 0 0 > mirror ONLINE 0 0 0 > c3t14d0 ONLINE 0 0 0 > c4t17d0 ONLINE 0 0 0 > mirror ONLINE 0 0 0 > c3t13d0 ONLINE 0 0 0 > c4t16d0 ONLINE 0 0 0 > mirror ONLINE 0 0 0 > c3t12d0 ONLINE 0 0 0 > c4t15d0 ONLINE 0 0 0 > mirror ONLINE 0 0 0 > c3t11d0 ONLINE 0 0 0 > c4t14d0 ONLINE 0 0 0 > cache > c0t2d0 ONLINE 0 0 0 > > So this is true with and without autoreplace, and with differnt guid''s of the devices. > -- > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Lutz Schumann
2010-Jan-24 08:55 UTC
[zfs-discuss] L2ARC in Cluster is picked up althought not part of the pool
Thanks for the feedback Richard. Does that mean that the L2ARC can be part of ANY pool and that there is only ONE L2ARC for all pools active on the machine ? Thesis: - There is one L2ARC on the machine for all pools - all Pools active share the same L2ARC - the L2ARC can be part of any pool, also the root (syspool) pool If this true, the solution would be like this: a) Add L2ARC to the syspool or b) Add another two (standby) L2ARC devices in the head that are used in case of a failover. (Thus a configuration that accepts degrated performance after a failover has to life with this "corrup data" effect). True ? -- This message posted from opensolaris.org
Lutz Schumann
2010-Jan-28 18:54 UTC
[zfs-discuss] L2ARC in Cluster is picked up althought not part of the pool
Actuall I tested this. If I add a l2arc device to the syspool it is not used when issueing I/O to the data pool (note: on root pool it must no be a whole disk, but only a slice of it otherwise ZFS complains that root disks may not contain some EFI label). So this does not work - unfortunately :( Just for Info. Robert -- This message posted from opensolaris.org
Richard Elling
2010-Jan-28 18:57 UTC
[zfs-discuss] L2ARC in Cluster is picked up althought not part of the pool
On Jan 28, 2010, at 10:54 AM, Lutz Schumann wrote:> Actuall I tested this. > > If I add a l2arc device to the syspool it is not used when issueing I/O to the data pool (note: on root pool it must no be a whole disk, but only a slice of it otherwise ZFS complains that root disks may not contain some EFI label).In my tests it does work. Can you share your test plan? -- richard
Lutz Schumann
2010-Jan-28 19:35 UTC
[zfs-discuss] L2ARC in Cluster is picked up althought not part of the pool
Yes, here it is .... (performance is vmware on laptop, so sorry for that) How did I test ? 1) My Disks: LUN ID Device Type Size Volume Mounted Remov Attach c0t0d0 sd4 cdrom No Media no yes ata c1t0d0 sd0 disk 8GB syspool no no mpt c1t1d0 sd1 disk 20GB data no no mpt c1t2d0 sd2 disk 20GB data no no mpt c1t3d0 sd3 disk 20GB data no no mpt c1t4d0 sd8 disk 4GB no no mpt syspo~/swap zvol 768.0MB syspool no no 2) My Pools: volume: data state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM data ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c1t1d0 ONLINE 0 0 0 c1t2d0 ONLINE 0 0 0 c1t3d0 ONLINE 0 0 0 errors: No known data errors volume: syspool state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM syspool ONLINE 0 0 0 c1t0d0s0 ONLINE 0 0 0 errors: No known data errors 3) Add the cache device to syspool: zpool add -f syspool cache c1t4d0s2 root at nexenta:/volumes# zpool status pool: data state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM data ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c1t1d0 ONLINE 0 0 0 c1t2d0 ONLINE 0 0 0 c1t3d0 ONLINE 0 0 0 errors: No known data errors pool: syspool state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM syspool ONLINE 0 0 0 c1t0d0s0 ONLINE 0 0 0 cache c1t4d0s2 ONLINE 0 0 0 errors: No known data errors 4) Do I/O on the data volume and watch if the l2arc is filled with "zpool iostat": cmd: cd /volumes/data iozone -s 1G -i 0 -i 1 (for I/O) Typically looks like this: capacity operations bandwidth pool used avail read write read write ---------- ----- ----- ----- ----- ----- ----- data 1.47G 58.0G 0 131 0 9.47M raidz1 1.47G 58.0G 0 131 0 9.47M c1t1d0 - - 0 100 0 8.45M c1t2d0 - - 0 77 0 4.74M c1t3d0 - - 0 77 0 5.48M ---------- ----- ----- ----- ----- ----- ----- syspool 1.87G 6.06G 2 0 23.8K 0 c1t0d0s0 1.87G 6.06G 2 0 23.7K 0 cache - - - - - - c1t4d0s2 95.9M 3.89G 0 0 0 127K ---------- ----- ----- ----- ----- ----- ----- 5) Do the same I/O on the syspool: cd /volumes iozone -s 1G -i 0 -i 1 (for I/O) capacity operations bandwidth pool used avail read write read write ---------- ----- ----- ----- ----- ----- ----- data 407K 59.5G 0 0 0 0 raidz1 407K 59.5G 0 0 0 0 c1t1d0 - - 0 0 0 0 c1t2d0 - - 0 0 0 0 c1t3d0 - - 0 0 0 0 ---------- ----- ----- ----- ----- ----- ----- syspool 2.35G 5.59G 0 167 6.25K 14.2M c1t0d0s0 2.35G 5.59G 0 167 6.25K 14.2M cache - - - - - - c1t4d0s2 406M 3.59G 0 80 0 9.59M ---------- ----- ----- ----- ----- ----- ----- 6) You see only if I/O to syspool is done, the l2arc in syspool used. Release is build 104 with some patches. -- This message posted from opensolaris.org
Lutz Schumann
2010-Feb-01 13:53 UTC
[zfs-discuss] L2ARC in Cluster is picked up althought not part of the pool
I tested some more and found that Pool disks are picked UP. Head1: Cachedevice1 (c0t0d0) Head2: Cachedevice2 (c0t0d0) Pool: Shared, c1t<X>d<Y> I created a pool on shared storage. Added the cache device on Head1. Switched the pool to Head2 (export + import). Created a pool on head1 containing just the cache device (c0t0d0). Exported the pool on Head1. Switched back the pool from head2 to head1 (export + import) The disk c0t0d0 is picked up as cache device ... This practially means my exported pool was destroyed. In production this would been hell. Am I missing something here ? -- This message posted from opensolaris.org
Richard Elling
2010-Feb-01 18:35 UTC
[zfs-discuss] L2ARC in Cluster is picked up althought not part of the pool
On Feb 1, 2010, at 5:53 AM, Lutz Schumann wrote:> I tested some more and found that Pool disks are picked UP. > > Head1: Cachedevice1 (c0t0d0) > Head2: Cachedevice2 (c0t0d0) > Pool: Shared, c1t<X>d<Y> > > I created a pool on shared storage. > Added the cache device on Head1. > Switched the pool to Head2 (export + import). > Created a pool on head1 containing just the cache device (c0t0d0).This is not possible, unless there is a bug. You cannot create a pool with only a cache device. I have verified this on b131: # zpool create norealpool cache /dev/ramdisk/rc1 invalid vdev specification: at least one toplevel vdev must be specified This is also consistent with the notion that cache devices are auxiliary devices and do not have pool configuration information in the label. -- richard> Exported the pool on Head1. > Switched back the pool from head2 to head1 (export + import) > The disk c0t0d0 is picked up as cache device ... > > This practially means my exported pool was destroyed. > > In production this would been hell. > > Am I missing something here ? > -- > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Lutz Schumann
2010-Feb-01 20:22 UTC
[zfs-discuss] L2ARC in Cluster is picked up althought not part of the pool
> > Created a pool on head1 containing just the cache > device (c0t0d0). > > This is not possible, unless there is a bug. You > cannot create a pool > with only a cache device. I have verified this on > b131: > # zpool create norealpool cache /dev/ramdisk/rc1 > 1 > invalid vdev specification: at least one toplevel > l vdev must be specified > > This is also consistent with the notion that cache > devices are auxiliary > devices and do not have pool configuration > information in the label.Sorry for the confustion ... a little misunderstanding. I created a Pool who''s only data disk is the disk formally used as cache device in the pool that switched. Then I exported this pool mad eform just a single disk (data disk). And switched back. The exported pool was picked up as cache device ... this seems really problematic. Robert -- This message posted from opensolaris.org
Daniel Carosone
2010-Feb-08 22:10 UTC
[zfs-discuss] L2ARC in Cluster is picked up althought not part of the pool
On Mon, Feb 01, 2010 at 12:22:55PM -0800, Lutz Schumann wrote:> > > Created a pool on head1 containing just the cache > > device (c0t0d0). > > > > This is not possible, unless there is a bug. You > > cannot create a pool > > with only a cache device. I have verified this on > > b131: > > # zpool create norealpool cache /dev/ramdisk/rc1 > > 1 > > invalid vdev specification: at least one toplevel > > l vdev must be specified > > > > This is also consistent with the notion that cache > > devices are auxiliary > > devices and do not have pool configuration > > information in the label. > > Sorry for the confustion ... a little misunderstanding. I created a Pool who''s only data disk is the disk formally used as cache device in the pool that switched. Then I exported this pool mad eform just a single disk (data disk). And switched back. The exported pool was picked up as cache device ... this seems really problematic.This is exactly the scenario I was concerned about earlier in the thread. Thanks for confirming that it occurs. Please verify that the pool had autoreplace=off (just to avoid that distraction), and file a bug. Cache devices should not automatically destroy disk contents based solely on device path, especially where that device path came along with a pool import. Cache devices need labels to confirm their identity. This is irrespective of whether the cache contents after the label are persistent or volatile, ie should be fixed without waiting for the CR about persistent l2arc. -- Dan. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 194 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100209/ca32e083/attachment.bin>
Lutz Schumann
2010-Feb-24 12:25 UTC
[zfs-discuss] L2ARC in Cluster is picked up althought not part of the pool
I fully agree. This needs fixing. I can think of so many situations, where device names change in OpenSolaris (especially with movable pools). This problem can lead to serious data corruption. Besides persistent L2ARC (which is much more difficult I would say) - Making L2ARC also rely on labels instead of device paths is essential. Can someone open a CR for this ?? -- This message posted from opensolaris.org