HI, I have a question I cannot seem to find an answer to. I know I can set up a stripe of L2ARC SSD''s with say, 4 SSD''s. I know if I set up ZIL on SSD and the SSD goes bad, the the ZIL will be relocated back to the spool. I''d probably have it mirrored anyway, just in case. However you cannot mirror the L2ARC, so... What I want to know, is what happens if one of those SSD''s goes bad? What happens to the L2ARC? Is it just taken offline, or will it continue to perform even with one drive missing? Sorry, if these questions have been asked before, but I cannot seem to find an answer. Mike --- Michael Sullivan michael.p.sullivan at me.com http://www.kamiogi.net/ Japan Mobile: +81-80-3202-2599 US Phone: +1-561-283-2034
On 05 May, 2010 - Michael Sullivan sent me these 0,9K bytes:> HI, > > I have a question I cannot seem to find an answer to. > > I know I can set up a stripe of L2ARC SSD''s with say, 4 SSD''s. > > I know if I set up ZIL on SSD and the SSD goes bad, the the ZIL will > be relocated back to the spool. I''d probably have it mirrored anyway, > just in case. However you cannot mirror the L2ARC, so...Given enough opensolaris.. Otherwise, your pool is screwed iirc.> What I want to know, is what happens if one of those SSD''s goes bad? > What happens to the L2ARC? Is it just taken offline, or will it > continue to perform even with one drive missing?L2ARC is a pure cache thing, if it gives bad data (checksum error), it will be ignored, if you yank it, it will be ignored. It''s very safe to have "crap hardware" there (as long as they don''t start messing up some bus or similar). They can be added/removed at any time as well. /Tomas -- Tomas ?gren, stric at acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Ume? `- Sysadmin at {cs,acc}.umu.se
On Tue, May 4, 2010 at 12:16 PM, Michael Sullivan < michael.p.sullivan at mac.com> wrote:> I have a question I cannot seem to find an answer to. > > I know I can set up a stripe of L2ARC SSD''s with say, 4 SSD''s. > > I know if I set up ZIL on SSD and the SSD goes bad, the the ZIL will be > relocated back to the spool. I''d probably have it mirrored anyway, just in > case. However you cannot mirror the L2ARC, so... > > What I want to know, is what happens if one of those SSD''s goes bad? What > happens to the L2ARC? Is it just taken offline, or will it continue to > perform even with one drive missing? > > Sorry, if these questions have been asked before, but I cannot seem to find > an answer. >Data in the L2ARC is checksummed. If a checksum fails, or the device disappears, data is read from the pool. The L2ARC is essentially a throw-away cache for reads. If it''s there, reads can be faster as data is not pulled from disk. If it''s not there, data just gets pulled from disk as per normal. There''s nothing really special about the L2ARC devices. -- Freddie Cash fjwcash at gmail.com -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100504/88116bf6/attachment.html>
The L2ARC will continue to function. -marc On 5/4/10, Michael Sullivan <michael.p.sullivan at mac.com> wrote:> HI, > > I have a question I cannot seem to find an answer to. > > I know I can set up a stripe of L2ARC SSD''s with say, 4 SSD''s. > > I know if I set up ZIL on SSD and the SSD goes bad, the the ZIL will be > relocated back to the spool. I''d probably have it mirrored anyway, just in > case. However you cannot mirror the L2ARC, so... > > What I want to know, is what happens if one of those SSD''s goes bad? What > happens to the L2ARC? Is it just taken offline, or will it continue to > perform even with one drive missing? > > Sorry, if these questions have been asked before, but I cannot seem to find > an answer. > Mike > > --- > Michael Sullivan > michael.p.sullivan at me.com > http://www.kamiogi.net/ > Japan Mobile: +81-80-3202-2599 > US Phone: +1-561-283-2034 > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >-- Sent from my mobile device
Ok, thanks. So, if I understand correctly, it will just remove the device from the VDEV and continue to use the good ones in the stripe. Mike --- Michael Sullivan michael.p.sullivan at me.com http://www.kamiogi.net/ Japan Mobile: +81-80-3202-2599 US Phone: +1-561-283-2034 On 5 May 2010, at 04:34 , Marc Nicholas wrote:> The L2ARC will continue to function. > > -marc > > On 5/4/10, Michael Sullivan <michael.p.sullivan at mac.com> wrote: >> HI, >> >> I have a question I cannot seem to find an answer to. >> >> I know I can set up a stripe of L2ARC SSD''s with say, 4 SSD''s. >> >> I know if I set up ZIL on SSD and the SSD goes bad, the the ZIL will be >> relocated back to the spool. I''d probably have it mirrored anyway, just in >> case. However you cannot mirror the L2ARC, so... >> >> What I want to know, is what happens if one of those SSD''s goes bad? What >> happens to the L2ARC? Is it just taken offline, or will it continue to >> perform even with one drive missing? >> >> Sorry, if these questions have been asked before, but I cannot seem to find >> an answer. >> Mike >> >> --- >> Michael Sullivan >> michael.p.sullivan at me.com >> http://www.kamiogi.net/ >> Japan Mobile: +81-80-3202-2599 >> US Phone: +1-561-283-2034 >> >> _______________________________________________ >> zfs-discuss mailing list >> zfs-discuss at opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >> > > -- > Sent from my mobile device
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Michael Sullivan > > I have a question I cannot seem to find an answer to.Google for ZFS Best Practices Guide (on solarisinternals). I know this answer is there.> I know if I set up ZIL on SSD and the SSD goes bad, the the ZIL will be > relocated back to the spool. I''d probably have it mirrored anyway, > just in case. However you cannot mirror the L2ARC, so...Careful. The "log device removal" feature exists, and is present in the developer builds of opensolaris today. However, it''s not included in opensolars 2009.06, and it''s not included in the latest and greatest solaris 10 yet. Which means, right now, if you lose an unmirrored ZIL (log) device, your whole pool is lost, unless you''re running a developer build of opensolaris.> What I want to know, is what happens if one of those SSD''s goes bad? > What happens to the L2ARC? Is it just taken offline, or will it > continue to perform even with one drive missing?In the L2ARC (cache) there is no ability to mirror, because cache device removal has always been supported. You can''t mirror a cache device, because you don''t need it. If one of the cache devices fails, no harm is done. That device goes offline. The rest stay online.> Sorry, if these questions have been asked before, but I cannot seem to > find an answer.Since you said this twice, I''ll answer it twice. ;-) I think the best advice regarding cache/log device mirroring is in the ZFS Best Practices Guide.
Hi Ed, Thanks for your answers. Seem to make sense, sort of? On 6 May 2010, at 12:21 , Edward Ned Harvey wrote:>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- >> bounces at opensolaris.org] On Behalf Of Michael Sullivan >> >> I have a question I cannot seem to find an answer to. > > Google for ZFS Best Practices Guide (on solarisinternals). I know this > answer is there. >My Google is very strong and I have the Best Practices Guide committed to bookmark as well as most of it to memory. While it explains how to implement these, there is no information regarding failure of a device in a striped L2ARC set of SSD''s. I have been hard pressed to find this information anywhere, short of testing it myself, but I don''t have the necessary hardware in a lab to test correctly. If someone has pointers to references, could you please provide them to chapter and verse, rather than the advice to "Go read the manual."> >> I know if I set up ZIL on SSD and the SSD goes bad, the the ZIL will be >> relocated back to the spool. I''d probably have it mirrored anyway, >> just in case. However you cannot mirror the L2ARC, so... > > Careful. The "log device removal" feature exists, and is present in the > developer builds of opensolaris today. However, it''s not included in > opensolars 2009.06, and it''s not included in the latest and greatest solaris > 10 yet. Which means, right now, if you lose an unmirrored ZIL (log) device, > your whole pool is lost, unless you''re running a developer build of > opensolaris. >I''m running 2009.11 which is the latest OpenSolaris. I should have made that clear, and that I don''t intend this to be on Solaris 10 system, and am waiting for the next production build anyway. As you say, it does not exist in 2009.06, this is not the latest production Opensolaris which is 2009.11, and I''d be more interested in its behavior than an older release. I am also well aware of the effect of losing a ZIL device will cause loss of the entire pool. Which is why I would never have a ZIL device unless it was mirrored and on different controllers. From the information I''ve been reading about the loss of a ZIL device, it will be relocated to the storage pool it is assigned to. I''m not sure which version this is in, but it would be nice if someone could provide the release number it is included in (and actually works), it would be nice. Also, will this functionality be included in the mythical 2010.03 release? Also, I''d be interested to know what features along these lines will be available in 2010.03 if it ever sees the light of day.> >> What I want to know, is what happens if one of those SSD''s goes bad? >> What happens to the L2ARC? Is it just taken offline, or will it >> continue to perform even with one drive missing? > > In the L2ARC (cache) there is no ability to mirror, because cache device > removal has always been supported. You can''t mirror a cache device, because > you don''t need it. > > If one of the cache devices fails, no harm is done. That device goes > offline. The rest stay online. >So what you are saying is that if a single device fails in a striped L2ARC VDEV, then the entire VDEV is taken offline and the fallback is to simply use the regular ARC and fetch from the pool whenever there is a cache miss. Or, does what you are saying here mean that if I have a 4 SSD''s in a stripe for my L2ARC, and one device fails, the L2ARC will be reconfigured dynamically using the remaining SSD''s for L2ARC. It would be good to get an answer to this from someone who has actually tested this or is more intimately familiar with the ZFS code rather than all the speculation I''ve been getting so far.> >> Sorry, if these questions have been asked before, but I cannot seem to >> find an answer. > > Since you said this twice, I''ll answer it twice. ;-) > I think the best advice regarding cache/log device mirroring is in the ZFS > Best Practices Guide. >Been there read that, many, many times. It''s an invaluable reference, I agree. Thanks Mike --- Michael Sullivan michael.p.sullivan at me.com http://www.kamiogi.net/ Japan Mobile: +81-80-3202-2599 US Phone: +1-561-283-2034
> From: Michael Sullivan [mailto:michael.p.sullivan at mac.com] > > My Google is very strong and I have the Best Practices Guide committed > to bookmark as well as most of it to memory. > > While it explains how to implement these, there is no information > regarding failure of a device in a striped L2ARC set of SSD''s. I havehttp://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide#Sepa rate_Cache_Devices It is not possible to mirror or use raidz on cache devices, nor is it necessary. If a cache device fails, the data will simply be read from the main pool storage devices instead. I guess I didn''t write this part, but: If you have multiple cache devices, they are all independent from each other. Failure of one does not negate the functionality of the others.> I''m running 2009.11 which is the latest OpenSolaris.Quoi?? 2009.06 is the latest available from opensolaris.com and opensolaris.org. If you want something newer, AFAIK, you have to go to developer build, such as osol-dev-134 Sure you didn''t accidentally get 2008.11?> I am also well aware of the effect of losing a ZIL device will cause > loss of the entire pool. Which is why I would never have a ZIL device > unless it was mirrored and on different controllers.Um ... the log device is not special. If you lose *any* unmirrored device, you lose the pool. Except for cache devices, or log devices on zpool >=19> From the information I''ve been reading about the loss of a ZIL device, > it will be relocated to the storage pool it is assigned to. I''m not > sure which version this is in, but it would be nice if someone could > provide the release number it is included in (and actually works), it > would be nice.What the heck? Didn''t I just answer that question? I know I said this is answered in ZFS Best Practices Guide. http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide#Sepa rate_Log_Devices Prior to pool version 19, if you have an unmirrored log device that fails, your whole pool is permanently lost. Prior to pool version 19, mirroring the log device is highly recommended. In pool version 19 or greater, if an unmirrored log device fails during operation, the system reverts to the default behavior, using blocks from the main storage pool for the ZIL, just as if the log device had been gracefully removed via the "zpool remove" command.> Also, will this functionality be included in the > mythical 2010.03 release?Zpool 19 was released in build 125. Oct 16, 2009. You can rest assured it will be included in 2010.03, or 04, or whenever that thing comes out.> So what you are saying is that if a single device fails in a striped > L2ARC VDEV, then the entire VDEV is taken offline and the fallback is > to simply use the regular ARC and fetch from the pool whenever there is > a cache miss.It sounds like you''re only going to believe it if you test it. Go for it. That''s what I did before I wrote that section of the ZFS Best Practices Guide. In ZFS, there is no such thing as striping, although the term is commonly used, because adding multiple devices creates all the benefit of striping, plus all the benefit of concatenation, but colloquially, people think concatenation is weird or unused or something, so people just naturally gravitated to calling it a stripe in ZFS too, although that''s not technically correct according to the traditional RAID definition. But nobody bothered to create a new term "stripecat" or whatever, for ZFS.> Or, does what you are saying here mean that if I have a 4 SSD''s in a > stripe for my L2ARC, and one device fails, the L2ARC will be > reconfigured dynamically using the remaining SSD''s for L2ARC.No reconfiguration necessary, because it''s not a stripe. It''s 4 separate devices, which ZFS can use simultaneously if it wants to.
On 6 May 2010, at 13:18 , Edward Ned Harvey wrote:>> From: Michael Sullivan [mailto:michael.p.sullivan at mac.com] >> >> While it explains how to implement these, there is no information >> regarding failure of a device in a striped L2ARC set of SSD''s. I have > > http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide#Sepa > rate_Cache_Devices > > It is not possible to mirror or use raidz on cache devices, nor is it > necessary. If a cache device fails, the data will simply be read from the > main pool storage devices instead. >I understand this.> I guess I didn''t write this part, but: If you have multiple cache devices, > they are all independent from each other. Failure of one does not negate > the functionality of the others. >Ok, this is what I wanted to know. The that the L2ARC devices assigned to the pool are not striped but are independent. Loss of one drive will just cause a cache miss and force ZFS to go out to the pool for its objects. But then I''m not talking about using RAIDZ on a cache device. I''m talking about a striped device which would be RAID-0. If the SSD''s are all assigned to L2ARC, then they are not striped in any fashion (RAID-0), but are completely independent and the L2ARC will continue to operate, just missing a single SSD.> >> I''m running 2009.11 which is the latest OpenSolaris. > > Quoi?? 2009.06 is the latest available from opensolaris.com and > opensolaris.org. > > If you want something newer, AFAIK, you have to go to developer build, such > as osol-dev-134 > > Sure you didn''t accidentally get 2008.11? >My mistake? snv_111b which is 2009.06. I know it went up to 11 somewhere.> >> I am also well aware of the effect of losing a ZIL device will cause >> loss of the entire pool. Which is why I would never have a ZIL device >> unless it was mirrored and on different controllers. > > Um ... the log device is not special. If you lose *any* unmirrored device, > you lose the pool. Except for cache devices, or log devices on zpool >=19 >Well, if I''ve got a separate ZIL which is mirrored for performance, and mirrored because I think my data is valuable and important, I will have something more than RAID-0 on my main storage pool too. More than likely RAIDZ2 since I plan on using L2ARC to help improve performance along with separate SSD mirrored ZIL devices.> >> From the information I''ve been reading about the loss of a ZIL device, >> it will be relocated to the storage pool it is assigned to. I''m not >> sure which version this is in, but it would be nice if someone could >> provide the release number it is included in (and actually works), it >> would be nice. > > What the heck? Didn''t I just answer that question? > I know I said this is answered in ZFS Best Practices Guide. > http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide#Sepa > rate_Log_Devices > > Prior to pool version 19, if you have an unmirrored log device that fails, > your whole pool is permanently lost. > Prior to pool version 19, mirroring the log device is highly recommended. > In pool version 19 or greater, if an unmirrored log device fails during > operation, the system reverts to the default behavior, using blocks from the > main storage pool for the ZIL, just as if the log device had been gracefully > removed via the "zpool remove" command. >No need to get defensive here, all I''m looking for is the spool version number which supports it and the version of OpenSolaris which supports that ZPOOL version. I think that if you are building for performance, it would be almost intuitive to have a mirrored ZIL in the event of failure, and perhaps even a hot spare available as well. I don''t like the idea of my ZIL being transferred back to the pool, but having it transferred back is better than the alternative which would be data loss or corruption.> >> Also, will this functionality be included in the >> mythical 2010.03 release? > > > Zpool 19 was released in build 125. Oct 16, 2009. You can rest assured it > will be included in 2010.03, or 04, or whenever that thing comes out. >Thanks, build 125.> >> So what you are saying is that if a single device fails in a striped >> L2ARC VDEV, then the entire VDEV is taken offline and the fallback is >> to simply use the regular ARC and fetch from the pool whenever there is >> a cache miss. > > It sounds like you''re only going to believe it if you test it. Go for it. > That''s what I did before I wrote that section of the ZFS Best Practices > Guide. > > In ZFS, there is no such thing as striping, although the term is commonly > used, because adding multiple devices creates all the benefit of striping, > plus all the benefit of concatenation, but colloquially, people think > concatenation is weird or unused or something, so people just naturally > gravitated to calling it a stripe in ZFS too, although that''s not > technically correct according to the traditional RAID definition. But > nobody bothered to create a new term "stripecat" or whatever, for ZFS. >Ummm, yes you can create concatenated devices with ZFS, I have done this and even created mirrors of concatenated VDEVS in both configurations of RAID-0+1 and RAID-1+0. The only RAID level not supported by ZFS is RAID-4.> >> Or, does what you are saying here mean that if I have a 4 SSD''s in a >> stripe for my L2ARC, and one device fails, the L2ARC will be >> reconfigured dynamically using the remaining SSD''s for L2ARC. > > No reconfiguration necessary, because it''s not a stripe. It''s 4 separate > devices, which ZFS can use simultaneously if it wants to. >It''s not a RAID-0 concatenated device either then? I mean that would be the most efficient way to balance the load across the four devices simultaneously, wouldn''t it? Just want to make sure. I''m at a bit of a loss as to how the internals of the L2ARC work in the event of a single drive failure. My assumption was that if you used 4 SSD''s as L2ARC, ZFS would be smart enough to spread the load and arrange the SSD''s into a concatenated stripe. I''m not entirely clear on whether it actually does this or not, but from a performance standpoint, it would make sense. Additionally, from the VM model that ZFS is derived from, I would imagine these disks are arranged in a concat fashion. I understand that since the L2ARC is read only, a failure of a single device would result in a failure to validate the checksum, or even retrieve the block. The fall-back in this case would be like any other cache miss and go back out to the pool for the object. What I''d also like to know is a bit more of how the L2ARC is organized and used regardless of whether they are a concatenated device or independent devices (though I''d like verification of this, just for my own trivial pursuit), and how ZFS goes about reconfiguring or removing the failed device from the L2ARC. Mike --- Michael Sullivan michael.p.sullivan at me.com http://www.kamiogi.net/ Japan Mobile: +81-80-3202-2599 US Phone: +1-561-283-2034
On Wed, 5 May 2010, Edward Ned Harvey wrote:> > In the L2ARC (cache) there is no ability to mirror, because cache device > removal has always been supported. You can''t mirror a cache device, because > you don''t need it.How do you know that I don''t need it? The ability seems useful to me. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On 06 May, 2010 - Bob Friesenhahn sent me these 0,6K bytes:> On Wed, 5 May 2010, Edward Ned Harvey wrote: >> >> In the L2ARC (cache) there is no ability to mirror, because cache device >> removal has always been supported. You can''t mirror a cache device, because >> you don''t need it. > > How do you know that I don''t need it? The ability seems useful to me.The gain is quite minimal.. If the first device fails (which doesn''t happen too often I hope), then it will be read from the normal pool once and then stored in ARC/L2ARC again. It just behaves like a cache miss for that specific block... If this happens often enough to become a performance problem, then you should throw away that L2ARC device because it''s broken beyond usability. /Tomas -- Tomas ?gren, stric at acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Ume? `- Sysadmin at {cs,acc}.umu.se
On 06/05/2010 15:31, Tomas ?gren wrote:> On 06 May, 2010 - Bob Friesenhahn sent me these 0,6K bytes: > > >> On Wed, 5 May 2010, Edward Ned Harvey wrote: >> >>> In the L2ARC (cache) there is no ability to mirror, because cache device >>> removal has always been supported. You can''t mirror a cache device, because >>> you don''t need it. >>> >> How do you know that I don''t need it? The ability seems useful to me. >> > The gain is quite minimal.. If the first device fails (which doesn''t > happen too often I hope), then it will be read from the normal pool once > and then stored in ARC/L2ARC again. It just behaves like a cache miss > for that specific block... If this happens often enough to become a > performance problem, then you should throw away that L2ARC device > because it''s broken beyond usability. > >Well if a L2ARC device fails there might be an unacceptable drop in delivered performance. If it were mirrored than a drop usually would be much smaller or there could be no drop if a mirror had an option to read only from one side. Being able to mirror L2ARC might especially be useful once a persistent L2ARC is implemented as after a node restart or a resource failover in a cluster L2ARC will be kept warm. Then the only thing which might affect L2 performance considerably would be a L2ARC device failure... -- Robert Milkowski http://milek.blogspot.com
On Wed, May 5, 2010 at 8:47 PM, Michael Sullivan <michael.p.sullivan at mac.com> wrote:> While it explains how to implement these, there is no information regarding failure of a device in a striped L2ARC set of SSD''s. ?I have been hard pressed to find this information anywhere, short of testing it myself, but I don''t have the necessary hardware in a lab to test correctly. ?If someone has pointers to references, could you please provide them to chapter and verse, rather than the advice to "Go read the manual."Yes, but the answer is in the man page. So reading it is a good idea: "If a read error is encountered on a cache device, that read I/O is reissued to the original storage pool device, which might be part of a mirrored or raidz configuration."> I''m running 2009.11 which is the latest OpenSolaris. ?I should have made that clear, and that I don''t intend this to be on Solaris 10 system, and am waiting for the next production build anyway. ?As you say, it does not exist in 2009.06, this is not the latest production Opensolaris which is 2009.11, and I''d be more interested in its behavior than an older release.The "latest" is b134, which contains many, many fixes over 2009.11, though it''s a dev release.> From the information I''ve been reading about the loss of a ZIL device, it will be relocated to the storage pool it is assigned to. ?I''m not sure which version this is in, but it would be nice if someone could provide the release number it is included in (and actually works), it would be nice. ?Also, will this functionality be included in the mythical 2010.03 release?It''s went into somewhere around b118 I think, so it will be in the next scheduled release.> Also, I''d be interested to know what features along these lines will be available in 2010.03 if it ever sees the light of day.Look at the latest dev release. b134 was originally slated to be 2010.03, so the feature set of the final release should be very close.> So what you are saying is that if a single device fails in a striped L2ARC VDEV, then the entire VDEV is taken offline and the fallback is to simply use the regular ARC and fetch from the pool whenever there is a cache miss.The strict interpretation of the documentation is that the read is re-issued. My understanding is that the block that failed to be read would then be read from the original pool.> Or, does what you are saying here mean that if I have a 4 SSD''s in a stripe for my L2ARC, and one device fails, the L2ARC will be reconfigured dynamically using the remaining SSD''s for L2ARC.Auto-healing in zfs would resilver the block that failed to be read, either onto the same device or another cache device in the pool, exactly as if a read failed on a normal pool device. It wouldn''t reconfigure the cache devices, but each failed read would cause the blocks to be reallocated to a functioning device which has the same effect in the end. -B -- Brandon High : bhigh at freaks.com
Everyone, Thanks for the help. I really appreciate it. Well, I actually walked through the source code with an associate today and we found out how things work by looking at the code. It appears that L2ARC is just assigned in round-robin fashion. If a device goes offline, then it goes to the next and marks that one as offline. The failure to retrieve the requested object is treated like a cache miss and everything goes along its merry way, as far as we can tell. I would have hoped it to be different in some way. Like if the L2ARC was striped for performance reasons, that would be really cool and using that device as an extension of the VM model it is modeled after. Which would mean using the L2ARC as an extension of the virtual address space and striping it to make it more efficient. Way cool. If it took out the bad device and reconfigured the stripe device, that would be even way cooler. Replacing it with a hot spare more cool too. However, it appears from the source code that the L2ARC is just a (sort of) jumbled collection of ZFS objects. Yes, it gives you better performance if you have it, but it doesn''t really use it in a way you might expect something as cool as ZFS does. I understand why it is read only, and it invalidates it''s cache when a write occurs, to be expected for any object written. If an object is not there because of a failure or because it has been removed from the cache, it is treated as a cache miss, all well and good - go fetch from the pool. I also understand why the ZIL is important and that it should be mirrored if it is to be on a separate device. Though I''m wondering how it is handled internally when there is a failure of one of it''s default devices, but then again, it''s on a regular pool and should be redundant enough, only just some degradation in speed. Breaking these devices out from their default locations is great for performance, and I understand. I just wish the knowledge of how they work and their internal mechanisms were not so much of a black box. Maybe that is due to the speed at which ZFS is progressing and the features it adds with each subsequent release. Overall, I am very impressed with ZFS, its flexibility and even more so, it''s breaking all the rules about how storage should be managed and I really like it. I have yet to see anything to come close in its approach to disk data management. Let''s just hope it keeps moving forward, it is truly a unique way to view disk storage. Anyway, sorry for the ramble, but to everyone, thanks again for the answers. Mike --- Michael Sullivan michael.p.sullivan at me.com http://www.kamiogi.net/ Japan Mobile: +81-80-3202-2599 US Phone: +1-561-283-2034 On 7 May 2010, at 00:00 , Robert Milkowski wrote:> On 06/05/2010 15:31, Tomas ?gren wrote: >> On 06 May, 2010 - Bob Friesenhahn sent me these 0,6K bytes: >> >> >>> On Wed, 5 May 2010, Edward Ned Harvey wrote: >>> >>>> In the L2ARC (cache) there is no ability to mirror, because cache device >>>> removal has always been supported. You can''t mirror a cache device, because >>>> you don''t need it. >>>> >>> How do you know that I don''t need it? The ability seems useful to me. >>> >> The gain is quite minimal.. If the first device fails (which doesn''t >> happen too often I hope), then it will be read from the normal pool once >> and then stored in ARC/L2ARC again. It just behaves like a cache miss >> for that specific block... If this happens often enough to become a >> performance problem, then you should throw away that L2ARC device >> because it''s broken beyond usability. >> >> > > Well if a L2ARC device fails there might be an unacceptable drop in delivered performance. > If it were mirrored than a drop usually would be much smaller or there could be no drop if a mirror had an option to read only from one side. > > Being able to mirror L2ARC might especially be useful once a persistent L2ARC is implemented as after a node restart or a resource failover in a cluster L2ARC will be kept warm. Then the only thing which might affect L2 performance considerably would be a L2ARC device failure... > > > -- > Robert Milkowski > http://milek.blogspot.com > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Hi Michael, What makes you think striping the SSDs would be faster than round-robin? -marc On Thu, May 6, 2010 at 1:09 PM, Michael Sullivan <michael.p.sullivan at mac.com> wrote:> Everyone, > > Thanks for the help. I really appreciate it. > > Well, I actually walked through the source code with an associate today and > we found out how things work by looking at the code. > > It appears that L2ARC is just assigned in round-robin fashion. If a device > goes offline, then it goes to the next and marks that one as offline. The > failure to retrieve the requested object is treated like a cache miss and > everything goes along its merry way, as far as we can tell. > > I would have hoped it to be different in some way. Like if the L2ARC was > striped for performance reasons, that would be really cool and using that > device as an extension of the VM model it is modeled after. Which would > mean using the L2ARC as an extension of the virtual address space and > striping it to make it more efficient. Way cool. If it took out the bad > device and reconfigured the stripe device, that would be even way cooler. > Replacing it with a hot spare more cool too. However, it appears from the > source code that the L2ARC is just a (sort of) jumbled collection of ZFS > objects. Yes, it gives you better performance if you have it, but it > doesn''t really use it in a way you might expect something as cool as ZFS > does. > > I understand why it is read only, and it invalidates it''s cache when a > write occurs, to be expected for any object written. > > If an object is not there because of a failure or because it has been > removed from the cache, it is treated as a cache miss, all well and good - > go fetch from the pool. > > I also understand why the ZIL is important and that it should be mirrored > if it is to be on a separate device. Though I''m wondering how it is handled > internally when there is a failure of one of it''s default devices, but then > again, it''s on a regular pool and should be redundant enough, only just some > degradation in speed. > > Breaking these devices out from their default locations is great for > performance, and I understand. I just wish the knowledge of how they work > and their internal mechanisms were not so much of a black box. Maybe that > is due to the speed at which ZFS is progressing and the features it adds > with each subsequent release. > > Overall, I am very impressed with ZFS, its flexibility and even more so, > it''s breaking all the rules about how storage should be managed and I really > like it. I have yet to see anything to come close in its approach to disk > data management. Let''s just hope it keeps moving forward, it is truly a > unique way to view disk storage. > > Anyway, sorry for the ramble, but to everyone, thanks again for the > answers. > > Mike > > --- > Michael Sullivan > michael.p.sullivan at me.com > http://www.kamiogi.net/ > Japan Mobile: +81-80-3202-2599 > US Phone: +1-561-283-2034 > > On 7 May 2010, at 00:00 , Robert Milkowski wrote: > > > On 06/05/2010 15:31, Tomas ?gren wrote: > >> On 06 May, 2010 - Bob Friesenhahn sent me these 0,6K bytes: > >> > >> > >>> On Wed, 5 May 2010, Edward Ned Harvey wrote: > >>> > >>>> In the L2ARC (cache) there is no ability to mirror, because cache > device > >>>> removal has always been supported. You can''t mirror a cache device, > because > >>>> you don''t need it. > >>>> > >>> How do you know that I don''t need it? The ability seems useful to me. > >>> > >> The gain is quite minimal.. If the first device fails (which doesn''t > >> happen too often I hope), then it will be read from the normal pool once > >> and then stored in ARC/L2ARC again. It just behaves like a cache miss > >> for that specific block... If this happens often enough to become a > >> performance problem, then you should throw away that L2ARC device > >> because it''s broken beyond usability. > >> > >> > > > > Well if a L2ARC device fails there might be an unacceptable drop in > delivered performance. > > If it were mirrored than a drop usually would be much smaller or there > could be no drop if a mirror had an option to read only from one side. > > > > Being able to mirror L2ARC might especially be useful once a persistent > L2ARC is implemented as after a node restart or a resource failover in a > cluster L2ARC will be kept warm. Then the only thing which might affect L2 > performance considerably would be a L2ARC device failure... > > > > > > -- > > Robert Milkowski > > http://milek.blogspot.com > > > > _______________________________________________ > > zfs-discuss mailing list > > zfs-discuss at opensolaris.org > > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100506/a65fc682/attachment.html>
Hi Marc, Well, if you are striping over multiple devices the you I/O should be spread over the devices and you should be reading them all simultaneously rather than just accessing a single device. Traditional striping would give 1/n performance improvement rather than 1/1 where n is the number of disks the stripe is spread across. The round-robin access I am referring to, is the way the L2ARC vdevs appear to be accessed. So, any given object will be taken from a single device rather than from several devices simultaneously, thereby increasing the I/O throughput. So, theoretically, a stripe spread over 4 disks would give 4 times the performance as opposed to reading from a single disk. This also assumes the controller can handle multiple I/O as well or that you are striped over different disk controllers for each disk in the stripe. SSD''s are fast, but if I can read a block from more devices simultaneously, it will cut the latency of the overall read. On 7 May 2010, at 02:57 , Marc Nicholas wrote:> Hi Michael, > > What makes you think striping the SSDs would be faster than round-robin? > > -marc > > On Thu, May 6, 2010 at 1:09 PM, Michael Sullivan <michael.p.sullivan at mac.com> wrote: > Everyone, > > Thanks for the help. I really appreciate it. > > Well, I actually walked through the source code with an associate today and we found out how things work by looking at the code. > > It appears that L2ARC is just assigned in round-robin fashion. If a device goes offline, then it goes to the next and marks that one as offline. The failure to retrieve the requested object is treated like a cache miss and everything goes along its merry way, as far as we can tell. > > I would have hoped it to be different in some way. Like if the L2ARC was striped for performance reasons, that would be really cool and using that device as an extension of the VM model it is modeled after. Which would mean using the L2ARC as an extension of the virtual address space and striping it to make it more efficient. Way cool. If it took out the bad device and reconfigured the stripe device, that would be even way cooler. Replacing it with a hot spare more cool too. However, it appears from the source code that the L2ARC is just a (sort of) jumbled collection of ZFS objects. Yes, it gives you better performance if you have it, but it doesn''t really use it in a way you might expect something as cool as ZFS does. > > I understand why it is read only, and it invalidates it''s cache when a write occurs, to be expected for any object written. > > If an object is not there because of a failure or because it has been removed from the cache, it is treated as a cache miss, all well and good - go fetch from the pool. > > I also understand why the ZIL is important and that it should be mirrored if it is to be on a separate device. Though I''m wondering how it is handled internally when there is a failure of one of it''s default devices, but then again, it''s on a regular pool and should be redundant enough, only just some degradation in speed. > > Breaking these devices out from their default locations is great for performance, and I understand. I just wish the knowledge of how they work and their internal mechanisms were not so much of a black box. Maybe that is due to the speed at which ZFS is progressing and the features it adds with each subsequent release. > > Overall, I am very impressed with ZFS, its flexibility and even more so, it''s breaking all the rules about how storage should be managed and I really like it. I have yet to see anything to come close in its approach to disk data management. Let''s just hope it keeps moving forward, it is truly a unique way to view disk storage. > > Anyway, sorry for the ramble, but to everyone, thanks again for the answers. > > Mike > > --- > Michael Sullivan > michael.p.sullivan at me.com > http://www.kamiogi.net/ > Japan Mobile: +81-80-3202-2599 > US Phone: +1-561-283-2034 > > On 7 May 2010, at 00:00 , Robert Milkowski wrote: > > > On 06/05/2010 15:31, Tomas ?gren wrote: > >> On 06 May, 2010 - Bob Friesenhahn sent me these 0,6K bytes: > >> > >> > >>> On Wed, 5 May 2010, Edward Ned Harvey wrote: > >>> > >>>> In the L2ARC (cache) there is no ability to mirror, because cache device > >>>> removal has always been supported. You can''t mirror a cache device, because > >>>> you don''t need it. > >>>> > >>> How do you know that I don''t need it? The ability seems useful to me. > >>> > >> The gain is quite minimal.. If the first device fails (which doesn''t > >> happen too often I hope), then it will be read from the normal pool once > >> and then stored in ARC/L2ARC again. It just behaves like a cache miss > >> for that specific block... If this happens often enough to become a > >> performance problem, then you should throw away that L2ARC device > >> because it''s broken beyond usability. > >> > >> > > > > Well if a L2ARC device fails there might be an unacceptable drop in delivered performance. > > If it were mirrored than a drop usually would be much smaller or there could be no drop if a mirror had an option to read only from one side. > > > > Being able to mirror L2ARC might especially be useful once a persistent L2ARC is implemented as after a node restart or a resource failover in a cluster L2ARC will be kept warm. Then the only thing which might affect L2 performance considerably would be a L2ARC device failure... > > > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100507/75d39239/attachment.html>
On Thu, May 6, 2010 at 1:18 AM, Edward Ned Harvey <solaris2 at nedharvey.com>wrote:> > From the information I''ve been reading about the loss of a ZIL device, > What the heck? Didn''t I just answer that question? > I know I said this is answered in ZFS Best Practices Guide. > > http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide#Sepa > rate_Log_Devices > > Prior to pool version 19, if you have an unmirrored log device that fails, > your whole pool is permanently lost. > Prior to pool version 19, mirroring the log device is highly recommended. > In pool version 19 or greater, if an unmirrored log device fails during > operation, the system reverts to the default behavior, using blocks from > the > main storage pool for the ZIL, just as if the log device had been > gracefully > removed via the "zpool remove" command. >This week I''ve had a bad experience replacing a SSD device that was in a hardware RAID-1 volume. While rebuilding, the source SSD failed and the volume was brought off-line by the controller. The server kept working just fine but seemed to have switched from the 30-second interval to all writes going directly to the disks. I could confirm this with iostat. We''ve had some compatibility issues between LSI MegaRAID cards and a few MTRON SSDs and I didn''t believe the SSD had really died. So I brought it off-line and back on-line and everything started to work. ZFS showed the log device c3t1d0 as removed. After the RAID-1 volume was back I replaced that device with itself and a resilver process started. I don''t know what it was resilvering against but it took 2h10min. I should have probably tried a zpool offline/online too. So I think if a log device fails AND you''ve to import your pool later (server rebooted, etc)... then you lost your pool (prior to version 19). Right ? This happened on OpenSolaris 2009.6. -- Giovanni -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100506/40e074fd/attachment.html>
On Fri, 7 May 2010, Michael Sullivan wrote:> > Well, if you are striping over multiple devices the you I/O should be spread over the devices and you > should be reading them all simultaneously rather than just accessing a single device. ?Traditional > striping would give 1/n performance improvement rather than 1/1 where n is the number of disks the > stripe is spread across.This is true. Use of mirroring also improves performance since a mirror multiplies the read performance for the same data. The value of the various approaches likely depends on the total size of the working set and the number of simultaneous requests. Currently available L2ARC SSD devices are very good with a high number of I/Os, but they are quite a bottleneck for bulk reads as compared to L1ARC in RAM . Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On Thu, May 6, 2010 at 11:08 AM, Michael Sullivan <michael.p.sullivan at mac.com> wrote:> The round-robin access I am referring to, is the way the L2ARC vdevs appear > to be accessed. ?So, any given object will be taken from a single device > rather than from several devices simultaneously, thereby increasing the I/O > throughput. ?So, theoretically, a stripe spread over 4 disks would give 4I believe that the L2ARC behaves the same as a pool with multiple top-level vdevs. It''s not typical striping, where every write goes to all devices. Writes may go to only one device, or may avoid a device entirely while using several other. The decision about where to place data is done at write time, so no fixed width stripes are created at allocation time. In your example, if the file had at least four blocks there is a likelihood that it will be spread across the four top-level vdevs. -B -- Brandon High : bhigh at freaks.com
On 06/05/2010 19:08, Michael Sullivan wrote:> Hi Marc, > > Well, if you are striping over multiple devices the you I/O should be > spread over the devices and you should be reading them all > simultaneously rather than just accessing a single device. > Traditional striping would give 1/n performance improvement rather > than 1/1 where n is the number of disks the stripe is spread across. > > The round-robin access I am referring to, is the way the L2ARC vdevs > appear to be accessed. So, any given object will be taken from a > single device rather than from several devices simultaneously, thereby > increasing the I/O throughput. So, theoretically, a stripe spread > over 4 disks would give 4 times the performance as opposed to reading > from a single disk. This also assumes the controller can handle > multiple I/O as well or that you are striped over different disk > controllers for each disk in the stripe. > > SSD''s are fast, but if I can read a block from more devices > simultaneously, it will cut the latency of the overall read. >Keep in mind that the largest block is currently 128KB and you always need to read an entire block. Splitting a block across several L2ARC devices would probably decrease performance and would invalidate all blocks if only a single l2arc device would die. Additionally having each block only on one l2arc device allows to read from all of l2arc devices at the same time. -- Robert Milkowski http://milek.blogspot.com
On May 6, 2010, at 11:08 AM, Michael Sullivan wrote:> Well, if you are striping over multiple devices the you I/O should be spread over the devices and you should be reading them all simultaneously rather than just accessing a single device. Traditional striping would give 1/n performance improvement rather than 1/1 where n is the number of disks the stripe is spread across.In theory, for bandwidth, yes, striping does improve by N. For latency, striping adds little, and in some cases is worse. ZFS dynamic stripe tries to balance this tradeoff towards latency for HDDs by grouping blocks so that only one seek+rotate is required. More below...> The round-robin access I am referring to, is the way the L2ARC vdevs appear to be accessed.RAID-0 striping is also round-robin.> So, any given object will be taken from a single device rather than from several devices simultaneously, thereby increasing the I/O throughput. So, theoretically, a stripe spread over 4 disks would give 4 times the performance as opposed to reading from a single disk. This also assumes the controller can handle multiple I/O as well or that you are striped over different disk controllers for each disk in the stripe.All modern controllers handled multiple, concurrent I/O.> SSD''s are fast, but if I can read a block from more devices simultaneously, it will cut the latency of the overall read.OTOH, if you have to wait for N HDDs to seek+rotate, then the latency is that of the slowest disk. The classic analogy is: nine women cannot produce a baby in one month. The difference is: ZFS dynamic stripe: latency per I/O = fixed latency of one vdev + (size / min(media bandwidth, path bandwidth)) RAID-0: latency per I/O = max(fixed latency of devices) + (size / min((media bandwidth / N), path bandwidth)) For HDDs, the media bandwidth is around 100 MB/sec for many devices, far less than the path bandwidth on a modern system. For many SSDs, the media bandwidth is close to the path bandwidth. Newer SSDs have media bandwidth > 3Gbps, but 6Gbps SAS is becoming readily available. In other words, if the path bandwidth isn''t a problem, and the media bandwidth of an SSD is 3x that of a HDD, then the bandwidth requirement that dictated RAID-0 for HDDs is reduced by a factor of 3. Yet another reason why HDDs lost the performance battle. This is also why not many folks choose to use HDDs for L2ARC -- the latency gain over the pool is marginal for HDDs. This is also one reason why there is no concatenation in ZFS. -- richard -- ZFS storage and performance consulting at http://www.RichardElling.com
On Fri, May 7, 2010 at 4:57 AM, Brandon High <bhigh at freaks.com> wrote:> I believe that the L2ARC behaves the same as a pool with multiple > top-level vdevs. It''s not typical striping, where every write goes to > all devices. Writes may go to only one device, or may avoid a device > entirely while using several other. The decision about where to place > data is done at write time, so no fixed width stripes are created at > allocation time.That''s nothing to believe or not to believe much. Each write access to the L2ARC devices are grouped and sent in-sequence. Queue is used to sort them out like to larger or fewer chunks to write. L2ARC behaves in a rotor fashion, simply sweeping writes through available space. That''s all the magic, nothing very special... Answering to Mike''s main question, behavior on failure is quite simple: once some L2ARC device[s] gone, the others will continue to function. Impact: a little performance losing, some time needs to warm them up and sort things out. No serious consequences or data loss here. Take care, folks. -- Kind regards, BM Things, that are stupid at the beginning, rarely ends up wisely.