I was checking with Sun support regarding this issue, and they say "The CR currently has a high priority and the fix is understood. However, there is no eta, workaround, nor IDR." If it''s a high priority, and it''s known how to fix it, I was curious as to why has there been no progress? As I understand, if a failure of the log device occurs while the pool is active, it automatically switches back to an embedded pool log. It seems removal would be as simple as following the failure path to an embedded log, and then update the pool metadata to remove the log device. Is it more complicated than that? We''re about to do some testing with slogs, and it would make me a lot more comfortable to deploy one in production if there was a backout plan :)... -- Paul B. Henson | (909) 979-6361 | http://www.csupomona.edu/~henson/ Operating Systems and Network Analyst | henson at csupomona.edu California State Polytechnic University | Pomona CA 91768
Paul B. Henson wrote:> I was checking with Sun support regarding this issue, and they say "The CR > currently has a high priority and the fix is understood. However, there is > no eta, workaround, nor IDR." > > If it''s a high priority, and it''s known how to fix it, I was curious as to > why has there been no progress? As I understand, if a failure of the log > device occurs while the pool is active, it automatically switches back to > an embedded pool log. It seems removal would be as simple as following the > failure path to an embedded log, and then update the pool metadata to > remove the log device. Is it more complicated than that? We''re about to do > some testing with slogs, and it would make me a lot more comfortable to > deploy one in production if there was a backout plan :)...If you don''t have mirrored slogs and the slog fails, you may lose any data that was in a txg group waiting to be committed to the main pool vdevs - you will never know if you lost any data or not. I think this thread is the latest discussion about slogs and their behavior: https://opensolaris.org/jive/thread.jspa?threadID=102392&tstart=0 -- Dave
Paul B. Henson wrote:> I was checking with Sun support regarding this issue, and they say "The CR > currently has a high priority and the fix is understood. However, there is > no eta, workaround, nor IDR." > > If it''s a high priority, and it''s known how to fix it, I was curious as to > why has there been no progress? As I understand, if a failure of the log > device occurs while the pool is active, it automatically switches back to > an embedded pool log. It seems removal would be as simple as following the > failure path to an embedded log, and then update the pool metadata to > remove the log device. Is it more complicated than that? We''re about to do > some testing with slogs, and it would make me a lot more comfortable to > deploy one in production if there was a backout plan :)... > >Removal of a slog is different than failure of a slog. Removal is an administrative task, not a repair task. You can lump slog removal in with the more general shrink or top-level vdev removal tasks. -- richard
Does Solaris flush a slog device before it powers down? If so, removal during a shutdown cycle wouldn''t lose any data. On Wed, May 20, 2009 at 7:57 AM, Dave <dave-zfs at dubkat.com> wrote:> If you don''t have mirrored slogs and the slog fails, you may lose any data > that was in a txg group waiting to be committed to the main pool vdevs - you > will never know if you lost any data or not. > > I think this thread is the latest discussion about slogs and their > behavior: > > https://opensolaris.org/jive/thread.jspa?threadID=102392&tstart=0 >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090520/3e4a17fb/attachment.html>
On Tue, 19 May 2009, Dave wrote:> If you don''t have mirrored slogs and the slog fails, you may lose any > data that was in a txg group waiting to be committed to the main pool > vdevs - you will never know if you lost any data or not.True; but from what I understand the failure rate of SSD''s is so low it''s not worth mirroring them. My main issue is that we''re considering dropping an SSD into an x4500, which is not an officially supported configuration and I''d like an easy way to back out of it if it becomes an issue.> I think this thread is the latest discussion about slogs and their > behavior: > > https://opensolaris.org/jive/thread.jspa?threadID=102392&tstart=0Yah, that''s my thread too :). -- Paul B. Henson | (909) 979-6361 | http://www.csupomona.edu/~henson/ Operating Systems and Network Analyst | henson at csupomona.edu California State Polytechnic University | Pomona CA 91768
On Tue, 19 May 2009, Richard Elling wrote:> Removal of a slog is different than failure of a slog. Removal is an > administrative task, not a repair task. You can lump slog removal in > with the more general shrink or top-level vdev removal tasks.Granted; however, for shrinking or removing vdevs you first need to do the technical step of getting data off of the devices you want to remove. Presumably that code has not yet been implemented, so before you can implement removal of data devices, you have to write that code. On the other hand, clearly there already exists code that will switch from a separate log device back to an embedded one. The only thing missing is a simple "update the metadata" step. While they might be lumped together in the same category, in terms of amount of effort and complexity to implement unless there''s something I''m not understanding there should be a couple of orders of magnitude difference. -- Paul B. Henson | (909) 979-6361 | http://www.csupomona.edu/~henson/ Operating Systems and Network Analyst | henson at csupomona.edu California State Polytechnic University | Pomona CA 91768
On Tue, 19 May 2009, Nicholas Lee wrote:> Does Solaris flush a slog device before it powers down? If so, removal > during a shutdown cycle wouldn''t lose any data.Actually, if you remove a slog from a pool while the pool is exported, it becomes completely inaccessible. I have an open support ticket requesting details of any potential recovery method for such a situation. -- Paul B. Henson | (909) 979-6361 | http://www.csupomona.edu/~henson/ Operating Systems and Network Analyst | henson at csupomona.edu California State Polytechnic University | Pomona CA 91768
On May 19, 2009, at 12:57 PM, Dave wrote:> > If you don''t have mirrored slogs and the slog fails, you may lose > any data that was in a txg group waiting to be committed to the main > pool vdevs - you will never know if you lost any data or not.None of the above is correct. First off, you only lose data if the slog fails *and* the machine panics/reboots before the transaction groups is synced (5-30s by default depending on load, though there is a CR filed to immediately sync on slog failure). You will not lose any data once the txg is synced - syncing the transaction group does not require reading from the slog, so failure of the log device does not impact normal operation. The latter half of the above statement is also incorrect. Should you find yourself in the double-failure described above, you will get an FMA fault that describes the nature of the problem and the implications. If the slog is truly dead, you can ''zpool clear'' (or ''fmadm repair'') the fault and use whatever data you still have in the pool. If the slog is just missing, you can insert it and continue without losing data. In no cases will ZFS silently continue without committed data. - Eric -- Eric Schrock, Fishworks http://blogs.sun.com/eschrock
So txg is sync to the slog device but retained in memory, and then rather than reading it back from the slog to memory it is copied to the pool from memory the copy? With the txg being a working set of the active commit, so might be a set of NFS iops? On Wed, May 20, 2009 at 3:43 PM, Eric Schrock <Eric.Schrock at sun.com> wrote:> > On May 19, 2009, at 12:57 PM, Dave wrote: > >> >> If you don''t have mirrored slogs and the slog fails, you may lose any data >> that was in a txg group waiting to be committed to the main pool vdevs - you >> will never know if you lost any data or not. >> > > None of the above is correct. First off, you only lose data if the slog > fails *and* the machine panics/reboots before the transaction groups is > synced (5-30s by default depending on load, though there is a CR filed to > immediately sync on slog failure). You will not lose any data once the txg > is synced - syncing the transaction group does not require reading from the > slog, so failure of the log device does not impact normal operation. > > The latter half of the above statement is also incorrect. Should you find > yourself in the double-failure described above, you will get an FMA fault > that describes the nature of the problem and the implications. If the slog > is truly dead, you can ''zpool clear'' (or ''fmadm repair'') the fault and use > whatever data you still have in the pool. If the slog is just missing, you > can insert it and continue without losing data. In no cases will ZFS > silently continue without committed data. > > - Eric > > -- > Eric Schrock, Fishworks > http://blogs.sun.com/eschrock > > > > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090520/6fdeecfe/attachment.html>
On May 19, 2009, at 8:56 PM, Nicholas Lee wrote:> > So txg is sync to the slog device but retained in memory, and then > rather than reading it back from the slog to memory it is copied to > the pool from memory the copy?Yes, that is correct. It is best to think of the ZIL and the txg sync process as orthogonal - data goes to both locations at different times. The ZIL (technically "all ZILs" since they''re per-filesystem) is *only* read in the event of log replay (unclean shutdown). During normal operation it is never read. Hence the benefit of write-biases SSDs - it doesn''t matter if reads are fast for slogs.> With the txg being a working set of the active commit, so might be a > set of NFS iops?If the NFS ops are synchronous, then yes. Async operations do not use the ZIL and therefore don''t have anything to do with slogs. - Eric -- Eric Schrock, Fishworks http://blogs.sun.com/eschrock
On Tue, 19 May 2009, Eric Schrock wrote:> The latter half of the above statement is also incorrect. Should you > find yourself in the double-failure described above, you will get an FMA > fault that describes the nature of the problem and the implications. If > the slog is truly dead, you can ''zpool clear'' (or ''fmadm repair'') the > fault and use whatever data you still have in the pool. If the slog is > just missing, you can insert it and continue without losing data. In no > cases will ZFS silently continue without committed data.How about the case where a slog device dies while a pool is not active? I created a pool with one mirror pair and a slog, and then intentionally corrupted the slog while the pool was exported (dd if=/dev/zero of=/dev/dsk/<slog>), and the pool is now inaccessible: ----- root at s10 ~ # zpool import pool: export id: 7254558150370674682 state: UNAVAIL status: One or more devices are missing from the system. action: The pool cannot be imported. Attach the missing devices and try again. see: http://www.sun.com/msg/ZFS-8000-6X config: export UNAVAIL missing device mirror ONLINE c0t1d0 ONLINE c0t2d0 ONLINE Additional devices are known to be part of this pool, though their exact configuration cannot be determined. ----- zpool clear doesn''t help: ----- root at s10 ~ # zpool clear export cannot open ''export'': no such pool ----- and there''s no fault logged: ----- root at s10 ~ # fmdump TIME UUID SUNW-MSG-ID fmdump: /var/fm/fmd/fltlog is empty ----- How do you recover from this scenario? BTW, you don''t happen to have any insight on why slog removal hasn''t been implemented yet? -- Paul B. Henson | (909) 979-6361 | http://www.csupomona.edu/~henson/ Operating Systems and Network Analyst | henson at csupomona.edu California State Polytechnic University | Pomona CA 91768
I guess this also means the relative value of a slog is also limited by the amount memory that can be allocated to the txg. On Wed, May 20, 2009 at 4:03 PM, Eric Schrock <Eric.Schrock at sun.com> wrote:> > Yes, that is correct. It is best to think of the ZIL and the txg sync > process as orthogonal - data goes to both locations at different times. The > ZIL (technically "all ZILs" since they''re per-filesystem) is *only* read in > the event of log replay (unclean shutdown). During normal operation it is > never read. Hence the benefit of write-biases SSDs - it doesn''t matter if > reads are fast for slogs. >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090520/62997ba7/attachment.html>
Eric Schrock wrote:> > On May 19, 2009, at 12:57 PM, Dave wrote: >> >> If you don''t have mirrored slogs and the slog fails, you may lose any >> data that was in a txg group waiting to be committed to the main pool >> vdevs - you will never know if you lost any data or not. > > None of the above is correct. First off, you only lose data if the slog > fails *and* the machine panics/reboots before the transaction groups is > synced (5-30s by default depending on load, though there is a CR filed > to immediately sync on slog failure). You will not lose any data once > the txg is synced - syncing the transaction group does not require > reading from the slog, so failure of the log device does not impact > normal operation. >Thanks for correcting my statement. There is still a potential approximate 60 second window for data loss if there are 2 transaction groups waiting to sync with a 30 second txg commit timer, correct?> The latter half of the above statement is also incorrect. Should you > find yourself in the double-failure described above, you will get an FMA > fault that describes the nature of the problem and the implications. If > the slog is truly dead, you can ''zpool clear'' (or ''fmadm repair'') the > fault and use whatever data you still have in the pool. If the slog is > just missing, you can insert it and continue without losing data. In no > cases will ZFS silently continue without committed data. >How will it know that data was actually lost? Or does it just alert you that it''s possible data was lost? There''s also the worry that the pool is not importable if you did have the double failure scenario and the log really is gone. Re: bug ID 6733267 . E.g. if you had done a ''zpool import -o cachefile=none mypool''. -- Dave
Eric Schrock wrote:> > On May 19, 2009, at 8:56 PM, Nicholas Lee wrote: > >> >> So txg is sync to the slog device but retained in memory, and then >> rather than reading it back from the slog to memory it is copied to >> the pool from memory the copy? > > Yes, that is correct. It is best to think of the ZIL and the txg sync > process as orthogonal - data goes to both locations at different > times. The ZIL (technically "all ZILs" since they''re per-filesystem) > is *only* read in the event of log replay (unclean shutdown). During > normal operation it is never read. Hence the benefit of write-biases > SSDs - it doesn''t matter if reads are fast for slogs.Yes. OTOH we also know that if the pool is exported, then there is nothing to be read from the slog. So in the case of a properly exported pool, we should be allowed to import sans slog. No? I seem to recall an RFE for this, but can''t seem to locate it now. -- richard
On May 19, 2009, at 9:45 PM, Dave wrote:> > Thanks for correcting my statement. There is still a potential > approximate 60 second window for data loss if there are 2 > transaction groups waiting to sync with a 30 second txg commit > timer, correct?No, only the syncing transaction group is affected. And that''s only if you''re pushing a very small amount of data. Once you''ve reached a certain amount of data we push the txg regardless of how long it''s been. I believe the ZFS team is also re-evaluating that timeout (it used to be much smaller), so it''s definitely not set in stone.> How will it know that data was actually lost?It attempts to replay the log records.> Or does it just alert you that it''s possible data was lost?No, we know that there should be a log record but we couldn''t read it.> There''s also the worry that the pool is not importable if you did > have the double failure scenario and the log really is gone. Re: bug > ID 6733267 . E.g. if you had done a ''zpool import -o cachefile=none > mypool''.Yes, import relies on the vdev GUID sum matching, which is orthogonal to a failed slog device during open. A failed slog device can prevent such a pool from being imported. - Eric -- Eric Schrock, Fishworks http://blogs.sun.com/eschrock
Paul B. Henson wrote:> I was checking with Sun support regarding this issue, and they say "The CR > currently has a high priority and the fix is understood. However, there is > no eta, workaround, nor IDR." > > If it''s a high priority, and it''s known how to fix it, I was curious as to > why has there been no progress? As I understand, if a failure of the logWhy do you think there is no progress ? The code for this is hard to implement. People are working on very hard but that doesn''t mean there isn''t any progress if there is no eta or workaround available via Sun Service. -- Darren J Moffat
On Tue, 19 May 2009, Dave wrote:> Paul B. Henson wrote: >> I was checking with Sun support regarding this issue, and they say "The CR >> currently has a high priority and the fix is understood. However, there is >> no eta, workaround, nor IDR." >> >> If it''s a high priority, and it''s known how to fix it, I was curious as to >> why has there been no progress? As I understand, if a failure of the log >> device occurs while the pool is active, it automatically switches back to >> an embedded pool log. It seems removal would be as simple as following the >> failure path to an embedded log, and then update the pool metadata to >> remove the log device. Is it more complicated than that? We''re about to do >> some testing with slogs, and it would make me a lot more comfortable to >> deploy one in production if there was a backout plan :)... > > If you don''t have mirrored slogs and the slog fails, you may lose any data > that was in a txg group waiting to be committed to the main pool vdevs - you > will never know if you lost any data or not. >No, you won''t loose any data. The issue is that if you export the pool currently you won''t be able to import it with a broken log device. But while a pool is up and its slog device fails it won''t loose any data. -- Robert Milkowski http://milek.blogspot.com
On Wed, 20 May 2009, Darren J Moffat wrote:> Why do you think there is no progress ?Sorry if that''s a wrong assumption, but I posted questions regarding it to this list with no response from a Sun employee until yours, and the engineer assigned to my support ticket was unable to provide any information as to the current state or if anyone was working on it or why it was so hard to do. Barring any data to the contrary it appeared from an external perspective that it was stalled.> The code for this is hard to implement. People are working on very hard > but that doesn''t mean there isn''t any progress if there is no eta or > workaround available via Sun Service.Why is it so hard? I understand that removing a data vdev or shrinking the size of a pool is complicated, but what makes it so difficult to remove a slog? If the slog fails the pool returns to an embedded log, it seems the only difference between a pool with a failed slog and a pool with no slog is that the former knows it used to have a slog. Why is it not as simple as updating the metadata for a pool with a failed slog so it no longer thinks it has a slog? On another note, do you have any idea how one might recover from the case where a slog device fails while the pool is inactive and renders it inaccessible? Thanks... -- Paul B. Henson | (909) 979-6361 | http://www.csupomona.edu/~henson/ Operating Systems and Network Analyst | henson at csupomona.edu California State Polytechnic University | Pomona CA 91768
Paul B. Henson wrote:> On Wed, 20 May 2009, Darren J Moffat wrote: > >> Why do you think there is no progress ? > > Sorry if that''s a wrong assumption, but I posted questions regarding it to > this list with no response from a Sun employee until yours, and the > engineer assigned to my support ticket was unable to provide any > information as to the current state or if anyone was working on it or why > it was so hard to do. Barring any data to the contrary it appeared from an > external perspective that it was stalled.How Sun Services reports the status of escalations to customers under contract is not a discussion for a public alias like this so I won''t comment on this.>> The code for this is hard to implement. People are working on very hard >> but that doesn''t mean there isn''t any progress if there is no eta or >> workaround available via Sun Service. > > Why is it so hard? I understand that removing a data vdev or shrinking the > size of a pool is complicated, but what makes it so difficult to remove a > slog? If the slog fails the pool returns to an embedded log, it seems the > only difference between a pool with a failed slog and a pool with no slog > is that the former knows it used to have a slog. Why is it not as simple as > updating the metadata for a pool with a failed slog so it no longer thinks > it has a slog?If the engineers that are working on this wish to comment I''m sure they will, but I know it really isn''t that simple.> On another note, do you have any idea how one might recover from the case > where a slog device fails while the pool is inactive and renders it > inaccessible?I do; because I''ve done it to my own personal data pool. However it is not a procedure I''m willing to tell anyone how to do - so please don''t ask - a) it was highly dangerous and involved using multiple different zfs kernel modules was well as updating disk labels from inside kmdb and b) it was several months ago and I think I''d forget a critical step. I''m not even sure I could repeat it on a pool with a non trivial number of datasets. It did how ever give me an appreciation for the issues in implementing this for a generic solution in a way that is safe to do and that works for all types of pool and slog config (mine was a very simple configuration: mirror + slog). -- Darren J Moffat
On Wed, 20 May 2009, Darren J Moffat wrote:> How Sun Services reports the status of escalations to customers under > contract is not a discussion for a public alias like this so I won''t > comment on this.Heh, but maybe it should be a discussion for some internal forum; more information = less anxious customers :)...> If the engineers that are working on this wish to comment I''m sure they > will, but I know it really isn''t that simple.I hope they do, as an information vacuum tends to result in false assumptions.> I do; because I''ve done it to my own personal data pool. However it is > not a procedure I''m willing to tell anyone how to do - so please don''t[...]> implementing this for a generic solution in a way that is safe to do and > that works for all types of pool and slog config (mine was a very simple > configuration: mirror + slog).Hmm, well, it just seems horribly wrong for the failure of a slog to result in complete data loss, *particularly* when all of the data is perfectly valid and just sitting there beyond your reach. One suggestion I received off-list was to dump your virgin slog right after creation (I did a dd=/dev/zero of=/dev/dsk/<slogtobe>, a zfs add pool log <slog>, then a dd if=/dev/<slog> of=slog.dd count=<blocks until everything is zeros>) and then you could restore it if you lost your slog. I tested this, and sure enough if I wrote the slog dump to the corrupted device the pool was happy again. This only seemed to work if I restored it to the exact same device it was on before, restoring it to a different device didn''t work (I thought zfs was device name agnostic?). -- Paul B. Henson | (909) 979-6361 | http://www.csupomona.edu/~henson/ Operating Systems and Network Analyst | henson at csupomona.edu California State Polytechnic University | Pomona CA 91768
Paul B. Henson wrote:> On Wed, 20 May 2009, Darren J Moffat wrote: > >> How Sun Services reports the status of escalations to customers under >> contract is not a discussion for a public alias like this so I won''t >> comment on this. > > Heh, but maybe it should be a discussion for some internal forum; more > information = less anxious customers :)... > >> If the engineers that are working on this wish to comment I''m sure they >> will, but I know it really isn''t that simple. > > I hope they do, as an information vacuum tends to result in false > assumptions.As does attempting to gain the same information via several routes. Since you apparently have a Sun Service support contract I highly recommend that you purse that route alone for further information on this bug status for your production needs. Note that I don''t represent the engineers working on this issue and the statements are urely my personal view on things based on my experience of recovering from the situation. I was only able to do that recovery due to my experience in the ZFS code base due to the ZFS crypto project. -- Darren J Moffat
>>>>> "re" == Richard Elling <richard.elling at gmail.com> writes:re> in the case of a properly exported pool, we should be allowed re> to import sans slog. seems so, but the non properly exported case is still important. for example NFS HA clusters would stop working if slogs were always ignored on import---you have to ''import -f'' and roll the slog like it does now. And I think the desire for the removal feature is both these: * don''t want to lose whole pool if slog goes bad * want to test out slog, then remove it. or replace with smaller slog. so your case handles the second (though you''d have to export/import to remove, while you can add online) but it does not handle the first. I want to make sure the common conception of the scope of the slog user-whining doesn''t ``creep'''' narrower. I think the original slog work included a comment like: q. shouldn''t we be able to remove them? a. yes that would be extremely easy to do, but my deadline is already tight so I want to resist scope creep. now, here we are, a year later... -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090520/81553d99/attachment.bin>
>>>>> "djm" == Darren J Moffat <darrenm at opensolaris.org> writes:djm> I do; because I''ve done it to my own personal data pool. djm> However it is not a procedure I''m willing to tell anyone how djm> to do - so please don''t ask - k, fine, fair enough and noted. djm> a) it was highly dangerous and involved using multiple djm> different zfs kernel modules was well as however...utter hogwash! Nothing is ``highly dangerous'''' when your pool is completely unreadable. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090520/3cf428c9/attachment.bin>
On Wed, May 20, 2009 at 12:42, Miles Nordin <carton at ivy.net> wrote:>>>>>> "djm" == Darren J Moffat <darrenm at opensolaris.org> writes: > ? djm> a) it was highly dangerous and involved using multiple > ? djm> different zfs kernel modules was well as > > however...utter hogwash! ?Nothing is ``highly dangerous'''' when your > pool is completely unreadable.It is if you turn your "unreadable but fixable" pool into a "completely unrecoverable" pool. If my pool loses its log disk, I''m waiting for an official tool to fix it. Will
Will Murnane wrote:> On Wed, May 20, 2009 at 12:42, Miles Nordin <carton at ivy.net> wrote: > >>>>>>> "djm" == Darren J Moffat <darrenm at opensolaris.org> writes: >>>>>>> >> djm> a) it was highly dangerous and involved using multiple >> djm> different zfs kernel modules was well as >> >> however...utter hogwash! Nothing is ``highly dangerous'''' when your >> pool is completely unreadable. >> > It is if you turn your "unreadable but fixable" pool into a > "completely unrecoverable" pool. If my pool loses its log disk, I''m > waiting for an official tool to fix it. >Whoa. The slog is a top-level vdev like the others. The current situation is that loss of a top-level vdev results in a pool that cannot be imported. If you are concerned about the loss of a top-level vdev, then you need to protect them. For slogs, mirrors work. For the main pool, mirrors and raidz[12] work. There was a conversation regarding whether it would be a best practice to always mirror the slog. Since the recovery from slog failure modes is better than that of the other top-level vdevs, the case for recommending a mirrored slog is less clear. If you are paranoid, then mirror the slog. -- richard
>>>>> "re" == Richard Elling <richard.elling at gmail.com> writes:re> Whoa. re> The slog is a top-level vdev like the others. The current re> situation is that loss of a top-level vdev results in a pool re> that cannot be imported. this taxonomy is wilfully ignorant of the touted way pools will keep working if their slog dies, reverting to normal inside-the-main-pool zil. and about ZFS''s supposed ability to tell from the other pool devices and report to fmdump if the slog was empty or full before it failed. also the difference between slogs failing on imported pools (okay) vs failed slogs on exported pools (entire pool lost). It''s not as intuitive as you imply, and I don''t think worrying about a corner case where you lose the whole pool is ``paranoid''''. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090520/56721da4/attachment.bin>
Richard Elling wrote:> Will Murnane wrote: >> On Wed, May 20, 2009 at 12:42, Miles Nordin <carton at ivy.net> wrote: >> >>>>>>>> "djm" == Darren J Moffat <darrenm at opensolaris.org> writes: >>>>>>>> >>> djm> a) it was highly dangerous and involved using multiple >>> djm> different zfs kernel modules was well as >>> >>> however...utter hogwash! Nothing is ``highly dangerous'''' when your >>> pool is completely unreadable. >>> >> It is if you turn your "unreadable but fixable" pool into a >> "completely unrecoverable" pool. If my pool loses its log disk, I''m >> waiting for an official tool to fix it. >> > > Whoa. > > The slog is a top-level vdev like the others. The current situation is > that > loss of a top-level vdev results in a pool that cannot be imported. If you > are concerned about the loss of a top-level vdev, then you need to protect > them. For slogs, mirrors work. For the main pool, mirrors and raidz[12] > work. > > There was a conversation regarding whether it would be a best practice > to always mirror the slog. Since the recovery from slog failure modes is > better than that of the other top-level vdevs, the case for recommending > a mirrored slog is less clear. If you are paranoid, then mirror the slog. > -- richard >I can''t test this myself at the moment, but the reporter of Bug ID 6733267 says even one failed slog from a pair of mirrored slogs will prevent an exported zpool from being imported. Has anyone tested this recently? -- Dave
Not sure if this is a wacky question. Given a slog device does not really need much more than 10 GB. If I was to use a pair of X25-E (or STEC devices or whatever) in a mirror as the boot device and then either 1. created a loopback file vdev or 2. separate mirrored slice for the slog would this cause issues? Other than h/w management details. Nicholas -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090521/50a46736/attachment.html>
On Wed, May 20, 2009 at 9:35 AM, Paul B. Henson <henson at acm.org> wrote:> On Wed, 20 May 2009, Darren J Moffat wrote: > >> Why do you think there is no progress ? > > Sorry if that''s a wrong assumption, but I posted questions regarding it to > this list with no response from a Sun employee until yours, and the > engineer assigned to my support ticket was unable to provide any > information as to the current state or if anyone was working on it or why > it was so hard to do. Barring any data to the contrary it appeared from an > external perspective that it was stalled. > >> The code for this is hard to implement. ?People are working on very hard >> but that doesn''t mean there isn''t any progress if there is no eta or >> workaround available via Sun Service. > > Why is it so hard? I understand that removing a data vdev or shrinking the > size of a pool is complicated, but what makes it so difficult to remove a > slog? If the slog fails the pool returns to an embedded log, it seems the > only difference between a pool with a failed slog and a pool with no slog > is that the former knows it used to have a slog. Why is it not as simple as > updating the metadata for a pool with a failed slog so it no longer thinks > it has a slog? > > On another note, do you have any idea how one might recover from the case > where a slog device fails while the pool is inactive and renders it > inaccessible? > > Thanks...I stumbled across this just now while performing a search for something else. http://opensolaris.org/jive/thread.jspa?messageID=377018 I have no idea of the quality or correctness of this solution. -- Mike Gerdts http://mgerdts.blogspot.com/
Miles Nordin wrote:>>>>>> "re" == Richard Elling <richard.elling at gmail.com> writes: >>>>>> > > re> Whoa. > > re> The slog is a top-level vdev like the others. The current > re> situation is that loss of a top-level vdev results in a pool > re> that cannot be imported. > > this taxonomy is wilfully ignorant of the touted way pools will keep > working if their slog dies, reverting to normal inside-the-main-pool > zil. and about ZFS''s supposed ability to tell from the other pool > devices and report to fmdump if the slog was empty or full before it > failed.What happens depends on what "it" is when "it failed" as well as the nature of the failure. ZFS reads the slog on import, so there is no notion of the slog being empty or full at any given instant other than at import time when we may also know whether the pool was exported or not. Another way to look at this, there is no explicit flag set in the pool that indicates whether the slog is empty or full. Adding one would be silly because it would be inherently slow, and improving performance is why the slog exists in the first place. We can''t count on the in-RAM status, since that won''t survive an outage. At import time, the vdevs trees are reconstructed -- this is where the cleverness needs to be applied.> also the difference between slogs failing on imported pools > (okay) vs failed slogs on exported pools (entire pool lost). It''s not > as intuitive as you imply, and I don''t think worrying about a corner > case where you lose the whole pool is ``paranoid''''. >In any case, if you lose a device which has unprotected data on it, then the data is lost. If you do not protect the slog, then your system is susceptible to data loss if you lose the slog device. This fact is not under question. What is under question is the probability that you will lose a slog device *and* lose data, which we know to be less than the probability of losing a device -- we just can''t say how much less. Redundancy offers nothing more than insurance. The people who buy insurance are either gullible or, to some extent, paranoid. I prefer to believe that the folks on this forum are not gullible :-) -- richard
Well, it worked for me, at least. Note that this is a very limited recovery case- it only works if you have the GUID of the slog device from zpool.cache, which in the case of a fail-on-export and reimport might not be available. The original author of the fix seems to imply that you can use any size device as the replacement slog, but I had trouble doing that. Didn''t investigate enough to say conclusively that that is not possible, though. It''s a very limited fix, but for the case Paul outlined, it will work, assuming zpool.cache is available. On Thu, May 21, 2009 at 9:21 AM, Mike Gerdts <mgerdts at gmail.com> wrote:> On Wed, May 20, 2009 at 9:35 AM, Paul B. Henson <henson at acm.org> wrote: > > On Wed, 20 May 2009, Darren J Moffat wrote: > > > >> Why do you think there is no progress ? > > > > Sorry if that''s a wrong assumption, but I posted questions regarding it > to > > this list with no response from a Sun employee until yours, and the > > engineer assigned to my support ticket was unable to provide any > > information as to the current state or if anyone was working on it or why > > it was so hard to do. Barring any data to the contrary it appeared from > an > > external perspective that it was stalled. > > > >> The code for this is hard to implement. People are working on very hard > >> but that doesn''t mean there isn''t any progress if there is no eta or > >> workaround available via Sun Service. > > > > Why is it so hard? I understand that removing a data vdev or shrinking > the > > size of a pool is complicated, but what makes it so difficult to remove a > > slog? If the slog fails the pool returns to an embedded log, it seems the > > only difference between a pool with a failed slog and a pool with no slog > > is that the former knows it used to have a slog. Why is it not as simple > as > > updating the metadata for a pool with a failed slog so it no longer > thinks > > it has a slog? > > > > On another note, do you have any idea how one might recover from the case > > where a slog device fails while the pool is inactive and renders it > > inaccessible? > > > > Thanks... > > I stumbled across this just now while performing a search for something > else. > > http://opensolaris.org/jive/thread.jspa?messageID=377018 > > I have no idea of the quality or correctness of this solution. > > -- > Mike Gerdts > http://mgerdts.blogspot.com/ > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090521/f2d3b658/attachment.html>
>>>>> "re" == Richard Elling <richard.elling at gmail.com> writes: >>>>> "es" == Eric Schrock <Eric.Schrock at Sun.COM> writes:re> Another way to look at this, there is no explicit flag set in re> the pool that indicates whether the slog is empty or re> full. Not that it makes a huge difference to me, but Eric seemed to say that actually there is just such a flag: dave> Or does it just alert you that it''s possible data was lost? es> No, we know that there should be a log record but we couldn''t es> read it. doesn''t make perfect sense to me, either, since keeping a slog-full bit synchronously updated seems like it''d have in many cases almost the same cost as not using the slog. Maybe it''s asynchronously updated and useful but not perfectly reliable, or maybe it''s a ``slog empty'''' but that''s only set on export or clean shutdown. Anyway, Richard I think your whole argument is ridiculous: you''re acting like losing 30 seconds of data and losing the entire pool are equivalent. Who is this line of reasoning supposed to serve? From here it looks like everyone loses the further you advance it. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090521/6092bc84/attachment.bin>
On Thu, 21 May 2009, Miles Nordin wrote:> > Anyway, Richard I think your whole argument is ridiculous: you''re > acting like losing 30 seconds of data and losing the entire pool are > equivalent. Who is this line of reasoning supposed to serve? From > here it looks like everyone loses the further you advance it.For some people losing 30 seconds of data and losing the entire pool could be equivalent. In fact, it could be a billion dollar error. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Miles Nordin wrote:>>>>>> "re" == Richard Elling <richard.elling at gmail.com> writes: >>>>>> "es" == Eric Schrock <Eric.Schrock at Sun.COM> writes: >>>>>> > > re> Another way to look at this, there is no explicit flag set in > re> the pool that indicates whether the slog is empty or > re> full. > > Not that it makes a huge difference to me, but Eric seemed to say that > actually there is just such a flag: > > dave> Or does it just alert you that it''s possible data was lost? > > es> No, we know that there should be a log record but we couldn''t > es> read it. > > doesn''t make perfect sense to me, either, since keeping a slog-full > bit synchronously updated seems like it''d have in many cases almost > the same cost as not using the slog. Maybe it''s asynchronously > updated and useful but not perfectly reliable, or maybe it''s a ``slog > empty'''' but that''s only set on export or clean shutdown. >In the case of an export, we know that the log has been flushed to the pool and we know than an export has occurred. This is why there is an RFE to use this info to permit the pool to be imported sans slog.> Anyway, Richard I think your whole argument is ridiculous: you''re > acting like losing 30 seconds of data and losing the entire pool are > equivalent. Who is this line of reasoning supposed to serve? From > here it looks like everyone loses the further you advance it. >I recommend protecting your data. Data retention is important. I don''t see where there is an argument here, just an engineering trade-off between data retention, data availability, performance, and cost... nothing new here. -- richard
On Thu, 21 May 2009, Peter Woodman wrote:> Well, it worked for me, at least. Note that this is a very limited > recovery case- it only works if you have the GUID of the slog device from > zpool.cache, which in the case of a fail-on-export and reimport might not > be available. The original author of the fix seems to imply that you can > use any size device as the replacement slog, but I had trouble doing > that. Didn''t investigate enough to say conclusively that that is not > possible, though. > > It''s a very limited fix, but for the case Paul outlined, it will work, > assuming zpool.cache is available.Cool, thanks for the info. While if you came across this info after a critical failure it wouldn''t be much help, knowing about this beforehand I can make a backup of the cache file stored someplace other than the pool with the slog. I''ll have to test it out some, but if it works out it will make me more comfortable about deploying a slog in production. Still haven''t heard anything official from Sun support about recovering from this situation. I''m also still really curious why it''s so hard to remove a slog, but no one in the know has replied, and that''s not the type of question Sun support tends to follow up on :(... -- Paul B. Henson | (909) 979-6361 | http://www.csupomona.edu/~henson/ Operating Systems and Network Analyst | henson at csupomona.edu California State Polytechnic University | Pomona CA 91768
On Thu, 21 May 2009, Bob Friesenhahn wrote:> For some people losing 30 seconds of data and losing the entire pool > could be equivalent. In fact, it could be a billion dollar error.I don''t think anybody''s saying to just ignore a missing slog and continue on like nothing''s wrong. Let the pool fail to import, generate errors and faults. But if I''m willing to lose whatever uncommitted transactions are in the slog, why should my entire pool sit unaccessible? Some manual option to force an import of a pool with a missing slog would at least give the option of getting to your data, which would be a lot better than the current situation. -- Paul B. Henson | (909) 979-6361 | http://www.csupomona.edu/~henson/ Operating Systems and Network Analyst | henson at csupomona.edu California State Polytechnic University | Pomona CA 91768
Miles Nordin wrote:>>>>>> "djm" == Darren J Moffat <darrenm at opensolaris.org> writes: > > djm> I do; because I''ve done it to my own personal data pool. > djm> However it is not a procedure I''m willing to tell anyone how > djm> to do - so please don''t ask - > > k, fine, fair enough and noted. > > djm> a) it was highly dangerous and involved using multiple > djm> different zfs kernel modules was well as > > however...utter hogwash! Nothing is ``highly dangerous'''' when your > pool is completely unreadable.That''s your opinion. However I''m the one that did this on *my* data and I knew what I was doing because I develop in the ZFS code base. I calculated the risk I was taken on *my* pool based on the hacks I did for this. I considered what I was doing was "highly dangerous" because I could if I made a mistake make the situation even worse for myself than it already was, one that would be even harder to recover from. I had some files I didn''t have a backup of yet in that pool that I really didn''t want to lose because if I did I''d be in trouble with my spouse. So for me this was highly dangerous on my data. I would expect others to make a similar risk judgement based based on the value of the data in the pool. If the only copy of the data is in the pool what I did (and remember you have no idea how I did this because I''m not going to explain it because I know I can''t accurately reproduce it in email) it is risky. -- Darren J Moffat
On Tue, May 19, 2009 at 2:16 PM, Paul B. Henson <henson at acm.org> wrote:> > I was checking with Sun support regarding this issue, and they say "The CR > currently has a high priority and the fix is understood. However, there is > no eta, workaround, nor IDR." > > If it''s a high priority, and it''s known how to fix it, I was curious as to > why has there been no progress? As I understand, if a failure of the log > device occurs while the pool is active, it automatically switches back to > an embedded pool log. It seems removal would be as simple as following the > failure path to an embedded log, and then update the pool metadata to > remove the log device. Is it more complicated than that? We''re about to do > some testing with slogs, and it would make me a lot more comfortable to > deploy one in production if there was a backout plan :)... >A rather interesting putback just happened... http://hg.genunix.org/onnv-gate.hg/rev/cc5b64682e64 6803605 <http://bugs.opensolaris.org/view_bug.do?bug_id=6803605> should be able to offline log devices 6726045 <http://bugs.opensolaris.org/view_bug.do?bug_id=6726045>vdev_deflate_ratio is not set when offlining a log device 6599442 <http://bugs.opensolaris.org/view_bug.do?bug_id=6599442> zpool import has faults in the display I love comments that tell you what is really going on... 8.75 <http://hg.genunix.org/onnv-gate.hg/rev/cc5b64682e64#l8.75> /* 8.76 <http://hg.genunix.org/onnv-gate.hg/rev/cc5b64682e64#l8.76> * If this device has the only valid copy of some data, 8.77 <http://hg.genunix.org/onnv-gate.hg/rev/cc5b64682e64#l8.77> - * don''t allow it to be offlined. 8.78 <http://hg.genunix.org/onnv-gate.hg/rev/cc5b64682e64#l8.78> + * don''t allow it to be offlined. Log devices are always 8.79 <http://hg.genunix.org/onnv-gate.hg/rev/cc5b64682e64#l8.79> + * expendable. 8.80 <http://hg.genunix.org/onnv-gate.hg/rev/cc5b64682e64#l8.80> */ For some reason, the CR''s listed above are not available through bugs.opensolaris.org. However, at least 6833605 is available through sunsolve if you have a support contract. -- Mike Gerdts http://mgerdts.blogspot.com/ -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090522/0398b396/attachment.html>
>>>>> "mg" == Mike Gerdts <mgerdts at gmail.com> writes:mg> A rather interesting putback just happened... yeah, it is good when you can manually offline the same set of devices as the set of those which are allowed to fail without invoking the pool''s failmode. I guess the putback means one less such difference. However the set of devices you need available to (force-) import a pool should also be the same as the set of devices you need to run it. The goal should be to make all three set requirements (zpool offline, failmode, import) the same. The offline and online cases aren''t totally equivalent, so there will be corner cases like the quorum rules there were with SVM, but by following procedures, banging on things, invoking ``simon-sez'''' flags, the three cases should ultimately end up demanding the same sets of devices. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090522/e0714999/attachment.bin>
Mike Gerdts wrote:> On Tue, May 19, 2009 at 2:16 PM, Paul B. Henson <henson at acm.org > <mailto:henson at acm.org>> wrote: > > > > I was checking with Sun support regarding this issue, and they say > "The CR > > currently has a high priority and the fix is understood. However, > there is > > no eta, workaround, nor IDR." > > > > If it''s a high priority, and it''s known how to fix it, I was curious > as to > > why has there been no progress? As I understand, if a failure of the log > > device occurs while the pool is active, it automatically switches > back to > > an embedded pool log. It seems removal would be as simple as > following the > > failure path to an embedded log, and then update the pool metadata to > > remove the log device. Is it more complicated than that? We''re about > to do > > some testing with slogs, and it would make me a lot more comfortable to > > deploy one in production if there was a backout plan :)... > > > > A rather interesting putback just happened... > > http://hg.genunix.org/onnv-gate.hg/rev/cc5b64682e64 > > 6803605 <http://bugs.opensolaris.org/view_bug.do?bug_id=6803605> > should be able to offline log devices > 6726045 <http://bugs.opensolaris.org/view_bug.do?bug_id=6726045> > vdev_deflate_ratio is not set when offlining a log device > 6599442 <http://bugs.opensolaris.org/view_bug.do?bug_id=6599442> zpool > import has faults in the display > > I love comments that tell you what is really going on... > > 8.75 <http://hg.genunix.org/onnv-gate.hg/rev/cc5b64682e64#l8.75> /* > 8.76 <http://hg.genunix.org/onnv-gate.hg/rev/cc5b64682e64#l8.76> * If this device has the only valid copy of some data, > > 8.77 <http://hg.genunix.org/onnv-gate.hg/rev/cc5b64682e64#l8.77> - * don''t allow it to be offlined. > 8.78 <http://hg.genunix.org/onnv-gate.hg/rev/cc5b64682e64#l8.78> + * don''t allow it to be offlined. Log devices are always > > 8.79 <http://hg.genunix.org/onnv-gate.hg/rev/cc5b64682e64#l8.79> + * expendable. > 8.80 <http://hg.genunix.org/onnv-gate.hg/rev/cc5b64682e64#l8.80> */ > > > For some reason, the CR''s listed above are not available through > bugs.opensolaris.org <http://bugs.opensolaris.org>. However, at least > 6833605 is available through sunsolve if you have a support contract. > > -- > Mike Gerdts > http://mgerdts.blogspot.com/ >This putback is the precursor to slog device removal and the ability to import pools with failed slogs. I''ll provide more details as we get closer to integrating the slog removal feature. We are working on it, it is one of our top priorities. Stay tuned for more details... George