I''m debating using an external intent log on a new box that I''m about to start working on, and I have a few questions. 1. If I use an external log initially and decide that it was a mistake, is there a way to move back to the internal log without rebuilding the entire pool? 2. What happens if the logging device fails completely? Does this damage anything else in the pool, other then potentially losing in-flight transactions? 3. What about corruption in the log? Is it checksummed like the rest of ZFS? Thanks. Scott
Scott Laird wrote:> I''m debating using an external intent log on a new box that I''m about > to start working on, and I have a few questions. > > 1. If I use an external log initially and decide that it was a > mistake, is there a way to move back to the internal log without > rebuilding the entire pool?It''s not currently possible to remove a separate log. This was working once, but was stripped out until the more generic zpool remove devices was provided. This is bug 6574286: http://bugs.opensolaris.org/view_bug.do?bug_id=6574286> 2. What happens if the logging device fails completely? Does this > damage anything else in the pool, other then potentially losing > in-flight transactions?This should work. It shouldn''t even lose the in-flight transactions. ZFS reverts to using the main pool if a slog write fails or the slog fills up.> 3. What about corruption in the log? Is it checksummed like the rest of ZFS?Yes it''s checksummed, but the checksumming is a bit different from the pool blocks in the uberblock tree. See also: http://blogs.sun.com/perrin/entry/slog_blog_or_blogging_on> > Thanks. > > > Scott > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
On 10/18/07, Neil Perrin <Neil.Perrin at sun.com> wrote:> > > Scott Laird wrote: > > I''m debating using an external intent log on a new box that I''m about > > to start working on, and I have a few questions. > > > > 1. If I use an external log initially and decide that it was a > > mistake, is there a way to move back to the internal log without > > rebuilding the entire pool? > > It''s not currently possible to remove a separate log. > This was working once, but was stripped out until the > more generic zpool remove devices was provided. > This is bug 6574286: > > http://bugs.opensolaris.org/view_bug.do?bug_id=6574286Okay, so hopefully it''ll work in a couple quarters?> > 2. What happens if the logging device fails completely? Does this > > damage anything else in the pool, other then potentially losing > > in-flight transactions? > > This should work. It shouldn''t even lose the in-flight transactions. > ZFS reverts to using the main pool if a slog write fails or the > slog fills up.So, the only way to lose transactions would be a crash or power loss, leaving outstanding transactions in the log, followed by the log device failing to start up on reboot? I assume that that would that be handled relatively cleanly (files have out of data data), as opposed to something nasty like the pool fails to start up.> > 3. What about corruption in the log? Is it checksummed like the rest of ZFS? > > Yes it''s checksummed, but the checksumming is a bit different > from the pool blocks in the uberblock tree. > > See also: > http://blogs.sun.com/perrin/entry/slog_blog_or_blogging_onThat started this whole mess :-). I''d like to try out using one of the Gigabyte SATA ramdisk cards that are discussed in the comments. It supposedly has 18 hours of battery life, so a long-term power outage would kill the log. I could reasonably expect one 18+ hour power outage over the life of the filesystem. I''m fine with losing in-flight data (I''d expect the log to be replayed before the UPS shuts the system down anyway), but I''d rather not lose the whole pool or something extreme like that. I''m willing to trade the chance of some transaction losses during an exceptional event for more performance, but I''d rather not have to pull out the backups if I can ever avoid it. Scott
Scott Laird wrote:> On 10/18/07, Neil Perrin <Neil.Perrin at sun.com> wrote: >> >> Scott Laird wrote: >>> I''m debating using an external intent log on a new box that I''m about >>> to start working on, and I have a few questions. >>> >>> 1. If I use an external log initially and decide that it was a >>> mistake, is there a way to move back to the internal log without >>> rebuilding the entire pool? >> It''s not currently possible to remove a separate log. >> This was working once, but was stripped out until the >> more generic zpool remove devices was provided. >> This is bug 6574286: >> >> http://bugs.opensolaris.org/view_bug.do?bug_id=6574286 > > Okay, so hopefully it''ll work in a couple quarters?It''s not being worked on currently but hopefully will be fixed in 6 months.> >>> 2. What happens if the logging device fails completely? Does this >>> damage anything else in the pool, other then potentially losing >>> in-flight transactions? >> This should work. It shouldn''t even lose the in-flight transactions. >> ZFS reverts to using the main pool if a slog write fails or the >> slog fills up. > > So, the only way to lose transactions would be a crash or power loss, > leaving outstanding transactions in the log, followed by the log > device failing to start up on reboot? I assume that that would that > be handled relatively cleanly (files have out of data data), as > opposed to something nasty like the pool fails to start up.I just checked on the behaviour of this. The log is treated as part of the main pool. If it is not replicated and disappears then the pool can''t be opened - just like any unreplicated device in the main pool. If the slog is found but can''t be opened or is corrupted then then the pool will be opened but the slog isn''t used. This seems a bit inconsistent.> >>> 3. What about corruption in the log? Is it checksummed like the rest of ZFS? >> Yes it''s checksummed, but the checksumming is a bit different >> from the pool blocks in the uberblock tree. >> >> See also: >> http://blogs.sun.com/perrin/entry/slog_blog_or_blogging_on > > That started this whole mess :-). I''d like to try out using one of > the Gigabyte SATA ramdisk cards that are discussed in the comments.A while ago there was a comment on this alias that these cards weren''t purchasable. Unfortunately, I don''t know what is available.> It supposedly has 18 hours of battery life, so a long-term power > outage would kill the log. I could reasonably expect one 18+ hour > power outage over the life of the filesystem. I''m fine with losing > in-flight data (I''d expect the log to be replayed before the UPS shuts > the system down anyway), but I''d rather not lose the whole pool or > something extreme like that. > > I''m willing to trade the chance of some transaction losses during an > exceptional event for more performance, but I''d rather not have to > pull out the backups if I can ever avoid it. > > > Scott
On 10/18/07, Neil Perrin <Neil.Perrin at sun.com> wrote:> > So, the only way to lose transactions would be a crash or power loss, > > leaving outstanding transactions in the log, followed by the log > > device failing to start up on reboot? I assume that that would that > > be handled relatively cleanly (files have out of data data), as > > opposed to something nasty like the pool fails to start up. > > I just checked on the behaviour of this. The log is treated as part > of the main pool. If it is not replicated and disappears then the pool > can''t be opened - just like any unreplicated device in the main pool. > If the slog is found but can''t be opened or is corrupted then then the > pool will be opened but the slog isn''t used. > This seems a bit inconsistent.Hmm, yeah. What would happen if I mirrored the ramdisk with a hard drive? Would ZFS block until the data''s stable on both devices, or would it continue once the write is complete on the ramdisk? Failing that, would replacing the missing log with a blank device let me bring the pool back up, or would it be dead at that point?> >>> 3. What about corruption in the log? Is it checksummed like the rest of ZFS? > >> Yes it''s checksummed, but the checksumming is a bit different > >> from the pool blocks in the uberblock tree. > >> > >> See also: > >> http://blogs.sun.com/perrin/entry/slog_blog_or_blogging_on > > > > That started this whole mess :-). I''d like to try out using one of > > the Gigabyte SATA ramdisk cards that are discussed in the comments. > > A while ago there was a comment on this alias that these cards > weren''t purchasable. Unfortunately, I don''t know what is available.The umem one is unavailable, but the Gigabyte model is easy to find. I had Amazon overnight one to me, it''s probably sitting at home right now. Scott
On Thu, Oct 18, 2007 at 02:29:27PM -0600, Neil Perrin wrote:> > > So, the only way to lose transactions would be a crash or power loss, > > leaving outstanding transactions in the log, followed by the log > > device failing to start up on reboot? I assume that that would that > > be handled relatively cleanly (files have out of data data), as > > opposed to something nasty like the pool fails to start up. > > I just checked on the behaviour of this. The log is treated as part > of the main pool. If it is not replicated and disappears then the pool > can''t be opened - just like any unreplicated device in the main pool. > If the slog is found but can''t be opened or is corrupted then then the > pool will be opened but the slog isn''t used. > This seems a bit inconsistent. >It''s worth noting that this is a generic problem. In the world of metadata replication (ditto blocks), even an unreplicated normal device does not necessarily render a pool completely faulted. The code needs to be modified across the board so that the root vdev never ends up in the FAULTED state, and then pool health is based solely on the ability to read some basic piece of information, such as a successful dsl_pool_open(). From the looks of things, this will "just work" if we get rid of the too_many_errors() call and associated code in vdev_root.c, but I''m sure there would be some odd edge conditions. - Eric -- Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock
Scott Laird wrote:> On 10/18/07, Neil Perrin <Neil.Perrin at sun.com> wrote: >>> So, the only way to lose transactions would be a crash or power loss, >>> leaving outstanding transactions in the log, followed by the log >>> device failing to start up on reboot? I assume that that would that >>> be handled relatively cleanly (files have out of data data), as >>> opposed to something nasty like the pool fails to start up. >> I just checked on the behaviour of this. The log is treated as part >> of the main pool. If it is not replicated and disappears then the pool >> can''t be opened - just like any unreplicated device in the main pool. >> If the slog is found but can''t be opened or is corrupted then then the >> pool will be opened but the slog isn''t used. >> This seems a bit inconsistent. > > Hmm, yeah. What would happen if I mirrored the ramdisk with a hard > drive? Would ZFS block until the data''s stable on both devices, or > would it continue once the write is complete on the ramdisk?ZFS ensures all mirror sides have the data before returning.> > Failing that, would replacing the missing log with a blank device let > me bring the pool back up, or would it be dead at that point?Replacing the device would work: : mull ; mkfile 100m /p1 /p2 : mull ; zpool create whirl /p1 log /p2 : mull ; echo abc > /whirl/f : mull ; sync : mull ; rm /p2 : mull ; sync <reset system> : mull ; zpool status pool: whirl state: UNAVAIL status: One or more devices could not be opened. There are insufficient replicas for the pool to continue functioning. action: Attach the missing device and online it using ''zpool online''. see: http://www.sun.com/msg/ZFS-8000-3C scrub: none requested config: NAME STATE READ WRITE CKSUM whirl UNAVAIL 0 0 0 insufficient replicas /p1 ONLINE 0 0 0 logs UNAVAIL 0 0 0 insufficient replicas /p2 UNAVAIL 0 0 0 cannot open : mull ; mkfile 100m /p2 /p3 : mull ; zpool online whirl /p2 warning: device ''/p2'' onlined, but remains in faulted state use ''zpool replace'' to replace devices that are no longer present : mull ; zpool status pool: whirl state: ONLINE status: One or more devices could not be used because the label is missing or invalid. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Replace the device using ''zpool replace''. see: http://www.sun.com/msg/ZFS-8000-4J scrub: none requested config: NAME STATE READ WRITE CKSUM whirl ONLINE 0 0 0 /p1 ONLINE 0 0 0 logs ONLINE 0 0 0 /p2 UNAVAIL 0 0 0 corrupted data errors: No known data errors : mull ; zpool replace whirl /p2 /p3 : mull ; zpool status pool: whirl state: ONLINE scrub: resilver completed with 0 errors on Thu Oct 18 18:16:39 2007 config: NAME STATE READ WRITE CKSUM whirl ONLINE 0 0 0 /p1 ONLINE 0 0 0 logs ONLINE 0 0 0 replacing ONLINE 0 0 0 /p2 UNAVAIL 0 0 0 corrupted data /p3 ONLINE 0 0 0 errors: No known data errors : mull ; zpool status pool: whirl state: ONLINE scrub: resilver completed with 0 errors on Thu Oct 18 18:16:39 2007 config: NAME STATE READ WRITE CKSUM whirl ONLINE 0 0 0 /p1 ONLINE 0 0 0 logs ONLINE 0 0 0 /p3 ONLINE 0 0 0 errors: No known data errors : mull ; zfs mount : mull ; zfs mount -a : mull ; cat /whirl/f abc : mull ;> >>>>> 3. What about corruption in the log? Is it checksummed like the rest of ZFS? >>>> Yes it''s checksummed, but the checksumming is a bit different >>>> from the pool blocks in the uberblock tree. >>>> >>>> See also: >>>> http://blogs.sun.com/perrin/entry/slog_blog_or_blogging_on >>> That started this whole mess :-). I''d like to try out using one of >>> the Gigabyte SATA ramdisk cards that are discussed in the comments. >> A while ago there was a comment on this alias that these cards >> weren''t purchasable. Unfortunately, I don''t know what is available. > > The umem one is unavailable, but the Gigabyte model is easy to find. > I had Amazon overnight one to me, it''s probably sitting at home right > now.Cool let us know how it goes. Neil.
On 10/18/07, Neil Perrin <Neil.Perrin at sun.com> wrote:> > > > The umem one is unavailable, but the Gigabyte model is easy to find. > > I had Amazon overnight one to me, it''s probably sitting at home right > > now. > > Cool let us know how it goes.Not so well. I was completely unable to get the card to work at all. The motherboard''s BIOS wouldn''t even list the GC-RAMDISK during the bus scan. Solaris saw it, but couldn''t talk to it: Oct 20 12:50:54 fs2 ahci: [ID 632458 kern.warning] WARNING: ahci_port_reset: port 1 the device hardware has been initialized and the power-up diagnostics failed The Supermicro 8-port SATA card''s BIOS saw it, but Solaris reported errors at boot time: Oct 20 12:06:00 fs2 marvell88sx: [ID 748163 kern.warning] WARNING: marvell88sx0: device on port 5 still busy after reset I tried using it with the motherboard''s Marvell-based eSATA ports, but that made the POST hang for a minute or two and Solaris spewed errors all over the console after boot. I''m sending it back. Scott
On 10/22/07, Scott Laird <scott at sigkill.org> wrote:> Oct 20 12:50:54 fs2 ahci: [ID 632458 kern.warning] WARNING: > ahci_port_reset: port 1 the device hardware has been initialized and > the power-up diagnostics failedIIRC, Gigabyte cheaped out and didn''t implement SMART. Thus, everything that tries to keep track of this marks the drives as failed.> I''m sending it back.Good plan. Will
>> This should work. It shouldn''t even lose the in-flight transactions. > ZFS reverts to using the main pool if a slog write fails or the > slog fills up. So, the only way to lose transactions would be a crash or power loss, leaving outstanding transactions in the log, followed by the log device failing to start up on reboot? I assume that that would that be handled relatively cleanly (files have out of data data), as opposed to something nasty like the pool fails to start up. It''s just data loss from zpool perspective. However it''s data loss from application commited data. So applications that relied on commiting data for their own consistency might end up with a corrupted view of the world. NFS clients fall in this bin. Mirroring the NVRAM cards in the Separate Intent log seems like a very good idea. -r