thr3ads.net - zfs discuss - [zfs-discuss] ZIL reliability/replication questions [Oct 2007]

If this information is useful, please help other people find it:
Share via:

Scott Laird

2007-Oct-18 17:41 UTC

[zfs-discuss] ZIL reliability/replication questions

I''m debating using an external intent log on a new box that
I''m about
to start working on, and I have a few questions.

1.  If I use an external log initially and decide that it was a
mistake, is there a way to move back to the internal log without
rebuilding the entire pool?
2.  What happens if the logging device fails completely?  Does this
damage anything else in the pool, other then potentially losing
in-flight transactions?
3.  What about corruption in the log?  Is it checksummed like the rest of ZFS?

Thanks.


Scott

Neil Perrin

2007-Oct-18 18:04 UTC

head link

[zfs-discuss] ZIL reliability/replication questions

Scott Laird wrote:> I''m debating using an external intent log on a new box that
I''m about
> to start working on, and I have a few questions.
> 
> 1.  If I use an external log initially and decide that it was a
> mistake, is there a way to move back to the internal log without
> rebuilding the entire pool?
It''s not currently possible to remove a separate log.
This was working once, but was stripped out until the
more generic zpool remove devices was provided.
This is bug 6574286:

http://bugs.opensolaris.org/view_bug.do?bug_id=6574286 
> 2.  What happens if the logging device fails completely?  Does this
> damage anything else in the pool, other then potentially losing
> in-flight transactions?
This should work. It shouldn''t even lose the in-flight transactions.
ZFS reverts to using the main pool if a slog write fails or the
slog fills up.
> 3.  What about corruption in the log?  Is it checksummed like the rest of
ZFS?
Yes it''s checksummed, but the checksumming is a bit different
from the pool blocks in the uberblock tree.

See also:
http://blogs.sun.com/perrin/entry/slog_blog_or_blogging_on
> 
> Thanks.
> 
> 
> Scott
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Scott Laird

2007-Oct-18 18:25 UTC

head link

[zfs-discuss] ZIL reliability/replication questions

On 10/18/07, Neil Perrin <Neil.Perrin at sun.com>
wrote:>
>
> Scott Laird wrote:
> > I''m debating using an external intent log on a new box that
I''m about
> > to start working on, and I have a few questions.
> >
> > 1.  If I use an external log initially and decide that it was a
> > mistake, is there a way to move back to the internal log without
> > rebuilding the entire pool?
>
> It''s not currently possible to remove a separate log.
> This was working once, but was stripped out until the
> more generic zpool remove devices was provided.
> This is bug 6574286:
>
> http://bugs.opensolaris.org/view_bug.do?bug_id=6574286
Okay, so hopefully it''ll work in a couple quarters?
> > 2.  What happens if the logging device fails completely?  Does this
> > damage anything else in the pool, other then potentially losing
> > in-flight transactions?
>
> This should work. It shouldn''t even lose the in-flight
transactions.
> ZFS reverts to using the main pool if a slog write fails or the
> slog fills up.
So, the only way to lose transactions would be a crash or power loss,
leaving outstanding transactions in the log, followed by the log
device failing to start up on reboot?  I assume that that would that
be handled relatively cleanly (files have out of data data), as
opposed to something nasty like the pool fails to start up.
> > 3.  What about corruption in the log?  Is it checksummed like the rest
of ZFS?
>
> Yes it''s checksummed, but the checksumming is a bit different
> from the pool blocks in the uberblock tree.
>
> See also:
> http://blogs.sun.com/perrin/entry/slog_blog_or_blogging_on
That started this whole mess :-).  I''d like to try out using one of
the Gigabyte SATA ramdisk cards that are discussed in the comments.
It supposedly has 18 hours of battery life, so a long-term power
outage would kill the log.  I could reasonably expect one 18+ hour
power outage over the life of the filesystem.  I''m fine with losing
in-flight data (I''d expect the log to be replayed before the UPS shuts
the system down anyway), but I''d rather not lose the whole pool or
something extreme like that.

I''m willing to trade the chance of some transaction losses during an
exceptional event for more performance, but I''d rather not have to
pull out the backups if I can ever avoid it.


Scott

Neil Perrin

2007-Oct-18 20:29 UTC

head link

[zfs-discuss] ZIL reliability/replication questions

Scott Laird wrote:> On 10/18/07, Neil Perrin <Neil.Perrin at sun.com> wrote:
>>
>> Scott Laird wrote:
>>> I''m debating using an external intent log on a new box
that I''m about
>>> to start working on, and I have a few questions.
>>>
>>> 1.  If I use an external log initially and decide that it was a
>>> mistake, is there a way to move back to the internal log without
>>> rebuilding the entire pool?
>> It''s not currently possible to remove a separate log.
>> This was working once, but was stripped out until the
>> more generic zpool remove devices was provided.
>> This is bug 6574286:
>>
>> http://bugs.opensolaris.org/view_bug.do?bug_id=6574286
> 
> Okay, so hopefully it''ll work in a couple quarters?
It''s not being worked on currently but hopefully will be fixed
in 6 months.> 
>>> 2.  What happens if the logging device fails completely?  Does this
>>> damage anything else in the pool, other then potentially losing
>>> in-flight transactions?
>> This should work. It shouldn''t even lose the in-flight
transactions.
>> ZFS reverts to using the main pool if a slog write fails or the
>> slog fills up.
> 
> So, the only way to lose transactions would be a crash or power loss,
> leaving outstanding transactions in the log, followed by the log
> device failing to start up on reboot?  I assume that that would that
> be handled relatively cleanly (files have out of data data), as
> opposed to something nasty like the pool fails to start up.
I just checked on the behaviour of this. The log is treated as part
of the main pool. If it is not replicated and disappears then the pool
can''t be opened - just like any unreplicated device in the main pool.
If the slog is found but can''t be opened or is corrupted then then the
pool will be opened but the slog isn''t used.
This seems a bit inconsistent.
> 
>>> 3.  What about corruption in the log?  Is it checksummed like the
rest of ZFS?
>> Yes it''s checksummed, but the checksumming is a bit different
>> from the pool blocks in the uberblock tree.
>>
>> See also:
>> http://blogs.sun.com/perrin/entry/slog_blog_or_blogging_on
> 
> That started this whole mess :-).  I''d like to try out using one
of
> the Gigabyte SATA ramdisk cards that are discussed in the comments.
A while ago there was a comment on this alias that these cards
weren''t purchasable. Unfortunately, I don''t know what is
available.
> It supposedly has 18 hours of battery life, so a long-term power
> outage would kill the log.  I could reasonably expect one 18+ hour
> power outage over the life of the filesystem.  I''m fine with
losing
> in-flight data (I''d expect the log to be replayed before the UPS
shuts
> the system down anyway), but I''d rather not lose the whole pool or
> something extreme like that.
> 
> I''m willing to trade the chance of some transaction losses during
an
> exceptional event for more performance, but I''d rather not have to
> pull out the backups if I can ever avoid it.
> 
> 
> Scott

Scott Laird

2007-Oct-18 20:35 UTC

head link

[zfs-discuss] ZIL reliability/replication questions

On 10/18/07, Neil Perrin <Neil.Perrin at sun.com>
wrote:> > So, the only way to lose transactions would be a crash or power loss,
> > leaving outstanding transactions in the log, followed by the log
> > device failing to start up on reboot?  I assume that that would that
> > be handled relatively cleanly (files have out of data data), as
> > opposed to something nasty like the pool fails to start up.
>
> I just checked on the behaviour of this. The log is treated as part
> of the main pool. If it is not replicated and disappears then the pool
> can''t be opened - just like any unreplicated device in the main
pool.
> If the slog is found but can''t be opened or is corrupted then then
the
> pool will be opened but the slog isn''t used.
> This seems a bit inconsistent.
Hmm, yeah.  What would happen if I mirrored the ramdisk with a hard
drive?  Would ZFS block until the data''s stable on both devices, or
would it continue once the write is complete on the ramdisk?

Failing that, would replacing the missing log with a blank device let
me bring the pool back up, or would it be dead at that point?
> >>> 3.  What about corruption in the log?  Is it checksummed like
the rest of ZFS?
> >> Yes it''s checksummed, but the checksumming is a bit
different
> >> from the pool blocks in the uberblock tree.
> >>
> >> See also:
> >> http://blogs.sun.com/perrin/entry/slog_blog_or_blogging_on
> >
> > That started this whole mess :-).  I''d like to try out using
one of
> > the Gigabyte SATA ramdisk cards that are discussed in the comments.
>
> A while ago there was a comment on this alias that these cards
> weren''t purchasable. Unfortunately, I don''t know what is
available.
The umem one is unavailable, but the Gigabyte model is easy to find.
I had Amazon overnight one to me, it''s probably sitting at home right
now.


Scott

Eric Schrock

2007-Oct-18 20:49 UTC

head link

[zfs-discuss] ZIL reliability/replication questions

On Thu, Oct 18, 2007 at 02:29:27PM -0600, Neil Perrin
wrote:> 
> > So, the only way to lose transactions would be a crash or power loss,
> > leaving outstanding transactions in the log, followed by the log
> > device failing to start up on reboot?  I assume that that would that
> > be handled relatively cleanly (files have out of data data), as
> > opposed to something nasty like the pool fails to start up.
> 
> I just checked on the behaviour of this. The log is treated as part
> of the main pool. If it is not replicated and disappears then the pool
> can''t be opened - just like any unreplicated device in the main
pool.
> If the slog is found but can''t be opened or is corrupted then then
the
> pool will be opened but the slog isn''t used.
> This seems a bit inconsistent.
> 
It''s worth noting that this is a generic problem.  In the world of
metadata replication (ditto blocks), even an unreplicated normal device
does not necessarily render a pool completely faulted.  The code needs
to be modified across the board so that the root vdev never ends up in
the FAULTED state, and then pool health is based solely on the ability
to read some basic piece of information, such as a successful
dsl_pool_open().  From the looks of things, this will "just work" if
we
get rid of the too_many_errors() call and associated code in
vdev_root.c, but I''m sure there would be some odd edge conditions.

- Eric

--
Eric Schrock, Solaris Kernel Development       http://blogs.sun.com/eschrock

Neil Perrin

2007-Oct-19 00:22 UTC

head link

[zfs-discuss] ZIL reliability/replication questions

Scott Laird wrote:> On 10/18/07, Neil Perrin <Neil.Perrin at sun.com> wrote:
>>> So, the only way to lose transactions would be a crash or power
loss,
>>> leaving outstanding transactions in the log, followed by the log
>>> device failing to start up on reboot?  I assume that that would
that
>>> be handled relatively cleanly (files have out of data data), as
>>> opposed to something nasty like the pool fails to start up.
>> I just checked on the behaviour of this. The log is treated as part
>> of the main pool. If it is not replicated and disappears then the pool
>> can''t be opened - just like any unreplicated device in the
main pool.
>> If the slog is found but can''t be opened or is corrupted then
then the
>> pool will be opened but the slog isn''t used.
>> This seems a bit inconsistent.
> 
> Hmm, yeah.  What would happen if I mirrored the ramdisk with a hard
> drive?  Would ZFS block until the data''s stable on both devices,
or
> would it continue once the write is complete on the ramdisk?
ZFS ensures all mirror sides have the data before returning.
>
> Failing that, would replacing the missing log with a blank device let
> me bring the pool back up, or would it be dead at that point?
Replacing the device would work:

: mull ; mkfile 100m /p1 /p2
: mull ; zpool create whirl /p1 log /p2
: mull ; echo abc > /whirl/f
: mull ; sync
: mull ; rm /p2
: mull ; sync
<reset system>
: mull ; zpool status
  pool: whirl
 state: UNAVAIL
status: One or more devices could not be opened.  There are insufficient
        replicas for the pool to continue functioning.
action: Attach the missing device and online it using ''zpool
online''.
   see: http://www.sun.com/msg/ZFS-8000-3C
 scrub: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        whirl       UNAVAIL      0     0     0  insufficient replicas
          /p1       ONLINE       0     0     0
        logs        UNAVAIL      0     0     0  insufficient replicas
          /p2       UNAVAIL      0     0     0  cannot open
: mull ; mkfile 100m /p2 /p3
: mull ; zpool online whirl /p2
warning: device ''/p2'' onlined, but remains in faulted state
use ''zpool replace'' to replace devices that are no longer
present
: mull ; zpool status
  pool: whirl
 state: ONLINE
status: One or more devices could not be used because the label is missing or
        invalid.  Sufficient replicas exist for the pool to continue
        functioning in a degraded state.
action: Replace the device using ''zpool replace''.
   see: http://www.sun.com/msg/ZFS-8000-4J
 scrub: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        whirl       ONLINE       0     0     0
          /p1       ONLINE       0     0     0
        logs        ONLINE       0     0     0
          /p2       UNAVAIL      0     0     0  corrupted data

errors: No known data errors
: mull ; zpool replace whirl /p2 /p3
: mull ; zpool status
  pool: whirl
 state: ONLINE
 scrub: resilver completed with 0 errors on Thu Oct 18 18:16:39 2007
config:

        NAME         STATE     READ WRITE CKSUM
        whirl        ONLINE       0     0     0
          /p1        ONLINE       0     0     0
        logs         ONLINE       0     0     0
          replacing  ONLINE       0     0     0
            /p2      UNAVAIL      0     0     0  corrupted data
            /p3      ONLINE       0     0     0

errors: No known data errors
: mull ; zpool status
  pool: whirl
 state: ONLINE
 scrub: resilver completed with 0 errors on Thu Oct 18 18:16:39 2007
config:

        NAME        STATE     READ WRITE CKSUM
        whirl       ONLINE       0     0     0
          /p1       ONLINE       0     0     0
        logs        ONLINE       0     0     0
          /p3       ONLINE       0     0     0

errors: No known data errors
: mull ; zfs mount
: mull ; zfs mount -a
: mull ; cat /whirl/f
abc
: mull ;
> 
>>>>> 3.  What about corruption in the log?  Is it checksummed
like the rest of ZFS?
>>>> Yes it''s checksummed, but the checksumming is a bit
different
>>>> from the pool blocks in the uberblock tree.
>>>>
>>>> See also:
>>>> http://blogs.sun.com/perrin/entry/slog_blog_or_blogging_on
>>> That started this whole mess :-).  I''d like to try out
using one of
>>> the Gigabyte SATA ramdisk cards that are discussed in the comments.
>> A while ago there was a comment on this alias that these cards
>> weren''t purchasable. Unfortunately, I don''t know what
is available.
> 
> The umem one is unavailable, but the Gigabyte model is easy to find.
> I had Amazon overnight one to me, it''s probably sitting at home
right
> now.
Cool let us know how it goes.

Neil.

Scott Laird

2007-Oct-22 16:42 UTC

head link

[zfs-discuss] ZIL reliability/replication questions

On 10/18/07, Neil Perrin <Neil.Perrin at sun.com>
wrote:> >
> > The umem one is unavailable, but the Gigabyte model is easy to find.
> > I had Amazon overnight one to me, it''s probably sitting at
home right
> > now.
>
> Cool let us know how it goes.
Not so well.  I was completely unable to get the card to work at all.
The motherboard''s BIOS wouldn''t even list the GC-RAMDISK
during the
bus scan.  Solaris saw it, but couldn''t talk to it:

Oct 20 12:50:54 fs2 ahci: [ID 632458 kern.warning] WARNING:
ahci_port_reset: port 1 the device hardware has been initialized and
the power-up diagnostics failed

The Supermicro 8-port SATA card''s BIOS saw it, but Solaris reported
errors at boot time:

Oct 20 12:06:00 fs2 marvell88sx: [ID 748163 kern.warning] WARNING:
marvell88sx0: device on port 5 still busy after reset

I tried using it with the motherboard''s Marvell-based eSATA ports, but
that made the POST hang for a minute or two and Solaris spewed errors
all over the console after boot.

I''m sending it back.


Scott

Will Murnane

2007-Oct-22 18:31 UTC

head link

[zfs-discuss] ZIL reliability/replication questions

On 10/22/07, Scott Laird <scott at sigkill.org>
wrote:> Oct 20 12:50:54 fs2 ahci: [ID 632458 kern.warning] WARNING:
> ahci_port_reset: port 1 the device hardware has been initialized and
> the power-up diagnostics failedIIRC, Gigabyte cheaped out and didn''t implement SMART.  Thus,
everything that tries to keep track of this marks the drives as
failed.
> I''m sending it back.Good plan.

Will

Roch - PAE

2007-Oct-24 09:02 UTC

head link

[zfs-discuss] ZIL reliability/replication questions

>  > This should work. It shouldn''t even lose the in-flight
transactions.
  > ZFS reverts to using the main pool if a slog write fails or the
  > slog fills up.

  So, the only way to lose transactions would be a crash or power loss,
  leaving outstanding transactions in the log, followed by the log
  device failing to start up on reboot?  I assume that that would that
  be handled relatively cleanly (files have out of data data), as
  opposed to something nasty like the pool fails to start up.


It''s   just data loss   from zpool perspective. However it''s
data  loss from application  commited  data. So applications
that relied on commiting data for their own consistency might
end up with a corrupted view of the  world. NFS clients fall
in this bin.

Mirroring the  NVRAM cards in  the Separate Intent log seems
like a very good idea. 

-r

zfs discuss - Oct 2007 - ZIL reliability/replication questions

[zfs-discuss] ZIL reliability/replication questions

[zfs-discuss] ZIL reliability/replication questions

[zfs-discuss] ZIL reliability/replication questions

[zfs-discuss] ZIL reliability/replication questions

[zfs-discuss] ZIL reliability/replication questions

[zfs-discuss] ZIL reliability/replication questions

[zfs-discuss] ZIL reliability/replication questions

[zfs-discuss] ZIL reliability/replication questions

[zfs-discuss] ZIL reliability/replication questions

[zfs-discuss] ZIL reliability/replication questions