thr3ads.net - zfs discuss - [zfs-discuss] ZFS offline ZIL corruption not detected [Aug 2010]

If this information is useful, please help other people find it:
Share via:

StorageConcepts

2010-Aug-23 16:41 UTC

[zfs-discuss] ZFS offline ZIL corruption not detected

Hello, 

we are currently extensivly testing the DDRX1 drive for ZIL and we are going
through all the corner cases.

The headline above all our tests is "do we still need to mirror ZIL"
with all current fixes in ZFS (zfs can recover zil failure, as long as you
don''t export the pool, with latest upstream you can also import a poool
with a missing zil)? This question  is especially interesting with RAM based
devices, because they don''t wear out, have a very low bit error rate
and use one PCIx slot - which are rare. Price is another aspect here :)

During our tests we found a strange behaviour of ZFS ZIL failures which are not
device related and we are looking for help from the ZFS guru''s here :)

The test in question is called "offline ZIL corruption". The question
is, what happens if my ZIL data is corrupted while a server is transported or
moved and not properly shut down. For this we do:

- Prepare 2 OS installations (ProdudctOS and CorruptOS)
- Boot ProductOS and create a pool and add the ZIL 
- ProductOS: Issue synchronous I/O with a increasing TNX number (and print the
latest committet transaciton)
- ProductOS: Power off the server and record the laast committet transaction
- Boot CorruptOS
- Write random data to the beginning of the ZIL (dd if=/dev/urandom of=ZIL ....
~ 300 MB from start of disk, overwriting the first two disk labels)
- Boot ProductOS
- Verify that the data corruption is detected by checking the file with the
transaction number against the one recorded

We ran the test and it seems with modern snv_134 the pool comes up after the
corruption with all beeing ok, while ~10000 Transactions (this is some seconds
of writes with DDRX1) are missing and nobody knows about this. We ran a scrub
and scrub does not even detect this. ZFS automatically repairs the labels on the
ZIL, however no error is reported about the missing data.

While it is clear to us that if we do not have a mirrored zil, the data we have
overwritten in the zil is lost, we are really wondering why ZFS does not REPORT
about this corruption, silently ignoring it.

Is this is a bug or .. aehm ... a feature  :) ?

Regards, 
Robert
-- 
This message posted from opensolaris.org

Neil Perrin

2010-Aug-23 17:44 UTC

head link

[zfs-discuss] ZFS offline ZIL corruption not detected

This is a consequence of the design for performance of the ZIL code.
Intent log blocks are dynamically allocated and chained together.
When reading the intent log we read each block and checksum it
with the embedded checksum within the same block. If we can''t read
a block due to an IO error then that is reported, but if the checksum does
not match then we assume it''s the end of the intent log chain.
Using this design means we the minimum number of writes to add
write an intent log record is just one write.

So corruption of an intent log is not going to generate any errors.

Neil.

On 08/23/10 10:41, StorageConcepts wrote:> Hello, 
>
> we are currently extensivly testing the DDRX1 drive for ZIL and we are
going through all the corner cases.
>
> The headline above all our tests is "do we still need to mirror
ZIL" with all current fixes in ZFS (zfs can recover zil failure, as long as
you don''t export the pool, with latest upstream you can also import a
poool with a missing zil)? This question  is especially interesting with RAM
based devices, because they don''t wear out, have a very low bit error
rate and use one PCIx slot - which are rare. Price is another aspect here :)
>
> During our tests we found a strange behaviour of ZFS ZIL failures which are
not device related and we are looking for help from the ZFS guru''s here
:)
>
> The test in question is called "offline ZIL corruption". The
question is, what happens if my ZIL data is corrupted while a server is
transported or moved and not properly shut down. For this we do:
>
> - Prepare 2 OS installations (ProdudctOS and CorruptOS)
> - Boot ProductOS and create a pool and add the ZIL 
> - ProductOS: Issue synchronous I/O with a increasing TNX number (and print
the latest committet transaciton)
> - ProductOS: Power off the server and record the laast committet
transaction
> - Boot CorruptOS
> - Write random data to the beginning of the ZIL (dd if=/dev/urandom of=ZIL
.... ~ 300 MB from start of disk, overwriting the first two disk labels)
> - Boot ProductOS
> - Verify that the data corruption is detected by checking the file with the
transaction number against the one recorded
>
> We ran the test and it seems with modern snv_134 the pool comes up after
the corruption with all beeing ok, while ~10000 Transactions (this is some
seconds of writes with DDRX1) are missing and nobody knows about this. We ran a
scrub and scrub does not even detect this. ZFS automatically repairs the labels
on the ZIL, however no error is reported about the missing data.
>
> While it is clear to us that if we do not have a mirrored zil, the data we
have overwritten in the zil is lost, we are really wondering why ZFS does not
REPORT about this corruption, silently ignoring it.
>
> Is this is a bug or .. aehm ... a feature  :) ?
>
> Regards, 
> Robert
>

Markus Keil

2010-Aug-23 19:12 UTC

head link

[zfs-discuss] ZFS offline ZIL corruption not detected

Does that mean that when the begin of the intent log chain gets corrupted, all
other intent log data after the corruption area is lost, because the checksum of
the first corrupted block doesn''t match??
?
Regards,
Markus

Neil Perrin <neil.perrin at oracle.com> hat am 23. August 2010 um 19:44
geschrieben:
> This is a consequence of the design for performance of the ZIL code.
> Intent log blocks are dynamically allocated and chained together.
> When reading the intent log we read each block and checksum it
> with the embedded checksum within the same block. If we can''t read
> a block due to an IO error then that is reported, but if the checksum does
> not match then we assume it''s the end of the intent log chain.
> Using this design means we the minimum number of writes to add
> write an intent log record is just one write.
>
> So corruption of an intent log is not going to generate any errors.
>
> Neil.
>
> On 08/23/10 10:41, StorageConcepts wrote:
> > Hello,
> >
> > we are currently extensivly testing the DDRX1 drive for ZIL and we are
going
> > through all the corner cases.
> >
> > The headline above all our tests is "do we still need to mirror
ZIL" with
> > all current fixes in ZFS (zfs can recover zil failure, as long as you
don''t
> > export the pool, with latest upstream you can also import a poool with
a
> > missing zil)? This question? is especially interesting with RAM based
> > devices, because they don''t wear out, have a very low bit
error rate and use
> > one PCIx slot - which are rare. Price is another aspect here :)
> >
> > During our tests we found a strange behaviour of ZFS ZIL failures
which are
> > not device related and we are looking for help from the ZFS
guru''s here :)
> >
> > The test in question is called "offline ZIL corruption". The
question is,
> > what happens if my ZIL data is corrupted while a server is transported
or
> > moved and not properly shut down. For this we do:
> >
> > - Prepare 2 OS installations (ProdudctOS and CorruptOS)
> > - Boot ProductOS and create a pool and add the ZIL
> > - ProductOS: Issue synchronous I/O with a increasing TNX number (and
print
> > the latest committet transaciton)
> > - ProductOS: Power off the server and record the laast committet
transaction
> > - Boot CorruptOS
> > - Write random data to the beginning of the ZIL (dd if=/dev/urandom
of=ZIL
> > .... ~ 300 MB from start of disk, overwriting the first two disk
labels)
> > - Boot ProductOS
> > - Verify that the data corruption is detected by checking the file
with the
> > transaction number against the one recorded
> >
> > We ran the test and it seems with modern snv_134 the pool comes up
after the
> > corruption with all beeing ok, while ~10000 Transactions (this is some
> > seconds of writes with DDRX1) are missing and nobody knows about this.
We
> > ran a scrub and scrub does not even detect this. ZFS automatically
repairs
> > the labels on the ZIL, however no error is reported about the missing
data.
> >
> > While it is clear to us that if we do not have a mirrored zil, the
data we
> > have overwritten in the zil is lost, we are really wondering why ZFS
does
> > not REPORT about this corruption, silently ignoring it.
> >
> > Is this is a bug or .. aehm ... a feature? :) ?
> >
> > Regards,
> > Robert
> >? ?
>
--------------------------------------------------
StorageConcepts Europe GmbH
? ? Storage: Beratung. Realisierung. Support? ? ?

Markus Keil? ? ? ? ? ? keil at storageconcepts.de
? ? ? ? ? ? ? ? ? ? ? ?http://www.storageconcepts.de
Wiener Stra?e 114-116? Telefon:? ?+49 (351) 8 76 92-21
01219 Dresden? ? ? ? ? Telefax:? ?+49 (351) 8 76 92-99
Handelregister Dresden, HRB 28281
Gesch?ftsf?hrer: Robert Heinzmann, Gerd Jelinek
--------------------------------------------------
Rechtlicher Hinweis: Der Inhalt dieser E-Mail sowie etwaige Anlagen hierzu sind
vertraulich? und ausschlie?lich f?r den Gebrauch durch den Empf?nger bestimmt,
soweit diese Nachricht im Einzelfall nicht einen anderweitigen Umgang erlaubt.
Auch kann der Inhalt der Nachricht Gegenstand von gesetzlichen Schutzrechten
sein. Soweit eine Weitergabe oder Verteilung nicht ausschlie?lich zu internen
Zwecken des Empf?ngers geschieht, wird jede Weitergabe, Verteilung oder sonstige
Kopierung untersagt. Sollten Sie nicht? der beabsichtigte Empf?nger der Sendung
sein, informieren Sie den Absender bitte unverz?glich.

Neil Perrin

2010-Aug-23 19:35 UTC

head link

[zfs-discuss] ZFS offline ZIL corruption not detected

On 08/23/10 13:12, Markus Keil wrote:> Does that mean that when the begin of the intent log chain gets corrupted,
all
> other intent log data after the corruption area is lost, because the
checksum of
> the first corrupted block doesn''t match?? 
>   
- Yes, but you wouldn''t want to replay the following entries in case
the
log records
in the missing log block were important (eg create file).

Mirroring the slogs is recommended to minimise concerns about slogs 
corruption.

> ? 
> Regards,
> Markus
>
> Neil Perrin <neil.perrin at oracle.com> hat am 23. August 2010 um
19:44
> geschrieben:
>
>   
>> This is a consequence of the design for performance of the ZIL code.
>> Intent log blocks are dynamically allocated and chained together.
>> When reading the intent log we read each block and checksum it
>> with the embedded checksum within the same block. If we can''t
read
>> a block due to an IO error then that is reported, but if the checksum
does
>> not match then we assume it''s the end of the intent log chain.
>> Using this design means we the minimum number of writes to add
>> write an intent log record is just one write.
>>
>> So corruption of an intent log is not going to generate any errors.
>>
>> Neil.
>>
>> On 08/23/10 10:41, StorageConcepts wrote:
>>     
>>> Hello,
>>>
>>> we are currently extensivly testing the DDRX1 drive for ZIL and we
are going
>>> through all the corner cases.
>>>
>>> The headline above all our tests is "do we still need to
mirror ZIL" with
>>> all current fixes in ZFS (zfs can recover zil failure, as long as
you don''t
>>> export the pool, with latest upstream you can also import a poool
with a
>>> missing zil)? This question?  is especially interesting with RAM
based
>>> devices, because they don''t wear out, have a very low bit
error rate and use
>>> one PCIx slot - which are rare. Price is another aspect here :)
>>>
>>> During our tests we found a strange behaviour of ZFS ZIL failures
which are
>>> not device related and we are looking for help from the ZFS
guru''s here :)
>>>
>>> The test in question is called "offline ZIL corruption".
The question is,
>>> what happens if my ZIL data is corrupted while a server is
transported or
>>> moved and not properly shut down. For this we do:
>>>
>>> - Prepare 2 OS installations (ProdudctOS and CorruptOS)
>>> - Boot ProductOS and create a pool and add the ZIL
>>> - ProductOS: Issue synchronous I/O with a increasing TNX number
(and print
>>> the latest committet transaciton)
>>> - ProductOS: Power off the server and record the laast committet
transaction
>>> - Boot CorruptOS
>>> - Write random data to the beginning of the ZIL (dd if=/dev/urandom
of=ZIL
>>> .... ~ 300 MB from start of disk, overwriting the first two disk
labels)
>>> - Boot ProductOS
>>> - Verify that the data corruption is detected by checking the file
with the
>>> transaction number against the one recorded
>>>
>>> We ran the test and it seems with modern snv_134 the pool comes up
after the
>>> corruption with all beeing ok, while ~10000 Transactions (this is
some
>>> seconds of writes with DDRX1) are missing and nobody knows about
this. We
>>> ran a scrub and scrub does not even detect this. ZFS automatically
repairs
>>> the labels on the ZIL, however no error is reported about the
missing data.
>>>
>>> While it is clear to us that if we do not have a mirrored zil, the
data we
>>> have overwritten in the zil is lost, we are really wondering why
ZFS does
>>> not REPORT about this corruption, silently ignoring it.
>>>
>>> Is this is a bug or .. aehm ... a feature?  :) ?
>>>
>>> Regards,
>>> Robert
>>> ?  ? 
>>>       
>
> --------------------------------------------------
> StorageConcepts Europe GmbH
> ?  ?  Storage: Beratung. Realisierung. Support?  ?  ? 
>
> Markus Keil?  ?  ?  ?  ?  ?  keil at storageconcepts.de
> ?  ?  ?  ?  ?  ?  ?  ?  ?  ?  ?  ? http://www.storageconcepts.de
> Wiener Stra?Y"e 114-116?  Telefon:?  ? +49 (351) 8 76 92-21
> 01219 Dresden?  ?  ?  ?  ?  Telefax:?  ? +49 (351) 8 76 92-99
> Handelregister Dresden, HRB 28281
> Gesch??ftsf??hrer: Robert Heinzmann, Gerd Jelinek
> --------------------------------------------------
> Rechtlicher Hinweis: Der Inhalt dieser E-Mail sowie etwaige Anlagen hierzu
sind
> vertraulich?  und ausschlie?Y"lich f??r den Gebrauch durch den
Empf??nger bestimmt,
> soweit diese Nachricht im Einzelfall nicht einen anderweitigen Umgang
erlaubt.
> Auch kann der Inhalt der Nachricht Gegenstand von gesetzlichen
Schutzrechten
> sein. Soweit eine Weitergabe oder Verteilung nicht ausschlie?Y"lich zu
internen
> Zwecken des Empf??ngers geschieht, wird jede Weitergabe, Verteilung oder
sonstige
> Kopierung untersagt. Sollten Sie nicht?  der beabsichtigte Empf??nger der
Sendung
> sein, informieren Sie den Absender bitte unverz??glich.
>   
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100823/b3b7c66f/attachment.html>

Edward Ned Harvey

2010-Aug-26 02:33 UTC

head link

[zfs-discuss] ZFS offline ZIL corruption not detected

> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> bounces at opensolaris.org] On Behalf Of Neil Perrin
> 
> This is a consequence of the design for performance of the ZIL code.
> Intent log blocks are dynamically allocated and chained together.
> When reading the intent log we read each block and checksum it
> with the embedded checksum within the same block. If we can''t read
> a block due to an IO error then that is reported, but if the checksum
> does
> not match then we assume it''s the end of the intent log chain.
> Using this design means we the minimum number of writes to add
> write an intent log record is just one write.
> 
> So corruption of an intent log is not going to generate any errors.
I didn''t know that.  Very interesting.  This raises another question
...

It''s commonly stated, that even with log device removal supported, the
most
common failure mode for an SSD is to blindly write without reporting any
errors, and only detect that the device is failed upon read.  So ... If an
SSD is in this failure mode, you won''t detect it?  At bootup, the
checksum
will simply mismatch, and we''ll chug along forward, having lost the
data ...
(nothing can prevent that) ... but we don''t know that we''ve
lost data?

Worse yet ... In preparation for the above SSD failure mode, it''s
commonly
recommended to still mirror your log device, even if you have log device
removal.  If you have a mirror, and the data on each half of the mirror
doesn''t match each other (one device failed, and the other device is
good)
... Do you read the data from *both* sides of the mirror, in order to
discover the corrupted log device, and correctly move forward without data
loss?

Neil Perrin

2010-Aug-26 03:00 UTC

head link

[zfs-discuss] ZFS offline ZIL corruption not detected

On 08/25/10 20:33, Edward Ned Harvey wrote:>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
>> bounces at opensolaris.org] On Behalf Of Neil Perrin
>>
>> This is a consequence of the design for performance of the ZIL code.
>> Intent log blocks are dynamically allocated and chained together.
>> When reading the intent log we read each block and checksum it
>> with the embedded checksum within the same block. If we can''t
read
>> a block due to an IO error then that is reported, but if the checksum
>> does
>> not match then we assume it''s the end of the intent log chain.
>> Using this design means we use the minimum number of writes.
>>
>> So corruption of an intent log is not going to generate any errors.
>>     
>
> I didn''t know that.  Very interesting.  This raises another
question ...
>
> It''s commonly stated, that even with log device removal supported,
the most
> common failure mode for an SSD is to blindly write without reporting any
> errors, and only detect that the device is failed upon read.  So ... If an
> SSD is in this failure mode, you won''t detect it?  At bootup, the
checksum
> will simply mismatch, and we''ll chug along forward, having lost
the data ...
> (nothing can prevent that) ... but we don''t know that
we''ve lost data?
>   
- Indeed, we wouldn''t know we lost data.
> Worse yet ... In preparation for the above SSD failure mode, it''s
commonly
> recommended to still mirror your log device, even if you have log device
> removal.  If you have a mirror, and the data on each half of the mirror
> doesn''t match each other (one device failed, and the other device
is good)
> ... Do you read the data from *both* sides of the mirror, in order to
> discover the corrupted log device, and correctly move forward without data
> loss?
>
>   
Hmm, I need to check, but if we get a checksum mismatch then I don''t 
think we try other
mirror(s). This is automatic for the ''main pool'', but of
course the ZIL
code is different
by necessity. This problem can of course be fixed. (It will be  a week 
and a bit before I can
report back on this, as I''m on vacation).

Neil.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100825/fd2de062/attachment.html>

StorageConcepts

2010-Aug-26 06:40 UTC

head link

[zfs-discuss] ZFS offline ZIL corruption not detected

Hello, 
actually this is bad news. 

I always assumed that the mirror redundancy of zil can also be used to handle
bad blocks on the zil device (just as the main pool self healing does for data
blocks).

I actually dont know how SSD''s "die", because of the
"wear out" characteristics I can think of a increased number of bad
blocks / bit errors at the EOL of such a device -  probably undiscovered.

Because ZIL is write only, you only know if it worked in case you need it - wich
is bad. So my suggestion was always to run with 1 zil during pre-production, and
add the zil mirror 2 weeks later when production starts. This way they
dont''t age exactly the same and zil2 has 2 more weeks of expected
flifetime (or even more, assuming the usual heavier writes during stress
testing).

I would call this pre-aging. However if the second zil is not used to recover
from bad blocks, this does not make a lot of sense.

So would say there are 2 bugs / missing features in this: 

1) zil needs to report truncated transactions on zilcorruption
2) zil should need mirrored counterpart to recover bad block checksums 

Now with OpenSolaris beeing Oracle closed and Illumos beeing just startet, I
don''t  know how to handle bug openenings :) - is bugs.opensolaris.org
still maintained ???

Regards, 
Robert
-- 
This message posted from opensolaris.org

Edward Ned Harvey

2010-Aug-26 13:14 UTC

head link

[zfs-discuss] ZFS offline ZIL corruption not detected

> From: Neil Perrin [mailto:neil.perrin at oracle.com]
> 
> Hmm, I need to check, but if we get a checksum mismatch then I
don''t
> think we try other
> mirror(s). This is automatic for the ''main pool'', but of
course the ZIL
> code is different
> by necessity. This problem can of course be fixed. (It will be? a week
> and a bit before I can
> report back on this, as I''m on vacation).
Thanks...

If indeed that is the behavior, then I would conclude:  
* Call it a bug.  It needs a bug fix.
* Prior to log device removal (zpool 19) it is critical to mirror log
device.
* After introduction of ldr, before this bug fix is available, it is
pointless to mirror log devices.
* After this bug fix is introduced, it is again recommended to mirror slogs.

Edward Ned Harvey

2010-Aug-26 13:17 UTC

head link

[zfs-discuss] ZFS offline ZIL corruption not detected

> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> bounces at opensolaris.org] On Behalf Of StorageConcepts
> 
> So would say there are 2 bugs / missing features in this:
> 
> 1) zil needs to report truncated transactions on zilcorruption
> 2) zil should need mirrored counterpart to recover bad block checksums
Add to that:

During scrubs, perform some reads on log devices (even if there''s
nothing to
read).
In fact, during scrubs, perform some reads on every device (even if
it''s
actually empty.)

Eric Schrock

2010-Aug-26 13:18 UTC

head link

[zfs-discuss] ZFS offline ZIL corruption not detected

On Aug 26, 2010, at 9:14 AM, Edward Ned Harvey wrote:> * After introduction of ldr, before this bug fix is available, it is
> pointless to mirror log devices.
That''s a bit of an overstatement.  Mirrored logs protect against a wide
variety of failure modes.  Neil just isn''t sure if it does the right
thing for checksum errors.  That is a very small subset of possible device
failure modes.

- Eric

--
Eric Schrock, Fishworks                        http://blogs.sun.com/eschrock

Eric Schrock

2010-Aug-26 13:22 UTC

head link

[zfs-discuss] ZFS offline ZIL corruption not detected

On Aug 26, 2010, at 2:40 AM, StorageConcepts wrote:> 
> 1) zil needs to report truncated transactions on zilcorruption
As Neil outlined, this isn''t possible while preserving current ZIL
performance.  There is no way to distinguish the "last" ZIL block
without incurring additional writes for every block.  If it''s even
possible to implement this "paranoid ZIL" tunable, are you willing to
take a 2-5x performance hit to be able to detect this failure mode?

- Eric

--
Eric Schrock, Fishworks                        http://blogs.sun.com/eschrock

Saso Kiselkov

2010-Aug-26 14:08 UTC

head link

[zfs-discuss] ZFS offline ZIL corruption not detected

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

If I might add my $0.02: it appears that the ZIL is implemented as a
kind of circular log buffer. As I understand it, when a corrupt checksum
is detected, it is taken to be the end of the log, but this kind of
defeats the checksum''s original purpose, which is to detect device
failure. Thus we would first need to change this behavior to only be
used for failure detection. This leaves the question of how to detect
the end of the log, which I think could be done by using a monotonously
incrementing counter on the ZIL entries. Once we find an entry where the
counter != n+1, then we know that the block is the last one in the sequence.

Now that we can use checksums to detect device failure, it would be
possible to implement ZIL-scrub, allowing an environment to detect ZIL
device degradation before it actually results in a catastrophe.

- --
Saso

On 08/26/2010 03:22 PM, Eric Schrock wrote:> 
> On Aug 26, 2010, at 2:40 AM, StorageConcepts wrote:
>>
>> 1) zil needs to report truncated transactions on zilcorruption
> 
> As Neil outlined, this isn''t possible while preserving current ZIL
performance.  There is no way to distinguish the "last" ZIL block
without incurring additional writes for every block.  If it''s even
possible to implement this "paranoid ZIL" tunable, are you willing to
take a 2-5x performance hit to be able to detect this failure mode?
> 
> - Eric
> 
> --
> Eric Schrock, Fishworks                       
http://blogs.sun.com/eschrock
> 
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkx2dUwACgkQRO8UcfzpOHD6QgCfWRBvqYxwKOqrFeaMyQ3nZDVX
Pu0AoJJHPybVT3GqvQbJPL8Xa58aC5P1
=pQJU
-----END PGP SIGNATURE-----

StorageConcepts

2010-Aug-26 14:15 UTC

head link

[zfs-discuss] ZFS offline ZIL corruption not detected

Actually - I can''t read ZFS code, so the next assumtions are more or
less based on   brainware - excuse me in advance :)

How does ZFS detect "up to date" zil''s ? - with the tnx check
of the ueberblock - right ?

In our corruption case, we had 2 valid ueberblocks at the end and ZFS used those
to import the pool. this is what the end-ueberblock is for. Ok, so the
ueberblock contains the pointer to the start of the zil chain - right ?

Assume we are adding the tnx number of the current transaction this zil is part
of to the blocks written to the zil (special packages zil blocks). So the zil
blocks are a little bit bigger then the data blocks, however the transaction
count is the the same. Ok for SSD block alignment might be an issue ... agreed.
For memory DRAM based ZIL''s this is not a problem - except for
bandwith.

Logic: 

On ZIL import, check: 
  - If the pointer to the zil chain is empty
    if yes -> clean pool
    if not -> we need to replay 

  - now if the block the root pointer points to is ok (checksum), the zil is
used and replayed. At the end, the tnxof the last zil block must be = pool tnx.
If =, then OK, if not report a error about missing zil parts and switch to
mirror (if available).
> As Neil outlined, this isn''t possible while
> preserving current ZIL performance.  There is no way
> to distinguish the "last" ZIL block without incurring
> additional writes for every block.  If it''s even
> possible to implement this "paranoid ZIL" tunable,
> are you willing to take a 2-5x performance hit to be
> able to detect this failure mode?
> Robert
-- 
This message posted from opensolaris.org

Darren J Moffat

2010-Aug-26 14:31 UTC

head link

[zfs-discuss] ZFS offline ZIL corruption not detected

On 26/08/2010 15:08, Saso Kiselkov wrote:> If I might add my $0.02: it appears that the ZIL is implemented as a
> kind of circular log buffer. As I understand it, when a corrupt checksum
It is NOT circular since that implies limited number of entries that get 
overwritten.
> is detected, it is taken to be the end of the log, but this kind of
> defeats the checksum''s original purpose, which is to detect device
> failure. Thus we would first need to change this behavior to only be
> used for failure detection. This leaves the question of how to detect
> the end of the log, which I think could be done by using a monotonously
> incrementing counter on the ZIL entries. Once we find an entry where the
> counter != n+1, then we know that the block is the last one in the
sequence.
See the comment part way down zil_read_log_block about how we do 
something pretty much like that for checking the chain of log blocks:

http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/zil.c#zil_read_log_block

This is the checksum in the BP checksum field.

But before we even got there we checked the ZILOG2 checksum as part of 
doing the zio (in zio_checksum_verify() stage):

http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/zio_checksum.c#zio_checksum_error

A ZILOG2 checksum is a embedded  in the block (at the start, the 
original ZILOG was at the end) version of fletcher4.  If that failed - 
ie the block was corrupt we would have returned an error back through 
the dsl_read() of the log block.

-- 
Darren J Moffat

David Magda

2010-Aug-26 14:42 UTC

head link

[zfs-discuss] ZFS offline ZIL corruption not detected

On Wed, August 25, 2010 23:00, Neil Perrin wrote:> On 08/25/10 20:33, Edward Ned Harvey wrote:
>
>> It''s commonly stated, that even with log device removal
supported, the
>> most common failure mode for an SSD is to blindly write without
reporting
>> any errors, and only detect that the device is failed upon read.  So
...
>> If an SSD is in this failure mode, you won''t detect it?  At
bootup, the
>> checksum will simply mismatch, and we''ll chug along forward,
having lost
>> the data ... (nothing can prevent that) ... but we don''t know
that we''ve
>> lost data?
>
> - Indeed, we wouldn''t know we lost data.
Does a scrub go through the slog and/or L2ARC devices, or only the
"primary" storage components?

If it doesn''t go through these "secondary" devices, that may
be a useful
RFE, as one would ideally want to test the data on every component of a
storage system.

Darren J Moffat

2010-Aug-26 14:48 UTC

head link

[zfs-discuss] ZFS offline ZIL corruption not detected

On 26/08/2010 15:42, David Magda wrote:> Does a scrub go through the slog and/or L2ARC devices, or only the
> "primary" storage components?
A scrub traverses datasets including the ZIL thus the scrub will read 
(and if needed resilver) on a slog device too.

http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/dmu_traverse.c

A scrub does not traverse an L2ARC device because hold in memory 
checksums (in the ARC header) for everything on the cache devices if we 
get a checksum failure on read we remove the L2ARC cached entry and read 
from the main pool again.   The L2ARC cache devices are purely caches 
there is NEVER data on them that isn''t already in the main pool
devices.

-- 
Darren J Moffat

Saso Kiselkov

2010-Aug-26 14:52 UTC

head link

[zfs-discuss] ZFS offline ZIL corruption not detected

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

I see, thank you for the clarification. So it is possible to have
something equivalent to main storage self-healing on ZIL, with ZIL-scrub
to activate it. Or is that already implemented also? (Sorry for asking
these obvious questions, but I''m not familiar with ZFS source code.)

- --
Saso

On 08/26/2010 04:31 PM, Darren J Moffat wrote:> On 26/08/2010 15:08, Saso Kiselkov wrote:
>> If I might add my $0.02: it appears that the ZIL is implemented as a
>> kind of circular log buffer. As I understand it, when a corrupt
checksum
> 
> It is NOT circular since that implies limited number of entries that get
> overwritten.
> 
>> is detected, it is taken to be the end of the log, but this kind of
>> defeats the checksum''s original purpose, which is to detect
device
>> failure. Thus we would first need to change this behavior to only be
>> used for failure detection. This leaves the question of how to detect
>> the end of the log, which I think could be done by using a monotonously
>> incrementing counter on the ZIL entries. Once we find an entry where
the
>> counter != n+1, then we know that the block is the last one in the
>> sequence.
> 
> See the comment part way down zil_read_log_block about how we do
> something pretty much like that for checking the chain of log blocks:
> 
>
http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/zil.c#zil_read_log_block
> 
> 
> This is the checksum in the BP checksum field.
> 
> But before we even got there we checked the ZILOG2 checksum as part of
> doing the zio (in zio_checksum_verify() stage):
> 
>
http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/zio_checksum.c#zio_checksum_error
> 
> 
> A ZILOG2 checksum is a embedded  in the block (at the start, the
> original ZILOG was at the end) version of fletcher4.  If that failed -
> ie the block was corrupt we would have returned an error back through
> the dsl_read() of the log block.
> 
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkx2f64ACgkQRO8UcfzpOHA7rACgoyydAq2hO/VIfdknRb09WWGJ
BkwAn2i3nPtWNnfXwyW2089YMb8FRkZP
=YMqL
-----END PGP SIGNATURE-----

George Wilson

2010-Aug-26 14:52 UTC

head link

[zfs-discuss] ZFS offline ZIL corruption not detected

Edward Ned Harvey wrote:>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
>> bounces at opensolaris.org] On Behalf Of Neil Perrin
>>
>> This is a consequence of the design for performance of the ZIL code.
>> Intent log blocks are dynamically allocated and chained together.
>> When reading the intent log we read each block and checksum it
>> with the embedded checksum within the same block. If we can''t
read
>> a block due to an IO error then that is reported, but if the checksum
>> does
>> not match then we assume it''s the end of the intent log chain.
>> Using this design means we the minimum number of writes to add
>> write an intent log record is just one write.
>>
>> So corruption of an intent log is not going to generate any errors.
> 
> I didn''t know that.  Very interesting.  This raises another
question ...
> 
> It''s commonly stated, that even with log device removal supported,
the most
> common failure mode for an SSD is to blindly write without reporting any
> errors, and only detect that the device is failed upon read.  So ... If an
> SSD is in this failure mode, you won''t detect it?  At bootup, the
checksum
> will simply mismatch, and we''ll chug along forward, having lost
the data ...
> (nothing can prevent that) ... but we don''t know that
we''ve lost data?
If the drive''s firmware isn''t returning back a write error of
any kind
then there isn''t much that ZFS can really do here (regardless of
whether
this is an SSD or not). Turning every write into a read/write operation 
would totally defeat the purpose of the ZIL. It''s my understanding that
SSDs will eventually transition to read-only devices once they''ve 
exceeded their spare reallocation blocks. This should propagate to the 
OS as an EIO which means that ZFS will instead store the ZIL data on the 
main storage pool.
> 
> Worse yet ... In preparation for the above SSD failure mode, it''s
commonly
> recommended to still mirror your log device, even if you have log device
> removal.  If you have a mirror, and the data on each half of the mirror
> doesn''t match each other (one device failed, and the other device
is good)
> ... Do you read the data from *both* sides of the mirror, in order to
> discover the corrupted log device, and correctly move forward without data
> loss?
Yes, we read all sides of the mirror when we claim (i.e. read) the log 
blocks for a log device. This is exactly what a scrub would do for a 
mirrored data device.

- George
> 
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

George Wilson

2010-Aug-26 14:55 UTC

head link

[zfs-discuss] ZFS offline ZIL corruption not detected

David Magda wrote:> On Wed, August 25, 2010 23:00, Neil Perrin wrote:
> 
> Does a scrub go through the slog and/or L2ARC devices, or only the
> "primary" storage components?
A scrub will go through slogs and primary storage devices. The L2ARC 
device is considered volatile and data loss is not possible should it fail.

- George

George Wilson

2010-Aug-26 15:01 UTC

head link

[zfs-discuss] ZFS offline ZIL corruption not detected

Edward Ned Harvey wrote:> 
> Add to that:
> 
> During scrubs, perform some reads on log devices (even if there''s
nothing to
> read).
We do read from log device if there is data stored on
them.> In fact, during scrubs, perform some reads on every device (even if
it''s
> actually empty.)
Reading from the data portion of an empty device wouldn''t really show
us
much as we''re going to be reading a bunch of non-checksummed data. The 
best we can do is to "probe" the device''s label region to
determine it''s
health. This is exactly what we do today.

- George
> 
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Bob Friesenhahn

2010-Aug-27 15:01 UTC

head link

[zfs-discuss] ZFS offline ZIL corruption not detected

On Thu, 26 Aug 2010, George Wilson wrote:
> David Magda wrote:
>> On Wed, August 25, 2010 23:00, Neil Perrin wrote:
>> 
>> Does a scrub go through the slog and/or L2ARC devices, or only the
>> "primary" storage components?
>
> A scrub will go through slogs and primary storage devices. The L2ARC device
> is considered volatile and data loss is not possible should it fail.
What gets "scrubbed" in the slog?  The slog contains transient 
data which exists for only seconds at a time.  The slog is quite 
likely to be empty at any given point in time.

Bob
-- 
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

George Wilson

2010-Aug-27 15:17 UTC

head link

[zfs-discuss] ZFS offline ZIL corruption not detected

Bob Friesenhahn wrote:> On Thu, 26 Aug 2010, George Wilson wrote:
> 
> 
> What gets "scrubbed" in the slog?  The slog contains transient
data
> which exists for only seconds at a time.  The slog is quite likely to be 
> empty at any given point in time.
> 
> Bob
Yes, the typical ZIL block never lives long enough to scrub but if there 
are any blocks which have not been replayed (i.e. zil blocks for an 
unmounted filesystem) then those will get scrubbed.

- George

zfs discuss - Aug 2010 - ZFS offline ZIL corruption not detected

[zfs-discuss] ZFS offline ZIL corruption not detected

[zfs-discuss] ZFS offline ZIL corruption not detected

[zfs-discuss] ZFS offline ZIL corruption not detected

[zfs-discuss] ZFS offline ZIL corruption not detected

[zfs-discuss] ZFS offline ZIL corruption not detected

[zfs-discuss] ZFS offline ZIL corruption not detected

[zfs-discuss] ZFS offline ZIL corruption not detected

[zfs-discuss] ZFS offline ZIL corruption not detected

[zfs-discuss] ZFS offline ZIL corruption not detected

[zfs-discuss] ZFS offline ZIL corruption not detected

[zfs-discuss] ZFS offline ZIL corruption not detected

[zfs-discuss] ZFS offline ZIL corruption not detected

[zfs-discuss] ZFS offline ZIL corruption not detected

[zfs-discuss] ZFS offline ZIL corruption not detected

[zfs-discuss] ZFS offline ZIL corruption not detected

[zfs-discuss] ZFS offline ZIL corruption not detected

[zfs-discuss] ZFS offline ZIL corruption not detected

[zfs-discuss] ZFS offline ZIL corruption not detected

[zfs-discuss] ZFS offline ZIL corruption not detected

[zfs-discuss] ZFS offline ZIL corruption not detected

[zfs-discuss] ZFS offline ZIL corruption not detected

[zfs-discuss] ZFS offline ZIL corruption not detected