StorageConcepts
2010-Aug-23 16:41 UTC
[zfs-discuss] ZFS offline ZIL corruption not detected
Hello, we are currently extensivly testing the DDRX1 drive for ZIL and we are going through all the corner cases. The headline above all our tests is "do we still need to mirror ZIL" with all current fixes in ZFS (zfs can recover zil failure, as long as you don''t export the pool, with latest upstream you can also import a poool with a missing zil)? This question is especially interesting with RAM based devices, because they don''t wear out, have a very low bit error rate and use one PCIx slot - which are rare. Price is another aspect here :) During our tests we found a strange behaviour of ZFS ZIL failures which are not device related and we are looking for help from the ZFS guru''s here :) The test in question is called "offline ZIL corruption". The question is, what happens if my ZIL data is corrupted while a server is transported or moved and not properly shut down. For this we do: - Prepare 2 OS installations (ProdudctOS and CorruptOS) - Boot ProductOS and create a pool and add the ZIL - ProductOS: Issue synchronous I/O with a increasing TNX number (and print the latest committet transaciton) - ProductOS: Power off the server and record the laast committet transaction - Boot CorruptOS - Write random data to the beginning of the ZIL (dd if=/dev/urandom of=ZIL .... ~ 300 MB from start of disk, overwriting the first two disk labels) - Boot ProductOS - Verify that the data corruption is detected by checking the file with the transaction number against the one recorded We ran the test and it seems with modern snv_134 the pool comes up after the corruption with all beeing ok, while ~10000 Transactions (this is some seconds of writes with DDRX1) are missing and nobody knows about this. We ran a scrub and scrub does not even detect this. ZFS automatically repairs the labels on the ZIL, however no error is reported about the missing data. While it is clear to us that if we do not have a mirrored zil, the data we have overwritten in the zil is lost, we are really wondering why ZFS does not REPORT about this corruption, silently ignoring it. Is this is a bug or .. aehm ... a feature :) ? Regards, Robert -- This message posted from opensolaris.org
This is a consequence of the design for performance of the ZIL code. Intent log blocks are dynamically allocated and chained together. When reading the intent log we read each block and checksum it with the embedded checksum within the same block. If we can''t read a block due to an IO error then that is reported, but if the checksum does not match then we assume it''s the end of the intent log chain. Using this design means we the minimum number of writes to add write an intent log record is just one write. So corruption of an intent log is not going to generate any errors. Neil. On 08/23/10 10:41, StorageConcepts wrote:> Hello, > > we are currently extensivly testing the DDRX1 drive for ZIL and we are going through all the corner cases. > > The headline above all our tests is "do we still need to mirror ZIL" with all current fixes in ZFS (zfs can recover zil failure, as long as you don''t export the pool, with latest upstream you can also import a poool with a missing zil)? This question is especially interesting with RAM based devices, because they don''t wear out, have a very low bit error rate and use one PCIx slot - which are rare. Price is another aspect here :) > > During our tests we found a strange behaviour of ZFS ZIL failures which are not device related and we are looking for help from the ZFS guru''s here :) > > The test in question is called "offline ZIL corruption". The question is, what happens if my ZIL data is corrupted while a server is transported or moved and not properly shut down. For this we do: > > - Prepare 2 OS installations (ProdudctOS and CorruptOS) > - Boot ProductOS and create a pool and add the ZIL > - ProductOS: Issue synchronous I/O with a increasing TNX number (and print the latest committet transaciton) > - ProductOS: Power off the server and record the laast committet transaction > - Boot CorruptOS > - Write random data to the beginning of the ZIL (dd if=/dev/urandom of=ZIL .... ~ 300 MB from start of disk, overwriting the first two disk labels) > - Boot ProductOS > - Verify that the data corruption is detected by checking the file with the transaction number against the one recorded > > We ran the test and it seems with modern snv_134 the pool comes up after the corruption with all beeing ok, while ~10000 Transactions (this is some seconds of writes with DDRX1) are missing and nobody knows about this. We ran a scrub and scrub does not even detect this. ZFS automatically repairs the labels on the ZIL, however no error is reported about the missing data. > > While it is clear to us that if we do not have a mirrored zil, the data we have overwritten in the zil is lost, we are really wondering why ZFS does not REPORT about this corruption, silently ignoring it. > > Is this is a bug or .. aehm ... a feature :) ? > > Regards, > Robert >
Does that mean that when the begin of the intent log chain gets corrupted, all other intent log data after the corruption area is lost, because the checksum of the first corrupted block doesn''t match?? ? Regards, Markus Neil Perrin <neil.perrin at oracle.com> hat am 23. August 2010 um 19:44 geschrieben:> This is a consequence of the design for performance of the ZIL code. > Intent log blocks are dynamically allocated and chained together. > When reading the intent log we read each block and checksum it > with the embedded checksum within the same block. If we can''t read > a block due to an IO error then that is reported, but if the checksum does > not match then we assume it''s the end of the intent log chain. > Using this design means we the minimum number of writes to add > write an intent log record is just one write. > > So corruption of an intent log is not going to generate any errors. > > Neil. > > On 08/23/10 10:41, StorageConcepts wrote: > > Hello, > > > > we are currently extensivly testing the DDRX1 drive for ZIL and we are going > > through all the corner cases. > > > > The headline above all our tests is "do we still need to mirror ZIL" with > > all current fixes in ZFS (zfs can recover zil failure, as long as you don''t > > export the pool, with latest upstream you can also import a poool with a > > missing zil)? This question? is especially interesting with RAM based > > devices, because they don''t wear out, have a very low bit error rate and use > > one PCIx slot - which are rare. Price is another aspect here :) > > > > During our tests we found a strange behaviour of ZFS ZIL failures which are > > not device related and we are looking for help from the ZFS guru''s here :) > > > > The test in question is called "offline ZIL corruption". The question is, > > what happens if my ZIL data is corrupted while a server is transported or > > moved and not properly shut down. For this we do: > > > > - Prepare 2 OS installations (ProdudctOS and CorruptOS) > > - Boot ProductOS and create a pool and add the ZIL > > - ProductOS: Issue synchronous I/O with a increasing TNX number (and print > > the latest committet transaciton) > > - ProductOS: Power off the server and record the laast committet transaction > > - Boot CorruptOS > > - Write random data to the beginning of the ZIL (dd if=/dev/urandom of=ZIL > > .... ~ 300 MB from start of disk, overwriting the first two disk labels) > > - Boot ProductOS > > - Verify that the data corruption is detected by checking the file with the > > transaction number against the one recorded > > > > We ran the test and it seems with modern snv_134 the pool comes up after the > > corruption with all beeing ok, while ~10000 Transactions (this is some > > seconds of writes with DDRX1) are missing and nobody knows about this. We > > ran a scrub and scrub does not even detect this. ZFS automatically repairs > > the labels on the ZIL, however no error is reported about the missing data. > > > > While it is clear to us that if we do not have a mirrored zil, the data we > > have overwritten in the zil is lost, we are really wondering why ZFS does > > not REPORT about this corruption, silently ignoring it. > > > > Is this is a bug or .. aehm ... a feature? :) ? > > > > Regards, > > Robert > >? ? >-------------------------------------------------- StorageConcepts Europe GmbH ? ? Storage: Beratung. Realisierung. Support? ? ? Markus Keil? ? ? ? ? ? keil at storageconcepts.de ? ? ? ? ? ? ? ? ? ? ? ?http://www.storageconcepts.de Wiener Stra?e 114-116? Telefon:? ?+49 (351) 8 76 92-21 01219 Dresden? ? ? ? ? Telefax:? ?+49 (351) 8 76 92-99 Handelregister Dresden, HRB 28281 Gesch?ftsf?hrer: Robert Heinzmann, Gerd Jelinek -------------------------------------------------- Rechtlicher Hinweis: Der Inhalt dieser E-Mail sowie etwaige Anlagen hierzu sind vertraulich? und ausschlie?lich f?r den Gebrauch durch den Empf?nger bestimmt, soweit diese Nachricht im Einzelfall nicht einen anderweitigen Umgang erlaubt. Auch kann der Inhalt der Nachricht Gegenstand von gesetzlichen Schutzrechten sein. Soweit eine Weitergabe oder Verteilung nicht ausschlie?lich zu internen Zwecken des Empf?ngers geschieht, wird jede Weitergabe, Verteilung oder sonstige Kopierung untersagt. Sollten Sie nicht? der beabsichtigte Empf?nger der Sendung sein, informieren Sie den Absender bitte unverz?glich.
On 08/23/10 13:12, Markus Keil wrote:> Does that mean that when the begin of the intent log chain gets corrupted, all > other intent log data after the corruption area is lost, because the checksum of > the first corrupted block doesn''t match?? >- Yes, but you wouldn''t want to replay the following entries in case the log records in the missing log block were important (eg create file). Mirroring the slogs is recommended to minimise concerns about slogs corruption.> ? > Regards, > Markus > > Neil Perrin <neil.perrin at oracle.com> hat am 23. August 2010 um 19:44 > geschrieben: > > >> This is a consequence of the design for performance of the ZIL code. >> Intent log blocks are dynamically allocated and chained together. >> When reading the intent log we read each block and checksum it >> with the embedded checksum within the same block. If we can''t read >> a block due to an IO error then that is reported, but if the checksum does >> not match then we assume it''s the end of the intent log chain. >> Using this design means we the minimum number of writes to add >> write an intent log record is just one write. >> >> So corruption of an intent log is not going to generate any errors. >> >> Neil. >> >> On 08/23/10 10:41, StorageConcepts wrote: >> >>> Hello, >>> >>> we are currently extensivly testing the DDRX1 drive for ZIL and we are going >>> through all the corner cases. >>> >>> The headline above all our tests is "do we still need to mirror ZIL" with >>> all current fixes in ZFS (zfs can recover zil failure, as long as you don''t >>> export the pool, with latest upstream you can also import a poool with a >>> missing zil)? This question? is especially interesting with RAM based >>> devices, because they don''t wear out, have a very low bit error rate and use >>> one PCIx slot - which are rare. Price is another aspect here :) >>> >>> During our tests we found a strange behaviour of ZFS ZIL failures which are >>> not device related and we are looking for help from the ZFS guru''s here :) >>> >>> The test in question is called "offline ZIL corruption". The question is, >>> what happens if my ZIL data is corrupted while a server is transported or >>> moved and not properly shut down. For this we do: >>> >>> - Prepare 2 OS installations (ProdudctOS and CorruptOS) >>> - Boot ProductOS and create a pool and add the ZIL >>> - ProductOS: Issue synchronous I/O with a increasing TNX number (and print >>> the latest committet transaciton) >>> - ProductOS: Power off the server and record the laast committet transaction >>> - Boot CorruptOS >>> - Write random data to the beginning of the ZIL (dd if=/dev/urandom of=ZIL >>> .... ~ 300 MB from start of disk, overwriting the first two disk labels) >>> - Boot ProductOS >>> - Verify that the data corruption is detected by checking the file with the >>> transaction number against the one recorded >>> >>> We ran the test and it seems with modern snv_134 the pool comes up after the >>> corruption with all beeing ok, while ~10000 Transactions (this is some >>> seconds of writes with DDRX1) are missing and nobody knows about this. We >>> ran a scrub and scrub does not even detect this. ZFS automatically repairs >>> the labels on the ZIL, however no error is reported about the missing data. >>> >>> While it is clear to us that if we do not have a mirrored zil, the data we >>> have overwritten in the zil is lost, we are really wondering why ZFS does >>> not REPORT about this corruption, silently ignoring it. >>> >>> Is this is a bug or .. aehm ... a feature? :) ? >>> >>> Regards, >>> Robert >>> ? ? >>> > > -------------------------------------------------- > StorageConcepts Europe GmbH > ? ? Storage: Beratung. Realisierung. Support? ? ? > > Markus Keil? ? ? ? ? ? keil at storageconcepts.de > ? ? ? ? ? ? ? ? ? ? ? ? http://www.storageconcepts.de > Wiener Stra?Y"e 114-116? Telefon:? ? +49 (351) 8 76 92-21 > 01219 Dresden? ? ? ? ? Telefax:? ? +49 (351) 8 76 92-99 > Handelregister Dresden, HRB 28281 > Gesch??ftsf??hrer: Robert Heinzmann, Gerd Jelinek > -------------------------------------------------- > Rechtlicher Hinweis: Der Inhalt dieser E-Mail sowie etwaige Anlagen hierzu sind > vertraulich? und ausschlie?Y"lich f??r den Gebrauch durch den Empf??nger bestimmt, > soweit diese Nachricht im Einzelfall nicht einen anderweitigen Umgang erlaubt. > Auch kann der Inhalt der Nachricht Gegenstand von gesetzlichen Schutzrechten > sein. Soweit eine Weitergabe oder Verteilung nicht ausschlie?Y"lich zu internen > Zwecken des Empf??ngers geschieht, wird jede Weitergabe, Verteilung oder sonstige > Kopierung untersagt. Sollten Sie nicht? der beabsichtigte Empf??nger der Sendung > sein, informieren Sie den Absender bitte unverz??glich. >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100823/b3b7c66f/attachment.html>
Edward Ned Harvey
2010-Aug-26 02:33 UTC
[zfs-discuss] ZFS offline ZIL corruption not detected
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Neil Perrin > > This is a consequence of the design for performance of the ZIL code. > Intent log blocks are dynamically allocated and chained together. > When reading the intent log we read each block and checksum it > with the embedded checksum within the same block. If we can''t read > a block due to an IO error then that is reported, but if the checksum > does > not match then we assume it''s the end of the intent log chain. > Using this design means we the minimum number of writes to add > write an intent log record is just one write. > > So corruption of an intent log is not going to generate any errors.I didn''t know that. Very interesting. This raises another question ... It''s commonly stated, that even with log device removal supported, the most common failure mode for an SSD is to blindly write without reporting any errors, and only detect that the device is failed upon read. So ... If an SSD is in this failure mode, you won''t detect it? At bootup, the checksum will simply mismatch, and we''ll chug along forward, having lost the data ... (nothing can prevent that) ... but we don''t know that we''ve lost data? Worse yet ... In preparation for the above SSD failure mode, it''s commonly recommended to still mirror your log device, even if you have log device removal. If you have a mirror, and the data on each half of the mirror doesn''t match each other (one device failed, and the other device is good) ... Do you read the data from *both* sides of the mirror, in order to discover the corrupted log device, and correctly move forward without data loss?
On 08/25/10 20:33, Edward Ned Harvey wrote:>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- >> bounces at opensolaris.org] On Behalf Of Neil Perrin >> >> This is a consequence of the design for performance of the ZIL code. >> Intent log blocks are dynamically allocated and chained together. >> When reading the intent log we read each block and checksum it >> with the embedded checksum within the same block. If we can''t read >> a block due to an IO error then that is reported, but if the checksum >> does >> not match then we assume it''s the end of the intent log chain. >> Using this design means we use the minimum number of writes. >> >> So corruption of an intent log is not going to generate any errors. >> > > I didn''t know that. Very interesting. This raises another question ... > > It''s commonly stated, that even with log device removal supported, the most > common failure mode for an SSD is to blindly write without reporting any > errors, and only detect that the device is failed upon read. So ... If an > SSD is in this failure mode, you won''t detect it? At bootup, the checksum > will simply mismatch, and we''ll chug along forward, having lost the data ... > (nothing can prevent that) ... but we don''t know that we''ve lost data? >- Indeed, we wouldn''t know we lost data.> Worse yet ... In preparation for the above SSD failure mode, it''s commonly > recommended to still mirror your log device, even if you have log device > removal. If you have a mirror, and the data on each half of the mirror > doesn''t match each other (one device failed, and the other device is good) > ... Do you read the data from *both* sides of the mirror, in order to > discover the corrupted log device, and correctly move forward without data > loss? > >Hmm, I need to check, but if we get a checksum mismatch then I don''t think we try other mirror(s). This is automatic for the ''main pool'', but of course the ZIL code is different by necessity. This problem can of course be fixed. (It will be a week and a bit before I can report back on this, as I''m on vacation). Neil. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100825/fd2de062/attachment.html>
StorageConcepts
2010-Aug-26 06:40 UTC
[zfs-discuss] ZFS offline ZIL corruption not detected
Hello, actually this is bad news. I always assumed that the mirror redundancy of zil can also be used to handle bad blocks on the zil device (just as the main pool self healing does for data blocks). I actually dont know how SSD''s "die", because of the "wear out" characteristics I can think of a increased number of bad blocks / bit errors at the EOL of such a device - probably undiscovered. Because ZIL is write only, you only know if it worked in case you need it - wich is bad. So my suggestion was always to run with 1 zil during pre-production, and add the zil mirror 2 weeks later when production starts. This way they dont''t age exactly the same and zil2 has 2 more weeks of expected flifetime (or even more, assuming the usual heavier writes during stress testing). I would call this pre-aging. However if the second zil is not used to recover from bad blocks, this does not make a lot of sense. So would say there are 2 bugs / missing features in this: 1) zil needs to report truncated transactions on zilcorruption 2) zil should need mirrored counterpart to recover bad block checksums Now with OpenSolaris beeing Oracle closed and Illumos beeing just startet, I don''t know how to handle bug openenings :) - is bugs.opensolaris.org still maintained ??? Regards, Robert -- This message posted from opensolaris.org
Edward Ned Harvey
2010-Aug-26 13:14 UTC
[zfs-discuss] ZFS offline ZIL corruption not detected
> From: Neil Perrin [mailto:neil.perrin at oracle.com] > > Hmm, I need to check, but if we get a checksum mismatch then I don''t > think we try other > mirror(s). This is automatic for the ''main pool'', but of course the ZIL > code is different > by necessity. This problem can of course be fixed. (It will be? a week > and a bit before I can > report back on this, as I''m on vacation).Thanks... If indeed that is the behavior, then I would conclude: * Call it a bug. It needs a bug fix. * Prior to log device removal (zpool 19) it is critical to mirror log device. * After introduction of ldr, before this bug fix is available, it is pointless to mirror log devices. * After this bug fix is introduced, it is again recommended to mirror slogs.
Edward Ned Harvey
2010-Aug-26 13:17 UTC
[zfs-discuss] ZFS offline ZIL corruption not detected
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of StorageConcepts > > So would say there are 2 bugs / missing features in this: > > 1) zil needs to report truncated transactions on zilcorruption > 2) zil should need mirrored counterpart to recover bad block checksumsAdd to that: During scrubs, perform some reads on log devices (even if there''s nothing to read). In fact, during scrubs, perform some reads on every device (even if it''s actually empty.)
On Aug 26, 2010, at 9:14 AM, Edward Ned Harvey wrote:> * After introduction of ldr, before this bug fix is available, it is > pointless to mirror log devices.That''s a bit of an overstatement. Mirrored logs protect against a wide variety of failure modes. Neil just isn''t sure if it does the right thing for checksum errors. That is a very small subset of possible device failure modes. - Eric -- Eric Schrock, Fishworks http://blogs.sun.com/eschrock
On Aug 26, 2010, at 2:40 AM, StorageConcepts wrote:> > 1) zil needs to report truncated transactions on zilcorruptionAs Neil outlined, this isn''t possible while preserving current ZIL performance. There is no way to distinguish the "last" ZIL block without incurring additional writes for every block. If it''s even possible to implement this "paranoid ZIL" tunable, are you willing to take a 2-5x performance hit to be able to detect this failure mode? - Eric -- Eric Schrock, Fishworks http://blogs.sun.com/eschrock
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 If I might add my $0.02: it appears that the ZIL is implemented as a kind of circular log buffer. As I understand it, when a corrupt checksum is detected, it is taken to be the end of the log, but this kind of defeats the checksum''s original purpose, which is to detect device failure. Thus we would first need to change this behavior to only be used for failure detection. This leaves the question of how to detect the end of the log, which I think could be done by using a monotonously incrementing counter on the ZIL entries. Once we find an entry where the counter != n+1, then we know that the block is the last one in the sequence. Now that we can use checksums to detect device failure, it would be possible to implement ZIL-scrub, allowing an environment to detect ZIL device degradation before it actually results in a catastrophe. - -- Saso On 08/26/2010 03:22 PM, Eric Schrock wrote:> > On Aug 26, 2010, at 2:40 AM, StorageConcepts wrote: >> >> 1) zil needs to report truncated transactions on zilcorruption > > As Neil outlined, this isn''t possible while preserving current ZIL performance. There is no way to distinguish the "last" ZIL block without incurring additional writes for every block. If it''s even possible to implement this "paranoid ZIL" tunable, are you willing to take a 2-5x performance hit to be able to detect this failure mode? > > - Eric > > -- > Eric Schrock, Fishworks http://blogs.sun.com/eschrock > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAkx2dUwACgkQRO8UcfzpOHD6QgCfWRBvqYxwKOqrFeaMyQ3nZDVX Pu0AoJJHPybVT3GqvQbJPL8Xa58aC5P1 =pQJU -----END PGP SIGNATURE-----
StorageConcepts
2010-Aug-26 14:15 UTC
[zfs-discuss] ZFS offline ZIL corruption not detected
Actually - I can''t read ZFS code, so the next assumtions are more or less based on brainware - excuse me in advance :) How does ZFS detect "up to date" zil''s ? - with the tnx check of the ueberblock - right ? In our corruption case, we had 2 valid ueberblocks at the end and ZFS used those to import the pool. this is what the end-ueberblock is for. Ok, so the ueberblock contains the pointer to the start of the zil chain - right ? Assume we are adding the tnx number of the current transaction this zil is part of to the blocks written to the zil (special packages zil blocks). So the zil blocks are a little bit bigger then the data blocks, however the transaction count is the the same. Ok for SSD block alignment might be an issue ... agreed. For memory DRAM based ZIL''s this is not a problem - except for bandwith. Logic: On ZIL import, check: - If the pointer to the zil chain is empty if yes -> clean pool if not -> we need to replay - now if the block the root pointer points to is ok (checksum), the zil is used and replayed. At the end, the tnxof the last zil block must be = pool tnx. If =, then OK, if not report a error about missing zil parts and switch to mirror (if available).> As Neil outlined, this isn''t possible while > preserving current ZIL performance. There is no way > to distinguish the "last" ZIL block without incurring > additional writes for every block. If it''s even > possible to implement this "paranoid ZIL" tunable, > are you willing to take a 2-5x performance hit to be > able to detect this failure mode? >Robert -- This message posted from opensolaris.org
Darren J Moffat
2010-Aug-26 14:31 UTC
[zfs-discuss] ZFS offline ZIL corruption not detected
On 26/08/2010 15:08, Saso Kiselkov wrote:> If I might add my $0.02: it appears that the ZIL is implemented as a > kind of circular log buffer. As I understand it, when a corrupt checksumIt is NOT circular since that implies limited number of entries that get overwritten.> is detected, it is taken to be the end of the log, but this kind of > defeats the checksum''s original purpose, which is to detect device > failure. Thus we would first need to change this behavior to only be > used for failure detection. This leaves the question of how to detect > the end of the log, which I think could be done by using a monotonously > incrementing counter on the ZIL entries. Once we find an entry where the > counter != n+1, then we know that the block is the last one in the sequence.See the comment part way down zil_read_log_block about how we do something pretty much like that for checking the chain of log blocks: http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/zil.c#zil_read_log_block This is the checksum in the BP checksum field. But before we even got there we checked the ZILOG2 checksum as part of doing the zio (in zio_checksum_verify() stage): http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/zio_checksum.c#zio_checksum_error A ZILOG2 checksum is a embedded in the block (at the start, the original ZILOG was at the end) version of fletcher4. If that failed - ie the block was corrupt we would have returned an error back through the dsl_read() of the log block. -- Darren J Moffat
On Wed, August 25, 2010 23:00, Neil Perrin wrote:> On 08/25/10 20:33, Edward Ned Harvey wrote: > >> It''s commonly stated, that even with log device removal supported, the >> most common failure mode for an SSD is to blindly write without reporting >> any errors, and only detect that the device is failed upon read. So ... >> If an SSD is in this failure mode, you won''t detect it? At bootup, the >> checksum will simply mismatch, and we''ll chug along forward, having lost >> the data ... (nothing can prevent that) ... but we don''t know that we''ve >> lost data? > > - Indeed, we wouldn''t know we lost data.Does a scrub go through the slog and/or L2ARC devices, or only the "primary" storage components? If it doesn''t go through these "secondary" devices, that may be a useful RFE, as one would ideally want to test the data on every component of a storage system.
Darren J Moffat
2010-Aug-26 14:48 UTC
[zfs-discuss] ZFS offline ZIL corruption not detected
On 26/08/2010 15:42, David Magda wrote:> Does a scrub go through the slog and/or L2ARC devices, or only the > "primary" storage components?A scrub traverses datasets including the ZIL thus the scrub will read (and if needed resilver) on a slog device too. http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/dmu_traverse.c A scrub does not traverse an L2ARC device because hold in memory checksums (in the ARC header) for everything on the cache devices if we get a checksum failure on read we remove the L2ARC cached entry and read from the main pool again. The L2ARC cache devices are purely caches there is NEVER data on them that isn''t already in the main pool devices. -- Darren J Moffat
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 I see, thank you for the clarification. So it is possible to have something equivalent to main storage self-healing on ZIL, with ZIL-scrub to activate it. Or is that already implemented also? (Sorry for asking these obvious questions, but I''m not familiar with ZFS source code.) - -- Saso On 08/26/2010 04:31 PM, Darren J Moffat wrote:> On 26/08/2010 15:08, Saso Kiselkov wrote: >> If I might add my $0.02: it appears that the ZIL is implemented as a >> kind of circular log buffer. As I understand it, when a corrupt checksum > > It is NOT circular since that implies limited number of entries that get > overwritten. > >> is detected, it is taken to be the end of the log, but this kind of >> defeats the checksum''s original purpose, which is to detect device >> failure. Thus we would first need to change this behavior to only be >> used for failure detection. This leaves the question of how to detect >> the end of the log, which I think could be done by using a monotonously >> incrementing counter on the ZIL entries. Once we find an entry where the >> counter != n+1, then we know that the block is the last one in the >> sequence. > > See the comment part way down zil_read_log_block about how we do > something pretty much like that for checking the chain of log blocks: > > http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/zil.c#zil_read_log_block > > > This is the checksum in the BP checksum field. > > But before we even got there we checked the ZILOG2 checksum as part of > doing the zio (in zio_checksum_verify() stage): > > http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/zio_checksum.c#zio_checksum_error > > > A ZILOG2 checksum is a embedded in the block (at the start, the > original ZILOG was at the end) version of fletcher4. If that failed - > ie the block was corrupt we would have returned an error back through > the dsl_read() of the log block. >-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAkx2f64ACgkQRO8UcfzpOHA7rACgoyydAq2hO/VIfdknRb09WWGJ BkwAn2i3nPtWNnfXwyW2089YMb8FRkZP =YMqL -----END PGP SIGNATURE-----
Edward Ned Harvey wrote:>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- >> bounces at opensolaris.org] On Behalf Of Neil Perrin >> >> This is a consequence of the design for performance of the ZIL code. >> Intent log blocks are dynamically allocated and chained together. >> When reading the intent log we read each block and checksum it >> with the embedded checksum within the same block. If we can''t read >> a block due to an IO error then that is reported, but if the checksum >> does >> not match then we assume it''s the end of the intent log chain. >> Using this design means we the minimum number of writes to add >> write an intent log record is just one write. >> >> So corruption of an intent log is not going to generate any errors. > > I didn''t know that. Very interesting. This raises another question ... > > It''s commonly stated, that even with log device removal supported, the most > common failure mode for an SSD is to blindly write without reporting any > errors, and only detect that the device is failed upon read. So ... If an > SSD is in this failure mode, you won''t detect it? At bootup, the checksum > will simply mismatch, and we''ll chug along forward, having lost the data ... > (nothing can prevent that) ... but we don''t know that we''ve lost data?If the drive''s firmware isn''t returning back a write error of any kind then there isn''t much that ZFS can really do here (regardless of whether this is an SSD or not). Turning every write into a read/write operation would totally defeat the purpose of the ZIL. It''s my understanding that SSDs will eventually transition to read-only devices once they''ve exceeded their spare reallocation blocks. This should propagate to the OS as an EIO which means that ZFS will instead store the ZIL data on the main storage pool.> > Worse yet ... In preparation for the above SSD failure mode, it''s commonly > recommended to still mirror your log device, even if you have log device > removal. If you have a mirror, and the data on each half of the mirror > doesn''t match each other (one device failed, and the other device is good) > ... Do you read the data from *both* sides of the mirror, in order to > discover the corrupted log device, and correctly move forward without data > loss?Yes, we read all sides of the mirror when we claim (i.e. read) the log blocks for a log device. This is exactly what a scrub would do for a mirrored data device. - George> > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
David Magda wrote:> On Wed, August 25, 2010 23:00, Neil Perrin wrote: > > Does a scrub go through the slog and/or L2ARC devices, or only the > "primary" storage components?A scrub will go through slogs and primary storage devices. The L2ARC device is considered volatile and data loss is not possible should it fail. - George
Edward Ned Harvey wrote:> > Add to that: > > During scrubs, perform some reads on log devices (even if there''s nothing to > read).We do read from log device if there is data stored on them.> In fact, during scrubs, perform some reads on every device (even if it''s > actually empty.)Reading from the data portion of an empty device wouldn''t really show us much as we''re going to be reading a bunch of non-checksummed data. The best we can do is to "probe" the device''s label region to determine it''s health. This is exactly what we do today. - George> > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Bob Friesenhahn
2010-Aug-27 15:01 UTC
[zfs-discuss] ZFS offline ZIL corruption not detected
On Thu, 26 Aug 2010, George Wilson wrote:> David Magda wrote: >> On Wed, August 25, 2010 23:00, Neil Perrin wrote: >> >> Does a scrub go through the slog and/or L2ARC devices, or only the >> "primary" storage components? > > A scrub will go through slogs and primary storage devices. The L2ARC device > is considered volatile and data loss is not possible should it fail.What gets "scrubbed" in the slog? The slog contains transient data which exists for only seconds at a time. The slog is quite likely to be empty at any given point in time. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Bob Friesenhahn wrote:> On Thu, 26 Aug 2010, George Wilson wrote: > > > What gets "scrubbed" in the slog? The slog contains transient data > which exists for only seconds at a time. The slog is quite likely to be > empty at any given point in time. > > BobYes, the typical ZIL block never lives long enough to scrub but if there are any blocks which have not been replayed (i.e. zil blocks for an unmounted filesystem) then those will get scrubbed. - George