thr3ads.net - Lustre devel - [Lustre-devel] [Bug 11602] fsx causes occasional checksum failures [Jan 2007]

If this information is useful, please help other people find it:
Share via:

behlendorf1@llnl.gov

2007-Jan-23 14:35 UTC

[Lustre-devel] [Bug 11602] fsx causes occasional checksum failures

Please don''t reply to lustre-devel. Instead, comment in Bugzilla by
using the following link:
https://bugzilla.lustre.org/show_bug.cgi?id=11602



Created an attachment (id=9404)
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by
using the following link:
 --> (https://bugzilla.lustre.org/attachment.cgi?id=9404&action=view)
Full dk log

adilger@clusterfs.com

2007-Jan-23 16:31 UTC

head link

[Lustre-devel] [Bug 11602] fsx causes occasional checksum failures

Please don''t reply to lustre-devel. Instead, comment in Bugzilla by
using the following link:
https://bugzilla.lustre.org/show_bug.cgi?id=11602

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |alex@clusterfs.com


(In reply to comment #0)>   For some time now we''ve noticed that with lustre checksums
enabled, always
> on at LLNL, the fsx test will cause spurious checksum failure messages. 
The
> failures are harmless in nature but they generate some very scary looking
log
> messages which we need to resolve.
> 
>   Now lustre should be locking these pages before they''re being
sent but
> something is still modifing them.
Actually, I think the issue is that the VM isn''t locking the pages that
are
being written, so the checksum that the client calculates becomes invalid as fsx
makes a small write into a page that is currently being sent.

There is already some work in progress at CFS to change the client IO model in
order to fit it better into the standard Linux form, and also to address this
same problem for LAID (which _would_ result in data corruption issues because
the RAID checksum would be incorrect and a lost disk would result in incorrect
parity).  Also, in 1.8 the current checksumming code is replaced by Kerberos
data authentication and encryption, and the encryption needs separate buffers
for the IO and the problem is also gone as a result.

Lustre devel - Jan 2007 - [Bug 11602] fsx causes occasional checksum failures

[Lustre-devel] [Bug 11602] fsx causes occasional checksum failures

[Lustre-devel] [Bug 11602] fsx causes occasional checksum failures