behlendorf1@llnl.gov
2007-Jan-23 14:35 UTC
[Lustre-devel] [Bug 11602] fsx causes occasional checksum failures
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: https://bugzilla.lustre.org/show_bug.cgi?id=11602 Created an attachment (id=9404) Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: --> (https://bugzilla.lustre.org/attachment.cgi?id=9404&action=view) Full dk log
adilger@clusterfs.com
2007-Jan-23 16:31 UTC
[Lustre-devel] [Bug 11602] fsx causes occasional checksum failures
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by
using the following link:
https://bugzilla.lustre.org/show_bug.cgi?id=11602
What |Removed |Added
----------------------------------------------------------------------------
CC| |alex@clusterfs.com
(In reply to comment #0)> For some time now we''ve noticed that with lustre checksums
enabled, always
> on at LLNL, the fsx test will cause spurious checksum failure messages.
The
> failures are harmless in nature but they generate some very scary looking
log
> messages which we need to resolve.
>
> Now lustre should be locking these pages before they''re being
sent but
> something is still modifing them.
Actually, I think the issue is that the VM isn''t locking the pages that
are
being written, so the checksum that the client calculates becomes invalid as fsx
makes a small write into a page that is currently being sent.
There is already some work in progress at CFS to change the client IO model in
order to fit it better into the standard Linux form, and also to address this
same problem for LAID (which _would_ result in data corruption issues because
the RAID checksum would be incorrect and a lost disk would result in incorrect
parity). Also, in 1.8 the current checksumming code is replaced by Kerberos
data authentication and encryption, and the encryption needs separate buffers
for the IO and the problem is also gone as a result.