behlendorf1@llnl.gov
2007-Jan-23 14:35 UTC
[Lustre-devel] [Bug 11602] fsx causes occasional checksum failures
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: https://bugzilla.lustre.org/show_bug.cgi?id=11602 Created an attachment (id=9404) Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: --> (https://bugzilla.lustre.org/attachment.cgi?id=9404&action=view) Full dk log
adilger@clusterfs.com
2007-Jan-23 16:31 UTC
[Lustre-devel] [Bug 11602] fsx causes occasional checksum failures
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: https://bugzilla.lustre.org/show_bug.cgi?id=11602 What |Removed |Added ---------------------------------------------------------------------------- CC| |alex@clusterfs.com (In reply to comment #0)> For some time now we''ve noticed that with lustre checksums enabled, always > on at LLNL, the fsx test will cause spurious checksum failure messages. The > failures are harmless in nature but they generate some very scary looking log > messages which we need to resolve. > > Now lustre should be locking these pages before they''re being sent but > something is still modifing them.Actually, I think the issue is that the VM isn''t locking the pages that are being written, so the checksum that the client calculates becomes invalid as fsx makes a small write into a page that is currently being sent. There is already some work in progress at CFS to change the client IO model in order to fit it better into the standard Linux form, and also to address this same problem for LAID (which _would_ result in data corruption issues because the RAID checksum would be incorrect and a lost disk would result in incorrect parity). Also, in 1.8 the current checksumming code is replaced by Kerberos data authentication and encryption, and the encryption needs separate buffers for the IO and the problem is also gone as a result.