Per Lundqvist
2006-Sep-01 04:37 UTC
[Lustre-discuss] silent corruption when a OSS restarts its lustre service (1.4.6.1)
We have experienced problems with data corruption on our Lustre filesystem (described below) when restarting the lustre service on one OSS. The problem can be reproduced by doing the following steps: 1. on a lustre client: start copying some files (/lustre/testfile.[0-99] -> /lustre/testdir) on the lustre file system: cd /lustre/testdir/ cp -av /lustre/testfile.* . 2. on an OSS: service lustre stop; sleep 140s; service lustre start (was originally noticed when rebooting an OSS) 3. on the client: the copy operation hangs on e.g. `../testfile.41'' -> `./testfile.41'' 4. on the client (after the lustre service has started again on the OSS). cp: reading `../testfile.41'': Input/output error cp: cannot stat `../testfile.42'': Input/output error `../testfile.43'' -> `./testfile.43'' <...snip...> `../testfile.100'' -> `./testfile.100'' In this case, testfile.42 was never created while testfile.41 was. ** Some times the file copy operation completes with no errors - but ** still leaves corrupt files. 5. on the client, check md5sum of the original and the copy of each existing file: md5sum testfile.4 ../testfile.4 9e5a980a2de40df0a8934bb814d83600 testfile.4 078b31bee47dfcd8207610f85530c064 ../testfile.4 md5sum testfile.41 ../testfile.41 d41d8cd98f00b204e9800998ecf8427e testfile.41 078b31bee47dfcd8207610f85530c064 ../testfile.41 ls -l testfile.4 ../testfile.4 -rw------- 1 perl nsc 10485760 Jun 20 09:49 ../testfile.4 -rw------- 1 perl nsc 9437184 Jun 20 09:49 testfile.4 ls -l testfile.41 ../testfile.41 -rw------- 1 perl nsc 10485760 Jun 20 09:49 ../testfile.41 -rw------- 1 perl nsc 0 Aug 31 17:03 testfile.41 ** corrupt files with correct file size have been observed too. Is this expected behaviour? Shouldn''t you be able to restart an OSS without risking file data corruption? This is our setup for this lustre filesystem : OS: Centos 4.3 x86_64 Kernel: 2.6.9.34 Lustre version: 1.4.6.1 1 single file system (tcp ethernet only, no HA) 1 MDS (1 MDT) 3 OSS (2 OSTs per OSS) regards, Per Lundqvist -- Per Lundqvist National Supercomputer Centre Link?ping University, Sweden http://www.nsc.liu.se
Andreas Dilger
2006-Sep-01 14:34 UTC
[Lustre-discuss] silent corruption when a OSS restarts its lustre service (1.4.6.1)
On Sep 01, 2006 12:37 +0200, Per Lundqvist wrote:> We have experienced problems with data corruption on our Lustre filesystem > (described below) when restarting the lustre service on one OSS. The > problem can be reproduced by doing the following steps: > > 2. on an OSS: service lustre stop; sleep 140s; service lustre start (was > originally noticed when rebooting an OSS)The default for "service lustre stop" is to evict all of the clients, so it is not at all surprising that you are getting errors. You can change this in /etc/init.d/lustre::LCONF_STOP_ARGS to use --failover instead of --force then the OST will not evict the clients on shutdown. If you don''t want to do this regularly, then you can also use "lconf --cleanup --failover $config.xml" for "want to keep same clients" case. Starting the OST in recovery mode means that when you start the OST it will wait for old clients to connect and this can take a few minutes to complete before it allows new clients to mount. You can stop this via "lctl --device $ost_dev abort_recovery" after the startup. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc.
Per Lundqvist
2006-Sep-04 08:16 UTC
[Lustre-discuss] silent corruption when a OSS restarts its lustre service (1.4.6.1)
On Fri, 1 Sep 2006, Andreas Dilger wrote:> The default for "service lustre stop" is to evict all of the clients, so > it is not at all surprising that you are getting errors. You can change > this in /etc/init.d/lustre::LCONF_STOP_ARGS to use --failover instead of > --force then the OST will not evict the clients on shutdown. If youThanks Andreas, that did the trick. I can now reboot one OSS without seeing data corruption on ongoing file transfers. thanks, Per Lundqvist -- Per Lundqvist National Supercomputer Centre Link?ping University, Sweden http://www.nsc.liu.se