Hi all, we are running lustre 1.8.1 on about 800 clients. Some days ago, we found a weird problem several times. When writing data, some clients reported "LustreError: 11-0: an error occurred while communicating with xx.xx.xx.xx at o2ib. The ost_write operation failed with -28", where xx.xx.xx.xx is one of our OSS node. But both the MDS and OSS is not anywhere near full. As we know, the errno 28 is "no space left on device". After some time, everything appears to be ok again. the space used: # lfs df -h UUID bytes Used Available Use% Mounted on lustre-MDT0000_UUID 350.0G 1.1G 328.9G 0% /home[MDT:0] lustre-OST0000_UUID 6.2T 263.5G 5.6T 4% /home[OST:0] lustre-OST0001_UUID 6.2T 264.3G 5.6T 4% /home[OST:1] lustre-OST0002_UUID 5.7T 261.7G 5.2T 4% /home[OST:2] lustre-OST0003_UUID 5.4T 206.6G 4.9T 3% /home[OST:3] lustre-OST0004_UUID 4.6T 197.1G 4.2T 4% /home[OST:4] lustre-OST0005_UUID 4.6T 160.6G 4.2T 3% /home[OST:5] lustre-OST0006_UUID 4.6T 300.7G 4.1T 6% /home[OST:6] lustre-OST0007_UUID 4.6T 174.1G 4.2T 3% /home[OST:7] lustre-OST0008_UUID 6.9T 232.7G 6.4T 3% /home[OST:8] lustre-OST0009_UUID 6.9T 237.7G 6.4T 3% /home[OST:9] lustre-OST000a_UUID 6.2T 219.9G 5.6T 3% /home[OST:10] lustre-OST000b_UUID 6.2T 257.8G 5.6T 4% /home[OST:11] lustre-OST000c_UUID 6.2T 784.6G 5.1T 12% /home[OST:12] lustre-OST000d_UUID 6.2T 227.2G 5.6T 3% /home[OST:13] lustre-OST000e_UUID 5.7T 199.2G 5.2T 3% /home[OST:14] lustre-OST000f_UUID 5.4T 221.9G 4.9T 4% /home[OST:15] lustre-OST0010_UUID 4.6T 176.4G 4.2T 3% /home[OST:16] lustre-OST0011_UUID 4.6T 160.9G 4.2T 3% /home[OST:17] lustre-OST0012_UUID 3.1T 118.3G 2.8T 3% /home[OST:18] lustre-OST0013_UUID 3.1T 99.0G 2.8T 3% /home[OST:19] lustre-OST0014_UUID 6.9T 243.7G 6.4T 3% /home[OST:20] lustre-OST0015_UUID 6.9T 273.6G 6.3T 3% /home[OST:21] lustre-OST0016_UUID 6.2T 335.6G 5.5T 5% /home[OST:22] lustre-OST0017_UUID 6.2T 219.1G 5.6T 3% /home[OST:23] the inode used: # lfs df -ih UUID Inodes IUsed IFree IUse% Mounted on lustre-MDT0000_UUID 89.3M 2.1M 87.2M 2% /home[MDT:0] lustre-OST0000_UUID 6.2M 91.3K 6.1M 1% /home[OST:0] lustre-OST0001_UUID 6.2M 90.7K 6.1M 1% /home[OST:1] lustre-OST0002_UUID 5.7M 83.1K 5.6M 1% /home[OST:2] lustre-OST0003_UUID 5.4M 80.1K 5.3M 1% /home[OST:3] lustre-OST0004_UUID 4.6M 68.8K 4.6M 1% /home[OST:4] lustre-OST0005_UUID 4.6M 69.8K 4.6M 1% /home[OST:5] lustre-OST0006_UUID 4.6M 69.4K 4.6M 1% /home[OST:6] lustre-OST0007_UUID 4.6M 69.2K 4.6M 1% /home[OST:7] lustre-OST0008_UUID 6.9M 101.7K 6.8M 1% /home[OST:8] lustre-OST0009_UUID 6.9M 101.3K 6.8M 1% /home[OST:9] lustre-OST000a_UUID 6.2M 91.0K 6.1M 1% /home[OST:10] lustre-OST000b_UUID 6.2M 90.9K 6.1M 1% /home[OST:11] lustre-OST000c_UUID 6.2M 86.1K 6.1M 1% /home[OST:12] lustre-OST000d_UUID 6.2M 90.2K 6.1M 1% /home[OST:13] lustre-OST000e_UUID 5.7M 83.4K 5.6M 1% /home[OST:14] lustre-OST000f_UUID 5.4M 80.4K 5.3M 1% /home[OST:15] lustre-OST0010_UUID 4.6M 69.0K 4.6M 1% /home[OST:16] lustre-OST0011_UUID 4.6M 69.3K 4.6M 1% /home[OST:17] lustre-OST0012_UUID 3.1M 46.7K 3.0M 1% /home[OST:18] lustre-OST0013_UUID 3.1M 46.6K 3.0M 1% /home[OST:19] lustre-OST0014_UUID 6.9M 101.5K 6.8M 1% /home[OST:20] lustre-OST0015_UUID 6.9M 101.3K 6.8M 1% /home[OST:21] lustre-OST0016_UUID 6.2M 90.1K 6.1M 1% /home[OST:22] lustre-OST0017_UUID 6.2M 90.7K 6.1M 1% /home[OST:23] So why it happened? BTW, OSS and MDS didn''t say anything about "no space left" in their logs, the os is SLES 10 sp2. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100304/76299640/attachment.html