I had a client die with the following errors in dmesg: Lustre: Client nobackup-client has started standard.exe[4680]: segfault at 0000002ac4e88058 rip 00000033df807b7c rsp 0000000040bf7e08 error 4 LustreError: 11-0: an error occurred while communicating with 141.212.30.181 at tcp. The ost_write operation failed with -28 LustreError: 11-0: an error occurred while communicating with 141.212.30.181 at tcp. The ost_write operation failed with -28 LustreError: 11-0: an error occurred while communicating with 141.212.30.181 at tcp. The ost_write operation failed with -28 LustreError: Skipped 3 previous similar messages LustreError: 11-0: an error occurred while communicating with 141.212.30.181 at tcp. The ost_write operation failed with -28 LustreError: Skipped 5 previous similar messages After abaqus segfaults, lustre throws some errors, but the mount is still useable. The thing though is abaqus (standard.exe) has never died on us like this before on this input on this hardware (we are new lusture users). Its a standard test case provided by abaqus. s4b I found no reference to -28 in google. Any help would be great. Brock Palen Center for Advanced Computing brockp at umich.edu (734)936-1985
Michael MacDonald
2007-Oct-09 22:13 UTC
[Lustre-discuss] ost_write operation failed with -28
On Tue, 2007-10-09 at 17:57 -0400, Brock Palen wrote:> I had a client die with the following errors in dmesg: > > Lustre: Client nobackup-client has started > standard.exe[4680]: segfault at 0000002ac4e88058 rip 00000033df807b7c > rsp 0000000040bf7e08 error 4 > LustreError: 11-0: an error occurred while communicating with > 141.212.30.181 at tcp. The ost_write operation failed with -28 > LustreError: 11-0: an error occurred while communicating with > 141.212.30.181 at tcp. The ost_write operation failed with -28 > LustreError: 11-0: an error occurred while communicating with > 141.212.30.181 at tcp. The ost_write operation failed with -28 > LustreError: Skipped 3 previous similar messages > LustreError: 11-0: an error occurred while communicating with > 141.212.30.181 at tcp. The ost_write operation failed with -28 > LustreError: Skipped 5 previous similar messages > >errno 28 is ENOSPC. Looks like you''ve got a full OST.
On Oct 9, 2007, at 6:13 PM, Michael MacDonald wrote:> On Tue, 2007-10-09 at 17:57 -0400, Brock Palen wrote: >> I had a client die with the following errors in dmesg: >> >> Lustre: Client nobackup-client has started >> standard.exe[4680]: segfault at 0000002ac4e88058 rip 00000033df807b7c >> rsp 0000000040bf7e08 error 4 >> LustreError: 11-0: an error occurred while communicating with >> 141.212.30.181 at tcp. The ost_write operation failed with -28 >> LustreError: 11-0: an error occurred while communicating with >> 141.212.30.181 at tcp. The ost_write operation failed with -28 >> LustreError: 11-0: an error occurred while communicating with >> 141.212.30.181 at tcp. The ost_write operation failed with -28 >> LustreError: Skipped 3 previous similar messages >> LustreError: 11-0: an error occurred while communicating with >> 141.212.30.181 at tcp. The ost_write operation failed with -28 >> LustreError: Skipped 5 previous similar messages >> >> > errno 28 is ENOSPC. Looks like you''ve got a full OST.Crap your right one is full. If a file is striped across multiple OST''s I want to verify it does not just stop writing to that one and only write to the remaining OST''s allocated to that file?> > >
On Oct 09, 2007 18:23 -0400, Brock Palen wrote:> On Oct 9, 2007, at 6:13 PM, Michael MacDonald wrote: > > On Tue, 2007-10-09 at 17:57 -0400, Brock Palen wrote: > >> I had a client die with the following errors in dmesg: > >> > >> Lustre: Client nobackup-client has started > >> standard.exe[4680]: segfault at 0000002ac4e88058 rip 00000033df807b7c > >> rsp 0000000040bf7e08 error 4 > >> LustreError: 11-0: an error occurred while communicating with > >> 141.212.30.181 at tcp. The ost_write operation failed with -28 > >> LustreError: 11-0: an error occurred while communicating with > >> 141.212.30.181 at tcp. The ost_write operation failed with -28 > >> LustreError: 11-0: an error occurred while communicating with > >> 141.212.30.181 at tcp. The ost_write operation failed with -28 > >> LustreError: Skipped 3 previous similar messages > >> LustreError: 11-0: an error occurred while communicating with > >> 141.212.30.181 at tcp. The ost_write operation failed with -28 > >> LustreError: Skipped 5 previous similar messages > >> > >> > > errno 28 is ENOSPC. Looks like you''ve got a full OST. > > Crap your right one is full.This means your application isn''t checking for errors returned by write().> If a file is striped across multiple OST''s I want to verify it does > not just stop writing to that one and only write to the remaining > OST''s allocated to that file?No, currently you get ENOSPC as above when one of the component OSTs is full. Generally OST space usage is even, unless you have single files that consume a large fraction of the available space in the filesystem. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc.