Thomas Guthmann
2011-Dec-05 07:30 UTC
[Lustre-discuss] What''s the human translation for: ost_write operation failed with -28
Hi, Over the week-end on the client sides we started to see a lot of: LustreError: 11-0: an error occurred while communicating with 192.168.1.32 at tcp. The ost_write operation failed with -28 LustreError: Skipped 23528 previous similar messages What do they mean or imply, I guess -28 means a specific error ? The host 192.168.1.32 is up and provides other lustre filesystems which don''t have this problem (finger crossed:).Once this error happened, I couldn''t write to the filesystem anymore despite 250GB free (lfs df -h). Unmount / remount the lustrefs fixed the issue. But then the error came back later. As usual everything went fine for 2 years until today ;) Any ideas, leads I can follow to investigate more this issue ? BTW on the server side we ''only'' have the usual messages we had before the disaster where xxxx is our lustrefs having the above issue. Lustre: Skipped 2 previous similar messages Lustre: xxxx-OST0004: slow direct_io 73s due to heavy IO load Lustre: xxxx-OST0004: slow journal start 72s due to heavy IO load Lustre: xxxx-OST0004: slow commitrw commit 72s due to heavy IO load Lustre: xxxx-OST0003: slow journal start 146s due to heavy IO load Lustre: xxxx-OST0003: slow brw_start 163s due to heavy IO load Lustre: Skipped 1 previous similar message Lustre: xxxx-OST0003: slow journal start 164s due to heavy IO load Lustre: xxxx-OST0003: slow commitrw commit 164s due to heavy IO load Lustre: xxxx-OST0003: slow direct_io 164s due to heavy IO load centos5# rpm -qa |grep lustre lustre-modules-1.8.5-2.6.18_194.17.1.el5_lustre.1.8.5 lustre-ldiskfs-3.1.4-2.6.18_194.17.1.el5_lustre.1.8.5 kernel-devel-2.6.18-194.17.1.el5_lustre.1.8.5 kernel-2.6.18-194.17.1.el5_lustre.1.8.5 lustre-1.8.5-2.6.18_194.17.1.el5_lustre.1.8.5 Cheers, Thomas
Christian Becker
2011-Dec-05 07:47 UTC
[Lustre-discuss] What''s the human translation for: ost_write operation failed with -28
Hi Thomas, On 12/05/2011 08:30 AM, Thomas Guthmann wrote:> Hi, > > Over the week-end on the client sides we started to see a lot of: > > LustreError: 11-0: an error occurred while communicating with > 192.168.1.32 at tcp. The ost_write operation failed with -28 > LustreError: Skipped 23528 previous similar messagesThe negative number in this line means the usual linux error code, in this case: # grep 28 /usr/include/asm-generic/errno-base.h #define ENOSPC 28 /* No space left on device */ best regards, Christian> > What do they mean or imply, I guess -28 means a specific error ? > > The host 192.168.1.32 is up and provides other lustre filesystems which > don''t have this problem (finger crossed:).Once this error happened, I > couldn''t write to the filesystem anymore despite 250GB free (lfs df -h). > Unmount / remount the lustrefs fixed the issue. But then the error came > back later. As usual everything went fine for 2 years until today ;) > > Any ideas, leads I can follow to investigate more this issue ? > > BTW on the server side we ''only'' have the usual messages we had before > the disaster where xxxx is our lustrefs having the above issue. > > Lustre: Skipped 2 previous similar messages > Lustre: xxxx-OST0004: slow direct_io 73s due to heavy IO load > Lustre: xxxx-OST0004: slow journal start 72s due to heavy IO load > Lustre: xxxx-OST0004: slow commitrw commit 72s due to heavy IO load > Lustre: xxxx-OST0003: slow journal start 146s due to heavy IO load > Lustre: xxxx-OST0003: slow brw_start 163s due to heavy IO load > Lustre: Skipped 1 previous similar message > Lustre: xxxx-OST0003: slow journal start 164s due to heavy IO load > Lustre: xxxx-OST0003: slow commitrw commit 164s due to heavy IO load > Lustre: xxxx-OST0003: slow direct_io 164s due to heavy IO load > > centos5# rpm -qa |grep lustre > lustre-modules-1.8.5-2.6.18_194.17.1.el5_lustre.1.8.5 > lustre-ldiskfs-3.1.4-2.6.18_194.17.1.el5_lustre.1.8.5 > kernel-devel-2.6.18-194.17.1.el5_lustre.1.8.5 > kernel-2.6.18-194.17.1.el5_lustre.1.8.5 > lustre-1.8.5-2.6.18_194.17.1.el5_lustre.1.8.5 > > Cheers, > Thomas > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss
Thomas Guthmann
2011-Dec-06 01:05 UTC
[Lustre-discuss] What''s the human translation for: ost_write operation failed with -28
Hi,> # grep 28 /usr/include/asm-generic/errno-base.h > #define ENOSPC 28 /* No space left on device */Great. So it''s really what''s happening. But I have free space/inodes... I cannot remember anything in the documentation talking about ''reserved free space''. So based on the following output, is it normal to have no space left on storage ? # lfs df -h [..] UUID bytes Used Available Use% Mounted on foobar-MDT0000_UUID 4.1G 197.8M 3.7G 4% /lustre/foobar[MDT:0] foobar-OST0000_UUID 2.0T 1.8T 21.1G 93% /lustre/foobar[OST:0] foobar-OST0001_UUID 2.0T 1.8T 23.2G 93% /lustre/foobar[OST:1] foobar-OST0002_UUID 2.0T 1.8T 21.4G 93% /lustre/foobar[OST:2] foobar-OST0003_UUID 2.0T 1.8T 19.3G 93% /lustre/foobar[OST:3] foobar-OST0004_UUID 2.0T 1.8T 19.3G 93% /lustre/foobar[OST:4] foobar-OST0005_UUID 2.0T 1.9T 16.9G 94% /lustre/foobar[OST:5] # lfs df -i [..] UUID Inodes IUsed IFree IUse% Mounted on foobar-MDT0000_UUID 1019403 64 1019339 0% /lustre/foobar[MDT:0] foobar-OST0000_UUID 32363906 102 32363804 0% /lustre/foobar[OST:0] foobar-OST0001_UUID 32920407 99 32920308 0% /lustre/foobar[OST:1] foobar-OST0002_UUID 32453038 100 32452938 0% /lustre/foobar[OST:2] foobar-OST0003_UUID 31904762 104 31904658 0% /lustre/foobar[OST:3] foobar-OST0004_UUID 31904338 103 31904235 0% /lustre/foobar[OST:4] foobar-OST0005_UUID 31280099 104 31279995 0% /lustre/foobar[OST:5] For my dmesg on the OSS, Heiko pointed it out (in a private email) that I may have hit one of the following bottlenecks : - To little space left on file system - Performance of ext3/4 on large disks (Note: I am using ext4/lustre1.8.5/centos5) ==> http://jira.whamcloud.com/browse/LU-15. But it still does not explain why I couldn''t write anymore. Cheers Thomas Any ide
John Hammond
2011-Dec-06 01:20 UTC
[Lustre-discuss] What''s the human translation for: ost_write operation failed with -28
On 12/05/2011 01:30 AM, Thomas Guthmann wrote:> Hi, > > Over the week-end on the client sides we started to see a lot of: > > LustreError: 11-0: an error occurred while communicating with > 192.168.1.32 at tcp. The ost_write operation failed with -28 > LustreError: Skipped 23528 previous similar messages > > What do they mean or imply, I guess -28 means a specific error ?See attached. YMMV. -- John L. Hammond, Ph.D. TACC, The University of Texas at Austin jhammond at tacc.utexas.edu (512) 471-9304 -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: errno Url: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20111205/c36dcd75/attachment.pl -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: lopc Url: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20111205/c36dcd75/attachment-0001.pl
Rappleye, Jason (ARC-TN)[Computer Sciences Corporation]
2011-Dec-06 01:34 UTC
[Lustre-discuss] What''s the human translation for: ost_write operation failed with -28
Hi Thomas, The OSTs for which you''re receiving ENOSPC might have an excessive amount of grant space outstanding. You can get the current grant space by running the following command on each OSS: $ lctl get_param obdfilter.*.tot_granted Units are in bytes. One grant-related BZ that that bit us hard is 22755; in particular the part that caused grant to grow when a user code continued trying to write even after write(2) started returning EDQUOTA :-( We monitor and alert on high grant space usage on each OST, so we can avoid ENOSPC due to this issue. Jason -- Jason Rappleye Systems Administrator NASA Advanced Supercomputing Division On 12/5/11 5:05 PM, "Thomas Guthmann" <tguthmann at iseek.com.au> wrote:>Hi, > >> # grep 28 /usr/include/asm-generic/errno-base.h >> #define ENOSPC 28 /* No space left on device */ >Great. So it''s really what''s happening. But I have free space/inodes... >I cannot remember anything in the documentation talking about ''reserved >free space''. > >So based on the following output, is it normal to have no space left on >storage ? > ># lfs df -h >[..] >UUID bytes Used Available Use% Mounted on >foobar-MDT0000_UUID 4.1G 197.8M 3.7G 4% >/lustre/foobar[MDT:0] >foobar-OST0000_UUID 2.0T 1.8T 21.1G 93% >/lustre/foobar[OST:0] >foobar-OST0001_UUID 2.0T 1.8T 23.2G 93% >/lustre/foobar[OST:1] >foobar-OST0002_UUID 2.0T 1.8T 21.4G 93% >/lustre/foobar[OST:2] >foobar-OST0003_UUID 2.0T 1.8T 19.3G 93% >/lustre/foobar[OST:3] >foobar-OST0004_UUID 2.0T 1.8T 19.3G 93% >/lustre/foobar[OST:4] >foobar-OST0005_UUID 2.0T 1.9T 16.9G 94% >/lustre/foobar[OST:5] > ># lfs df -i >[..] >UUID Inodes IUsed IFree IUse% Mounted on >foobar-MDT0000_UUID 1019403 64 1019339 0% >/lustre/foobar[MDT:0] >foobar-OST0000_UUID 32363906 102 32363804 0% >/lustre/foobar[OST:0] >foobar-OST0001_UUID 32920407 99 32920308 0% >/lustre/foobar[OST:1] >foobar-OST0002_UUID 32453038 100 32452938 0% >/lustre/foobar[OST:2] >foobar-OST0003_UUID 31904762 104 31904658 0% >/lustre/foobar[OST:3] >foobar-OST0004_UUID 31904338 103 31904235 0% >/lustre/foobar[OST:4] >foobar-OST0005_UUID 31280099 104 31279995 0% >/lustre/foobar[OST:5] > >For my dmesg on the OSS, Heiko pointed it out (in a private email) that I >may have hit one of the following bottlenecks : >- To little space left on file system >- Performance of ext3/4 on large disks (Note: I am using >ext4/lustre1.8.5/centos5) >==> http://jira.whamcloud.com/browse/LU-15. > >But it still does not explain why I couldn''t write anymore. > >Cheers >Thomas > >Any ide >_______________________________________________ >Lustre-discuss mailing list >Lustre-discuss at lists.lustre.org >http://lists.lustre.org/mailman/listinfo/lustre-discuss
Thomas Guthmann
2011-Dec-06 06:31 UTC
[Lustre-discuss] What''s the human translation for: ost_write operation failed with -28
Hi Jason,> $ lctl get_param obdfilter.*.tot_granted > Units are in bytes.Thanks. I wasn''t aware of this "grant". I googled for it and I found some information about it but it''s still unclear. Should I understand that the value in obdfilter.*.tot_granted are actually ''reserved'' space allocated by clients but not used ? So REAL_FREESPACE = DF_FREESPACE - TOT_GRANTED ? Correct ? FYI, I have the following values on the OSS it couldn''t connect/write to : obdfilter.foobar-OST0003.tot_granted=17429659648 obdfilter.foobar-OST0004.tot_granted=13648875520 obdfilter.foobar-OST0005.tot_granted=18136141824 and : lfs df (seen from the client) foobar-OST0003_UUID 2113787824 1986169192 20244388 93% /lustre/foobar[OST:3] foobar-OST0004_UUID 2113787824 1986170884 20242696 93% /lustre/foobar[OST:4] foobar-OST0005_UUID 2113787824 1988667844 17745736 94% /lustre/foobar[OST:5] So, for instance for OST5 I have 17745736 - (18136141824/1024) = ... 17745736 - 17711076 = 34660 KB left Am I right ?> One grant-related BZ that that bit us hard is 22755; in particular the > part that caused grant to grow when a user code continued trying to write > even after write(2) started returning EDQUOTA :-(That''s interesting information. I also found the same via [1] and apparently it may not be fixed overall. Which may explain why I may have hit it with Lustre 1.8.5. But, again, my application was writing into sparse files so the space was already allocated... and the sparse files haven''t grown. [1]: http://www.mail-archive.com/lustre-discuss at lists.lustre.org/msg07565.html Thomas> > On 12/5/11 5:05 PM, "Thomas Guthmann"<tguthmann at iseek.com.au> wrote: > >> Hi, >> >>> # grep 28 /usr/include/asm-generic/errno-base.h >>> #define ENOSPC 28 /* No space left on device */ >> Great. So it''s really what''s happening. But I have free space/inodes... >> I cannot remember anything in the documentation talking about ''reserved >> free space''. >> >> So based on the following output, is it normal to have no space left on >> storage ? >> >> # lfs df -h >> [..] >> UUID bytes Used Available Use% Mounted on >> foobar-MDT0000_UUID 4.1G 197.8M 3.7G 4% >> /lustre/foobar[MDT:0] >> foobar-OST0000_UUID 2.0T 1.8T 21.1G 93% >> /lustre/foobar[OST:0] >> foobar-OST0001_UUID 2.0T 1.8T 23.2G 93% >> /lustre/foobar[OST:1] >> foobar-OST0002_UUID 2.0T 1.8T 21.4G 93% >> /lustre/foobar[OST:2] >> foobar-OST0003_UUID 2.0T 1.8T 19.3G 93% >> /lustre/foobar[OST:3] >> foobar-OST0004_UUID 2.0T 1.8T 19.3G 93% >> /lustre/foobar[OST:4] >> foobar-OST0005_UUID 2.0T 1.9T 16.9G 94% >> /lustre/foobar[OST:5] >> >> # lfs df -i >> [..] >> UUID Inodes IUsed IFree IUse% Mounted on >> foobar-MDT0000_UUID 1019403 64 1019339 0% >> /lustre/foobar[MDT:0] >> foobar-OST0000_UUID 32363906 102 32363804 0% >> /lustre/foobar[OST:0] >> foobar-OST0001_UUID 32920407 99 32920308 0% >> /lustre/foobar[OST:1] >> foobar-OST0002_UUID 32453038 100 32452938 0% >> /lustre/foobar[OST:2] >> foobar-OST0003_UUID 31904762 104 31904658 0% >> /lustre/foobar[OST:3] >> foobar-OST0004_UUID 31904338 103 31904235 0% >> /lustre/foobar[OST:4] >> foobar-OST0005_UUID 31280099 104 31279995 0% >> /lustre/foobar[OST:5] >> >> For my dmesg on the OSS, Heiko pointed it out (in a private email) that I >> may have hit one of the following bottlenecks : >> - To little space left on file system >> - Performance of ext3/4 on large disks (Note: I am using >> ext4/lustre1.8.5/centos5) >> ==> http://jira.whamcloud.com/browse/LU-15. >> >> But it still does not explain why I couldn''t write anymore. >> >> Cheers >> Thomas >> >> Any ide >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss >
Thomas Guthmann
2011-Dec-06 06:57 UTC
[Lustre-discuss] What''s the human translation for: ost_write operation failed with -28
Hi,> See attached. YMMV.Handy! Thanks for sharing John. Thomas
Rappleye, Jason (ARC-TN)[Computer Sciences Corporation]
2011-Dec-06 07:31 UTC
[Lustre-discuss] What''s the human translation for: ost_write operation failed with -28
Hi Thomas, On Dec 5, 2011, at 10:31 PM, Thomas Guthmann wrote:> Hi Jason, > >> $ lctl get_param obdfilter.*.tot_granted >> Units are in bytes. > Thanks. I wasn''t aware of this "grant". I googled for it and I found some > information about it but it''s still unclear. Should I understand that the > value in obdfilter.*.tot_granted are actually ''reserved'' space allocated > by clients but not used ?In a sense, yes. My understanding is that grant space exists to ensure that client applications can perform asynchronous writes without dirtying more pages than the available space on an OST. Otherwise, writes would have to be synchronous to ensure that clients didn''t use more space than is available.> So REAL_FREESPACE = DF_FREESPACE - TOT_GRANTED ? Correct ?That''s more or less how our monitoring tools interpret it; a knowledgeable Lustre engineer might chime in and say otherwise :-)> FYI, I have the following values on the OSS it couldn''t connect/write to : > > obdfilter.foobar-OST0003.tot_granted=17429659648 > obdfilter.foobar-OST0004.tot_granted=13648875520 > obdfilter.foobar-OST0005.tot_granted=18136141824 > > and : lfs df (seen from the client) > > foobar-OST0003_UUID 2113787824 1986169192 20244388 93% /lustre/foobar[OST:3] > foobar-OST0004_UUID 2113787824 1986170884 20242696 93% /lustre/foobar[OST:4] > foobar-OST0005_UUID 2113787824 1988667844 17745736 94% /lustre/foobar[OST:5] > > So, for instance for OST5 I have 17745736 - (18136141824/1024) = ... > 17745736 - 17711076 = 34660 KB left > > Am I right ?Yes, though on our system with ~12,000 clients, those values of tot_granted are obscenely low. A better comparison would be tot_granted on a freshly mounted OST on your filesystem.>> One grant-related BZ that that bit us hard is 22755; in particular the >> part that caused grant to grow when a user code continued trying to write >> even after write(2) started returning EDQUOTA :-( > That''s interesting information. I also found the same via [1] and apparently > it may not be fixed overall. Which may explain why I may have hit it with Lustre > 1.8.5. > > But, again, my application was writing into sparse files so the space was > already allocated... and the sparse files haven''t grown.Your specific problem may not be due to a bug. That last bit of the filesystem may not be easily usable due to the grant mechanism. I''ll let someone with more knowledge about grants chime in here. Also, as Heiko alluded to, running with an OST so full is going to increase the chance of exposure to problems described in LU-15. Jason> [1]: http://www.mail-archive.com/lustre-discuss at lists.lustre.org/msg07565.html > > Thomas > > >> >> On 12/5/11 5:05 PM, "Thomas Guthmann"<tguthmann at iseek.com.au> wrote: >> >>> Hi, >>> >>>> # grep 28 /usr/include/asm-generic/errno-base.h >>>> #define ENOSPC 28 /* No space left on device */ >>> Great. So it''s really what''s happening. But I have free space/inodes... >>> I cannot remember anything in the documentation talking about ''reserved >>> free space''. >>> >>> So based on the following output, is it normal to have no space left on >>> storage ? >>> >>> # lfs df -h >>> [..] >>> UUID bytes Used Available Use% Mounted on >>> foobar-MDT0000_UUID 4.1G 197.8M 3.7G 4% >>> /lustre/foobar[MDT:0] >>> foobar-OST0000_UUID 2.0T 1.8T 21.1G 93% >>> /lustre/foobar[OST:0] >>> foobar-OST0001_UUID 2.0T 1.8T 23.2G 93% >>> /lustre/foobar[OST:1] >>> foobar-OST0002_UUID 2.0T 1.8T 21.4G 93% >>> /lustre/foobar[OST:2] >>> foobar-OST0003_UUID 2.0T 1.8T 19.3G 93% >>> /lustre/foobar[OST:3] >>> foobar-OST0004_UUID 2.0T 1.8T 19.3G 93% >>> /lustre/foobar[OST:4] >>> foobar-OST0005_UUID 2.0T 1.9T 16.9G 94% >>> /lustre/foobar[OST:5] >>> >>> # lfs df -i >>> [..] >>> UUID Inodes IUsed IFree IUse% Mounted on >>> foobar-MDT0000_UUID 1019403 64 1019339 0% >>> /lustre/foobar[MDT:0] >>> foobar-OST0000_UUID 32363906 102 32363804 0% >>> /lustre/foobar[OST:0] >>> foobar-OST0001_UUID 32920407 99 32920308 0% >>> /lustre/foobar[OST:1] >>> foobar-OST0002_UUID 32453038 100 32452938 0% >>> /lustre/foobar[OST:2] >>> foobar-OST0003_UUID 31904762 104 31904658 0% >>> /lustre/foobar[OST:3] >>> foobar-OST0004_UUID 31904338 103 31904235 0% >>> /lustre/foobar[OST:4] >>> foobar-OST0005_UUID 31280099 104 31279995 0% >>> /lustre/foobar[OST:5] >>> >>> For my dmesg on the OSS, Heiko pointed it out (in a private email) that I >>> may have hit one of the following bottlenecks : >>> - To little space left on file system >>> - Performance of ext3/4 on large disks (Note: I am using >>> ext4/lustre1.8.5/centos5) >>> ==> http://jira.whamcloud.com/browse/LU-15. >>> >>> But it still does not explain why I couldn''t write anymore. >>> >>> Cheers >>> Thomas >>> >>> Any ide >>> _______________________________________________ >>> Lustre-discuss mailing list >>> Lustre-discuss at lists.lustre.org >>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >> >
Johann Lombardi
2011-Dec-06 08:36 UTC
[Lustre-discuss] What''s the human translation for: ost_write operation failed with -28
On Tue, Dec 06, 2011 at 01:31:24AM -0600, Rappleye, Jason (ARC-TN)[Computer Sciences Corporation] wrote:> > So REAL_FREESPACE = DF_FREESPACE - TOT_GRANTED ? Correct ? > > That''s more or less how our monitoring tools interpret it; a knowledgeable Lustre engineer might chime in and say otherwise :-)Please note that the space accounted in tot_granted is not *totally* unusable since this space can still be consumed on clients by asynchronous writes. Actually, the main problem with grant is that there is no callback mechanism yet to reclaim the space granted to clients. In 1.8.1, we introduced a feature called "grant shrinking" which forces idle clients to release grant space after some time. However, this feature was disabled before GA because of some issues in the patch which have never been addressed since.> > FYI, I have the following values on the OSS it couldn''t connect/write to : > > > > obdfilter.foobar-OST0003.tot_granted=17429659648 > > obdfilter.foobar-OST0004.tot_granted=13648875520 > > obdfilter.foobar-OST0005.tot_granted=18136141824By default, one single OSC should not own more than 32MB of grant space. With 18GB of total granted space, you should have ~560 clients. How many clients are mounting the filesystem?> >> One grant-related BZ that that bit us hard is 22755; in particular theIndeed, this grant leak issue has unfortunately been hit by many customers.> >> part that caused grant to grow when a user code continued trying to write > >> even after write(2) started returning EDQUOTA :-( > > That''s interesting information. I also found the same via [1] and apparently > > it may not be fixed overall. Which may explain why I may have hit it with Lustre > > 1.8.5.This particular bug is supposed to be fixed since 1.8.4.> > But, again, my application was writing into sparse files so the space was > > already allocated... and the sparse files haven''t grown.Lustre (like most filesystems) does not allocate blocks for "holes" in sparse files. Cheers, Johann -- Johann Lombardi Whamcloud, Inc. www.whamcloud.com
Thomas Guthmann
2011-Dec-07 04:05 UTC
[Lustre-discuss] What''s the human translation for: ost_write operation failed with -28
Hi,>>> FYI, I have the following values on the OSS it couldn''t connect/write to : >>> obdfilter.foobar-OST0003.tot_granted=17429659648 >>> obdfilter.foobar-OST0004.tot_granted=13648875520 >>> obdfilter.foobar-OST0005.tot_granted=18136141824 > By default, one single OSC should not own more than 32MB of grant space. With 18GB of total granted space, you should have ~560 clients. How many clients are mounting the filesystem?Don''t fret... 5 clients :) And only 2 of them write to a define set of sparse files (no concurrent writes). No other files haven been created since AFAIK. At "day #1" of the lustre filesystem, we created 21 sparses files of 512GB each. Then the application wrote into them. We didn''t write any other files except 2 new 512GB sparse files a month ago. (This explain why we have a very low number of used inodes - see previous email for lfs df -i).>>> But, again, my application was writing into sparse files so the space was >>> already allocated... and the sparse files haven''t grown. > Lustre (like most filesystems) does not allocate blocks for "holes" in sparse files.Hmm, what do you mean ? It works like any other filesystems and so I should haven''t hit a grant issue ? Cheers, Thomas
Johann Lombardi
2011-Dec-07 09:09 UTC
[Lustre-discuss] What''s the human translation for: ost_write operation failed with -28
On Wed, Dec 07, 2011 at 02:05:28PM +1000, Thomas Guthmann wrote:> >>> FYI, I have the following values on the OSS it couldn''t connect/write to : > >>> obdfilter.foobar-OST0003.tot_granted=17429659648 > >>> obdfilter.foobar-OST0004.tot_granted=13648875520 > >>> obdfilter.foobar-OST0005.tot_granted=18136141824 > > By default, one single OSC should not own more than 32MB of grant space. With 18GB of total granted space, you should have ~560 clients. How many clients are mounting the filesystem? > Don''t fret... 5 clients :)Then there is a grant leak. Could you please run "lctl get_param osc.*.cur_grant_bytes" on all the clients? BTW, do run the same version of lustre (i think you mentioned 1.8.5) on all the nodes? In any case, you can try to unmount/remount the OSTs to work around the problem.> >>> But, again, my application was writing into sparse files so the space was > >>> already allocated... and the sparse files haven''t grown. > > Lustre (like most filesystems) does not allocate blocks for "holes" in sparse files. > Hmm, what do you mean ?My point is just that writing to a hole in a sparse file is not any different than writing at the end of a file and increasing its size. In both cases we have to allocate blocks and the write can fail with ENOSPC. Johann -- Johann Lombardi Whamcloud, Inc. www.whamcloud.com