Michael Robbert
2010-Jan-08 18:36 UTC
[Lustre-discuss] No space left on device for just one file
I have a user that reported a problem creating a file on our Lustre filesystem. When I investigated I found that the problem appears to be unique to just one filename in one directory. I have tried numerous ways of creating the file including echo, touch, and "lfs setstripe" all return "No space left on device". I have checked the filesystem with df and "lfs df" both show that the filesystem and all OSTs are far from being full for both blocks and inodes. Slight changes in the filename are created fine. We had a kernel panic on the MDS yesterday and it was quite possible that the user had a compute job working in this directory at the time of that problem. I am guessing we have some kind of corruption with the directory. This directory has around 1 million files so moving the data around may not be a quick operation, but we''re willing to do it. I just want to know the best way, short of taking the filesystem offline, to fix this problem. Any ideas? Thanks in advance, Mike Robbert
Can you paste us the file name? I want to see if we can touch something like this. On Fri, Jan 8, 2010 at 1:36 PM, Michael Robbert <mrobbert at mines.edu> wrote:> I have a user that reported a problem creating a file on our Lustre filesystem. When I investigated I found that the problem appears to be unique to just one filename in one directory. I have tried numerous ways of creating the file including echo, touch, and "lfs setstripe" all return "No space left on device". I have checked the filesystem with df and "lfs df" both show that the filesystem and all OSTs are far from being full for both blocks and inodes. Slight changes in the filename are created fine. We had a kernel panic on the MDS yesterday and it was quite possible that the user had a compute job working in this directory at the time of that problem. I am guessing we have some kind of corruption with the directory. This directory has around 1 million files so moving the data around may not be a quick operation, but we''re willing to do it. I just want to know the best way, short of taking the filesystem offline, to fix this problem. > > Any ideas? Thanks in advance, > Mike Robbert > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >
Michael Robbert
2010-Jan-11 19:59 UTC
[Lustre-discuss] No space left on device for just one file
The filename is not very unique. I can create a file with the same name in another directory or on another Lustre filesystem. It is just this exact path on this filesystem. The full path is: /lustre/scratch/smoqbel/Cenval/CLM/Met.Forcing/18X11/NLDAS.APCP.007100.pfb.00164 The mount point for this filesystem is /lustre/scratch/ Thanks, Mike On Jan 11, 2010, at 5:52 AM, Mag Gam wrote:> Can you paste us the file name? I want to see if we can touch > something like this. > > On Fri, Jan 8, 2010 at 1:36 PM, Michael Robbert <mrobbert at mines.edu> wrote: >> I have a user that reported a problem creating a file on our Lustre filesystem. When I investigated I found that the problem appears to be unique to just one filename in one directory. I have tried numerous ways of creating the file including echo, touch, and "lfs setstripe" all return "No space left on device". I have checked the filesystem with df and "lfs df" both show that the filesystem and all OSTs are far from being full for both blocks and inodes. Slight changes in the filename are created fine. We had a kernel panic on the MDS yesterday and it was quite possible that the user had a compute job working in this directory at the time of that problem. I am guessing we have some kind of corruption with the directory. This directory has around 1 million files so moving the data around may not be a quick operation, but we''re willing to do it. I just want to know the best way, short of taking the filesystem offline, to fix this problem. >> >> Any ideas? Thanks in advance, >> Mike Robbert >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>
Bernd Schubert
2010-Jan-11 23:15 UTC
[Lustre-discuss] No space left on device for just one file
Hello Robert, could you please send a mail into our ticket system? Kit or I would then start to investigate tomorrow. Thanks, Bernd On Monday 11 January 2010, Michael Robbert wrote:> The filename is not very unique. I can create a file with the same name in > another directory or on another Lustre filesystem. It is just this exact > path on this filesystem. The full path is: > /lustre/scratch/smoqbel/Cenval/CLM/Met.Forcing/18X11/NLDAS.APCP.007100.pfb > .00164 The mount point for this filesystem is /lustre/scratch/ > > Thanks, > Mike > > On Jan 11, 2010, at 5:52 AM, Mag Gam wrote: > > Can you paste us the file name? I want to see if we can touch > > something like this. > > > > On Fri, Jan 8, 2010 at 1:36 PM, Michael Robbert <mrobbert at mines.edu>wrote:> >> I have a user that reported a problem creating a file on our Lustre > >> filesystem. When I investigated I found that the problem appears to be > >> unique to just one filename in one directory. I have tried numerous ways > >> of creating the file including echo, touch, and "lfs setstripe" all > >> return "No space left on device". I have checked the filesystem with df > >> and "lfs df" both show that the filesystem and all OSTs are far from > >> being full for both blocks and inodes. Slight changes in the filename > >> are created fine. We had a kernel panic on the MDS yesterday and it was > >> quite possible that the user had a compute job working in this directory > >> at the time of that problem. I am guessing we have some kind of > >> corruption with the directory. This directory has around 1 million files > >> so moving the data around may not be a quick operation, but we''re > >> willing to do it. I just want to know the best way, short of taking the > >> filesystem offline, to fix this problem. > >> > >> Any ideas? Thanks in advance, > >> Mike Robbert > >> _______________________________________________ > >> Lustre-discuss mailing list > >> Lustre-discuss at lists.lustre.org > >> http://lists.lustre.org/mailman/listinfo/lustre-discuss > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >
Andreas Dilger
2010-Jan-12 02:24 UTC
[Lustre-discuss] No space left on device for just one file
On 2010-01-11, at 15:59, Michael Robbert wrote:> The filename is not very unique. I can create a file with the same > name in another directory or on another Lustre filesystem. It is > just this exact path on this filesystem. The full path is: > /lustre/scratch/smoqbel/Cenval/CLM/Met.Forcing/18X11/NLDAS.APCP. > 007100.pfb.00164 > The mount point for this filesystem is /lustre/scratch/Robert, does the same problem happen on multiple client nodes, or is it only happening on a single client? Are there any messages on the MDS and/ or the OSSes when this problem is happening? This problem is somewhat unusual, since I''m not aware of any places outside the disk filesystem code that would cause ENOSPC when creating a file. Can you please do a bit of debugging on the system: {client}# cd /lustre/scratch/smoqbel/Cenval/CLM/Met.Forcing/18X11 {mds,client}# echo -1 > /proc/sys/lustre/debug # enable full debug {mds,client}# lctl clear # clear debug logs {client}# touch NLDAS.APCP.007100.pfb.00164 {mds,client}# lctl dk > /tmp/debug.{mds,client} # dump debug logs For now, please extract the ENOSPC error from the logs will be much shorter, and may be enough to identify where the problem is located, and will be a lot friendlier to the list. grep -- "-28" /tmp/debug.{mds,client} > /tmp/debug-28.{mds,client}:: along with the "lfs df" and "lfs df -i" output. If this is only on a single client, just dropping the locks on the client might be enough to resolve the problem: for L in /proc/fs/lustre/ldlm/namespaces/*; do echo clear > $L done If, on the other hand, this same problem is happening on all clients then the problem is likely on the MDS.>> On Fri, Jan 8, 2010 at 1:36 PM, Michael Robbert >> <mrobbert at mines.edu> wrote: >>> I have a user that reported a problem creating a file on our >>> Lustre filesystem. When I investigated I found that the problem >>> appears to be unique to just one filename in one directory. I have >>> tried numerous ways of creating the file including echo, touch, >>> and "lfs setstripe" all return "No space left on device". I have >>> checked the filesystem with df and "lfs df" both show that the >>> filesystem and all OSTs are far from being full for both blocks >>> and inodes. Slight changes in the filename are created fine. We >>> had a kernel panic on the MDS yesterday and it was quite possible >>> that the user had a compute job working in this directory at the >>> time of that problem. I am guessing we have some kind of >>> corruption with the directory. This directory has around 1 million >>> files so moving the data around may not be a quick operation, but >>> we''re willing to do it. I just want to know the best way, short of >>> taking the filesystem offline, to fix this problem. >>> >>> Any ideas? Thanks in advance, >>> Mike Robbert >>> _______________________________________________ >>> Lustre-discuss mailing list >>> Lustre-discuss at lists.lustre.org >>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>> > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discussCheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
Michael Robbert
2010-Jan-12 16:59 UTC
[Lustre-discuss] No space left on device for just one file
Andreas, Here are the results of my debugging. This problem does show up on multiple (presumably all) clients. I followed your instructions, changing lustre to lnet in step 2, and got debug output on both machines, but the -28 text only showed up on the client. [root at ra 18X11]# grep -- "-28" /tmp/debug.client 00000100:00000200:5:1263315233.100525:0:22069:0:(client.c:841:ptlrpc_check_reply()) @@@ rc = 1 for req at 00000103a5820800 x200609397/t0 o36->scratch-MDT0000_UUID at 172.16.34.1@o2ib:12/10 lens 376/424 e 0 to 1 dl 1263315433 ref 1 fl Rpc:R/0/0 rc 0/-28 00000100:00000200:5:1263315233.100538:0:22069:0:(events.c:95:reply_in_callback()) @@@ type 5, status 0 req at 00000103a5820800 x200609397/t0 o36->scratch-MDT0000_UUID at 172.16.34.1@o2ib:12/10 lens 376/424 e 0 to 1 dl 1263315433 ref 1 fl Rpc:R/0/0 rc 0/-28 00000100:00100000:5:1263315233.100543:0:22069:0:(events.c:115:reply_in_callback()) @@@ unlink req at 00000103a5820800 x200609397/t0 o36->scratch-MDT0000_UUID at 172.16.34.1@o2ib:12/10 lens 376/424 e 0 to 1 dl 1263315433 ref 1 fl Rpc:R/0/0 rc 0/-28 00000100:00000040:5:1263315233.100565:0:22069:0:(client.c:863:ptlrpc_check_status()) @@@ status is -28 req at 00000103a5820800 x200609397/t0 o36->scratch-MDT0000_UUID at 172.16.34.1@o2ib:12/10 lens 376/424 e 0 to 1 dl 1263315433 ref 1 fl Rpc:R/0/0 rc 0/-28 00000100:00000001:5:1263315233.100570:0:22069:0:(client.c:869:ptlrpc_check_status()) Process leaving (rc=18446744073709551588 : -28 : ffffffffffffffe4) 00000100:00000001:5:1263315233.100578:0:22069:0:(client.c:955:after_reply()) Process leaving (rc=18446744073709551588 : -28 : ffffffffffffffe4) 00000100:00100000:5:1263315233.100581:0:22069:0:(lustre_net.h:984:ptlrpc_rqphase_move()) @@@ move req "Rpc" -> "Interpret" req at 00000103a5820800 x200609397/t0 o36->scratch-MDT0000_UUID at 172.16.34.1@o2ib:12/10 lens 376/424 e 0 to 1 dl 1263315433 ref 1 fl Rpc:R/0/0 rc 0/-28 00000100:00000001:5:1263315233.100586:0:22069:0:(client.c:2094:ptlrpc_queue_wait()) Process leaving (rc=18446744073709551588 : -28 : ffffffffffffffe4) 00000002:00000040:5:1263315233.100590:0:22069:0:(mdc_reint.c:67:mdc_reint()) error in handling -28 00000002:00000001:5:1263315233.100593:0:22069:0:(mdc_reint.c:227:mdc_create()) Process leaving (rc=18446744073709551588 : -28 : ffffffffffffffe4) 00000080:00000001:5:1263315233.100596:0:22069:0:(namei.c:881:ll_new_node()) Process leaving via err_exit (rc=18446744073709551588 : -28 : ffffffffffffffe4) 00000100:00000040:5:1263315233.100600:0:22069:0:(client.c:1629:__ptlrpc_req_finished()) @@@ refcount now 0 req at 00000103a5820800 x200609397/t0 o36->scratch-MDT0000_UUID at 172.16.34.1@o2ib:12/10 lens 376/424 e 0 to 1 dl 1263315433 ref 1 fl Interpret:R/0/0 rc 0/-28 00000080:00000001:5:1263315233.100620:0:22069:0:(namei.c:930:ll_mknod_generic()) Process leaving (rc=18446744073709551588 : -28 : ffffffffffffffe4) Finally here is the lfs df output: [root at ra 18X11]# lfs df UUID 1K-blocks Used Available Use% Mounted on home-MDT0000_UUID 5127574032 2034740 4832512272 0% /lustre/home[MDT:0] home-OST0000_UUID 5768577552 1392861480 4082688968 24% /lustre/home[OST:0] home-OST0001_UUID 5768577552 1206861808 4268688824 20% /lustre/home[OST:1] home-OST0002_UUID 5768577552 1500109508 3975439928 26% /lustre/home[OST:2] home-OST0003_UUID 5768577552 1233475740 4242074712 21% /lustre/home[OST:3] home-OST0004_UUID 5768577552 1197398768 4278150628 20% /lustre/home[OST:4] home-OST0005_UUID 5768577552 1186058976 4289491656 20% /lustre/home[OST:5] filesystem summary: 34611465312 7716766280 25136534716 22% /lustre/home UUID 1K-blocks Used Available Use% Mounted on scratch-MDT0000_UUID 5127569936 9913156 4824629964 0% /lustre/scratch[MDT:0] scratch-OST0000_UUID 5768577552 4446029104 1029519960 77% /lustre/scratch[OST:0] scratch-OST0001_UUID 5768577552 3914730392 1560819220 67% /lustre/scratch[OST:1] scratch-OST0002_UUID 5768577552 4268932844 1206616396 74% /lustre/scratch[OST:2] scratch-OST0003_UUID 5768577552 4307085048 1168464192 74% /lustre/scratch[OST:3] scratch-OST0004_UUID 5768577552 3920023888 1555525724 67% /lustre/scratch[OST:4] scratch-OST0005_UUID 5768577552 3590710852 1884838760 62% /lustre/scratch[OST:5] scratch-OST0006_UUID 5768577552 4649048836 826500028 80% /lustre/scratch[OST:6] scratch-OST0007_UUID 5768577552 4089658692 1385890920 70% /lustre/scratch[OST:7] scratch-OST0008_UUID 5768577552 4151458292 1324090948 71% /lustre/scratch[OST:8] scratch-OST0009_UUID 5768577552 4116646240 1358902348 71% /lustre/scratch[OST:9] scratch-OST000a_UUID 5768577552 3750259568 1725290032 65% /lustre/scratch[OST:10] scratch-OST000b_UUID 5768577552 4346406836 1129141752 75% /lustre/scratch[OST:11] scratch-OST000c_UUID 5768577552 4376152100 1099396768 75% /lustre/scratch[OST:12] scratch-OST000d_UUID 5768577552 4312773056 1162776184 74% /lustre/scratch[OST:13] scratch-OST000e_UUID 5768577552 4900307080 575242532 84% /lustre/scratch[OST:14] scratch-OST000f_UUID 5768577552 4044304276 1431243940 70% /lustre/scratch[OST:15] scratch-OST0010_UUID 5768577552 3827521672 1648026552 66% /lustre/scratch[OST:16] scratch-OST0011_UUID 5768577552 3789120072 1686427400 65% /lustre/scratch[OST:17] scratch-OST0012_UUID 5768577552 4023497048 1452052192 69% /lustre/scratch[OST:18] scratch-OST0013_UUID 5768577552 4133682544 1341866324 71% /lustre/scratch[OST:19] scratch-OST0014_UUID 5768577552 3690021408 1785527832 63% /lustre/scratch[OST:20] scratch-OST0015_UUID 5768577552 3891559096 1583990144 67% /lustre/scratch[OST:21] scratch-OST0016_UUID 5768577552 4404600712 1070948896 76% /lustre/scratch[OST:22] scratch-OST0017_UUID 5768577552 4792223084 683326528 83% /lustre/scratch[OST:23] scratch-OST0018_UUID 5768577552 4486070024 989478844 77% /lustre/scratch[OST:24] scratch-OST0019_UUID 5768577552 4471754448 1003795164 77% /lustre/scratch[OST:25] scratch-OST001a_UUID 5768577552 4517349052 958199536 78% /lustre/scratch[OST:26] scratch-OST001b_UUID 5768577552 3989325372 1486223000 69% /lustre/scratch[OST:27] scratch-OST001c_UUID 5768577552 4024754964 1450793904 69% /lustre/scratch[OST:28] scratch-OST001d_UUID 5768577552 3883873220 1591676392 67% /lustre/scratch[OST:29] scratch-OST001e_UUID 5768577552 4928383088 547166152 85% /lustre/scratch[OST:30] scratch-OST001f_UUID 5768577552 4291418836 1184130776 74% /lustre/scratch[OST:31] filesystem summary: 184594481664 134329681744 40887889340 72% /lustre/scratch [root at ra 18X11]# lfs df -i UUID Inodes IUsed IFree IUse% Mounted on home-MDT0000_UUID 1287101228 5716405 1281384823 0% /lustre/home[MDT:0] home-OST0000_UUID 366288896 871143 365417753 0% /lustre/home[OST:0] home-OST0001_UUID 366288896 900011 365388885 0% /lustre/home[OST:1] home-OST0002_UUID 366288896 804892 365484004 0% /lustre/home[OST:2] home-OST0003_UUID 366288896 836213 365452683 0% /lustre/home[OST:3] home-OST0004_UUID 366288896 836852 365452044 0% /lustre/home[OST:4] home-OST0005_UUID 366288896 850446 365438450 0% /lustre/home[OST:5] filesystem summary: 1287101228 5716405 1281384823 0% /lustre/home UUID Inodes IUsed IFree IUse% Mounted on scratch-MDT0000_UUID 1453492963 174078773 1279414190 11% /lustre/scratch[MDT:0] scratch-OST0000_UUID 337257280 6621404 330635876 1% /lustre/scratch[OST:0] scratch-OST0001_UUID 366288896 6697629 359591267 1% /lustre/scratch[OST:1] scratch-OST0002_UUID 366288896 5272904 361015992 1% /lustre/scratch[OST:2] scratch-OST0003_UUID 366288896 5161903 361126993 1% /lustre/scratch[OST:3] scratch-OST0004_UUID 366288896 5327683 360961213 1% /lustre/scratch[OST:4] scratch-OST0005_UUID 366288896 5582579 360706317 1% /lustre/scratch[OST:5] scratch-OST0006_UUID 285040431 5158974 279881457 1% /lustre/scratch[OST:6] scratch-OST0007_UUID 366288896 5307157 360981739 1% /lustre/scratch[OST:7] scratch-OST0008_UUID 366288896 5387313 360901583 1% /lustre/scratch[OST:8] scratch-OST0009_UUID 366288896 5426523 360862373 1% /lustre/scratch[OST:9] scratch-OST000a_UUID 366288896 5424803 360864093 1% /lustre/scratch[OST:10] scratch-OST000b_UUID 360664073 5122378 355541695 1% /lustre/scratch[OST:11] scratch-OST000c_UUID 353235316 5129413 348105903 1% /lustre/scratch[OST:12] scratch-OST000d_UUID 366288896 5053936 361234960 1% /lustre/scratch[OST:13] scratch-OST000e_UUID 222189585 5122229 217067356 2% /lustre/scratch[OST:14] scratch-OST000f_UUID 366288896 5281196 361007700 1% /lustre/scratch[OST:15] scratch-OST0010_UUID 366288896 5274738 361014158 1% /lustre/scratch[OST:16] scratch-OST0011_UUID 366288896 5409560 360879336 1% /lustre/scratch[OST:17] scratch-OST0012_UUID 366288896 5369406 360919490 1% /lustre/scratch[OST:18] scratch-OST0013_UUID 366288896 5502974 360785922 1% /lustre/scratch[OST:19] scratch-OST0014_UUID 366288896 5521406 360767490 1% /lustre/scratch[OST:20] scratch-OST0015_UUID 366288896 5550606 360738290 1% /lustre/scratch[OST:21] scratch-OST0016_UUID 345993048 4999552 340993496 1% /lustre/scratch[OST:22] scratch-OST0017_UUID 249051056 4963064 244087992 1% /lustre/scratch[OST:23] scratch-OST0018_UUID 325734426 5108454 320625972 1% /lustre/scratch[OST:24] scratch-OST0019_UUID 329427010 5222114 324204896 1% /lustre/scratch[OST:25] scratch-OST001a_UUID 317921820 5115591 312806229 1% /lustre/scratch[OST:26] scratch-OST001b_UUID 366288896 5353229 360935667 1% /lustre/scratch[OST:27] scratch-OST001c_UUID 366288896 5383473 360905423 1% /lustre/scratch[OST:28] scratch-OST001d_UUID 366288896 5411890 360877006 1% /lustre/scratch[OST:29] scratch-OST001e_UUID 216236615 6188887 210047728 2% /lustre/scratch[OST:30] scratch-OST001f_UUID 366288896 6465049 359823847 1% /lustre/scratch[OST:31] filesystem summary: 1453492963 174078773 1279414190 11% /lustre/scratch Thanks, Mike Robbert On Jan 11, 2010, at 7:24 PM, Andreas Dilger wrote:> On 2010-01-11, at 15:59, Michael Robbert wrote: >> The filename is not very unique. I can create a file with the same >> name in another directory or on another Lustre filesystem. It is >> just this exact path on this filesystem. The full path is: >> /lustre/scratch/smoqbel/Cenval/CLM/Met.Forcing/18X11/NLDAS.APCP. >> 007100.pfb.00164 >> The mount point for this filesystem is /lustre/scratch/ > > Robert, > does the same problem happen on multiple client nodes, or is it only > happening on a single client? Are there any messages on the MDS and/ > or the OSSes when this problem is happening? This problem is somewhat > unusual, since I''m not aware of any places outside the disk filesystem > code that would cause ENOSPC when creating a file. > > Can you please do a bit of debugging on the system: > > {client}# cd /lustre/scratch/smoqbel/Cenval/CLM/Met.Forcing/18X11 > {mds,client}# echo -1 > /proc/sys/lustre/debug # enable full debug > {mds,client}# lctl clear # clear debug logs > {client}# touch NLDAS.APCP.007100.pfb.00164 > {mds,client}# lctl dk > /tmp/debug.{mds,client} # dump debug logs > > For now, please extract the ENOSPC error from the logs will be much > shorter, and may be enough to identify where the problem is located, > and will be a lot friendlier to the list. > > grep -- "-28" /tmp/debug.{mds,client} > /tmp/debug-28.{mds,client}:: > > along with the "lfs df" and "lfs df -i" output. > > If this is only on a single client, just dropping the locks on the > client might be enough to resolve the problem: > > for L in /proc/fs/lustre/ldlm/namespaces/*; do > echo clear > $L > done > > If, on the other hand, this same problem is happening on all clients > then the problem is likely on the MDS. > >>> On Fri, Jan 8, 2010 at 1:36 PM, Michael Robbert >>> <mrobbert at mines.edu> wrote: >>>> I have a user that reported a problem creating a file on our >>>> Lustre filesystem. When I investigated I found that the problem >>>> appears to be unique to just one filename in one directory. I have >>>> tried numerous ways of creating the file including echo, touch, >>>> and "lfs setstripe" all return "No space left on device". I have >>>> checked the filesystem with df and "lfs df" both show that the >>>> filesystem and all OSTs are far from being full for both blocks >>>> and inodes. Slight changes in the filename are created fine. We >>>> had a kernel panic on the MDS yesterday and it was quite possible >>>> that the user had a compute job working in this directory at the >>>> time of that problem. I am guessing we have some kind of >>>> corruption with the directory. This directory has around 1 million >>>> files so moving the data around may not be a quick operation, but >>>> we''re willing to do it. I just want to know the best way, short of >>>> taking the filesystem offline, to fix this problem. >>>> >>>> Any ideas? Thanks in advance, >>>> Mike Robbert >>>> _______________________________________________ >>>> Lustre-discuss mailing list >>>> Lustre-discuss at lists.lustre.org >>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>>> >> >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss > > > Cheers, Andreas > -- > Andreas Dilger > Sr. Staff Engineer, Lustre Group > Sun Microsystems of Canada, Inc. >
Bernd Schubert
2010-Jan-12 19:30 UTC
[Lustre-discuss] No space left on device for just one file
Hello Mike, you really should fill a ticket to us (DDN). I think your problem is from these MDS messages: LDISKFS-fs warning (device dm-1): ldiskfs_dx_add_entry: Directory index full! LDISKFS-fs warning (device dm-1): ldiskfs_dx_add_entry: Directory index full! And /dev/dm-1 is also the scratch MDT. Cheers, Bernd On Tuesday 12 January 2010, Michael Robbert wrote:> Andreas, > Here are the results of my debugging. This problem does show up on multiple > (presumably all) clients. I followed your instructions, changing lustre to > lnet in step 2, and got debug output on both machines, but the -28 text > only showed up on the client. > > [root at ra 18X11]# grep -- "-28" /tmp/debug.client > 00000100:00000200:5:1263315233.100525:0:22069:0:(client.c:841:ptlrpc_check_ > reply()) @@@ rc = 1 for req at 00000103a5820800 x200609397/t0 > o36->scratch-MDT0000_UUID at 172.16.34.1@o2ib:12/10 lens 376/424 e 0 to 1 dl > 1263315433 ref 1 fl Rpc:R/0/0 rc 0/-28 > 00000100:00000200:5:1263315233.100538:0:22069:0:(events.c:95:reply_in_call > back()) @@@ type 5, status 0 req at 00000103a5820800 x200609397/t0 > o36->scratch-MDT0000_UUID at 172.16.34.1@o2ib:12/10 lens 376/424 e 0 to 1 dl > 1263315433 ref 1 fl Rpc:R/0/0 rc 0/-28 > 00000100:00100000:5:1263315233.100543:0:22069:0:(events.c:115:reply_in_cal > lback()) @@@ unlink req at 00000103a5820800 x200609397/t0 > o36->scratch-MDT0000_UUID at 172.16.34.1@o2ib:12/10 lens 376/424 e 0 to 1 dl > 1263315433 ref 1 fl Rpc:R/0/0 rc 0/-28 > 00000100:00000040:5:1263315233.100565:0:22069:0:(client.c:863:ptlrpc_check > _status()) @@@ status is -28 req at 00000103a5820800 x200609397/t0 > o36->scratch-MDT0000_UUID at 172.16.34.1@o2ib:12/10 lens 376/424 e 0 to 1 dl > 1263315433 ref 1 fl Rpc:R/0/0 rc 0/-28 > 00000100:00000001:5:1263315233.100570:0:22069:0:(client.c:869:ptlrpc_check > _status()) Process leaving (rc=18446744073709551588 : -28 : > ffffffffffffffe4) > 00000100:00000001:5:1263315233.100578:0:22069:0:(client.c:955:after_reply( > )) Process leaving (rc=18446744073709551588 : -28 : ffffffffffffffe4) > 00000100:00100000:5:1263315233.100581:0:22069:0:(lustre_net.h:984:ptlrpc_r > qphase_move()) @@@ move req "Rpc" -> "Interpret" req at 00000103a5820800 > x200609397/t0 o36->scratch-MDT0000_UUID at 172.16.34.1@o2ib:12/10 lens > 376/424 e 0 to 1 dl 1263315433 ref 1 fl Rpc:R/0/0 rc 0/-28 > 00000100:00000001:5:1263315233.100586:0:22069:0:(client.c:2094:ptlrpc_queu > e_wait()) Process leaving (rc=18446744073709551588 : -28 : > ffffffffffffffe4) > 00000002:00000040:5:1263315233.100590:0:22069:0:(mdc_reint.c:67:mdc_reint( > )) error in handling -28 > 00000002:00000001:5:1263315233.100593:0:22069:0:(mdc_reint.c:227:mdc_creat > e()) Process leaving (rc=18446744073709551588 : -28 : ffffffffffffffe4) > 00000080:00000001:5:1263315233.100596:0:22069:0:(namei.c:881:ll_new_node() > ) Process leaving via err_exit (rc=18446744073709551588 : -28 : > ffffffffffffffe4) > 00000100:00000040:5:1263315233.100600:0:22069:0:(client.c:1629:__ptlrpc_re > q_finished()) @@@ refcount now 0 req at 00000103a5820800 x200609397/t0 > o36->scratch-MDT0000_UUID at 172.16.34.1@o2ib:12/10 lens 376/424 e 0 to 1 dl > 1263315433 ref 1 fl Interpret:R/0/0 rc 0/-28 > 00000080:00000001:5:1263315233.100620:0:22069:0:(namei.c:930:ll_mknod_gene > ric()) Process leaving (rc=18446744073709551588 : -28 : ffffffffffffffe4) > > Finally here is the lfs df output: > > [root at ra 18X11]# lfs df > UUID 1K-blocks Used Available Use% Mounted on > home-MDT0000_UUID 5127574032 2034740 4832512272 0% > /lustre/home[MDT:0] home-OST0000_UUID 5768577552 1392861480 4082688968 > 24% /lustre/home[OST:0] home-OST0001_UUID 5768577552 1206861808 > 4268688824 20% /lustre/home[OST:1] home-OST0002_UUID 5768577552 > 1500109508 3975439928 26% /lustre/home[OST:2] home-OST0003_UUID > 5768577552 1233475740 4242074712 21% /lustre/home[OST:3] > home-OST0004_UUID 5768577552 1197398768 4278150628 20% > /lustre/home[OST:4] home-OST0005_UUID 5768577552 1186058976 4289491656 > 20% /lustre/home[OST:5] > > filesystem summary: 34611465312 7716766280 25136534716 22% /lustre/home > > UUID 1K-blocks Used Available Use% Mounted on > scratch-MDT0000_UUID 5127569936 9913156 4824629964 0% > /lustre/scratch[MDT:0] scratch-OST0000_UUID 5768577552 4446029104 > 1029519960 77% /lustre/scratch[OST:0] scratch-OST0001_UUID 5768577552 > 3914730392 1560819220 67% /lustre/scratch[OST:1] scratch-OST0002_UUID > 5768577552 4268932844 1206616396 74% /lustre/scratch[OST:2] > scratch-OST0003_UUID 5768577552 4307085048 1168464192 74% > /lustre/scratch[OST:3] scratch-OST0004_UUID 5768577552 3920023888 > 1555525724 67% /lustre/scratch[OST:4] scratch-OST0005_UUID 5768577552 > 3590710852 1884838760 62% /lustre/scratch[OST:5] scratch-OST0006_UUID > 5768577552 4649048836 826500028 80% /lustre/scratch[OST:6] > scratch-OST0007_UUID 5768577552 4089658692 1385890920 70% > /lustre/scratch[OST:7] scratch-OST0008_UUID 5768577552 4151458292 > 1324090948 71% /lustre/scratch[OST:8] scratch-OST0009_UUID 5768577552 > 4116646240 1358902348 71% /lustre/scratch[OST:9] scratch-OST000a_UUID > 5768577552 3750259568 1725290032 65% /lustre/scratch[OST:10] > scratch-OST000b_UUID 5768577552 4346406836 1129141752 75% > /lustre/scratch[OST:11] scratch-OST000c_UUID 5768577552 4376152100 > 1099396768 75% /lustre/scratch[OST:12] scratch-OST000d_UUID 5768577552 > 4312773056 1162776184 74% /lustre/scratch[OST:13] scratch-OST000e_UUID > 5768577552 4900307080 575242532 84% /lustre/scratch[OST:14] > scratch-OST000f_UUID 5768577552 4044304276 1431243940 70% > /lustre/scratch[OST:15] scratch-OST0010_UUID 5768577552 3827521672 > 1648026552 66% /lustre/scratch[OST:16] scratch-OST0011_UUID 5768577552 > 3789120072 1686427400 65% /lustre/scratch[OST:17] scratch-OST0012_UUID > 5768577552 4023497048 1452052192 69% /lustre/scratch[OST:18] > scratch-OST0013_UUID 5768577552 4133682544 1341866324 71% > /lustre/scratch[OST:19] scratch-OST0014_UUID 5768577552 3690021408 > 1785527832 63% /lustre/scratch[OST:20] scratch-OST0015_UUID 5768577552 > 3891559096 1583990144 67% /lustre/scratch[OST:21] scratch-OST0016_UUID > 5768577552 4404600712 1070948896 76% /lustre/scratch[OST:22] > scratch-OST0017_UUID 5768577552 4792223084 683326528 83% > /lustre/scratch[OST:23] scratch-OST0018_UUID 5768577552 4486070024 > 989478844 77% /lustre/scratch[OST:24] scratch-OST0019_UUID 5768577552 > 4471754448 1003795164 77% /lustre/scratch[OST:25] scratch-OST001a_UUID > 5768577552 4517349052 958199536 78% /lustre/scratch[OST:26] > scratch-OST001b_UUID 5768577552 3989325372 1486223000 69% > /lustre/scratch[OST:27] scratch-OST001c_UUID 5768577552 4024754964 > 1450793904 69% /lustre/scratch[OST:28] scratch-OST001d_UUID 5768577552 > 3883873220 1591676392 67% /lustre/scratch[OST:29] scratch-OST001e_UUID > 5768577552 4928383088 547166152 85% /lustre/scratch[OST:30] > scratch-OST001f_UUID 5768577552 4291418836 1184130776 74% > /lustre/scratch[OST:31] > > filesystem summary: 184594481664 134329681744 40887889340 72% > /lustre/scratch > > [root at ra 18X11]# lfs df -i > UUID Inodes IUsed IFree IUse% Mounted on > home-MDT0000_UUID 1287101228 5716405 1281384823 0% > /lustre/home[MDT:0] home-OST0000_UUID 366288896 871143 365417753 > 0% /lustre/home[OST:0] home-OST0001_UUID 366288896 900011 365388885 > 0% /lustre/home[OST:1] home-OST0002_UUID 366288896 804892 > 365484004 0% /lustre/home[OST:2] home-OST0003_UUID 366288896 > 836213 365452683 0% /lustre/home[OST:3] home-OST0004_UUID 366288896 > 836852 365452044 0% /lustre/home[OST:4] home-OST0005_UUID > 366288896 850446 365438450 0% /lustre/home[OST:5] > > filesystem summary: 1287101228 5716405 1281384823 0% /lustre/home > > UUID Inodes IUsed IFree IUse% Mounted on > scratch-MDT0000_UUID 1453492963 174078773 1279414190 11% > /lustre/scratch[MDT:0] scratch-OST0000_UUID 337257280 6621404 330635876 > 1% /lustre/scratch[OST:0] scratch-OST0001_UUID 366288896 6697629 > 359591267 1% /lustre/scratch[OST:1] scratch-OST0002_UUID 366288896 > 5272904 361015992 1% /lustre/scratch[OST:2] scratch-OST0003_UUID > 366288896 5161903 361126993 1% /lustre/scratch[OST:3] > scratch-OST0004_UUID 366288896 5327683 360961213 1% > /lustre/scratch[OST:4] scratch-OST0005_UUID 366288896 5582579 360706317 > 1% /lustre/scratch[OST:5] scratch-OST0006_UUID 285040431 5158974 > 279881457 1% /lustre/scratch[OST:6] scratch-OST0007_UUID 366288896 > 5307157 360981739 1% /lustre/scratch[OST:7] scratch-OST0008_UUID > 366288896 5387313 360901583 1% /lustre/scratch[OST:8] > scratch-OST0009_UUID 366288896 5426523 360862373 1% > /lustre/scratch[OST:9] scratch-OST000a_UUID 366288896 5424803 360864093 > 1% /lustre/scratch[OST:10] scratch-OST000b_UUID 360664073 5122378 > 355541695 1% /lustre/scratch[OST:11] scratch-OST000c_UUID 353235316 > 5129413 348105903 1% /lustre/scratch[OST:12] scratch-OST000d_UUID > 366288896 5053936 361234960 1% /lustre/scratch[OST:13] > scratch-OST000e_UUID 222189585 5122229 217067356 2% > /lustre/scratch[OST:14] scratch-OST000f_UUID 366288896 5281196 361007700 > 1% /lustre/scratch[OST:15] scratch-OST0010_UUID 366288896 5274738 > 361014158 1% /lustre/scratch[OST:16] scratch-OST0011_UUID 366288896 > 5409560 360879336 1% /lustre/scratch[OST:17] scratch-OST0012_UUID > 366288896 5369406 360919490 1% /lustre/scratch[OST:18] > scratch-OST0013_UUID 366288896 5502974 360785922 1% > /lustre/scratch[OST:19] scratch-OST0014_UUID 366288896 5521406 360767490 > 1% /lustre/scratch[OST:20] scratch-OST0015_UUID 366288896 5550606 > 360738290 1% /lustre/scratch[OST:21] scratch-OST0016_UUID 345993048 > 4999552 340993496 1% /lustre/scratch[OST:22] scratch-OST0017_UUID > 249051056 4963064 244087992 1% /lustre/scratch[OST:23] > scratch-OST0018_UUID 325734426 5108454 320625972 1% > /lustre/scratch[OST:24] scratch-OST0019_UUID 329427010 5222114 324204896 > 1% /lustre/scratch[OST:25] scratch-OST001a_UUID 317921820 5115591 > 312806229 1% /lustre/scratch[OST:26] scratch-OST001b_UUID 366288896 > 5353229 360935667 1% /lustre/scratch[OST:27] scratch-OST001c_UUID > 366288896 5383473 360905423 1% /lustre/scratch[OST:28] > scratch-OST001d_UUID 366288896 5411890 360877006 1% > /lustre/scratch[OST:29] scratch-OST001e_UUID 216236615 6188887 210047728 > 2% /lustre/scratch[OST:30] scratch-OST001f_UUID 366288896 6465049 > 359823847 1% /lustre/scratch[OST:31] > > filesystem summary: 1453492963 174078773 1279414190 11% /lustre/scratch > > > Thanks, > Mike Robbert > > On Jan 11, 2010, at 7:24 PM, Andreas Dilger wrote: > > On 2010-01-11, at 15:59, Michael Robbert wrote: > >> The filename is not very unique. I can create a file with the same > >> name in another directory or on another Lustre filesystem. It is > >> just this exact path on this filesystem. The full path is: > >> /lustre/scratch/smoqbel/Cenval/CLM/Met.Forcing/18X11/NLDAS.APCP. > >> 007100.pfb.00164 > >> The mount point for this filesystem is /lustre/scratch/ > > > > Robert, > > does the same problem happen on multiple client nodes, or is it only > > happening on a single client? Are there any messages on the MDS and/ > > or the OSSes when this problem is happening? This problem is somewhat > > unusual, since I''m not aware of any places outside the disk filesystem > > code that would cause ENOSPC when creating a file. > > > > Can you please do a bit of debugging on the system: > > > > {client}# cd /lustre/scratch/smoqbel/Cenval/CLM/Met.Forcing/18X11 > > {mds,client}# echo -1 > /proc/sys/lustre/debug # enable full debug > > {mds,client}# lctl clear # clear debug logs > > {client}# touch NLDAS.APCP.007100.pfb.00164 > > {mds,client}# lctl dk > /tmp/debug.{mds,client} # dump debug logs > > > > For now, please extract the ENOSPC error from the logs will be much > > shorter, and may be enough to identify where the problem is located, > > and will be a lot friendlier to the list. > > > > grep -- "-28" /tmp/debug.{mds,client} > /tmp/debug-28.{mds,client}:: > > > > along with the "lfs df" and "lfs df -i" output. > > > > If this is only on a single client, just dropping the locks on the > > client might be enough to resolve the problem: > > > > for L in /proc/fs/lustre/ldlm/namespaces/*; do > > echo clear > $L > > done > > > > If, on the other hand, this same problem is happening on all clients > > then the problem is likely on the MDS. > > > >>> On Fri, Jan 8, 2010 at 1:36 PM, Michael Robbert > >>> > >>> <mrobbert at mines.edu> wrote: > >>>> I have a user that reported a problem creating a file on our > >>>> Lustre filesystem. When I investigated I found that the problem > >>>> appears to be unique to just one filename in one directory. I have > >>>> tried numerous ways of creating the file including echo, touch, > >>>> and "lfs setstripe" all return "No space left on device". I have > >>>> checked the filesystem with df and "lfs df" both show that the > >>>> filesystem and all OSTs are far from being full for both blocks > >>>> and inodes. Slight changes in the filename are created fine. We > >>>> had a kernel panic on the MDS yesterday and it was quite possible > >>>> that the user had a compute job working in this directory at the > >>>> time of that problem. I am guessing we have some kind of > >>>> corruption with the directory. This directory has around 1 million > >>>> files so moving the data around may not be a quick operation, but > >>>> we''re willing to do it. I just want to know the best way, short of > >>>> taking the filesystem offline, to fix this problem. > >>>> > >>>> Any ideas? Thanks in advance, > >>>> Mike Robbert > >>>> _______________________________________________ > >>>> Lustre-discuss mailing list > >>>> Lustre-discuss at lists.lustre.org > >>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss > >> > >> _______________________________________________ > >> Lustre-discuss mailing list > >> Lustre-discuss at lists.lustre.org > >> http://lists.lustre.org/mailman/listinfo/lustre-discuss > > > > Cheers, Andreas > > -- > > Andreas Dilger > > Sr. Staff Engineer, Lustre Group > > Sun Microsystems of Canada, Inc. > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >-- Bernd Schubert DataDirect Networks
Bernd Schubert
2010-Jan-12 19:41 UTC
[Lustre-discuss] No space left on device for just one file
Hmm, seems there is no solution available except of to turn off dir-indexing. I never looked into the dir-index code so far, but wouldn''t it make sense to skip the index just for this directory? https://bugzilla.lustre.org/show_bug.cgi?id=10129 http://kerneltrap.org/mailarchive/linux-kernel/2008/5/18/1861604 Thanks, Bernd On Tuesday 12 January 2010, Bernd Schubert wrote:> Hello Mike, > > you really should fill a ticket to us (DDN). I think your problem is from > these MDS messages: > > > LDISKFS-fs warning (device dm-1): ldiskfs_dx_add_entry: Directory index > full! LDISKFS-fs warning (device dm-1): ldiskfs_dx_add_entry: Directory > index full! > > > And /dev/dm-1 is also the scratch MDT. > > > Cheers, > Bernd > > On Tuesday 12 January 2010, Michael Robbert wrote: > > Andreas, > > Here are the results of my debugging. This problem does show up on > > multiple (presumably all) clients. I followed your instructions, changing > > lustre to lnet in step 2, and got debug output on both machines, but the > > -28 text only showed up on the client. > > > > [root at ra 18X11]# grep -- "-28" /tmp/debug.client > > 00000100:00000200:5:1263315233.100525:0:22069:0:(client.c:841:ptlrpc_chec > >k_ reply()) @@@ rc = 1 for req at 00000103a5820800 x200609397/t0 > > o36->scratch-MDT0000_UUID at 172.16.34.1@o2ib:12/10 lens 376/424 e 0 to 1 > > dl 1263315433 ref 1 fl Rpc:R/0/0 rc 0/-28 > > > > 00000100:00000200:5:1263315233.100538:0:22069:0:(events.c:95:reply_in_cal > >l back()) @@@ type 5, status 0 req at 00000103a5820800 x200609397/t0 > > o36->scratch-MDT0000_UUID at 172.16.34.1@o2ib:12/10 lens 376/424 e 0 to 1 dl > > 1263315433 ref 1 fl Rpc:R/0/0 rc 0/-28 > > > > 00000100:00100000:5:1263315233.100543:0:22069:0:(events.c:115:reply_in_ca > >l lback()) @@@ unlink req at 00000103a5820800 x200609397/t0 > > o36->scratch-MDT0000_UUID at 172.16.34.1@o2ib:12/10 lens 376/424 e 0 to 1 > > dl 1263315433 ref 1 fl Rpc:R/0/0 rc 0/-28 > > > > 00000100:00000040:5:1263315233.100565:0:22069:0:(client.c:863:ptlrpc_chec > >k _status()) @@@ status is -28 req at 00000103a5820800 x200609397/t0 > > o36->scratch-MDT0000_UUID at 172.16.34.1@o2ib:12/10 lens 376/424 e 0 to 1 dl > > 1263315433 ref 1 fl Rpc:R/0/0 rc 0/-28 > > > > 00000100:00000001:5:1263315233.100570:0:22069:0:(client.c:869:ptlrpc_chec > >k _status()) Process leaving (rc=18446744073709551588 : -28 : > > ffffffffffffffe4) > > > > 00000100:00000001:5:1263315233.100578:0:22069:0:(client.c:955:after_reply > >( )) Process leaving (rc=18446744073709551588 : -28 : ffffffffffffffe4) > > 00000100:00100000:5:1263315233.100581:0:22069:0:(lustre_net.h:984:ptlrpc_ > >r qphase_move()) @@@ move req "Rpc" -> "Interpret" req at 00000103a5820800 > > x200609397/t0 o36->scratch-MDT0000_UUID at 172.16.34.1@o2ib:12/10 lens > > 376/424 e 0 to 1 dl 1263315433 ref 1 fl Rpc:R/0/0 rc 0/-28 > > > > 00000100:00000001:5:1263315233.100586:0:22069:0:(client.c:2094:ptlrpc_que > >u e_wait()) Process leaving (rc=18446744073709551588 : -28 : > > ffffffffffffffe4) > > > > 00000002:00000040:5:1263315233.100590:0:22069:0:(mdc_reint.c:67:mdc_reint > >( )) error in handling -28 > > > > 00000002:00000001:5:1263315233.100593:0:22069:0:(mdc_reint.c:227:mdc_crea > >t e()) Process leaving (rc=18446744073709551588 : -28 : ffffffffffffffe4) > > 00000080:00000001:5:1263315233.100596:0:22069:0:(namei.c:881:ll_new_node( > >) ) Process leaving via err_exit (rc=18446744073709551588 : -28 : > > ffffffffffffffe4) > > > > 00000100:00000040:5:1263315233.100600:0:22069:0:(client.c:1629:__ptlrpc_r > >e q_finished()) @@@ refcount now 0 req at 00000103a5820800 x200609397/t0 > > o36->scratch-MDT0000_UUID at 172.16.34.1@o2ib:12/10 lens 376/424 e 0 to 1 dl > > 1263315433 ref 1 fl Interpret:R/0/0 rc 0/-28 > > > > 00000080:00000001:5:1263315233.100620:0:22069:0:(namei.c:930:ll_mknod_gen > >e ric()) Process leaving (rc=18446744073709551588 : -28 : > > ffffffffffffffe4) > > > > Finally here is the lfs df output: > > > > [root at ra 18X11]# lfs df > > UUID 1K-blocks Used Available Use% Mounted on > > home-MDT0000_UUID 5127574032 2034740 4832512272 0% > > /lustre/home[MDT:0] home-OST0000_UUID 5768577552 1392861480 > > 4082688968 24% /lustre/home[OST:0] home-OST0001_UUID 5768577552 > > 1206861808 4268688824 20% /lustre/home[OST:1] home-OST0002_UUID > > 5768577552 1500109508 3975439928 26% /lustre/home[OST:2] > > home-OST0003_UUID 5768577552 1233475740 4242074712 21% > > /lustre/home[OST:3] > > home-OST0004_UUID 5768577552 1197398768 4278150628 20% > > /lustre/home[OST:4] home-OST0005_UUID 5768577552 1186058976 > > 4289491656 20% /lustre/home[OST:5] > > > > filesystem summary: 34611465312 7716766280 25136534716 22% > > /lustre/home > > > > UUID 1K-blocks Used Available Use% Mounted on > > scratch-MDT0000_UUID 5127569936 9913156 4824629964 0% > > /lustre/scratch[MDT:0] scratch-OST0000_UUID 5768577552 4446029104 > > 1029519960 77% /lustre/scratch[OST:0] scratch-OST0001_UUID 5768577552 > > 3914730392 1560819220 67% /lustre/scratch[OST:1] scratch-OST0002_UUID > > 5768577552 4268932844 1206616396 74% /lustre/scratch[OST:2] > > scratch-OST0003_UUID 5768577552 4307085048 1168464192 74% > > /lustre/scratch[OST:3] scratch-OST0004_UUID 5768577552 3920023888 > > 1555525724 67% /lustre/scratch[OST:4] scratch-OST0005_UUID 5768577552 > > 3590710852 1884838760 62% /lustre/scratch[OST:5] scratch-OST0006_UUID > > 5768577552 4649048836 826500028 80% /lustre/scratch[OST:6] > > scratch-OST0007_UUID 5768577552 4089658692 1385890920 70% > > /lustre/scratch[OST:7] scratch-OST0008_UUID 5768577552 4151458292 > > 1324090948 71% /lustre/scratch[OST:8] scratch-OST0009_UUID 5768577552 > > 4116646240 1358902348 71% /lustre/scratch[OST:9] scratch-OST000a_UUID > > 5768577552 3750259568 1725290032 65% /lustre/scratch[OST:10] > > scratch-OST000b_UUID 5768577552 4346406836 1129141752 75% > > /lustre/scratch[OST:11] scratch-OST000c_UUID 5768577552 4376152100 > > 1099396768 75% /lustre/scratch[OST:12] scratch-OST000d_UUID 5768577552 > > 4312773056 1162776184 74% /lustre/scratch[OST:13] scratch-OST000e_UUID > > 5768577552 4900307080 575242532 84% /lustre/scratch[OST:14] > > scratch-OST000f_UUID 5768577552 4044304276 1431243940 70% > > /lustre/scratch[OST:15] scratch-OST0010_UUID 5768577552 3827521672 > > 1648026552 66% /lustre/scratch[OST:16] scratch-OST0011_UUID 5768577552 > > 3789120072 1686427400 65% /lustre/scratch[OST:17] scratch-OST0012_UUID > > 5768577552 4023497048 1452052192 69% /lustre/scratch[OST:18] > > scratch-OST0013_UUID 5768577552 4133682544 1341866324 71% > > /lustre/scratch[OST:19] scratch-OST0014_UUID 5768577552 3690021408 > > 1785527832 63% /lustre/scratch[OST:20] scratch-OST0015_UUID 5768577552 > > 3891559096 1583990144 67% /lustre/scratch[OST:21] scratch-OST0016_UUID > > 5768577552 4404600712 1070948896 76% /lustre/scratch[OST:22] > > scratch-OST0017_UUID 5768577552 4792223084 683326528 83% > > /lustre/scratch[OST:23] scratch-OST0018_UUID 5768577552 4486070024 > > 989478844 77% /lustre/scratch[OST:24] scratch-OST0019_UUID 5768577552 > > 4471754448 1003795164 77% /lustre/scratch[OST:25] scratch-OST001a_UUID > > 5768577552 4517349052 958199536 78% /lustre/scratch[OST:26] > > scratch-OST001b_UUID 5768577552 3989325372 1486223000 69% > > /lustre/scratch[OST:27] scratch-OST001c_UUID 5768577552 4024754964 > > 1450793904 69% /lustre/scratch[OST:28] scratch-OST001d_UUID 5768577552 > > 3883873220 1591676392 67% /lustre/scratch[OST:29] scratch-OST001e_UUID > > 5768577552 4928383088 547166152 85% /lustre/scratch[OST:30] > > scratch-OST001f_UUID 5768577552 4291418836 1184130776 74% > > /lustre/scratch[OST:31] > > > > filesystem summary: 184594481664 134329681744 40887889340 72% > > /lustre/scratch > > > > [root at ra 18X11]# lfs df -i > > UUID Inodes IUsed IFree IUse% Mounted on > > home-MDT0000_UUID 1287101228 5716405 1281384823 0% > > /lustre/home[MDT:0] home-OST0000_UUID 366288896 871143 365417753 > > 0% /lustre/home[OST:0] home-OST0001_UUID 366288896 900011 > > 365388885 0% /lustre/home[OST:1] home-OST0002_UUID 366288896 804892 > > 365484004 0% /lustre/home[OST:2] home-OST0003_UUID 366288896 836213 > > 365452683 0% /lustre/home[OST:3] home-OST0004_UUID 366288896 836852 > > 365452044 0% /lustre/home[OST:4] home-OST0005_UUID > > 366288896 850446 365438450 0% /lustre/home[OST:5] > > > > filesystem summary: 1287101228 5716405 1281384823 0% /lustre/home > > > > UUID Inodes IUsed IFree IUse% Mounted on > > scratch-MDT0000_UUID 1453492963 174078773 1279414190 11% > > /lustre/scratch[MDT:0] scratch-OST0000_UUID 337257280 6621404 > > 330635876 1% /lustre/scratch[OST:0] scratch-OST0001_UUID 366288896 > > 6697629 359591267 1% /lustre/scratch[OST:1] scratch-OST0002_UUID > > 366288896 5272904 361015992 1% /lustre/scratch[OST:2] > > scratch-OST0003_UUID 366288896 5161903 361126993 1% > > /lustre/scratch[OST:3] > > scratch-OST0004_UUID 366288896 5327683 360961213 1% > > /lustre/scratch[OST:4] scratch-OST0005_UUID 366288896 5582579 > > 360706317 1% /lustre/scratch[OST:5] scratch-OST0006_UUID 285040431 > > 5158974 279881457 1% /lustre/scratch[OST:6] scratch-OST0007_UUID > > 366288896 5307157 360981739 1% /lustre/scratch[OST:7] > > scratch-OST0008_UUID 366288896 5387313 360901583 1% > > /lustre/scratch[OST:8] > > scratch-OST0009_UUID 366288896 5426523 360862373 1% > > /lustre/scratch[OST:9] scratch-OST000a_UUID 366288896 5424803 > > 360864093 1% /lustre/scratch[OST:10] scratch-OST000b_UUID 360664073 > > 5122378 355541695 1% /lustre/scratch[OST:11] scratch-OST000c_UUID > > 353235316 5129413 348105903 1% /lustre/scratch[OST:12] > > scratch-OST000d_UUID 366288896 5053936 361234960 1% > > /lustre/scratch[OST:13] > > scratch-OST000e_UUID 222189585 5122229 217067356 2% > > /lustre/scratch[OST:14] scratch-OST000f_UUID 366288896 5281196 > > 361007700 1% /lustre/scratch[OST:15] scratch-OST0010_UUID 366288896 > > 5274738 361014158 1% /lustre/scratch[OST:16] scratch-OST0011_UUID > > 366288896 5409560 360879336 1% /lustre/scratch[OST:17] > > scratch-OST0012_UUID 366288896 5369406 360919490 1% > > /lustre/scratch[OST:18] > > scratch-OST0013_UUID 366288896 5502974 360785922 1% > > /lustre/scratch[OST:19] scratch-OST0014_UUID 366288896 5521406 > > 360767490 1% /lustre/scratch[OST:20] scratch-OST0015_UUID 366288896 > > 5550606 360738290 1% /lustre/scratch[OST:21] scratch-OST0016_UUID > > 345993048 4999552 340993496 1% /lustre/scratch[OST:22] > > scratch-OST0017_UUID 249051056 4963064 244087992 1% > > /lustre/scratch[OST:23] > > scratch-OST0018_UUID 325734426 5108454 320625972 1% > > /lustre/scratch[OST:24] scratch-OST0019_UUID 329427010 5222114 > > 324204896 1% /lustre/scratch[OST:25] scratch-OST001a_UUID 317921820 > > 5115591 312806229 1% /lustre/scratch[OST:26] scratch-OST001b_UUID > > 366288896 5353229 360935667 1% /lustre/scratch[OST:27] > > scratch-OST001c_UUID 366288896 5383473 360905423 1% > > /lustre/scratch[OST:28] > > scratch-OST001d_UUID 366288896 5411890 360877006 1% > > /lustre/scratch[OST:29] scratch-OST001e_UUID 216236615 6188887 > > 210047728 2% /lustre/scratch[OST:30] scratch-OST001f_UUID 366288896 > > 6465049 359823847 1% /lustre/scratch[OST:31] > > > > filesystem summary: 1453492963 174078773 1279414190 11% > > /lustre/scratch > > > > > > Thanks, > > Mike Robbert > > > > On Jan 11, 2010, at 7:24 PM, Andreas Dilger wrote: > > > On 2010-01-11, at 15:59, Michael Robbert wrote: > > >> The filename is not very unique. I can create a file with the same > > >> name in another directory or on another Lustre filesystem. It is > > >> just this exact path on this filesystem. The full path is: > > >> /lustre/scratch/smoqbel/Cenval/CLM/Met.Forcing/18X11/NLDAS.APCP. > > >> 007100.pfb.00164 > > >> The mount point for this filesystem is /lustre/scratch/ > > > > > > Robert, > > > does the same problem happen on multiple client nodes, or is it only > > > happening on a single client? Are there any messages on the MDS and/ > > > or the OSSes when this problem is happening? This problem is somewhat > > > unusual, since I''m not aware of any places outside the disk filesystem > > > code that would cause ENOSPC when creating a file. > > > > > > Can you please do a bit of debugging on the system: > > > > > > {client}# cd /lustre/scratch/smoqbel/Cenval/CLM/Met.Forcing/18X11 > > > {mds,client}# echo -1 > /proc/sys/lustre/debug # enable full > > > debug {mds,client}# lctl clear # clear > > > debug logs {client}# touch NLDAS.APCP.007100.pfb.00164 > > > {mds,client}# lctl dk > /tmp/debug.{mds,client} # dump debug logs > > > > > > For now, please extract the ENOSPC error from the logs will be much > > > shorter, and may be enough to identify where the problem is located, > > > and will be a lot friendlier to the list. > > > > > > grep -- "-28" /tmp/debug.{mds,client} > /tmp/debug-28.{mds,client}:: > > > > > > along with the "lfs df" and "lfs df -i" output. > > > > > > If this is only on a single client, just dropping the locks on the > > > client might be enough to resolve the problem: > > > > > > for L in /proc/fs/lustre/ldlm/namespaces/*; do > > > echo clear > $L > > > done > > > > > > If, on the other hand, this same problem is happening on all clients > > > then the problem is likely on the MDS. > > > > > >>> On Fri, Jan 8, 2010 at 1:36 PM, Michael Robbert > > >>> > > >>> <mrobbert at mines.edu> wrote: > > >>>> I have a user that reported a problem creating a file on our > > >>>> Lustre filesystem. When I investigated I found that the problem > > >>>> appears to be unique to just one filename in one directory. I have > > >>>> tried numerous ways of creating the file including echo, touch, > > >>>> and "lfs setstripe" all return "No space left on device". I have > > >>>> checked the filesystem with df and "lfs df" both show that the > > >>>> filesystem and all OSTs are far from being full for both blocks > > >>>> and inodes. Slight changes in the filename are created fine. We > > >>>> had a kernel panic on the MDS yesterday and it was quite possible > > >>>> that the user had a compute job working in this directory at the > > >>>> time of that problem. I am guessing we have some kind of > > >>>> corruption with the directory. This directory has around 1 million > > >>>> files so moving the data around may not be a quick operation, but > > >>>> we''re willing to do it. I just want to know the best way, short of > > >>>> taking the filesystem offline, to fix this problem. > > >>>> > > >>>> Any ideas? Thanks in advance, > > >>>> Mike Robbert > > >>>> _______________________________________________ > > >>>> Lustre-discuss mailing list > > >>>> Lustre-discuss at lists.lustre.org > > >>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss > > >> > > >> _______________________________________________ > > >> Lustre-discuss mailing list > > >> Lustre-discuss at lists.lustre.org > > >> http://lists.lustre.org/mailman/listinfo/lustre-discuss > > > > > > Cheers, Andreas > > > -- > > > Andreas Dilger > > > Sr. Staff Engineer, Lustre Group > > > Sun Microsystems of Canada, Inc. > > > > _______________________________________________ > > Lustre-discuss mailing list > > Lustre-discuss at lists.lustre.org > > http://lists.lustre.org/mailman/listinfo/lustre-discuss >
Andreas Dilger
2010-Jan-13 02:49 UTC
[Lustre-discuss] No space left on device for just one file
On 2010-01-12, at 15:30, Bernd Schubert wrote:> you really should fill a ticket to us (DDN). I think your problem is > from > these MDS messages: > > LDISKFS-fs warning (device dm-1): ldiskfs_dx_add_entry: Directory > index full! > LDISKFS-fs warning (device dm-1): ldiskfs_dx_add_entry: Directory > index full!Hmm, I didn''t see this in any emails. That definitely would have made it obvious what the problem is. I didn''t think that the size of the index would be a problem, since the filename is only 27 characters long, and Michael said there were only a million files in the directory. That works out to a directory size of about 40MB, and isn''t close to the upper limit. There might be a problem if you have a 1M file directory and are repeatedly creating and deleting long filenames in the same directory, which might leave some directory leaf blocks full, but the block cannot be split to redistribute the values therein. It should be possible to unmount the MDT, run "e2fsck -fD /dev/ {mdsdev}" so it will re-index the directory and reduce the number of blocks the directory is using.> And /dev/dm-1 is also the scratch MDT. > > > Cheers, > Bernd > > On Tuesday 12 January 2010, Michael Robbert wrote: >> Andreas, >> Here are the results of my debugging. This problem does show up on >> multiple >> (presumably all) clients. I followed your instructions, changing >> lustre to >> lnet in step 2, and got debug output on both machines, but the -28 >> text >> only showed up on the client. >> >> [root at ra 18X11]# grep -- "-28" /tmp/debug.client >> 00000100:00000200:5:1263315233.100525:0:22069:0:(client.c: >> 841:ptlrpc_check_ >> reply()) @@@ rc = 1 for req at 00000103a5820800 x200609397/t0 >> o36->scratch-MDT0000_UUID at 172.16.34.1@o2ib:12/10 lens 376/424 e 0 >> to 1 dl >> 1263315433 ref 1 fl Rpc:R/0/0 rc 0/-28 >> 00000100:00000200:5:1263315233.100538:0:22069:0:(events.c: >> 95:reply_in_call >> back()) @@@ type 5, status 0 req at 00000103a5820800 x200609397/t0 >> o36->scratch-MDT0000_UUID at 172.16.34.1@o2ib:12/10 lens 376/424 e 0 >> to 1 dl >> 1263315433 ref 1 fl Rpc:R/0/0 rc 0/-28 >> 00000100:00100000:5:1263315233.100543:0:22069:0:(events.c: >> 115:reply_in_cal >> lback()) @@@ unlink req at 00000103a5820800 x200609397/t0 >> o36->scratch-MDT0000_UUID at 172.16.34.1@o2ib:12/10 lens 376/424 e 0 >> to 1 dl >> 1263315433 ref 1 fl Rpc:R/0/0 rc 0/-28 >> 00000100:00000040:5:1263315233.100565:0:22069:0:(client.c: >> 863:ptlrpc_check >> _status()) @@@ status is -28 req at 00000103a5820800 x200609397/t0 >> o36->scratch-MDT0000_UUID at 172.16.34.1@o2ib:12/10 lens 376/424 e 0 >> to 1 dl >> 1263315433 ref 1 fl Rpc:R/0/0 rc 0/-28 >> 00000100:00000001:5:1263315233.100570:0:22069:0:(client.c: >> 869:ptlrpc_check >> _status()) Process leaving (rc=18446744073709551588 : -28 : >> ffffffffffffffe4) >> 00000100:00000001:5:1263315233.100578:0:22069:0:(client.c: >> 955:after_reply( >> )) Process leaving (rc=18446744073709551588 : -28 : ffffffffffffffe4) >> 00000100:00100000:5:1263315233.100581:0:22069:0:(lustre_net.h: >> 984:ptlrpc_r >> qphase_move()) @@@ move req "Rpc" -> "Interpret" >> req at 00000103a5820800 >> x200609397/t0 o36->scratch-MDT0000_UUID at 172.16.34.1@o2ib:12/10 lens >> 376/424 e 0 to 1 dl 1263315433 ref 1 fl Rpc:R/0/0 rc 0/-28 >> 00000100:00000001:5:1263315233.100586:0:22069:0:(client.c: >> 2094:ptlrpc_queu >> e_wait()) Process leaving (rc=18446744073709551588 : -28 : >> ffffffffffffffe4) >> 00000002:00000040:5:1263315233.100590:0:22069:0:(mdc_reint.c: >> 67:mdc_reint( >> )) error in handling -28 >> 00000002:00000001:5:1263315233.100593:0:22069:0:(mdc_reint.c: >> 227:mdc_creat >> e()) Process leaving (rc=18446744073709551588 : -28 : >> ffffffffffffffe4) >> 00000080:00000001:5:1263315233.100596:0:22069:0:(namei.c: >> 881:ll_new_node() >> ) Process leaving via err_exit (rc=18446744073709551588 : -28 : >> ffffffffffffffe4) >> 00000100:00000040:5:1263315233.100600:0:22069:0:(client.c: >> 1629:__ptlrpc_re >> q_finished()) @@@ refcount now 0 req at 00000103a5820800 x200609397/t0 >> o36->scratch-MDT0000_UUID at 172.16.34.1@o2ib:12/10 lens 376/424 e 0 >> to 1 dl >> 1263315433 ref 1 fl Interpret:R/0/0 rc 0/-28 >> 00000080:00000001:5:1263315233.100620:0:22069:0:(namei.c: >> 930:ll_mknod_gene >> ric()) Process leaving (rc=18446744073709551588 : -28 : >> ffffffffffffffe4) >> >> Finally here is the lfs df output: >> >> [root at ra 18X11]# lfs df >> UUID 1K-blocks Used Available Use% Mounted on >> home-MDT0000_UUID 5127574032 2034740 4832512272 0% >> /lustre/home[MDT:0] home-OST0000_UUID 5768577552 1392861480 >> 4082688968 >> 24% /lustre/home[OST:0] home-OST0001_UUID 5768577552 1206861808 >> 4268688824 20% /lustre/home[OST:1] home-OST0002_UUID 5768577552 >> 1500109508 3975439928 26% /lustre/home[OST:2] home-OST0003_UUID >> 5768577552 1233475740 4242074712 21% /lustre/home[OST:3] >> home-OST0004_UUID 5768577552 1197398768 4278150628 20% >> /lustre/home[OST:4] home-OST0005_UUID 5768577552 1186058976 >> 4289491656 >> 20% /lustre/home[OST:5] >> >> filesystem summary: 34611465312 7716766280 25136534716 22% / >> lustre/home >> >> UUID 1K-blocks Used Available Use% Mounted on >> scratch-MDT0000_UUID 5127569936 9913156 4824629964 0% >> /lustre/scratch[MDT:0] scratch-OST0000_UUID 5768577552 4446029104 >> 1029519960 77% /lustre/scratch[OST:0] scratch-OST0001_UUID >> 5768577552 >> 3914730392 1560819220 67% /lustre/scratch[OST:1] scratch- >> OST0002_UUID >> 5768577552 4268932844 1206616396 74% /lustre/scratch[OST:2] >> scratch-OST0003_UUID 5768577552 4307085048 1168464192 74% >> /lustre/scratch[OST:3] scratch-OST0004_UUID 5768577552 3920023888 >> 1555525724 67% /lustre/scratch[OST:4] scratch-OST0005_UUID >> 5768577552 >> 3590710852 1884838760 62% /lustre/scratch[OST:5] scratch- >> OST0006_UUID >> 5768577552 4649048836 826500028 80% /lustre/scratch[OST:6] >> scratch-OST0007_UUID 5768577552 4089658692 1385890920 70% >> /lustre/scratch[OST:7] scratch-OST0008_UUID 5768577552 4151458292 >> 1324090948 71% /lustre/scratch[OST:8] scratch-OST0009_UUID >> 5768577552 >> 4116646240 1358902348 71% /lustre/scratch[OST:9] scratch- >> OST000a_UUID >> 5768577552 3750259568 1725290032 65% /lustre/scratch[OST:10] >> scratch-OST000b_UUID 5768577552 4346406836 1129141752 75% >> /lustre/scratch[OST:11] scratch-OST000c_UUID 5768577552 4376152100 >> 1099396768 75% /lustre/scratch[OST:12] scratch-OST000d_UUID >> 5768577552 >> 4312773056 1162776184 74% /lustre/scratch[OST:13] scratch- >> OST000e_UUID >> 5768577552 4900307080 575242532 84% /lustre/scratch[OST:14] >> scratch-OST000f_UUID 5768577552 4044304276 1431243940 70% >> /lustre/scratch[OST:15] scratch-OST0010_UUID 5768577552 3827521672 >> 1648026552 66% /lustre/scratch[OST:16] scratch-OST0011_UUID >> 5768577552 >> 3789120072 1686427400 65% /lustre/scratch[OST:17] scratch- >> OST0012_UUID >> 5768577552 4023497048 1452052192 69% /lustre/scratch[OST:18] >> scratch-OST0013_UUID 5768577552 4133682544 1341866324 71% >> /lustre/scratch[OST:19] scratch-OST0014_UUID 5768577552 3690021408 >> 1785527832 63% /lustre/scratch[OST:20] scratch-OST0015_UUID >> 5768577552 >> 3891559096 1583990144 67% /lustre/scratch[OST:21] scratch- >> OST0016_UUID >> 5768577552 4404600712 1070948896 76% /lustre/scratch[OST:22] >> scratch-OST0017_UUID 5768577552 4792223084 683326528 83% >> /lustre/scratch[OST:23] scratch-OST0018_UUID 5768577552 4486070024 >> 989478844 77% /lustre/scratch[OST:24] scratch-OST0019_UUID >> 5768577552 >> 4471754448 1003795164 77% /lustre/scratch[OST:25] scratch- >> OST001a_UUID >> 5768577552 4517349052 958199536 78% /lustre/scratch[OST:26] >> scratch-OST001b_UUID 5768577552 3989325372 1486223000 69% >> /lustre/scratch[OST:27] scratch-OST001c_UUID 5768577552 4024754964 >> 1450793904 69% /lustre/scratch[OST:28] scratch-OST001d_UUID >> 5768577552 >> 3883873220 1591676392 67% /lustre/scratch[OST:29] scratch- >> OST001e_UUID >> 5768577552 4928383088 547166152 85% /lustre/scratch[OST:30] >> scratch-OST001f_UUID 5768577552 4291418836 1184130776 74% >> /lustre/scratch[OST:31] >> >> filesystem summary: 184594481664 134329681744 40887889340 72% >> /lustre/scratch >> >> [root at ra 18X11]# lfs df -i >> UUID Inodes IUsed IFree IUse% Mounted on >> home-MDT0000_UUID 1287101228 5716405 1281384823 0% >> /lustre/home[MDT:0] home-OST0000_UUID 366288896 871143 >> 365417753 >> 0% /lustre/home[OST:0] home-OST0001_UUID 366288896 900011 >> 365388885 >> 0% /lustre/home[OST:1] home-OST0002_UUID 366288896 804892 >> 365484004 0% /lustre/home[OST:2] home-OST0003_UUID 366288896 >> 836213 365452683 0% /lustre/home[OST:3] home-OST0004_UUID >> 366288896 >> 836852 365452044 0% /lustre/home[OST:4] home-OST0005_UUID >> 366288896 850446 365438450 0% /lustre/home[OST:5] >> >> filesystem summary: 1287101228 5716405 1281384823 0% /lustre/ >> home >> >> UUID Inodes IUsed IFree IUse% Mounted on >> scratch-MDT0000_UUID 1453492963 174078773 1279414190 11% >> /lustre/scratch[MDT:0] scratch-OST0000_UUID 337257280 6621404 >> 330635876 >> 1% /lustre/scratch[OST:0] scratch-OST0001_UUID 366288896 6697629 >> 359591267 1% /lustre/scratch[OST:1] scratch-OST0002_UUID 366288896 >> 5272904 361015992 1% /lustre/scratch[OST:2] scratch-OST0003_UUID >> 366288896 5161903 361126993 1% /lustre/scratch[OST:3] >> scratch-OST0004_UUID 366288896 5327683 360961213 1% >> /lustre/scratch[OST:4] scratch-OST0005_UUID 366288896 5582579 >> 360706317 >> 1% /lustre/scratch[OST:5] scratch-OST0006_UUID 285040431 5158974 >> 279881457 1% /lustre/scratch[OST:6] scratch-OST0007_UUID 366288896 >> 5307157 360981739 1% /lustre/scratch[OST:7] scratch-OST0008_UUID >> 366288896 5387313 360901583 1% /lustre/scratch[OST:8] >> scratch-OST0009_UUID 366288896 5426523 360862373 1% >> /lustre/scratch[OST:9] scratch-OST000a_UUID 366288896 5424803 >> 360864093 >> 1% /lustre/scratch[OST:10] scratch-OST000b_UUID 360664073 5122378 >> 355541695 1% /lustre/scratch[OST:11] scratch-OST000c_UUID >> 353235316 >> 5129413 348105903 1% /lustre/scratch[OST:12] scratch-OST000d_UUID >> 366288896 5053936 361234960 1% /lustre/scratch[OST:13] >> scratch-OST000e_UUID 222189585 5122229 217067356 2% >> /lustre/scratch[OST:14] scratch-OST000f_UUID 366288896 5281196 >> 361007700 >> 1% /lustre/scratch[OST:15] scratch-OST0010_UUID 366288896 >> 5274738 >> 361014158 1% /lustre/scratch[OST:16] scratch-OST0011_UUID >> 366288896 >> 5409560 360879336 1% /lustre/scratch[OST:17] scratch-OST0012_UUID >> 366288896 5369406 360919490 1% /lustre/scratch[OST:18] >> scratch-OST0013_UUID 366288896 5502974 360785922 1% >> /lustre/scratch[OST:19] scratch-OST0014_UUID 366288896 5521406 >> 360767490 >> 1% /lustre/scratch[OST:20] scratch-OST0015_UUID 366288896 >> 5550606 >> 360738290 1% /lustre/scratch[OST:21] scratch-OST0016_UUID >> 345993048 >> 4999552 340993496 1% /lustre/scratch[OST:22] scratch-OST0017_UUID >> 249051056 4963064 244087992 1% /lustre/scratch[OST:23] >> scratch-OST0018_UUID 325734426 5108454 320625972 1% >> /lustre/scratch[OST:24] scratch-OST0019_UUID 329427010 5222114 >> 324204896 >> 1% /lustre/scratch[OST:25] scratch-OST001a_UUID 317921820 >> 5115591 >> 312806229 1% /lustre/scratch[OST:26] scratch-OST001b_UUID >> 366288896 >> 5353229 360935667 1% /lustre/scratch[OST:27] scratch-OST001c_UUID >> 366288896 5383473 360905423 1% /lustre/scratch[OST:28] >> scratch-OST001d_UUID 366288896 5411890 360877006 1% >> /lustre/scratch[OST:29] scratch-OST001e_UUID 216236615 6188887 >> 210047728 >> 2% /lustre/scratch[OST:30] scratch-OST001f_UUID 366288896 >> 6465049 >> 359823847 1% /lustre/scratch[OST:31] >> >> filesystem summary: 1453492963 174078773 1279414190 11% /lustre/ >> scratch >> >> >> Thanks, >> Mike Robbert >> >> On Jan 11, 2010, at 7:24 PM, Andreas Dilger wrote: >>> On 2010-01-11, at 15:59, Michael Robbert wrote: >>>> The filename is not very unique. I can create a file with the same >>>> name in another directory or on another Lustre filesystem. It is >>>> just this exact path on this filesystem. The full path is: >>>> /lustre/scratch/smoqbel/Cenval/CLM/Met.Forcing/18X11/NLDAS.APCP. >>>> 007100.pfb.00164 >>>> The mount point for this filesystem is /lustre/scratch/ >>> >>> Robert, >>> does the same problem happen on multiple client nodes, or is it only >>> happening on a single client? Are there any messages on the MDS >>> and/ >>> or the OSSes when this problem is happening? This problem is >>> somewhat >>> unusual, since I''m not aware of any places outside the disk >>> filesystem >>> code that would cause ENOSPC when creating a file. >>> >>> Can you please do a bit of debugging on the system: >>> >>> {client}# cd /lustre/scratch/smoqbel/Cenval/CLM/Met.Forcing/18X11 >>> {mds,client}# echo -1 > /proc/sys/lustre/debug # enable full >>> debug >>> {mds,client}# lctl clear # clear debug >>> logs >>> {client}# touch NLDAS.APCP.007100.pfb.00164 >>> {mds,client}# lctl dk > /tmp/debug.{mds,client} # dump debug >>> logs >>> >>> For now, please extract the ENOSPC error from the logs will be much >>> shorter, and may be enough to identify where the problem is located, >>> and will be a lot friendlier to the list. >>> >>> grep -- "-28" /tmp/debug.{mds,client} > /tmp/debug-28.{mds,client}:: >>> >>> along with the "lfs df" and "lfs df -i" output. >>> >>> If this is only on a single client, just dropping the locks on the >>> client might be enough to resolve the problem: >>> >>> for L in /proc/fs/lustre/ldlm/namespaces/*; do >>> echo clear > $L >>> done >>> >>> If, on the other hand, this same problem is happening on all clients >>> then the problem is likely on the MDS. >>> >>>>> On Fri, Jan 8, 2010 at 1:36 PM, Michael Robbert >>>>> >>>>> <mrobbert at mines.edu> wrote: >>>>>> I have a user that reported a problem creating a file on our >>>>>> Lustre filesystem. When I investigated I found that the problem >>>>>> appears to be unique to just one filename in one directory. I >>>>>> have >>>>>> tried numerous ways of creating the file including echo, touch, >>>>>> and "lfs setstripe" all return "No space left on device". I have >>>>>> checked the filesystem with df and "lfs df" both show that the >>>>>> filesystem and all OSTs are far from being full for both blocks >>>>>> and inodes. Slight changes in the filename are created fine. We >>>>>> had a kernel panic on the MDS yesterday and it was quite possible >>>>>> that the user had a compute job working in this directory at the >>>>>> time of that problem. I am guessing we have some kind of >>>>>> corruption with the directory. This directory has around 1 >>>>>> million >>>>>> files so moving the data around may not be a quick operation, but >>>>>> we''re willing to do it. I just want to know the best way, short >>>>>> of >>>>>> taking the filesystem offline, to fix this problem. >>>>>> >>>>>> Any ideas? Thanks in advance, >>>>>> Mike Robbert >>>>>> _______________________________________________ >>>>>> Lustre-discuss mailing list >>>>>> Lustre-discuss at lists.lustre.org >>>>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>>> >>>> _______________________________________________ >>>> Lustre-discuss mailing list >>>> Lustre-discuss at lists.lustre.org >>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>> >>> Cheers, Andreas >>> -- >>> Andreas Dilger >>> Sr. Staff Engineer, Lustre Group >>> Sun Microsystems of Canada, Inc. >> >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss >> > > > -- > Bernd Schubert > DataDirect NetworksCheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
Michael Robbert
2010-Jan-13 14:37 UTC
[Lustre-discuss] No space left on device for just one file
Andreas, I never saw this message either. It is showing up in the output of dmesg, but is not written to any log files that I can find. I miscounted the number of files in my original message. The actual number is a little more than 11 million files. I am in the process of working with the user to decrease the number of files needed in a single directory. At this point I think that we''ve given up trying to use this directory for anything and will just purge it. I expect that will be a long process. Does anybody have any suggestions for making the removal of a directory with 11 million files a little less painless? As for moving forward I''m waiting for the user to get some code changes that will allow these files to be split into 8 separate directories as well as the possibility of stacking time step files for further reduction of number of files. I still expect the number of files to be fairly large and was considering using the loopback file system trick to store these since once they are created they will be read only. Any suggestions for doing this? Initial tests indicate that having the loopback file on a local disk for writing may be faster. Then I can copy it to Lustre and come up with some kind of configuration on the compute nodes so that the user can mount it RO when his jobs start. Thanks, Mike Robbert On Jan 12, 2010, at 7:49 PM, Andreas Dilger wrote:> On 2010-01-12, at 15:30, Bernd Schubert wrote: >> you really should fill a ticket to us (DDN). I think your problem is >> from >> these MDS messages: >> >> LDISKFS-fs warning (device dm-1): ldiskfs_dx_add_entry: Directory >> index full! >> LDISKFS-fs warning (device dm-1): ldiskfs_dx_add_entry: Directory >> index full! > > Hmm, I didn''t see this in any emails. That definitely would have made > it obvious what the problem is. I didn''t think that the size of the > index would be a problem, since the filename is only 27 characters > long, and Michael said there were only a million files in the > directory. That works out to a directory size of about 40MB, and > isn''t close to the upper limit. > > There might be a problem if you have a 1M file directory and are > repeatedly creating and deleting long filenames in the same directory, > which might leave some directory leaf blocks full, but the block > cannot be split to redistribute the values therein. > > It should be possible to unmount the MDT, run "e2fsck -fD /dev/ > {mdsdev}" so it will re-index the directory and reduce the number of > blocks the directory is using. > >> And /dev/dm-1 is also the scratch MDT. >> >> >> Cheers, >> Bernd >> >> On Tuesday 12 January 2010, Michael Robbert wrote: >>> Andreas, >>> Here are the results of my debugging. This problem does show up on >>> multiple >>> (presumably all) clients. I followed your instructions, changing >>> lustre to >>> lnet in step 2, and got debug output on both machines, but the -28 >>> text >>> only showed up on the client. >>> >>> [root at ra 18X11]# grep -- "-28" /tmp/debug.client >>> 00000100:00000200:5:1263315233.100525:0:22069:0:(client.c: >>> 841:ptlrpc_check_ >>> reply()) @@@ rc = 1 for req at 00000103a5820800 x200609397/t0 >>> o36->scratch-MDT0000_UUID at 172.16.34.1@o2ib:12/10 lens 376/424 e 0 >>> to 1 dl >>> 1263315433 ref 1 fl Rpc:R/0/0 rc 0/-28 >>> 00000100:00000200:5:1263315233.100538:0:22069:0:(events.c: >>> 95:reply_in_call >>> back()) @@@ type 5, status 0 req at 00000103a5820800 x200609397/t0 >>> o36->scratch-MDT0000_UUID at 172.16.34.1@o2ib:12/10 lens 376/424 e 0 >>> to 1 dl >>> 1263315433 ref 1 fl Rpc:R/0/0 rc 0/-28 >>> 00000100:00100000:5:1263315233.100543:0:22069:0:(events.c: >>> 115:reply_in_cal >>> lback()) @@@ unlink req at 00000103a5820800 x200609397/t0 >>> o36->scratch-MDT0000_UUID at 172.16.34.1@o2ib:12/10 lens 376/424 e 0 >>> to 1 dl >>> 1263315433 ref 1 fl Rpc:R/0/0 rc 0/-28 >>> 00000100:00000040:5:1263315233.100565:0:22069:0:(client.c: >>> 863:ptlrpc_check >>> _status()) @@@ status is -28 req at 00000103a5820800 x200609397/t0 >>> o36->scratch-MDT0000_UUID at 172.16.34.1@o2ib:12/10 lens 376/424 e 0 >>> to 1 dl >>> 1263315433 ref 1 fl Rpc:R/0/0 rc 0/-28 >>> 00000100:00000001:5:1263315233.100570:0:22069:0:(client.c: >>> 869:ptlrpc_check >>> _status()) Process leaving (rc=18446744073709551588 : -28 : >>> ffffffffffffffe4) >>> 00000100:00000001:5:1263315233.100578:0:22069:0:(client.c: >>> 955:after_reply( >>> )) Process leaving (rc=18446744073709551588 : -28 : ffffffffffffffe4) >>> 00000100:00100000:5:1263315233.100581:0:22069:0:(lustre_net.h: >>> 984:ptlrpc_r >>> qphase_move()) @@@ move req "Rpc" -> "Interpret" >>> req at 00000103a5820800 >>> x200609397/t0 o36->scratch-MDT0000_UUID at 172.16.34.1@o2ib:12/10 lens >>> 376/424 e 0 to 1 dl 1263315433 ref 1 fl Rpc:R/0/0 rc 0/-28 >>> 00000100:00000001:5:1263315233.100586:0:22069:0:(client.c: >>> 2094:ptlrpc_queu >>> e_wait()) Process leaving (rc=18446744073709551588 : -28 : >>> ffffffffffffffe4) >>> 00000002:00000040:5:1263315233.100590:0:22069:0:(mdc_reint.c: >>> 67:mdc_reint( >>> )) error in handling -28 >>> 00000002:00000001:5:1263315233.100593:0:22069:0:(mdc_reint.c: >>> 227:mdc_creat >>> e()) Process leaving (rc=18446744073709551588 : -28 : >>> ffffffffffffffe4) >>> 00000080:00000001:5:1263315233.100596:0:22069:0:(namei.c: >>> 881:ll_new_node() >>> ) Process leaving via err_exit (rc=18446744073709551588 : -28 : >>> ffffffffffffffe4) >>> 00000100:00000040:5:1263315233.100600:0:22069:0:(client.c: >>> 1629:__ptlrpc_re >>> q_finished()) @@@ refcount now 0 req at 00000103a5820800 x200609397/t0 >>> o36->scratch-MDT0000_UUID at 172.16.34.1@o2ib:12/10 lens 376/424 e 0 >>> to 1 dl >>> 1263315433 ref 1 fl Interpret:R/0/0 rc 0/-28 >>> 00000080:00000001:5:1263315233.100620:0:22069:0:(namei.c: >>> 930:ll_mknod_gene >>> ric()) Process leaving (rc=18446744073709551588 : -28 : >>> ffffffffffffffe4) >>> >>> Finally here is the lfs df output: >>> >>> [root at ra 18X11]# lfs df >>> UUID 1K-blocks Used Available Use% Mounted on >>> home-MDT0000_UUID 5127574032 2034740 4832512272 0% >>> /lustre/home[MDT:0] home-OST0000_UUID 5768577552 1392861480 >>> 4082688968 >>> 24% /lustre/home[OST:0] home-OST0001_UUID 5768577552 1206861808 >>> 4268688824 20% /lustre/home[OST:1] home-OST0002_UUID 5768577552 >>> 1500109508 3975439928 26% /lustre/home[OST:2] home-OST0003_UUID >>> 5768577552 1233475740 4242074712 21% /lustre/home[OST:3] >>> home-OST0004_UUID 5768577552 1197398768 4278150628 20% >>> /lustre/home[OST:4] home-OST0005_UUID 5768577552 1186058976 >>> 4289491656 >>> 20% /lustre/home[OST:5] >>> >>> filesystem summary: 34611465312 7716766280 25136534716 22% / >>> lustre/home >>> >>> UUID 1K-blocks Used Available Use% Mounted on >>> scratch-MDT0000_UUID 5127569936 9913156 4824629964 0% >>> /lustre/scratch[MDT:0] scratch-OST0000_UUID 5768577552 4446029104 >>> 1029519960 77% /lustre/scratch[OST:0] scratch-OST0001_UUID >>> 5768577552 >>> 3914730392 1560819220 67% /lustre/scratch[OST:1] scratch- >>> OST0002_UUID >>> 5768577552 4268932844 1206616396 74% /lustre/scratch[OST:2] >>> scratch-OST0003_UUID 5768577552 4307085048 1168464192 74% >>> /lustre/scratch[OST:3] scratch-OST0004_UUID 5768577552 3920023888 >>> 1555525724 67% /lustre/scratch[OST:4] scratch-OST0005_UUID >>> 5768577552 >>> 3590710852 1884838760 62% /lustre/scratch[OST:5] scratch- >>> OST0006_UUID >>> 5768577552 4649048836 826500028 80% /lustre/scratch[OST:6] >>> scratch-OST0007_UUID 5768577552 4089658692 1385890920 70% >>> /lustre/scratch[OST:7] scratch-OST0008_UUID 5768577552 4151458292 >>> 1324090948 71% /lustre/scratch[OST:8] scratch-OST0009_UUID >>> 5768577552 >>> 4116646240 1358902348 71% /lustre/scratch[OST:9] scratch- >>> OST000a_UUID >>> 5768577552 3750259568 1725290032 65% /lustre/scratch[OST:10] >>> scratch-OST000b_UUID 5768577552 4346406836 1129141752 75% >>> /lustre/scratch[OST:11] scratch-OST000c_UUID 5768577552 4376152100 >>> 1099396768 75% /lustre/scratch[OST:12] scratch-OST000d_UUID >>> 5768577552 >>> 4312773056 1162776184 74% /lustre/scratch[OST:13] scratch- >>> OST000e_UUID >>> 5768577552 4900307080 575242532 84% /lustre/scratch[OST:14] >>> scratch-OST000f_UUID 5768577552 4044304276 1431243940 70% >>> /lustre/scratch[OST:15] scratch-OST0010_UUID 5768577552 3827521672 >>> 1648026552 66% /lustre/scratch[OST:16] scratch-OST0011_UUID >>> 5768577552 >>> 3789120072 1686427400 65% /lustre/scratch[OST:17] scratch- >>> OST0012_UUID >>> 5768577552 4023497048 1452052192 69% /lustre/scratch[OST:18] >>> scratch-OST0013_UUID 5768577552 4133682544 1341866324 71% >>> /lustre/scratch[OST:19] scratch-OST0014_UUID 5768577552 3690021408 >>> 1785527832 63% /lustre/scratch[OST:20] scratch-OST0015_UUID >>> 5768577552 >>> 3891559096 1583990144 67% /lustre/scratch[OST:21] scratch- >>> OST0016_UUID >>> 5768577552 4404600712 1070948896 76% /lustre/scratch[OST:22] >>> scratch-OST0017_UUID 5768577552 4792223084 683326528 83% >>> /lustre/scratch[OST:23] scratch-OST0018_UUID 5768577552 4486070024 >>> 989478844 77% /lustre/scratch[OST:24] scratch-OST0019_UUID >>> 5768577552 >>> 4471754448 1003795164 77% /lustre/scratch[OST:25] scratch- >>> OST001a_UUID >>> 5768577552 4517349052 958199536 78% /lustre/scratch[OST:26] >>> scratch-OST001b_UUID 5768577552 3989325372 1486223000 69% >>> /lustre/scratch[OST:27] scratch-OST001c_UUID 5768577552 4024754964 >>> 1450793904 69% /lustre/scratch[OST:28] scratch-OST001d_UUID >>> 5768577552 >>> 3883873220 1591676392 67% /lustre/scratch[OST:29] scratch- >>> OST001e_UUID >>> 5768577552 4928383088 547166152 85% /lustre/scratch[OST:30] >>> scratch-OST001f_UUID 5768577552 4291418836 1184130776 74% >>> /lustre/scratch[OST:31] >>> >>> filesystem summary: 184594481664 134329681744 40887889340 72% >>> /lustre/scratch >>> >>> [root at ra 18X11]# lfs df -i >>> UUID Inodes IUsed IFree IUse% Mounted on >>> home-MDT0000_UUID 1287101228 5716405 1281384823 0% >>> /lustre/home[MDT:0] home-OST0000_UUID 366288896 871143 >>> 365417753 >>> 0% /lustre/home[OST:0] home-OST0001_UUID 366288896 900011 >>> 365388885 >>> 0% /lustre/home[OST:1] home-OST0002_UUID 366288896 804892 >>> 365484004 0% /lustre/home[OST:2] home-OST0003_UUID 366288896 >>> 836213 365452683 0% /lustre/home[OST:3] home-OST0004_UUID >>> 366288896 >>> 836852 365452044 0% /lustre/home[OST:4] home-OST0005_UUID >>> 366288896 850446 365438450 0% /lustre/home[OST:5] >>> >>> filesystem summary: 1287101228 5716405 1281384823 0% /lustre/ >>> home >>> >>> UUID Inodes IUsed IFree IUse% Mounted on >>> scratch-MDT0000_UUID 1453492963 174078773 1279414190 11% >>> /lustre/scratch[MDT:0] scratch-OST0000_UUID 337257280 6621404 >>> 330635876 >>> 1% /lustre/scratch[OST:0] scratch-OST0001_UUID 366288896 6697629 >>> 359591267 1% /lustre/scratch[OST:1] scratch-OST0002_UUID 366288896 >>> 5272904 361015992 1% /lustre/scratch[OST:2] scratch-OST0003_UUID >>> 366288896 5161903 361126993 1% /lustre/scratch[OST:3] >>> scratch-OST0004_UUID 366288896 5327683 360961213 1% >>> /lustre/scratch[OST:4] scratch-OST0005_UUID 366288896 5582579 >>> 360706317 >>> 1% /lustre/scratch[OST:5] scratch-OST0006_UUID 285040431 5158974 >>> 279881457 1% /lustre/scratch[OST:6] scratch-OST0007_UUID 366288896 >>> 5307157 360981739 1% /lustre/scratch[OST:7] scratch-OST0008_UUID >>> 366288896 5387313 360901583 1% /lustre/scratch[OST:8] >>> scratch-OST0009_UUID 366288896 5426523 360862373 1% >>> /lustre/scratch[OST:9] scratch-OST000a_UUID 366288896 5424803 >>> 360864093 >>> 1% /lustre/scratch[OST:10] scratch-OST000b_UUID 360664073 5122378 >>> 355541695 1% /lustre/scratch[OST:11] scratch-OST000c_UUID >>> 353235316 >>> 5129413 348105903 1% /lustre/scratch[OST:12] scratch-OST000d_UUID >>> 366288896 5053936 361234960 1% /lustre/scratch[OST:13] >>> scratch-OST000e_UUID 222189585 5122229 217067356 2% >>> /lustre/scratch[OST:14] scratch-OST000f_UUID 366288896 5281196 >>> 361007700 >>> 1% /lustre/scratch[OST:15] scratch-OST0010_UUID 366288896 >>> 5274738 >>> 361014158 1% /lustre/scratch[OST:16] scratch-OST0011_UUID >>> 366288896 >>> 5409560 360879336 1% /lustre/scratch[OST:17] scratch-OST0012_UUID >>> 366288896 5369406 360919490 1% /lustre/scratch[OST:18] >>> scratch-OST0013_UUID 366288896 5502974 360785922 1% >>> /lustre/scratch[OST:19] scratch-OST0014_UUID 366288896 5521406 >>> 360767490 >>> 1% /lustre/scratch[OST:20] scratch-OST0015_UUID 366288896 >>> 5550606 >>> 360738290 1% /lustre/scratch[OST:21] scratch-OST0016_UUID >>> 345993048 >>> 4999552 340993496 1% /lustre/scratch[OST:22] scratch-OST0017_UUID >>> 249051056 4963064 244087992 1% /lustre/scratch[OST:23] >>> scratch-OST0018_UUID 325734426 5108454 320625972 1% >>> /lustre/scratch[OST:24] scratch-OST0019_UUID 329427010 5222114 >>> 324204896 >>> 1% /lustre/scratch[OST:25] scratch-OST001a_UUID 317921820 >>> 5115591 >>> 312806229 1% /lustre/scratch[OST:26] scratch-OST001b_UUID >>> 366288896 >>> 5353229 360935667 1% /lustre/scratch[OST:27] scratch-OST001c_UUID >>> 366288896 5383473 360905423 1% /lustre/scratch[OST:28] >>> scratch-OST001d_UUID 366288896 5411890 360877006 1% >>> /lustre/scratch[OST:29] scratch-OST001e_UUID 216236615 6188887 >>> 210047728 >>> 2% /lustre/scratch[OST:30] scratch-OST001f_UUID 366288896 >>> 6465049 >>> 359823847 1% /lustre/scratch[OST:31] >>> >>> filesystem summary: 1453492963 174078773 1279414190 11% /lustre/ >>> scratch >>> >>> >>> Thanks, >>> Mike Robbert >>> >>> On Jan 11, 2010, at 7:24 PM, Andreas Dilger wrote: >>>> On 2010-01-11, at 15:59, Michael Robbert wrote: >>>>> The filename is not very unique. I can create a file with the same >>>>> name in another directory or on another Lustre filesystem. It is >>>>> just this exact path on this filesystem. The full path is: >>>>> /lustre/scratch/smoqbel/Cenval/CLM/Met.Forcing/18X11/NLDAS.APCP. >>>>> 007100.pfb.00164 >>>>> The mount point for this filesystem is /lustre/scratch/ >>>> >>>> Robert, >>>> does the same problem happen on multiple client nodes, or is it only >>>> happening on a single client? Are there any messages on the MDS >>>> and/ >>>> or the OSSes when this problem is happening? This problem is >>>> somewhat >>>> unusual, since I''m not aware of any places outside the disk >>>> filesystem >>>> code that would cause ENOSPC when creating a file. >>>> >>>> Can you please do a bit of debugging on the system: >>>> >>>> {client}# cd /lustre/scratch/smoqbel/Cenval/CLM/Met.Forcing/18X11 >>>> {mds,client}# echo -1 > /proc/sys/lustre/debug # enable full >>>> debug >>>> {mds,client}# lctl clear # clear debug >>>> logs >>>> {client}# touch NLDAS.APCP.007100.pfb.00164 >>>> {mds,client}# lctl dk > /tmp/debug.{mds,client} # dump debug >>>> logs >>>> >>>> For now, please extract the ENOSPC error from the logs will be much >>>> shorter, and may be enough to identify where the problem is located, >>>> and will be a lot friendlier to the list. >>>> >>>> grep -- "-28" /tmp/debug.{mds,client} > /tmp/debug-28.{mds,client}:: >>>> >>>> along with the "lfs df" and "lfs df -i" output. >>>> >>>> If this is only on a single client, just dropping the locks on the >>>> client might be enough to resolve the problem: >>>> >>>> for L in /proc/fs/lustre/ldlm/namespaces/*; do >>>> echo clear > $L >>>> done >>>> >>>> If, on the other hand, this same problem is happening on all clients >>>> then the problem is likely on the MDS. >>>> >>>>>> On Fri, Jan 8, 2010 at 1:36 PM, Michael Robbert >>>>>> >>>>>> <mrobbert at mines.edu> wrote: >>>>>>> I have a user that reported a problem creating a file on our >>>>>>> Lustre filesystem. When I investigated I found that the problem >>>>>>> appears to be unique to just one filename in one directory. I >>>>>>> have >>>>>>> tried numerous ways of creating the file including echo, touch, >>>>>>> and "lfs setstripe" all return "No space left on device". I have >>>>>>> checked the filesystem with df and "lfs df" both show that the >>>>>>> filesystem and all OSTs are far from being full for both blocks >>>>>>> and inodes. Slight changes in the filename are created fine. We >>>>>>> had a kernel panic on the MDS yesterday and it was quite possible >>>>>>> that the user had a compute job working in this directory at the >>>>>>> time of that problem. I am guessing we have some kind of >>>>>>> corruption with the directory. This directory has around 1 >>>>>>> million >>>>>>> files so moving the data around may not be a quick operation, but >>>>>>> we''re willing to do it. I just want to know the best way, short >>>>>>> of >>>>>>> taking the filesystem offline, to fix this problem. >>>>>>> >>>>>>> Any ideas? Thanks in advance, >>>>>>> Mike Robbert >>>>>>> _______________________________________________ >>>>>>> Lustre-discuss mailing list >>>>>>> Lustre-discuss at lists.lustre.org >>>>>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>>>> >>>>> _______________________________________________ >>>>> Lustre-discuss mailing list >>>>> Lustre-discuss at lists.lustre.org >>>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>>> >>>> Cheers, Andreas >>>> -- >>>> Andreas Dilger >>>> Sr. Staff Engineer, Lustre Group >>>> Sun Microsystems of Canada, Inc. >>> >>> _______________________________________________ >>> Lustre-discuss mailing list >>> Lustre-discuss at lists.lustre.org >>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>> >> >> >> -- >> Bernd Schubert >> DataDirect Networks > > > Cheers, Andreas > -- > Andreas Dilger > Sr. Staff Engineer, Lustre Group > Sun Microsystems of Canada, Inc. >