Hello, We are running lustre 1.6.6 and are seeing a weird error on space usage. The filesystem is not anywhere near full, but writes are failing if they hit OST 23. if I do an lfs setstripe -i 23 chad, and then do # dd if=/dev/zero of=chad dd: writing to `chad'': No space left on device 26+0 records in 25+0 records out # The actual device is fairly full. # df /lustre/home/ost_h_24 Filesystem 1K-blocks Used Available Use% Mounted on /dev/mapper/ost_h_24 564172088 509452368 26061444 96% /lustre/home/ost_h_24 # The striping is set as: stripe_count: 2 stripe_size: 1048576 stripe_offset: -1 So there should be enough space to still create that stripe. Was wondering if anyone else sees this type of error and if so, how can you get around it? # lfs df /u UUID 1K-blocks Used Available Use% Mounted on home-MDT0000_UUID 250741088 7437416 228974536 2% /u[MDT:0] home-OST0000_UUID 564172088 465752856 69742520 82% /u[OST:0] home-OST0001_UUID 564172088 481188596 54304660 85% /u[OST:1] home-OST0002_UUID 564172088 476474964 59007024 84% /u[OST:2] home-OST0003_UUID 564172088 477867116 57604028 84% /u[OST:3] home-OST0004_UUID 564172088 467160140 68338168 82% /u[OST:4] home-OST0005_UUID 564172088 460648340 74822768 81% /u[OST:5] home-OST0006_UUID 564172088 461577612 73936192 81% /u[OST:6] home-OST0007_UUID 564172088 463904716 71582888 82% /u[OST:7] home-OST0008_UUID 564172088 490786452 67629072 86% /u[OST:8] home-OST0009_UUID 564172088 494448544 41036872 87% /u[OST:9] home-OST000a_UUID 564172088 469107600 66406208 83% /u[OST:10] home-OST000b_UUID 564172088 471937184 63550148 83% /u[OST:11] home-OST000c_UUID 564172088 470872456 64614456 83% /u[OST:12] home-OST000d_UUID 564172088 468447640 67066156 83% /u[OST:13] home-OST000e_UUID 564172088 461661040 73823812 81% /u[OST:14] home-OST000f_UUID 564172088 472427452 63086356 83% /u[OST:15] home-OST0010_UUID 564172088 465760292 69712320 82% /u[OST:16] home-OST0011_UUID 564172088 463973452 71515716 82% /u[OST:17] home-OST0012_UUID 564172088 478922600 56568572 84% /u[OST:18] home-OST0013_UUID 564172088 488618676 46890016 86% /u[OST:19] home-OST0014_UUID 564172088 472108048 63382768 83% /u[OST:20] home-OST0015_UUID 564172088 489621560 45874056 86% /u[OST:21] home-OST0016_UUID 564172088 475357600 83082828 84% /u[OST:22] home-OST0017_UUID 564172088 509437736 26069524 90% /u[OST:23] filesystem summary: 13540130112 11398062672 1499647128 84% /u Thanks, Chad
Brian J. Murrell
2009-Jan-22 19:36 UTC
[Lustre-discuss] Question about No space left on device
On Thu, 2009-01-22 at 12:11 -0600, Chad Kerner wrote:> Hello, > > We are running lustre 1.6.6 and are seeing a weird error on space > usage. The filesystem is not anywhere near full, but writes are failing > if they hit OST 23. > > if I do an lfs setstripe -i 23 chad, and then do > # dd if=/dev/zero of=chad > dd: writing to `chad'': No space left on device > 26+0 records in > 25+0 records out > # > > The actual device is fairly full. > # df /lustre/home/ost_h_24 > Filesystem 1K-blocks Used Available Use% Mounted on > /dev/mapper/ost_h_24 564172088 509452368 26061444 96% > /lustre/home/ost_h_24At 4% free, unless you have changed the "reserved space" on the OSTs'' filesystem (see the ops manual) you are into the space that a normal user is not allowed to write (and gets ENOSPC when he does). By default 5% of every device is reserved for root. That said, you really are running that OST quite full. Historically (I''m not sure if this still applies -- maybe one of our ext3 experts can comment) if you run an ext3 filesystem >80% you start to get performance degradations. Are you getting any ENOSPC (-28) errors other than by trying to force a write to that full OST? I ask because unless directed specifically to use a particular OST (as your example does) the MDS should avoid using a full OST. If it''s not, that''s a bug. b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090122/ccc12098/attachment.bin
Brian, yes, the users are not specifying any stripe parameters, and they are getting this error. Chad On Thu, 22 Jan 2009, Brian J. Murrell wrote:> On Thu, 2009-01-22 at 12:11 -0600, Chad Kerner wrote: >> Hello, >> >> We are running lustre 1.6.6 and are seeing a weird error on space >> usage. The filesystem is not anywhere near full, but writes are failing >> if they hit OST 23. >> >> if I do an lfs setstripe -i 23 chad, and then do >> # dd if=/dev/zero of=chad >> dd: writing to `chad'': No space left on device >> 26+0 records in >> 25+0 records out >> # >> >> The actual device is fairly full. >> # df /lustre/home/ost_h_24 >> Filesystem 1K-blocks Used Available Use% Mounted on >> /dev/mapper/ost_h_24 564172088 509452368 26061444 96% >> /lustre/home/ost_h_24 > > At 4% free, unless you have changed the "reserved space" on the OSTs'' > filesystem (see the ops manual) you are into the space that a normal > user is not allowed to write (and gets ENOSPC when he does). By default > 5% of every device is reserved for root. > > That said, you really are running that OST quite full. Historically > (I''m not sure if this still applies -- maybe one of our ext3 experts can > comment) if you run an ext3 filesystem >80% you start to get performance > degradations. > > Are you getting any ENOSPC (-28) errors other than by trying to force a > write to that full OST? I ask because unless directed specifically to > use a particular OST (as your example does) the MDS should avoid using a > full OST. If it''s not, that''s a bug. > > b. > >
Brian J. Murrell
2009-Jan-22 20:02 UTC
[Lustre-discuss] Question about No space left on device
On Thu, 2009-01-22 at 13:41 -0600, Chad Kerner wrote:> Brian, yes, the users are not specifying any stripe parameters, and they > are getting this error.Are you absolutely positive there is no striping policy on the dir (or parent dir in absence of anything specific on a given dir, or it''s parent, and so on) they are creating the file in? Understanding that if a given dir has no specific striping policy it inherits it''s parent''s policy and that re-curses all the way up to the root of the filesystem. Also, attempted "appends" to any objects (i.e. files) on that OST will fail with ENOSPC. Maybe that is what you are seeing. If you are sure of the striping and that the ENOSPC is not from trying to append to existing files, then you have found a bug. Please file a bug at our bugzilla. If you can show a test case and prove the striping, all the better. If the ENOSPC is exclusively from file appends then you will need to rebalance the OST using the poor-man''s mirgrate/reblance procedure that we have described here previously. I am sure you could cook one up. It''s basically a cp/mv of enough files (to reallocate objects) on that full OST (lfs find) to get your full OST''s usage down. You have to deactivate the OST on the MDS first to be sure not to allocate new objects to it until it''s back in balance. b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090122/82846a93/attachment.bin
Heald, Nathan T.
2009-Jan-22 22:07 UTC
[Lustre-discuss] Question about bug 18018 (external journal bug)
I seem to understand that patches are available for Lustre 1.6.7 but I don''t see any specific patches for affected versions below that, am I missing something? I am looking for a patch for 1.6.4.3 on RHEL4, would this fall under the RHEL4 patch? (Comment 34): https://bugzilla.lustre.org/show_bug.cgi?id=18018 Thanks, -Nathan -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090122/cf7f3cdb/attachment-0001.html
Brian J. Murrell
2009-Jan-22 22:58 UTC
[Lustre-discuss] Question about bug 18018 (external journal bug)
On Thu, 2009-01-22 at 17:07 -0500, Heald, Nathan T. wrote:> I seem to understand that patches are available for Lustre 1.6.7 but I > don?t see any specific patches for affected versions below that,We typically only create patches for the "upcoming release(s)". Those patches are what makes the "current release" the "next release". The amount of work it would take to create a version of every patch for every past release is just far far too much. This is one of the reasons we urge people to keep up with our release cycle.> I am looking for a patch for 1.6.4.3 on RHEL4, would this fall under > the RHEL4 patch? (Comment 34): > https://bugzilla.lustre.org/show_bug.cgi?id=18018Given that this particular patch is for the kernel itself, it escapes the above clause with regard to Lustre verions but falls under a similar clause in that we are only going to produce that patch for the particular RHEL4 that is released with the "next release". If you want that patch for previous RHEL4 kernels you will have to try to port it (if it even needs porting -- it might just apply cleanly) yourself, maybe with help from people here in the community. b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090122/351820dd/attachment.bin
Heald, Nathan T.
2009-Jan-22 23:33 UTC
[Lustre-discuss] Question about bug 18018 (external journal bug)
That''s fine, I just wanted to make sure I wasn''t missing something. We can just plan to reboot after every unmount until we are ready to upgrade. Thanks, -Nathan On 1/22/09 5:58 PM, "Brian J. Murrell" <Brian.Murrell at Sun.COM> wrote: On Thu, 2009-01-22 at 17:07 -0500, Heald, Nathan T. wrote:> I seem to understand that patches are available for Lustre 1.6.7 but I > don''t see any specific patches for affected versions below that,We typically only create patches for the "upcoming release(s)". Those patches are what makes the "current release" the "next release". The amount of work it would take to create a version of every patch for every past release is just far far too much. This is one of the reasons we urge people to keep up with our release cycle.> I am looking for a patch for 1.6.4.3 on RHEL4, would this fall under > the RHEL4 patch? (Comment 34): > https://bugzilla.lustre.org/show_bug.cgi?id=18018Given that this particular patch is for the kernel itself, it escapes the above clause with regard to Lustre verions but falls under a similar clause in that we are only going to produce that patch for the particular RHEL4 that is released with the "next release". If you want that patch for previous RHEL4 kernels you will have to try to port it (if it even needs porting -- it might just apply cleanly) yourself, maybe with help from people here in the community. b. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090122/18e8ba3c/attachment.html