Hi, I''ve been trying to get swap on lustre to work with not much success using blockdev_attach and the resulting lloop0 device and using losetup and the resulting loop device. This thread (http://www.mail-archive.com/lustre-discuss at lists.lustre.org/msg00856.html) claims that it works, but in all my attempts almost as soon as swap is used (testing with memhog), the host hangs. In some cases it hangs hard, but on occasion if I''m patient enough the OOM will eventually kill something and the node will become responsive again. If I carefully increase memory with each successive memhog run I can get some pages to swap, but any real pressure always results in a hang. I''m attempting this on Redhat EL 5.6 with lustre 1.8.4 patchless client over IB. DIgging around search results for "swap over NFS" I''ve found a lot fo discussion about race conditions and different patches to address this, but CONFIG_NFS_SWAP seems to be missing from the redhat kernel. And upon trying swap to an NFS server, I see the same behavior. Is swap to a network device doomed to always fail on Redhat EL 5 and if not, does anyone have a recipe for getting swap on lustre to work? I''ve also fiddled with min_free_kbytes and swappiness in an attempt to induce swapping before the node''s memory is actually all gone but all this results in is an earlier hang with less memory having been used. Thanks, jbh
On 08/17/2011 10:43 PM, John Hanks wrote:> Hi, > > I''ve been trying to get swap on lustre to work with not much success > using blockdev_attach and the resulting lloop0 device and using > losetup and the resulting loop device. This thread > (http://www.mail-archive.com/lustre-discuss at lists.lustre.org/msg00856.html) > claims that it works, but in all my attempts almost as soon as swap is > used (testing with memhog), the host hangs. In some cases it hangs > hard, but on occasion if I''m patient enough the OOM will eventually > kill something and the node will become responsive again. If I > carefully increase memory with each successive memhog run I can get > some pages to swap, but any real pressure always results in a hang. > I''m attempting this on Redhat EL 5.6 with lustre 1.8.4 patchless > client over IB.As a rule of thumb, you should try to keep the path to swap as simple as possible. No memory/buffer allocations on the way to a paging event if you can possibly do this. The lustre client (and most NFS or even network block devices) all do memory allocation of buffers ... which is anathema to migrating pages out to disk. You can easily wind up in a "death spiral" race condition (and it sounds like you are there). You might be able to do something with iSCSI or SRP (though these also do block allocations and could trigger death spirals). If you can limit the number of buffers they allocate, and then force them to allocate the buffers at startup (by forcing some activity to the block device, and then pin this memory so that they can''t be ejected ...) you might have chance to do it as a block device. I think SRP can do this, not sure if iSCSI initiators can pin buffers in ram. You might look at the swapz patches (we haven''t integrated them into our kernel yet, but have been looking at it) to compress swap pages and store them ... in ram. This may not work for you, but it could be an option. Is there any particular reason you can''t use a local drive for this (such as you don''t have local drives, or they aren''t big/fast enough)?> DIgging around search results for "swap over NFS" I''ve found a lot fo > discussion about race conditions and different patches to address > this, but CONFIG_NFS_SWAP seems to be missing from the redhat kernel. > And upon trying swap to an NFS server, I see the same behavior. Is > swap to a network device doomed to always fail on Redhat EL 5 and if > not, does anyone have a recipe for getting swap on lustre to work? > > I''ve also fiddled with min_free_kbytes and swappiness in an attempt to > induce swapping before the node''s memory is actually all gone but all > this results in is an earlier hang with less memory having been used.-- Joseph Landman, Ph.D Founder and CEO Scalable Informatics, Inc. email: landman at scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/sicluster phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615
On Wed, 2011-08-17 at 22:57 -0400, Joe Landman wrote:> The lustre client (and most NFS or even network block devices) all do > memory allocation of buffers ... which is anathema to migrating pages > out to disk. You can easily wind up in a "death spiral" race condition > (and it sounds like you are there). You might be able to do something > with iSCSI or SRP (though these also do block allocations and could > trigger death spirals).Your post is generally correct, but minor nit here: there is no memory allocation on the command path of the Linux SRP initiator, so the death spiral is not possible there. I suspect the iSCSI initiator takes similar precautions -- or uses mempools -- to avoid this fate, but I''m not as familiar with that code. Cheers, Dave
On Wed, Aug 17, 2011 at 8:57 PM, Joe Landman <landman at scalableinformatics.com> wrote:> On 08/17/2011 10:43 PM, John Hanks wrote: > As a rule of thumb, you should try to keep the path to swap as simple as > possible. ?No memory/buffer allocations on the way to a paging event if > you can possibly do this.I do have a long path there, will try simplifying that and see if it helps.> The lustre client (and most NFS or even network block devices) all do > memory allocation of buffers ... which is anathema to migrating pages > out to disk. ?You can easily wind up in a "death spiral" race condition > (and it sounds like you are there). ?You might be able to do something > with iSCSI or SRP (though these also do block allocations and could > trigger death spirals). ?If you can limit the number of buffers they > allocate, and then force them to allocate the buffers at startup (by > forcing some activity to the block device, and then pin this memory so > that they can''t be ejected ...) you might have chance to do it as a > block device. ?I think SRP can do this, not sure if iSCSI initiators can > pin buffers in ram. > > You might look at the swapz patches (we haven''t integrated them into our > kernel yet, but have been looking at it) to compress swap pages and > store them ... in ram. ?This may not work for you, but it could be an > option.I wasn''t aware of swapz, that sounds really interesting. The codes that run the nodes out of memory tend to be sequencing applications, which seem like good candidates for memory compression.> Is there any particular reason you can''t use a local drive for this > (such as you don''t have local drives, or they aren''t big/fast enough)?We''re doing this on diskless nodes. I''m not looking to get a huge amount of swap, just enough to provide a place for the root filesystem to page out of the tmpfs so we can squeeze out all the RAM possible for applications. Since I don''t expect it to get heavily used, I''m considering running vblade on a server and carving out small aoe LUNs. It seems logical that if a host can boot off of iscsi or aoe, that you could have a swap space there but I''ve never tried it with either protocol. FWIW, mounting a file on lustre via loopback to provide a local scratch filesystem works really well. jbh
John Hanks wrote:> FWIW, mounting a file on lustre via loopback to provide a local > scratch filesystem works really well. >Mostly, but you probably need a kernel patch. See https://bugzilla.lustre.org/show_bug.cgi?id=22004 Kevin
On 08/17/2011 11:42 PM, David Dillow wrote:> On Wed, 2011-08-17 at 22:57 -0400, Joe Landman wrote: >> The lustre client (and most NFS or even network block devices) all do >> memory allocation of buffers ... which is anathema to migrating pages >> out to disk. You can easily wind up in a "death spiral" race condition >> (and it sounds like you are there). You might be able to do something >> with iSCSI or SRP (though these also do block allocations and could >> trigger death spirals). > > Your post is generally correct, but minor nit here: there is no memory > allocation on the command path of the Linux SRP initiator, so the death > spiral is not possible there. I suspect the iSCSI initiator takesThanks for clarifying that. I know that during startup there is an allocation, but I wasn''t sure after that.> similar precautions -- or uses mempools -- to avoid this fate, but I''m > not as familiar with that code.I think they also try to pre-allocate as much as possible. One issue we''ve seen in conjunction with these has been on some network drivers with skb allocations, in tight memory situations, it can cause some problems when there are very few free pages. Usually we get a bunch of messages in the logs, but on some occasions, the network device shuts down (unable to allocate send/receive buffers). Have seen this on igb, e1000e, and e1000 based networks.> > Cheers, > Dave-- Joseph Landman, Ph.D Founder and CEO Scalable Informatics, Inc. email: landman at scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/sicluster phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615
On 2011-08-17, at 8:43 PM, John Hanks wrote:> I''ve been trying to get swap on lustre to work with not much success > using blockdev_attach and the resulting lloop0 device and using > losetup and the resulting loop device. This thread > (http://www.mail-archive.com/lustre-discuss at lists.lustre.org/msg00856.html) > claims that it works, but in all my attempts almost as soon as swap is > used (testing with memhog), the host hangs. In some cases it hangs > hard, but on occasion if I''m patient enough the OOM will eventually > kill something and the node will become responsive again. If I > carefully increase memory with each successive memhog run I can get > some pages to swap, but any real pressure always results in a hang. > I''m attempting this on Redhat EL 5.6 with lustre 1.8.4 patchless > client over IB.Using IB is important to try this out, since the Lustre RDMA will use preallocate pages for the RPC, unlike TCP where there can be be problems allocating the TCP receive buffers. That said, the swap-on-Lustre code was never really finished. If you are interested to debug this and have some coding skills you could probably get some help for debugging on the list. You need to have a serial console attached to the client node, and grab the stack traces from the client to see where it is stuck allocating memory, and then remove/avoid/preallocate it. Cheers, Andreas -- Andreas Dilger Principal Engineer Whamcloud, Inc.
Hello, I experimented with swap on lustre in as many ways as possible (without touching the code), and had the shortest path possible to no avail. The code is not able to handle it at all, and the system always hung. Without serious code rewrites, this isn''t going to work for you. -Jason -----Original Message----- From: lustre-discuss-bounces at lists.lustre.org [mailto:lustre-discuss-bounces at lists.lustre.org] On Behalf Of John Hanks Sent: gioved?, 18. agosto 2011 05:55 To: landman at scalableinformatics.com Cc: lustre-discuss at lists.lustre.org Subject: Re: [Lustre-discuss] Swap over lustre On Wed, Aug 17, 2011 at 8:57 PM, Joe Landman <landman at scalableinformatics.com> wrote:> On 08/17/2011 10:43 PM, John Hanks wrote: > As a rule of thumb, you should try to keep the path to swap as simple as > possible. ?No memory/buffer allocations on the way to a paging event if > you can possibly do this.I do have a long path there, will try simplifying that and see if it helps.> The lustre client (and most NFS or even network block devices) all do > memory allocation of buffers ... which is anathema to migrating pages > out to disk. ?You can easily wind up in a "death spiral" race condition > (and it sounds like you are there). ?You might be able to do something > with iSCSI or SRP (though these also do block allocations and could > trigger death spirals). ?If you can limit the number of buffers they > allocate, and then force them to allocate the buffers at startup (by > forcing some activity to the block device, and then pin this memory so > that they can''t be ejected ...) you might have chance to do it as a > block device. ?I think SRP can do this, not sure if iSCSI initiators can > pin buffers in ram. > > You might look at the swapz patches (we haven''t integrated them into our > kernel yet, but have been looking at it) to compress swap pages and > store them ... in ram. ?This may not work for you, but it could be an > option.I wasn''t aware of swapz, that sounds really interesting. The codes that run the nodes out of memory tend to be sequencing applications, which seem like good candidates for memory compression.> Is there any particular reason you can''t use a local drive for this > (such as you don''t have local drives, or they aren''t big/fast enough)?We''re doing this on diskless nodes. I''m not looking to get a huge amount of swap, just enough to provide a place for the root filesystem to page out of the tmpfs so we can squeeze out all the RAM possible for applications. Since I don''t expect it to get heavily used, I''m considering running vblade on a server and carving out small aoe LUNs. It seems logical that if a host can boot off of iscsi or aoe, that you could have a swap space there but I''ve never tried it with either protocol. FWIW, mounting a file on lustre via loopback to provide a local scratch filesystem works really well. jbh _______________________________________________ Lustre-discuss mailing list Lustre-discuss at lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
On Thu, Aug 18, 2011 at 12:36 AM, Temple Jason <jtemple at cscs.ch> wrote:> Hello, > > I experimented with swap on lustre in as many ways as possible (without touching the code), and had the shortest path possible to no avail. ?The code is not able to handle it at all, and the system always hung. > > Without serious code rewrites, this isn''t going to work for you. > > -Jason >Lacking the skills and time for what Andreas suggested, I think my approach will be to abandon this direction for the moment. Thanks to everyone for your responses. If I find time to learn how to provide useful debugging and feedback, the list will be the first to know :) jbh
On 2011-08-18, at 12:36 AM, "Temple Jason" <jtemple at cscs.ch> wrote:> I experimented with swap on lustre in as many ways as possible (without touching the code), and had the shortest path possible to no avail. The code is not able to handle it at all, and the system always hung.Jason, did you try the lloop device? That was written for swap to use, to avoid the VFS, filesystem, and locking layers. It never made it to production quality, since no customer was interested to complete it, bit it is definitely the best starting point.> Without serious code rewrites, this isn''t going to work for you.That''s a difficult assessment to make. A bunch of effort went into removing allocations in the IO path at one time, but it was never a priority to keep lloop working, so things may have regressed over time. IMHO it probably isn''t a huge effort to get this working again, but someone in thr community would need to invest the time to investigate the problems and fix the code. It would be best to start with just getting the lloop block device to work reliably, and use "lctl set_param debug=+malloc" to find allocations along the IO path, then move on to debugging swap. Cheers, Andreas> -----Original Message----- > From: lustre-discuss-bounces at lists.lustre.org [mailto:lustre-discuss-bounces at lists.lustre.org] On Behalf Of John Hanks > Sent: gioved?, 18. agosto 2011 05:55 > To: landman at scalableinformatics.com > Cc: lustre-discuss at lists.lustre.org > Subject: Re: [Lustre-discuss] Swap over lustre > > On Wed, Aug 17, 2011 at 8:57 PM, Joe Landman > <landman at scalableinformatics.com> wrote: >> On 08/17/2011 10:43 PM, John Hanks wrote: >> As a rule of thumb, you should try to keep the path to swap as simple as >> possible. No memory/buffer allocations on the way to a paging event if >> you can possibly do this. > > I do have a long path there, will try simplifying that and see if it helps. > >> The lustre client (and most NFS or even network block devices) all do >> memory allocation of buffers ... which is anathema to migrating pages >> out to disk. You can easily wind up in a "death spiral" race condition >> (and it sounds like you are there). You might be able to do something >> with iSCSI or SRP (though these also do block allocations and could >> trigger death spirals). If you can limit the number of buffers they >> allocate, and then force them to allocate the buffers at startup (by >> forcing some activity to the block device, and then pin this memory so >> that they can''t be ejected ...) you might have chance to do it as a >> block device. I think SRP can do this, not sure if iSCSI initiators can >> pin buffers in ram. >> >> You might look at the swapz patches (we haven''t integrated them into our >> kernel yet, but have been looking at it) to compress swap pages and >> store them ... in ram. This may not work for you, but it could be an >> option. > > I wasn''t aware of swapz, that sounds really interesting. The codes > that run the nodes out of memory tend to be sequencing applications, > which seem like good candidates for memory compression. > >> Is there any particular reason you can''t use a local drive for this >> (such as you don''t have local drives, or they aren''t big/fast enough)? > > We''re doing this on diskless nodes. I''m not looking to get a huge > amount of swap, just enough to provide a place for the root filesystem > to page out of the tmpfs so we can squeeze out all the RAM possible > for applications. Since I don''t expect it to get heavily used, I''m > considering running vblade on a server and carving out small aoe LUNs. > It seems logical that if a host can boot off of iscsi or aoe, that you > could have a swap space there but I''ve never tried it with either > protocol. > > FWIW, mounting a file on lustre via loopback to provide a local > scratch filesystem works really well. > > jbh > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss
Since you are considering other options, I would recommend NBD swap for this type of minimal swapping application. Look at the Linux Terminal Server Project (LTSP) implementation using nbdswapd, a simple inetd-driven script that creates and exports NBD swap spaces. We have been using this successfully on our diskless nodes for several years. You can grab the nbdswapd script for the server-side and the ltsp-init-common script for an example client setup from any of the latest LTSP binary distributions. -Phil -----Original Message----- From: lustre-discuss-bounces at lists.lustre.org on behalf of John Hanks Sent: Thu 8/18/2011 7:30 AM To: Temple Jason Cc: lustre-discuss at lists.lustre.org Subject: Re: [Lustre-discuss] Swap over lustre On Thu, Aug 18, 2011 at 12:36 AM, Temple Jason <jtemple at cscs.ch> wrote:> Hello, > > I experimented with swap on lustre in as many ways as possible (without touching the code), and had the shortest path possible to no avail. ?The code is not able to handle it at all, and the system always hung. > > Without serious code rewrites, this isn''t going to work for you. > > -Jason >Lacking the skills and time for what Andreas suggested, I think my approach will be to abandon this direction for the moment. Thanks to everyone for your responses. If I find time to learn how to provide useful debugging and feedback, the list will be the first to know :) jbh _______________________________________________ Lustre-discuss mailing list Lustre-discuss at lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20110818/b046c964/attachment.html