thr3ads.net - Lustre discuss - [Lustre-discuss] Swap over lustre [Aug 2011]

If this information is useful, please help other people find it:
Share via:

John Hanks

2011-Aug-18 02:43 UTC

[Lustre-discuss] Swap over lustre

Hi,

I''ve been trying to get swap on lustre to work with not much success
using blockdev_attach and the resulting lloop0 device and using
losetup and the resulting loop device. This thread
(http://www.mail-archive.com/lustre-discuss at lists.lustre.org/msg00856.html)
claims that it works, but in all my attempts almost as soon as swap is
used (testing with memhog), the host hangs. In some cases it hangs
hard, but on occasion if I''m patient enough the OOM will eventually
kill something and the node will become responsive again. If I
carefully increase memory with each successive memhog run I can get
some pages to swap, but any real pressure always results in a hang.
I''m attempting this on Redhat EL 5.6 with lustre 1.8.4 patchless
client over IB.

DIgging around search results for "swap over NFS" I''ve found
a lot fo
discussion about race conditions and different patches to address
this, but CONFIG_NFS_SWAP seems to be missing from the redhat kernel.
And upon trying swap to an NFS server, I see the same behavior. Is
swap to a network device doomed to always fail on Redhat EL 5 and if
not, does anyone have a recipe for getting swap on lustre to work?

I''ve also fiddled with min_free_kbytes and swappiness in an attempt to
induce swapping before the node''s memory is actually all gone but all
this results in is an earlier hang with less memory having been used.

Thanks,

jbh

Joe Landman

2011-Aug-18 02:57 UTC

head link

[Lustre-discuss] Swap over lustre

On 08/17/2011 10:43 PM, John Hanks wrote:> Hi,
>
> I''ve been trying to get swap on lustre to work with not much
success
> using blockdev_attach and the resulting lloop0 device and using
> losetup and the resulting loop device. This thread
> (http://www.mail-archive.com/lustre-discuss at
lists.lustre.org/msg00856.html)
> claims that it works, but in all my attempts almost as soon as swap is
> used (testing with memhog), the host hangs. In some cases it hangs
> hard, but on occasion if I''m patient enough the OOM will
eventually
> kill something and the node will become responsive again. If I
> carefully increase memory with each successive memhog run I can get
> some pages to swap, but any real pressure always results in a hang.
> I''m attempting this on Redhat EL 5.6 with lustre 1.8.4 patchless
> client over IB.
As a rule of thumb, you should try to keep the path to swap as simple as 
possible.  No memory/buffer allocations on the way to a paging event if 
you can possibly do this.

The lustre client (and most NFS or even network block devices) all do 
memory allocation of buffers ... which is anathema to migrating pages 
out to disk.  You can easily wind up in a "death spiral" race
condition
(and it sounds like you are there).  You might be able to do something 
with iSCSI or SRP (though these also do block allocations and could 
trigger death spirals).  If you can limit the number of buffers they 
allocate, and then force them to allocate the buffers at startup (by 
forcing some activity to the block device, and then pin this memory so 
that they can''t be ejected ...) you might have chance to do it as a 
block device.  I think SRP can do this, not sure if iSCSI initiators can 
pin buffers in ram.

You might look at the swapz patches (we haven''t integrated them into
our
kernel yet, but have been looking at it) to compress swap pages and 
store them ... in ram.  This may not work for you, but it could be an 
option.

Is there any particular reason you can''t use a local drive for this 
(such as you don''t have local drives, or they aren''t big/fast
enough)?
> DIgging around search results for "swap over NFS" I''ve
found a lot fo
> discussion about race conditions and different patches to address
> this, but CONFIG_NFS_SWAP seems to be missing from the redhat kernel.
> And upon trying swap to an NFS server, I see the same behavior. Is
> swap to a network device doomed to always fail on Redhat EL 5 and if
> not, does anyone have a recipe for getting swap on lustre to work?
>
> I''ve also fiddled with min_free_kbytes and swappiness in an
attempt to
> induce swapping before the node''s memory is actually all gone but
all
> this results in is an earlier hang with less memory having been used.

-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
email: landman at scalableinformatics.com
web  : http://scalableinformatics.com
        http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615

David Dillow

2011-Aug-18 03:42 UTC

head link

[Lustre-discuss] Swap over lustre

On Wed, 2011-08-17 at 22:57 -0400, Joe Landman wrote:> The lustre client (and most NFS or even network block devices) all do 
> memory allocation of buffers ... which is anathema to migrating pages 
> out to disk.  You can easily wind up in a "death spiral" race
condition
> (and it sounds like you are there).  You might be able to do something 
> with iSCSI or SRP (though these also do block allocations and could 
> trigger death spirals).
Your post is generally correct, but minor nit here: there is no memory
allocation on the command path of the Linux SRP initiator, so the death
spiral is not possible there. I suspect the iSCSI initiator takes
similar precautions -- or uses mempools -- to avoid this fate, but I''m
not as familiar with that code.

Cheers,
Dave

John Hanks

2011-Aug-18 03:54 UTC

head link

[Lustre-discuss] Swap over lustre

On Wed, Aug 17, 2011 at 8:57 PM, Joe Landman
<landman at scalableinformatics.com> wrote:> On 08/17/2011 10:43 PM, John Hanks wrote:
> As a rule of thumb, you should try to keep the path to swap as simple as
> possible. ?No memory/buffer allocations on the way to a paging event if
> you can possibly do this.
I do have a long path there, will try simplifying that and see if it helps.
> The lustre client (and most NFS or even network block devices) all do
> memory allocation of buffers ... which is anathema to migrating pages
> out to disk. ?You can easily wind up in a "death spiral" race
condition
> (and it sounds like you are there). ?You might be able to do something
> with iSCSI or SRP (though these also do block allocations and could
> trigger death spirals). ?If you can limit the number of buffers they
> allocate, and then force them to allocate the buffers at startup (by
> forcing some activity to the block device, and then pin this memory so
> that they can''t be ejected ...) you might have chance to do it as
a
> block device. ?I think SRP can do this, not sure if iSCSI initiators can
> pin buffers in ram.
>
> You might look at the swapz patches (we haven''t integrated them
into our
> kernel yet, but have been looking at it) to compress swap pages and
> store them ... in ram. ?This may not work for you, but it could be an
> option.
I wasn''t aware of swapz, that sounds really interesting. The codes
that run the nodes out of memory tend to be sequencing applications,
which seem like good candidates for memory compression.
> Is there any particular reason you can''t use a local drive for
this
> (such as you don''t have local drives, or they aren''t
big/fast enough)?
We''re doing this on diskless nodes. I''m not looking to get a
huge
amount of swap, just enough to provide a place for the root filesystem
to page out of the tmpfs so we can squeeze out all the RAM possible
for applications. Since I don''t expect it to get heavily used,
I''m
considering running vblade on a server and carving out small aoe LUNs.
It seems logical that if a host can boot off of iscsi or aoe, that you
could have a swap space there but I''ve never tried it with either
protocol.

FWIW, mounting a file on lustre via loopback to provide a local
scratch filesystem works really well.

jbh

Kevin Van Maren

2011-Aug-18 04:11 UTC

head link

[Lustre-discuss] Swap over lustre

John Hanks wrote:> FWIW, mounting a file on lustre via loopback to provide a local
> scratch filesystem works really well.
>   
Mostly, but you probably need a kernel patch.  See 
https://bugzilla.lustre.org/show_bug.cgi?id=22004

Kevin

Joe Landman

2011-Aug-18 04:25 UTC

head link

[Lustre-discuss] Swap over lustre

On 08/17/2011 11:42 PM, David Dillow wrote:> On Wed, 2011-08-17 at 22:57 -0400, Joe Landman wrote:
>> The lustre client (and most NFS or even network block devices) all do
>> memory allocation of buffers ... which is anathema to migrating pages
>> out to disk.  You can easily wind up in a "death spiral" race
condition
>> (and it sounds like you are there).  You might be able to do something
>> with iSCSI or SRP (though these also do block allocations and could
>> trigger death spirals).
>
> Your post is generally correct, but minor nit here: there is no memory
> allocation on the command path of the Linux SRP initiator, so the death
> spiral is not possible there. I suspect the iSCSI initiator takes
Thanks for clarifying that.  I know that during startup there is an 
allocation, but I wasn''t sure after that.
> similar precautions -- or uses mempools -- to avoid this fate, but
I''m
> not as familiar with that code.
I think they also try to pre-allocate as much as possible.

One issue we''ve seen in conjunction with these has been on some network
drivers with skb allocations, in tight memory situations, it can cause 
some problems when there are very few free pages.  Usually we get a 
bunch of messages in the logs, but on some occasions, the network device 
shuts down (unable to allocate send/receive buffers).  Have seen this on 
igb, e1000e, and e1000 based networks.


>
> Cheers,
> Dave

-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
email: landman at scalableinformatics.com
web  : http://scalableinformatics.com
        http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615

Andreas Dilger

2011-Aug-18 05:29 UTC

head link

[Lustre-discuss] Swap over lustre

On 2011-08-17, at 8:43 PM, John Hanks wrote:> I''ve been trying to get swap on lustre to work with not much
success
> using blockdev_attach and the resulting lloop0 device and using
> losetup and the resulting loop device. This thread
> (http://www.mail-archive.com/lustre-discuss at
lists.lustre.org/msg00856.html)
> claims that it works, but in all my attempts almost as soon as swap is
> used (testing with memhog), the host hangs. In some cases it hangs
> hard, but on occasion if I''m patient enough the OOM will
eventually
> kill something and the node will become responsive again. If I
> carefully increase memory with each successive memhog run I can get
> some pages to swap, but any real pressure always results in a hang.
> I''m attempting this on Redhat EL 5.6 with lustre 1.8.4 patchless
> client over IB.
Using IB is important to try this out, since the Lustre RDMA will
use preallocate pages for the RPC, unlike TCP where there can be
be problems allocating the TCP receive buffers.

That said, the swap-on-Lustre code was never really finished.  If
you are interested to debug this and have some coding skills you
could probably get some help for debugging on the list.

You need to have a serial console attached to the client node, and
grab the stack traces from the client to see where it is stuck
allocating memory, and then remove/avoid/preallocate it.

Cheers, Andreas
--
Andreas Dilger 
Principal Engineer
Whamcloud, Inc.

Temple Jason

2011-Aug-18 06:36 UTC

head link

[Lustre-discuss] Swap over lustre

Hello,

I experimented with swap on lustre in as many ways as possible (without touching
the code), and had the shortest path possible to no avail.  The code is not able
to handle it at all, and the system always hung.

Without serious code rewrites, this isn''t going to work for you.

-Jason

-----Original Message-----
From: lustre-discuss-bounces at lists.lustre.org [mailto:lustre-discuss-bounces
at lists.lustre.org] On Behalf Of John Hanks
Sent: gioved?, 18. agosto 2011 05:55
To: landman at scalableinformatics.com
Cc: lustre-discuss at lists.lustre.org
Subject: Re: [Lustre-discuss] Swap over lustre

On Wed, Aug 17, 2011 at 8:57 PM, Joe Landman
<landman at scalableinformatics.com> wrote:> On 08/17/2011 10:43 PM, John Hanks wrote:
> As a rule of thumb, you should try to keep the path to swap as simple as
> possible. ?No memory/buffer allocations on the way to a paging event if
> you can possibly do this.
I do have a long path there, will try simplifying that and see if it helps.
> The lustre client (and most NFS or even network block devices) all do
> memory allocation of buffers ... which is anathema to migrating pages
> out to disk. ?You can easily wind up in a "death spiral" race
condition
> (and it sounds like you are there). ?You might be able to do something
> with iSCSI or SRP (though these also do block allocations and could
> trigger death spirals). ?If you can limit the number of buffers they
> allocate, and then force them to allocate the buffers at startup (by
> forcing some activity to the block device, and then pin this memory so
> that they can''t be ejected ...) you might have chance to do it as
a
> block device. ?I think SRP can do this, not sure if iSCSI initiators can
> pin buffers in ram.
>
> You might look at the swapz patches (we haven''t integrated them
into our
> kernel yet, but have been looking at it) to compress swap pages and
> store them ... in ram. ?This may not work for you, but it could be an
> option.
I wasn''t aware of swapz, that sounds really interesting. The codes
that run the nodes out of memory tend to be sequencing applications,
which seem like good candidates for memory compression.
> Is there any particular reason you can''t use a local drive for
this
> (such as you don''t have local drives, or they aren''t
big/fast enough)?
We''re doing this on diskless nodes. I''m not looking to get a
huge
amount of swap, just enough to provide a place for the root filesystem
to page out of the tmpfs so we can squeeze out all the RAM possible
for applications. Since I don''t expect it to get heavily used,
I''m
considering running vblade on a server and carving out small aoe LUNs.
It seems logical that if a host can boot off of iscsi or aoe, that you
could have a swap space there but I''ve never tried it with either
protocol.

FWIW, mounting a file on lustre via loopback to provide a local
scratch filesystem works really well.

jbh
_______________________________________________
Lustre-discuss mailing list
Lustre-discuss at lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

John Hanks

2011-Aug-18 14:30 UTC

head link

[Lustre-discuss] Swap over lustre

On Thu, Aug 18, 2011 at 12:36 AM, Temple  Jason <jtemple at cscs.ch>
wrote:> Hello,
>
> I experimented with swap on lustre in as many ways as possible (without
touching the code), and had the shortest path possible to no avail. ?The code is
not able to handle it at all, and the system always hung.
>
> Without serious code rewrites, this isn''t going to work for you.
>
> -Jason
>
Lacking the skills and time for what Andreas suggested, I think my
approach will be to abandon this direction for the moment. Thanks to
everyone for your responses. If I find time to learn how to provide
useful debugging and feedback, the list will be the first to know :)

jbh

Andreas Dilger

2011-Aug-18 17:20 UTC

head link

[Lustre-discuss] Swap over lustre

On 2011-08-18, at 12:36 AM, "Temple  Jason" <jtemple at cscs.ch>
wrote:> I experimented with swap on lustre in as many ways as possible (without
touching the code), and had the shortest path possible to no avail.  The code is
not able to handle it at all, and the system always hung.
Jason, did you try the lloop device?  That was written for swap to use, to avoid
the VFS, filesystem, and locking layers. It never made it to production quality,
since no customer was interested to complete it, bit it is definitely the best
starting point.
> Without serious code rewrites, this isn''t going to work for you.
That''s a difficult assessment to make.  A bunch of effort went into
removing allocations in the IO path at one time, but it was never a priority to
keep lloop working, so things may have regressed over time.

IMHO it probably isn''t a huge effort to get this working again, but
someone in thr community would need to invest the time to investigate the
problems and fix the code.

It would be best to start with just getting the lloop block device to work
reliably, and use "lctl set_param debug=+malloc" to find allocations
along the IO path, then move on to debugging swap.

Cheers, Andreas
> -----Original Message-----
> From: lustre-discuss-bounces at lists.lustre.org
[mailto:lustre-discuss-bounces at lists.lustre.org] On Behalf Of John Hanks
> Sent: gioved?, 18. agosto 2011 05:55
> To: landman at scalableinformatics.com
> Cc: lustre-discuss at lists.lustre.org
> Subject: Re: [Lustre-discuss] Swap over lustre
> 
> On Wed, Aug 17, 2011 at 8:57 PM, Joe Landman
> <landman at scalableinformatics.com> wrote:
>> On 08/17/2011 10:43 PM, John Hanks wrote:
>> As a rule of thumb, you should try to keep the path to swap as simple
as
>> possible.  No memory/buffer allocations on the way to a paging event if
>> you can possibly do this.
> 
> I do have a long path there, will try simplifying that and see if it helps.
> 
>> The lustre client (and most NFS or even network block devices) all do
>> memory allocation of buffers ... which is anathema to migrating pages
>> out to disk.  You can easily wind up in a "death spiral" race
condition
>> (and it sounds like you are there).  You might be able to do something
>> with iSCSI or SRP (though these also do block allocations and could
>> trigger death spirals).  If you can limit the number of buffers they
>> allocate, and then force them to allocate the buffers at startup (by
>> forcing some activity to the block device, and then pin this memory so
>> that they can''t be ejected ...) you might have chance to do it
as a
>> block device.  I think SRP can do this, not sure if iSCSI initiators
can
>> pin buffers in ram.
>> 
>> You might look at the swapz patches (we haven''t integrated
them into our
>> kernel yet, but have been looking at it) to compress swap pages and
>> store them ... in ram.  This may not work for you, but it could be an
>> option.
> 
> I wasn''t aware of swapz, that sounds really interesting. The codes
> that run the nodes out of memory tend to be sequencing applications,
> which seem like good candidates for memory compression.
> 
>> Is there any particular reason you can''t use a local drive for
this
>> (such as you don''t have local drives, or they aren''t
big/fast enough)?
> 
> We''re doing this on diskless nodes. I''m not looking to
get a huge
> amount of swap, just enough to provide a place for the root filesystem
> to page out of the tmpfs so we can squeeze out all the RAM possible
> for applications. Since I don''t expect it to get heavily used,
I''m
> considering running vblade on a server and carving out small aoe LUNs.
> It seems logical that if a host can boot off of iscsi or aoe, that you
> could have a swap space there but I''ve never tried it with either
> protocol.
> 
> FWIW, mounting a file on lustre via loopback to provide a local
> scratch filesystem works really well.
> 
> jbh
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss

Phil Sharfstein

2011-Aug-18 18:37 UTC

head link

[Lustre-discuss] Swap over lustre

Since you are considering other options, I would recommend NBD swap for this
type of minimal swapping application.  Look at the Linux Terminal Server Project
(LTSP) implementation using nbdswapd, a simple inetd-driven script that creates
and exports NBD swap spaces.  We have been using this successfully on our
diskless nodes for several years.

You can grab the nbdswapd script for the server-side and the ltsp-init-common
script for an example client setup from any of the latest LTSP binary
distributions.

-Phil

-----Original Message-----
From: lustre-discuss-bounces at lists.lustre.org on behalf of John Hanks
Sent: Thu 8/18/2011 7:30 AM
To: Temple Jason
Cc: lustre-discuss at lists.lustre.org
Subject: Re: [Lustre-discuss] Swap over lustre
 
On Thu, Aug 18, 2011 at 12:36 AM, Temple  Jason <jtemple at cscs.ch>
wrote:> Hello,
>
> I experimented with swap on lustre in as many ways as possible (without
touching the code), and had the shortest path possible to no avail. ?The code is
not able to handle it at all, and the system always hung.
>
> Without serious code rewrites, this isn''t going to work for you.
>
> -Jason
>
Lacking the skills and time for what Andreas suggested, I think my
approach will be to abandon this direction for the moment. Thanks to
everyone for your responses. If I find time to learn how to provide
useful debugging and feedback, the list will be the first to know :)

jbh
_______________________________________________
Lustre-discuss mailing list
Lustre-discuss at lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20110818/b046c964/attachment.html

Lustre discuss - Aug 2011 - Swap over lustre

[Lustre-discuss] Swap over lustre

[Lustre-discuss] Swap over lustre

[Lustre-discuss] Swap over lustre

[Lustre-discuss] Swap over lustre

[Lustre-discuss] Swap over lustre

[Lustre-discuss] Swap over lustre

[Lustre-discuss] Swap over lustre

[Lustre-discuss] Swap over lustre

[Lustre-discuss] Swap over lustre

[Lustre-discuss] Swap over lustre

[Lustre-discuss] Swap over lustre