thr3ads.net - Lustre discuss - [Lustre-discuss] Fwd: Lustre and Large Pages [Aug 2010]

If this information is useful, please help other people find it:
Share via:

Nathan Rutman

2010-Aug-19 22:34 UTC

[Lustre-discuss] Fwd: Lustre and Large Pages

Jim, I''m forwarding this to lustre-discuss to get a broader community
input.  I''m sure somebody has some experience with this.

Begin forwarded message:> 
> I am looking for information on how Lustre assigns and holds pages on
client nodes across jobs.  The motivation is that we want to make
"huge" pages available to users.  We have found that it is almost
impossible to allocate very many "huge" pages since Lustre holds
scattered small pages across jobs.  In fact, typically about 1/3 of compute node
memory can be allocated as huge pages.
> 
> We have done quite a lot of performance studies which show that a
substantial percentage of jobs on Ranger have TLB misses as a major performance
bottleneck.  We estimate we might recover as much as an additional 5%-10%
throughput if users could use huge pages.
> 
> Therefore we would like to find a way to minimize the client memory which
Lustre holds across jobs.
> 
> Have you had anyone else mention this situation to you?
> 
> Regards,
> 
> Jim Browne
> 
> 
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100819/b086c4ee/attachment.html

Kevin Van Maren

2010-Aug-19 22:44 UTC

head link

[Lustre-discuss] Fwd: Lustre and Large Pages

Easy way to reduce the client memory used by "Lustre" is to have an 
Epilogue script run by SGE (or whatever scheduler/resource manager) that 
does something like this on every node:
# sync ; sleep 1 ; sync
# echo 3 > /proc/sys/vm/drop_caches

Kevin


Nathan Rutman wrote:> Jim, I''m forwarding this to lustre-discuss to get a broader
community
> input.  I''m sure somebody has some experience with this.
>
> Begin forwarded message:
>>
>> I am looking for information on how Lustre assigns and holds pages on 
>> client nodes across jobs.  The motivation is that we want to make 
>> "huge" pages available to users.  We have found that it is
almost
>> impossible to allocate very many "huge" pages since Lustre
holds
>> scattered small pages across jobs.  In fact, typically about 1/3 of 
>> compute node memory can be allocated as huge pages.
>>
>> We have done quite a lot of performance studies which show that a 
>> substantial percentage of jobs on Ranger have TLB misses as a major 
>> performance bottleneck.  We estimate we might recover as much as an 
>> additional 5%-10% throughput if users could use huge pages.
>>
>> Therefore we would like to find a way to minimize the client memory 
>> which Lustre holds across jobs.
>>
>> Have you had anyone else mention this situation to you?
>>
>> Regards,
>>
>> Jim Browne
>>
>>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>

Andreas Dilger

2010-Aug-19 23:07 UTC

head link

[Lustre-discuss] Fwd: Lustre and Large Pages

On 2010-08-19, at 16:44, Kevin Van Maren wrote:> Easy way to reduce the client memory used by "Lustre" is to have
an
> Epilogue script run by SGE (or whatever scheduler/resource manager) that 
> does something like this on every node:
> # sync ; sleep 1 ; sync
> # echo 3 > /proc/sys/vm/drop_caches
Actually, my understanding is that /proc/sys/vm/drop_caches is NOT safe for
production usage in all cases (i.e. there are bugs in some kernels, and it
isn''t actually meant for regular use from what I''ve read).

Others use huge pages in their configuration, but they reserve them at node boot
time.  See https://bugzilla.lustre.org/show_bug.cgi?id=14323 for details.

If you want to flush all the memory used by a Lustre client between jobs, you
can do "lctl set_param ldlm.namespaces.*.lru_size=clear".  Unlike
Kevin''s suggestion it is Lustre-specific, while drop_caches will try to
flush memory from everything.
> Nathan Rutman wrote:
>> Jim, I''m forwarding this to lustre-discuss to get a broader
community
>> input.  I''m sure somebody has some experience with this.
>> 
>> Begin forwarded message:
>>> 
>>> I am looking for information on how Lustre assigns and holds pages
on
>>> client nodes across jobs.  The motivation is that we want to make 
>>> "huge" pages available to users.  We have found that it
is almost
>>> impossible to allocate very many "huge" pages since
Lustre holds
>>> scattered small pages across jobs.  In fact, typically about 1/3 of
>>> compute node memory can be allocated as huge pages.
>>> 
>>> We have done quite a lot of performance studies which show that a 
>>> substantial percentage of jobs on Ranger have TLB misses as a major
>>> performance bottleneck.  We estimate we might recover as much as an
>>> additional 5%-10% throughput if users could use huge pages.
>>> 
>>> Therefore we would like to find a way to minimize the client memory
>>> which Lustre holds across jobs.
>>> 
>>> Have you had anyone else mention this situation to you?
>>> 
>>> Regards,
>>> 
>>> Jim Browne
>>> 
>>> 
>> 
>>
------------------------------------------------------------------------
>> 
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>> 
> 
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss

Cheers, Andreas
--
Andreas Dilger
Lustre Technical Lead
Oracle Corporation Canada Inc.

Oleg Drokin

2010-Aug-20 04:10 UTC

head link

[Lustre-discuss] Fwd: Lustre and Large Pages

Hello!

On Aug 19, 2010, at 7:07 PM, Andreas Dilger wrote:> If you want to flush all the memory used by a Lustre client between jobs,
you can do "lctl set_param ldlm.namespaces.*.lru_size=clear".  Unlike
Kevin''s suggestion it is Lustre-specific, while drop_caches will try to
flush memory from everything.

Actually there is one extra bit that won''t get freed by dropping locks
that is lustre debug logs (assuming non-zero debug level).
It could be cleared with lctl clear

Bye,
    Oleg

Bernd Schubert

2010-Aug-20 07:11 UTC

head link

[Lustre-discuss] Fwd: Lustre and Large Pages

Last week there was an article on lwn.net about "Transparent
hugepages"
discussed during "The fourth Linux storage and filesystem summit".
According
to that article, we might be luckily and those patches might go into RHEL6

If you do not have an lwn.net account you might need to wait a few weeks:
http://lwn.net/Articles/398846/


It links an older article about it, which should be already avaible for all:
http://lwn.net/Articles/359158/

And another one:
http://lwn.net/Articles/374424/


Cheers,
Bernd


-- 
Bernd Schubert
DataDirect Networks

John Hammond

2010-Aug-20 13:00 UTC

head link

[Lustre-discuss] Fwd: Lustre and Large Pages

Hi Andreas,

On 08/19/2010 06:07 PM, Andreas Dilger wrote:> On 2010-08-19, at 16:44, Kevin Van Maren wrote:
>> Easy way to reduce the client memory used by "Lustre" is to
have
>> an Epilogue script run by SGE (or whatever scheduler/resource
>> manager) that does something like this on every node: # sync ;
>> sleep 1 ; sync # echo 3>  /proc/sys/vm/drop_caches
>
> Actually, my understanding is that /proc/sys/vm/drop_caches is NOT
> safe for production usage in all cases (i.e. there are bugs in some
> kernels, and it isn''t actually meant for regular use from what
I''ve
> read).
That''s good to know.  But, there are two parts to drop_caches,
depending
on what you write---do you know if the unsafety comes from the part that 
calls the ''slab'' shrinkers or the part that calls 
invalidate_inode_pages()?  I suppose it''s the latter.  Do you have a 
pointer to a more specific description?  I''m curious about which
kernels
are affected.  I looked but didn''t turn up much.

Thanks,

-John

-- 
John L. Hammond, Ph.D.
ICES, The University of Texas at Austin
jhammond at ices.utexas.edu
(512) 471-9304

James C. Browne

2010-Aug-20 13:07 UTC

head link

[Lustre-discuss] Fwd: Lustre and Large Pages

Oleg,

Thanks for the clarification.

Jim Browne

At 11:10 PM 8/19/2010, Oleg Drokin wrote:>Hello!
>
>On Aug 19, 2010, at 7:07 PM, Andreas Dilger wrote:
> > If you want to flush all the memory used by a Lustre client 
> between jobs, you can do "lctl set_param 
> ldlm.namespaces.*.lru_size=clear".  Unlike Kevin''s suggestion
it is
> Lustre-specific, while drop_caches will try to flush memory from
everything.
>
>
>Actually there is one extra bit that won''t get freed by dropping 
>locks that is lustre debug logs (assuming non-zero debug level).
>It could be cleared with lctl clear
>
>Bye,
>     Oleg

James  C. Browne
Department of Computer Science
University of Texas
Austin, Texas 78712
Phone - 512-471-9579
Fax - 512-471-8885
browne at cs.utexas.edu
http://www.cs.utexas.edu/users/browne

John Hammond

2010-Aug-20 13:21 UTC

head link

[Lustre-discuss] Fwd: Lustre and Large Pages

On 08/19/2010 11:10 PM, Oleg Drokin wrote:> Hello!
>
> On Aug 19, 2010, at 7:07 PM, Andreas Dilger wrote:
>> If you want to flush all the memory used by a Lustre client
>> between jobs, you can do "lctl set_param
>> ldlm.namespaces.*.lru_size=clear". Unlike Kevin''s
suggestion it is
>> Lustre-specific, while drop_caches will try to flush memory from
>> everything.
>
>
> Actually there is one extra bit that won''t get freed by dropping
> locks that is lustre debug logs (assuming non-zero debug level). It
> could be cleared with lctl clear
Indeed, thanks.  On Ranger, the compute nodes use compact flash drives 
for /, and so they depend on tmpfs''s for /tmp, /var/run, /var/log, and 
of course /dev/shm.  So cleaning up these ram backed filesystems as much 
as practical before asking for any hugepages is also a win.

Also, in imitation of the systems that pre-allocate all needed hugepages 
at boot time, we are considering the idea of first pre-allocating a 
large chunk of memory (say 7/8) in hugepages, then mounting the Lustre 
filesystems, then releasing the hugepages.  The hope is that Lustre''s 
persistent structures will be fit into a more compact region of memory 
thereby.

The main obstacle in testing all of this is that benchmarking the gains 
gotten by a particular approach is difficult, since we have not yet 
found an easy way of producing external fragmentation of physical memory 
in short order.  Suggestions are welcome.

Best,

-John

-- 
John L. Hammond, Ph.D.
ICES, The University of Texas at Austin
jhammond at ices.utexas.edu
(512) 471-9304

Andreas Dilger

2010-Aug-20 18:31 UTC

head link

[Lustre-discuss] Fwd: Lustre and Large Pages

On 2010-08-20, at 07:21, John Hammond wrote:> Indeed, thanks. On Ranger, the compute nodes use compact flash drives for
/, and so they depend on tmpfs''s for /tmp, /var/run, /var/log, and of
course /dev/shm. So cleaning up these ram backed filesystems as much as
practical before asking for any hugepages is also a win.
>
> Also, in imitation of the systems that pre-allocate all needed hugepages at
boot time, we are considering the idea of first pre-allocating a large chunk of
memory (say 7/8) in hugepages, then mounting the Lustre filesystems, then
releasing the hugepages. The hope is that Lustre''s persistent
structures will be fit into a more compact region of memory thereby.
As discussed in https://bugzilla.lustre.org/show_bug.cgi?id=14323 that I
previously referenced, the Lustre tunables are based on the total number of
pages, and do not take huge pages into account.

Also, if the hugepages are released, there is no guarantee that you will be able
to allocate them all again due to small pinned memory structures _somewhere_ in
the middle of each huge page.

If you are running an prologue/epilogue script then you should tune the Lustre
cache size based on the number of huge pages that will be used. The last time
this was investigated, there was no way for Lustre to know how many huge pages
were allocated from within the kernel w/o patching it. If that has changed in
newer kernels, it would be possible to dynamically adjust the cache size based
on this.

> The main obstacle in testing all of this is that benchmarking the gains
gotten by a particular approach is difficult, since we have not yet found an
easy way of producing external fragmentation of physical memory in short order.
Suggestions are welcome.
Running something like "slocate" across multiple filesystems will fill
all of RAM with inodes/dentries, and if you pin some of these in memory (e.g.
start a shell with some deep directory as CWD), you should quickly be able to
fragment your memory with unfreeable inode/dentry allocations.

Cheers, Andreas
--
Andreas Dilger
Lustre Technical Lead
Oracle Corporation Canada Inc.

Lustre discuss - Aug 2010 - Fwd: Lustre and Large Pages

[Lustre-discuss] Fwd: Lustre and Large Pages

[Lustre-discuss] Fwd: Lustre and Large Pages

[Lustre-discuss] Fwd: Lustre and Large Pages

[Lustre-discuss] Fwd: Lustre and Large Pages

[Lustre-discuss] Fwd: Lustre and Large Pages

[Lustre-discuss] Fwd: Lustre and Large Pages

[Lustre-discuss] Fwd: Lustre and Large Pages

[Lustre-discuss] Fwd: Lustre and Large Pages

[Lustre-discuss] Fwd: Lustre and Large Pages