thr3ads.net - Lustre discuss - Lustre buffer cache causes large system overhead. [Aug 2013]

If this information is useful, please help other people find it:
Share via:

Roy Dragseth

2013-Aug-22 13:51 UTC

Lustre buffer cache causes large system overhead.

We have just discovered that a large buffer cache generated from traversing a 
lustre file system will cause a significant system overhead for applications 
with high memory demands.  We have seen a 50% slowdown or worse for 
applications.  Even High Performance Linpack, that have no file IO whatsoever 
is affected.  The only remedy seems to be to empty the buffer cache from memory 
by running "echo 3 > /proc/sys/vm/drop_caches"

Any hints on how to improve the situation is greatly appreciated.


System setup:
Client: Dual socket Sandy Bridge, with 32GB ram and infiniband connection to 
lustre server.  CentOS 6.4, with kernel 2.6.32-358.11.1.el6.x86_64 and lustre 
v2.1.6 rpms downloaded from whamcloud download site.

Lustre: 1 MDS and 4 OSS running Lustre 2.1.3 (also from whamcloud site).  Each 
OSS has 12 OST, total 1.1 PB storage.

How to reproduce:

Traverse the lustre file system until the buffer cache is large enough.  In our 
case we run

 find . -print0 -type f | xargs -0 cat > /dev/null

on the client until the buffer cache reaches ~15-20GB.  (The lustre file system 
has lots of small files so this takes up to an hour.)

Kill the find process and start a single node parallel application, we use HPL 
(high performance linpack).  We run on all 16 cores on the system with 1GB ram 
per core (a normal run should complete in appr. 150 seconds.)  The system 
monitoring shows a 10-20% system cpu overhead and the HPL run takes more than 
200 secs.  After running "echo 3 > /proc/sys/vm/drop_caches" the
system
performance goes back to normal with a run time at 150 secs.

I''ve created an infographic from our ganglia graphs for the above
scenario.

https://dl.dropboxusercontent.com/u/23468442/misc/lustre_bc_overhead.png

Attached is an excerpt from perf top indicating that the kernel routine taking 
the most time is _spin_lock_irqsave if that means anything to anyone.


Things tested:

It does not seem to matter if we mount lustre over infiniband or ethernet.

Filling the buffer cache with files from an NFS filesystem does not degrade 
performance.

Filling the buffer cache with one large file does not give degraded performance.
(tested with iozone)


Again, any hints on how to proceed is greatly appreciated.


Best regards,
Roy.

-- 

  The Computer Center, University of Tromsø, N-9037 TROMSØ Norway.
	      phone:+47 77 64 41 07, fax:+47 77 64 41 00
        Roy Dragseth, Team Leader, High Performance Computing
	 Direct call: +47 77 64 62 56. email: roy.dragseth-hYqmg196XYc@public.gmane.org


_______________________________________________
Lustre-discuss mailing list
Lustre-discuss-aLEFhgZF4x6X6Mz3xDxJMA@public.gmane.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Roger Sersted

2013-Aug-22 14:21 UTC

head link

Re: Lustre buffer cache causes large system overhead.

Is this slowdown due to increased swap activity?  If "yes", then try
lowering
the "swappiness" value.  This will sacrifice buffer cache space to
lower swap
activity.

Take a look at http://en.wikipedia.org/wiki/Swappiness.

Roger S.


On 08/22/2013 08:51 AM, Roy Dragseth wrote:> We have just discovered that a large buffer cache generated from traversing
a
> lustre file system will cause a significant system overhead for
applications
> with high memory demands.  We have seen a 50% slowdown or worse for
> applications.  Even High Performance Linpack, that have no file IO
whatsoever
> is affected.  The only remedy seems to be to empty the buffer cache from
memory
> by running "echo 3 > /proc/sys/vm/drop_caches"
>
> Any hints on how to improve the situation is greatly appreciated.
>
>
> System setup:
> Client: Dual socket Sandy Bridge, with 32GB ram and infiniband connection
to
> lustre server.  CentOS 6.4, with kernel 2.6.32-358.11.1.el6.x86_64 and
lustre
> v2.1.6 rpms downloaded from whamcloud download site.
>
> Lustre: 1 MDS and 4 OSS running Lustre 2.1.3 (also from whamcloud site). 
Each
> OSS has 12 OST, total 1.1 PB storage.
>
> How to reproduce:
>
> Traverse the lustre file system until the buffer cache is large enough.  In
our
> case we run
>
>   find . -print0 -type f | xargs -0 cat > /dev/null
>
> on the client until the buffer cache reaches ~15-20GB.  (The lustre file
system
> has lots of small files so this takes up to an hour.)
>
> Kill the find process and start a single node parallel application, we use
HPL
> (high performance linpack).  We run on all 16 cores on the system with 1GB
ram
> per core (a normal run should complete in appr. 150 seconds.)  The system
> monitoring shows a 10-20% system cpu overhead and the HPL run takes more
than
> 200 secs.  After running "echo 3 > /proc/sys/vm/drop_caches"
the system
> performance goes back to normal with a run time at 150 secs.
>
> I''ve created an infographic from our ganglia graphs for the above
scenario.
>
> https://dl.dropboxusercontent.com/u/23468442/misc/lustre_bc_overhead.png
>
> Attached is an excerpt from perf top indicating that the kernel routine
taking
> the most time is _spin_lock_irqsave if that means anything to anyone.
>
>
> Things tested:
>
> It does not seem to matter if we mount lustre over infiniband or ethernet.
>
> Filling the buffer cache with files from an NFS filesystem does not degrade
> performance.
>
> Filling the buffer cache with one large file does not give degraded
performance.
> (tested with iozone)
>
>
> Again, any hints on how to proceed is greatly appreciated.
>
>
> Best regards,
> Roy.
>
>
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss-aLEFhgZF4x6X6Mz3xDxJMA@public.gmane.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>

Carlson, Timothy S

2013-Aug-22 14:40 UTC

head link

Re: Lustre buffer cache causes large system overhead.

FWIW, we have seen the same issues with Lustre 1.8.x and slightly older RHEL6
kernel.  We do the "echo" as part of our slurm prolog/epilog scripts.
Not a fix but a workaround before/after jobs run.  No swap activity, but very
large buffer cache in use.

Tim

-----Original Message-----
From: lustre-discuss-bounces-aLEFhgZF4x6X6Mz3xDxJMA@public.gmane.org
[mailto:lustre-discuss-bounces-aLEFhgZF4x6X6Mz3xDxJMA@public.gmane.org] On
Behalf Of Roger Sersted
Sent: Thursday, August 22, 2013 7:22 AM
To: lustre-discuss-aLEFhgZF4x6X6Mz3xDxJMA@public.gmane.org
Cc: Roy Dragseth
Subject: Re: [Lustre-discuss] Lustre buffer cache causes large system overhead.




Is this slowdown due to increased swap activity?  If "yes", then try
lowering the "swappiness" value.  This will sacrifice buffer cache
space to lower swap activity.

Take a look at http://en.wikipedia.org/wiki/Swappiness.

Roger S.


On 08/22/2013 08:51 AM, Roy Dragseth wrote:> We have just discovered that a large buffer cache generated from 
> traversing a lustre file system will cause a significant system 
> overhead for applications with high memory demands.  We have seen a 
> 50% slowdown or worse for applications.  Even High Performance 
> Linpack, that have no file IO whatsoever is affected.  The only remedy 
> seems to be to empty the buffer cache from memory by running "echo 3
> /proc/sys/vm/drop_caches"
>
> Any hints on how to improve the situation is greatly appreciated.
>
>
> System setup:
> Client: Dual socket Sandy Bridge, with 32GB ram and infiniband 
> connection to lustre server.  CentOS 6.4, with kernel 
> 2.6.32-358.11.1.el6.x86_64 and lustre
> v2.1.6 rpms downloaded from whamcloud download site.
>
> Lustre: 1 MDS and 4 OSS running Lustre 2.1.3 (also from whamcloud 
> site).  Each OSS has 12 OST, total 1.1 PB storage.
>
> How to reproduce:
>
> Traverse the lustre file system until the buffer cache is large 
> enough.  In our case we run
>
>   find . -print0 -type f | xargs -0 cat > /dev/null
>
> on the client until the buffer cache reaches ~15-20GB.  (The lustre 
> file system has lots of small files so this takes up to an hour.)
>
> Kill the find process and start a single node parallel application, we 
> use HPL (high performance linpack).  We run on all 16 cores on the 
> system with 1GB ram per core (a normal run should complete in appr. 
> 150 seconds.)  The system monitoring shows a 10-20% system cpu 
> overhead and the HPL run takes more than
> 200 secs.  After running "echo 3 > /proc/sys/vm/drop_caches"
the
> system performance goes back to normal with a run time at 150 secs.
>
> I''ve created an infographic from our ganglia graphs for the above
scenario.
>
> https://dl.dropboxusercontent.com/u/23468442/misc/lustre_bc_overhead.p
> ng
>
> Attached is an excerpt from perf top indicating that the kernel 
> routine taking the most time is _spin_lock_irqsave if that means anything
to anyone.
>
>
> Things tested:
>
> It does not seem to matter if we mount lustre over infiniband or ethernet.
>
> Filling the buffer cache with files from an NFS filesystem does not 
> degrade performance.
>
> Filling the buffer cache with one large file does not give degraded
performance.
> (tested with iozone)
>
>
> Again, any hints on how to proceed is greatly appreciated.
>
>
> Best regards,
> Roy.
>
>
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss-aLEFhgZF4x6X6Mz3xDxJMA@public.gmane.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>_______________________________________________
Lustre-discuss mailing list
Lustre-discuss-aLEFhgZF4x6X6Mz3xDxJMA@public.gmane.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Dragseth Roy Einar

2013-Aug-22 15:38 UTC

head link

Re: Lustre buffer cache causes large system overhead.

Yes, we have also started emptying the BC on job startup, but it
doesn''t seem
to cover all cases.  We see similar symptoms on applications using netcdf even 
if we drop the BC at job startup.   The application writes a netcdf-file at 
300-500 MB/s for 3-5 secs, then after the IO is done the client will spend 
100% in _spin_lock_irqsave for up to a minute.  The data has clearly left the 
client as no ib-traffic is detected during or after the spin_lock time until the
application has completed a new time step and writes a new data chunk.  The 
application use appr. 1.2GB per core so the scenario is quite similar to the 
syntetic one I reported.

r.


On Thursday 22. August 2013 07.40.01 you wrote:> FWIW, we have seen the same issues with Lustre 1.8.x and slightly older
> RHEL6 kernel.  We do the "echo" as part of our slurm
prolog/epilog scripts.
> Not a fix but a workaround before/after jobs run.  No swap activity, but
> very large buffer cache in use.
> 
> Tim
> 
> -----Original Message-----
> From: lustre-discuss-bounces-aLEFhgZF4x6X6Mz3xDxJMA@public.gmane.org
> [mailto:lustre-discuss-bounces-aLEFhgZF4x6X6Mz3xDxJMA@public.gmane.org] On
Behalf Of Roger Sersted
> Sent: Thursday, August 22, 2013 7:22 AM
> To: lustre-discuss-aLEFhgZF4x6X6Mz3xDxJMA@public.gmane.org
> Cc: Roy Dragseth
> Subject: Re: [Lustre-discuss] Lustre buffer cache causes large system
> overhead.
> 
> 
> 
> 
> Is this slowdown due to increased swap activity?  If "yes", then
try
> lowering the "swappiness" value.  This will sacrifice buffer
cache space to
> lower swap activity.
> 
> Take a look at http://en.wikipedia.org/wiki/Swappiness.
> 
> Roger S.
> 
> On 08/22/2013 08:51 AM, Roy Dragseth wrote:
> > We have just discovered that a large buffer cache generated from
> > traversing a lustre file system will cause a significant system
> > overhead for applications with high memory demands.  We have seen a
> > 50% slowdown or worse for applications.  Even High Performance
> > Linpack, that have no file IO whatsoever is affected.  The only remedy
> > seems to be to empty the buffer cache from memory by running
"echo 3 >
> > /proc/sys/vm/drop_caches"
> > 
> > Any hints on how to improve the situation is greatly appreciated.
> > 
> > 
> > System setup:
> > Client: Dual socket Sandy Bridge, with 32GB ram and infiniband
> > connection to lustre server.  CentOS 6.4, with kernel
> > 2.6.32-358.11.1.el6.x86_64 and lustre
> > v2.1.6 rpms downloaded from whamcloud download site.
> > 
> > Lustre: 1 MDS and 4 OSS running Lustre 2.1.3 (also from whamcloud
> > site).  Each OSS has 12 OST, total 1.1 PB storage.
> > 
> > How to reproduce:
> > 
> > Traverse the lustre file system until the buffer cache is large
> > enough.  In our case we run
> > 
> >   find . -print0 -type f | xargs -0 cat > /dev/null
> > 
> > on the client until the buffer cache reaches ~15-20GB.  (The lustre
> > file system has lots of small files so this takes up to an hour.)
> > 
> > Kill the find process and start a single node parallel application, we
> > use HPL (high performance linpack).  We run on all 16 cores on the
> > system with 1GB ram per core (a normal run should complete in appr.
> > 150 seconds.)  The system monitoring shows a 10-20% system cpu
> > overhead and the HPL run takes more than
> > 200 secs.  After running "echo 3 >
/proc/sys/vm/drop_caches" the
> > system performance goes back to normal with a run time at 150 secs.
> > 
> > I''ve created an infographic from our ganglia graphs for the
above
> > scenario.
> > 
> > https://dl.dropboxusercontent.com/u/23468442/misc/lustre_bc_overhead.p
> > ng
> > 
> > Attached is an excerpt from perf top indicating that the kernel
> > routine taking the most time is _spin_lock_irqsave if that means
anything
> > to anyone.
> > 
> > 
> > Things tested:
> > 
> > It does not seem to matter if we mount lustre over infiniband or
ethernet.
> > 
> > Filling the buffer cache with files from an NFS filesystem does not
> > degrade performance.
> > 
> > Filling the buffer cache with one large file does not give degraded
> > performance. (tested with iozone)
> > 
> > 
> > Again, any hints on how to proceed is greatly appreciated.
> > 
> > 
> > Best regards,
> > Roy.
> > 
> > 
> > 
> > _______________________________________________
> > Lustre-discuss mailing list
> > Lustre-discuss-aLEFhgZF4x6X6Mz3xDxJMA@public.gmane.org
> > http://lists.lustre.org/mailman/listinfo/lustre-discuss
> 
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss-aLEFhgZF4x6X6Mz3xDxJMA@public.gmane.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss-- 

  The Computer Center, University of Tromsø, N-9037 TROMSØ Norway.
	      phone:+47 77 64 41 07, fax:+47 77 64 41 00
        Roy Dragseth, Team Leader, High Performance Computing
	 Direct call: +47 77 64 62 56. email: roy.dragseth-hYqmg196XYc@public.gmane.org

Dragseth Roy Einar

2013-Aug-22 15:38 UTC

head link

Re: Lustre buffer cache causes large system overhead.

No, I cannot detect any swap activity on the system.

r.


On Thursday 22. August 2013 09.21.33 you wrote:> Is this slowdown due to increased swap activity?  If "yes", then
try
> lowering the "swappiness" value.  This will sacrifice buffer
cache space to
> lower swap activity.
> 
> Take a look at http://en.wikipedia.org/wiki/Swappiness.
> 
> Roger S.
> 
> On 08/22/2013 08:51 AM, Roy Dragseth wrote:
> > We have just discovered that a large buffer cache generated from
> > traversing a lustre file system will cause a significant system
overhead
> > for applications with high memory demands.  We have seen a 50%
slowdown
> > or worse for applications.  Even High Performance Linpack, that have
no
> > file IO whatsoever is affected.  The only remedy seems to be to empty
the
> > buffer cache from memory by running "echo 3 >
/proc/sys/vm/drop_caches"
> > 
> > Any hints on how to improve the situation is greatly appreciated.
> > 
> > 
> > System setup:
> > Client: Dual socket Sandy Bridge, with 32GB ram and infiniband
connection
> > to lustre server.  CentOS 6.4, with kernel 2.6.32-358.11.1.el6.x86_64
and
> > lustre v2.1.6 rpms downloaded from whamcloud download site.
> > 
> > Lustre: 1 MDS and 4 OSS running Lustre 2.1.3 (also from whamcloud
site).
> > Each OSS has 12 OST, total 1.1 PB storage.
> > 
> > How to reproduce:
> > 
> > Traverse the lustre file system until the buffer cache is large
enough.
> > In our case we run
> > 
> >   find . -print0 -type f | xargs -0 cat > /dev/null
> > 
> > on the client until the buffer cache reaches ~15-20GB.  (The lustre
file
> > system has lots of small files so this takes up to an hour.)
> > 
> > Kill the find process and start a single node parallel application, we
use
> > HPL (high performance linpack).  We run on all 16 cores on the system
> > with 1GB ram per core (a normal run should complete in appr. 150
> > seconds.)  The system monitoring shows a 10-20% system cpu overhead
and
> > the HPL run takes more than 200 secs.  After running "echo 3 >
> > /proc/sys/vm/drop_caches" the system performance goes back to
normal with
> > a run time at 150 secs.
> > 
> > I''ve created an infographic from our ganglia graphs for the
above
> > scenario.
> > 
> >
https://dl.dropboxusercontent.com/u/23468442/misc/lustre_bc_overhead.png
> > 
> > Attached is an excerpt from perf top indicating that the kernel
routine
> > taking the most time is _spin_lock_irqsave if that means anything to
> > anyone.
> > 
> > 
> > Things tested:
> > 
> > It does not seem to matter if we mount lustre over infiniband or
ethernet.
> > 
> > Filling the buffer cache with files from an NFS filesystem does not
> > degrade
> > performance.
> > 
> > Filling the buffer cache with one large file does not give degraded
> > performance. (tested with iozone)
> > 
> > 
> > Again, any hints on how to proceed is greatly appreciated.
> > 
> > 
> > Best regards,
> > Roy.
> > 
> > 
> > 
> > _______________________________________________
> > Lustre-discuss mailing list
> > Lustre-discuss-aLEFhgZF4x6X6Mz3xDxJMA@public.gmane.org
> > http://lists.lustre.org/mailman/listinfo/lustre-discuss-- 

  The Computer Center, University of Tromsø, N-9037 TROMSØ Norway.
	      phone:+47 77 64 41 07, fax:+47 77 64 41 00
        Roy Dragseth, Team Leader, High Performance Computing
	 Direct call: +47 77 64 62 56. email: roy.dragseth-hYqmg196XYc@public.gmane.org

Dragseth Roy Einar

2013-Aug-23 11:29 UTC

head link

Re: Lustre buffer cache causes large system overhead.

I tried to change swapiness from 0 to 95 but it did not have any impact on the 
system overhead.

r.


On Thursday 22. August 2013 15.38.37 Dragseth Roy Einar
wrote:> No, I cannot detect any swap activity on the system.
> 
> r.
> 
> On Thursday 22. August 2013 09.21.33 you wrote:
> > Is this slowdown due to increased swap activity?  If "yes",
then try
> > lowering the "swappiness" value.  This will sacrifice buffer
cache space
> > to
> > lower swap activity.
> > 
> > Take a look at http://en.wikipedia.org/wiki/Swappiness.
> > 
> > Roger S.
> > 
> > On 08/22/2013 08:51 AM, Roy Dragseth wrote:
> > > We have just discovered that a large buffer cache generated from
> > > traversing a lustre file system will cause a significant system
overhead
> > > for applications with high memory demands.  We have seen a 50%
slowdown
> > > or worse for applications.  Even High Performance Linpack, that
have no
> > > file IO whatsoever is affected.  The only remedy seems to be to
empty
> > > the
> > > buffer cache from memory by running "echo 3 >
/proc/sys/vm/drop_caches"
> > > 
> > > Any hints on how to improve the situation is greatly appreciated.
> > > 
> > > 
> > > System setup:
> > > Client: Dual socket Sandy Bridge, with 32GB ram and infiniband
> > > connection
> > > to lustre server.  CentOS 6.4, with kernel
2.6.32-358.11.1.el6.x86_64
> > > and
> > > lustre v2.1.6 rpms downloaded from whamcloud download site.
> > > 
> > > Lustre: 1 MDS and 4 OSS running Lustre 2.1.3 (also from whamcloud
site).
> > > Each OSS has 12 OST, total 1.1 PB storage.
> > > 
> > > How to reproduce:
> > > 
> > > Traverse the lustre file system until the buffer cache is large
enough.
> > > In our case we run
> > > 
> > >   find . -print0 -type f | xargs -0 cat > /dev/null
> > > 
> > > on the client until the buffer cache reaches ~15-20GB.  (The
lustre file
> > > system has lots of small files so this takes up to an hour.)
> > > 
> > > Kill the find process and start a single node parallel
application, we
> > > use
> > > HPL (high performance linpack).  We run on all 16 cores on the
system
> > > with 1GB ram per core (a normal run should complete in appr. 150
> > > seconds.)  The system monitoring shows a 10-20% system cpu
overhead and
> > > the HPL run takes more than 200 secs.  After running "echo 3
>
> > > /proc/sys/vm/drop_caches" the system performance goes back
to normal
> > > with
> > > a run time at 150 secs.
> > > 
> > > I''ve created an infographic from our ganglia graphs for
the above
> > > scenario.
> > > 
> > >
https://dl.dropboxusercontent.com/u/23468442/misc/lustre_bc_overhead.png
> > > 
> > > Attached is an excerpt from perf top indicating that the kernel
routine
> > > taking the most time is _spin_lock_irqsave if that means anything
to
> > > anyone.
> > > 
> > > 
> > > Things tested:
> > > 
> > > It does not seem to matter if we mount lustre over infiniband or
> > > ethernet.
> > > 
> > > Filling the buffer cache with files from an NFS filesystem does
not
> > > degrade
> > > performance.
> > > 
> > > Filling the buffer cache with one large file does not give
degraded
> > > performance. (tested with iozone)
> > > 
> > > 
> > > Again, any hints on how to proceed is greatly appreciated.
> > > 
> > > 
> > > Best regards,
> > > Roy.
> > > 
> > > 
> > > 
> > > _______________________________________________
> > > Lustre-discuss mailing list
> > > Lustre-discuss-aLEFhgZF4x6X6Mz3xDxJMA@public.gmane.org
> > > http://lists.lustre.org/mailman/listinfo/lustre-discuss-- 

  The Computer Center, University of Tromsø, N-9037 TROMSØ Norway.
	      phone:+47 77 64 41 07, fax:+47 77 64 41 00
        Roy Dragseth, Team Leader, High Performance Computing
	 Direct call: +47 77 64 62 56. email: roy.dragseth-hYqmg196XYc@public.gmane.org

Scott Nolin

2013-Aug-23 14:36 UTC

head link

Re: Lustre buffer cache causes large system overhead.

You might also try increasing the vfs_cache_pressure.

This will reclaim inode and dentry caches faster. Maybe that''s the 
problem, not page caches.

To be clear - I have no deep insight into Lustre''s use of the client 
cache, but you said you has lots of small files, which if lustre uses 
the cache system like other filesystems means it may be inodes/dentries. 
Filling up the page cache with files like you did in your other tests 
wouldn''t have the same effect. Just my guess here.

We had some experience years ago with the opposite sort of problem. We 
have a big ftp server, and we want to *keep* inode/dentry data in the 
linux cache, as there are often stupid numbers of files in directories. 
Files were always flowing through the server, so the page cache would 
force out the inode cache. Was surprised to find with linux there''s no 
ability to set a fixed inode cache size - the best you can do is 
"suggest" with the cache pressure tunable.

Scott

On 8/23/2013 6:29 AM, Dragseth Roy Einar wrote:> I tried to change swapiness from 0 to 95 but it did not have any impact on
the
> system overhead.
>
> r.
>
>
> On Thursday 22. August 2013 15.38.37 Dragseth Roy Einar wrote:
>> No, I cannot detect any swap activity on the system.
>>
>> r.
>>
>> On Thursday 22. August 2013 09.21.33 you wrote:
>>> Is this slowdown due to increased swap activity?  If
"yes", then try
>>> lowering the "swappiness" value.  This will sacrifice
buffer cache space
>>> to
>>> lower swap activity.
>>>
>>> Take a look at http://en.wikipedia.org/wiki/Swappiness.
>>>
>>> Roger S.
>>>
>>> On 08/22/2013 08:51 AM, Roy Dragseth wrote:
>>>> We have just discovered that a large buffer cache generated
from
>>>> traversing a lustre file system will cause a significant system
overhead
>>>> for applications with high memory demands.  We have seen a 50%
slowdown
>>>> or worse for applications.  Even High Performance Linpack, that
have no
>>>> file IO whatsoever is affected.  The only remedy seems to be to
empty
>>>> the
>>>> buffer cache from memory by running "echo 3 >
/proc/sys/vm/drop_caches"
>>>>
>>>> Any hints on how to improve the situation is greatly
appreciated.
>>>>
>>>>
>>>> System setup:
>>>> Client: Dual socket Sandy Bridge, with 32GB ram and infiniband
>>>> connection
>>>> to lustre server.  CentOS 6.4, with kernel
2.6.32-358.11.1.el6.x86_64
>>>> and
>>>> lustre v2.1.6 rpms downloaded from whamcloud download site.
>>>>
>>>> Lustre: 1 MDS and 4 OSS running Lustre 2.1.3 (also from
whamcloud site).
>>>> Each OSS has 12 OST, total 1.1 PB storage.
>>>>
>>>> How to reproduce:
>>>>
>>>> Traverse the lustre file system until the buffer cache is large
enough.
>>>> In our case we run
>>>>
>>>>    find . -print0 -type f | xargs -0 cat > /dev/null
>>>>
>>>> on the client until the buffer cache reaches ~15-20GB.  (The
lustre file
>>>> system has lots of small files so this takes up to an hour.)
>>>>
>>>> Kill the find process and start a single node parallel
application, we
>>>> use
>>>> HPL (high performance linpack).  We run on all 16 cores on the
system
>>>> with 1GB ram per core (a normal run should complete in appr.
150
>>>> seconds.)  The system monitoring shows a 10-20% system cpu
overhead and
>>>> the HPL run takes more than 200 secs.  After running "echo
3 >
>>>> /proc/sys/vm/drop_caches" the system performance goes back
to normal
>>>> with
>>>> a run time at 150 secs.
>>>>
>>>> I''ve created an infographic from our ganglia graphs
for the above
>>>> scenario.
>>>>
>>>>
https://dl.dropboxusercontent.com/u/23468442/misc/lustre_bc_overhead.png
>>>>
>>>> Attached is an excerpt from perf top indicating that the kernel
routine
>>>> taking the most time is _spin_lock_irqsave if that means
anything to
>>>> anyone.
>>>>
>>>>
>>>> Things tested:
>>>>
>>>> It does not seem to matter if we mount lustre over infiniband
or
>>>> ethernet.
>>>>
>>>> Filling the buffer cache with files from an NFS filesystem does
not
>>>> degrade
>>>> performance.
>>>>
>>>> Filling the buffer cache with one large file does not give
degraded
>>>> performance. (tested with iozone)
>>>>
>>>>
>>>> Again, any hints on how to proceed is greatly appreciated.
>>>>
>>>>
>>>> Best regards,
>>>> Roy.
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Lustre-discuss mailing list
>>>> Lustre-discuss-aLEFhgZF4x6X6Mz3xDxJMA@public.gmane.org
>>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss



_______________________________________________
Lustre-discuss mailing list
Lustre-discuss-aLEFhgZF4x6X6Mz3xDxJMA@public.gmane.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Scott Nolin

2013-Aug-23 14:56 UTC

head link

Re: Lustre buffer cache causes large system overhead.

I forgot to add ''slabtop'' is a nice tool for watching this
stuff.

Scott

On 8/23/2013 9:36 AM, Scott Nolin wrote:> You might also try increasing the vfs_cache_pressure.
>
> This will reclaim inode and dentry caches faster. Maybe that''s the
> problem, not page caches.
>
> To be clear - I have no deep insight into Lustre''s use of the
client
> cache, but you said you has lots of small files, which if lustre uses
> the cache system like other filesystems means it may be inodes/dentries.
> Filling up the page cache with files like you did in your other tests
> wouldn''t have the same effect. Just my guess here.
>
> We had some experience years ago with the opposite sort of problem. We
> have a big ftp server, and we want to *keep* inode/dentry data in the
> linux cache, as there are often stupid numbers of files in directories.
> Files were always flowing through the server, so the page cache would
> force out the inode cache. Was surprised to find with linux
there''s no
> ability to set a fixed inode cache size - the best you can do is
> "suggest" with the cache pressure tunable.
>
> Scott
>
> On 8/23/2013 6:29 AM, Dragseth Roy Einar wrote:
>> I tried to change swapiness from 0 to 95 but it did not have any
>> impact on the
>> system overhead.
>>
>> r.
>>
>>


_______________________________________________
Lustre-discuss mailing list
Lustre-discuss-aLEFhgZF4x6X6Mz3xDxJMA@public.gmane.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Prakash Surya

2013-Aug-23 18:37 UTC

head link

Re: Lustre buffer cache causes large system overhead.

How much RAM does the application use? Lustre is particularly slow at
evicting memory from its caches, so my guess is that''s what
you''re
falling victim of..

    1) Lustre has a large portion of RAM allocated for the page cache
    2) Application starts and begins to allocate a large portion of RAM
    3) Linux kernel starts to reclaim from page cache (i.e. Lustre).
       This is a sort spot for Lustre, so it''s causes application
       allocations to "stall", waiting for Lustre to reclaim.
    4) Lustre finally finishes reclaiming from its page cache
    5) Application succeeds allocation and proceeds

Of course, this is all speculation as I don''t have much data to go on.

How long does the drop_caches command take to complete?

I have a feeling the drop_caches command just preloads steps 3 and 4 in
the above sequence.

-- 
Cheers, Prakash

On Thu, Aug 22, 2013 at 03:51:32PM +0200, Roy Dragseth
wrote:> We have just discovered that a large buffer cache generated from traversing
a
> lustre file system will cause a significant system overhead for
applications
> with high memory demands.  We have seen a 50% slowdown or worse for 
> applications.  Even High Performance Linpack, that have no file IO
whatsoever
> is affected.  The only remedy seems to be to empty the buffer cache from
memory
> by running "echo 3 > /proc/sys/vm/drop_caches"
> 
> Any hints on how to improve the situation is greatly appreciated.
> 
> 
> System setup:
> Client: Dual socket Sandy Bridge, with 32GB ram and infiniband connection
to
> lustre server.  CentOS 6.4, with kernel 2.6.32-358.11.1.el6.x86_64 and
lustre
> v2.1.6 rpms downloaded from whamcloud download site.
> 
> Lustre: 1 MDS and 4 OSS running Lustre 2.1.3 (also from whamcloud site). 
Each
> OSS has 12 OST, total 1.1 PB storage.
> 
> How to reproduce:
> 
> Traverse the lustre file system until the buffer cache is large enough.  In
our
> case we run
> 
>  find . -print0 -type f | xargs -0 cat > /dev/null
> 
> on the client until the buffer cache reaches ~15-20GB.  (The lustre file
system
> has lots of small files so this takes up to an hour.)
> 
> Kill the find process and start a single node parallel application, we use
HPL
> (high performance linpack).  We run on all 16 cores on the system with 1GB
ram
> per core (a normal run should complete in appr. 150 seconds.)  The system 
> monitoring shows a 10-20% system cpu overhead and the HPL run takes more
than
> 200 secs.  After running "echo 3 > /proc/sys/vm/drop_caches"
the system
> performance goes back to normal with a run time at 150 secs.
> 
> I''ve created an infographic from our ganglia graphs for the above
scenario.
> 
> https://dl.dropboxusercontent.com/u/23468442/misc/lustre_bc_overhead.png
> 
> Attached is an excerpt from perf top indicating that the kernel routine
taking
> the most time is _spin_lock_irqsave if that means anything to anyone.
> 
> 
> Things tested:
> 
> It does not seem to matter if we mount lustre over infiniband or ethernet.
> 
> Filling the buffer cache with files from an NFS filesystem does not degrade
> performance.
> 
> Filling the buffer cache with one large file does not give degraded
performance.
> (tested with iozone)
> 
> 
> Again, any hints on how to proceed is greatly appreciated.
> 
> 
> Best regards,
> Roy.
> 
> -- 
> 
>   The Computer Center, University of Tromsø, N-9037 TROMSØ Norway.
> 	      phone:+47 77 64 41 07, fax:+47 77 64 41 00
>         Roy Dragseth, Team Leader, High Performance Computing
> 	 Direct call: +47 77 64 62 56. email:
roy.dragseth-hYqmg196XYc@public.gmane.org
> Samples: 6M of event ''cycles'', Event count (approx.):
634546877255
>  62.19%  libmkl_avx.so                [.] mkl_blas_avx_dgemm_kernel_0
>  13.30%  mca_btl_sm.so                [.] mca_btl_sm_component_progress
>   8.80%  libmpi.so.1.0.3              [.] opal_progress
>   5.29%  [kernel]                     [k] _spin_lock_irqsave
>   1.41%  libmkl_avx.so                [.] mkl_blas_avx_dgemm_copyan
>   1.17%  mca_pml_ob1.so               [.] mca_pml_ob1_progress
>   0.88%  libmkl_avx.so                [.] mkl_blas_avx_dtrsm_ker_ruu_a4_b8
>   0.41%  [kernel]                     [k] compaction_alloc
>   0.38%  [kernel]                     [k] _spin_lock_irq
>   0.36%  mca_pml_ob1.so               [.] opal_progress@plt
>   0.33%  xhpl                         [.] HPL_dlaswp06T
>   0.28%  libmkl_avx.so                [.] mkl_blas_avx_dgemm_copybt
>   0.24%  mca_pml_ob1.so               [.] mca_pml_ob1_send
>   0.18%  [kernel]                     [k] _spin_lock
>   0.17%  [kernel]                     [k] __mem_cgroup_commit_charge
>   0.16%  [kernel]                     [k] mem_cgroup_lru_del_list
>   0.16%  [kernel]                     [k] putback_lru_page
>   0.16%  [kernel]                     [k] __mem_cgroup_uncharge_common
>   0.15%  xhpl                         [.] HPL_dlatcpy
>   0.15%  xhpl                         [.] HPL_dlaswp01T
>   0.15%  [kernel]                     [k] clear_page_c
>   0.15%  xhpl                         [.] HPL_dlaswp10N
>   0.13%  [kernel]                     [k] list_del
>   0.13%  [kernel]                     [k] free_hot_cold_page
>   0.13%  [kernel]                     [k] free_pcppages_bulk
>   0.13%  [kernel]                     [k] release_pages
>   0.13%  mca_pml_ob1.so               [.] mca_pml_ob1_recv
>   0.12%  [kernel]                     [k] ____pagevec_lru_add
>   0.12%  [kernel]                     [k] copy_user_generic_string
>   0.12%  [kernel]                     [k] compact_zone
>   0.10%  xhpl                         [.] __intel_ssse3_rep_memcpy
>   0.10%  [kernel]                     [k] __list_add
>   0.10%  [kernel]                     [k] lookup_page_cgroup
>   0.09%  [kernel]                     [k] mem_cgroup_end_migration
>   0.08%  [kernel]                     [k] mem_cgroup_prepare_migration
>   0.08%  [kernel]                     [k] get_pageblock_flags_group
>   0.08%  [kernel]                     [k] page_waitqueue
>   0.07%  [kernel]                     [k] migrate_pages
>   0.07%  [kernel]                     [k] __wake_up_bit
>   0.07%  [kernel]                     [k] get_page
>   0.07%  [kernel]                     [k] unlock_page
>   0.07%  [kernel]                     [k] mem_cgroup_lru_add_list
>   0.06%  [kernel]                     [k] page_fault
>   0.06%  [kernel]                     [k] __alloc_pages_nodemask
>   0.06%  [kernel]                     [k] put_page
>   0.06%  [kernel]                     [k] compact_checklock_irqsave
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss-aLEFhgZF4x6X6Mz3xDxJMA@public.gmane.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss

Dragseth Roy Einar

2013-Aug-23 20:08 UTC

head link

Re: Lustre buffer cache causes large system overhead.

Thanks for the suggestion!  It didn''t help, but as I read the
documentation on
vfs_cache_pressure in the kernel docs I noticed the next parameter, 
zone_reclaim_mode, which looked like it might be worth fiddling with.  And what 
do you know, changing it from 0 to 1 made the system overhead vanish 
immediately!

I must admit I do not completely understand why this helps, but it seems to do 
the trick in my case.  We''ll put 

vm.zone_reclaim_mode = 1 

into /etc/sysctl.conf from now on.

Thanks to all for the hints and comments on this.

A nice weekend to everyone, mine for sure is going to be...
r.


On Friday 23. August 2013 09.36.34 Scott Nolin wrote:> You might also try increasing the vfs_cache_pressure.
> 
> This will reclaim inode and dentry caches faster. Maybe that''s the
> problem, not page caches.
> 
> To be clear - I have no deep insight into Lustre''s use of the
client
> cache, but you said you has lots of small files, which if lustre uses
> the cache system like other filesystems means it may be inodes/dentries.
> Filling up the page cache with files like you did in your other tests
> wouldn''t have the same effect. Just my guess here.
> 
> We had some experience years ago with the opposite sort of problem. We
> have a big ftp server, and we want to *keep* inode/dentry data in the
> linux cache, as there are often stupid numbers of files in directories.
> Files were always flowing through the server, so the page cache would
> force out the inode cache. Was surprised to find with linux
there''s no
> ability to set a fixed inode cache size - the best you can do is
> "suggest" with the cache pressure tunable.
> 
> Scott
> 
> On 8/23/2013 6:29 AM, Dragseth Roy Einar wrote:
> > I tried to change swapiness from 0 to 95 but it did not have any
impact on
> > the system overhead.
> > 
> > r.
> > 
> > On Thursday 22. August 2013 15.38.37 Dragseth Roy Einar wrote:
> >> No, I cannot detect any swap activity on the system.
> >> 
> >> r.
> >> 
> >> On Thursday 22. August 2013 09.21.33 you wrote:
> >>> Is this slowdown due to increased swap activity?  If
"yes", then try
> >>> lowering the "swappiness" value.  This will
sacrifice buffer cache space
> >>> to
> >>> lower swap activity.
> >>> 
> >>> Take a look at http://en.wikipedia.org/wiki/Swappiness.
> >>> 
> >>> Roger S.
> >>> 
> >>> On 08/22/2013 08:51 AM, Roy Dragseth wrote:
> >>>> We have just discovered that a large buffer cache
generated from
> >>>> traversing a lustre file system will cause a significant
system
> >>>> overhead
> >>>> for applications with high memory demands.  We have seen a
50% slowdown
> >>>> or worse for applications.  Even High Performance Linpack,
that have no
> >>>> file IO whatsoever is affected.  The only remedy seems to
be to empty
> >>>> the
> >>>> buffer cache from memory by running "echo 3 >
/proc/sys/vm/drop_caches"
> >>>> 
> >>>> Any hints on how to improve the situation is greatly
appreciated.
> >>>> 
> >>>> 
> >>>> System setup:
> >>>> Client: Dual socket Sandy Bridge, with 32GB ram and
infiniband
> >>>> connection
> >>>> to lustre server.  CentOS 6.4, with kernel
2.6.32-358.11.1.el6.x86_64
> >>>> and
> >>>> lustre v2.1.6 rpms downloaded from whamcloud download
site.
> >>>> 
> >>>> Lustre: 1 MDS and 4 OSS running Lustre 2.1.3 (also from
whamcloud
> >>>> site).
> >>>> Each OSS has 12 OST, total 1.1 PB storage.
> >>>> 
> >>>> How to reproduce:
> >>>> 
> >>>> Traverse the lustre file system until the buffer cache is
large enough.
> >>>> In our case we run
> >>>> 
> >>>>    find . -print0 -type f | xargs -0 cat > /dev/null
> >>>> 
> >>>> on the client until the buffer cache reaches ~15-20GB. 
(The lustre
> >>>> file
> >>>> system has lots of small files so this takes up to an
hour.)
> >>>> 
> >>>> Kill the find process and start a single node parallel
application, we
> >>>> use
> >>>> HPL (high performance linpack).  We run on all 16 cores on
the system
> >>>> with 1GB ram per core (a normal run should complete in
appr. 150
> >>>> seconds.)  The system monitoring shows a 10-20% system cpu
overhead and
> >>>> the HPL run takes more than 200 secs.  After running
"echo 3 >
> >>>> /proc/sys/vm/drop_caches" the system performance goes
back to normal
> >>>> with
> >>>> a run time at 150 secs.
> >>>> 
> >>>> I''ve created an infographic from our ganglia
graphs for the above
> >>>> scenario.
> >>>> 
> >>>>
https://dl.dropboxusercontent.com/u/23468442/misc/lustre_bc_overhead.pn
> >>>> g
> >>>> 
> >>>> Attached is an excerpt from perf top indicating that the
kernel routine
> >>>> taking the most time is _spin_lock_irqsave if that means
anything to
> >>>> anyone.
> >>>> 
> >>>> 
> >>>> Things tested:
> >>>> 
> >>>> It does not seem to matter if we mount lustre over
infiniband or
> >>>> ethernet.
> >>>> 
> >>>> Filling the buffer cache with files from an NFS filesystem
does not
> >>>> degrade
> >>>> performance.
> >>>> 
> >>>> Filling the buffer cache with one large file does not give
degraded
> >>>> performance. (tested with iozone)
> >>>> 
> >>>> 
> >>>> Again, any hints on how to proceed is greatly appreciated.
> >>>> 
> >>>> 
> >>>> Best regards,
> >>>> Roy.
> >>>> 
> >>>> 
> >>>> 
> >>>> _______________________________________________
> >>>> Lustre-discuss mailing list
> >>>> Lustre-discuss-aLEFhgZF4x6X6Mz3xDxJMA@public.gmane.org
> >>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss-- 

  The Computer Center, University of Tromsø, N-9037 TROMSØ Norway.
	      phone:+47 77 64 41 07, fax:+47 77 64 41 00
        Roy Dragseth, Team Leader, High Performance Computing
	 Direct call: +47 77 64 62 56. email: roy.dragseth-hYqmg196XYc@public.gmane.org

Patrick Shopbell

2013-Aug-23 20:59 UTC

head link

Re: Lustre buffer cache causes large system overhead.

Hi all -
I have watched this thread with much interest, and now I am even
more interested/confused.  :-)

Several months back, we had a very substantial slowdown on our
MDS box. Interactive use of the box was very sluggish, even
though the load was quite low. This was eventually solved by
setting the opposite value for the variable in question:

vm.zone_reclaim_mode = 0

And it was equally dramatic in its solution of our problem - the MDS
started responding normally immediately afterwards. We went ahead
and set the value to zero on all of our NUMA machines. (We are
running Lustre 2.3.)

Clearly, I need to do some reading on Lustre and its various caching
issues. This has been a quite interesting discussion.

Thanks everyone for such a great list.
--
Patrick

*--------------------------------------------------------------------*
| Patrick Shopbell               Department of Astronomy             |
| pls-f+Cz5gDlz18dsUksgWXaj4dd74u8MsAO@public.gmane.org          Mail Code
249-17                    |
| (626) 395-4097                 California Institute of Technology  |
| (626) 568-9352  (FAX)          Pasadena, CA 91125                 |
| WWW: http://www.astro.caltech.edu/~pls/                            |
*--------------------------------------------------------------------*



On 8/23/13 1:08 PM, Dragseth Roy Einar wrote:> Thanks for the suggestion!  It didn''t help, but as I read the
documentation on
> vfs_cache_pressure in the kernel docs I noticed the next parameter,
> zone_reclaim_mode, which looked like it might be worth fiddling with.  And
what
> do you know, changing it from 0 to 1 made the system overhead vanish
> immediately!
>
> I must admit I do not completely understand why this helps, but it seems to
do
> the trick in my case.  We''ll put
>
> vm.zone_reclaim_mode = 1
>
> into /etc/sysctl.conf from now on.
>
> Thanks to all for the hints and comments on this.
>
> A nice weekend to everyone, mine for sure is going to be...
> r.
>
>
> On Friday 23. August 2013 09.36.34 Scott Nolin wrote:
>> You might also try increasing the vfs_cache_pressure.
>>
>> This will reclaim inode and dentry caches faster. Maybe that''s
the
>> problem, not page caches.
>>
>> To be clear - I have no deep insight into Lustre''s use of the
client
>> cache, but you said you has lots of small files, which if lustre uses
>> the cache system like other filesystems means it may be
inodes/dentries.
>> Filling up the page cache with files like you did in your other tests
>> wouldn''t have the same effect. Just my guess here.
>>
>> We had some experience years ago with the opposite sort of problem. We
>> have a big ftp server, and we want to *keep* inode/dentry data in the
>> linux cache, as there are often stupid numbers of files in directories.
>> Files were always flowing through the server, so the page cache would
>> force out the inode cache. Was surprised to find with linux
there''s no
>> ability to set a fixed inode cache size - the best you can do is
>> "suggest" with the cache pressure tunable.
>>
>> Scott
>>
>> On 8/23/2013 6:29 AM, Dragseth Roy Einar wrote:
>>> I tried to change swapiness from 0 to 95 but it did not have any
impact on
>>> the system overhead.
>>>
>>> r.
>>>
>>> On Thursday 22. August 2013 15.38.37 Dragseth Roy Einar wrote:
>>>> No, I cannot detect any swap activity on the system.
>>>>
>>>> r.
>>>>
>>>> On Thursday 22. August 2013 09.21.33 you wrote:
>>>>> Is this slowdown due to increased swap activity?  If
"yes", then try
>>>>> lowering the "swappiness" value.  This will
sacrifice buffer cache space
>>>>> to
>>>>> lower swap activity.
>>>>>
>>>>> Take a look at http://en.wikipedia.org/wiki/Swappiness.
>>>>>
>>>>> Roger S.
>>>>>
>>>>> On 08/22/2013 08:51 AM, Roy Dragseth wrote:
>>>>>> We have just discovered that a large buffer cache
generated from
>>>>>> traversing a lustre file system will cause a
significant system
>>>>>> overhead
>>>>>> for applications with high memory demands.  We have
seen a 50% slowdown
>>>>>> or worse for applications.  Even High Performance
Linpack, that have no
>>>>>> file IO whatsoever is affected.  The only remedy seems
to be to empty
>>>>>> the
>>>>>> buffer cache from memory by running "echo 3 >
/proc/sys/vm/drop_caches"
>>>>>>
>>>>>> Any hints on how to improve the situation is greatly
appreciated.
>>>>>>
>>>>>>
>>>>>> System setup:
>>>>>> Client: Dual socket Sandy Bridge, with 32GB ram and
infiniband
>>>>>> connection
>>>>>> to lustre server.  CentOS 6.4, with kernel
2.6.32-358.11.1.el6.x86_64
>>>>>> and
>>>>>> lustre v2.1.6 rpms downloaded from whamcloud download
site.
>>>>>>
>>>>>> Lustre: 1 MDS and 4 OSS running Lustre 2.1.3 (also from
whamcloud
>>>>>> site).
>>>>>> Each OSS has 12 OST, total 1.1 PB storage.
>>>>>>
>>>>>> How to reproduce:
>>>>>>
>>>>>> Traverse the lustre file system until the buffer cache
is large enough.
>>>>>> In our case we run
>>>>>>
>>>>>>     find . -print0 -type f | xargs -0 cat >
/dev/null
>>>>>>
>>>>>> on the client until the buffer cache reaches ~15-20GB. 
(The lustre
>>>>>> file
>>>>>> system has lots of small files so this takes up to an
hour.)
>>>>>>
>>>>>> Kill the find process and start a single node parallel
application, we
>>>>>> use
>>>>>> HPL (high performance linpack).  We run on all 16 cores
on the system
>>>>>> with 1GB ram per core (a normal run should complete in
appr. 150
>>>>>> seconds.)  The system monitoring shows a 10-20% system
cpu overhead and
>>>>>> the HPL run takes more than 200 secs.  After running
"echo 3 >
>>>>>> /proc/sys/vm/drop_caches" the system performance
goes back to normal
>>>>>> with
>>>>>> a run time at 150 secs.
>>>>>>
>>>>>> I''ve created an infographic from our ganglia
graphs for the above
>>>>>> scenario.
>>>>>>
>>>>>>
https://dl.dropboxusercontent.com/u/23468442/misc/lustre_bc_overhead.pn
>>>>>> g
>>>>>>
>>>>>> Attached is an excerpt from perf top indicating that
the kernel routine
>>>>>> taking the most time is _spin_lock_irqsave if that
means anything to
>>>>>> anyone.
>>>>>>
>>>>>>
>>>>>> Things tested:
>>>>>>
>>>>>> It does not seem to matter if we mount lustre over
infiniband or
>>>>>> ethernet.
>>>>>>
>>>>>> Filling the buffer cache with files from an NFS
filesystem does not
>>>>>> degrade
>>>>>> performance.
>>>>>>
>>>>>> Filling the buffer cache with one large file does not
give degraded
>>>>>> performance. (tested with iozone)
>>>>>>
>>>>>>
>>>>>> Again, any hints on how to proceed is greatly
appreciated.
>>>>>>
>>>>>>
>>>>>> Best regards,
>>>>>> Roy.
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Lustre-discuss mailing list
>>>>>> Lustre-discuss-aLEFhgZF4x6X6Mz3xDxJMA@public.gmane.org
>>>>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss

Brian O''Connor

2013-Aug-24 01:58 UTC

head link

Re: Lustre buffer cache causes large system overhead.

Watch for swapping now. Turning zone reclaim on can cause the machine to swap if
the memory use goes outside of the NUMA node.

Although you don't have much memory(which IMHO is the real issue)
so this may not effect you.

-----Original Message-----
From: Dragseth Roy Einar [roy.dragseth@uit.no<mailto:roy.dragseth@uit.no>]
Sent: Friday, August 23, 2013 03:09 PM Central Standard Time
To: lustre-discuss@lists.lustre.org
Subject: Re: [Lustre-discuss] Lustre buffer cache causes large system overhead.


Thanks for the suggestion!  It didn't help, but as I read the documentation
on
vfs_cache_pressure in the kernel docs I noticed the next parameter,
zone_reclaim_mode, which looked like it might be worth fiddling with.  And what
do you know, changing it from 0 to 1 made the system overhead vanish
immediately!

I must admit I do not completely understand why this helps, but it seems to do
the trick in my case.  We'll put

vm.zone_reclaim_mode = 1

into /etc/sysctl.conf from now on.

Thanks to all for the hints and comments on this.

A nice weekend to everyone, mine for sure is going to be...
r.


On Friday 23. August 2013 09.36.34 Scott Nolin wrote:> You might also try increasing the vfs_cache_pressure.
>
> This will reclaim inode and dentry caches faster. Maybe that's the
> problem, not page caches.
>
> To be clear - I have no deep insight into Lustre's use of the client
> cache, but you said you has lots of small files, which if lustre uses
> the cache system like other filesystems means it may be inodes/dentries.
> Filling up the page cache with files like you did in your other tests
> wouldn't have the same effect. Just my guess here.
>
> We had some experience years ago with the opposite sort of problem. We
> have a big ftp server, and we want to *keep* inode/dentry data in the
> linux cache, as there are often stupid numbers of files in directories.
> Files were always flowing through the server, so the page cache would
> force out the inode cache. Was surprised to find with linux there's no
> ability to set a fixed inode cache size - the best you can do is
> "suggest" with the cache pressure tunable.
>
> Scott
>
> On 8/23/2013 6:29 AM, Dragseth Roy Einar wrote:
> > I tried to change swapiness from 0 to 95 but it did not have any
impact on
> > the system overhead.
> >
> > r.
> >
> > On Thursday 22. August 2013 15.38.37 Dragseth Roy Einar wrote:
> >> No, I cannot detect any swap activity on the system.
> >>
> >> r.
> >>
> >> On Thursday 22. August 2013 09.21.33 you wrote:
> >>> Is this slowdown due to increased swap activity?  If
"yes", then try
> >>> lowering the "swappiness" value.  This will
sacrifice buffer cache space
> >>> to
> >>> lower swap activity.
> >>>
> >>> Take a look at http://en.wikipedia.org/wiki/Swappiness.
> >>>
> >>> Roger S.
> >>>
> >>> On 08/22/2013 08:51 AM, Roy Dragseth wrote:
> >>>> We have just discovered that a large buffer cache
generated from
> >>>> traversing a lustre file system will cause a significant
system
> >>>> overhead
> >>>> for applications with high memory demands.  We have seen a
50% slowdown
> >>>> or worse for applications.  Even High Performance Linpack,
that have no
> >>>> file IO whatsoever is affected.  The only remedy seems to
be to empty
> >>>> the
> >>>> buffer cache from memory by running "echo 3 >
/proc/sys/vm/drop_caches"
> >>>>
> >>>> Any hints on how to improve the situation is greatly
appreciated.
> >>>>
> >>>>
> >>>> System setup:
> >>>> Client: Dual socket Sandy Bridge, with 32GB ram and
infiniband
> >>>> connection
> >>>> to lustre server.  CentOS 6.4, with kernel
2.6.32-358.11.1.el6.x86_64
> >>>> and
> >>>> lustre v2.1.6 rpms downloaded from whamcloud download
site.
> >>>>
> >>>> Lustre: 1 MDS and 4 OSS running Lustre 2.1.3 (also from
whamcloud
> >>>> site).
> >>>> Each OSS has 12 OST, total 1.1 PB storage.
> >>>>
> >>>> How to reproduce:
> >>>>
> >>>> Traverse the lustre file system until the buffer cache is
large enough.
> >>>> In our case we run
> >>>>
> >>>>    find . -print0 -type f | xargs -0 cat > /dev/null
> >>>>
> >>>> on the client until the buffer cache reaches ~15-20GB. 
(The lustre
> >>>> file
> >>>> system has lots of small files so this takes up to an
hour.)
> >>>>
> >>>> Kill the find process and start a single node parallel
application, we
> >>>> use
> >>>> HPL (high performance linpack).  We run on all 16 cores on
the system
> >>>> with 1GB ram per core (a normal run should complete in
appr. 150
> >>>> seconds.)  The system monitoring shows a 10-20% system cpu
overhead and
> >>>> the HPL run takes more than 200 secs.  After running
"echo 3 >
> >>>> /proc/sys/vm/drop_caches" the system performance goes
back to normal
> >>>> with
> >>>> a run time at 150 secs.
> >>>>
> >>>> I've created an infographic from our ganglia graphs
for the above
> >>>> scenario.
> >>>>
> >>>>
https://dl.dropboxusercontent.com/u/23468442/misc/lustre_bc_overhead.pn
> >>>> g
> >>>>
> >>>> Attached is an excerpt from perf top indicating that the
kernel routine
> >>>> taking the most time is _spin_lock_irqsave if that means
anything to
> >>>> anyone.
> >>>>
> >>>>
> >>>> Things tested:
> >>>>
> >>>> It does not seem to matter if we mount lustre over
infiniband or
> >>>> ethernet.
> >>>>
> >>>> Filling the buffer cache with files from an NFS filesystem
does not
> >>>> degrade
> >>>> performance.
> >>>>
> >>>> Filling the buffer cache with one large file does not give
degraded
> >>>> performance. (tested with iozone)
> >>>>
> >>>>
> >>>> Again, any hints on how to proceed is greatly appreciated.
> >>>>
> >>>>
> >>>> Best regards,
> >>>> Roy.
> >>>>
> >>>>
> >>>>
> >>>> _______________________________________________
> >>>> Lustre-discuss mailing list
> >>>> Lustre-discuss@lists.lustre.org
> >>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss--

  The Computer Center, University of Tromsø, N-9037 TROMSØ Norway.
              phone:+47 77 64 41 07, fax:+47 77 64 41 00
        Roy Dragseth, Team Leader, High Performance Computing
         Direct call: +47 77 64 62 56. email: roy.dragseth@uit.no
_______________________________________________
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


_______________________________________________
Lustre-discuss mailing list
Lustre-discuss-aLEFhgZF4x6X6Mz3xDxJMA@public.gmane.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Dragseth Roy Einar

2013-Aug-24 08:08 UTC

head link

Re: Lustre buffer cache causes large system overhead.

The kernel docs for zone_reclaim_mode indicates that a value of 0 makes sense 
on dedicated file servers like MDS/OSS as fetching cached data from another 
numa domain is much faster than going all the way to the disk.  For clients 
that need the memory for computations a value of 1 seems to be the way to go 
as (I guess) it reduces the cross-domain traffic.

r.

On Friday 23. August 2013 13.59.44 Patrick Shopbell
wrote:> Hi all -
> I have watched this thread with much interest, and now I am even
> more interested/confused.  :-)
> 
> Several months back, we had a very substantial slowdown on our
> MDS box. Interactive use of the box was very sluggish, even
> though the load was quite low. This was eventually solved by
> setting the opposite value for the variable in question:
> 
> vm.zone_reclaim_mode = 0
> 
> And it was equally dramatic in its solution of our problem - the MDS
> started responding normally immediately afterwards. We went ahead
> and set the value to zero on all of our NUMA machines. (We are
> running Lustre 2.3.)
> 
> Clearly, I need to do some reading on Lustre and its various caching
> issues. This has been a quite interesting discussion.
> 
> Thanks everyone for such a great list.
> --
> Patrick
> 
> *--------------------------------------------------------------------*
> 
> | Patrick Shopbell               Department of Astronomy             |
> | pls-f+Cz5gDlz18dsUksgWXaj4dd74u8MsAO@public.gmane.org          Mail Code
249-17                    |
> | (626) 395-4097                 California Institute of Technology  |
> | (626) 568-9352  (FAX)          Pasadena, CA 91125                 |
> | WWW: http://www.astro.caltech.edu/~pls/                            |
> 
> *--------------------------------------------------------------------*
> 
> On 8/23/13 1:08 PM, Dragseth Roy Einar wrote:
> > Thanks for the suggestion!  It didn''t help, but as I read the
> > documentation on vfs_cache_pressure in the kernel docs I noticed the
next
> > parameter, zone_reclaim_mode, which looked like it might be worth
> > fiddling with.  And what do you know, changing it from 0 to 1 made the
> > system overhead vanish immediately!
> > 
> > I must admit I do not completely understand why this helps, but it
seems
> > to do the trick in my case.  We''ll put
> > 
> > vm.zone_reclaim_mode = 1
> > 
> > into /etc/sysctl.conf from now on.
> > 
> > Thanks to all for the hints and comments on this.
> > 
> > A nice weekend to everyone, mine for sure is going to be...
> > r.
> > 
> > On Friday 23. August 2013 09.36.34 Scott Nolin wrote:
> >> You might also try increasing the vfs_cache_pressure.
> >> 
> >> This will reclaim inode and dentry caches faster. Maybe
that''s the
> >> problem, not page caches.
> >> 
> >> To be clear - I have no deep insight into Lustre''s use of
the client
> >> cache, but you said you has lots of small files, which if lustre
uses
> >> the cache system like other filesystems means it may be
inodes/dentries.
> >> Filling up the page cache with files like you did in your other
tests
> >> wouldn''t have the same effect. Just my guess here.
> >> 
> >> We had some experience years ago with the opposite sort of
problem. We
> >> have a big ftp server, and we want to *keep* inode/dentry data in
the
> >> linux cache, as there are often stupid numbers of files in
directories.
> >> Files were always flowing through the server, so the page cache
would
> >> force out the inode cache. Was surprised to find with linux
there''s no
> >> ability to set a fixed inode cache size - the best you can do is
> >> "suggest" with the cache pressure tunable.
> >> 
> >> Scott
> >> 
> >> On 8/23/2013 6:29 AM, Dragseth Roy Einar wrote:
> >>> I tried to change swapiness from 0 to 95 but it did not have
any impact
> >>> on
> >>> the system overhead.
> >>> 
> >>> r.
> >>> 
> >>> On Thursday 22. August 2013 15.38.37 Dragseth Roy Einar wrote:
> >>>> No, I cannot detect any swap activity on the system.
> >>>> 
> >>>> r.
> >>>> 
> >>>> On Thursday 22. August 2013 09.21.33 you wrote:
> >>>>> Is this slowdown due to increased swap activity?  If
"yes", then try
> >>>>> lowering the "swappiness" value.  This will
sacrifice buffer cache
> >>>>> space
> >>>>> to
> >>>>> lower swap activity.
> >>>>> 
> >>>>> Take a look at
http://en.wikipedia.org/wiki/Swappiness.
> >>>>> 
> >>>>> Roger S.
> >>>>> 
> >>>>> On 08/22/2013 08:51 AM, Roy Dragseth wrote:
> >>>>>> We have just discovered that a large buffer cache
generated from
> >>>>>> traversing a lustre file system will cause a
significant system
> >>>>>> overhead
> >>>>>> for applications with high memory demands.  We
have seen a 50%
> >>>>>> slowdown
> >>>>>> or worse for applications.  Even High Performance
Linpack, that have
> >>>>>> no
> >>>>>> file IO whatsoever is affected.  The only remedy
seems to be to empty
> >>>>>> the
> >>>>>> buffer cache from memory by running "echo 3
>
> >>>>>> /proc/sys/vm/drop_caches"
> >>>>>> 
> >>>>>> Any hints on how to improve the situation is
greatly appreciated.
> >>>>>> 
> >>>>>> 
> >>>>>> System setup:
> >>>>>> Client: Dual socket Sandy Bridge, with 32GB ram
and infiniband
> >>>>>> connection
> >>>>>> to lustre server.  CentOS 6.4, with kernel
2.6.32-358.11.1.el6.x86_64
> >>>>>> and
> >>>>>> lustre v2.1.6 rpms downloaded from whamcloud
download site.
> >>>>>> 
> >>>>>> Lustre: 1 MDS and 4 OSS running Lustre 2.1.3 (also
from whamcloud
> >>>>>> site).
> >>>>>> Each OSS has 12 OST, total 1.1 PB storage.
> >>>>>> 
> >>>>>> How to reproduce:
> >>>>>> 
> >>>>>> Traverse the lustre file system until the buffer
cache is large
> >>>>>> enough.
> >>>>>> In our case we run
> >>>>>> 
> >>>>>>     find . -print0 -type f | xargs -0 cat >
/dev/null
> >>>>>> 
> >>>>>> on the client until the buffer cache reaches
~15-20GB.  (The lustre
> >>>>>> file
> >>>>>> system has lots of small files so this takes up to
an hour.)
> >>>>>> 
> >>>>>> Kill the find process and start a single node
parallel application,
> >>>>>> we
> >>>>>> use
> >>>>>> HPL (high performance linpack).  We run on all 16
cores on the system
> >>>>>> with 1GB ram per core (a normal run should
complete in appr. 150
> >>>>>> seconds.)  The system monitoring shows a 10-20%
system cpu overhead
> >>>>>> and
> >>>>>> the HPL run takes more than 200 secs.  After
running "echo 3 >
> >>>>>> /proc/sys/vm/drop_caches" the system
performance goes back to normal
> >>>>>> with
> >>>>>> a run time at 150 secs.
> >>>>>> 
> >>>>>> I''ve created an infographic from our
ganglia graphs for the above
> >>>>>> scenario.
> >>>>>> 
> >>>>>>
https://dl.dropboxusercontent.com/u/23468442/misc/lustre_bc_overhead.
> >>>>>> pn
> >>>>>> g
> >>>>>> 
> >>>>>> Attached is an excerpt from perf top indicating
that the kernel
> >>>>>> routine
> >>>>>> taking the most time is _spin_lock_irqsave if that
means anything to
> >>>>>> anyone.
> >>>>>> 
> >>>>>> 
> >>>>>> Things tested:
> >>>>>> 
> >>>>>> It does not seem to matter if we mount lustre over
infiniband or
> >>>>>> ethernet.
> >>>>>> 
> >>>>>> Filling the buffer cache with files from an NFS
filesystem does not
> >>>>>> degrade
> >>>>>> performance.
> >>>>>> 
> >>>>>> Filling the buffer cache with one large file does
not give degraded
> >>>>>> performance. (tested with iozone)
> >>>>>> 
> >>>>>> 
> >>>>>> Again, any hints on how to proceed is greatly
appreciated.
> >>>>>> 
> >>>>>> 
> >>>>>> Best regards,
> >>>>>> Roy.
> >>>>>> 
> >>>>>> 
> >>>>>> 
> >>>>>> _______________________________________________
> >>>>>> Lustre-discuss mailing list
> >>>>>>
Lustre-discuss-aLEFhgZF4x6X6Mz3xDxJMA@public.gmane.org
> >>>>>>
http://lists.lustre.org/mailman/listinfo/lustre-discuss
> 
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss-aLEFhgZF4x6X6Mz3xDxJMA@public.gmane.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss-- 

  The Computer Center, University of Tromsø, N-9037 TROMSØ Norway.
	      phone:+47 77 64 41 07, fax:+47 77 64 41 00
        Roy Dragseth, Team Leader, High Performance Computing
	 Direct call: +47 77 64 62 56. email: roy.dragseth-hYqmg196XYc@public.gmane.org

Roy Dragseth

2013-Sep-03 10:57 UTC

head link

Re: Lustre buffer cache causes large system overhead.

An admin at another site sent me this info (thanks Hans)

kernel component, BZ#770545
In Red Hat Enterprise Linux 6.2 and Red Hat Enterprise Linux 6.3, the 
default value for sysctl vm.zone_reclaim_mode is now 0, whereas in Red Hat 
Enterprise Linux 6.1 it was 1.


Just a heads up for anyone planning an upgrade...

r.



On Saturday, August 24, 2013 08:08:06 Dragseth Roy Einar
wrote:> The kernel docs for zone_reclaim_mode indicates that a value of 0 
makes> sense on dedicated file servers like MDS/OSS as fetching cached data 
from> another numa domain is much faster than going all the way to the disk.  
For> clients that need the memory for computations a value of 1 seems to be 
the> way to go as (I guess) it reduces the cross-domain traffic.
> 
> r.
> 
> On Friday 23. August 2013 13.59.44 Patrick Shopbell wrote:
> > Hi all -
> > I have watched this thread with much interest, and now I am even
> > more interested/confused.  :-)
> > 
> > Several months back, we had a very substantial slowdown on our
> > MDS box. Interactive use of the box was very sluggish, even
> > though the load was quite low. This was eventually solved by
> > setting the opposite value for the variable in question:
> > 
> > vm.zone_reclaim_mode = 0
> > 
> > And it was equally dramatic in its solution of our problem - the MDS
> > started responding normally immediately afterwards. We went ahead
> > and set the value to zero on all of our NUMA machines. (We are
> > running Lustre 2.3.)
> > 
> > Clearly, I need to do some reading on Lustre and its various caching
> > issues. This has been a quite interesting discussion.
> > 
> > Thanks everyone for such a great list.
> > --
> > Patrick
> > 
> > *--------------------------------------------------------------------*
> > 
> > | Patrick Shopbell               Department of Astronomy             |
> > | pls-f+Cz5gDlz18dsUksgWXaj4dd74u8MsAO@public.gmane.org          Mail
Code 249-17                    |
> > | (626) 395-4097                 California Institute of Technology  |
> > | (626) 568-9352  (FAX)          Pasadena, CA 91125                 |
> > | WWW: http://www.astro.caltech.edu/~pls/                            |
> > 
> > *--------------------------------------------------------------------*
> > 
> > On 8/23/13 1:08 PM, Dragseth Roy Einar wrote:
> > > Thanks for the suggestion!  It didn''t help, but as I
read the
> > > documentation on vfs_cache_pressure in the kernel docs I noticed 
the> > > next
> > > parameter, zone_reclaim_mode, which looked like it might be worth
> > > fiddling with.  And what do you know, changing it from 0 to 1
made
the> > > system overhead vanish immediately!
> > > 
> > > I must admit I do not completely understand why this helps, but
it
seems> > > to do the trick in my case.  We''ll put
> > > 
> > > vm.zone_reclaim_mode = 1
> > > 
> > > into /etc/sysctl.conf from now on.
> > > 
> > > Thanks to all for the hints and comments on this.
> > > 
> > > A nice weekend to everyone, mine for sure is going to be...
> > > r.
> > > 
> > > On Friday 23. August 2013 09.36.34 Scott Nolin wrote:
> > >> You might also try increasing the vfs_cache_pressure.
> > >> 
> > >> This will reclaim inode and dentry caches faster. Maybe
that''s the
> > >> problem, not page caches.
> > >> 
> > >> To be clear - I have no deep insight into Lustre''s
use of the client
> > >> cache, but you said you has lots of small files, which if
lustre uses
> > >> the cache system like other filesystems means it may be
> > >> inodes/dentries.
> > >> Filling up the page cache with files like you did in your
other tests
> > >> wouldn''t have the same effect. Just my guess here.
> > >> 
> > >> We had some experience years ago with the opposite sort of 
problem. We> > >> have a big ftp server, and we want to *keep* inode/dentry
data in
the> > >> linux cache, as there are often stupid numbers of files in 
directories.> > >> Files were always flowing through the server, so the page
cache
would> > >> force out the inode cache. Was surprised to find with linux
there''s
no> > >> ability to set a fixed inode cache size - the best you can do
is
> > >> "suggest" with the cache pressure tunable.
> > >> 
> > >> Scott
> > >> 
> > >> On 8/23/2013 6:29 AM, Dragseth Roy Einar wrote:
> > >>> I tried to change swapiness from 0 to 95 but it did not
have any
> > >>> impact
> > >>> on
> > >>> the system overhead.
> > >>> 
> > >>> r.
> > >>> 
> > >>> On Thursday 22. August 2013 15.38.37 Dragseth Roy Einar 
wrote:> > >>>> No, I cannot detect any swap activity on the system.
> > >>>> 
> > >>>> r.
> > >>>> 
> > >>>> On Thursday 22. August 2013 09.21.33 you wrote:
> > >>>>> Is this slowdown due to increased swap activity? 
If "yes", then
try> > >>>>> lowering the "swappiness" value.  This
will sacrifice buffer
cache> > >>>>> space
> > >>>>> to
> > >>>>> lower swap activity.
> > >>>>> 
> > >>>>> Take a look at
http://en.wikipedia.org/wiki/Swappiness.
> > >>>>> 
> > >>>>> Roger S.
> > >>>>> 
> > >>>>> On 08/22/2013 08:51 AM, Roy Dragseth wrote:
> > >>>>>> We have just discovered that a large buffer
cache
generated from> > >>>>>> traversing a lustre file system will cause a
significant system
_______________________________________________
Lustre-discuss mailing list
Lustre-discuss-aLEFhgZF4x6X6Mz3xDxJMA@public.gmane.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Apparently Analagous Threads

Search for more reasonably related threads

Lustre discuss - Aug 2013 - Lustre buffer cache causes large system overhead.

Lustre buffer cache causes large system overhead.

Re: Lustre buffer cache causes large system overhead.

Re: Lustre buffer cache causes large system overhead.

Re: Lustre buffer cache causes large system overhead.

Re: Lustre buffer cache causes large system overhead.

Re: Lustre buffer cache causes large system overhead.

Re: Lustre buffer cache causes large system overhead.

Re: Lustre buffer cache causes large system overhead.

Re: Lustre buffer cache causes large system overhead.

Re: Lustre buffer cache causes large system overhead.

Re: Lustre buffer cache causes large system overhead.

Re: Lustre buffer cache causes large system overhead.

Re: Lustre buffer cache causes large system overhead.

Re: Lustre buffer cache causes large system overhead.

Apparently Analagous Threads