We have just discovered that a large buffer cache generated from traversing a lustre file system will cause a significant system overhead for applications with high memory demands. We have seen a 50% slowdown or worse for applications. Even High Performance Linpack, that have no file IO whatsoever is affected. The only remedy seems to be to empty the buffer cache from memory by running "echo 3 > /proc/sys/vm/drop_caches" Any hints on how to improve the situation is greatly appreciated. System setup: Client: Dual socket Sandy Bridge, with 32GB ram and infiniband connection to lustre server. CentOS 6.4, with kernel 2.6.32-358.11.1.el6.x86_64 and lustre v2.1.6 rpms downloaded from whamcloud download site. Lustre: 1 MDS and 4 OSS running Lustre 2.1.3 (also from whamcloud site). Each OSS has 12 OST, total 1.1 PB storage. How to reproduce: Traverse the lustre file system until the buffer cache is large enough. In our case we run find . -print0 -type f | xargs -0 cat > /dev/null on the client until the buffer cache reaches ~15-20GB. (The lustre file system has lots of small files so this takes up to an hour.) Kill the find process and start a single node parallel application, we use HPL (high performance linpack). We run on all 16 cores on the system with 1GB ram per core (a normal run should complete in appr. 150 seconds.) The system monitoring shows a 10-20% system cpu overhead and the HPL run takes more than 200 secs. After running "echo 3 > /proc/sys/vm/drop_caches" the system performance goes back to normal with a run time at 150 secs. I''ve created an infographic from our ganglia graphs for the above scenario. https://dl.dropboxusercontent.com/u/23468442/misc/lustre_bc_overhead.png Attached is an excerpt from perf top indicating that the kernel routine taking the most time is _spin_lock_irqsave if that means anything to anyone. Things tested: It does not seem to matter if we mount lustre over infiniband or ethernet. Filling the buffer cache with files from an NFS filesystem does not degrade performance. Filling the buffer cache with one large file does not give degraded performance. (tested with iozone) Again, any hints on how to proceed is greatly appreciated. Best regards, Roy. -- The Computer Center, University of Tromsø, N-9037 TROMSØ Norway. phone:+47 77 64 41 07, fax:+47 77 64 41 00 Roy Dragseth, Team Leader, High Performance Computing Direct call: +47 77 64 62 56. email: roy.dragseth-hYqmg196XYc@public.gmane.org _______________________________________________ Lustre-discuss mailing list Lustre-discuss-aLEFhgZF4x6X6Mz3xDxJMA@public.gmane.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Is this slowdown due to increased swap activity? If "yes", then try lowering the "swappiness" value. This will sacrifice buffer cache space to lower swap activity. Take a look at http://en.wikipedia.org/wiki/Swappiness. Roger S. On 08/22/2013 08:51 AM, Roy Dragseth wrote:> We have just discovered that a large buffer cache generated from traversing a > lustre file system will cause a significant system overhead for applications > with high memory demands. We have seen a 50% slowdown or worse for > applications. Even High Performance Linpack, that have no file IO whatsoever > is affected. The only remedy seems to be to empty the buffer cache from memory > by running "echo 3 > /proc/sys/vm/drop_caches" > > Any hints on how to improve the situation is greatly appreciated. > > > System setup: > Client: Dual socket Sandy Bridge, with 32GB ram and infiniband connection to > lustre server. CentOS 6.4, with kernel 2.6.32-358.11.1.el6.x86_64 and lustre > v2.1.6 rpms downloaded from whamcloud download site. > > Lustre: 1 MDS and 4 OSS running Lustre 2.1.3 (also from whamcloud site). Each > OSS has 12 OST, total 1.1 PB storage. > > How to reproduce: > > Traverse the lustre file system until the buffer cache is large enough. In our > case we run > > find . -print0 -type f | xargs -0 cat > /dev/null > > on the client until the buffer cache reaches ~15-20GB. (The lustre file system > has lots of small files so this takes up to an hour.) > > Kill the find process and start a single node parallel application, we use HPL > (high performance linpack). We run on all 16 cores on the system with 1GB ram > per core (a normal run should complete in appr. 150 seconds.) The system > monitoring shows a 10-20% system cpu overhead and the HPL run takes more than > 200 secs. After running "echo 3 > /proc/sys/vm/drop_caches" the system > performance goes back to normal with a run time at 150 secs. > > I''ve created an infographic from our ganglia graphs for the above scenario. > > https://dl.dropboxusercontent.com/u/23468442/misc/lustre_bc_overhead.png > > Attached is an excerpt from perf top indicating that the kernel routine taking > the most time is _spin_lock_irqsave if that means anything to anyone. > > > Things tested: > > It does not seem to matter if we mount lustre over infiniband or ethernet. > > Filling the buffer cache with files from an NFS filesystem does not degrade > performance. > > Filling the buffer cache with one large file does not give degraded performance. > (tested with iozone) > > > Again, any hints on how to proceed is greatly appreciated. > > > Best regards, > Roy. > > > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss-aLEFhgZF4x6X6Mz3xDxJMA@public.gmane.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >
Carlson, Timothy S
2013-Aug-22 14:40 UTC
Re: Lustre buffer cache causes large system overhead.
FWIW, we have seen the same issues with Lustre 1.8.x and slightly older RHEL6 kernel. We do the "echo" as part of our slurm prolog/epilog scripts. Not a fix but a workaround before/after jobs run. No swap activity, but very large buffer cache in use. Tim -----Original Message----- From: lustre-discuss-bounces-aLEFhgZF4x6X6Mz3xDxJMA@public.gmane.org [mailto:lustre-discuss-bounces-aLEFhgZF4x6X6Mz3xDxJMA@public.gmane.org] On Behalf Of Roger Sersted Sent: Thursday, August 22, 2013 7:22 AM To: lustre-discuss-aLEFhgZF4x6X6Mz3xDxJMA@public.gmane.org Cc: Roy Dragseth Subject: Re: [Lustre-discuss] Lustre buffer cache causes large system overhead. Is this slowdown due to increased swap activity? If "yes", then try lowering the "swappiness" value. This will sacrifice buffer cache space to lower swap activity. Take a look at http://en.wikipedia.org/wiki/Swappiness. Roger S. On 08/22/2013 08:51 AM, Roy Dragseth wrote:> We have just discovered that a large buffer cache generated from > traversing a lustre file system will cause a significant system > overhead for applications with high memory demands. We have seen a > 50% slowdown or worse for applications. Even High Performance > Linpack, that have no file IO whatsoever is affected. The only remedy > seems to be to empty the buffer cache from memory by running "echo 3 > /proc/sys/vm/drop_caches" > > Any hints on how to improve the situation is greatly appreciated. > > > System setup: > Client: Dual socket Sandy Bridge, with 32GB ram and infiniband > connection to lustre server. CentOS 6.4, with kernel > 2.6.32-358.11.1.el6.x86_64 and lustre > v2.1.6 rpms downloaded from whamcloud download site. > > Lustre: 1 MDS and 4 OSS running Lustre 2.1.3 (also from whamcloud > site). Each OSS has 12 OST, total 1.1 PB storage. > > How to reproduce: > > Traverse the lustre file system until the buffer cache is large > enough. In our case we run > > find . -print0 -type f | xargs -0 cat > /dev/null > > on the client until the buffer cache reaches ~15-20GB. (The lustre > file system has lots of small files so this takes up to an hour.) > > Kill the find process and start a single node parallel application, we > use HPL (high performance linpack). We run on all 16 cores on the > system with 1GB ram per core (a normal run should complete in appr. > 150 seconds.) The system monitoring shows a 10-20% system cpu > overhead and the HPL run takes more than > 200 secs. After running "echo 3 > /proc/sys/vm/drop_caches" the > system performance goes back to normal with a run time at 150 secs. > > I''ve created an infographic from our ganglia graphs for the above scenario. > > https://dl.dropboxusercontent.com/u/23468442/misc/lustre_bc_overhead.p > ng > > Attached is an excerpt from perf top indicating that the kernel > routine taking the most time is _spin_lock_irqsave if that means anything to anyone. > > > Things tested: > > It does not seem to matter if we mount lustre over infiniband or ethernet. > > Filling the buffer cache with files from an NFS filesystem does not > degrade performance. > > Filling the buffer cache with one large file does not give degraded performance. > (tested with iozone) > > > Again, any hints on how to proceed is greatly appreciated. > > > Best regards, > Roy. > > > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss-aLEFhgZF4x6X6Mz3xDxJMA@public.gmane.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >_______________________________________________ Lustre-discuss mailing list Lustre-discuss-aLEFhgZF4x6X6Mz3xDxJMA@public.gmane.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Dragseth Roy Einar
2013-Aug-22 15:38 UTC
Re: Lustre buffer cache causes large system overhead.
Yes, we have also started emptying the BC on job startup, but it doesn''t seem to cover all cases. We see similar symptoms on applications using netcdf even if we drop the BC at job startup. The application writes a netcdf-file at 300-500 MB/s for 3-5 secs, then after the IO is done the client will spend 100% in _spin_lock_irqsave for up to a minute. The data has clearly left the client as no ib-traffic is detected during or after the spin_lock time until the application has completed a new time step and writes a new data chunk. The application use appr. 1.2GB per core so the scenario is quite similar to the syntetic one I reported. r. On Thursday 22. August 2013 07.40.01 you wrote:> FWIW, we have seen the same issues with Lustre 1.8.x and slightly older > RHEL6 kernel. We do the "echo" as part of our slurm prolog/epilog scripts. > Not a fix but a workaround before/after jobs run. No swap activity, but > very large buffer cache in use. > > Tim > > -----Original Message----- > From: lustre-discuss-bounces-aLEFhgZF4x6X6Mz3xDxJMA@public.gmane.org > [mailto:lustre-discuss-bounces-aLEFhgZF4x6X6Mz3xDxJMA@public.gmane.org] On Behalf Of Roger Sersted > Sent: Thursday, August 22, 2013 7:22 AM > To: lustre-discuss-aLEFhgZF4x6X6Mz3xDxJMA@public.gmane.org > Cc: Roy Dragseth > Subject: Re: [Lustre-discuss] Lustre buffer cache causes large system > overhead. > > > > > Is this slowdown due to increased swap activity? If "yes", then try > lowering the "swappiness" value. This will sacrifice buffer cache space to > lower swap activity. > > Take a look at http://en.wikipedia.org/wiki/Swappiness. > > Roger S. > > On 08/22/2013 08:51 AM, Roy Dragseth wrote: > > We have just discovered that a large buffer cache generated from > > traversing a lustre file system will cause a significant system > > overhead for applications with high memory demands. We have seen a > > 50% slowdown or worse for applications. Even High Performance > > Linpack, that have no file IO whatsoever is affected. The only remedy > > seems to be to empty the buffer cache from memory by running "echo 3 > > > /proc/sys/vm/drop_caches" > > > > Any hints on how to improve the situation is greatly appreciated. > > > > > > System setup: > > Client: Dual socket Sandy Bridge, with 32GB ram and infiniband > > connection to lustre server. CentOS 6.4, with kernel > > 2.6.32-358.11.1.el6.x86_64 and lustre > > v2.1.6 rpms downloaded from whamcloud download site. > > > > Lustre: 1 MDS and 4 OSS running Lustre 2.1.3 (also from whamcloud > > site). Each OSS has 12 OST, total 1.1 PB storage. > > > > How to reproduce: > > > > Traverse the lustre file system until the buffer cache is large > > enough. In our case we run > > > > find . -print0 -type f | xargs -0 cat > /dev/null > > > > on the client until the buffer cache reaches ~15-20GB. (The lustre > > file system has lots of small files so this takes up to an hour.) > > > > Kill the find process and start a single node parallel application, we > > use HPL (high performance linpack). We run on all 16 cores on the > > system with 1GB ram per core (a normal run should complete in appr. > > 150 seconds.) The system monitoring shows a 10-20% system cpu > > overhead and the HPL run takes more than > > 200 secs. After running "echo 3 > /proc/sys/vm/drop_caches" the > > system performance goes back to normal with a run time at 150 secs. > > > > I''ve created an infographic from our ganglia graphs for the above > > scenario. > > > > https://dl.dropboxusercontent.com/u/23468442/misc/lustre_bc_overhead.p > > ng > > > > Attached is an excerpt from perf top indicating that the kernel > > routine taking the most time is _spin_lock_irqsave if that means anything > > to anyone. > > > > > > Things tested: > > > > It does not seem to matter if we mount lustre over infiniband or ethernet. > > > > Filling the buffer cache with files from an NFS filesystem does not > > degrade performance. > > > > Filling the buffer cache with one large file does not give degraded > > performance. (tested with iozone) > > > > > > Again, any hints on how to proceed is greatly appreciated. > > > > > > Best regards, > > Roy. > > > > > > > > _______________________________________________ > > Lustre-discuss mailing list > > Lustre-discuss-aLEFhgZF4x6X6Mz3xDxJMA@public.gmane.org > > http://lists.lustre.org/mailman/listinfo/lustre-discuss > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss-aLEFhgZF4x6X6Mz3xDxJMA@public.gmane.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss-- The Computer Center, University of Tromsø, N-9037 TROMSØ Norway. phone:+47 77 64 41 07, fax:+47 77 64 41 00 Roy Dragseth, Team Leader, High Performance Computing Direct call: +47 77 64 62 56. email: roy.dragseth-hYqmg196XYc@public.gmane.org
Dragseth Roy Einar
2013-Aug-22 15:38 UTC
Re: Lustre buffer cache causes large system overhead.
No, I cannot detect any swap activity on the system. r. On Thursday 22. August 2013 09.21.33 you wrote:> Is this slowdown due to increased swap activity? If "yes", then try > lowering the "swappiness" value. This will sacrifice buffer cache space to > lower swap activity. > > Take a look at http://en.wikipedia.org/wiki/Swappiness. > > Roger S. > > On 08/22/2013 08:51 AM, Roy Dragseth wrote: > > We have just discovered that a large buffer cache generated from > > traversing a lustre file system will cause a significant system overhead > > for applications with high memory demands. We have seen a 50% slowdown > > or worse for applications. Even High Performance Linpack, that have no > > file IO whatsoever is affected. The only remedy seems to be to empty the > > buffer cache from memory by running "echo 3 > /proc/sys/vm/drop_caches" > > > > Any hints on how to improve the situation is greatly appreciated. > > > > > > System setup: > > Client: Dual socket Sandy Bridge, with 32GB ram and infiniband connection > > to lustre server. CentOS 6.4, with kernel 2.6.32-358.11.1.el6.x86_64 and > > lustre v2.1.6 rpms downloaded from whamcloud download site. > > > > Lustre: 1 MDS and 4 OSS running Lustre 2.1.3 (also from whamcloud site). > > Each OSS has 12 OST, total 1.1 PB storage. > > > > How to reproduce: > > > > Traverse the lustre file system until the buffer cache is large enough. > > In our case we run > > > > find . -print0 -type f | xargs -0 cat > /dev/null > > > > on the client until the buffer cache reaches ~15-20GB. (The lustre file > > system has lots of small files so this takes up to an hour.) > > > > Kill the find process and start a single node parallel application, we use > > HPL (high performance linpack). We run on all 16 cores on the system > > with 1GB ram per core (a normal run should complete in appr. 150 > > seconds.) The system monitoring shows a 10-20% system cpu overhead and > > the HPL run takes more than 200 secs. After running "echo 3 > > > /proc/sys/vm/drop_caches" the system performance goes back to normal with > > a run time at 150 secs. > > > > I''ve created an infographic from our ganglia graphs for the above > > scenario. > > > > https://dl.dropboxusercontent.com/u/23468442/misc/lustre_bc_overhead.png > > > > Attached is an excerpt from perf top indicating that the kernel routine > > taking the most time is _spin_lock_irqsave if that means anything to > > anyone. > > > > > > Things tested: > > > > It does not seem to matter if we mount lustre over infiniband or ethernet. > > > > Filling the buffer cache with files from an NFS filesystem does not > > degrade > > performance. > > > > Filling the buffer cache with one large file does not give degraded > > performance. (tested with iozone) > > > > > > Again, any hints on how to proceed is greatly appreciated. > > > > > > Best regards, > > Roy. > > > > > > > > _______________________________________________ > > Lustre-discuss mailing list > > Lustre-discuss-aLEFhgZF4x6X6Mz3xDxJMA@public.gmane.org > > http://lists.lustre.org/mailman/listinfo/lustre-discuss-- The Computer Center, University of Tromsø, N-9037 TROMSØ Norway. phone:+47 77 64 41 07, fax:+47 77 64 41 00 Roy Dragseth, Team Leader, High Performance Computing Direct call: +47 77 64 62 56. email: roy.dragseth-hYqmg196XYc@public.gmane.org
Dragseth Roy Einar
2013-Aug-23 11:29 UTC
Re: Lustre buffer cache causes large system overhead.
I tried to change swapiness from 0 to 95 but it did not have any impact on the system overhead. r. On Thursday 22. August 2013 15.38.37 Dragseth Roy Einar wrote:> No, I cannot detect any swap activity on the system. > > r. > > On Thursday 22. August 2013 09.21.33 you wrote: > > Is this slowdown due to increased swap activity? If "yes", then try > > lowering the "swappiness" value. This will sacrifice buffer cache space > > to > > lower swap activity. > > > > Take a look at http://en.wikipedia.org/wiki/Swappiness. > > > > Roger S. > > > > On 08/22/2013 08:51 AM, Roy Dragseth wrote: > > > We have just discovered that a large buffer cache generated from > > > traversing a lustre file system will cause a significant system overhead > > > for applications with high memory demands. We have seen a 50% slowdown > > > or worse for applications. Even High Performance Linpack, that have no > > > file IO whatsoever is affected. The only remedy seems to be to empty > > > the > > > buffer cache from memory by running "echo 3 > /proc/sys/vm/drop_caches" > > > > > > Any hints on how to improve the situation is greatly appreciated. > > > > > > > > > System setup: > > > Client: Dual socket Sandy Bridge, with 32GB ram and infiniband > > > connection > > > to lustre server. CentOS 6.4, with kernel 2.6.32-358.11.1.el6.x86_64 > > > and > > > lustre v2.1.6 rpms downloaded from whamcloud download site. > > > > > > Lustre: 1 MDS and 4 OSS running Lustre 2.1.3 (also from whamcloud site). > > > Each OSS has 12 OST, total 1.1 PB storage. > > > > > > How to reproduce: > > > > > > Traverse the lustre file system until the buffer cache is large enough. > > > In our case we run > > > > > > find . -print0 -type f | xargs -0 cat > /dev/null > > > > > > on the client until the buffer cache reaches ~15-20GB. (The lustre file > > > system has lots of small files so this takes up to an hour.) > > > > > > Kill the find process and start a single node parallel application, we > > > use > > > HPL (high performance linpack). We run on all 16 cores on the system > > > with 1GB ram per core (a normal run should complete in appr. 150 > > > seconds.) The system monitoring shows a 10-20% system cpu overhead and > > > the HPL run takes more than 200 secs. After running "echo 3 > > > > /proc/sys/vm/drop_caches" the system performance goes back to normal > > > with > > > a run time at 150 secs. > > > > > > I''ve created an infographic from our ganglia graphs for the above > > > scenario. > > > > > > https://dl.dropboxusercontent.com/u/23468442/misc/lustre_bc_overhead.png > > > > > > Attached is an excerpt from perf top indicating that the kernel routine > > > taking the most time is _spin_lock_irqsave if that means anything to > > > anyone. > > > > > > > > > Things tested: > > > > > > It does not seem to matter if we mount lustre over infiniband or > > > ethernet. > > > > > > Filling the buffer cache with files from an NFS filesystem does not > > > degrade > > > performance. > > > > > > Filling the buffer cache with one large file does not give degraded > > > performance. (tested with iozone) > > > > > > > > > Again, any hints on how to proceed is greatly appreciated. > > > > > > > > > Best regards, > > > Roy. > > > > > > > > > > > > _______________________________________________ > > > Lustre-discuss mailing list > > > Lustre-discuss-aLEFhgZF4x6X6Mz3xDxJMA@public.gmane.org > > > http://lists.lustre.org/mailman/listinfo/lustre-discuss-- The Computer Center, University of Tromsø, N-9037 TROMSØ Norway. phone:+47 77 64 41 07, fax:+47 77 64 41 00 Roy Dragseth, Team Leader, High Performance Computing Direct call: +47 77 64 62 56. email: roy.dragseth-hYqmg196XYc@public.gmane.org
You might also try increasing the vfs_cache_pressure. This will reclaim inode and dentry caches faster. Maybe that''s the problem, not page caches. To be clear - I have no deep insight into Lustre''s use of the client cache, but you said you has lots of small files, which if lustre uses the cache system like other filesystems means it may be inodes/dentries. Filling up the page cache with files like you did in your other tests wouldn''t have the same effect. Just my guess here. We had some experience years ago with the opposite sort of problem. We have a big ftp server, and we want to *keep* inode/dentry data in the linux cache, as there are often stupid numbers of files in directories. Files were always flowing through the server, so the page cache would force out the inode cache. Was surprised to find with linux there''s no ability to set a fixed inode cache size - the best you can do is "suggest" with the cache pressure tunable. Scott On 8/23/2013 6:29 AM, Dragseth Roy Einar wrote:> I tried to change swapiness from 0 to 95 but it did not have any impact on the > system overhead. > > r. > > > On Thursday 22. August 2013 15.38.37 Dragseth Roy Einar wrote: >> No, I cannot detect any swap activity on the system. >> >> r. >> >> On Thursday 22. August 2013 09.21.33 you wrote: >>> Is this slowdown due to increased swap activity? If "yes", then try >>> lowering the "swappiness" value. This will sacrifice buffer cache space >>> to >>> lower swap activity. >>> >>> Take a look at http://en.wikipedia.org/wiki/Swappiness. >>> >>> Roger S. >>> >>> On 08/22/2013 08:51 AM, Roy Dragseth wrote: >>>> We have just discovered that a large buffer cache generated from >>>> traversing a lustre file system will cause a significant system overhead >>>> for applications with high memory demands. We have seen a 50% slowdown >>>> or worse for applications. Even High Performance Linpack, that have no >>>> file IO whatsoever is affected. The only remedy seems to be to empty >>>> the >>>> buffer cache from memory by running "echo 3 > /proc/sys/vm/drop_caches" >>>> >>>> Any hints on how to improve the situation is greatly appreciated. >>>> >>>> >>>> System setup: >>>> Client: Dual socket Sandy Bridge, with 32GB ram and infiniband >>>> connection >>>> to lustre server. CentOS 6.4, with kernel 2.6.32-358.11.1.el6.x86_64 >>>> and >>>> lustre v2.1.6 rpms downloaded from whamcloud download site. >>>> >>>> Lustre: 1 MDS and 4 OSS running Lustre 2.1.3 (also from whamcloud site). >>>> Each OSS has 12 OST, total 1.1 PB storage. >>>> >>>> How to reproduce: >>>> >>>> Traverse the lustre file system until the buffer cache is large enough. >>>> In our case we run >>>> >>>> find . -print0 -type f | xargs -0 cat > /dev/null >>>> >>>> on the client until the buffer cache reaches ~15-20GB. (The lustre file >>>> system has lots of small files so this takes up to an hour.) >>>> >>>> Kill the find process and start a single node parallel application, we >>>> use >>>> HPL (high performance linpack). We run on all 16 cores on the system >>>> with 1GB ram per core (a normal run should complete in appr. 150 >>>> seconds.) The system monitoring shows a 10-20% system cpu overhead and >>>> the HPL run takes more than 200 secs. After running "echo 3 > >>>> /proc/sys/vm/drop_caches" the system performance goes back to normal >>>> with >>>> a run time at 150 secs. >>>> >>>> I''ve created an infographic from our ganglia graphs for the above >>>> scenario. >>>> >>>> https://dl.dropboxusercontent.com/u/23468442/misc/lustre_bc_overhead.png >>>> >>>> Attached is an excerpt from perf top indicating that the kernel routine >>>> taking the most time is _spin_lock_irqsave if that means anything to >>>> anyone. >>>> >>>> >>>> Things tested: >>>> >>>> It does not seem to matter if we mount lustre over infiniband or >>>> ethernet. >>>> >>>> Filling the buffer cache with files from an NFS filesystem does not >>>> degrade >>>> performance. >>>> >>>> Filling the buffer cache with one large file does not give degraded >>>> performance. (tested with iozone) >>>> >>>> >>>> Again, any hints on how to proceed is greatly appreciated. >>>> >>>> >>>> Best regards, >>>> Roy. >>>> >>>> >>>> >>>> _______________________________________________ >>>> Lustre-discuss mailing list >>>> Lustre-discuss-aLEFhgZF4x6X6Mz3xDxJMA@public.gmane.org >>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss_______________________________________________ Lustre-discuss mailing list Lustre-discuss-aLEFhgZF4x6X6Mz3xDxJMA@public.gmane.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
I forgot to add ''slabtop'' is a nice tool for watching this stuff. Scott On 8/23/2013 9:36 AM, Scott Nolin wrote:> You might also try increasing the vfs_cache_pressure. > > This will reclaim inode and dentry caches faster. Maybe that''s the > problem, not page caches. > > To be clear - I have no deep insight into Lustre''s use of the client > cache, but you said you has lots of small files, which if lustre uses > the cache system like other filesystems means it may be inodes/dentries. > Filling up the page cache with files like you did in your other tests > wouldn''t have the same effect. Just my guess here. > > We had some experience years ago with the opposite sort of problem. We > have a big ftp server, and we want to *keep* inode/dentry data in the > linux cache, as there are often stupid numbers of files in directories. > Files were always flowing through the server, so the page cache would > force out the inode cache. Was surprised to find with linux there''s no > ability to set a fixed inode cache size - the best you can do is > "suggest" with the cache pressure tunable. > > Scott > > On 8/23/2013 6:29 AM, Dragseth Roy Einar wrote: >> I tried to change swapiness from 0 to 95 but it did not have any >> impact on the >> system overhead. >> >> r. >> >>_______________________________________________ Lustre-discuss mailing list Lustre-discuss-aLEFhgZF4x6X6Mz3xDxJMA@public.gmane.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
How much RAM does the application use? Lustre is particularly slow at evicting memory from its caches, so my guess is that''s what you''re falling victim of.. 1) Lustre has a large portion of RAM allocated for the page cache 2) Application starts and begins to allocate a large portion of RAM 3) Linux kernel starts to reclaim from page cache (i.e. Lustre). This is a sort spot for Lustre, so it''s causes application allocations to "stall", waiting for Lustre to reclaim. 4) Lustre finally finishes reclaiming from its page cache 5) Application succeeds allocation and proceeds Of course, this is all speculation as I don''t have much data to go on. How long does the drop_caches command take to complete? I have a feeling the drop_caches command just preloads steps 3 and 4 in the above sequence. -- Cheers, Prakash On Thu, Aug 22, 2013 at 03:51:32PM +0200, Roy Dragseth wrote:> We have just discovered that a large buffer cache generated from traversing a > lustre file system will cause a significant system overhead for applications > with high memory demands. We have seen a 50% slowdown or worse for > applications. Even High Performance Linpack, that have no file IO whatsoever > is affected. The only remedy seems to be to empty the buffer cache from memory > by running "echo 3 > /proc/sys/vm/drop_caches" > > Any hints on how to improve the situation is greatly appreciated. > > > System setup: > Client: Dual socket Sandy Bridge, with 32GB ram and infiniband connection to > lustre server. CentOS 6.4, with kernel 2.6.32-358.11.1.el6.x86_64 and lustre > v2.1.6 rpms downloaded from whamcloud download site. > > Lustre: 1 MDS and 4 OSS running Lustre 2.1.3 (also from whamcloud site). Each > OSS has 12 OST, total 1.1 PB storage. > > How to reproduce: > > Traverse the lustre file system until the buffer cache is large enough. In our > case we run > > find . -print0 -type f | xargs -0 cat > /dev/null > > on the client until the buffer cache reaches ~15-20GB. (The lustre file system > has lots of small files so this takes up to an hour.) > > Kill the find process and start a single node parallel application, we use HPL > (high performance linpack). We run on all 16 cores on the system with 1GB ram > per core (a normal run should complete in appr. 150 seconds.) The system > monitoring shows a 10-20% system cpu overhead and the HPL run takes more than > 200 secs. After running "echo 3 > /proc/sys/vm/drop_caches" the system > performance goes back to normal with a run time at 150 secs. > > I''ve created an infographic from our ganglia graphs for the above scenario. > > https://dl.dropboxusercontent.com/u/23468442/misc/lustre_bc_overhead.png > > Attached is an excerpt from perf top indicating that the kernel routine taking > the most time is _spin_lock_irqsave if that means anything to anyone. > > > Things tested: > > It does not seem to matter if we mount lustre over infiniband or ethernet. > > Filling the buffer cache with files from an NFS filesystem does not degrade > performance. > > Filling the buffer cache with one large file does not give degraded performance. > (tested with iozone) > > > Again, any hints on how to proceed is greatly appreciated. > > > Best regards, > Roy. > > -- > > The Computer Center, University of Tromsø, N-9037 TROMSØ Norway. > phone:+47 77 64 41 07, fax:+47 77 64 41 00 > Roy Dragseth, Team Leader, High Performance Computing > Direct call: +47 77 64 62 56. email: roy.dragseth-hYqmg196XYc@public.gmane.org> Samples: 6M of event ''cycles'', Event count (approx.): 634546877255 > 62.19% libmkl_avx.so [.] mkl_blas_avx_dgemm_kernel_0 > 13.30% mca_btl_sm.so [.] mca_btl_sm_component_progress > 8.80% libmpi.so.1.0.3 [.] opal_progress > 5.29% [kernel] [k] _spin_lock_irqsave > 1.41% libmkl_avx.so [.] mkl_blas_avx_dgemm_copyan > 1.17% mca_pml_ob1.so [.] mca_pml_ob1_progress > 0.88% libmkl_avx.so [.] mkl_blas_avx_dtrsm_ker_ruu_a4_b8 > 0.41% [kernel] [k] compaction_alloc > 0.38% [kernel] [k] _spin_lock_irq > 0.36% mca_pml_ob1.so [.] opal_progress@plt > 0.33% xhpl [.] HPL_dlaswp06T > 0.28% libmkl_avx.so [.] mkl_blas_avx_dgemm_copybt > 0.24% mca_pml_ob1.so [.] mca_pml_ob1_send > 0.18% [kernel] [k] _spin_lock > 0.17% [kernel] [k] __mem_cgroup_commit_charge > 0.16% [kernel] [k] mem_cgroup_lru_del_list > 0.16% [kernel] [k] putback_lru_page > 0.16% [kernel] [k] __mem_cgroup_uncharge_common > 0.15% xhpl [.] HPL_dlatcpy > 0.15% xhpl [.] HPL_dlaswp01T > 0.15% [kernel] [k] clear_page_c > 0.15% xhpl [.] HPL_dlaswp10N > 0.13% [kernel] [k] list_del > 0.13% [kernel] [k] free_hot_cold_page > 0.13% [kernel] [k] free_pcppages_bulk > 0.13% [kernel] [k] release_pages > 0.13% mca_pml_ob1.so [.] mca_pml_ob1_recv > 0.12% [kernel] [k] ____pagevec_lru_add > 0.12% [kernel] [k] copy_user_generic_string > 0.12% [kernel] [k] compact_zone > 0.10% xhpl [.] __intel_ssse3_rep_memcpy > 0.10% [kernel] [k] __list_add > 0.10% [kernel] [k] lookup_page_cgroup > 0.09% [kernel] [k] mem_cgroup_end_migration > 0.08% [kernel] [k] mem_cgroup_prepare_migration > 0.08% [kernel] [k] get_pageblock_flags_group > 0.08% [kernel] [k] page_waitqueue > 0.07% [kernel] [k] migrate_pages > 0.07% [kernel] [k] __wake_up_bit > 0.07% [kernel] [k] get_page > 0.07% [kernel] [k] unlock_page > 0.07% [kernel] [k] mem_cgroup_lru_add_list > 0.06% [kernel] [k] page_fault > 0.06% [kernel] [k] __alloc_pages_nodemask > 0.06% [kernel] [k] put_page > 0.06% [kernel] [k] compact_checklock_irqsave> _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss-aLEFhgZF4x6X6Mz3xDxJMA@public.gmane.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss
Dragseth Roy Einar
2013-Aug-23 20:08 UTC
Re: Lustre buffer cache causes large system overhead.
Thanks for the suggestion! It didn''t help, but as I read the documentation on vfs_cache_pressure in the kernel docs I noticed the next parameter, zone_reclaim_mode, which looked like it might be worth fiddling with. And what do you know, changing it from 0 to 1 made the system overhead vanish immediately! I must admit I do not completely understand why this helps, but it seems to do the trick in my case. We''ll put vm.zone_reclaim_mode = 1 into /etc/sysctl.conf from now on. Thanks to all for the hints and comments on this. A nice weekend to everyone, mine for sure is going to be... r. On Friday 23. August 2013 09.36.34 Scott Nolin wrote:> You might also try increasing the vfs_cache_pressure. > > This will reclaim inode and dentry caches faster. Maybe that''s the > problem, not page caches. > > To be clear - I have no deep insight into Lustre''s use of the client > cache, but you said you has lots of small files, which if lustre uses > the cache system like other filesystems means it may be inodes/dentries. > Filling up the page cache with files like you did in your other tests > wouldn''t have the same effect. Just my guess here. > > We had some experience years ago with the opposite sort of problem. We > have a big ftp server, and we want to *keep* inode/dentry data in the > linux cache, as there are often stupid numbers of files in directories. > Files were always flowing through the server, so the page cache would > force out the inode cache. Was surprised to find with linux there''s no > ability to set a fixed inode cache size - the best you can do is > "suggest" with the cache pressure tunable. > > Scott > > On 8/23/2013 6:29 AM, Dragseth Roy Einar wrote: > > I tried to change swapiness from 0 to 95 but it did not have any impact on > > the system overhead. > > > > r. > > > > On Thursday 22. August 2013 15.38.37 Dragseth Roy Einar wrote: > >> No, I cannot detect any swap activity on the system. > >> > >> r. > >> > >> On Thursday 22. August 2013 09.21.33 you wrote: > >>> Is this slowdown due to increased swap activity? If "yes", then try > >>> lowering the "swappiness" value. This will sacrifice buffer cache space > >>> to > >>> lower swap activity. > >>> > >>> Take a look at http://en.wikipedia.org/wiki/Swappiness. > >>> > >>> Roger S. > >>> > >>> On 08/22/2013 08:51 AM, Roy Dragseth wrote: > >>>> We have just discovered that a large buffer cache generated from > >>>> traversing a lustre file system will cause a significant system > >>>> overhead > >>>> for applications with high memory demands. We have seen a 50% slowdown > >>>> or worse for applications. Even High Performance Linpack, that have no > >>>> file IO whatsoever is affected. The only remedy seems to be to empty > >>>> the > >>>> buffer cache from memory by running "echo 3 > /proc/sys/vm/drop_caches" > >>>> > >>>> Any hints on how to improve the situation is greatly appreciated. > >>>> > >>>> > >>>> System setup: > >>>> Client: Dual socket Sandy Bridge, with 32GB ram and infiniband > >>>> connection > >>>> to lustre server. CentOS 6.4, with kernel 2.6.32-358.11.1.el6.x86_64 > >>>> and > >>>> lustre v2.1.6 rpms downloaded from whamcloud download site. > >>>> > >>>> Lustre: 1 MDS and 4 OSS running Lustre 2.1.3 (also from whamcloud > >>>> site). > >>>> Each OSS has 12 OST, total 1.1 PB storage. > >>>> > >>>> How to reproduce: > >>>> > >>>> Traverse the lustre file system until the buffer cache is large enough. > >>>> In our case we run > >>>> > >>>> find . -print0 -type f | xargs -0 cat > /dev/null > >>>> > >>>> on the client until the buffer cache reaches ~15-20GB. (The lustre > >>>> file > >>>> system has lots of small files so this takes up to an hour.) > >>>> > >>>> Kill the find process and start a single node parallel application, we > >>>> use > >>>> HPL (high performance linpack). We run on all 16 cores on the system > >>>> with 1GB ram per core (a normal run should complete in appr. 150 > >>>> seconds.) The system monitoring shows a 10-20% system cpu overhead and > >>>> the HPL run takes more than 200 secs. After running "echo 3 > > >>>> /proc/sys/vm/drop_caches" the system performance goes back to normal > >>>> with > >>>> a run time at 150 secs. > >>>> > >>>> I''ve created an infographic from our ganglia graphs for the above > >>>> scenario. > >>>> > >>>> https://dl.dropboxusercontent.com/u/23468442/misc/lustre_bc_overhead.pn > >>>> g > >>>> > >>>> Attached is an excerpt from perf top indicating that the kernel routine > >>>> taking the most time is _spin_lock_irqsave if that means anything to > >>>> anyone. > >>>> > >>>> > >>>> Things tested: > >>>> > >>>> It does not seem to matter if we mount lustre over infiniband or > >>>> ethernet. > >>>> > >>>> Filling the buffer cache with files from an NFS filesystem does not > >>>> degrade > >>>> performance. > >>>> > >>>> Filling the buffer cache with one large file does not give degraded > >>>> performance. (tested with iozone) > >>>> > >>>> > >>>> Again, any hints on how to proceed is greatly appreciated. > >>>> > >>>> > >>>> Best regards, > >>>> Roy. > >>>> > >>>> > >>>> > >>>> _______________________________________________ > >>>> Lustre-discuss mailing list > >>>> Lustre-discuss-aLEFhgZF4x6X6Mz3xDxJMA@public.gmane.org > >>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss-- The Computer Center, University of Tromsø, N-9037 TROMSØ Norway. phone:+47 77 64 41 07, fax:+47 77 64 41 00 Roy Dragseth, Team Leader, High Performance Computing Direct call: +47 77 64 62 56. email: roy.dragseth-hYqmg196XYc@public.gmane.org
Patrick Shopbell
2013-Aug-23 20:59 UTC
Re: Lustre buffer cache causes large system overhead.
Hi all - I have watched this thread with much interest, and now I am even more interested/confused. :-) Several months back, we had a very substantial slowdown on our MDS box. Interactive use of the box was very sluggish, even though the load was quite low. This was eventually solved by setting the opposite value for the variable in question: vm.zone_reclaim_mode = 0 And it was equally dramatic in its solution of our problem - the MDS started responding normally immediately afterwards. We went ahead and set the value to zero on all of our NUMA machines. (We are running Lustre 2.3.) Clearly, I need to do some reading on Lustre and its various caching issues. This has been a quite interesting discussion. Thanks everyone for such a great list. -- Patrick *--------------------------------------------------------------------* | Patrick Shopbell Department of Astronomy | | pls-f+Cz5gDlz18dsUksgWXaj4dd74u8MsAO@public.gmane.org Mail Code 249-17 | | (626) 395-4097 California Institute of Technology | | (626) 568-9352 (FAX) Pasadena, CA 91125 | | WWW: http://www.astro.caltech.edu/~pls/ | *--------------------------------------------------------------------* On 8/23/13 1:08 PM, Dragseth Roy Einar wrote:> Thanks for the suggestion! It didn''t help, but as I read the documentation on > vfs_cache_pressure in the kernel docs I noticed the next parameter, > zone_reclaim_mode, which looked like it might be worth fiddling with. And what > do you know, changing it from 0 to 1 made the system overhead vanish > immediately! > > I must admit I do not completely understand why this helps, but it seems to do > the trick in my case. We''ll put > > vm.zone_reclaim_mode = 1 > > into /etc/sysctl.conf from now on. > > Thanks to all for the hints and comments on this. > > A nice weekend to everyone, mine for sure is going to be... > r. > > > On Friday 23. August 2013 09.36.34 Scott Nolin wrote: >> You might also try increasing the vfs_cache_pressure. >> >> This will reclaim inode and dentry caches faster. Maybe that''s the >> problem, not page caches. >> >> To be clear - I have no deep insight into Lustre''s use of the client >> cache, but you said you has lots of small files, which if lustre uses >> the cache system like other filesystems means it may be inodes/dentries. >> Filling up the page cache with files like you did in your other tests >> wouldn''t have the same effect. Just my guess here. >> >> We had some experience years ago with the opposite sort of problem. We >> have a big ftp server, and we want to *keep* inode/dentry data in the >> linux cache, as there are often stupid numbers of files in directories. >> Files were always flowing through the server, so the page cache would >> force out the inode cache. Was surprised to find with linux there''s no >> ability to set a fixed inode cache size - the best you can do is >> "suggest" with the cache pressure tunable. >> >> Scott >> >> On 8/23/2013 6:29 AM, Dragseth Roy Einar wrote: >>> I tried to change swapiness from 0 to 95 but it did not have any impact on >>> the system overhead. >>> >>> r. >>> >>> On Thursday 22. August 2013 15.38.37 Dragseth Roy Einar wrote: >>>> No, I cannot detect any swap activity on the system. >>>> >>>> r. >>>> >>>> On Thursday 22. August 2013 09.21.33 you wrote: >>>>> Is this slowdown due to increased swap activity? If "yes", then try >>>>> lowering the "swappiness" value. This will sacrifice buffer cache space >>>>> to >>>>> lower swap activity. >>>>> >>>>> Take a look at http://en.wikipedia.org/wiki/Swappiness. >>>>> >>>>> Roger S. >>>>> >>>>> On 08/22/2013 08:51 AM, Roy Dragseth wrote: >>>>>> We have just discovered that a large buffer cache generated from >>>>>> traversing a lustre file system will cause a significant system >>>>>> overhead >>>>>> for applications with high memory demands. We have seen a 50% slowdown >>>>>> or worse for applications. Even High Performance Linpack, that have no >>>>>> file IO whatsoever is affected. The only remedy seems to be to empty >>>>>> the >>>>>> buffer cache from memory by running "echo 3 > /proc/sys/vm/drop_caches" >>>>>> >>>>>> Any hints on how to improve the situation is greatly appreciated. >>>>>> >>>>>> >>>>>> System setup: >>>>>> Client: Dual socket Sandy Bridge, with 32GB ram and infiniband >>>>>> connection >>>>>> to lustre server. CentOS 6.4, with kernel 2.6.32-358.11.1.el6.x86_64 >>>>>> and >>>>>> lustre v2.1.6 rpms downloaded from whamcloud download site. >>>>>> >>>>>> Lustre: 1 MDS and 4 OSS running Lustre 2.1.3 (also from whamcloud >>>>>> site). >>>>>> Each OSS has 12 OST, total 1.1 PB storage. >>>>>> >>>>>> How to reproduce: >>>>>> >>>>>> Traverse the lustre file system until the buffer cache is large enough. >>>>>> In our case we run >>>>>> >>>>>> find . -print0 -type f | xargs -0 cat > /dev/null >>>>>> >>>>>> on the client until the buffer cache reaches ~15-20GB. (The lustre >>>>>> file >>>>>> system has lots of small files so this takes up to an hour.) >>>>>> >>>>>> Kill the find process and start a single node parallel application, we >>>>>> use >>>>>> HPL (high performance linpack). We run on all 16 cores on the system >>>>>> with 1GB ram per core (a normal run should complete in appr. 150 >>>>>> seconds.) The system monitoring shows a 10-20% system cpu overhead and >>>>>> the HPL run takes more than 200 secs. After running "echo 3 > >>>>>> /proc/sys/vm/drop_caches" the system performance goes back to normal >>>>>> with >>>>>> a run time at 150 secs. >>>>>> >>>>>> I''ve created an infographic from our ganglia graphs for the above >>>>>> scenario. >>>>>> >>>>>> https://dl.dropboxusercontent.com/u/23468442/misc/lustre_bc_overhead.pn >>>>>> g >>>>>> >>>>>> Attached is an excerpt from perf top indicating that the kernel routine >>>>>> taking the most time is _spin_lock_irqsave if that means anything to >>>>>> anyone. >>>>>> >>>>>> >>>>>> Things tested: >>>>>> >>>>>> It does not seem to matter if we mount lustre over infiniband or >>>>>> ethernet. >>>>>> >>>>>> Filling the buffer cache with files from an NFS filesystem does not >>>>>> degrade >>>>>> performance. >>>>>> >>>>>> Filling the buffer cache with one large file does not give degraded >>>>>> performance. (tested with iozone) >>>>>> >>>>>> >>>>>> Again, any hints on how to proceed is greatly appreciated. >>>>>> >>>>>> >>>>>> Best regards, >>>>>> Roy. >>>>>> >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> Lustre-discuss mailing list >>>>>> Lustre-discuss-aLEFhgZF4x6X6Mz3xDxJMA@public.gmane.org >>>>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
Brian O''Connor
2013-Aug-24 01:58 UTC
Re: Lustre buffer cache causes large system overhead.
Watch for swapping now. Turning zone reclaim on can cause the machine to swap if the memory use goes outside of the NUMA node. Although you don't have much memory(which IMHO is the real issue) so this may not effect you. -----Original Message----- From: Dragseth Roy Einar [roy.dragseth@uit.no<mailto:roy.dragseth@uit.no>] Sent: Friday, August 23, 2013 03:09 PM Central Standard Time To: lustre-discuss@lists.lustre.org Subject: Re: [Lustre-discuss] Lustre buffer cache causes large system overhead. Thanks for the suggestion! It didn't help, but as I read the documentation on vfs_cache_pressure in the kernel docs I noticed the next parameter, zone_reclaim_mode, which looked like it might be worth fiddling with. And what do you know, changing it from 0 to 1 made the system overhead vanish immediately! I must admit I do not completely understand why this helps, but it seems to do the trick in my case. We'll put vm.zone_reclaim_mode = 1 into /etc/sysctl.conf from now on. Thanks to all for the hints and comments on this. A nice weekend to everyone, mine for sure is going to be... r. On Friday 23. August 2013 09.36.34 Scott Nolin wrote:> You might also try increasing the vfs_cache_pressure. > > This will reclaim inode and dentry caches faster. Maybe that's the > problem, not page caches. > > To be clear - I have no deep insight into Lustre's use of the client > cache, but you said you has lots of small files, which if lustre uses > the cache system like other filesystems means it may be inodes/dentries. > Filling up the page cache with files like you did in your other tests > wouldn't have the same effect. Just my guess here. > > We had some experience years ago with the opposite sort of problem. We > have a big ftp server, and we want to *keep* inode/dentry data in the > linux cache, as there are often stupid numbers of files in directories. > Files were always flowing through the server, so the page cache would > force out the inode cache. Was surprised to find with linux there's no > ability to set a fixed inode cache size - the best you can do is > "suggest" with the cache pressure tunable. > > Scott > > On 8/23/2013 6:29 AM, Dragseth Roy Einar wrote: > > I tried to change swapiness from 0 to 95 but it did not have any impact on > > the system overhead. > > > > r. > > > > On Thursday 22. August 2013 15.38.37 Dragseth Roy Einar wrote: > >> No, I cannot detect any swap activity on the system. > >> > >> r. > >> > >> On Thursday 22. August 2013 09.21.33 you wrote: > >>> Is this slowdown due to increased swap activity? If "yes", then try > >>> lowering the "swappiness" value. This will sacrifice buffer cache space > >>> to > >>> lower swap activity. > >>> > >>> Take a look at http://en.wikipedia.org/wiki/Swappiness. > >>> > >>> Roger S. > >>> > >>> On 08/22/2013 08:51 AM, Roy Dragseth wrote: > >>>> We have just discovered that a large buffer cache generated from > >>>> traversing a lustre file system will cause a significant system > >>>> overhead > >>>> for applications with high memory demands. We have seen a 50% slowdown > >>>> or worse for applications. Even High Performance Linpack, that have no > >>>> file IO whatsoever is affected. The only remedy seems to be to empty > >>>> the > >>>> buffer cache from memory by running "echo 3 > /proc/sys/vm/drop_caches" > >>>> > >>>> Any hints on how to improve the situation is greatly appreciated. > >>>> > >>>> > >>>> System setup: > >>>> Client: Dual socket Sandy Bridge, with 32GB ram and infiniband > >>>> connection > >>>> to lustre server. CentOS 6.4, with kernel 2.6.32-358.11.1.el6.x86_64 > >>>> and > >>>> lustre v2.1.6 rpms downloaded from whamcloud download site. > >>>> > >>>> Lustre: 1 MDS and 4 OSS running Lustre 2.1.3 (also from whamcloud > >>>> site). > >>>> Each OSS has 12 OST, total 1.1 PB storage. > >>>> > >>>> How to reproduce: > >>>> > >>>> Traverse the lustre file system until the buffer cache is large enough. > >>>> In our case we run > >>>> > >>>> find . -print0 -type f | xargs -0 cat > /dev/null > >>>> > >>>> on the client until the buffer cache reaches ~15-20GB. (The lustre > >>>> file > >>>> system has lots of small files so this takes up to an hour.) > >>>> > >>>> Kill the find process and start a single node parallel application, we > >>>> use > >>>> HPL (high performance linpack). We run on all 16 cores on the system > >>>> with 1GB ram per core (a normal run should complete in appr. 150 > >>>> seconds.) The system monitoring shows a 10-20% system cpu overhead and > >>>> the HPL run takes more than 200 secs. After running "echo 3 > > >>>> /proc/sys/vm/drop_caches" the system performance goes back to normal > >>>> with > >>>> a run time at 150 secs. > >>>> > >>>> I've created an infographic from our ganglia graphs for the above > >>>> scenario. > >>>> > >>>> https://dl.dropboxusercontent.com/u/23468442/misc/lustre_bc_overhead.pn > >>>> g > >>>> > >>>> Attached is an excerpt from perf top indicating that the kernel routine > >>>> taking the most time is _spin_lock_irqsave if that means anything to > >>>> anyone. > >>>> > >>>> > >>>> Things tested: > >>>> > >>>> It does not seem to matter if we mount lustre over infiniband or > >>>> ethernet. > >>>> > >>>> Filling the buffer cache with files from an NFS filesystem does not > >>>> degrade > >>>> performance. > >>>> > >>>> Filling the buffer cache with one large file does not give degraded > >>>> performance. (tested with iozone) > >>>> > >>>> > >>>> Again, any hints on how to proceed is greatly appreciated. > >>>> > >>>> > >>>> Best regards, > >>>> Roy. > >>>> > >>>> > >>>> > >>>> _______________________________________________ > >>>> Lustre-discuss mailing list > >>>> Lustre-discuss@lists.lustre.org > >>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss-- The Computer Center, University of Tromsø, N-9037 TROMSØ Norway. phone:+47 77 64 41 07, fax:+47 77 64 41 00 Roy Dragseth, Team Leader, High Performance Computing Direct call: +47 77 64 62 56. email: roy.dragseth@uit.no _______________________________________________ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss _______________________________________________ Lustre-discuss mailing list Lustre-discuss-aLEFhgZF4x6X6Mz3xDxJMA@public.gmane.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Dragseth Roy Einar
2013-Aug-24 08:08 UTC
Re: Lustre buffer cache causes large system overhead.
The kernel docs for zone_reclaim_mode indicates that a value of 0 makes sense on dedicated file servers like MDS/OSS as fetching cached data from another numa domain is much faster than going all the way to the disk. For clients that need the memory for computations a value of 1 seems to be the way to go as (I guess) it reduces the cross-domain traffic. r. On Friday 23. August 2013 13.59.44 Patrick Shopbell wrote:> Hi all - > I have watched this thread with much interest, and now I am even > more interested/confused. :-) > > Several months back, we had a very substantial slowdown on our > MDS box. Interactive use of the box was very sluggish, even > though the load was quite low. This was eventually solved by > setting the opposite value for the variable in question: > > vm.zone_reclaim_mode = 0 > > And it was equally dramatic in its solution of our problem - the MDS > started responding normally immediately afterwards. We went ahead > and set the value to zero on all of our NUMA machines. (We are > running Lustre 2.3.) > > Clearly, I need to do some reading on Lustre and its various caching > issues. This has been a quite interesting discussion. > > Thanks everyone for such a great list. > -- > Patrick > > *--------------------------------------------------------------------* > > | Patrick Shopbell Department of Astronomy | > | pls-f+Cz5gDlz18dsUksgWXaj4dd74u8MsAO@public.gmane.org Mail Code 249-17 | > | (626) 395-4097 California Institute of Technology | > | (626) 568-9352 (FAX) Pasadena, CA 91125 | > | WWW: http://www.astro.caltech.edu/~pls/ | > > *--------------------------------------------------------------------* > > On 8/23/13 1:08 PM, Dragseth Roy Einar wrote: > > Thanks for the suggestion! It didn''t help, but as I read the > > documentation on vfs_cache_pressure in the kernel docs I noticed the next > > parameter, zone_reclaim_mode, which looked like it might be worth > > fiddling with. And what do you know, changing it from 0 to 1 made the > > system overhead vanish immediately! > > > > I must admit I do not completely understand why this helps, but it seems > > to do the trick in my case. We''ll put > > > > vm.zone_reclaim_mode = 1 > > > > into /etc/sysctl.conf from now on. > > > > Thanks to all for the hints and comments on this. > > > > A nice weekend to everyone, mine for sure is going to be... > > r. > > > > On Friday 23. August 2013 09.36.34 Scott Nolin wrote: > >> You might also try increasing the vfs_cache_pressure. > >> > >> This will reclaim inode and dentry caches faster. Maybe that''s the > >> problem, not page caches. > >> > >> To be clear - I have no deep insight into Lustre''s use of the client > >> cache, but you said you has lots of small files, which if lustre uses > >> the cache system like other filesystems means it may be inodes/dentries. > >> Filling up the page cache with files like you did in your other tests > >> wouldn''t have the same effect. Just my guess here. > >> > >> We had some experience years ago with the opposite sort of problem. We > >> have a big ftp server, and we want to *keep* inode/dentry data in the > >> linux cache, as there are often stupid numbers of files in directories. > >> Files were always flowing through the server, so the page cache would > >> force out the inode cache. Was surprised to find with linux there''s no > >> ability to set a fixed inode cache size - the best you can do is > >> "suggest" with the cache pressure tunable. > >> > >> Scott > >> > >> On 8/23/2013 6:29 AM, Dragseth Roy Einar wrote: > >>> I tried to change swapiness from 0 to 95 but it did not have any impact > >>> on > >>> the system overhead. > >>> > >>> r. > >>> > >>> On Thursday 22. August 2013 15.38.37 Dragseth Roy Einar wrote: > >>>> No, I cannot detect any swap activity on the system. > >>>> > >>>> r. > >>>> > >>>> On Thursday 22. August 2013 09.21.33 you wrote: > >>>>> Is this slowdown due to increased swap activity? If "yes", then try > >>>>> lowering the "swappiness" value. This will sacrifice buffer cache > >>>>> space > >>>>> to > >>>>> lower swap activity. > >>>>> > >>>>> Take a look at http://en.wikipedia.org/wiki/Swappiness. > >>>>> > >>>>> Roger S. > >>>>> > >>>>> On 08/22/2013 08:51 AM, Roy Dragseth wrote: > >>>>>> We have just discovered that a large buffer cache generated from > >>>>>> traversing a lustre file system will cause a significant system > >>>>>> overhead > >>>>>> for applications with high memory demands. We have seen a 50% > >>>>>> slowdown > >>>>>> or worse for applications. Even High Performance Linpack, that have > >>>>>> no > >>>>>> file IO whatsoever is affected. The only remedy seems to be to empty > >>>>>> the > >>>>>> buffer cache from memory by running "echo 3 > > >>>>>> /proc/sys/vm/drop_caches" > >>>>>> > >>>>>> Any hints on how to improve the situation is greatly appreciated. > >>>>>> > >>>>>> > >>>>>> System setup: > >>>>>> Client: Dual socket Sandy Bridge, with 32GB ram and infiniband > >>>>>> connection > >>>>>> to lustre server. CentOS 6.4, with kernel 2.6.32-358.11.1.el6.x86_64 > >>>>>> and > >>>>>> lustre v2.1.6 rpms downloaded from whamcloud download site. > >>>>>> > >>>>>> Lustre: 1 MDS and 4 OSS running Lustre 2.1.3 (also from whamcloud > >>>>>> site). > >>>>>> Each OSS has 12 OST, total 1.1 PB storage. > >>>>>> > >>>>>> How to reproduce: > >>>>>> > >>>>>> Traverse the lustre file system until the buffer cache is large > >>>>>> enough. > >>>>>> In our case we run > >>>>>> > >>>>>> find . -print0 -type f | xargs -0 cat > /dev/null > >>>>>> > >>>>>> on the client until the buffer cache reaches ~15-20GB. (The lustre > >>>>>> file > >>>>>> system has lots of small files so this takes up to an hour.) > >>>>>> > >>>>>> Kill the find process and start a single node parallel application, > >>>>>> we > >>>>>> use > >>>>>> HPL (high performance linpack). We run on all 16 cores on the system > >>>>>> with 1GB ram per core (a normal run should complete in appr. 150 > >>>>>> seconds.) The system monitoring shows a 10-20% system cpu overhead > >>>>>> and > >>>>>> the HPL run takes more than 200 secs. After running "echo 3 > > >>>>>> /proc/sys/vm/drop_caches" the system performance goes back to normal > >>>>>> with > >>>>>> a run time at 150 secs. > >>>>>> > >>>>>> I''ve created an infographic from our ganglia graphs for the above > >>>>>> scenario. > >>>>>> > >>>>>> https://dl.dropboxusercontent.com/u/23468442/misc/lustre_bc_overhead. > >>>>>> pn > >>>>>> g > >>>>>> > >>>>>> Attached is an excerpt from perf top indicating that the kernel > >>>>>> routine > >>>>>> taking the most time is _spin_lock_irqsave if that means anything to > >>>>>> anyone. > >>>>>> > >>>>>> > >>>>>> Things tested: > >>>>>> > >>>>>> It does not seem to matter if we mount lustre over infiniband or > >>>>>> ethernet. > >>>>>> > >>>>>> Filling the buffer cache with files from an NFS filesystem does not > >>>>>> degrade > >>>>>> performance. > >>>>>> > >>>>>> Filling the buffer cache with one large file does not give degraded > >>>>>> performance. (tested with iozone) > >>>>>> > >>>>>> > >>>>>> Again, any hints on how to proceed is greatly appreciated. > >>>>>> > >>>>>> > >>>>>> Best regards, > >>>>>> Roy. > >>>>>> > >>>>>> > >>>>>> > >>>>>> _______________________________________________ > >>>>>> Lustre-discuss mailing list > >>>>>> Lustre-discuss-aLEFhgZF4x6X6Mz3xDxJMA@public.gmane.org > >>>>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss-aLEFhgZF4x6X6Mz3xDxJMA@public.gmane.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss-- The Computer Center, University of Tromsø, N-9037 TROMSØ Norway. phone:+47 77 64 41 07, fax:+47 77 64 41 00 Roy Dragseth, Team Leader, High Performance Computing Direct call: +47 77 64 62 56. email: roy.dragseth-hYqmg196XYc@public.gmane.org
An admin at another site sent me this info (thanks Hans) kernel component, BZ#770545 In Red Hat Enterprise Linux 6.2 and Red Hat Enterprise Linux 6.3, the default value for sysctl vm.zone_reclaim_mode is now 0, whereas in Red Hat Enterprise Linux 6.1 it was 1. Just a heads up for anyone planning an upgrade... r. On Saturday, August 24, 2013 08:08:06 Dragseth Roy Einar wrote:> The kernel docs for zone_reclaim_mode indicates that a value of 0makes> sense on dedicated file servers like MDS/OSS as fetching cached datafrom> another numa domain is much faster than going all the way to the disk.For> clients that need the memory for computations a value of 1 seems to bethe> way to go as (I guess) it reduces the cross-domain traffic. > > r. > > On Friday 23. August 2013 13.59.44 Patrick Shopbell wrote: > > Hi all - > > I have watched this thread with much interest, and now I am even > > more interested/confused. :-) > > > > Several months back, we had a very substantial slowdown on our > > MDS box. Interactive use of the box was very sluggish, even > > though the load was quite low. This was eventually solved by > > setting the opposite value for the variable in question: > > > > vm.zone_reclaim_mode = 0 > > > > And it was equally dramatic in its solution of our problem - the MDS > > started responding normally immediately afterwards. We went ahead > > and set the value to zero on all of our NUMA machines. (We are > > running Lustre 2.3.) > > > > Clearly, I need to do some reading on Lustre and its various caching > > issues. This has been a quite interesting discussion. > > > > Thanks everyone for such a great list. > > -- > > Patrick > > > > *--------------------------------------------------------------------* > > > > | Patrick Shopbell Department of Astronomy | > > | pls-f+Cz5gDlz18dsUksgWXaj4dd74u8MsAO@public.gmane.org Mail Code 249-17 | > > | (626) 395-4097 California Institute of Technology | > > | (626) 568-9352 (FAX) Pasadena, CA 91125 | > > | WWW: http://www.astro.caltech.edu/~pls/ | > > > > *--------------------------------------------------------------------* > > > > On 8/23/13 1:08 PM, Dragseth Roy Einar wrote: > > > Thanks for the suggestion! It didn''t help, but as I read the > > > documentation on vfs_cache_pressure in the kernel docs I noticedthe> > > next > > > parameter, zone_reclaim_mode, which looked like it might be worth > > > fiddling with. And what do you know, changing it from 0 to 1 madethe> > > system overhead vanish immediately! > > > > > > I must admit I do not completely understand why this helps, but itseems> > > to do the trick in my case. We''ll put > > > > > > vm.zone_reclaim_mode = 1 > > > > > > into /etc/sysctl.conf from now on. > > > > > > Thanks to all for the hints and comments on this. > > > > > > A nice weekend to everyone, mine for sure is going to be... > > > r. > > > > > > On Friday 23. August 2013 09.36.34 Scott Nolin wrote: > > >> You might also try increasing the vfs_cache_pressure. > > >> > > >> This will reclaim inode and dentry caches faster. Maybe that''s the > > >> problem, not page caches. > > >> > > >> To be clear - I have no deep insight into Lustre''s use of the client > > >> cache, but you said you has lots of small files, which if lustre uses > > >> the cache system like other filesystems means it may be > > >> inodes/dentries. > > >> Filling up the page cache with files like you did in your other tests > > >> wouldn''t have the same effect. Just my guess here. > > >> > > >> We had some experience years ago with the opposite sort ofproblem. We> > >> have a big ftp server, and we want to *keep* inode/dentry data inthe> > >> linux cache, as there are often stupid numbers of files indirectories.> > >> Files were always flowing through the server, so the page cachewould> > >> force out the inode cache. Was surprised to find with linux there''sno> > >> ability to set a fixed inode cache size - the best you can do is > > >> "suggest" with the cache pressure tunable. > > >> > > >> Scott > > >> > > >> On 8/23/2013 6:29 AM, Dragseth Roy Einar wrote: > > >>> I tried to change swapiness from 0 to 95 but it did not have any > > >>> impact > > >>> on > > >>> the system overhead. > > >>> > > >>> r. > > >>> > > >>> On Thursday 22. August 2013 15.38.37 Dragseth Roy Einarwrote:> > >>>> No, I cannot detect any swap activity on the system. > > >>>> > > >>>> r. > > >>>> > > >>>> On Thursday 22. August 2013 09.21.33 you wrote: > > >>>>> Is this slowdown due to increased swap activity? If "yes", thentry> > >>>>> lowering the "swappiness" value. This will sacrifice buffercache> > >>>>> space > > >>>>> to > > >>>>> lower swap activity. > > >>>>> > > >>>>> Take a look at http://en.wikipedia.org/wiki/Swappiness. > > >>>>> > > >>>>> Roger S. > > >>>>> > > >>>>> On 08/22/2013 08:51 AM, Roy Dragseth wrote: > > >>>>>> We have just discovered that a large buffer cachegenerated from> > >>>>>> traversing a lustre file system will cause a significant system_______________________________________________ Lustre-discuss mailing list Lustre-discuss-aLEFhgZF4x6X6Mz3xDxJMA@public.gmane.org http://lists.lustre.org/mailman/listinfo/lustre-discuss