Jim, I''m forwarding this to lustre-discuss to get a broader community input. I''m sure somebody has some experience with this. Begin forwarded message:> > I am looking for information on how Lustre assigns and holds pages on client nodes across jobs. The motivation is that we want to make "huge" pages available to users. We have found that it is almost impossible to allocate very many "huge" pages since Lustre holds scattered small pages across jobs. In fact, typically about 1/3 of compute node memory can be allocated as huge pages. > > We have done quite a lot of performance studies which show that a substantial percentage of jobs on Ranger have TLB misses as a major performance bottleneck. We estimate we might recover as much as an additional 5%-10% throughput if users could use huge pages. > > Therefore we would like to find a way to minimize the client memory which Lustre holds across jobs. > > Have you had anyone else mention this situation to you? > > Regards, > > Jim Browne > >-------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100819/b086c4ee/attachment.html
Easy way to reduce the client memory used by "Lustre" is to have an Epilogue script run by SGE (or whatever scheduler/resource manager) that does something like this on every node: # sync ; sleep 1 ; sync # echo 3 > /proc/sys/vm/drop_caches Kevin Nathan Rutman wrote:> Jim, I''m forwarding this to lustre-discuss to get a broader community > input. I''m sure somebody has some experience with this. > > Begin forwarded message: >> >> I am looking for information on how Lustre assigns and holds pages on >> client nodes across jobs. The motivation is that we want to make >> "huge" pages available to users. We have found that it is almost >> impossible to allocate very many "huge" pages since Lustre holds >> scattered small pages across jobs. In fact, typically about 1/3 of >> compute node memory can be allocated as huge pages. >> >> We have done quite a lot of performance studies which show that a >> substantial percentage of jobs on Ranger have TLB misses as a major >> performance bottleneck. We estimate we might recover as much as an >> additional 5%-10% throughput if users could use huge pages. >> >> Therefore we would like to find a way to minimize the client memory >> which Lustre holds across jobs. >> >> Have you had anyone else mention this situation to you? >> >> Regards, >> >> Jim Browne >> >> > > ------------------------------------------------------------------------ > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >
On 2010-08-19, at 16:44, Kevin Van Maren wrote:> Easy way to reduce the client memory used by "Lustre" is to have an > Epilogue script run by SGE (or whatever scheduler/resource manager) that > does something like this on every node: > # sync ; sleep 1 ; sync > # echo 3 > /proc/sys/vm/drop_cachesActually, my understanding is that /proc/sys/vm/drop_caches is NOT safe for production usage in all cases (i.e. there are bugs in some kernels, and it isn''t actually meant for regular use from what I''ve read). Others use huge pages in their configuration, but they reserve them at node boot time. See https://bugzilla.lustre.org/show_bug.cgi?id=14323 for details. If you want to flush all the memory used by a Lustre client between jobs, you can do "lctl set_param ldlm.namespaces.*.lru_size=clear". Unlike Kevin''s suggestion it is Lustre-specific, while drop_caches will try to flush memory from everything.> Nathan Rutman wrote: >> Jim, I''m forwarding this to lustre-discuss to get a broader community >> input. I''m sure somebody has some experience with this. >> >> Begin forwarded message: >>> >>> I am looking for information on how Lustre assigns and holds pages on >>> client nodes across jobs. The motivation is that we want to make >>> "huge" pages available to users. We have found that it is almost >>> impossible to allocate very many "huge" pages since Lustre holds >>> scattered small pages across jobs. In fact, typically about 1/3 of >>> compute node memory can be allocated as huge pages. >>> >>> We have done quite a lot of performance studies which show that a >>> substantial percentage of jobs on Ranger have TLB misses as a major >>> performance bottleneck. We estimate we might recover as much as an >>> additional 5%-10% throughput if users could use huge pages. >>> >>> Therefore we would like to find a way to minimize the client memory >>> which Lustre holds across jobs. >>> >>> Have you had anyone else mention this situation to you? >>> >>> Regards, >>> >>> Jim Browne >>> >>> >> >> ------------------------------------------------------------------------ >> >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss >> > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discussCheers, Andreas -- Andreas Dilger Lustre Technical Lead Oracle Corporation Canada Inc.
Hello! On Aug 19, 2010, at 7:07 PM, Andreas Dilger wrote:> If you want to flush all the memory used by a Lustre client between jobs, you can do "lctl set_param ldlm.namespaces.*.lru_size=clear". Unlike Kevin''s suggestion it is Lustre-specific, while drop_caches will try to flush memory from everything.Actually there is one extra bit that won''t get freed by dropping locks that is lustre debug logs (assuming non-zero debug level). It could be cleared with lctl clear Bye, Oleg
Last week there was an article on lwn.net about "Transparent hugepages" discussed during "The fourth Linux storage and filesystem summit". According to that article, we might be luckily and those patches might go into RHEL6 If you do not have an lwn.net account you might need to wait a few weeks: http://lwn.net/Articles/398846/ It links an older article about it, which should be already avaible for all: http://lwn.net/Articles/359158/ And another one: http://lwn.net/Articles/374424/ Cheers, Bernd -- Bernd Schubert DataDirect Networks
Hi Andreas, On 08/19/2010 06:07 PM, Andreas Dilger wrote:> On 2010-08-19, at 16:44, Kevin Van Maren wrote: >> Easy way to reduce the client memory used by "Lustre" is to have >> an Epilogue script run by SGE (or whatever scheduler/resource >> manager) that does something like this on every node: # sync ; >> sleep 1 ; sync # echo 3> /proc/sys/vm/drop_caches > > Actually, my understanding is that /proc/sys/vm/drop_caches is NOT > safe for production usage in all cases (i.e. there are bugs in some > kernels, and it isn''t actually meant for regular use from what I''ve > read).That''s good to know. But, there are two parts to drop_caches, depending on what you write---do you know if the unsafety comes from the part that calls the ''slab'' shrinkers or the part that calls invalidate_inode_pages()? I suppose it''s the latter. Do you have a pointer to a more specific description? I''m curious about which kernels are affected. I looked but didn''t turn up much. Thanks, -John -- John L. Hammond, Ph.D. ICES, The University of Texas at Austin jhammond at ices.utexas.edu (512) 471-9304
Oleg, Thanks for the clarification. Jim Browne At 11:10 PM 8/19/2010, Oleg Drokin wrote:>Hello! > >On Aug 19, 2010, at 7:07 PM, Andreas Dilger wrote: > > If you want to flush all the memory used by a Lustre client > between jobs, you can do "lctl set_param > ldlm.namespaces.*.lru_size=clear". Unlike Kevin''s suggestion it is > Lustre-specific, while drop_caches will try to flush memory from everything. > > >Actually there is one extra bit that won''t get freed by dropping >locks that is lustre debug logs (assuming non-zero debug level). >It could be cleared with lctl clear > >Bye, > OlegJames C. Browne Department of Computer Science University of Texas Austin, Texas 78712 Phone - 512-471-9579 Fax - 512-471-8885 browne at cs.utexas.edu http://www.cs.utexas.edu/users/browne
On 08/19/2010 11:10 PM, Oleg Drokin wrote:> Hello! > > On Aug 19, 2010, at 7:07 PM, Andreas Dilger wrote: >> If you want to flush all the memory used by a Lustre client >> between jobs, you can do "lctl set_param >> ldlm.namespaces.*.lru_size=clear". Unlike Kevin''s suggestion it is >> Lustre-specific, while drop_caches will try to flush memory from >> everything. > > > Actually there is one extra bit that won''t get freed by dropping > locks that is lustre debug logs (assuming non-zero debug level). It > could be cleared with lctl clearIndeed, thanks. On Ranger, the compute nodes use compact flash drives for /, and so they depend on tmpfs''s for /tmp, /var/run, /var/log, and of course /dev/shm. So cleaning up these ram backed filesystems as much as practical before asking for any hugepages is also a win. Also, in imitation of the systems that pre-allocate all needed hugepages at boot time, we are considering the idea of first pre-allocating a large chunk of memory (say 7/8) in hugepages, then mounting the Lustre filesystems, then releasing the hugepages. The hope is that Lustre''s persistent structures will be fit into a more compact region of memory thereby. The main obstacle in testing all of this is that benchmarking the gains gotten by a particular approach is difficult, since we have not yet found an easy way of producing external fragmentation of physical memory in short order. Suggestions are welcome. Best, -John -- John L. Hammond, Ph.D. ICES, The University of Texas at Austin jhammond at ices.utexas.edu (512) 471-9304
On 2010-08-20, at 07:21, John Hammond wrote:> Indeed, thanks. On Ranger, the compute nodes use compact flash drives for /, and so they depend on tmpfs''s for /tmp, /var/run, /var/log, and of course /dev/shm. So cleaning up these ram backed filesystems as much as practical before asking for any hugepages is also a win. > > Also, in imitation of the systems that pre-allocate all needed hugepages at boot time, we are considering the idea of first pre-allocating a large chunk of memory (say 7/8) in hugepages, then mounting the Lustre filesystems, then releasing the hugepages. The hope is that Lustre''s persistent structures will be fit into a more compact region of memory thereby.As discussed in https://bugzilla.lustre.org/show_bug.cgi?id=14323 that I previously referenced, the Lustre tunables are based on the total number of pages, and do not take huge pages into account. Also, if the hugepages are released, there is no guarantee that you will be able to allocate them all again due to small pinned memory structures _somewhere_ in the middle of each huge page. If you are running an prologue/epilogue script then you should tune the Lustre cache size based on the number of huge pages that will be used. The last time this was investigated, there was no way for Lustre to know how many huge pages were allocated from within the kernel w/o patching it. If that has changed in newer kernels, it would be possible to dynamically adjust the cache size based on this.> The main obstacle in testing all of this is that benchmarking the gains gotten by a particular approach is difficult, since we have not yet found an easy way of producing external fragmentation of physical memory in short order. Suggestions are welcome.Running something like "slocate" across multiple filesystems will fill all of RAM with inodes/dentries, and if you pin some of these in memory (e.g. start a shell with some deep directory as CWD), you should quickly be able to fragment your memory with unfreeable inode/dentry allocations. Cheers, Andreas -- Andreas Dilger Lustre Technical Lead Oracle Corporation Canada Inc.