Hi We have a (small) 30 node sge based cluster with centos4 which will be growing to maximum 50 core duos. We use custom software that is based on gmake to launch parallel compilation and computations with lot of small files and some large files. We actualy use nfs and have a lot of problems with incoherencies between nodes. I''m currently evaluating lustre and have some questions about lustre overhead with small files. I succesfully installed the rpms on a test machine and launched the local lmount.sh script. The first thing I tried is to make a svn checkout into it. (lot of small files...) It takes 1m54 from our local svn server versus 15s into a local ext3 filesystem and 50s over nfs network. During the checkout, the processor (amd64 3200) is busy with 90% system. How come is there so much system process? Is there something to tweak to lower this overhead? Is there a specific tweak for small files? Using multiple server nodes, will the performance be better? Thanks
Hi, Please read the i/o tunables in the proc section in the lustre manual. I tried that with the postmark benchmark to test improvement and there was some improvement after trying the suggestions in the Lustre manual. Regards Balagopal Joe Barjo wrote:> Hi > > We have a (small) 30 node sge based cluster with centos4 which will be > growing to maximum 50 core duos. > We use custom software that is based on gmake to launch parallel > compilation and computations with lot of small files and some large files. > We actualy use nfs and have a lot of problems with incoherencies between > nodes. > > I''m currently evaluating lustre and have some questions about lustre > overhead with small files. > I succesfully installed the rpms on a test machine and launched the > local lmount.sh script. > The first thing I tried is to make a svn checkout into it. (lot of small > files...) > It takes 1m54 from our local svn server versus 15s into a local ext3 > filesystem and 50s over nfs network. > During the checkout, the processor (amd64 3200) is busy with 90% system. > > How come is there so much system process? > Is there something to tweak to lower this overhead? > Is there a specific tweak for small files? > Using multiple server nodes, will the performance be better? > > Thanks > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >
On Feb 29, 2008 15:37 +0100, Joe Barjo wrote:> We have a (small) 30 node sge based cluster with centos4 which will be > growing to maximum 50 core duos. > We use custom software that is based on gmake to launch parallel > compilation and computations with lot of small files and some large files. > We actualy use nfs and have a lot of problems with incoherencies between > nodes. > > I''m currently evaluating lustre and have some questions about lustre > overhead with small files. > I succesfully installed the rpms on a test machine and launched the > local lmount.sh script.Note that if you are using the unmodified llmount.sh script this is running on loopback files in /tmp, so the performance is likely quite bad. For a realistic performance measure, put the MDT and OST on separate disks.> The first thing I tried is to make a svn checkout into it. (lot of small > files...) > It takes 1m54 from our local svn server versus 15s into a local ext3 > filesystem and 50s over nfs network. > During the checkout, the processor (amd64 3200) is busy with 90% system. > > How come is there so much system process?Have you turned off debugging (sysctl -w lnet.debug=0)? Have you increased the DLM lock LRU sizes? for L in /proc/fs/lustre/ldlm/namespaces/*/lru_size; do echo 10000 > $L done In 1.6.5/1.8.0 it will be possible to use a new command to set this kind of parameter easier: lctl set_param ldlm.namespaces.*.lru_size=10000> Is there something to tweak to lower this overhead? > Is there a specific tweak for small files?Not really, this isn''t Lustre''s strongest point.> Using multiple server nodes, will the performance be better?Partly. There can only be a single MDT per filesystem, but it can scale quite well with multiple clients. There can be many OSTs, but it isn''t clear whether you are IO bound. It probably wouldn''t hurt to have a few to give you a high IOPS rate. Note that increasing OST count also by default allows clients to cache more dirty data (32MB/OST). You can change this manually, it is by default tuned for very large clusters (000''s of nodes). for C in /proc/fs/lustre/osc/*/max_dirty_mb echo 256 > $C done Similarly, in 1.6.5/1.8.0 it will be possible to do: lctl set_param osc.*.max_dirty_mb=256 Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
Hi Turning off debuging made it much better. It went from 1m54 down to 25 seconds, but still 85% of system processing... I really think you should turn off debuging by default, or make it appear as a BIG warning message. People who are trying lustre for the first time are not going to debug lustre... Also, looking in the documentation, though the debugging was documented, it was not clear that removing it would improve performance so much. I will now make more tests and see how the coherency is... Thanks for your support. Andreas Dilger a ?crit :> On Feb 29, 2008 15:37 +0100, Joe Barjo wrote: > >> We have a (small) 30 node sge based cluster with centos4 which will be >> growing to maximum 50 core duos. >> We use custom software that is based on gmake to launch parallel >> compilation and computations with lot of small files and some large files. >> We actualy use nfs and have a lot of problems with incoherencies between >> nodes. >> >> I''m currently evaluating lustre and have some questions about lustre >> overhead with small files. >> I succesfully installed the rpms on a test machine and launched the >> local lmount.sh script. >> > > Note that if you are using the unmodified llmount.sh script this is running > on loopback files in /tmp, so the performance is likely quite bad. For > a realistic performance measure, put the MDT and OST on separate disks. > > >> The first thing I tried is to make a svn checkout into it. (lot of small >> files...) >> It takes 1m54 from our local svn server versus 15s into a local ext3 >> filesystem and 50s over nfs network. >> During the checkout, the processor (amd64 3200) is busy with 90% system. >> >> How come is there so much system process? >> > > Have you turned off debugging (sysctl -w lnet.debug=0)? > Have you increased the DLM lock LRU sizes? > > for L in /proc/fs/lustre/ldlm/namespaces/*/lru_size; do > echo 10000 > $L > done > > In 1.6.5/1.8.0 it will be possible to use a new command to set > this kind of parameter easier: > > lctl set_param ldlm.namespaces.*.lru_size=10000 > > >> Is there something to tweak to lower this overhead? >> Is there a specific tweak for small files? >> > > Not really, this isn''t Lustre''s strongest point. > > >> Using multiple server nodes, will the performance be better? >> > > Partly. There can only be a single MDT per filesystem, but it can > scale quite well with multiple clients. There can be many OSTs, > but it isn''t clear whether you are IO bound. It probably wouldn''t > hurt to have a few to give you a high IOPS rate. > > Note that increasing OST count also by default allows clients to > cache more dirty data (32MB/OST). You can change this manually, > it is by default tuned for very large clusters (000''s of nodes). > > for C in /proc/fs/lustre/osc/*/max_dirty_mb > echo 256 > $C > done > > Similarly, in 1.6.5/1.8.0 it will be possible to do: > > lctl set_param osc.*.max_dirty_mb=256 > > Cheers, Andreas > -- > Andreas Dilger > Sr. Staff Engineer, Lustre Group > Sun Microsystems of Canada, Inc. > > >-------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080303/2bf5fee1/attachment-0002.html
I vote for the warning. A first time user does debug his setup until it all works with confidence and then should know to turn it off. For that you could have a startup warning plus emphasis in the documentation - ie mention it in the quickstart portion as well. Michael -----Original Message----- From: Joe Barjo [mailto:jobarjo78 at yahoo.fr] Sent: Monday, March 03, 2008 01:24 AM Pacific Standard Time To: Cc: lustre-discuss at lists.lustre.org Subject: Re: [Lustre-discuss] lustre and small files overhead Hi Turning off debuging made it much better. It went from 1m54 down to 25 seconds, but still 85% of system processing... I really think you should turn off debuging by default, or make it appear as a BIG warning message. People who are trying lustre for the first time are not going to debug lustre... Also, looking in the documentation, though the debugging was documented, it was not clear that removing it would improve performance so much. I will now make more tests and see how the coherency is... Thanks for your support. Andreas Dilger a ?crit : On Feb 29, 2008 15:37 +0100, Joe Barjo wrote: We have a (small) 30 node sge based cluster with centos4 which will be growing to maximum 50 core duos. We use custom software that is based on gmake to launch parallel compilation and computations with lot of small files and some large files. We actualy use nfs and have a lot of problems with incoherencies between nodes. I''m currently evaluating lustre and have some questions about lustre overhead with small files. I succesfully installed the rpms on a test machine and launched the local lmount.sh script. Note that if you are using the unmodified llmount.sh script this is running on loopback files in /tmp, so the performance is likely quite bad. For a realistic performance measure, put the MDT and OST on separate disks. The first thing I tried is to make a svn checkout into it. (lot of small files...) It takes 1m54 from our local svn server versus 15s into a local ext3 filesystem and 50s over nfs network. During the checkout, the processor (amd64 3200) is busy with 90% system. How come is there so much system process? Have you turned off debugging (sysctl -w lnet.debug=0)? Have you increased the DLM lock LRU sizes? for L in /proc/fs/lustre/ldlm/namespaces/*/lru_size; do echo 10000 > $L done In 1.6.5/1.8.0 it will be possible to use a new command to set this kind of parameter easier: lctl set_param ldlm.namespaces.*.lru_size=10000 Is there something to tweak to lower this overhead? Is there a specific tweak for small files? Not really, this isn''t Lustre''s strongest point. Using multiple server nodes, will the performance be better? Partly. There can only be a single MDT per filesystem, but it can scale quite well with multiple clients. There can be many OSTs, but it isn''t clear whether you are IO bound. It probably wouldn''t hurt to have a few to give you a high IOPS rate. Note that increasing OST count also by default allows clients to cache more dirty data (32MB/OST). You can change this manually, it is by default tuned for very large clusters (000''s of nodes). for C in /proc/fs/lustre/osc/*/max_dirty_mb echo 256 > $C done Similarly, in 1.6.5/1.8.0 it will be possible to do: lctl set_param osc.*.max_dirty_mb=256 Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080303/8103c0e4/attachment-0002.html
Joe Barjo <mailto:jobarjo78 at yahoo.fr> wrote:> Turning off debuging made it much better. > It went from 1m54 down to 25 seconds, but still 85% of system processing... > I really think you should turn off debuging by default, or make it appear > as a BIG warning message. > > People who are trying lustre for the first time are not going to debug lustre. > Also, looking in the documentation, though the debugging was documented, it > was not clear that removing it would improve performance so much. > > I will now make more tests and see how the coherency is... > Thanks for your support.What version of Lustre are you using? We have turned down the default debugging level in more recent versions of Lustre.> Andreas Dilger a ?crit : > On Feb 29, 2008 15:37 +0100, Joe Barjo wrote: > > > We have a (small) 30 node sge based cluster with centos4 which will be > growing to maximum 50 core duos. > We use custom software that is based on gmake to launch parallel > compilation and computations with lot of small files and some large files. > We actualy use nfs and have a lot of problems with incoherencies between > nodes. > > I''m currently evaluating lustre and have some questions about lustre > overhead with small files. > I succesfully installed the rpms on a test machine and launched the > local lmount.sh script. > > > > Note that if you are using the unmodified llmount.sh script this is running > on loopback files in /tmp, so the performance is likely quite bad. For > a realistic performance measure, put the MDT and OST on separate disks. > > > > The first thing I tried is to make a svn checkout into it. (lot of small > files...) > It takes 1m54 from our local svn server versus 15s into a local ext3 > filesystem and 50s over nfs network. > During the checkout, the processor (amd64 3200) is busy with 90% system. > > How come is there so much system process? > > > > Have you turned off debugging (sysctl -w lnet.debug=0)? > Have you increased the DLM lock LRU sizes? > > for L in /proc/fs/lustre/ldlm/namespaces/*/lru_size; do > echo 10000 > $L > done > > In 1.6.5/1.8.0 it will be possible to use a new command to set > this kind of parameter easier: > > lctl set_param ldlm.namespaces.*.lru_size=10000 > > > > Is there something to tweak to lower this overhead? > Is there a specific tweak for small files? > > > > Not really, this isn''t Lustre''s strongest point. > > > > Using multiple server nodes, will the performance be better? > > > > Partly. There can only be a single MDT per filesystem, but it can > scale quite well with multiple clients. There can be many OSTs, > but it isn''t clear whether you are IO bound. It probably wouldn''t > hurt to have a few to give you a high IOPS rate. > > Note that increasing OST count also by default allows clients to > cache more dirty data (32MB/OST). You can change this manually, > it is by default tuned for very large clusters (000''s of nodes). > > for C in /proc/fs/lustre/osc/*/max_dirty_mb > echo 256 > $C > done > > Similarly, in 1.6.5/1.8.0 it will be possible to do: > > lctl set_param osc.*.max_dirty_mb=256 > > Cheers, Andreas > -- > Andreas Dilger > Sr. Staff Engineer, Lustre Group > Sun Microsystems of Canada, Inc. > > > > >> _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discussCheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
Andreas Dilger a ?crit :> Joe Barjo <mailto:jobarjo78 at yahoo.fr> wrote: > >> Turning off debuging made it much better. >> It went from 1m54 down to 25 seconds, but still 85% of system processing... >> I really think you should turn off debuging by default, or make it appear >> as a BIG warning message. >> >> People who are trying lustre for the first time are not going to debug lustre. >> Also, looking in the documentation, though the debugging was documented, it >> was not clear that removing it would improve performance so much. >> >> I will now make more tests and see how the coherency is... >> Thanks for your support. >> > > What version of Lustre are you using? We have turned down the default > debugging level in more recent versions of Lustre. > >I''m using the latest stable 1.6.4.2 centos4 rpms -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080304/ee332157/attachment-0002.html
I made some more tests, and have setup a micro lustre cluster on lvm volumes. node a: MDS node b and c: OST node a,b,c,d,e,f: clients Gigabit ethernet network. Made the optimizations: lnet.debug=0, lru_size to 10000, max_dirty_mb to 1024 The svn checkout takes 50s ( 15s on a localdisk, 25s on a local lustre demo (with debug=0)) Launching gkrellm, a single svn checkout consumes about 20% of the MDS system cpu with about 2.4mbyte/sec ethernet communication. About 6MByte/s disk bandwidth on OST1, up to 12-16MB/s on OST2 disk bandwidth, network bandwidth on OST is about 10 to 20 times under disk bandwidth. I launched a compilation distributed on the 6 clients: MDS system cpu goes up to 60% system ressource (athlon 64 3500+) 12MByte/s on the ethernet, OST goes up to the same level as previous test. How come is there so much network communications on the MDT? Why so much disk bandwidth on OSTs, is it a readahead problem? As I understood that the MDS can not be load balanced, I don''t see how lustre is scalable to thousands of clients... It looks like lustre is not made for this kind of application Best regards. Andreas Dilger a ?crit :> > Have you turned off debugging (sysctl -w lnet.debug=0)? > Have you increased the DLM lock LRU sizes? > > for L in /proc/fs/lustre/ldlm/namespaces/*/lru_size; do > echo 10000 > $L > done > > In 1.6.5/1.8.0 it will be possible to use a new command to set > this kind of parameter easier: > > lctl set_param ldlm.namespaces.*.lru_size=10000 > > >> Is there something to tweak to lower this overhead? >> Is there a specific tweak for small files? >> > > Not really, this isn''t Lustre''s strongest point. > > >> Using multiple server nodes, will the performance be better? >> > > Partly. There can only be a single MDT per filesystem, but it can > scale quite well with multiple clients. There can be many OSTs, > but it isn''t clear whether you are IO bound. It probably wouldn''t > hurt to have a few to give you a high IOPS rate. > > Note that increasing OST count also by default allows clients to > cache more dirty data (32MB/OST). You can change this manually, > it is by default tuned for very large clusters (000''s of nodes). > > for C in /proc/fs/lustre/osc/*/max_dirty_mb > echo 256 > $C > done > > Similarly, in 1.6.5/1.8.0 it will be possible to do: > > lctl set_param osc.*.max_dirty_mb=256 > > Cheers, Andreas > -- > Andreas Dilger > Sr. Staff Engineer, Lustre Group > Sun Microsystems of Canada, Inc. > > >-------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080307/83c77f34/attachment-0002.html
On Mar 07, 2008 12:49 +0100, Joe Barjo wrote:> I made some more tests, and have setup a micro lustre cluster on lvm > volumes. > node a: MDS > node b and c: OST > node a,b,c,d,e,f: clients > Gigabit ethernet network. > Made the optimizations: lnet.debug=0, lru_size to 10000, max_dirty_mb to > 1024For high RPC-rate operations using an interconnect like Infiniband is better than ethernet.> The svn checkout takes 50s ( 15s on a localdisk, 25s on a local lustre > demo (with debug=0)) > Launching gkrellm, a single svn checkout consumes about 20% of the MDS > system cpu with about 2.4mbyte/sec ethernet communication.> About 6MByte/s disk bandwidth on OST1, up to 12-16MB/s on OST2 disk > bandwidth, network bandwidth on OST is about 10 to 20 times under disk > bandwidth. > Why so much disk bandwidth on OSTs, is it a readahead problem?That does seem strange, I can''t really say why. There is some metadata overhead, and that is higher with small files, but I don''t think it would be 10-20x overhead.> I launched a compilation distributed on the 6 clients: > MDS system cpu goes up to 60% system ressource (athlon 64 3500+) > 12MByte/s on the ethernet, OST goes up to the same level as previous test. > > How come is there so much network communications on the MDT?Because every metadata operation has to be done on the MDS currently. We are working toward having metadata writeback cache operations on the client, but it doesn''t happen currently. For operations like compilation it is basically entirely metadata overhead.> As I understood that the MDS can not be load balanced, I don''t see how > lustre is scalable to thousands of clients...Because in many HPC environments there are very few metadata operations in comparison to the amount of data being read/written. Average file sizes are 20-30MB instead of 20-30kB.> It looks like lustre is not made for this kind of applicationNo, it definitely isn''t tuned for small files.> Best regards. > Andreas Dilger a ?crit : > > > > Have you turned off debugging (sysctl -w lnet.debug=0)? > > Have you increased the DLM lock LRU sizes? > > > > for L in /proc/fs/lustre/ldlm/namespaces/*/lru_size; do > > echo 10000 > $L > > done > > > > In 1.6.5/1.8.0 it will be possible to use a new command to set > > this kind of parameter easier: > > > > lctl set_param ldlm.namespaces.*.lru_size=10000 > > > > > >> Is there something to tweak to lower this overhead? > >> Is there a specific tweak for small files? > >> > > > > Not really, this isn''t Lustre''s strongest point. > > > > > >> Using multiple server nodes, will the performance be better? > >> > > > > Partly. There can only be a single MDT per filesystem, but it can > > scale quite well with multiple clients. There can be many OSTs, > > but it isn''t clear whether you are IO bound. It probably wouldn''t > > hurt to have a few to give you a high IOPS rate. > > > > Note that increasing OST count also by default allows clients to > > cache more dirty data (32MB/OST). You can change this manually, > > it is by default tuned for very large clusters (000''s of nodes). > > > > for C in /proc/fs/lustre/osc/*/max_dirty_mb > > echo 256 > $C > > done > > > > Similarly, in 1.6.5/1.8.0 it will be possible to do: > > > > lctl set_param osc.*.max_dirty_mb=256 > > > > Cheers, Andreas > > -- > > Andreas Dilger > > Sr. Staff Engineer, Lustre Group > > Sun Microsystems of Canada, Inc. > > > > > > >> _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discussCheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
Andreas Dilger a ?crit :> On Mar 07, 2008 12:49 +0100, Joe Barjo wrote: > >> I made some more tests, and have setup a micro lustre cluster on lvm >> volumes. >> node a: MDS >> node b and c: OST >> node a,b,c,d,e,f: clients >> Gigabit ethernet network. >> Made the optimizations: lnet.debug=0, lru_size to 10000, max_dirty_mb to >> 1024 >> > > For high RPC-rate operations using an interconnect like Infiniband is > better than ethernet. > >infiniband is not in our budget...>> The svn checkout takes 50s ( 15s on a localdisk, 25s on a local lustre >> demo (with debug=0)) >> Launching gkrellm, a single svn checkout consumes about 20% of the MDS >> system cpu with about 2.4mbyte/sec ethernet communication. >> > > >> About 6MByte/s disk bandwidth on OST1, up to 12-16MB/s on OST2 disk >> bandwidth, network bandwidth on OST is about 10 to 20 times under disk >> bandwidth. >> Why so much disk bandwidth on OSTs, is it a readahead problem? >> > > That does seem strange, I can''t really say why. There is some metadata > overhead, and that is higher with small files, but I don''t think it > would be 10-20x overhead. > >The checkouted source is only 65 megabytes. So much OST disk bandwidth is probably not normal. Maybe you should verify this point. Are you sure there isn''t an optimazation for this? This looks like readahead or something like that.>> I launched a compilation distributed on the 6 clients: >> MDS system cpu goes up to 60% system ressource (athlon 64 3500+) >> 12MByte/s on the ethernet, OST goes up to the same level as previous test. >> >> How come is there so much network communications on the MDT? >> > > Because every metadata operation has to be done on the MDS currently. > We are working toward having metadata writeback cache operations on > the client, but it doesn''t happen currently. For operations like > compilation it is basically entirely metadata overhead. > > >> As I understood that the MDS can not be load balanced, I don''t see how >> lustre is scalable to thousands of clients... >> > > Because in many HPC environments there are very few metadata operations > in comparison to the amount of data being read/written. Average file > sizes are 20-30MB instead of 20-30kB. > > >> It looks like lustre is not made for this kind of application >> > > No, it definitely isn''t tuned for small files. >Could it be tuned one day for small files? Which filesystem would you suggest for me? I already tried nfs, afs I will now try glusterfs Thanks for your support -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080310/11e3dfa0/attachment-0002.html
Hi, I tried gfs and ocfs2 before i came back to Lustre as the choice for hpc cluster filesystem and they were both quite good for small files. I wasn''t happy with the stability of either at the time and also didn''t want to deal with the quorum issues when all i wanted was a cluster file system and not the high availability part of the solution. gfs and ocfs2 are worth trying for benchmarks with small files. Please note that gfs2 is not considered production ready. So you would need to stick with rhel4 instead of rhel5 if sticking with a redhat solution. Regards Balagopal Joe Barjo wrote:> Could it be tuned one day for small files? > Which filesystem would you suggest for me? > I already tried nfs, afs > I will now try glusterfs > > Thanks for your support > > ------------------------------------------------------------------------ > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >
Wojciech Turek
2008-Jun-10 17:32 UTC
[Lustre-discuss] lustre and small files overhead lru_size question
Hi, Lustre operation manual says that recommended values for lru_size parameter are in neighborhood of 1000 per osc and 2500 per mdc on interactively used client node. Also it states. Increasing this on a large number of client nodes is not recommended (this parameter controls locks cached on the client), though servers have been tested with up to 150,000 total locks (num_clients * lru_size). My question is why not bigger values? What determines the max lru_size. I would like to be able to determine what is the biggest number of lru_size I can set on a fixed number of interactive clients (login nodes) with fixed number of osc''s and mdc''s. I found this thread where Oleg Drokin suggests to set lru_size to 41000 (this is an old thread though). So it is significantly bigger value then one recommended in the manual. http://lists.lustre.org/pipermail/lustre-discuss/2005-December/001040.html Thank You, Wojciech Turek -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080610/b2f519ea/attachment.html
Jakob Goldbach
2008-Jun-10 18:18 UTC
[Lustre-discuss] lustre and small files overhead lru_size question
> > My question is why not bigger values? What determines the max lru_size. >The number of clients and ram on servers. Locks are held by server and client caches "copies" of locks even if unused. A thousand node cluster each with a lru_size of 41000 would mean a lot of locks on the servers. /Jakob
Wojciech Turek
2008-Jun-10 18:25 UTC
[Lustre-discuss] lustre and small files overhead lru_size question
On 10 Jun 2008, at 19:18, Jakob Goldbach wrote:> >> >> My question is why not bigger values? What determines the max >> lru_size. >> > > The number of clients and ram on servers.Is there a recommendation for lustre servers how much max ram one can spend on locks?> > > Locks are held by server and client caches "copies" of locks even if > unused. A thousand node cluster each with a lru_size of 41000 would > mean > a lot of locks on the servers.I rather think of increasing lru_size only on few login nodes where user access file system interactively. Wojciech> > > /Jakob >
Jakob Goldbach
2008-Jun-10 18:36 UTC
[Lustre-discuss] lustre and small files overhead lru_size question
> I rather think of increasing lru_size only on few login nodes where > user access file system interactively. >I run with 20k on a few-node cluster without problems. I also increased max_dirty_mb. Both improved small I/O. /JAkob
Cliff White
2008-Jun-12 19:42 UTC
[Lustre-discuss] lustre and small files overhead lru_size question
Jakob Goldbach wrote:>> I rather think of increasing lru_size only on few login nodes where >> user access file system interactively. >> > > I run with 20k on a few-node cluster without problems. I also increased > max_dirty_mb. Both improved small I/O. > > /JAkob > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discussBasically, you want the total ( lru_size summed for all clients ) to be less than ~500k. If you stay below that limit, should have no issues. From a conversation a few years ago: " If this is going to be a compile client instead of a compute client, the standard is to increase the MDC LRU size to 5000, and the OSC LRU size to 2000. If they don''t have thousands of clients this could be done on a large number of clients if this test represents their normal workload (aim for LRU * clients <500k should be safe)." cliffw
Andreas Dilger
2008-Jun-12 20:19 UTC
[Lustre-discuss] lustre and small files overhead lru_size question
On Jun 12, 2008 12:42 -0700, Cliff White wrote:> Jakob Goldbach wrote: > >> I rather think of increasing lru_size only on few login nodes where > >> user access file system interactively. > > > > I run with 20k on a few-node cluster without problems. I also increased > > max_dirty_mb. Both improved small I/O. > > Basically, you want the total ( lru_size summed for all clients ) to > be less than ~500k. If you stay below that limit, should have no issues.Exceeding this limit isn''t necessarily going to cause problems either, it is really a function of your RAM size on each server. TACC Ranger has 64k cores, so it would have a maximum of 64000 * 100 = 6.4M locks on each MDT/OST (16GB RAM or so?) with the default 100 locks/core on clients. With multiple OSTs/OSS this number would increase correspondingly.> From a conversation a few years ago: > > " > If this is going to be a compile client instead of a compute client, the > standard is to increase the MDC LRU size to 5000, and the OSC LRU size > to 2000. > If they don''t have thousands of clients this could be done on a large > number of clients if this test represents their normal workload (aim for > LRU * clients <= 500k should be safe)."Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.