thr3ads.net - Lustre discuss - [Lustre-discuss] lustre and small files overhead [Feb 2008]

If this information is useful, please help other people find it:
Share via:

Joe Barjo

2008-Feb-29 14:37 UTC

[Lustre-discuss] lustre and small files overhead

Hi

We have a (small) 30 node sge based cluster with centos4 which will be
growing to maximum 50 core duos.
We use custom software that is based on gmake to launch parallel
compilation and computations with lot of small files and some large files.
We actualy use nfs and have a lot of problems with incoherencies between
nodes.

I''m currently evaluating lustre and have some questions about lustre
overhead with small files.
I succesfully installed the rpms on a test machine and launched the
local lmount.sh script.
The first thing I tried is to make a svn checkout into it. (lot of small
files...)
It takes 1m54 from our local svn server versus 15s into a local ext3
filesystem and 50s over nfs network.
During the checkout, the processor (amd64 3200) is busy with 90% system.

How come is there so much system process?
Is there something to tweak to lower this overhead?
Is there a specific tweak for small files?
Using multiple server nodes, will the performance be better?

Thanks

Balagopal Pillai

2008-Feb-29 14:47 UTC

head link

[Lustre-discuss] lustre and small files overhead

Hi,

        Please read the i/o tunables in the proc section in the lustre 
manual.
I tried that with the postmark benchmark to test improvement and there
was some improvement after trying the suggestions in the Lustre manual.


Regards
Balagopal

Joe Barjo wrote:> Hi
>
> We have a (small) 30 node sge based cluster with centos4 which will be
> growing to maximum 50 core duos.
> We use custom software that is based on gmake to launch parallel
> compilation and computations with lot of small files and some large files.
> We actualy use nfs and have a lot of problems with incoherencies between
> nodes.
>
> I''m currently evaluating lustre and have some questions about
lustre
> overhead with small files.
> I succesfully installed the rpms on a test machine and launched the
> local lmount.sh script.
> The first thing I tried is to make a svn checkout into it. (lot of small
> files...)
> It takes 1m54 from our local svn server versus 15s into a local ext3
> filesystem and 50s over nfs network.
> During the checkout, the processor (amd64 3200) is busy with 90% system.
>
> How come is there so much system process?
> Is there something to tweak to lower this overhead?
> Is there a specific tweak for small files?
> Using multiple server nodes, will the performance be better?
>
> Thanks
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>

Andreas Dilger

2008-Feb-29 19:46 UTC

head link

[Lustre-discuss] lustre and small files overhead

On Feb 29, 2008  15:37 +0100, Joe Barjo wrote:> We have a (small) 30 node sge based cluster with centos4 which will be
> growing to maximum 50 core duos.
> We use custom software that is based on gmake to launch parallel
> compilation and computations with lot of small files and some large files.
> We actualy use nfs and have a lot of problems with incoherencies between
> nodes.
> 
> I''m currently evaluating lustre and have some questions about
lustre
> overhead with small files.
> I succesfully installed the rpms on a test machine and launched the
> local lmount.sh script.
Note that if you are using the unmodified llmount.sh script this is running
on loopback files in /tmp, so the performance is likely quite bad.  For
a realistic performance measure, put the MDT and OST on separate disks.
> The first thing I tried is to make a svn checkout into it. (lot of small
> files...)
> It takes 1m54 from our local svn server versus 15s into a local ext3
> filesystem and 50s over nfs network.
> During the checkout, the processor (amd64 3200) is busy with 90% system.
> 
> How come is there so much system process?
Have you turned off debugging (sysctl -w lnet.debug=0)?
Have you increased the DLM lock LRU sizes?

for L in /proc/fs/lustre/ldlm/namespaces/*/lru_size; do
    echo 10000 > $L
done

In 1.6.5/1.8.0 it will be possible to use a new command to set
this kind of parameter easier:

lctl set_param ldlm.namespaces.*.lru_size=10000
> Is there something to tweak to lower this overhead?
> Is there a specific tweak for small files?
Not really, this isn''t Lustre''s strongest point.
> Using multiple server nodes, will the performance be better?
Partly.  There can only be a single MDT per filesystem, but it can
scale quite well with multiple clients.  There can be many OSTs,
but it isn''t clear whether you are IO bound.  It probably
wouldn''t
hurt to have a few to give you a high IOPS rate.

Note that increasing OST count also by default allows clients to
cache more dirty data (32MB/OST).  You can change this manually,
it is by default tuned for very large clusters (000''s of nodes).

for C in /proc/fs/lustre/osc/*/max_dirty_mb
	echo 256 > $C
done

Similarly, in 1.6.5/1.8.0 it will be possible to do:

lctl set_param osc.*.max_dirty_mb=256

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

Joe Barjo

2008-Mar-03 09:24 UTC

head link

[Lustre-discuss] lustre and small files overhead

Hi
Turning off debuging made it much better.
It went from 1m54 down to 25 seconds, but still 85% of system processing...
I really think you should turn off debuging by default, or make it
appear as a BIG warning message.
People who are trying lustre for the first time are not going to debug
lustre...
Also, looking in the documentation, though the debugging was documented,
it was not clear that removing it would improve performance so much.

I will now make more tests and see how the coherency is...
Thanks for your support.

Andreas Dilger a ?crit :> On Feb 29, 2008  15:37 +0100, Joe Barjo wrote:
>   
>> We have a (small) 30 node sge based cluster with centos4 which will be
>> growing to maximum 50 core duos.
>> We use custom software that is based on gmake to launch parallel
>> compilation and computations with lot of small files and some large
files.
>> We actualy use nfs and have a lot of problems with incoherencies
between
>> nodes.
>>
>> I''m currently evaluating lustre and have some questions about
lustre
>> overhead with small files.
>> I succesfully installed the rpms on a test machine and launched the
>> local lmount.sh script.
>>     
>
> Note that if you are using the unmodified llmount.sh script this is running
> on loopback files in /tmp, so the performance is likely quite bad.  For
> a realistic performance measure, put the MDT and OST on separate disks.
>
>   
>> The first thing I tried is to make a svn checkout into it. (lot of
small
>> files...)
>> It takes 1m54 from our local svn server versus 15s into a local ext3
>> filesystem and 50s over nfs network.
>> During the checkout, the processor (amd64 3200) is busy with 90%
system.
>>
>> How come is there so much system process?
>>     
>
> Have you turned off debugging (sysctl -w lnet.debug=0)?
> Have you increased the DLM lock LRU sizes?
>
> for L in /proc/fs/lustre/ldlm/namespaces/*/lru_size; do
>     echo 10000 > $L
> done
>
> In 1.6.5/1.8.0 it will be possible to use a new command to set
> this kind of parameter easier:
>
> lctl set_param ldlm.namespaces.*.lru_size=10000
>
>   
>> Is there something to tweak to lower this overhead?
>> Is there a specific tweak for small files?
>>     
>
> Not really, this isn''t Lustre''s strongest point.
>
>   
>> Using multiple server nodes, will the performance be better?
>>     
>
> Partly.  There can only be a single MDT per filesystem, but it can
> scale quite well with multiple clients.  There can be many OSTs,
> but it isn''t clear whether you are IO bound.  It probably
wouldn''t
> hurt to have a few to give you a high IOPS rate.
>
> Note that increasing OST count also by default allows clients to
> cache more dirty data (32MB/OST).  You can change this manually,
> it is by default tuned for very large clusters (000''s of nodes).
>
> for C in /proc/fs/lustre/osc/*/max_dirty_mb
> 	echo 256 > $C
> done
>
> Similarly, in 1.6.5/1.8.0 it will be possible to do:
>
> lctl set_param osc.*.max_dirty_mb=256
>
> Cheers, Andreas
> --
> Andreas Dilger
> Sr. Staff Engineer, Lustre Group
> Sun Microsystems of Canada, Inc.
>
>
>   
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080303/2bf5fee1/attachment-0002.html

Michael Will

2008-Mar-03 16:52 UTC

head link

[Lustre-discuss] lustre and small files overhead

I vote for the warning. A first time user does debug his setup until it all
works with confidence and then should know to turn it off.

For that you could have a startup warning plus emphasis in the documentation -
ie mention it in the quickstart portion as well.

Michael

 -----Original Message-----
From: 	Joe Barjo [mailto:jobarjo78 at yahoo.fr]
Sent:	Monday, March 03, 2008 01:24 AM Pacific Standard Time
To:	
Cc:	lustre-discuss at lists.lustre.org
Subject:	Re: [Lustre-discuss] lustre and small files overhead

Hi
Turning off debuging made it much better.
It went from 1m54 down to 25 seconds, but still 85% of system processing...
I really think you should turn off debuging by default, or make it appear as a
BIG warning message.
People who are trying lustre for the first time are not going to debug lustre...
Also, looking in the documentation, though the debugging was documented, it was
not clear that removing it would improve performance so much.

I will now make more tests and see how the coherency is...
Thanks for your support.

Andreas Dilger a ?crit : 

	On Feb 29, 2008  15:37 +0100, Joe Barjo wrote:
	  

		We have a (small) 30 node sge based cluster with centos4 which will be
		growing to maximum 50 core duos.
		We use custom software that is based on gmake to launch parallel
		compilation and computations with lot of small files and some large files.
		We actualy use nfs and have a lot of problems with incoherencies between
		nodes.
		
		I''m currently evaluating lustre and have some questions about lustre
		overhead with small files.
		I succesfully installed the rpms on a test machine and launched the
		local lmount.sh script.
		    

	
	Note that if you are using the unmodified llmount.sh script this is running
	on loopback files in /tmp, so the performance is likely quite bad.  For
	a realistic performance measure, put the MDT and OST on separate disks.
	
	  

		The first thing I tried is to make a svn checkout into it. (lot of small
		files...)
		It takes 1m54 from our local svn server versus 15s into a local ext3
		filesystem and 50s over nfs network.
		During the checkout, the processor (amd64 3200) is busy with 90% system.
		
		How come is there so much system process?
		    

	
	Have you turned off debugging (sysctl -w lnet.debug=0)?
	Have you increased the DLM lock LRU sizes?
	
	for L in /proc/fs/lustre/ldlm/namespaces/*/lru_size; do
	    echo 10000 > $L
	done
	
	In 1.6.5/1.8.0 it will be possible to use a new command to set
	this kind of parameter easier:
	
	lctl set_param ldlm.namespaces.*.lru_size=10000
	
	  

		Is there something to tweak to lower this overhead?
		Is there a specific tweak for small files?
		    

	
	Not really, this isn''t Lustre''s strongest point.
	
	  

		Using multiple server nodes, will the performance be better?
		    

	
	Partly.  There can only be a single MDT per filesystem, but it can
	scale quite well with multiple clients.  There can be many OSTs,
	but it isn''t clear whether you are IO bound.  It probably
wouldn''t
	hurt to have a few to give you a high IOPS rate.
	
	Note that increasing OST count also by default allows clients to
	cache more dirty data (32MB/OST).  You can change this manually,
	it is by default tuned for very large clusters (000''s of nodes).
	
	for C in /proc/fs/lustre/osc/*/max_dirty_mb
		echo 256 > $C
	done
	
	Similarly, in 1.6.5/1.8.0 it will be possible to do:
	
	lctl set_param osc.*.max_dirty_mb=256
	
	Cheers, Andreas
	--
	Andreas Dilger
	Sr. Staff Engineer, Lustre Group
	Sun Microsystems of Canada, Inc.
	
	
	  


-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080303/8103c0e4/attachment-0002.html

Andreas Dilger

2008-Mar-03 19:16 UTC

head link

[Lustre-discuss] lustre and small files overhead

Joe Barjo <mailto:jobarjo78 at yahoo.fr> wrote:> Turning off debuging made it much better.
> It went from 1m54 down to 25 seconds, but still 85% of system processing...
> I really think you should turn off debuging by default, or make it appear
> as a BIG warning message.
>
> People who are trying lustre for the first time are not going to debug
lustre.
> Also, looking in the documentation, though the debugging was documented, it
> was not clear that removing it would improve performance so much.
> 
> I will now make more tests and see how the coherency is...
> Thanks for your support.
What version of Lustre are you using?  We have turned down the default
debugging level in more recent versions of Lustre.
> Andreas Dilger a ?crit : 
> 	On Feb 29, 2008  15:37 +0100, Joe Barjo wrote:
> 	  
> 
> 		We have a (small) 30 node sge based cluster with centos4 which will be
> 		growing to maximum 50 core duos.
> 		We use custom software that is based on gmake to launch parallel
> 		compilation and computations with lot of small files and some large
files.
> 		We actualy use nfs and have a lot of problems with incoherencies between
> 		nodes.
> 		
> 		I''m currently evaluating lustre and have some questions about
lustre
> 		overhead with small files.
> 		I succesfully installed the rpms on a test machine and launched the
> 		local lmount.sh script.
> 		    
> 
> 	
> 	Note that if you are using the unmodified llmount.sh script this is
running
> 	on loopback files in /tmp, so the performance is likely quite bad.  For
> 	a realistic performance measure, put the MDT and OST on separate disks.
> 	
> 	  
> 
> 		The first thing I tried is to make a svn checkout into it. (lot of small
> 		files...)
> 		It takes 1m54 from our local svn server versus 15s into a local ext3
> 		filesystem and 50s over nfs network.
> 		During the checkout, the processor (amd64 3200) is busy with 90% system.
> 		
> 		How come is there so much system process?
> 		    
> 
> 	
> 	Have you turned off debugging (sysctl -w lnet.debug=0)?
> 	Have you increased the DLM lock LRU sizes?
> 	
> 	for L in /proc/fs/lustre/ldlm/namespaces/*/lru_size; do
> 	    echo 10000 > $L
> 	done
> 	
> 	In 1.6.5/1.8.0 it will be possible to use a new command to set
> 	this kind of parameter easier:
> 	
> 	lctl set_param ldlm.namespaces.*.lru_size=10000
> 	
> 	  
> 
> 		Is there something to tweak to lower this overhead?
> 		Is there a specific tweak for small files?
> 		    
> 
> 	
> 	Not really, this isn''t Lustre''s strongest point.
> 	
> 	  
> 
> 		Using multiple server nodes, will the performance be better?
> 		    
> 
> 	
> 	Partly.  There can only be a single MDT per filesystem, but it can
> 	scale quite well with multiple clients.  There can be many OSTs,
> 	but it isn''t clear whether you are IO bound.  It probably
wouldn''t
> 	hurt to have a few to give you a high IOPS rate.
> 	
> 	Note that increasing OST count also by default allows clients to
> 	cache more dirty data (32MB/OST).  You can change this manually,
> 	it is by default tuned for very large clusters (000''s of nodes).
> 	
> 	for C in /proc/fs/lustre/osc/*/max_dirty_mb
> 		echo 256 > $C
> 	done
> 	
> 	Similarly, in 1.6.5/1.8.0 it will be possible to do:
> 	
> 	lctl set_param osc.*.max_dirty_mb=256
> 	
> 	Cheers, Andreas
> 	--
> 	Andreas Dilger
> 	Sr. Staff Engineer, Lustre Group
> 	Sun Microsystems of Canada, Inc.
> 	
> 	
> 	  
> 
> 
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

Joe Barjo

2008-Mar-04 11:29 UTC

head link

[Lustre-discuss] lustre and small files overhead

Andreas Dilger a ?crit :> Joe Barjo <mailto:jobarjo78 at yahoo.fr> wrote:
>   
>> Turning off debuging made it much better.
>> It went from 1m54 down to 25 seconds, but still 85% of system
processing...
>> I really think you should turn off debuging by default, or make it
appear
>> as a BIG warning message.
>>
>> People who are trying lustre for the first time are not going to debug
lustre.
>> Also, looking in the documentation, though the debugging was
documented, it
>> was not clear that removing it would improve performance so much.
>>
>> I will now make more tests and see how the coherency is...
>> Thanks for your support.
>>     
>
> What version of Lustre are you using?  We have turned down the default
> debugging level in more recent versions of Lustre.
>
>   I''m using the latest stable 1.6.4.2 centos4 rpms
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080304/ee332157/attachment-0002.html

Joe Barjo

2008-Mar-07 11:49 UTC

head link

[Lustre-discuss] lustre and small files overhead

I made some more tests, and have setup a micro lustre cluster on lvm
volumes.
node a: MDS
node b and c: OST
node a,b,c,d,e,f: clients
Gigabit ethernet network.
Made the optimizations: lnet.debug=0, lru_size to 10000, max_dirty_mb to
1024

The svn checkout takes 50s ( 15s on a localdisk, 25s on a local lustre
demo (with debug=0))
Launching gkrellm, a single svn checkout consumes about 20% of the MDS
system cpu with about 2.4mbyte/sec ethernet communication.
About 6MByte/s disk bandwidth on OST1, up to 12-16MB/s on OST2 disk
bandwidth, network bandwidth on OST is about 10 to 20 times under disk
bandwidth.

I launched a compilation distributed on the 6 clients:
MDS system cpu goes up to 60% system ressource (athlon 64 3500+)
12MByte/s on the ethernet, OST goes up to the same level as previous test.

How come is there so much network communications on the MDT?
Why so much disk bandwidth on OSTs, is it a readahead problem?

As I understood that the MDS can not be load balanced, I don''t see how
lustre is scalable to thousands of clients...
It looks like lustre is not made for this kind of application

Best regards.

Andreas Dilger a ?crit :>
> Have you turned off debugging (sysctl -w lnet.debug=0)?
> Have you increased the DLM lock LRU sizes?
>
> for L in /proc/fs/lustre/ldlm/namespaces/*/lru_size; do
>     echo 10000 > $L
> done
>
> In 1.6.5/1.8.0 it will be possible to use a new command to set
> this kind of parameter easier:
>
> lctl set_param ldlm.namespaces.*.lru_size=10000
>
>   
>> Is there something to tweak to lower this overhead?
>> Is there a specific tweak for small files?
>>     
>
> Not really, this isn''t Lustre''s strongest point.
>
>   
>> Using multiple server nodes, will the performance be better?
>>     
>
> Partly.  There can only be a single MDT per filesystem, but it can
> scale quite well with multiple clients.  There can be many OSTs,
> but it isn''t clear whether you are IO bound.  It probably
wouldn''t
> hurt to have a few to give you a high IOPS rate.
>
> Note that increasing OST count also by default allows clients to
> cache more dirty data (32MB/OST).  You can change this manually,
> it is by default tuned for very large clusters (000''s of nodes).
>
> for C in /proc/fs/lustre/osc/*/max_dirty_mb
> 	echo 256 > $C
> done
>
> Similarly, in 1.6.5/1.8.0 it will be possible to do:
>
> lctl set_param osc.*.max_dirty_mb=256
>
> Cheers, Andreas
> --
> Andreas Dilger
> Sr. Staff Engineer, Lustre Group
> Sun Microsystems of Canada, Inc.
>
>
>   
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080307/83c77f34/attachment-0002.html

Andreas Dilger

2008-Mar-10 04:53 UTC

head link

[Lustre-discuss] lustre and small files overhead

On Mar 07, 2008  12:49 +0100, Joe Barjo wrote:> I made some more tests, and have setup a micro lustre cluster on lvm
> volumes.
> node a: MDS
> node b and c: OST
> node a,b,c,d,e,f: clients
> Gigabit ethernet network.
> Made the optimizations: lnet.debug=0, lru_size to 10000, max_dirty_mb to
> 1024
For high RPC-rate operations using an interconnect like Infiniband is
better than ethernet.
> The svn checkout takes 50s ( 15s on a localdisk, 25s on a local lustre
> demo (with debug=0))
> Launching gkrellm, a single svn checkout consumes about 20% of the MDS
> system cpu with about 2.4mbyte/sec ethernet communication.
> About 6MByte/s disk bandwidth on OST1, up to 12-16MB/s on OST2 disk
> bandwidth, network bandwidth on OST is about 10 to 20 times under disk
> bandwidth.
> Why so much disk bandwidth on OSTs, is it a readahead problem?
That does seem strange, I can''t really say why.  There is some metadata
overhead, and that is higher with small files, but I don''t think it
would be 10-20x overhead.
> I launched a compilation distributed on the 6 clients:
> MDS system cpu goes up to 60% system ressource (athlon 64 3500+)
> 12MByte/s on the ethernet, OST goes up to the same level as previous test.
> 
> How come is there so much network communications on the MDT?
Because every metadata operation has to be done on the MDS currently.
We are working toward having metadata writeback cache operations on
the client, but it doesn''t happen currently.  For operations like
compilation it is basically entirely metadata overhead.
> As I understood that the MDS can not be load balanced, I don''t see
how
> lustre is scalable to thousands of clients...
Because in many HPC environments there are very few metadata operations
in comparison to the amount of data being read/written.  Average file
sizes are 20-30MB instead of 20-30kB.
> It looks like lustre is not made for this kind of application
No, it definitely isn''t tuned for small files.
> Best regards.
> Andreas Dilger a ?crit :
> >
> > Have you turned off debugging (sysctl -w lnet.debug=0)?
> > Have you increased the DLM lock LRU sizes?
> >
> > for L in /proc/fs/lustre/ldlm/namespaces/*/lru_size; do
> >     echo 10000 > $L
> > done
> >
> > In 1.6.5/1.8.0 it will be possible to use a new command to set
> > this kind of parameter easier:
> >
> > lctl set_param ldlm.namespaces.*.lru_size=10000
> >
> >   
> >> Is there something to tweak to lower this overhead?
> >> Is there a specific tweak for small files?
> >>     
> >
> > Not really, this isn''t Lustre''s strongest point.
> >
> >   
> >> Using multiple server nodes, will the performance be better?
> >>     
> >
> > Partly.  There can only be a single MDT per filesystem, but it can
> > scale quite well with multiple clients.  There can be many OSTs,
> > but it isn''t clear whether you are IO bound.  It probably
wouldn''t
> > hurt to have a few to give you a high IOPS rate.
> >
> > Note that increasing OST count also by default allows clients to
> > cache more dirty data (32MB/OST).  You can change this manually,
> > it is by default tuned for very large clusters (000''s of
nodes).
> >
> > for C in /proc/fs/lustre/osc/*/max_dirty_mb
> > 	echo 256 > $C
> > done
> >
> > Similarly, in 1.6.5/1.8.0 it will be possible to do:
> >
> > lctl set_param osc.*.max_dirty_mb=256
> >
> > Cheers, Andreas
> > --
> > Andreas Dilger
> > Sr. Staff Engineer, Lustre Group
> > Sun Microsystems of Canada, Inc.
> >
> >
> >   
> 
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

Joe Barjo

2008-Mar-10 11:27 UTC

head link

[Lustre-discuss] lustre and small files overhead

Andreas Dilger a ?crit :> On Mar 07, 2008  12:49 +0100, Joe Barjo wrote:
>   
>> I made some more tests, and have setup a micro lustre cluster on lvm
>> volumes.
>> node a: MDS
>> node b and c: OST
>> node a,b,c,d,e,f: clients
>> Gigabit ethernet network.
>> Made the optimizations: lnet.debug=0, lru_size to 10000, max_dirty_mb
to
>> 1024
>>     
>
> For high RPC-rate operations using an interconnect like Infiniband is
> better than ethernet.
>
>   
infiniband is not in our budget...>> The svn checkout takes 50s ( 15s on a localdisk, 25s on a local lustre
>> demo (with debug=0))
>> Launching gkrellm, a single svn checkout consumes about 20% of the MDS
>> system cpu with about 2.4mbyte/sec ethernet communication.
>>     
>
>   
>> About 6MByte/s disk bandwidth on OST1, up to 12-16MB/s on OST2 disk
>> bandwidth, network bandwidth on OST is about 10 to 20 times under disk
>> bandwidth.
>> Why so much disk bandwidth on OSTs, is it a readahead problem?
>>     
>
> That does seem strange, I can''t really say why.  There is some
metadata
> overhead, and that is higher with small files, but I don''t think
it
> would be 10-20x overhead.
>
>   The checkouted source is only 65 megabytes. So much OST disk bandwidth
is probably not normal.
Maybe you should verify this point.
Are you sure there isn''t an optimazation for this? This looks like
readahead or something like that.>> I launched a compilation distributed on the 6 clients:
>> MDS system cpu goes up to 60% system ressource (athlon 64 3500+)
>> 12MByte/s on the ethernet, OST goes up to the same level as previous
test.
>>
>> How come is there so much network communications on the MDT?
>>     
>
> Because every metadata operation has to be done on the MDS currently.
> We are working toward having metadata writeback cache operations on
> the client, but it doesn''t happen currently.  For operations like
> compilation it is basically entirely metadata overhead.
>
>   
>> As I understood that the MDS can not be load balanced, I don''t
see how
>> lustre is scalable to thousands of clients...
>>     
>
> Because in many HPC environments there are very few metadata operations
> in comparison to the amount of data being read/written.  Average file
> sizes are 20-30MB instead of 20-30kB.
>
>   
>> It looks like lustre is not made for this kind of application
>>     
>
> No, it definitely isn''t tuned for small files.
>   Could it be tuned one day for small files?
Which filesystem would you suggest for me?
I already tried nfs, afs
I will now try glusterfs

Thanks for your support

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080310/11e3dfa0/attachment-0002.html

Balagopal Pillai

2008-Mar-10 13:12 UTC

head link

[Lustre-discuss] lustre and small files overhead

Hi,

           I tried gfs and ocfs2 before i came back to Lustre as the 
choice for hpc cluster filesystem
and they were both quite good for small files. I wasn''t happy with
the stability of either at the time and also didn''t want to deal with 
the quorum issues when all
i wanted was a cluster file system and not the high availability part of 
the solution.

          gfs and ocfs2 are worth trying for benchmarks with small 
files. Please note that gfs2 is
not considered production ready. So you would need to stick with rhel4 
instead of rhel5 if sticking
with a redhat solution.


Regards
Balagopal


Joe Barjo wrote:> Could it be tuned one day for small files?
> Which filesystem would you suggest for me?
> I already tried nfs, afs
> I will now try glusterfs
>
> Thanks for your support
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>

Wojciech Turek

2008-Jun-10 17:32 UTC

head link

[Lustre-discuss] lustre and small files overhead lru_size question

Hi,

Lustre operation manual says that recommended values for lru_size  
parameter are in neighborhood of 1000 per osc and 2500 per mdc on  
interactively used client node.
Also it states.
Increasing this on a large number of client nodes is not recommended  
(this parameter controls locks cached on the client), though servers  
have been tested with up to 150,000 total locks (num_clients *  
lru_size).

My question is why not bigger values? What determines the max lru_size.
I would like to be able to determine what is the biggest number of  
lru_size I can set on a fixed number of interactive clients (login  
nodes) with fixed number of osc''s and mdc''s.

I found this thread where Oleg Drokin suggests to set lru_size to  
41000 (this is an old thread though). So it is significantly bigger  
value then one recommended in the manual.
http://lists.lustre.org/pipermail/lustre-discuss/2005-December/001040.html

Thank You,

Wojciech Turek
  
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080610/b2f519ea/attachment.html

Jakob Goldbach

2008-Jun-10 18:18 UTC

head link

[Lustre-discuss] lustre and small files overhead lru_size question

> 
> My question is why not bigger values? What determines the max lru_size.
> 
The number of clients and ram on servers. 

Locks are held by server and client caches "copies" of locks even if
unused. A thousand node cluster each with a lru_size of 41000 would mean
a lot of locks on the servers.

/Jakob

Wojciech Turek

2008-Jun-10 18:25 UTC

head link

[Lustre-discuss] lustre and small files overhead lru_size question

On 10 Jun 2008, at 19:18, Jakob Goldbach wrote:
>
>>
>> My question is why not bigger values? What determines the max  
>> lru_size.
>>
>
> The number of clients and ram on servers.Is there a recommendation for lustre servers how much max ram one can  
spend on locks?>
>
> Locks are held by server and client caches "copies" of locks even
if
> unused. A thousand node cluster each with a lru_size of 41000 would  
> mean
> a lot of locks on the servers.I rather think of increasing lru_size only on few login nodes where  
user access file system interactively.

Wojciech>
>
> /Jakob
>

Jakob Goldbach

2008-Jun-10 18:36 UTC

head link

[Lustre-discuss] lustre and small files overhead lru_size question

> I rather think of increasing lru_size only on few login nodes where  
> user access file system interactively.
> 
I run with 20k on a few-node cluster without problems. I also increased
max_dirty_mb. Both improved small I/O.

/JAkob

Cliff White

2008-Jun-12 19:42 UTC

head link

[Lustre-discuss] lustre and small files overhead lru_size question

Jakob Goldbach wrote:>> I rather think of increasing lru_size only on few login nodes where  
>> user access file system interactively.
>>
> 
> I run with 20k on a few-node cluster without problems. I also increased
> max_dirty_mb. Both improved small I/O.
> 
> /JAkob
> 
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
Basically, you want the total ( lru_size summed for all clients )  to
be less than ~500k. If you stay below that limit, should have no issues.
 From a conversation a few years ago:

"
If this is going to be a compile client instead of a compute client, the
standard is to increase the MDC LRU size to 5000, and the OSC LRU size 
to 2000.
  If they don''t have thousands of clients this could be done on a large
number of
clients if this test represents their normal workload (aim for LRU * 
clients <500k should be safe)."

cliffw

Andreas Dilger

2008-Jun-12 20:19 UTC

head link

[Lustre-discuss] lustre and small files overhead lru_size question

On Jun 12, 2008  12:42 -0700, Cliff White wrote:> Jakob Goldbach wrote:
> >> I rather think of increasing lru_size only on few login nodes
where
> >> user access file system interactively.
> > 
> > I run with 20k on a few-node cluster without problems. I also
increased
> > max_dirty_mb. Both improved small I/O.
> 
> Basically, you want the total ( lru_size summed for all clients )  to
> be less than ~500k. If you stay below that limit, should have no issues.
Exceeding this limit isn''t necessarily going to cause problems either,
it
is really a function of your RAM size on each server.  TACC Ranger has
64k cores, so it would have a maximum of 64000 * 100 = 6.4M locks on
each MDT/OST (16GB RAM or so?) with the default 100 locks/core on clients.
With multiple OSTs/OSS this number would increase correspondingly.
>  From a conversation a few years ago:
> 
> "
> If this is going to be a compile client instead of a compute client, the
> standard is to increase the MDC LRU size to 5000, and the OSC LRU size 
> to 2000.
>   If they don''t have thousands of clients this could be done on a
large
> number of clients if this test represents their normal workload (aim for
> LRU * clients <= 500k should be safe)."
Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

Lustre discuss - Feb 2008 - lustre and small files overhead

[Lustre-discuss] lustre and small files overhead

[Lustre-discuss] lustre and small files overhead

[Lustre-discuss] lustre and small files overhead

[Lustre-discuss] lustre and small files overhead

[Lustre-discuss] lustre and small files overhead

[Lustre-discuss] lustre and small files overhead

[Lustre-discuss] lustre and small files overhead

[Lustre-discuss] lustre and small files overhead

[Lustre-discuss] lustre and small files overhead

[Lustre-discuss] lustre and small files overhead

[Lustre-discuss] lustre and small files overhead

[Lustre-discuss] lustre and small files overhead lru_size question

[Lustre-discuss] lustre and small files overhead lru_size question

[Lustre-discuss] lustre and small files overhead lru_size question

[Lustre-discuss] lustre and small files overhead lru_size question

[Lustre-discuss] lustre and small files overhead lru_size question

[Lustre-discuss] lustre and small files overhead lru_size question