thr3ads.net - Lustre discuss - [Lustre-discuss] Metadata performance question [Nov 2008]

If this information is useful, please help other people find it:
Share via:

Ben Evans

2008-Nov-12 20:46 UTC

[Lustre-discuss] Metadata performance question

I''ve run into an interesting issue regarding metadata performance.

 

My MDS has ib on the front end, and a FC array on the back end in a RAID
10 configuration.

 

I have a number of clients creating 0 byte files in different
directories.  As the test runs, I can see the threads on the MDS idling
for 3-4 seconds before they all wake up, process whatever they need to
and go back to sleep.

 

I haven''t been able to find a bottleneck in the system. 
There''s plenty
of memory, changing various parameters in the /proc directory on the
clients and on the MDS doesn''t help much at all.  There''s
plenty of
bandwidth on the ib network and on the FC.  CPU and memory aren''t
overly
taxed (~50% max).  Yet I can only get about 10k file creates per second,
once I''ve created enough files so that caching 

 

The linux kernel is 2.6.18, with Lustre 1.6.5.

 

--

Ben Evans  ben at terascala.com

Principal Engineer

Terascala Inc.

Office:  508-588-1501 x223

www.terascala.com

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20081112/56e2c257/attachment-0001.html

Andreas Dilger

2008-Nov-13 22:05 UTC

head link

[Lustre-discuss] Metadata performance question

On Nov 12, 2008  15:46 -0500, Ben Evans wrote:> My MDS has ib on the front end, and a FC array on the back end in a RAID
> 10 configuration.
>  
> 
> I have a number of clients creating 0 byte files in different
> directories.  As the test runs, I can see the threads on the MDS idling
> for 3-4 seconds before they all wake up, process whatever they need to
> and go back to sleep.
What is important to know is the number of clients.  One of the current
limitations on Lustre metadata scalability is that each client can only
have a single _filesystem_modifying_ MDS request in flight at a time
(i.e. create, rename, unlink, setattr).  This is to make recovery of
the filesystem manageable in the face of client asynchronous operations
being replayed on the MDS.

It sounds like the number of clients is lower than the number of MDS
service threads, and the MDS threads are just sitting idle until the
kernel wakes it up again to handle an incoming request.

If ALL of the threads are idle for the same 3-4 seconds, it is possible
they are waiting on the journal commit.  One change that might improve
performance here dramatically is to use SSD devices for the MDS storage,
which has MUCH lower latency.  If you decide to try that, give me a shout
and I can advise you on further MDS filesystem tuning.
> I haven''t been able to find a bottleneck in the system. 
There''s plenty
> of memory, changing various parameters in the /proc directory on the
> clients and on the MDS doesn''t help much at all.  There''s
plenty of
> bandwidth on the ib network and on the FC.  CPU and memory aren''t
overly
> taxed (~50% max).  Yet I can only get about 10k file creates per second,
> once I''ve created enough files so that caching 
You could try either increasing the number of clients (if you have more),
or as a temporary experiment try mounting the filesystem multiple times
on each client (in a different location, e.g. /mnt/lustre1, /mnt/lustre2,
etc) and run your load once for each mount point on the client.

If either of these improve the performance then the bottleneck is in this
1-RPC-per-client limitation.  If the performance does not increase there
is some other limitation.


Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

Andreas Dilger

2008-Nov-27 21:29 UTC

head link

[Lustre-discuss] Metadata performance question

On Nov 25, 2008  16:53 -0500, Ben Evans wrote:> I''ve only got a few physical clients (the most I''ve
thrown at it is 11),
> I''ve tried multiple processes on them, but it only makes  a
moderate
> difference.
I''m assuming you also ran each process in a separate mountpoint.
> I''ve used a differing number of thread allocations, but regardless
of
> how many I use, I still see them all go idle for 3-4 seconds.  Sometimes
> there is a u-shaped graph in the number of threads that are running.
> The kjournald thread is only occasionally running during this interval.
How many OSTs do you have in the filesystem and how wide is the striping?
It is possible that the MDS is running out of precreated objects and is
being blocked waiting for the OST to return more.  If you have several 
OSTs in the filesystem (and you are not striping over too many of them)
this should ensure the MDS has a good supply of precreated objects.

While there is a pause, would it be possible to get a stack trace
(e.g. "echo t > /proc/sysrq-trigger") to see where the threads are
stuck.
You could also monitor the /proc/fs/lustre/osc/*/prealloc_* values on the
MDS to see if they are ever very close to each other (i.e. difference is
very low), and see what /proc/fs/lustre/osc/*/create_count is.  This
shows how many objects the MDS is creating in a batch.
> I have one solid state drive in-house, and will most likely be receiving
> a few more to use in my testing in the next few weeks, so any insight
> you have into tuning them for maximum performance would be greatly
> appreciated.
There is a patch in bugzilla (search for "SSD") that should help
performance
in some cases with SSDs.  It shouldn''t make a difference with a 3-4s
wait
though.  In the meantime you could try running with the MDS filesystem on
a ramdisk, just to see whether the slowdown is related to the disk or if
it is something else.
> If I greatly increase the caching on the server sides, I get very high
> performance while the cache is filling up, but if I exceed that,
> performance drops off quickly.
What do you mean by "increase the caching"?
> -----Original Message-----
> From: Andreas.Dilger at sun.com [mailto:Andreas.Dilger at sun.com] On
Behalf
> Of Andreas Dilger
> Sent: Thursday, November 13, 2008 5:05 PM
> To: Ben Evans
> Cc: lustre-discuss at lists.lustre.org; Atul Vidwansa
> Subject: Re: [Lustre-discuss] Metadata performance question
> 
> On Nov 12, 2008  15:46 -0500, Ben Evans wrote:
> > My MDS has ib on the front end, and a FC array on the back end in a
> > RAID 10 configuration.
> >  
> > 
> > I have a number of clients creating 0 byte files in different
> > directories.  As the test runs, I can see the threads on the MDS
> > idling for 3-4 seconds before they all wake up, process whatever
> > they need to and go back to sleep.
> 
> What is important to know is the number of clients.  One of the current
> limitations on Lustre metadata scalability is that each client can only
> have a single _filesystem_modifying_ MDS request in flight at a time
> (i.e. create, rename, unlink, setattr).  This is to make recovery of
> the filesystem manageable in the face of client asynchronous operations
> being replayed on the MDS.
> 
> It sounds like the number of clients is lower than the number of MDS
> service threads, and the MDS threads are just sitting idle until the
> kernel wakes it up again to handle an incoming request.
> 
> If ALL of the threads are idle for the same 3-4 seconds, it is possible
> they are waiting on the journal commit.  One change that might improve
> performance here dramatically is to use SSD devices for the MDS storage,
> which has MUCH lower latency.  If you decide to try that, give me a
> shout and I can advise you on further MDS filesystem tuning.
> 
> > I haven''t been able to find a bottleneck in the system. 
There''s
> > plenty of memory, changing various parameters in the /proc directory
> > on the clients and on the MDS doesn''t help much at all. 
There''s
> > plenty of bandwidth on the ib network and on the FC.  CPU and memory
> > aren''t overly taxed (~50% max).  Yet I can only get about 10k
file
> > creates per second, once I''ve created enough files so that
caching
> 
> You could try either increasing the number of clients (if you have
> more), or as a temporary experiment try mounting the filesystem multiple
> times on each client (in a different location, e.g. /mnt/lustre1,
> /mnt/lustre2, etc) and run your load once for each mount point on the
client.
> 
> If either of these improve the performance then the bottleneck is in
> this 1-RPC-per-client limitation.  If the performance does not increase
there
> is some other limitation.
> 
> 
> Cheers, Andreas
> --
> Andreas Dilger
> Sr. Staff Engineer, Lustre Group
> Sun Microsystems of Canada, Inc.
Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

Lustre discuss - Nov 2008 - Metadata performance question

[Lustre-discuss] Metadata performance question

[Lustre-discuss] Metadata performance question

[Lustre-discuss] Metadata performance question