thr3ads.net - Lustre devel - [Lustre-devel] Thinking of Hacks around bug #12329 [May 2009]

If this information is useful, please help other people find it:
Share via:

David Brown

2009-May-13 18:06 UTC

[Lustre-devel] Thinking of Hacks around bug #12329

Okay so since bug #12329 isn''t so much fixed yet and we''ve
been trying
to do big I/O runs on our cluster but we keep bumping up to this
issue.

So I was thinking of some hacks to get around the issue with prep work
prior to obtaining the big cluster to do whatever we want.

Just a little background on what we''ve been doing. PNL (Pacific
Northwest National Labs) has a unique cluster in the amount of local
scratch disk we require on our compute nodes. Because of this fact we
can take the big machine occasionally to do large scale I/O tests with
Lustre. This helps us understand how Lustre scales for future use and
how we might have to change the way we deal with our production Lustre
systems when upgrading them. We have about 1Tb/s theoretical local
scratch bandwidth on all of our compute nodes, we''ve only gotten a
quarter of this in a lustre file system using half the system, as of
yet.

We''ve tried 4 times since to put that image on the entire cluster and
see how well things work not getting any numbers so far.

The file system created has the following characteristics

1) 700Tb when df -h returns
2) 4600 OSTs (well within the max of 8192)
3) 2300 OSSs

We could have an alternate configuration where we break the raid
arrays and go a bit wider, 18000 OSTs on 2300 OSSs. But this would
require modifications to lustre source code to make that happen.

The bug above really hits us hard on this system, even on lustre
1.8.0. It took 5 hours just to get 4096 OSTs mounted.

As this is going on everything is in a constant state of reconnect,
since the mgs/mdt is busy handling new incoming OSTs. I''m glad to say
that the reconnects keep up and everything goes through the recovery
as expected.  When we''ve paused the mount process, the cluster settles
back down and df -h returns about 5 minutes afterwards, which is very
acceptable. However, there''s a linear increase in the amount of
reconnects and traffic associated with those reconnects as the number
of OSTs increase during the mounting. This causes an increase in time
for the next OSTs that has to mount. Keep in mind that this is on a
brand new file system, not upgrading, not currently running. I would
expect this behavior wouldn''t happen (or would be slightly different)
if the file system was already created.

Which leads me into the hack to get around the bug. I''m just wondering
thoughts or ideas as to what to watch for (or if it would even work).

Precreate mdt/mgs and ost images in a small form factor prior to
production cluster time.

1) pick a system and put lustre on it.
2) setup an mdt/mgs combo and mount it
3) create an ost and mount it
4) umount it save the image (should only be 10M or so not sure what
the smallest size would be).
5) deactivate the new ost
6) go to step 3 with the same disk you used before

You''d end up with pre-created images of a lustre file system prior to
deployment that you could dd onto all the drives in parallel quite
fast.

You could then run resize2fs on the file systems to fill up the OST to
the appropriate size for that device (not sure how long this would
take).

Then you would run tunefs.lustre to change where the mgsnode and
fsname is for that file system.

Then all you''d have to do is mount and the bug may be averted, right?

Just wondering if anyone has any thoughts or ideas on how to get
around this issue.

Thanks,
- David Brown

Andreas Dilger

2009-May-14 06:22 UTC

head link

[Lustre-devel] Thinking of Hacks around bug #12329

On May 13, 2009  11:06 -0700, David Brown wrote:> The file system created has the following characteristics
> 
> 1) 700Tb when df -h returns
> 2) 4600 OSTs (well within the max of 8192)
> 3) 2300 OSSs
> 
> We could have an alternate configuration where we break the raid
> arrays and go a bit wider, 18000 OSTs on 2300 OSSs. But this would
> require modifications to lustre source code to make that happen.
Hmm, even projecting out to the future I''m not certain we will get
to systems with 18000 OSTs.  The capacity of the disks is growing
contiuously, and with ZFS we can have very large individual OSTs
so even 2-3 years from now we''re only looking at 1200 OSTs on 400 OSS
nodes (115 PB filesystem with 30x4TB disks/OST in RAID-6 8+2 = 96TB/OST).
> The bug above really hits us hard on this system, even on lustre
> 1.8.0. It took 5 hours just to get 4096 OSTs mounted.
Ouch.
> As this is going on everything is in a constant state of reconnect,
> since the mgs/mdt is busy handling new incoming OSTs. I''m glad to
say
> that the reconnects keep up and everything goes through the recovery
> as expected.  When we''ve paused the mount process, the cluster
settles
> back down and df -h returns about 5 minutes afterwards, which is very
> acceptable. However, there''s a linear increase in the amount of
> reconnects and traffic associated with those reconnects as the number
> of OSTs increase during the mounting. This causes an increase in time
> for the next OSTs that has to mount. Keep in mind that this is on a
> brand new file system, not upgrading, not currently running. I would
> expect this behavior wouldn''t happen (or would be slightly
different)
> if the file system was already created.
Sounds unpleasant.  I wonder if this is driven by the fact that the
MGS clients (OSTs are also MGS clients) don''t expect a huge amount of
change at any one time so they try to refetch the updated config in
an eager manner.  This probably increases the queue of requests on
the MGS linearly with the number of OSTs, and new OST connections are
getting backed up behind this.

I was going to say that getting some RPC stats from the MGS service
would be informative, but I can''t see any MGS RPC stats file on my
system...

I wonder if a longer-term solution is to have the MGS push the config
log changes to the clients, instead of having the clients pull them.
> Precreate mdt/mgs and ost images in a small form factor prior to
> production cluster time.
> 
> 1) pick a system and put lustre on it.
> 2) setup an mdt/mgs combo and mount it
> 3) create an ost and mount it
> 4) umount it save the image (should only be 10M or so not sure what
> the smallest size would be).
You only really need to mount the OST filesystem with "-t ldiksfs",
tar up the contents of the OST filesystem, and save the filesystem
label ({fsname}-OSTnnnn).
> 5) deactivate the new ost
> 6) go to step 3 with the same disk you used before
> 
> You''d end up with pre-created images of a lustre file system prior
to
> deployment that you could dd onto all the drives in parallel quite
> fast.
> 
> You could then run resize2fs on the file systems to fill up the OST to
> the appropriate size for that device (not sure how long this would
> take).
If you only save the tar image, then you can just do a normal format
of the OST filesystem (using the mke2fs options as reported during
the mkfs.lustre run, with the proper label "-L {fsname}-OSTnnnn" for
that OST index, mount it with "-t ldiskfs" and then extract the
tarball into it.
> Then you would run tunefs.lustre to change where the mgsnode and
> fsname is for that file system.
> 
> Then all you''d have to do is mount and the bug may be averted,
right?
Probably, yes.


Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

Oleg Drokin

2009-May-14 14:25 UTC

head link

[Lustre-devel] Thinking of Hacks around bug #12329

Hello!

On May 14, 2009, at 2:22 AM, Andreas Dilger wrote:> Sounds unpleasant.  I wonder if this is driven by the fact that the
> MGS clients (OSTs are also MGS clients) don''t expect a huge amount
of
> change at any one time so they try to refetch the updated config in
> an eager manner.  This probably increases the queue of requests on
> the MGS linearly with the number of OSTs, and new OST connections are
> getting backed up behind this.
Actually just to combat situqtion like this MGCs are doing a bit of a  
pause
for a few seconds before refetching config, I remember there was a bug
and this measure was introduced as a fix.

What''s interesting is that I actually have 1200 OSTs system in a  
single node
and the mount (even format & mount) takes nowhere near 5 hours.
In fact I am up to ~950 OSTs mounted in around 20 minutes or so, I  
think,
at which point the node usually OOMs. (and it''s all in a heavily  
swapping vmware too)
And this setup does have some extra complications like hitting a bug  
in MGC
where every target establishes its own connection to MGS where only  
one connection
for entire node is needed, and then there is no dynamic lru enabled,  
so mgc locks
are just pushed out of the lru and I see constant attempts to requeue  
the locks
even if the mount is finished.
Of course on the other hand even if I run mounts in parallel (and I  
do), MGC does not
rush inn with all of the requests in parallel still, I think.

Bye,
     Oleg

Oleg Drokin

2009-May-14 15:48 UTC

head link

[Lustre-devel] Thinking of Hacks around bug #12329

Hello!

On May 14, 2009, at 10:25 AM, Oleg Drokin wrote:
> Actually just to combat situqtion like this MGCs are doing a bit of a
> pause
> for a few seconds before refetching config, I remember there was a bug
> and this measure was introduced as a fix.
Nic actually tuned in and said that the backoff (set at 3 seconds now)
is certainly not enough, since it takes this long to only mount actual
on-disk fs.
Anyway that got me thinking that we have a "coarse-grained" locking  
problem.
Since OSTs don''t connect to other OSTs, they do not care about OT  
connections,
and perhaps if we introduce bit-locks to MGS locks as well to indicate  
client
type, then locks from OSTs would only be revoked when MDS connects or  
disconnects,
MDS locks would only be revoked when OSTs connect or disconnect and  
client locks
would be revoked always.
Or alternatively we can split our single resource right now to a few  
separate:
one for osts one for MDSes for example, sure that would mean clients  
would not have to
take two locks, but on the other hand there would be supposedly less  
information to
reparse when one of those locks is invalidated.

Nathan, what do you think?

Bye,
     Oleg

David Brown

2009-May-14 16:05 UTC

head link

[Lustre-devel] Thinking of Hacks around bug #12329

> Hmm, even projecting out to the future I''m not certain we will get
> to systems with 18000 OSTs. ?The capacity of the disks is growing
> contiuously, and with ZFS we can have very large individual OSTs
> so even 2-3 years from now we''re only looking at 1200 OSTs on 400
OSS
> nodes (115 PB filesystem with 30x4TB disks/OST in RAID-6 8+2 = 96TB/OST).
>
So How fast is this theoretical 115PB system?

With innovations like SSDs getting cheeper and larger, we may get
faster and better performance. However, I''m still not sure
we''re going
to be keeping up with the curve that processors are going to be doing.

However, I guess you guys really aren''t too concerned about the
processing power increasing particularly. Just make a file system that
can save all the memory in a super computer in N minutes really.

Also I''ll be getting back with some interesting statistics on what
each system is doing memory used processors working and an infiniband
soft lockup detection that forces all ko2iblnd and prtlrpc kernel
procs to go full bore putting the load average to 16+ for a single OSS
(this does sound like a bug not sure if its already been noticed have
to do a bugzilla search).

Thanks,
- David Brown

Andreas Dilger

2009-May-14 20:54 UTC

head link

[Lustre-devel] Thinking of Hacks around bug #12329

On May 14, 2009  09:05 -0700, David Brown wrote:> > Hmm, even projecting out to the future I''m not certain we
will get
> > to systems with 18000 OSTs. ?The capacity of the disks is growing
> > contiuously, and with ZFS we can have very large individual OSTs
> > so even 2-3 years from now we''re only looking at 1200 OSTs on
400 OSS
> > nodes (115 PB filesystem with 30x4TB disks/OST in RAID-6 8+2 =
96TB/OST).
> 
> So How fast is this theoretical 115PB system?
About 1.5TB/s fast, if the math holds true.
> With innovations like SSDs getting cheeper and larger, we may get
> faster and better performance. However, I''m still not sure
we''re going
> to be keeping up with the curve that processors are going to be doing.
Sure, we are already testing SSDs and what we can do to improve the
filesystem to take advantage of them.  They aren''t quite at the price
point where people will be jumping on them, but I expect in a year or
two they will become much more affordable.
> However, I guess you guys really aren''t too concerned about the
> processing power increasing particularly. Just make a file system that
> can save all the memory in a super computer in N minutes really.
Well, CPU power -> more RAM to feed CPU -> more disk to feed/purge RAM.
So even though CPU/RAM size is increasing exponentially, disk size and
more importantly perf is not growing so quickly.  Even worse, seeks are
almost static for spinning disks (8ms or so => 125 IOPS), so SSDs are
really the only hope to keep up with filesystem requirements.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

Andreas Dilger

2009-May-14 21:28 UTC

head link

[Lustre-devel] Thinking of Hacks around bug #12329

On May 14, 2009  11:48 -0400, Oleg Drokin wrote:> On May 14, 2009, at 10:25 AM, Oleg Drokin wrote:
>> Actually just to combat situqtion like this MGCs are doing a bit of a
>> pause
>> for a few seconds before refetching config, I remember there was a bug
>> and this measure was introduced as a fix.
>
> Nic actually tuned in and said that the backoff (set at 3 seconds now)
> is certainly not enough, since it takes this long to only mount actual
> on-disk fs.
> Anyway that got me thinking that we have a "coarse-grained"
locking
> problem.  Since OSTs don''t connect to other OSTs, they do not care
about
> OST connections, and perhaps if we introduce bit-locks to MGS locks as
> well to indicate client type, then locks from OSTs would only be revoked
> when MDS connects or disconnects, MDS locks would only be revoked when
> OSTs connect or disconnect and client locks would be revoked always.
> Or alternatively we can split our single resource right now to a few  
> separate:
> one for osts one for MDSes for example, sure that would mean clients  
> would not have to take two locks, but on the other hand there would
> be supposedly less information to reparse when one of those locks is
> invalidated.
I would tend to prefer the latter.  Having separate resource IDs for
the different llogs makes it a lot cleaner in the end.  Ideally,
picking a relatively unique resource ID for that config log would
allow us to separate the configs between different filesystems.

The OSTs in fact don''t really need to read the same llog as the client
for
very many things (some shared tunables, perhaps), and there also isn''t
a big problem IMHO to store the same tunables in two different config
llogs (one for servers and one for clients).  Generally, the server-side
tunables are not used by the client, and vice versa.  Probably the only
place that would need to read two config llogs is the MDS, which is both
a server and a client of the OSTs.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

Nathaniel Rutman

2009-Jun-17 03:04 UTC

head link

[Lustre-devel] Thinking of Hacks around bug #12329

Sorry for taking so long to respond...

Andreas Dilger wrote:> On May 14, 2009  11:48 -0400, Oleg Drokin wrote:
>   
>> On May 14, 2009, at 10:25 AM, Oleg Drokin wrote:
>>     
>>> Actually just to combat situqtion like this MGCs are doing a bit of
a
>>> pause
>>> for a few seconds before refetching config, I remember there was a
bug
>>> and this measure was introduced as a fix.
>>>       
>> Nic actually tuned in and said that the backoff (set at 3 seconds now)
>> is certainly not enough, since it takes this long to only mount actual
>> on-disk fsThis is probably the easiest thing to try out for fixing bug 12329.   
Put this up at 30s or 60s or something -- it''s just the amount of time 
it takes to update after a config change.  These will be rare and 
asynchronous, so there''s no real penalty for waiting.  Preventing 
thousands of clients from trying to re-read the config log every few 
seconds seems like a no-brainer.  See
mgc_requeue_thread.>> Anyway that got me thinking that we have a "coarse-grained"
locking
>> problem.  Since OSTs don''t connect to other OSTs, they do not
care about
>> OST connections, and perhaps if we introduce bit-locks to MGS locks as
>> well to indicate client type, then locks from OSTs would only be
revoked
>> when MDS connects or disconnects, MDS locks would only be revoked when
>> OSTs connect or disconnect and client locks would be revoked always.
>> Or alternatively we can split our single resource right now to a few  
>> separate:
>>     
>> one for osts one for MDSes for example, sure that would mean clients  
>> would not have to take two locks, but on the other hand there would
>> be supposedly less information to reparse when one of those locks is
>> invalidated.
>>     
>
> I would tend to prefer the latter.  Having separate resource IDs for
> the different llogs makes it a lot cleaner in the end.  Ideally,
> picking a relatively unique resource ID for that config log would
> allow us to separate the configs between different filesystems.
>
> The OSTs in fact don''t really need to read the same llog as the
client for
> very many things (some shared tunables, perhaps), and there also
isn''t
> a big problem IMHO to store the same tunables in two different config
> llogs (one for servers and one for clients).  Generally, the server-side
> tunables are not used by the client, and vice versa.  Probably the only
> place that would need to read two config llogs is the MDS, which is both
> a server and a client of the OSTs.
>   The OSTs in fact do read a separate llog than the client.  But there is 
still a single
config lock per fs on the MGS, so that doesn''t really matter.  Revoking
the lock
causes everybody in the fs to try to update, even if there''s nothing
new
in their
particular log.  Oleg''s fine-grained idea, or simply separate locks, 
would help in this case.
But I think the big win is backing off the requeue time for big clusters.
We could even automate this a bit; increase the requeue time on the 
clients as the number of
OSTs increase.

Andreas Dilger

2009-Jun-18 08:40 UTC

head link

[Lustre-devel] Thinking of Hacks around bug #12329

On Jun 16, 2009  20:04 -0700, Nathaniel Rutman wrote:> Andreas Dilger wrote:
> > On May 14, 2009  11:48 -0400, Oleg Drokin wrote:
> >   
> >> On May 14, 2009, at 10:25 AM, Oleg Drokin wrote:
> >>> Actually just to combat situqtion like this MGCs are doing a
bit of a
> >>> pause
> >>> for a few seconds before refetching config, I remember there
was a bug
> >>> and this measure was introduced as a fix.
> >>       
> >> Nic actually tuned in and said that the backoff (set at 3 seconds
now)
> >> is certainly not enough, since it takes this long to only mount
actual
> >> on-disk fs
>
> This is probably the easiest thing to try out for fixing bug 12329.   
> Put this up at 30s or 60s or something -- it''s just the amount of
time
> it takes to update after a config change.  These will be rare and 
> asynchronous, so there''s no real penalty for waiting.  Preventing 
> thousands of clients from trying to re-read the config log every few 
> seconds seems like a no-brainer.  See mgc_requeue_thread.
The one potential problem with this is that if the MDS creates a file
on one of the newly-added OSTs, but the clients don''t see this OST for
60s then there is a big window for hitting an IO error due to the
"bad"
striping (from the POV of that client).  I don''t _think_ there is any
coordiation between refreshing the config lock and doing other IO...
> >> Anyway that got me thinking that we have a
"coarse-grained" locking
> >> problem.  Since OSTs don''t connect to other OSTs, they do
not care about
> >> OST connections, and perhaps if we introduce bit-locks to MGS
locks as
> >> well to indicate client type, then locks from OSTs would only be
revoked
> >> when MDS connects or disconnects, MDS locks would only be revoked
when
> >> OSTs connect or disconnect and client locks would be revoked
always.
> >> Or alternatively we can split our single resource right now to a
few
> >> separate:
> >>     
> >> one for osts one for MDSes for example, sure that would mean
clients
> >> would not have to take two locks, but on the other hand there
would
> >> be supposedly less information to reparse when one of those locks
is
> >> invalidated.
> >
> > I would tend to prefer the latter.  Having separate resource IDs for
> > the different llogs makes it a lot cleaner in the end.  Ideally,
> > picking a relatively unique resource ID for that config log would
> > allow us to separate the configs between different filesystems.
> >
> > The OSTs in fact don''t really need to read the same llog as
the client for
> > very many things (some shared tunables, perhaps), and there also
isn''t
> > a big problem IMHO to store the same tunables in two different config
> > llogs (one for servers and one for clients).  Generally, the
server-side
> > tunables are not used by the client, and vice versa.  Probably the
only
> > place that would need to read two config llogs is the MDS, which is
both
> > a server and a client of the OSTs.
>   
> The OSTs in fact do read a separate llog than the client.  But there is 
> still a single config lock per fs on the MGS, so that doesn''t
really matter.
> Revoking the lock causes everybody in the fs to try to update, even if
> there''s nothing new in their particular log.  Oleg''s
fine-grained idea,
> or simply separate locks, would help in this case.
Yes, I think this is fairly important for very large filesystems, since
there appears to be some kind of O(n^2) behaviour going on with respect
to many OSTs connecting, trying to get the single config lock (which
they don''t really need in the end) and then slowing everything down.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

Lustre devel - May 2009 - Thinking of Hacks around bug #12329

[Lustre-devel] Thinking of Hacks around bug #12329

[Lustre-devel] Thinking of Hacks around bug #12329

[Lustre-devel] Thinking of Hacks around bug #12329

[Lustre-devel] Thinking of Hacks around bug #12329

[Lustre-devel] Thinking of Hacks around bug #12329

[Lustre-devel] Thinking of Hacks around bug #12329

[Lustre-devel] Thinking of Hacks around bug #12329

[Lustre-devel] Thinking of Hacks around bug #12329

[Lustre-devel] Thinking of Hacks around bug #12329