Okay so since bug #12329 isn''t so much fixed yet and we''ve been trying to do big I/O runs on our cluster but we keep bumping up to this issue. So I was thinking of some hacks to get around the issue with prep work prior to obtaining the big cluster to do whatever we want. Just a little background on what we''ve been doing. PNL (Pacific Northwest National Labs) has a unique cluster in the amount of local scratch disk we require on our compute nodes. Because of this fact we can take the big machine occasionally to do large scale I/O tests with Lustre. This helps us understand how Lustre scales for future use and how we might have to change the way we deal with our production Lustre systems when upgrading them. We have about 1Tb/s theoretical local scratch bandwidth on all of our compute nodes, we''ve only gotten a quarter of this in a lustre file system using half the system, as of yet. We''ve tried 4 times since to put that image on the entire cluster and see how well things work not getting any numbers so far. The file system created has the following characteristics 1) 700Tb when df -h returns 2) 4600 OSTs (well within the max of 8192) 3) 2300 OSSs We could have an alternate configuration where we break the raid arrays and go a bit wider, 18000 OSTs on 2300 OSSs. But this would require modifications to lustre source code to make that happen. The bug above really hits us hard on this system, even on lustre 1.8.0. It took 5 hours just to get 4096 OSTs mounted. As this is going on everything is in a constant state of reconnect, since the mgs/mdt is busy handling new incoming OSTs. I''m glad to say that the reconnects keep up and everything goes through the recovery as expected. When we''ve paused the mount process, the cluster settles back down and df -h returns about 5 minutes afterwards, which is very acceptable. However, there''s a linear increase in the amount of reconnects and traffic associated with those reconnects as the number of OSTs increase during the mounting. This causes an increase in time for the next OSTs that has to mount. Keep in mind that this is on a brand new file system, not upgrading, not currently running. I would expect this behavior wouldn''t happen (or would be slightly different) if the file system was already created. Which leads me into the hack to get around the bug. I''m just wondering thoughts or ideas as to what to watch for (or if it would even work). Precreate mdt/mgs and ost images in a small form factor prior to production cluster time. 1) pick a system and put lustre on it. 2) setup an mdt/mgs combo and mount it 3) create an ost and mount it 4) umount it save the image (should only be 10M or so not sure what the smallest size would be). 5) deactivate the new ost 6) go to step 3 with the same disk you used before You''d end up with pre-created images of a lustre file system prior to deployment that you could dd onto all the drives in parallel quite fast. You could then run resize2fs on the file systems to fill up the OST to the appropriate size for that device (not sure how long this would take). Then you would run tunefs.lustre to change where the mgsnode and fsname is for that file system. Then all you''d have to do is mount and the bug may be averted, right? Just wondering if anyone has any thoughts or ideas on how to get around this issue. Thanks, - David Brown
On May 13, 2009 11:06 -0700, David Brown wrote:> The file system created has the following characteristics > > 1) 700Tb when df -h returns > 2) 4600 OSTs (well within the max of 8192) > 3) 2300 OSSs > > We could have an alternate configuration where we break the raid > arrays and go a bit wider, 18000 OSTs on 2300 OSSs. But this would > require modifications to lustre source code to make that happen.Hmm, even projecting out to the future I''m not certain we will get to systems with 18000 OSTs. The capacity of the disks is growing contiuously, and with ZFS we can have very large individual OSTs so even 2-3 years from now we''re only looking at 1200 OSTs on 400 OSS nodes (115 PB filesystem with 30x4TB disks/OST in RAID-6 8+2 = 96TB/OST).> The bug above really hits us hard on this system, even on lustre > 1.8.0. It took 5 hours just to get 4096 OSTs mounted.Ouch.> As this is going on everything is in a constant state of reconnect, > since the mgs/mdt is busy handling new incoming OSTs. I''m glad to say > that the reconnects keep up and everything goes through the recovery > as expected. When we''ve paused the mount process, the cluster settles > back down and df -h returns about 5 minutes afterwards, which is very > acceptable. However, there''s a linear increase in the amount of > reconnects and traffic associated with those reconnects as the number > of OSTs increase during the mounting. This causes an increase in time > for the next OSTs that has to mount. Keep in mind that this is on a > brand new file system, not upgrading, not currently running. I would > expect this behavior wouldn''t happen (or would be slightly different) > if the file system was already created.Sounds unpleasant. I wonder if this is driven by the fact that the MGS clients (OSTs are also MGS clients) don''t expect a huge amount of change at any one time so they try to refetch the updated config in an eager manner. This probably increases the queue of requests on the MGS linearly with the number of OSTs, and new OST connections are getting backed up behind this. I was going to say that getting some RPC stats from the MGS service would be informative, but I can''t see any MGS RPC stats file on my system... I wonder if a longer-term solution is to have the MGS push the config log changes to the clients, instead of having the clients pull them.> Precreate mdt/mgs and ost images in a small form factor prior to > production cluster time. > > 1) pick a system and put lustre on it. > 2) setup an mdt/mgs combo and mount it > 3) create an ost and mount it > 4) umount it save the image (should only be 10M or so not sure what > the smallest size would be).You only really need to mount the OST filesystem with "-t ldiksfs", tar up the contents of the OST filesystem, and save the filesystem label ({fsname}-OSTnnnn).> 5) deactivate the new ost > 6) go to step 3 with the same disk you used before > > You''d end up with pre-created images of a lustre file system prior to > deployment that you could dd onto all the drives in parallel quite > fast. > > You could then run resize2fs on the file systems to fill up the OST to > the appropriate size for that device (not sure how long this would > take).If you only save the tar image, then you can just do a normal format of the OST filesystem (using the mke2fs options as reported during the mkfs.lustre run, with the proper label "-L {fsname}-OSTnnnn" for that OST index, mount it with "-t ldiskfs" and then extract the tarball into it.> Then you would run tunefs.lustre to change where the mgsnode and > fsname is for that file system. > > Then all you''d have to do is mount and the bug may be averted, right?Probably, yes. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
Hello! On May 14, 2009, at 2:22 AM, Andreas Dilger wrote:> Sounds unpleasant. I wonder if this is driven by the fact that the > MGS clients (OSTs are also MGS clients) don''t expect a huge amount of > change at any one time so they try to refetch the updated config in > an eager manner. This probably increases the queue of requests on > the MGS linearly with the number of OSTs, and new OST connections are > getting backed up behind this.Actually just to combat situqtion like this MGCs are doing a bit of a pause for a few seconds before refetching config, I remember there was a bug and this measure was introduced as a fix. What''s interesting is that I actually have 1200 OSTs system in a single node and the mount (even format & mount) takes nowhere near 5 hours. In fact I am up to ~950 OSTs mounted in around 20 minutes or so, I think, at which point the node usually OOMs. (and it''s all in a heavily swapping vmware too) And this setup does have some extra complications like hitting a bug in MGC where every target establishes its own connection to MGS where only one connection for entire node is needed, and then there is no dynamic lru enabled, so mgc locks are just pushed out of the lru and I see constant attempts to requeue the locks even if the mount is finished. Of course on the other hand even if I run mounts in parallel (and I do), MGC does not rush inn with all of the requests in parallel still, I think. Bye, Oleg
Hello! On May 14, 2009, at 10:25 AM, Oleg Drokin wrote:> Actually just to combat situqtion like this MGCs are doing a bit of a > pause > for a few seconds before refetching config, I remember there was a bug > and this measure was introduced as a fix.Nic actually tuned in and said that the backoff (set at 3 seconds now) is certainly not enough, since it takes this long to only mount actual on-disk fs. Anyway that got me thinking that we have a "coarse-grained" locking problem. Since OSTs don''t connect to other OSTs, they do not care about OT connections, and perhaps if we introduce bit-locks to MGS locks as well to indicate client type, then locks from OSTs would only be revoked when MDS connects or disconnects, MDS locks would only be revoked when OSTs connect or disconnect and client locks would be revoked always. Or alternatively we can split our single resource right now to a few separate: one for osts one for MDSes for example, sure that would mean clients would not have to take two locks, but on the other hand there would be supposedly less information to reparse when one of those locks is invalidated. Nathan, what do you think? Bye, Oleg
> Hmm, even projecting out to the future I''m not certain we will get > to systems with 18000 OSTs. ?The capacity of the disks is growing > contiuously, and with ZFS we can have very large individual OSTs > so even 2-3 years from now we''re only looking at 1200 OSTs on 400 OSS > nodes (115 PB filesystem with 30x4TB disks/OST in RAID-6 8+2 = 96TB/OST). >So How fast is this theoretical 115PB system? With innovations like SSDs getting cheeper and larger, we may get faster and better performance. However, I''m still not sure we''re going to be keeping up with the curve that processors are going to be doing. However, I guess you guys really aren''t too concerned about the processing power increasing particularly. Just make a file system that can save all the memory in a super computer in N minutes really. Also I''ll be getting back with some interesting statistics on what each system is doing memory used processors working and an infiniband soft lockup detection that forces all ko2iblnd and prtlrpc kernel procs to go full bore putting the load average to 16+ for a single OSS (this does sound like a bug not sure if its already been noticed have to do a bugzilla search). Thanks, - David Brown
On May 14, 2009 09:05 -0700, David Brown wrote:> > Hmm, even projecting out to the future I''m not certain we will get > > to systems with 18000 OSTs. ?The capacity of the disks is growing > > contiuously, and with ZFS we can have very large individual OSTs > > so even 2-3 years from now we''re only looking at 1200 OSTs on 400 OSS > > nodes (115 PB filesystem with 30x4TB disks/OST in RAID-6 8+2 = 96TB/OST). > > So How fast is this theoretical 115PB system?About 1.5TB/s fast, if the math holds true.> With innovations like SSDs getting cheeper and larger, we may get > faster and better performance. However, I''m still not sure we''re going > to be keeping up with the curve that processors are going to be doing.Sure, we are already testing SSDs and what we can do to improve the filesystem to take advantage of them. They aren''t quite at the price point where people will be jumping on them, but I expect in a year or two they will become much more affordable.> However, I guess you guys really aren''t too concerned about the > processing power increasing particularly. Just make a file system that > can save all the memory in a super computer in N minutes really.Well, CPU power -> more RAM to feed CPU -> more disk to feed/purge RAM. So even though CPU/RAM size is increasing exponentially, disk size and more importantly perf is not growing so quickly. Even worse, seeks are almost static for spinning disks (8ms or so => 125 IOPS), so SSDs are really the only hope to keep up with filesystem requirements. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
On May 14, 2009 11:48 -0400, Oleg Drokin wrote:> On May 14, 2009, at 10:25 AM, Oleg Drokin wrote: >> Actually just to combat situqtion like this MGCs are doing a bit of a >> pause >> for a few seconds before refetching config, I remember there was a bug >> and this measure was introduced as a fix. > > Nic actually tuned in and said that the backoff (set at 3 seconds now) > is certainly not enough, since it takes this long to only mount actual > on-disk fs.> Anyway that got me thinking that we have a "coarse-grained" locking > problem. Since OSTs don''t connect to other OSTs, they do not care about > OST connections, and perhaps if we introduce bit-locks to MGS locks as > well to indicate client type, then locks from OSTs would only be revoked > when MDS connects or disconnects, MDS locks would only be revoked when > OSTs connect or disconnect and client locks would be revoked always. > Or alternatively we can split our single resource right now to a few > separate:> one for osts one for MDSes for example, sure that would mean clients > would not have to take two locks, but on the other hand there would > be supposedly less information to reparse when one of those locks is > invalidated.I would tend to prefer the latter. Having separate resource IDs for the different llogs makes it a lot cleaner in the end. Ideally, picking a relatively unique resource ID for that config log would allow us to separate the configs between different filesystems. The OSTs in fact don''t really need to read the same llog as the client for very many things (some shared tunables, perhaps), and there also isn''t a big problem IMHO to store the same tunables in two different config llogs (one for servers and one for clients). Generally, the server-side tunables are not used by the client, and vice versa. Probably the only place that would need to read two config llogs is the MDS, which is both a server and a client of the OSTs. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
Sorry for taking so long to respond... Andreas Dilger wrote:> On May 14, 2009 11:48 -0400, Oleg Drokin wrote: > >> On May 14, 2009, at 10:25 AM, Oleg Drokin wrote: >> >>> Actually just to combat situqtion like this MGCs are doing a bit of a >>> pause >>> for a few seconds before refetching config, I remember there was a bug >>> and this measure was introduced as a fix. >>> >> Nic actually tuned in and said that the backoff (set at 3 seconds now) >> is certainly not enough, since it takes this long to only mount actual >> on-disk fsThis is probably the easiest thing to try out for fixing bug 12329. Put this up at 30s or 60s or something -- it''s just the amount of time it takes to update after a config change. These will be rare and asynchronous, so there''s no real penalty for waiting. Preventing thousands of clients from trying to re-read the config log every few seconds seems like a no-brainer. See mgc_requeue_thread.>> Anyway that got me thinking that we have a "coarse-grained" locking >> problem. Since OSTs don''t connect to other OSTs, they do not care about >> OST connections, and perhaps if we introduce bit-locks to MGS locks as >> well to indicate client type, then locks from OSTs would only be revoked >> when MDS connects or disconnects, MDS locks would only be revoked when >> OSTs connect or disconnect and client locks would be revoked always. >> Or alternatively we can split our single resource right now to a few >> separate: >> >> one for osts one for MDSes for example, sure that would mean clients >> would not have to take two locks, but on the other hand there would >> be supposedly less information to reparse when one of those locks is >> invalidated. >> > > I would tend to prefer the latter. Having separate resource IDs for > the different llogs makes it a lot cleaner in the end. Ideally, > picking a relatively unique resource ID for that config log would > allow us to separate the configs between different filesystems. > > The OSTs in fact don''t really need to read the same llog as the client for > very many things (some shared tunables, perhaps), and there also isn''t > a big problem IMHO to store the same tunables in two different config > llogs (one for servers and one for clients). Generally, the server-side > tunables are not used by the client, and vice versa. Probably the only > place that would need to read two config llogs is the MDS, which is both > a server and a client of the OSTs. >The OSTs in fact do read a separate llog than the client. But there is still a single config lock per fs on the MGS, so that doesn''t really matter. Revoking the lock causes everybody in the fs to try to update, even if there''s nothing new in their particular log. Oleg''s fine-grained idea, or simply separate locks, would help in this case. But I think the big win is backing off the requeue time for big clusters. We could even automate this a bit; increase the requeue time on the clients as the number of OSTs increase.
On Jun 16, 2009 20:04 -0700, Nathaniel Rutman wrote:> Andreas Dilger wrote: > > On May 14, 2009 11:48 -0400, Oleg Drokin wrote: > > > >> On May 14, 2009, at 10:25 AM, Oleg Drokin wrote: > >>> Actually just to combat situqtion like this MGCs are doing a bit of a > >>> pause > >>> for a few seconds before refetching config, I remember there was a bug > >>> and this measure was introduced as a fix. > >> > >> Nic actually tuned in and said that the backoff (set at 3 seconds now) > >> is certainly not enough, since it takes this long to only mount actual > >> on-disk fs > > This is probably the easiest thing to try out for fixing bug 12329. > Put this up at 30s or 60s or something -- it''s just the amount of time > it takes to update after a config change. These will be rare and > asynchronous, so there''s no real penalty for waiting. Preventing > thousands of clients from trying to re-read the config log every few > seconds seems like a no-brainer. See mgc_requeue_thread.The one potential problem with this is that if the MDS creates a file on one of the newly-added OSTs, but the clients don''t see this OST for 60s then there is a big window for hitting an IO error due to the "bad" striping (from the POV of that client). I don''t _think_ there is any coordiation between refreshing the config lock and doing other IO...> >> Anyway that got me thinking that we have a "coarse-grained" locking > >> problem. Since OSTs don''t connect to other OSTs, they do not care about > >> OST connections, and perhaps if we introduce bit-locks to MGS locks as > >> well to indicate client type, then locks from OSTs would only be revoked > >> when MDS connects or disconnects, MDS locks would only be revoked when > >> OSTs connect or disconnect and client locks would be revoked always. > >> Or alternatively we can split our single resource right now to a few > >> separate: > >> > >> one for osts one for MDSes for example, sure that would mean clients > >> would not have to take two locks, but on the other hand there would > >> be supposedly less information to reparse when one of those locks is > >> invalidated. > > > > I would tend to prefer the latter. Having separate resource IDs for > > the different llogs makes it a lot cleaner in the end. Ideally, > > picking a relatively unique resource ID for that config log would > > allow us to separate the configs between different filesystems. > > > > The OSTs in fact don''t really need to read the same llog as the client for > > very many things (some shared tunables, perhaps), and there also isn''t > > a big problem IMHO to store the same tunables in two different config > > llogs (one for servers and one for clients). Generally, the server-side > > tunables are not used by the client, and vice versa. Probably the only > > place that would need to read two config llogs is the MDS, which is both > > a server and a client of the OSTs. > > The OSTs in fact do read a separate llog than the client. But there is > still a single config lock per fs on the MGS, so that doesn''t really matter. > Revoking the lock causes everybody in the fs to try to update, even if > there''s nothing new in their particular log. Oleg''s fine-grained idea, > or simply separate locks, would help in this case.Yes, I think this is fairly important for very large filesystems, since there appears to be some kind of O(n^2) behaviour going on with respect to many OSTs connecting, trying to get the single config lock (which they don''t really need in the end) and then slowing everything down. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.