I''ve seen a lot of issues with mounting all of our OSTs on an OSS taking an excessive amount of time. Most of the individual OST mount time was related to bug 18456, but we still see mount times take minutes per OST with the relevant patches. At mount time the llog does a small write which ends up scanning nearly our entire 7+ TB OSTs to find the desired block and complete the write. To reduce startup time mounting multiple OSTs simultaneously would help, but during that process it looks like the code path is still holding the big kernel lock from the mount system call. During that time all other mount commands are in an uninterruptible sleep (D state). Based on the discussions from bug 23790 it doesn''t appear that Lustre relies on the BKL so would it be reasonable to call unlock_kernel in lustre_fill_super or at least before lustre_start_mgc and lock it again before the return so multiple OSTs could be mounting at the same time? I think the same thing would apply to unmounting but I haven''t looked at the code path there. Jeremy -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-devel/attachments/20101028/c72f73a7/attachment.html
On 2010-10-28, at 21:07, Jeremy Filizetti wrote:> I''ve seen a lot of issues with mounting all of our OSTs on an OSS taking an excessive amount of time. Most of the individual OST mount time was related to bug 18456, but we still see mount times take minutes per OST with the relevant patches. At mount time the llog does a small write which ends up scanning nearly our entire 7+ TB OSTs to find the desired block and complete the write. > > To reduce startup time mounting multiple OSTs simultaneously would help, but during that process it looks like the code path is still holding the big kernel lock from the mount system call. During that time all other mount commands are in an uninterruptible sleep (D state). Based on the discussions from bug 23790 it doesn''t appear that Lustre relies on the BKL so would it be reasonable to call unlock_kernel in lustre_fill_super or at least before lustre_start_mgc and lock it again before the return so multiple OSTs could be mounting at the same time? I think the same thing would apply to unmounting but I haven''t looked at the code path there.IIRC, the BKL is held at mount time to avoid potential races with mounting the same device multiple times. However, the risk of this is pretty small, and can be controlled on an OSS, which has limited access. Also, this code is being removed in newer kernels, as I don''t think it is needed by most filesystems. I _think_ it should be OK, but YMMV. Cheers, Andreas
On 2 Nov 2010, at 07:40, Andreas Dilger wrote:> On 2010-10-28, at 21:07, Jeremy Filizetti wrote: >> I''ve seen a lot of issues with mounting all of our OSTs on an OSS taking an excessive amount of time. Most of the individual OST mount time was related to bug 18456, but we still see mount times take minutes per OST with the relevant patches. At mount time the llog does a small write which ends up scanning nearly our entire 7+ TB OSTs to find the desired block and complete the write. >> >> To reduce startup time mounting multiple OSTs simultaneously would help, but during that process it looks like the code path is still holding the big kernel lock from the mount system call. During that time all other mount commands are in an uninterruptible sleep (D state). Based on the discussions from bug 23790 it doesn''t appear that Lustre relies on the BKL so would it be reasonable to call unlock_kernel in lustre_fill_super or at least before lustre_start_mgc and lock it again before the return so multiple OSTs could be mounting at the same time? I think the same thing would apply to unmounting but I haven''t looked at the code path there. > > IIRC, the BKL is held at mount time to avoid potential races with mounting the same device multiple times. However, the risk of this is pretty small, and can be controlled on an OSS, which has limited access. Also, this code is being removed in newer kernels, as I don''t think it is needed by most filesystems. > > I _think_ it should be OK, but YMMV.I''ve been thinking about this and can''t make up my mind on if it''s a good idea or not, we often see mount times in the ten minute region so anything we can do to speed them up is a good thing, I find it hard to believe the core kernel mount code would accept you doing this behind their back though and I''d be surprised if it worked. Then again - when we were discussing this yesterday is the mount command *really* holding the BKL for the entire duration? Surely if this lock is being held for minutes we''d notice this in other ways because other kernel paths that require this lock would block? Ashley.
Are you seeing individual OST mount times in the 10 minute region or for all OSTs? Releasing the lock is more common then you think. Both ext3, ext4 release the lock in their ext[34]_fill_super and reacquire before exiting. So when Lustre does the pre mount it gets released, and when it does the real mount for ldiskfs, but after that during that first llog write when the buddy allocator is initializing I don''t see where it can be getting released. I was going to try to confirm things 100% with systemtap but our the version I have doesn''t seem to pick up the lustre modules (or any additionallly added ones for that matter). I don''t have any easy way to upgrade it either. Jeremy> I''ve been thinking about this and can''t make up my mind on if it''s a good > idea or not, we often see mount times in the ten minute region so anything > we can do to speed them up is a good thing, I find it hard to believe the > core kernel mount code would accept you doing this behind their back though > and I''d be surprised if it worked. > > Then again - when we were discussing this yesterday is the mount command > *really* holding the BKL for the entire duration? Surely if this lock is > being held for minutes we''d notice this in other ways because other kernel > paths that require this lock would block? > > Ashley. > > _______________________________________________ > Lustre-devel mailing list > Lustre-devel at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-devel >-------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-devel/attachments/20101103/18bd4387/attachment.html
That is per OST, it''s on the outside of the times that we see but it''s not uncommon. We suffer from bug 18456 still and this mainly happens when OSTs get uncomfortably full. Ashley. On 3 Nov 2010, at 20:51, Jeremy Filizetti wrote:> Are you seeing individual OST mount times in the 10 minute region or for all OSTs? > > Releasing the lock is more common then you think. Both ext3, ext4 release the lock in their ext[34]_fill_super and reacquire before exiting. So when Lustre does the pre mount it gets released, and when it does the real mount for ldiskfs, but after that during that first llog write when the buddy allocator is initializing I don''t see where it can be getting released. I was going to try to confirm things 100% with systemtap but our the version I have doesn''t seem to pick up the lustre modules (or any additionallly added ones for that matter). I don''t have any easy way to upgrade it either. > > Jeremy > > > I''ve been thinking about this and can''t make up my mind on if it''s a good idea or not, we often see mount times in the ten minute region so anything we can do to speed them up is a good thing, I find it hard to believe the core kernel mount code would accept you doing this behind their back though and I''d be surprised if it worked. > > Then again - when we were discussing this yesterday is the mount command *really* holding the BKL for the entire duration? Surely if this lock is being held for minutes we''d notice this in other ways because other kernel paths that require this lock would block? > > Ashley. > > _______________________________________________ > Lustre-devel mailing list > Lustre-devel at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-devel >
I''ve had a chance to take a longer look at this and I think I was wrong about the BKL. I still don''t see where it would be getting released but the problem appears to be that all OBD''s are using the same MGC from a server. In server_start_targets, server_mgc_set_fs acquires the cl_mgc_sem, holds it through lustre_process_log and releases with server_mgc_clear_fs after that. As a result all of our mounts that are started at the same time are waiting for the cl_mgc_sem semaphore. And each OBD has to process it''s llog one at a time. When you have OSTs near capacity like bug 18456 the first write when processing the llog can take minutes to complete. I don''t see any easy way to fix this because they are all using the same sb->lsi->lsi_mgc. I was thinking maybe some of these structures could just modify a copy of that data instead of the actual structure itself but there are so many functions called its hard to see if anything would be using it. Any ideas for a way to work around this? Jeremy On Wed, Nov 3, 2010 at 11:57 AM, Ashley Pittman <apittman at ddn.com> wrote:> > On 2 Nov 2010, at 07:40, Andreas Dilger wrote: > > > On 2010-10-28, at 21:07, Jeremy Filizetti wrote: > >> I''ve seen a lot of issues with mounting all of our OSTs on an OSS taking > an excessive amount of time. Most of the individual OST mount time was > related to bug 18456, but we still see mount times take minutes per OST with > the relevant patches. At mount time the llog does a small write which ends > up scanning nearly our entire 7+ TB OSTs to find the desired block and > complete the write. > >> > >> To reduce startup time mounting multiple OSTs simultaneously would help, > but during that process it looks like the code path is still holding the big > kernel lock from the mount system call. During that time all other mount > commands are in an uninterruptible sleep (D state). Based on the > discussions from bug 23790 it doesn''t appear that Lustre relies on the BKL > so would it be reasonable to call unlock_kernel in lustre_fill_super or at > least before lustre_start_mgc and lock it again before the return so > multiple OSTs could be mounting at the same time? I think the same thing > would apply to unmounting but I haven''t looked at the code path there. > > > > IIRC, the BKL is held at mount time to avoid potential races with > mounting the same device multiple times. However, the risk of this is > pretty small, and can be controlled on an OSS, which has limited access. > Also, this code is being removed in newer kernels, as I don''t think it is > needed by most filesystems. > > > > I _think_ it should be OK, but YMMV. > > I''ve been thinking about this and can''t make up my mind on if it''s a good > idea or not, we often see mount times in the ten minute region so anything > we can do to speed them up is a good thing, I find it hard to believe the > core kernel mount code would accept you doing this behind their back though > and I''d be surprised if it worked. > > Then again - when we were discussing this yesterday is the mount command > *really* holding the BKL for the entire duration? Surely if this lock is > being held for minutes we''d notice this in other ways because other kernel > paths that require this lock would block? > > Ashley. > > _______________________________________________ > Lustre-devel mailing list > Lustre-devel at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-devel >-------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-devel/attachments/20101108/99db6b4e/attachment.html
On 2010-11-08, at 15:16, Jeremy Filizetti wrote:> I''ve had a chance to take a longer look at this and I think I was wrong about the BKL. I still don''t see where it would be getting released but the problem appears to be that all OBD''s are using the same MGC from a server. > > In server_start_targets, server_mgc_set_fs acquires the cl_mgc_sem, holds it through lustre_process_log and releases with server_mgc_clear_fs after that. As a result all of our mounts that are started at the same time are waiting for the cl_mgc_sem semaphore. And each OBD has to process it''s llog one at a time. When you have OSTs near capacity like bug 18456 the first write when processing the llog can take minutes to complete. > > I don''t see any easy way to fix this because they are all using the same sb->lsi->lsi_mgc. I was thinking maybe some of these structures could just modify a copy of that data instead of the actual structure itself but there are so many functions called its hard to see if anything would be using it. > > Any ideas for a way to work around this?The first thing I always think about when seeing a problem like this is not "how to reduce this contention" but "do we need to be doing this at all"? Without having looked at that code in a long time, I''m having a hard time thinking why the OST needs to allocate a new block for the config during mount. It is probably worthwhile to investigate why this is happening in the first place, and possibly just eliminating useless work, rather than making it slightly less slow. Unfortunately, I don''t have the bandwidth to look at this, but maybe Nathan or someone with more familiarity of the config code can chime in.> Jeremy > On Wed, Nov 3, 2010 at 11:57 AM, Ashley Pittman <apittman at ddn.com> wrote: > > On 2 Nov 2010, at 07:40, Andreas Dilger wrote: > > > On 2010-10-28, at 21:07, Jeremy Filizetti wrote: > >> I''ve seen a lot of issues with mounting all of our OSTs on an OSS taking an excessive amount of time. Most of the individual OST mount time was related to bug 18456, but we still see mount times take minutes per OST with the relevant patches. At mount time the llog does a small write which ends up scanning nearly our entire 7+ TB OSTs to find the desired block and complete the write. > >> > >> To reduce startup time mounting multiple OSTs simultaneously would help, but during that process it looks like the code path is still holding the big kernel lock from the mount system call. During that time all other mount commands are in an uninterruptible sleep (D state). Based on the discussions from bug 23790 it doesn''t appear that Lustre relies on the BKL so would it be reasonable to call unlock_kernel in lustre_fill_super or at least before lustre_start_mgc and lock it again before the return so multiple OSTs could be mounting at the same time? I think the same thing would apply to unmounting but I haven''t looked at the code path there. > > > > IIRC, the BKL is held at mount time to avoid potential races with mounting the same device multiple times. However, the risk of this is pretty small, and can be controlled on an OSS, which has limited access. Also, this code is being removed in newer kernels, as I don''t think it is needed by most filesystems. > > > > I _think_ it should be OK, but YMMV. > > I''ve been thinking about this and can''t make up my mind on if it''s a good idea or not, we often see mount times in the ten minute region so anything we can do to speed them up is a good thing, I find it hard to believe the core kernel mount code would accept you doing this behind their back though and I''d be surprised if it worked. > > Then again - when we were discussing this yesterday is the mount command *really* holding the BKL for the entire duration? Surely if this lock is being held for minutes we''d notice this in other ways because other kernel paths that require this lock would block? > > Ashley. > > _______________________________________________ > Lustre-devel mailing list > Lustre-devel at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-devel > > _______________________________________________ > Lustre-devel mailing list > Lustre-devel at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-develCheers, Andreas -- Andreas Dilger Lustre Technical Lead Oracle Corporation Canada Inc.
> > > The first thing I always think about when seeing a problem like this is not > "how to reduce this contention" but "do we need to be doing this at all"?> Without having looked at that code in a long time, I''m having a hard time > thinking why the OST needs to allocate a new block for the config during > mount. It is probably worthwhile to investigate why this is happening in > the first place, and possibly just eliminating useless work, rather than > making it slightly less slow. >I can''t really answer whether "we need to do it", but I can elaborate on what is happening. The actual write that is being done during the lustre_process_log is when the llog is being copied from the remote server to the local server. I assume this is at least a necessary step and no getting rid of the llog without some sort of overhaul. I don''t really have an idea on how large the llog can get but on the sizes I''ve seen it does seem reasonable that it could be copied from the remote MGS into memory, release the lock, and then write the log out to disk.> > Unfortunately, I don''t have the bandwidth to look at this, but maybe Nathan > or someone with more familiarity of the config code can chime in. > >-------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-devel/attachments/20101109/565348f6/attachment.html
On 2010-11-09, at 08:55, Jeremy Filizetti wrote:>> The first thing I always think about when seeing a problem like this is not "how to reduce this contention" but "do we need to be doing this at all"? > > I can''t really answer whether "we need to do it", but I can elaborate on what is happening. The actual write that is being done during the lustre_process_log is when the llog is being copied from the remote server to the local server. I assume this is at least a necessary step and no getting rid of the llog without some sort of overhaul.In fact, this config llog copying should only be done on the first mount of the OST, or if the configuration has changed. We''ve actually removed this entirely for the 2.4 release, though that doesn''t help you now. Cheers, Andreas -- Andreas Dilger Lustre Technical Lead Oracle Corporation Canada Inc.
On Wed, Nov 10, 2010 at 5:40 AM, Andreas Dilger <andreas.dilger at oracle.com>wrote:> On 2010-11-09, at 08:55, Jeremy Filizetti wrote: > >> The first thing I always think about when seeing a problem like this is > not "how to reduce this contention" but "do we need to be doing this at > all"? > > > > I can''t really answer whether "we need to do it", but I can elaborate on > what is happening. The actual write that is being done during the > lustre_process_log is when the llog is being copied from the remote server > to the local server. I assume this is at least a necessary step and no > getting rid of the llog without some sort of overhaul. > > In fact, this config llog copying should only be done on the first mount of > the OST, or if the configuration has changed. > > We''ve actually removed this entirely for the 2.4 release, though that > doesn''t help you now. >Does that mean the llog component of Lustre is completely removed? Is 2.4 an Oracle only release?> > Cheers, Andreas > -- > Andreas Dilger > Lustre Technical Lead > Oracle Corporation Canada Inc. > >-------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-devel/attachments/20101216/6157addc/attachment.html
On 2010-12-16, at 6:47, Jeremy Filizetti <jeremy.filizetti at gmail.com> wrote:> On Wed, Nov 10, 2010 at 5:40 AM, Andreas Dilger <andreas.dilger at oracle.com> wrote: > In fact, this config llog copying should only be done on the first mount of the OST, or if the configuration has changed. > > We''ve actually removed this entirely for the 2.4 release, though that doesn''t help you now. > > Does that mean the llog component of Lustre is completely removed? Is 2.4 an Oracle only release?No, only that the copying of the config llog from the MGS to the OST has been removed. The llog subsystem still is used to maintain distributed operation consistency. "Lustre 2.4" is the anticipated release number when that change might become available. As yet, it is still pre-alpha code, and while this change is available in bugzilla as a series of patches, it would need some effort to port it for 1.8. Cheers, Andreas -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-devel/attachments/20101216/045fd371/attachment.html