I did some btrfs RTFS over the weeking and I have a hard time understanding what this code is attempting to do: 28 int btrfs_tree_lock(struct extent_buffer *eb) 29 { 30 int i; 31 32 if (mutex_trylock(&eb->mutex)) 33 return 0; 34 for (i = 0; i < 512; i++) { 35 cpu_relax(); 36 if (mutex_trylock(&eb->mutex)) 37 return 0; 38 } 39 cpu_relax(); 40 mutex_lock_nested(&eb->mutex, BTRFS_MAX_LEVEL - btrfs_header_level(e b)); 41 return 0; 42 } The trylocks seem pretty pointless. I presume it can be all replaced with the mutex_lock_nested() in line 40. Also the return value seems pointless because noone checks it. Like in the appended patch. Or do I miss something? --- Remove unneeded trylocking and unused return value in btrfs_tree_lock Signed-off-by: Andi Kleen <ak@linux.intel.com> diff -r 417d87e57364 locking.c --- a/locking.c Wed Aug 20 13:39:41 2008 -0400 +++ b/locking.c Mon Sep 08 13:09:21 2008 +0200 @@ -25,20 +25,9 @@ #include "extent_io.h" #include "locking.h" -int btrfs_tree_lock(struct extent_buffer *eb) +void btrfs_tree_lock(struct extent_buffer *eb) { - int i; - - if (mutex_trylock(&eb->mutex)) - return 0; - for (i = 0; i < 512; i++) { - cpu_relax(); - if (mutex_trylock(&eb->mutex)) - return 0; - } - cpu_relax(); mutex_lock_nested(&eb->mutex, BTRFS_MAX_LEVEL - btrfs_header_level(eb)); - return 0; } int btrfs_try_tree_lock(struct extent_buffer *eb) diff -r 417d87e57364 locking.h --- a/locking.h Wed Aug 20 13:39:41 2008 -0400 +++ b/locking.h Mon Sep 08 13:09:21 2008 +0200 @@ -19,7 +19,7 @@ #ifndef __BTRFS_LOCKING_ #define __BTRFS_LOCKING_ -int btrfs_tree_lock(struct extent_buffer *eb); +void btrfs_tree_lock(struct extent_buffer *eb); int btrfs_tree_unlock(struct extent_buffer *eb); int btrfs_tree_locked(struct extent_buffer *eb); int btrfs_try_tree_lock(struct extent_buffer *eb); -- ak@linux.intel.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, 2008-09-08 at 13:10 +0200, Andi Kleen wrote:> I did some btrfs RTFS over the weeking and I have a hard time understanding > what this code is attempting to do: > > 28 int btrfs_tree_lock(struct extent_buffer *eb) > 29 { > 30 int i; > 31 > 32 if (mutex_trylock(&eb->mutex)) > 33 return 0; > 34 for (i = 0; i < 512; i++) { > 35 cpu_relax(); > 36 if (mutex_trylock(&eb->mutex)) > 37 return 0; > 38 } > 39 cpu_relax(); > 40 mutex_lock_nested(&eb->mutex, BTRFS_MAX_LEVEL - btrfs_header_level(e b)); > 41 return 0; > 42 } > > The trylocks seem pretty pointless. > I presume it can be all replaced with the mutex_lock_nested() in line 40. > Also the return value seems pointless because noone checks it. Like > in the appended patch. Or do I miss something?The idea is to try to spin for a bit to avoid scheduling away, which is especially important for the high levels. Most holders of the mutex let it go very quickly. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> The idea is to try to spin for a bit to avoid scheduling away, which is > especially important for the high levels. Most holders of the mutex > let it go very quickly.Ok but that surely should be implemented in the general mutex code then or at least in a standard adaptive mutex wrapper? -Andi -- ak@linux.intel.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, 2008-09-08 at 15:54 +0200, Andi Kleen wrote:> > The idea is to try to spin for a bit to avoid scheduling away, which is > > especially important for the high levels. Most holders of the mutex > > let it go very quickly. > > Ok but that surely should be implemented in the general mutex code then > or at least in a standard adaptive mutex wrapper?That depends, am I the only one crazy enough to think its a good idea? If it were in the general mutex code, that''s probably a faster/smarter way to spin. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Sep 08, 2008 at 10:02:30AM -0400, Chris Mason wrote:> On Mon, 2008-09-08 at 15:54 +0200, Andi Kleen wrote: > > > The idea is to try to spin for a bit to avoid scheduling away, which is > > > especially important for the high levels. Most holders of the mutex > > > let it go very quickly. > > > > Ok but that surely should be implemented in the general mutex code then > > or at least in a standard adaptive mutex wrapper? > > That depends, am I the only one crazy enough to think its a good idea?Adaptive mutexes are classic, a lot of other OS have it. Gregory et.al. also saw big improvements in the RT kernel (they posted a patchkit a few times) But a lot of people don''t like them for some reason. Anyways hiding them in a fs is probably wrong. -Andi -- ak@linux.intel.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, 8 Sep 2008 16:20:52 +0200 Andi Kleen <andi@firstfloor.org> wrote:> On Mon, Sep 08, 2008 at 10:02:30AM -0400, Chris Mason wrote: > > On Mon, 2008-09-08 at 15:54 +0200, Andi Kleen wrote: > > > > The idea is to try to spin for a bit to avoid scheduling away, which is > > > > especially important for the high levels. Most holders of the mutex > > > > let it go very quickly. > > > > > > Ok but that surely should be implemented in the general mutex code then > > > or at least in a standard adaptive mutex wrapper? > > > > That depends, am I the only one crazy enough to think its a good idea? > > Adaptive mutexes are classic, a lot of other OS have it.The problem is that they are a nuisance. It is impossible to choose the right trade off between spin an no-spin, also they optimize for a case that doesn''t occur often enough to be justified. People seem to repeatedly come up with adaptive mutex based on intuitive hunch, and never do much analysis before or afterwards. You need some facts to come up with a useful model: % of time lock is contended average lock hold time overhead of entry-exit for lock primitive (spin time) overhead of the adaptive version either pure spin or pure mutex Also, adaptive locks have even worse unfairness than spin locks under contention. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, 2008-09-08 at 08:07 -0700, Stephen Hemminger wrote:> On Mon, 8 Sep 2008 16:20:52 +0200 > Andi Kleen <andi@firstfloor.org> wrote: > > > On Mon, Sep 08, 2008 at 10:02:30AM -0400, Chris Mason wrote: > > > On Mon, 2008-09-08 at 15:54 +0200, Andi Kleen wrote: > > > > > The idea is to try to spin for a bit to avoid scheduling away, which is > > > > > especially important for the high levels. Most holders of the mutex > > > > > let it go very quickly. > > > > > > > > Ok but that surely should be implemented in the general mutex code then > > > > or at least in a standard adaptive mutex wrapper? > > > > > > That depends, am I the only one crazy enough to think its a good idea? > > > > Adaptive mutexes are classic, a lot of other OS have it. > > The problem is that they are a nuisance. It is impossible to choose > the right trade off between spin an no-spin, also they optimize for > a case that doesn''t occur often enough to be justified. > > People seem to repeatedly come up with adaptive mutex based on intuitive > hunch, and never do much analysis before or afterwards. >In my case, it very easy to measure. Just watch the context switch rate on any metadata intensive workload. The current code scores higher on benchmarks and uses less system time overall. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Sep 08, 2008 at 08:07:51AM -0700, Stephen Hemminger wrote:> On Mon, 8 Sep 2008 16:20:52 +0200 > Andi Kleen <andi@firstfloor.org> wrote: > > > On Mon, Sep 08, 2008 at 10:02:30AM -0400, Chris Mason wrote: > > > On Mon, 2008-09-08 at 15:54 +0200, Andi Kleen wrote: > > > > > The idea is to try to spin for a bit to avoid scheduling away, which is > > > > > especially important for the high levels. Most holders of the mutex > > > > > let it go very quickly. > > > > > > > > Ok but that surely should be implemented in the general mutex code then > > > > or at least in a standard adaptive mutex wrapper? > > > > > > That depends, am I the only one crazy enough to think its a good idea? > > > > Adaptive mutexes are classic, a lot of other OS have it. > > The problem is that they are a nuisance. It is impossible to choose > the right trade off between spin an no-spin, also they optimize for > a case that doesn''t occur often enough to be justified.At least the numbers done by Gregory et.al. were dramatic improvements. Given that was an extreme case in that the rt kernel does everything with mutexes, but it was still a very clear win on a wide range of workloads. -Andi -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, 8 Sep 2008 17:47:14 +0200 Andi Kleen <andi@firstfloor.org> wrote:> On Mon, Sep 08, 2008 at 08:07:51AM -0700, Stephen Hemminger wrote: > > On Mon, 8 Sep 2008 16:20:52 +0200 > > Andi Kleen <andi@firstfloor.org> wrote: > > > > > On Mon, Sep 08, 2008 at 10:02:30AM -0400, Chris Mason wrote: > > > > On Mon, 2008-09-08 at 15:54 +0200, Andi Kleen wrote: > > > > > > The idea is to try to spin for a bit to avoid scheduling away, which is > > > > > > especially important for the high levels. Most holders of the mutex > > > > > > let it go very quickly. > > > > > > > > > > Ok but that surely should be implemented in the general mutex code then > > > > > or at least in a standard adaptive mutex wrapper? > > > > > > > > That depends, am I the only one crazy enough to think its a good idea? > > > > > > Adaptive mutexes are classic, a lot of other OS have it. > > > > The problem is that they are a nuisance. It is impossible to choose > > the right trade off between spin an no-spin, also they optimize for > > a case that doesn''t occur often enough to be justified. > > At least the numbers done by Gregory et.al. were dramatic improvements. > Given that was an extreme case in that the rt kernel does everything > with mutexes, but it was still a very clear win on a wide range > of workloads. > > -AndiMy guess is that the improvement happens mostly from the first couple of tries, not from repeated spinning. And since it is a mutex, you could even do: if (mutex_trylock(&eb->mutex)) return 0; cpu_relax(); if (mutex_trylock(&eb->mutex)) return 0; yield(); return mutex_lock(&eb->mutex); -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, 2008-09-08 at 08:50 -0700, Stephen Hemminger wrote:> On Mon, 8 Sep 2008 17:47:14 +0200 > Andi Kleen <andi@firstfloor.org> wrote: > > > On Mon, Sep 08, 2008 at 08:07:51AM -0700, Stephen Hemminger wrote: > > > On Mon, 8 Sep 2008 16:20:52 +0200 > > > Andi Kleen <andi@firstfloor.org> wrote: > > > > > > > On Mon, Sep 08, 2008 at 10:02:30AM -0400, Chris Mason wrote: > > > > > On Mon, 2008-09-08 at 15:54 +0200, Andi Kleen wrote: > > > > > > > The idea is to try to spin for a bit to avoid scheduling away, which is > > > > > > > especially important for the high levels. Most holders of the mutex > > > > > > > let it go very quickly. > > > > > > > > > > > > Ok but that surely should be implemented in the general mutex code then > > > > > > or at least in a standard adaptive mutex wrapper? > > > > > > > > > > That depends, am I the only one crazy enough to think its a good idea? > > > > > > > > Adaptive mutexes are classic, a lot of other OS have it. > > > > > > The problem is that they are a nuisance. It is impossible to choose > > > the right trade off between spin an no-spin, also they optimize for > > > a case that doesn''t occur often enough to be justified. > > > > At least the numbers done by Gregory et.al. were dramatic improvements. > > Given that was an extreme case in that the rt kernel does everything > > with mutexes, but it was still a very clear win on a wide range > > of workloads. > > > > -Andi > > My guess is that the improvement happens mostly from the first couple of tries, > not from repeated spinning. And since it is a mutex, you could even do:I started with lower spin counts, I really didn''t want to spin at all but the current values came from trial and error. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Chris Mason wrote:>> My guess is that the improvement happens mostly from the first couple of tries, >> not from repeated spinning. And since it is a mutex, you could even do: > > I started with lower spin counts, I really didn''t want to spin at all > but the current values came from trial and error.Exactly the problem Steven is saying about adaptive locking. Using benchmarks (or any test), on a small sample of systems leads you to conclude "this design/tuning combination is better". I''ve been burned repeatedly by that... ugly things happen as you move away from your design testing center. I''m not saying your code does not work, just that we need a lot more proof with different configurations and loads to see that is "at least no worse". jim -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, 2008-09-08 at 12:13 -0400, jim owens wrote:> Chris Mason wrote: > >> My guess is that the improvement happens mostly from the first couple of tries, > >> not from repeated spinning. And since it is a mutex, you could even do: > > > > I started with lower spin counts, I really didn''t want to spin at all > > but the current values came from trial and error. > > Exactly the problem Steven is saying about adaptive locking. > > Using benchmarks (or any test), on a small sample of systems > leads you to conclude "this design/tuning combination is better". > > I''ve been burned repeatedly by that... ugly things happen > as you move away from your design testing center. > > I''m not saying your code does not work, just that we need > a lot more proof with different configurations and loads > to see that is "at least no worse".Oh, I completely agree. This is tuned on just one CPU in a handful of workloads. In general, it makes sense to spin for about as long as it takes someone to do a btree search in the block, which we could benchmark up front at mount time. I could also get better results from an API where the holder of the lock indicates it is going to hold on to things for a while, which might happen right before doing an IO. Over the long term these are important issues, but for today I''m focused on the disk format ;) -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, 08 Sep 2008 12:20:32 -0400 Chris Mason <chris.mason@oracle.com> wrote:> On Mon, 2008-09-08 at 12:13 -0400, jim owens wrote: > > Chris Mason wrote: > > >> My guess is that the improvement happens mostly from the first couple of tries, > > >> not from repeated spinning. And since it is a mutex, you could even do: > > > > > > I started with lower spin counts, I really didn''t want to spin at all > > > but the current values came from trial and error. > > > > Exactly the problem Steven is saying about adaptive locking. > > > > Using benchmarks (or any test), on a small sample of systems > > leads you to conclude "this design/tuning combination is better". > > > > I''ve been burned repeatedly by that... ugly things happen > > as you move away from your design testing center. > > > > I''m not saying your code does not work, just that we need > > a lot more proof with different configurations and loads > > to see that is "at least no worse". > > Oh, I completely agree. This is tuned on just one CPU in a handful of > workloads. In general, it makes sense to spin for about as long as it > takes someone to do a btree search in the block, which we could > benchmark up front at mount time. > > I could also get better results from an API where the holder of the lock > indicates it is going to hold on to things for a while, which might > happen right before doing an IO. > > Over the long term these are important issues, but for today I''m focused > on the disk format ;) > > -chris > >Not to mention the problem that developers seem to have faster machines than average user, but slower than the enterprise and future generation CPU''s. So any tuning value seems to get out of date fast. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Christoph Hellwig
2008-Sep-08 17:16 UTC
adaptive mutexes, was Re: btrfs_tree_lock & trylock
On Mon, Sep 08, 2008 at 08:07:51AM -0700, Stephen Hemminger wrote:> The problem is that they are a nuisance. It is impossible to choose > the right trade off between spin an no-spin, also they optimize for > a case that doesn''t occur often enough to be justified. > > People seem to repeatedly come up with adaptive mutex based on intuitive > hunch, and never do much analysis before or afterwards. > > You need some facts to come up with a useful model: > % of time lock is contended > average lock hold time > overhead of entry-exit for lock primitive (spin time) > overhead of the adaptive version either pure spin or pure mutex > > Also, adaptive locks have even worse unfairness than spin locks under > contention.Well, the traditional wisdom in kernel land is that you want a spinlock if the contention phases are short. But we grow more an more places where we might do sleep under the lock. One optimization would be to spin, but only if the mutex holder is not sleeping. Or we make the spinning a completely different API, mutex_lock_adaptive() -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Sep 08, 2008 at 09:49:42AM -0700, Stephen Hemminger wrote:> Not to mention the problem that developers seem to have faster machines than > average user, but slower than the enterprise and future generation CPU''s. > So any tuning value seems to get out of date fast.So where do my fellow developers get these fast systems from? :) -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Christoph Hellwig wrote:> On Mon, Sep 08, 2008 at 09:49:42AM -0700, Stephen Hemminger wrote: > >> Not to mention the problem that developers seem to have faster machines than >> average user, but slower than the enterprise and future generation CPU''s. >> So any tuning value seems to get out of date fast. >> > > So where do my fellow developers get these fast systems from? :) > >Actually, my experience is that most linux file system developers have really poking, aged hardware with little to no cutting edge storage (or even cutting edge commercial class storage). Testing file systems on lap tops and desktops with old CPUs, no DRAM and 40GB drives does not even reflect a typical home user these days :-) ric -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Monday 08 September 2008 16:28:34 Chris Mason wrote:> > People seem to repeatedly come up with adaptive mutex based on intuitive > > hunch, and never do much analysis before or afterwards. > > In my case, it very easy to measure. Just watch the context switch rate > on any metadata intensive workload. The current code scores higher on > benchmarks and uses less system time overall. >Given that it''s system-dependent, perhaps making the 512 a #define''d value (defaulting to 512) while it''s current, might be worth doing? That would allow users to test via different values passed to make, and so gather more data, or just use the setting they prefer, if they''re bothered. At least until the adaptive mutex API is worked out (which might well take the number of attempts to spin/relax as a parameter. I''d add a shouldYield [for before final acquisition] and keep the ifSleeping as implicit in another fn, based on what I''ve read, but thankfully that''s not my problem.) -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Monday 08 September 2008 18:32:14 Ric Wheeler wrote:> Christoph Hellwig wrote: > > On Mon, Sep 08, 2008 at 09:49:42AM -0700, Stephen Hemminger wrote: > >> Not to mention the problem that developers seem to have faster machines > >> than average user, but slower than the enterprise and future generation > >> CPU''s. So any tuning value seems to get out of date fast. > > > > So where do my fellow developers get these fast systems from? :) > > Actually, my experience is that most linux file system developers have > really poking, aged hardware with little to no cutting edge storage (or > even cutting edge commercial class storage). Testing file systems on lap > tops and desktops with old CPUs, no DRAM and 40GB drives does not even > reflect a typical home user these days :-) >Yeah but it means they really notice when the algorithm has been improved. It''s character-building, or summat. ;-) -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Possibly Parallel Threads
- [PATCH] speed up snapshot dropping
- [PATCH] Btrfs: fix passing wrong arg gfp_t to decide the correct allocation mode
- [PATCH] Btrfs: make some functions return void
- [PATCH v0 00/18] btfs: Subvolume Quota Groups
- [PATCH V2] Btrfs: add direct I/O helper to process inline compressed extents.