thr3ads.net - Btrfs devel - Btrfs for mainline [Dec 2008]

If this information is useful, please help other people find it:
Share via:

Chris Mason

2008-Dec-31 11:28 UTC

Btrfs for mainline

Hello everyone,

I''ve done some testing against Linus'' git tree from last night
and the
current btrfs trees still work well.

There are a few bug fixes that I need to include from while I was on
vacation but I haven''t made any large changes since early in December:

Btrfs details and usage information can be found:

http://btrfs.wiki.kernel.org/

The btrfs kernel code is here:

http://git.kernel.org/?p=linux/kernel/git/mason/btrfs-unstable.git;a=summary

And the utilities are here:

http://git.kernel.org/?p=linux/kernel/git/mason/btrfs-progs-unstable.git;a=summary

-chris


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Andrew Morton

2008-Dec-31 18:45 UTC

head link

Re: Btrfs for mainline

On Wed, 31 Dec 2008 06:28:55 -0500 Chris Mason <chris.mason@oracle.com>
wrote:
> Hello everyone,
Hi!
> I''ve done some testing against Linus'' git tree from last
night and the
> current btrfs trees still work well.
what''s btrfs?  I think I''ve heard the name before, but
I''ve never
seen the patches :)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Mason

2008-Dec-31 23:19 UTC

head link

Re: Btrfs for mainline

On Wed, 2008-12-31 at 10:45 -0800, Andrew Morton wrote:> On Wed, 31 Dec 2008 06:28:55 -0500 Chris Mason
<chris.mason@oracle.com> wrote:
> 
> > Hello everyone,
> 
> Hi!
> 
> > I''ve done some testing against Linus'' git tree from
last night and the
> > current btrfs trees still work well.
> 
> what''s btrfs?  I think I''ve heard the name before, but
I''ve never
> seen the patches :)
The source is up to around 38k loc, I thought it better to use that http
thing for people who were interested in the code.

There is also a standalone git repo:

http://git.kernel.org/?p=linux/kernel/git/mason/btrfs-unstable-standalone.git;a=summary

This has only btrfs as a module and would be the fastest way to see
the .c files.  btrfs doesn''t have any changes outside of fs/Makefile
and
fs/Kconfig

(happy new year ;)

-chris


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Ryusuke Konishi

2009-Jan-02 16:37 UTC

head link

Re: Btrfs for mainline

Hi,
On Wed, 31 Dec 2008 18:19:09 -0500, Chris Mason <chris.mason@oracle.com>
wrote:> 
> This has only btrfs as a module and would be the fastest way to see
> the .c files.  btrfs doesn''t have any changes outside of
fs/Makefile and
> fs/Kconfig
I found some overlapping (or cloned) functions in
btrfs-unstable.git/fs/btrfs, for example:

 - Declarations to apply hardware crc32c in fs/btrfs/crc32c.h:
   The same code is found in arch/x86/crypto/crc32c-intel.c

 - btrfs_wait_on_page_writeback_range() and btrfs_fdatawrite_range():
   These are clones of wait_on_page_writeback_range() and
   __filemap_fdatawrite_range() respectively, and can be removed if they
   are just exported.

 - Copies of add_to_page_cache_lru() found in compression.c and extent_io.c
   (can be replaced if it''s exported)

How about including patches to resolve these in the btrfs kernel tree
(or patchset to be posted) ?

In addition, there seem to be well-separated reusable routines such as
async-thread (enhanced workqueue) and extent_map.  Do you intend to
move these into lib/ or so?

I also tried scripts/checkpatch.pl against btrfs, and it has detected
45 ERRORs and 93 WARNINGs.  I think it''s a good opportunity to clean
up these violations.


With regards,
Ryusuke Konishi
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Mason

2009-Jan-02 18:06 UTC

head link

Re: Btrfs for mainline

On Sat, 2009-01-03 at 01:37 +0900, Ryusuke Konishi
wrote:> Hi,
> On Wed, 31 Dec 2008 18:19:09 -0500, Chris Mason
<chris.mason@oracle.com> wrote:
> > 
> > This has only btrfs as a module and would be the fastest way to see
> > the .c files.  btrfs doesn''t have any changes outside of
fs/Makefile and
> > fs/Kconfig
> 
> I found some overlapping (or cloned) functions in
> btrfs-unstable.git/fs/btrfs, for example:
> 
>  - Declarations to apply hardware crc32c in fs/btrfs/crc32c.h:
>    The same code is found in arch/x86/crypto/crc32c-intel.c
> 
Yes, I can just remove the btrfs version of this for now.
>  - btrfs_wait_on_page_writeback_range() and btrfs_fdatawrite_range():
>    These are clones of wait_on_page_writeback_range() and
>    __filemap_fdatawrite_range() respectively, and can be removed if they
>    are just exported.
> 
>  - Copies of add_to_page_cache_lru() found in compression.c and extent_io.c
>    (can be replaced if it''s exported)
> 
> How about including patches to resolve these in the btrfs kernel tree
> (or patchset to be posted) ?
> 
My plan was to export those after btrfs was actually in.  But on Monday
I''ll send along a patch to export them and make compat functions in
btrfs.
> In addition, there seem to be well-separated reusable routines such as
> async-thread (enhanced workqueue) and extent_map.  Do you intend to
> move these into lib/ or so?
> 
> I also tried scripts/checkpatch.pl against btrfs, and it has detected
> 45 ERRORs and 93 WARNINGs.  I think it''s a good opportunity to
clean
> up these violations.
Good point, thanks for looking at the code.

-chris


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Andi Kleen

2009-Jan-02 19:05 UTC

head link

Re: Btrfs for mainline

Chris Mason <chris.mason@oracle.com> writes:
> On Wed, 2008-12-31 at 10:45 -0800, Andrew Morton wrote:
>> On Wed, 31 Dec 2008 06:28:55 -0500 Chris Mason
<chris.mason@oracle.com> wrote:
>> 
>> > Hello everyone,
>> 
>> Hi!
>> 
>> > I''ve done some testing against Linus'' git tree
from last night and the
>> > current btrfs trees still work well.
>> 
>> what''s btrfs?  I think I''ve heard the name before,
but I''ve never
>> seen the patches :)
>
> The source is up to around 38k loc, I thought it better to use that http
> thing for people who were interested in the code.
>
> There is also a standalone git repo:
>
>
http://git.kernel.org/?p=linux/kernel/git/mason/btrfs-unstable-standalone.git;a=summary
Some items I remember from my last look at the code that should
be cleaned up before mainline merge (that wasn''t a full in depth
review):

- locking.c needs a lot of cleanup.
If combination spinlocks/mutexes are really a win they should be 
in the generic mutex framework. And I''m still dubious on the hardcoded 
numbers.
- compat.h needs to go
- there''s various copy''n''pasted code from the VFS
(like may_create)
that needs to be cleaned up.
- there should be manpages for all the ioctls and other interfaces.
- ioctl.c was not explicitely root protected. security issues?
- some code was severly undercommented.
e.g. each file should at least have a one liner
describing what it does (ideally at least a paragraph). Bad examples
are export.c or free-space-cache.c, but also others.
- ENOMEM checks are still missing all over (e.g. with most of the 
btrfs_alloc_path callers). If you keep it that way you would need
at least XFS style "loop for ever" alloc wrappers, but better just
fix all the callers. Also there used to be a lot of BUG_ON()s on
memory allocation failure even.
- In general BUG_ONs need review I think. Lots of externally triggerable
ones.
- various checkpath.pl level problems I think (e.g. printk levels) 
- the printks should all include which file system they refer to

In general I think the whole thing needs more review.

-Andi

-- 
ak@linux.intel.com
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Mason

2009-Jan-02 19:32 UTC

head link

Re: Btrfs for mainline

On Fri, 2009-01-02 at 20:05 +0100, Andi Kleen wrote:> Chris Mason <chris.mason@oracle.com> writes:
> 
> > On Wed, 2008-12-31 at 10:45 -0800, Andrew Morton wrote:
> >> On Wed, 31 Dec 2008 06:28:55 -0500 Chris Mason
<chris.mason@oracle.com> wrote:
> >> 
> >> > Hello everyone,
> >> 
> >> Hi!
> >> 
> >> > I''ve done some testing against Linus'' git
tree from last night and the
> >> > current btrfs trees still work well.
> >> 
> >> what''s btrfs?  I think I''ve heard the name
before, but I''ve never
> >> seen the patches :)
> >
> > The source is up to around 38k loc, I thought it better to use that
http
> > thing for people who were interested in the code.
> >
> > There is also a standalone git repo:
> >
> >
http://git.kernel.org/?p=linux/kernel/git/mason/btrfs-unstable-standalone.git;a=summary
> 
> Some items I remember from my last look at the code that should
> be cleaned up before mainline merge (that wasn''t a full in depth
review):
> 
Hi Andi, thanks for looking at things.
> - locking.c needs a lot of cleanup.
Grin, lots of the code needs to be cleaned, but locking.c is really just
a few wrappers around the mutex calls.  I don''t think I had it at the
top of my to-be-cleaned list ;)

> If combination spinlocks/mutexes are really a win they should be 
> in the generic mutex framework. And I''m still dubious on the
hardcoded
> numbers.
Sure, I''m happy to use a generic framework there (or help create one).
They are definitely a win for btrfs, and show up in most benchmarks.
> - compat.h needs to go
Projects that are out of mainline have a difficult task of making sure
development can continue until they are in mainline and being clean
enough to merge.  I''d rather get rid of the small amount of compat code
that I have left after btrfs is in (compat.h is 32 lines).  

It isn''t hurting anything, and taking it out makes it much more
difficult for our current users.
> - there''s various copy''n''pasted code from the
VFS (like may_create)
> that needs to be cleaned up.
Yes, I tried to mark those as I did them (a very small number of
functions).  In general they were copied to avoid adding exports, and
that is easily fixed.
> - there should be manpages for all the ioctls and other interfaces.
> - ioctl.c was not explicitely root protected. security issues?
Christoph added a CAP_SYS_ADMIN check to the trans start ioctl, but I do
need to add one to the device add/remove/balance code as well.

The subvol/snapshot creation is meant to be user callable (controlled by
something similar to quotas later on).
> - some code was severly undercommented.
> e.g. each file should at least have a one liner
> describing what it does (ideally at least a paragraph). Bad examples
> are export.c or free-space-cache.c, but also others.
> - ENOMEM checks are still missing all over (e.g. with most of the 
> btrfs_alloc_path callers). If you keep it that way you would need
> at least XFS style "loop for ever" alloc wrappers, but better
just
> fix all the callers.
Yes, there''s quite some work to do in the error handling paths.
>  Also there used to be a lot of BUG_ON()s on
> memory allocation failure even.
> - In general BUG_ONs need review I think. Lots of externally triggerable
> ones.
> - various checkpath.pl level problems I think (e.g. printk levels) 
> - the printks should all include which file system they refer to
> 
> In general I think the whole thing needs more review.
I don''t disagree, please do keep in mind that I''m not
suggesting anyone
use this in production yet.

-chris


--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Mason

2009-Jan-02 19:38 UTC

head link

Re: Btrfs for mainline

On Sat, 2009-01-03 at 01:37 +0900, Ryusuke Konishi
wrote:> Hi,
> On Wed, 31 Dec 2008 18:19:09 -0500, Chris Mason
<chris.mason@oracle.com> wrote:
> > 
> > This has only btrfs as a module and would be the fastest way to see
> > the .c files.  btrfs doesn''t have any changes outside of
fs/Makefile and
> > fs/Kconfig
[ ... ]
> In addition, there seem to be well-separated reusable routines such as
> async-thread (enhanced workqueue) and extent_map.  Do you intend to
> move these into lib/ or so?
Sorry, looks like I hit send too soon that time.  The async-thread code
is very self contained, and was intended for generic use.  Pushing that
into lib is probably a good idea.

The extent_map and extent_buffer code was also intended for generic use.
It needs some love and care (making it work for blocksize != pagesize)
before I''d suggest moving it out of fs/btrfs.

-chris


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Andi Kleen

2009-Jan-02 21:01 UTC

head link

Re: Btrfs for mainline

On Fri, Jan 02, 2009 at 02:32:29PM -0500, Chris Mason
wrote:> > If combination spinlocks/mutexes are really a win they should be 
> > in the generic mutex framework. And I''m still dubious on the
hardcoded
> > numbers.
> 
> Sure, I''m happy to use a generic framework there (or help create
one).
> They are definitely a win for btrfs, and show up in most benchmarks.
If they are such a big win then likely they will help other users
too and should be generic in some form.
> 
> > - compat.h needs to go
> 
> Projects that are out of mainline have a difficult task of making sure
> development can continue until they are in mainline and being clean
> enough to merge.  I''d rather get rid of the small amount of compat
code
> that I have left after btrfs is in (compat.h is 32 lines).  
It''s fine for an out of tree variant, but the in tree version
shouldn''t have compat.h. For out of tree you just apply a patch
that adds the includes. e.g.compat-wireless and lots of other
projects do it this way. 
> 
> Yes, I tried to mark those as I did them (a very small number of
> functions).  In general they were copied to avoid adding exports, and
> that is easily fixed.
> 
> > - there should be manpages for all the ioctls and other interfaces.
> > - ioctl.c was not explicitely root protected. security issues?
> 
> Christoph added a CAP_SYS_ADMIN check to the trans start ioctl, but I do
> need to add one to the device add/remove/balance code as well.
Ok. Didn''t see that.

It still needs to be carefully audited for security holes 
even with root checks.

Another thing is that once auto mounting is enabled each usb stick
with btrfs on it could be a root hole if you have buffer overflows
somewhere triggerable by disk data. I guess that would need some
checking too.
> The subvol/snapshot creation is meant to be user callable (controlled by
> something similar to quotas later on).
But right now that''s not there so it should be root only.
> 
> >  Also there used to be a lot of BUG_ON()s on
> > memory allocation failure even.
> > - In general BUG_ONs need review I think. Lots of externally
triggerable
> > ones.
> > - various checkpath.pl level problems I think (e.g. printk levels) 
> > - the printks should all include which file system they refer to
> > 
> > In general I think the whole thing needs more review.
> 
> I don''t disagree, please do keep in mind that I''m not
suggesting anyone
> use this in production yet.
When it''s in mainline I suspect people will start using it for that.

-Andi

-- 
ak@linux.intel.com
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Mason

2009-Jan-02 21:35 UTC

head link

Re: Btrfs for mainline

On Fri, 2009-01-02 at 22:01 +0100, Andi Kleen wrote:> On Fri, Jan 02, 2009 at 02:32:29PM -0500, Chris Mason wrote:
> > > If combination spinlocks/mutexes are really a win they should be 
> > > in the generic mutex framework. And I''m still dubious on
the hardcoded
> > > numbers.
> > 
> > Sure, I''m happy to use a generic framework there (or help
create one).
> > They are definitely a win for btrfs, and show up in most benchmarks.
> 
> If they are such a big win then likely they will help other users
> too and should be generic in some form.
> 
I don''t disagree.  It''s about 6 lines of code though, and just
hasn''t
been at the top of my list.  I''m sure the generic version will be
faster, as it could add checks to see if the holder of the lock was
actually running.
> > 
> > > - compat.h needs to go
> > 
> > Projects that are out of mainline have a difficult task of making sure
> > development can continue until they are in mainline and being clean
> > enough to merge.  I''d rather get rid of the small amount of
compat code
> > that I have left after btrfs is in (compat.h is 32 lines).  
> 
> It''s fine for an out of tree variant, but the in tree version
> shouldn''t have compat.h. For out of tree you just apply a patch
> that adds the includes. e.g.compat-wireless and lots of other
> projects do it this way. 
> 
It helps debugging that my standalone tree is generated from and exactly
the same as fs/btrfs in the full kernel tree.  I''ll switch to a pull
and
merge system for the standalone tree.
> > 
> > Yes, I tried to mark those as I did them (a very small number of
> > functions).  In general they were copied to avoid adding exports, and
> > that is easily fixed.
> > 
> > > - there should be manpages for all the ioctls and other
interfaces.
> > > - ioctl.c was not explicitely root protected. security issues?
> > 
> > Christoph added a CAP_SYS_ADMIN check to the trans start ioctl, but I
do
> > need to add one to the device add/remove/balance code as well.
> 
> Ok. Didn''t see that.
> 
> It still needs to be carefully audited for security holes 
> even with root checks.
Yes, the most important one is the device scan ioctl (also missing the
root check, will fix).
> 
> Another thing is that once auto mounting is enabled each usb stick
> with btrfs on it could be a root hole if you have buffer overflows
> somewhere triggerable by disk data. I guess that would need some
> checking too.
> 
> > The subvol/snapshot creation is meant to be user callable (controlled
by
> > something similar to quotas later on).
> 
> But right now that''s not there so it should be root only.
> 
I''ll switch to checking against directory permissions for now.
subvol/snapshot creation are basically mkdir anyway, so this fits well.
> > 
> > >  Also there used to be a lot of BUG_ON()s on
> > > memory allocation failure even.
> > > - In general BUG_ONs need review I think. Lots of externally
triggerable
> > > ones.
> > > - various checkpath.pl level problems I think (e.g. printk
levels)
> > > - the printks should all include which file system they refer to
> > > 
> > > In general I think the whole thing needs more review.
> > 
> > I don''t disagree, please do keep in mind that I''m
not suggesting anyone
> > use this in production yet.
> 
> When it''s in mainline I suspect people will start using it for
that.
I think the larger question here is where we want development to happen.
I''m definitely not pretending that btrfs is perfect, but I strongly
believe that it will be a better filesystem if the development moves to
mainline where it will attract more eyeballs and more testers.

-chris


--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Roland Dreier

2009-Jan-02 22:26 UTC

head link

Re: Btrfs for mainline

> > > I don''t disagree, please do keep in mind that
I''m not suggesting anyone > > > use this in production yet.

 > > When it''s in mainline I suspect people will start using it
for that.

 > I think the larger question here is where we want development to happen.
 > I''m definitely not pretending that btrfs is perfect, but I
strongly
 > believe that it will be a better filesystem if the development moves to
 > mainline where it will attract more eyeballs and more testers.

One possibility would be to mimic ext4 and register the fs as
"btrfsdev"
until it''s considered stable enough for production.  I agree with the
consensus that we want to use the upstream kernel as a nexus for
coordinating btrfs development, so I don''t think it''s worth
waiting a
release or two to merge something.

 - R.
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Ryusuke Konishi

2009-Jan-03 09:44 UTC

head link

Re: Btrfs for mainline

On Fri, 02 Jan 2009 14:38:07 -0500, Chris Mason <chris.mason@oracle.com>
wrote:> On Sat, 2009-01-03 at 01:37 +0900, Ryusuke Konishi wrote:
> > In addition, there seem to be well-separated reusable routines such as
> > async-thread (enhanced workqueue) and extent_map.  Do you intend to
> > move these into lib/ or so?
> 
> Sorry, looks like I hit send too soon that time.  The async-thread code
> is very self contained, and was intended for generic use.  Pushing that
> into lib is probably a good idea.
As for async-thread, kernel/ seems to be better place (sorry, I also hit
send too soon ;)

Anyway, I think it should be reviewed deeply by scheduler people and
wider range of people.  So, it''s a good idea to put it out in order to
arouse interest.
 > The extent_map and extent_buffer code was also intended for generic use.
> It needs some love and care (making it work for blocksize != pagesize)
> before I''d suggest moving it out of fs/btrfs.
> 
> -chris
The extent_map itself seemed independent of the problem to me, but I
understand your plan.

Btrfs seems to have other helpful code including pages/bio compression
which may become separable, too.  And, this may be the same for
pages/bio encryption/decryption code which would come next.  (I don''t
mention about the volume management/raid feature here to avoid getting
off the subject, but it''s likewise).

I think it''s wonderful if they can be well-integrated into sublayers
or libraries.

Regards,
Ryusuke
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Matthew Wilcox

2009-Jan-03 19:17 UTC

head link

Re: Btrfs for mainline

On Fri, Jan 02, 2009 at 08:05:50PM +0100, Andi Kleen
wrote:> Some items I remember from my last look at the code that should
> be cleaned up before mainline merge (that wasn''t a full in depth
review):
> 
> - locking.c needs a lot of cleanup.
> If combination spinlocks/mutexes are really a win they should be 
> in the generic mutex framework. And I''m still dubious on the
hardcoded
> numbers.
I don''t think this needs to be cleaned up before merge.  I''ve
spent
an hour or two looking at it, and while we can do a somewhat better
job as part of the generic mutex framework, it''s quite tricky (due to
the different <asm/mutex.h> implementations).  It has the potential to
introduce some hard-to-hit bugs in the generic mutexes, and there''s
some
API discussions to have.

It''s no worse than XFS (which still has its own implementation of
''synchronisation variables'', a (very thin) wrapper around
mutexes, a
(thin) wrapper around rwsems, and wrappers around kmalloc and kmem_cache.
> - compat.h needs to go
Later.  It''s still there for XFS.
> - there''s various copy''n''pasted code from the
VFS (like may_create)
> that needs to be cleaned up.
No urgency here.
> - there should be manpages for all the ioctls and other interfaces.
I wonder if Michael Kerrisk has time to help with that.  Cc''d.
> - ioctl.c was not explicitely root protected. security issues?
This does need auditing.
> - some code was severly undercommented.
> e.g. each file should at least have a one liner
> describing what it does (ideally at least a paragraph). Bad examples
> are export.c or free-space-cache.c, but also others.
Nice to have, but generally not required.
> - ENOMEM checks are still missing all over (e.g. with most of the 
> btrfs_alloc_path callers). If you keep it that way you would need
> at least XFS style "loop for ever" alloc wrappers, but better
just
> fix all the callers. Also there used to be a lot of BUG_ON()s on
> memory allocation failure even.
> - In general BUG_ONs need review I think. Lots of externally triggerable
> ones.
Agreed on these two.
> - various checkpath.pl level problems I think (e.g. printk levels) 
Can be fixed up later.
> - the printks should all include which file system they refer to
Ditto.

-- 
Matthew Wilcox				Intel Open Source Technology Centre
"Bill, look, we understand that you''re interested in selling us
this
operating system, but compare it to ours.  We can''t possibly take such
a retrograde step."
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Christoph Hellwig

2009-Jan-03 19:50 UTC

head link

Re: Btrfs for mainline

On Sat, Jan 03, 2009 at 12:17:06PM -0700, Matthew Wilcox
wrote:> It''s no worse than XFS (which still has its own implementation of
> ''synchronisation variables'',
Which are a trivial wrapper around wait queues.  I have patches to kill
them, but I''m not entirely sure it''s worth it
> a (very thin) wrapper around mutexes,
nope.
> a
> (thin) wrapper around rwsems,
Which are needed so we can have asserts about the lock state, which
generic rwsems still don''t have.  At some pointer Peter looked into
it, and once we have that we can kill the wrapper.

and wrappers around kmalloc and kmem_cache.> 
> > - compat.h needs to go
> 
> Later.  It''s still there for XFS.
?
> > - there should be manpages for all the ioctls and other interfaces.
> 
> I wonder if Michael Kerrisk has time to help with that.  Cc''d.
Actually a lot of the ioctl API don''t just need documentation but
a complete redo.  That''s true at least for the physical device
management and subvolume / snaphot ones.
> > - various checkpath.pl level problems I think (e.g. printk levels) 
> 
> Can be fixed up later.
> 
> > - the printks should all include which file system they refer to
> 
> Ditto.
From painfull experience with a lot of things, including a filesystem
you keep on mentioning it''s clear that once stuff is upstream there
is very little to no incentive to fix these things up.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Mason

2009-Jan-03 20:17 UTC

head link

Re: Btrfs for mainline

On Sat, 2009-01-03 at 14:50 -0500, Christoph Hellwig
wrote:> On Sat, Jan 03, 2009 at 12:17:06PM -0700, Matthew Wilcox wrote:
>  
> > > - compat.h needs to go
> > 
> > Later.  It''s still there for XFS.
> 
> ?
> > > - there should be manpages for all the ioctls and other
interfaces.
> > 
> > I wonder if Michael Kerrisk has time to help with that. 
Cc''d.
> 
> Actually a lot of the ioctl API don''t just need documentation but
> a complete redo.  That''s true at least for the physical device
> management and subvolume / snaphot ones.
> 
The ioctl interface is definitely not finalized.  Adding more vs
replacing the existing ones is an open question.
> > > - various checkpath.pl level problems I think (e.g. printk
levels)
> > 
> > Can be fixed up later.
> > 
> > > - the printks should all include which file system they refer to
> > 
> > Ditto.
> 
> >From painfull experience with a lot of things, including a filesystem
> you keep on mentioning it''s clear that once stuff is upstream
there
> is very little to no incentive to fix these things up.
> 
I''d disagree here.  Cleanup incentive is a mixture of the people
involved and the attention the project has.

-chris


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Matthew Wilcox

2009-Jan-03 21:12 UTC

head link

Re: Btrfs for mainline

On Sat, Jan 03, 2009 at 02:50:34PM -0500, Christoph Hellwig
wrote:> On Sat, Jan 03, 2009 at 12:17:06PM -0700, Matthew Wilcox wrote:
> > It''s no worse than XFS (which still has its own
implementation of
> > ''synchronisation variables'',
> 
> Which are a trivial wrapper around wait queues.  I have patches to kill
> them, but I''m not entirely sure it''s worth it
I''m not sure it''s worth it either.
> > a (very thin) wrapper around mutexes,
> 
> nope.
It''s down to:

typedef struct mutex mutex_t;

but it''s still there.
> > a
> > (thin) wrapper around rwsems,
> 
> Which are needed so we can have asserts about the lock state, which
> generic rwsems still don''t have.  At some pointer Peter looked
into
> it, and once we have that we can kill the wrapper.
Good to know.  Rather like btrfs''s wrappers around mutexes then ...
> > > - compat.h needs to go
> > 
> > Later.  It''s still there for XFS.
> 
> ?
XFS still has ''fs/xfs/linux-2.6''.  It''s a little
bigger than compat.h,
for sure, and doesn''t contain code for supporting different Linux
versions, sure.  But it''s still a compat layer.
> > > - there should be manpages for all the ioctls and other
interfaces.
> > 
> > I wonder if Michael Kerrisk has time to help with that. 
Cc''d.
> 
> Actually a lot of the ioctl API don''t just need documentation but
> a complete redo.  That''s true at least for the physical device
> management and subvolume / snaphot ones.
That''s a more important critique than Andi''s.  Let''s
take care of that.
> From painfull experience with a lot of things, including a filesystem
> you keep on mentioning it''s clear that once stuff is upstream
there
> is very little to no incentive to fix these things up.
I don''t think that''s as true of btrfs as it was of XFS -- for
example,
Chris has no incentive to keep compatibility with IRIX, or continue to
support CXFS.  I don''t think ''getting included in
kernel'' is Chris''
goal, so much as it is a step towards making btrfs better.

-- 
Matthew Wilcox				Intel Open Source Technology Centre
"Bill, look, we understand that you''re interested in selling us
this
operating system, but compare it to ours.  We can''t possibly take such
a retrograde step."
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Gabor MICSKO

2009-Jan-04 08:32 UTC

head link

Re: Btrfs for mainline

Hi Chris,


Does this means that disk format finalised or at least backward
compatible?


On Wed, 2008-12-31 at 06:28 -0500, Chris Mason wrote:> Hello everyone,
> I''ve done some testing against Linus'' git tree from last
night and the
> current btrfs trees still work well.
> 
> There are a few bug fixes that I need to include from while I was on
> vacation but I haven''t made any large changes since early in
December:
> 
> Btrfs details and usage information can be found:
> 
> http://btrfs.wiki.kernel.org/
> 
> The btrfs kernel code is here:
> 
>
http://git.kernel.org/?p=linux/kernel/git/mason/btrfs-unstable.git;a=summary
> 
> And the utilities are here:
> 
>
http://git.kernel.org/?p=linux/kernel/git/mason/btrfs-progs-unstable.git;a=summary
> 
> -chris
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe
linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

KOSAKI Motohiro

2009-Jan-04 13:28 UTC

head link

Re: Btrfs for mainline

Hi
> One possibility would be to mimic ext4 and register the fs as
"btrfsdev"
> until it''s considered stable enough for production.  I agree with
the
> consensus that we want to use the upstream kernel as a nexus for
> coordinating btrfs development, so I don''t think it''s
worth waiting a
> release or two to merge something.
I like this idea.
I also want to test btrfs. but I''m not interested out of tree code.



--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Ed Tomlinson

2009-Jan-04 15:56 UTC

head link

Re: Btrfs for mainline

On January 4, 2009, KOSAKI Motohiro wrote:> Hi
> 
> > One possibility would be to mimic ext4 and register the fs as
"btrfsdev"
> > until it''s considered stable enough for production.  I agree
with the
> > consensus that we want to use the upstream kernel as a nexus for
> > coordinating btrfs development, so I don''t think
it''s worth waiting a
> > release or two to merge something.
> 
> I like this idea.
> I also want to test btrfs. but I''m not interested out of tree
code.
I''ll second this.  Please get btrfsdev into mainline asap.

TIA
Ed Tomlinson
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Peter Zijlstra

2009-Jan-04 18:21 UTC

head link

Re: Btrfs for mainline

On Sat, 2009-01-03 at 12:17 -0700, Matthew Wilcox wrote:> > - locking.c needs a lot of cleanup.
> > If combination spinlocks/mutexes are really a win they should be 
> > in the generic mutex framework. And I''m still dubious on the
> hardcoded 
> > numbers.
> 
> I don''t think this needs to be cleaned up before merge. 
I''ve spent
> an hour or two looking at it, and while we can do a somewhat better
> job as part of the generic mutex framework, it''s quite tricky (due
to
> the different <asm/mutex.h> implementations).  It has the potential
to
> introduce some hard-to-hit bugs in the generic mutexes, and
there''s some
> API discussions to have.
I''m really opposed to having this in some filesystem. Please remove it
before merging it.

The -rt tree has adaptive spin patches for the rtmutex code, its really
not all that hard to do -- the rtmutex code is way more tricky than the
regular mutexes due to all the PI fluff.

For kernel only locking the simple rule: spin iff the lock holder is
running proved to be simple enough. Any added heuristics like max spin
count etc. only made things worse. The whole idea though did make sense
and certainly improved performance.

We''ve also been looking at doing adaptive spins for futexes, although
that does get a little more complex, furthermore, we''ve never gotten
around to actually doing any code on that.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Matthew Wilcox

2009-Jan-04 18:41 UTC

head link

Re: Btrfs for mainline

On Sun, Jan 04, 2009 at 07:21:50PM +0100, Peter Zijlstra
wrote:> The -rt tree has adaptive spin patches for the rtmutex code, its really
> not all that hard to do -- the rtmutex code is way more tricky than the
> regular mutexes due to all the PI fluff.
> 
> For kernel only locking the simple rule: spin iff the lock holder is
> running proved to be simple enough. Any added heuristics like max spin
> count etc. only made things worse. The whole idea though did make sense
> and certainly improved performance.
That implies moving

        struct thread_info      *owner;

out from under the CONFIG_DEBUG_MUTEXES code.  One of the original
justifications for mutexes was:

 - ''struct mutex'' is smaller on most architectures: .e.g on
x86,
   ''struct semaphore'' is 20 bytes, ''struct
mutex'' is 16 bytes.
    A smaller structure size means less RAM footprint, and better
    CPU-cache utilization.

I''d be reluctant to reverse that decision just for btrfs.

Benchmarking required!  Maybe I can put a patch together that implements
the simple ''spin if it''s running'' heuristic and throw
it at our
testing guys on Monday ...

-- 
Matthew Wilcox				Intel Open Source Technology Centre
"Bill, look, we understand that you''re interested in selling us
this
operating system, but compare it to ours.  We can''t possibly take such
a retrograde step."
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Arnd Bergmann

2009-Jan-04 21:52 UTC

head link

Re: Btrfs for mainline

On Saturday 03 January 2009, Chris Mason wrote:> 
> > Actually a lot of the ioctl API don''t just need documentation
but
> > a complete redo.  That''s true at least for the physical
device
> > management and subvolume / snaphot ones.
> > 
> 
> The ioctl interface is definitely not finalized.  Adding more vs
> replacing the existing ones is an open question.
As long as that''s an open question, the ioctl interface
shouldn''t get
merged into the kernel, or should get in as btrfsdev, otherwise you
get stuck with the current ABI forever.

Is it possible to separate out the nonstandard ioctls into a patch
that can get merged when the interface is final, or will that make
btrfs unusable?

	Arnd <><

Chris Samuel

2009-Jan-05 10:07 UTC

head link

Re: Btrfs for mainline

On Sat, 3 Jan 2009 8:01:04 am Andi Kleen wrote:
> When it''s in mainline I suspect people will start using it for
that.
Some people don''t even wait for that. ;-)

Seriously though, if that is a concern can I suggest taking the btrfsdev route 
and, if you want a real belt and braces approach, perhaps require it to have a 
mandatory mount option specified to successfully mount, maybe
"eat_my_data" ?

cheers,
Chris
-- 
 Chris Samuel  :  http://www.csamuel.org/  :  Melbourne, VIC

This email may come with a PGP signature as a file. Do not panic.
For more info see: http://en.wikipedia.org/wiki/OpenPGP

Nick Piggin

2009-Jan-05 10:32 UTC

head link

Re: Btrfs for mainline

On Saturday 03 January 2009 06:38:07 Chris Mason wrote:> On Sat, 2009-01-03 at 01:37 +0900, Ryusuke Konishi wrote:
> > Hi,
> >
> > On Wed, 31 Dec 2008 18:19:09 -0500, Chris Mason
<chris.mason@oracle.com>
wrote:> > > This has only btrfs as a module and would be the fastest way to
see
> > > the .c files.  btrfs doesn''t have any changes outside of
fs/Makefile
> > > and fs/Kconfig
>
> [ ... ]
>
> > In addition, there seem to be well-separated reusable routines such as
> > async-thread (enhanced workqueue) and extent_map.  Do you intend to
> > move these into lib/ or so?
>
> Sorry, looks like I hit send too soon that time.  The async-thread code
> is very self contained, and was intended for generic use.  Pushing that
> into lib is probably a good idea.
>
> The extent_map and extent_buffer code was also intended for generic use.
> It needs some love and care (making it work for blocksize != pagesize)
> before I''d suggest moving it out of fs/btrfs.
I''m yet to be convinced it is a good idea to use extents for this. Been
a
long time since we visited the issue, but when you converted ext2 to use
the extent mapping stuff, it actually went slower, and complexity went up
a lot (IIRC possibly required allocations in the writeback path).

So I think it is a fine idea to live in btrfs until it is more proven and
found useful elsewhere.

--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Mason

2009-Jan-05 13:18 UTC

head link

Re: Btrfs for mainline

On Mon, 2009-01-05 at 21:07 +1100, Chris Samuel wrote:> On Sat, 3 Jan 2009 8:01:04 am Andi Kleen wrote:
> 
> > When it''s in mainline I suspect people will start using it
for that.
> 
> Some people don''t even wait for that. ;-)
> 
> Seriously though, if that is a concern can I suggest taking the btrfsdev
route
> and, if you want a real belt and braces approach, perhaps require it to
have a
> mandatory mount option specified to successfully mount, maybe
"eat_my_data" ?
I think ext4dev made more sense for ext4 because people generally expect
ext* to be stable.  Btrfs doesn''t quite have the reputation for
stability yet, so I don''t feel we need a special -dev name for it.

But, if Andrew/Linus prefer that unstable filesystems are tagged with
-dev, I''m happy to do it.

-chris


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Mason

2009-Jan-05 13:21 UTC

head link

Re: Btrfs for mainline

On Mon, 2009-01-05 at 21:32 +1100, Nick Piggin wrote:> On Saturday 03 January 2009 06:38:07 Chris Mason wrote:
> > The extent_map and extent_buffer code was also intended for generic
use.
> > It needs some love and care (making it work for blocksize != pagesize)
> > before I''d suggest moving it out of fs/btrfs.
> 
> I''m yet to be convinced it is a good idea to use extents for this.
Been a
> long time since we visited the issue, but when you converted ext2 to use
> the extent mapping stuff, it actually went slower, and complexity went up
> a lot (IIRC possibly required allocations in the writeback path).
> 
>
> So I think it is a fine idea to live in btrfs until it is more proven and
> found useful elsewhere.
It has gotten faster since then, but it makes sense to wait on moving
extent_* code.

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Mason

2009-Jan-05 14:01 UTC

head link

Re: Btrfs for mainline

On Sun, 2009-01-04 at 22:52 +0100, Arnd Bergmann wrote:> On Saturday 03 January 2009, Chris Mason wrote:
> > 
> > > Actually a lot of the ioctl API don''t just need
documentation but
> > > a complete redo.  That''s true at least for the physical
device
> > > management and subvolume / snaphot ones.
> > > 
> > 
> > The ioctl interface is definitely not finalized.  Adding more vs
> > replacing the existing ones is an open question.
> 
> As long as that''s an open question, the ioctl interface
shouldn''t get
> merged into the kernel, or should get in as btrfsdev, otherwise you
> get stuck with the current ABI forever.
> 
Maintaining the current ioctls isn''t a problem.  There aren''t
very many
and they do very discrete things.  The big part that may change is the
device scanning, which may get more integrated into udev and mount (see
other threads about this).

But, that is one very simple ioctl, and most of the code it uses is
going to stay regardless of how the device scanning is done.

-chris

Chris Mason

2009-Jan-05 14:14 UTC

head link

Re: Btrfs for mainline

On Sat, 2009-01-03 at 18:44 +0900, Ryusuke Konishi
wrote:> On Fri, 02 Jan 2009 14:38:07 -0500, Chris Mason 
>
> Btrfs seems to have other helpful code including pages/bio compression
> which may become separable, too.  And, this may be the same for
> pages/bio encryption/decryption code which would come next.  (I
don''t
> mention about the volume management/raid feature here to avoid getting
> off the subject, but it''s likewise).
> 
The compression code is somewhat tied to the btrfs internals, but it
could be pulled out without too much trouble.  The big question there is
if other filesystems are interested in transparent compression support.

But, at the end of the day, most of the work is still done by the zlib
code.  The btrfs bits just organize pages to send down to zlib.

-chris


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Mason

2009-Jan-05 14:34 UTC

head link

Re: Btrfs for mainline

On Sun, 2009-01-04 at 19:21 +0100, Peter Zijlstra wrote:> On Sat, 2009-01-03 at 12:17 -0700, Matthew Wilcox wrote:
> > > - locking.c needs a lot of cleanup.
> > > If combination spinlocks/mutexes are really a win they should be 
> > > in the generic mutex framework. And I''m still dubious on
the
> > hardcoded 
> > > numbers.
> > 
> > I don''t think this needs to be cleaned up before merge. 
I''ve spent
> > an hour or two looking at it, and while we can do a somewhat better
> > job as part of the generic mutex framework, it''s quite tricky
(due to
> > the different <asm/mutex.h> implementations).  It has the
potential to
> > introduce some hard-to-hit bugs in the generic mutexes, and
there''s some
> > API discussions to have.
> 
> I''m really opposed to having this in some filesystem. Please
remove it
> before merging it.
> 
It is 5 lines in a single function that is local to btrfs.  I''ll be
happy to take it out when a clear path to a replacement is in.

I know people have been doing work in this area for -rt, and do not want
to start a parallel effort to change things.

I''m not trying to jump into the design discussions because there are
people already working on it who know the issues much better than I do. 

But, if anyone working on adaptive mutexes is looking for a coder,
tester, use case, or benchmark for their locking scheme, my hand is up.

Until then, this is my for loop, there are many like it, but this one is
mine.

-chris


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Mason

2009-Jan-05 14:35 UTC

head link

Re: Btrfs for mainline

On Sun, 2009-01-04 at 09:32 +0100, Gabor MICSKO wrote:> Hi Chris,
> 
> 
> Does this means that disk format finalised or at least backward
> compatible?
> 
We''re making every effort to avoid new disk format changes now (that
are
not backward compatible).  Only a critical bug would result in a disk
format change now.

-chris


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Nick Piggin

2009-Jan-05 14:39 UTC

head link

generic pagecache to block mapping layer (was Re: Btrfs for mainline)

[trim ccs]

Feel free to ignore this diversion ;) I''d like to see btrfs go upstream
sooner rather than later. But eventually we''ll have to resurrect
fsblock
vs extent map discussion.

On Tuesday 06 January 2009 00:21:43 Chris Mason wrote:> On Mon, 2009-01-05 at 21:32 +1100, Nick Piggin wrote:
> > On Saturday 03 January 2009 06:38:07 Chris Mason wrote:
> > > The extent_map and extent_buffer code was also intended for
generic
> > > use. It needs some love and care (making it work for blocksize
!> > > pagesize) before I''d suggest moving it out of fs/btrfs.
> >
> > I''m yet to be convinced it is a good idea to use extents for
this. Been a
> > long time since we visited the issue, but when you converted ext2 to
use
> > the extent mapping stuff, it actually went slower, and complexity went
up
> > a lot (IIRC possibly required allocations in the writeback path).
> >
> >
> > So I think it is a fine idea to live in btrfs until it is more proven
and
> > found useful elsewhere.
>
> It has gotten faster since then, but it makes sense to wait on moving
> extent_* code.
faster, than it was or than buffer heads now?

fsblock is faster than buffer heads, robust WRT memory allocation,
supports smaller and larger blocks than pagecache, and does locking
solely on a per-page basis.

I added a module that can cache block mapping (but not pagecache state
mapping, importantly) in extents for filesystems that don''t have a good
in-memory data structure (although this has a per-inode lock course). I
agree that using extents for this makes perfect sense, but I''ve just
never thought pagecache state extents are a good idea.

I don''t think this will be too easy to beat with state extents. I
haven''t looked closely at your implementation for quite a while, but
last I did, I couldn''t imagine it being easy to make fast+scalable or
rework it to have good memory allocation behaviour.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Nick Piggin

2009-Jan-05 14:47 UTC

head link

Re: Btrfs for mainline

On Monday 05 January 2009 05:41:03 Matthew Wilcox wrote:> On Sun, Jan 04, 2009 at 07:21:50PM +0100, Peter Zijlstra wrote:
> > The -rt tree has adaptive spin patches for the rtmutex code, its
really
> > not all that hard to do -- the rtmutex code is way more tricky than
the
> > regular mutexes due to all the PI fluff.
> >
> > For kernel only locking the simple rule: spin iff the lock holder is
> > running proved to be simple enough. Any added heuristics like max spin
> > count etc. only made things worse. The whole idea though did make
sense
> > and certainly improved performance.
>
> That implies moving
>
>         struct thread_info      *owner;
>
> out from under the CONFIG_DEBUG_MUTEXES code.  One of the original
> justifications for mutexes was:
>
>  - ''struct mutex'' is smaller on most architectures: .e.g
on x86,
>    ''struct semaphore'' is 20 bytes, ''struct
mutex'' is 16 bytes.
>     A smaller structure size means less RAM footprint, and better
>     CPU-cache utilization.
>
> I''d be reluctant to reverse that decision just for btrfs.
>
> Benchmarking required!  Maybe I can put a patch together that implements
> the simple ''spin if it''s running'' heuristic and
throw it at our
> testing guys on Monday ...
adaptive locks have traditionally (read: Linus says) indicated the locking
is suboptimal from a performance perspective and should be reworked. This
is definitely the case for the -rt patchset, because they deliberately
trade performance by change even very short held spinlocks to sleeping locks. 

So I don''t really know if -rt justifies adaptive locks in
mainline/btrfs.
Is there no way for the short critical sections to be decoupled from the
long/sleeping ones?

--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Tomasz Torcz

2009-Jan-05 15:01 UTC

head link

Re: Btrfs for mainline

On Mon, Jan 05, 2009 at 09:35:53AM -0500, Chris Mason
wrote:> On Sun, 2009-01-04 at 09:32 +0100, Gabor MICSKO wrote:
> > Hi Chris,
> > 
> > 
> > Does this means that disk format finalised or at least backward
> > compatible?
> > 
> 
> We''re making every effort to avoid new disk format changes now
(that are
> not backward compatible).  Only a critical bug would result in a disk
> format change now.

  Is this means that new format is ready for RAID5-like disk layout?

-- 
Tomasz Torcz                Only gods can safely risk perfection,
zdzichu@irc.-nie.spam-.pl     it''s a dangerous thing for a man.  --
Alia

Matthew Wilcox

2009-Jan-05 16:23 UTC

head link

Re: Btrfs for mainline

On Tue, Jan 06, 2009 at 01:47:23AM +1100, Nick Piggin
wrote:> adaptive locks have traditionally (read: Linus says) indicated the locking
> is suboptimal from a performance perspective and should be reworked. This
> is definitely the case for the -rt patchset, because they deliberately
> trade performance by change even very short held spinlocks to sleeping
locks.
> 
> So I don''t really know if -rt justifies adaptive locks in
mainline/btrfs.
> Is there no way for the short critical sections to be decoupled from the
> long/sleeping ones?
I wondered about that option too.  Let''s see if we have other users
that
will benefit from adaptive locks -- my gut says that Linus is right, but
then there''s a lot of lazy programmers out there using mutexes when
they
should be using spinlocks.

I wonder about a new lockdep-style debugging option that adds a bit per
mutex class to determine whether the holder ever slept while holding it.
Then a periodic check could determine which mutexes were needlessly held
would find one style of bad lock management.

The comment in btrfs certainly indicates that locking redesign is a
potential solution:

 * locks the per buffer mutex in an extent buffer.  This uses adaptive locks
 * and the spin is not tuned very extensively.  The spinning does make a big
 * difference in almost every workload, but spinning for the right amount of
 * time needs some help.
 *
 * In general, we want to spin as long as the lock holder is doing btree
searches,
 * and we should give up if they are in more expensive code.

btrfs almost wants its own hybrid locks (like lock_sock(), to choose
a new in-tree example).  One where it will spin, unless a flag is set
to not spin, in which case it sleeps.  Then the ''more expensive
code''
can set the flag to not bother spinning.

-- 
Matthew Wilcox				Intel Open Source Technology Centre
"Bill, look, we understand that you''re interested in selling us
this
operating system, but compare it to ours.  We can''t possibly take such
a retrograde step."
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Mason

2009-Jan-05 16:30 UTC

head link

Re: Btrfs for mainline

On Tue, 2009-01-06 at 01:47 +1100, Nick Piggin wrote:

[ adaptive locking in btrfs ]
> adaptive locks have traditionally (read: Linus says) indicated the locking
> is suboptimal from a performance perspective and should be reworked. This
> is definitely the case for the -rt patchset, because they deliberately
> trade performance by change even very short held spinlocks to sleeping
locks.
> 
> So I don''t really know if -rt justifies adaptive locks in
mainline/btrfs.
> Is there no way for the short critical sections to be decoupled from the
> long/sleeping ones?
Yes and no.  The locks are used here to control access to the btree
leaves and nodes.  Some of these are very hot and tend to stay in cache
all the time, while others have to be read from the disk.

As the btree search walks down the tree, access to the hot nodes is best
controlled by a spinlock.  Some operations (like a balance) will need to
read other blocks from the disk and keep the node/leaf locked.  So it
also needs to be able to sleep.

I try to drop the locks where it makes sense before sleeping operatinos,
but in some corner cases it isn''t practical.

For leaves, once the code has found the item in the btree it was looking
for, it wants to go off and do something useful (insert an inode etc
etc). Those operations also tend to block, and the lock needs to be held
to keep the tree block from changing.

All of this is a long way of saying the btrfs locking scheme is far from
perfect.  I''ll look harder at the loop and ways to get rid of it.

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Mason

2009-Jan-05 16:30 UTC

head link

Re: Btrfs for mainline

On Mon, 2009-01-05 at 16:01 +0100, Tomasz Torcz wrote:> On Mon, Jan 05, 2009 at 09:35:53AM -0500, Chris Mason wrote:
> > On Sun, 2009-01-04 at 09:32 +0100, Gabor MICSKO wrote:
> > > Hi Chris,
> > > 
> > > 
> > > Does this means that disk format finalised or at least backward
> > > compatible?
> > > 
> > 
> > We''re making every effort to avoid new disk format changes
now (that are
> > not backward compatible).  Only a critical bug would result in a disk
> > format change now.
> 
> 
>   Is this means that new format is ready for RAID5-like disk layout?
> 
That will be done through forward compat bits.  So you won''t be able to
use today''s code to mount a RAID5 FS, but the raid5 fs will understand
today''s FS.

-chris




--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

J. Bruce Fields

2009-Jan-05 16:33 UTC

head link

Re: Btrfs for mainline

On Mon, Jan 05, 2009 at 08:18:58AM -0500, Chris Mason
wrote:> On Mon, 2009-01-05 at 21:07 +1100, Chris Samuel wrote:
> > On Sat, 3 Jan 2009 8:01:04 am Andi Kleen wrote:
> > 
> > > When it''s in mainline I suspect people will start using
it for that.
> > 
> > Some people don''t even wait for that. ;-)
> > 
> > Seriously though, if that is a concern can I suggest taking the
btrfsdev route
> > and, if you want a real belt and braces approach, perhaps require it
to have a
> > mandatory mount option specified to successfully mount, maybe
"eat_my_data" ?
> 
> I think ext4dev made more sense for ext4 because people generally expect
> ext* to be stable.  Btrfs doesn''t quite have the reputation for
> stability yet, so I don''t feel we need a special -dev name for it.
Old kernel versions may still get booted after brtfs has gotten a
reputation for stability.  E.g. if I move my / to brtfs in 2.6.34, then
one day need to boot back to 2.6.30 to track down some regression, the
reminder that I''m moving back to some sort of brtfs dark-ages might be
welcome.

(Not that I have particularly strong feelings about this.)

--b.
> 
> But, if Andrew/Linus prefer that unstable filesystems are tagged with
> -dev, I''m happy to do it.
> 
> -chris
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe
linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Mason

2009-Jan-05 16:37 UTC

head link

Re: generic pagecache to block mapping layer (was Re: Btrfs for mainline)

On Tue, 2009-01-06 at 01:39 +1100, Nick Piggin wrote:> [trim ccs]
> 
> Feel free to ignore this diversion ;) I''d like to see btrfs go
upstream
> sooner rather than later. But eventually we''ll have to resurrect
fsblock
> vs extent map discussion.
> 
There''s extent_map, extent_state and extent_buffer.  I''d
expect the
mapping code to beat fsblock, since it more closely models the
conditions of the disk format.  This is a very thin layer of code to
figure out which file offset goes to which block on disk.

extent_state is a different beast, since it is trying to track state
across extents.  It is entirely possible that we''re better off keeping
the state in the pages, aside from the part where we''re running out of
bits.

extent_buffers are an api to access/modify the contents of ranges of
bytes, supporting larger and smaller blocksizes than the page.  I''d be
really interested in comparing this to fsblock, but I need to first fix
it to actually support larger and smaller blocksizes than the page ;)

So, long term we can have a benchmarking contest, but I''ve got a little
ways to go before that is a good idea.

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Ryusuke Konishi

2009-Jan-05 16:43 UTC

head link

Re: Btrfs for mainline

On Mon, 05 Jan 2009 09:14:56 -0500, Chris Mason <chris.mason@oracle.com>
wrote:> On Sat, 2009-01-03 at 18:44 +0900, Ryusuke Konishi wrote:
> > On Fri, 02 Jan 2009 14:38:07 -0500, Chris Mason 
> >
> > Btrfs seems to have other helpful code including pages/bio compression
> > which may become separable, too.  And, this may be the same for
> > pages/bio encryption/decryption code which would come next.  (I
don''t
> > mention about the volume management/raid feature here to avoid getting
> > off the subject, but it''s likewise).
> > 
> 
> The compression code is somewhat tied to the btrfs internals, but it
> could be pulled out without too much trouble.  The big question there is
> if other filesystems are interested in transparent compression support.
It was so attractive to me ;)
though I don''t know if it''s applicable.

Anyway, making this common can be said to be the exercise of someone
who try to reuse it, and I don''t adhere to it in order not to get off
the subject.  Just I wanted to add presence of other candicates.
> But, at the end of the day, most of the work is still done by the zlib
> code.  The btrfs bits just organize pages to send down to zlib.
> 
> -chris
Yes, but that''s the interesting point; it provides ways to apply
compression through array of pages, or bio - I like it.

Ryusuke
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Nick Piggin

2009-Jan-05 17:10 UTC

head link

Re: generic pagecache to block mapping layer (was Re: Btrfs for mainline)

On Tuesday 06 January 2009 03:37:33 Chris Mason wrote:> On Tue, 2009-01-06 at 01:39 +1100, Nick Piggin wrote:
> > [trim ccs]
> >
> > Feel free to ignore this diversion ;) I''d like to see btrfs
go upstream
> > sooner rather than later. But eventually we''ll have to
resurrect fsblock
> > vs extent map discussion.
>
> There''s extent_map, extent_state and extent_buffer.  I''d
expect the
> mapping code to beat fsblock, since it more closely models the
> conditions of the disk format.  This is a very thin layer of code to
> figure out which file offset goes to which block on disk.
It looks somewhat similar to the optional extent mapping layer I
added in front of fsblock (which works very nicely for ext2, but
may not be bloated^W enough for btrfs :P).

#define FE_mapped       0x1
#define FE_hole         0x2
#define FE_new          0x4

struct fsb_extent {
        struct rb_node          rb_node;
        sector_t                offset;
        sector_t                block;
        unsigned int            size;
        unsigned int            flags;
};

But I have a feeling that it might be better to not have such a layer
and go direct to the filesystem in cases where they have good in
memory data structures for mapping themselves. btrfs for example has
some non-generic looking data in its mapping.

But... we''ll see. If we can distill the common goodness from different
places and make it more usable, it would definitely be a good idea.

> extent_state is a different beast, since it is trying to track state
> across extents.  It is entirely possible that we''re better off
keeping
> the state in the pages, aside from the part where we''re running
out of
> bits.
OK, I haven''t really understood how that works.

> extent_buffers are an api to access/modify the contents of ranges of
> bytes, supporting larger and smaller blocksizes than the page. 
I''d be
> really interested in comparing this to fsblock, but I need to first fix
> it to actually support larger and smaller blocksizes than the page ;)
Yes, this area is where we have a difference of opinion I think ;)

> So, long term we can have a benchmarking contest, but I''ve got a
little
> ways to go before that is a good idea.
That would be good.

Thanks,
Nick

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Peter Zijlstra

2009-Jan-06 11:40 UTC

head link

[PATCH][RFC]: mutex: adaptive spin

Subject: mutex: adaptive spin
From: Peter Zijlstra <a.p.zijlstra@chello.nl>
Date: Tue Jan 06 12:32:12 CET 2009

Based on the code in -rt, provide adaptive spins on generic mutexes.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/mutex.h |    4 ++--
 include/linux/sched.h |    1 +
 kernel/mutex-debug.c  |   11 ++---------
 kernel/mutex-debug.h  |    8 --------
 kernel/mutex.c        |   46 +++++++++++++++++++++++++++++++++++++++-------
 kernel/mutex.h        |    2 --
 kernel/sched.c        |    5 +++++
 7 files changed, 49 insertions(+), 28 deletions(-)

Index: linux-2.6/include/linux/mutex.h
==================================================================---
linux-2.6.orig/include/linux/mutex.h
+++ linux-2.6/include/linux/mutex.h
@@ -50,8 +50,8 @@ struct mutex {
 	atomic_t		count;
 	spinlock_t		wait_lock;
 	struct list_head	wait_list;
+	struct task_struct	*owner;
 #ifdef CONFIG_DEBUG_MUTEXES
-	struct thread_info	*owner;
 	const char 		*name;
 	void			*magic;
 #endif
@@ -67,8 +67,8 @@ struct mutex {
 struct mutex_waiter {
 	struct list_head	list;
 	struct task_struct	*task;
-#ifdef CONFIG_DEBUG_MUTEXES
 	struct mutex		*lock;
+#ifdef CONFIG_DEBUG_MUTEXES
 	void			*magic;
 #endif
 };
Index: linux-2.6/include/linux/sched.h
==================================================================---
linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -249,6 +249,7 @@ extern void init_idle(struct task_struct
 extern void init_idle_bootup_task(struct task_struct *idle);
 
 extern int runqueue_is_locked(void);
+extern int task_is_current(struct task_struct *p);
 extern void task_rq_unlock_wait(struct task_struct *p);
 
 extern cpumask_var_t nohz_cpu_mask;
Index: linux-2.6/kernel/mutex-debug.c
==================================================================---
linux-2.6.orig/kernel/mutex-debug.c
+++ linux-2.6/kernel/mutex-debug.c
@@ -26,11 +26,6 @@
 /*
  * Must be called with lock->wait_lock held.
  */
-void debug_mutex_set_owner(struct mutex *lock, struct thread_info *new_owner)
-{
-	lock->owner = new_owner;
-}
-
 void debug_mutex_lock_common(struct mutex *lock, struct mutex_waiter *waiter)
 {
 	memset(waiter, MUTEX_DEBUG_INIT, sizeof(*waiter));
@@ -59,7 +54,6 @@ void debug_mutex_add_waiter(struct mutex
 
 	/* Mark the current thread as blocked on the lock: */
 	ti->task->blocked_on = waiter;
-	waiter->lock = lock;
 }
 
 void mutex_remove_waiter(struct mutex *lock, struct mutex_waiter *waiter,
@@ -80,9 +74,9 @@ void debug_mutex_unlock(struct mutex *lo
 		return;
 
 	DEBUG_LOCKS_WARN_ON(lock->magic != lock);
-	DEBUG_LOCKS_WARN_ON(lock->owner != current_thread_info());
+	DEBUG_LOCKS_WARN_ON(lock->owner != current);
 	DEBUG_LOCKS_WARN_ON(!lock->wait_list.prev &&
!lock->wait_list.next);
-	DEBUG_LOCKS_WARN_ON(lock->owner != current_thread_info());
+	DEBUG_LOCKS_WARN_ON(lock->owner != current);
 }
 
 void debug_mutex_init(struct mutex *lock, const char *name,
@@ -95,7 +89,6 @@ void debug_mutex_init(struct mutex *lock
 	debug_check_no_locks_freed((void *)lock, sizeof(*lock));
 	lockdep_init_map(&lock->dep_map, name, key, 0);
 #endif
-	lock->owner = NULL;
 	lock->magic = lock;
 }
 
Index: linux-2.6/kernel/mutex-debug.h
==================================================================---
linux-2.6.orig/kernel/mutex-debug.h
+++ linux-2.6/kernel/mutex-debug.h
@@ -13,14 +13,6 @@
 /*
  * This must be called with lock->wait_lock held.
  */
-extern void
-debug_mutex_set_owner(struct mutex *lock, struct thread_info *new_owner);
-
-static inline void debug_mutex_clear_owner(struct mutex *lock)
-{
-	lock->owner = NULL;
-}
-
 extern void debug_mutex_lock_common(struct mutex *lock,
 				    struct mutex_waiter *waiter);
 extern void debug_mutex_wake_waiter(struct mutex *lock,
Index: linux-2.6/kernel/mutex.c
==================================================================---
linux-2.6.orig/kernel/mutex.c
+++ linux-2.6/kernel/mutex.c
@@ -46,6 +46,7 @@ __mutex_init(struct mutex *lock, const c
 	atomic_set(&lock->count, 1);
 	spin_lock_init(&lock->wait_lock);
 	INIT_LIST_HEAD(&lock->wait_list);
+	lock->owner = NULL;
 
 	debug_mutex_init(lock, name, key);
 }
@@ -120,6 +121,28 @@ void __sched mutex_unlock(struct mutex *
 
 EXPORT_SYMBOL(mutex_unlock);
 
+#ifdef CONFIG_SMP
+static int adaptive_wait(struct mutex_waiter *waiter,
+			 struct task_struct *owner, long state)
+{
+	for (;;) {
+		if (signal_pending_state(state, waiter->task))
+			return 0;
+		if (waiter->lock->owner != owner)
+			return 0;
+		if (!task_is_current(owner))
+			return 1;
+		cpu_relax();
+	}
+}
+#else
+static int adaptive_wait(struct mutex_waiter *waiter,
+			 struct task_struct *owner, long state)
+{
+	return 1;
+}
+#endif
+
 /*
  * Lock a mutex (possibly interruptible), slowpath:
  */
@@ -127,7 +150,7 @@ static inline int __sched
 __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
 	       	unsigned long ip)
 {
-	struct task_struct *task = current;
+	struct task_struct *owner, *task = current;
 	struct mutex_waiter waiter;
 	unsigned int old_val;
 	unsigned long flags;
@@ -141,6 +164,7 @@ __mutex_lock_common(struct mutex *lock, 
 	/* add waiting tasks to the end of the waitqueue (FIFO): */
 	list_add_tail(&waiter.list, &lock->wait_list);
 	waiter.task = task;
+	waiter.lock = lock;
 
 	old_val = atomic_xchg(&lock->count, -1);
 	if (old_val == 1)
@@ -175,11 +199,19 @@ __mutex_lock_common(struct mutex *lock, 
 			debug_mutex_free_waiter(&waiter);
 			return -EINTR;
 		}
-		__set_task_state(task, state);
 
-		/* didnt get the lock, go to sleep: */
+		owner = lock->owner;
+		get_task_struct(owner);
 		spin_unlock_mutex(&lock->wait_lock, flags);
-		schedule();
+
+		if (adaptive_wait(&waiter, owner, state)) {
+			put_task_struct(owner);
+			__set_task_state(task, state);
+			/* didnt get the lock, go to sleep: */
+			schedule();
+		} else
+			put_task_struct(owner);
+
 		spin_lock_mutex(&lock->wait_lock, flags);
 	}
 
@@ -187,7 +219,7 @@ done:
 	lock_acquired(&lock->dep_map, ip);
 	/* got the lock - rejoice! */
 	mutex_remove_waiter(lock, &waiter, task_thread_info(task));
-	debug_mutex_set_owner(lock, task_thread_info(task));
+	lock->owner = task;
 
 	/* set it to 0 if there are no waiters left: */
 	if (likely(list_empty(&lock->wait_list)))
@@ -260,7 +292,7 @@ __mutex_unlock_common_slowpath(atomic_t 
 		wake_up_process(waiter->task);
 	}
 
-	debug_mutex_clear_owner(lock);
+	lock->owner = NULL;
 
 	spin_unlock_mutex(&lock->wait_lock, flags);
 }
@@ -352,7 +384,7 @@ static inline int __mutex_trylock_slowpa
 
 	prev = atomic_xchg(&lock->count, -1);
 	if (likely(prev == 1)) {
-		debug_mutex_set_owner(lock, current_thread_info());
+		lock->owner = current;
 		mutex_acquire(&lock->dep_map, 0, 1, _RET_IP_);
 	}
 	/* Set it back to 0 if there are no waiters: */
Index: linux-2.6/kernel/sched.c
==================================================================---
linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -697,6 +697,11 @@ int runqueue_is_locked(void)
 	return ret;
 }
 
+int task_is_current(struct task_struct *p)
+{
+	return task_rq(p)->curr == p;
+}
+
 /*
  * Debugging: various feature bits
  */
Index: linux-2.6/kernel/mutex.h
==================================================================---
linux-2.6.orig/kernel/mutex.h
+++ linux-2.6/kernel/mutex.h
@@ -16,8 +16,6 @@
 #define mutex_remove_waiter(lock, waiter, ti) \
 		__list_del((waiter)->list.prev, (waiter)->list.next)
 
-#define debug_mutex_set_owner(lock, new_owner)		do { } while (0)
-#define debug_mutex_clear_owner(lock)			do { } while (0)
 #define debug_mutex_wake_waiter(lock, waiter)		do { } while (0)
 #define debug_mutex_free_waiter(waiter)			do { } while (0)
 #define debug_mutex_add_waiter(lock, waiter, ti)	do { } while (0)

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Ingo Molnar

2009-Jan-06 12:10 UTC

head link

Re: [PATCH][RFC]: mutex: adaptive spin

* Peter Zijlstra <peterz@infradead.org> wrote:
> +++ linux-2.6/kernel/mutex.c
> @@ -46,6 +46,7 @@ __mutex_init(struct mutex *lock, const c
>  	atomic_set(&lock->count, 1);
>  	spin_lock_init(&lock->wait_lock);
>  	INIT_LIST_HEAD(&lock->wait_list);
> +	lock->owner = NULL;
>  
>  	debug_mutex_init(lock, name, key);
>  }
> @@ -120,6 +121,28 @@ void __sched mutex_unlock(struct mutex *
>  
>  EXPORT_SYMBOL(mutex_unlock);
>  
> +#ifdef CONFIG_SMP
> +static int adaptive_wait(struct mutex_waiter *waiter,
> +			 struct task_struct *owner, long state)
> +{
> +	for (;;) {
> +		if (signal_pending_state(state, waiter->task))
> +			return 0;
> +		if (waiter->lock->owner != owner)
> +			return 0;
> +		if (!task_is_current(owner))
> +			return 1;
> +		cpu_relax();
> +	}
> +}
> +#else
Linus, what do you think about this particular approach of spin-mutexes? 
It''s not the typical spin-mutex i think.

The thing i like most about Peter''s patch (compared to most other
adaptive
spinning approaches i''ve seen, which all sucked as they included
various
ugly heuristics complicating the whole thing) is that it solves the "how 
long should we spin" question elegantly: we spin until the owner runs on a 
CPU.

So on shortly held locks we degenerate to spinlock behavior, and only 
long-held blocking locks [with little CPU time spent while holding the 
lock - say we wait for IO] we degenerate to classic mutex behavior.

There''s no time or spin-rate based heuristics in this at all (i.e.
these
mutexes are not ''adaptive'' at all!), and it degenerates to our
primary and
well-known locking behavior in the important boundary situations.

A couple of other properties i like about it:

 - A spinlock user can be changed to a mutex with no runtime impact. (no 
   increase in scheduling) This might enable us to convert/standardize 
   some of the uglier locking constructs within ext2/3/4?

 - This mutex modification would probably be a win for workloads where
   mutexes are held briefly - we''d never schedule.

 - If the owner is preempted, we fall back to proper blocking behavior. 
   This might reduce the cost of preemptive kernels in general.

The flip side:

 - The slight increase in the hotpath - we now maintain the
''owner'' field.
   That''s cached in a register on most platforms anyway so
it''s not a too
   big deal - if the general win justifies it.

   ( This reminds me: why not flip over all the task_struct uses in 
     mutex.c to thread_info? thread_info is faster to access [on x86] 
     than current. )

 - The extra mutex->owner pointer data overhead.

 - It could possibly increase spinning overhead (and waste CPU time) on
   workloads where locks are held and contended for. OTOH, such cases are
   probably a prime target for improvements anyway. It would probably be 
   near-zero-impact for workloads where mutexes are held for a very long 
   time and where most of the time is spent blocking.

It''s hard to tell how it would impact inbetween workloads - i guess it 
needs to be measured on a couple of workloads.

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Peter Zijlstra

2009-Jan-06 12:21 UTC

head link

Re: [PATCH][RFC]: mutex: adaptive spin

On Tue, 2009-01-06 at 13:10 +0100, Ingo Molnar wrote:> * Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > +++ linux-2.6/kernel/mutex.c
> > @@ -46,6 +46,7 @@ __mutex_init(struct mutex *lock, const c
> >  	atomic_set(&lock->count, 1);
> >  	spin_lock_init(&lock->wait_lock);
> >  	INIT_LIST_HEAD(&lock->wait_list);
> > +	lock->owner = NULL;
> >  
> >  	debug_mutex_init(lock, name, key);
> >  }
> > @@ -120,6 +121,28 @@ void __sched mutex_unlock(struct mutex *
> >  
> >  EXPORT_SYMBOL(mutex_unlock);
> >  
> > +#ifdef CONFIG_SMP
> > +static int adaptive_wait(struct mutex_waiter *waiter,
> > +			 struct task_struct *owner, long state)
> > +{
> > +	for (;;) {
> > +		if (signal_pending_state(state, waiter->task))
> > +			return 0;
> > +		if (waiter->lock->owner != owner)
> > +			return 0;
> > +		if (!task_is_current(owner))
> > +			return 1;
> > +		cpu_relax();
> > +	}
> > +}
> > +#else
> 
> Linus, what do you think about this particular approach of spin-mutexes? 
> It''s not the typical spin-mutex i think.
> 
> The thing i like most about Peter''s patch (compared to most other
adaptive
> spinning approaches i''ve seen, which all sucked as they included
various
> ugly heuristics complicating the whole thing) is that it solves the
"how
> long should we spin" question elegantly: we spin until the owner runs
on a
> CPU.
s/until/as long as/
> So on shortly held locks we degenerate to spinlock behavior, and only 
> long-held blocking locks [with little CPU time spent while holding the 
> lock - say we wait for IO] we degenerate to classic mutex behavior.
> 
> There''s no time or spin-rate based heuristics in this at all (i.e.
these
> mutexes are not ''adaptive'' at all!), and it degenerates
to our primary and
> well-known locking behavior in the important boundary situations.
Well, it adapts to the situation, choosing between spinning vs blocking.
But what''s in a name ;-)
> A couple of other properties i like about it:
> 
>  - A spinlock user can be changed to a mutex with no runtime impact. (no 
>    increase in scheduling) This might enable us to convert/standardize 
>    some of the uglier locking constructs within ext2/3/4?
I think a lot of stuff there is bit (spin) locks.
> It''s hard to tell how it would impact inbetween workloads - i
guess it
> needs to be measured on a couple of workloads.
Matthew volunteered to run something IIRC.

Gregory Haskins

2009-Jan-06 13:10 UTC

head link

Re: [PATCH][RFC]: mutex: adaptive spin

Ingo Molnar wrote:> There''s no time or spin-rate based heuristics in this at all (i.e.
these
> mutexes are not ''adaptive'' at all!),
FYI: The original "adaptive" name was chosen in the -rt implementation
to reflect that the locks can adaptively spin or sleep, depending on
conditions.  I realize this is in contrast to the typical usage of the
term when it is in reference to the spin-time being based on some
empirical heuristics, etc as you mentioned.  Sorry for the confusion.

Regards,
-Greg

Ingo Molnar

2009-Jan-06 13:16 UTC

head link

Re: [PATCH][RFC]: mutex: adaptive spin

* Gregory Haskins <ghaskins@novell.com> wrote:
> Ingo Molnar wrote:
> > There''s no time or spin-rate based heuristics in this at all
(i.e. these
> > mutexes are not ''adaptive'' at all!),
> 
> FYI: The original "adaptive" name was chosen in the -rt
implementation
> to reflect that the locks can adaptively spin or sleep, depending on 
> conditions.  I realize this is in contrast to the typical usage of the 
> term when it is in reference to the spin-time being based on some 
> empirical heuristics, etc as you mentioned.  Sorry for the confusion.
the current version of the -rt spinny-mutexes bits were mostly written by 
Steve, right? Historically it all started out with a more classic 
"adaptive mutexes" patchset so the name stuck i guess.

	Ingo

--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Peter Zijlstra

2009-Jan-06 13:20 UTC

head link

Re: [PATCH][RFC]: mutex: adaptive spin

On Tue, 2009-01-06 at 14:16 +0100, Ingo Molnar wrote:> * Gregory Haskins <ghaskins@novell.com> wrote:
> 
> > Ingo Molnar wrote:
> > > There''s no time or spin-rate based heuristics in this at
all (i.e. these
> > > mutexes are not ''adaptive'' at all!),
> > 
> > FYI: The original "adaptive" name was chosen in the -rt
implementation
> > to reflect that the locks can adaptively spin or sleep, depending on 
> > conditions.  I realize this is in contrast to the typical usage of the
> > term when it is in reference to the spin-time being based on some 
> > empirical heuristics, etc as you mentioned.  Sorry for the confusion.
> 
> the current version of the -rt spinny-mutexes bits were mostly written by 
> Steve, right? Historically it all started out with a more classic 
> "adaptive mutexes" patchset so the name stuck i guess.
Yeah, Gregory and co. started with the whole thing and showed there was
significant performance to be gained, after that Steve rewrote it from
scratch reducing it to this minimalist heuristic, with help from Greg.

(At least, that is how I remember it, please speak up if I got things
wrong)
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Nick Piggin

2009-Jan-06 13:29 UTC

head link

Re: [PATCH][RFC]: mutex: adaptive spin

On Tue, Jan 06, 2009 at 12:40:31PM +0100, Peter Zijlstra
wrote:> Subject: mutex: adaptive spin
> From: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Date: Tue Jan 06 12:32:12 CET 2009
> 
> Based on the code in -rt, provide adaptive spins on generic mutexes.
I guess it would be nice to add another type so you can test/convert
callsites individually. I''ve got no objections to improving
synchronisation primitives, but I would be interested to see good
results from some mutex that can''t be achieved by improving the
locking (by improving I don''t mean inventing some crazy lockless
algorithm, but simply making it reasonably sane and scalable).

Good area to investigate though, I think.
> 
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> ---
>  include/linux/mutex.h |    4 ++--
>  include/linux/sched.h |    1 +
>  kernel/mutex-debug.c  |   11 ++---------
>  kernel/mutex-debug.h  |    8 --------
>  kernel/mutex.c        |   46
+++++++++++++++++++++++++++++++++++++++-------
>  kernel/mutex.h        |    2 --
>  kernel/sched.c        |    5 +++++
>  7 files changed, 49 insertions(+), 28 deletions(-)
> 
> Index: linux-2.6/include/linux/mutex.h
> ==================================================================> ---
linux-2.6.orig/include/linux/mutex.h
> +++ linux-2.6/include/linux/mutex.h
> @@ -50,8 +50,8 @@ struct mutex {
>  	atomic_t		count;
>  	spinlock_t		wait_lock;
>  	struct list_head	wait_list;
> +	struct task_struct	*owner;
>  #ifdef CONFIG_DEBUG_MUTEXES
> -	struct thread_info	*owner;
>  	const char 		*name;
>  	void			*magic;
>  #endif
> @@ -67,8 +67,8 @@ struct mutex {
>  struct mutex_waiter {
>  	struct list_head	list;
>  	struct task_struct	*task;
> -#ifdef CONFIG_DEBUG_MUTEXES
>  	struct mutex		*lock;
> +#ifdef CONFIG_DEBUG_MUTEXES
>  	void			*magic;
>  #endif
>  };
> Index: linux-2.6/include/linux/sched.h
> ==================================================================> ---
linux-2.6.orig/include/linux/sched.h
> +++ linux-2.6/include/linux/sched.h
> @@ -249,6 +249,7 @@ extern void init_idle(struct task_struct
>  extern void init_idle_bootup_task(struct task_struct *idle);
>  
>  extern int runqueue_is_locked(void);
> +extern int task_is_current(struct task_struct *p);
>  extern void task_rq_unlock_wait(struct task_struct *p);
>  
>  extern cpumask_var_t nohz_cpu_mask;
> Index: linux-2.6/kernel/mutex-debug.c
> ==================================================================> ---
linux-2.6.orig/kernel/mutex-debug.c
> +++ linux-2.6/kernel/mutex-debug.c
> @@ -26,11 +26,6 @@
>  /*
>   * Must be called with lock->wait_lock held.
>   */
> -void debug_mutex_set_owner(struct mutex *lock, struct thread_info
*new_owner)
> -{
> -	lock->owner = new_owner;
> -}
> -
>  void debug_mutex_lock_common(struct mutex *lock, struct mutex_waiter
*waiter)
>  {
>  	memset(waiter, MUTEX_DEBUG_INIT, sizeof(*waiter));
> @@ -59,7 +54,6 @@ void debug_mutex_add_waiter(struct mutex
>  
>  	/* Mark the current thread as blocked on the lock: */
>  	ti->task->blocked_on = waiter;
> -	waiter->lock = lock;
>  }
>  
>  void mutex_remove_waiter(struct mutex *lock, struct mutex_waiter *waiter,
> @@ -80,9 +74,9 @@ void debug_mutex_unlock(struct mutex *lo
>  		return;
>  
>  	DEBUG_LOCKS_WARN_ON(lock->magic != lock);
> -	DEBUG_LOCKS_WARN_ON(lock->owner != current_thread_info());
> +	DEBUG_LOCKS_WARN_ON(lock->owner != current);
>  	DEBUG_LOCKS_WARN_ON(!lock->wait_list.prev &&
!lock->wait_list.next);
> -	DEBUG_LOCKS_WARN_ON(lock->owner != current_thread_info());
> +	DEBUG_LOCKS_WARN_ON(lock->owner != current);
>  }
>  
>  void debug_mutex_init(struct mutex *lock, const char *name,
> @@ -95,7 +89,6 @@ void debug_mutex_init(struct mutex *lock
>  	debug_check_no_locks_freed((void *)lock, sizeof(*lock));
>  	lockdep_init_map(&lock->dep_map, name, key, 0);
>  #endif
> -	lock->owner = NULL;
>  	lock->magic = lock;
>  }
>  
> Index: linux-2.6/kernel/mutex-debug.h
> ==================================================================> ---
linux-2.6.orig/kernel/mutex-debug.h
> +++ linux-2.6/kernel/mutex-debug.h
> @@ -13,14 +13,6 @@
>  /*
>   * This must be called with lock->wait_lock held.
>   */
> -extern void
> -debug_mutex_set_owner(struct mutex *lock, struct thread_info *new_owner);
> -
> -static inline void debug_mutex_clear_owner(struct mutex *lock)
> -{
> -	lock->owner = NULL;
> -}
> -
>  extern void debug_mutex_lock_common(struct mutex *lock,
>  				    struct mutex_waiter *waiter);
>  extern void debug_mutex_wake_waiter(struct mutex *lock,
> Index: linux-2.6/kernel/mutex.c
> ==================================================================> ---
linux-2.6.orig/kernel/mutex.c
> +++ linux-2.6/kernel/mutex.c
> @@ -46,6 +46,7 @@ __mutex_init(struct mutex *lock, const c
>  	atomic_set(&lock->count, 1);
>  	spin_lock_init(&lock->wait_lock);
>  	INIT_LIST_HEAD(&lock->wait_list);
> +	lock->owner = NULL;
>  
>  	debug_mutex_init(lock, name, key);
>  }
> @@ -120,6 +121,28 @@ void __sched mutex_unlock(struct mutex *
>  
>  EXPORT_SYMBOL(mutex_unlock);
>  
> +#ifdef CONFIG_SMP
> +static int adaptive_wait(struct mutex_waiter *waiter,
> +			 struct task_struct *owner, long state)
> +{
> +	for (;;) {
> +		if (signal_pending_state(state, waiter->task))
> +			return 0;
> +		if (waiter->lock->owner != owner)
> +			return 0;
> +		if (!task_is_current(owner))
> +			return 1;
> +		cpu_relax();
> +	}
> +}
> +#else
> +static int adaptive_wait(struct mutex_waiter *waiter,
> +			 struct task_struct *owner, long state)
> +{
> +	return 1;
> +}
> +#endif
> +
>  /*
>   * Lock a mutex (possibly interruptible), slowpath:
>   */
> @@ -127,7 +150,7 @@ static inline int __sched
>  __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
>  	       	unsigned long ip)
>  {
> -	struct task_struct *task = current;
> +	struct task_struct *owner, *task = current;
>  	struct mutex_waiter waiter;
>  	unsigned int old_val;
>  	unsigned long flags;
> @@ -141,6 +164,7 @@ __mutex_lock_common(struct mutex *lock, 
>  	/* add waiting tasks to the end of the waitqueue (FIFO): */
>  	list_add_tail(&waiter.list, &lock->wait_list);
>  	waiter.task = task;
> +	waiter.lock = lock;
>  
>  	old_val = atomic_xchg(&lock->count, -1);
>  	if (old_val == 1)
> @@ -175,11 +199,19 @@ __mutex_lock_common(struct mutex *lock, 
>  			debug_mutex_free_waiter(&waiter);
>  			return -EINTR;
>  		}
> -		__set_task_state(task, state);
>  
> -		/* didnt get the lock, go to sleep: */
> +		owner = lock->owner;
> +		get_task_struct(owner);
>  		spin_unlock_mutex(&lock->wait_lock, flags);
> -		schedule();
> +
> +		if (adaptive_wait(&waiter, owner, state)) {
> +			put_task_struct(owner);
> +			__set_task_state(task, state);
> +			/* didnt get the lock, go to sleep: */
> +			schedule();
> +		} else
> +			put_task_struct(owner);
> +
>  		spin_lock_mutex(&lock->wait_lock, flags);
>  	}
>  
> @@ -187,7 +219,7 @@ done:
>  	lock_acquired(&lock->dep_map, ip);
>  	/* got the lock - rejoice! */
>  	mutex_remove_waiter(lock, &waiter, task_thread_info(task));
> -	debug_mutex_set_owner(lock, task_thread_info(task));
> +	lock->owner = task;
>  
>  	/* set it to 0 if there are no waiters left: */
>  	if (likely(list_empty(&lock->wait_list)))
> @@ -260,7 +292,7 @@ __mutex_unlock_common_slowpath(atomic_t 
>  		wake_up_process(waiter->task);
>  	}
>  
> -	debug_mutex_clear_owner(lock);
> +	lock->owner = NULL;
>  
>  	spin_unlock_mutex(&lock->wait_lock, flags);
>  }
> @@ -352,7 +384,7 @@ static inline int __mutex_trylock_slowpa
>  
>  	prev = atomic_xchg(&lock->count, -1);
>  	if (likely(prev == 1)) {
> -		debug_mutex_set_owner(lock, current_thread_info());
> +		lock->owner = current;
>  		mutex_acquire(&lock->dep_map, 0, 1, _RET_IP_);
>  	}
>  	/* Set it back to 0 if there are no waiters: */
> Index: linux-2.6/kernel/sched.c
> ==================================================================> ---
linux-2.6.orig/kernel/sched.c
> +++ linux-2.6/kernel/sched.c
> @@ -697,6 +697,11 @@ int runqueue_is_locked(void)
>  	return ret;
>  }
>  
> +int task_is_current(struct task_struct *p)
> +{
> +	return task_rq(p)->curr == p;
> +}
> +
>  /*
>   * Debugging: various feature bits
>   */
> Index: linux-2.6/kernel/mutex.h
> ==================================================================> ---
linux-2.6.orig/kernel/mutex.h
> +++ linux-2.6/kernel/mutex.h
> @@ -16,8 +16,6 @@
>  #define mutex_remove_waiter(lock, waiter, ti) \
>  		__list_del((waiter)->list.prev, (waiter)->list.next)
>  
> -#define debug_mutex_set_owner(lock, new_owner)		do { } while (0)
> -#define debug_mutex_clear_owner(lock)			do { } while (0)
>  #define debug_mutex_wake_waiter(lock, waiter)		do { } while (0)
>  #define debug_mutex_free_waiter(waiter)			do { } while (0)
>  #define debug_mutex_add_waiter(lock, waiter, ti)	do { } while (0)--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Gregory Haskins

2009-Jan-06 14:15 UTC

head link

Re: [PATCH][RFC]: mutex: adaptive spin

Peter Zijlstra wrote:> On Tue, 2009-01-06 at 14:16 +0100, Ingo Molnar wrote:
>   
>> * Gregory Haskins <ghaskins@novell.com> wrote:
>>
>>     
>>> Ingo Molnar wrote:
>>>       
>>>> There''s no time or spin-rate based heuristics in this
at all (i.e. these
>>>> mutexes are not ''adaptive'' at all!),
>>>>         
>>> FYI: The original "adaptive" name was chosen in the -rt
implementation
>>> to reflect that the locks can adaptively spin or sleep, depending
on
>>> conditions.  I realize this is in contrast to the typical usage of
the
>>> term when it is in reference to the spin-time being based on some 
>>> empirical heuristics, etc as you mentioned.  Sorry for the
confusion.
>>>       
>> the current version of the -rt spinny-mutexes bits were mostly written
by
>> Steve, right? Historically it all started out with a more classic 
>> "adaptive mutexes" patchset so the name stuck i guess.
>>     
>
> Yeah, Gregory and co. started with the whole thing and showed there was
> significant performance to be gained, after that Steve rewrote it from
> scratch reducing it to this minimalist heuristic, with help from Greg.
>
> (At least, that is how I remember it, please speak up if I got things
> wrong)
>   Thats pretty accurate IIUC.  The concept and original patches were
written by myself, Peter Morreale and Sven Dietrich (cc''d).  However,
Steve cleaned up our patch before accepting it into -rt (we had extra
provisions for things like handling conditional compilation and run-time
disablement which he did not care for), but its otherwise the precise
core concept we introduced.  I think Steven gave a nice attribution to
that fact in the prologue, however.  And I also ACKed his cleanup, so I
think all is well from my perspective.

As a historical note: It should be mentioned that Steven also introduced
a really brilliant optimization to use RCU for the owner tracking that
the original patch as submitted by my team did not have.  However, it
turned out to inadvertently regress performance due to the way
preempt-rcu''s rcu_read_lock() works so it had to be reverted a few
weeks
later to the original logic that we submitted (even though on paper his
ideas in that area were superior to ours).  So whats in -rt now really
is more or less our patch sans the conditional crap, etc. 

Hope that helps!

-Greg

Matthew Wilcox

2009-Jan-06 15:01 UTC

head link

Re: [PATCH][RFC]: mutex: adaptive spin

On Tue, Jan 06, 2009 at 01:21:41PM +0100, Peter Zijlstra
wrote:> > It''s hard to tell how it would impact inbetween workloads - i
guess it
> > needs to be measured on a couple of workloads.
> 
> Matthew volunteered to run something IIRC.
I''ve sent it off to two different teams at Intel for benchmarking.  One
is a database load, and one is a mixed load.

-- 
Matthew Wilcox				Intel Open Source Technology Centre
"Bill, look, we understand that you''re interested in selling us
this
operating system, but compare it to ours.  We can''t possibly take such
a retrograde step."
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Linus Torvalds

2009-Jan-06 15:55 UTC

head link

Re: [PATCH][RFC]: mutex: adaptive spin

On Tue, 6 Jan 2009, Ingo Molnar wrote:> 
> The thing i like most about Peter''s patch (compared to most other
adaptive
> spinning approaches i''ve seen, which all sucked as they included
various
> ugly heuristics complicating the whole thing) is that it solves the
"how
> long should we spin" question elegantly: we spin until the owner runs
on a
> CPU.
The other way around, you mean: we spin until the owner is no longer 
holding a cpu.

I agree that it''s better than the normal "spin for some random
time"
model, but I can''t say I like the "return 0" cases where it
just retries
the whole loop if the semaphore was gotten by somebody else instead. 
Sounds like an easyish live-lock to me. 

I also still strongly suspect that whatever lock actually needs this, 
should be seriously re-thought. 

But apart from the "return 0" craziness I at least dont''
_hate_ this
patch. Do we have numbers? Do we know which locks this matters on?

		Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Linus Torvalds

2009-Jan-06 16:11 UTC

head link

Re: [PATCH][RFC]: mutex: adaptive spin

On Tue, 6 Jan 2009, Linus Torvalds wrote:> 
> The other way around, you mean: we spin until the owner is no longer 
> holding a cpu.
Btw, I hate the name of the helper function for that.
"task_is_current()"?
"current" means something totally different in the linux kernel: it
means
that the task is _this_ task.

So the only sensible implementation of "task_is_current(task)" is to
just
make it return "task == current", but that''s obviously not
what the
function wants to do.

So it should be renamed. Something like "task_is_oncpu()" or whatever.

I realize that the scheduler internally has that whole "rq->curr"
thing,
but that''s an internal scheduler thing, and should not be confused with
the overall kernel model of "current".

		Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Mason

2009-Jan-06 16:19 UTC

head link

Re: [PATCH][RFC]: mutex: adaptive spin

On Tue, 2009-01-06 at 07:55 -0800, Linus Torvalds wrote:> 
> On Tue, 6 Jan 2009, Ingo Molnar wrote:
> > 
> > The thing i like most about Peter''s patch (compared to most
other adaptive
> > spinning approaches i''ve seen, which all sucked as they
included various
> > ugly heuristics complicating the whole thing) is that it solves the
"how
> > long should we spin" question elegantly: we spin until the owner
runs on a
> > CPU.
> 
> The other way around, you mean: we spin until the owner is no longer 
> holding a cpu.
> 
> I agree that it''s better than the normal "spin for some
random time"
> model, but I can''t say I like the "return 0" cases where
it just retries
> the whole loop if the semaphore was gotten by somebody else instead. 
> Sounds like an easyish live-lock to me. 
> 
> I also still strongly suspect that whatever lock actually needs this, 
> should be seriously re-thought. 
> 
> But apart from the "return 0" craziness I at least dont''
_hate_ this
> patch. Do we have numbers? Do we know which locks this matters on?
This discussion was kicked off by an unconditional spin (512 tries)
against mutex_trylock in the btrfs tree locking code.

Btrfs is using mutexes to protect the btree blocks, and btree searching
often hits hot nodes that are always in cache.  For these nodes, the
spinning is much faster, but btrfs also needs to be able to sleep with
the locks held so it can read from the disk and do other complex
operations.

For btrfs, dbench 50 performance doubles with the unconditional spin,
mostly because that workload is almost all in ram.

For 50 procs creating 4k files in parallel, the spin is 30-50% faster.
This workload is a mixture of disk bound and CPU bound.

Yes, there is definitely some low hanging fruit to tune the btrfs btree
searches and locking.  But, I think the adaptive model is a good fit for
on disk btrees in general.

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Mason

2009-Jan-06 16:23 UTC

head link

Re: [PATCH][RFC]: mutex: adaptive spin

On Tue, 2009-01-06 at 12:40 +0100, Peter Zijlstra wrote:> Subject: mutex: adaptive spin
> From: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Date: Tue Jan 06 12:32:12 CET 2009
> 
> Based on the code in -rt, provide adaptive spins on generic mutexes.
I already sent Peter details, but the patch has problems without mutex
debugging turned on.  So it isn''t quite ready for benchmarking yet.

-chris


--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Steven Rostedt

2009-Jan-06 16:28 UTC

head link

Re: [PATCH][RFC]: mutex: adaptive spin

On Tue, 6 Jan 2009, Linus Torvalds wrote:> Btw, I hate the name of the helper function for that.
"task_is_current()"?
> "current" means something totally different in the linux kernel:
it means
> that the task is _this_ task.
> 
> So the only sensible implementation of "task_is_current(task)" is
to just
> make it return "task == current", but that''s obviously
not what the
> function wants to do.
> 
> So it should be renamed. Something like "task_is_oncpu()" or
whatever.
> 
> I realize that the scheduler internally has that whole
"rq->curr" thing,
> but that''s an internal scheduler thing, and should not be confused
with
> the overall kernel model of "current".
I totally agree that we should change the name of that. I never thought 
about the confusion that it would cause to someone that is not working so 
heavy on the scheduler. Thanks for the perspective.

Sometimes us scheduler guys have our heads too much up the scheduler ass 
that all we see is scheduler crap ;-)

-- Steve

Linus Torvalds

2009-Jan-06 16:40 UTC

head link

Re: [PATCH][RFC]: mutex: adaptive spin

On Tue, 6 Jan 2009, Linus Torvalds wrote:> 
> So it should be renamed. Something like "task_is_oncpu()" or
whatever.
Another complaint, which is tangentially related in that it actually 
concerns "current".

Right now, if some process deadlocks on a mutex, we get hung process, but 
with a nice backtrace and hopefully other things (that don''t need that 
lock) still continue to work.

But if I read it correctly, the adaptive spin code will instead just hang. 
Exactly because "task_is_current()" will also trigger for that case,
and
now you get an infinite loop, with the process spinning until it looses 
its own CPU, which obviously will never happen.

Yes, this is the behavior we get with spinlocks too, and yes, lock 
debugging will talk about it, but it''s a regression. We''ve
historically
had a _lot_ more bad deadlocks on mutexes than we have had on spinlocks, 
exactly because mutexes can be held over much more complex code. So 
regressing on it and making it less debuggable is bad.

IOW, if we do this, then I think we need a

	BUG_ON(task == owner);

in the waiting slow-path. I realize the test already exists for the DEBUG 
case, but I think we just want it even for production kernels. Especially 
since we''d only ever need it in the slow-path.

			Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Linus Torvalds

2009-Jan-06 16:42 UTC

head link

Re: [PATCH][RFC]: mutex: adaptive spin

On Tue, 6 Jan 2009, Linus Torvalds wrote:> 
> Right now, if some process deadlocks on a mutex, we get hung process, but 
> with a nice backtrace and hopefully other things (that don''t need
that
> lock) still continue to work.
Clarification: the "nice backtrace" we only get with something like 
sysrq-W, of course. We don''t get a backtrace _automatically_, but with
an
otherwise live machine, there''s a better chance that people do get
wchan
or other info. IOW, it''s at least a fairly debuggable situation.

			Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Ingo Molnar

2009-Jan-06 16:54 UTC

head link

Re: [PATCH][RFC]: mutex: adaptive spin

* Linus Torvalds <torvalds@linux-foundation.org> wrote:
> On Tue, 6 Jan 2009, Linus Torvalds wrote:
> > 
> > So it should be renamed. Something like "task_is_oncpu()" or
whatever.
> 
> Another complaint, which is tangentially related in that it actually 
> concerns "current".
> 
> Right now, if some process deadlocks on a mutex, we get hung process, 
> but with a nice backtrace and hopefully other things (that don''t
need
> that lock) still continue to work.
> 
> But if I read it correctly, the adaptive spin code will instead just 
> hang. Exactly because "task_is_current()" will also trigger for
that
> case, and now you get an infinite loop, with the process spinning until 
> it looses its own CPU, which obviously will never happen.
> 
> Yes, this is the behavior we get with spinlocks too, and yes, lock 
> debugging will talk about it, but it''s a regression.
We''ve historically
> had a _lot_ more bad deadlocks on mutexes than we have had on spinlocks, 
> exactly because mutexes can be held over much more complex code. So 
> regressing on it and making it less debuggable is bad.
> 
> IOW, if we do this, then I think we need a
> 
> 	BUG_ON(task == owner);
> 
> in the waiting slow-path. I realize the test already exists for the 
> DEBUG case, but I think we just want it even for production kernels. 
> Especially since we''d only ever need it in the slow-path.
yeah, sounds good.

One thought:

BUG_ON()''s do_exit() shows a slightly misleading failure pattern to
users:
instead of a ''hanging'' task, we''d get a misbehaving
app due to one of its
tasks exiting spuriously. It can even go completely unnoticed [users dont 
look at kernel logs normally] - while a hanging task generally does get 
noticed. (because there''s no progress in processing)

So instead of the BUG_ON() we could emit a WARN_ONCE() perhaps, plus not 
do any spinning and just block - resulting in an uninterruptible task 
(that the user will probably notice) and a scary message in the syslog? 
[all in the slowpath]

So in this case WARN_ONCE() is both more passive (it does not run 
do_exit()), and shows the more intuitive failure pattern to users. No 
strong feelings though.

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Ingo Molnar

2009-Jan-06 16:56 UTC

head link

Re: [PATCH][RFC]: mutex: adaptive spin

* Ingo Molnar <mingo@elte.hu> wrote:
> One thought:
> 
> BUG_ON()''s do_exit() shows a slightly misleading failure pattern
to
> users: instead of a ''hanging'' task, we''d get a
misbehaving app due to
> one of its tasks exiting spuriously. It can even go completely unnoticed 
> [users dont look at kernel logs normally] - while a hanging task 
> generally does get noticed. (because there''s no progress in
processing)
> 
> So instead of the BUG_ON() we could emit a WARN_ONCE() perhaps, plus not 
> do any spinning and just block - resulting in an uninterruptible task 
> (that the user will probably notice) and a scary message in the syslog? 
> [all in the slowpath]
And we''d strictly do an uninterruptible sleep here, unconditionally:
even
if this is within mutex_lock_interruptible() - we dont want a Ctrl-C or a 
SIGKILL to allow to ''break out'' the app from the deadlock.

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Linus Torvalds

2009-Jan-06 16:59 UTC

head link

Re: [PATCH][RFC]: mutex: adaptive spin

On Tue, 6 Jan 2009, Ingo Molnar wrote:> 
> So instead of the BUG_ON() we could emit a WARN_ONCE() perhaps, plus not 
> do any spinning and just block - resulting in an uninterruptible task 
> (that the user will probably notice) and a scary message in the syslog? 
> [all in the slowpath]
Sure. You could put it in the adaptive function thing, with something like

	if (WARN_ONCE(waiter == owner))
		return 1;

which should fall back on the old behavior and do the one-time warning.

			Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Ingo Molnar

2009-Jan-06 17:02 UTC

head link

Re: [PATCH][RFC]: mutex: adaptive spin

* Linus Torvalds <torvalds@linux-foundation.org> wrote:
> On Tue, 6 Jan 2009, Linus Torvalds wrote:
> > 
> > Right now, if some process deadlocks on a mutex, we get hung process, 
> > but with a nice backtrace and hopefully other things (that
don''t need
> > that lock) still continue to work.
> 
> Clarification: the "nice backtrace" we only get with something
like
> sysrq-W, of course. We don''t get a backtrace _automatically_, but
with
> an otherwise live machine, there''s a better chance that people do
get
> wchan or other info. IOW, it''s at least a fairly debuggable
situation.
btw., the softlockup watchdog detects non-progressing uninterruptible 
tasks (regardless of whether they locked up due to mutexes or any other 
reason).

This does occasionally help in debugging deadlocks:

  http://marc.info/?l=linux-mm&m=122889587725061&w=2

but it would indeed be also good to have the most common self-deadlock 
case checked unconditionally in the mutex slowpath.

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Andrew Morton

2009-Jan-06 17:08 UTC

head link

Re: [PATCH][RFC]: mutex: adaptive spin

On Tue, 06 Jan 2009 12:40:31 +0100 Peter Zijlstra <peterz@infradead.org>
wrote:
> Subject: mutex: adaptive spin
> From: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Date: Tue Jan 06 12:32:12 CET 2009
> 
> Based on the code in -rt, provide adaptive spins on generic mutexes.
> 
How dumb is it to send a lump of uncommented, changelogged code as an
rfc, only to have Ingo reply, providing the changelog for you?

Sigh.
> --- linux-2.6.orig/kernel/mutex.c
> +++ linux-2.6/kernel/mutex.c
> @@ -46,6 +46,7 @@ __mutex_init(struct mutex *lock, const c
>  	atomic_set(&lock->count, 1);
>  	spin_lock_init(&lock->wait_lock);
>  	INIT_LIST_HEAD(&lock->wait_list);
> +	lock->owner = NULL;
>  
>  	debug_mutex_init(lock, name, key);
>  }
> @@ -120,6 +121,28 @@ void __sched mutex_unlock(struct mutex *
>  
>  EXPORT_SYMBOL(mutex_unlock);
>  
> +#ifdef CONFIG_SMP
> +static int adaptive_wait(struct mutex_waiter *waiter,
> +			 struct task_struct *owner, long state)
> +{
> +	for (;;) {
> +		if (signal_pending_state(state, waiter->task))
> +			return 0;
> +		if (waiter->lock->owner != owner)
> +			return 0;
> +		if (!task_is_current(owner))
> +			return 1;
> +		cpu_relax();
> +	}
> +}
Each of the tests in this function should be carefully commented.  It''s
really the core piece of the design.
> +#else
> +static int adaptive_wait(struct mutex_waiter *waiter,
> +			 struct task_struct *owner, long state)
> +{
> +	return 1;
> +}
> +#endif
> +
>  /*
>   * Lock a mutex (possibly interruptible), slowpath:
>   */
> @@ -127,7 +150,7 @@ static inline int __sched
>  __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
>  	       	unsigned long ip)
>  {
> -	struct task_struct *task = current;
> +	struct task_struct *owner, *task = current;
>  	struct mutex_waiter waiter;
>  	unsigned int old_val;
>  	unsigned long flags;
> @@ -141,6 +164,7 @@ __mutex_lock_common(struct mutex *lock, 
>  	/* add waiting tasks to the end of the waitqueue (FIFO): */
>  	list_add_tail(&waiter.list, &lock->wait_list);
>  	waiter.task = task;
> +	waiter.lock = lock;
>  
>  	old_val = atomic_xchg(&lock->count, -1);
>  	if (old_val == 1)
> @@ -175,11 +199,19 @@ __mutex_lock_common(struct mutex *lock, 
>  			debug_mutex_free_waiter(&waiter);
>  			return -EINTR;
>  		}
> -		__set_task_state(task, state);
>  
> -		/* didnt get the lock, go to sleep: */
> +		owner = lock->owner;
What prevents *owner from exitting right here?
> +		get_task_struct(owner);
>  		spin_unlock_mutex(&lock->wait_lock, flags);
> -		schedule();
> +
> +		if (adaptive_wait(&waiter, owner, state)) {
> +			put_task_struct(owner);
> +			__set_task_state(task, state);
> +			/* didnt get the lock, go to sleep: */
> +			schedule();
> +		} else
> +			put_task_struct(owner);
> +
>  		spin_lock_mutex(&lock->wait_lock, flags);
>  	}
>  
> @@ -187,7 +219,7 @@ done:
>  	lock_acquired(&lock->dep_map, ip);
>  	/* got the lock - rejoice! */
>  	mutex_remove_waiter(lock, &waiter, task_thread_info(task));
> -	debug_mutex_set_owner(lock, task_thread_info(task));
> +	lock->owner = task;
>  
>  	/* set it to 0 if there are no waiters left: */
>  	if (likely(list_empty(&lock->wait_list)))
> @@ -260,7 +292,7 @@ __mutex_unlock_common_slowpath(atomic_t 
>  		wake_up_process(waiter->task);
>  	}
>  
> -	debug_mutex_clear_owner(lock);
> +	lock->owner = NULL;
>  
>  	spin_unlock_mutex(&lock->wait_lock, flags);
>  }
> @@ -352,7 +384,7 @@ static inline int __mutex_trylock_slowpa
>  
>  	prev = atomic_xchg(&lock->count, -1);
>  	if (likely(prev == 1)) {
> -		debug_mutex_set_owner(lock, current_thread_info());
> +		lock->owner = current;
>  		mutex_acquire(&lock->dep_map, 0, 1, _RET_IP_);
>  	}
>  	/* Set it back to 0 if there are no waiters: */
> Index: linux-2.6/kernel/sched.c
> ==================================================================> ---
linux-2.6.orig/kernel/sched.c
> +++ linux-2.6/kernel/sched.c
> @@ -697,6 +697,11 @@ int runqueue_is_locked(void)
>  	return ret;
>  }
>  
> +int task_is_current(struct task_struct *p)
> +{
> +	return task_rq(p)->curr == p;
> +}
Please don''t add kernel-wide infrastructure and leave it completely
undocumented.  Particularly functions which are as vague and dangerous
as this one.

What locking must the caller provide?  What are the semantics of the
return value?

What must the caller do to avoid oopses if *p is concurrently exiting?

etc.


The overall design intent seems very smart to me, as long as the races
can be plugged, if they''re indeed present.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Nick Piggin

2009-Jan-06 17:20 UTC

head link

Re: generic pagecache to block mapping layer (was Re: Btrfs for mainline)

On Tuesday 06 January 2009 04:10:42 Nick Piggin wrote:> On Tuesday 06 January 2009 03:37:33 Chris Mason wrote:
> > So, long term we can have a benchmarking contest, but I''ve
got a little
> > ways to go before that is a good idea.
>
> That would be good.
This got me motivated to rebase fsblock to current again. I finally
switched it to create a private inode for the metadata linear mapping,
which makes it incredibly cleaner (doesn''t require the new page flag,
lives much more happily beside buffer_heads etc).

Another big thing I did a few months back is to add a ->data pointer
to metadata fsblocks (like bh->b_data), which makes a quick fsblock
conversion much more trivial (although it wouldn''t support super-page
blocks without further work).

I''d also converted XFS to fsblock, which involved adding support for
delalloc and unwritten blocks. Although that has a couple of
unfinished bits, it works pretty well and helps prove fsblock working
with an advanced filesystem.

Given these points (particularly the first one), I''m going to try to
find time in the next few months to work on fsblock with a view to
submitting it.

--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Steven Rostedt

2009-Jan-06 17:37 UTC

head link

Re: [PATCH][RFC]: mutex: adaptive spin

On Tue, 6 Jan 2009, Andrew Morton wrote:
> On Tue, 06 Jan 2009 12:40:31 +0100 Peter Zijlstra
<peterz@infradead.org> wrote:
> 
> > Subject: mutex: adaptive spin
> > From: Peter Zijlstra <a.p.zijlstra@chello.nl>
> > Date: Tue Jan 06 12:32:12 CET 2009
> > 
> > Based on the code in -rt, provide adaptive spins on generic mutexes.
> > 
> 
> How dumb is it to send a lump of uncommented, changelogged code as an
> rfc, only to have Ingo reply, providing the changelog for you?
Sounds smart to me. He was able to get someone else to do the work. ;-)
> 
> Sigh.
> 
> > --- linux-2.6.orig/kernel/mutex.c
> > +++ linux-2.6/kernel/mutex.c
> > @@ -46,6 +46,7 @@ __mutex_init(struct mutex *lock, const c
> >  	atomic_set(&lock->count, 1);
> >  	spin_lock_init(&lock->wait_lock);
> >  	INIT_LIST_HEAD(&lock->wait_list);
> > +	lock->owner = NULL;
> >  
> >  	debug_mutex_init(lock, name, key);
> >  }
> > @@ -120,6 +121,28 @@ void __sched mutex_unlock(struct mutex *
> >  
> >  EXPORT_SYMBOL(mutex_unlock);
> >  
> > +#ifdef CONFIG_SMP
> > +static int adaptive_wait(struct mutex_waiter *waiter,
> > +			 struct task_struct *owner, long state)
> > +{
> > +	for (;;) {
> > +		if (signal_pending_state(state, waiter->task))
> > +			return 0;
> > +		if (waiter->lock->owner != owner)
> > +			return 0;
> > +		if (!task_is_current(owner))
> > +			return 1;
> > +		cpu_relax();
> > +	}
> > +}
> 
> Each of the tests in this function should be carefully commented. 
It''s
> really the core piece of the design.
Yep, I agree here too.
> 
> > +#else
> > +static int adaptive_wait(struct mutex_waiter *waiter,
> > +			 struct task_struct *owner, long state)
> > +{
> > +	return 1;
> > +}
> > +#endif
> > +
> >  /*
> >   * Lock a mutex (possibly interruptible), slowpath:
> >   */
> > @@ -127,7 +150,7 @@ static inline int __sched
> >  __mutex_lock_common(struct mutex *lock, long state, unsigned int
subclass,
> >  	       	unsigned long ip)
> >  {
> > -	struct task_struct *task = current;
> > +	struct task_struct *owner, *task = current;
> >  	struct mutex_waiter waiter;
> >  	unsigned int old_val;
> >  	unsigned long flags;
> > @@ -141,6 +164,7 @@ __mutex_lock_common(struct mutex *lock, 
> >  	/* add waiting tasks to the end of the waitqueue (FIFO): */
> >  	list_add_tail(&waiter.list, &lock->wait_list);
> >  	waiter.task = task;
> > +	waiter.lock = lock;
> >  
> >  	old_val = atomic_xchg(&lock->count, -1);
> >  	if (old_val == 1)
> > @@ -175,11 +199,19 @@ __mutex_lock_common(struct mutex *lock, 
> >  			debug_mutex_free_waiter(&waiter);
> >  			return -EINTR;
> >  		}
> > -		__set_task_state(task, state);
> >  
> > -		/* didnt get the lock, go to sleep: */
> > +		owner = lock->owner;
> 
> What prevents *owner from exitting right here?
Yeah, this should be commented. Why this is not a race is because
the owner has the mutex here, and we have the lock->wait_lock. When
the owner releases the mutex it must go into the slow unlock path and
grab the lock->wait_lock spinlock. Thus it will block and not go away.

Of course if there''s a bug elsewhere in the kernel that lets the task
exit
without releasing the mutex, this will fail. But in that case, there''s 
bigger problems that need to be fixed.

> 
> > +		get_task_struct(owner);
> >  		spin_unlock_mutex(&lock->wait_lock, flags);
Here, we get the owner before releasing the wait_lock.
> > -		schedule();
> > +
> > +		if (adaptive_wait(&waiter, owner, state)) {
> > +			put_task_struct(owner);
> > +			__set_task_state(task, state);
> > +			/* didnt get the lock, go to sleep: */
> > +			schedule();
> > +		} else
> > +			put_task_struct(owner);
> > +
> >  		spin_lock_mutex(&lock->wait_lock, flags);
> >  	}
> >  
> > @@ -187,7 +219,7 @@ done:
> >  	lock_acquired(&lock->dep_map, ip);
> >  	/* got the lock - rejoice! */
> >  	mutex_remove_waiter(lock, &waiter, task_thread_info(task));
> > -	debug_mutex_set_owner(lock, task_thread_info(task));
> > +	lock->owner = task;
> >  
> >  	/* set it to 0 if there are no waiters left: */
> >  	if (likely(list_empty(&lock->wait_list)))
> > @@ -260,7 +292,7 @@ __mutex_unlock_common_slowpath(atomic_t 
> >  		wake_up_process(waiter->task);
> >  	}
> >  
> > -	debug_mutex_clear_owner(lock);
> > +	lock->owner = NULL;
> >  
> >  	spin_unlock_mutex(&lock->wait_lock, flags);
> >  }
> > @@ -352,7 +384,7 @@ static inline int __mutex_trylock_slowpa
> >  
> >  	prev = atomic_xchg(&lock->count, -1);
> >  	if (likely(prev == 1)) {
> > -		debug_mutex_set_owner(lock, current_thread_info());
> > +		lock->owner = current;
> >  		mutex_acquire(&lock->dep_map, 0, 1, _RET_IP_);
> >  	}
> >  	/* Set it back to 0 if there are no waiters: */
> > Index: linux-2.6/kernel/sched.c
> > ==================================================================>
> --- linux-2.6.orig/kernel/sched.c
> > +++ linux-2.6/kernel/sched.c
> > @@ -697,6 +697,11 @@ int runqueue_is_locked(void)
> >  	return ret;
> >  }
> >  
> > +int task_is_current(struct task_struct *p)
> > +{
> > +	return task_rq(p)->curr == p;
> > +}
> 
> Please don''t add kernel-wide infrastructure and leave it
completely
> undocumented.  Particularly functions which are as vague and dangerous
> as this one.
> 
> What locking must the caller provide?  What are the semantics of the
> return value?
> 
> What must the caller do to avoid oopses if *p is concurrently exiting?
> 
> etc.
> 
> 
> The overall design intent seems very smart to me, as long as the races
> can be plugged, if they''re indeed present.
Thanks for the complement on the design. It came about lots of work by 
many people in the -rt tree. What we need here is to comment the code for 
those that did not see the evolution of the changes that were made.

-- Steve

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Linus Torvalds

2009-Jan-06 18:02 UTC

head link

Re: [PATCH][RFC]: mutex: adaptive spin

Ok, last comment, I promise.

On Tue, 6 Jan 2009, Peter Zijlstra wrote:> @@ -175,11 +199,19 @@ __mutex_lock_common(struct mutex *lock, 
>  			debug_mutex_free_waiter(&waiter);
>  			return -EINTR;
>  		}
> -		__set_task_state(task, state);
>  
> -		/* didnt get the lock, go to sleep: */
> +		owner = lock->owner;
> +		get_task_struct(owner);
>  		spin_unlock_mutex(&lock->wait_lock, flags);
> -		schedule();
> +
> +		if (adaptive_wait(&waiter, owner, state)) {
> +			put_task_struct(owner);
> +			__set_task_state(task, state);
> +			/* didnt get the lock, go to sleep: */
> +			schedule();
> +		} else
> +			put_task_struct(owner);
> +
>  		spin_lock_mutex(&lock->wait_lock, flags);
So I really dislike the whole get_task_struct/put_task_struct thing. It 
seems very annoying. And as far as I can tell, it''s there _only_ to 
protect "task->rq" and nothing else (ie to make sure that the task 
doesn''t exit and get freed and the pointer now points to la-la-land).

Wouldn''t it be much nicer to just cache the rq pointer (take it while 
still holding the spinlock), and then pass it in to adaptive_wait()?

Then, adaptive_wait() can just do

	if (lock->owner != owner)
		return 0;

	if (rq->task != owner)
		return 1;

Sure - the owner may have rescheduled to another CPU, but if it did that, 
then we really might as well sleep. So we really don''t need to
dereference
that (possibly stale) owner task_struct at all - because we don''t care.
All we care about is whether the owner is still busy on that other CPU 
that it was on. 

Hmm? So it looks to me that we don''t really need that annoying
"try to
protect the task pointer" crud. We can do the sufficient (and limited) 
sanity checking without the task even existing, as long as we originally 
load the ->rq pointer at a point where it was stable (ie inside the 
spinlock, when we know that the task must be still alive since it owns the 
lock).

		Linus

Steven Rostedt

2009-Jan-06 18:20 UTC

head link

Re: [PATCH][RFC]: mutex: adaptive spin

On Tue, 6 Jan 2009, Linus Torvalds wrote:
> 
> Ok, last comment, I promise.
> 
> On Tue, 6 Jan 2009, Peter Zijlstra wrote:
> > @@ -175,11 +199,19 @@ __mutex_lock_common(struct mutex *lock, 
> >  			debug_mutex_free_waiter(&waiter);
> >  			return -EINTR;
> >  		}
> > -		__set_task_state(task, state);
> >  
> > -		/* didnt get the lock, go to sleep: */
> > +		owner = lock->owner;
> > +		get_task_struct(owner);
> >  		spin_unlock_mutex(&lock->wait_lock, flags);
> > -		schedule();
> > +
> > +		if (adaptive_wait(&waiter, owner, state)) {
> > +			put_task_struct(owner);
> > +			__set_task_state(task, state);
> > +			/* didnt get the lock, go to sleep: */
> > +			schedule();
> > +		} else
> > +			put_task_struct(owner);
> > +
> >  		spin_lock_mutex(&lock->wait_lock, flags);
> 
> So I really dislike the whole get_task_struct/put_task_struct thing. It 
> seems very annoying. And as far as I can tell, it''s there _only_
to
> protect "task->rq" and nothing else (ie to make sure that the
task
> doesn''t exit and get freed and the pointer now points to
la-la-land).
Yeah, that was not one of the things that we liked either. We tried
other ways to get around the get_task_struct but, ended up with
the get_task_struct in the end anyway.
> 
> Wouldn''t it be much nicer to just cache the rq pointer (take it
while
> still holding the spinlock), and then pass it in to adaptive_wait()?
> 
> Then, adaptive_wait() can just do
> 
> 	if (lock->owner != owner)
> 		return 0;
> 
> 	if (rq->task != owner)
> 		return 1;
> 
> Sure - the owner may have rescheduled to another CPU, but if it did that, 
> then we really might as well sleep. So we really don''t need to
dereference
> that (possibly stale) owner task_struct at all - because we don''t
care.
> All we care about is whether the owner is still busy on that other CPU 
> that it was on. 
> 
> Hmm? So it looks to me that we don''t really need that annoying
"try to
> protect the task pointer" crud. We can do the sufficient (and limited)
> sanity checking without the task even existing, as long as we originally 
> load the ->rq pointer at a point where it was stable (ie inside the 
> spinlock, when we know that the task must be still alive since it owns the 
> lock).
Caching the rq is an interesting idea. But since the rq struct is local to 
sched.c, what would be a good API to do this?

in mutex.c:

	void *rq;

	[...]

	rq = get_task_rq(owner);
	spin_unlock(&lock->wait_lock);

	[...]

	if (!task_running_on_rq(rq, owner))


in sched.c:


	void *get_task_rq(struct task_struct *p)
	{
		return task_rq(p);
	}

	int task_running_on_rq(void *r, struct task_sturct *p)
	{
		struct rq *rq = r;

		return rq->curr == p;
	}

??

-- Steve


--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Linus Torvalds

2009-Jan-06 18:25 UTC

head link

Re: [PATCH][RFC]: mutex: adaptive spin

On Tue, 6 Jan 2009, Linus Torvalds wrote:> 
> Sure - the owner may have rescheduled to another CPU, but if it did that, 
> then we really might as well sleep.
.. or at least re-try the loop.

It might be worth it to move the whole sleeping behavior into that helper 
function (mutex_spin_or_sleep()), and just make the logic be:

 - if we _enter_ the function with the owner not on a runqueue, then we 
   sleep

 - otherwise, we spin as long as the owner stays on the same runqueue.

and then we loop over the whole "get mutex spinlock and revalidate it
all"
after either sleeping or spinning for a while.

That seems much simpler in all respects. And it should simplify the whole 
patch too, because it literally means that in the main 
__mutex_lock_common(), the only real difference is

	-             __set_task_state(task, state);
	-             spin_unlock_mutex(&lock->wait_lock, flags);
	-             schedule();

becoming instead

	+		mutex_spin_or_schedule(owner, task, lock, state);

and then all the "spin-or-schedule" logic is in just that one simple 
routine (that has to look up the rq and drop the lock and re-take it).

The "mutex_spin_or_schedule()" code would literally look something
like

	void mutex_spin_or_schedule(owner, task, lock, state)
	{
		struct rq *owner_rq = owner->rq;

		if (owner_rq->curr != owner) {
			__set_task_state(task, state);
			spin_unlock_mutex(&lock->wait_lock, flags);
			schedule();
		} else {
			spin_unlock_mutex(&lock->wait_lock, flags);
			do {
				cpu_relax();
			while (lock->owner == owner &&
				owner_rq->curr == owner);
		}
		spin_lock_mutex(&lock->wait_lock, flags);
	}

or something.

Btw, this also fixes a bug: your patch did

	+                     __set_task_state(task, state);
	+                     /* didnt get the lock, go to sleep: */
	+                     schedule();

for the schedule case without holding the mutex spinlock.

And that seems very buggy and racy indeed: since it doesn''t hold the
mutex
lock, if the old owner releases the mutex at just the right point (yeah, 
yeah, it requires a scheduling event on another CPU in order to also miss 
the whole "task_is_current()" logic), the wakeup can get lost, because
you
set the state to sleeping perhaps _after_ the task just got woken up. So 
we stay sleeping even though the mutex is clear.

So I''m going to NAK the original patch, and just -require- the cleanup
I''m
suggesting as also fixing what looks like a bug.

Of course, once I see the actual patch, maybe I''m going to find
something
_else_ to kwetch about.

			Linus
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Linus Torvalds

2009-Jan-06 18:28 UTC

head link

Re: [PATCH][RFC]: mutex: adaptive spin

On Tue, 6 Jan 2009, Steven Rostedt wrote:> 
> Caching the rq is an interesting idea. But since the rq struct is local to 
> sched.c, what would be a good API to do this?
Just move the whole "spin_or_schedule()" into sched.c, and
you''re all
done.

Yeah, that requires having sched.c know a bit about mutex locking rules, 
but the data structures are already exported in <linux/mutex.h>, and while
_normal_ code should never care or know exactly how it is used, I think 
the scheduler might as well. Mutexes are clearly very much a scheduling 
entity, so it''s not too much of a break with internal knowledge.

		Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Steven Rostedt

2009-Jan-06 19:03 UTC

head link

Re: [PATCH][RFC]: mutex: adaptive spin

> 
> Btw, this also fixes a bug: your patch did
> 
> 	+                     __set_task_state(task, state);
> 	+                     /* didnt get the lock, go to sleep: */
> 	+                     schedule();
> 
> for the schedule case without holding the mutex spinlock.
> 
> And that seems very buggy and racy indeed: since it doesn''t hold
the mutex
> lock, if the old owner releases the mutex at just the right point (yeah, 
> yeah, it requires a scheduling event on another CPU in order to also miss 
> the whole "task_is_current()" logic), the wakeup can get lost,
because you
> set the state to sleeping perhaps _after_ the task just got woken up. So 
> we stay sleeping even though the mutex is clear.

That is indeed a bug.

Peter, why did you need to move __set_task_state down here?  The -rt patch 
does not do this.

-- Steve
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Mason

2009-Jan-06 19:41 UTC

head link

Re: Btrfs for mainline

Hello everyone,

Thanks for all of the comments so far.  I''ve pushed out a number of
fixes for btrfs mainline, covering most of the comments from this
thread.

* All LINUX_KERNEL_VERSION checks are gone.
* checkpatch.pl fixes
* Extra permission checks on the ioctls
* Some important bug fixes from the btrfs list
* Andi found a buggy use of kmap_atomic during checksum verification
* Drop EXPORT_SYMBOLs from extent_io.c

Unresolved from this reviewing thread:

* Should it be named btrfsdev?  My vote is no, it is extra work for the
distros when we finally do rename it, and I don''t think btrfs really
has
the reputation for stability right now.  But if Linus or Andrew would
prefer the dev on there, I''ll do it.

* My ugly mutex_trylock spin.  It''s a hefty performance gain so
I''m
hoping to keep it until there is a generic adaptive lock.

The full kernel tree is here:

http://git.kernel.org/?p=linux/kernel/git/mason/btrfs-unstable.git;a=summary


The standalone tree just has the btrfs files and commits, reviewers may
find it easier:

http://git.kernel.org/?p=linux/kernel/git/mason/btrfs-unstable-standalone.git;a=summary

The utilities can be found here:

http://git.kernel.org/?p=linux/kernel/git/mason/btrfs-progs-unstable.git;a=summary

-chris


--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Paul E. McKenney

2009-Jan-06 21:42 UTC

head link

Re: [PATCH][RFC]: mutex: adaptive spin

On Tue, Jan 06, 2009 at 09:15:53AM -0500, Gregory Haskins
wrote:> Peter Zijlstra wrote:
> > On Tue, 2009-01-06 at 14:16 +0100, Ingo Molnar wrote:
> >   
> >> * Gregory Haskins <ghaskins@novell.com> wrote:
> >>
> >>     
> >>> Ingo Molnar wrote:
> >>>       
> >>>> There''s no time or spin-rate based heuristics in
this at all (i.e. these
> >>>> mutexes are not ''adaptive'' at all!),
> >>>>         
> >>> FYI: The original "adaptive" name was chosen in the
-rt implementation
> >>> to reflect that the locks can adaptively spin or sleep,
depending on
> >>> conditions.  I realize this is in contrast to the typical
usage of the
> >>> term when it is in reference to the spin-time being based on
some
> >>> empirical heuristics, etc as you mentioned.  Sorry for the
confusion.
> >>>       
> >> the current version of the -rt spinny-mutexes bits were mostly
written by
> >> Steve, right? Historically it all started out with a more classic 
> >> "adaptive mutexes" patchset so the name stuck i guess.
> >>     
> >
> > Yeah, Gregory and co. started with the whole thing and showed there
was
> > significant performance to be gained, after that Steve rewrote it from
> > scratch reducing it to this minimalist heuristic, with help from Greg.
> >
> > (At least, that is how I remember it, please speak up if I got things
> > wrong)
> >   
> Thats pretty accurate IIUC.  The concept and original patches were
> written by myself, Peter Morreale and Sven Dietrich (cc''d). 
However,
> Steve cleaned up our patch before accepting it into -rt (we had extra
> provisions for things like handling conditional compilation and run-time
> disablement which he did not care for), but its otherwise the precise
> core concept we introduced.  I think Steven gave a nice attribution to
> that fact in the prologue, however.  And I also ACKed his cleanup, so I
> think all is well from my perspective.
> 
> As a historical note: It should be mentioned that Steven also introduced
> a really brilliant optimization to use RCU for the owner tracking that
> the original patch as submitted by my team did not have.  However, it
> turned out to inadvertently regress performance due to the way
> preempt-rcu''s rcu_read_lock() works so it had to be reverted a few
weeks
> later to the original logic that we submitted (even though on paper his
> ideas in that area were superior to ours).  So whats in -rt now really
> is more or less our patch sans the conditional crap, etc. 
Preemptable RCU needs to be faster.  Got it -- and might have a way
to do it by eliminating the irq disabling and cutting way back on the
number of operations that must be performed.  It would probably still
be necessary to access the task structure.

Or is something other than the raw performance of rcu_read_lock() and
rcu_read_unlock() at issue here?

							Thanx, Paul
> Hope that helps!
> 
> -Greg
> 

--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Peter Zijlstra

2009-Jan-06 21:44 UTC

head link

Re: [PATCH][RFC]: mutex: adaptive spin

On Tue, 2009-01-06 at 13:42 -0800, Paul E. McKenney
wrote:> Preemptable RCU needs to be faster.  Got it -- and might have a way
> to do it by eliminating the irq disabling and cutting way back on the
> number of operations that must be performed.  It would probably still
> be necessary to access the task structure.
> 
> Or is something other than the raw performance of rcu_read_lock() and
> rcu_read_unlock() at issue here?
With Linus'' mutex_spin_or_schedule() function the whole - keeping
owner''s task_struct alive issue goes away,.. now if only the thing
would
boot...
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Linus Torvalds

2009-Jan-06 21:50 UTC

head link

Re: [PATCH][RFC]: mutex: adaptive spin

On Tue, 6 Jan 2009, Peter Zijlstra wrote:> 
> With Linus'' mutex_spin_or_schedule() function the whole - keeping
> owner''s task_struct alive issue goes away,.. now if only the thing
would
> boot...
Can you post the patch, so that we can see if we can find some silly error 
that we can ridicule you over?

		Linus "always willing to ''help''" Torvalds
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Peter Zijlstra

2009-Jan-06 22:07 UTC

head link

Re: [PATCH][RFC]: mutex: adaptive spin

On Tue, 2009-01-06 at 13:50 -0800, Linus Torvalds wrote:> 
> On Tue, 6 Jan 2009, Peter Zijlstra wrote:
> > 
> > With Linus'' mutex_spin_or_schedule() function the whole -
keeping
> > owner''s task_struct alive issue goes away,.. now if only the
thing would
> > boot...
> 
> Can you post the patch, so that we can see if we can find some silly error 
> that we can ridicule you over?
Sure, see below..

I think I''m seeing why it deadlocks..

One of the things the previous patch did wrong was that it never tracked
the owner in the mutex fast path -- I initially didn''t notice because I
had all debugging infrastructure enabled, and that short circuits all
the fast paths.

So I added a lame fast path owner tracking:

  preempt_disable();
  mutex_fast_path_lock();
  lock->owner = current;
  preempt_enable();

and a similar clear on the unlock side.

Which is exactly what causes the deadlock -- or livelock more
accurately. Since the contention code spins while !owner, and the unlock
code clears owner before release, we spin so hard we never release the
cacheline to the other cpu and therefore get stuck.

Now I just looked what kernel/rtmutex.c does, it keeps track of the
owner in the lock field and uses cmpxchg to change ownership. Regular
mutexes however use atomic_t (not wide enough for void *) and hand
crafted assembly fast paths for all archs.

I think I''ll just hack up a atomic_long_t atomic_lock_cmpxchg mutex
implementation -- so we can at least test this on x86 to see if its
worth continuing this way.

Converting all hand crafted asm only to find out it degrades performance
doesn''t sound too cool :-)

---
 include/linux/mutex.h |    4 +-
 include/linux/sched.h |    1 
 kernel/mutex-debug.c  |   10 ------
 kernel/mutex-debug.h  |    8 -----
 kernel/mutex.c        |   80 ++++++++++++++++++++++++++++++++++++++------------
 kernel/mutex.h        |    2 -
 kernel/sched.c        |   44 +++++++++++++++++++++++++++
 7 files changed, 110 insertions(+), 39 deletions(-)

Index: linux-2.6/include/linux/mutex.h
==================================================================---
linux-2.6.orig/include/linux/mutex.h
+++ linux-2.6/include/linux/mutex.h
@@ -50,8 +50,8 @@ struct mutex {
 	atomic_t		count;
 	spinlock_t		wait_lock;
 	struct list_head	wait_list;
+	struct task_struct	*owner;
 #ifdef CONFIG_DEBUG_MUTEXES
-	struct thread_info	*owner;
 	const char 		*name;
 	void			*magic;
 #endif
@@ -67,8 +67,8 @@ struct mutex {
 struct mutex_waiter {
 	struct list_head	list;
 	struct task_struct	*task;
-#ifdef CONFIG_DEBUG_MUTEXES
 	struct mutex		*lock;
+#ifdef CONFIG_DEBUG_MUTEXES
 	void			*magic;
 #endif
 };
Index: linux-2.6/include/linux/sched.h
==================================================================---
linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -329,6 +329,7 @@ extern signed long schedule_timeout(sign
 extern signed long schedule_timeout_interruptible(signed long timeout);
 extern signed long schedule_timeout_killable(signed long timeout);
 extern signed long schedule_timeout_uninterruptible(signed long timeout);
+extern void mutex_spin_or_schedule(struct mutex_waiter *, long state, unsigned
long *flags);
 asmlinkage void schedule(void);
 
 struct nsproxy;
Index: linux-2.6/kernel/mutex-debug.c
==================================================================---
linux-2.6.orig/kernel/mutex-debug.c
+++ linux-2.6/kernel/mutex-debug.c
@@ -26,11 +26,6 @@
 /*
  * Must be called with lock->wait_lock held.
  */
-void debug_mutex_set_owner(struct mutex *lock, struct thread_info *new_owner)
-{
-	lock->owner = new_owner;
-}
-
 void debug_mutex_lock_common(struct mutex *lock, struct mutex_waiter *waiter)
 {
 	memset(waiter, MUTEX_DEBUG_INIT, sizeof(*waiter));
@@ -59,7 +54,6 @@ void debug_mutex_add_waiter(struct mutex
 
 	/* Mark the current thread as blocked on the lock: */
 	ti->task->blocked_on = waiter;
-	waiter->lock = lock;
 }
 
 void mutex_remove_waiter(struct mutex *lock, struct mutex_waiter *waiter,
@@ -80,9 +74,8 @@ void debug_mutex_unlock(struct mutex *lo
 		return;
 
 	DEBUG_LOCKS_WARN_ON(lock->magic != lock);
-	DEBUG_LOCKS_WARN_ON(lock->owner != current_thread_info());
+	/* DEBUG_LOCKS_WARN_ON(lock->owner != current); */
 	DEBUG_LOCKS_WARN_ON(!lock->wait_list.prev &&
!lock->wait_list.next);
-	DEBUG_LOCKS_WARN_ON(lock->owner != current_thread_info());
 }
 
 void debug_mutex_init(struct mutex *lock, const char *name,
@@ -95,7 +88,6 @@ void debug_mutex_init(struct mutex *lock
 	debug_check_no_locks_freed((void *)lock, sizeof(*lock));
 	lockdep_init_map(&lock->dep_map, name, key, 0);
 #endif
-	lock->owner = NULL;
 	lock->magic = lock;
 }
 
Index: linux-2.6/kernel/mutex-debug.h
==================================================================---
linux-2.6.orig/kernel/mutex-debug.h
+++ linux-2.6/kernel/mutex-debug.h
@@ -13,14 +13,6 @@
 /*
  * This must be called with lock->wait_lock held.
  */
-extern void
-debug_mutex_set_owner(struct mutex *lock, struct thread_info *new_owner);
-
-static inline void debug_mutex_clear_owner(struct mutex *lock)
-{
-	lock->owner = NULL;
-}
-
 extern void debug_mutex_lock_common(struct mutex *lock,
 				    struct mutex_waiter *waiter);
 extern void debug_mutex_wake_waiter(struct mutex *lock,
Index: linux-2.6/kernel/mutex.c
==================================================================---
linux-2.6.orig/kernel/mutex.c
+++ linux-2.6/kernel/mutex.c
@@ -46,6 +46,7 @@ __mutex_init(struct mutex *lock, const c
 	atomic_set(&lock->count, 1);
 	spin_lock_init(&lock->wait_lock);
 	INIT_LIST_HEAD(&lock->wait_list);
+	lock->owner = NULL;
 
 	debug_mutex_init(lock, name, key);
 }
@@ -90,7 +91,10 @@ void inline __sched mutex_lock(struct mu
 	 * The locking fastpath is the 1->0 transition from
 	 * ''unlocked'' into ''locked'' state.
 	 */
+	preempt_disable();
 	__mutex_fastpath_lock(&lock->count, __mutex_lock_slowpath);
+	lock->owner = current;
+	preempt_enable();
 }
 
 EXPORT_SYMBOL(mutex_lock);
@@ -115,7 +119,10 @@ void __sched mutex_unlock(struct mutex *
 	 * The unlocking fastpath is the 0->1 transition from
''locked''
 	 * into ''unlocked'' state:
 	 */
+	preempt_disable();
+	lock->owner = NULL;
 	__mutex_fastpath_unlock(&lock->count, __mutex_unlock_slowpath);
+	preempt_enable();
 }
 
 EXPORT_SYMBOL(mutex_unlock);
@@ -127,7 +134,7 @@ static inline int __sched
 __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
 	       	unsigned long ip)
 {
-	struct task_struct *task = current;
+	struct task_struct *owner, *task = current;
 	struct mutex_waiter waiter;
 	unsigned int old_val;
 	unsigned long flags;
@@ -141,6 +148,7 @@ __mutex_lock_common(struct mutex *lock, 
 	/* add waiting tasks to the end of the waitqueue (FIFO): */
 	list_add_tail(&waiter.list, &lock->wait_list);
 	waiter.task = task;
+	waiter.lock = lock;
 
 	old_val = atomic_xchg(&lock->count, -1);
 	if (old_val == 1)
@@ -175,19 +183,16 @@ __mutex_lock_common(struct mutex *lock, 
 			debug_mutex_free_waiter(&waiter);
 			return -EINTR;
 		}
-		__set_task_state(task, state);
 
-		/* didnt get the lock, go to sleep: */
-		spin_unlock_mutex(&lock->wait_lock, flags);
-		schedule();
-		spin_lock_mutex(&lock->wait_lock, flags);
+		preempt_enable();
+		mutex_spin_or_schedule(&waiter, state, &flags);
+		preempt_disable();
 	}
 
 done:
 	lock_acquired(&lock->dep_map, ip);
 	/* got the lock - rejoice! */
 	mutex_remove_waiter(lock, &waiter, task_thread_info(task));
-	debug_mutex_set_owner(lock, task_thread_info(task));
 
 	/* set it to 0 if there are no waiters left: */
 	if (likely(list_empty(&lock->wait_list)))
@@ -205,7 +210,10 @@ void __sched
 mutex_lock_nested(struct mutex *lock, unsigned int subclass)
 {
 	might_sleep();
+	preempt_disable();
 	__mutex_lock_common(lock, TASK_UNINTERRUPTIBLE, subclass, _RET_IP_);
+	lock->owner = current;
+	preempt_enable();
 }
 
 EXPORT_SYMBOL_GPL(mutex_lock_nested);
@@ -213,16 +221,32 @@ EXPORT_SYMBOL_GPL(mutex_lock_nested);
 int __sched
 mutex_lock_killable_nested(struct mutex *lock, unsigned int subclass)
 {
+	int ret;
+
 	might_sleep();
-	return __mutex_lock_common(lock, TASK_KILLABLE, subclass, _RET_IP_);
+	preempt_disable();
+	ret =  __mutex_lock_common(lock, TASK_KILLABLE, subclass, _RET_IP_);
+	if (!ret)
+		lock->owner = current;
+	preempt_enable();
+
+	return ret;
 }
 EXPORT_SYMBOL_GPL(mutex_lock_killable_nested);
 
 int __sched
 mutex_lock_interruptible_nested(struct mutex *lock, unsigned int subclass)
 {
+	int ret;
+
 	might_sleep();
-	return __mutex_lock_common(lock, TASK_INTERRUPTIBLE, subclass, _RET_IP_);
+	preempt_disable();
+	ret = __mutex_lock_common(lock, TASK_INTERRUPTIBLE, subclass, _RET_IP_);
+	if (!ret)
+		lock->owner = current;
+	preempt_enable();
+
+	return ret;
 }
 
 EXPORT_SYMBOL_GPL(mutex_lock_interruptible_nested);
@@ -260,8 +284,6 @@ __mutex_unlock_common_slowpath(atomic_t 
 		wake_up_process(waiter->task);
 	}
 
-	debug_mutex_clear_owner(lock);
-
 	spin_unlock_mutex(&lock->wait_lock, flags);
 }
 
@@ -298,18 +320,34 @@ __mutex_lock_interruptible_slowpath(atom
  */
 int __sched mutex_lock_interruptible(struct mutex *lock)
 {
+	int ret;
+
 	might_sleep();
-	return __mutex_fastpath_lock_retval
+	preempt_disable();
+	ret =  __mutex_fastpath_lock_retval
 			(&lock->count, __mutex_lock_interruptible_slowpath);
+	if (!ret)
+		lock->owner = current;
+	preempt_enable();
+
+	return ret;
 }
 
 EXPORT_SYMBOL(mutex_lock_interruptible);
 
 int __sched mutex_lock_killable(struct mutex *lock)
 {
+	int ret;
+
 	might_sleep();
-	return __mutex_fastpath_lock_retval
+	preempt_disable();
+	ret = __mutex_fastpath_lock_retval
 			(&lock->count, __mutex_lock_killable_slowpath);
+	if (!ret)
+		lock->owner = current;
+	preempt_enable();
+
+	return ret;
 }
 EXPORT_SYMBOL(mutex_lock_killable);
 
@@ -351,10 +389,9 @@ static inline int __mutex_trylock_slowpa
 	spin_lock_mutex(&lock->wait_lock, flags);
 
 	prev = atomic_xchg(&lock->count, -1);
-	if (likely(prev == 1)) {
-		debug_mutex_set_owner(lock, current_thread_info());
+	if (likely(prev == 1))
 		mutex_acquire(&lock->dep_map, 0, 1, _RET_IP_);
-	}
+
 	/* Set it back to 0 if there are no waiters: */
 	if (likely(list_empty(&lock->wait_list)))
 		atomic_set(&lock->count, 0);
@@ -380,8 +417,15 @@ static inline int __mutex_trylock_slowpa
  */
 int __sched mutex_trylock(struct mutex *lock)
 {
-	return __mutex_fastpath_trylock(&lock->count,
-					__mutex_trylock_slowpath);
+	int ret;
+
+	preempt_disable();
+	ret = __mutex_fastpath_trylock(&lock->count, __mutex_trylock_slowpath);
+	if (ret)
+		lock->owner = current;
+	preempt_enable();
+
+	return ret;
 }
 
 EXPORT_SYMBOL(mutex_trylock);
Index: linux-2.6/kernel/sched.c
==================================================================---
linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -4596,6 +4596,50 @@ pick_next_task(struct rq *rq, struct tas
 	}
 }
 
+#ifdef CONFIG_DEBUG_MUTEXES
+# include "mutex-debug.h"
+#else
+# include "mutex.h"
+#endif
+
+void mutex_spin_or_schedule(struct mutex_waiter *waiter, long state, unsigned
long *flags)
+{
+	struct mutex *lock = waiter->lock;
+	struct task_struct *task = waiter->task;
+	struct task_struct *owner = lock->owner;
+	struct rq *rq;
+
+	/* Lack of ownership demands we try again to obtain it */
+	if (!owner)
+		return;
+
+	rq = task_rq(owner);
+
+	if (rq->curr != owner) {
+		__set_task_state(task, state);
+		spin_unlock_mutex(&lock->wait_lock, *flags);
+		schedule();
+	} else {
+		spin_unlock_mutex(&lock->wait_lock, *flags);
+		for (;;) {
+			/* Stop spinning when there''s a pending signal. */
+			if (signal_pending_state(state, task))
+				break;
+
+			/* Owner changed, bail to revalidate state */
+			if (lock->owner != owner)
+				break;
+
+			/* Owner stopped running, bail to revalidate state */
+			if (rq->curr != owner)
+				break;
+
+			cpu_relax();
+		}
+	}
+	spin_lock_mutex(&lock->wait_lock, *flags);
+}
+
 /*
  * schedule() is the main scheduler function.
  */
Index: linux-2.6/kernel/mutex.h
==================================================================---
linux-2.6.orig/kernel/mutex.h
+++ linux-2.6/kernel/mutex.h
@@ -16,8 +16,6 @@
 #define mutex_remove_waiter(lock, waiter, ti) \
 		__list_del((waiter)->list.prev, (waiter)->list.next)
 
-#define debug_mutex_set_owner(lock, new_owner)		do { } while (0)
-#define debug_mutex_clear_owner(lock)			do { } while (0)
 #define debug_mutex_wake_waiter(lock, waiter)		do { } while (0)
 #define debug_mutex_free_waiter(waiter)			do { } while (0)
 #define debug_mutex_add_waiter(lock, waiter, ti)	do { } while (0)

--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jamie Lokier

2009-Jan-06 22:09 UTC

head link

Re: Btrfs for mainline

J. Bruce Fields wrote:> Old kernel versions may still get booted after brtfs has gotten a
> reputation for stability.  E.g. if I move my / to brtfs in 2.6.34, then
> one day need to boot back to 2.6.30 to track down some regression, the
> reminder that I''m moving back to some sort of brtfs dark-ages
might be
> welcome.
Require a mount option "allow_unstable_version" until it''s
stable?
The stable version can ignore the option.

In your example, you wouldn''t use the option, and in the btrfs dark
ages it would refuse to mount.

-- Jamie
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Linus Torvalds

2009-Jan-06 22:22 UTC

head link

Re: [PATCH][RFC]: mutex: adaptive spin

On Tue, 6 Jan 2009, Peter Zijlstra wrote:> 
> One of the things the previous patch did wrong was that it never tracked
> the owner in the mutex fast path -- I initially didn''t notice
because I
> had all debugging infrastructure enabled, and that short circuits all
> the fast paths.
Why even worry?

Just set the owner non-atomically. We can consider a NULL owner in the 
slow-path to be a "let''s schedule" case. After all,
it''s just a
performance tuning thing after all, and the window is going to be very 
small, so it won''t even be relevant from a performance tuning angle.

So there''s _no_ reason why the fast-path can''t just set owner,
and no
reason to disable preemption.

I also think the whole preempt_disable/enable around the 
mutex_spin_or_schedule() is pure garbage.

In fact, I suspect that''s the real bug you''re hitting:
you''re enabling
preemption while holding a spinlock. That is NOT a good idea.

So 

 - remove all the "preempt_disable/enable" crud. It''s wrong.

 - remove the whole

	+       if (!owner)
	+               return;

   thing. It''s also wrong. Just go to the schedule case for that.

It all looks like unnecessary complexity that you added, and I think 
_that_ is what bites you.

Aim for _simple_. Not clever. Not complex. What you should aim for is to 
keep the _exact_ same code that we had before in mutex, and the absolute 
*only* change is replacing that "unlock+schedule()+relock" sequence
with
"mutex_spin_or_schedule()".

That should be your starting point.

Yeah, you''ll need to set the owner for the fast case, but there are no 
locking or preemption issues there. Just forget them. You''re confusing 
yourself and the code by trying to look for problems that aren''t 
relevant.

			Linus
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Paul E. McKenney

2009-Jan-06 22:31 UTC

head link

Re: [PATCH][RFC]: mutex: adaptive spin

On Tue, Jan 06, 2009 at 10:44:35PM +0100, Peter Zijlstra
wrote:> On Tue, 2009-01-06 at 13:42 -0800, Paul E. McKenney wrote:
> > Preemptable RCU needs to be faster.  Got it -- and might have a way
> > to do it by eliminating the irq disabling and cutting way back on the
> > number of operations that must be performed.  It would probably still
> > be necessary to access the task structure.
> > 
> > Or is something other than the raw performance of rcu_read_lock() and
> > rcu_read_unlock() at issue here?
> 
> With Linus'' mutex_spin_or_schedule() function the whole - keeping
> owner''s task_struct alive issue goes away,.. now if only the thing
would
> boot...
Cool!  And I can relate to the "if only the thing would boot" part. 
;-)

							Thanx, Paul
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Peter Zijlstra

2009-Jan-06 22:43 UTC

head link

Re: [PATCH][RFC]: mutex: adaptive spin

On Tue, 2009-01-06 at 14:22 -0800, Linus Torvalds wrote:> 
> On Tue, 6 Jan 2009, Peter Zijlstra wrote:
> > 
> > One of the things the previous patch did wrong was that it never
tracked
> > the owner in the mutex fast path -- I initially didn''t notice
because I
> > had all debugging infrastructure enabled, and that short circuits all
> > the fast paths.
> 
> Why even worry?
Wrong mind-set, as you rightly point out below.
> Just set the owner non-atomically. We can consider a NULL owner in the 
> slow-path to be a "let''s schedule" case. After all,
it''s just a
> performance tuning thing after all, and the window is going to be very 
> small, so it won''t even be relevant from a performance tuning
angle.
> 
> So there''s _no_ reason why the fast-path can''t just set
owner, and no
> reason to disable preemption.
Agreed, when viewed from that angle all the little races don''t matter.
> I also think the whole preempt_disable/enable around the 
> mutex_spin_or_schedule() is pure garbage.
> 
> In fact, I suspect that''s the real bug you''re hitting:
you''re enabling
> preemption while holding a spinlock. That is NOT a good idea.
spinlocks also fiddle with preempt_count, that should all work out -
although granted, it does look funny.



Indeed, the below does boot -- which means I get to sleep now ;-)

---
 include/linux/mutex.h |    4 +--
 include/linux/sched.h |    1 
 kernel/mutex-debug.c  |   10 --------
 kernel/mutex-debug.h  |    8 ------
 kernel/mutex.c        |   62 +++++++++++++++++++++++++++++++++++---------------
 kernel/mutex.h        |    2 -
 kernel/sched.c        |   44 +++++++++++++++++++++++++++++++++++
 7 files changed, 92 insertions(+), 39 deletions(-)

Index: linux-2.6/include/linux/mutex.h
==================================================================---
linux-2.6.orig/include/linux/mutex.h
+++ linux-2.6/include/linux/mutex.h
@@ -50,8 +50,8 @@ struct mutex {
 	atomic_t		count;
 	spinlock_t		wait_lock;
 	struct list_head	wait_list;
+	struct task_struct	*owner;
 #ifdef CONFIG_DEBUG_MUTEXES
-	struct thread_info	*owner;
 	const char 		*name;
 	void			*magic;
 #endif
@@ -67,8 +67,8 @@ struct mutex {
 struct mutex_waiter {
 	struct list_head	list;
 	struct task_struct	*task;
-#ifdef CONFIG_DEBUG_MUTEXES
 	struct mutex		*lock;
+#ifdef CONFIG_DEBUG_MUTEXES
 	void			*magic;
 #endif
 };
Index: linux-2.6/include/linux/sched.h
==================================================================---
linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -329,6 +329,7 @@ extern signed long schedule_timeout(sign
 extern signed long schedule_timeout_interruptible(signed long timeout);
 extern signed long schedule_timeout_killable(signed long timeout);
 extern signed long schedule_timeout_uninterruptible(signed long timeout);
+extern void mutex_spin_or_schedule(struct mutex_waiter *, long state, unsigned
long *flags);
 asmlinkage void schedule(void);
 
 struct nsproxy;
Index: linux-2.6/kernel/mutex-debug.c
==================================================================---
linux-2.6.orig/kernel/mutex-debug.c
+++ linux-2.6/kernel/mutex-debug.c
@@ -26,11 +26,6 @@
 /*
  * Must be called with lock->wait_lock held.
  */
-void debug_mutex_set_owner(struct mutex *lock, struct thread_info *new_owner)
-{
-	lock->owner = new_owner;
-}
-
 void debug_mutex_lock_common(struct mutex *lock, struct mutex_waiter *waiter)
 {
 	memset(waiter, MUTEX_DEBUG_INIT, sizeof(*waiter));
@@ -59,7 +54,6 @@ void debug_mutex_add_waiter(struct mutex
 
 	/* Mark the current thread as blocked on the lock: */
 	ti->task->blocked_on = waiter;
-	waiter->lock = lock;
 }
 
 void mutex_remove_waiter(struct mutex *lock, struct mutex_waiter *waiter,
@@ -80,9 +74,8 @@ void debug_mutex_unlock(struct mutex *lo
 		return;
 
 	DEBUG_LOCKS_WARN_ON(lock->magic != lock);
-	DEBUG_LOCKS_WARN_ON(lock->owner != current_thread_info());
+	/* DEBUG_LOCKS_WARN_ON(lock->owner != current); */
 	DEBUG_LOCKS_WARN_ON(!lock->wait_list.prev &&
!lock->wait_list.next);
-	DEBUG_LOCKS_WARN_ON(lock->owner != current_thread_info());
 }
 
 void debug_mutex_init(struct mutex *lock, const char *name,
@@ -95,7 +88,6 @@ void debug_mutex_init(struct mutex *lock
 	debug_check_no_locks_freed((void *)lock, sizeof(*lock));
 	lockdep_init_map(&lock->dep_map, name, key, 0);
 #endif
-	lock->owner = NULL;
 	lock->magic = lock;
 }
 
Index: linux-2.6/kernel/mutex-debug.h
==================================================================---
linux-2.6.orig/kernel/mutex-debug.h
+++ linux-2.6/kernel/mutex-debug.h
@@ -13,14 +13,6 @@
 /*
  * This must be called with lock->wait_lock held.
  */
-extern void
-debug_mutex_set_owner(struct mutex *lock, struct thread_info *new_owner);
-
-static inline void debug_mutex_clear_owner(struct mutex *lock)
-{
-	lock->owner = NULL;
-}
-
 extern void debug_mutex_lock_common(struct mutex *lock,
 				    struct mutex_waiter *waiter);
 extern void debug_mutex_wake_waiter(struct mutex *lock,
Index: linux-2.6/kernel/mutex.c
==================================================================---
linux-2.6.orig/kernel/mutex.c
+++ linux-2.6/kernel/mutex.c
@@ -46,6 +46,7 @@ __mutex_init(struct mutex *lock, const c
 	atomic_set(&lock->count, 1);
 	spin_lock_init(&lock->wait_lock);
 	INIT_LIST_HEAD(&lock->wait_list);
+	lock->owner = NULL;
 
 	debug_mutex_init(lock, name, key);
 }
@@ -91,6 +92,7 @@ void inline __sched mutex_lock(struct mu
 	 * ''unlocked'' into ''locked'' state.
 	 */
 	__mutex_fastpath_lock(&lock->count, __mutex_lock_slowpath);
+	lock->owner = current;
 }
 
 EXPORT_SYMBOL(mutex_lock);
@@ -115,6 +117,7 @@ void __sched mutex_unlock(struct mutex *
 	 * The unlocking fastpath is the 0->1 transition from
''locked''
 	 * into ''unlocked'' state:
 	 */
+	lock->owner = NULL;
 	__mutex_fastpath_unlock(&lock->count, __mutex_unlock_slowpath);
 }
 
@@ -127,7 +130,7 @@ static inline int __sched
 __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
 	       	unsigned long ip)
 {
-	struct task_struct *task = current;
+	struct task_struct *owner, *task = current;
 	struct mutex_waiter waiter;
 	unsigned int old_val;
 	unsigned long flags;
@@ -141,6 +144,7 @@ __mutex_lock_common(struct mutex *lock, 
 	/* add waiting tasks to the end of the waitqueue (FIFO): */
 	list_add_tail(&waiter.list, &lock->wait_list);
 	waiter.task = task;
+	waiter.lock = lock;
 
 	old_val = atomic_xchg(&lock->count, -1);
 	if (old_val == 1)
@@ -175,19 +179,14 @@ __mutex_lock_common(struct mutex *lock, 
 			debug_mutex_free_waiter(&waiter);
 			return -EINTR;
 		}
-		__set_task_state(task, state);
 
-		/* didnt get the lock, go to sleep: */
-		spin_unlock_mutex(&lock->wait_lock, flags);
-		schedule();
-		spin_lock_mutex(&lock->wait_lock, flags);
+		mutex_spin_or_schedule(&waiter, state, &flags);
 	}
 
 done:
 	lock_acquired(&lock->dep_map, ip);
 	/* got the lock - rejoice! */
 	mutex_remove_waiter(lock, &waiter, task_thread_info(task));
-	debug_mutex_set_owner(lock, task_thread_info(task));
 
 	/* set it to 0 if there are no waiters left: */
 	if (likely(list_empty(&lock->wait_list)))
@@ -206,6 +205,7 @@ mutex_lock_nested(struct mutex *lock, un
 {
 	might_sleep();
 	__mutex_lock_common(lock, TASK_UNINTERRUPTIBLE, subclass, _RET_IP_);
+	lock->owner = current;
 }
 
 EXPORT_SYMBOL_GPL(mutex_lock_nested);
@@ -213,16 +213,28 @@ EXPORT_SYMBOL_GPL(mutex_lock_nested);
 int __sched
 mutex_lock_killable_nested(struct mutex *lock, unsigned int subclass)
 {
+	int ret;
+
 	might_sleep();
-	return __mutex_lock_common(lock, TASK_KILLABLE, subclass, _RET_IP_);
+	ret =  __mutex_lock_common(lock, TASK_KILLABLE, subclass, _RET_IP_);
+	if (!ret)
+		lock->owner = current;
+
+	return ret;
 }
 EXPORT_SYMBOL_GPL(mutex_lock_killable_nested);
 
 int __sched
 mutex_lock_interruptible_nested(struct mutex *lock, unsigned int subclass)
 {
+	int ret;
+
 	might_sleep();
-	return __mutex_lock_common(lock, TASK_INTERRUPTIBLE, subclass, _RET_IP_);
+	ret = __mutex_lock_common(lock, TASK_INTERRUPTIBLE, subclass, _RET_IP_);
+	if (!ret)
+		lock->owner = current;
+
+	return ret;
 }
 
 EXPORT_SYMBOL_GPL(mutex_lock_interruptible_nested);
@@ -260,8 +272,6 @@ __mutex_unlock_common_slowpath(atomic_t 
 		wake_up_process(waiter->task);
 	}
 
-	debug_mutex_clear_owner(lock);
-
 	spin_unlock_mutex(&lock->wait_lock, flags);
 }
 
@@ -298,18 +308,30 @@ __mutex_lock_interruptible_slowpath(atom
  */
 int __sched mutex_lock_interruptible(struct mutex *lock)
 {
+	int ret;
+
 	might_sleep();
-	return __mutex_fastpath_lock_retval
+	ret =  __mutex_fastpath_lock_retval
 			(&lock->count, __mutex_lock_interruptible_slowpath);
+	if (!ret)
+		lock->owner = current;
+
+	return ret;
 }
 
 EXPORT_SYMBOL(mutex_lock_interruptible);
 
 int __sched mutex_lock_killable(struct mutex *lock)
 {
+	int ret;
+
 	might_sleep();
-	return __mutex_fastpath_lock_retval
+	ret = __mutex_fastpath_lock_retval
 			(&lock->count, __mutex_lock_killable_slowpath);
+	if (!ret)
+		lock->owner = current;
+
+	return ret;
 }
 EXPORT_SYMBOL(mutex_lock_killable);
 
@@ -351,10 +373,9 @@ static inline int __mutex_trylock_slowpa
 	spin_lock_mutex(&lock->wait_lock, flags);
 
 	prev = atomic_xchg(&lock->count, -1);
-	if (likely(prev == 1)) {
-		debug_mutex_set_owner(lock, current_thread_info());
+	if (likely(prev == 1))
 		mutex_acquire(&lock->dep_map, 0, 1, _RET_IP_);
-	}
+
 	/* Set it back to 0 if there are no waiters: */
 	if (likely(list_empty(&lock->wait_list)))
 		atomic_set(&lock->count, 0);
@@ -380,8 +401,13 @@ static inline int __mutex_trylock_slowpa
  */
 int __sched mutex_trylock(struct mutex *lock)
 {
-	return __mutex_fastpath_trylock(&lock->count,
-					__mutex_trylock_slowpath);
+	int ret;
+
+	ret = __mutex_fastpath_trylock(&lock->count, __mutex_trylock_slowpath);
+	if (ret)
+		lock->owner = current;
+
+	return ret;
 }
 
 EXPORT_SYMBOL(mutex_trylock);
Index: linux-2.6/kernel/sched.c
==================================================================---
linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -4596,6 +4596,50 @@ pick_next_task(struct rq *rq, struct tas
 	}
 }
 
+#ifdef CONFIG_DEBUG_MUTEXES
+# include "mutex-debug.h"
+#else
+# include "mutex.h"
+#endif
+
+void mutex_spin_or_schedule(struct mutex_waiter *waiter, long state, unsigned
long *flags)
+{
+	struct mutex *lock = waiter->lock;
+	struct task_struct *task = waiter->task;
+	struct task_struct *owner = lock->owner;
+	struct rq *rq;
+
+	if (!owner)
+		goto do_schedule;
+
+	rq = task_rq(owner);
+
+	if (rq->curr != owner) {
+do_schedule:
+		__set_task_state(task, state);
+		spin_unlock_mutex(&lock->wait_lock, *flags);
+		schedule();
+	} else {
+		spin_unlock_mutex(&lock->wait_lock, *flags);
+		for (;;) {
+			/* Stop spinning when there''s a pending signal. */
+			if (signal_pending_state(state, task))
+				break;
+
+			/* Owner changed, bail to revalidate state */
+			if (lock->owner != owner)
+				break;
+
+			/* Owner stopped running, bail to revalidate state */
+			if (rq->curr != owner)
+				break;
+
+			cpu_relax();
+		}
+	}
+	spin_lock_mutex(&lock->wait_lock, *flags);
+}
+
 /*
  * schedule() is the main scheduler function.
  */
Index: linux-2.6/kernel/mutex.h
==================================================================---
linux-2.6.orig/kernel/mutex.h
+++ linux-2.6/kernel/mutex.h
@@ -16,8 +16,6 @@
 #define mutex_remove_waiter(lock, waiter, ti) \
 		__list_del((waiter)->list.prev, (waiter)->list.next)
 
-#define debug_mutex_set_owner(lock, new_owner)		do { } while (0)
-#define debug_mutex_clear_owner(lock)			do { } while (0)
 #define debug_mutex_wake_waiter(lock, waiter)		do { } while (0)
 #define debug_mutex_free_waiter(waiter)			do { } while (0)
 #define debug_mutex_add_waiter(lock, waiter, ti)	do { } while (0)

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Linus Torvalds

2009-Jan-06 22:56 UTC

head link

Re: [PATCH][RFC]: mutex: adaptive spin

On Tue, 6 Jan 2009, Peter Zijlstra wrote:> > 
> > In fact, I suspect that''s the real bug you''re
hitting: you''re enabling
> > preemption while holding a spinlock. That is NOT a good idea.
> 
> spinlocks also fiddle with preempt_count, that should all work out -
> although granted, it does look funny.
It most certainly doesn''t always work out.

For example, the irq-disabling ones do *not* fiddle with preempt_count, 
because they disable preemption by just disabling interrupts. So doing 
preempt_enable() inside such a spinlock is almost guaranteed to lock up: 
because the preempt_enable() will now potentially call the scheduler with 
a spinlock held and with interrupts disabled.

That, in turn, can cause any number of problems - deadlocks with other 
processes that then try to take the spinlock that didn''t get released,
but
also deadlocks with interrupts, since the scheduler will enable interrupts 
again.

So mixing preemption and spinlocks is almost always a bug. Yes, _some_ 
cases work out ok, but I''d call those the odd ones.

			Linus
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Linus Torvalds

2009-Jan-06 23:00 UTC

head link

Re: [PATCH][RFC]: mutex: adaptive spin

On Tue, 6 Jan 2009, Peter Zijlstra wrote:> 
> Indeed, the below does boot -- which means I get to sleep now ;-)
Well, if you didn''t go to sleep, a few more questions..
>  int __sched
>  mutex_lock_killable_nested(struct mutex *lock, unsigned int subclass)
>  {
> +	int ret;
> +
>  	might_sleep();
> -	return __mutex_lock_common(lock, TASK_KILLABLE, subclass, _RET_IP_);
> +	ret =  __mutex_lock_common(lock, TASK_KILLABLE, subclass, _RET_IP_);
> +	if (!ret)
> +		lock->owner = current;
> +
> +	return ret;
This looks ugly. Why doesn''t __mutex_lock_common() just set the lock 
owner? Hate seeing it done in the caller that has to re-compute common 
(yeah, yeah, it''s cheap) and just looks ugly.

IOW, why didn''t this just get done with something like

	--- a/kernel/mutex.c
	+++ b/kernel/mutex.c
	@@ -186,6 +186,7 @@ __mutex_lock_common(struct mutex *lock, long state,
unsigned int subclass,
	 done:
	 	lock_acquired(&lock->dep_map, ip);
	 	/* got the lock - rejoice! */
	+	lock->owner = task;
	 	mutex_remove_waiter(lock, &waiter, task_thread_info(task));
	 	debug_mutex_set_owner(lock, task_thread_info(task));

instead?  That takes care of all callers, including the conditional thing 
(since the error case is a totally different path).

		Linus
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Matthew Wilcox

2009-Jan-06 23:09 UTC

head link

Re: [PATCH][RFC]: mutex: adaptive spin

On Tue, Jan 06, 2009 at 03:00:47PM -0800, Linus Torvalds
wrote:> Well, if you didn''t go to sleep, a few more questions..
I know this one!  Me sir, me me me!
> >  int __sched
> >  mutex_lock_killable_nested(struct mutex *lock, unsigned int subclass)
> >  {
> > +	int ret;
> > +
> >  	might_sleep();
> > -	return __mutex_lock_common(lock, TASK_KILLABLE, subclass, _RET_IP_);
> > +	ret =  __mutex_lock_common(lock, TASK_KILLABLE, subclass, _RET_IP_);
> > +	if (!ret)
> > +		lock->owner = current;
> > +
> > +	return ret;
> 
> This looks ugly. Why doesn''t __mutex_lock_common() just set the
lock
> owner? Hate seeing it done in the caller that has to re-compute common 
> (yeah, yeah, it''s cheap) and just looks ugly.
Because __mutex_lock_common() is the slow path.  The fast path is a
couple of assembly instructions in asm/mutex.h.  If the lock isn''t
contended, it will never call __mutex_lock_common().

That would make the whole exercise rather pointless; the only time worth
spinning really is if you''re the only other one waiting for it ... if
there''s already a waiter, you might as well go to sleep.

-- 
Matthew Wilcox				Intel Open Source Technology Centre
"Bill, look, we understand that you''re interested in selling us
this
operating system, but compare it to ours.  We can''t possibly take such
a retrograde step."
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Peter Zijlstra

2009-Jan-06 23:16 UTC

head link

Re: [PATCH][RFC]: mutex: adaptive spin

On Tue, 2009-01-06 at 23:43 +0100, Peter Zijlstra wrote:
> @@ -115,6 +117,7 @@ void __sched mutex_unlock(struct mutex *
>  	 * The unlocking fastpath is the 0->1 transition from
''locked''
>  	 * into ''unlocked'' state:
>  	 */
> +	lock->owner = NULL;
>  	__mutex_fastpath_unlock(&lock->count, __mutex_unlock_slowpath);
>  }
>  
> +void mutex_spin_or_schedule(struct mutex_waiter *waiter, long state,
unsigned long *flags)
> +{
> +	struct mutex *lock = waiter->lock;
> +	struct task_struct *task = waiter->task;
> +	struct task_struct *owner = lock->owner;
> +	struct rq *rq;
> +
> +	if (!owner)
> +		goto do_schedule;
> +
> +	rq = task_rq(owner);
> +
> +	if (rq->curr != owner) {
> +do_schedule:
> +		__set_task_state(task, state);
> +		spin_unlock_mutex(&lock->wait_lock, *flags);
> +		schedule();
> +	} else {
> +		spin_unlock_mutex(&lock->wait_lock, *flags);
> +		for (;;) {
> +			/* Stop spinning when there''s a pending signal. */
> +			if (signal_pending_state(state, task))
> +				break;
> +
> +			/* Owner changed, bail to revalidate state */
> +			if (lock->owner != owner)
> +				break;
> +
> +			/* Owner stopped running, bail to revalidate state */
> +			if (rq->curr != owner)
> +				break;
> +
> +			cpu_relax();
> +		}
> +	}
> +	spin_lock_mutex(&lock->wait_lock, *flags);
> +}
That''s not going to work, we set owner to NULL, which means pending
spinners get schedule()ed out instead of racing to acquire.

I suppose the below would fix that... really sleep time now

Index: linux-2.6/kernel/sched.c
==================================================================---
linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -4626,8 +4626,8 @@ do_schedule:
 			if (signal_pending_state(state, task))
 				break;
 
-			/* Owner changed, bail to revalidate state */
-			if (lock->owner != owner)
+			/* Mutex got unlocked, race to acquire. */
+			if (!mutex_is_locked(lock))
 				break;
 
 			/* Owner stopped running, bail to revalidate state */

--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Linus Torvalds

2009-Jan-07 00:06 UTC

head link

Re: [PATCH][RFC]: mutex: adaptive spin

On Tue, 6 Jan 2009, Matthew Wilcox wrote:> > 
> > This looks ugly. Why doesn''t __mutex_lock_common() just set
the lock
> > owner? Hate seeing it done in the caller that has to re-compute common
> > (yeah, yeah, it''s cheap) and just looks ugly.
> 
> Because __mutex_lock_common() is the slow path.  The fast path is a
> couple of assembly instructions in asm/mutex.h.  If the lock isn''t
> contended, it will never call __mutex_lock_common().
No, that''s not it.

Look at the callers. They are _all_ the slow path. They looked like this:

	might_sleep();
	return __mutex_lock_common(lock, TASK_KILLABLE, subclass, _RET_IP_);

Yes, you _also_ need to set the owner in the fast-path, but that''s all 
entirely different. This is the debug case, which _always_ calls the 
slow-path.

So what I''m saying is that the slow-path should just set it. And then
yes,
we _also_ need to set it in the fast-path, but at least we don''t need
to
set it in all the debug versions that just call the slow path!

			Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Lai Jiangshan

2009-Jan-07 03:57 UTC

head link

Re: [PATCH][RFC]: mutex: adaptive spin

Peter Zijlstra wrote:> +void mutex_spin_or_schedule(struct mutex_waiter *waiter, long state,
unsigned long *flags)
> +{
> +	struct mutex *lock = waiter->lock;
> +	struct task_struct *task = waiter->task;
> +	struct task_struct *owner = lock->owner;
> +	struct rq *rq;
> +
> +	if (!owner)
> +		goto do_schedule;
> +
> +	rq = task_rq(owner);
> +
> +	if (rq->curr != owner) {
> +do_schedule:
> +		__set_task_state(task, state);
> +		spin_unlock_mutex(&lock->wait_lock, *flags);
> +		schedule();
> +	} else {
> +		spin_unlock_mutex(&lock->wait_lock, *flags);
> +		for (;;) {
> +			/* Stop spinning when there''s a pending signal. */
> +			if (signal_pending_state(state, task))
> +				break;
> +
> +			/* Owner changed, bail to revalidate state */
> +			if (lock->owner != owner)
> +				break;
> +
> +			/* Owner stopped running, bail to revalidate state */
> +			if (rq->curr != owner)
> +				break;
> +
2 questions from my immature thought:

1) Do we need keep gcc from optimizing when we access lock->owner
   and rq->curr in the loop?

2) "if (rq->curr != owner)" need become smarter.
   schedule()
   {
	select_next
	rq->curr = next;
	contex_swith
   }
we also spin when owner is select_next-ing in schedule().
but select_next is not fast enough.


Lai.


--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Peter Zijlstra

2009-Jan-07 06:32 UTC

head link

Re: [PATCH][RFC]: mutex: adaptive spin

On Wed, 2009-01-07 at 11:57 +0800, Lai Jiangshan wrote:> Peter Zijlstra wrote:
> > +void mutex_spin_or_schedule(struct mutex_waiter *waiter, long state,
unsigned long *flags)
> > +{
> > +	struct mutex *lock = waiter->lock;
> > +	struct task_struct *task = waiter->task;
> > +	struct task_struct *owner = lock->owner;
> > +	struct rq *rq;
> > +
> > +	if (!owner)
> > +		goto do_schedule;
> > +
> > +	rq = task_rq(owner);
> > +
> > +	if (rq->curr != owner) {
> > +do_schedule:
> > +		__set_task_state(task, state);
> > +		spin_unlock_mutex(&lock->wait_lock, *flags);
> > +		schedule();
> > +	} else {
> > +		spin_unlock_mutex(&lock->wait_lock, *flags);
> > +		for (;;) {
> > +			/* Stop spinning when there''s a pending signal. */
> > +			if (signal_pending_state(state, task))
> > +				break;
> > +
> > +			/* Owner changed, bail to revalidate state */
> > +			if (lock->owner != owner)
> > +				break;
> > +
> > +			/* Owner stopped running, bail to revalidate state */
> > +			if (rq->curr != owner)
> > +				break;
> > +
> 
> 2 questions from my immature thought:
> 
> 1) Do we need keep gcc from optimizing when we access lock->owner
>    and rq->curr in the loop?
cpu_relax() is a compiler barrier iirc.
> 2) "if (rq->curr != owner)" need become smarter.
>    schedule()
>    {
> 	select_next
> 	rq->curr = next;
> 	contex_swith
>    }
> we also spin when owner is select_next-ing in schedule().
> but select_next is not fast enough.
I''m not sure what you''re saying here..
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Lai Jiangshan

2009-Jan-07 07:34 UTC

head link

Re: [PATCH][RFC]: mutex: adaptive spin

Peter Zijlstra wrote:> On Wed, 2009-01-07 at 11:57 +0800, Lai Jiangshan wrote:
>> Peter Zijlstra wrote:
>>> +void mutex_spin_or_schedule(struct mutex_waiter *waiter, long
state, unsigned long *flags)
>>> +{
>>> +	struct mutex *lock = waiter->lock;
>>> +	struct task_struct *task = waiter->task;
>>> +	struct task_struct *owner = lock->owner;
>>> +	struct rq *rq;
>>> +
>>> +	if (!owner)
>>> +		goto do_schedule;
>>> +
>>> +	rq = task_rq(owner);
>>> +
>>> +	if (rq->curr != owner) {
>>> +do_schedule:
>>> +		__set_task_state(task, state);
>>> +		spin_unlock_mutex(&lock->wait_lock, *flags);
>>> +		schedule();
>>> +	} else {
>>> +		spin_unlock_mutex(&lock->wait_lock, *flags);
>>> +		for (;;) {
>>> +			/* Stop spinning when there''s a pending signal. */
>>> +			if (signal_pending_state(state, task))
>>> +				break;
>>> +
>>> +			/* Owner changed, bail to revalidate state */
>>> +			if (lock->owner != owner)
>>> +				break;
>>> +
>>> +			/* Owner stopped running, bail to revalidate state */
>>> +			if (rq->curr != owner)
>>> +				break;
>>> +
>> 2 questions from my immature thought:
>>
>> 1) Do we need keep gcc from optimizing when we access lock->owner
>>    and rq->curr in the loop?
> 
> cpu_relax() is a compiler barrier iirc.
> 
>> 2) "if (rq->curr != owner)" need become smarter.
>>    schedule()
>>    {
>> 	select_next
>> 	rq->curr = next;
>> 	contex_swith
>>    }
>> we also spin when owner is select_next-ing in schedule().
>> but select_next is not fast enough.
> 
> I''m not sure what you''re saying here..
> 
> 
I means when mutex owner calls schedule(), current task is also spinning
until rq->curr is changed.

I think such spin is not necessary, it is doing nothing but wasting time.
And this spin period is not short, and when this spin period ended,
rq->curr is changed too, current task has to sleep.

So I think current task should sleep earlier when it detects that
mutex owner start schedule().



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

David Woodhouse

2009-Jan-07 09:33 UTC

head link

Re: Btrfs for mainline

On Tue, 2009-01-06 at 14:41 -0500, Chris Mason wrote:> Hello everyone,
> 
> Thanks for all of the comments so far.  I''ve pushed out a number
of
> fixes for btrfs mainline, covering most of the comments from this
> thread.
> 
> * All LINUX_KERNEL_VERSION checks are gone.
> * checkpatch.pl fixes
> * Extra permission checks on the ioctls
> * Some important bug fixes from the btrfs list
> * Andi found a buggy use of kmap_atomic during checksum verification
> * Drop EXPORT_SYMBOLs from extent_io.c
Looking good...
> Unresolved from this reviewing thread:
> 
> * Should it be named btrfsdev?  My vote is no, it is extra work for the
> distros when we finally do rename it, and I don''t think btrfs
really has
> the reputation for stability right now.  But if Linus or Andrew would
> prefer the dev on there, I''ll do it.
I agree; I don''t think there''s any particular need for the
''dev'' suffix.
It''s already dependent on CONFIG_EXPERIMENTAL, after all.
> * My ugly mutex_trylock spin.  It''s a hefty performance gain so
I''m
> hoping to keep it until there is a generic adaptive lock.
If a better option is forthcoming, by all means use it -- but I
wouldn''t
see the existing version as a barrier to merging.


One more thing I''d suggest is removing the INSTALL file. The parts
about
having to build libcrc32c aren''t relevant when it''s part of
the kernel
tree and you have ''select LIBCRC32C'', and the documentation on
the
userspace utilities probably lives _with_ the userspace repo. Might be
worth adding a pointer to the userspace utilities though, in
Documentation/filesystems/btrfs.txt

I think you can drop your own copy of the GPL too.

-- 
David Woodhouse                            Open Source Technology Centre
David.Woodhouse@intel.com                              Intel Corporation

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Peter Zijlstra

2009-Jan-07 12:03 UTC

head link

[PATCH -v4][RFC]: mutex: implement adaptive spinning

Change mutex contention behaviour such that it will sometimes busy wait on
acquisition - moving its behaviour closer to that of spinlocks.

This concept got ported to mainline from the -rt tree, where it was originally
implemented for rtmutexes by Steven Rostedt, based on work by Gregory Haskins.

Testing with Ingo''s test-mutex application
(http://lkml.org/lkml/2006/1/8/50)
gave a 8% boost for VFS scalability on my testbox:

  # echo MUTEX_SPIN > /debug/sched_features
  # ./test-mutex V 16 10
  2 CPUs, running 16 parallel test-tasks.
  checking VFS performance.

  avg ops/sec:                74910

  # echo NO_MUTEX_SPIN > /debug/sched_features
  # ./test-mutex V 16 10
  2 CPUs, running 16 parallel test-tasks.
  checking VFS performance.

  avg ops/sec:                68804

The key criteria for the busy wait is that the lock owner has to be running on
a (different) cpu. The idea is that as long as the owner is running, there is a
fair chance it''ll release the lock soon, and thus we''ll be
better off spinning
instead of blocking/scheduling.

Since regular mutexes (as opposed to rtmutexes) do not atomically track the
owner, we add the owner in a non-atomic fashion and deal with the races in
the slowpath.

Furthermore, to ease the testing of the performance impact of this new code,
there is means to disable this behaviour runtime (without having to reboot
the system), when scheduler debugging is enabled (CONFIG_SCHED_DEBUG=y),
by issuing the following command:

 # echo NO_MUTEX_SPIN > /debug/sched_features

This command re-enables spinning again (this is also the default):

 # echo MUTEX_SPIN > /debug/sched_features

There''s also a few new statistic fields in /proc/sched_debug
(available if CONFIG_SCHED_DEBUG=y and CONFIG_SCHEDSTATS=y):

 # grep mtx /proc/sched_debug
  .mtx_spin                      : 2387
  .mtx_sched                     : 2283
  .mtx_spin                      : 1277
  .mtx_sched                     : 1700

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Reviewed-and-signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 include/linux/mutex.h   |    4 +-
 include/linux/sched.h   |    2 +
 kernel/mutex-debug.c    |   10 +------
 kernel/mutex-debug.h    |    8 -----
 kernel/mutex.c          |   46 ++++++++++++++++++++++--------
 kernel/mutex.h          |    2 -
 kernel/sched.c          |   73 +++++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched_debug.c    |    2 +
 kernel/sched_features.h |    1 +
 9 files changed, 115 insertions(+), 33 deletions(-)

diff --git a/include/linux/mutex.h b/include/linux/mutex.h
index 7a0e5c4..c007b4e 100644
--- a/include/linux/mutex.h
+++ b/include/linux/mutex.h
@@ -50,8 +50,8 @@ struct mutex {
 	atomic_t		count;
 	spinlock_t		wait_lock;
 	struct list_head	wait_list;
+	struct task_struct	*owner;
 #ifdef CONFIG_DEBUG_MUTEXES
-	struct thread_info	*owner;
 	const char 		*name;
 	void			*magic;
 #endif
@@ -67,8 +67,8 @@ struct mutex {
 struct mutex_waiter {
 	struct list_head	list;
 	struct task_struct	*task;
-#ifdef CONFIG_DEBUG_MUTEXES
 	struct mutex		*lock;
+#ifdef CONFIG_DEBUG_MUTEXES
 	void			*magic;
 #endif
 };
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 4cae9b8..d8fa96b 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -328,6 +328,8 @@ extern signed long schedule_timeout(signed long timeout);
 extern signed long schedule_timeout_interruptible(signed long timeout);
 extern signed long schedule_timeout_killable(signed long timeout);
 extern signed long schedule_timeout_uninterruptible(signed long timeout);
+extern void mutex_spin_or_schedule(struct mutex_waiter *waiter, long state,
+				   unsigned long *flags);
 asmlinkage void schedule(void);
 
 struct nsproxy;
diff --git a/kernel/mutex-debug.c b/kernel/mutex-debug.c
index 1d94160..0564680 100644
--- a/kernel/mutex-debug.c
+++ b/kernel/mutex-debug.c
@@ -26,11 +26,6 @@
 /*
  * Must be called with lock->wait_lock held.
  */
-void debug_mutex_set_owner(struct mutex *lock, struct thread_info *new_owner)
-{
-	lock->owner = new_owner;
-}
-
 void debug_mutex_lock_common(struct mutex *lock, struct mutex_waiter *waiter)
 {
 	memset(waiter, MUTEX_DEBUG_INIT, sizeof(*waiter));
@@ -59,7 +54,6 @@ void debug_mutex_add_waiter(struct mutex *lock, struct
mutex_waiter *waiter,
 
 	/* Mark the current thread as blocked on the lock: */
 	ti->task->blocked_on = waiter;
-	waiter->lock = lock;
 }
 
 void mutex_remove_waiter(struct mutex *lock, struct mutex_waiter *waiter,
@@ -80,9 +74,8 @@ void debug_mutex_unlock(struct mutex *lock)
 		return;
 
 	DEBUG_LOCKS_WARN_ON(lock->magic != lock);
-	DEBUG_LOCKS_WARN_ON(lock->owner != current_thread_info());
+	/* DEBUG_LOCKS_WARN_ON(lock->owner != current); */
 	DEBUG_LOCKS_WARN_ON(!lock->wait_list.prev &&
!lock->wait_list.next);
-	DEBUG_LOCKS_WARN_ON(lock->owner != current_thread_info());
 }
 
 void debug_mutex_init(struct mutex *lock, const char *name,
@@ -95,7 +88,6 @@ void debug_mutex_init(struct mutex *lock, const char *name,
 	debug_check_no_locks_freed((void *)lock, sizeof(*lock));
 	lockdep_init_map(&lock->dep_map, name, key, 0);
 #endif
-	lock->owner = NULL;
 	lock->magic = lock;
 }
 
diff --git a/kernel/mutex-debug.h b/kernel/mutex-debug.h
index babfbdf..42eab06 100644
--- a/kernel/mutex-debug.h
+++ b/kernel/mutex-debug.h
@@ -13,14 +13,6 @@
 /*
  * This must be called with lock->wait_lock held.
  */
-extern void
-debug_mutex_set_owner(struct mutex *lock, struct thread_info *new_owner);
-
-static inline void debug_mutex_clear_owner(struct mutex *lock)
-{
-	lock->owner = NULL;
-}
-
 extern void debug_mutex_lock_common(struct mutex *lock,
 				    struct mutex_waiter *waiter);
 extern void debug_mutex_wake_waiter(struct mutex *lock,
diff --git a/kernel/mutex.c b/kernel/mutex.c
index 4f45d4b..089b46b 100644
--- a/kernel/mutex.c
+++ b/kernel/mutex.c
@@ -10,6 +10,10 @@
  * Many thanks to Arjan van de Ven, Thomas Gleixner, Steven Rostedt and
  * David Howells for suggestions and improvements.
  *
+ *  - Adaptive spinning for mutexes by Peter Zijlstra. (Ported to mainline
+ *    from the -rt tree, where it was originally implemented for rtmutexes
+ *    by Steven Rostedt, based on work by Gregory Haskins.)
+ *
  * Also see Documentation/mutex-design.txt.
  */
 #include <linux/mutex.h>
@@ -46,6 +50,7 @@ __mutex_init(struct mutex *lock, const char *name, struct
lock_class_key *key)
 	atomic_set(&lock->count, 1);
 	spin_lock_init(&lock->wait_lock);
 	INIT_LIST_HEAD(&lock->wait_list);
+	lock->owner = NULL;
 
 	debug_mutex_init(lock, name, key);
 }
@@ -91,6 +96,7 @@ void inline __sched mutex_lock(struct mutex *lock)
 	 * ''unlocked'' into ''locked'' state.
 	 */
 	__mutex_fastpath_lock(&lock->count, __mutex_lock_slowpath);
+	lock->owner = current;
 }
 
 EXPORT_SYMBOL(mutex_lock);
@@ -115,6 +121,7 @@ void __sched mutex_unlock(struct mutex *lock)
 	 * The unlocking fastpath is the 0->1 transition from
''locked''
 	 * into ''unlocked'' state:
 	 */
+	lock->owner = NULL;
 	__mutex_fastpath_unlock(&lock->count, __mutex_unlock_slowpath);
 }
 
@@ -141,6 +148,7 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned
int subclass,
 	/* add waiting tasks to the end of the waitqueue (FIFO): */
 	list_add_tail(&waiter.list, &lock->wait_list);
 	waiter.task = task;
+	waiter.lock = lock;
 
 	old_val = atomic_xchg(&lock->count, -1);
 	if (old_val == 1)
@@ -175,19 +183,15 @@ __mutex_lock_common(struct mutex *lock, long state,
unsigned int subclass,
 			debug_mutex_free_waiter(&waiter);
 			return -EINTR;
 		}
-		__set_task_state(task, state);
 
-		/* didnt get the lock, go to sleep: */
-		spin_unlock_mutex(&lock->wait_lock, flags);
-		schedule();
-		spin_lock_mutex(&lock->wait_lock, flags);
+		mutex_spin_or_schedule(&waiter, state, &flags);
 	}
 
 done:
 	lock_acquired(&lock->dep_map, ip);
 	/* got the lock - rejoice! */
 	mutex_remove_waiter(lock, &waiter, task_thread_info(task));
-	debug_mutex_set_owner(lock, task_thread_info(task));
+	lock->owner = task;
 
 	/* set it to 0 if there are no waiters left: */
 	if (likely(list_empty(&lock->wait_list)))
@@ -260,7 +264,7 @@ __mutex_unlock_common_slowpath(atomic_t *lock_count, int
nested)
 		wake_up_process(waiter->task);
 	}
 
-	debug_mutex_clear_owner(lock);
+	lock->owner = NULL;
 
 	spin_unlock_mutex(&lock->wait_lock, flags);
 }
@@ -298,18 +302,30 @@ __mutex_lock_interruptible_slowpath(atomic_t *lock_count);
  */
 int __sched mutex_lock_interruptible(struct mutex *lock)
 {
+	int ret;
+
 	might_sleep();
-	return __mutex_fastpath_lock_retval
+	ret =  __mutex_fastpath_lock_retval
 			(&lock->count, __mutex_lock_interruptible_slowpath);
+	if (!ret)
+		lock->owner = current;
+
+	return ret;
 }
 
 EXPORT_SYMBOL(mutex_lock_interruptible);
 
 int __sched mutex_lock_killable(struct mutex *lock)
 {
+	int ret;
+
 	might_sleep();
-	return __mutex_fastpath_lock_retval
+	ret = __mutex_fastpath_lock_retval
 			(&lock->count, __mutex_lock_killable_slowpath);
+	if (!ret)
+		lock->owner = current;
+
+	return ret;
 }
 EXPORT_SYMBOL(mutex_lock_killable);
 
@@ -352,9 +368,10 @@ static inline int __mutex_trylock_slowpath(atomic_t
*lock_count)
 
 	prev = atomic_xchg(&lock->count, -1);
 	if (likely(prev == 1)) {
-		debug_mutex_set_owner(lock, current_thread_info());
+		lock->owner = current;
 		mutex_acquire(&lock->dep_map, 0, 1, _RET_IP_);
 	}
+
 	/* Set it back to 0 if there are no waiters: */
 	if (likely(list_empty(&lock->wait_list)))
 		atomic_set(&lock->count, 0);
@@ -380,8 +397,13 @@ static inline int __mutex_trylock_slowpath(atomic_t
*lock_count)
  */
 int __sched mutex_trylock(struct mutex *lock)
 {
-	return __mutex_fastpath_trylock(&lock->count,
-					__mutex_trylock_slowpath);
+	int ret;
+
+	ret = __mutex_fastpath_trylock(&lock->count, __mutex_trylock_slowpath);
+	if (ret)
+		lock->owner = current;
+
+	return ret;
 }
 
 EXPORT_SYMBOL(mutex_trylock);
diff --git a/kernel/mutex.h b/kernel/mutex.h
index a075daf..55e1986 100644
--- a/kernel/mutex.h
+++ b/kernel/mutex.h
@@ -16,8 +16,6 @@
 #define mutex_remove_waiter(lock, waiter, ti) \
 		__list_del((waiter)->list.prev, (waiter)->list.next)
 
-#define debug_mutex_set_owner(lock, new_owner)		do { } while (0)
-#define debug_mutex_clear_owner(lock)			do { } while (0)
 #define debug_mutex_wake_waiter(lock, waiter)		do { } while (0)
 #define debug_mutex_free_waiter(waiter)			do { } while (0)
 #define debug_mutex_add_waiter(lock, waiter, ti)	do { } while (0)
diff --git a/kernel/sched.c b/kernel/sched.c
index 2e3545f..c189597 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -631,6 +631,10 @@ struct rq {
 
 	/* BKL stats */
 	unsigned int bkl_count;
+
+	/* mutex spin stats */
+	unsigned int mtx_spin;
+	unsigned int mtx_sched;
 #endif
 };
 
@@ -4527,6 +4531,75 @@ pick_next_task(struct rq *rq, struct task_struct *prev)
 	}
 }
 
+#ifdef CONFIG_DEBUG_MUTEXES
+# include "mutex-debug.h"
+#else
+# include "mutex.h"
+#endif
+
+void mutex_spin_or_schedule(struct mutex_waiter *waiter, long state,
+			    unsigned long *flags)
+{
+	struct mutex *lock = waiter->lock;
+	struct task_struct *task = waiter->task;
+	struct task_struct *owner = lock->owner;
+	struct rq *rq;
+	int spin = 0;
+
+	if (likely(sched_feat(MUTEX_SPIN) && owner)) {
+		rq = task_rq(owner);
+		spin = (rq->curr == owner);
+	}
+
+	if (!spin) {
+		schedstat_inc(this_rq(), mtx_sched);
+		__set_task_state(task, state);
+		spin_unlock_mutex(&lock->wait_lock, *flags);
+		schedule();
+		spin_lock_mutex(&lock->wait_lock, *flags);
+		return;
+	}
+
+	schedstat_inc(this_rq(), mtx_spin);
+	spin_unlock_mutex(&lock->wait_lock, *flags);
+	for (;;) {
+		struct task_struct *l_owner;
+
+		/* Stop spinning when there''s a pending signal. */
+		if (signal_pending_state(state, task))
+			break;
+
+		/* Mutex got unlocked, try to acquire. */
+		if (!mutex_is_locked(lock))
+			break;
+
+		/*
+		 * Owner changed, bail to re-assess state.
+		 *
+		 * We ignore !owner because that would break us out of
+		 * the spin too early -- see mutex_unlock() -- and make
+		 * us schedule -- see the !owner case on top -- at the
+		 * worst possible moment.
+		 */
+		l_owner = ACCESS_ONCE(lock->owner);
+		if (l_owner && l_owner != owner)
+			break;
+
+		/* Owner stopped running, bail to re-assess state. */
+		if (rq->curr != owner)
+			break;
+
+		/*
+		 * cpu_relax() provides a compiler barrier that ensures we
+		 * reload everything every time. SMP barriers are not strictly
+		 * required as the worst case is we''ll spin a bit more before
+		 * we observe the right values.
+		 */
+		cpu_relax();
+	}
+	spin_lock_mutex(&lock->wait_lock, *flags);
+}
+
 /*
  * schedule() is the main scheduler function.
  */
diff --git a/kernel/sched_debug.c b/kernel/sched_debug.c
index 4293cfa..3dec83a 100644
--- a/kernel/sched_debug.c
+++ b/kernel/sched_debug.c
@@ -288,6 +288,8 @@ static void print_cpu(struct seq_file *m, int cpu)
 
 	P(bkl_count);
 
+	P(mtx_spin);
+	P(mtx_sched);
 #undef P
 #endif
 	print_cfs_stats(m, cpu);
diff --git a/kernel/sched_features.h b/kernel/sched_features.h
index da5d93b..f548627 100644
--- a/kernel/sched_features.h
+++ b/kernel/sched_features.h
@@ -13,3 +13,4 @@ SCHED_FEAT(LB_WAKEUP_UPDATE, 1)
 SCHED_FEAT(ASYM_EFF_LOAD, 1)
 SCHED_FEAT(WAKEUP_OVERLAP, 0)
 SCHED_FEAT(LAST_BUDDY, 1)
+SCHED_FEAT(MUTEX_SPIN, 1)

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Ingo Molnar

2009-Jan-07 13:07 UTC

head link

Re: Btrfs for mainline

* Chris Mason <chris.mason@oracle.com> wrote:
> On Tue, 2009-01-06 at 01:47 +1100, Nick Piggin wrote:
> 
> [ adaptive locking in btrfs ]
> 
> > adaptive locks have traditionally (read: Linus says) indicated the
locking
> > is suboptimal from a performance perspective and should be reworked.
This
> > is definitely the case for the -rt patchset, because they deliberately
> > trade performance by change even very short held spinlocks to sleeping
locks.
> > 
> > So I don''t really know if -rt justifies adaptive locks in
mainline/btrfs.
> > Is there no way for the short critical sections to be decoupled from
the
> > long/sleeping ones?
> 
> Yes and no.  The locks are used here to control access to the btree
> leaves and nodes.  Some of these are very hot and tend to stay in cache
> all the time, while others have to be read from the disk.
> 
> As the btree search walks down the tree, access to the hot nodes is best 
> controlled by a spinlock.  Some operations (like a balance) will need to 
> read other blocks from the disk and keep the node/leaf locked.  So it 
> also needs to be able to sleep.
> 
> I try to drop the locks where it makes sense before sleeping operatinos, 
> but in some corner cases it isn''t practical.
> 
> For leaves, once the code has found the item in the btree it was looking 
> for, it wants to go off and do something useful (insert an inode etc 
> etc). Those operations also tend to block, and the lock needs to be held 
> to keep the tree block from changing.
> 
> All of this is a long way of saying the btrfs locking scheme is far from 
> perfect.  I''ll look harder at the loop and ways to get rid of it.
<ob''plug>

adaptive spinning mutexes perhaps? Such as:

   http://lkml.org/lkml/2009/1/7/119

(also pullable via the URI below)

If you have a BTRFS performance test where you know such details matter 
you might want to try Peter''s patch and send us the test results.

	Ingo

------------->

You can pull the latest core/locking git tree from:

   git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip.git
core/locking

------------------>
Peter Zijlstra (1):
      mutex: implement adaptive spinning


 include/linux/mutex.h   |    4 +-
 include/linux/sched.h   |    2 +
 kernel/mutex-debug.c    |   10 +------
 kernel/mutex-debug.h    |    8 -----
 kernel/mutex.c          |   46 ++++++++++++++++++++++--------
 kernel/mutex.h          |    2 -
 kernel/sched.c          |   73 +++++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched_debug.c    |    2 +
 kernel/sched_features.h |    1 +
 9 files changed, 115 insertions(+), 33 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Matthew Wilcox

2009-Jan-07 13:24 UTC

head link

Re: Btrfs for mainline

On Wed, Jan 07, 2009 at 02:07:42PM +0100, Ingo Molnar
wrote:> * Chris Mason <chris.mason@oracle.com> wrote:
> > All of this is a long way of saying the btrfs locking scheme is far
from
> > perfect.  I''ll look harder at the loop and ways to get rid of
it.
> 
> <ob''plug>
> 
> adaptive spinning mutexes perhaps? Such as:
Um, I don''t know how your mail client does threading, but mine shows
Peter''s message introducing the adaptive spinning mutexes as a reply to
one of Chris'' messages in the btrfs thread.

Chris is just saying he''ll look at other ways to not need the spinning
mutexes.

-- 
Matthew Wilcox				Intel Open Source Technology Centre
"Bill, look, we understand that you''re interested in selling us
this
operating system, but compare it to ours.  We can''t possibly take such
a retrograde step."
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Matthew Wilcox

2009-Jan-07 13:35 UTC

head link

Re: [PATCH][RFC]: mutex: adaptive spin

On Wed, Jan 07, 2009 at 03:34:47PM +0800, Lai Jiangshan
wrote:> So I think current task should sleep earlier when it detects that
> mutex owner start schedule().
How do you propose it detects this?

-- 
Matthew Wilcox				Intel Open Source Technology Centre
"Bill, look, we understand that you''re interested in selling us
this
operating system, but compare it to ours.  We can''t possibly take such
a retrograde step."
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Frédéric Weisbecker

2009-Jan-07 14:50 UTC

head link

Re: [PATCH -v4][RFC]: mutex: implement adaptive spinning

2009/1/7 Peter Zijlstra <peterz@infradead.org>:> Change mutex contention behaviour such that it will sometimes busy wait on
> acquisition - moving its behaviour closer to that of spinlocks.
>
> This concept got ported to mainline from the -rt tree, where it was
originally
> implemented for rtmutexes by Steven Rostedt, based on work by Gregory
Haskins.
>
> Testing with Ingo''s test-mutex application
(http://lkml.org/lkml/2006/1/8/50)
> gave a 8% boost for VFS scalability on my testbox:
>
>  # echo MUTEX_SPIN > /debug/sched_features
>  # ./test-mutex V 16 10
>  2 CPUs, running 16 parallel test-tasks.
>  checking VFS performance.
>
>  avg ops/sec:                74910
>
>  # echo NO_MUTEX_SPIN > /debug/sched_features
>  # ./test-mutex V 16 10
>  2 CPUs, running 16 parallel test-tasks.
>  checking VFS performance.
>
>  avg ops/sec:                68804
>
> The key criteria for the busy wait is that the lock owner has to be running
on
> a (different) cpu. The idea is that as long as the owner is running, there
is a
> fair chance it''ll release the lock soon, and thus we''ll
be better off spinning
> instead of blocking/scheduling.
>
> Since regular mutexes (as opposed to rtmutexes) do not atomically track the
> owner, we add the owner in a non-atomic fashion and deal with the races in
> the slowpath.
>
> Furthermore, to ease the testing of the performance impact of this new
code,
> there is means to disable this behaviour runtime (without having to reboot
> the system), when scheduler debugging is enabled (CONFIG_SCHED_DEBUG=y),
> by issuing the following command:
>
>  # echo NO_MUTEX_SPIN > /debug/sched_features
>
> This command re-enables spinning again (this is also the default):
>
>  # echo MUTEX_SPIN > /debug/sched_features
>
> There''s also a few new statistic fields in /proc/sched_debug
> (available if CONFIG_SCHED_DEBUG=y and CONFIG_SCHEDSTATS=y):
>
>  # grep mtx /proc/sched_debug
>  .mtx_spin                      : 2387
>  .mtx_sched                     : 2283
>  .mtx_spin                      : 1277
>  .mtx_sched                     : 1700
>
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Reviewed-and-signed-off-by: Ingo Molnar <mingo@elte.hu>
> ---
>  include/linux/mutex.h   |    4 +-
>  include/linux/sched.h   |    2 +
>  kernel/mutex-debug.c    |   10 +------
>  kernel/mutex-debug.h    |    8 -----
>  kernel/mutex.c          |   46 ++++++++++++++++++++++--------
>  kernel/mutex.h          |    2 -
>  kernel/sched.c          |   73
+++++++++++++++++++++++++++++++++++++++++++++++
>  kernel/sched_debug.c    |    2 +
>  kernel/sched_features.h |    1 +
>  9 files changed, 115 insertions(+), 33 deletions(-)
>
> diff --git a/include/linux/mutex.h b/include/linux/mutex.h
> index 7a0e5c4..c007b4e 100644
> --- a/include/linux/mutex.h
> +++ b/include/linux/mutex.h
> @@ -50,8 +50,8 @@ struct mutex {
>        atomic_t                count;
>        spinlock_t              wait_lock;
>        struct list_head        wait_list;
> +       struct task_struct      *owner;
>  #ifdef CONFIG_DEBUG_MUTEXES
> -       struct thread_info      *owner;
>        const char              *name;
>        void                    *magic;
>  #endif
> @@ -67,8 +67,8 @@ struct mutex {
>  struct mutex_waiter {
>        struct list_head        list;
>        struct task_struct      *task;
> -#ifdef CONFIG_DEBUG_MUTEXES
>        struct mutex            *lock;
> +#ifdef CONFIG_DEBUG_MUTEXES
>        void                    *magic;
>  #endif
>  };
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 4cae9b8..d8fa96b 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -328,6 +328,8 @@ extern signed long schedule_timeout(signed long
timeout);
>  extern signed long schedule_timeout_interruptible(signed long timeout);
>  extern signed long schedule_timeout_killable(signed long timeout);
>  extern signed long schedule_timeout_uninterruptible(signed long timeout);
> +extern void mutex_spin_or_schedule(struct mutex_waiter *waiter, long
state,
> +                                  unsigned long *flags);
>  asmlinkage void schedule(void);
>
>  struct nsproxy;
> diff --git a/kernel/mutex-debug.c b/kernel/mutex-debug.c
> index 1d94160..0564680 100644
> --- a/kernel/mutex-debug.c
> +++ b/kernel/mutex-debug.c
> @@ -26,11 +26,6 @@
>  /*
>  * Must be called with lock->wait_lock held.
>  */
> -void debug_mutex_set_owner(struct mutex *lock, struct thread_info
*new_owner)
> -{
> -       lock->owner = new_owner;
> -}
> -
>  void debug_mutex_lock_common(struct mutex *lock, struct mutex_waiter
*waiter)
>  {
>        memset(waiter, MUTEX_DEBUG_INIT, sizeof(*waiter));
> @@ -59,7 +54,6 @@ void debug_mutex_add_waiter(struct mutex *lock, struct
mutex_waiter *waiter,
>
>        /* Mark the current thread as blocked on the lock: */
>        ti->task->blocked_on = waiter;
> -       waiter->lock = lock;
>  }
>
>  void mutex_remove_waiter(struct mutex *lock, struct mutex_waiter *waiter,
> @@ -80,9 +74,8 @@ void debug_mutex_unlock(struct mutex *lock)
>                return;
>
>        DEBUG_LOCKS_WARN_ON(lock->magic != lock);
> -       DEBUG_LOCKS_WARN_ON(lock->owner != current_thread_info());
> +       /* DEBUG_LOCKS_WARN_ON(lock->owner != current); */
>        DEBUG_LOCKS_WARN_ON(!lock->wait_list.prev &&
!lock->wait_list.next);
> -       DEBUG_LOCKS_WARN_ON(lock->owner != current_thread_info());
>  }
>
>  void debug_mutex_init(struct mutex *lock, const char *name,
> @@ -95,7 +88,6 @@ void debug_mutex_init(struct mutex *lock, const char
*name,
>        debug_check_no_locks_freed((void *)lock, sizeof(*lock));
>        lockdep_init_map(&lock->dep_map, name, key, 0);
>  #endif
> -       lock->owner = NULL;
>        lock->magic = lock;
>  }
>
> diff --git a/kernel/mutex-debug.h b/kernel/mutex-debug.h
> index babfbdf..42eab06 100644
> --- a/kernel/mutex-debug.h
> +++ b/kernel/mutex-debug.h
> @@ -13,14 +13,6 @@
>  /*
>  * This must be called with lock->wait_lock held.
>  */
> -extern void
> -debug_mutex_set_owner(struct mutex *lock, struct thread_info *new_owner);
> -
> -static inline void debug_mutex_clear_owner(struct mutex *lock)
> -{
> -       lock->owner = NULL;
> -}
> -
>  extern void debug_mutex_lock_common(struct mutex *lock,
>                                    struct mutex_waiter *waiter);
>  extern void debug_mutex_wake_waiter(struct mutex *lock,
> diff --git a/kernel/mutex.c b/kernel/mutex.c
> index 4f45d4b..089b46b 100644
> --- a/kernel/mutex.c
> +++ b/kernel/mutex.c
> @@ -10,6 +10,10 @@
>  * Many thanks to Arjan van de Ven, Thomas Gleixner, Steven Rostedt and
>  * David Howells for suggestions and improvements.
>  *
> + *  - Adaptive spinning for mutexes by Peter Zijlstra. (Ported to mainline
> + *    from the -rt tree, where it was originally implemented for rtmutexes
> + *    by Steven Rostedt, based on work by Gregory Haskins.)
> + *
>  * Also see Documentation/mutex-design.txt.
>  */
>  #include <linux/mutex.h>
> @@ -46,6 +50,7 @@ __mutex_init(struct mutex *lock, const char *name, struct
lock_class_key *key)
>        atomic_set(&lock->count, 1);
>        spin_lock_init(&lock->wait_lock);
>        INIT_LIST_HEAD(&lock->wait_list);
> +       lock->owner = NULL;
>
>        debug_mutex_init(lock, name, key);
>  }
> @@ -91,6 +96,7 @@ void inline __sched mutex_lock(struct mutex *lock)
>         * ''unlocked'' into ''locked''
state.
>         */
>        __mutex_fastpath_lock(&lock->count, __mutex_lock_slowpath);
> +       lock->owner = current;
>  }
>
>  EXPORT_SYMBOL(mutex_lock);
> @@ -115,6 +121,7 @@ void __sched mutex_unlock(struct mutex *lock)
>         * The unlocking fastpath is the 0->1 transition from
''locked''
>         * into ''unlocked'' state:
>         */
> +       lock->owner = NULL;
>        __mutex_fastpath_unlock(&lock->count,
__mutex_unlock_slowpath);
>  }
>
> @@ -141,6 +148,7 @@ __mutex_lock_common(struct mutex *lock, long state,
unsigned int subclass,
>        /* add waiting tasks to the end of the waitqueue (FIFO): */
>        list_add_tail(&waiter.list, &lock->wait_list);
>        waiter.task = task;
> +       waiter.lock = lock;
>
>        old_val = atomic_xchg(&lock->count, -1);
>        if (old_val == 1)
> @@ -175,19 +183,15 @@ __mutex_lock_common(struct mutex *lock, long state,
unsigned int subclass,
>                        debug_mutex_free_waiter(&waiter);
>                        return -EINTR;
>                }
> -               __set_task_state(task, state);
>
> -               /* didnt get the lock, go to sleep: */
> -               spin_unlock_mutex(&lock->wait_lock, flags);
> -               schedule();
> -               spin_lock_mutex(&lock->wait_lock, flags);
> +               mutex_spin_or_schedule(&waiter, state, &flags);
>        }
>
>  done:
>        lock_acquired(&lock->dep_map, ip);
>        /* got the lock - rejoice! */
>        mutex_remove_waiter(lock, &waiter, task_thread_info(task));
> -       debug_mutex_set_owner(lock, task_thread_info(task));
> +       lock->owner = task;
>
>        /* set it to 0 if there are no waiters left: */
>        if (likely(list_empty(&lock->wait_list)))
> @@ -260,7 +264,7 @@ __mutex_unlock_common_slowpath(atomic_t *lock_count,
int nested)
>                wake_up_process(waiter->task);
>        }
>
> -       debug_mutex_clear_owner(lock);
> +       lock->owner = NULL;
>
>        spin_unlock_mutex(&lock->wait_lock, flags);
>  }
> @@ -298,18 +302,30 @@ __mutex_lock_interruptible_slowpath(atomic_t
*lock_count);
>  */
>  int __sched mutex_lock_interruptible(struct mutex *lock)
>  {
> +       int ret;
> +
>        might_sleep();
> -       return __mutex_fastpath_lock_retval
> +       ret =  __mutex_fastpath_lock_retval
>                        (&lock->count,
__mutex_lock_interruptible_slowpath);
> +       if (!ret)
> +               lock->owner = current;
> +
> +       return ret;
>  }
>
>  EXPORT_SYMBOL(mutex_lock_interruptible);
>
>  int __sched mutex_lock_killable(struct mutex *lock)
>  {
> +       int ret;
> +
>        might_sleep();
> -       return __mutex_fastpath_lock_retval
> +       ret = __mutex_fastpath_lock_retval
>                        (&lock->count,
__mutex_lock_killable_slowpath);
> +       if (!ret)
> +               lock->owner = current;
> +
> +       return ret;
>  }
>  EXPORT_SYMBOL(mutex_lock_killable);
>
> @@ -352,9 +368,10 @@ static inline int __mutex_trylock_slowpath(atomic_t
*lock_count)
>
>        prev = atomic_xchg(&lock->count, -1);
>        if (likely(prev == 1)) {
> -               debug_mutex_set_owner(lock, current_thread_info());
> +               lock->owner = current;
>                mutex_acquire(&lock->dep_map, 0, 1, _RET_IP_);
>        }
> +
>        /* Set it back to 0 if there are no waiters: */
>        if (likely(list_empty(&lock->wait_list)))
>                atomic_set(&lock->count, 0);
> @@ -380,8 +397,13 @@ static inline int __mutex_trylock_slowpath(atomic_t
*lock_count)
>  */
>  int __sched mutex_trylock(struct mutex *lock)
>  {
> -       return __mutex_fastpath_trylock(&lock->count,
> -                                       __mutex_trylock_slowpath);
> +       int ret;
> +
> +       ret = __mutex_fastpath_trylock(&lock->count,
__mutex_trylock_slowpath);
> +       if (ret)
> +               lock->owner = current;
> +
> +       return ret;
>  }
>
>  EXPORT_SYMBOL(mutex_trylock);
> diff --git a/kernel/mutex.h b/kernel/mutex.h
> index a075daf..55e1986 100644
> --- a/kernel/mutex.h
> +++ b/kernel/mutex.h
> @@ -16,8 +16,6 @@
>  #define mutex_remove_waiter(lock, waiter, ti) \
>                __list_del((waiter)->list.prev, (waiter)->list.next)
>
> -#define debug_mutex_set_owner(lock, new_owner)         do { } while (0)
> -#define debug_mutex_clear_owner(lock)                  do { } while (0)
>  #define debug_mutex_wake_waiter(lock, waiter)          do { } while (0)
>  #define debug_mutex_free_waiter(waiter)                        do { }
while (0)
>  #define debug_mutex_add_waiter(lock, waiter, ti)       do { } while (0)
> diff --git a/kernel/sched.c b/kernel/sched.c
> index 2e3545f..c189597 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -631,6 +631,10 @@ struct rq {
>
>        /* BKL stats */
>        unsigned int bkl_count;
> +
> +       /* mutex spin stats */
> +       unsigned int mtx_spin;
> +       unsigned int mtx_sched;
>  #endif
>  };
>
> @@ -4527,6 +4531,75 @@ pick_next_task(struct rq *rq, struct task_struct
*prev)
>        }
>  }
>
> +#ifdef CONFIG_DEBUG_MUTEXES
> +# include "mutex-debug.h"
> +#else
> +# include "mutex.h"
> +#endif
> +
> +void mutex_spin_or_schedule(struct mutex_waiter *waiter, long state,
> +                           unsigned long *flags)
> +{
> +       struct mutex *lock = waiter->lock;
> +       struct task_struct *task = waiter->task;
> +       struct task_struct *owner = lock->owner;
> +       struct rq *rq;
> +       int spin = 0;
> +
> +       if (likely(sched_feat(MUTEX_SPIN) && owner)) {
> +               rq = task_rq(owner);
> +               spin = (rq->curr == owner);
> +       }
> +
> +       if (!spin) {
> +               schedstat_inc(this_rq(), mtx_sched);
> +               __set_task_state(task, state);
> +               spin_unlock_mutex(&lock->wait_lock, *flags);
> +               schedule();
> +               spin_lock_mutex(&lock->wait_lock, *flags);
> +               return;
> +       }
> +
> +       schedstat_inc(this_rq(), mtx_spin);
> +       spin_unlock_mutex(&lock->wait_lock, *flags);
> +       for (;;) {
> +               struct task_struct *l_owner;
> +
> +               /* Stop spinning when there''s a pending signal. */
> +               if (signal_pending_state(state, task))
> +                       break;
> +
> +               /* Mutex got unlocked, try to acquire. */
> +               if (!mutex_is_locked(lock))
> +                       break;
> +
> +               /*
> +                * Owner changed, bail to re-assess state.
> +                *
> +                * We ignore !owner because that would break us out of
> +                * the spin too early -- see mutex_unlock() -- and make
> +                * us schedule -- see the !owner case on top -- at the
> +                * worst possible moment.
> +                */
> +               l_owner = ACCESS_ONCE(lock->owner);
> +               if (l_owner && l_owner != owner)
> +                       break;
> +
> +               /* Owner stopped running, bail to re-assess state. */
> +               if (rq->curr != owner)
> +                       break;
> +
> +               /*
> +                * cpu_relax() provides a compiler barrier that ensures we
> +                * reload everything every time. SMP barriers are not
strictly
> +                * required as the worst case is we''ll spin a bit
more before
> +                * we observe the right values.
> +                */
> +               cpu_relax();
> +       }
> +       spin_lock_mutex(&lock->wait_lock, *flags);
> +}

Hi Peter,

Sorry I haven''t read all the previous talk about the older version.
But it is possible that, in hopefully rare cases, you enter
mutex_spin_or_schedule
multiple times, and try to spin for the same lock each of these times.

For each of the above break,

_if you exit the spin because the mutex is unlocked, and someone else
grab it before you
_ or simply the owner changed...

then you will enter again in mutex_spin_or_schedule, you have some chances that
rq->curr == the new owner, and then you will spin again.
And this situation can almost really make you behave like a spinlock...

Shouldn''t it actually try only one time to spin, and if it calls again
mutex_spin_or_schedule()
then it would be better to schedule()  ?

Or I misunderstood something...?

Thanks.


> +
>  /*
>  * schedule() is the main scheduler function.
>  */
> diff --git a/kernel/sched_debug.c b/kernel/sched_debug.c
> index 4293cfa..3dec83a 100644
> --- a/kernel/sched_debug.c
> +++ b/kernel/sched_debug.c
> @@ -288,6 +288,8 @@ static void print_cpu(struct seq_file *m, int cpu)
>
>        P(bkl_count);
>
> +       P(mtx_spin);
> +       P(mtx_sched);
>  #undef P
>  #endif
>        print_cfs_stats(m, cpu);
> diff --git a/kernel/sched_features.h b/kernel/sched_features.h
> index da5d93b..f548627 100644
> --- a/kernel/sched_features.h
> +++ b/kernel/sched_features.h
> @@ -13,3 +13,4 @@ SCHED_FEAT(LB_WAKEUP_UPDATE, 1)
>  SCHED_FEAT(ASYM_EFF_LOAD, 1)
>  SCHED_FEAT(WAKEUP_OVERLAP, 0)
>  SCHED_FEAT(LAST_BUDDY, 1)
> +SCHED_FEAT(MUTEX_SPIN, 1)
>
> --
> To unsubscribe from this list: send the line "unsubscribe
linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Ingo Molnar

2009-Jan-07 14:56 UTC

head link

Re: Btrfs for mainline

* Matthew Wilcox <matthew@wil.cx> wrote:
> On Wed, Jan 07, 2009 at 02:07:42PM +0100, Ingo Molnar wrote:
> > * Chris Mason <chris.mason@oracle.com> wrote:
> > > All of this is a long way of saying the btrfs locking scheme is
far from
> > > perfect.  I''ll look harder at the loop and ways to get
rid of it.
> > 
> > <ob''plug>
> > 
> > adaptive spinning mutexes perhaps? Such as:
> 
> Um, I don''t know how your mail client does threading, but mine
shows
> Peter''s message introducing the adaptive spinning mutexes as a
reply to
> one of Chris'' messages in the btrfs thread.
>
> Chris is just saying he''ll look at other ways to not need the
spinning
> mutexes.
But those are not the same spinning mutexes. Chris wrote his mail on Jan 
05, Peter his first mail about spin-mutexes on Jan 06, as a reaction to 
Chris''s mail.

My reply links back the discussion to the original analysis from Chris 
pointing out that it would be nice to try BTRFS with plain mutexes plus 
Peter''s patch - instead of throwing away BTRFS''s locking
design or
anything intrusive like that.

Where''s the problem? :)

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Peter Zijlstra

2009-Jan-07 14:58 UTC

head link

Re: [PATCH -v4][RFC]: mutex: implement adaptive spinning

On Wed, 2009-01-07 at 15:50 +0100, Frédéric Weisbecker
wrote:> 2009/1/7 Peter Zijlstra <peterz@infradead.org>:
> > Change mutex contention behaviour such that it will sometimes busy
wait on
> > acquisition - moving its behaviour closer to that of spinlocks.
> >
> > This concept got ported to mainline from the -rt tree, where it was
originally
> > implemented for rtmutexes by Steven Rostedt, based on work by Gregory
Haskins.
> >
> > Testing with Ingo''s test-mutex application
(http://lkml.org/lkml/2006/1/8/50)
> > gave a 8% boost for VFS scalability on my testbox:
> >
> >  # echo MUTEX_SPIN > /debug/sched_features
> >  # ./test-mutex V 16 10
> >  2 CPUs, running 16 parallel test-tasks.
> >  checking VFS performance.
> >
> >  avg ops/sec:                74910
> >
> >  # echo NO_MUTEX_SPIN > /debug/sched_features
> >  # ./test-mutex V 16 10
> >  2 CPUs, running 16 parallel test-tasks.
> >  checking VFS performance.
> >
> >  avg ops/sec:                68804
> >
> > The key criteria for the busy wait is that the lock owner has to be
running on
> > a (different) cpu. The idea is that as long as the owner is running,
there is a
> > fair chance it''ll release the lock soon, and thus
we''ll be better off spinning
> > instead of blocking/scheduling.
> >
> > Since regular mutexes (as opposed to rtmutexes) do not atomically
track the
> > owner, we add the owner in a non-atomic fashion and deal with the
races in
> > the slowpath.
> >
> > Furthermore, to ease the testing of the performance impact of this new
code,
> > there is means to disable this behaviour runtime (without having to
reboot
> > the system), when scheduler debugging is enabled
(CONFIG_SCHED_DEBUG=y),
> > by issuing the following command:
> >
> >  # echo NO_MUTEX_SPIN > /debug/sched_features
> >
> > This command re-enables spinning again (this is also the default):
> >
> >  # echo MUTEX_SPIN > /debug/sched_features
> >
> > There''s also a few new statistic fields in /proc/sched_debug
> > (available if CONFIG_SCHED_DEBUG=y and CONFIG_SCHEDSTATS=y):
> >
> >  # grep mtx /proc/sched_debug
> >  .mtx_spin                      : 2387
> >  .mtx_sched                     : 2283
> >  .mtx_spin                      : 1277
> >  .mtx_sched                     : 1700
> >
> > Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> > Reviewed-and-signed-off-by: Ingo Molnar <mingo@elte.hu>
> > ---
> Sorry I haven''t read all the previous talk about the older
version.
> But it is possible that, in hopefully rare cases, you enter
> mutex_spin_or_schedule
> multiple times, and try to spin for the same lock each of these times.
> 
> For each of the above break,
> 
> _if you exit the spin because the mutex is unlocked, and someone else
> grab it before you
> _ or simply the owner changed...
> 
> then you will enter again in mutex_spin_or_schedule, you have some chances
that
> rq->curr == the new owner, and then you will spin again.
> And this situation can almost really make you behave like a spinlock...
You understand correctly, that is indeed possible.
> Shouldn''t it actually try only one time to spin, and if it calls
again
> mutex_spin_or_schedule()
> then it would be better to schedule()  ?
I don''t know, maybe code it up and find a benchmark where it makes a
difference. :-)
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Steven Rostedt

2009-Jan-07 15:22 UTC

head link

Re: [PATCH -v4][RFC]: mutex: implement adaptive spinning

Peter, nice work!


On Wed, 7 Jan 2009, Peter Zijlstra wrote:
> Change mutex contention behaviour such that it will sometimes busy wait on
> acquisition - moving its behaviour closer to that of spinlocks.
> 
> This concept got ported to mainline from the -rt tree, where it was
originally
> implemented for rtmutexes by Steven Rostedt, based on work by Gregory
Haskins.
> 
> Testing with Ingo''s test-mutex application
(http://lkml.org/lkml/2006/1/8/50)
> gave a 8% boost for VFS scalability on my testbox:
> 
>   # echo MUTEX_SPIN > /debug/sched_features
>   # ./test-mutex V 16 10
>   2 CPUs, running 16 parallel test-tasks.
>   checking VFS performance.
> 
>   avg ops/sec:                74910
> 
>   # echo NO_MUTEX_SPIN > /debug/sched_features
>   # ./test-mutex V 16 10
>   2 CPUs, running 16 parallel test-tasks.
>   checking VFS performance.
> 
>   avg ops/sec:                68804
> 
> The key criteria for the busy wait is that the lock owner has to be running
on
> a (different) cpu. The idea is that as long as the owner is running, there
is a
> fair chance it''ll release the lock soon, and thus we''ll
be better off spinning
> instead of blocking/scheduling.
> 
> Since regular mutexes (as opposed to rtmutexes) do not atomically track the
> owner, we add the owner in a non-atomic fashion and deal with the races in
> the slowpath.
> 
> Furthermore, to ease the testing of the performance impact of this new
code,
> there is means to disable this behaviour runtime (without having to reboot
> the system), when scheduler debugging is enabled (CONFIG_SCHED_DEBUG=y),
> by issuing the following command:
> 
>  # echo NO_MUTEX_SPIN > /debug/sched_features
> 
> This command re-enables spinning again (this is also the default):
> 
>  # echo MUTEX_SPIN > /debug/sched_features
> 
> There''s also a few new statistic fields in /proc/sched_debug
> (available if CONFIG_SCHED_DEBUG=y and CONFIG_SCHEDSTATS=y):
> 
>  # grep mtx /proc/sched_debug
>   .mtx_spin                      : 2387
>   .mtx_sched                     : 2283
>   .mtx_spin                      : 1277
>   .mtx_sched                     : 1700
> 
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Reviewed-and-signed-off-by: Ingo Molnar <mingo@elte.hu>
> ---
>  include/linux/mutex.h   |    4 +-
>  include/linux/sched.h   |    2 +
>  kernel/mutex-debug.c    |   10 +------
>  kernel/mutex-debug.h    |    8 -----
>  kernel/mutex.c          |   46 ++++++++++++++++++++++--------
>  kernel/mutex.h          |    2 -
>  kernel/sched.c          |   73
+++++++++++++++++++++++++++++++++++++++++++++++
>  kernel/sched_debug.c    |    2 +
>  kernel/sched_features.h |    1 +
>  9 files changed, 115 insertions(+), 33 deletions(-)
> 
> diff --git a/include/linux/mutex.h b/include/linux/mutex.h
> index 7a0e5c4..c007b4e 100644
> --- a/include/linux/mutex.h
> +++ b/include/linux/mutex.h
> @@ -50,8 +50,8 @@ struct mutex {
>  	atomic_t		count;
>  	spinlock_t		wait_lock;
>  	struct list_head	wait_list;
> +	struct task_struct	*owner;
>  #ifdef CONFIG_DEBUG_MUTEXES
> -	struct thread_info	*owner;
>  	const char 		*name;
>  	void			*magic;
>  #endif
> @@ -67,8 +67,8 @@ struct mutex {
>  struct mutex_waiter {
>  	struct list_head	list;
>  	struct task_struct	*task;
> -#ifdef CONFIG_DEBUG_MUTEXES
>  	struct mutex		*lock;
> +#ifdef CONFIG_DEBUG_MUTEXES
>  	void			*magic;
>  #endif
>  };
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 4cae9b8..d8fa96b 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -328,6 +328,8 @@ extern signed long schedule_timeout(signed long
timeout);
>  extern signed long schedule_timeout_interruptible(signed long timeout);
>  extern signed long schedule_timeout_killable(signed long timeout);
>  extern signed long schedule_timeout_uninterruptible(signed long timeout);
> +extern void mutex_spin_or_schedule(struct mutex_waiter *waiter, long
state,
> +				   unsigned long *flags);
>  asmlinkage void schedule(void);
>  
>  struct nsproxy;
> diff --git a/kernel/mutex-debug.c b/kernel/mutex-debug.c
> index 1d94160..0564680 100644
> --- a/kernel/mutex-debug.c
> +++ b/kernel/mutex-debug.c
> @@ -26,11 +26,6 @@
>  /*
>   * Must be called with lock->wait_lock held.
>   */
> -void debug_mutex_set_owner(struct mutex *lock, struct thread_info
*new_owner)
> -{
> -	lock->owner = new_owner;
> -}
> -
>  void debug_mutex_lock_common(struct mutex *lock, struct mutex_waiter
*waiter)
>  {
>  	memset(waiter, MUTEX_DEBUG_INIT, sizeof(*waiter));
> @@ -59,7 +54,6 @@ void debug_mutex_add_waiter(struct mutex *lock, struct
mutex_waiter *waiter,
>  
>  	/* Mark the current thread as blocked on the lock: */
>  	ti->task->blocked_on = waiter;
> -	waiter->lock = lock;
>  }
>  
>  void mutex_remove_waiter(struct mutex *lock, struct mutex_waiter *waiter,
> @@ -80,9 +74,8 @@ void debug_mutex_unlock(struct mutex *lock)
>  		return;
>  
>  	DEBUG_LOCKS_WARN_ON(lock->magic != lock);
> -	DEBUG_LOCKS_WARN_ON(lock->owner != current_thread_info());
> +	/* DEBUG_LOCKS_WARN_ON(lock->owner != current); */
>  	DEBUG_LOCKS_WARN_ON(!lock->wait_list.prev &&
!lock->wait_list.next);
> -	DEBUG_LOCKS_WARN_ON(lock->owner != current_thread_info());
>  }
>  
>  void debug_mutex_init(struct mutex *lock, const char *name,
> @@ -95,7 +88,6 @@ void debug_mutex_init(struct mutex *lock, const char
*name,
>  	debug_check_no_locks_freed((void *)lock, sizeof(*lock));
>  	lockdep_init_map(&lock->dep_map, name, key, 0);
>  #endif
> -	lock->owner = NULL;
>  	lock->magic = lock;
>  }
>  
> diff --git a/kernel/mutex-debug.h b/kernel/mutex-debug.h
> index babfbdf..42eab06 100644
> --- a/kernel/mutex-debug.h
> +++ b/kernel/mutex-debug.h
> @@ -13,14 +13,6 @@
>  /*
>   * This must be called with lock->wait_lock held.
>   */
> -extern void
> -debug_mutex_set_owner(struct mutex *lock, struct thread_info *new_owner);
> -
> -static inline void debug_mutex_clear_owner(struct mutex *lock)
> -{
> -	lock->owner = NULL;
> -}
> -
>  extern void debug_mutex_lock_common(struct mutex *lock,
>  				    struct mutex_waiter *waiter);
>  extern void debug_mutex_wake_waiter(struct mutex *lock,
> diff --git a/kernel/mutex.c b/kernel/mutex.c
> index 4f45d4b..089b46b 100644
> --- a/kernel/mutex.c
> +++ b/kernel/mutex.c
> @@ -10,6 +10,10 @@
>   * Many thanks to Arjan van de Ven, Thomas Gleixner, Steven Rostedt and
>   * David Howells for suggestions and improvements.
>   *
> + *  - Adaptive spinning for mutexes by Peter Zijlstra. (Ported to mainline
> + *    from the -rt tree, where it was originally implemented for rtmutexes
> + *    by Steven Rostedt, based on work by Gregory Haskins.)
I feel guilty with my name being the only one there for the rtmutexes.
Thomas Gleixner, Ingo Molnar and Esben Nielsen also played large roles in
that code.

And Peter Morreale and Sven Dietrich might also be mentioned next to
Gregory''s name.
> + *
>   * Also see Documentation/mutex-design.txt.
>   */
>  #include <linux/mutex.h>
> @@ -46,6 +50,7 @@ __mutex_init(struct mutex *lock, const char *name, struct
lock_class_key *key)
>  	atomic_set(&lock->count, 1);
>  	spin_lock_init(&lock->wait_lock);
>  	INIT_LIST_HEAD(&lock->wait_list);
> +	lock->owner = NULL;
>  
>  	debug_mutex_init(lock, name, key);
>  }
> @@ -91,6 +96,7 @@ void inline __sched mutex_lock(struct mutex *lock)
>  	 * ''unlocked'' into ''locked'' state.
>  	 */
>  	__mutex_fastpath_lock(&lock->count, __mutex_lock_slowpath);
> +	lock->owner = current;
>  }
>  
>  EXPORT_SYMBOL(mutex_lock);
> @@ -115,6 +121,7 @@ void __sched mutex_unlock(struct mutex *lock)
>  	 * The unlocking fastpath is the 0->1 transition from
''locked''
>  	 * into ''unlocked'' state:
>  	 */
> +	lock->owner = NULL;
>  	__mutex_fastpath_unlock(&lock->count, __mutex_unlock_slowpath);
>  }
>  
> @@ -141,6 +148,7 @@ __mutex_lock_common(struct mutex *lock, long state,
unsigned int subclass,
>  	/* add waiting tasks to the end of the waitqueue (FIFO): */
>  	list_add_tail(&waiter.list, &lock->wait_list);
>  	waiter.task = task;
> +	waiter.lock = lock;
>  
>  	old_val = atomic_xchg(&lock->count, -1);
>  	if (old_val == 1)
> @@ -175,19 +183,15 @@ __mutex_lock_common(struct mutex *lock, long state,
unsigned int subclass,
>  			debug_mutex_free_waiter(&waiter);
>  			return -EINTR;
>  		}
> -		__set_task_state(task, state);
>  
> -		/* didnt get the lock, go to sleep: */
> -		spin_unlock_mutex(&lock->wait_lock, flags);
> -		schedule();
> -		spin_lock_mutex(&lock->wait_lock, flags);
> +		mutex_spin_or_schedule(&waiter, state, &flags);
>  	}
>  
>  done:
>  	lock_acquired(&lock->dep_map, ip);
>  	/* got the lock - rejoice! */
>  	mutex_remove_waiter(lock, &waiter, task_thread_info(task));
> -	debug_mutex_set_owner(lock, task_thread_info(task));
> +	lock->owner = task;
>  
>  	/* set it to 0 if there are no waiters left: */
>  	if (likely(list_empty(&lock->wait_list)))
> @@ -260,7 +264,7 @@ __mutex_unlock_common_slowpath(atomic_t *lock_count,
int nested)
>  		wake_up_process(waiter->task);
>  	}
>  
> -	debug_mutex_clear_owner(lock);
> +	lock->owner = NULL;
>  
>  	spin_unlock_mutex(&lock->wait_lock, flags);
>  }
> @@ -298,18 +302,30 @@ __mutex_lock_interruptible_slowpath(atomic_t
*lock_count);
>   */
>  int __sched mutex_lock_interruptible(struct mutex *lock)
>  {
> +	int ret;
> +
>  	might_sleep();
> -	return __mutex_fastpath_lock_retval
> +	ret =  __mutex_fastpath_lock_retval
>  			(&lock->count, __mutex_lock_interruptible_slowpath);
> +	if (!ret)
> +		lock->owner = current;
> +
> +	return ret;
>  }
>  
>  EXPORT_SYMBOL(mutex_lock_interruptible);
>  
>  int __sched mutex_lock_killable(struct mutex *lock)
>  {
> +	int ret;
> +
>  	might_sleep();
> -	return __mutex_fastpath_lock_retval
> +	ret = __mutex_fastpath_lock_retval
>  			(&lock->count, __mutex_lock_killable_slowpath);
> +	if (!ret)
> +		lock->owner = current;
> +
> +	return ret;
>  }
>  EXPORT_SYMBOL(mutex_lock_killable);
>  
> @@ -352,9 +368,10 @@ static inline int __mutex_trylock_slowpath(atomic_t
*lock_count)
>  
>  	prev = atomic_xchg(&lock->count, -1);
>  	if (likely(prev == 1)) {
> -		debug_mutex_set_owner(lock, current_thread_info());
> +		lock->owner = current;
>  		mutex_acquire(&lock->dep_map, 0, 1, _RET_IP_);
>  	}
> +
>  	/* Set it back to 0 if there are no waiters: */
>  	if (likely(list_empty(&lock->wait_list)))
>  		atomic_set(&lock->count, 0);
> @@ -380,8 +397,13 @@ static inline int __mutex_trylock_slowpath(atomic_t
*lock_count)
>   */
>  int __sched mutex_trylock(struct mutex *lock)
>  {
> -	return __mutex_fastpath_trylock(&lock->count,
> -					__mutex_trylock_slowpath);
> +	int ret;
> +
> +	ret = __mutex_fastpath_trylock(&lock->count,
__mutex_trylock_slowpath);
> +	if (ret)
> +		lock->owner = current;
> +
> +	return ret;
>  }
>  
>  EXPORT_SYMBOL(mutex_trylock);
> diff --git a/kernel/mutex.h b/kernel/mutex.h
> index a075daf..55e1986 100644
> --- a/kernel/mutex.h
> +++ b/kernel/mutex.h
> @@ -16,8 +16,6 @@
>  #define mutex_remove_waiter(lock, waiter, ti) \
>  		__list_del((waiter)->list.prev, (waiter)->list.next)
>  
> -#define debug_mutex_set_owner(lock, new_owner)		do { } while (0)
> -#define debug_mutex_clear_owner(lock)			do { } while (0)
>  #define debug_mutex_wake_waiter(lock, waiter)		do { } while (0)
>  #define debug_mutex_free_waiter(waiter)			do { } while (0)
>  #define debug_mutex_add_waiter(lock, waiter, ti)	do { } while (0)
> diff --git a/kernel/sched.c b/kernel/sched.c
> index 2e3545f..c189597 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -631,6 +631,10 @@ struct rq {
>  
>  	/* BKL stats */
>  	unsigned int bkl_count;
> +
> +	/* mutex spin stats */
> +	unsigned int mtx_spin;
> +	unsigned int mtx_sched;
>  #endif
>  };
>  
> @@ -4527,6 +4531,75 @@ pick_next_task(struct rq *rq, struct task_struct
*prev)
>  	}
>  }
>  
> +#ifdef CONFIG_DEBUG_MUTEXES
> +# include "mutex-debug.h"
> +#else
> +# include "mutex.h"
> +#endif
> +
> +void mutex_spin_or_schedule(struct mutex_waiter *waiter, long state,
> +			    unsigned long *flags)
> +{
> +	struct mutex *lock = waiter->lock;
> +	struct task_struct *task = waiter->task;
> +	struct task_struct *owner = lock->owner;
> +	struct rq *rq;
> +	int spin = 0;
> +
> +	if (likely(sched_feat(MUTEX_SPIN) && owner)) {
> +		rq = task_rq(owner);
> +		spin = (rq->curr == owner);
> +	}
> +
> +	if (!spin) {
> +		schedstat_inc(this_rq(), mtx_sched);
> +		__set_task_state(task, state);
I still do not know why you set state here instead of in the mutex code.
Yes, you prevent changing the state if we do not schedule, but there''s 
nothing wrong with setting it before hand. We may even be able to cache 
the owner and keep the locking of the wait_lock out of here. But then I 
see that it may be used to protect the sched_stat counters.

-- Steve
> +		spin_unlock_mutex(&lock->wait_lock, *flags);
> +		schedule();
> +		spin_lock_mutex(&lock->wait_lock, *flags);
> +		return;
> +	}
> +
> +	schedstat_inc(this_rq(), mtx_spin);
> +	spin_unlock_mutex(&lock->wait_lock, *flags);
> +	for (;;) {
> +		struct task_struct *l_owner;
> +
> +		/* Stop spinning when there''s a pending signal. */
> +		if (signal_pending_state(state, task))
> +			break;
> +
> +		/* Mutex got unlocked, try to acquire. */
> +		if (!mutex_is_locked(lock))
> +			break;
> +
> +		/*
> +		 * Owner changed, bail to re-assess state.
> +		 *
> +		 * We ignore !owner because that would break us out of
> +		 * the spin too early -- see mutex_unlock() -- and make
> +		 * us schedule -- see the !owner case on top -- at the
> +		 * worst possible moment.
> +		 */
> +		l_owner = ACCESS_ONCE(lock->owner);
> +		if (l_owner && l_owner != owner)
> +			break;
> +
> +		/* Owner stopped running, bail to re-assess state. */
> +		if (rq->curr != owner)
> +			break;
> +
> +		/*
> +		 * cpu_relax() provides a compiler barrier that ensures we
> +		 * reload everything every time. SMP barriers are not strictly
> +		 * required as the worst case is we''ll spin a bit more before
> +		 * we observe the right values.
> +		 */
> +		cpu_relax();
> +	}
> +	spin_lock_mutex(&lock->wait_lock, *flags);
> +}
> +
>  /*
>   * schedule() is the main scheduler function.
>   */
> diff --git a/kernel/sched_debug.c b/kernel/sched_debug.c
> index 4293cfa..3dec83a 100644
> --- a/kernel/sched_debug.c
> +++ b/kernel/sched_debug.c
> @@ -288,6 +288,8 @@ static void print_cpu(struct seq_file *m, int cpu)
>  
>  	P(bkl_count);
>  
> +	P(mtx_spin);
> +	P(mtx_sched);
>  #undef P
>  #endif
>  	print_cfs_stats(m, cpu);
> diff --git a/kernel/sched_features.h b/kernel/sched_features.h
> index da5d93b..f548627 100644
> --- a/kernel/sched_features.h
> +++ b/kernel/sched_features.h
> @@ -13,3 +13,4 @@ SCHED_FEAT(LB_WAKEUP_UPDATE, 1)
>  SCHED_FEAT(ASYM_EFF_LOAD, 1)
>  SCHED_FEAT(WAKEUP_OVERLAP, 0)
>  SCHED_FEAT(LAST_BUDDY, 1)
> +SCHED_FEAT(MUTEX_SPIN, 1)
> 
> 
> --
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Steven Rostedt

2009-Jan-07 15:29 UTC

head link

Re: [PATCH -v4][RFC]: mutex: implement adaptive spinning

On Wed, 7 Jan 2009, Steven Rostedt wrote:> On Wed, 7 Jan 2009, Peter Zijlstra wrote:
> > --- a/kernel/mutex.c
> > +++ b/kernel/mutex.c
> > @@ -10,6 +10,10 @@
> >   * Many thanks to Arjan van de Ven, Thomas Gleixner, Steven Rostedt
and
> >   * David Howells for suggestions and improvements.
> >   *
> > + *  - Adaptive spinning for mutexes by Peter Zijlstra. (Ported to
mainline
> > + *    from the -rt tree, where it was originally implemented for
rtmutexes
> > + *    by Steven Rostedt, based on work by Gregory Haskins.)
> 
> I feel guilty with my name being the only one there for the rtmutexes.
> Thomas Gleixner, Ingo Molnar and Esben Nielsen also played large roles in
> that code.
> 
> And Peter Morreale and Sven Dietrich might also be mentioned next to
> Gregory''s name.
> 
If you are not talking about rtmutexes in general, and are just 
referencing the spinning part, then just group us all together:

  Gregory Haskins, Steven Rostedt, Peter Morreale and Sven Dietrich.

Thanks,

-- Steve

--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Peter Zijlstra

2009-Jan-07 15:54 UTC

head link

Re: [PATCH -v4][RFC]: mutex: implement adaptive spinning

On Wed, 2009-01-07 at 10:22 -0500, Steven Rostedt wrote:> Peter, nice work!
Thanks!
> > +	}
> > +
> > +	if (!spin) {
> > +		schedstat_inc(this_rq(), mtx_sched);
> > +		__set_task_state(task, state);
> 
> I still do not know why you set state here instead of in the mutex code.
> Yes, you prevent changing the state if we do not schedule, but
there''s
> nothing wrong with setting it before hand. We may even be able to cache 
> the owner and keep the locking of the wait_lock out of here. But then I 
> see that it may be used to protect the sched_stat counters.
I was about to say because we need task_rq(owner) and can only deref
owner while holding that lock, but I found a way around it by using
task_cpu() which is exported.

Compile tested only so far...

---
Index: linux-2.6/include/linux/sched.h
==================================================================---
linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -329,8 +329,8 @@ extern signed long schedule_timeout(sign
 extern signed long schedule_timeout_interruptible(signed long timeout);
 extern signed long schedule_timeout_killable(signed long timeout);
 extern signed long schedule_timeout_uninterruptible(signed long timeout);
-extern void mutex_spin_or_schedule(struct mutex_waiter *waiter, long state,
-				   unsigned long *flags);
+extern void mutex_spin_or_schedule(struct mutex_waiter *waiter,
+				   struct task_struct *owner, int cpu);
 asmlinkage void schedule(void);
 
 struct nsproxy;
Index: linux-2.6/kernel/mutex.c
==================================================================---
linux-2.6.orig/kernel/mutex.c
+++ linux-2.6/kernel/mutex.c
@@ -12,7 +12,8 @@
  *
  *  - Adaptive spinning for mutexes by Peter Zijlstra. (Ported to mainline
  *    from the -rt tree, where it was originally implemented for rtmutexes
- *    by Steven Rostedt, based on work by Gregory Haskins.)
+ *    by Steven Rostedt, based on work by Gregory Haskins, Peter Morreale
+ *    and Sven Dietrich.
  *
  * Also see Documentation/mutex-design.txt.
  */
@@ -157,6 +158,9 @@ __mutex_lock_common(struct mutex *lock, 
 	lock_contended(&lock->dep_map, ip);
 
 	for (;;) {
+		int cpu = 0;
+		struct task_struct *l_owner;
+
 		/*
 		 * Lets try to take the lock again - this is needed even if
 		 * we get here for the first time (shortly after failing to
@@ -184,8 +188,15 @@ __mutex_lock_common(struct mutex *lock, 
 			return -EINTR;
 		}
 
-		mutex_spin_or_schedule(&waiter, state, &flags);
+		__set_task_state(task, state);
+		l_owner = ACCESS_ONCE(lock->owner);
+		if (l_owner)
+			cpu = task_cpu(l_owner);
+		spin_unlock_mutex(&lock->wait_lock, flags);
+		mutex_spin_or_schedule(&waiter, l_owner, cpu);
+		spin_lock_mutex(&lock->wait_lock, flags);
 	}
+	__set_task_state(task, TASK_RUNNING);
 
 done:
 	lock_acquired(&lock->dep_map, ip);
Index: linux-2.6/kernel/sched.c
==================================================================---
linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -4600,42 +4600,37 @@ pick_next_task(struct rq *rq, struct tas
 	}
 }
 
-#ifdef CONFIG_DEBUG_MUTEXES
-# include "mutex-debug.h"
-#else
-# include "mutex.h"
-#endif
-
-void mutex_spin_or_schedule(struct mutex_waiter *waiter, long state,
-			    unsigned long *flags)
+void mutex_spin_or_schedule(struct mutex_waiter *waiter,
+			    struct task_struct *owner, int cpu)
 {
-	struct mutex *lock = waiter->lock;
 	struct task_struct *task = waiter->task;
-	struct task_struct *owner = lock->owner;
+	struct mutex *lock = waiter->lock;
 	struct rq *rq;
 	int spin = 0;
 
 	if (likely(sched_feat(MUTEX_SPIN) && owner)) {
-		rq = task_rq(owner);
+		rq = cpu_rq(cpu);
 		spin = (rq->curr == owner);
 	}
 
 	if (!spin) {
+		preempt_disable();
 		schedstat_inc(this_rq(), mtx_sched);
-		__set_task_state(task, state);
-		spin_unlock_mutex(&lock->wait_lock, *flags);
+		preempt_enable();
+
 		schedule();
-		spin_lock_mutex(&lock->wait_lock, *flags);
 		return;
 	}
 
+	preempt_disable();
 	schedstat_inc(this_rq(), mtx_spin);
-	spin_unlock_mutex(&lock->wait_lock, *flags);
+	preempt_enable();
+
 	for (;;) {
 		struct task_struct *l_owner;
 
 		/* Stop spinning when there''s a pending signal. */
-		if (signal_pending_state(state, task))
+		if (signal_pending_state(task->state, task))
 			break;
 
 		/* Mutex got unlocked, try to acquire. */
@@ -4666,7 +4661,6 @@ void mutex_spin_or_schedule(struct mutex
 		 */
 		cpu_relax();
 	}
-	spin_lock_mutex(&lock->wait_lock, *flags);
 }
 
 /*

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Linus Torvalds

2009-Jan-07 16:25 UTC

head link

Re: [PATCH -v4][RFC]: mutex: implement adaptive spinning

On Wed, 7 Jan 2009, Peter Zijlstra wrote:>
> Change mutex contention behaviour such that it will sometimes busy wait on
> acquisition - moving its behaviour closer to that of spinlocks.
Ok, this one looks _almost_ ok.

The only problem is that I think you''ve lost the UP case. 

In UP, you shouldn''t have the code to spin, and the
"spin_or_schedule()"
should fall back to just the schedule case.

It migth also be worthwhile to try to not set the owner, and re-organize 
that a bit (by making it a inline function that sets the owner only for 
CONFIG_SMP or lockdep/debug). 

			Linus
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Peter Zijlstra

2009-Jan-07 16:57 UTC

head link

[PATCH -v5][RFC]: mutex: implement adaptive spinning

On Wed, 2009-01-07 at 08:25 -0800, Linus Torvalds wrote:> 
> On Wed, 7 Jan 2009, Peter Zijlstra wrote:
> >
> > Change mutex contention behaviour such that it will sometimes busy
wait on
> > acquisition - moving its behaviour closer to that of spinlocks.
> 
> Ok, this one looks _almost_ ok.
> 
> The only problem is that I think you''ve lost the UP case. 
> 
> In UP, you shouldn''t have the code to spin, and the
"spin_or_schedule()"
> should fall back to just the schedule case.
> 
> It migth also be worthwhile to try to not set the owner, and re-organize 
> that a bit (by making it a inline function that sets the owner only for 
> CONFIG_SMP or lockdep/debug). 
As you wish ;-)

---
Subject: mutex: implement adaptive spinning
From: Peter Zijlstra <a.p.zijlstra@chello.nl>
Date: Tue, 6 Jan 2009 12:32:12 +0100

Change mutex contention behaviour such that it will sometimes busy wait on
acquisition - moving its behaviour closer to that of spinlocks.

This concept got ported to mainline from the -rt tree, where it was originally
implemented for rtmutexes by Steven Rostedt, based on work by Gregory Haskins.

Testing with Ingo''s test-mutex application
(http://lkml.org/lkml/2006/1/8/50)
gave a 8% boost for VFS scalability on my testbox:

  # echo MUTEX_SPIN > /debug/sched_features
  # ./test-mutex V 16 10
  2 CPUs, running 16 parallel test-tasks.
  checking VFS performance.

  avg ops/sec:                74910

  # echo NO_MUTEX_SPIN > /debug/sched_features
  # ./test-mutex V 16 10
  2 CPUs, running 16 parallel test-tasks.
  checking VFS performance.

  avg ops/sec:                68804

The key criteria for the busy wait is that the lock owner has to be running on
a (different) cpu. The idea is that as long as the owner is running, there is a
fair chance it''ll release the lock soon, and thus we''ll be
better off spinning
instead of blocking/scheduling.

Since regular mutexes (as opposed to rtmutexes) do not atomically track the
owner, we add the owner in a non-atomic fashion and deal with the races in
the slowpath.

Furthermore, to ease the testing of the performance impact of this new code,
there is means to disable this behaviour runtime (without having to reboot
the system), when scheduler debugging is enabled (CONFIG_SCHED_DEBUG=y),
by issuing the following command:

 # echo NO_MUTEX_SPIN > /debug/sched_features

This command re-enables spinning again (this is also the default):

 # echo MUTEX_SPIN > /debug/sched_features

There''s also a few new statistic fields in /proc/sched_debug
(available if CONFIG_SCHED_DEBUG=y and CONFIG_SCHEDSTATS=y):

 # grep mtx /proc/sched_debug
  .mtx_spin                      : 2387
  .mtx_sched                     : 2283
  .mtx_spin                      : 1277
  .mtx_sched                     : 1700

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Reviewed-and-signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 include/linux/mutex.h   |    6 ++--
 include/linux/sched.h   |    2 +
 kernel/mutex-debug.c    |   10 -------
 kernel/mutex-debug.h    |   13 +++------
 kernel/mutex.c          |   66 +++++++++++++++++++++++++++++++++++++++++-------
 kernel/mutex.h          |   13 ++++++++-
 kernel/sched.c          |   63 +++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched_debug.c    |    2 +
 kernel/sched_features.h |    1 
 9 files changed, 146 insertions(+), 30 deletions(-)

Index: linux-2.6/include/linux/mutex.h
==================================================================---
linux-2.6.orig/include/linux/mutex.h
+++ linux-2.6/include/linux/mutex.h
@@ -50,8 +50,10 @@ struct mutex {
 	atomic_t		count;
 	spinlock_t		wait_lock;
 	struct list_head	wait_list;
+#if defined(CONFIG_DEBUG_MUTEXES) || defined(CONFIG_SMP)
+	struct task_struct	*owner;
+#endif
 #ifdef CONFIG_DEBUG_MUTEXES
-	struct thread_info	*owner;
 	const char 		*name;
 	void			*magic;
 #endif
@@ -67,8 +69,8 @@ struct mutex {
 struct mutex_waiter {
 	struct list_head	list;
 	struct task_struct	*task;
-#ifdef CONFIG_DEBUG_MUTEXES
 	struct mutex		*lock;
+#ifdef CONFIG_DEBUG_MUTEXES
 	void			*magic;
 #endif
 };
Index: linux-2.6/include/linux/sched.h
==================================================================---
linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -329,6 +329,8 @@ extern signed long schedule_timeout(sign
 extern signed long schedule_timeout_interruptible(signed long timeout);
 extern signed long schedule_timeout_killable(signed long timeout);
 extern signed long schedule_timeout_uninterruptible(signed long timeout);
+extern void mutex_spin_or_schedule(struct mutex_waiter *waiter,
+				   struct task_struct *owner, int cpu);
 asmlinkage void schedule(void);
 
 struct nsproxy;
Index: linux-2.6/kernel/mutex-debug.c
==================================================================---
linux-2.6.orig/kernel/mutex-debug.c
+++ linux-2.6/kernel/mutex-debug.c
@@ -26,11 +26,6 @@
 /*
  * Must be called with lock->wait_lock held.
  */
-void debug_mutex_set_owner(struct mutex *lock, struct thread_info *new_owner)
-{
-	lock->owner = new_owner;
-}
-
 void debug_mutex_lock_common(struct mutex *lock, struct mutex_waiter *waiter)
 {
 	memset(waiter, MUTEX_DEBUG_INIT, sizeof(*waiter));
@@ -59,7 +54,6 @@ void debug_mutex_add_waiter(struct mutex
 
 	/* Mark the current thread as blocked on the lock: */
 	ti->task->blocked_on = waiter;
-	waiter->lock = lock;
 }
 
 void mutex_remove_waiter(struct mutex *lock, struct mutex_waiter *waiter,
@@ -80,9 +74,8 @@ void debug_mutex_unlock(struct mutex *lo
 		return;
 
 	DEBUG_LOCKS_WARN_ON(lock->magic != lock);
-	DEBUG_LOCKS_WARN_ON(lock->owner != current_thread_info());
+	DEBUG_LOCKS_WARN_ON(lock->owner != current);
 	DEBUG_LOCKS_WARN_ON(!lock->wait_list.prev &&
!lock->wait_list.next);
-	DEBUG_LOCKS_WARN_ON(lock->owner != current_thread_info());
 }
 
 void debug_mutex_init(struct mutex *lock, const char *name,
@@ -95,7 +88,6 @@ void debug_mutex_init(struct mutex *lock
 	debug_check_no_locks_freed((void *)lock, sizeof(*lock));
 	lockdep_init_map(&lock->dep_map, name, key, 0);
 #endif
-	lock->owner = NULL;
 	lock->magic = lock;
 }
 
Index: linux-2.6/kernel/mutex-debug.h
==================================================================---
linux-2.6.orig/kernel/mutex-debug.h
+++ linux-2.6/kernel/mutex-debug.h
@@ -13,14 +13,6 @@
 /*
  * This must be called with lock->wait_lock held.
  */
-extern void
-debug_mutex_set_owner(struct mutex *lock, struct thread_info *new_owner);
-
-static inline void debug_mutex_clear_owner(struct mutex *lock)
-{
-	lock->owner = NULL;
-}
-
 extern void debug_mutex_lock_common(struct mutex *lock,
 				    struct mutex_waiter *waiter);
 extern void debug_mutex_wake_waiter(struct mutex *lock,
@@ -35,6 +27,11 @@ extern void debug_mutex_unlock(struct mu
 extern void debug_mutex_init(struct mutex *lock, const char *name,
 			     struct lock_class_key *key);
 
+static inline void mutex_set_owner(struct mutex *lock, struct task_struct
*owner)
+{
+	lock->owner = owner;
+}
+
 #define spin_lock_mutex(lock, flags)			\
 	do {						\
 		struct mutex *l = container_of(lock, struct mutex, wait_lock); \
Index: linux-2.6/kernel/mutex.c
==================================================================---
linux-2.6.orig/kernel/mutex.c
+++ linux-2.6/kernel/mutex.c
@@ -10,6 +10,11 @@
  * Many thanks to Arjan van de Ven, Thomas Gleixner, Steven Rostedt and
  * David Howells for suggestions and improvements.
  *
+ *  - Adaptive spinning for mutexes by Peter Zijlstra. (Ported to mainline
+ *    from the -rt tree, where it was originally implemented for rtmutexes
+ *    by Steven Rostedt, based on work by Gregory Haskins, Peter Morreale
+ *    and Sven Dietrich.
+ *
  * Also see Documentation/mutex-design.txt.
  */
 #include <linux/mutex.h>
@@ -46,6 +51,7 @@ __mutex_init(struct mutex *lock, const c
 	atomic_set(&lock->count, 1);
 	spin_lock_init(&lock->wait_lock);
 	INIT_LIST_HEAD(&lock->wait_list);
+	mutex_set_owner(lock, NULL);
 
 	debug_mutex_init(lock, name, key);
 }
@@ -91,6 +97,7 @@ void inline __sched mutex_lock(struct mu
 	 * ''unlocked'' into ''locked'' state.
 	 */
 	__mutex_fastpath_lock(&lock->count, __mutex_lock_slowpath);
+	mutex_set_owner(lock, current);
 }
 
 EXPORT_SYMBOL(mutex_lock);
@@ -115,6 +122,14 @@ void __sched mutex_unlock(struct mutex *
 	 * The unlocking fastpath is the 0->1 transition from
''locked''
 	 * into ''unlocked'' state:
 	 */
+#ifndef CONFIG_DEBUG_MUTEXES
+	/*
+	 * When debugging is enabled we must not clear the owner before time,
+	 * the slow path will always be taken, and that clears the owner field
+	 * after verifying that it was indeed current.
+	 */
+	mutex_set_owner(lock, NULL);
+#endif
 	__mutex_fastpath_unlock(&lock->count, __mutex_unlock_slowpath);
 }
 
@@ -141,6 +156,7 @@ __mutex_lock_common(struct mutex *lock, 
 	/* add waiting tasks to the end of the waitqueue (FIFO): */
 	list_add_tail(&waiter.list, &lock->wait_list);
 	waiter.task = task;
+	waiter.lock = lock;
 
 	old_val = atomic_xchg(&lock->count, -1);
 	if (old_val == 1)
@@ -149,6 +165,11 @@ __mutex_lock_common(struct mutex *lock, 
 	lock_contended(&lock->dep_map, ip);
 
 	for (;;) {
+#ifdef CONFIG_SMP
+		int cpu = 0;
+		struct task_struct *l_owner;
+#endif
+
 		/*
 		 * Lets try to take the lock again - this is needed even if
 		 * we get here for the first time (shortly after failing to
@@ -175,19 +196,28 @@ __mutex_lock_common(struct mutex *lock, 
 			debug_mutex_free_waiter(&waiter);
 			return -EINTR;
 		}
-		__set_task_state(task, state);
 
-		/* didnt get the lock, go to sleep: */
+		__set_task_state(task, state);
+#ifdef CONFIG_SMP
+		l_owner = ACCESS_ONCE(lock->owner);
+		if (l_owner)
+			cpu = task_cpu(l_owner);
+		spin_unlock_mutex(&lock->wait_lock, flags);
+		mutex_spin_or_schedule(&waiter, l_owner, cpu);
+		spin_lock_mutex(&lock->wait_lock, flags);
+#else
 		spin_unlock_mutex(&lock->wait_lock, flags);
 		schedule();
 		spin_lock_mutex(&lock->wait_lock, flags);
+#endif
 	}
+	__set_task_state(task, TASK_RUNNING);
 
 done:
 	lock_acquired(&lock->dep_map, ip);
 	/* got the lock - rejoice! */
 	mutex_remove_waiter(lock, &waiter, task_thread_info(task));
-	debug_mutex_set_owner(lock, task_thread_info(task));
+	mutex_set_owner(lock, task);
 
 	/* set it to 0 if there are no waiters left: */
 	if (likely(list_empty(&lock->wait_list)))
@@ -260,7 +290,7 @@ __mutex_unlock_common_slowpath(atomic_t 
 		wake_up_process(waiter->task);
 	}
 
-	debug_mutex_clear_owner(lock);
+	mutex_set_owner(lock, NULL);
 
 	spin_unlock_mutex(&lock->wait_lock, flags);
 }
@@ -298,18 +328,30 @@ __mutex_lock_interruptible_slowpath(atom
  */
 int __sched mutex_lock_interruptible(struct mutex *lock)
 {
+	int ret;
+
 	might_sleep();
-	return __mutex_fastpath_lock_retval
+	ret =  __mutex_fastpath_lock_retval
 			(&lock->count, __mutex_lock_interruptible_slowpath);
+	if (!ret)
+		mutex_set_owner(lock, current);
+
+	return ret;
 }
 
 EXPORT_SYMBOL(mutex_lock_interruptible);
 
 int __sched mutex_lock_killable(struct mutex *lock)
 {
+	int ret;
+
 	might_sleep();
-	return __mutex_fastpath_lock_retval
+	ret = __mutex_fastpath_lock_retval
 			(&lock->count, __mutex_lock_killable_slowpath);
+	if (!ret)
+		mutex_set_owner(lock, current);
+
+	return ret;
 }
 EXPORT_SYMBOL(mutex_lock_killable);
 
@@ -352,9 +394,10 @@ static inline int __mutex_trylock_slowpa
 
 	prev = atomic_xchg(&lock->count, -1);
 	if (likely(prev == 1)) {
-		debug_mutex_set_owner(lock, current_thread_info());
+		mutex_set_owner(lock, current);
 		mutex_acquire(&lock->dep_map, 0, 1, _RET_IP_);
 	}
+
 	/* Set it back to 0 if there are no waiters: */
 	if (likely(list_empty(&lock->wait_list)))
 		atomic_set(&lock->count, 0);
@@ -380,8 +423,13 @@ static inline int __mutex_trylock_slowpa
  */
 int __sched mutex_trylock(struct mutex *lock)
 {
-	return __mutex_fastpath_trylock(&lock->count,
-					__mutex_trylock_slowpath);
+	int ret;
+
+	ret = __mutex_fastpath_trylock(&lock->count, __mutex_trylock_slowpath);
+	if (ret)
+		mutex_set_owner(lock, current);
+
+	return ret;
 }
 
 EXPORT_SYMBOL(mutex_trylock);
Index: linux-2.6/kernel/mutex.h
==================================================================---
linux-2.6.orig/kernel/mutex.h
+++ linux-2.6/kernel/mutex.h
@@ -16,8 +16,17 @@
 #define mutex_remove_waiter(lock, waiter, ti) \
 		__list_del((waiter)->list.prev, (waiter)->list.next)
 
-#define debug_mutex_set_owner(lock, new_owner)		do { } while (0)
-#define debug_mutex_clear_owner(lock)			do { } while (0)
+#ifdef CONFIG_SMP
+static inline void mutex_set_owner(struct mutex *lock, struct task_struct
*owner)
+{
+	lock->owner = owner;
+}
+#else
+static inline void mutex_set_owner(struct mutex *lock, struct task_struct
*owner)
+{
+}
+#endif
+
 #define debug_mutex_wake_waiter(lock, waiter)		do { } while (0)
 #define debug_mutex_free_waiter(waiter)			do { } while (0)
 #define debug_mutex_add_waiter(lock, waiter, ti)	do { } while (0)
Index: linux-2.6/kernel/sched.c
==================================================================---
linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -631,6 +631,10 @@ struct rq {
 
 	/* BKL stats */
 	unsigned int bkl_count;
+
+	/* mutex spin stats */
+	unsigned int mtx_spin;
+	unsigned int mtx_sched;
 #endif
 };
 
@@ -4596,6 +4600,65 @@ pick_next_task(struct rq *rq, struct tas
 	}
 }
 
+#ifdef CONFIG_SMP
+void mutex_spin_or_schedule(struct mutex_waiter *waiter,
+			    struct task_struct *owner, int cpu)
+{
+	struct task_struct *task = waiter->task;
+	struct mutex *lock = waiter->lock;
+	struct rq *rq;
+	int spin = 0;
+
+	if (likely(sched_feat(MUTEX_SPIN) && owner)) {
+		rq = cpu_rq(cpu);
+		spin = (rq->curr == owner);
+	}
+
+	if (!spin) {
+		schedstat_inc(cpu_rq(raw_smp_processor_id()), mtx_sched);
+		schedule();
+		return;
+	}
+
+	schedstat_inc(cpu_rq(raw_smp_processor_id()), mtx_spin);
+	for (;;) {
+		struct task_struct *l_owner;
+
+		/* Stop spinning when there''s a pending signal. */
+		if (signal_pending_state(task->state, task))
+			break;
+
+		/* Mutex got unlocked, try to acquire. */
+		if (!mutex_is_locked(lock))
+			break;
+
+		/*
+		 * Owner changed, bail to re-assess state.
+		 *
+		 * We ignore !owner because that would break us out of
+		 * the spin too early -- see mutex_unlock() -- and make
+		 * us schedule -- see the !owner case on top -- at the
+		 * worst possible moment.
+		 */
+		l_owner = ACCESS_ONCE(lock->owner);
+		if (l_owner && l_owner != owner)
+			break;
+
+		/* Owner stopped running, bail to re-assess state. */
+		if (rq->curr != owner)
+			break;
+
+		/*
+		 * cpu_relax() provides a compiler barrier that ensures we
+		 * reload everything every time. SMP barriers are not strictly
+		 * required as the worst case is we''ll spin a bit more before
+		 * we observe the right values.
+		 */
+		cpu_relax();
+	}
+}
+#endif
+
 /*
  * schedule() is the main scheduler function.
  */
Index: linux-2.6/kernel/sched_debug.c
==================================================================---
linux-2.6.orig/kernel/sched_debug.c
+++ linux-2.6/kernel/sched_debug.c
@@ -288,6 +288,8 @@ static void print_cpu(struct seq_file *m
 
 	P(bkl_count);
 
+	P(mtx_spin);
+	P(mtx_sched);
 #undef P
 #endif
 	print_cfs_stats(m, cpu);
Index: linux-2.6/kernel/sched_features.h
==================================================================---
linux-2.6.orig/kernel/sched_features.h
+++ linux-2.6/kernel/sched_features.h
@@ -13,3 +13,4 @@ SCHED_FEAT(LB_WAKEUP_UPDATE, 1)
 SCHED_FEAT(ASYM_EFF_LOAD, 1)
 SCHED_FEAT(WAKEUP_OVERLAP, 0)
 SCHED_FEAT(LAST_BUDDY, 1)
+SCHED_FEAT(MUTEX_SPIN, 1)

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Mason

2009-Jan-07 17:20 UTC

head link

Re: [PATCH -v4][RFC]: mutex: implement adaptive spinning

On Wed, 2009-01-07 at 08:25 -0800, Linus Torvalds wrote:> 
> On Wed, 7 Jan 2009, Peter Zijlstra wrote:
> >
> > Change mutex contention behaviour such that it will sometimes busy
wait on
> > acquisition - moving its behaviour closer to that of spinlocks.
> 
> Ok, this one looks _almost_ ok.
> 
> The only problem is that I think you''ve lost the UP case. 
> 
> In UP, you shouldn''t have the code to spin, and the
"spin_or_schedule()"
> should fall back to just the schedule case.
> 
> It migth also be worthwhile to try to not set the owner, and re-organize 
> that a bit (by making it a inline function that sets the owner only for 
> CONFIG_SMP or lockdep/debug). 
So far I haven''t found any btrfs benchmarks where this is slower than
mutexes without any spinning.  But, it isn''t quite as fast as the btrfs
spin.

I''m using three different benchmarks, and they hammer on different
things.  All against btrfs and a single sata drive.

* dbench -t 30 50, which means run dbench 50 for 30 seconds.   It is a
broad workload that hammers on lots of code, but it also tends to go
faster when things are less fair.  These numbers are stable across runs.
Some IO is done at the very start of the run, but the bulk of the run is
CPU bound in various btrfs btrees.

Plain mutex: dbench reports 240MB/s
Simple spin: dbench reports 560MB/s
Peter''s v4: dbench reports 388MB/s

* 50 procs creating 10,000 files (4k each), one dir per proc.  The
result is 50 dirs and each dir has 10,000 files.  This is mostly CPU
bound for the procs, but pdflush and friends are doing lots of IO.

Plain mutex: avg: 115 files/s avg system time for each proc: 1.6s
Simple spin: avg: 152 files/s avg system time for each proc: 2s
Peter''s v4:  avg: 130 files/s avg system time for each proc: 2.9s

I would have expected Peter''s patch to use less system time than my
spin.  If I change his patch to limit the spin to 512 iterations (same
as my code), the system time goes back down to 1.7s, but the files/s
doesn''t improve.

* Parallel stat:  the last benchmark is the most interesting, since it
really hammers on the btree locking speed.  I take the directory tree
the file creation run (50 dirs, 10,000 files each) and have 50 procs
running stat on all the files in parallel.

Before the run I clear the inode cache but leave the page cache.  So,
everything is hot in the btree but there are no inodes in cache.

Plain mutex: 9.488s real, 8.6 sys
Simple spin: 3.8s real 13.8s sys
Peter''s v4: 7.9s real 8.5s sys

-chris

--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Linus Torvalds

2009-Jan-07 17:50 UTC

head link

Re: [PATCH -v4][RFC]: mutex: implement adaptive spinning

On Wed, 7 Jan 2009, Chris Mason wrote:> 
> So far I haven''t found any btrfs benchmarks where this is slower
than
> mutexes without any spinning.  But, it isn''t quite as fast as the
btrfs
> spin.
Quite frankly, from our history with ext3 and other filesystems, using a 
mutex in the filesystem is generally the wrong thing to do anyway. 

Are you sure you can''t just use a spinlock, and just release it over
IO?
The "have to do IO or extend the btree" case is usually pretty damn
clear.

Because it really sounds like you''re lock-limited, and you should just
try
to clean it up. A pure "just spinlock" in the hotpath is always going
to
be better.

		Linus
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Mason

2009-Jan-07 18:00 UTC

head link

Re: [PATCH -v4][RFC]: mutex: implement adaptive spinning

On Wed, 2009-01-07 at 09:50 -0800, Linus Torvalds wrote:> 
> On Wed, 7 Jan 2009, Chris Mason wrote:
> > 
> > So far I haven''t found any btrfs benchmarks where this is
slower than
> > mutexes without any spinning.  But, it isn''t quite as fast as
the btrfs
> > spin.
> 
> Quite frankly, from our history with ext3 and other filesystems, using a 
> mutex in the filesystem is generally the wrong thing to do anyway. 
> 
> Are you sure you can''t just use a spinlock, and just release it
over IO?
> The "have to do IO or extend the btree" case is usually pretty
damn clear.
> 
> Because it really sounds like you''re lock-limited, and you should
just try
> to clean it up. A pure "just spinlock" in the hotpath is always
going to
> be better.
There are definitely ways I can improve performance for contention in
the hot btree nodes, and I think it would be a mistake to tune the
generic adaptive locks just for my current code.

But, it isn''t a bad test case to compare the spin with the new patch
and
with the plain mutex.  If the adaptive code gets in, I think it would be
best for me to drop the spin.

Either way there''s more work to be done in the btrfs locking code.

-chris


--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Mason

2009-Jan-07 18:45 UTC

head link

Re: Btrfs for mainline

On Wed, 2009-01-07 at 09:33 +0000, David Woodhouse
wrote:> On Tue, 2009-01-06 at 14:41 -0500, Chris Mason wrote:
> One more thing I''d suggest is removing the INSTALL file. The parts
about
> having to build libcrc32c aren''t relevant when it''s part
of the kernel
> tree and you have ''select LIBCRC32C'', and the
documentation on the
> userspace utilities probably lives _with_ the userspace repo. Might be
> worth adding a pointer to the userspace utilities though, in
> Documentation/filesystems/btrfs.txt
> 
> I think you can drop your own copy of the GPL too.
> 
I''ve pushed out your patch for this, thanks.

-chris


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Linus Torvalds

2009-Jan-07 18:55 UTC

head link

Re: [PATCH -v5][RFC]: mutex: implement adaptive spinning

Ok a few more issues. This never stops.

Here''s the basic spinloop:

On Wed, 7 Jan 2009, Peter Zijlstra wrote:>
> +	for (;;) {
> +		struct task_struct *l_owner;
> +
> +		/* Stop spinning when there''s a pending signal. */
> +		if (signal_pending_state(task->state, task))
> +			break;
This just can''t be right. If we have signal latency issues, we have way
way _way_ bigger issues.

So the correct test is to just break out if we have any work at all, 
whether it''s signals or rescheduling. Yes, you probably never saw that
in
RT, but that was because you have preemption on etc. And obviously usually 
there isn''t enough constant contention to trigger anything like this 
anyway, so usually the wait is short either because the owner ends up 
sleeping, or because the owner releases the semaphore.

But you could have starvation issues where everybody is all on CPU, and 
other processes constantly get the semaphore, and latency is absolutely 
HORRIBLE because of this guy just busy-looping all the time.

So at a minimum, add a couple of "if (need_resched())" calls.

However, the other thing that really strikes me is that we''ve done all 
that insane work to add ourselves to the waiter lists etc, and _then_ we 
start spinning. Again, that can''t be right: if there are other people
on
the waiter list, we should not spin - because it''s going to be really 
really unfair to take the lock from them!

So we should do all this spinning ONLY IF there are no other waiters, ie 
everybody else is spinning too!

Anything else sounds just horribly broken due to the extreme unfairness of 
it all. Those things are supposed to be fair.

What does that mean? It means that the spinning loop should also check 
that the wait-queue was empty. But it can''t do that, because the way
this
whole adaptive spinning was done is _inside_ the slow path that already 
added itself to the waiting list - so you cannot tell if there are other 
spinners.

So I think that the spinning is actually done in the wrong place. It 
_should_ be done at the very top of __mutex_lock_common, _before_ we''ve
done that lock->wait_list thing etc. That also makes us only spin at the 
beginning, and if we start sleeping, we don''t suddenly go back to
spinning
again.

That in turn would mean that we should try to do this spinning without any 
locks. I think it''s doable. I think it''s also doable to spin
_without_
dirtying the cacheline further and bouncing it due to the spinlock. We can 
really do the spinning by just reading that lock, and the only time we 
want to write is when we see the lock releasing.

Something like this..

NOTE NOTE NOTE! This does _not_ implement "spin_on_owner(lock,
owner);".
That''s the thing that the scheduler needs to do, with the extra 
interesting part of needing to be able to access the thread_info struct 
without knowing whether it might possibly have exited already.

But we can do that with __get_user(thread_info->cpu) (very unlikely page 
fault protection due to the possibility of CONFIG_DEBUG_PAGEALLOC) and 
then validating the cpu. It it''s in range, we can use it and verify 
whether cpu_rq(cpu)->curr has that thread_info.

So we can do all that locklessly and optimistically, just going back and 
verifying the results later. This is why "thread_info" is actually a 
better thing to use than "task_struct" - we can look up the cpu in it
with
a simple dereference. We knew the pointer _used_ to be valid, so in any 
normal situation, it will never page fault (and if you have 
CONFIG_DEBUG_PAGEALLOC and hit a very unlucky race, then performance
isn''t
your concern anyway: we just need to make the page fault be non-lethal ;)

			Linus

---
 kernel/mutex.c |   30 ++++++++++++++++++++++++++++--
 1 files changed, 28 insertions(+), 2 deletions(-)

diff --git a/kernel/mutex.c b/kernel/mutex.c
index 4f45d4b..65525d0 100644
--- a/kernel/mutex.c
+++ b/kernel/mutex.c
@@ -120,6 +120,8 @@ void __sched mutex_unlock(struct mutex *lock)

 EXPORT_SYMBOL(mutex_unlock);

+#define MUTEX_SLEEPERS (-1000)
+
 /*
  * Lock a mutex (possibly interruptible), slowpath:
  */
@@ -132,6 +134,30 @@ __mutex_lock_common(struct mutex *lock, long state,
unsigned int subclass,
 	unsigned int old_val;
 	unsigned long flags;

+#ifdef CONFIG_SMP
+	/* Optimistic spinning.. */
+	for (;;) {
+		struct thread_struct *owner;
+		int oldval = atomic_read(&lock->count);
+
+		if (oldval <= MUTEX_SLEEPERS)
+			break;
+		if (oldval == 1) {
+			oldval = atomic_cmpxchg(&lock->count, oldval, 0);
+			if (oldval == 1) {
+				lock->owner = task_thread_info(task);
+				return 0;
+			}
+		} else {
+			/* See who owns it, and spin on him if anybody */
+			owner = lock->owner;
+			if (owner)
+				spin_on_owner(lock, owner);
+		}
+		cpu_relax();
+	}
+#endif
+
 	spin_lock_mutex(&lock->wait_lock, flags);

 	debug_mutex_lock_common(lock, &waiter);
@@ -142,7 +168,7 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned
int subclass,
 	list_add_tail(&waiter.list, &lock->wait_list);
 	waiter.task = task;

-	old_val = atomic_xchg(&lock->count, -1);
+	old_val = atomic_xchg(&lock->count, MUTEX_SLEEPERS);
 	if (old_val == 1)
 		goto done;

@@ -158,7 +184,7 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned
int subclass,
 		 * that when we release the lock, we properly wake up the
 		 * other waiters:
 		 */
-		old_val = atomic_xchg(&lock->count, -1);
+		old_val = atomic_xchg(&lock->count, MUTEX_SLEEPERS);
 		if (old_val == 1)
 			break;

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Steven Rostedt

2009-Jan-07 20:40 UTC

head link

Re: [PATCH -v5][RFC]: mutex: implement adaptive spinning

On Wed, 7 Jan 2009, Linus Torvalds wrote:> 
> So we can do all that locklessly and optimistically, just going back and 
> verifying the results later. This is why "thread_info" is
actually a
> better thing to use than "task_struct" - we can look up the cpu
in it with
> a simple dereference. We knew the pointer _used_ to be valid, so in any 
> normal situation, it will never page fault (and if you have 
> CONFIG_DEBUG_PAGEALLOC and hit a very unlucky race, then performance
isn''t
> your concern anyway: we just need to make the page fault be non-lethal ;)
Wow, and I thought I do some crazy things with side effects of different 
kernel characteristics. So basically, since the owner use to be valid and 
we take the cpu number directly from the thread_info struct, we do not 
need to worry about page faulting.

Next comes the issue to know if the owner is still running. Wouldn''t we
need to do something like

	if (task_thread_info(cpu_rq(cpu)->curr) == owner)

I guess that would have the same characteristic, that even if the task 
struct of cpu_rq(cpu)->curr was freed, we can still reference the 
thread_info. Although, we might get garbage, but we don''t care.

I understand that this should not be a problem, but I''m afraid it will 
give me nightmares at night. ;-)

God that code had better be commented well.

-- Steve
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Linus Torvalds

2009-Jan-07 20:55 UTC

head link

Re: [PATCH -v5][RFC]: mutex: implement adaptive spinning

On Wed, 7 Jan 2009, Steven Rostedt wrote:> 
> Next comes the issue to know if the owner is still running.
Wouldn''t we
> need to do something like
> 
> 	if (task_thread_info(cpu_rq(cpu)->curr) == owner)
Yes. After verifying that "cpu" is in a valid range.
> I understand that this should not be a problem, but I''m afraid it
will
> give me nightmares at night. ;-)
> 
> God that code had better be commented well.
Well, the good news is that it really would be just a few - admittedly 
very subtle - lines, each basically generating just a couple of machine 
instructions. So we''d be looking at code where the actual assembly
output
should hopefully be in the ten-to-twenty instruction range, and the C code 
itself would be about five times as many comments as actual real lines.

So the code really shouldn''t be much worse than

	/*
	 * Look out! "thread" is an entirely speculative pointer
	 * access and not reliable.
	 */
	void loop_while_oncpu(struct mutex *lock, struct thread_struct *thread)
	{
		for (;;) {
			unsigned cpu;
			struct runqueue *rq;

			if (lock->owner != thread)
				break;

			/*
			 * Need to access the cpu field knowing that
			 * DEBUG_PAGEALLOC could have unmapped it if
			 * the mutex owner just released it and exited.
			 */
			if (__get_user(cpu, &thread->cpu))
				break;

			/*
			 * Even if the access succeeded (likely case),
			 * the cpu field may no longer be valid. FIXME:
			 * this needs to validate that we can do a
			 * get_cpu() and that we have the percpu area.
			 */
			if (cpu >= NR_CPUS)
				break;

			if (!cpu_online(cpu))
				break;

			/*
			 * Is that thread really running on that cpu?
			 */
			rq = cpu_rq(cpu);
			if (task_thread_info(rq->curr) != thread)
				break;

			cpu_relax();
		}
	}

and it all looks like it shouldn''t be all that bad. Yeah, it''s
like 50
lines of C code, but it''s mostly comments about subtle one-liners that 
really expand to almost no real code at all.

			Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Matthew Wilcox

2009-Jan-07 21:09 UTC

head link

Re: [PATCH -v5][RFC]: mutex: implement adaptive spinning

On Wed, Jan 07, 2009 at 12:55:49PM -0800, Linus Torvalds
wrote:> 	void loop_while_oncpu(struct mutex *lock, struct thread_struct *thread)
> 	{
> 		for (;;) {
> 			unsigned cpu;
> 			struct runqueue *rq;
> 
> 			if (lock->owner != thread)
> 				break;
> 
> 			/*
> 			 * Need to access the cpu field knowing that
> 			 * DEBUG_PAGEALLOC could have unmapped it if
> 			 * the mutex owner just released it and exited.
> 			 */
> 			if (__get_user(cpu, &thread->cpu))
> 				break;
I appreciate this is sample code, but using __get_user() on
non-userspace pointers messes up architectures which have separate
user/kernel spaces (eg the old 4G/4G split for x86-32).  Do we have an
appropriate function for kernel space pointers?  Is this a good reason
to add one?

-- 
Matthew Wilcox				Intel Open Source Technology Centre
"Bill, look, we understand that you''re interested in selling us
this
operating system, but compare it to ours.  We can''t possibly take such
a retrograde step."
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Linus Torvalds

2009-Jan-07 21:24 UTC

head link

Re: [PATCH -v5][RFC]: mutex: implement adaptive spinning

On Wed, 7 Jan 2009, Matthew Wilcox wrote:> 
> I appreciate this is sample code, but using __get_user() on
> non-userspace pointers messes up architectures which have separate
> user/kernel spaces (eg the old 4G/4G split for x86-32).  Do we have an
> appropriate function for kernel space pointers?  Is this a good reason
> to add one?
Yes, you''re right. 

We could do the whole "oldfs = get_fs(); set_fs(KERNEL_DS); .. 
set_fs(oldfs);" crud, but it would probably be better to just add an 
architected accessor. Especially since it''s going to generally just be
a

	#define get_kernel_careful(val,p) __get_user(val,p)

for most architectures.

We''ve needed that before (and yes, we''ve simply mis-used
__get_user() on
x86 before rather than add it).

			Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Ingo Molnar

2009-Jan-07 21:28 UTC

head link

Re: [PATCH -v5][RFC]: mutex: implement adaptive spinning

* Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 			/*
> 			 * Even if the access succeeded (likely case),
> 			 * the cpu field may no longer be valid. FIXME:
> 			 * this needs to validate that we can do a
> 			 * get_cpu() and that we have the percpu area.
s/get_cpu/cpu_rq ?
> 			 */
> 			if (cpu >= NR_CPUS)
> 				break;
> 
> 			if (!cpu_online(cpu))
> 				break;
Regarding the FIXME, this should be safe already - at least on x86 we set 
up all the possible-cpus-mask per CPU areas during bootup. So any CPU that 
is online, will have a percpu area. (even in the most racy case)

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Ingo Molnar

2009-Jan-07 21:32 UTC

head link

Re: [PATCH -v5][RFC]: mutex: implement adaptive spinning

* Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
> 
> On Wed, 7 Jan 2009, Matthew Wilcox wrote:
> > 
> > I appreciate this is sample code, but using __get_user() on
> > non-userspace pointers messes up architectures which have separate
> > user/kernel spaces (eg the old 4G/4G split for x86-32).  Do we have an
> > appropriate function for kernel space pointers?  Is this a good reason
> > to add one?
> 
> Yes, you''re right. 
> 
> We could do the whole "oldfs = get_fs(); set_fs(KERNEL_DS); .. 
> set_fs(oldfs);" crud, but it would probably be better to just add an 
> architected accessor. Especially since it''s going to generally
just be a
> 
> 	#define get_kernel_careful(val,p) __get_user(val,p)
> 
> for most architectures.
> 
> We''ve needed that before (and yes, we''ve simply mis-used
__get_user() on
> x86 before rather than add it).
for the oldfs stuff we already have probe_kernel_read(). OTOH, that 
involves pagefault_disable() which is an atomic op, so 
__get_user_careful() should be much more lightweight - and we already know 
that the memory range at least _used to_ be a valid kernel address.

(Theoretical race: with memory hotplug that kernel pointer address could 
have gotten unmapped and we could get device memory there - with 
side-effects if accessed. Wont happen in practice.)

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Andrew Morton

2009-Jan-07 21:35 UTC

head link

Re: [PATCH -v5][RFC]: mutex: implement adaptive spinning

On Wed, 7 Jan 2009 22:37:40 +0100
Andi Kleen <andi@firstfloor.org> wrote:
> > But we can do that with __get_user(thread_info->cpu) (very unlikely
page
> > fault protection due to the possibility of CONFIG_DEBUG_PAGEALLOC) and
> > then validating the cpu. It it''s in range, we can use it and
verify
> > whether cpu_rq(cpu)->curr has that thread_info.
> > 
> > So we can do all that locklessly and optimistically, just going back
and
> > verifying the results later. This is why "thread_info" is
actually a
> > better thing to use than "task_struct" - we can look up the
cpu in it with
> > a simple dereference. We knew the pointer _used_ to be valid, so in
any
> > normal situation, it will never page fault (and if you have 
> > CONFIG_DEBUG_PAGEALLOC and hit a very unlucky race, then performance
isn''t
> > your concern anyway: we just need to make the page fault be non-lethal
;)
> 
> The problem with probe_kernel_address() is that it does lots of
> operations around the access in the hot path (set_fs, pagefault_disable
etc.),
> so i''m not sure that''s a good idea. 
probe_kernel_address() isn''t tooooo bad - a few reads and writes into
the task_struct and thread_struct.  And we''re on the slow, contended
path here anyway..
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Andi Kleen

2009-Jan-07 21:37 UTC

head link

Re: [PATCH -v5][RFC]: mutex: implement adaptive spinning

> But we can do that with __get_user(thread_info->cpu) (very unlikely page
> fault protection due to the possibility of CONFIG_DEBUG_PAGEALLOC) and 
> then validating the cpu. It it''s in range, we can use it and
verify
> whether cpu_rq(cpu)->curr has that thread_info.
> 
> So we can do all that locklessly and optimistically, just going back and 
> verifying the results later. This is why "thread_info" is
actually a
> better thing to use than "task_struct" - we can look up the cpu
in it with
> a simple dereference. We knew the pointer _used_ to be valid, so in any 
> normal situation, it will never page fault (and if you have 
> CONFIG_DEBUG_PAGEALLOC and hit a very unlucky race, then performance
isn''t
> your concern anyway: we just need to make the page fault be non-lethal ;)
The problem with probe_kernel_address() is that it does lots of
operations around the access in the hot path (set_fs, pagefault_disable etc.), 
so i''m not sure that''s a good idea. 

Sure you can probably do better, but that would involve
patching all architectures won''t it? Ok I suppose
you could make an ARCH_HAS_blabla white list, but that
wouldn''t be exactly pretty.

-Andi
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Linus Torvalds

2009-Jan-07 21:39 UTC

head link

Re: [PATCH -v5][RFC]: mutex: implement adaptive spinning

On Wed, 7 Jan 2009, Linus Torvalds wrote:> 
> We''ve needed that before (and yes, we''ve simply mis-used
__get_user() on
> x86 before rather than add it).
Ahh, yeah, we have a really broken form of this in probe_kernel_address(), 
but it''s disgustingly slow. And it actually does a whole lot more,
since
it is designed for addresses that we can''t trust.

The thing about "thread->cpu" is that we can actually _trust_ the
address,
so we know it''s not a user address or anything like that. So we
don''t need
the whole pagefault_disable/enable around it. At least the x86 page fault 
handler knows that it absolutely must not take the mm semaphore for these 
things, for example (because it would be an insta-deadlock for the vmalloc 
space and thus any mm fault handlers in filesystems that are modules).

So we would be better off without that, but we could certainly also 
improve probe_kernel_address() if we had a better model. IOW, we *could* 
do something like

	#define get_kernel_mode(val, addr) __get_user(val, addr)

on the simple platforms, and then in the generic <linux/uaccess.h> code we
can just do

	#ifndef get_kernel_mode
	 #define get_kernel_mode(val, addr) ({  \
		long _err;			\
		mm_segment_t old_fs = get_fs(); \
		set_fs(KERNEL_DS);		\
		_err = __get_user(val, addr);	\
		set_fs(old_fs);			\
		_err; })
	#endif

and then probe_kernel_address becomes

	#define probe_kernel_address(addr, val) ({	\
		long _err;				\
		pagefault_disable();			\
		_err = get_kernel_mode(val, addr);	\
		pagefault_enable();			\
		_err; })

which leaves all architectures with the trivial option to just define 
their own ''get_kernel_mode()'' thing that _often_ is exactly
the same as
__get_user().

Hmm? Anybody want to test it? 

				Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Andi Kleen

2009-Jan-07 21:39 UTC

head link

Re: [PATCH -v5][RFC]: mutex: implement adaptive spinning

> I appreciate this is sample code, but using __get_user() on
> non-userspace pointers messes up architectures which have separate
> user/kernel spaces (eg the old 4G/4G split for x86-32).  Do we have an
> appropriate function for kernel space pointers? 
probe_kernel_address().

But it''s slow.

-Andi

-- 
ak@linux.intel.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Andrew Morton

2009-Jan-07 21:47 UTC

head link

Re: [PATCH -v5][RFC]: mutex: implement adaptive spinning

On Wed, 7 Jan 2009 22:32:22 +0100
Ingo Molnar <mingo@elte.hu> wrote:
> > We could do the whole "oldfs = get_fs(); set_fs(KERNEL_DS); .. 
> > set_fs(oldfs);" crud, but it would probably be better to just add
an
> > architected accessor. Especially since it''s going to
generally just be a
> > 
> > 	#define get_kernel_careful(val,p) __get_user(val,p)
> > 
> > for most architectures.
> > 
> > We''ve needed that before (and yes, we''ve simply
mis-used __get_user() on
> > x86 before rather than add it).
> 
> for the oldfs stuff we already have probe_kernel_read(). OTOH, that 
> involves pagefault_disable() which is an atomic op
tisn''t.  pagefault_disable() is just preempt_count()+=1;barrier() ?

Am suspecting that you guys might be over-optimising this
contended-path-were-going-to-spin-anyway code?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Peter Zijlstra

2009-Jan-07 21:51 UTC

head link

Re: [PATCH -v5][RFC]: mutex: implement adaptive spinning

On Wed, 2009-01-07 at 12:55 -0800, Linus Torvalds wrote:
> 	/*
> 	 * Look out! "thread" is an entirely speculative pointer
> 	 * access and not reliable.
> 	 */
> 	void loop_while_oncpu(struct mutex *lock, struct thread_struct *thread)
> 	{
> 		for (;;) {
> 			unsigned cpu;
> 			struct runqueue *rq;
> 
> 			if (lock->owner != thread)
> 				break;
> 
> 			/*
> 			 * Need to access the cpu field knowing that
> 			 * DEBUG_PAGEALLOC could have unmapped it if
> 			 * the mutex owner just released it and exited.
> 			 */
> 			if (__get_user(cpu, &thread->cpu))
> 				break;
> 
> 			/*
> 			 * Even if the access succeeded (likely case),
> 			 * the cpu field may no longer be valid. FIXME:
> 			 * this needs to validate that we can do a
> 			 * get_cpu() and that we have the percpu area.
> 			 */
> 			if (cpu >= NR_CPUS)
> 				break;
> 
> 			if (!cpu_online(cpu))
> 				break;
> 
> 			/*
> 			 * Is that thread really running on that cpu?
> 			 */
> 			rq = cpu_rq(cpu);
> 			if (task_thread_info(rq->curr) != thread)
> 				break;
> 
> 			cpu_relax();
> 		}
> 	}
Do we really have to re-do all that code every loop?

        void loop_while_oncpu(struct mutex *lock, struct thread_struct *thread)
        {
                unsigned cpu;
                struct runqueue *rq;

                /*
                 * Need to access the cpu field knowing that
                 * DEBUG_PAGEALLOC could have unmapped it if
                 * the mutex owner just released it and exited.
                 */
                if (__get_user(cpu, &thread->cpu))
                        break;

                /*
                 * Even if the access succeeded (likely case),
                 * the cpu field may no longer be valid. FIXME:
                 * this needs to validate that we can do a
                 * get_cpu() and that we have the percpu area.
                 */
                if (cpu >= NR_CPUS)
                        break;

                if (!cpu_online(cpu))
                        break;

                rq = cpu_rq(cpu);

                for (;;) {
                        if (lock->owner != thread)
                                break;

                        /*
                         * Is that thread really running on that cpu?
                         */
                        if (task_thread_info(rq->curr) != thread)
                                break;

                        cpu_relax();
                }
        }

Also, it would still need to do the funny:

 l_owner = ACCESS_ONCE(lock->owner)
 if (l_owner && l_owner != thread)
   break;

thing, to handle the premature non-atomic lock->owner tracking.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Ingo Molnar

2009-Jan-07 21:57 UTC

head link

Re: [PATCH -v5][RFC]: mutex: implement adaptive spinning

* Andrew Morton <akpm@linux-foundation.org> wrote:
> On Wed, 7 Jan 2009 22:32:22 +0100
> Ingo Molnar <mingo@elte.hu> wrote:
> 
> > > We could do the whole "oldfs = get_fs(); set_fs(KERNEL_DS);
..
> > > set_fs(oldfs);" crud, but it would probably be better to
just add an
> > > architected accessor. Especially since it''s going to
generally just be a
> > > 
> > > 	#define get_kernel_careful(val,p) __get_user(val,p)
> > > 
> > > for most architectures.
> > > 
> > > We''ve needed that before (and yes, we''ve simply
mis-used __get_user() on
> > > x86 before rather than add it).
> > 
> > for the oldfs stuff we already have probe_kernel_read(). OTOH, that 
> > involves pagefault_disable() which is an atomic op
> 
> tisn''t.  pagefault_disable() is just preempt_count()+=1;barrier()
?
okay. Not an atomic (which is plenty fast on Nehalem with 20 cycles 
anyway), but probe_kernel_read() is expensive nevertheless:

ffffffff8027c092 <probe_kernel_read>:
ffffffff8027c092:	65 48 8b 04 25 10 00 	mov    %gs:0x10,%rax
ffffffff8027c099:	00 00 
ffffffff8027c09b:	53                   	push   %rbx
ffffffff8027c09c:	48 8b 98 48 e0 ff ff 	mov    -0x1fb8(%rax),%rbx
ffffffff8027c0a3:	48 c7 80 48 e0 ff ff 	movq   $0xffffffffffffffff,-0x1fb8(%rax)
ffffffff8027c0aa:	ff ff ff ff 
ffffffff8027c0ae:	65 48 8b 04 25 10 00 	mov    %gs:0x10,%rax
ffffffff8027c0b5:	00 00 
ffffffff8027c0b7:	ff 80 44 e0 ff ff    	incl   -0x1fbc(%rax)
ffffffff8027c0bd:	e8 0e dd 0d 00       	callq  ffffffff80359dd0
<__copy_from_user_inatomic>
ffffffff8027c0c2:	65 48 8b 14 25 10 00 	mov    %gs:0x10,%rdx
ffffffff8027c0c9:	00 00 
ffffffff8027c0cb:	ff 8a 44 e0 ff ff    	decl   -0x1fbc(%rdx)
ffffffff8027c0d1:	65 48 8b 14 25 10 00 	mov    %gs:0x10,%rdx
ffffffff8027c0d8:	00 00 
ffffffff8027c0da:	48 83 f8 01          	cmp    $0x1,%rax
ffffffff8027c0de:	48 89 9a 48 e0 ff ff 	mov    %rbx,-0x1fb8(%rdx)
ffffffff8027c0e5:	48 19 c0             	sbb    %rax,%rax
ffffffff8027c0e8:	48 f7 d0             	not    %rax
ffffffff8027c0eb:	48 83 e0 f2          	and    $0xfffffffffffffff2,%rax
ffffffff8027c0ef:	5b                   	pop    %rbx
ffffffff8027c0f0:	c3                   	retq   
ffffffff8027c0f1:	90                   	nop    

where __copy_user_inatomic() goes into the full __copy_generic_unrolled(). 
Not pretty.
> Am suspecting that you guys might be over-optimising this 
> contended-path-were-going-to-spin-anyway code?
not sure. Especially for ''good'' locking usage - where there
are shortly
held locks and the spin times are short, the average time to get _out_ of 
the spinning section is a kind of secondary fastpath as well.

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Linus Torvalds

2009-Jan-07 21:58 UTC

head link

Re: [PATCH -v5][RFC]: mutex: implement adaptive spinning

On Wed, 7 Jan 2009, Peter Zijlstra wrote:> 
> Do we really have to re-do all that code every loop?
No, you''re right, we can just look up the cpu once. Which makes
Andrew''s
argument that "probe_kernel_address()" isn''t in any hot path
even more
true.
> Also, it would still need to do the funny:
> 
>  l_owner = ACCESS_ONCE(lock->owner)
>  if (l_owner && l_owner != thread)
>    break;
Why? That would fall out of the 

	if (lock->owner != thread)
		break;

part. We don''t actually care that it only happens once: this all has 
_known_ races, and the "cpu_relax()" is a barrier.

And notice how the _caller_ handles the "owner == NULL" case by not
even
calling this, and looping over just the state in the lock itself. That was 
in the earlier emails. So this approach is actually pretty different from 
the case that depended on the whole spinlock thing.

			Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Linus Torvalds

2009-Jan-07 22:06 UTC

head link

Re: [PATCH -v5][RFC]: mutex: implement adaptive spinning

On Wed, 7 Jan 2009, Linus Torvalds wrote:> 
> We don''t actually care that it only happens once: this all has
_known_
> races, and the "cpu_relax()" is a barrier.
I phrased that badly. It''s not that it has "known races",
it''s really that
the whole code sequence is very much written and intended to be 
optimistic.

So whatever code motion or whatever CPU memory ordering motion that 
happens, we don''t really care, because none of the tests are final. We
do
need to make sure that the compiler doesn''t optimize the loads out of
the
loops _entirely_, but the "cpu_relax()" things that we need for other 
reasons guarantee that part.

One related issue: since we avoid the spinlock, we now suddenly end up 
relying on the "atomic_cmpxchg()" having lock acquire memory ordering 
semantics. Because _that_ is the one non-speculative thing we do end up 
doing in the whole loop. 

But atomic_cmpxchg() is currently defined to be a full memory barrier, so 
we should be ok. The only issue might be that it''s _too_ much of a
memory
barrier for some architectures, but this is not the pure fastpath, so I 
think we''re all good.

			Linus
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Peter Zijlstra

2009-Jan-07 22:18 UTC

head link

Re: [PATCH -v5][RFC]: mutex: implement adaptive spinning

On Wed, 2009-01-07 at 13:58 -0800, Linus Torvalds wrote:> 
> On Wed, 7 Jan 2009, Peter Zijlstra wrote:
> > 
> > Do we really have to re-do all that code every loop?
> 
> No, you''re right, we can just look up the cpu once. Which makes
Andrew''s
> argument that "probe_kernel_address()" isn''t in any hot
path even more
> true.
> 
> > Also, it would still need to do the funny:
> > 
> >  l_owner = ACCESS_ONCE(lock->owner)
> >  if (l_owner && l_owner != thread)
> >    break;
> 
> Why? That would fall out of the 
> 
> 	if (lock->owner != thread)
> 		break;
> 
> part. We don''t actually care that it only happens once: this all
has
> _known_ races, and the "cpu_relax()" is a barrier.
> 
> And notice how the _caller_ handles the "owner == NULL" case by
not even
> calling this, and looping over just the state in the lock itself. That was 
> in the earlier emails. So this approach is actually pretty different from 
> the case that depended on the whole spinlock thing.
Ah, so now you do loop on !owner, previuosly you insisted we''d go to
sleep on !owner. Yes, with !owner spinning that is indeed not needed.
> +#ifdef CONFIG_SMP
> +	/* Optimistic spinning.. */
> +	for (;;) {
> +		struct thread_struct *owner;
> +		int oldval = atomic_read(&lock->count);
> +
> +		if (oldval <= MUTEX_SLEEPERS)
> +			break;
> +		if (oldval == 1) {
> +			oldval = atomic_cmpxchg(&lock->count, oldval, 0);
> +			if (oldval == 1) {
> +				lock->owner = task_thread_info(task);
> +				return 0;
> +			}
> +		} else {
> +			/* See who owns it, and spin on him if anybody */
> +			owner = lock->owner;
> +			if (owner)
> +				spin_on_owner(lock, owner);
> +		}
> +		cpu_relax();
> +	}
> +#endif
Hmm, still wouldn''t the spin_on_owner() loopyness and the above need
that need_resched() check you mentioned to that it can fall into the
slow path and go to sleep?

--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Gregory Haskins

2009-Jan-07 22:28 UTC

head link

Re: [PATCH -v5][RFC]: mutex: implement adaptive spinning

Andi Kleen wrote:>> I appreciate this is sample code, but using __get_user() on
>> non-userspace pointers messes up architectures which have separate
>> user/kernel spaces (eg the old 4G/4G split for x86-32).  Do we have an
>> appropriate function for kernel space pointers? 
>>     
>
> probe_kernel_address().
>
> But it''s slow.
>
> -Andi
>
>   
Can I ask a simple question in light of all this discussion? 

"Is get_task_struct() really that bad?"

I have to admit you guys have somewhat lost me on some of the more
recent discussion, so its probably just a case of being naive on my
part...but this whole thing seems to have become way more complex than
it needs to be.  Lets boil this down to the core requirements:  We need
to know if the owner task is still running somewhere in the system as a
predicate to whether we should sleep or spin, period.  Now the question
is how to do that.

The get/put task is the obvious answer to me (as an aside, we looked at
task->oncpu rather than the rq->curr stuff which I believe was better),
and I am inclined to think that is perfectly reasonable way to do this: 
After all, even if acquiring a reference is somewhat expensive (which I
don''t really think it is on a modern processor), we are already in the
slowpath as it is and would sleep otherwise.

Steve proposed a really cool trick with RCU since we know that the task
cannot release while holding the lock, and the pointer cannot go away
without waiting for a grace period.  It turned out to introduce latency
side-effects so it  ultimately couldn''t be used (and this was in no way
a knock against RCU or you, Paul..just wasn''t the right tool for the
job
it turned out).

Ok, so onto other ideas.  What if we simply look at something like a
scheduling sequence id.  If we know (within the wait-lock) that task X
is the owner and its on CPU A, then we can simply monitor if A context
switches.  Having something like rq[A]->seq++ every time we schedule()
would suffice and you wouldnt need to hold a task reference...just note
A=X->cpu from inside the wait-lock.  I guess the downside there is
putting that extra increment in the schedule() hotpath even if no-one
cares, but I would surmise that should be reasonably cheap when no-one
is pulling the cacheline around other than A (i.e. no observers).

But anyway, my impression from observing the direction this discussion
has taken is that it is being way way over optimized before we even know
if a) the adaptive stuff helps, and b) the get/put-ref hurts.  Food for
thought.

-Greg

Ingo Molnar

2009-Jan-07 22:33 UTC

head link

Re: [PATCH -v5][RFC]: mutex: implement adaptive spinning

* Gregory Haskins <ghaskins@novell.com> wrote:
> Can I ask a simple question in light of all this discussion?
> 
> "Is get_task_struct() really that bad?"
it dirties a cacheline and it also involves atomics.

Also, it''s a small design cleanliness issue to me: get_task_struct() 
impacts the lifetime of an object - and if a locking primitive has 
side-effects on object lifetimes that''s never nice.

	Ingo

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Peter W. Morreale

2009-Jan-07 22:51 UTC

head link

Re: [PATCH -v5][RFC]: mutex: implement adaptive spinning

On Wed, 2009-01-07 at 23:33 +0100, Ingo Molnar wrote:> * Gregory Haskins <ghaskins@novell.com> wrote:
> 
> > Can I ask a simple question in light of all this discussion?
> > 
> > "Is get_task_struct() really that bad?"
> 
> it dirties a cacheline and it also involves atomics.
> 
> Also, it''s a small design cleanliness issue to me:
get_task_struct()
> impacts the lifetime of an object - and if a locking primitive has 
> side-effects on object lifetimes that''s never nice.
> 
True, but it''s for one iteration * NR_CPUS, max.

Best,
-PWM

> 	Ingo
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Dave Kleikamp

2009-Jan-07 22:54 UTC

head link

Re: [PATCH -v5][RFC]: mutex: implement adaptive spinning

On Wed, 2009-01-07 at 13:58 -0800, Linus Torvalds wrote:> 
> On Wed, 7 Jan 2009, Peter Zijlstra wrote:
> > 
> > Do we really have to re-do all that code every loop?
> 
> No, you''re right, we can just look up the cpu once. Which makes
Andrew''s
> argument that "probe_kernel_address()" isn''t in any hot
path even more
> true.
Do you need to even do that if CONFIG_DEBUG_PAGEALLOC is unset?
Something like:

#ifdef CONFIG_DEBUG_PAGEALLOC
                /*
                 * Need to access the cpu field knowing that
                 * DEBUG_PAGEALLOC could have unmapped it if
                 * the mutex owner just released it and exited.
                 */
                if (probe_kernel_address(&thread->cpu, cpu))
                        break;
#else
		cpu = thread->cpu;
#endif

Shaggy
-- 
David Kleikamp
IBM Linux Technology Center

--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Gregory Haskins

2009-Jan-07 22:56 UTC

head link

Re: [PATCH -v5][RFC]: mutex: implement adaptive spinning

Hi Ingo,

Ingo Molnar wrote:> * Gregory Haskins <ghaskins@novell.com> wrote:
>
>   
>> Can I ask a simple question in light of all this discussion?
>>
>> "Is get_task_struct() really that bad?"
>>     
>
> it dirties a cacheline and it also involves atomics.
>   Yes, understood.  But we should note we are always going to be talking
about thrashing caches here since we are ultimately having one CPU
observe another.  There''s no way to get around that.  I understand that
there are various degrees of this occurring, and I have no doubt that
the proposed improvements strive to achieve a reduction of that.  My
question is really targeted at "at what cost".

Don''t get me wrong.  I am not advocating going back to get/put-task per
se.  I am simply asking the question of whether we have taken the design
off into the weeds having lost sight of the actual requirements and/or
results.  Its starting to smell like we have.  This is just a friendly
reality check.  Feel free to disregard. ;)

-Greg

Steven Rostedt

2009-Jan-07 23:09 UTC

head link

Re: [PATCH -v5][RFC]: mutex: implement adaptive spinning

On Wed, 7 Jan 2009, Gregory Haskins wrote:
> Hi Ingo,
> 
> Ingo Molnar wrote:
> > * Gregory Haskins <ghaskins@novell.com> wrote:
> >
> >   
> >> Can I ask a simple question in light of all this discussion?
> >>
> >> "Is get_task_struct() really that bad?"
> >>     
> >
> > it dirties a cacheline and it also involves atomics.
> >   
> Yes, understood.  But we should note we are always going to be talking
> about thrashing caches here since we are ultimately having one CPU
> observe another.  There''s no way to get around that.  I understand
that
> there are various degrees of this occurring, and I have no doubt that
> the proposed improvements strive to achieve a reduction of that.  My
> question is really targeted at "at what cost".
> 
> Don''t get me wrong.  I am not advocating going back to
get/put-task per
> se.  I am simply asking the question of whether we have taken the design
> off into the weeds having lost sight of the actual requirements and/or
> results.  Its starting to smell like we have.  This is just a friendly
> reality check.  Feel free to disregard. ;)
What would be interesting is various benchmarks against all three.

1) no mutex spinning.
2) get_task_struct() implementation.
3) spin_or_sched implementation.

I believe that 2 happens to be the easiest to understand. No need to know 
about the behavior of freed objects. If we see no or negligible 
improvement between 2 and 3 on any benchmark, then I say we stick with 2.

-- Steve

--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Linus Torvalds

2009-Jan-07 23:10 UTC

head link

Re: [PATCH -v5][RFC]: mutex: implement adaptive spinning

On Wed, 7 Jan 2009, Peter Zijlstra wrote:> 
> Hmm, still wouldn''t the spin_on_owner() loopyness and the above
need
> that need_resched() check you mentioned to that it can fall into the
> slow path and go to sleep?
Yes, I do think that the outer loop at least should have a test for 
need_resched(). 

Please take all my patches to be pseudo-code. They''ve neither been 
compiled nor tested, and I''m just posting them in the hope that
somebody
else will then do things in the direction I think is the proper one ;)

		Linus
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Peter W. Morreale

2009-Jan-07 23:14 UTC

head link

Re: [PATCH -v5][RFC]: mutex: implement adaptive spinning

On Wed, 2009-01-07 at 15:51 -0700, Peter W. Morreale
wrote:> On Wed, 2009-01-07 at 23:33 +0100, Ingo Molnar wrote:
> > * Gregory Haskins <ghaskins@novell.com> wrote:
> > 
> > > Can I ask a simple question in light of all this discussion?
> > > 
> > > "Is get_task_struct() really that bad?"
> > 
> > it dirties a cacheline and it also involves atomics.
> > 
> > Also, it''s a small design cleanliness issue to me:
get_task_struct()
> > impacts the lifetime of an object - and if a locking primitive has 
> > side-effects on object lifetimes that''s never nice.
> > 
> 
> True, but it''s for one iteration * NR_CPUS, max.
> 
> Best,
> -PWM
Never mind.  Bogus argument.  

That''s why we have you Big Guns out there... - To keep us rif-raf in
line... 

:-)

Best,
-PWM

--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Linus Torvalds

2009-Jan-07 23:15 UTC

head link

Re: [PATCH -v5][RFC]: mutex: implement adaptive spinning

On Wed, 7 Jan 2009, Gregory Haskins wrote:>
> Can I ask a simple question in light of all this discussion? 
> 
> "Is get_task_struct() really that bad?"
Yes. It''s an atomic access (two, in fact, since you need to release it 
too), which is a huge deal if we''re talking about a timing-critical 
section of code.

And this is timing-critical, or we wouldn''t even care - even in the 
contention case. Admittedly btrfs apparently makes it more so that it 
_should_ be, but Peter had some timings that happened with just regular 
create/unlink that showed a big difference.

So the whole and only point of spinning mutexes is to get rid of the 
scheduler overhead, but to also not replace it with some other thing ;)

			Linus
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Linus Torvalds

2009-Jan-07 23:18 UTC

head link

Re: [PATCH -v5][RFC]: mutex: implement adaptive spinning

On Wed, 7 Jan 2009, Linus Torvalds wrote:> > 
> > "Is get_task_struct() really that bad?"
> 
> Yes. It''s an atomic access (two, in fact, since you need to
release it
> too), which is a huge deal if we''re talking about a
timing-critical
> section of code.
There''s another issue: you also need to lock the thing that gives you
the
task pointer in the first place. So it''s not sufficient to do 
get_task_struct(), you also need to do it within a context where you know 
that the pointer is not going away _while_ you do it.

And with the mutexes clearing the ->owner field without even holding the 
spinlock, that is not a guarantee we can easily get any way. Maybe we''d
need to hold the tasklist_lock or something.

		Linus
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Linus Torvalds

2009-Jan-07 23:19 UTC

head link

Re: [PATCH -v5][RFC]: mutex: implement adaptive spinning

On Wed, 7 Jan 2009, Dave Kleikamp wrote:> 
> Do you need to even do that if CONFIG_DEBUG_PAGEALLOC is unset?
No.
> Something like:
> 
> #ifdef CONFIG_DEBUG_PAGEALLOC
>                 /*
>                  * Need to access the cpu field knowing that
>                  * DEBUG_PAGEALLOC could have unmapped it if
>                  * the mutex owner just released it and exited.
>                  */
>                 if (probe_kernel_address(&thread->cpu, cpu))
>                         break;
> #else
> 		cpu = thread->cpu;
> #endif
yes. That would work fine.

			Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Paul E. McKenney

2009-Jan-07 23:23 UTC

head link

Re: [PATCH -v5][RFC]: mutex: implement adaptive spinning

On Wed, Jan 07, 2009 at 05:28:12PM -0500, Gregory Haskins
wrote:> Andi Kleen wrote:
> >> I appreciate this is sample code, but using __get_user() on
> >> non-userspace pointers messes up architectures which have separate
> >> user/kernel spaces (eg the old 4G/4G split for x86-32).  Do we
have an
> >> appropriate function for kernel space pointers? 
> >>     
> >
> > probe_kernel_address().
> >
> > But it''s slow.
> >
> > -Andi
> >
> >   
> 
> Can I ask a simple question in light of all this discussion? 
> 
> "Is get_task_struct() really that bad?"
> 
> I have to admit you guys have somewhat lost me on some of the more
> recent discussion, so its probably just a case of being naive on my
> part...but this whole thing seems to have become way more complex than
> it needs to be.  Lets boil this down to the core requirements:  We need
> to know if the owner task is still running somewhere in the system as a
> predicate to whether we should sleep or spin, period.  Now the question
> is how to do that.
> 
> The get/put task is the obvious answer to me (as an aside, we looked at
> task->oncpu rather than the rq->curr stuff which I believe was
better),
> and I am inclined to think that is perfectly reasonable way to do this: 
> After all, even if acquiring a reference is somewhat expensive (which I
> don''t really think it is on a modern processor), we are already in
the
> slowpath as it is and would sleep otherwise.
> 
> Steve proposed a really cool trick with RCU since we know that the task
> cannot release while holding the lock, and the pointer cannot go away
> without waiting for a grace period.  It turned out to introduce latency
> side-effects so it  ultimately couldn''t be used (and this was in
no way
> a knock against RCU or you, Paul..just wasn''t the right tool for
the job
> it turned out).
Too late...

I already figured out a way to speed up preemptable RCU''s read-side
primitives (to about as fast as CONFIG_PREEMPT RCU''s read-side
primitives)
and also its grace-period latency.  And it is making it quite clear that
it won''t let go of my brain until I implement it...  ;-)

							Thanx, Paul
> Ok, so onto other ideas.  What if we simply look at something like a
> scheduling sequence id.  If we know (within the wait-lock) that task X
> is the owner and its on CPU A, then we can simply monitor if A context
> switches.  Having something like rq[A]->seq++ every time we schedule()
> would suffice and you wouldnt need to hold a task reference...just note
> A=X->cpu from inside the wait-lock.  I guess the downside there is
> putting that extra increment in the schedule() hotpath even if no-one
> cares, but I would surmise that should be reasonably cheap when no-one
> is pulling the cacheline around other than A (i.e. no observers).
> 
> But anyway, my impression from observing the direction this discussion
> has taken is that it is being way way over optimized before we even know
> if a) the adaptive stuff helps, and b) the get/put-ref hurts.  Food for
> thought.
> 
> -Greg
> 
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Linus Torvalds

2009-Jan-07 23:32 UTC

head link

Re: [PATCH -v5][RFC]: mutex: implement adaptive spinning

On Wed, 7 Jan 2009, Steven Rostedt wrote:> 
> What would be interesting is various benchmarks against all three.
> 
> 1) no mutex spinning.
> 2) get_task_struct() implementation.
> 3) spin_or_sched implementation.
One of the issues is that I cannot convince myself that (2) is even 
necessarily correct. At least not without having all cases happen inder 
the mutex spinlock - which they don''t. Even with the original patch,
the
uncontended cases set and cleared the owner field outside the lock.

			Linus
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Steven Rostedt

2009-Jan-07 23:46 UTC

head link

Re: [PATCH -v5][RFC]: mutex: implement adaptive spinning

On Wed, 7 Jan 2009, Linus Torvalds wrote:> On Wed, 7 Jan 2009, Steven Rostedt wrote:
> > 
> > What would be interesting is various benchmarks against all three.
> > 
> > 1) no mutex spinning.
> > 2) get_task_struct() implementation.
> > 3) spin_or_sched implementation.
> 
> One of the issues is that I cannot convince myself that (2) is even 
> necessarily correct. At least not without having all cases happen inder 
> the mutex spinlock - which they don''t. Even with the original
patch, the
> uncontended cases set and cleared the owner field outside the lock.
True. I need to keep looking at the code that is posted. In -rt, we force 
the fast path into the slowpath as soon as another task fails to get the 
lock. Without that, as you pointed out, the code can be racy.

-- Steve

--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Steven Rostedt

2009-Jan-07 23:47 UTC

head link

Re: [PATCH -v5][RFC]: mutex: implement adaptive spinning

On Wed, 7 Jan 2009, Steven Rostedt wrote:> 
> True. I need to keep looking at the code that is posted. In -rt, we force 
> the fast path into the slowpath as soon as another task fails to get the 
> lock. Without that, as you pointed out, the code can be racy.
I mean we force the fast unlock path into the slow path as soon as another 
task fails to get the lock.

-- Steve

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Steven Rostedt

2009-Jan-07 23:49 UTC

head link

Re: [PATCH -v5][RFC]: mutex: implement adaptive spinning

On Wed, 7 Jan 2009, Linus Torvalds wrote:> On Wed, 7 Jan 2009, Dave Kleikamp wrote:
> > 
> > Do you need to even do that if CONFIG_DEBUG_PAGEALLOC is unset?
> 
> No.
> 
> > Something like:
> > 
> > #ifdef CONFIG_DEBUG_PAGEALLOC
> >                 /*
> >                  * Need to access the cpu field knowing that
> >                  * DEBUG_PAGEALLOC could have unmapped it if
> >                  * the mutex owner just released it and exited.
> >                  */
> >                 if (probe_kernel_address(&thread->cpu, cpu))
> >                         break;
> > #else
> > 		cpu = thread->cpu;
> > #endif
> 
> yes. That would work fine.
What about memory hotplug as Ingo mentioned?

Should that be:

#if defined(CONFIG_DEBUG_PAGEALLOC) || defined(CONFIG_MEMORY_HOTPLUG)

??

-- Steve

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Linus Torvalds

2009-Jan-07 23:52 UTC

head link

Re: [PATCH -v5][RFC]: mutex: implement adaptive spinning

On Wed, 7 Jan 2009, Steven Rostedt wrote:
> 
> On Wed, 7 Jan 2009, Steven Rostedt wrote:
> > 
> > True. I need to keep looking at the code that is posted. In -rt, we
force
> > the fast path into the slowpath as soon as another task fails to get
the
> > lock. Without that, as you pointed out, the code can be racy.
> 
> I mean we force the fast unlock path into the slow path as soon as another 
> task fails to get the lock.
I think the mainline mutex code does that all right too - we keep the 
counter negative all the time when it has contention.

So the original model where the spinning was done only after we''d
gotten
the spinlock probably was correct. However, it _is_ a lot more expensive 
than the "optimistic spin" model.

		Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Linus Torvalds

2009-Jan-07 23:57 UTC

head link

Re: [PATCH -v5][RFC]: mutex: implement adaptive spinning

On Wed, 7 Jan 2009, Steven Rostedt wrote:> 
> Should that be:
> 
> #if defined(CONFIG_DEBUG_PAGEALLOC) || defined(CONFIG_MEMORY_HOTPLUG)
Well, probably CONFIG_MEMORY_HOTREMOVE, no? And I''d actually suggest
that
unplugging should have a stop-machine if it doesn''t already, just
because
it''s such a special case - like module removal.

			Linus
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

KAMEZAWA Hiroyuki

2009-Jan-08 02:18 UTC

head link

Re: [PATCH -v5][RFC]: mutex: implement adaptive spinning

On Wed, 7 Jan 2009 15:57:06 -0800 (PST)
Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
> 
> On Wed, 7 Jan 2009, Steven Rostedt wrote:
> > 
> > Should that be:
> > 
> > #if defined(CONFIG_DEBUG_PAGEALLOC) || defined(CONFIG_MEMORY_HOTPLUG)
> 
> Well, probably CONFIG_MEMORY_HOTREMOVE, no? And I''d actually
suggest that
> unplugging should have a stop-machine if it doesn''t already, just
because
> it''s such a special case - like module removal.
> I looked memory hotplug code again, then..

 1. stop_machine() is not used.
 2. x86-64 code doesn''t have "unmap physical memory from the
kernel space".
    (because ia64, my (old) target machine, doesn''t need it.)
    I''m not sure about powerpc.

I''d like to look into this for x86-64 when I can. 

BTW, can I make a quiestion ?
=			if (lock->owner != thread)
				break;

			/*
			 * Need to access the cpu field knowing that
			 * DEBUG_PAGEALLOC could have unmapped it if
			 * the mutex owner just released it and exited.
			 */
			if (__get_user(cpu, &thread->cpu))
				break;
="thread" can be obsolete while lock->owner == thread ?
Isn''t this depends on  CONFIG_DEBUG_MUTEXES ?


Thanks,
-Kame

--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Steven Rostedt

2009-Jan-08 02:33 UTC

head link

Re: [PATCH -v5][RFC]: mutex: implement adaptive spinning

On Wed, 7 Jan 2009, Linus Torvalds wrote:> > Should that be:
> > 
> > #if defined(CONFIG_DEBUG_PAGEALLOC) || defined(CONFIG_MEMORY_HOTPLUG)
> 
> Well, probably CONFIG_MEMORY_HOTREMOVE, no? And I''d actually
suggest that
> unplugging should have a stop-machine if it doesn''t already, just
because
> it''s such a special case - like module removal.
I do not think stop-machine will help, unless that spinning is protected 
by preempt-disable. If the task gets preempted after grabbing the owner 
thread_info, and then stop-machine runs, the memory disappears, the task 
is scheduled back, accesses the owner thread_info and then page-fault.

-- Steve

KAMEZAWA Hiroyuki

2009-Jan-08 02:49 UTC

head link

Re: [PATCH -v5][RFC]: mutex: implement adaptive spinning

On Wed, 7 Jan 2009 21:33:31 -0500 (EST)
Steven Rostedt <rostedt@goodmis.org> wrote:
> 
> On Wed, 7 Jan 2009, Linus Torvalds wrote:
> > > Should that be:
> > > 
> > > #if defined(CONFIG_DEBUG_PAGEALLOC) ||
defined(CONFIG_MEMORY_HOTPLUG)
> > 
> > Well, probably CONFIG_MEMORY_HOTREMOVE, no? And I''d actually
suggest that
> > unplugging should have a stop-machine if it doesn''t already,
just because
> > it''s such a special case - like module removal.
> 
> I do not think stop-machine will help, unless that spinning is protected 
> by preempt-disable. If the task gets preempted after grabbing the owner 
> thread_info, and then stop-machine runs, the memory disappears, the task 
> is scheduled back, accesses the owner thread_info and then page-fault.
> How about recording "cpu" into mutex at locking explicitly  ?
too bad ?

-Kame

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Gregory Haskins

2009-Jan-08 03:27 UTC

head link

Re: [PATCH -v5][RFC]: mutex: implement adaptive spinning

[resend: i fat fingered the reply-to-all for a few messages]
>>> Linus Torvalds <torvalds@linux-foundation.org> 01/07/09 6:20
PM >>>
>On Wed, 7 Jan 2009, Linus Torvalds wrote:
>> >
>> > "Is get_task_struct() really that bad?"
>>
>> Yes. It''s an atomic access (two, in fact, since you need to
release it
>> too), which is a huge deal if we''re talking about a
timing-critical
>> section of code.
>
>There''s another issue: you also need to lock the thing that gives
you the
>task pointer in the first place. So it''s not sufficient to do
>get_task_struct(), you also need to do it within a context where you know
>that the pointer is not going away _while_ you do it.
In my defense, the -rt versions of the patches guarantee this is ok
based on a little hack:

"if the owner holds the mutex, and you old the wait-lock, you guarantee
owner cannot exit". Therefore get_task_struct is guaranteed not to race
if you take it while holding wait-lock. (This is, in fact, why the
original design had a loop within a loop...the inner loop would break
when it wanted the outer-loop to reacquire wait-lock for things such as
re-acquiring owner/get-task

This is in part enforced by the fact that in -rt, pending pending
waiters force the mutex-release to hit the slow path (and therefore
require that the wait-lock be reacquired by the releaser). I am under
the impression (because I havent had a chance to really review the new
patches yet) that this is not necessarily the case here for the mainline
patch, since the owner field is not managed atomically IIUC

Again, not advocating get/put task per se. Just wanted to point out that
I didnt flub this up on the original -rt design ;)

-Greg

Gregory Haskins

2009-Jan-08 03:28 UTC

head link

Re: [PATCH -v5][RFC]: mutex: implement adaptive spinning

[resend: restore CC list]
>>> Linus Torvalds <torvalds@linux-foundation.org> 01/07/09 6:33
PM >>>
>On Wed, 7 Jan 2009, Steven Rostedt wrote:
>>
>> What would be interesting is various benchmarks against all three.
>>
>> 1) no mutex spinning.
>> 2) get_task_struct() implementation.
>> 3) spin_or_sched implementation.
>
>One of the issues is that I cannot convince myself that (2) is even
>necessarily correct. At least not without having all cases happen inder
>the mutex spinlock - which they don''t. Even with the original
patch, the
>uncontended cases set and cleared the owner field outside the lock.
Yeah, you are right (see my last post).  Without the cmpxchg trickery
(*) on the owner field (like we have in -rt) with the pending bits, (2)
cannot work.

(*) For what its worth, I am impressed with the brilliance of the
cmgxchg fast path stuff in the rtmutex code.  Whoever came up with that,
my hat is off to you.

-Greg

Steven Rostedt

2009-Jan-08 03:38 UTC

head link

Re: [PATCH -v5][RFC]: mutex: implement adaptive spinning

On Wed, 7 Jan 2009, Gregory Haskins wrote:> 
> In my defense, the -rt versions of the patches guarantee this is ok
> based on a little hack:
The -rt versions worry about much more than what the mutex code in
mainline does. Linus is correct in his arguments. The adaptive mutex (as 
suppose to what -rt has), is only to help aid in preformance. There are a 
lot of races that can happen in mainline version where lock taking may not 
be fifo, or where we might start to schedule when we could have taken the 
lock. These races are not in -rt, but that is because -rt cares about 
these. But mainline cares more about performance over determinism. This 
means that we have to look at the current code that Peter is submitting 
with a different perspective than we do in -rt.

-- Steve


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Gregory Haskins

2009-Jan-08 04:00 UTC

head link

Re: [PATCH -v5][RFC]: mutex: implement adaptive spinning

Steven Rostedt wrote:> On Wed, 7 Jan 2009, Gregory Haskins wrote:
>   
>> In my defense, the -rt versions of the patches guarantee this is ok
>> based on a little hack:
>>     
>
> The -rt versions worry about much more than what the mutex code in
> mainline does. Linus is correct in his arguments. The adaptive mutex (as 
> suppose to what -rt has), is only to help aid in preformance. There are a 
> lot of races that can happen in mainline version where lock taking may not 
> be fifo, or where we might start to schedule when we could have taken the 
> lock. These races are not in -rt, but that is because -rt cares about 
> these. But mainline cares more about performance over determinism. This 
> means that we have to look at the current code that Peter is submitting 
> with a different perspective than we do in -rt.
>   
Hey Steve,
  Understood, and agreed.  I only mentioned it because I wanted to clear
the record that I did not (to my knowledge) mess up the protocol design
which first introduced the get/put-task pattern under discussion ;).  I
am fairly confident that at least the -rt version does not have any race
conditions such as the one Linus mentioned in the mainline version.  I
am not advocating that the full protocol that we use in -rt should be
carried forward, per se or anything like that.

-Greg

Andi Kleen

2009-Jan-08 06:52 UTC

head link

Re: [PATCH -v5][RFC]: mutex: implement adaptive spinning

> What about memory hotplug as Ingo mentioned?
> 
> Should that be:
> 
> #if defined(CONFIG_DEBUG_PAGEALLOC) || defined(CONFIG_MEMORY_HOTPLUG)
We expect memory hotunplug to only really work in movable zones
(all others should at least have one kernel object somewhere that prevents
unplug) and you can''t have task structs in movable zones obviously

So it''s probably a non issue in practice.

-Andi
-- 
ak@linux.intel.com
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Peter Zijlstra

2009-Jan-08 07:10 UTC

head link

Re: [PATCH -v5][RFC]: mutex: implement adaptive spinning

On Wed, 2009-01-07 at 15:32 -0800, Linus Torvalds wrote:> 
> On Wed, 7 Jan 2009, Steven Rostedt wrote:
> > 
> > What would be interesting is various benchmarks against all three.
> > 
> > 1) no mutex spinning.
> > 2) get_task_struct() implementation.
> > 3) spin_or_sched implementation.
> 
> One of the issues is that I cannot convince myself that (2) is even 
> necessarily correct. At least not without having all cases happen inder 
> the mutex spinlock - which they don''t. Even with the original
patch, the
> uncontended cases set and cleared the owner field outside the lock.
Yes, 2 isn''t feasible for regular mutexes as we have non-atomic owner
tracking.

I''ve since realized the whole rtmutex thing is fundamentally difference
on a few levels:

  a) we have atomic owner tracking (that''s the lock itself, it holds
the
task_pointer as a cookie), and 

  b) we need to do that whole enqueue on the waitlist thing because we
need to do the PI propagation and such to figure out if the current task
is even allowed to acquire -- that is, only the highest waiting and or
lateral steal candidates are allowed to spin acquire.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Peter Zijlstra

2009-Jan-08 09:58 UTC

head link

[PATCH -v6][RFC]: mutex: implement adaptive spinning

On Wed, 2009-01-07 at 15:10 -0800, Linus Torvalds wrote:
> Please take all my patches to be pseudo-code. They''ve neither been
> compiled nor tested, and I''m just posting them in the hope that
somebody
> else will then do things in the direction I think is the proper one ;)
Linux opteron 2.6.28-tip #585 SMP PREEMPT Thu Jan 8 10:38:09 CET 2009 x86_64
x86_64 x86_64 GNU/Linux

[root@opteron bench]# echo NO_OWNER_SPIN > /debug/sched_features; ./timec -e
-5,-4,-3,-2 ./test-mutex V 16 10
2 CPUs, running 16 parallel test-tasks.
checking VFS performance.
| loops/sec: 69415
avg ops/sec:               74996
average cost per op:       0.00 usecs
average cost per lock:     0.00 usecs
average cost per unlock:   0.00 usecs
max cost per op:           0.00 usecs
max cost per lock:         0.00 usecs
max cost per unlock:       0.00 usecs
average deviance per op:   0.00 usecs

 Performance counter stats for ''./test-mutex'':

   12098.324578  task clock ticks     (msecs)

           1081  CPU migrations       (events)
           7102  context switches     (events)
           2763  pagefaults           (events)
   12098.324578  task clock ticks     (msecs)

 Wall-clock time elapsed: 12026.804839 msecs

[root@opteron bench]# echo OWNER_SPIN > /debug/sched_features; ./timec -e
-5,-4,-3,-2 ./test-mutex V 16 10
2 CPUs, running 16 parallel test-tasks.
checking VFS performance.
| loops/sec: 208147
avg ops/sec:               228126
average cost per op:       0.00 usecs
average cost per lock:     0.00 usecs
average cost per unlock:   0.00 usecs
max cost per op:           0.00 usecs
max cost per lock:         0.00 usecs
max cost per unlock:       0.00 usecs
average deviance per op:   0.00 usecs

 Performance counter stats for ''./test-mutex'':

   22280.283224  task clock ticks     (msecs)

            117  CPU migrations       (events)
           5711  context switches     (events)
           2781  pagefaults           (events)
   22280.283224  task clock ticks     (msecs)

 Wall-clock time elapsed: 12307.053737 msecs

* WOW *

---
Subject: mutex: implement adaptive spin
From: Peter Zijlstra <a.p.zijlstra@chello.nl>
Date: Thu Jan 08 09:41:22 CET 2009

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/mutex.h   |    4 +-
 include/linux/sched.h   |    1 
 kernel/mutex-debug.c    |    7 ----
 kernel/mutex-debug.h    |   18 +++++-----
 kernel/mutex.c          |   81 ++++++++++++++++++++++++++++++++++++++++++------
 kernel/mutex.h          |   22 +++++++++++--
 kernel/sched.c          |   63 +++++++++++++++++++++++++++++++++++++
 kernel/sched_features.h |    1 
 8 files changed, 170 insertions(+), 27 deletions(-)

Index: linux-2.6/include/linux/mutex.h
==================================================================---
linux-2.6.orig/include/linux/mutex.h
+++ linux-2.6/include/linux/mutex.h
@@ -50,8 +50,10 @@ struct mutex {
 	atomic_t		count;
 	spinlock_t		wait_lock;
 	struct list_head	wait_list;
-#ifdef CONFIG_DEBUG_MUTEXES
+#if defined(CONFIG_DEBUG_MUTEXES) || defined(CONFIG_SMP)
 	struct thread_info	*owner;
+#endif
+#ifdef CONFIG_DEBUG_MUTEXES
 	const char 		*name;
 	void			*magic;
 #endif
Index: linux-2.6/kernel/mutex-debug.c
==================================================================---
linux-2.6.orig/kernel/mutex-debug.c
+++ linux-2.6/kernel/mutex-debug.c
@@ -26,11 +26,6 @@
 /*
  * Must be called with lock->wait_lock held.
  */
-void debug_mutex_set_owner(struct mutex *lock, struct thread_info *new_owner)
-{
-	lock->owner = new_owner;
-}
-
 void debug_mutex_lock_common(struct mutex *lock, struct mutex_waiter *waiter)
 {
 	memset(waiter, MUTEX_DEBUG_INIT, sizeof(*waiter));
@@ -82,7 +77,6 @@ void debug_mutex_unlock(struct mutex *lo
 	DEBUG_LOCKS_WARN_ON(lock->magic != lock);
 	DEBUG_LOCKS_WARN_ON(lock->owner != current_thread_info());
 	DEBUG_LOCKS_WARN_ON(!lock->wait_list.prev &&
!lock->wait_list.next);
-	DEBUG_LOCKS_WARN_ON(lock->owner != current_thread_info());
 }
 
 void debug_mutex_init(struct mutex *lock, const char *name,
@@ -95,7 +89,6 @@ void debug_mutex_init(struct mutex *lock
 	debug_check_no_locks_freed((void *)lock, sizeof(*lock));
 	lockdep_init_map(&lock->dep_map, name, key, 0);
 #endif
-	lock->owner = NULL;
 	lock->magic = lock;
 }
 
Index: linux-2.6/kernel/mutex-debug.h
==================================================================---
linux-2.6.orig/kernel/mutex-debug.h
+++ linux-2.6/kernel/mutex-debug.h
@@ -13,14 +13,6 @@
 /*
  * This must be called with lock->wait_lock held.
  */
-extern void
-debug_mutex_set_owner(struct mutex *lock, struct thread_info *new_owner);
-
-static inline void debug_mutex_clear_owner(struct mutex *lock)
-{
-	lock->owner = NULL;
-}
-
 extern void debug_mutex_lock_common(struct mutex *lock,
 				    struct mutex_waiter *waiter);
 extern void debug_mutex_wake_waiter(struct mutex *lock,
@@ -35,6 +27,16 @@ extern void debug_mutex_unlock(struct mu
 extern void debug_mutex_init(struct mutex *lock, const char *name,
 			     struct lock_class_key *key);
 
+static inline void mutex_set_owner(struct mutex *lock)
+{
+	lock->owner = current_thread_info();
+}
+
+static incline void mutex_clear_owner(struct mutex *lock)
+{
+	lock->owner = NULL;
+}
+
 #define spin_lock_mutex(lock, flags)			\
 	do {						\
 		struct mutex *l = container_of(lock, struct mutex, wait_lock); \
Index: linux-2.6/kernel/mutex.c
==================================================================---
linux-2.6.orig/kernel/mutex.c
+++ linux-2.6/kernel/mutex.c
@@ -10,6 +10,11 @@
  * Many thanks to Arjan van de Ven, Thomas Gleixner, Steven Rostedt and
  * David Howells for suggestions and improvements.
  *
+ *  - Adaptive spinning for mutexes by Peter Zijlstra. (Ported to mainline
+ *    from the -rt tree, where it was originally implemented for rtmutexes
+ *    by Steven Rostedt, based on work by Gregory Haskins, Peter Morreale
+ *    and Sven Dietrich.
+ *
  * Also see Documentation/mutex-design.txt.
  */
 #include <linux/mutex.h>
@@ -46,6 +51,7 @@ __mutex_init(struct mutex *lock, const c
 	atomic_set(&lock->count, 1);
 	spin_lock_init(&lock->wait_lock);
 	INIT_LIST_HEAD(&lock->wait_list);
+	mutex_clear_owner(lock);
 
 	debug_mutex_init(lock, name, key);
 }
@@ -91,6 +97,7 @@ void inline __sched mutex_lock(struct mu
 	 * ''unlocked'' into ''locked'' state.
 	 */
 	__mutex_fastpath_lock(&lock->count, __mutex_lock_slowpath);
+	mutex_set_owner(lock);
 }
 
 EXPORT_SYMBOL(mutex_lock);
@@ -115,11 +122,21 @@ void __sched mutex_unlock(struct mutex *
 	 * The unlocking fastpath is the 0->1 transition from
''locked''
 	 * into ''unlocked'' state:
 	 */
+#ifndef CONFIG_DEBUG_MUTEXES
+	/*
+	 * When debugging is enabled we must not clear the owner before time,
+	 * the slow path will always be taken, and that clears the owner field
+	 * after verifying that it was indeed current.
+	 */
+	mutex_clear_owner(lock);
+#endif
 	__mutex_fastpath_unlock(&lock->count, __mutex_unlock_slowpath);
 }
 
 EXPORT_SYMBOL(mutex_unlock);
 
+#define MUTEX_SLEEPERS (-1000)
+
 /*
  * Lock a mutex (possibly interruptible), slowpath:
  */
@@ -132,6 +149,34 @@ __mutex_lock_common(struct mutex *lock, 
 	unsigned int old_val;
 	unsigned long flags;
 
+#ifdef CONFIG_SMP
+	/* Optimistic spinning.. */
+	for (;;) {
+		struct thread_info *owner;
+		int oldval = atomic_read(&lock->count);
+
+		if (oldval <= MUTEX_SLEEPERS)
+			break;
+		if (oldval == 1) {
+			oldval = atomic_cmpxchg(&lock->count, oldval, 0);
+			if (oldval == 1) {
+				mutex_set_owner(lock);
+				return 0;
+			}
+		} else {
+			/* See who owns it, and spin on him if anybody */
+			owner = ACCESS_ONCE(lock->owner);
+			if (owner && !spin_on_owner(lock, owner))
+				break;
+		}
+
+		if (need_resched())
+			break;
+
+		cpu_relax();
+	}
+#endif
+
 	spin_lock_mutex(&lock->wait_lock, flags);
 
 	debug_mutex_lock_common(lock, &waiter);
@@ -142,7 +187,7 @@ __mutex_lock_common(struct mutex *lock, 
 	list_add_tail(&waiter.list, &lock->wait_list);
 	waiter.task = task;
 
-	old_val = atomic_xchg(&lock->count, -1);
+	old_val = atomic_xchg(&lock->count, MUTEX_SLEEPERS);
 	if (old_val == 1)
 		goto done;
 
@@ -158,7 +203,7 @@ __mutex_lock_common(struct mutex *lock, 
 		 * that when we release the lock, we properly wake up the
 		 * other waiters:
 		 */
-		old_val = atomic_xchg(&lock->count, -1);
+		old_val = atomic_xchg(&lock->count, MUTEX_SLEEPERS);
 		if (old_val == 1)
 			break;
 
@@ -187,7 +232,7 @@ done:
 	lock_acquired(&lock->dep_map, ip);
 	/* got the lock - rejoice! */
 	mutex_remove_waiter(lock, &waiter, task_thread_info(task));
-	debug_mutex_set_owner(lock, task_thread_info(task));
+	mutex_set_owner(lock);
 
 	/* set it to 0 if there are no waiters left: */
 	if (likely(list_empty(&lock->wait_list)))
@@ -260,7 +305,7 @@ __mutex_unlock_common_slowpath(atomic_t 
 		wake_up_process(waiter->task);
 	}
 
-	debug_mutex_clear_owner(lock);
+	mutex_clear_owner(lock);
 
 	spin_unlock_mutex(&lock->wait_lock, flags);
 }
@@ -298,18 +343,30 @@ __mutex_lock_interruptible_slowpath(atom
  */
 int __sched mutex_lock_interruptible(struct mutex *lock)
 {
+	int ret;
+
 	might_sleep();
-	return __mutex_fastpath_lock_retval
+	ret =  __mutex_fastpath_lock_retval
 			(&lock->count, __mutex_lock_interruptible_slowpath);
+	if (!ret)
+		mutex_set_owner(lock);
+
+	return ret;
 }
 
 EXPORT_SYMBOL(mutex_lock_interruptible);
 
 int __sched mutex_lock_killable(struct mutex *lock)
 {
+	int ret;
+
 	might_sleep();
-	return __mutex_fastpath_lock_retval
+	ret = __mutex_fastpath_lock_retval
 			(&lock->count, __mutex_lock_killable_slowpath);
+	if (!ret)
+		mutex_set_owner(lock);
+
+	return ret;
 }
 EXPORT_SYMBOL(mutex_lock_killable);
 
@@ -352,9 +409,10 @@ static inline int __mutex_trylock_slowpa
 
 	prev = atomic_xchg(&lock->count, -1);
 	if (likely(prev == 1)) {
-		debug_mutex_set_owner(lock, current_thread_info());
+		mutex_set_owner(lock);
 		mutex_acquire(&lock->dep_map, 0, 1, _RET_IP_);
 	}
+
 	/* Set it back to 0 if there are no waiters: */
 	if (likely(list_empty(&lock->wait_list)))
 		atomic_set(&lock->count, 0);
@@ -380,8 +438,13 @@ static inline int __mutex_trylock_slowpa
  */
 int __sched mutex_trylock(struct mutex *lock)
 {
-	return __mutex_fastpath_trylock(&lock->count,
-					__mutex_trylock_slowpath);
+	int ret;
+
+	ret = __mutex_fastpath_trylock(&lock->count, __mutex_trylock_slowpath);
+	if (ret)
+		mutex_set_owner(lock);
+
+	return ret;
 }
 
 EXPORT_SYMBOL(mutex_trylock);
Index: linux-2.6/kernel/mutex.h
==================================================================---
linux-2.6.orig/kernel/mutex.h
+++ linux-2.6/kernel/mutex.h
@@ -16,8 +16,26 @@
 #define mutex_remove_waiter(lock, waiter, ti) \
 		__list_del((waiter)->list.prev, (waiter)->list.next)
 
-#define debug_mutex_set_owner(lock, new_owner)		do { } while (0)
-#define debug_mutex_clear_owner(lock)			do { } while (0)
+#ifdef CONFIG_SMP
+static inline void mutex_set_owner(struct mutex *lock)
+{
+	lock->owner = current_thread_info();
+}
+
+static inline void mutex_clear_owner(struct mutex *lock)
+{
+	lock->owner = NULL;
+}
+#else
+static inline void mutex_set_owner(struct mutex *lock)
+{
+}
+
+static inline void mutex_clear_owner(struct mutex *lock)
+{
+}
+#endif
+
 #define debug_mutex_wake_waiter(lock, waiter)		do { } while (0)
 #define debug_mutex_free_waiter(waiter)			do { } while (0)
 #define debug_mutex_add_waiter(lock, waiter, ti)	do { } while (0)
Index: linux-2.6/kernel/sched.c
==================================================================---
linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -4672,6 +4672,69 @@ need_resched_nonpreemptible:
 }
 EXPORT_SYMBOL(schedule);
 
+#ifdef CONFIG_SMP
+/*
+ * Look out! "owner" is an entirely speculative pointer
+ * access and not reliable.
+ */
+int spin_on_owner(struct mutex *lock, struct thread_info *owner)
+{
+	unsigned int cpu;
+	struct rq *rq;
+
+	if (unlikely(!sched_feat(OWNER_SPIN)))
+		return 0;
+
+	preempt_disable();
+#ifdef CONFIG_DEBUG_PAGEALLOC
+	/*
+	 * Need to access the cpu field knowing that
+	 * DEBUG_PAGEALLOC could have unmapped it if
+	 * the mutex owner just released it and exited.
+	 */
+	if (probe_kernel_address(&owner->cpu, cpu))
+		goto out;
+#else
+	cpu = owner->cpu;
+#endif
+
+	/*
+	 * Even if the access succeeded (likely case),
+	 * the cpu field may no longer be valid.
+	 */
+	if (cpu >= nr_cpumask_bits)
+		goto out;
+
+	/*
+	 * We need to validate that we can do a
+	 * get_cpu() and that we have the percpu area.
+	 */
+	if (!cpu_online(cpu))
+		goto out;
+
+	rq = cpu_rq(cpu);
+
+	for (;;) {
+		if (lock->owner != owner)
+			break;
+
+		/*
+		 * Is that owner really running on that cpu?
+		 */
+		if (task_thread_info(rq->curr) != owner)
+			break;
+
+		if (need_resched())
+			break;
+
+		cpu_relax();
+	}
+out:
+	preempt_enable_no_resched();
+	return 1;
+}
+#endif
+
 #ifdef CONFIG_PREEMPT
 /*
  * this is the entry point to schedule() from in-kernel preemption
Index: linux-2.6/include/linux/sched.h
==================================================================---
linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -330,6 +330,7 @@ extern signed long schedule_timeout_inte
 extern signed long schedule_timeout_killable(signed long timeout);
 extern signed long schedule_timeout_uninterruptible(signed long timeout);
 asmlinkage void schedule(void);
+extern int spin_on_owner(struct mutex *lock, struct thread_info *owner);
 
 struct nsproxy;
 struct user_namespace;
Index: linux-2.6/kernel/sched_features.h
==================================================================---
linux-2.6.orig/kernel/sched_features.h
+++ linux-2.6/kernel/sched_features.h
@@ -13,3 +13,4 @@ SCHED_FEAT(LB_WAKEUP_UPDATE, 1)
 SCHED_FEAT(ASYM_EFF_LOAD, 1)
 SCHED_FEAT(WAKEUP_OVERLAP, 0)
 SCHED_FEAT(LAST_BUDDY, 1)
+SCHED_FEAT(OWNER_SPIN, 1)

--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Ingo Molnar

2009-Jan-08 14:18 UTC

head link

Re: [PATCH -v6][RFC]: mutex: implement adaptive spinning

* Peter Zijlstra <peterz@infradead.org> wrote:
> On Wed, 2009-01-07 at 15:10 -0800, Linus Torvalds wrote:
> 
> > Please take all my patches to be pseudo-code. They''ve neither
been
> > compiled nor tested, and I''m just posting them in the hope
that somebody
> > else will then do things in the direction I think is the proper one ;)
> 
> Linux opteron 2.6.28-tip #585 SMP PREEMPT Thu Jan 8 10:38:09 CET 2009
x86_64 x86_64 x86_64 GNU/Linux
> 
> [root@opteron bench]# echo NO_OWNER_SPIN > /debug/sched_features;
./timec -e -5,-4,-3,-2 ./test-mutex V 16 10
> 2 CPUs, running 16 parallel test-tasks.
> checking VFS performance.
> avg ops/sec:               74996
> 
>  Performance counter stats for ''./test-mutex'':
> 
>    12098.324578  task clock ticks     (msecs)
> 
>            1081  CPU migrations       (events)
>            7102  context switches     (events)
>            2763  pagefaults           (events)
> 
>  Wall-clock time elapsed: 12026.804839 msecs
> 
> [root@opteron bench]# echo OWNER_SPIN > /debug/sched_features; ./timec
-e -5,-4,-3,-2 ./test-mutex V 16 10
> 2 CPUs, running 16 parallel test-tasks.
> checking VFS performance.
> avg ops/sec:               228126
> 
>  Performance counter stats for ''./test-mutex'':
> 
>    22280.283224  task clock ticks     (msecs)
> 
>             117  CPU migrations       (events)
>            5711  context switches     (events)
>            2781  pagefaults           (events)
> 
>  Wall-clock time elapsed: 12307.053737 msecs
> 
> * WOW *
WOW indeed - and i can see a similar _brutal_ speedup on two separate 
16-way boxes as well:

  16 CPUs, running 128 parallel test-tasks.

  NO_OWNER_SPIN:
  avg ops/sec:               281595

  OWNER_SPIN:
  avg ops/sec:               524791

Da Killer!

Look at the performance counter stats:
>    12098.324578  task clock ticks     (msecs)
> 
>            1081  CPU migrations       (events)
>            7102  context switches     (events)
>            2763  pagefaults           (events)
>    22280.283224  task clock ticks     (msecs)
> 
>             117  CPU migrations       (events)
>            5711  context switches     (events)
>            2781  pagefaults           (events)
We were able to spend twice as much CPU time and efficiently so - and we 
did about 10% of the cross-CPU migrations as before (!).

My (wild) guess is that the biggest speedup factor was perhaps this little 
trick:

+               if (need_resched())
+                       break;

this allows the spin-mutex to only waste CPU time if there''s no work 
around on that CPU. (i.e. if there''s no other task that wants to run)
The
moment there''s some other task, we context-switch to it.

Very elegant concept i think.

[ A detail, -tip testing found that the patch breaks mutex debugging:

  ====================================  [ BUG: bad unlock balance detected! ]
  -------------------------------------
  swapper/0 is trying to release lock (cpu_add_remove_lock) at:
  [<ffffffff8089f540>] mutex_unlock+0xe/0x10
  but there are no more locks to release!

 but that''s a detail for -v7 ;-) ]

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Steven Rostedt

2009-Jan-08 14:24 UTC

head link

Re: [PATCH -v5][RFC]: mutex: implement adaptive spinning

On Thu, 8 Jan 2009, Andi Kleen wrote:
> > What about memory hotplug as Ingo mentioned?
> > 
> > Should that be:
> > 
> > #if defined(CONFIG_DEBUG_PAGEALLOC) || defined(CONFIG_MEMORY_HOTPLUG)
> 
> We expect memory hotunplug to only really work in movable zones
> (all others should at least have one kernel object somewhere that prevents
> unplug) and you can''t have task structs in movable zones obviously
> 
> So it''s probably a non issue in practice.
Sure, it probably is a non issue, but I''m afraid that non issues of
today
might become issues of tomorrow. Where does it say that we can never put a 
task struct in a movable zone. Perhaps, we could someday have a CPU with 
memory local to it, and pinned tasks to that CPU in that memory. Then 
there can be cases where we remove the CPU and memory together.

Because of preemption in the mutex spin part, there''s no guarantee that
a
the task in that removed memory will not be referenced again. Of course 
this thought is purely theoretical, but I like to solve bugs that might 
might happen tomorrow too. ;-)

-- Steve

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Gregory Haskins

2009-Jan-08 14:33 UTC

head link

Re: [PATCH -v6][RFC]: mutex: implement adaptive spinning

Ingo Molnar wrote:> * Peter Zijlstra <peterz@infradead.org> wrote:
>
>   
>> * WOW *
>>     
>
> WOW indeed - and i can see a similar _brutal_ speedup on two separate 
> 16-way boxes as well:
>
>   16 CPUs, running 128 parallel test-tasks.
>
>   NO_OWNER_SPIN:
>   avg ops/sec:               281595
>
>   OWNER_SPIN:
>   avg ops/sec:               524791
>
> Da Killer!
>   
This jives with our findings back when we first looked at this
(200%-300% speedups in most benchmarks), so this is excellent that it is
yielding boosts here as well.
> Look at the performance counter stats:
>
>   
>>    12098.324578  task clock ticks     (msecs)
>>
>>            1081  CPU migrations       (events)
>>            7102  context switches     (events)
>>            2763  pagefaults           (events)
>>     
>
>   
>>    22280.283224  task clock ticks     (msecs)
>>
>>             117  CPU migrations       (events)
>>            5711  context switches     (events)
>>            2781  pagefaults           (events)
>>     
>
> We were able to spend twice as much CPU time and efficiently so - and we 
> did about 10% of the cross-CPU migrations as before (!).
>
> My (wild) guess is that the biggest speedup factor was perhaps this little 
> trick:
>
> +               if (need_resched())
> +                       break;
>
> this allows the spin-mutex to only waste CPU time if there''s no
work
> around on that CPU. (i.e. if there''s no other task that wants to
run) The
> moment there''s some other task, we context-switch to it.
>   Well, IIUC thats only true if the other task happens to preempt current,
which may not always be the case, right?  For instance, if current still
has timeslice left, etc.  I think the primary difference is actually the
reduction in the ctx switch rate, but its hard to say without looking at
detailed traces and more stats.  Either way, woohoo!

-Greg

Andi Kleen

2009-Jan-08 14:45 UTC

head link

Re: [PATCH -v5][RFC]: mutex: implement adaptive spinning

> Sure, it probably is a non issue, but I''m afraid that non issues
of today
> might become issues of tomorrow. Where does it say that we can never put a 
> task struct in a movable zone.
Task structs are not movable, so they by definition do not belong
in movable zones.
> memory local to it, and pinned tasks to that CPU in that memory. Then 
> there can be cases where we remove the CPU and memory together.
If you did that you would need to redesign so much of the kernel,
changing the mutex code too would be the smallest of your worries.

-Andi
-- 
ak@linux.intel.com
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Peter Zijlstra

2009-Jan-08 14:46 UTC

head link

[PATCH -v7][RFC]: mutex: implement adaptive spinning

On Thu, 2009-01-08 at 15:18 +0100, Ingo Molnar wrote:
> [ A detail, -tip testing found that the patch breaks mutex debugging:
> 
>   ====================================>   [ BUG: bad unlock balance
detected! ]
>   -------------------------------------
>   swapper/0 is trying to release lock (cpu_add_remove_lock) at:
>   [<ffffffff8089f540>] mutex_unlock+0xe/0x10
>   but there are no more locks to release!
> 
>  but that''s a detail for -v7 ;-) ]
Here it is..

I changed the optimistic spin cmpxchg trickery to not require the -1000
state, please double and tripple check the code.

This was done because the interaction between trylock_slowpath and that
-1000 state hurt my head.

---
Subject: 
From: Peter Zijlstra <a.p.zijlstra@chello.nl>
Date: Thu Jan 08 09:41:22 CET 2009

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/mutex.h   |    4 +-
 include/linux/sched.h   |    1 
 kernel/mutex-debug.c    |   24 ++++++++++-----
 kernel/mutex-debug.h    |   18 ++++++-----
 kernel/mutex.c          |   76 ++++++++++++++++++++++++++++++++++++++++++------
 kernel/mutex.h          |   22 ++++++++++++-
 kernel/sched.c          |   66 +++++++++++++++++++++++++++++++++++++++++
 kernel/sched_features.h |    1 
 8 files changed, 185 insertions(+), 27 deletions(-)

Index: linux-2.6/include/linux/mutex.h
==================================================================---
linux-2.6.orig/include/linux/mutex.h
+++ linux-2.6/include/linux/mutex.h
@@ -50,8 +50,10 @@ struct mutex {
 	atomic_t		count;
 	spinlock_t		wait_lock;
 	struct list_head	wait_list;
-#ifdef CONFIG_DEBUG_MUTEXES
+#if defined(CONFIG_DEBUG_MUTEXES) || defined(CONFIG_SMP)
 	struct thread_info	*owner;
+#endif
+#ifdef CONFIG_DEBUG_MUTEXES
 	const char 		*name;
 	void			*magic;
 #endif
Index: linux-2.6/kernel/mutex-debug.c
==================================================================---
linux-2.6.orig/kernel/mutex-debug.c
+++ linux-2.6/kernel/mutex-debug.c
@@ -26,11 +26,6 @@
 /*
  * Must be called with lock->wait_lock held.
  */
-void debug_mutex_set_owner(struct mutex *lock, struct thread_info *new_owner)
-{
-	lock->owner = new_owner;
-}
-
 void debug_mutex_lock_common(struct mutex *lock, struct mutex_waiter *waiter)
 {
 	memset(waiter, MUTEX_DEBUG_INIT, sizeof(*waiter));
@@ -80,9 +75,25 @@ void debug_mutex_unlock(struct mutex *lo
 		return;
 
 	DEBUG_LOCKS_WARN_ON(lock->magic != lock);
+#if 0
+	/*
+	 * XXX something is iffy with owner tracking, but lockdep - which has
+	 * similar checks - finds everything dandy, so disable for now.
+	 */
+	if (lock->owner != current_thread_info()) {
+		printk(KERN_ERR "%p %p\n", lock->owner, current_thread_info());
+		if (lock->owner) {
+			printk(KERN_ERR "%d %s\n",
+					lock->owner->task->pid,
+					lock->owner->task->comm);
+		}
+		printk(KERN_ERR "%d %s\n",
+				current_thread_info()->task->pid,
+				current_thread_info()->task->comm);
+	}
 	DEBUG_LOCKS_WARN_ON(lock->owner != current_thread_info());
+#endif
 	DEBUG_LOCKS_WARN_ON(!lock->wait_list.prev &&
!lock->wait_list.next);
-	DEBUG_LOCKS_WARN_ON(lock->owner != current_thread_info());
 }
 
 void debug_mutex_init(struct mutex *lock, const char *name,
@@ -95,7 +106,6 @@ void debug_mutex_init(struct mutex *lock
 	debug_check_no_locks_freed((void *)lock, sizeof(*lock));
 	lockdep_init_map(&lock->dep_map, name, key, 0);
 #endif
-	lock->owner = NULL;
 	lock->magic = lock;
 }
 
Index: linux-2.6/kernel/mutex-debug.h
==================================================================---
linux-2.6.orig/kernel/mutex-debug.h
+++ linux-2.6/kernel/mutex-debug.h
@@ -13,14 +13,6 @@
 /*
  * This must be called with lock->wait_lock held.
  */
-extern void
-debug_mutex_set_owner(struct mutex *lock, struct thread_info *new_owner);
-
-static inline void debug_mutex_clear_owner(struct mutex *lock)
-{
-	lock->owner = NULL;
-}
-
 extern void debug_mutex_lock_common(struct mutex *lock,
 				    struct mutex_waiter *waiter);
 extern void debug_mutex_wake_waiter(struct mutex *lock,
@@ -35,6 +27,16 @@ extern void debug_mutex_unlock(struct mu
 extern void debug_mutex_init(struct mutex *lock, const char *name,
 			     struct lock_class_key *key);
 
+static inline void mutex_set_owner(struct mutex *lock)
+{
+	lock->owner = current_thread_info();
+}
+
+static inline void mutex_clear_owner(struct mutex *lock)
+{
+	lock->owner = NULL;
+}
+
 #define spin_lock_mutex(lock, flags)			\
 	do {						\
 		struct mutex *l = container_of(lock, struct mutex, wait_lock); \
Index: linux-2.6/kernel/mutex.c
==================================================================---
linux-2.6.orig/kernel/mutex.c
+++ linux-2.6/kernel/mutex.c
@@ -10,6 +10,11 @@
  * Many thanks to Arjan van de Ven, Thomas Gleixner, Steven Rostedt and
  * David Howells for suggestions and improvements.
  *
+ *  - Adaptive spinning for mutexes by Peter Zijlstra. (Ported to mainline
+ *    from the -rt tree, where it was originally implemented for rtmutexes
+ *    by Steven Rostedt, based on work by Gregory Haskins, Peter Morreale
+ *    and Sven Dietrich.
+ *
  * Also see Documentation/mutex-design.txt.
  */
 #include <linux/mutex.h>
@@ -46,6 +51,7 @@ __mutex_init(struct mutex *lock, const c
 	atomic_set(&lock->count, 1);
 	spin_lock_init(&lock->wait_lock);
 	INIT_LIST_HEAD(&lock->wait_list);
+	mutex_clear_owner(lock);
 
 	debug_mutex_init(lock, name, key);
 }
@@ -91,6 +97,7 @@ void inline __sched mutex_lock(struct mu
 	 * ''unlocked'' into ''locked'' state.
 	 */
 	__mutex_fastpath_lock(&lock->count, __mutex_lock_slowpath);
+	mutex_set_owner(lock);
 }
 
 EXPORT_SYMBOL(mutex_lock);
@@ -115,6 +122,14 @@ void __sched mutex_unlock(struct mutex *
 	 * The unlocking fastpath is the 0->1 transition from
''locked''
 	 * into ''unlocked'' state:
 	 */
+#ifndef CONFIG_DEBUG_MUTEXES
+	/*
+	 * When debugging is enabled we must not clear the owner before time,
+	 * the slow path will always be taken, and that clears the owner field
+	 * after verifying that it was indeed current.
+	 */
+	mutex_clear_owner(lock);
+#endif
 	__mutex_fastpath_unlock(&lock->count, __mutex_unlock_slowpath);
 }
 
@@ -129,13 +144,38 @@ __mutex_lock_common(struct mutex *lock, 
 {
 	struct task_struct *task = current;
 	struct mutex_waiter waiter;
-	unsigned int old_val;
 	unsigned long flags;
+	int old_val;
+
+	mutex_acquire(&lock->dep_map, subclass, 0, ip);
+
+#ifdef CONFIG_SMP
+	/* Optimistic spinning.. */
+	for (;;) {
+		struct thread_info *owner;
+
+		old_val = atomic_cmpxchg(&lock->count, 1, 0);
+		if (old_val == 1) {
+			lock_acquired(&lock->dep_map, ip);
+			mutex_set_owner(lock);
+			return 0;
+		}
+
+		if (old_val < 0 && !list_empty(&lock->wait_list))
+			break;
+
+		/* See who owns it, and spin on him if anybody */
+		owner = ACCESS_ONCE(lock->owner);
+		if (owner && !spin_on_owner(lock, owner))
+			break;
+
+		cpu_relax();
+	}
+#endif
 
 	spin_lock_mutex(&lock->wait_lock, flags);
 
 	debug_mutex_lock_common(lock, &waiter);
-	mutex_acquire(&lock->dep_map, subclass, 0, ip);
 	debug_mutex_add_waiter(lock, &waiter, task_thread_info(task));
 
 	/* add waiting tasks to the end of the waitqueue (FIFO): */
@@ -187,7 +227,7 @@ done:
 	lock_acquired(&lock->dep_map, ip);
 	/* got the lock - rejoice! */
 	mutex_remove_waiter(lock, &waiter, task_thread_info(task));
-	debug_mutex_set_owner(lock, task_thread_info(task));
+	mutex_set_owner(lock);
 
 	/* set it to 0 if there are no waiters left: */
 	if (likely(list_empty(&lock->wait_list)))
@@ -260,7 +300,7 @@ __mutex_unlock_common_slowpath(atomic_t 
 		wake_up_process(waiter->task);
 	}
 
-	debug_mutex_clear_owner(lock);
+	mutex_clear_owner(lock);
 
 	spin_unlock_mutex(&lock->wait_lock, flags);
 }
@@ -298,18 +338,30 @@ __mutex_lock_interruptible_slowpath(atom
  */
 int __sched mutex_lock_interruptible(struct mutex *lock)
 {
+	int ret;
+
 	might_sleep();
-	return __mutex_fastpath_lock_retval
+	ret =  __mutex_fastpath_lock_retval
 			(&lock->count, __mutex_lock_interruptible_slowpath);
+	if (!ret)
+		mutex_set_owner(lock);
+
+	return ret;
 }
 
 EXPORT_SYMBOL(mutex_lock_interruptible);
 
 int __sched mutex_lock_killable(struct mutex *lock)
 {
+	int ret;
+
 	might_sleep();
-	return __mutex_fastpath_lock_retval
+	ret = __mutex_fastpath_lock_retval
 			(&lock->count, __mutex_lock_killable_slowpath);
+	if (!ret)
+		mutex_set_owner(lock);
+
+	return ret;
 }
 EXPORT_SYMBOL(mutex_lock_killable);
 
@@ -352,9 +404,10 @@ static inline int __mutex_trylock_slowpa
 
 	prev = atomic_xchg(&lock->count, -1);
 	if (likely(prev == 1)) {
-		debug_mutex_set_owner(lock, current_thread_info());
+		mutex_set_owner(lock);
 		mutex_acquire(&lock->dep_map, 0, 1, _RET_IP_);
 	}
+
 	/* Set it back to 0 if there are no waiters: */
 	if (likely(list_empty(&lock->wait_list)))
 		atomic_set(&lock->count, 0);
@@ -380,8 +433,13 @@ static inline int __mutex_trylock_slowpa
  */
 int __sched mutex_trylock(struct mutex *lock)
 {
-	return __mutex_fastpath_trylock(&lock->count,
-					__mutex_trylock_slowpath);
+	int ret;
+
+	ret = __mutex_fastpath_trylock(&lock->count, __mutex_trylock_slowpath);
+	if (ret)
+		mutex_set_owner(lock);
+
+	return ret;
 }
 
 EXPORT_SYMBOL(mutex_trylock);
Index: linux-2.6/kernel/mutex.h
==================================================================---
linux-2.6.orig/kernel/mutex.h
+++ linux-2.6/kernel/mutex.h
@@ -16,8 +16,26 @@
 #define mutex_remove_waiter(lock, waiter, ti) \
 		__list_del((waiter)->list.prev, (waiter)->list.next)
 
-#define debug_mutex_set_owner(lock, new_owner)		do { } while (0)
-#define debug_mutex_clear_owner(lock)			do { } while (0)
+#ifdef CONFIG_SMP
+static inline void mutex_set_owner(struct mutex *lock)
+{
+	lock->owner = current_thread_info();
+}
+
+static inline void mutex_clear_owner(struct mutex *lock)
+{
+	lock->owner = NULL;
+}
+#else
+static inline void mutex_set_owner(struct mutex *lock)
+{
+}
+
+static inline void mutex_clear_owner(struct mutex *lock)
+{
+}
+#endif
+
 #define debug_mutex_wake_waiter(lock, waiter)		do { } while (0)
 #define debug_mutex_free_waiter(waiter)			do { } while (0)
 #define debug_mutex_add_waiter(lock, waiter, ti)	do { } while (0)
Index: linux-2.6/kernel/sched.c
==================================================================---
linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -4672,6 +4672,72 @@ need_resched_nonpreemptible:
 }
 EXPORT_SYMBOL(schedule);
 
+#ifdef CONFIG_SMP
+/*
+ * Look out! "owner" is an entirely speculative pointer
+ * access and not reliable.
+ */
+int spin_on_owner(struct mutex *lock, struct thread_info *owner)
+{
+	unsigned int cpu;
+	struct rq *rq;
+	int ret = 1;
+
+	if (unlikely(!sched_feat(OWNER_SPIN)))
+		return 0;
+
+	preempt_disable();
+#ifdef CONFIG_DEBUG_PAGEALLOC
+	/*
+	 * Need to access the cpu field knowing that
+	 * DEBUG_PAGEALLOC could have unmapped it if
+	 * the mutex owner just released it and exited.
+	 */
+	if (probe_kernel_address(&owner->cpu, cpu))
+		goto out;
+#else
+	cpu = owner->cpu;
+#endif
+
+	/*
+	 * Even if the access succeeded (likely case),
+	 * the cpu field may no longer be valid.
+	 */
+	if (cpu >= nr_cpumask_bits)
+		goto out;
+
+	/*
+	 * We need to validate that we can do a
+	 * get_cpu() and that we have the percpu area.
+	 */
+	if (!cpu_online(cpu))
+		goto out;
+
+	rq = cpu_rq(cpu);
+
+	for (;;) {
+		if (lock->owner != owner)
+			break;
+
+		/*
+		 * Is that owner really running on that cpu?
+		 */
+		if (task_thread_info(rq->curr) != owner)
+			break;
+
+		if (need_resched()) {
+			ret = 0;
+			break;
+		}
+
+		cpu_relax();
+	}
+out:
+	preempt_enable_no_resched();
+	return ret;
+}
+#endif
+
 #ifdef CONFIG_PREEMPT
 /*
  * this is the entry point to schedule() from in-kernel preemption
Index: linux-2.6/include/linux/sched.h
==================================================================---
linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -330,6 +330,7 @@ extern signed long schedule_timeout_inte
 extern signed long schedule_timeout_killable(signed long timeout);
 extern signed long schedule_timeout_uninterruptible(signed long timeout);
 asmlinkage void schedule(void);
+extern int spin_on_owner(struct mutex *lock, struct thread_info *owner);
 
 struct nsproxy;
 struct user_namespace;
Index: linux-2.6/kernel/sched_features.h
==================================================================---
linux-2.6.orig/kernel/sched_features.h
+++ linux-2.6/kernel/sched_features.h
@@ -13,3 +13,4 @@ SCHED_FEAT(LB_WAKEUP_UPDATE, 1)
 SCHED_FEAT(ASYM_EFF_LOAD, 1)
 SCHED_FEAT(WAKEUP_OVERLAP, 0)
 SCHED_FEAT(LAST_BUDDY, 1)
+SCHED_FEAT(OWNER_SPIN, 1)

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Steven Rostedt

2009-Jan-08 15:09 UTC

head link

Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

On Thu, 8 Jan 2009, Peter Zijlstra wrote:> Index: linux-2.6/kernel/sched.c
> ==================================================================> ---
linux-2.6.orig/kernel/sched.c
> +++ linux-2.6/kernel/sched.c
> @@ -4672,6 +4672,72 @@ need_resched_nonpreemptible:
>  }
>  EXPORT_SYMBOL(schedule);
>  
> +#ifdef CONFIG_SMP
> +/*
> + * Look out! "owner" is an entirely speculative pointer
> + * access and not reliable.
> + */
> +int spin_on_owner(struct mutex *lock, struct thread_info *owner)
> +{
> +	unsigned int cpu;
> +	struct rq *rq;
> +	int ret = 1;
> +
> +	if (unlikely(!sched_feat(OWNER_SPIN)))
I would remove the "unlikely", if someone turns OWNER_SPIN off, then
you
have the wrong decision being made. Choices by users should never be in a 
"likely" or "unlikely" annotation. It''s
discrimination ;-)

> +		return 0;
> +
> +	preempt_disable();
> +#ifdef CONFIG_DEBUG_PAGEALLOC
> +	/*
> +	 * Need to access the cpu field knowing that
> +	 * DEBUG_PAGEALLOC could have unmapped it if
> +	 * the mutex owner just released it and exited.
> +	 */
> +	if (probe_kernel_address(&owner->cpu, cpu))
> +		goto out;
> +#else
> +	cpu = owner->cpu;
> +#endif
> +
> +	/*
> +	 * Even if the access succeeded (likely case),
> +	 * the cpu field may no longer be valid.
> +	 */
> +	if (cpu >= nr_cpumask_bits)
> +		goto out;
> +
> +	/*
> +	 * We need to validate that we can do a
> +	 * get_cpu() and that we have the percpu area.
> +	 */
> +	if (!cpu_online(cpu))
> +		goto out;
Should we need to do a "get_cpu" or something? Couldn''t the
CPU disappear
between these two calls. Or does it do a stop-machine and the preempt 
disable will protect us?

-- Steve

> +
> +	rq = cpu_rq(cpu);
> +
> +	for (;;) {
> +		if (lock->owner != owner)
> +			break;
> +
> +		/*
> +		 * Is that owner really running on that cpu?
> +		 */
> +		if (task_thread_info(rq->curr) != owner)
> +			break;
> +
> +		if (need_resched()) {
> +			ret = 0;
> +			break;
> +		}
> +
> +		cpu_relax();
> +	}
> +out:
> +	preempt_enable_no_resched();
> +	return ret;
> +}
> +#endif
> +
>  #ifdef CONFIG_PREEMPT
>  /*
>   * this is the entry point to schedule() from in-kernel preemption
> Index: linux-2.6/include/linux/sched.h
> ==================================================================> ---
linux-2.6.orig/include/linux/sched.h
> +++ linux-2.6/include/linux/sched.h
> @@ -330,6 +330,7 @@ extern signed long schedule_timeout_inte
>  extern signed long schedule_timeout_killable(signed long timeout);
>  extern signed long schedule_timeout_uninterruptible(signed long timeout);
>  asmlinkage void schedule(void);
> +extern int spin_on_owner(struct mutex *lock, struct thread_info *owner);
>  
>  struct nsproxy;
>  struct user_namespace;
> Index: linux-2.6/kernel/sched_features.h
> ==================================================================> ---
linux-2.6.orig/kernel/sched_features.h
> +++ linux-2.6/kernel/sched_features.h
> @@ -13,3 +13,4 @@ SCHED_FEAT(LB_WAKEUP_UPDATE, 1)
>  SCHED_FEAT(ASYM_EFF_LOAD, 1)
>  SCHED_FEAT(WAKEUP_OVERLAP, 0)
>  SCHED_FEAT(LAST_BUDDY, 1)
> +SCHED_FEAT(OWNER_SPIN, 1)
> 
> 
> --
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Peter Zijlstra

2009-Jan-08 15:23 UTC

head link

Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

On Thu, 2009-01-08 at 10:09 -0500, Steven Rostedt wrote:> On Thu, 8 Jan 2009, Peter Zijlstra wrote:
> > Index: linux-2.6/kernel/sched.c
> > ==================================================================>
> --- linux-2.6.orig/kernel/sched.c
> > +++ linux-2.6/kernel/sched.c
> > @@ -4672,6 +4672,72 @@ need_resched_nonpreemptible:
> >  }
> >  EXPORT_SYMBOL(schedule);
> >  
> > +#ifdef CONFIG_SMP
> > +/*
> > + * Look out! "owner" is an entirely speculative pointer
> > + * access and not reliable.
> > + */
> > +int spin_on_owner(struct mutex *lock, struct thread_info *owner)
> > +{
> > +	unsigned int cpu;
> > +	struct rq *rq;
> > +	int ret = 1;
> > +
> > +	if (unlikely(!sched_feat(OWNER_SPIN)))
> 
> I would remove the "unlikely", if someone turns OWNER_SPIN off,
then you
> have the wrong decision being made. Choices by users should never be in a 
> "likely" or "unlikely" annotation. It''s
discrimination ;-)
in the unlikely case we schedule(), that seems expensive enough to want
to make the spin case ever so slightly faster.
> > +		return 0;
> > +
> > +	preempt_disable();
> > +#ifdef CONFIG_DEBUG_PAGEALLOC
> > +	/*
> > +	 * Need to access the cpu field knowing that
> > +	 * DEBUG_PAGEALLOC could have unmapped it if
> > +	 * the mutex owner just released it and exited.
> > +	 */
> > +	if (probe_kernel_address(&owner->cpu, cpu))
> > +		goto out;
> > +#else
> > +	cpu = owner->cpu;
> > +#endif
> > +
> > +	/*
> > +	 * Even if the access succeeded (likely case),
> > +	 * the cpu field may no longer be valid.
> > +	 */
> > +	if (cpu >= nr_cpumask_bits)
> > +		goto out;
> > +
> > +	/*
> > +	 * We need to validate that we can do a
> > +	 * get_cpu() and that we have the percpu area.
> > +	 */
> > +	if (!cpu_online(cpu))
> > +		goto out;
> 
> Should we need to do a "get_cpu" or something? Couldn''t
the CPU disappear
> between these two calls. Or does it do a stop-machine and the preempt 
> disable will protect us?
Did you miss the preempt_disable() a bit up?
> > +
> > +	rq = cpu_rq(cpu);
> > +
> > +	for (;;) {
> > +		if (lock->owner != owner)
> > +			break;
> > +
> > +		/*
> > +		 * Is that owner really running on that cpu?
> > +		 */
> > +		if (task_thread_info(rq->curr) != owner)
> > +			break;
> > +
> > +		if (need_resched()) {
> > +			ret = 0;
> > +			break;
> > +		}
> > +
> > +		cpu_relax();
> > +	}
> > +out:
> > +	preempt_enable_no_resched();
> > +	return ret;
> > +}
> > +#endif
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Steven Rostedt

2009-Jan-08 15:28 UTC

head link

Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

On Thu, 8 Jan 2009, Peter Zijlstra wrote:
> On Thu, 2009-01-08 at 10:09 -0500, Steven Rostedt wrote:
> > On Thu, 8 Jan 2009, Peter Zijlstra wrote:
> > > Index: linux-2.6/kernel/sched.c
> > >
==================================================================> > >
--- linux-2.6.orig/kernel/sched.c
> > > +++ linux-2.6/kernel/sched.c
> > > @@ -4672,6 +4672,72 @@ need_resched_nonpreemptible:
> > >  }
> > >  EXPORT_SYMBOL(schedule);
> > >  
> > > +#ifdef CONFIG_SMP
> > > +/*
> > > + * Look out! "owner" is an entirely speculative
pointer
> > > + * access and not reliable.
> > > + */
> > > +int spin_on_owner(struct mutex *lock, struct thread_info *owner)
> > > +{
> > > +	unsigned int cpu;
> > > +	struct rq *rq;
> > > +	int ret = 1;
> > > +
> > > +	if (unlikely(!sched_feat(OWNER_SPIN)))
> > 
> > I would remove the "unlikely", if someone turns OWNER_SPIN
off, then you
> > have the wrong decision being made. Choices by users should never be
in a
> > "likely" or "unlikely" annotation. It''s
discrimination ;-)
> 
> in the unlikely case we schedule(), that seems expensive enough to want
> to make the spin case ever so slightly faster.
OK, that makes sense, but I would comment that. Otherwise, it just looks 
like another misuse of the unlikely annotation.
> 
> > > +		return 0;
> > > +
> > > +	preempt_disable();
> > > +#ifdef CONFIG_DEBUG_PAGEALLOC
> > > +	/*
> > > +	 * Need to access the cpu field knowing that
> > > +	 * DEBUG_PAGEALLOC could have unmapped it if
> > > +	 * the mutex owner just released it and exited.
> > > +	 */
> > > +	if (probe_kernel_address(&owner->cpu, cpu))
> > > +		goto out;
> > > +#else
> > > +	cpu = owner->cpu;
> > > +#endif
> > > +
> > > +	/*
> > > +	 * Even if the access succeeded (likely case),
> > > +	 * the cpu field may no longer be valid.
> > > +	 */
> > > +	if (cpu >= nr_cpumask_bits)
> > > +		goto out;
> > > +
> > > +	/*
> > > +	 * We need to validate that we can do a
> > > +	 * get_cpu() and that we have the percpu area.
> > > +	 */
> > > +	if (!cpu_online(cpu))
> > > +		goto out;
> > 
> > Should we need to do a "get_cpu" or something?
Couldn''t the CPU disappear
> > between these two calls. Or does it do a stop-machine and the preempt 
> > disable will protect us?
> 
> Did you miss the preempt_disable() a bit up?
No, let me rephrase it better. Does the preempt_disable protect against
another CPU from going off line? Does taking a CPU off line do a 
stop_machine?

-- Steve
> 
> > > +
> > > +	rq = cpu_rq(cpu);
> > > +
> > > +	for (;;) {
> > > +		if (lock->owner != owner)
> > > +			break;
> > > +
> > > +		/*
> > > +		 * Is that owner really running on that cpu?
> > > +		 */
> > > +		if (task_thread_info(rq->curr) != owner)
> > > +			break;
> > > +
> > > +		if (need_resched()) {
> > > +			ret = 0;
> > > +			break;
> > > +		}
> > > +
> > > +		cpu_relax();
> > > +	}
> > > +out:
> > > +	preempt_enable_no_resched();
> > > +	return ret;
> > > +}
> > > +#endif
> 
> 
> --
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Peter Zijlstra

2009-Jan-08 15:30 UTC

head link

Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

On Thu, 2009-01-08 at 10:28 -0500, Steven Rostedt wrote:> On Thu, 8 Jan 2009, Peter Zijlstra wrote:
> > in the unlikely case we schedule(), that seems expensive enough to
want
> > to make the spin case ever so slightly faster.
> 
> OK, that makes sense, but I would comment that. Otherwise, it just looks 
> like another misuse of the unlikely annotation.
OK, sensible enough.
> > > Should we need to do a "get_cpu" or something?
Couldn''t the CPU disappear
> > > between these two calls. Or does it do a stop-machine and the
preempt
> > > disable will protect us?
> > 
> > Did you miss the preempt_disable() a bit up?
> 
> No, let me rephrase it better. Does the preempt_disable protect against
> another CPU from going off line? Does taking a CPU off line do a 
> stop_machine?
Yes and yes.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Steven Rostedt

2009-Jan-08 15:30 UTC

head link

Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

On Thu, 8 Jan 2009, Steven Rostedt wrote:> > > > +	/*
> > > > +	 * We need to validate that we can do a
> > > > +	 * get_cpu() and that we have the percpu area.
> > > > +	 */
> > > > +	if (!cpu_online(cpu))
> > > > +		goto out;
> > > 
> > > Should we need to do a "get_cpu" or something?
Couldn''t the CPU disappear
> > > between these two calls. Or does it do a stop-machine and the
preempt
> > > disable will protect us?
> > 
> > Did you miss the preempt_disable() a bit up?
> 
> No, let me rephrase it better. Does the preempt_disable protect against
> another CPU from going off line? Does taking a CPU off line do a 
> stop_machine?
I just looked at the cpu hotplug code, and it does call stop machine. All 
is in order ;-)

-- Steve

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Linus Torvalds

2009-Jan-08 16:11 UTC

head link

Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

On Thu, 8 Jan 2009, Peter Zijlstra wrote:> 
> This was done because the interaction between trylock_slowpath and that
> -1000 state hurt my head.
Yeah, it was a stupid hacky thing to avoid the "list_empty()", but
doing
it explicitly is fine (we don''t hold the lock, so the list
isn''t
necessarily stable, but doing "list_empty()" is fine because we
don''t ever
dereference the pointers, we just compare the pointers themselves).

I shouldn''t have done that hacky thing, it wasn''t worth it.

		Linus
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Linus Torvalds

2009-Jan-08 16:58 UTC

head link

Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

Ok, I''ve gone through -v7, and I''m sure you''re all
shocked to hear it, but
I have no complaints. Except that you dropped all the good commit 
commentary you had earlier ;)

The patch looks pretty good (except for the big "#if 0" block in 
mutex-debug.c that I hope gets fixed, but I can''t even really claim
that I
can be bothered), the locking looks fine (ie no locking at all), and the 
numbers seem pretty convinving.

Oh, and I think the open-coded

	atomic_cmpxchg(count, 1, 0) == 1

could possibly just be replaced with a simple __mutex_fastpath_trylock(). 
I dunno.

IOW, I''d actually like to take it, but let''s give it at least
a day or
two. Do people have any concerns? 

And as far as I''m concerned, the nice part about not having any locking
there is that now the spinning has no impact what-so-ever on the rest of 
the mutex logic. There are no subtleties about any of that - it''s 
literally about falling back to a (fairly educated) "try a few trylocks if 
you fail". So it _looks_ pretty robust. I don''t think there should
be any
subtle interactions with anything else. If the old mutexes worked, then 
the spinning should work.

Discussion?

		Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Mason

2009-Jan-08 17:08 UTC

head link

Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

On Thu, 2009-01-08 at 08:58 -0800, Linus Torvalds wrote:> 
> Ok, I''ve gone through -v7, and I''m sure you''re
all shocked to hear it, but
> I have no complaints. Except that you dropped all the good commit 
> commentary you had earlier ;)
> 
Seems to get stuck under load.  I''ve hit it with make -j 50 on ext3 and
with my btrfs benchmarking.  This was against the latest git from about
5 minutes ago.

-chris

BUG: soft lockup - CPU#3 stuck for 61s! [python:3970]
CPU 3:
Modules linked in: netconsole configfs btrfs zlib_deflate loop e1000e 3w_9xxx
Pid: 3970, comm: python Not tainted 2.6.28 #1
RIP: 0010:[<ffffffff8024f4de>]  [<ffffffff8024f4de>]
__cmpxchg+0x36/0x3f
RSP: 0018:ffff880148073be8  EFLAGS: 00000217
RAX: 00000000fffffff8 RBX: ffff880148073be8 RCX: 0000000000000004
RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff8801498efdf0
RBP: ffffffff8020ce0e R08: 00000000007563b7 R09: 0000000000000000
R10: ffff880148073c88 R11: 0000000000000001 R12: ffff880148073c88
R13: 0000000000000001 R14: ffff880148073be8 R15: ffffffff8020ce0e
FS:  00007f5fb455a6e0(0000) GS:ffff88014dd368c0(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007f5fb38181a0 CR3: 00000001483a0000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Call Trace:
 [<ffffffff8024f4b1>] ? __cmpxchg+0x9/0x3f
 [<ffffffff805e0068>] ? __mutex_lock_common+0x3d/0x178
 [<ffffffff805e106c>] ? _spin_lock+0x9/0x1f
 [<ffffffff802af31b>] ? __d_lookup+0x98/0xdb
 [<ffffffff805e01f2>] ? __mutex_lock_slowpath+0x19/0x1b
 [<ffffffff805e025b>] ? mutex_lock+0x23/0x3a
 [<ffffffff802a7f99>] ? do_lookup+0x85/0x162
 [<ffffffff802a993d>] ? __link_path_walk+0x4db/0x620
 [<ffffffff802a9ad5>] ? path_walk+0x53/0x9a
 [<ffffffff802a9c68>] ? do_path_lookup+0x107/0x126
 [<ffffffff802a8f74>] ? getname+0x16b/0x1ad
 [<ffffffff802aa693>] ? user_path_at+0x57/0x98
 [<ffffffff802aa6a5>] ? user_path_at+0x69/0x98
 [<ffffffff802a3773>] ? vfs_stat_fd+0x26/0x53
 [<ffffffff802b46b2>] ? mntput_no_expire+0x2f/0x125
 [<ffffffff802a3963>] ? sys_newstat+0x27/0x41
 [<ffffffff802a7cae>] ? mntput+0x1d/0x1f
 [<ffffffff802a7d3e>] ? path_put+0x22/0x26
 [<ffffffff802a36b7>] ? sys_readlinkat+0x7b/0x89
 [<ffffffff8020c4eb>] ? system_call_fastpath+0x16/0x1b


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Peter Zijlstra

2009-Jan-08 17:23 UTC

head link

Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

On Thu, 2009-01-08 at 08:58 -0800, Linus Torvalds wrote:> 
> Ok, I''ve gone through -v7, and I''m sure you''re
all shocked to hear it, but
> I have no complaints. 
*cheer*, except I guess we need to figure out what goes bad for Chris.
> Except that you dropped all the good commit 
> commentary you had earlier ;)
Yeah, I''ve yet to add that back, will do.
> The patch looks pretty good (except for the big "#if 0" block in 
> mutex-debug.c that I hope gets fixed, but I can''t even really
claim that I
> can be bothered), the locking looks fine (ie no locking at all), and the 
> numbers seem pretty convinving.
> 
> Oh, and I think the open-coded
> 
> 	atomic_cmpxchg(count, 1, 0) == 1
> 
> could possibly just be replaced with a simple __mutex_fastpath_trylock(). 
> I dunno.
__mutex_fastpath_trylock() isn''t always that neat -- see
include/asm-generic/mutex-xchg.h -- and its a NOP on DEBUG_MUTEXES.

Note how I used old_val for the list_empty() thing as well, we could
possibly drop that extra condition though.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Steven Rostedt

2009-Jan-08 17:33 UTC

head link

Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

On Thu, 8 Jan 2009, Chris Mason wrote:
> On Thu, 2009-01-08 at 08:58 -0800, Linus Torvalds wrote:
> > 
> > Ok, I''ve gone through -v7, and I''m sure
you''re all shocked to hear it, but
> > I have no complaints. Except that you dropped all the good commit 
> > commentary you had earlier ;)
> > 
> 
> Seems to get stuck under load.  I''ve hit it with make -j 50 on
ext3 and
> with my btrfs benchmarking.  This was against the latest git from about
> 5 minutes ago.
> 
> -chris
> 
> BUG: soft lockup - CPU#3 stuck for 61s! [python:3970] CPU 3: Modules 
> linked in: netconsole configfs btrfs zlib_deflate loop e1000e 3w_9xxx 
> Pid: 3970, comm: python Not tainted 2.6.28 #1 Call Trace:
>  [<ffffffff8024f4b1>] ? __cmpxchg+0x9/0x3f
>  [<ffffffff805e0068>] ? __mutex_lock_common+0x3d/0x178
Hmm, looking at the code...

mutex.c:

       for (;;) {
                struct thread_info *owner;

                old_val = atomic_cmpxchg(&lock->count, 1, 0);
                if (old_val == 1) {
                        lock_acquired(&lock->dep_map, ip);
                        mutex_set_owner(lock);
                        return 0;
                }

                if (old_val < 0 &&
!list_empty(&lock->wait_list))
                        break;

                /* See who owns it, and spin on him if anybody */
                owner = ACCESS_ONCE(lock->owner);
                if (owner && !spin_on_owner(lock, owner))
                        break;

                cpu_relax();
        }


and in sched.c:

int spin_on_owner(struct mutex *lock, struct thread_info *owner)
{
        unsigned int cpu;
        struct rq *rq;
        int ret = 1;
[...]
                if (lock->owner != owner)
                        break;


We keep spinning if the owner changes.  I wonder if you have many CPUS 
(Chris, how many cpus did this box have?), you could get one task 
constantly spinning while the mutex keeps changing owners on the other 
CPUS.

Perhaps, we should do something like:

mutex.c:

       for (;;) {
                struct thread_info *owner = NULL;

                old_val = atomic_cmpxchg(&lock->count, 1, 0);
                if (old_val == 1) {
                        lock_acquired(&lock->dep_map, ip);
                        mutex_set_owner(lock);
                        return 0;
                }

                if (old_val < 0 &&
!list_empty(&lock->wait_list))
                        break;

                /* See who owns it, and spin on him if anybody */
		if (!owner)
	                owner = ACCESS_ONCE(lock->owner);
                if (owner && !spin_on_owner(lock, owner)) 
                        break;

                cpu_relax();
        }

Or just pull assigning of the owner out of the loop. This way, we go to 
sleep if the owner changes and is not NULL.

Just a thought,

-- Steve

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Linus Torvalds

2009-Jan-08 17:52 UTC

head link

Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

On Thu, 8 Jan 2009, Steven Rostedt wrote:> 
> We keep spinning if the owner changes.
I think we want to - if you have multiple CPU''s and a heavily contended
lock that acts as a spinlock, we still _do_ want to keep spinning even if 
another CPU gets the lock.

And I don''t even believe that is the bug. I suspect the bug is simpler.

I think the "need_resched()" needs to go in the outer loop, or at
least
happen in the "!owner" case. Because at least with preemption, what
can
happen otherwise is

 - process A gets the lock, but gets preempted before it sets lock->owner.

   End result: count = 0, owner = NULL.

 - processes B/C goes into the spin loop, filling up all CPU''s
(assuming
   dual-core here), and will now both loop forever if they hold the kernel 
   lock (or have some other preemption disabling thing over their down()).

And all the while, process A would _happily_ set ->owner, and eventually 
release the mutex, but it never gets to run to do either of them so.

In fact, you might not even need a process C: all you need is for B to be 
on the same runqueue as A, and having enough load on the other CPU''s
that
A never gets migrated away. So "C" might be in user space.

I dunno. There are probably variations on the above.

		Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Linus Torvalds

2009-Jan-08 18:00 UTC

head link

Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

Unrelated:

On Thu, 8 Jan 2009, Chris Mason wrote:>
> RIP: 0010:[<ffffffff8024f4de>]  [<ffffffff8024f4de>]
__cmpxchg+0x36/0x3f
Ouch. HOW THE HELL DID THAT NOT GET INLINED?

cmpxchg() is a _single_ instruction if it''s inlined, but it''s
a horrible
mess of dynamic conditionals on the (constant - per call-site) size 
argument if it isn''t.

It looks like you probably enabled the "let gcc mess up inlining"
config
option.

Ingo - I think we need to remove that crap again. Because gcc gets the 
inlining horribly horribly wrong. As usual.

				Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Steven Rostedt

2009-Jan-08 18:03 UTC

head link

Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

On Thu, 8 Jan 2009, Linus Torvalds wrote:> 
> And I don''t even believe that is the bug. I suspect the bug is
simpler.
> 
> I think the "need_resched()" needs to go in the outer loop, or at
least
> happen in the "!owner" case. Because at least with preemption,
what can
> happen otherwise is
> 
>  - process A gets the lock, but gets preempted before it sets
lock->owner.
> 
>    End result: count = 0, owner = NULL.
> 
>  - processes B/C goes into the spin loop, filling up all CPU''s
(assuming
>    dual-core here), and will now both loop forever if they hold the kernel 
>    lock (or have some other preemption disabling thing over their down()).
> 
> And all the while, process A would _happily_ set ->owner, and eventually
> release the mutex, but it never gets to run to do either of them so.
> 
> In fact, you might not even need a process C: all you need is for B to be 
> on the same runqueue as A, and having enough load on the other
CPU''s that
> A never gets migrated away. So "C" might be in user space.
> 
> I dunno. There are probably variations on the above.
Ouch! I think you are on to something:

        for (;;) {
                struct thread_info *owner;

                old_val = atomic_cmpxchg(&lock->count, 1, 0);
                if (old_val == 1) {
                        lock_acquired(&lock->dep_map, ip);
                        mutex_set_owner(lock);
                        return 0;
                }

                if (old_val < 0 &&
!list_empty(&lock->wait_list))
                        break;

                /* See who owns it, and spin on him if anybody */
                owner = ACCESS_ONCE(lock->owner);

The owner was preempted before assigning lock->owner (as you stated).

                if (owner && !spin_on_owner(lock, owner))
                        break;

We just spin :-(

I think adding the:

+		if (need_resched())
+			break;

would solve the problem.

Thanks,

-- Steve


                cpu_relax();
        }

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Steven Rostedt

2009-Jan-08 18:14 UTC

head link

Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

On Thu, 8 Jan 2009, Steven Rostedt wrote:> > In fact, you might not even need a process C: all you need is for B to
be
> > on the same runqueue as A, and having enough load on the other
CPU''s that
> > A never gets migrated away. So "C" might be in user space.
You''re right about not needing process C.
> > 
> > I dunno. There are probably variations on the above.
> 
> Ouch! I think you are on to something:
> 
>         for (;;) {
>                 struct thread_info *owner;
> 
>                 old_val = atomic_cmpxchg(&lock->count, 1, 0);
>                 if (old_val == 1) {
>                         lock_acquired(&lock->dep_map, ip);
>                         mutex_set_owner(lock);
>                         return 0;
>                 }
> 
>                 if (old_val < 0 &&
!list_empty(&lock->wait_list))
>                         break;
> 
>                 /* See who owns it, and spin on him if anybody */
>                 owner = ACCESS_ONCE(lock->owner);
> 
> The owner was preempted before assigning lock->owner (as you stated).
If it was the current process that preempted the owner and these are RT 
tasks pinned to the same CPU and the owner is of lower priority than the 
spinner, we have a deadlock!

Hmm, I do not think the need_sched here will even fix that :-/

-- Steve

--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Linus Torvalds

2009-Jan-08 18:16 UTC

head link

Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

On Thu, 8 Jan 2009, Steven Rostedt wrote:> 
> Ouch! I think you are on to something:
Yeah, there''s somethign there, but looking at Chris''
backtrace, there''s
nothing there to disable preemption. So if it was this simple case, it 
should still have preempted him to let the other process run and finish 
up.

So I don''t think Chris'' softlockup is at least _exactly_ that
case.
There''s something else going on too.

That said, I do think it''s a mistake for us to care about the value of 
"spin_on_owner()". I suspect v8 should

 - always have

	if (need_resched())
		break

   in the outer loop.

 - drop the return value from "spin_on_owner()", and just break out if
   anything changes (including the need_resched() flag).

 - I''d also drop the "old_value < 0 &&" test, and
just test the
   list_empty() unconditionally. 

Aim for really simple. 

As to what to do about the "!owner" case - we do want to spin on it,
but
the interaction with preemption is kind of nasty. I''d hesitate to make
the
mutex_[un]lock() use preempt_disable() to avoid scheduling in between 
getting the lock and settign the owner, though - because that would slow 
down the normal fast-path case.

Maybe we should just limit the "spin on !owner" to some maximal count.

		Linus
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Mason

2009-Jan-08 18:27 UTC

head link

Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

On Thu, 2009-01-08 at 10:16 -0800, Linus Torvalds wrote:> 
> On Thu, 8 Jan 2009, Steven Rostedt wrote:
> > 
> > Ouch! I think you are on to something:
> 
> Yeah, there''s somethign there, but looking at Chris''
backtrace, there''s
> nothing there to disable preemption. So if it was this simple case, it 
> should still have preempted him to let the other process run and finish 
> up.
> 
My .config has no lockdep or schedule debugging and voluntary preempt.
I do have CONFIG_INLINE_OPTIMIZE on, its a good name for trusting gcc I
guess.
> So I don''t think Chris'' softlockup is at least _exactly_
that case.
> There''s something else going on too.
> 
> That said, I do think it''s a mistake for us to care about the
value of
> "spin_on_owner()". I suspect v8 should
> 
>  - always have
> 
> 	if (need_resched())
> 		break
> 
>    in the outer loop.
> 
>  - drop the return value from "spin_on_owner()", and just break
out if
>    anything changes (including the need_resched() flag).
> 
>  - I''d also drop the "old_value < 0 &&" test,
and just test the
>    list_empty() unconditionally. 
> 
I''ll give the above a shot, and we can address the preempt + !owner in
another rev

-chris

--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Ingo Molnar

2009-Jan-08 18:33 UTC

head link

Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

* Linus Torvalds <torvalds@linux-foundation.org> wrote:
> Unrelated:
> 
> On Thu, 8 Jan 2009, Chris Mason wrote:
> >
> > RIP: 0010:[<ffffffff8024f4de>]  [<ffffffff8024f4de>]
__cmpxchg+0x36/0x3f
> 
> Ouch. HOW THE HELL DID THAT NOT GET INLINED?
> 
> cmpxchg() is a _single_ instruction if it''s inlined, but
it''s a horrible
> mess of dynamic conditionals on the (constant - per call-site) size 
> argument if it isn''t.
> 
> It looks like you probably enabled the "let gcc mess up inlining"
config
> option.
> 
> Ingo - I think we need to remove that crap again. Because gcc gets the 
> inlining horribly horribly wrong. As usual.
Apparently it messes up with asm()s: it doesnt know the contents of the 
asm() and hence it over-estimates the size [based on string heuristics] 
...

Which is bad - asm()s tend to be the most important entities to inline - 
all over our fastpaths .

Despite that messup it''s still a 1% net size win:

      text    data     bss     dec     hex filename
   7109652 1464684  802888 9377224  8f15c8 vmlinux.always-inline
   7046115 1465324  802888 9314327  8e2017 vmlinux.optimized-inlining

That win is mixed in slowpath and fastpath as well.

I see three options:

 - Disable CONFIG_OPTIMIZE_INLINING=y altogether (it''s already 
   default-off)

 - Change the asm() inline markers to something new like asm_inline, which
   defaults to __always_inline.

 - Just mark all asm() inline markers as __always_inline - realizing that 
   these should never ever be out of line.

We might still try the second or third options, as i think we shouldnt go 
back into the business of managing the inline attributes of ~100,000 
kernel functions.

I''ll try to annotate the inline asms (there''s not _that_ many
of them),
and measure what the size impact is.

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

H. Peter Anvin

2009-Jan-08 18:41 UTC

head link

Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

Ingo Molnar wrote:> 
> Apparently it messes up with asm()s: it doesnt know the contents of the 
> asm() and hence it over-estimates the size [based on string heuristics] 
> ...
> 
Right.   gcc simply doesn''t have any way to know how heavyweight an
asm() statement is, and it WILL do the wrong thing in many cases --
especially the ones which involve an out-of-line recovery stub.  This is
due to a fundamental design decision in gcc to not integrate the
compiler and assembler (which some compilers do.)
> Which is bad - asm()s tend to be the most important entities to inline - 
> all over our fastpaths .
> 
> Despite that messup it''s still a 1% net size win:
> 
>       text    data     bss     dec     hex filename
>    7109652 1464684  802888 9377224  8f15c8 vmlinux.always-inline
>    7046115 1465324  802888 9314327  8e2017 vmlinux.optimized-inlining
> 
> That win is mixed in slowpath and fastpath as well.
The good part here is that the assembly ones really don''t have much
subtlety -- a function call is at least five bytes, usually more once
you count in the register spill penalties -- so __always_inline-ing them
should still end up with numbers looking very much like the above.
> I see three options:
> 
>  - Disable CONFIG_OPTIMIZE_INLINING=y altogether (it''s already 
>    default-off)
> 
>  - Change the asm() inline markers to something new like asm_inline, which
>    defaults to __always_inline.
> 
>  - Just mark all asm() inline markers as __always_inline - realizing that 
>    these should never ever be out of line.
> 
> We might still try the second or third options, as i think we shouldnt go 
> back into the business of managing the inline attributes of ~100,000 
> kernel functions.
> 
> I''ll try to annotate the inline asms (there''s not _that_
many of them),
> and measure what the size impact is.
The main reason to do #2 over #3 would be for programmer documentation.
 There simply should be no reason to ever out-of-lining these.  However,
documenting the reason to the programmer is a valuable thing in itself.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don''t speak on their behalf.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Andi Kleen

2009-Jan-08 19:00 UTC

head link

Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

> I''ll try to annotate the inline asms (there''s not _that_
many of them),
> and measure what the size impact is.
You can just use the patch I submitted and that you rejected for
most of them :)

-Andi

-- 
ak@linux.intel.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Mason

2009-Jan-08 19:02 UTC

head link

Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

On Thu, 2009-01-08 at 13:27 -0500, Chris Mason wrote:> On Thu, 2009-01-08 at 10:16 -0800, Linus Torvalds wrote:
> > 
> > On Thu, 8 Jan 2009, Steven Rostedt wrote:
> > > 
> > > Ouch! I think you are on to something:
> > 
> > Yeah, there''s somethign there, but looking at Chris''
backtrace, there''s
> > nothing there to disable preemption. So if it was this simple case, it
> > should still have preempted him to let the other process run and
finish
> > up.
> > 
> 
> My .config has no lockdep or schedule debugging and voluntary preempt.
> I do have CONFIG_INLINE_OPTIMIZE on, its a good name for trusting gcc I
> guess.
The patch below isn''t quite what Linus suggested, but it is working
here
at least.  In every test I''ve tried so far, this is faster than the
ugly
btrfs spin.

dbench v7.1:        789mb/s
dbench simple spin: 566MB/s

50 proc parallel creates v7.1:        162 files/s avg sys: 1.6
50 proc parallel creates simple spin: 152 files/s avg sys: 2

50 proc parallel stat v7.1:        2.3s total
50 proc parallel stat simple spin: 3.8s total

It is less fair though, the 50 proc parallel creates had a much bigger
span between the first and last proc''s exit time.  This isn''t
a huge
shock, I think it shows the hot path is closer to a real spin lock.

Here''s the incremental I was using.  It looks to me like most of the
things that could change inside spin_on_owner mean we still want to
spin.  The only exception is the need_resched() flag.

-chris

diff --git a/kernel/mutex.c b/kernel/mutex.c
index bd6342a..8936410 100644
--- a/kernel/mutex.c
+++ b/kernel/mutex.c
@@ -161,11 +161,13 @@ __mutex_lock_common(struct mutex *lock, long state,
unsigned int subclass,
 			return 0;
 		}
 
-		if (old_val < 0 && !list_empty(&lock->wait_list))
+		if (!list_empty(&lock->wait_list))
 			break;
 
 		/* See who owns it, and spin on him if anybody */
 		owner = ACCESS_ONCE(lock->owner);
+		if (need_resched())
+			break;
 		if (owner && !spin_on_owner(lock, owner))
 			break;
 


--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Linus Torvalds

2009-Jan-08 19:13 UTC

head link

Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

On Thu, 8 Jan 2009, Chris Mason wrote:> 
> It is less fair though, the 50 proc parallel creates had a much bigger
> span between the first and last proc''s exit time.  This
isn''t a huge
> shock, I think it shows the hot path is closer to a real spin lock.
Actually, the real spin locks are now fair. We use ticket locks on x86.

Well, at least we do unless you enable that broken paravirt support.
I''m
not at all clear on why CONFIG_PARAVIRT wants to use inferior locks, but I 
don''t much care.

We _could_ certainly aim for using ticket locks for mutexes too, that 
might be quite nice.

But yes, from a throughput standpoint fairness is almost always a bad 
thing, so your numbers could easily go down if we did.

			Linus

Chris Mason

2009-Jan-08 19:17 UTC

head link

Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

On Thu, 2009-01-08 at 13:14 -0500, Steven Rostedt wrote:> 
> 
> On Thu, 8 Jan 2009, Steven Rostedt wrote:
> > > In fact, you might not even need a process C: all you need is for
B to be
> > > on the same runqueue as A, and having enough load on the other
CPU''s that
> > > A never gets migrated away. So "C" might be in user
space.
> 
> You''re right about not needing process C.
> 
> > > 
> > > I dunno. There are probably variations on the above.
> > 
> > Ouch! I think you are on to something:
> > 
> >         for (;;) {
> >                 struct thread_info *owner;
> > 
> >                 old_val = atomic_cmpxchg(&lock->count, 1, 0);
> >                 if (old_val == 1) {
> >                         lock_acquired(&lock->dep_map, ip);
> >                         mutex_set_owner(lock);
> >                         return 0;
> >                 }
> > 
> >                 if (old_val < 0 &&
!list_empty(&lock->wait_list))
> >                         break;
> > 
> >                 /* See who owns it, and spin on him if anybody */
> >                 owner = ACCESS_ONCE(lock->owner);
> > 
> > The owner was preempted before assigning lock->owner (as you
stated).
> 
> If it was the current process that preempted the owner and these are RT 
> tasks pinned to the same CPU and the owner is of lower priority than the 
> spinner, we have a deadlock!
> 
> Hmm, I do not think the need_sched here will even fix that :-/
RT tasks could go directly to sleeping.  The spinner would see them on
the list and break out.

-chris


--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Peter Zijlstra

2009-Jan-08 19:23 UTC

head link

Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

On Thu, 2009-01-08 at 11:13 -0800, Linus Torvalds wrote:> 
> Well, at least we do unless you enable that broken paravirt support.
I''m
> not at all clear on why CONFIG_PARAVIRT wants to use inferior locks, but I 
> don''t much care.
Because the virtual cpu that has the ticket might not get scheduled for
a while, even though another vcpu with a spinner is scheduled.

The whole (para)virt is a nightmare in that respect.

Steven Rostedt

2009-Jan-08 19:45 UTC

head link

Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

On Thu, 8 Jan 2009, Chris Mason wrote:> On Thu, 2009-01-08 at 13:14 -0500, Steven Rostedt wrote:
> > 
> > If it was the current process that preempted the owner and these are
RT
> > tasks pinned to the same CPU and the owner is of lower priority than
the
> > spinner, we have a deadlock!
> > 
> > Hmm, I do not think the need_sched here will even fix that :-/
> 
> RT tasks could go directly to sleeping.  The spinner would see them on
> the list and break out.
True, we could do:

	if (owner) {
		if (!spin_on_owner(lock, owner))	
			break;
	} else if (rt_task(current))
		break;

That would at least solve the issue in the short term.

-- Steve
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Linus Torvalds

2009-Jan-08 19:54 UTC

head link

Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

On Thu, 8 Jan 2009, Chris Mason wrote:> 
> The patch below isn''t quite what Linus suggested, but it is
working here
> at least.  In every test I''ve tried so far, this is faster than
the ugly
> btrfs spin.
Sadly, I don''t think it''s really working.

I was pretty sure that adding the unlocked loop should provably not change 
the mutex lock semantics. Why? Because it''s just basically equivalent
to
just doing the mutex_trylock() without really changing anything really 
fundamental in the mutex logic.

And that argument is sadly totally bogus. 

The thing is, we used to have this guarantee that any contention would 
always go into the slowpath, and then in the slow-path we serialize using 
the spinlock. 

So I think the bug is still there, we just hid it better by breaking out 
of the loop with that "if (need_resched())" always eventually
triggering.
And it would be ok if it really is guaranteed to _eventually_ trigger, and 
I guess with timeslices it eventually always will, but I suspect we could 
have some serious latency spikes.

The problem? Setting "lock->count" to 0. That will mean that the
next
"mutex_unlock()" will not necessarily enter the slowpath at all, and
won''t
necessarily wake things up like it should.

Normally we set lock->count to 0 after getting the lock, and only _inside_ 
the spinlock, and then we check the waiters after that. The comment says 
it all:

        /* set it to 0 if there are no waiters left: */
        if (likely(list_empty(&lock->wait_list)))
                atomic_set(&lock->count, 0);

and the spinning case violates that rule. 

Now, the spinning case only sets it to 0 if we saw it set to 1, so I think 
the argument can go something like:

 - if it was 1, and we _have_ seen contention, then we know that at 
   least _one_ person that set it to 1 must have gone through the unlock 
   slowpath (ie it wasn''t just a normal "locked increment".

 - So even if _we_ (in the spinning part of stealing that lock) didn''t 
   wake the waiter up, the slowpath wakeup case (that did _not_ wake 
   us up, since we were spinning and hadn''t added ourselves to the wait
   list) must have done so.

So maybe it''s all really really safe, and we''re still
guaranteed to have
as many wakeups as we had go-to-sleeps. But I have to admit that my brain 
hurts a bit from worrying about this.

Sleeping mutexes are not ever simple. 

			Linus
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Harvey Harrison

2009-Jan-08 19:59 UTC

head link

Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

On Thu, 2009-01-08 at 19:33 +0100, Ingo Molnar wrote:> * Linus Torvalds <torvalds@linux-foundation.org> wrote:
<snip>
> > 
> > Ingo - I think we need to remove that crap again. Because gcc gets the
> > inlining horribly horribly wrong. As usual.
> 
> Apparently it messes up with asm()s: it doesnt know the contents of the 
> asm() and hence it over-estimates the size [based on string heuristics] 
> ...
<snip>
> That win is mixed in slowpath and fastpath as well.
> 
> I see three options:
> 
>  - Disable CONFIG_OPTIMIZE_INLINING=y altogether (it''s already 
>    default-off)
I''d like to see this, leave all the heuristics out of it, if I say
inline, I
don''t mean _maybe_, I mean you''d better damn well inline it. 
On the other
hand, alpha seems to be hand-disabling the inline means __always_inline
in their arch headers, so if this option is kept, it should be enabled
on alpha as that is the current state of play there and get rid of that
arch-private bit.
> 
>  - Change the asm() inline markers to something new like asm_inline, which
>    defaults to __always_inline.
I''d suggest just making inline always mean __always_inline.  And get to
work removing inline from functions in C files.  This is probably also the
better choice to keep older gccs producing decent code.
> 
>  - Just mark all asm() inline markers as __always_inline - realizing that 
>    these should never ever be out of line.
> 
> We might still try the second or third options, as i think we shouldnt go 
> back into the business of managing the inline attributes of ~100,000 
> kernel functions.
Or just make it clear that inline shouldn''t (unless for a very good
reason)
_ever_ be used in a .c file.

Cheers,

Harvey

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Steven Rostedt

2009-Jan-08 20:12 UTC

head link

Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

> 
> The problem? Setting "lock->count" to 0. That will mean that
the next
> "mutex_unlock()" will not necessarily enter the slowpath at all,
and won''t
> necessarily wake things up like it should.
> 
> Normally we set lock->count to 0 after getting the lock, and only
_inside_
> the spinlock, and then we check the waiters after that. The comment says 
> it all:
> 
>         /* set it to 0 if there are no waiters left: */
>         if (likely(list_empty(&lock->wait_list)))
>                 atomic_set(&lock->count, 0);
> 
> and the spinning case violates that rule. 
The difference is that here we have it set to -1 (in the non patched 
code), and we have to decide if we should change that to 0. To change from 
-1 to 0 needs the protection of the spin locks.

In the loop, we only change from 1 to 0 which is the same as the fast 
path, and should not cause any problems.

> 
> Now, the spinning case only sets it to 0 if we saw it set to 1, so I think 
> the argument can go something like:
Yep.
> 
>  - if it was 1, and we _have_ seen contention, then we know that at 
>    least _one_ person that set it to 1 must have gone through the unlock 
>    slowpath (ie it wasn''t just a normal "locked
increment".
Correct.
> 
>  - So even if _we_ (in the spinning part of stealing that lock)
didn''t
>    wake the waiter up, the slowpath wakeup case (that did _not_ wake 
>    us up, since we were spinning and hadn''t added ourselves to the
wait
>    list) must have done so.
Agreed.
> 
> So maybe it''s all really really safe, and we''re still
guaranteed to have
> as many wakeups as we had go-to-sleeps. But I have to admit that my brain 
> hurts a bit from worrying about this.
I do not think that the issue with the previous bug that Chris showed had 
anything to do with the actual sleepers. The slow path never changes the 
lock to ''1'', so it should not affect the spinners. We can
think of the
spinners as not having true contention with the lock, and are just like a:

	while (cond) {
		if (mutex_trylock(lock))
			goto got_the_lock;
	}
> 
> Sleeping mutexes are not ever simple. 
Now you see why in -rt we did all this in the slow path ;-)

-- Steve

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

jim owens

2009-Jan-08 20:15 UTC

head link

Re: Btrfs for mainline

Chris Mason wrote:> 
> Unresolved from this reviewing thread:
> 
> * Should it be named btrfsdev?  My vote is no, it is extra work for the
> distros when we finally do rename it, and I don''t think btrfs
really has
> the reputation for stability right now.  But if Linus or Andrew would
> prefer the dev on there, I''ll do it.
We know who has the last word on this.

This is just additional background for those who commented.

Using tricks such as btrfsdev, mount "unsafe", or kernel
messages won''t provide a guarantee that btrfs is only used
in appropriate and risk-free ways.  Those tricks also won''t
prevent all problems caused by booting a broken old (or new)
release.  And the perceived quality and performance of any
filesystem release, even between stable versions, depends
very much on individual system configuration and use.

Before Chris posted the code, we had some btrfs concall
discussions about the best way to set user expectations on
btrfs mainline 1.0.  Consensus was that the best way we
could do this was to warn them when they did mkfs.btrfs.

Today I sent a patch to Chris (which he may ignore/change)
so mkfs.btrfs will say:

WARNING! - Btrfs v0.16-39-gf9972b4 IS EXPERIMENTAL
WARNING! - see http://btrfs.wiki.kernel.org before using

with a blank line before and after.  They don''t have to
confirm since they can just mkfs a different filesystem
if I scared them away.  The version is auto-generated.

jim

H. Peter Anvin

2009-Jan-09 01:42 UTC

head link

Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

Andi Kleen wrote:>> I''ll try to annotate the inline asms (there''s not
_that_ many of them),
>> and measure what the size impact is.
> 
> You can just use the patch I submitted and that you rejected for
> most of them :)
I just ran a sample build for x86-64 with gcc 4.3.0, these all
allyesconfig builds (modulo the inlining option):

: voreg 64 ; size o.*/vmlinux
   text    data     bss     dec     hex     filename
57590217 24940519 15560504 98091240 5d8c0e8 o.andi/vmlinux
59421552 24912223 15560504 99894279 5f44407 o.noopt/vmlinux
57700527 24950719 15560504 98211750 5da97a6 o.opty/vmlinux

A 3% code size difference even on allyesconfig (1.8 MB!) is nothing to
sneeze at.  As shown by the delta from Andi''s patch, these small
assembly stubs really do need to be annotated, since gcc simply has no
way to do anything sane with them -- it just doesn''t know.

Personally, I''d like to see __asm_inline as opposed to __always_inline
for these, though, as a documentation issue: __always_inline implies to
me that this function needs to be inlined for correctness, and this
could be highly relevant if someone, for example, recodes the routine in
C or decides to bloat it out (e.g. paravirt_ops).

It''s not a perfect solution even then, because gcc may choose to not
inline a higher level of inline functions for the same bogus reason.
There isn''t much we can do about that, though, unless gcc either
integrates the assembler, or gives us some way of injecting the actual
weight of the asm statement...

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don''t speak on their behalf.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

H. Peter Anvin

2009-Jan-09 01:44 UTC

head link

Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

Harvey Harrison wrote:>>
>> We might still try the second or third options, as i think we shouldnt
go
>> back into the business of managing the inline attributes of ~100,000 
>> kernel functions.
> 
> Or just make it clear that inline shouldn''t (unless for a very
good reason)
> _ever_ be used in a .c file.
> 
The question is if that would produce acceptable quality code.  In
theory it should, but I''m more than wondering if it really will.

It would be ideal, of course, as it would mean less typing.  I guess we
could try it out by disabling any "inline" in the current code that
isn''t "__always_inline"...

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don''t speak on their behalf.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Harvey Harrison

2009-Jan-09 02:24 UTC

head link

Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

On Thu, 2009-01-08 at 17:44 -0800, H. Peter Anvin wrote:> Harvey Harrison wrote:
> >>
> >> We might still try the second or third options, as i think we
shouldnt go
> >> back into the business of managing the inline attributes of
~100,000
> >> kernel functions.
> > 
> > Or just make it clear that inline shouldn''t (unless for a
very good reason)
> > _ever_ be used in a .c file.
> > 
> 
> The question is if that would produce acceptable quality code.  In
> theory it should, but I''m more than wondering if it really will.
> 
> It would be ideal, of course, as it would mean less typing.  I guess we
> could try it out by disabling any "inline" in the current code
that
> isn''t "__always_inline"...
> 
A lot of code was written assuming inline means __always_inline, I''d
suggest
keeping that assumption and working on removing inlines that aren''t
strictly necessary as there''s no way to know what inlines meant
''try to inline''
and what ones really should have been __always_inline.

Not that I feel _that_ strongly about it.

Cheers,

Harvey

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Andi Kleen

2009-Jan-09 03:35 UTC

head link

Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

On Thu, Jan 08, 2009 at 05:44:25PM -0800, H. Peter Anvin
wrote:> Harvey Harrison wrote:
> >>
> >> We might still try the second or third options, as i think we
shouldnt go
> >> back into the business of managing the inline attributes of
~100,000
> >> kernel functions.
> > 
> > Or just make it clear that inline shouldn''t (unless for a
very good reason)
> > _ever_ be used in a .c file.
> > 
> 
> The question is if that would produce acceptable quality code.  In
> theory it should, but I''m more than wondering if it really will.
I actually often use noinline when developing code simply because it 
makes it easier to read oopses when gcc doesn''t inline ever static
(which it normally does if it only has a single caller). You know
roughly where it crashed without having to decode the line number.

I believe others do that too, I notice it''s all over btrfs for example.

-Andi

-- 
ak@linux.intel.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Linus Torvalds

2009-Jan-09 03:42 UTC

head link

Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

On Fri, 9 Jan 2009, Andi Kleen wrote:> 
> I actually often use noinline when developing code simply because it 
> makes it easier to read oopses when gcc doesn''t inline ever static
> (which it normally does if it only has a single caller)
Yes. Gcc inlining is a total piece of sh*t.

Gcc doesn''t inline enough when we ask it to, and inlines too damn 
aggressively when we don''t. It seems to almost totally ignore the
inline
hint. 

Oh, well. The best option tends to be

 - mark things "noinline" just to make sure gcc doesn''t screw
up.

 - make "inline" mean "must_inline".

 - maybe add a new "maybe_inline" to be the "inline" hint
that gcc uses.

because quite frankly, depending on gcc to do the right thing is not 
working out.

			Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Linus Torvalds

2009-Jan-09 03:46 UTC

head link

Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

On Thu, 8 Jan 2009, H. Peter Anvin wrote:> 
> Right.   gcc simply doesn''t have any way to know how heavyweight
an
> asm() statement is
I don''t think that''s relevant.

First off, gcc _does_ have a perfectly fine notion of how heavy-weight an 
"asm" statement is: just count it as a single instruction (and count
the
argument setup cost that gcc _can_ estimate).

That would be perfectly fine. If people use inline asms, they tend to use 
it for a reason.

However, I doubt that it''s the inline asm that was the biggest reason
why
gcc decided not to inline - it was probably the constant "switch()" 
statement. The inline function actually looks pretty large, if it
wasn''t
for the fact that we have a constant argument, and that one makes the 
switch statement go away.

I suspect gcc has some pre-inlining heuristics that don''t take constant
folding and simplifiation into account - if you look at just the raw tree 
of the function without taking the optimization into account, it will look 
big.

			Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

David Miller

2009-Jan-09 04:59 UTC

head link

Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Thu, 8 Jan 2009 19:46:30 -0800 (PST)
> First off, gcc _does_ have a perfectly fine notion of how heavy-weight an 
> "asm" statement is: just count it as a single instruction (and
count the
> argument setup cost that gcc _can_ estimate).
Actually, at least at one point, it counted the number of newline
characters in the assembly string to estimate how many instructions
are contained inside.

It actually needs to know exaclty how many instructions are in there,
to emit proper far branches and stuff like that, for some cpus.

Since they never added an (optional) way to actually tell the compiler
this critical piece of information, so I guess the newline hack is the
best they could come up with.
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

H. Peter Anvin

2009-Jan-09 05:00 UTC

head link

Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

Linus Torvalds wrote:> 
> First off, gcc _does_ have a perfectly fine notion of how heavy-weight an 
> "asm" statement is: just count it as a single instruction (and
count the
> argument setup cost that gcc _can_ estimate).
> 
True.  It''s not what it''s doing, though.  It looks for
''\n'' and '';''
characters, and counts the maximum instruction size for each possible
instruction.

The reason why is that gcc''s size estimation is partially designed to
select what kind of branches it needs to use on architectures which have
more than one type of branches.  As a result, it tends to drastically
overestimate, on purpose.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don''t speak on their behalf.

--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

H. Peter Anvin

2009-Jan-09 05:05 UTC

head link

Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

Harvey Harrison wrote:> 
> A lot of code was written assuming inline means __always_inline,
I''d suggest
> keeping that assumption and working on removing inlines that
aren''t
> strictly necessary as there''s no way to know what inlines meant
''try to inline''
> and what ones really should have been __always_inline.
> 
> Not that I feel _that_ strongly about it.
> 
Actually, we have that reasonably well down by now.  There seems to be a
couple of minor tweaking still necessary, but I think we''re 90-95%
there
already.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don''t speak on their behalf.

--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Andrew Morton

2009-Jan-09 06:00 UTC

head link

Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

On Fri, 9 Jan 2009 04:35:31 +0100 Andi Kleen <andi@firstfloor.org> wrote:
> On Thu, Jan 08, 2009 at 05:44:25PM -0800, H. Peter Anvin wrote:
> > Harvey Harrison wrote:
> > >>
> > >> We might still try the second or third options, as i think we
shouldnt go
> > >> back into the business of managing the inline attributes of
~100,000
> > >> kernel functions.
> > > 
> > > Or just make it clear that inline shouldn''t (unless for
a very good reason)
> > > _ever_ be used in a .c file.
> > > 
> > 
> > The question is if that would produce acceptable quality code.  In
> > theory it should, but I''m more than wondering if it really
will.
> 
> I actually often use noinline when developing code simply because it 
> makes it easier to read oopses when gcc doesn''t inline ever static
> (which it normally does if it only has a single caller). You know
> roughly where it crashed without having to decode the line number.
> 
> I believe others do that too, I notice it''s all over btrfs for
example.
> 
Plus there is the problem where

foo()
{
	char a[1000];
}

bar()
{
	char a[1000];
}

zot()
{
	foo();
	bar();
}

uses 2000 bytes of stack.

Fortunately scripts/checkstack.pl can find these.

If someone runs it.

With the right kconfig settings.

And with the right compiler version.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Andi Kleen

2009-Jan-09 06:57 UTC

head link

Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

On Thu, Jan 08, 2009 at 07:42:48PM -0800, Linus Torvalds
wrote:> > I actually often use noinline when developing code simply because it 
> > makes it easier to read oopses when gcc doesn''t inline ever
static
> > (which it normally does if it only has a single caller)
> 
> Yes. Gcc inlining is a total piece of sh*t.
The static inlining by default (unfortunately) saves a lot of text size.

For testing I built an x86-64 allyesconfig kernel with 
-fno-inline-functions-called-once (which disables the default static
inlining), and it increased text size by ~4.1% (over 2MB for a allyesconfig 
kernel). So I think we have to keep that, dropping it would cost too
much :/

-Andi

-- 
ak@linux.intel.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Andi Kleen

2009-Jan-09 06:58 UTC

head link

Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

> foo()
> {
> 	char a[1000];
> }
> 
> bar()
> {
> 	char a[1000];
> }
> 
> zot()
> {
> 	foo();
> 	bar();
> }
> 
> uses 2000 bytes of stack.
> And with the right compiler version.
I believe that''s fixed in newer gcc versions.

For old gccs we might indeed need to add noinlines though.

-Andi

-- 
ak@linux.intel.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Peter Zijlstra

2009-Jan-09 09:28 UTC

head link

Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

On Thu, 2009-01-08 at 11:13 -0800, Linus Torvalds wrote:> 
> On Thu, 8 Jan 2009, Chris Mason wrote:
> > 
> > It is less fair though, the 50 proc parallel creates had a much bigger
> > span between the first and last proc''s exit time.  This
isn''t a huge
> > shock, I think it shows the hot path is closer to a real spin lock.
> 
> Actually, the real spin locks are now fair. We use ticket locks on x86.
> We _could_ certainly aim for using ticket locks for mutexes too, that 
> might be quite nice.
Not really, ticket locks cannot handle a spinner going away - and we
need that here.

I''ve googled around a bit and MCS locks
(http://www.cs.rice.edu/~johnmc/papers/asplos91.pdf) look like a viable
way to gain fairness in our situation.

--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Peter Zijlstra

2009-Jan-09 10:47 UTC

head link

Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

On Thu, 2009-01-08 at 11:54 -0800, Linus Torvalds wrote:
> I was pretty sure that adding the unlocked loop should provably not change 
> the mutex lock semantics. Why? Because it''s just basically
equivalent to
> just doing the mutex_trylock() without really changing anything really 
> fundamental in the mutex logic.
> 
> And that argument is sadly totally bogus. 
It fails for the RT case, yes. It should still be true for regular tasks
- if the owner tracking was accurate.
> The thing is, we used to have this guarantee that any contention would 
> always go into the slowpath, and then in the slow-path we serialize using 
> the spinlock. 
> 
> So I think the bug is still there, we just hid it better by breaking out 
> of the loop with that "if (need_resched())" always eventually
triggering.
> And it would be ok if it really is guaranteed to _eventually_ trigger, and 
> I guess with timeslices it eventually always will, but I suspect we could 
> have some serious latency spikes.
Yes, the owner getting preempted after acquiring the lock, but before
setting the owner can give some nasties :-(

I initially did that preempt_disable/enable around the fast path, but I
agree that slowing down the fast path is unwelcome.

Alternatively we could go back to block on !owner, with the added
complexity of not breaking out of the spin on lock->owner != owner
when !lock->owner, so that the premature owner clearing of the unlock
fast path will not force a schedule right before we get a chance to
acquire the lock.

Let me do that..
> The problem? Setting "lock->count" to 0. That will mean that
the next
> "mutex_unlock()" will not necessarily enter the slowpath at all,
and won''t
> necessarily wake things up like it should.
That''s exactly what __mutex_fastpath_trylock() does (or can do,
depending on the implementation), so if regular mutexes are correct in
the face of a trylock stealing the lock in front of a woken up waiter,
then we''re still good.

That said, I''m not seeing how mutexes aren''t broken already.

say A locks it: counter 1->0
then B contends: counter 0->-1, added to wait list
then C contentds: counter -1, added to wait list
then A releases: counter -1->1, wake someone up, say B
then D trylocks: counter 1->0
so B is back to wait list
then D releases, 0->1, no wakeup

Aaah, B going back to sleep sets it to -1

Therefore, I think we''re good.
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Ingo Molnar

2009-Jan-09 13:00 UTC

head link

[patch] measurements, numbers about CONFIG_OPTIMIZE_INLINING=y impact

* Linus Torvalds <torvalds@linux-foundation.org> wrote:
> On Thu, 8 Jan 2009, H. Peter Anvin wrote:
> > 
> > Right.  gcc simply doesn''t have any way to know how
heavyweight an
> > asm() statement is
> 
> I don''t think that''s relevant.
> 
> First off, gcc _does_ have a perfectly fine notion of how heavy-weight 
> an "asm" statement is: just count it as a single instruction (and
count
> the argument setup cost that gcc _can_ estimate).
> 
> That would be perfectly fine. If people use inline asms, they tend to 
> use it for a reason.
> 
> However, I doubt that it''s the inline asm that was the biggest
reason
> why gcc decided not to inline - it was probably the constant
"switch()"
> statement. The inline function actually looks pretty large, if it
wasn''t
> for the fact that we have a constant argument, and that one makes the 
> switch statement go away.
> 
> I suspect gcc has some pre-inlining heuristics that don''t take
constant
> folding and simplifiation into account - if you look at just the raw 
> tree of the function without taking the optimization into account, it 
> will look big.
Yeah. In my tests GCC 4.3.2 does properly inline that particular asm.

But because we cannot really trust GCC in this area yet (or at all), today 
i''ve conducted extensive tests measuring GCC''s interaction
with inlined
asm statements. I built hundreds of vmlinux''s and distilled the
numbers.
Here are my findings.

Firstly i''ve written 40 patches that gradually add an __asm_inline 
annotation to all inline functions of arch/x86/include/asm/*.h - i''ve 
annotated 224 inline functions that way.

Then i''ve conducted an x86 defconfig and an x86 allyesconfig test for
each
of the 40 patches.

Realizing that there''s assymetry in the inlining practices of drivers
and
kernel code, i''ve also conducted a separate series of tests: measuring
the
size increase/decrease of the core kernel, kernel/built-in.o.

The first table containts the size numbers of kernel/built-in.o, on 
allyesconfig builds, and the size delta (and percentage):

 ( Object size tests on kernel: v2.6.28-7939-g2150edc
 (                       using: gcc (GCC) 4.3.2 20081007 (Red Hat 4.3.2-6)
 (                      target: kernel/ kernel/built-in.o
 (                      config: x86.64.allyes

                                 name     text-size    (   delta)  (      pct)
 -----------------------------------------------------------------------------
                  always-inline.patch:      1039591    (       0)  (   0.000%)
              optimize-inlining.patch:       967724    (  -71867)  (  -7.426%)

                   asm-inline-0.patch:       967724    (       0)  (   0.000%)
       asm-inline-bitops-simple.patch:       967691    (     -33)  (  -0.003%)
 asm-inline-bitops-constant-set.patch:       966947    (    -744)  (  -0.077%)
 asm-inline-bitops-test-and-set.patch:       966735    (    -212)  (  -0.022%)
          asm-inline-bitops-ffs.patch:       966735    (       0)  (   0.000%)
                   asm-inline-__ffs.h:       966735    (       0)  (   0.000%)
                   asm-inline-__fls.h:       966735    (       0)  (   0.000%)
                     asm-inline-fls.h:       966735    (       0)  (   0.000%)
                   asm-inline-fls64.h:       966735    (       0)  (   0.000%)
                    asm-inline-apic.h:       966735    (       0)  (   0.000%)
               asm-inline-atomic_32.h:       966735    (       0)  (   0.000%)
               asm-inline-atomic_64.h:       966735    (       0)  (   0.000%)
             asm-inline-checksum_32.h:       966735    (       0)  (   0.000%)
             asm-inline-checksum_64.h:       966735    (       0)  (   0.000%)
              asm-inline-cmpxchg_32.h:       966735    (       0)  (   0.000%)
              asm-inline-cmpxchg_64.h:       966735    (       0)  (   0.000%)
                    asm-inline-desc.h:       966735    (       0)  (   0.000%)
                   asm-inline-futex.h:       966735    (       0)  (   0.000%)
                    asm-inline-i387.h:       966735    (       0)  (   0.000%)
                      asm-inline-io.h:       966735    (       0)  (   0.000%)
                   asm-inline-io_32.h:       966735    (       0)  (   0.000%)
                   asm-inline-io_64.h:       966735    (       0)  (   0.000%)
                asm-inline-irqflags.h:       966735    (       0)  (   0.000%)
                   asm-inline-kexec.h:       966735    (       0)  (   0.000%)
                    asm-inline-kgdb.h:       966735    (       0)  (   0.000%)
                asm-inline-kvm_host.h:       966735    (       0)  (   0.000%)
                asm-inline-kvm_para.h:       966735    (       0)  (   0.000%)
            asm-inline-lguest_hcall.h:       966735    (       0)  (   0.000%)
                   asm-inline-local.h:       966735    (       0)  (   0.000%)
                     asm-inline-msr.h:       966735    (       0)  (   0.000%)
                asm-inline-paravirt.h:       966735    (       0)  (   0.000%)
                 asm-inline-pci_x86.h:       966735    (       0)  (   0.000%)
               asm-inline-processor.h:       966735    (       0)  (   0.000%)
                   asm-inline-rwsem.h:       966735    (       0)  (   0.000%)
                  asm-inline-signal.h:       966735    (       0)  (   0.000%)
                asm-inline-spinlock.h:       966735    (       0)  (   0.000%)
               asm-inline-string_32.h:       966735    (       0)  (   0.000%)
                    asm-inline-swab.h:       966735    (       0)  (   0.000%)
             asm-inline-sync_bitops.h:       966735    (       0)  (   0.000%)
                  asm-inline-system.h:       966735    (       0)  (   0.000%)
               asm-inline-system_64.h:       966735    (       0)  (   0.000%)
             asm-inline-thread_info.h:       966735    (       0)  (   0.000%)
                asm-inline-tlbflush.h:       966735    (       0)  (   0.000%)
                     asm-inline-xcr.h:       966735    (       0)  (   0.000%)
                   asm-inline-xsave.h:       966735    (       0)  (   0.000%)

[ The patch names reflect the include file names where i did the changes. ]

There are two surprising results:

Firstly, the CONFIG_OPTIMIZE_INLINING=y build is 7.4% more compact than 
the "always inline as instructed by kernel hackers" build:

              optimize-inlining.patch:       967724    (  -71867)  (  -7.426%)

Secondly: out of 40 patches, only three make an actual vmlinux object size 
difference (!):

       asm-inline-bitops-simple.patch:       967691    (     -33)  (  -0.003%)
 asm-inline-bitops-constant-set.patch:       966947    (    -744)  (  -0.077%)
 asm-inline-bitops-test-and-set.patch:       966735    (    -212)  (  -0.022%)

And those three have a combined effect of 0.1% - noise compared to the 
7.4% that inline optimization already brought us. [it''s still worth 
improving.]

I have also conducted full x86.64.defconfig vmlinux builds for each of the 
patches, and tabulated the size changes:

 ( Object size tests on kernel: v2.6.28-7939-g2150edc
 (                       using: gcc (GCC) 4.3.2 20081007 (Red Hat 4.3.2-6)
 (                      target: vmlinux vmlinux
 (                      config: x86.64.defconfig

                                 name     text-size    (   delta)  (      pct)
 -----------------------------------------------------------------------------
                  always-inline.patch:      7045187    (       0)  (   0.000%)
              optimize-inlining.patch:      6987053    (  -58134)  (  -0.832%)

                   asm-inline-0.patch:      6987053    (       0)  (   0.000%)
       asm-inline-bitops-simple.patch:      6987053    (       0)  (   0.000%)
 asm-inline-bitops-constant-set.patch:      6961126    (  -25927)  (  -0.372%)
 asm-inline-bitops-test-and-set.patch:      6961126    (       0)  (   0.000%)
          asm-inline-bitops-ffs.patch:      6961126    (       0)  (   0.000%)
                   asm-inline-__ffs.h:      6961126    (       0)  (   0.000%)
                   asm-inline-__fls.h:      6961126    (       0)  (   0.000%)
                     asm-inline-fls.h:      6961126    (       0)  (   0.000%)
                   asm-inline-fls64.h:      6961126    (       0)  (   0.000%)
                    asm-inline-apic.h:      6961126    (       0)  (   0.000%)
               asm-inline-atomic_32.h:      6961126    (       0)  (   0.000%)
               asm-inline-atomic_64.h:      6961126    (       0)  (   0.000%)
             asm-inline-checksum_32.h:      6961126    (       0)  (   0.000%)
             asm-inline-checksum_64.h:      6961126    (       0)  (   0.000%)
              asm-inline-cmpxchg_32.h:      6961126    (       0)  (   0.000%)
              asm-inline-cmpxchg_64.h:      6961126    (       0)  (   0.000%)
                    asm-inline-desc.h:      6961126    (       0)  (   0.000%)
                   asm-inline-futex.h:      6961126    (       0)  (   0.000%)
                    asm-inline-i387.h:      6961126    (       0)  (   0.000%)
                      asm-inline-io.h:      6961126    (       0)  (   0.000%)
                   asm-inline-io_32.h:      6961126    (       0)  (   0.000%)
                   asm-inline-io_64.h:      6961126    (       0)  (   0.000%)
                asm-inline-irqflags.h:      6961126    (       0)  (   0.000%)
                   asm-inline-kexec.h:      6961126    (       0)  (   0.000%)
                    asm-inline-kgdb.h:      6961126    (       0)  (   0.000%)
                asm-inline-kvm_host.h:      6961126    (       0)  (   0.000%)
                asm-inline-kvm_para.h:      6961126    (       0)  (   0.000%)
            asm-inline-lguest_hcall.h:      6961126    (       0)  (   0.000%)
                   asm-inline-local.h:      6961126    (       0)  (   0.000%)
                     asm-inline-msr.h:      6961126    (       0)  (   0.000%)
                asm-inline-paravirt.h:      6961126    (       0)  (   0.000%)
                 asm-inline-pci_x86.h:      6961126    (       0)  (   0.000%)
               asm-inline-processor.h:      6961126    (       0)  (   0.000%)
                   asm-inline-rwsem.h:      6961126    (       0)  (   0.000%)
                  asm-inline-signal.h:      6961126    (       0)  (   0.000%)
                asm-inline-spinlock.h:      6961126    (       0)  (   0.000%)
               asm-inline-string_32.h:      6961126    (       0)  (   0.000%)
                    asm-inline-swab.h:      6961126    (       0)  (   0.000%)
             asm-inline-sync_bitops.h:      6961126    (       0)  (   0.000%)
                  asm-inline-system.h:      6961126    (       0)  (   0.000%)
               asm-inline-system_64.h:      6961126    (       0)  (   0.000%)
             asm-inline-thread_info.h:      6961126    (       0)  (   0.000%)
                asm-inline-tlbflush.h:      6961126    (       0)  (   0.000%)
                     asm-inline-xcr.h:      6961126    (       0)  (   0.000%)
                   asm-inline-xsave.h:      6961126    (       0)  (   0.000%)

This is a surprising result too: the manual annotation of asm inlines 
(totally against my expectations) made even less of a difference.

Here''s the allyesconfig vmlinux builds as well:

 ( Object size tests on kernel: v2.6.28-7939-g2150edc
 (                       using: gcc (GCC) 4.3.2 20081007 (Red Hat 4.3.2-6)
 (                    building: vmlinux vmlinux
 (                      config: x86.64.allyes

                                 name     text-size    (   delta)  (      pct)
 
-----------------------------------------------------------------------------
                  always-inline.patch:     59089458    (       0)  (   0.000%)
              optimize-inlining.patch:     57363721    (-1725737)  (  -3.008%)
               asm-inline-combo.patch:     57254254    ( -109467)  (  -0.191%)

Similar pattern: optimize-inlining helps a lot, the manual annotations 
only a tiny bit more. [no finegrained results here - it would take me a 
week to finish a hundred allyesconfig builds. But the results line up with 
the kernel/built-in.o allyesconfig results.]

Another surprise to me was that according to the numbers the core kernel 
appears to be much worse with its inlining practices (at least in terms of 
absolute text size) than the full kernel.

I found these numbers hard to believe, so i double-checked them against 
source-level metrics: i compared the number of ''inline''
keywords compared
against object size, of kernel/built-in.o versus that of 
drivers/built-in.o. (on allyesconfig)

The results are:

                       nr-inlines     .o size      inline-frequency (bytes)

  kernel/built-in.o:          529      1177230     2225
 drivers/built-in.o:         9105     27933205     3067


i.e. in the driver space we get an inline function for every ~3K bytes of 
code, in the core kernel we get an inline function every ~2K bytes of 
code.

One interpretation of the numbers would be that core kernel hackers are 
more inline-happy, maybe because they think that their functions are more 
important to inline.

Which is generally a fair initial assumption, but according to the numbers 
it does not appear to pay off in practice as it does not result in a 
smaller kernel image. (Driver authors tend to be either more disciplined - 
or more lazy in that respect.)

One final note. One might ask, which one was this particular patch, which 
made the largest measurable impact on the allyesconfig build and which 
also showed up in the defconfig builds:

 asm-inline-bitops-constant-set.patch:      6961126    (  -25927)  (  -0.372%)

See that patch below, it is another surprise: a single inline function.
[ GCC uninlines this one as its inliner probably got lost in the 
  __builtin_constant_p() trick we are doing there. ]

It is a central bitop primitive (test_bit()), so it made a relatively 
large difference of 0.3% to the defconfig build.

So out of the 224 functions i annotated manually, only one was inlined 
incorrectly by GCC. (that matters to x86 defconfig)

Note that these numbers match hpa''s 4.3.0 based allyesconfig numbers
very
closely:

 |  : voreg 64 ; size o.*/vmlinux
 |     text    data     bss     dec     hex     filename
 |  57590217 24940519 15560504 98091240 5d8c0e8 o.andi/vmlinux
 |  59421552 24912223 15560504 99894279 5f44407 o.noopt/vmlinux
 |  57700527 24950719 15560504 98211750 5da97a6 o.opty/vmlinux

So the conclusion: GCC 4.3.x appears to do an acceptable job at inlining 
(if not, i''d like to see the specific .config where it screws up).
We''ve
also seen a few clear examples of earlier GCC versions screwing up with 
inlining.

So my inclination would be: either mark CONFIG_OPTIMIZE_INLINING as 
CONFIG_BROKEN, or limit CONFIG_OPTIMIZE_INLINING to GCC 4.3.x and later 
versions only.

I''m no fan of the GCC inliner, but the latter seems to be the more 
rational choice to me - the size win from inline-optimization is 
significant: 1% for defconfig, 3% for allyesconfig and 7.5% for the core 
kernel (allyesconfig).

	Ingo

---------------------->
Subject: asm: inline bitops constant set
From: Ingo Molnar <mingo@elte.hu>
Date: Fri Jan 09 11:43:20 CET 2009

Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 arch/x86/include/asm/bitops.h |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Index: linux/arch/x86/include/asm/bitops.h
==================================================================---
linux.orig/arch/x86/include/asm/bitops.h
+++ linux/arch/x86/include/asm/bitops.h
@@ -300,7 +300,8 @@ static inline int test_and_change_bit(in
 	return oldbit;
 }
 
-static inline int constant_test_bit(int nr, const volatile unsigned long *addr)
+static __asm_inline int
+constant_test_bit(int nr, const volatile unsigned long *addr)
 {
 	return ((1UL << (nr % BITS_PER_LONG)) &
 		(((unsigned long *)addr)[nr / BITS_PER_LONG])) != 0;
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Mason

2009-Jan-09 13:05 UTC

head link

Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

On Fri, 2009-01-09 at 04:35 +0100, Andi Kleen wrote:> On Thu, Jan 08, 2009 at 05:44:25PM -0800, H. Peter Anvin wrote:
> > Harvey Harrison wrote:
> > >>
> > >> We might still try the second or third options, as i think we
shouldnt go
> > >> back into the business of managing the inline attributes of
~100,000
> > >> kernel functions.
> > > 
> > > Or just make it clear that inline shouldn''t (unless for
a very good reason)
> > > _ever_ be used in a .c file.
> > > 
> > 
> > The question is if that would produce acceptable quality code.  In
> > theory it should, but I''m more than wondering if it really
will.
> 
> I actually often use noinline when developing code simply because it 
> makes it easier to read oopses when gcc doesn''t inline ever static
> (which it normally does if it only has a single caller). You know
> roughly where it crashed without having to decode the line number.
> 
> I believe others do that too, I notice it''s all over btrfs for
example.
For btrfs it was mostly about stack size at first.  I''d use
checkstack.pl and then run through the big funcs and figure out how they
got so huge.  It was almost always because gcc was inlining something it
shouldn''t, so I started using it on most funcs.

-chris


--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Ingo Molnar

2009-Jan-09 13:37 UTC

head link

Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

* H. Peter Anvin <hpa@zytor.com> wrote:
> Andi Kleen wrote:
> >> I''ll try to annotate the inline asms (there''s
not _that_ many of them),
> >> and measure what the size impact is.
> > 
> > You can just use the patch I submitted and that you rejected for
> > most of them :)
> 
> I just ran a sample build for x86-64 with gcc 4.3.0, these all
> allyesconfig builds (modulo the inlining option):
> 
> : voreg 64 ; size o.*/vmlinux
>    text    data     bss     dec     hex     filename
> 59421552 24912223 15560504 99894279 5f44407 o.noopt/vmlinux
> 57700527 24950719 15560504 98211750 5da97a6 o.opty/vmlinux
> 57590217 24940519 15560504 98091240 5d8c0e8 o.andi/vmlinux
> 
> A 3% code size difference even on allyesconfig (1.8 MB!) is nothing to 
> sneeze at.  As shown by the delta from Andi''s patch, these small 
> assembly stubs really do need to be annotated, since gcc simply has no 
> way to do anything sane with them -- it just doesn''t know.
I''ve done a finegrained size analysis today (see my other mail in this 
thread), and it turns out that on gcc 4.3.x the main (and pretty much 
only) inlining annotation that matters in arch/x86/include/asm/*.h is the 
onliner patch attached below, annotating constant_test_bit().

That change is included in Andi''s patch too AFAICS - i.e. just that
single
hunk from Andi''s patch would have given you 90% of the size win - an 
additional 0.17% size win to the 3.00% that CONFIG_OPTIMIZE_INLINING=y 
already brings.

The second patch below had some (much smaller, 0.01% ) impact too. All the 
other annotations i did to hundreds of inlined asm()s had no measurable 
effect on GCC 4.3.2. (i.e. gcc appears to inline single-statement asms 
correctly)

[ On older GCC it might matter more, but there we can/should turn off
  CONFIG_OPTIMIZE_INLINING. ]
> Personally, I''d like to see __asm_inline as opposed to
__always_inline
> for these, though, as a documentation issue: __always_inline implies to 
> me that this function needs to be inlined for correctness, and this 
> could be highly relevant if someone, for example, recodes the routine in 
> C or decides to bloat it out (e.g. paravirt_ops).
Yeah. I''ve implemented __asm_inline today. It indeed documents the
reason
for the annotation in a cleaner way than slapping __always_inline around 
and diluting the quality of __always_inline annotations.
> It''s not a perfect solution even then, because gcc may choose to
not
> inline a higher level of inline functions for the same bogus reason. 
> There isn''t much we can do about that, though, unless gcc either 
> integrates the assembler, or gives us some way of injecting the actual 
> weight of the asm statement...
Yeah.

	Ingo

---
 arch/x86/include/asm/bitops.h |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Index: linux/arch/x86/include/asm/bitops.h
==================================================================---
linux.orig/arch/x86/include/asm/bitops.h
+++ linux/arch/x86/include/asm/bitops.h
@@ -300,7 +300,8 @@ static inline int test_and_change_bit(in
 	return oldbit;
 }
 
-static inline int constant_test_bit(int nr, const volatile unsigned long *addr)
+static __asm_inline int
+constant_test_bit(int nr, const volatile unsigned long *addr)
 {
 	return ((1UL << (nr % BITS_PER_LONG)) &
 		(((unsigned long *)addr)[nr / BITS_PER_LONG])) != 0;



 arch/x86/include/asm/bitops.h |   12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

Index: linux/arch/x86/include/asm/bitops.h
==================================================================---
linux.orig/arch/x86/include/asm/bitops.h
+++ linux/arch/x86/include/asm/bitops.h
@@ -53,7 +53,7 @@
  * Note that @nr may be almost arbitrarily large; this function is not
  * restricted to acting on a single-word quantity.
  */
-static inline void set_bit(unsigned int nr, volatile unsigned long *addr)
+static __asm_inline void set_bit(unsigned int nr, volatile unsigned long *addr)
 {
 	if (IS_IMMEDIATE(nr)) {
 		asm volatile(LOCK_PREFIX "orb %1,%0"
@@ -75,7 +75,7 @@ static inline void set_bit(unsigned int 
  * If it''s called on the same region of memory simultaneously, the
effect
  * may be that only one operation succeeds.
  */
-static inline void __set_bit(int nr, volatile unsigned long *addr)
+static __asm_inline void __set_bit(int nr, volatile unsigned long *addr)
 {
 	asm volatile("bts %1,%0" : ADDR : "Ir" (nr) :
"memory");
 }
@@ -90,7 +90,7 @@ static inline void __set_bit(int nr, vol
  * you should call smp_mb__before_clear_bit() and/or smp_mb__after_clear_bit()
  * in order to ensure changes are visible on other processors.
  */
-static inline void clear_bit(int nr, volatile unsigned long *addr)
+static __asm_inline void clear_bit(int nr, volatile unsigned long *addr)
 {
 	if (IS_IMMEDIATE(nr)) {
 		asm volatile(LOCK_PREFIX "andb %1,%0"
@@ -117,7 +117,7 @@ static inline void clear_bit_unlock(unsi
 	clear_bit(nr, addr);
 }
 
-static inline void __clear_bit(int nr, volatile unsigned long *addr)
+static __asm_inline void __clear_bit(int nr, volatile unsigned long *addr)
 {
 	asm volatile("btr %1,%0" : ADDR : "Ir" (nr));
 }
@@ -152,7 +152,7 @@ static inline void __clear_bit_unlock(un
  * If it''s called on the same region of memory simultaneously, the
effect
  * may be that only one operation succeeds.
  */
-static inline void __change_bit(int nr, volatile unsigned long *addr)
+static __asm_inline void __change_bit(int nr, volatile unsigned long *addr)
 {
 	asm volatile("btc %1,%0" : ADDR : "Ir" (nr));
 }
@@ -166,7 +166,7 @@ static inline void __change_bit(int nr, 
  * Note that @nr may be almost arbitrarily large; this function is not
  * restricted to acting on a single-word quantity.
  */
-static inline void change_bit(int nr, volatile unsigned long *addr)
+static __asm_inline void change_bit(int nr, volatile unsigned long *addr)
 {
 	if (IS_IMMEDIATE(nr)) {
 		asm volatile(LOCK_PREFIX "xorb %1,%0"
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

jim owens

2009-Jan-09 14:03 UTC

head link

Re: [patch] measurements, numbers about CONFIG_OPTIMIZE_INLINING=y impact

Ingo Molnar wrote:> 
> One interpretation of the numbers would be that core kernel hackers are 
> more inline-happy, maybe because they think that their functions are more 
> important to inline.
> 
> Which is generally a fair initial assumption, but according to the numbers 
> it does not appear to pay off in practice as it does not result in a 
> smaller kernel image.
I think people over-use inline for the opposite reason.

They are taught:
    - use inline functions instead of macros
    - inlining functions makes your code run faster

They also know inlining may increase program object size.
That inlining will reduce object size on many architectures
if the function is small is just a happy side effect to them.

jim
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Peter Zijlstra

2009-Jan-09 15:06 UTC

head link

Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

On Fri, 2009-01-09 at 11:47 +0100, Peter Zijlstra wrote:
> > So I think the bug is still there, we just hid it better by breaking
out
> > of the loop with that "if (need_resched())" always
eventually triggering.
> > And it would be ok if it really is guaranteed to _eventually_ trigger,
and
> > I guess with timeslices it eventually always will, but I suspect we
could
> > have some serious latency spikes.
> 
> Yes, the owner getting preempted after acquiring the lock, but before
> setting the owner can give some nasties :-(
> 
> I initially did that preempt_disable/enable around the fast path, but I
> agree that slowing down the fast path is unwelcome.
> 
> Alternatively we could go back to block on !owner, with the added
> complexity of not breaking out of the spin on lock->owner != owner
> when !lock->owner, so that the premature owner clearing of the unlock
> fast path will not force a schedule right before we get a chance to
> acquire the lock.
> 
> Let me do that..
Ok a few observations..

Adding that need_resched() in the outer loop utterly destroys the
performance gain for PREEMPT=y. Voluntary preemption is mostly good, but
somewhat unstable results.

Adding that blocking on !owner utterly destroys everything.

Going to look into where that extra preemption comes from.
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Mason

2009-Jan-09 15:11 UTC

head link

Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

On Fri, 2009-01-09 at 16:06 +0100, Peter Zijlstra wrote:> On Fri, 2009-01-09 at 11:47 +0100, Peter Zijlstra wrote:
> 
> > > So I think the bug is still there, we just hid it better by
breaking out
> > > of the loop with that "if (need_resched())" always
eventually triggering.
> > > And it would be ok if it really is guaranteed to _eventually_
trigger, and
> > > I guess with timeslices it eventually always will, but I suspect
we could
> > > have some serious latency spikes.
> > 
> > Yes, the owner getting preempted after acquiring the lock, but before
> > setting the owner can give some nasties :-(
> > 
> > I initially did that preempt_disable/enable around the fast path, but
I
> > agree that slowing down the fast path is unwelcome.
> > 
> > Alternatively we could go back to block on !owner, with the added
> > complexity of not breaking out of the spin on lock->owner != owner
> > when !lock->owner, so that the premature owner clearing of the
unlock
> > fast path will not force a schedule right before we get a chance to
> > acquire the lock.
> > 
> > Let me do that..
> 
> Ok a few observations..
> 
> Adding that need_resched() in the outer loop utterly destroys the
> performance gain for PREEMPT=y. Voluntary preemption is mostly good, but
> somewhat unstable results.
How about if (!owner && need_resched()) break; instead of the
unconditional need_resched().  That should solve the race that Linus saw
without and hurt PREEMPT less.
> 
> Adding that blocking on !owner utterly destroys everything.
> 
> Going to look into where that extra preemption comes from.
-chris


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Ingo Molnar

2009-Jan-09 15:25 UTC

head link

Re: [patch] measurements, numbers about CONFIG_OPTIMIZE_INLINING=y impact

* Ingo Molnar <mingo@elte.hu> wrote:
> > I suspect gcc has some pre-inlining heuristics that don''t
take
> > constant folding and simplifiation into account - if you look at just 
> > the raw tree of the function without taking the optimization into 
> > account, it will look big.
> 
> Yeah. In my tests GCC 4.3.2 does properly inline that particular asm.
here is how that function ended up looking like here (with Peter''s v7, 
which should be close to what Chris tried, looking at his crash dump):

ffffffff81491347 <__mutex_lock_common>:
ffffffff81491347:	55                   	push   %rbp
ffffffff81491348:	48 89 e5             	mov    %rsp,%rbp
ffffffff8149134b:	41 57                	push   %r15
ffffffff8149134d:	4c 8d 7f 08          	lea    0x8(%rdi),%r15
ffffffff81491351:	41 56                	push   %r14
ffffffff81491353:	49 89 f6             	mov    %rsi,%r14
ffffffff81491356:	41 55                	push   %r13
ffffffff81491358:	41 54                	push   %r12
ffffffff8149135a:	4c 8d 67 18          	lea    0x18(%rdi),%r12
ffffffff8149135e:	53                   	push   %rbx
ffffffff8149135f:	48 89 fb             	mov    %rdi,%rbx
ffffffff81491362:	48 83 ec 38          	sub    $0x38,%rsp
ffffffff81491366:	65 4c 8b 2c 25 00 00 	mov    %gs:0x0,%r13
ffffffff8149136d:	00 00 
ffffffff8149136f:	b8 01 00 00 00       	mov    $0x1,%eax
ffffffff81491374:	31 d2                	xor    %edx,%edx
ffffffff81491376:	f0 0f b1 13          	lock cmpxchg %edx,(%rbx)
ffffffff8149137a:	83 f8 01             	cmp    $0x1,%eax
ffffffff8149137d:	75 18                	jne    ffffffff81491397
<__mutex_lock_common+0x50>
ffffffff8149137f:	65 48 8b 04 25 10 00 	mov    %gs:0x10,%rax
ffffffff81491386:	00 00 
ffffffff81491388:	48 2d d8 1f 00 00    	sub    $0x1fd8,%rax
ffffffff8149138e:	48 89 43 18          	mov    %rax,0x18(%rbx)
ffffffff81491392:	e9 dd 00 00 00       	jmpq   ffffffff81491474
<__mutex_lock_common+0x12d>
ffffffff81491397:	85 c0                	test   %eax,%eax
ffffffff81491399:	79 06                	jns    ffffffff814913a1
<__mutex_lock_common+0x5a>
ffffffff8149139b:	4c 39 7b 08          	cmp    %r15,0x8(%rbx)
ffffffff8149139f:	75 19                	jne    ffffffff814913ba
<__mutex_lock_common+0x73>
ffffffff814913a1:	49 8b 34 24          	mov    (%r12),%rsi
ffffffff814913a5:	48 85 f6             	test   %rsi,%rsi
ffffffff814913a8:	74 0c                	je     ffffffff814913b6
<__mutex_lock_common+0x6f>
ffffffff814913aa:	48 89 df             	mov    %rbx,%rdi
ffffffff814913ad:	e8 6c dd b9 ff       	callq  ffffffff8102f11e
<spin_on_owner>
ffffffff814913b2:	85 c0                	test   %eax,%eax
ffffffff814913b4:	74 04                	je     ffffffff814913ba
<__mutex_lock_common+0x73>
ffffffff814913b6:	f3 90                	pause  
ffffffff814913b8:	eb b5                	jmp    ffffffff8149136f
<__mutex_lock_common+0x28>
ffffffff814913ba:	4c 8d 63 04          	lea    0x4(%rbx),%r12
ffffffff814913be:	4c 89 e7             	mov    %r12,%rdi
ffffffff814913c1:	e8 d2 0f 00 00       	callq  ffffffff81492398
<_spin_lock>
ffffffff814913c6:	48 8b 53 10          	mov    0x10(%rbx),%rdx
ffffffff814913ca:	48 8d 45 b0          	lea    -0x50(%rbp),%rax
ffffffff814913ce:	4c 89 7d b0          	mov    %r15,-0x50(%rbp)
ffffffff814913d2:	48 89 43 10          	mov    %rax,0x10(%rbx)
ffffffff814913d6:	48 89 02             	mov    %rax,(%rdx)
ffffffff814913d9:	48 89 55 b8          	mov    %rdx,-0x48(%rbp)
ffffffff814913dd:	48 83 ca ff          	or     $0xffffffffffffffff,%rdx
ffffffff814913e1:	4c 89 6d c0          	mov    %r13,-0x40(%rbp)
ffffffff814913e5:	48 89 d0             	mov    %rdx,%rax
ffffffff814913e8:	87 03                	xchg   %eax,(%rbx)
ffffffff814913ea:	ff c8                	dec    %eax
ffffffff814913ec:	74 55                	je     ffffffff81491443
<__mutex_lock_common+0xfc>
ffffffff814913ee:	44 88 f0             	mov    %r14b,%al
ffffffff814913f1:	44 89 f2             	mov    %r14d,%edx
ffffffff814913f4:	83 e0 81             	and    $0xffffffffffffff81,%eax
ffffffff814913f7:	83 e2 01             	and    $0x1,%edx
ffffffff814913fa:	88 45 af             	mov    %al,-0x51(%rbp)
ffffffff814913fd:	89 55 a8             	mov    %edx,-0x58(%rbp)
ffffffff81491400:	48 83 c8 ff          	or     $0xffffffffffffffff,%rax
ffffffff81491404:	87 03                	xchg   %eax,(%rbx)
ffffffff81491406:	ff c8                	dec    %eax
ffffffff81491408:	74 39                	je     ffffffff81491443
<__mutex_lock_common+0xfc>
ffffffff8149140a:	80 7d af 00          	cmpb   $0x0,-0x51(%rbp)
ffffffff8149140e:	74 1c                	je     ffffffff8149142c
<__mutex_lock_common+0xe5>
ffffffff81491410:	49 8b 45 08          	mov    0x8(%r13),%rax
ffffffff81491414:	f6 40 10 04          	testb  $0x4,0x10(%rax)
ffffffff81491418:	74 12                	je     ffffffff8149142c
<__mutex_lock_common+0xe5>
ffffffff8149141a:	83 7d a8 00          	cmpl   $0x0,-0x58(%rbp)
ffffffff8149141e:	75 65                	jne    ffffffff81491485
<__mutex_lock_common+0x13e>
ffffffff81491420:	4c 89 ef             	mov    %r13,%rdi
ffffffff81491423:	e8 e1 2e bb ff       	callq  ffffffff81044309
<__fatal_signal_pending>
ffffffff81491428:	85 c0                	test   %eax,%eax
ffffffff8149142a:	75 59                	jne    ffffffff81491485
<__mutex_lock_common+0x13e>
ffffffff8149142c:	4d 89 75 00          	mov    %r14,0x0(%r13)
ffffffff81491430:	41 fe 04 24          	incb   (%r12)
ffffffff81491434:	e8 e9 ef ff ff       	callq  ffffffff81490422 <schedule>
ffffffff81491439:	4c 89 e7             	mov    %r12,%rdi
ffffffff8149143c:	e8 57 0f 00 00       	callq  ffffffff81492398
<_spin_lock>
ffffffff81491441:	eb bd                	jmp    ffffffff81491400
<__mutex_lock_common+0xb9>
ffffffff81491443:	48 8b 45 b8          	mov    -0x48(%rbp),%rax
ffffffff81491447:	48 8b 55 b0          	mov    -0x50(%rbp),%rdx
ffffffff8149144b:	48 89 10             	mov    %rdx,(%rax)
ffffffff8149144e:	48 89 42 08          	mov    %rax,0x8(%rdx)
ffffffff81491452:	65 48 8b 04 25 10 00 	mov    %gs:0x10,%rax
ffffffff81491459:	00 00 
ffffffff8149145b:	48 2d d8 1f 00 00    	sub    $0x1fd8,%rax
ffffffff81491461:	4c 39 7b 08          	cmp    %r15,0x8(%rbx)
ffffffff81491465:	48 89 43 18          	mov    %rax,0x18(%rbx)
ffffffff81491469:	75 06                	jne    ffffffff81491471
<__mutex_lock_common+0x12a>
ffffffff8149146b:	c7 03 00 00 00 00    	movl   $0x0,(%rbx)
ffffffff81491471:	fe 43 04             	incb   0x4(%rbx)
ffffffff81491474:	31 c0                	xor    %eax,%eax
ffffffff81491476:	48 83 c4 38          	add    $0x38,%rsp
ffffffff8149147a:	5b                   	pop    %rbx
ffffffff8149147b:	41 5c                	pop    %r12
ffffffff8149147d:	41 5d                	pop    %r13
ffffffff8149147f:	41 5e                	pop    %r14
ffffffff81491481:	41 5f                	pop    %r15
ffffffff81491483:	c9                   	leaveq 
ffffffff81491484:	c3                   	retq   
ffffffff81491485:	48 8b 55 b0          	mov    -0x50(%rbp),%rdx
ffffffff81491489:	48 8b 45 b8          	mov    -0x48(%rbp),%rax
ffffffff8149148d:	48 89 42 08          	mov    %rax,0x8(%rdx)
ffffffff81491491:	48 89 10             	mov    %rdx,(%rax)
ffffffff81491494:	fe 43 04             	incb   0x4(%rbx)
ffffffff81491497:	b8 fc ff ff ff       	mov    $0xfffffffc,%eax
ffffffff8149149c:	eb d8                	jmp    ffffffff81491476
<__mutex_lock_common+0x12f>

the "lock cmpxchg" ended up being inlined properly at
ffffffff81491376,
and the whole assembly sequence looks pretty compact to me.

That does not let GCC off the hook though, and i''d be the last one to 
defend it when it messes up.

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Ingo Molnar

2009-Jan-09 15:35 UTC

head link

Re: [patch] measurements, numbers about CONFIG_OPTIMIZE_INLINING=y impact

* jim owens <jowens@hp.com> wrote:
> Ingo Molnar wrote:
>>
>> One interpretation of the numbers would be that core kernel hackers are
>> more inline-happy, maybe because they think that their functions are 
>> more important to inline.
>>
>> Which is generally a fair initial assumption, but according to the 
>> numbers it does not appear to pay off in practice as it does not result
>> in a smaller kernel image.
>
> I think people over-use inline for the opposite reason.
Note that i talked about the core kernel (kernel/*.c) specifically.
> They are taught:
>    - use inline functions instead of macros
>    - inlining functions makes your code run faster
>
> They also know inlining may increase program object size. That inlining 
> will reduce object size on many architectures if the function is small 
> is just a happy side effect to them.
Core kernel developers tend to be quite inline-conscious and generally do 
not believe that making something inline will make it go faster.

That''s why i picked kernel/built-in.o as a good "best of
breed" entity to
measure - if then that is an area where we have at least the chance to do 
a "kernel coders know best when to inline" manual inlining job.

But despite a decade of tuning and systematic effort in that area, the 
numbers suggest that we dont. (if someone has different numbers or 
different interpretation, please share it with us.)

My goal is to make the kernel smaller and faster, and as far as the 
placement of ''inline'' keywords goes, i dont have too strong
feelings about
how it''s achieved: they have a certain level of documentation value 
[signalling that a function is _intended_ to be lightweight] but otherwise 
they are pretty neutral attributes to me.

So we want all the mechanisms in place to constantly press towards a 
smaller and faster kernel, with the most efficient use of development 
resources. Some techniques work in practice despite looking problematic, 
some dont, despite looking good on paper.

This might be one of those cases. Or not :-)

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Steven Rostedt

2009-Jan-09 15:59 UTC

head link

Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

On Fri, 9 Jan 2009, Peter Zijlstra wrote:
> On Fri, 2009-01-09 at 11:47 +0100, Peter Zijlstra wrote:
> 
> > > So I think the bug is still there, we just hid it better by
breaking out
> > > of the loop with that "if (need_resched())" always
eventually triggering.
> > > And it would be ok if it really is guaranteed to _eventually_
trigger, and
> > > I guess with timeslices it eventually always will, but I suspect
we could
> > > have some serious latency spikes.
> > 
> > Yes, the owner getting preempted after acquiring the lock, but before
> > setting the owner can give some nasties :-(
> > 
> > I initially did that preempt_disable/enable around the fast path, but
I
> > agree that slowing down the fast path is unwelcome.
> > 
> > Alternatively we could go back to block on !owner, with the added
> > complexity of not breaking out of the spin on lock->owner != owner
> > when !lock->owner, so that the premature owner clearing of the
unlock
> > fast path will not force a schedule right before we get a chance to
> > acquire the lock.
> > 
> > Let me do that..
> 
> Ok a few observations..
> 
> Adding that need_resched() in the outer loop utterly destroys the
> performance gain for PREEMPT=y. Voluntary preemption is mostly good, but
> somewhat unstable results.
I was going to say a while ago...
In PREEMPT=y the need_resched() is not needed at all. If you have 
preemption enabled, you will get preempted in that loop. No need for the 
need_resched() in the outer loop. Although I''m not sure how it would
even
hit the "need_resched". If it was set, then it is most likely going to
be
cleared when coming back from being preempted.
> 
> Adding that blocking on !owner utterly destroys everything.
I was going to warn you about that ;-)

Without the check for !owner, you are almost guaranteed to go to sleep 
every time. Here''s why:

You are spinning and thus have a hot cache on that CPU.

The owner goes to unlock but will be in a cold cache. It sets lock->owner 
to NULL, but is still in cold cache so it is a bit slower.

Once the spinner sees the NULL, it shoots out of the spin but sees the 
lock is still not available then goes to sleep. All before the owner could 
release it. This could probably happen at every contention. Thus, you lose 
the benefit of spinning. You probably make things worse because you add a 
spin before every sleep.

-- Steve

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Peter Zijlstra

2009-Jan-09 16:03 UTC

head link

Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

On Fri, 2009-01-09 at 10:59 -0500, Steven Rostedt wrote:
> > 
> > Adding that blocking on !owner utterly destroys everything.
> 
> I was going to warn you about that ;-)
> 
> Without the check for !owner, you are almost guaranteed to go to sleep 
> every time. Here''s why:
> 
> You are spinning and thus have a hot cache on that CPU.
> 
> The owner goes to unlock but will be in a cold cache. It sets
lock->owner
> to NULL, but is still in cold cache so it is a bit slower.
> 
> Once the spinner sees the NULL, it shoots out of the spin but sees the 
> lock is still not available then goes to sleep. All before the owner could 
> release it. This could probably happen at every contention. Thus, you lose 
> the benefit of spinning. You probably make things worse because you add a 
> spin before every sleep.
Which is why I changed the inner loop to:

  l_owner = ACCESS_ONCE(lock->owner)
  if (l_owner && l_owner != owner)
    break

So that that would continue spinning.
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Linus Torvalds

2009-Jan-09 16:09 UTC

head link

Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

On Fri, 9 Jan 2009, Ingo Molnar wrote:>  
> -static inline int constant_test_bit(int nr, const volatile unsigned long
*addr)
> +static __asm_inline int
> +constant_test_bit(int nr, const volatile unsigned long *addr)
>  {
>  	return ((1UL << (nr % BITS_PER_LONG)) &
>  		(((unsigned long *)addr)[nr / BITS_PER_LONG])) != 0;
Thios makes absolutely no sense.

It''s called "__always_inline", not __asm_inline.

Why add a new nonsensical annotations like that?

Also, the very fact that gcc gets that function wrong WHEN
''nr'' IS
CONSTANT (which is when it is called) just shows what kind of crap gcc is!

Ingo, the fact is, I care about size, but I care about debuggability and 
sanity more. I don''t care one _whit_ about 3% size differences, if they
are insane and cause idiotic per-compiler differences.

And you haven''t done any interesting analysis per-file etc. It shoul be
almost _trivial_ to do CONFIG_OPTIMIZE_INLINING on/off tests for the whole 
tree, and then comparing sizes of individual object files, and see if we 
find some obvious _bug_ where we just inline too much.

In fact, we shouldn''t even do that - we should try to find a mode where
gcc simply refuses to inline at all, and compare that to one where it 
_only_ inlines the things we ask it to. Because that''s the more
relevant
test. The problem with gcc inlining is actually two-fold:

 - gcc doesn''t inline things we ask for

   Here the sub-problem is that we ask for this too much, but see above on 
   how to figure -that- out!

 - gcc _does_ inline things that we haven''t marked at all, causing too 
   much stack-space to be used, and causing debugging problems.

   And here the problem is that gcc should damn well not do that, at least 
   not as aggressively as it does!

IT DOES NOT MATTER if something is called in just one place and inlining 
makes things smaller! If it''s not a clear performance win (and it
almost
never is, unless the function is really small), the inlining of especially 
functions that aren''t even hot in the cache is ONLY a negative thing.

		Linus

--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

H. Peter Anvin

2009-Jan-09 16:23 UTC

head link

Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

Linus Torvalds wrote:> 
> On Fri, 9 Jan 2009, Ingo Molnar wrote:
>>  
>> -static inline int constant_test_bit(int nr, const volatile unsigned
long *addr)
>> +static __asm_inline int
>> +constant_test_bit(int nr, const volatile unsigned long *addr)
>>  {
>>  	return ((1UL << (nr % BITS_PER_LONG)) &
>>  		(((unsigned long *)addr)[nr / BITS_PER_LONG])) != 0;
> 
> Thios makes absolutely no sense.
> 
> It''s called "__always_inline", not __asm_inline.
> 
> Why add a new nonsensical annotations like that?
> 
__asm_inline was my suggestion, to distinguish "inline this
unconditionally because gcc screws up in the presence of asm()" versus
"inline this unconditionally because the world ends if it
isn''t" -- to
tell the human reader, not gcc.  I guess the above is a good indicator
that the __asm_inline might have been a bad name.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don''t speak on their behalf.

--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Linus Torvalds

2009-Jan-09 16:28 UTC

head link

Re: [patch] measurements, numbers about CONFIG_OPTIMIZE_INLINING=y impact

On Fri, 9 Jan 2009, Ingo Molnar wrote:> 
> Core kernel developers tend to be quite inline-conscious and generally do 
> not believe that making something inline will make it go faster.
Some of us core kernel developers tend to believe that:

 - inlining is supposed to work like macros, and should make the compiler 
   do decisions BASED ON CALL-SITE.

   This is one of the most _common_ reasons for inlining. Making the 
   compiler select static code rather than dynamic code, and using 
   inlining as a nice macro. We can pass in a flag with a constant value, 
   and only the case that matters will be compiled.

It''s not about size - or necessarily even performance - at all.
It''s about
abstraction, and a way of writing code. 

And the thing is, as long as gcc does what we ask, we can notice when _we_ 
did something wrong. We can say "ok, we should just remove the inline"
etc. But when gcc then essentially flips a coin, and inlines things we 
don''t want to, it dilutes the whole value of inlining - because now gcc
does things that actually does hurt us.

We get oopses that have a nice symbolic back-trace, and it reports an 
error IN TOTALLY THE WRONG FUNCTION, because gcc "helpfully" inlined 
things to the point that only an expert can realize "oh, the bug was 
actually five hundred lines up, in that other function that was just 
called once, so gcc inlined it even though it is huge".

See? THIS is the problem with gcc heuristics. It''s not about quality of
code, it''s about RELIABILITY of code. 

The reason people use C for system programming is because the language is 
a reasonably portable way to get the expected end results WITHOUT the 
compiler making a lot of semantic changes behind your back. 

Inlining is also the wrong thing to do _even_ if it makes code smaller and 
faster if you inline the unlikely case, or inlining causes more live 
variables that cause stack pressure. And we KNOW this happens. Again,
I''d
be much happier if we had a compiler option to just does "do what I _say_, 
dammit", and then we can fix up the mistakes. Because then they are _our_ 
mistakes, not some random compiler version that throws a dice!

			Linus
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Linus Torvalds

2009-Jan-09 16:34 UTC

head link

Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

On Fri, 9 Jan 2009, Steven Rostedt wrote:> 
> I was going to say a while ago...
> In PREEMPT=y the need_resched() is not needed at all. If you have 
> preemption enabled, you will get preempted in that loop. No need for the 
> need_resched() in the outer loop. Although I''m not sure how it
would even
> hit the "need_resched". If it was set, then it is most likely
going to be
> cleared when coming back from being preempted.
No, no, you miss the point entirely.

It''s not about correctness.

Remember: the whole (and only) point of spinning is about performance.

And the thing is, we should only spin if it makes sense. So the

	if (need_resched())
		break;

is not there because of any "ok, I need to sleep now", it''s
there because
of something TOTALLY DIFFERENT, namely "ok, it makes no sense to spin now, 
since I should be sleeping".

See? WE DO NOT WANT TO BE PREEMPTED in this region, because that totally 
destroys the whole point of the spinning. If we go through the scheduler, 
then we should go through the scheduler AND GO TO SLEEP, so that we
don''t
go through the scheduler any more than absolutely necessary.

So this code - by design - is always only going to get worse if you have 
involuntary preemption. The preemption is going to do _two_ bad things:

 - it''s going to call the scheduler at the wrong point, meaning that we
   now scheduler _more_ (or at least not less) than if we didn''t have
that
   spin-loop in the first place.

 - .. and to make things worse, since it scheduled "for us", it is
going
   to clear that "need_resched()" flag, so we''ll _stay_ in
the bad
   spinning loop too long!

So quite frankly, if you have CONFIG_PREEMPT, then the spinning really is 
the wrong thing to do, or the whole mutex slow-path thing should be done 
with preemption disabled so that we only schedule where we _should_ be 
scheduling.

			Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

H. Peter Anvin

2009-Jan-09 16:34 UTC

head link

Re: [patch] measurements, numbers about CONFIG_OPTIMIZE_INLINING=y impact

Ingo Molnar wrote:> 
> My goal is to make the kernel smaller and faster, and as far as the 
> placement of ''inline'' keywords goes, i dont have too
strong feelings about
> how it''s achieved: they have a certain level of documentation
value
> [signalling that a function is _intended_ to be lightweight] but otherwise 
> they are pretty neutral attributes to me.
> 
As far as naming is concerned, gcc effectively supports four levels,
which *currently* map onto macros as follows:

__always_inline		Inline unconditionally
inline			Inlining hint
<nothing>		Standard heuristics
noinline		Uninline unconditionally

A lot of noise is being made about the naming of the levels (and I
personally believe we should have a different annotation for "inline
unconditionally for correctness" and "inline unconditionally for
performance", as a documentation issue), but those are the four we get.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don''t speak on their behalf.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Linus Torvalds

2009-Jan-09 16:37 UTC

head link

Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

On Fri, 9 Jan 2009, H. Peter Anvin wrote:> 
> __asm_inline was my suggestion, to distinguish "inline this
> unconditionally because gcc screws up in the presence of asm()"
THERE IS NO ASM IN THERE!

Guys, look at the code. No asm. The whole notion that gcc gets confused by 
inline asms IS BOGUS. It''s simply not TRUE. Gcc gets confused because
gcc
is confused, and it has NOTHING to do with inline asms.

So please don''t confuse things further. 

		Linus
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Steven Rostedt

2009-Jan-09 16:44 UTC

head link

Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

On Fri, 9 Jan 2009, Linus Torvalds wrote:
> 
> 
> On Fri, 9 Jan 2009, Steven Rostedt wrote:
> > 
> > I was going to say a while ago...
> > In PREEMPT=y the need_resched() is not needed at all. If you have 
> > preemption enabled, you will get preempted in that loop. No need for
the
> > need_resched() in the outer loop. Although I''m not sure how
it would even
> > hit the "need_resched". If it was set, then it is most
likely going to be
> > cleared when coming back from being preempted.
> 
> No, no, you miss the point entirely.
No I did not miss your point. I was commenting on the current code ;-)
> So quite frankly, if you have CONFIG_PREEMPT, then the spinning really is 
> the wrong thing to do, or the whole mutex slow-path thing should be done 
> with preemption disabled so that we only schedule where we _should_ be 
> scheduling.
I agree here. I was going to recommend to add a preempt_disable in the 
spinner. And keep the need_resched test. Then we should not allow 
preemption until we get all the way to the point of the schedule in the 
contention case, or when we get the lock.

When we get to the schedule() it then needs to be a:

	preempt_enable_no_resched();
	schedule();

-- Steve

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Linus Torvalds

2009-Jan-09 16:44 UTC

head link

Re: [patch] measurements, numbers about CONFIG_OPTIMIZE_INLINING=y impact

On Fri, 9 Jan 2009, H. Peter Anvin wrote:> As far as naming is concerned, gcc effectively supports four levels,
> which *currently* map onto macros as follows:
> 
> __always_inline		Inline unconditionally
> inline			Inlining hint
> <nothing>		Standard heuristics
> noinline		Uninline unconditionally
> 
> A lot of noise is being made about the naming of the levels
The biggest problem is the <nothing>. 

The standard heuristics for that are broken, in particular for the "single 
call-site static function" case.

If gcc only inlined truly trivial functions for that case, I''d already
be
much happier. Size be damned.

			Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Dirk Hohndel

2009-Jan-09 16:46 UTC

head link

Re: [patch] measurements, numbers about CONFIG_OPTIMIZE_INLINING=y impact

On Fri, 09 Jan 2009 08:34:57 -0800
"H. Peter Anvin" <hpa@zytor.com> wrote:> 
> As far as naming is concerned, gcc effectively supports four levels,
> which *currently* map onto macros as follows:
> 
> __always_inline		Inline unconditionally
> inline			Inlining hint
> <nothing>		Standard heuristics
> noinline		Uninline unconditionally
> 
> A lot of noise is being made about the naming of the levels (and I
> personally believe we should have a different annotation for "inline
> unconditionally for correctness" and "inline unconditionally for
> performance", as a documentation issue), but those are the four we
> get.
Does gcc actually follow the "promise"? If that''s the case
(and if it''s
considered a bug when it doesn''t), then we can get what Linus wants by
annotating EVERY function with either __always_inline or noinline.

/D 

-- 
Dirk Hohndel
Intel Open Source Technology Center
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Dirk Hohndel

2009-Jan-09 16:47 UTC

head link

Re: [patch] measurements, numbers about CONFIG_OPTIMIZE_INLINING=y impact

On Fri, 9 Jan 2009 08:44:47 -0800 (PST)
Linus Torvalds <torvalds@linux-foundation.org>
wrote:> On Fri, 9 Jan 2009, H. Peter Anvin wrote:
> > As far as naming is concerned, gcc effectively supports four levels,
> > which *currently* map onto macros as follows:
> > 
> > __always_inline		Inline unconditionally
> > inline			Inlining hint
> > <nothing>		Standard heuristics
> > noinline		Uninline unconditionally
> > 
> > A lot of noise is being made about the naming of the levels
> 
> The biggest problem is the <nothing>. 
> 
> The standard heuristics for that are broken, in particular for the
> "single call-site static function" case.
> 
> If gcc only inlined truly trivial functions for that case, I''d
> already be much happier. Size be damned.
See my other email. Maybe we should just stop trusting gcc and annotate
every single function call.
Ugly, but effective.

/D

-- 
Dirk Hohndel
Intel Open Source Technology Center
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

H. Peter Anvin

2009-Jan-09 16:51 UTC

head link

Re: [patch] measurements, numbers about CONFIG_OPTIMIZE_INLINING=y impact

Dirk Hohndel wrote:> 
> Does gcc actually follow the "promise"? If that''s the
case (and if it''s
> considered a bug when it doesn''t), then we can get what Linus
wants by
> annotating EVERY function with either __always_inline or noinline.
> 
__always_inline and noinline does work.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don''t speak on their behalf.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Steven Rostedt

2009-Jan-09 17:07 UTC

head link

Re: [patch] measurements, numbers about CONFIG_OPTIMIZE_INLINING=y impact

On Fri, 9 Jan 2009, H. Peter Anvin wrote:
> Dirk Hohndel wrote:
> > 
> > Does gcc actually follow the "promise"? If that''s
the case (and if it''s
> > considered a bug when it doesn''t), then we can get what Linus
wants by
> > annotating EVERY function with either __always_inline or noinline.
> > 
> 
> __always_inline and noinline does work.
I vote for the, get rid of the current inline, rename __always_inline to 
inline, and then remove all non needed inlines from the kernel.

We''ll, probably start adding a lot more noinlines.

-- Steve

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Linus Torvalds

2009-Jan-09 17:11 UTC

head link

Re: [patch] measurements, numbers about CONFIG_OPTIMIZE_INLINING=y impact

On Fri, 9 Jan 2009, Andi Kleen wrote:> 
> There''s also one alternative: gcc''s inlining algorithms
are extensibly
> tunable with --param. We might be able to find a set of numbers that
> make it roughly work like we want it by default.
We tried that.

IIRC, the numbers mean different things for different versions of gcc, and 
I think using the parameters was very strongly discouraged by gcc 
developers. IOW, they were meant for gcc developers internal tuning 
efforts, not really for external people. Which means that using them would 
put us _more_ at the mercy of random compiler versions rather than less.

			Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Christoph Hellwig

2009-Jan-09 17:13 UTC

head link

Re: [patch] measurements, numbers about CONFIG_OPTIMIZE_INLINING=y impact

On Fri, Jan 09, 2009 at 09:03:12AM -0500, jim owens
wrote:> They also know inlining may increase program object size.
> That inlining will reduce object size on many architectures
> if the function is small is just a happy side effect to them.
The problem is that the threshold for that is architecture specific.
While e.g. x86 has relatively low overhead of prologue/epilogue other
architectures like s390 have enormous overhead.  So handling this in
the compiler would be optimal, but it would need at least whole-program
optimization and a compiler aware of the inline assembly to get it
half-way right.

--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Linus Torvalds

2009-Jan-09 17:14 UTC

head link

Re: [patch] measurements, numbers about CONFIG_OPTIMIZE_INLINING=y impact

On Fri, 9 Jan 2009, Steven Rostedt wrote:> 
> I vote for the, get rid of the current inline, rename __always_inline to 
> inline, and then remove all non needed inlines from the kernel.
This is what we do all the time, and historically have always done.

But
 - CONFIG_OPTIMIZE_INLINING=y screws that up
and
 - gcc still inlines even big functions static that have no markings at 
   all.
> We''ll, probably start adding a lot more noinlines.
That''s going to be very painful. Especially since the cases we really
want
to not inline is the random drivers etc - generally not "hot in the 
cache", but they are the ones that cause the most oopses (not per line, 
but globally - because there''s just so many drivers).

		Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Andi Kleen

2009-Jan-09 17:20 UTC

head link

Re: [patch] measurements, numbers about CONFIG_OPTIMIZE_INLINING=y impact

On Fri, Jan 09, 2009 at 08:46:20AM -0800, Dirk Hohndel
wrote:> On Fri, 09 Jan 2009 08:34:57 -0800
> "H. Peter Anvin" <hpa@zytor.com> wrote:
> > 
> > As far as naming is concerned, gcc effectively supports four levels,
> > which *currently* map onto macros as follows:
> > 
> > __always_inline		Inline unconditionally
> > inline			Inlining hint
> > <nothing>		Standard heuristics
> > noinline		Uninline unconditionally
> > 
> > A lot of noise is being made about the naming of the levels (and I
> > personally believe we should have a different annotation for
"inline
> > unconditionally for correctness" and "inline unconditionally
for
> > performance", as a documentation issue), but those are the four
we
> > get.
> 
> Does gcc actually follow the "promise"? If that''s the
case (and if it''s
> considered a bug when it doesn''t), then we can get what Linus
wants by
> annotating EVERY function with either __always_inline or noinline.
There''s also one alternative: gcc''s inlining algorithms are
extensibly
tunable with --param. We might be able to find a set of numbers that
make it roughly work like we want it by default.

Disadvantage: the whole thing will be compiler version
dependent so we might need to have different numbers
for different compiler versions and it will be an 
area that will need constant maintenance in the future.

I''m not sure that''s really a good path to walk down to.

Also cc Honza in case he has comments (you might want
to review more of the thread in the archives) 

-Andi

-- 
ak@linux.intel.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Matthew Wilcox

2009-Jan-09 17:28 UTC

head link

Re: [patch] measurements, numbers about CONFIG_OPTIMIZE_INLINING=y impact

On Fri, Jan 09, 2009 at 06:20:11PM +0100, Andi Kleen
wrote:> Also cc Honza in case he has comments (you might want
> to review more of the thread in the archives) 
I think this particular bug is already known and discussed:

http://gcc.gnu.org/ml/gcc/2008-12/msg00365.html

and it hints at being fixed with gcc 4.4.  Does anyone want to test
that?

-- 
Matthew Wilcox				Intel Open Source Technology Centre
"Bill, look, we understand that you''re interested in selling us
this
operating system, but compare it to ours.  We can''t possibly take such
a retrograde step."
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Andi Kleen

2009-Jan-09 17:32 UTC

head link

Re: [patch] measurements, numbers about CONFIG_OPTIMIZE_INLINING=y impact

> I vote for the, get rid of the current inline, rename __always_inline to 
There is some code that absolutely requires inline for correctness,
like the x86-64 vsyscall code. I would advocate to keep the
explicit __always_inline at least there to make it very clear.
> inline, and then remove all non needed inlines from the kernel.
Most inlines in .c should be probably dropped.
> 
> We''ll, probably start adding a lot more noinlines.
That would cost you, see the numbers I posted (~4.1% text increase)

-Andi
-- 
ak@linux.intel.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Matthew Wilcox

2009-Jan-09 17:39 UTC

head link

Re: [patch] measurements, numbers about CONFIG_OPTIMIZE_INLINING=y impact

On Fri, Jan 09, 2009 at 06:47:19PM +0100, Andi Kleen
wrote:> On Fri, Jan 09, 2009 at 10:28:01AM -0700, Matthew Wilcox wrote:
> > On Fri, Jan 09, 2009 at 06:20:11PM +0100, Andi Kleen wrote:
> > > Also cc Honza in case he has comments (you might want
> > > to review more of the thread in the archives) 
> > 
> > I think this particular bug is already known and discussed:
> 
> I thought so initially too, but:
> 
> > 
> > http://gcc.gnu.org/ml/gcc/2008-12/msg00365.html
> > 
> > and it hints at being fixed with gcc 4.4.  Does anyone want to test
> > that?
> 
> Hugh already tested with 4.4 and it didn''t work well. At least
> a lot of the low level asm inlines were not inlined.
> So it looks like it''s still mistuned for the kernel.
That seems like valuable feedback to give to the GCC developers.

Richi, you can find the whole thread at

http://marc.info/?l=linux-fsdevel&m=123150610901773&w=2

and http://marc.info/?l=linux-fsdevel&m=123150834405285&w=2 is also
relevant.

-- 
Matthew Wilcox				Intel Open Source Technology Centre
"Bill, look, we understand that you''re interested in selling us
this
operating system, but compare it to ours.  We can''t possibly take such
a retrograde step."
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jiri Kosina

2009-Jan-09 17:40 UTC

head link

Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

On Thu, 8 Jan 2009, Peter Zijlstra wrote:
> > Well, at least we do unless you enable that broken paravirt support. 
> > I''m not at all clear on why CONFIG_PARAVIRT wants to use
inferior
> > locks, but I don''t much care.
> Because the virtual cpu that has the ticket might not get scheduled for
> a while, even though another vcpu with a spinner is scheduled.
> The whole (para)virt is a nightmare in that respect.
Hmm, are we in fact really using byte locks in CONFIG_PARAVIRT situation? 
Where are we actually setting pv_lock_ops.spin_lock pointer to point to 
__byte_spin_lock?

Such initialization seems to happen only in paravirt_use_bytelocks() 
function, but my blind eyes prevent me from finding a callsite from which 
this function would eventually get called.

-- 
Jiri Kosina
SUSE Labs

--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Dirk Hohndel

2009-Jan-09 17:41 UTC

head link

Re: [patch] measurements, numbers about CONFIG_OPTIMIZE_INLINING=y impact

On Fri, 9 Jan 2009 18:47:19 +0100
Andi Kleen <andi@firstfloor.org> wrote:
> On Fri, Jan 09, 2009 at 10:28:01AM -0700, Matthew Wilcox wrote:
> > On Fri, Jan 09, 2009 at 06:20:11PM +0100, Andi Kleen wrote:
> > > Also cc Honza in case he has comments (you might want
> > > to review more of the thread in the archives) 
> > 
> > I think this particular bug is already known and discussed:
> 
> I thought so initially too, but:
> 
> > 
> > http://gcc.gnu.org/ml/gcc/2008-12/msg00365.html
> > 
> > and it hints at being fixed with gcc 4.4.  Does anyone want to test
> > that?
> 
> Hugh already tested with 4.4 and it didn''t work well. At least
> a lot of the low level asm inlines were not inlined.
> So it looks like it''s still mistuned for the kernel.
I think that''s the point. gcc will not get it right. 
So we need to do it ourselves in the kernel sources.
We may not like it, but it''s the only way to guarantee reproducable
reliable inline / noinline decisions.

/D

-- 
Dirk Hohndel
Intel Open Source Technology Center
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Andi Kleen

2009-Jan-09 17:46 UTC

head link

Re: [patch] measurements, numbers about CONFIG_OPTIMIZE_INLINING=y impact

On Fri, Jan 09, 2009 at 09:11:47AM -0800, Linus Torvalds
wrote:> IIRC, the numbers mean different things for different versions of gcc, and 
> I think using the parameters was very strongly discouraged by gcc 
> developers. IOW, they were meant for gcc developers internal tuning 
> efforts, not really for external people. Which means that using them would 
When I asked last time that was not what I heard. Apparently at least
some --params are considered ready for user consumption these days.
> put us _more_ at the mercy of random compiler versions rather than less.
Yes it would basically be a list in the Makefile keyed on compiler
version giving different options and someone would need to do 
the work to do that for each new compiler version.

That would be some work, but it might be less work than going
all over 9.7MLOCs and changing inlines around manually.
Also the advantage is that that you wouldn''t need to teach
the rules to hundreds of new driver programmers.

Anyways I''m not very strongly wedded to this idea, but I think
it''s an alternative that should be at least considered before
doing anything else drastic.

-Andi

-- 
ak@linux.intel.com
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Andi Kleen

2009-Jan-09 17:47 UTC

head link

Re: [patch] measurements, numbers about CONFIG_OPTIMIZE_INLINING=y impact

On Fri, Jan 09, 2009 at 10:28:01AM -0700, Matthew Wilcox
wrote:> On Fri, Jan 09, 2009 at 06:20:11PM +0100, Andi Kleen wrote:
> > Also cc Honza in case he has comments (you might want
> > to review more of the thread in the archives) 
> 
> I think this particular bug is already known and discussed:
I thought so initially too, but:
> 
> http://gcc.gnu.org/ml/gcc/2008-12/msg00365.html
> 
> and it hints at being fixed with gcc 4.4.  Does anyone want to test
> that?
Hugh already tested with 4.4 and it didn''t work well. At least
a lot of the low level asm inlines were not inlined.
So it looks like it''s still mistuned for the kernel.

-Andi

-- 
ak@linux.intel.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Linus Torvalds

2009-Jan-09 17:54 UTC

head link

Re: [patch] measurements, numbers about CONFIG_OPTIMIZE_INLINING=y impact

On Fri, 9 Jan 2009, Matthew Wilcox wrote:> 
> That seems like valuable feedback to give to the GCC developers.
Well, one thing we should remember is that the kernel really _is_ special.

The kernel not only does things no other program tends to do (inline asms 
are unusual in the first place - many of them are literally due to system 
issues like atomic accesses and interrupts that simply aren''t an issue
in
user space, or that need so much abstraction that they aren''t inlinable
anyway).

But the kernel also has totally different requirements in other ways. When 
was the last time you did user space programming and needed to get a 
backtrace from a user with register info because you simply don''t have
the
hardware that he has? 

IOW, debugging in user space tends to be much more about trying to 
reproduce the bug - in a way that we often cannot in the kernel. User 
space in general is much more reproducible, since it''s seldom as
hardware-
or timing-dependent (threading does change the latter, but usually user 
space threading is not _nearly_ as aggressive as the kernel has to be).

So the thing is, even if gcc was "perfect", it would likely be perfect
for
a different audience than the kernel. 

Do you think _any_ user space programmer worries about the stack space 
being a few hundred bytes larger because the compiler inlined two 
functions, and caused stack usage to be sum of them instead of just the 
maximum of the two?

So we do have special issues. And exactly _because_ we have special issues 
we should also expect that some compiler defaults simply won''t ever
really
be appropriate for us.

			Linus
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Linus Torvalds

2009-Jan-09 17:57 UTC

head link

Re: [patch] measurements, numbers about CONFIG_OPTIMIZE_INLINING=y impact

On Fri, 9 Jan 2009, Andi Kleen wrote:> 
> Universal noinline would also be a bad idea because of its
> costs (4.1% text size increase). Perhaps should make it
> a CONFIG option for debugging though.
That''s _totally_ the wrong way.

If you can reproduce an issue on your machine, you generally don''t care
about inline, because you can see the stack, do the whole "gdb
vmlinux"
thing, and you generally have tools to help you decode things. Including 
just recompiling the kernel with an added noinline.

But _users_ just get their oopses sent automatically. So it''s not about
"debugging kernels", it''s about _normal_ kernels. They are
the ones that
need to be debuggable, and the ones that care most about things like the 
symbolic EIP being as helpful as possible.

			Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Andi Kleen

2009-Jan-09 18:02 UTC

head link

Re: [patch] measurements, numbers about CONFIG_OPTIMIZE_INLINING=y impact

> I think that''s the point. gcc will not get it right. 
I don''t think that''s necessary an universal truth. It 
can be probably fixed.
> So we need to do it ourselves in the kernel sources.
> We may not like it, but it''s the only way to guarantee
reproducable
> reliable inline / noinline decisions.
For most things we don''t really need it to be reproducable,
the main exception are the inlines in headers.

Universal noinline would also be a bad idea because of its
costs (4.1% text size increase). Perhaps should make it
a CONFIG option for debugging though.

-Andi


> 
> /D
> 
> -- 
> Dirk Hohndel
> Intel Open Source Technology Center
> 
-- 
ak@linux.intel.com
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

H. Peter Anvin

2009-Jan-09 18:07 UTC

head link

Re: [patch] measurements, numbers about CONFIG_OPTIMIZE_INLINING=y impact

Linus Torvalds wrote:> 
> So we do have special issues. And exactly _because_ we have special issues 
> we should also expect that some compiler defaults simply won''t
ever really
> be appropriate for us.
> 
That is, of course, true.

However, the Linux kernel (and quite a few other kernels) is a very 
important customer of gcc, and adding sustainable modes for the kernel 
that we can rely on is probably something we can work with them on.

I think the relationship between the gcc and Linux kernel people is 
unnecessarily infected, and cultivating a more constructive relationship 
would be good.  I suspect a big part of the reason for the oddities is 
that the timeline for the kernel community from making a request into 
gcc until we can actually rely on it is *very* long, and so we end up 
having to working things around no matter what (usually with copious 
invective), and the gcc people have other customers with shorter lead 
times which therefore drive their development more.

	-hpa
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Peter Zijlstra

2009-Jan-09 18:16 UTC

head link

Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

On Fri, 2009-01-09 at 11:44 -0500, Steven Rostedt wrote:
> When we get to the schedule() it then needs to be a:
> 
> 	preempt_enable_no_resched();
> 	schedule();
On that note:

Index: linux-2.6/kernel/mutex.c
==================================================================---
linux-2.6.orig/kernel/mutex.c
+++ linux-2.6/kernel/mutex.c
@@ -220,7 +220,9 @@ __mutex_lock_common(struct mutex *lock, 
 		__set_task_state(task, state);
 
 		/* didnt get the lock, go to sleep: */
+		preempt_disable();
 		spin_unlock_mutex(&lock->wait_lock, flags);
+		preempt_enable_no_resched();
 		schedule();
 		spin_lock_mutex(&lock->wait_lock, flags);
 	}


actually improves mutex performance on PREEMPT=y

Andi Kleen

2009-Jan-09 18:19 UTC

head link

Re: [patch] measurements, numbers about CONFIG_OPTIMIZE_INLINING=y impact

> So we do have special issues. And exactly _because_ we have special issues 
> we should also expect that some compiler defaults simply won''t
ever really
> be appropriate for us.
I agree that the kernel needs quite different inlining heuristics
than let''s say a template heavy C++ program. I guess that is 
also where our trouble comes from -- gcc is more tuned for the
later. Perhaps because the C++ programmers are better at working
with the gcc developers?

But it''s also not inconceivable that gcc adds a -fkernel-inlining or
similar that changes the parameters if we ask nicely. I suppose
actually such a parameter would be useful for far more programs
than the kernel.

-Andi

-- 
ak@linux.intel.com
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Linus Torvalds

2009-Jan-09 18:24 UTC

head link

Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

On Fri, 9 Jan 2009, Peter Zijlstra wrote:> 
> On that note:
> 
> Index: linux-2.6/kernel/mutex.c
> ==================================================================> ---
linux-2.6.orig/kernel/mutex.c
> +++ linux-2.6/kernel/mutex.c
> @@ -220,7 +220,9 @@ __mutex_lock_common(struct mutex *lock, 
>  		__set_task_state(task, state);
>  
>  		/* didnt get the lock, go to sleep: */
> +		preempt_disable();
>  		spin_unlock_mutex(&lock->wait_lock, flags);
> +		preempt_enable_no_resched();
>  		schedule();
Yes. I think this is a generic issue independently of the whole adaptive 
thing.

In fact, I think wee could make the mutex code use explicit preemption and 
then the __raw spinlocks to make this more obvious. Because now there''s
a
hidden "preempt_enable()" in that spin_unlock_mutex, and anybody
looking
at the code and not realizing it is going to just say "Whaa? Who is this 
crazy Peter Zijlstra guy, and what drugs is he on? I want me some!".

Because your patch really doesn''t make much sense unless you know how 
spinlocks work, and if you _do_ know how spinlocks work, you go "eww, 
that''s doing extra preemption crud in order to just disable the 
_automatic_ preemption crud".

			Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Andi Kleen

2009-Jan-09 18:55 UTC

head link

Re: [patch] measurements, numbers about CONFIG_OPTIMIZE_INLINING=y impact

> But _users_ just get their oopses sent automatically. So it''s not
about
If they send it from distro kernels the automated oops sender could
just fetch the debuginfo rpm and decode it down to a line.
My old automatic user segfault uploader I did originally
for the core pipe code did that too.
> "debugging kernels", it''s about _normal_ kernels. They
are the ones that
> need to be debuggable, and the ones that care most about things like the 
> symbolic EIP being as helpful as possible.
Ok you''re saying we should pay the 4.1% by default for this?
If you want that you can apply the appended patch.
Not sure if it''s really a good idea though. 4.1% is a lot.

-Andi

---

Disable inlining of functions called once by default

This makes oopses easier to read because it''s clearer in which
function the problem is.

Disadvantage: costs ~4.1% of text size (measured with allyesconfig
on gcc 4.1 on x86-64)

Signed-off-by: Andi Kleen <ak@linux.intel.com>

---
 Makefile |    4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

Index: linux-2.6.28-git12/Makefile
==================================================================---
linux-2.6.28-git12.orig/Makefile	2009-01-09 07:06:44.000000000 +0100
+++ linux-2.6.28-git12/Makefile	2009-01-09 07:16:21.000000000 +0100
@@ -546,10 +546,8 @@
 KBUILD_CFLAGS	+= -pg
 endif
 
-# We trigger additional mismatches with less inlining
-ifdef CONFIG_DEBUG_SECTION_MISMATCH
+# Disable too aggressive inlining because it makes oopses harder to read
 KBUILD_CFLAGS += $(call cc-option, -fno-inline-functions-called-once)
-endif
 
 # arch Makefile may override CC so keep this after arch Makefile is included
 NOSTDINC_FLAGS += -nostdinc -isystem $(shell $(CC) -print-file-name=include)


-- 
ak@linux.intel.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Richard Guenther

2009-Jan-09 18:59 UTC

head link

Re: [patch] measurements, numbers about CONFIG_OPTIMIZE_INLINING=y impact

On Fri, Jan 9, 2009 at 7:19 PM, Andi Kleen <andi@firstfloor.org>
wrote:>> So we do have special issues. And exactly _because_ we have special
issues
>> we should also expect that some compiler defaults simply won''t
ever really
>> be appropriate for us.
>
> I agree that the kernel needs quite different inlining heuristics
> than let''s say a template heavy C++ program. I guess that is
> also where our trouble comes from -- gcc is more tuned for the
> later. Perhaps because the C++ programmers are better at working
> with the gcc developers?
>
> But it''s also not inconceivable that gcc adds a -fkernel-inlining
or
> similar that changes the parameters if we ask nicely. I suppose
> actually such a parameter would be useful for far more programs
> than the kernel.
I think that the kernel is a perfect target to optimize default -Os behavior for
(whereas template heavy C++ programs are a target to optimize -O2 for).
And I think we did a good job in listening to kernel developers if once in
time they tried to talk to us - GCC 4.3 should be good in compiling the
kernel with default -Os settings.  We, unfortunately, cannot retroactively
fix old versions that kernel developers happen to like and still use.

Richard.
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Linus Torvalds

2009-Jan-09 19:00 UTC

head link

Re: [patch] measurements, numbers about CONFIG_OPTIMIZE_INLINING=y impact

On Fri, 9 Jan 2009, Andi Kleen wrote:> 
> Ok you''re saying we should pay the 4.1% by default for this?
The thing is, YOU ARE MAKING THAT NUMBER UP!

First off, the size increase only matters if it actually increases the 
cache footprint. And it may, but..

Secondly, my whole point here has been that we should not rely on gcc 
doing things behind our back, because gcc will generally do the wrong 
thing. If we decided to be more active about this, we could just choose to 
find the places that matter (in hot code) and fix _those_.

Thirdly, you''re just replacing _one_ random gcc choice with _another_ 
random one.

What happens when you say -fno-inline-functions-called-once? Does it 
disable inlining for those functions IN GENERAL, or just for the LARGE 
ones? See?

			Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Richard Guenther

2009-Jan-09 19:09 UTC

head link

Re: [patch] measurements, numbers about CONFIG_OPTIMIZE_INLINING=y impact

On Fri, Jan 9, 2009 at 6:54 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:>
>
> On Fri, 9 Jan 2009, Matthew Wilcox wrote:
>>
>> That seems like valuable feedback to give to the GCC developers.
>
> Well, one thing we should remember is that the kernel really _is_ special.
(sorry for not threading properly here)

Linus writes:

"Thirdly, you''re just replacing _one_ random gcc choice with
_another_
random one.

What happens when you say -fno-inline-functions-called-once? Does it
disable inlining for those functions IN GENERAL, or just for the LARGE
ones? See?"

-fno-inline-functions-called-once disables the heuristic that always inlines
(static!) functions that are called once.  Other heuristics still
apply, like inlining
the static function if it is small.  Everything else would be totally
stupid - which
seems to be the "default mode" you think GCC developers are in.

Richard.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Richard Guenther

2009-Jan-09 19:10 UTC

head link

Re: [patch] measurements, numbers about CONFIG_OPTIMIZE_INLINING=y impact

On Fri, Jan 9, 2009 at 8:21 PM, Andi Kleen <andi@firstfloor.org>
wrote:>> GCC 4.3 should be good in compiling the
>> kernel with default -Os settings.
>
> It''s unfortunately not. It doesn''t inline a lot of simple
asm() inlines
> for example.
Reading Ingos posting with the actual numbers states the opposite.

Richard.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

H. Peter Anvin

2009-Jan-09 19:13 UTC

head link

Re: [patch] measurements, numbers about CONFIG_OPTIMIZE_INLINING=y impact

Richard Guenther wrote:>>
>> But it''s also not inconceivable that gcc adds a
-fkernel-inlining or
>> similar that changes the parameters if we ask nicely. I suppose
>> actually such a parameter would be useful for far more programs
>> than the kernel.
> 
> I think that the kernel is a perfect target to optimize default -Os
behavior for
> (whereas template heavy C++ programs are a target to optimize -O2 for).
> And I think we did a good job in listening to kernel developers if once in
> time they tried to talk to us - GCC 4.3 should be good in compiling the
> kernel with default -Os settings.  We, unfortunately, cannot retroactively
> fix old versions that kernel developers happen to like and still use.
> 
Unfortunately I think there have been a lot of "we can''t talk to
them"
on both sides of the kernel-gcc interface, which is incredibly 
unfortunate.  I personally try to at least observe gcc development, 
including monitoring #gcc and knowing enough about gcc internals to 
write a (crappy) port, but I can hardly call myself a gcc expert. 
Still, I am willing to spend some significant time interfacing with 
anyone in the gcc community willing to spend the effort.  I think we can 
do good stuff.

	-hpa
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

H. Peter Anvin

2009-Jan-09 19:17 UTC

head link

Re: [patch] measurements, numbers about CONFIG_OPTIMIZE_INLINING=y impact

Richard Guenther wrote:> On Fri, Jan 9, 2009 at 8:21 PM, Andi Kleen <andi@firstfloor.org>
wrote:
>>> GCC 4.3 should be good in compiling the
>>> kernel with default -Os settings.
>> It''s unfortunately not. It doesn''t inline a lot of
simple asm() inlines
>> for example.
> 
> Reading Ingos posting with the actual numbers states the opposite.
> 
Well, Andi''s patch forcing inlining of the bitops chops quite a bit of 
size off the kernel, so there is clearly room for improvement.  From my 
post yesterday:

: voreg 64 ; size o.*/vmlinux
    text    data     bss     dec     hex     filename
57590217 24940519 15560504 98091240 5d8c0e8 o.andi/vmlinux
59421552 24912223 15560504 99894279 5f44407 o.noopt/vmlinux
57700527 24950719 15560504 98211750 5da97a6 o.opty/vmlinux

110 KB of code size reduction by force-inlining the small bitops.

	-hpa
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Andi Kleen

2009-Jan-09 19:21 UTC

head link

Re: [patch] measurements, numbers about CONFIG_OPTIMIZE_INLINING=y impact

> GCC 4.3 should be good in compiling the
> kernel with default -Os settings.  
It''s unfortunately not. It doesn''t inline a lot of simple
asm() inlines
for example.

-Andi
-- 
ak@linux.intel.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Matthew Wilcox

2009-Jan-09 19:29 UTC

head link

Re: [patch] measurements, numbers about CONFIG_OPTIMIZE_INLINING=y impact

On Fri, Jan 09, 2009 at 08:35:06PM +0100, Andi Kleen
wrote:> - Also inline everything static that is only called once
> [on the theory that this shrinks code size which is true
> according to my measurements]
> 
> -fno-inline-functions-called once disables this new rule.
> It''s very well and clearly defined.
It''s also not necessarily what we want.  For example, in
fs/direct-io.c,
we have:

static ssize_t
direct_io_worker(int rw, struct kiocb *iocb, struct inode *inode, 
        const struct iovec *iov, loff_t offset, unsigned long nr_segs, 
        unsigned blkbits, get_block_t get_block, dio_iodone_t end_io,
        struct dio *dio)
{
[150 lines]
}

ssize_t
__blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
        struct block_device *bdev, const struct iovec *iov, loff_t offset, 
        unsigned long nr_segs, get_block_t get_block, dio_iodone_t end_io,
        int dio_lock_type)
{
[100 lines]
        retval = direct_io_worker(rw, iocb, inode, iov, offset,
                                nr_segs, blkbits, get_block, end_io, dio);
[10 lines]
}

Now, I''m not going to argue the directIO code is a shining example of
how we want things to look, but we don''t really want ten arguments
being marshalled into a function call; we want gcc to inline the
direct_io_worker() and do its best to optimise the whole thing.

-- 
Matthew Wilcox				Intel Open Source Technology Centre
"Bill, look, we understand that you''re interested in selling us
this
operating system, but compare it to ours.  We can''t possibly take such
a retrograde step."
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Richard Guenther

2009-Jan-09 19:32 UTC

head link

Re: [patch] measurements, numbers about CONFIG_OPTIMIZE_INLINING=y impact

On Fri, Jan 9, 2009 at 8:40 PM, Andi Kleen <andi@firstfloor.org>
wrote:> On Fri, Jan 09, 2009 at 08:10:20PM +0100, Richard Guenther wrote:
>> On Fri, Jan 9, 2009 at 8:21 PM, Andi Kleen <andi@firstfloor.org>
wrote:
>> >> GCC 4.3 should be good in compiling the
>> >> kernel with default -Os settings.
>> >
>> > It''s unfortunately not. It doesn''t inline a lot
of simple asm() inlines
>> > for example.
>>
>> Reading Ingos posting with the actual numbers states the opposite.
>
> Hugh had some numbers upto 4.4.0 20090102 in
>
> http://thread.gmane.org/gmane.linux.kernel/775254/focus=777231
>
> which demonstrated the problem.
How about GCC bugzillas with testcases that we can fix and enter into the
testsuite to make sure future GCC versions won''t regress?

Richard.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Andi Kleen

2009-Jan-09 19:35 UTC

head link

Re: [patch] measurements, numbers about CONFIG_OPTIMIZE_INLINING=y impact

> What happens when you say -fno-inline-functions-called-once? Does it 
> disable inlining for those functions IN GENERAL, or just for the LARGE 
It does disable it in general, unless they''re marked inline explicitely
:

The traditional gcc 2.x rules were:

I Only inline what is marked inline (but it can decide to
not inline)
II Also inline others heuristically only with 
-O3 / -finline-functions [which we don''t set in the kernel]

Then at some point this additional rule was added:

- Also inline everything static that is only called once
[on the theory that this shrinks code size which is true
according to my measurements]

-fno-inline-functions-called once disables this new rule.
It''s very well and clearly defined.

-Andi

-- 
ak@linux.intel.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Andi Kleen

2009-Jan-09 19:40 UTC

head link

Re: [patch] measurements, numbers about CONFIG_OPTIMIZE_INLINING=y impact

On Fri, Jan 09, 2009 at 08:10:20PM +0100, Richard Guenther
wrote:> On Fri, Jan 9, 2009 at 8:21 PM, Andi Kleen <andi@firstfloor.org>
wrote:
> >> GCC 4.3 should be good in compiling the
> >> kernel with default -Os settings.
> >
> > It''s unfortunately not. It doesn''t inline a lot of
simple asm() inlines
> > for example.
> 
> Reading Ingos posting with the actual numbers states the opposite.
Hugh had some numbers upto 4.4.0 20090102 in

http://thread.gmane.org/gmane.linux.kernel/775254/focus=777231

which demonstrated the problem.

-Andi

-- 
ak@linux.intel.com
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Linus Torvalds

2009-Jan-09 19:44 UTC

head link

Re: [patch] measurements, numbers about CONFIG_OPTIMIZE_INLINING=y impact

On Fri, 9 Jan 2009, Richard Guenther wrote:> 
> -fno-inline-functions-called-once disables the heuristic that always 
> inlines (static!) functions that are called once.  Other heuristics 
> still apply, like inlining the static function if it is small.  
> Everything else would be totally stupid - which seems to be the
"default
> mode" you think GCC developers are in.
Well, I don''t know about you, but the "don''t inline a
single instruction"
sounds a bit stupid to me. And yes, that''s exactly what triggered this 
whole thing.

We have two examples of gcc doing that, one of which was even a modern 
version of gcc, where we had sone absolutely _everything_ on a source 
level to make sure that gcc could not possibly screw up. Yet it did:

	static inline int constant_test_bit(int nr, const volatile unsigned long *addr)
	{
	        return ((1UL << (nr % BITS_PER_LONG)) &
	                (((unsigned long *)addr)[nr / BITS_PER_LONG])) != 0;
	}

	#define test_bit(nr, addr)                      \
	        (__builtin_constant_p((nr))             \
	         ? constant_test_bit((nr), (addr))      \   
	         : variable_test_bit((nr), (addr)))

in this case, Ingo said that changing that _single_ inline to forcing 
inlining made a difference.

That''s CRAZY. The thing isn''t even called unless
"nr" is constant, so
absolutely _everything_ optimizes away, and that whole function was 
designed to give us a single instruction:

	testl $constant,constant_offset(addr)

and nothing else.

Maybe there was something else going on, and maybe Ingo''s tests were
off,
but this is an example of gcc not inlining WHEN WE TOLD IT TO, and when 
the function was a single instruction.

How can anybody possibly not consider that to be "stupid"?

The other case (with a single "cmpxchg" inline asm instruction) was at
least _slightly_ more understandable, in that (a) Ingo claims modern
gcc''s
did inline it and (b) the original function actually has a "switch()" 
statement that depends on the argument that is constant, so a stupid 
inliner might believe that it''s a big function. But again, we _told_
the
compiler to inline the damn thing, because we knew better. But gcc
didn''t.

The other part that is crazy is when gcc inlines large functions that 
aren''t even called most of the time (the "ioctl()" switch
statements tend
to be a great example of this - gcc inlines ten or twenty functions, and 
we can guarantee that only one of them is ever called). Yes, maybe it 
makes the code smaller, but it makes the code also undebuggable and often 
BUGGY, because we now have the stack frame of all ten-to-twenty functions 
to contend with.

And notice how "static" has absolutely _zero_ meaning for the above 
example. Yes, the thing is called just from one place - that''s how 
something like that very much works. It''s a special case. It''s
not _worth_
inlining, especially if it causes bugs. So "called once" or
"static" is
actually totally irrelevant.

And no, they are not marked "inline" (although they are clearly also
not
marked "uninline", until we figure out that gcc is causing system
crashes,
and we add the thing).

If these two small problems were fixed, gcc inlining would work much 
better. But the first one, in particular, means that the "do I inline or 
not" decision would have to happen after expanding and simplifying 
constants. And then, if the end result is big, the inlining gets aborted.

				Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Linus Torvalds

2009-Jan-09 19:48 UTC

head link

Re: [patch] measurements, numbers about CONFIG_OPTIMIZE_INLINING=y impact

On Fri, 9 Jan 2009, Matthew Wilcox wrote:> 
> Now, I''m not going to argue the directIO code is a shining example
of
> how we want things to look, but we don''t really want ten arguments
> being marshalled into a function call; we want gcc to inline the
> direct_io_worker() and do its best to optimise the whole thing.
Well, except we quite probably would be happier with gcc not doing that, 
than with gcc doing that too often.

There are exceptions. If the caller is really small (ie a pure wrapper 
that perhaps just does some locking around the call), then sure, inlining 
a large function that only gets called from one place does make sense.

But if both the caller and the callee is large, like in your example, then 
no. DON''T INLINE IT. Unless we _tell_ you, of course, which we probably
shouldn''t do.

Why? Becuase debugging is more important. And deciding to inline that, you 
probably decided to inline something _else_ too. And now you''ve quite 
possibly blown your stackspace.

		Linus
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Theodore Tso

2009-Jan-09 19:52 UTC

head link

Re: [patch] measurements, numbers about CONFIG_OPTIMIZE_INLINING=y impact

On Fri, Jan 09, 2009 at 07:55:09PM +0100, Andi Kleen
wrote:> > But _users_ just get their oopses sent automatically. So it''s
not about
> 
> If they send it from distro kernels the automated oops sender could
> just fetch the debuginfo rpm and decode it down to a line.
> My old automatic user segfault uploader I did originally
> for the core pipe code did that too.
Fetch a gigabyte''s worth of data for the debuginfo RPM?  Um, I think
most users would not be happy with that, especially if they are behind
a slow network.  Including the necessary information so someone who
wants to investigate the oops, or having kerneloops.org pull apart the
oops makes more sense, I think, and is already done.

Something that would be **really** useful would be a web page where if
someone sends me an oops message from Fedora or Open SUSE kernel to
linux-kernel or linux-ext4, I could take the oops message, cut and
paste it into a web page, along with the kernel version information,
and the kernel developer could get back a decoded oops information
with line number information.

Kerneloops.org does this, so the code is mostly written; but it does
this in a blinded fashion, so it only makes sense for oops which are
very common and for which we don''t need to ask the user, "so what
were
you doing at the time".  In cases where the user has already stepped
up and reported the oops on a mailing list, it would be nice if
kerneloops.org had a way of decoding the oops via some web page.

Arjan, would something like this be doable, hopefully without too much
effort?

						- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

H. Peter Anvin

2009-Jan-09 19:56 UTC

head link

Re: [patch] measurements, numbers about CONFIG_OPTIMIZE_INLINING=y impact

Linus Torvalds wrote:> Because then they are _our_ mistakes, not some random compiler version that
throws a dice!
This does bring up the idea of including a compiler with the kernel 
sources again, doesn''t it?

	-hpa (ducks & runs)

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Richard Guenther

2009-Jan-09 20:14 UTC

head link

Re: [patch] measurements, numbers about CONFIG_OPTIMIZE_INLINING=y impact

On Fri, Jan 9, 2009 at 8:44 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:>
>
> On Fri, 9 Jan 2009, Richard Guenther wrote:
>>
>> -fno-inline-functions-called-once disables the heuristic that always
>> inlines (static!) functions that are called once.  Other heuristics
>> still apply, like inlining the static function if it is small.
>> Everything else would be totally stupid - which seems to be the
"default
>> mode" you think GCC developers are in.
>
> Well, I don''t know about you, but the "don''t inline
a single instruction"
> sounds a bit stupid to me. And yes, that''s exactly what triggered
this
> whole thing.
>
> We have two examples of gcc doing that, one of which was even a modern
> version of gcc, where we had sone absolutely _everything_ on a source
> level to make sure that gcc could not possibly screw up. Yet it did:
>
>        static inline int constant_test_bit(int nr, const volatile unsigned
long *addr)
>        {
>                return ((1UL << (nr % BITS_PER_LONG)) &
>                        (((unsigned long *)addr)[nr / BITS_PER_LONG])) != 0;
>        }
>
>        #define test_bit(nr, addr)                      \
>                (__builtin_constant_p((nr))             \
>                 ? constant_test_bit((nr), (addr))      \
>                 : variable_test_bit((nr), (addr)))
>
> in this case, Ingo said that changing that _single_ inline to forcing
> inlining made a difference.
>
> That''s CRAZY. The thing isn''t even called unless
"nr" is constant, so
> absolutely _everything_ optimizes away, and that whole function was
> designed to give us a single instruction:
>
>        testl $constant,constant_offset(addr)
>
> and nothing else.
This is a case where the improved IPA-CP (interprocedural constant propagation)
of GCC 4.4 may help.  In general GCC cannot say how a call argument may
affect optimization if the function was inlined, so the size estimates are done
with just looking at the function body, not the arguments (well, for
GCC 4.4 this
is not completely true, there is now some "heuristics").  With IPA-CP
GCC will
clone the function for the constant arguments, optimize it and eventually inline
it if it is small enough.  At the moment this happens only if all
callers call the
function with the same constant though (at least I think so).

The above is definitely one case where using a macro or forced inlining is
a better idea than to trust a compiler to figure out that it can optimize the
function to a size suitable for inlining if called with a constant parameter.
> Maybe there was something else going on, and maybe Ingo''s tests
were off,
> but this is an example of gcc not inlining WHEN WE TOLD IT TO, and when
> the function was a single instruction.
>
> How can anybody possibly not consider that to be "stupid"?
Because it''s a hard problem, it''s not stupid to fail here -
you didn''t tell the
compiler the function optimizes!
> The other case (with a single "cmpxchg" inline asm instruction)
was at
> least _slightly_ more understandable, in that (a) Ingo claims modern
gcc''s
> did inline it and (b) the original function actually has a
"switch()"
> statement that depends on the argument that is constant, so a stupid
> inliner might believe that it''s a big function. But again, we
_told_ the
> compiler to inline the damn thing, because we knew better. But gcc
didn''t.
Experience tells us that people do not know better.  Maybe the kernel is
an exception here, but generally trusting "inline" up to an absolute
is
a bad idea (we stretch heuristics if you specify "inline" though).  We
can,
for the kernels purpose and maybe other clever developers, invent a
-fobey-inline mode, only inline functions marked inline and inline
all of them (if possible - which would be the key difference to always_inline).

But would you still want small functions be inlined even if they are not
marked inline?
> The other part that is crazy is when gcc inlines large functions that
> aren''t even called most of the time (the "ioctl()"
switch statements tend
> to be a great example of this - gcc inlines ten or twenty functions, and
> we can guarantee that only one of them is ever called). Yes, maybe it
> makes the code smaller, but it makes the code also undebuggable and often
> BUGGY, because we now have the stack frame of all ten-to-twenty functions
> to contend with.
Use -fno-inline-functions-called-once then.  But if you ask for -Os you get -Os.
Also recent GCC estimate and limit stack growth - which you can tune
reliably with --param large-stack-frame-growth and --param large-stack-frame.
> And notice how "static" has absolutely _zero_ meaning for the
above
> example. Yes, the thing is called just from one place - that''s how
> something like that very much works. It''s a special case.
It''s not _worth_
> inlining, especially if it causes bugs. So "called once" or
"static" is
> actually totally irrelevant.
Static makes inlining a single call cheap, because the out-of-line body
can be reclaimed.  If you do not like that, turn it off.
> And no, they are not marked "inline" (although they are clearly
also not
> marked "uninline", until we figure out that gcc is causing system
crashes,
> and we add the thing).
>
> If these two small problems were fixed, gcc inlining would work much
> better. But the first one, in particular, means that the "do I inline
or
> not" decision would have to happen after expanding and simplifying
> constants. And then, if the end result is big, the inlining gets aborted.
They do - just constant arguments are obviously not used for optimizing
before inlining.  Otherwise you''d scream bloody murder at us for all
the
increase in compile-time ;)

Richard.
>                                Linus
>--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Nicholas Miell

2009-Jan-09 20:17 UTC

head link

Re: [patch] measurements, numbers about CONFIG_OPTIMIZE_INLINING=y impact

On Fri, 2009-01-09 at 08:28 -0800, Linus Torvalds wrote:> 
> We get oopses that have a nice symbolic back-trace, and it reports an 
> error IN TOTALLY THE WRONG FUNCTION, because gcc "helpfully"
inlined
> things to the point that only an expert can realize "oh, the bug was 
> actually five hundred lines up, in that other function that was just 
> called once, so gcc inlined it even though it is huge".
> 
> See? THIS is the problem with gcc heuristics. It''s not about
quality of
> code, it''s about RELIABILITY of code. 
[bt]$ cat backtrace.c 
#include <stdlib.h>

static void called_once()
{
	abort();
}

int main(int argc, char* argv[])
{
	called_once();
	return 0;
}
[bt]$ gcc -Wall -O2 -g backtrace.c -o backtrace
[bt]$ gdb --quiet backtrace
(gdb) disassemble main 
Dump of assembler code for function main:
0x00000000004004d0 <main+0>:	sub    $0x8,%rsp
0x00000000004004d4 <called_once+0>:	callq  0x4003b8 <abort@plt>
End of assembler dump.
(gdb) run
Starting program: /home/nicholas/src/bitbucket/bt/backtrace 

Program received signal SIGABRT, Aborted.
0x0000003d9dc32f05 in raise (sig=<value optimized out>) at
../nptl/sysdeps/unix/sysv/linux/raise.c:64
64	  return INLINE_SYSCALL (tgkill, 3, pid, selftid, sig);
(gdb) bt
#0  0x0000003d9dc32f05 in raise (sig=<value optimized out>) at
../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1  0x0000003d9dc34a73 in abort () at abort.c:88
#2  0x00000000004004d9 in called_once () at backtrace.c:5
#3  main (argc=3989, argv=0xf95) at backtrace.c:10
(gdb)

Maybe the kernel''s backtrace code should be fixed instead of blaming
gcc.

-- 
Nicholas Miell <nmiell@comcast.net>

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Linus Torvalds

2009-Jan-09 20:26 UTC

head link

Re: [patch] measurements, numbers about CONFIG_OPTIMIZE_INLINING=y impact

On Fri, 9 Jan 2009, Richard Guenther wrote:> 
> This is a case where the improved IPA-CP (interprocedural constant 
> propagation) of GCC 4.4 may help.  In general GCC cannot say how a call 
> argument may affect optimization if the function was inlined, so the 
> size estimates are done with just looking at the function body, not the 
> arguments (well, for GCC 4.4 this is not completely true, there is now 
> some "heuristics").  With IPA-CP GCC will clone the function for
the
> constant arguments, optimize it and eventually inline it if it is small 
> enough.  At the moment this happens only if all callers call the 
> function with the same constant though (at least I think so).
Ok, that''s useless. The whole point is that everybody gives different -
but still constant - arguments.
> The above is definitely one case where using a macro or forced inlining is
> a better idea than to trust a compiler to figure out that it can optimize
the
> function to a size suitable for inlining if called with a constant
parameter.
.. and forced inlining is what we default to. But that''s when
"let''s try
letting gcc optimize this" fails. And macros get really unreadable, really 
quickly.
> > Maybe there was something else going on, and maybe Ingo''s
tests were off,
> > but this is an example of gcc not inlining WHEN WE TOLD IT TO, and
when
> > the function was a single instruction.
> >
> > How can anybody possibly not consider that to be "stupid"?
> 
> Because it''s a hard problem, it''s not stupid to fail here
- you didn''t tell the
> compiler the function optimizes!
Well, actually we did. It''s that "inline" there.
That''s how things used to
work. It''s like "no". It means "no". It
doesn''t mean "yes, I really want
to s*ck your d*ck, but I''m just screaming no at the top of my lungs 
because I think I should do so".

See?

And you do have to realize that Linux has been using gcc for a _loong_ 
while. You can talk all you want about how "inline" is just a hint,
but
the fact is, it didn''t use to be. gcc people _made_ it so, and are
having
a damn hard time admitting that it''s causing problems.
> Experience tells us that people do not know better.  Maybe the kernel is
> an exception here
Oh, I can well believe it. 

And I don''t even think that kernel people get it right nearly enough,
but
since for the kernel it can even be a _correctness_ issue, at least if we 
get it wrong, everybody sees it.

When _some_ compiler versions get it wrong, it''s a disaster.
> But would you still want small functions be inlined even if they are not
> marked inline?
If you can really tell that they are that small, yes. 
> They do - just constant arguments are obviously not used for optimizing
> before inlining.  Otherwise you''d scream bloody murder at us for
all the
> increase in compile-time ;)
A large portion of that has gone away now that everybody uses ccache. And 
if you only did it for functions that we _mark_ inline, it wouldn''t
even
be true. Because those are the ones that presumably really should be 
inlined.

So no, I don''t believe you. You much too easily dismiss the fact that 
we''ve explicitly marked these functions for inlining, and then you say 
"but we were too stupid".

If you cannot afford to do the real job, then trust the user. Don''t
guess.

		Linus
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Linus Torvalds

2009-Jan-09 20:29 UTC

head link

Re: [patch] measurements, numbers about CONFIG_OPTIMIZE_INLINING=y impact

On Fri, 9 Jan 2009, Nicholas Miell wrote:> 
> Maybe the kernel''s backtrace code should be fixed instead of
blaming
> gcc.
And maybe people who don''t know what they are talking about
shouldn''t
speak?

You just loaded the whole f*cking debug info just to do that exact 
analysis. Guess how big it is for the kernel?

Did you even read this discussion? Did you see my comments about why 
kernel backtrace debugging is different from regular user mode debugging?

			Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Steven Rostedt

2009-Jan-09 20:29 UTC

head link

Re: [patch] measurements, numbers about CONFIG_OPTIMIZE_INLINING=y impact

On Fri, 9 Jan 2009, Nicholas Miell wrote:
> On Fri, 2009-01-09 at 08:28 -0800, Linus Torvalds wrote:
> > 
> > We get oopses that have a nice symbolic back-trace, and it reports an 
> > error IN TOTALLY THE WRONG FUNCTION, because gcc "helpfully"
inlined
> > things to the point that only an expert can realize "oh, the bug
was
> > actually five hundred lines up, in that other function that was just 
> > called once, so gcc inlined it even though it is huge".
> > 
> > See? THIS is the problem with gcc heuristics. It''s not about
quality of
> > code, it''s about RELIABILITY of code. 
> 
> [bt]$ cat backtrace.c 
> #include <stdlib.h>
> 
> static void called_once()
> {
> 	abort();
> }
> 
> int main(int argc, char* argv[])
> {
> 	called_once();
> 	return 0;
> }
> [bt]$ gcc -Wall -O2 -g backtrace.c -o backtrace
> [bt]$ gdb --quiet backtrace
> (gdb) disassemble main 
> Dump of assembler code for function main:
> 0x00000000004004d0 <main+0>:	sub    $0x8,%rsp
> 0x00000000004004d4 <called_once+0>:	callq  0x4003b8 <abort@plt>
> End of assembler dump.
> (gdb) run
> Starting program: /home/nicholas/src/bitbucket/bt/backtrace 
> 
> Program received signal SIGABRT, Aborted.
> 0x0000003d9dc32f05 in raise (sig=<value optimized out>) at
../nptl/sysdeps/unix/sysv/linux/raise.c:64
> 64	  return INLINE_SYSCALL (tgkill, 3, pid, selftid, sig);
> (gdb) bt
> #0  0x0000003d9dc32f05 in raise (sig=<value optimized out>) at
../nptl/sysdeps/unix/sysv/linux/raise.c:64
> #1  0x0000003d9dc34a73 in abort () at abort.c:88
> #2  0x00000000004004d9 in called_once () at backtrace.c:5
> #3  main (argc=3989, argv=0xf95) at backtrace.c:10
> (gdb)
> 
> 
> Maybe the kernel''s backtrace code should be fixed instead of
blaming
> gcc.
Try doing the same without compiling with -g.

I believe Andi has a patch to use the DWARF markings for backtrace (I''m
sure he''ll correct me if I''m wrong ;-), but things like ftrace
that use
kallsyms to get the names of functions and such does better when the 
functions are not inlined. Not to mention that the function tracer does 
not trace inlined functions so that''s another downside of inlining.

-- Steve
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Richard Guenther

2009-Jan-09 20:37 UTC

head link

Re: [patch] measurements, numbers about CONFIG_OPTIMIZE_INLINING=y impact

On Fri, Jan 9, 2009 at 9:26 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:>
>
> On Fri, 9 Jan 2009, Richard Guenther wrote:
>>
>> This is a case where the improved IPA-CP (interprocedural constant
>> propagation) of GCC 4.4 may help.  In general GCC cannot say how a call
>> argument may affect optimization if the function was inlined, so the
>> size estimates are done with just looking at the function body, not the
>> arguments (well, for GCC 4.4 this is not completely true, there is now
>> some "heuristics").  With IPA-CP GCC will clone the function
for the
>> constant arguments, optimize it and eventually inline it if it is small
>> enough.  At the moment this happens only if all callers call the
>> function with the same constant though (at least I think so).
>
> Ok, that''s useless. The whole point is that everybody gives
different -
> but still constant - arguments.
Btw, both GCC 4.3 and upcoming GCC 4.4 inline the bit-test.  This is what I
used as a testcase (to avoid the single-call and single-constant cases):

#define BITS_PER_LONG 32
static inline int constant_test_bit(int nr, const volatile unsigned long *addr)
{
  return ((1UL << (nr % BITS_PER_LONG)) &
          (((unsigned long *)addr)[nr / BITS_PER_LONG])) != 0;
}

#define test_bit(nr, addr)                      \
    (__builtin_constant_p((nr))             \
     ? constant_test_bit((nr), (addr))      \
     : variable_test_bit((nr), (addr)))

int foo(unsigned long *addr)
{
  return test_bit (5, addr);
}

int bar(unsigned long *addr)
{
  return test_bit (6, addr);
}

at -Os even.
>> The above is definitely one case where using a macro or forced inlining
is
>> a better idea than to trust a compiler to figure out that it can
optimize the
>> function to a size suitable for inlining if called with a constant
parameter.
>
> .. and forced inlining is what we default to. But that''s when
"let''s try
> letting gcc optimize this" fails. And macros get really unreadable,
really
> quickly.
As it happens to work with your simple case it may still apply for more
complex (thus appearantly big) cases.
>> > Maybe there was something else going on, and maybe Ingo''s
tests were off,
>> > but this is an example of gcc not inlining WHEN WE TOLD IT TO, and
when
>> > the function was a single instruction.
>> >
>> > How can anybody possibly not consider that to be
"stupid"?
>>
>> Because it''s a hard problem, it''s not stupid to fail
here - you didn''t tell the
>> compiler the function optimizes!
>
> Well, actually we did. It''s that "inline" there.
That''s how things used to
> work. It''s like "no". It means "no". It
doesn''t mean "yes, I really want
> to s*ck your d*ck, but I''m just screaming no at the top of my
lungs
> because I think I should do so".
>
> See?
See below.
> And you do have to realize that Linux has been using gcc for a _loong_
> while. You can talk all you want about how "inline" is just a
hint, but
> the fact is, it didn''t use to be. gcc people _made_ it so, and are
having
> a damn hard time admitting that it''s causing problems.
We made it so 10 years ago.
>> Experience tells us that people do not know better.  Maybe the kernel
is
>> an exception here
^^^
> Oh, I can well believe it.
>
> And I don''t even think that kernel people get it right nearly
enough, but
> since for the kernel it can even be a _correctness_ issue, at least if we
> get it wrong, everybody sees it.
>
> When _some_ compiler versions get it wrong, it''s a disaster.
Of course.  If you use always_inline then it''s even a compiler bug.
>> But would you still want small functions be inlined even if they are
not
>> marked inline?
>
> If you can really tell that they are that small, yes.
>
>> They do - just constant arguments are obviously not used for optimizing
>> before inlining.  Otherwise you''d scream bloody murder at us
for all the
>> increase in compile-time ;)
>
> A large portion of that has gone away now that everybody uses ccache. And
> if you only did it for functions that we _mark_ inline, it
wouldn''t even
> be true. Because those are the ones that presumably really should be
> inlined.
>
> So no, I don''t believe you. You much too easily dismiss the fact
that
> we''ve explicitly marked these functions for inlining, and then you
say
> "but we were too stupid".
>
> If you cannot afford to do the real job, then trust the user.
Don''t guess.
We''re guessing way better than the average programmer.  But if you are
asking for a compiler option to disable guessing you can have it (you can
already use #define inline always_inline and -fno-inline to get it).

Richard.
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Ingo Molnar

2009-Jan-09 20:41 UTC

head link

Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

* Linus Torvalds <torvalds@linux-foundation.org> wrote:
> On Fri, 9 Jan 2009, Ingo Molnar wrote:
> >  
> > -static inline int constant_test_bit(int nr, const volatile unsigned
long *addr)
> > +static __asm_inline int
> > +constant_test_bit(int nr, const volatile unsigned long *addr)
> >  {
> >  	return ((1UL << (nr % BITS_PER_LONG)) &
> >  		(((unsigned long *)addr)[nr / BITS_PER_LONG])) != 0;
> 
> Thios makes absolutely no sense.
> 
> It''s called "__always_inline", not __asm_inline.
yeah.

Note that meanwhile i also figured out why gcc got the inlining wrong 
there: the ''int nr'' combined with the ''%
BITS_PER_LONG'' signed arithmetics
was too much for it to figure out at the inlining stage - it generated 
IDIV instructions, etc. With forced inlining later optimization stages 
managed to prove that the expression can be simplified.

The second patch below that changes ''int nr'' to
''unsigned nr'' solves that
problem, without the need to mark the function __always_inline.

How did i end up with __asm_inline? The thing is, i started the day under 
the assumption that there''s some big practical problem here. I expected
to
find a lot of places in need of annotation, so i introduced hpa''s 
suggestion and added the __asm_inline (via the patch attached below).

I wrote 40 patches that annotated 200+ asm inline functions, and i was 
fully expected to find that GCC made a mess, and i also wrote a patch to 
disable CONFIG_OPTIMIZE_INLINING on those grounds.

The irony is that indeed pretty much the _only_ annotation that made a 
difference was the one that isnt even an asm() inline (as you noted).

So, should we not remove CONFIG_OPTIMIZE_INLINING, then the correct one 
would be to mark it __always_inline [__asm_inline is senseless there], or 
the second patch below that changes the bit parameter to unsigned int.

	Ingo

---
 include/linux/compiler.h |    6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

Index: linux/include/linux/compiler.h
==================================================================---
linux.orig/include/linux/compiler.h
+++ linux/include/linux/compiler.h
@@ -223,7 +223,11 @@ void ftrace_likely_update(struct ftrace_
 #define noinline_for_stack noinline

 #ifndef __always_inline
-#define __always_inline inline
+# define __always_inline inline
+#endif
+
+#ifndef __asm_inline
+# define __asm_inline __always_inline
 #endif

 #endif /* __KERNEL__ */

Index: linux/arch/x86/include/asm/bitops.h
==================================================================---
linux.orig/arch/x86/include/asm/bitops.h
+++ linux/arch/x86/include/asm/bitops.h
@@ -300,7 +300,7 @@ static inline int test_and_change_bit(in
 	return oldbit;
 }

-static inline int constant_test_bit(int nr, const volatile unsigned long *addr)
+static int constant_test_bit(unsigned int nr, const volatile unsigned long
*addr)
 {
 	return ((1UL << (nr % BITS_PER_LONG)) &
 		(((unsigned long *)addr)[nr / BITS_PER_LONG])) != 0;
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Ingo Molnar

2009-Jan-09 20:56 UTC

head link

Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

* Ingo Molnar <mingo@elte.hu> wrote:
> Note that meanwhile i also figured out why gcc got the inlining wrong 
> there: the ''int nr'' combined with the ''%
BITS_PER_LONG'' signed
> arithmetics was too much for it to figure out at the inlining stage - it 
> generated IDIV instructions, etc. With forced inlining later 
> optimization stages managed to prove that the expression can be 
> simplified.
> 
> The second patch below that changes ''int nr'' to
''unsigned nr'' solves
> that problem, without the need to mark the function __always_inline.
The patch below that changes all the ''int nr'' arguments to
''unsigned int
nr'' in bitops.h and gives us a 0.3% size win (and all the right
inlining
behavior) on x86 defconfig:

    text	   data	    bss	    dec	    hex	filename
 6813470	1453188	 801096	9067754	 8a5cea	vmlinux.before
 6792602	1453188	 801096	9046886	 8a0b66	vmlinux.after

i checked other architectures and i can see many cases where the bitops 
''nr'' parameter is defined as unsigned - maybe they noticed
this.

This change makes some sense anyway as a cleanup: a negative
''nr'' bitop
argument does not make much sense IMO.

	Ingo

---
 arch/x86/include/asm/bitops.h |   31 ++++++++++++++++---------------
 1 file changed, 16 insertions(+), 15 deletions(-)

Index: linux/arch/x86/include/asm/bitops.h
==================================================================---
linux.orig/arch/x86/include/asm/bitops.h
+++ linux/arch/x86/include/asm/bitops.h
@@ -75,7 +75,7 @@ static inline void set_bit(unsigned int 
  * If it''s called on the same region of memory simultaneously, the
effect
  * may be that only one operation succeeds.
  */
-static inline void __set_bit(int nr, volatile unsigned long *addr)
+static inline void __set_bit(unsigned int nr, volatile unsigned long *addr)
 {
 	asm volatile("bts %1,%0" : ADDR : "Ir" (nr) :
"memory");
 }
@@ -90,7 +90,7 @@ static inline void __set_bit(int nr, vol
  * you should call smp_mb__before_clear_bit() and/or smp_mb__after_clear_bit()
  * in order to ensure changes are visible on other processors.
  */
-static inline void clear_bit(int nr, volatile unsigned long *addr)
+static inline void clear_bit(unsigned int nr, volatile unsigned long *addr)
 {
 	if (IS_IMMEDIATE(nr)) {
 		asm volatile(LOCK_PREFIX "andb %1,%0"
@@ -117,7 +117,7 @@ static inline void clear_bit_unlock(unsi
 	clear_bit(nr, addr);
 }
 
-static inline void __clear_bit(int nr, volatile unsigned long *addr)
+static inline void __clear_bit(unsigned int nr, volatile unsigned long *addr)
 {
 	asm volatile("btr %1,%0" : ADDR : "Ir" (nr));
 }
@@ -152,7 +152,7 @@ static inline void __clear_bit_unlock(un
  * If it''s called on the same region of memory simultaneously, the
effect
  * may be that only one operation succeeds.
  */
-static inline void __change_bit(int nr, volatile unsigned long *addr)
+static inline void __change_bit(unsigned int nr, volatile unsigned long *addr)
 {
 	asm volatile("btc %1,%0" : ADDR : "Ir" (nr));
 }
@@ -166,7 +166,7 @@ static inline void __change_bit(int nr, 
  * Note that @nr may be almost arbitrarily large; this function is not
  * restricted to acting on a single-word quantity.
  */
-static inline void change_bit(int nr, volatile unsigned long *addr)
+static inline void change_bit(unsigned int nr, volatile unsigned long *addr)
 {
 	if (IS_IMMEDIATE(nr)) {
 		asm volatile(LOCK_PREFIX "xorb %1,%0"
@@ -187,7 +187,7 @@ static inline void change_bit(int nr, vo
  * This operation is atomic and cannot be reordered.
  * It also implies a memory barrier.
  */
-static inline int test_and_set_bit(int nr, volatile unsigned long *addr)
+static inline int test_and_set_bit(unsigned int nr, volatile unsigned long
*addr)
 {
 	int oldbit;
 
@@ -204,7 +204,7 @@ static inline int test_and_set_bit(int n
  *
  * This is the same as test_and_set_bit on x86.
  */
-static inline int test_and_set_bit_lock(int nr, volatile unsigned long *addr)
+static inline int test_and_set_bit_lock(unsigned int nr, volatile unsigned long
*addr)
 {
 	return test_and_set_bit(nr, addr);
 }
@@ -218,7 +218,7 @@ static inline int test_and_set_bit_lock(
  * If two examples of this operation race, one can appear to succeed
  * but actually fail.  You must protect multiple accesses with a lock.
  */
-static inline int __test_and_set_bit(int nr, volatile unsigned long *addr)
+static inline int __test_and_set_bit(unsigned int nr, volatile unsigned long
*addr)
 {
 	int oldbit;
 
@@ -237,7 +237,7 @@ static inline int __test_and_set_bit(int
  * This operation is atomic and cannot be reordered.
  * It also implies a memory barrier.
  */
-static inline int test_and_clear_bit(int nr, volatile unsigned long *addr)
+static inline int test_and_clear_bit(unsigned int nr, volatile unsigned long
*addr)
 {
 	int oldbit;
 
@@ -257,7 +257,7 @@ static inline int test_and_clear_bit(int
  * If two examples of this operation race, one can appear to succeed
  * but actually fail.  You must protect multiple accesses with a lock.
  */
-static inline int __test_and_clear_bit(int nr, volatile unsigned long *addr)
+static inline int __test_and_clear_bit(unsigned int nr, volatile unsigned long
*addr)
 {
 	int oldbit;
 
@@ -269,7 +269,7 @@ static inline int __test_and_clear_bit(i
 }
 
 /* WARNING: non atomic and it can be reordered! */
-static inline int __test_and_change_bit(int nr, volatile unsigned long *addr)
+static inline int __test_and_change_bit(unsigned int nr, volatile unsigned long
*addr)
 {
 	int oldbit;
 
@@ -289,7 +289,7 @@ static inline int __test_and_change_bit(
  * This operation is atomic and cannot be reordered.
  * It also implies a memory barrier.
  */
-static inline int test_and_change_bit(int nr, volatile unsigned long *addr)
+static inline int test_and_change_bit(unsigned int nr, volatile unsigned long
*addr)
 {
 	int oldbit;
 
@@ -300,13 +300,14 @@ static inline int test_and_change_bit(in
 	return oldbit;
 }
 
-static inline int constant_test_bit(int nr, const volatile unsigned long *addr)
+static inline int
+constant_test_bit(unsigned int nr, const volatile unsigned long *addr)
 {
 	return ((1UL << (nr % BITS_PER_LONG)) &
 		(((unsigned long *)addr)[nr / BITS_PER_LONG])) != 0;
 }
 
-static inline int variable_test_bit(int nr, volatile const unsigned long *addr)
+static inline int variable_test_bit(unsigned int nr, volatile const unsigned
long *addr)
 {
 	int oldbit;
 
@@ -324,7 +325,7 @@ static inline int variable_test_bit(int 
  * @nr: bit number to test
  * @addr: Address to start counting from
  */
-static int test_bit(int nr, const volatile unsigned long *addr);
+static int test_bit(unsigned int nr, const volatile unsigned long *addr);
 #endif
 
 #define test_bit(nr, addr)			\
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Linus Torvalds

2009-Jan-09 20:56 UTC

head link

Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

On Fri, 9 Jan 2009, Ingo Molnar wrote:> 
> So, should we not remove CONFIG_OPTIMIZE_INLINING, then the correct one 
> would be to mark it __always_inline [__asm_inline is senseless there], or 
> the second patch below that changes the bit parameter to unsigned int.
Well, I certainly don''t want to _remove_ the "inline" like
your patch did.
Other gcc versions will care. But I committed the pure "change to 
unsigned" part.

But we should fix the cmpxchg (and perhaps plain xchg too), shouldn''t
we?

That your gcc version gets it right doesn''t change the fact that
Chris''
gcc version didn''t, and out-of-lined it all. So we''ll need
some
__always_inlines there too..

And no, I don''t think it makes any sense to call them
"__asm_inline". Even
when there are asms hidden in between the C statements, what''s the 
difference between "always" and "asm"? None, really.

		Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Sam Ravnborg

2009-Jan-09 20:58 UTC

head link

Re: [patch] measurements, numbers about CONFIG_OPTIMIZE_INLINING=y impact

> 
> And you do have to realize that Linux has been using gcc for a _loong_ 
> while. You can talk all you want about how "inline" is just a
hint, but
> the fact is, it didn''t use to be. gcc people _made_ it so, and are
having
> a damn hard time admitting that it''s causing problems.
The kernel has used:
# define inline         inline          __attribute__((always_inline))

For a looong time.

So anyone in the kernel when they said "inline" actually
said to gcc: if you have any possible way to do so inline this
sucker.

Now we have a config option that changes this so inline is only
a hint. gcc does not pay enough attention to the hint,
especially compared to the days where the hint was actually a command.

	Sam
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Andi Kleen

2009-Jan-09 21:15 UTC

head link

Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

> I''ve done a finegrained size analysis today (see my other mail in
this
> thread), and it turns out that on gcc 4.3.x the main (and pretty much 
> only) inlining annotation that matters in arch/x86/include/asm/*.h is the 
> onliner patch attached below, annotating constant_test_bit().
That''s pretty cool.

Should definitely file a gcc bug report for that though so that
they can fix gcc. Did you already do that or should I?

-Andi
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Theodore Tso

2009-Jan-09 21:23 UTC

head link

Re: [patch] measurements, numbers about CONFIG_OPTIMIZE_INLINING=y impact

I''m beginning to think that for the kernel, we should just simply
remove CONFIG_OPTIMIZE_INLINING (so that inline means
"always_inline"), and -fno-inline-functions
-fno-inline-functions-called-one (so that gcc never inlines functions
behind our back) --- and then we create tools that count how many times
functions get used, and how big functions are, so that we can flag if some
function really should be marked inline when it isn''t or vice versa.   

But given that this is a very hard thing for an automated program
todo, let''s write some tools so we can easily put a human in the loop,
who can add or remove inline keywords where it makes sense, and let''s
give up on gcc being able to "guess" correctly.

For some things, like register allocation, I can accept that the
compiler will usually get these things right.  But whether or not to
inline a function seems to be one of those things that humans (perhaps
with some tools assist) can still do a better job than compilers.

     	  		    	     - Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Steven Rostedt

2009-Jan-09 21:33 UTC

head link

Re: [patch] measurements, numbers about CONFIG_OPTIMIZE_INLINING=y impact

On Fri, 9 Jan 2009, Theodore Tso wrote:
> I''m beginning to think that for the kernel, we should just simply
> remove CONFIG_OPTIMIZE_INLINING (so that inline means
> "always_inline"), and -fno-inline-functions
> -fno-inline-functions-called-one (so that gcc never inlines functions
> behind our back) --- and then we create tools that count how many times
functions get used, and how big functions are, so that we can flag if some
> function really should be marked inline when it isn''t or vice
versa.
> 
> But given that this is a very hard thing for an automated program
> todo, let''s write some tools so we can easily put a human in the
loop,
> who can add or remove inline keywords where it makes sense, and
let''s
> give up on gcc being able to "guess" correctly.
> 
> For some things, like register allocation, I can accept that the
> compiler will usually get these things right.  But whether or not to
> inline a function seems to be one of those things that humans (perhaps
> with some tools assist) can still do a better job than compilers.
Adding a function histogram in ftrace should be trivial. I can write one 
up if you want. It will only count the functions not inlined.

-- Steve
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Ingo Molnar

2009-Jan-09 21:34 UTC

head link

Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

* Linus Torvalds <torvalds@linux-foundation.org> wrote:
> On Fri, 9 Jan 2009, Ingo Molnar wrote:
> > 
> > So, should we not remove CONFIG_OPTIMIZE_INLINING, then the correct 
> > one would be to mark it __always_inline [__asm_inline is senseless 
> > there], or the second patch below that changes the bit parameter to 
> > unsigned int.
> 
> Well, I certainly don''t want to _remove_ the "inline"
like your patch
> did. Other gcc versions will care. But I committed the pure "change to
> unsigned" part.
> 
> But we should fix the cmpxchg (and perhaps plain xchg too),
shouldn''t
> we?
> 
> That your gcc version gets it right doesn''t change the fact that
Chris''
> gcc version didn''t, and out-of-lined it all. So we''ll
need some
> __always_inlines there too..
Yeah. I''ll dig out an older version of gcc (latest distros are all
4.3.x
based) and run the checks to see which inlines make a difference.
> And no, I don''t think it makes any sense to call them
"__asm_inline".
> Even when there are asms hidden in between the C statements,
what''s the
> difference between "always" and "asm"? None, really.
Well, the difference is small, nitpicky and insignificant: the thing is 
there are two logically separate categories of __always_inline:

 1) the places where __always_inline means that in this universe no sane 
    compiler ever can end up thinking to move that function out of line.

 2) inlining for_correctness_ reasons: things like vreads or certain 
    paravirt items. Stuff where the kernel actually crashes if we dont 
    inline. Here if we do not inline we''ve got a materially crashy
kernel.

The original intention of __always_inline was to only cover the second 
category above - and thus self-document all the ''correctness
inlines''.

This notion has become bitrotten somewhat: we do use __always_inline in a 
few other places like the ticket spinlock inlines for non-correctness 
reasons. That bitrot happened because we simply have no separate symbol 
for the first category.

So hpa suggested __asm_inline (yesterday, well before all the analysis was 
conducted) under the assumption that there would be many such annotations 
needed and that they would be all about cases where GCC''s inliner gets 
confused by inline assembly.

This theory turned out to be a red herring today - asm()s do not seem to 
confuse latest GCC. (although they certain confuse earlier versions, so 
it''s still a practical issue, so i agree that we do need to annotate a
few
more places.)

In any case, the __asm_inline name - even if it made some marginal sense 
originally - is totally moot now, no argument about that.

The naming problem remains though:

- Perhaps we could introduce a name for the first category: __must_inline? 
  __should_inline? Not because it wouldnt mean ''always'', but
because it is
  ''always inline'' for another reason than the correctless
__always_inline.

- Another possible approach wuld be to rename the second category to 
  __force_inline. That would signal it rather forcefully that the inlining 
  there is an absolute correctness issue.

- Or we could go with the status quo and just conflate those two 
  categories (as it is happening currently) and document the correctness 
  inlines via in-source comments?

But these are really nuances that pale in comparison to the fundamental 
questions that were asked in this thread, about the pure existence of this 
feature.

If the optimize-inlining feature looks worthwile and maintainable to 
remain upstream then i''d simply like to see the information of these
two
categories preserved in a structured way (in 5 years i''m not sure
i''d
remember all the paravirt inlining details), and i dont feel too strongly 
about the style how we preserve that information.

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Harvey Harrison

2009-Jan-09 21:41 UTC

head link

Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

On Fri, 2009-01-09 at 22:34 +0100, Ingo Molnar wrote:> The naming problem remains though:
> 
> - Perhaps we could introduce a name for the first category: __must_inline? 
>   __should_inline? Not because it wouldnt mean ''always'',
but because it is
>   ''always inline'' for another reason than the correctless
__always_inline.
> 
> - Another possible approach wuld be to rename the second category to 
>   __force_inline. That would signal it rather forcefully that the inlining 
>   there is an absolute correctness issue.
__needs_inline?  That would imply that it''s for correctness reasons.

Then __always_inline is left to mean that it doesn''t _need_ to be
inline
but we _want_ it inline regardless of what gcc thinks?

$0.02

Harvey

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Linus Torvalds

2009-Jan-09 21:46 UTC

head link

Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

On Fri, 9 Jan 2009, Ingo Molnar wrote:> 
> - Perhaps we could introduce a name for the first category: __must_inline? 
>   __should_inline? Not because it wouldnt mean ''always'',
but because it is
>   ''always inline'' for another reason than the correctless
__always_inline.
I think you''re thinking about this the wrong way.

"inline" is a pretty damn strong hint already.

If you want a weaker one, make it _weaker_ instead of trying to use 
superlatives like "super_inline" or "must_inline" or
whatever.

So I''d suggest:

 - keep "inline" as being a strong hint. In fact, I''d suggest
it not be a
   hint at all - when we say "inline", we mean it. No ambiguity 
   _anywhere_, and no need for idiotic "I really really REALLY mean
it"
   versions.

 - add a "maybe_inline" or "inline_hint" to mean that
"ok, compiler, maybe
   this is worth inlining, but I''ll leave the final choice to
you".

That would get rid of the whole rationale for OPTIMIZE_INLINING=y, because 
at that point, it''s no longer potentially a correctness issue. At that 
point, if we let gcc optimize things, it was a per-call-site conscious 
decision.

			Linus
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Linus Torvalds

2009-Jan-09 21:50 UTC

head link

Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

On Fri, 9 Jan 2009, Harvey Harrison wrote:> 
> __needs_inline?  That would imply that it''s for correctness
reasons.
.. but the point is, we have _thousands_ of inlines, and do you know which 
is which? We''ve historically forced them to be inlined, and every time 
somebody does that "OPTIMIZE_INLINE=y", something simply _breaks_.

So instead of just continually hitting our head against this wall because 
some people seem to be convinced that gcc can do a good job, just do it 
the other way around. Make the new one be "inline_hint" (no
underscores
needed, btw), and there is ansolutely ZERO confusion about what it means. 

At that point, everybody knows why it''s there, and it''s
clearly not a
correctness issue or anything else.

Of course, at that point you might as well argue that the thing should not 
exist at all, and that such a flag should just be removed entirely. Which 
I certainly agree with - I think the only flag we need is "inline",
and I
think it should mean what it damn well says.

		Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Ingo Molnar

2009-Jan-09 21:58 UTC

head link

Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

* Linus Torvalds <torvalds@linux-foundation.org> wrote:
> On Fri, 9 Jan 2009, Ingo Molnar wrote:
> > 
> > So, should we not remove CONFIG_OPTIMIZE_INLINING, then the correct
one
> > would be to mark it __always_inline [__asm_inline is senseless there],
or
> > the second patch below that changes the bit parameter to unsigned int.
> 
> Well, I certainly don''t want to _remove_ the "inline"
like your patch did.
hm, that was a bug that i noticed and fixed in the second, fuller version 
of the patch i sent - which converts all the ''int nr''
instances in
bitops.h to ''unsigned int nr''.

This is the only instance where the integer type of ''nr''
matters in
practice though, due to the modulo arithmetics. But for cleanliness 
reasons we want to do the full patch, to have a standard type signature 
for these bitop methods.
> Other gcc versions will care. But I committed the pure "change to 
> unsigned" part.
thanks! I''ll clean up the rest - the second patch will now conflict 
(trivially). I also wanted to check the whole file more fully, there might 
be other details. [ So many files, so few nights ;-) ]

We also might need more __always_inline''s here and in other places, to 
solve the nonsensical inlining problems that Chris''s case showed for 
example, with earlier GCCs.

Another option would be to not trust earlier GCCs at all with this feature 
- to only define inline to not-__always_inline only on latest 4.3.x GCC - 
the only one that seems to at least not mess up royally.

Thus CONFIG_OPTIMIZE_INLINING=y would have no effect on older GCCs. That 
would quarantine the problem (and the impact) sufficiently i think. And if 
future GCCs start messing up in this area we could zap the whole feature 
in a heartbeat.

( Although now that we have this feature this gives an incentive to
  compiler folks to tune their inliner on to the Linux kernel - for a 
  decade we never allowed them to do that. The kernel clearly has one of
  the trickiest (and ugliest) inlining smarts in headers - and we never 
  really exposed compilers to those things, so i''m not surprised at all
  that they mess up in cases.

  Unfortunately the version lifecycle of most compiler projects is
  measured in years, not in months like that of the kernel. There''s
many
  reasons for that - and not all of those reasons are strictly their
  fault. )

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Harvey Harrison

2009-Jan-09 21:59 UTC

head link

Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

On Fri, 2009-01-09 at 13:50 -0800, Linus Torvalds wrote:> 
> On Fri, 9 Jan 2009, Harvey Harrison wrote:
> > 
> > __needs_inline?  That would imply that it''s for correctness
reasons.
> 
> .. but the point is, we have _thousands_ of inlines, and do you know which 
> is which? We''ve historically forced them to be inlined, and every
time
> somebody does that "OPTIMIZE_INLINE=y", something simply
_breaks_.
> 
My suggestion was just an alternative to __force_inline as a naming...I agree
that
inline should mean __always_inline.....always.
> So instead of just continually hitting our head against this wall because 
> some people seem to be convinced that gcc can do a good job, just do it 
> the other way around. Make the new one be "inline_hint" (no
underscores
> needed, btw), and there is ansolutely ZERO confusion about what it means. 
agreed.
> At that point, everybody knows why it''s there, and it''s
clearly not a
> correctness issue or anything else.
> 
> Of course, at that point you might as well argue that the thing should not 
> exist at all, and that such a flag should just be removed entirely. Which 
> I certainly agree with - I think the only flag we need is
"inline", and I
> think it should mean what it damn well says.
Also agreed, but there needs to start being some education about _not_ using
inline so much in the kernel.

Harvey

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Arjan van de Ven

2009-Jan-09 22:07 UTC

head link

Re: [patch] measurements, numbers about CONFIG_OPTIMIZE_INLINING=y impact

On Fri, 9 Jan 2009 14:52:33 -0500
Theodore Tso <tytso@mit.edu> wrote:
> On Fri, Jan 09, 2009 at 07:55:09PM +0100, Andi Kleen wrote:
> > > But _users_ just get their oopses sent automatically. So
it''s not
> > > about 
> > 
> > If they send it from distro kernels the automated oops sender could
> > just fetch the debuginfo rpm and decode it down to a line.
> > My old automatic user segfault uploader I did originally
> > for the core pipe code did that too.
> 
> Fetch a gigabyte''s worth of data for the debuginfo RPM?  Um, I
think
> most users would not be happy with that, especially if they are behind
> a slow network.  Including the necessary information so someone who
> wants to investigate the oops, or having kerneloops.org pull apart the
> oops makes more sense, I think, and is already done.
> 
it is.
> Something that would be **really** useful would be a web page where if
> someone sends me an oops message from Fedora or Open SUSE kernel to
> linux-kernel or linux-ext4, I could take the oops message, cut and
> paste it into a web page, along with the kernel version information,
> and the kernel developer could get back a decoded oops information
> with line number information.
> 
> Kerneloops.org does this, so the code is mostly written; but it does
> this in a blinded fashion, so it only makes sense for oops which are
> very common and for which we don''t need to ask the user, "so
what were
> you doing at the time".  In cases where the user has already stepped
> up and reported the oops on a mailing list, it would be nice if
> kerneloops.org had a way of decoding the oops via some web page.
> 
> Arjan, would something like this be doable, hopefully without too much
> effort?
I suppose it could be done if the exact idea is there, and it would be
nice if I''d get help from the distro kernel mainainers so that
they''ll
send me the vmlinux''n ;)


-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Linus Torvalds

2009-Jan-09 22:09 UTC

head link

Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

On Fri, 9 Jan 2009, Harvey Harrison wrote:> > 
> > Of course, at that point you might as well argue that the thing should
not
> > exist at all, and that such a flag should just be removed entirely.
Which
> > I certainly agree with - I think the only flag we need is
"inline", and I
> > think it should mean what it damn well says.
> 
> Also agreed, but there needs to start being some education about _not_
using
> inline so much in the kernel.
Actually, the nice part about "inline_hint" would be that then we
could
have some nice config option like

  #ifdef CONFIG_FULL_CALL_TRACE
   #define inline_hint noinline
  #elif defined(CONFIG_TRUST_COMPILER)
   #define inline_hint /* */
  #else
   #define inline_hint __inline
  #endif

and now the _only_ thing we need to do is to remove the

	#define __inline	__force_inline

thing, and just agree that "__inline" is the "native compiler
meaning".

We have a few users of "__inline", but not very many. We can leave
them
alone, or just convert them to __inline__ or inline.

			Linus
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Harvey Harrison

2009-Jan-09 22:13 UTC

head link

Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

On Fri, 2009-01-09 at 14:09 -0800, Linus Torvalds wrote:
> We have a few users of "__inline", but not very many. We can
leave them
> alone, or just convert them to __inline__ or inline.
Actually I sent out a series of patches which mostly went in 2.6.27-28
timeframe, that''s why there''s a lot fewer __inline/__inline__

Other than one more block in scsi which has been hanging out in -mm for
awhile, eliminating them should be pretty easy now.

Harvey

--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Harvey Harrison

2009-Jan-09 22:25 UTC

head link

Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

On Fri, 2009-01-09 at 14:09 -0800, Linus Torvalds wrote:> Actually, the nice part about "inline_hint" would be that then we
could
> have some nice config option like
> 
>   #ifdef CONFIG_FULL_CALL_TRACE
>    #define inline_hint noinline
>   #elif defined(CONFIG_TRUST_COMPILER)
>    #define inline_hint /* */
>   #else
>    #define inline_hint __inline
>   #endif
> 
> and now the _only_ thing we need to do is to remove the
> 
> 	#define __inline	__force_inline
> 
> thing, and just agree that "__inline" is the "native
compiler meaning".
> 
> We have a few users of "__inline", but not very many. We can
leave them
> alone, or just convert them to __inline__ or inline.
> 
Oh yeah, and figure out what actually breaks on alpha such that they added
the following (arch/alpha/include/asm/compiler.h)

#ifdef __KERNEL__
/* Some idiots over in <linux/compiler.h> thought inline should imply
   always_inline.  This breaks stuff.  We''ll include this file whenever
   we run into such problems.  */

#include <linux/compiler.h>
#undef inline
#undef __inline__
#undef __inline
#undef __always_inline
#define __always_inline		inline __attribute__((always_inline))

#endif /* __KERNEL__ */

Cheers,

Harvey

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

H. Peter Anvin

2009-Jan-09 22:35 UTC

head link

Re: [patch] measurements, numbers about CONFIG_OPTIMIZE_INLINING=y impact

Andi Kleen wrote:>> Fetch a gigabyte''s worth of data for the debuginfo RPM? 
> 
> The suse 11.0 kernel debuginfo is ~120M.
Still, though, hardly worth doing client-side when it can be done 
server-side for all the common distro kernels.  For custom kernels, not 
so, but there you should already have the debuginfo locally.

And yes, there are probably residual holes, but it''s questionable if it
matters.

	-hpa

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Andi Kleen

2009-Jan-09 22:44 UTC

head link

Re: [patch] measurements, numbers about CONFIG_OPTIMIZE_INLINING=y impact

> Fetch a gigabyte''s worth of data for the debuginfo RPM? 
The suse 11.0 kernel debuginfo is ~120M.

-Andi

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Arjan van de Ven

2009-Jan-09 22:55 UTC

head link

Re: [patch] measurements, numbers about CONFIG_OPTIMIZE_INLINING=y impact

On Fri, 09 Jan 2009 14:35:29 -0800
"H. Peter Anvin" <hpa@zytor.com> wrote:
> Andi Kleen wrote:
> >> Fetch a gigabyte''s worth of data for the debuginfo RPM? 
> > 
> > The suse 11.0 kernel debuginfo is ~120M.
> 
> Still, though, hardly worth doing client-side when it can be done 
> server-side for all the common distro kernels.  For custom kernels,
> not so, but there you should already have the debuginfo locally.h
and if you have the debug info local, all you need is

dmesg | scripts/markup_oops.pl vmlinux

and it nicely decodes it for you



-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Ingo Molnar

2009-Jan-09 23:12 UTC

head link

Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

* Linus Torvalds <torvalds@linux-foundation.org> wrote:
> On Fri, 9 Jan 2009, Harvey Harrison wrote:
> > 
> > __needs_inline?  That would imply that it''s for correctness
reasons.
> 
> .. but the point is, we have _thousands_ of inlines, and do you know 
> which is which? We''ve historically forced them to be inlined, and
every
> time somebody does that "OPTIMIZE_INLINE=y", something simply
_breaks_.
Having watched all the inline and anti-inline activites and patches of the 
past decade (and having participated in many of them) my strong impression 
is that any non-automated way is a fundamentally inhuman Don Quijote 
fight.

The inlining numbers me and others posted seem to support that impression.

Today we have in excess of thirty thousand ''inline'' keyword
uses in the
kernel, and in excess of one hundred thousand kernel functions. We had a 
decade of hundreds of inline-tuning patches that flipped inline attributes 
on and off, with the goal of doing that job better than the compiler.

Still a sucky compiler who was never faced with this level of inlining 
complexity before (up to a few short months ago when we released the first 
kernel with non-CONFIG_BROKEN-marked CONFIG_OPTIMIZE_INLINING feature in 
it) manages to do a better job at judging inlining than a decade of human 
optimizations managed to do. (If you accept that 1% - 3% - 7.5% code size 
reduction in important areas of the kernel is an improvement.)

That improvement is systematic: it happens regardless whether it''s core
kernel developers who wrote the code, with years of kernel experience - or 
driver developers who came from Windows and might be inexperienced about 
it all and might slap ''inline'' on every second random
function.

And it''s not like the compiler was not allowed to inline important 
functions before: all static functions in .c it can (and do) inline if it 
sees fit. Tens of thousands of them.

If we change ''inline'' back to mean ''must
inline'' again, we have not
changed the human dynamics of inlines at all and are back on square one. 
''should_inline'' or ''may_inline'' will be an
opt-in hint that will be
subject to the same kind of misjudgements that resulted in the inlining 
situation to begin with. In .c files it''s already possible to do that:
by
not placing an ''inline'' keyword at all, just leaving the
function
''static''.

may_inline/inline_hint is a longer, less known and uglier keyword. So all 
the cards are stacked up against this new ''may inline''
mechanism, and by
all likelyhood it will fizzle and never reach any sort of critical mass to 
truly matter. Nor should it - why should humans do this if a silly tool 
can achieve something rather acceptable?

So such a change will in essence amount to the effective removal of 
CONFIG_OPTIMIZE_INLINING. If we want to do that then we should do that 
honestly - and remove it altogether and not pretend to care.

Fedora has CONFIG_OPTIMIZE_INLINING=y enabled today - distros are always 
on the lookout for kernel image reductor features. As of today i''m not 
aware of a single Fedora bugzilla that was caused by that.

The upstream kernel did have bugs due to it - we had the UML breakage for 
example, and an older 3.x gcc threw an internal error on one of the 
(stale) isdn telephony drivers. Was Chris''s crash actually caused by
gcc''s
inlining decisions? I dont think it was.

Historically we had far more compiler problems with 
CONFIG_CC_OPTIMIZE_SIZE=y - optimizing for size is a subtly complex and 
non-trivial compiler pass.

	Ingo

Linus Torvalds

2009-Jan-09 23:24 UTC

head link

Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

On Sat, 10 Jan 2009, Ingo Molnar wrote:> 
> may_inline/inline_hint is a longer, less known and uglier keyword. 
Hey, your choice, should you decide to accept it, is to just get rid of 
them entirely.

You claim that we''re back to square one, but that''s simply the
way things
are. Either "inline" means something, or it doesn''t. You
argue for it
meaning nothing. I argue for it meaning something.

If you want to argue for it meaning nothing, then REMOVE it, instead of 
breaking it.

It really is that simple. Remove the inlines you think are wrong. Instead 
of trying to change the meaning of them.

			Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Nicholas Miell

2009-Jan-09 23:28 UTC

head link

Re: [patch] measurements, numbers about CONFIG_OPTIMIZE_INLINING=y impact

On Fri, 2009-01-09 at 12:29 -0800, Linus Torvalds wrote:> 
> On Fri, 9 Jan 2009, Nicholas Miell wrote:
> > 
> > Maybe the kernel''s backtrace code should be fixed instead of
blaming
> > gcc.
> 
> And maybe people who don''t know what they are talking about
shouldn''t
> speak?
I may not know what I''m talking about, but you don''t have to
be rude
about it, at least not until I''ve made a nuisance of myself, and a
single mail surely isn''t yet a nuisance.
> You just loaded the whole f*cking debug info just to do that exact 
> analysis. Guess how big it is for the kernel?
It''s huge, I know. My understanding is that the DWARF size could be
reduced rather dramatically if redundant DIEs were coalesced into a
single entry. Even then it would still be too large to load into the
kernel for runtime stack trace generation, but that''s what the offline
analysis of crash dumps is for.
> Did you even read this discussion? 
No, I didn''t read most of the discussion, I only started skimming it
after I saw the gcc people talking negatively about the latest brouhaha.
(And this thread being an offshoot of a large thread that is itself an
offshoot of another large thread certainly didn''t help me find the
relevant parts of the discussion, either). So I''m sorry that I missed
whatever comments you made on the subject

(As an aside, I''m amazed that anything works at all when the kernel,
compiler and runtime library people all seem to mutually loathe each
other to the point of generally refusing to communicate at all.)
> Did you see my comments about why 
> kernel backtrace debugging is different from regular user mode debugging?
I think I found the mail in question[1], and after reading it, I found
that it touches on one of the things I was thinking about while looking
through the thread.

The majority of code built by gcc was, is and always will be userspace
code. The kernel was, is and always will be of minor importance to gcc.
gcc will never be perfect for kernel use.

So take your complaint about gcc''s decision to inline functions called
once. Ignore for the moment the separate issue of stack growth and
let''s
talk about what it does to debugging, which was the bulk of your
complaint that I originally responded to.

In the general case is it does nothing at all to debugging (beyond the
usual weird control flow you get from any optimized code) -- the
compiler generates line number information for the inlined functions,
the debugger interprets that information, and your backtrace is
accurate.

It is only in the specific case of the kernel''s broken backtrace code
that this becomes an issue. It''s failure to function correctly is the
direct result of a failure to keep up with modern compiler changes that
everybody else in the toolchain has dealt with.

So putting "helpfully" in sarcastic quotes or calling what gcc does a
gcc problem is outright wrong. For most things, gcc''s behavior does
actually help, and it is a kernel''s problem (by virtue of the kernel
being different and strange), not gcc''s.

The question then becomes, how does the kernel deal with the fact that
it is of minor importance to gcc and significantly different from the
bulk of gcc''s consumers to the point where those differences become
serious problems?

I think that the answer to that is that the kernel should do its best to
be as much like userspace apps as it can, because insisting on special
treatment doesn''t seem to be working.

In this specific case, that would mean make kernel debugging as much
like userspace debugging as you can -- stop pretending that stack traces
generated by the kernel at runtime are adequate, get kernel crash dumps
enabled and working 100% of the time, and then use a debugger to examine
the kernel''s core dump and find your stack trace.

As an added bonus, real crash dumps aren''t limited just to backtraces,
so you''d have even more information to work with to find the root cause
of the failure.

[1] Message ID: alpine.LFD.2.00.0901090947080.6528@localhost.localdomain
-- 
Nicholas Miell <nmiell@comcast.net>

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Sam Ravnborg

2009-Jan-09 23:32 UTC

head link

Re: [patch] measurements, numbers about CONFIG_OPTIMIZE_INLINING=y impact

On Fri, Jan 09, 2009 at 11:44:10PM +0100, Andi Kleen
wrote:> > Fetch a gigabyte''s worth of data for the debuginfo RPM? 
> 
> The suse 11.0 kernel debuginfo is ~120M.
How is this debuginfo generated?

Someone have posted the following patch which I did not apply in lack
of any real need. But maybe distroes have something similar already
so it makes sense to apply it.

	Sam

Subject: [PATCH] Kbuild: generate debug info in building

This patch will generate kernel debuginfo in Kbuild when invoking "make
debug_info". The separate debug files are in .debug under building tree.
They can help the cases of requiring debug info for tracing/debug tools,
especially cross-compilation. Moreover, it can simplify or standardize
the packaging process for the distributions those will provide
kernel-debuginfo.

Signed-off-by: Wenji Huang <wenji.huang@oracle.com>
---
 Makefile                 |   14 ++++++++++++++
 scripts/Makefile.modpost |   14 ++++++++++++++
 2 files changed, 28 insertions(+), 0 deletions(-)


diff --git a/Makefile b/Makefile
index 7f9ff9b..eed7510 100644
--- a/Makefile
+++ b/Makefile
@@ -814,6 +814,20 @@ define rule_vmlinux-modpost
        $(Q)echo ''cmd_$@ := $(cmd_vmlinux-modpost)'' >
$(dot-target).cmd
 endef

+ifdef CONFIG_DEBUG_INFO
+quiet_cmd_vmlinux_debug = GEN     $<.debug
+      cmd_vmlinux_debug = mkdir -p .debug;                          \
+                          $(OBJCOPY) --only-keep-debug              \
+                                     $< .debug/$<.debug
+targets += vmlinux.debug
+endif
+
+debug_info: vmlinux FORCE
+ifdef CONFIG_DEBUG_INFO
+       $(call if_changed,vmlinux_debug)
+       $(Q)$(MAKE) -f $(srctree)/scripts/Makefile.modpost $@
+endif
+
 # vmlinux image - including updated kernel symbols
 vmlinux: $(vmlinux-lds) $(vmlinux-init) $(vmlinux-main) vmlinux.o
$(kallsyms.o) FORCE
 ifdef CONFIG_HEADERS_CHECK
diff --git a/scripts/Makefile.modpost b/scripts/Makefile.modpost
index f4053dc..0df73b2 100644
--- a/scripts/Makefile.modpost
+++ b/scripts/Makefile.modpost
@@ -137,6 +137,20 @@ $(modules): %.ko :%.o %.mod.o FORCE

 targets += $(modules)

+modules-debug := $(modules:.ko=.ko.debug)
+ifdef CONFIG_DEBUG_INFO
+quiet_cmd_debug_ko = GEN     $@
+      cmd_debug_ko = mkdir -p .debug/`dirname $@`;                     \
+                    $(OBJCOPY) --only-keep-debug $< .debug/$@
+targets += $(modules-debug)
+endif
+
+debug_info: $(modules-debug) FORCE
+
+$(modules-debug): $(modules) FORCE
+ifdef CONFIG_DEBUG_INFO
+       $(call if_changed,debug_ko)
+endif

 # Add FORCE to the prequisites of a target to force it to be always
rebuilt.
 #
---------------------------------------------------------------------------

Linus Torvalds

2009-Jan-10 00:05 UTC

head link

Re: [patch] measurements, numbers about CONFIG_OPTIMIZE_INLINING=y impact

On Fri, 9 Jan 2009, Nicholas Miell wrote:> 
> So take your complaint about gcc''s decision to inline functions
called
> once.
Actually, the "called once" really is a red herring. The big complaint
is
"too aggressively when not asked for". It just so happens that the
called
once logic is right now the main culprit.
> Ignore for the moment the separate issue of stack growth and let''s
> talk about what it does to debugging, which was the bulk of your
> complaint that I originally responded to.
Actually, stack growth is the one that ends up being a correctness issue. 
But:
> In the general case is it does nothing at all to debugging (beyond the
> usual weird control flow you get from any optimized code) -- the
> compiler generates line number information for the inlined functions,
> the debugger interprets that information, and your backtrace is
> accurate.
The thng is, we do not use line number information, and never will - 
because it''s too big. MUCH too big.

We do end up saving function start information (although even that is 
actually disabled if you''re doing embedded development), so that we can
at
least tell which function something happened in.
> It is only in the specific case of the kernel''s broken backtrace
code
> that this becomes an issue. It''s failure to function correctly is
the
> direct result of a failure to keep up with modern compiler changes that
> everybody else in the toolchain has dealt with.
Umm. You can say that. But the fact is, most others care a whole lot 
_less_ about those "modern compiler changes". In user space, when you 
debug something, you generally just stop optimizing. In the kernel,
we''ve
tried to balance the "optimize vs debug info" thing.
> I think that the answer to that is that the kernel should do its best to
> be as much like userspace apps as it can, because insisting on special
> treatment doesn''t seem to be working.
The problem with that is that the kernel _isn''t_ a normal app. An it 
_definitely_ isn''t a normal app when it comes to debugging.

You can hand-wave and talk about it all you want, but it''s just not
going
to happen. A kernel is special. We don''t get dumps, and only crazy
people
even ask for them. 

The fact that you seem to think that we should get them just shows that 
you either don''t udnerstand the problems, or you live in some sheltered
environment wher crash-dumps _could_ work, but also by definition those 
environments aren''t where they buy kernel developers anything.

The thing is, a crash dump in a "enterprise environment" (and that is
the
only kind where you can reasonably dump more than the minimal stuff we do 
now) is totally useless - because such kernels are usually at least a year 
old, often more. As such, debug information from enterprise users is 
almost totally worthless - if we relied on it, we''d never get anything 
done.

And outside of those kinds of very rare niches, big kernel dumps simply 
are not an option. Writing to disk when things go hay-wire in the kernel 
is the _last_ thing you must ever do. People can''t have dedicated dump 
partitions or network dumps.

That''s the reality. I''m not making it up. We can give a simple
trace, and
yes, we can try to do some off-line improvement on it (and kerneloops.org 
to some degree does), but that''s just about it.

But debugging isn''t even the only issue. It''s just that
debuggability is
more important than a DUBIOUS improvement in code quality. See? Note the 
DUBIOUS.

Let''s take a very practical example on a number that has been floated 
around here: letting gcc do inlining decisions apparently can help for up 
to about 4% of code-size. Fair enough - I happen to believe that we could 
cut that down a bit by just doing things manually with a checker, but 
that''s neither here nor there.

What''s the cost/benefit of that 4%? Does it actually improve
performance?
Especially if you then want to keep DWARF unwind information in memory in 
order to fix up some of the problems it causes? At that point, you lost 
all the memory you won, and then some.

Does it help I$ utilization (which can speed things up a lot more, and is 
probably the main reason -Os actually tends to perform better)? Likely 
not. Sure, shrinking code is good for I$, but on the other hand inlining 
can actually be bad for I$ density because if you inline a function that 
doesn''t get called, you now fragmented your footprint a lot more.

So aggressively inlining has to be shown to be a real _win_.

You try to say "well, do better debug info", but that turns inlining
into
a _loss_, so then the proper response is "don''t inline".

So when is inlining a win?

It''s a win when the thing you inline is clearly not bigger than the
call
site. Then it''s totally unambiguous.

It''s also often a win if it''s a unconditional call from a
single site, and
you only inline one such, so that you avoid all of the downsides (you may 
be able to _shrink_ stack usage, and you''re hopefully making I$
accesses
_denser_ rather than fragmenting it).

And if you can seriously simplify the code by taking advantage of constant 
arguments, it can be an absolutely _huge_ win. Except as we''ve seen in 
this discussion, gcc currently doesn''t apparently even consider this
case
before it does the inlining decision.

But if we''re just looking at code-size, then no, it''s _not_ a
win. Code
size can be a win (4% denser I$ is good), but a lot of the cases I''ve
seen
(which is often the _bad_ cases, since I end up looking at them because we 
are chasing bugs due to things like stack usage), it''s actually just 
fragmenting the function and making everybody lose.

Oh, and yes, it does depend on architectures. Some architectures suck at 
function calls. That''s why being able to trust the compiler _would_ be
a
good thing, no question about that. But yes, we do need to be able to 
trust it to make sense.

			Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Andi Kleen

2009-Jan-10 00:37 UTC

head link

Re: [patch] measurements, numbers about CONFIG_OPTIMIZE_INLINING=y impact

> What''s the cost/benefit of that 4%? Does it actually improve
performance?
> Especially if you then want to keep DWARF unwind information in memory in 
> order to fix up some of the problems it causes? At that point, you lost 
dwarf unwind information has nothing to do with this, it doesn''t tell
you anything about inlining or not inlining.  It just gives you
finished frames after all of that has been done.

Full line number information would help, but I don''t think anyone 
proposed to keep that in memory.
> Does it help I$ utilization (which can speed things up a lot more, and is 
> probably the main reason -Os actually tends to perform better)? Likely 
> not. Sure, shrinking code is good for I$, but on the other hand inlining 
> can actually be bad for I$ density because if you inline a function that 
> doesn''t get called, you now fragmented your footprint a lot more.
Not sure that is always true; the gcc basic block reordering 
based on its standard branch prediction heuristics (e.g. < 0 or
== NULL unlikely or the unlikely macro) might well put it all out of line.

-Andi
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Linus Torvalds

2009-Jan-10 00:41 UTC

head link

Re: [patch] measurements, numbers about CONFIG_OPTIMIZE_INLINING=y impact

On Sat, 10 Jan 2009, Andi Kleen wrote:>
> > What''s the cost/benefit of that 4%? Does it actually improve
performance?
> > Especially if you then want to keep DWARF unwind information in memory
in
> > order to fix up some of the problems it causes? At that point, you
lost
> 
> dwarf unwind information has nothing to do with this, it doesn''t
tell
> you anything about inlining or not inlining.  It just gives you
> finished frames after all of that has been done.
> 
> Full line number information would help, but I don''t think anyone 
> proposed to keep that in memory.
Yeah, true. Although one of the reasons inlining actually ends up causing 
problems is because of the bigger stack frames. That leaves a lot of space 
for old stale function pointers to peek through.

With denser stack frames, the stack dumps look better, even without an 
unwinder.
> > Does it help I$ utilization (which can speed things up a lot more, and
is
> > probably the main reason -Os actually tends to perform better)? Likely
> > not. Sure, shrinking code is good for I$, but on the other hand
inlining
> > can actually be bad for I$ density because if you inline a function
that
> > doesn''t get called, you now fragmented your footprint a lot
more.
> 
> Not sure that is always true; the gcc basic block reordering 
> based on its standard branch prediction heuristics (e.g. < 0 or
> == NULL unlikely or the unlikely macro) might well put it all out of line.
I thought -Os actually disabled the basic-block reordering, doesn''t it?

And I thought it did that exactly because it generates bigger code and 
much worse I$ patterns (ie you have a lot of "conditional branch to other 
place and then unconditional branch back" instead of "conditional
branch
over the non-taken code".

Also, I think we''ve had about as much good luck with guessing 
"likely/unlikely" as we''ve had with "inline" ;)

Sadly, apart from some of the "never happens" error cases, the kernel 
doesn''t tend to have lots of nice patterns. We have almost no loops
(well,
there are loops all over, but most of them we hopefully just loop over 
once or twice in any good situation), and few really predictable things.

Or rather, they can easily be very predictable under one particular load, 
and the totally the other way around under another ..

			Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jamie Lokier

2009-Jan-10 00:53 UTC

head link

Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

Harvey Harrison wrote:> Oh yeah, and figure out what actually breaks on alpha such that they added
> the following (arch/alpha/include/asm/compiler.h)
> 
> #ifdef __KERNEL__
> /* Some idiots over in <linux/compiler.h> thought inline should imply
>    always_inline.  This breaks stuff.  We''ll include this file
whenever
>    we run into such problems.  */
Does "always_inline" complain if the function isn''t
inlinable, while
"inline" allows it?  That would explain the alpha comment.

-- Jamie
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Ingo Molnar

2009-Jan-10 01:01 UTC

head link

Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

* Linus Torvalds <torvalds@linux-foundation.org> wrote:
> On Sat, 10 Jan 2009, Ingo Molnar wrote:
> > 
> > may_inline/inline_hint is a longer, less known and uglier keyword.
> 
> Hey, your choice, should you decide to accept it, is to just get rid of 
> them entirely.
> 
> You claim that we''re back to square one, but that''s
simply the way
> things are. Either "inline" means something, or it
doesn''t. You argue
> for it meaning nothing. I argue for it meaning something.
> 
> If you want to argue for it meaning nothing, then REMOVE it, instead of 
> breaking it.
> 
> It really is that simple. Remove the inlines you think are wrong. 
> Instead of trying to change the meaning of them.
Well, it''s not totally meaningless. To begin with, defining
''inline'' to
mean ''always inline'' is a Linux kernel definition. So we
already changed
the behavior - in the hope of getting it right most of the time and in the 
hope of thus improving the kernel.

And now it appears that in our quest of improving the kernel we can 
further tweak that (already non-standard) meaning to a weak "inline if the 
compiler agrees too" hint. That gives us an even more compact kernel. It 
also moves the meaning of ''inline'' closer to what the typical
programmer
expects it to be - for better or worse.

We could remove them completely, but there are a couple of practical 
problems with that:

 - In this cycle alone, in the past ~2 weeks we added another 1300 inlines
   to the kernel. Do we really want periodic postings of:

      [PATCH 0/135] inline removal cleanups

   ... in the next 10 years? We have about 20% of all functions in the 
   kernel marked with ''inline''. It is a _very_ strong habit.
Is it worth
   fighting against it?

 - Headers could probably go back to ''extern inline'' again. At
not small
   expense - we just finished moving to ''static inline''.
We''d need to
   guarantee a library instantiation for every header include file - this 
   is an additional mechanism with additional introduction complexities 
   and an ongoing maintenance cost.

 - ''static inline'' functions in .c files that are not used
cause no build
   warnings - while if we change them to ''static'', we get a
''defined but
   not used'' warning. Hundreds of new warnings in the allyesconfig
builds.

I know that because i just have removed all variants of
''inline'' from all
.c files of the kernel, it''s a 3.5MB patch:

   3263 files changed, 12409 insertions(+), 12409 deletions(-)

x86 defconfig comparisons:

      text    filename
   6875817    vmlinux.always-inline                       (  0.000% )
   6838290    vmlinux.always-inline+remove-c-inlines      ( -0.548% )
   6794474    vmlinux.optimize-inlining                   ( -1.197% )

So the kernel''s size improved by half a percent. Should i submit it?

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Linus Torvalds

2009-Jan-10 01:04 UTC

head link

Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

On Sat, 10 Jan 2009, Jamie Lokier wrote:> 
> Does "always_inline" complain if the function isn''t
inlinable, while
> "inline" allows it?  That would explain the alpha comment.
I suspect it dates back to gcc-3.1 days. It''s from 2004. And the author
of
that comment is a part-time gcc hacker who was probably offended by the 
fact that we thought (correctly) that a lot of gcc inlining was totally 
broken.

Since he was the main alpha maintainer, he got to do things his way 
there..

			Linus
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Linus Torvalds

2009-Jan-10 01:06 UTC

head link

Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

On Sat, 10 Jan 2009, Ingo Molnar wrote:> 
> Well, it''s not totally meaningless. To begin with, defining
''inline'' to
> mean ''always inline'' is a Linux kernel definition. So we
already changed
> the behavior - in the hope of getting it right most of the time and in the 
> hope of thus improving the kernel.
Umm. No we didn''t. We''ve never changed it. It was "always
inline" back in
the old days, and then we had to keep it "always inline", which is why
we
override the default gcc meaning with the preprocessor.

Now, OPTIMIZE_INLINING _tries_ to change the semantics, and people are 
complaining..

		Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Harvey Harrison

2009-Jan-10 01:08 UTC

head link

Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

On Sat, 2009-01-10 at 02:01 +0100, Ingo Molnar wrote:> * Linus Torvalds <torvalds@linux-foundation.org> wrote:
>  - Headers could probably go back to ''extern inline''
again. At not small
>    expense - we just finished moving to ''static inline''.
We''d need to
>    guarantee a library instantiation for every header include file - this 
>    is an additional mechanism with additional introduction complexities 
>    and an ongoing maintenance cost.
Puzzled?  What benefit is there to going back to extern inline in headers?

Harvey

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Andi Kleen

2009-Jan-10 01:08 UTC

head link

Re: [patch] measurements, numbers about CONFIG_OPTIMIZE_INLINING=y impact

> I thought -Os actually disabled the basic-block reordering,
doesn''t it?
Not in current gcc head no (just verified by stepping through) 
> 
> And I thought it did that exactly because it generates bigger code and 
> much worse I$ patterns (ie you have a lot of "conditional branch to
other
> place and then unconditional branch back" instead of "conditional
branch
> over the non-taken code".
> 
> Also, I think we''ve had about as much good luck with guessing 
> "likely/unlikely" as we''ve had with "inline"
;)
That''s true.

But if you look at the default heuristics that gcc has (gcc/predict.def
in the gcc sources) like == NULL, < 0, branch guarding etc.
I would expect a lot of them to DTRT for the kernel.

Honza at some point even fixed goto to be unlikely after I complained :)
 > Sadly, apart from some of the "never happens" error cases, the
kernel
> doesn''t tend to have lots of nice patterns. We have almost no
loops (well,
> there are loops all over, but most of them we hopefully just loop over 
> once or twice in any good situation), and few really predictable things.
That actually makes us well suited to gcc, it has a relatively poor
loop optimizer compared to other compilers ;-)
> Or rather, they can easily be very predictable under one particular load, 
> and the totally the other way around under another ..
Yes that is why we got good branch predictors in CPUs I guess.

-Andi
-- 
ak@linux.intel.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Andi Kleen

2009-Jan-10 01:13 UTC

head link

Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

On Sat, Jan 10, 2009 at 12:53:42AM +0000, Jamie Lokier
wrote:> Harvey Harrison wrote:
> > Oh yeah, and figure out what actually breaks on alpha such that they
added
> > the following (arch/alpha/include/asm/compiler.h)
> > 
> > #ifdef __KERNEL__
> > /* Some idiots over in <linux/compiler.h> thought inline should
imply
> >    always_inline.  This breaks stuff.  We''ll include this
file whenever
> >    we run into such problems.  */
> 
> Does "always_inline" complain if the function isn''t
inlinable, while
Yes it does.
> "inline" allows it? 
(unless you set -Winline which the kernel doesn''t) 

-Andi

-- 
ak@linux.intel.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Steven Rostedt

2009-Jan-10 01:18 UTC

head link

Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

>  - Headers could probably go back to ''extern inline''
again. At not small
>    expense - we just finished moving to ''static inline''.
We''d need to
>    guarantee a library instantiation for every header include file - this 
>    is an additional mechanism with additional introduction complexities 
>    and an ongoing maintenance cost.
I thought the "static inline" in headers should be more of a
"always
inline". As Andrew Morton keeps yelling at me to use static inline instead 
of macros ;-)

I do not see the point in the functions in the headers needing to have 
their "inlines" removed.
> 
>  - ''static inline'' functions in .c files that are not
used cause no build
>    warnings - while if we change them to ''static'', we get
a ''defined but
>    not used'' warning. Hundreds of new warnings in the allyesconfig
builds.
Perhaps that''s a good thing to see what functions are unused in the 
source.
> 
> I know that because i just have removed all variants of
''inline'' from all
> .c files of the kernel, it''s a 3.5MB patch:
> 
>    3263 files changed, 12409 insertions(+), 12409 deletions(-)
> 
> x86 defconfig comparisons:
> 
>       text    filename
>    6875817    vmlinux.always-inline                       (  0.000% )
>    6838290    vmlinux.always-inline+remove-c-inlines      ( -0.548% )
>    6794474    vmlinux.optimize-inlining                   ( -1.197% )
> 
> So the kernel''s size improved by half a percent. Should i submit
it?
Are there cases that are "must inline" in that patch? Also, what is
the
difference if you do vmlinux.optimize-remove-c-inlines? Is there a 
difference there?

-- Steve

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Ingo Molnar

2009-Jan-10 01:20 UTC

head link

Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

* Linus Torvalds <torvalds@linux-foundation.org> wrote:
> On Sat, 10 Jan 2009, Ingo Molnar wrote:
> > 
> > Well, it''s not totally meaningless. To begin with, defining
''inline'' to
> > mean ''always inline'' is a Linux kernel definition.
So we already changed
> > the behavior - in the hope of getting it right most of the time and in
the
> > hope of thus improving the kernel.
> 
> Umm. No we didn''t. We''ve never changed it. It was
"always inline" back in
s/changed the behavior/overrode the behavior
> the old days, and then we had to keep it "always inline", which
is why
> we override the default gcc meaning with the preprocessor.
> 
> Now, OPTIMIZE_INLINING _tries_ to change the semantics, and people are 
> complaining..
But i''d definitely argue that the Linux kernel definition of
''inline'' was
always more consistent than GCC''s. That was rather easy as well: it
doesnt
get any more clear-cut than ''always inline''.

Nevertheless the question remains: is 3% on allyesconfig and 1% on 
defconfig (7.5% in kernel/built-in.o) worth changing the kernel definition 
for?

I think it is axiomatic that improving the kernel means changing it - 
sometimes that means changing deep details. (And if you see us ignoring 
complaints, let us know, it must not happen.)

So the question isnt whether to change, the question is: does the kernel 
get ''better'' after that change - and could the same be
achieved
realistically via other means?

If you accept us turning all 30,000 inlines in the kernel upside down, we 
might be able to get the same end result differently. You can definitely 
be sure if people complained about this ~5 lines feature they will 
complain about a tens of thousand of lines patch (and the followup changed 
regime) ten or hundred times more fiercely.

And just in case it was not clear, i''m not a GCC apologist - to the 
contrary. I dont really care which tool makes the kernel better, and i 
wont stop looking at a quantified possibility to improve the kernel just 
because it happens to be GCC that offers a solution.

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Linus Torvalds

2009-Jan-10 01:30 UTC

head link

Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

On Fri, 9 Jan 2009, Harvey Harrison wrote:
> On Sat, 2009-01-10 at 02:01 +0100, Ingo Molnar wrote:
> 
> >  - Headers could probably go back to ''extern inline''
again. At not small
> >    expense - we just finished moving to ''static
inline''. We''d need to
> >    guarantee a library instantiation for every header include file -
this
> >    is an additional mechanism with additional introduction
complexities
> >    and an ongoing maintenance cost.
> 
> Puzzled?  What benefit is there to going back to extern inline in headers?
There''s none. In fact, it''s wrong, unless you _also_ have an
extern
definition (according to the "new" gcc rules as of back in the days).

Of course, as long as "inline" really means _always_ inline, it
won''t
matter. So in that sense Ingo is right - we _could_. Which has no bearing 
on whether we _should_, of course.

In fact, the whole mess with "extern inline" is a perfect example of
why a
inlining hit should be called "may_inline" or "inline_hint"
or something
like that.

Because then it actually makes sense to have "extern may_inline" with
one
definition, and another definition for the non-inline version.  And
it''s
very clear what the deal is about, and why we literally have two versions 
of the same function.

But again, that''s very much not a "let''s use
''extern'' instead of
''static''". It''s a totally different issue.

		Linus

[ A third reason to use "extern inline" is actually a really evil one:
we
  could do it for our unrelated issue with system call definitions on 
  architectures that require the caller to sign-extend the arguments. 

  Since we don''t control the callers of system calls, we can''t
do that,
  and architectures like s390 actually have potential security holes due 
  to callers that don''t "follow the rules". So there are
different needs
  for trusted - in-kernel - system call users that we know do the sign 
  extension correctly, and untrusted - user-mode callers that just call 
  through the system call function table.

  What we _could_ do is for the wrappers to use

	extern inline int sys_open(const char *pathname, int flags, mode_t mode)
	{
		return SYS_open(pathname, mode);
	}

  which gives the C callers the right interface without any unnecessary 
  wrapping, and then

	long WRAP_open(const char *pathname, long flags, long mode)
	{
		return SYS_open(pathname, flags, mode);
	}
	asm ("\t.globl sys_alias\n\t.set WRAP_open");

  which is the one that gets linked from any asm code. So now asm code 
  and C code gets two different functions, even though they use the same 
  system call name - one with inline expansion, one with linker games. 

  Whee. The games we can play (and the odd reasons we must play them). ]
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Linus Torvalds

2009-Jan-10 01:34 UTC

head link

Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

On Sat, 10 Jan 2009, Ingo Molnar wrote:
> 
> * Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
> > On Sat, 10 Jan 2009, Ingo Molnar wrote:
> > > 
> > > Well, it''s not totally meaningless. To begin with,
defining ''inline'' to
> > > mean ''always inline'' is a Linux kernel
definition. So we already changed
> > > the behavior - in the hope of getting it right most of the time
and in the
> > > hope of thus improving the kernel.
> > 
> > Umm. No we didn''t. We''ve never changed it. It was
"always inline" back in
> 
> s/changed the behavior/overrode the behavior
The point is, as far as the kenrel has been concerned, "inline" has
always
meant the same thing. Not just for the last few weeks, but for the last 18 
_years_. It''s always meant "always inline".
> But i''d definitely argue that the Linux kernel definition of
''inline'' was
> always more consistent than GCC''s. That was rather easy as well:
it doesnt
> get any more clear-cut than ''always inline''.
Exactly. And I argue that we shouldn''t change it.

If we have too many "inline"s, and if gcc inlines for us _anyway_,
then
the answer is not to change the meaning of "inline", but simply say
"ok,
we have too many inlines. Let''s remove the ones we don''t care
about".
> I think it is axiomatic that improving the kernel means changing it - 
> sometimes that means changing deep details. (And if you see us ignoring 
> complaints, let us know, it must not happen.)
And I''m agreeing with you.

What I''m _not_ agreeing with is how you want to change the semantics of
something we''ve had for 18 years.

YOU are the one who want to make "inline" mean "maybe".
I''m against it.
I''m against it because it makes no sense. It''s not what
we''ve ever done.

		Linus
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Linus Torvalds

2009-Jan-10 01:39 UTC

head link

Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

On Sat, 10 Jan 2009, Ingo Molnar wrote:> 
>  - ''static inline'' functions in .c files that are not
used cause no build
>    warnings - while if we change them to ''static'', we get
a ''defined but
>    not used'' warning. Hundreds of new warnings in the allyesconfig
builds.
Well, duh. Maybe they shouldn''t be marked "inline", and maybe
they should
be marked with "__maybe_unused" instead. 

I do not think it makes sense to use "inline" as a way to say
"maybe I
won''t use this function".

Yes, it''s true that "static inline" won''t warn, but
hey, as a way to avoid
a warning it''s a pretty bad one.

			Linus
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Andrew Morton

2009-Jan-10 01:41 UTC

head link

Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

On Sat, 10 Jan 2009 02:01:25 +0100 Ingo Molnar <mingo@elte.hu> wrote:
> 
> * Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
> > On Sat, 10 Jan 2009, Ingo Molnar wrote:
> > > 
> > > may_inline/inline_hint is a longer, less known and uglier
keyword.
> > 
> > Hey, your choice, should you decide to accept it, is to just get rid
of
> > them entirely.
> > 
> > You claim that we''re back to square one, but that''s
simply the way
> > things are. Either "inline" means something, or it
doesn''t. You argue
> > for it meaning nothing. I argue for it meaning something.
> > 
> > If you want to argue for it meaning nothing, then REMOVE it, instead
of
> > breaking it.
> > 
> > It really is that simple. Remove the inlines you think are wrong. 
> > Instead of trying to change the meaning of them.
> 
> Well, it''s not totally meaningless. To begin with, defining
''inline'' to
> mean ''always inline'' is a Linux kernel definition. So we
already changed
> the behavior - in the hope of getting it right most of the time and in the 
> hope of thus improving the kernel.
> 
> And now it appears that in our quest of improving the kernel we can 
> further tweak that (already non-standard) meaning to a weak "inline if
the
> compiler agrees too" hint. That gives us an even more compact kernel.
It
> also moves the meaning of ''inline'' closer to what the
typical programmer
> expects it to be - for better or worse.
> 
> We could remove them completely, but there are a couple of practical 
> problems with that:
> 
>  - In this cycle alone, in the past ~2 weeks we added another 1300 inlines
>    to the kernel.
Who "reviewed" all that?
> Do we really want periodic postings of:
> 
>       [PATCH 0/135] inline removal cleanups
> 
>    ... in the next 10 years? We have about 20% of all functions in the 
>    kernel marked with ''inline''. It is a _very_ strong
habit. Is it worth
>    fighting against it?
A side-effect of the inline fetish is that a lot of it goes into header
files, thus requiring that those header files #include lots of other
headers, thus leading to, well, the current mess.
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Nicholas Miell

2009-Jan-10 02:12 UTC

head link

Re: [patch] measurements, numbers about CONFIG_OPTIMIZE_INLINING=y impact

On Fri, 2009-01-09 at 16:05 -0800, Linus Torvalds wrote:> 
> On Fri, 9 Jan 2009, Nicholas Miell wrote:
> 
> > In the general case is it does nothing at all to debugging (beyond the
> > usual weird control flow you get from any optimized code) -- the
> > compiler generates line number information for the inlined functions,
> > the debugger interprets that information, and your backtrace is
> > accurate.
> 
> The thng is, we do not use line number information, and never will - 
> because it''s too big. MUCH too big.
> 
It''s only too big if you always keep it in memory, and I
wasn''t
suggesting that.

My point was that you can get completely accurate stack traces in the
face of gcc''s inlining, and that blaming gcc because you can''t
get good
stack traces because the kernel''s debugging infrastructure
isn''t up to
snuff isn''t exactly fair.
> We do end up saving function start information (although even that is 
> actually disabled if you''re doing embedded development), so that
we can at
> least tell which function something happened in.
> 
> > It is only in the specific case of the kernel''s broken
backtrace code
> > that this becomes an issue. It''s failure to function
correctly is the
> > direct result of a failure to keep up with modern compiler changes
that
> > everybody else in the toolchain has dealt with.
> 
> Umm. You can say that. But the fact is, most others care a whole lot 
> _less_ about those "modern compiler changes". In user space, when
you
> debug something, you generally just stop optimizing. In the kernel,
we''ve
> tried to balance the "optimize vs debug info" thing.
The bulk of the libraries that your userspace app links to are
distro-built with full optimization and figuring out why an optimized
build crashed does come up from time to time, so those changes still
matter.
> > I think that the answer to that is that the kernel should do its best
to
> > be as much like userspace apps as it can, because insisting on special
> > treatment doesn''t seem to be working.
> 
> The problem with that is that the kernel _isn''t_ a normal app. An
it
> _definitely_ isn''t a normal app when it comes to debugging.
> 
> You can hand-wave and talk about it all you want, but it''s just
not going
> to happen. A kernel is special. We don''t get dumps, and only crazy
people
> even ask for them.
>
> The fact that you seem to think that we should get them just shows that 
> you either don''t udnerstand the problems, or you live in some
sheltered
> environment wher crash-dumps _could_ work, but also by definition those 
> environments aren''t where they buy kernel developers anything.
> 
> The thing is, a crash dump in a "enterprise environment" (and
that is the
> only kind where you can reasonably dump more than the minimal stuff we do 
> now) is totally useless - because such kernels are usually at least a year 
> old, often more. As such, debug information from enterprise users is 
> almost totally worthless - if we relied on it, we''d never get
anything
> done.
> 
> And outside of those kinds of very rare niches, big kernel dumps simply 
> are not an option. Writing to disk when things go hay-wire in the kernel 
> is the _last_ thing you must ever do. People can''t have dedicated
dump
> partitions or network dumps.
> 
> That''s the reality. I''m not making it up. We can give a
simple trace, and
> yes, we can try to do some off-line improvement on it (and kerneloops.org 
> to some degree does), but that''s just about it.
>
And this is where we disagree. I believe that crash dumps should be the
norm and all the reasons you have against crash dumps in general are in
fact reasons against Linux''s sub-par implementation of crash dumps in
specific.

I can semi-reliably panic my kernel, and it''s fairly recent, too
(2.6.27.9-159.fc10.x86_64). Naturally, I run X, so the result of a panic
is a frozen screen, blinking keyboard lights, the occasional endless
audio loop, and no useful information whatsoever. I looked into kdump,
only to discover that it doesn''t work (but it could, it''s a
simple
matter of fixing the initrd''s script to support LVM), but I''ve
already
found a workaround, and after fiddling with kdump, I just don''t care
anymore

So, here I am, a non-enterprise end user with a non-stale kernel who''d
love to be able to give you a crash dump (or, more likely, a stack trace
created from that crash dump), but I can''t because Linux crash dumps
are
stuck in the enterprise ghetto.

You''re right, the bulk of the people who do use kdump these days are
enterprise people running ancient enterprise kernels, but that has more
to do with kdump being an unusable bastard red-headed left-handed
stepchild that you only use if you can''t avoid it than it has to do
with
the crash dump concept being useless.

Hell, I''d be happy if I could get the the normal panic text written to
disk, but since the hard part is the actual writing to disk, there''s no
reason not to do the full crash dump if you can.


-- 
Nicholas Miell <nmiell@comcast.net>

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Andi Kleen

2009-Jan-10 03:02 UTC

head link

Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

> A side-effect of the inline fetish is that a lot of it goes into header
> files, thus requiring that those header files #include lots of other
> headers, thus leading to, well, the current mess.
I personally also always found it annoying while grepping that
part of the code is in a completely different directory.
e.g. try to go through TCP code without jumping to .hs all the time.

Long term that problem will hopefully disappear, as gcc learns to do cross
source file inlining (like a lot of other compilers already do)

-Andi

-- 
ak@linux.intel.com
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Linus Torvalds

2009-Jan-10 04:05 UTC

head link

Re: [patch] measurements, numbers about CONFIG_OPTIMIZE_INLINING=y impact

On Fri, 9 Jan 2009, Nicholas Miell wrote:> 
> It''s only too big if you always keep it in memory, and I
wasn''t
> suggesting that.
Umm. We''re talking kernel panics here. If it''s not in memory,
it doesn''t
exist as far as the kernel is concerned.

If it doesn''t exist, it cannot be reported.
> My point was that you can get completely accurate stack traces in the
> face of gcc''s inlining, and that blaming gcc because you
can''t get good
> stack traces because the kernel''s debugging infrastructure
isn''t up to
> snuff isn''t exactly fair.
No. I''m blaming inlining for making debugging harder.

And that''s ok - IF IT IS WORTH IT.

It''s not. Gcc inlining decisions suck. gcc inlines stuff that
doesn''t
really help from being inlined, and doesn''t inline stuff that _does_.

What''s so hard to accept in that?
> And this is where we disagree. I believe that crash dumps should be the
> norm and all the reasons you have against crash dumps in general are in
> fact reasons against Linux''s sub-par implementation of crash dumps
in
> specific.
Good luck with that. Go ahead and try it.  You''ll find it
wasn''t so easy
after all.
> So, here I am, a non-enterprise end user with a non-stale kernel
who''d
> love to be able to give you a crash dump (or, more likely, a stack trace
> created from that crash dump), but I can''t because Linux crash
dumps are
> stuck in the enterprise ghetto.
No, you''re stuck because you apparently have your mind stuck on a 
crash-dump, and aren''t willing to look at alternatives.

You could use a network console. Trust me - if you can''t set up a
network
console, you have no business mucking around with crash dumps.

And if the crash is hard enough that you can''t any output from that, 
again, a crash dump wouldn''t exactly help, would it?
> Hell, I''d be happy if I could get the the normal panic text
written to
> disk, but since the hard part is the actual writing to disk,
there''s no
> reason not to do the full crash dump if you can.
Umm. And why do you think the two have anything to do with each other?

Only insane people want the kernel to write to disk when it has problems. 
Sane people try to write to something that doesn''t potentially
overwrite
their data. Like the network.

Which is there. Try it. Trust me, it''s a _hell_ of a lot more likely to
wotk than a crash dump.

			Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

H. Peter Anvin

2009-Jan-10 05:03 UTC

head link

Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

Linus Torvalds wrote:> 
> There''s none. In fact, it''s wrong, unless you _also_ have
an extern
> definition (according to the "new" gcc rules as of back in the
days).
> 
> Of course, as long as "inline" really means _always_ inline, it
won''t
> matter. So in that sense Ingo is right - we _could_. Which has no bearing 
> on whether we _should_, of course.
> 
I was thinking about experimenting with this, to see what level of
upside it might add.  Ingo showed me numbers which indicate that a
fairly significant fraction of the cases where removing inline helps is
in .h files, which would require code movement to fix.  Hence to see if
it can be automated.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don''t speak on their behalf.

--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Linus Torvalds

2009-Jan-10 05:28 UTC

head link

Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

On Fri, 9 Jan 2009, H. Peter Anvin wrote:> 
> I was thinking about experimenting with this, to see what level of
> upside it might add.  Ingo showed me numbers which indicate that a
> fairly significant fraction of the cases where removing inline helps is
> in .h files, which would require code movement to fix.  Hence to see if
> it can be automated.
We _definitely_ have too many inline functions in headers. They usually 
start out small, and then they grow. And even after they''ve grown big, 
it''s usually not at all clear exactly where else they should go, so
even
when you realize that "that shouldn''t be inlined", moving
them and making
them uninlined is not obvious.

And quite often, some of them go away - or at least shrink a lot - when 
some config option or other isn''t set. So sometimes it''s an
inline because
a certain class of people really want it inlined, simply because for 
_them_ it makes sense, but when you enable debugging or something, it 
absolutely explodes.

		Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Arjan van de Ven

2009-Jan-10 05:29 UTC

head link

Re: [patch] measurements, numbers about CONFIG_OPTIMIZE_INLINING=y impact

On Fri, 9 Jan 2009 14:52:33 -0500
Theodore Tso <tytso@mit.edu> wrote:
> Kerneloops.org does this, so the code is mostly written; but it does
> this in a blinded fashion, so it only makes sense for oops which are
> very common and for which we don''t need to ask the user, "so
what were
> you doing at the time".  In cases where the user has already stepped
> up and reported the oops on a mailing list, it would be nice if
> kerneloops.org had a way of decoding the oops via some web page.
> 
> Arjan, would something like this be doable, hopefully without too much
> effort?
thinking about this.. making a "pastebin" like thing for oopses is
relatively trivial for me; all the building blocks I have already.

The hard part is getting the vmlinux files in place. Right now I do
this manually for popular released kernels.. if the fedora/suse guys
would help to at least have the vmlinux for their released updates
easily available that would be a huge help.... without that it''s going
to suck.


-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

H. Peter Anvin

2009-Jan-10 05:57 UTC

head link

Re: [patch] measurements, numbers about CONFIG_OPTIMIZE_INLINING=y impact

Arjan van de Ven wrote:> 
> thinking about this.. making a "pastebin" like thing for oopses
is
> relatively trivial for me; all the building blocks I have already.
> 
> The hard part is getting the vmlinux files in place. Right now I do
> this manually for popular released kernels.. if the fedora/suse guys
> would help to at least have the vmlinux for their released updates
> easily available that would be a huge help.... without that it''s
going
> to suck.
> 
We could just pick them up automatically from the kernel.org mirrors
with a little bit of scripting.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don''t speak on their behalf.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

H. Peter Anvin

2009-Jan-10 05:57 UTC

head link

Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

Linus Torvalds wrote:> 
> And quite often, some of them go away - or at least shrink a lot - when 
> some config option or other isn''t set. So sometimes it''s
an inline because
> a certain class of people really want it inlined, simply because for 
> _them_ it makes sense, but when you enable debugging or something, it 
> absolutely explodes.
> 
And this is really why getting static inline annotations right is really
hard if not impossible in the general case (especially when considering
the sheer number of architectures we compile on.)  So making it possible
for the compiler to do the right thing for at least this class of
functions really does seem like a good idea.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don''t speak on their behalf.

--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

H. Peter Anvin

2009-Jan-10 06:44 UTC

head link

Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

Ingo Molnar wrote:> 
>  - Headers could probably go back to ''extern inline''
again. At not small
>    expense - we just finished moving to ''static inline''.
We''d need to
>    guarantee a library instantiation for every header include file - this 
>    is an additional mechanism with additional introduction complexities 
>    and an ongoing maintenance cost.
> 
I think I have a pretty clean idea for how to do this.  I''m going to
experiment with it over the next few days.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don''t speak on their behalf.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Nicholas Miell

2009-Jan-10 06:44 UTC

head link

Re: [patch] measurements, numbers about CONFIG_OPTIMIZE_INLINING=y impact

On Fri, 2009-01-09 at 20:05 -0800, Linus Torvalds wrote:
> 
> On Fri, 9 Jan 2009, Nicholas Miell wrote:
> > 
> > It''s only too big if you always keep it in memory, and I
wasn''t
> > suggesting that.
> 
> Umm. We''re talking kernel panics here. If it''s not in
memory, it doesn''t
> exist as far as the kernel is concerned.
> 
> If it doesn''t exist, it cannot be reported.
The idea was that the kernel would generate a crash dump and then after
the reboot, a post-processing tool would do something with it. (e.g. run
the dump through crash to get a stack trace using separate debug info or
ship the entire dump off to a collection server or something).
> > And this is where we disagree. I believe that crash dumps should be
the
> > norm and all the reasons you have against crash dumps in general are
in
> > fact reasons against Linux''s sub-par implementation of crash
dumps in
> > specific.
> 
> Good luck with that. Go ahead and try it.  You''ll find it
wasn''t so easy
> after all.
> 
> > So, here I am, a non-enterprise end user with a non-stale kernel
who''d
> > love to be able to give you a crash dump (or, more likely, a stack
trace
> > created from that crash dump), but I can''t because Linux
crash dumps are
> > stuck in the enterprise ghetto.
> 
> No, you''re stuck because you apparently have your mind stuck on a 
> crash-dump, and aren''t willing to look at alternatives.
> 
> You could use a network console. Trust me - if you can''t set up a
network
> console, you have no business mucking around with crash dumps.
netconsole requires a second computer. Feel free to mail me one. :)
> And if the crash is hard enough that you can''t any output from
that,
> again, a crash dump wouldn''t exactly help, would it?
> 
> > Hell, I''d be happy if I could get the the normal panic text
written to
> > disk, but since the hard part is the actual writing to disk,
there''s no
> > reason not to do the full crash dump if you can.
> 
> Umm. And why do you think the two have anything to do with each other?
> 
> Only insane people want the kernel to write to disk when it has problems. 
> Sane people try to write to something that doesn''t potentially
overwrite
> their data. Like the network.
> 
> Which is there. Try it. Trust me, it''s a _hell_ of a lot more
likely to
> wotk than a crash dump.
Well, yes, but that has everything to do with how terrible kdump is and
nothing to do with the idea of crash dumps in general.

Anyway, we''ve strayed off topic long enough, I''m sure everyone
in the Cc
list would be happy to stop getting an earful about the merits of crash
dumps.

-- 
Nicholas Miell <nmiell@comcast.net>

--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Samuel

2009-Jan-10 08:55 UTC

head link

Re: [patch] measurements, numbers about CONFIG_OPTIMIZE_INLINING=y impact

On Sat, 10 Jan 2009 6:56:02 am H. Peter Anvin wrote:
> This does bring up the idea of including a compiler with the kernel
> sources again, doesn''t it?
Oh please don''t give him any more ideas;  first it was a terminal
emulator that
had delusions of grandeur, then something to help him manage that code, do we 
really want lcc too ? ;-)

-- 
 Chris Samuel  :  http://www.csamuel.org/  :  Melbourne, VIC

This email may come with a PGP signature as a file. Do not panic.
For more info see: http://en.wikipedia.org/wiki/OpenPGP

Pavel Machek

2009-Jan-10 17:52 UTC

head link

Re: [PATCH][RFC]: mutex: adaptive spin

> Linus, what do you think about this particular approach of spin-mutexes? 
> It''s not the typical spin-mutex i think.
> 
> The thing i like most about Peter''s patch (compared to most other
adaptive
> spinning approaches i''ve seen, which all sucked as they included
various
> ugly heuristics complicating the whole thing) is that it solves the
"how
> long should we spin" question elegantly: we spin until the owner runs
on a
> CPU.
Well; if there''s a timeout, that''s obviously safe.

But this has no timeout, and Linus wants to play games with accessing
''does owner run on cpu?'' lockless. Now, can it mistakenly spin
when
the owner is scheduled away? That would deadlock, and without locking,
I''m not sure if we prevent that....

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures)
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jeremy Fitzhardinge

2009-Jan-10 23:59 UTC

head link

Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

Linus Torvalds wrote:> Actually, the real spin locks are now fair. We use ticket locks on x86.
>
> Well, at least we do unless you enable that broken paravirt support.
I''m
> not at all clear on why CONFIG_PARAVIRT wants to use inferior locks, but I 
> don''t much care.
>   
No, it will continue to use ticket locks, but there''s the option to 
switch to byte locks or something else.  Ticket locks are awesomely bad 
when the VCPU scheduler fights with the run-order required by the ticket 
order.

    J
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Ingo Molnar

2009-Jan-11 00:54 UTC

head link

Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

* Linus Torvalds <torvalds@linux-foundation.org> wrote:
> On Fri, 9 Jan 2009, H. Peter Anvin wrote:
> > 
> > I was thinking about experimenting with this, to see what level of 
> > upside it might add.  Ingo showed me numbers which indicate that a 
> > fairly significant fraction of the cases where removing inline helps 
> > is in .h files, which would require code movement to fix.  Hence to 
> > see if it can be automated.
> 
> We _definitely_ have too many inline functions in headers. They usually 
> start out small, and then they grow. And even after they''ve grown
big,
> it''s usually not at all clear exactly where else they should go,
so even
> when you realize that "that shouldn''t be inlined",
moving them and
> making them uninlined is not obvious.
> 
> And quite often, some of them go away - or at least shrink a lot - when 
> some config option or other isn''t set. So sometimes it''s
an inline
> because a certain class of people really want it inlined, simply because 
> for _them_ it makes sense, but when you enable debugging or something, 
> it absolutely explodes.
IMO it''s all quite dynamic when it comes to inlining.

Beyond the .config variances (which alone is enough degrees of freedom to 
make this non-static, it at least is a complexity we can control in the 
kernel to a certain degree) it also depends on the platform, the CPU type, 
the compiler version - factors which we dont (and probably dont want to) 
control.

There''s also the in-source variance of how many times an inline
function
is used within a .c file, and that factor is not easily tracked. If
it''s
used once in a single .c file it should be inlined even if it''s large.
If
it''s used twice in a .c file it might be put out of line. Transition 
between those states is not obvious in all cases.

There''s certainly clear-cut cases: the very small/constant ones that
must
be short and inlined in any environment, and the very large/complex ones 
that must not be inlined under any circumstance.

But there''s a lot of shades of grey inbetween - and that''s
where the size
wins come from. I''m not sure we can (or should) generally expect kernel
coders to continuously maintain the 30,000+ inline attributes in the 
kernel that involve 100,000+ functions:

 - Nothing breaks if it''s there, nothing breaks if it''s not
there.
   It''s a completely opaque, transparent entity that never pushes
itself
   to the foreground of human attention.

 - It''s so easy to add an inline function call site to a .c file
without
   noticing that it should not be inlined anymore.

 - It''s so easy to _remove_ a usage site from a .c file without
noticing
   that something should be inlined. I.e. local changes will have an 
   effect on the inline attribute _elsewhere_ - and this link is not 
   obvious and not tooled when editing the code.

 - The mapping from C statements to assembly can be non-obvious even to
   experienced developers. Data type details (signed/unsigned, width,
   etc.) can push an inline function over the (very hard to define)
   boundary.

I.e. IMO it''s all very dynamic, it''s opaque, it''s not
visualized and it''s
hard to track - so it''s very fundamentally not for humans to maintain 
[except for the really clear-cut cases].

Add to that that in _theory_ the decision to inline or not is boringly 
mechanic and tools ought to be able to do a near-perfect job with it and 
just adopt to whatever environment the kernel is in at a given moment when 
it''s built.

GCC limps along with its annoyingly mis-designed inlining heuristics, 
hopefully LLVC will become a real compiler that is aware of little details 
like instruction size and has a built-in assembler ...

So IMO all the basic psychological mechanics are missing from the picture 
that would result in really good, "self-maintained" inline attributes.

We can try to inject the requirement to have good inline attributes as an 
external rule, as a principle we want to see met - but we cannot expect it 
to be followed really in its current form, as it goes subtly against the 
human nature on various levels.

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

David Woodhouse

2009-Jan-11 12:26 UTC

head link

Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

On Sat, 2009-01-10 at 04:02 +0100, Andi Kleen wrote:> Long term that problem will hopefully disappear, as gcc learns to do cross
> source file inlining (like a lot of other compilers already do)
We''ve already been able to get GCC doing this for the kernel, in fact
(the --combine -fwhole-program stuff I was working on a while back).

It gives an interesting size reduction, especially in file systems and
other places where we tend to have functions with a single call site...
but in a different file.

Linus argues that we don''t want that kind of inlining because it harms
debuggability, but that isn''t _always_ true. Sometimes you
weren''t going
to get a backtrace if something goes wrong _anyway_. And even if the
size reduction doesn''t necessarily give a corresponding performance
improvement, we might not care. In the embedded world, size does matter
too. And the numbers are such that you can happily keep debuginfo for
the shipped kernel builds and postprocess any backtraces you get. Just
as we can for distros.

In general, I would much prefer being able to trust the compiler, rather
than disabling its heuristics completely. We might not be able to trust
it right now, but we should be working towards that state. Not just
declaring that we know best, even though _sometimes_ we do.

I think we should:

- Unconditionally have ''inline'' meaning
''always_inline''. If we say it,
we should mean it.

- Resist the temptation to use -fno-inline-functions. Allow GCC to
inline other things if it wants to.

- Reduce the number of unnecessary ''inline'' markers, and
have a policy
that the use of ''inline'' should be accompanied by either a
GCC PR#
or an explanation of why we couldn''t reasonably have expected GCC
to
get this particular case right.

- Have a similar policy of PR# or explanation for ''uninline''
too.

I don''t think we should just give up on GCC ever getting it right. That
way lies madness. As we''ve often found in the past.

--
David Woodhouse Open Source Technology Centre
David.Woodhouse@intel.com Intel Corporation

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Andi Kleen

2009-Jan-11 18:13 UTC

head link

Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

On Sun, Jan 11, 2009 at 12:26:41PM +0000, David Woodhouse
wrote:>   - Unconditionally have ''inline'' meaning
''always_inline''. If we say it,
>     we should mean it.
> 
>   - Resist the temptation to use -fno-inline-functions. Allow GCC to
>     inline other things if it wants to.
The proposal was to use -fno-inline-functions-called-once (but 
the resulting numbers were not promising)

We''ve never allowed gcc to inline any other functions not marked
inline explicitely because that''s not included in -O2. 
>   - Reduce the number of unnecessary ''inline'' markers,
and have a policy
>     that the use of ''inline'' should be accompanied by
either a GCC PR#
>     or an explanation of why we couldn''t reasonably have expected
GCC to
>     get this particular case right.
> 
>   - Have a similar policy of PR# or explanation for
''uninline'' too.
> 
> I don''t think we should just give up on GCC ever getting it right.
That
> way lies madness. As we''ve often found in the past. 
It sounds like you''re advocating to set -O3/-finline-functions
by default.   Not sure that''s a good idea.

-Andi

-- 
ak@linux.intel.com
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Linus Torvalds

2009-Jan-11 19:25 UTC

head link

Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

On Sun, 11 Jan 2009, Andi Kleen wrote:> 
> The proposal was to use -fno-inline-functions-called-once (but 
> the resulting numbers were not promising)
Well, the _optimal_ situation would be to not need it, because gcc does a 
good job without it. That implies trying to find a better balance between 
"worth it" and "causes problems". 

Rigth now, it does sound like gcc simply doesn''t try to balance AT ALL,
or
balances only when we add some very version-specific random options (ie 
the stack-usage one). And even those options don''t actually make much 
sense - yes, they "balance" things, but they don''t do it in a
sensible
manner.

For example: stack usage is undeniably a problem (we''ve hit it over and
over again), but it''s not about "stack must not be larger than X
bytes".

If the call is done unconditionally, then inlining _one_ function will 
grow the static stack usage of the function we inline into, but it will 
_not_ grow the dynamic stack usage one whit - so deciding to not inline 
because of stack usage is pointless.

See? So "stop inlining when you hit a stack limit" IS THE WRONG THING
TO
DO TOO! Because it just means that the compiler continues to do bad 
inlining decisions until it hits some magical limit - but since the 
problem isn''t the static stack size of any _single_ function, but the 
combined stack size of a dynamic chain of them, that''s totally idiotic.
You still grew the dynamic stack, and you have no way of knowing by how 
much - the limit on the static one simply has zero bearing what-so-ever on 
the dynamic one.

So no, "limit static stack usage" is not a good option, because it
stops
inlining when it doesn''t matter (single unconditional call), and
doesn''t
stop inlining when it might (lots of sequential calls, in a deep chain).

The other alternative is to let gcc do what it does, but 

 (a) remove lots of unnecessary ''inline''s. And we should
likely do this
     regardless of any "-fno-inline-functions-called-once" issues.

 (b) add lots of ''noinline''s to avoid all the cases where gcc
screws up so
     badly that it''s either a debugging disaster or an actual
correctness
     issue.

The problem with (b) is that it''s a lot of hard thinking, and debugging
disasters always happen in code that you didn''t realize would be a
problem
(because if you had, it simply wouldn''t be the debugging issue it is).

			Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Andi Kleen

2009-Jan-11 20:14 UTC

head link

gcc inlining heuristics was Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

On Sun, Jan 11, 2009 at 11:25:32AM -0800, Linus Torvalds
wrote:> 
> 
> On Sun, 11 Jan 2009, Andi Kleen wrote:
> > 
> > The proposal was to use -fno-inline-functions-called-once (but 
> > the resulting numbers were not promising)
> 
> Well, the _optimal_ situation would be to not need it, because gcc does a 
> good job without it. That implies trying to find a better balance between 
> "worth it" and "causes problems". 
> 
> Rigth now, it does sound like gcc simply doesn''t try to balance AT
ALL, or
> balances only when we add some very version-specific random options (ie 
> the stack-usage one). 
The gcc 4.3 inliner takes stack growth into account by default (without
any special options). I experimented a bit with it when that
was introduced and found the default thresholds are too large for the kernel 
and don''t change the checkstack.pl picture much.

I asked back then and was told --param large-stack-frame
is expected to be a reasonable stable --param (as much as these can
be) and I did a patch to lower it, but I couldn''t get myself
to actually submit it [if you really want it I can send it]. 

But of course that only helps for gcc 4.3+, older gccs would need
a different workaround.

On the other hand (my personal opinion, not shared by everyone) is 
that the ioctl switch stack issue is mostly only a problem with 4K 
stacks and in the rare cases when I still run 32bit kernels
I never set that option because I consider it russian roulette
(because there undoutedly dangerous dynamic stack growth cases that 
checkstack.pl doesn''t flag) 
> And even those options don''t actually make much 
> sense - yes, they "balance" things, but they don''t do it
in a sensible
> manner.
> 
> For example: stack usage is undeniably a problem (we''ve hit it
over and
> over again), but it''s not about "stack must not be larger
than X bytes".
> 
> If the call is done unconditionally, then inlining _one_ function will 
> grow the static stack usage of the function we inline into, but it will 
> _not_ grow the dynamic stack usage one whit - so deciding to not inline 
> because of stack usage is pointless.
Don''t think the current inliner takes that into account from
a quick look at the sources, although it probably could.  
Maybe Honza can comment.

But even if it did it could only do that for a single file,
but if the function is in a different file gcc doesn''t
have the information (unless you run with David''s --combine hack).
This means the kernel developers have to do it anyways.

On the other hand I''m not sure it''s that big a problem. Just
someone
should run make checkstack occasionally and add noinlines to large
offenders.

-Andi

[keep quote for Honza''s benefit]
> 
> See? So "stop inlining when you hit a stack limit" IS THE WRONG
THING TO
> DO TOO! Because it just means that the compiler continues to do bad 
> inlining decisions until it hits some magical limit - but since the 
> problem isn''t the static stack size of any _single_ function, but
the
> combined stack size of a dynamic chain of them, that''s totally
idiotic.
> You still grew the dynamic stack, and you have no way of knowing by how 
> much - the limit on the static one simply has zero bearing what-so-ever on 
> the dynamic one.
> 
> So no, "limit static stack usage" is not a good option, because
it stops
> inlining when it doesn''t matter (single unconditional call), and
doesn''t
> stop inlining when it might (lots of sequential calls, in a deep chain).
> 
> The other alternative is to let gcc do what it does, but 
> 
>  (a) remove lots of unnecessary ''inline''s. And we should
likely do this
>      regardless of any "-fno-inline-functions-called-once"
issues.
> 
>  (b) add lots of ''noinline''s to avoid all the cases where
gcc screws up so
>      badly that it''s either a debugging disaster or an actual
correctness
>      issue.
> 
> The problem with (b) is that it''s a lot of hard thinking, and
debugging
> disasters always happen in code that you didn''t realize would be a
problem
> (because if you had, it simply wouldn''t be the debugging issue it
is).
> 
> 			Linus
> 
-- 
ak@linux.intel.com
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

David Woodhouse

2009-Jan-11 20:15 UTC

head link

Re: gcc inlining heuristics was Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

On Sun, 2009-01-11 at 21:14 +0100, Andi Kleen wrote:> 
> On the other hand (my personal opinion, not shared by everyone) is 
> that the ioctl switch stack issue is mostly only a problem with 4K 
> stacks and in the rare cases when I still run 32bit kernels
> I never set that option because I consider it russian roulette
> (because there undoutedly dangerous dynamic stack growth cases that 
> checkstack.pl doesn''t flag) 
Isn''t the ioctl switch stack issue a separate GCC bug?

It was/is assigning assigning separate space for local variables which
are mutually exclusive. So instead of the stack footprint of the
function with the switch() being equal to the largest individual stack
size of all the subfunctions, it''s equal to the _sum_ of the stack
sizes
of the subfunctions. Even though it''ll never use them all at the same
time.

Without that bug, it would have been harmless to inline them all.

-- 
David Woodhouse                            Open Source Technology Centre
David.Woodhouse@intel.com                              Intel Corporation

--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Andi Kleen

2009-Jan-11 20:34 UTC

head link

Re: gcc inlining heuristics was Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

> Isn''t the ioctl switch stack issue a separate GCC bug?
> 
> It was/is assigning assigning separate space for local variables which
Was -- i think that got fixed in gcc. But again only in newer versions.
> are mutually exclusive. So instead of the stack footprint of the
> function with the switch() being equal to the largest individual stack
> size of all the subfunctions, it''s equal to the _sum_ of the stack
sizes
> of the subfunctions. Even though it''ll never use them all at the
same
> time.
> 
> Without that bug, it would have been harmless to inline them all.
True.

-Andi
-- 
ak@linux.intel.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Linus Torvalds

2009-Jan-11 20:51 UTC

head link

Re: gcc inlining heuristics was Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

On Sun, 11 Jan 2009, Andi Kleen wrote:> 
> Was -- i think that got fixed in gcc. But again only in newer versions.
I doubt it. People have said that about a million times, it has never 
gotten fixed, and I''ve never seen any actual proof.

I think that what got fixed was that gcc now at least re-uses stack slots 
for temporary spills. But only for things that fit in registers - not if 
you actually had variables that are big enough to be of type MEM. And the 
latter is what tends to eat stack-space (ie structures etc on stack).

But hey, maybe it really did get fixed. But the last big stack user
wasn''t
that long ago, and I saw it and I have a pretty recent gcc (gcc-4.3.2 
right now, it could obviously have been slightly older back a few months 
ago).

		Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Valdis.Kletnieks@vt.edu

2009-Jan-11 22:45 UTC

head link

Re: [patch] measurements, numbers about CONFIG_OPTIMIZE_INLINING=y impact

On Fri, 09 Jan 2009 08:34:57 PST, "H. Peter Anvin" said:
> A lot of noise is being made about the naming of the levels (and I
> personally believe we should have a different annotation for "inline
> unconditionally for correctness" and "inline unconditionally for
> performance", as a documentation issue), but those are the four we
get.
I know we use __builtin_return_address() in several places, and several
other places we introspect the stack and need to find the right frame entry.
Are there any other places that need to be inlined for correctness?

Linus Torvalds

2009-Jan-11 23:05 UTC

head link

Re: gcc inlining heuristics was Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

On Sun, 11 Jan 2009, Linus Torvalds wrote:> On Sun, 11 Jan 2009, Andi Kleen wrote:
> > 
> > Was -- i think that got fixed in gcc. But again only in newer
versions.
> 
> I doubt it. People have said that about a million times, it has never 
> gotten fixed, and I''ve never seen any actual proof.
In fact, I just double-checked.

Try this:

	struct a {
		unsigned long array[200];
		int a;
	};

	struct b {
		int b;
		unsigned long array[200];
	};

	extern int fn3(int, void *);
	extern int fn4(int, void *);

	static inline __attribute__((always_inline)) int fn1(int flag)
	{
		struct a a;
		return fn3(flag, &a);
	}

	static inline __attribute__((always_inline)) int fn2(int flag)
	{
		struct b b;
		return fn4(flag, &b);
	}

	int fn(int flag)
	{
		if (flag & 1)
			return fn1(flag);
		return fn2(flag);
	}

(yeah, I made sure it would inline with "always_inline" just so that
the
issue wouldn''t be hidden by any "avoid stack frames" flags).

Gcc creates a big stack frame that contains _both_ ''a'' and
''b'', and does
not merge the allocations together even though they clearly have no 
overlap in usage. Both ''a'' and ''b'' get 201
long-words (1608 bytes) of
stack, causing the inlined version to have 3kB+ of stack, even though the 
non-inlined one would never use more than half of it.

So please stop claiming this is fixed. It''s not fixed, never has been,
and
quite frankly, probably never will be because the lifetime analysis is 
hard enough (ie once you inline and there is any complex usage, CSE etc 
will quite possibly mix up the lifetimes - the above is clearly not any 
_realistic_ example).

So even if the above trivial case could be fixed, I suspect a more complex 
real-life case would still keep the allocations separate. Because merging 
the allocations and re-using the same stack for both really is pretty 
non-trivial, and the best solution is to simply not inline.

(And yeah, the above is such an extreme case that gcc seems to realize 
that it makes no sense to inline because the stack frame is _so_ big. I 
don''t know what the default stack frame limit is, but it''s
apparently
smaller than 1.5kB ;)

			Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Andrew Morton

2009-Jan-11 23:34 UTC

head link

Re: Btrfs for mainline

http://bugzilla.kernel.org/show_bug.cgi?id=12435

Congratulations ;)
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Andi Kleen

2009-Jan-12 00:12 UTC

head link

Re: gcc inlining heuristics was Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

On Sun, Jan 11, 2009 at 03:05:53PM -0800, Linus Torvalds
wrote:> 
> 
> On Sun, 11 Jan 2009, Linus Torvalds wrote:
> > On Sun, 11 Jan 2009, Andi Kleen wrote:
> > > 
> > > Was -- i think that got fixed in gcc. But again only in newer
versions.
> > 
> > I doubt it. People have said that about a million times, it has never 
> > gotten fixed, and I''ve never seen any actual proof.
> 
> In fact, I just double-checked.
> Try this:
Hmm, I actually had tested it some time ago with my own program
(supposed to emulate an ioctl)

extern void f5(char *);

static void f3(void)
{
	char y[100];
	f5(y);
}

static void f2(void)
{
	char x[100];
	f5(x);
}

int f(int cmd)
{
	switch (cmd) { 
	case 1: f3(); break;
	case 2: f2(); break;
	}
	return 1;
}

and with gcc 4.3.1 I get:

.globl f
	.type	f, @function
f:
.LFB4:
	subq	$120, %rsp			<---- not 200 bytes, stack gets reused; dunno where the 20
comes from
.LCFI0:
	cmpl	$1, %edi
	je	.L4
	cmpl	$2, %edi
	je	.L4
	movl	$1, %eax
	addq	$120, %rsp
	ret
	.p2align 4,,10
	.p2align 3
.L4:
	movq	%rsp, %rdi
	call	f5
	movl	$1, %eax
	addq	$120, %rsp
	ret
.LFE4:

so at least least for this case it works. Your case also doesn''t work
for me.
So it looks like gcc didn''t like something you did in your test
program.
Could be a pointer aliasing problem of some sort.

But yes it doesn''t work as well as we hoped.

-Andi
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Linus Torvalds

2009-Jan-12 00:21 UTC

head link

Re: gcc inlining heuristics was Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

On Mon, 12 Jan 2009, Andi Kleen wrote:> 
> so at least least for this case it works. Your case also doesn''t
work
> for me. So it looks like gcc didn''t like something you did in your
test
> program.
I very intentionally used _different_ types.

If you use the same type, gcc will apparenrly happily say "hey, I can 
combine two variables of the same type with different liveness into the 
same variable".

But that''s not the interesting case.

			Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Andi Kleen

2009-Jan-12 00:52 UTC

head link

Re: gcc inlining heuristics was Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

On Sun, Jan 11, 2009 at 04:21:03PM -0800, Linus Torvalds
wrote:> 
> 
> On Mon, 12 Jan 2009, Andi Kleen wrote:
> > 
> > so at least least for this case it works. Your case also
doesn''t work
> > for me. So it looks like gcc didn''t like something you did in
your test
> > program.
> 
> I very intentionally used _different_ types.
> 
> If you use the same type, gcc will apparenrly happily say "hey, I can 
> combine two variables of the same type with different liveness into the 
> same variable".
Confirmed.
> But that''s not the interesting case.
Weird. I wonder where this strange restriction comes from. It indeed
makes this much less useful than it could be :/

-Andi
-- 
ak@linux.intel.com
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

H. Peter Anvin

2009-Jan-12 01:20 UTC

head link

Re: gcc inlining heuristics was Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

Andi Kleen wrote:> 
> Weird. I wonder where this strange restriction comes from. It indeed
> makes this much less useful than it could be :/
> 
Most likely they''re collapsing at too early of a stage.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don''t speak on their behalf.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jamie Lokier

2009-Jan-12 01:56 UTC

head link

Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

Ingo Molnar wrote:> If it''s used once in a single .c file it should be inlined even if
> it''s large.
As Linus has pointed out, because of GCC not sharing stack among
different inlined functions, the above is surprisingly not true.

In kernel it''s a problem due to raw stack usage.

In userspace apps (where stack used is larger), inlining single-call
functions could, paradoxically, run slower due to increased stack
dcache pressure for some larger functions.

-- Jamie
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Samuel

2009-Jan-12 07:59 UTC

head link

Hard to debug kernel issues (was Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning)

On Sun, 11 Jan 2009 11:26:41 pm David Woodhouse wrote:
> Sometimes you weren''t going to get a backtrace if something goes
wrong
> _anyway_.
Case in point - we''ve been struggling with some of our SuperMicro based
systems with AMD Barcelona B3 k10h CPUs *turning themselves off* when running 
various HPC applications.

Nothing in the kernel logs, nothing in the IPMI controller logs. It''s
just
like someone has wandered in and held the power button down (and no,
it''s not
that).

It''s been driving us up the wall.

We''d assumed it was a hardware issue as it was happening with all sorts
of
kernels but today we tried 2.6.29-rc1 "just in case" and I''ve
not been able to
reproduce the crash (yet) on a node I can crash in about 30 seconds, and 
rebooting back into 2.6.28 makes it crash again.

If the test boxes are still alive tomorrow I might see if we can attempt some 
form of a reverse bisect to track down what commit fixed it (git
doesn''t seem
to support that so we''ve going to have to invert the good/bad
commands).

cheers,
Chris
-- 
 Chris Samuel  :  http://www.csamuel.org/  :  Melbourne, VIC

This email may come with a PGP signature as a file. Do not panic.
For more info see: http://en.wikipedia.org/wiki/OpenPGP

Ingo Molnar

2009-Jan-12 08:40 UTC

head link

Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

* Jamie Lokier <jamie@shareable.org> wrote:
> Ingo Molnar wrote:
> > If it''s used once in a single .c file it should be inlined
even if
> > it''s large.
> 
> As Linus has pointed out, because of GCC not sharing stack among 
> different inlined functions, the above is surprisingly not true.
Yes, but note that this has no relevance to the specific case of 
CONFIG_OPTIMIZE_INLINING: GCC can at most decide to inline _less_, not 
more. I.e. under CONFIG_OPTIMIZE_INLINING we can only end up having less 
stack sharing trouble.

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Mason

2009-Jan-12 13:58 UTC

head link

Re: Btrfs for mainline

On Sun, 2009-01-11 at 15:34 -0800, Andrew Morton wrote:> http://bugzilla.kernel.org/show_bug.cgi?id=12435
> 
> Congratulations ;)
Rejected documented, that''s the best buzilla tag ever ;)

Thanks,
Chris


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Miguel Figueiredo Mascarenhas Sousa Filipe

2009-Jan-12 15:14 UTC

head link

Re: Btrfs for mainline

Hi,

On Mon, Jan 12, 2009 at 1:58 PM, Chris Mason <chris.mason@oracle.com>
wrote:> On Sun, 2009-01-11 at 15:34 -0800, Andrew Morton wrote:
>> http://bugzilla.kernel.org/show_bug.cgi?id=12435
>>
This is by far the biggest issue btrfs has for simple/domestic users.
Its probably the most tested miss-feature of btrfs, something almost
all new testers encounter & report.

Is fixing this having maximum priority?
The sooner this is fixed, less energy is spend by everyone
(entusiasts, testers, bug reporters, bug triagers, mailing list
campers..etc) and we might save some whales, dolphins and pinguins by
reducing our carbon footprint. :)

Kind regards.
>> Congratulations ;)
>
> Rejected documented, that''s the best buzilla tag ever ;)
>
> Thanks,
> Chris
>


-- 
Miguel Sousa Filipe
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Mason

2009-Jan-12 15:17 UTC

head link

Re: Btrfs for mainline

On Mon, 2009-01-12 at 15:14 +0000, Miguel Figueiredo Mascarenhas Sousa
Filipe wrote:> Hi,
> 
> On Mon, Jan 12, 2009 at 1:58 PM, Chris Mason <chris.mason@oracle.com>
wrote:
> > On Sun, 2009-01-11 at 15:34 -0800, Andrew Morton wrote:
> >> http://bugzilla.kernel.org/show_bug.cgi?id=12435
> >>
> 
> This is by far the biggest issue btrfs has for simple/domestic users.
> Its probably the most tested miss-feature of btrfs, something almost
> all new testers encounter & report.
> 
> Is fixing this having maximum priority?
> The sooner this is fixed, less energy is spend by everyone
> (entusiasts, testers, bug reporters, bug triagers, mailing list
> campers..etc) and we might save some whales, dolphins and pinguins by
> reducing our carbon footprint. :)
> 
Yes, this definitely at the top of my list.  Along with better
documentation of the project as a whole.

-chris


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Bernd Schmidt

2009-Jan-12 18:06 UTC

head link

Re: gcc inlining heuristics was Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

Andi Kleen wrote:> On Sun, Jan 11, 2009 at 04:21:03PM -0800, Linus Torvalds wrote:
>>
>> On Mon, 12 Jan 2009, Andi Kleen wrote:
>>> so at least least for this case it works. Your case also
doesn''t work
>>> for me. So it looks like gcc didn''t like something you did
in your test
>>> program.
>> I very intentionally used _different_ types.
>>
>> If you use the same type, gcc will apparenrly happily say "hey, I
can
>> combine two variables of the same type with different liveness into the
>> same variable".
> 
> Confirmed.
> 
>> But that''s not the interesting case.
> 
> Weird. I wonder where this strange restriction comes from.
Something at the back of my mind said "aliasing".

$ gcc linus.c -O2 -S ; grep subl linus.s
        subl    $1624, %esp
$ gcc linus.c -O2 -S -fno-strict-aliasing; grep subl linus.s
        subl    $824, %esp

That''s with 4.3.2.


Bernd
-- 
This footer brought to you by insane German lawmakers.
Analog Devices GmbH      Wilhelm-Wagenfeld-Str. 6      80807 Muenchen
Sitz der Gesellschaft Muenchen, Registergericht Muenchen HRB 40368
Geschaeftsfuehrer Thomas Wessel, William A. Martin, Margaret Seif
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Linus Torvalds

2009-Jan-12 19:02 UTC

head link

Re: gcc inlining heuristics was Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

On Mon, 12 Jan 2009, Bernd Schmidt wrote:> 
> Something at the back of my mind said "aliasing".
> 
> $ gcc linus.c -O2 -S ; grep subl linus.s
>         subl    $1624, %esp
> $ gcc linus.c -O2 -S -fno-strict-aliasing; grep subl linus.s
>         subl    $824, %esp
> 
> That''s with 4.3.2.
Interesting. 

Nonsensical, but interesting.

Since they have no overlap in lifetime, confusing this with aliasing is 
really really broken (if the functions _hadn''t_ been inlined,
you''d have
gotten the same address for the two variables anyway! So anybody who 
thinks that they need different addresses because they are different types 
is really really fundmantally confused!).

But your numbers are unambiguous, and I can see the effect of that 
compiler flag myself.

The good news is that the kernel obviously already uses 
-fno-strict-aliasing for other reasonds, so we should see this effect 
already, _despite_ it making no sense. And the stack usage still causes 
problems.

Oh, and I see why. This test-case shows it clearly.

Note how the max stack usage _should_ be "struct b" + "struct
c". Note how
it isn''t (it''s "struct a" + "struct b/c").

So what seems to be going on is that gcc is able to do some per-slot 
sharing, but if you have one function with a single large entity, and 
another with a couple of different ones, gcc can''t do any smart 
allocation.

Put another way: gcc doesn''t create a "union of the set of
different stack
usages" (which would be optimal given a single frame, and generate the 
stack layout of just the maximum possible size), it creates a "set of 
unions of different stack usages" (which can be optimal in the trivial 
cases, but not nearly optimal in practical cases).

That explains the ioctl behavior - the structure use is usually pretty 
complicated (ie it''s almost never about just _one_ large stack slot,
but
the ioctl cases tend to do random stuff with multiple slots).

So it doesn''t add up to some horrible maximum of all sizes, but it also
doesn''t end up coalescing stack usage very well.

		Linus
---
struct a {
	int a;
	unsigned long array[200];
};

struct b {
	int b;
	unsigned long array[100];
};

struct c {
	int c;
	unsigned long array[100];
};

extern int fn3(int, void *);
extern int fn4(int, void *);

static inline __attribute__ ((always_inline))
int fn1(int flag)
{
	struct a a;
	return fn3(flag, &a);
}

static inline __attribute__ ((always_inline))
int fn2(int flag)
{
	struct b b;
	struct c c;	
	return fn4(flag, &b) + fn4(flag, &c);
}

int fn(int flag)
{
	fn1(flag);
	if (flag & 1)
		return 0;
	return fn2(flag);
}
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

H. Peter Anvin

2009-Jan-12 19:22 UTC

head link

Re: gcc inlining heuristics was Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

Andi Kleen wrote:> On Mon, Jan 12, 2009 at 11:02:17AM -0800, Linus Torvalds wrote:
>>> Something at the back of my mind said "aliasing".
>>>
>>> $ gcc linus.c -O2 -S ; grep subl linus.s
>>>         subl    $1624, %esp
>>> $ gcc linus.c -O2 -S -fno-strict-aliasing; grep subl linus.s
>>>         subl    $824, %esp
>>>
>>> That''s with 4.3.2.
>> Interesting. 
>>
>> Nonsensical, but interesting.
> 
> What I find nonsensical is that -fno-strict-aliasing generates
> better code here. Normally one would expect the compiler seeing
> more aliases with that option and then be more conservative
> regarding any sharing. But it seems to be the other way round
> here.
For this to be convolved with aliasing *AT ALL* indicates this is done 
incorrectly.

This is about storage allocation, not aliases.  Storage allocation only 
depends on lifetime.

	-hpa

Andi Kleen

2009-Jan-12 19:32 UTC

head link

Re: gcc inlining heuristics was Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

On Mon, Jan 12, 2009 at 11:02:17AM -0800, Linus Torvalds
wrote:> > Something at the back of my mind said "aliasing".
> > 
> > $ gcc linus.c -O2 -S ; grep subl linus.s
> >         subl    $1624, %esp
> > $ gcc linus.c -O2 -S -fno-strict-aliasing; grep subl linus.s
> >         subl    $824, %esp
> > 
> > That''s with 4.3.2.
> 
> Interesting. 
> 
> Nonsensical, but interesting.
What I find nonsensical is that -fno-strict-aliasing generates
better code here. Normally one would expect the compiler seeing
more aliases with that option and then be more conservative
regarding any sharing. But it seems to be the other way round
here.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Linus Torvalds

2009-Jan-12 19:42 UTC

head link

Re: gcc inlining heuristics was Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

On Mon, 12 Jan 2009, Andi Kleen wrote:>
> What I find nonsensical is that -fno-strict-aliasing generates
> better code here. Normally one would expect the compiler seeing
> more aliases with that option and then be more conservative
> regarding any sharing. But it seems to be the other way round
> here.
No, that''s not the surprising part. And in fact, now that you mention
it,
I can even tell you why gcc does what it does.

But you''ll need some background to it:

Type-based aliasing is _stupid_. It''s so incredibly stupid that
it''s not
even funny. It''s broken. And gcc took the broken notion, and made it
more
so by making it a "by-the-letter-of-the-law" thing that makes no
sense.

What happens (well, maybe it''s fixed, but this was _literally_ what gcc
used to do) is that the type-based aliasing overrode everything else, so 
if two accesses were to different types (and not in a union, and none of 
the types were "char"), then gcc "knew" that they clearly
could not alias,
and could thus wildly re-order accesses.

That''s INSANE. It''s so incredibly insane that people who do
that should
just be put out of their misery before they can reproduce. But real gcc 
developers really thought that it makes sense, because the standard allows 
it, and it gives the compiler the maximal freedom - because it can now do 
things that are CLEARLY NONSENSICAL.

And to compiler people, being able to do things that are clearly 
nonsensical seems to often be seen as a really good thing, because it 
means that they no longer have to worry about whether the end result works 
or not - they just got permission to do stupid things in the name of 
optimization.

So gcc did. I know for a _fact_ that gcc would re-order write accesses 
that were clearly to (statically) the same address. Gcc would suddenly 
think that

	unsigned long a;

	a = 5;
	*(unsigned short *)&a = 4;

could be re-ordered to set it to 4 first (because clearly they don''t
alias
- by reading the standard), and then because now the assignment of
''a=5''
was later, the assignment of 4 could be elided entirely! And if somebody 
complains that the compiler is insane, the compiler people would say 
"nyaah, nyaah, the standards people said we can do this", with
absolutely
no introspection to ask whether it made any SENSE.

Anyway, once you start doing stupid things like that, and once you start 
thinking that the standard makes more sense than a human being using his 
brain for 5 seconds, suddenly you end up in a situation where you can move 
stores around wildly, and it''s all ''correct''.

Now, take my stupid example, and make "fn1()" do "a.a = 1"
and make
"fn2()" do "b.b = 2", and think about what a compiler that
thinks it can
re-order the two writes willy-nilly will do?

Right. It will say "ok, a.a and b.b can not alias EVEN IF THEY HAVE 
STATICALLY THE SAME ADDFRESS ON THE STACK", because they are in two 
different structres. So we can then re-order the accesses, and move the 
stores around.

Guess what happens if you have that kind of insane mentality, and you then 
try to make sure that they really don''t alias, so you allocate extra
stack
space.

The fact is, Linux uses -fno-strict-aliasing for a damn good reason: 
because the gcc notion of "strict aliasing" is one huge stinking pile
of
sh*t. Linux doesn''t use that flag because Linux is playing fast and
loose,
it uses that flag because _not_ using that flag is insane.

Type-based aliasing is unacceptably stupid to begin with, and gcc took 
that stupidity to totally new heights by making it actually more important 
than even statically visible aliasing.

		Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Linus Torvalds

2009-Jan-12 19:45 UTC

head link

Re: gcc inlining heuristics was Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

On Mon, 12 Jan 2009, H. Peter Anvin wrote:> 
> This is about storage allocation, not aliases.  Storage allocation only
> depends on lifetime.
Well, the thing is, code motion does extend life-times, and if you think 
you can move stores across each other (even when you can see that they 
alias statically) due to type-based alias decisions, that does essentially 
end up making what _used_ to be disjoint lifetimes now be potentially 
overlapping.

			Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Bernd Schmidt

2009-Jan-12 19:55 UTC

head link

Re: gcc inlining heuristics was Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

Linus Torvalds wrote:> 
> On Mon, 12 Jan 2009, Bernd Schmidt wrote:
>> Something at the back of my mind said "aliasing".
>>
>> $ gcc linus.c -O2 -S ; grep subl linus.s
>>         subl    $1624, %esp
>> $ gcc linus.c -O2 -S -fno-strict-aliasing; grep subl linus.s
>>         subl    $824, %esp
>>
>> That''s with 4.3.2.
> 
> Interesting. 
> 
> Nonsensical, but interesting.
> 
> Since they have no overlap in lifetime, confusing this with aliasing is 
> really really broken (if the functions _hadn''t_ been inlined,
you''d have
> gotten the same address for the two variables anyway! So anybody who 
> thinks that they need different addresses because they are different types 
> is really really fundmantally confused!).
I''ve never really looked at the stack slot sharing code.  But I think
it''s not hard to see what''s going on: "no overlap in
lifetime" may be a
temporary state.  Let''s say you have

 {
   variable_of_some_type a;
   writes to a;
   other stuff;
   reads from a;
 }
 {
   variable_of_some_other_type b;
   writes to b;
   other stuff;
   reads from b;
 }

At the point where the compiler generates RTL, stack space has to be
allocated for variables A and B.  At this point the lifetimes are
non-overlapping.  However, if the compiler chooses to put them into the
same stack location, the RTL-based alias analysis will happily conclude
(based on the differing types) that the reads from A and the writes to B
can''t possibly conflict, and some passes may end up reordering them.
End result: overlapping lifetimes and overlapping stack slots.  Oops.


Bernd
-- 
This footer brought to you by insane German lawmakers.
Analog Devices GmbH      Wilhelm-Wagenfeld-Str. 6      80807 Muenchen
Sitz der Gesellschaft Muenchen, Registergericht Muenchen HRB 40368
Geschaeftsfuehrer Thomas Wessel, William A. Martin, Margaret Seif
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Linus Torvalds

2009-Jan-12 20:08 UTC

head link

Re: gcc inlining heuristics was Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

On Mon, 12 Jan 2009, Linus Torvalds wrote:> 
> Type-based aliasing is unacceptably stupid to begin with, and gcc took 
> that stupidity to totally new heights by making it actually more important 
> than even statically visible aliasing.
Btw, there are good forms of type-based aliasing.

The ''restrict'' keyword actually makes sense as a way to say
"this pointer
points to data that you cannot reach any other way". Of course, almost 
nobody uses it, and quite frankly, inlining can destroy that one too (a 
pointer that is restricted in the callEE is not necessarily restricted at 
all in the callER, and an inliner that doesn''t track that subtle 
distinction will be very unhappy).

So compiler people usually don''t much like ''restrict''
- because it i very
limited (you might even say restricted) in its meaning, and doesn''t
allow
for nearly the same kind of wild optimizations than the insane standard C 
type-aliasing allows. 

The best option, of course, is for a compiler to handle just _static_ 
alias information that it can prove (whether by use of
''restrict'' or by
actually doing some fancy real analysis of its own allocations), and 
letting the hardware do run-time dynamic alias analysis.

I suspect gcc people were a bit stressed out by Itanium support - it''s
an
insane architecture that basically requires an insane compiler for 
reasonable performance, and I think the Itanium people ended up 
brain-washing a lot of people who might otherwise have been sane.

So maybe I should blame Intel. Or HP. Because they almost certainly were 
at least a _part_ reason for bad compiler decisions.

			Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Linus Torvalds

2009-Jan-12 20:11 UTC

head link

Re: gcc inlining heuristics was Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

On Mon, 12 Jan 2009, Bernd Schmidt wrote:> 
> However, if the compiler chooses to put them into the same stack 
> location, the RTL-based alias analysis will happily conclude (based on 
> the differing types) that the reads from A and the writes to B
can''t
> possibly conflict, and some passes may end up reordering them. End 
> result: overlapping lifetimes and overlapping stack slots.  Oops.
Yes, I came to the same conclusion.

Of course, I knew a-priori that the real bug was using type-based alais 
analysis to make (statically visible) aliasing decisions, but I realize 
that there are people who never understood things like that. Sadly, some 
of them worked on gcc.

			Linus
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Bernd Schmidt

2009-Jan-12 22:03 UTC

head link

Re: gcc inlining heuristics was Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

Linus Torvalds wrote:> But you''ll need some background to it:
You paint a somewhat one-sided picture bordering on FUD.
> Type-based aliasing is _stupid_.
Type-based aliasing is simply an application of the language definition,
and depending on the compiled application and/or target architecture, it
can be essential for performance.

It''s _hard_ to tell whether two memory accesses can possibly conflict,
and the ability to decide based on type makes a vast difference.  This
is not, as you suggest in another post, simply a mild inconvenience for
the compiler that restricts scheduling a bit and forces the hardware
sort it out at run-time.  Too lazy to construct one myself, I googled
for examples, and here''s a trivial one that shows how it affects the
ability of the compiler to eliminate memory references:

  typedef struct
  {
    short a, b, c;
  } Sample;

  void test(int* values, Sample *uniform, int count)
  {
   int i;

   for (i=0;i<count;i++)
   {
     values[i] += uniform->b;
   }
  }

Type-based aliasing is what allows you to eliminate a load from the
loop.  Most users probably expect this kind of optimization from their
compiler, and it''ll make a difference not just on Itanium.

I''ll grant you that if you''re writing a kernel or maybe a
malloc
library, you have reason to be unhappy about it.  But that''s what
compiler switches are for: -fno-strict-aliasing allows you to write code
in a superset of C.
> So gcc did. I know for a _fact_ that gcc would re-order write accesses 
> that were clearly to (statically) the same address. Gcc would suddenly 
> think that
> 
> 	unsigned long a;
> 
> 	a = 5;
> 	*(unsigned short *)&a = 4;
> 
> could be re-ordered to set it to 4 first (because clearly they
don''t alias
> - by reading the standard),
To be precise, what the standard says is that your example is not C, and
therefore has no meaning.  While this kind of thing does occur in the
wild, it is infrequent, and the programs that used this kind of code
have been fixed over the years.

gcc even warns about code such as the above with -Wall, which makes this
even more of a non-issue.

linus2.c: In function ''foo'':
linus2.c:6: warning: dereferencing type-punned pointer will break
strict-aliasing rules
> and then because now the assignment of ''a=5'' 
> was later, the assignment of 4 could be elided entirely! And if somebody 
> complains that the compiler is insane, the compiler people would say 
> "nyaah, nyaah, the standards people said we can do this", with
absolutely
> no introspection to ask whether it made any SENSE.
The thing is, yours is a trivial example, but try to think further: in
the general case the compiler can''t tell whether two accesses can go to
the same address at runtime.  If it could, we wouldn''t be having this
discussion; I''m pretty sure this question reduces to the halting
problem.  That''s why the compiler must have a set of conservative rules
that allow it to decide that two accesses definitely _can''t_ conflict.
For all standards conforming programs, type based aliasing is such a
rule.  You could add code to weaken it by also checking against the
address, but since that cannot be a reliable test that catches all
problematic cases, what would be the point?

So, in effect, if you''re arguing that the compiler should detect the
above case and override the type-based aliasing based on the known
address, you''re arguing that only subtle bugs in the application should
be exposed, not the obvious ones.  If you''re arguing we should do away
with type-based aliasing altogether, you''re ignoring the fact that
there
are (a majority of) other users of gcc than the Linux kernel, they write
standards-conforming C, and they tend to worry about performance of
compiled code.
> The fact is, Linux uses -fno-strict-aliasing for a damn good reason: 
> because the gcc notion of "strict aliasing" is one huge stinking
pile of
> sh*t. Linux doesn''t use that flag because Linux is playing fast
and loose,
> it uses that flag because _not_ using that flag is insane.
Not using this flag works for pretty much all user space applications
these days.
> Type-based aliasing is unacceptably stupid to begin with, and gcc took 
> that stupidity to totally new heights by making it actually more important 
> than even statically visible aliasing.
gcc makes use of statically visible aliasing if it can use it to prove
that two accesses can''t conflict even if they have the same type, but
it''s vastly less powerful than type based analysis.  Since
it''s
impossible in general to decide that two accesses must conflict, trying
to avoid transformations based on such an attempt is completely
senseless.  Trying to do so would have no effect for conforming C
programs, and avoid only a subset of the problematic cases for other
programs, so it''s a waste of time.

So, to summarize: strict aliasing works for nearly every application
these days, there''s a compiler switch for the rest to turn it off, it
can be a serious performance improvement, and the compiler warns about
dangerous constructs.  That makes the issue a little less black and
white than "type based aliasing is stupid".


Bernd
-- 
This footer brought to you by insane German lawmakers.
Analog Devices GmbH      Wilhelm-Wagenfeld-Str. 6      80807 Muenchen
Sitz der Gesellschaft Muenchen, Registergericht Muenchen HRB 40368
Geschaeftsfuehrer Thomas Wessel, William A. Martin, Margaret Seif
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jamie Lokier

2009-Jan-12 23:01 UTC

head link

Re: gcc inlining heuristics was Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

Linus Torvalds wrote:> > This is about storage allocation, not aliases.  Storage allocation
only
> > depends on lifetime.
> 
> Well, the thing is, code motion does extend life-times, and if you think 
> you can move stores across each other (even when you can see that they 
> alias statically) due to type-based alias decisions, that does essentially 
> end up making what _used_ to be disjoint lifetimes now be potentially 
> overlapping.
Sometimes code motion makes code faster and/or smaller but use more
stack space.  If you want to keep the stack use down, it blocks some
other optimisations.

Register allocation is similar: code motion optimisations may use more
registers due to overlapping lifetimes, which causes more register
spills and changes the code.  The two interact; it''s not trivial to
optimise fully.

-- Jamie
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Linus Torvalds

2009-Jan-12 23:19 UTC

head link

Re: gcc inlining heuristics was Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

On Mon, 12 Jan 2009, Jamie Lokier wrote:> 
> Sometimes code motion makes code faster and/or smaller but use more
> stack space.  If you want to keep the stack use down, it blocks some
> other optimisations.
Uhh. Yes. Compiling is an exercise in trade-offs.

That doesn''t mean that you should try to find the STUPID trade-offs, 
though.

The thing is, there is no excuse for gcc''s stupid alias analysis. Other
compilers actually take advantage of things like the C standards type 
alias ambiguity by (a) realizing that it''s insane as a general thing
and
(b) limiting it to the real special cases, like assuming that pointers to 
floats and pointers to integers do not alias.

That, btw, is where the whole concept comes from. It should be passed off 
as an "unsafe FP optimization", where it actually makes sense, exactly
like a lot of other unsafe FP optimizations.

			Linus
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Linus Torvalds

2009-Jan-13 00:21 UTC

head link

Re: gcc inlining heuristics was Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

On Mon, 12 Jan 2009, Bernd Schmidt wrote:>
> Too lazy to construct one myself, I googled for examples, and
here''s a
> trivial one that shows how it affects the ability of the compiler to 
> eliminate memory references:
Do you really think this is realistic or even relevant?

The fact is

 (a) most people use similar types, so your example of "short" vs
"int" is
     actually not very common. Type-based alias analysis is wonderful for 
     finding specific examples of something you can optimize, but it''s
not
     actually all that wonderful in general. It _particularly_ isn''t 
     wonderful once you start looking at the downsides.

     When you''re adding arrays of integers, you''re usually
adding
     integers. Not "short"s. The shorts may be a great example of a 
     special case, but it''s a special case!

 (b) instructions with memory accesses aren''t the problem -
instructions
     that take cache misses are. Your example is an excellent example of 
     that - eliding the simple load out of the loop makes just about 
     absolutely _zero_ difference in any somewhat more realistic scenario, 
     because that one isn''t the one that is going to make any real 
     difference anyway.

The thing is, the way to optimize for modern CPU''s isn''t to
worry
over-much about instruction scheduling. Yes, it matters for the broken 
ones, but it matters in the embedded world where you still find in-order 
CPU''s, and there the size of code etc matters even more.
> I''ll grant you that if you''re writing a kernel or maybe a
malloc
> library, you have reason to be unhappy about it.  But that''s what
> compiler switches are for: -fno-strict-aliasing allows you to write code
> in a superset of C.
Oh, I''d use that flag regardless yes. But what you didn''t seem
to react to
was that gcc - for no valid reason what-so-ever - actually trusts (or at 
least trusted: I haven''t looked at that code for years) provably true 
static alias information _less_ than the idiotic weaker type-based one.

You make all this noise about how type-based alias analysis improves code, 
but then you can''t seem to just look at the example I gave you.
Type-based
alias analysis didn''t improve code. It just made things worse, for no 
actual gain. Moving those accesses to the stack around just causes worse 
behavior, and a bigger stack frame, which causes more cache misses.

[ Again, I do admit that kernel code is "different": we tend to have a
  cold stack, in ways that many other code sequences do not have. System 
  code tends to get a lot more I$ and D$ misses. Deep call-chains _will_ 
  take cache misses on the stack, simply because the user will do things 
  between system calls or page faults that almost guarantees that things 
  are not in L1, and often not in L2 either.

  Also, sadly, microbenchmarks often hide this, since they are often 
  exactly the unrealistic kinds of back-to-back system calls that almost 
  no real program ever has, since real programs actually _do_ something 
  with the data. ]

My point is, you''re making all these arguments and avoiding looking at
the
downsides of what you are arguing for.

So we use -Os - because it generally generates better (and simpler) code. 
We use -fno-strict-alias for the same reason. 

			Linus
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Steven Rostedt

2009-Jan-13 01:30 UTC

head link

Re: gcc inlining heuristics was Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

<comic relief>

On Mon, 12 Jan 2009, Linus Torvalds wrote:
>   code tends to get a lot more I$ and D$ misses. Deep call-chains _will_ 
I feel like an idiot that I never realized that "I$" meant
"instruction cache" until now :-p

</comic relief>

-- Steve


--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Ingo Molnar

2009-Jan-19 00:13 UTC

head link

Re: gcc inlining heuristics was Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

* Bernd Schmidt <bernds_cb1@t-online.de> wrote:
> Linus Torvalds wrote:
> > But you''ll need some background to it:
> 
> You paint a somewhat one-sided picture bordering on FUD.
> 
> > Type-based aliasing is _stupid_.
> 
> Type-based aliasing is simply an application of the language definition, 
> and depending on the compiled application and/or target architecture, it 
> can be essential for performance.
> [...]
>
> So, to summarize: strict aliasing works for nearly every application 
> these days, there''s a compiler switch for the rest to turn it off,
it
> can be a serious performance improvement, and the compiler warns about 
> dangerous constructs.  That makes the issue a little less black and 
> white than "type based aliasing is stupid".
i did some size measurements. Latest kernel, gcc 4.3.2, x86 defconfig, 
vmlinux built with strict-aliasing optimizations and without.

The image sizes are:

     text    data     bss     dec     hex filename
  6997984 1635900 1322376 9956260  97eba4 vmlinux.gcc.-fno-strict-aliasing
  6985962 1635884 1322376 9944222  97bc9e vmlinux.gcc.strict-aliasing

that''s 0.17% of a size win only.

The cost? More than _300_ new "type-punned" warnings during the kernel
build (3000 altogether, including duplicates that trigger in multiple 
object files) - more than _1000_ new warnings (more than 10,000 total) in 
an allyesconfig build.

That is a _ton_ of effort for a ~0.1% category win. Often in historic code 
that has been working well and now got broken by gcc.

I think this feature has been over-sold while it under-performs. You also 
significantly weakened the utility of the C language for lowlevel 
hardware-conscious programming as a result (which is the strongest side of 
C by far).

Type based aliasing should at most have been an opt-in for code that 
cares, not a turned-on-by-default feature for everyone.

You already dismissed concerns in this thread by suggesting that Linux is 
just one of many projects, but that way you dismiss 1) the largest OSS 
project by linecount 2) one of the most performance-optimized pieces of 
OSS software in existence.

I.e. you dismiss what should have been GCC''s strongest supporter, ally,
test-field and early prototype tester. A lot of folks in the kernel look 
at the assembly level on a daily basis, they run the latest hardware and 
they see where GCC messes up. By dismissing Linux you cut yourself off 
from a lot of development power, you cut yourself off from a lot of 
enthusiasm and you miss out on a lot of dynamism that would naturally help 
GCC too.

I.e. almost by definition you cannot possibly be interested in writing the 
best compiler on the planet.

	Ingo

Nick Piggin

2009-Jan-19 06:22 UTC

head link

Re: gcc inlining heuristics was Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

On Mon, Jan 19, 2009 at 01:13:45AM +0100, Ingo Molnar
wrote:> 
> * Bernd Schmidt <bernds_cb1@t-online.de> wrote:
> 
> > Linus Torvalds wrote:
> > > But you''ll need some background to it:
> > 
> > You paint a somewhat one-sided picture bordering on FUD.
> 
> Type based aliasing should at most have been an opt-in for code that 
> cares, not a turned-on-by-default feature for everyone.
I want to know what is the problem with the restrict keyword?
I''m sure I''ve read Linus ranting about how bad it is in the
past... it seems like a nice opt-in thing that can be used where
the aliases are verified and the code is particularly performance
critical...

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Linus Torvalds

2009-Jan-19 21:01 UTC

head link

Re: gcc inlining heuristics was Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

On Mon, 19 Jan 2009, Nick Piggin wrote:> 
> I want to know what is the problem with the restrict keyword?
> I''m sure I''ve read Linus ranting about how bad it is in
the
> past...
No, I don''t think I''ve ranted about
''restrict''. I think it''s a reasonable
solution for performance-critical code, and unlike the whole type-aliasing 
model, it actually works for the _sane_ cases (ie doing some operation 
over two arrays of the same type, and letting the compiler know that it 
can access the arrays without fearing that writing to one would affect 
reading from the other).

The problem with ''restrict'' is that almost nobody uses it, and
it does
obviously require programmer input rather than the compiler doing it 
automatically. But it should work well as a way to get Fortran-like 
performance from HPC workloads written in C - which is where most of the 
people are who really want the alias analysis.
> it seems like a nice opt-in thing that can be used where the aliases are 
> verified and the code is particularly performance critical...
Yes. I think we could use it in the kernel, although I''m not sure how
many
cases we would ever find where we really care. 

			Linus
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Nick Piggin

2009-Jan-20 00:51 UTC

head link

Re: gcc inlining heuristics was Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

On Tue, Jan 20, 2009 at 08:01:52AM +1100, Linus Torvalds
wrote:> 
> 
> On Mon, 19 Jan 2009, Nick Piggin wrote:
> > 
> > I want to know what is the problem with the restrict keyword?
> > I''m sure I''ve read Linus ranting about how bad it is
in the
> > past...
> 
> No, I don''t think I''ve ranted about
''restrict''. I think it''s a reasonable
> solution for performance-critical code, and unlike the whole type-aliasing 
> model, it actually works for the _sane_ cases (ie doing some operation 
> over two arrays of the same type, and letting the compiler know that it 
> can access the arrays without fearing that writing to one would affect 
> reading from the other).
> 
> The problem with ''restrict'' is that almost nobody uses
it, and it does
> obviously require programmer input rather than the compiler doing it 
> automatically. But it should work well as a way to get Fortran-like 
> performance from HPC workloads written in C - which is where most of the 
> people are who really want the alias analysis.
OK, that makes sense. I just had a vague feeling that you disliked
it.

 > > it seems like a nice opt-in thing that can be used where the aliases
are
> > verified and the code is particularly performance critical...
> 
> Yes. I think we could use it in the kernel, although I''m not sure
how many
> cases we would ever find where we really care. 
Yeah, we don''t tend to do a lot of intensive data processing, so it is
normally the cache misses that hurt most as you noted earlier.

Some places it might be appropriate, though. It might be nice if it can
bring code size down too...

--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Andi Kleen

2009-Jan-20 04:20 UTC

head link

Re: gcc inlining heuristics was Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

> The problem with ''restrict'' is that almost nobody uses
it, and it does
Also gcc traditionally didn''t do a very good job using it (this
might be better in the very latest versions). At least some of the 3.x
often discarded this information. 
> automatically. But it should work well as a way to get Fortran-like 
> performance from HPC workloads written in C - which is where most of the 
> people are who really want the alias analysis.
It''s more than just HPC  -- a lot of code has critical loops.
> > it seems like a nice opt-in thing that can be used where the aliases
are
> > verified and the code is particularly performance critical...
> 
> Yes. I think we could use it in the kernel, although I''m not sure
how many
> cases we would ever find where we really care. 
Very little I suspect. Also the optimizations that gcc does with this
often increase the code size. While that can be a win, with people
judging gcc''s output apparently *ONLY* on the code size as seen
in this thread[1] it would obviously not compete well.

-Andi 

[1] although there are compilers around that generate smaller code
than gcc at its best.

-- 
ak@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Ingo Molnar

2009-Jan-20 12:38 UTC

head link

Re: gcc inlining heuristics was Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

* Nick Piggin <npiggin@suse.de> wrote:
> > > it seems like a nice opt-in thing that can be used where the
aliases
> > > are verified and the code is particularly performance critical...
> > 
> > Yes. I think we could use it in the kernel, although I''m not
sure how
> > many cases we would ever find where we really care.
> 
> Yeah, we don''t tend to do a lot of intensive data processing, so
it is
> normally the cache misses that hurt most as you noted earlier.
> 
> Some places it might be appropriate, though. It might be nice if it can 
> bring code size down too...
I checked, its size effects were miniscule [0.17%] on the x86 defconfig 
kernel and it seems to be a clear loss in total cost as there would be an 
ongoing maintenance cost of this weird new variant of C that language 
lawyers legislated out of thin air and which departs so significantly from 
time-tested C coding concepts and practices.

We''d have to work around aliasing warnings of the compiler again and
again
with no upside and in fact i''d argue that the resulting code is _less_ 
clean.

The lack of data processing complexity in the kernel is not a surprise: 
the kernel is really just a conduit/abstractor between hw and apps, and 
rarely generates genuinely new information. (In fact it can be generally 
considered a broken system call concept if such data processing _has_ to 
be conducted somewhere in the kernel.)

( Notable exceptions would be the crypto code and the RAID5 [XOR checksum]
  and RAID6 [polinomial checksums] code - but those tend to be seriously
  hand-optimized already, with the most critical bits written in assembly. )

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Miguel F Mascarenhas Sousa Filipe

2009-Jan-20 17:43 UTC

head link

Re: gcc inlining heuristics was Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

On Mon, Jan 19, 2009 at 9:01 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:>
>
> On Mon, 19 Jan 2009, Nick Piggin wrote:
>>
>> I want to know what is the problem with the restrict keyword?
>> I''m sure I''ve read Linus ranting about how bad it is
in the
>> past...
>
> No, I don''t think I''ve ranted about
''restrict''. I think it''s a reasonable
> solution for performance-critical code, and unlike the whole type-aliasing
> model, it actually works for the _sane_ cases (ie doing some operation
> over two arrays of the same type, and letting the compiler know that it
> can access the arrays without fearing that writing to one would affect
> reading from the other).
>
> The problem with ''restrict'' is that almost nobody uses
it, and it does
> obviously require programmer input rather than the compiler doing it
> automatically. But it should work well as a way to get Fortran-like
> performance from HPC workloads written in C - which is where most of the
> people are who really want the alias analysis.
while working in my college thesis, a fortran HPC workload (10k lines
of fortran), converted to C resulted in performance speedups. this was
with gcc 3.4.
A simple f2c conversion + adaptations, resulted in a considerable
speedup. (20% IIRC).
The conversion was not done by performance reasons, the extra
performance was simply a unexpected, but quite nice outcome.
Just to let you guys know, that even with gcc3.4, gcc of C code ran
faster than gfortran of the equiv. fortran code.


... pushing the optimization engine further (-march tunning and -O3)
resulted in bad data.
But I can''t swear by the correctness of some of the computations with
REAL''s made in the original fortran code.


>
>> it seems like a nice opt-in thing that can be used where the aliases
are
>> verified and the code is particularly performance critical...
>
> Yes. I think we could use it in the kernel, although I''m not sure
how many
> cases we would ever find where we really care.
>
>                        Linus
> --
> To unsubscribe from this list: send the line "unsubscribe
linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>


-- 
Miguel Sousa Filipe
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

David Woodhouse

2009-Jan-20 19:49 UTC

head link

Re: gcc inlining heuristics was Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

On Tue, 2009-01-20 at 13:38 +0100, Ingo Molnar wrote:> 
> * Nick Piggin <npiggin@suse.de> wrote:
> 
> > > > it seems like a nice opt-in thing that can be used where the
aliases
> > > > are verified and the code is particularly performance
critical...
> > > 
> > > Yes. I think we could use it in the kernel, although I''m
not sure how
> > > many cases we would ever find where we really care.
> > 
> > Yeah, we don''t tend to do a lot of intensive data processing,
so it is
> > normally the cache misses that hurt most as you noted earlier.
> > 
> > Some places it might be appropriate, though. It might be nice if it
can
> > bring code size down too...
> 
> I checked, its size effects were miniscule [0.17%] on the x86 defconfig 
> kernel and it seems to be a clear loss in total cost as there would be an 
> ongoing maintenance cost
They were talking about ''restrict'', not strict-aliasing. Where
it can be
used, it''s going to give you optimisations that strict-aliasing
can''t.

-- 
David Woodhouse                            Open Source Technology Centre
David.Woodhouse@intel.com                              Intel Corporation

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Ingo Molnar

2009-Jan-20 21:05 UTC

head link

Re: gcc inlining heuristics was Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

* David Woodhouse <dwmw2@infradead.org> wrote:
> On Tue, 2009-01-20 at 13:38 +0100, Ingo Molnar wrote:
> > 
> > * Nick Piggin <npiggin@suse.de> wrote:
> > 
> > > > > it seems like a nice opt-in thing that can be used
where the aliases
> > > > > are verified and the code is particularly performance
critical...
> > > > 
> > > > Yes. I think we could use it in the kernel, although
I''m not sure how
> > > > many cases we would ever find where we really care.
> > > 
> > > Yeah, we don''t tend to do a lot of intensive data
processing, so it is
> > > normally the cache misses that hurt most as you noted earlier.
> > > 
> > > Some places it might be appropriate, though. It might be nice if
it can
> > > bring code size down too...
> > 
> > I checked, its size effects were miniscule [0.17%] on the x86
defconfig
> > kernel and it seems to be a clear loss in total cost as there would be
an
> > ongoing maintenance cost
> 
> They were talking about ''restrict'', not strict-aliasing.
Where it can be
> used, it''s going to give you optimisations that strict-aliasing
can''t.
the two are obviously related (just that the ''restrict''
keyword can be
used for same-type pointers so it gives even broader leeway) so i used the 
0.17% figure i already had to give a ballpark figure about what such type 
of optimizations can bring us in general.

(Different-type pointer uses are a common pattern: we have a lot of places 
where we have pointers to structures with different types so 
strict-aliasing optimization opportunities apply quite broadly already.)

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Linus Torvalds

2009-Jan-20 21:23 UTC

head link

Re: gcc inlining heuristics was Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

On Tue, 20 Jan 2009, Ingo Molnar wrote:> 
> (Different-type pointer uses are a common pattern: we have a lot of places 
> where we have pointers to structures with different types so 
> strict-aliasing optimization opportunities apply quite broadly already.)
Yes and no.

It''s true that the kernel in general uses mostly pointers through 
structures that can help the type-based thing.

However, the most common and important cases are actually the very same 
structures. In particular, things like <linux/list.h>. Same "struct
list",
often embedded into another case of the same struct.

And that''s where "restrict" can actually help. It might be
interesting to
see, for example, if it makes any difference to add a "restrict"
qualifier
to the "new" pointer in __list_add(). That might give the compiler the
ability to schedule the stores to next->prev and prev->next differently, 
and maybe it could matter? 

It probably is not noticeable. The big reason for wanting to do alias 
analysis tends to not be thatt kind of code at all, but the cases where 
you can do much bigger simplifications, or on in-order machines where you 
really want to hoist things like FP loads early and FP stores late, and 
alias analysis (and here type-based is more reasonable) shows that the FP 
accesses cannot alias with the integer accesses around it.

In x86, I doubt _any_ amount of alias analysis makes a hug difference (as 
long as the compiler at least doesn''t think that local variable spills
can
alias with anything else). Not enough registers, and generally pretty 
aggressively OoO (with alias analysis in hardware) makes for a much less 
sensitive platform.

			Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Ingo Molnar

2009-Jan-20 22:05 UTC

head link

Re: gcc inlining heuristics was Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

* Linus Torvalds <torvalds@linux-foundation.org> wrote:
> On Tue, 20 Jan 2009, Ingo Molnar wrote:
> > 
> > (Different-type pointer uses are a common pattern: we have a lot of 
> > places where we have pointers to structures with different types so 
> > strict-aliasing optimization opportunities apply quite broadly 
> > already.)
> 
> Yes and no.
> 
> It''s true that the kernel in general uses mostly pointers through 
> structures that can help the type-based thing.
> 
> However, the most common and important cases are actually the very same 
> structures. In particular, things like <linux/list.h>. Same
"struct
> list", often embedded into another case of the same struct.
> 
> And that''s where "restrict" can actually help. It might
be interesting
> to see, for example, if it makes any difference to add a
"restrict"
> qualifier to the "new" pointer in __list_add(). That might give
the
> compiler the ability to schedule the stores to next->prev and
prev->next
> differently, and maybe it could matter?
> 
> It probably is not noticeable. The big reason for wanting to do alias 
> analysis tends to not be thatt kind of code at all, but the cases where 
> you can do much bigger simplifications, or on in-order machines where 
> you really want to hoist things like FP loads early and FP stores late, 
> and alias analysis (and here type-based is more reasonable) shows that 
> the FP accesses cannot alias with the integer accesses around it.
Hm, GCC uses __restrict__, right?

The patch below makes no difference at all on an x86 defconfig:

  vmlinux:
     text	   data	    bss	    dec	    hex	filename
  7253544	1641812	1324296	10219652	 9bf084	vmlinux.before
  7253544	1641812	1324296	10219652	 9bf084	vmlinux.after

not a single instruction was changed:

 --- vmlinux.before.asm
 +++ vmlinux.after.asm
 @@ -1,5 +1,5 @@
 
 -vmlinux.before:     file format elf64-x86-64
 +vmlinux.after:     file format elf64-x86-64

I''m wondering whether there''s any internal tie-up between
alias analysis
and the __restrict__ keyword - so if we turn off aliasing optimizations 
the __restrict__ keyword''s optimizations are turned off as well.

Nope, with aliasing optimizations turned back on there''s still no
change
on the x86 defconfig:

     text	   data	    bss	    dec	    hex	filename
  7240893	1641796	1324296	10206985	 9bbf09	vmlinux.before
  7240893	1641796	1324296	10206985	 9bbf09	vmlinux.after

GCC 4.3.2. Maybe i missed something obvious?

	Ingo

---
 include/linux/list.h |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Index: linux/include/linux/list.h
==================================================================---
linux.orig/include/linux/list.h
+++ linux/include/linux/list.h
@@ -38,7 +38,7 @@ static inline void INIT_LIST_HEAD(struct
  * the prev/next entries already!
  */
 #ifndef CONFIG_DEBUG_LIST
-static inline void __list_add(struct list_head *new,
+static inline void __list_add(struct list_head * __restrict__ new,
 			      struct list_head *prev,
 			      struct list_head *next)
 {
@@ -48,7 +48,7 @@ static inline void __list_add(struct lis
 	prev->next = new;
 }
 #else
-extern void __list_add(struct list_head *new,
+extern void __list_add(struct list_head * __restrict__ new,
 			      struct list_head *prev,
 			      struct list_head *next);
 #endif
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

H. Peter Anvin

2009-Jan-21 01:22 UTC

head link

Re: gcc inlining heuristics was Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

Ingo Molnar wrote:> 
> Hm, GCC uses __restrict__, right?
> 
> I''m wondering whether there''s any internal tie-up between
alias analysis
> and the __restrict__ keyword - so if we turn off aliasing optimizations 
> the __restrict__ keyword''s optimizations are turned off as well.
> 
Actually I suspect that "restrict" makes little difference for inlines
or even statics, since gcc generally can do alias analysis fine there. 
However, in the presence of an intermodule function call, all alias 
analysis is off.  This is presumably why type-based analysis is used at 
all ... to at least be able to a modicum of, say, loop invariant removal 
in the presence of a library call.  This is also where "restrict"
comes
into play.

	-hpa

--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Nick Piggin

2009-Jan-21 08:52 UTC

head link

Re: gcc inlining heuristics was Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

On Wed, Jan 21, 2009 at 09:54:02AM +0100, Andi Kleen
wrote:> > GCC 4.3.2. Maybe i missed something obvious?
> 
> The typical use case of restrict is to tell it that multiple given
> arrays are independent and then give the loop optimizer 
> more freedom to handle expressions in the loop that
> accesses these arrays.
> 
> Since there are no loops in the list functions nothing changed.
> 
> Ok presumably there are some other optimizations which 
> rely on that alias information too, but again the list_*
> stuff is probably too simple to trigger any of them.
Any function that does several interleaved loads and stores
through different pointers could have much more freedom to
move loads early and stores late. Big OOOE CPUs won''t care
so much, but embedded and things (including in-order x86)
are very important users of the kernel.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Andi Kleen

2009-Jan-21 08:54 UTC

head link

Re: gcc inlining heuristics was Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

> GCC 4.3.2. Maybe i missed something obvious?
The typical use case of restrict is to tell it that multiple given
arrays are independent and then give the loop optimizer 
more freedom to handle expressions in the loop that
accesses these arrays.

Since there are no loops in the list functions nothing changed.

Ok presumably there are some other optimizations which 
rely on that alias information too, but again the list_*
stuff is probably too simple to trigger any of them.

-Andi

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Andi Kleen

2009-Jan-21 09:20 UTC

head link

Re: gcc inlining heuristics was Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

On Wed, Jan 21, 2009 at 09:52:08AM +0100, Nick Piggin
wrote:> On Wed, Jan 21, 2009 at 09:54:02AM +0100, Andi Kleen wrote:
> > > GCC 4.3.2. Maybe i missed something obvious?
> > 
> > The typical use case of restrict is to tell it that multiple given
> > arrays are independent and then give the loop optimizer 
> > more freedom to handle expressions in the loop that
> > accesses these arrays.
> > 
> > Since there are no loops in the list functions nothing changed.
> > 
> > Ok presumably there are some other optimizations which 
> > rely on that alias information too, but again the list_*
> > stuff is probably too simple to trigger any of them.
> 
> Any function that does several interleaved loads and stores
> through different pointers could have much more freedom to
> move loads early and stores late. 
For once that would require more live registers. It''s not
a clear and obvious win. Especially not if you have
only very little registers, like on 32bit x86.

Then it would typically increase code size.

Then x86s tend to have very very fast L1 caches and
if something is not in L1 on reads then the cost of fetching
something for a read dwarfs the few cycles you can typically
get out of this.

And lastly even on a in order system stores can 
be typically queued without stalling, so it doesn''t
hurt to do them early.

Also at least x86 gcc normally doesn''t do scheduling 
beyond basic blocks, so any if () shuts it up.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Nick Piggin

2009-Jan-21 09:25 UTC

head link

Re: gcc inlining heuristics was Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

On Wed, Jan 21, 2009 at 10:20:49AM +0100, Andi Kleen
wrote:> On Wed, Jan 21, 2009 at 09:52:08AM +0100, Nick Piggin wrote:
> > On Wed, Jan 21, 2009 at 09:54:02AM +0100, Andi Kleen wrote:
> > > > GCC 4.3.2. Maybe i missed something obvious?
> > > 
> > > The typical use case of restrict is to tell it that multiple
given
> > > arrays are independent and then give the loop optimizer 
> > > more freedom to handle expressions in the loop that
> > > accesses these arrays.
> > > 
> > > Since there are no loops in the list functions nothing changed.
> > > 
> > > Ok presumably there are some other optimizations which 
> > > rely on that alias information too, but again the list_*
> > > stuff is probably too simple to trigger any of them.
> > 
> > Any function that does several interleaved loads and stores
> > through different pointers could have much more freedom to
> > move loads early and stores late. 
> 
> For once that would require more live registers. It''s not
> a clear and obvious win. Especially not if you have
> only very little registers, like on 32bit x86.
> 
> Then it would typically increase code size.
The point is that the compiler is then free to do it. If things
slow down after the compiler gets *more* information, then that
is a problem with the compiler heuristics rather than the
information we give it.

 > Then x86s tend to have very very fast L1 caches and
> if something is not in L1 on reads then the cost of fetching
> something for a read dwarfs the few cycles you can typically
> get out of this.
Well most architectures have L1 caches of several cycles. And
L2 miss typically means going to L2 which in some cases the
compiler is expected to attempt to cover as much as possible
(eg in-order architectures).

If the caches are missed completely, then especially with an
in-order architecture, you want to issue as many parallel loads
as possible during the stall. If the compiler can''t resolve
aliases, then it simply won''t be able to bring some of those
loads forward.

> And lastly even on a in order system stores can 
> be typically queued without stalling, so it doesn''t
> hurt to do them early.
Store queues are, what? On the order of tens of entries for
big power hungry x86? I''d guess much smaller for low power
in-order x86 and ARM etc. These can definitely fill up and
stall, so you still want to get loads out early if possible.

Even a lot of OOOE CPUs I think won''t have the best alias
anaysis, so all else being equal, it wouldn''t hurt them to
move loads earlier.

> Also at least x86 gcc normally doesn''t do scheduling 
> beyond basic blocks, so any if () shuts it up.
I don''t think any of this is a reason not to use restrict, though.
But... there are so many places we could add it to the kernel, and
probably so few where it makes much difference. Maybe it should be
able to help some critical core code, though.
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Andi Kleen

2009-Jan-21 09:54 UTC

head link

Re: gcc inlining heuristics was Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

> The point is that the compiler is then free to do it. If things
> slow down after the compiler gets *more* information, then that
> is a problem with the compiler heuristics rather than the
> information we give it.
The point was the -Os disables it typically then.
(not always, compiler heuristics are far from perfect)
> 
>  
> > Then x86s tend to have very very fast L1 caches and
> > if something is not in L1 on reads then the cost of fetching
> > something for a read dwarfs the few cycles you can typically
> > get out of this.
> 
> Well most architectures have L1 caches of several cycles. And
> L2 miss typically means going to L2 which in some cases the
> compiler is expected to attempt to cover as much as possible
> (eg in-order architectures).
L2 cache is so much slower that scheduling a few instructions
more doesn''t help much.
> stall, so you still want to get loads out early if possible.
> 
> Even a lot of OOOE CPUs I think won''t have the best alias
> anaysis, so all else being equal, it wouldn''t hurt them to
> move loads earlier.
Hmm, but if the load is nearby it won''t matter if a 
store is in the middle, because the CPU will just execute
over it.

The real big win is if you do some computation inbetween,
but at least for typical list manipulation there isn''t 
really any.
> > Also at least x86 gcc normally doesn''t do scheduling 
> > beyond basic blocks, so any if () shuts it up.
> 
> I don''t think any of this is a reason not to use restrict, though.
> But... there are so many places we could add it to the kernel, and
> probably so few where it makes much difference. Maybe it should be
> able to help some critical core code, though.
Frankly I think it would be another unlikely().

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Nick Piggin

2009-Jan-21 10:14 UTC

head link

Re: gcc inlining heuristics was Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

On Wed, Jan 21, 2009 at 10:54:18AM +0100, Andi Kleen
wrote:> > The point is that the compiler is then free to do it. If things
> > slow down after the compiler gets *more* information, then that
> > is a problem with the compiler heuristics rather than the
> > information we give it.
> 
> The point was the -Os disables it typically then.
> (not always, compiler heuristics are far from perfect)
That''d be just another gcc failing. If it can make the code
faster without a size increase, then it should (of course
if it has to start spilling registers etc. then that''s a
different matter, but we''re not talking about only 32-bit
x86 here).

> > > Then x86s tend to have very very fast L1 caches and
> > > if something is not in L1 on reads then the cost of fetching
> > > something for a read dwarfs the few cycles you can typically
> > > get out of this.
> > 
> > Well most architectures have L1 caches of several cycles. And
> > L2 miss typically means going to L2 which in some cases the
> > compiler is expected to attempt to cover as much as possible
> > (eg in-order architectures).
> 
> L2 cache is so much slower that scheduling a few instructions
> more doesn''t help much.
I think on a lot of CPUs that is actually not the case. Including
on Nehalem and Montecito CPUs where it is what, like under 15
cycles?

Even in cases where you have a high latency LLC or go to memory,
you want to be able to start as many loads as possible before
stalling. Especially for in-order architectures, but even OOOE
can stall if it can''t resolve store addresses early enough or
speculate them.

 > > stall, so you still want to get loads out early if possible.
> > 
> > Even a lot of OOOE CPUs I think won''t have the best alias
> > anaysis, so all else being equal, it wouldn''t hurt them to
> > move loads earlier.
> 
> Hmm, but if the load is nearby it won''t matter if a 
> store is in the middle, because the CPU will just execute
> over it.
If the address is not known or the store buffer fills up etc
then it may not be able to. It could even be hundreds of
instructions here too much for an OOOE processor window. We
have a lot of huge functions (although granted they''ll often
contain barriers for other reasons like locks or function
calls anyway).
 
> The real big win is if you do some computation inbetween,
> but at least for typical list manipulation there isn''t 
> really any.
Well, I have a feeling that the MLP side of it could be more
significant than ILP. But no data.

--
To unsubscribe from this list: send the line "unsubscribe
linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Btrfs devel - Dec 2008 - Btrfs for mainline

Btrfs for mainline

Re: Btrfs for mainline

Re: Btrfs for mainline

Re: Btrfs for mainline

Re: Btrfs for mainline

Re: Btrfs for mainline

Re: Btrfs for mainline

Re: Btrfs for mainline

Re: Btrfs for mainline

Re: Btrfs for mainline

Re: Btrfs for mainline

Re: Btrfs for mainline

Re: Btrfs for mainline

Re: Btrfs for mainline

Re: Btrfs for mainline

Re: Btrfs for mainline

Re: Btrfs for mainline

Re: Btrfs for mainline

Re: Btrfs for mainline

Re: Btrfs for mainline

Re: Btrfs for mainline

Re: Btrfs for mainline

Re: Btrfs for mainline

Re: Btrfs for mainline

Re: Btrfs for mainline

Re: Btrfs for mainline

Re: Btrfs for mainline

Re: Btrfs for mainline

Re: Btrfs for mainline

Re: Btrfs for mainline

generic pagecache to block mapping layer (was Re: Btrfs for mainline)

Re: Btrfs for mainline

Re: Btrfs for mainline

Re: Btrfs for mainline

Re: Btrfs for mainline

Re: Btrfs for mainline

Re: Btrfs for mainline

Re: generic pagecache to block mapping layer (was Re: Btrfs for mainline)

Re: Btrfs for mainline

Re: generic pagecache to block mapping layer (was Re: Btrfs for mainline)

[PATCH][RFC]: mutex: adaptive spin

Re: [PATCH][RFC]: mutex: adaptive spin

Re: [PATCH][RFC]: mutex: adaptive spin

Re: [PATCH][RFC]: mutex: adaptive spin

Re: [PATCH][RFC]: mutex: adaptive spin

Re: [PATCH][RFC]: mutex: adaptive spin

Re: [PATCH][RFC]: mutex: adaptive spin

Re: [PATCH][RFC]: mutex: adaptive spin

Re: [PATCH][RFC]: mutex: adaptive spin

Re: [PATCH][RFC]: mutex: adaptive spin

Re: [PATCH][RFC]: mutex: adaptive spin

Re: [PATCH][RFC]: mutex: adaptive spin

Re: [PATCH][RFC]: mutex: adaptive spin

Re: [PATCH][RFC]: mutex: adaptive spin

Re: [PATCH][RFC]: mutex: adaptive spin

Re: [PATCH][RFC]: mutex: adaptive spin

Re: [PATCH][RFC]: mutex: adaptive spin

Re: [PATCH][RFC]: mutex: adaptive spin

Re: [PATCH][RFC]: mutex: adaptive spin

Re: [PATCH][RFC]: mutex: adaptive spin

Re: [PATCH][RFC]: mutex: adaptive spin

Re: generic pagecache to block mapping layer (was Re: Btrfs for mainline)

Re: [PATCH][RFC]: mutex: adaptive spin

Re: [PATCH][RFC]: mutex: adaptive spin

Re: [PATCH][RFC]: mutex: adaptive spin

Re: [PATCH][RFC]: mutex: adaptive spin

Re: [PATCH][RFC]: mutex: adaptive spin

Re: [PATCH][RFC]: mutex: adaptive spin

Re: Btrfs for mainline

Re: [PATCH][RFC]: mutex: adaptive spin

Re: [PATCH][RFC]: mutex: adaptive spin

Re: [PATCH][RFC]: mutex: adaptive spin

Re: [PATCH][RFC]: mutex: adaptive spin

Re: Btrfs for mainline

Re: [PATCH][RFC]: mutex: adaptive spin

Re: [PATCH][RFC]: mutex: adaptive spin

Re: [PATCH][RFC]: mutex: adaptive spin

Re: [PATCH][RFC]: mutex: adaptive spin

Re: [PATCH][RFC]: mutex: adaptive spin