thr3ads.net - Btrfs devel - No one seems to be using AOP_WRITEPAGE

If this information is useful, please help other people find it:
Share via:

Theodore Ts''o

2010-Apr-25 02:40 UTC

No one seems to be using AOP_WRITEPAGE_ACTIVATE?

I happened to be going through the source code for write_cache_pages(),
and I came across a reference to AOP_WRITEPAGE_ACTIVATE.  I was curious
what the heck that was, so I did search for it, and found this in
Documentation/filesystems/vfs.txt:

      If wbc->sync_mode is WB_SYNC_NONE, ->writepage doesn''t have
to
      try too hard if there are problems, and may choose to write out
      other pages from the mapping if that is easier (e.g. due to
      internal dependencies).  If it chooses not to start writeout, it
      should return AOP_WRITEPAGE_ACTIVATE so that the VM will not keep
      calling ->writepage on that page.

      See the file "Locking" for more details.

No filesystems are currently returning AOP_WRITEPAGE_ACTIVATE when it
chooses not to writeout page and call redirty_page_for_writeback()
instead.

Is this a change we should make, for example when btrfs refuses a
writepage() when PF_MEMALLOC is set, or when ext4 refuses a writepage()
if the page involved hasn''t been allocated an on-disk block yet (i.e.,
delayed allocation)?  The change seems to be that we should call
redirty_page_for_writeback() as before, but then _not_ unlock the page,
and return AOP_WRITEPAGE_ACTIVATE.  Is this a good and useful thing for
us to do?

Right now, the only writepage() function which is returning
AOP_WRITEPAGE_ACTIVATE is shmem_writepage(), and very curiously it''s
not
using redirty_page_for_writeback().  Should it, out of consistency''s
sake if not to keep various zone accounting straight?

There are some longer-term issues, including the fact that ext4 and
btrfs are violating some of the rules laid out in
Documentation/vfs/Locking regarding what writepage() is supposed to do
under direct reclaim -- something which isn''t going to be practical for
us to change on the file-system side, at least not without doing some
pretty nasty and serious rework, for both ext4 and I suspect btrfs.  But
if returning AOP_WRITEPAGE_ACTIVATE will help the VM deal more
gracefully with the fact that ext4 and btrfs will be refusing
writepage() calls under certain conditions, maybe we should make this
change?

						- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

KOSAKI Motohiro

2010-Apr-26 10:18 UTC

head link

Re: No one seems to be using AOP_WRITEPAGE_ACTIVATE?

Hi Ted

> I happened to be going through the source code for write_cache_pages(),
> and I came across a reference to AOP_WRITEPAGE_ACTIVATE.  I was curious
> what the heck that was, so I did search for it, and found this in
> Documentation/filesystems/vfs.txt:
> 
>       If wbc->sync_mode is WB_SYNC_NONE, ->writepage doesn''t
have to
>       try too hard if there are problems, and may choose to write out
>       other pages from the mapping if that is easier (e.g. due to
>       internal dependencies).  If it chooses not to start writeout, it
>       should return AOP_WRITEPAGE_ACTIVATE so that the VM will not keep
>       calling ->writepage on that page.
> 
>       See the file "Locking" for more details.
> 
> No filesystems are currently returning AOP_WRITEPAGE_ACTIVATE when it
> chooses not to writeout page and call redirty_page_for_writeback()
> instead.
> 
> Is this a change we should make, for example when btrfs refuses a
> writepage() when PF_MEMALLOC is set, or when ext4 refuses a writepage()
> if the page involved hasn''t been allocated an on-disk block yet
(i.e.,
> delayed allocation)?  The change seems to be that we should call
> redirty_page_for_writeback() as before, but then _not_ unlock the page,
> and return AOP_WRITEPAGE_ACTIVATE.  Is this a good and useful thing for
> us to do?
Sorry, no.

AOP_WRITEPAGE_ACTIVATE was introduced for ramdisk and tmpfs thing
(and later rd choosed to use another way).
Then, It assume writepage refusing aren''t happen on majority pages.
IOW, the VM assume other many pages can writeout although the page
can''t.
Then, the VM only make page activation if AOP_WRITEPAGE_ACTIVATE is returned.
but now ext4 and btrfs refuse all writepage(). (right?)

IOW, I don''t think such documentation suppose delayed allocation issue
;)

The point is, Our dirty page accounting only account per-system-memory
dirty ratio and per-task dirty pages. but It doesn''t account
per-numa-node
nor per-zone dirty ratio. and then, to refuse write page and fake numa
abusing can make confusing our vm easily. if _all_ pages in our VM LRU
list (it''s per-zone), page activation doesn''t help. It also
lead to OOM.

And I''m sorry. I have to say now all vm developers fake numa is not
production level quority yet. afaik, nobody have seriously tested our
vm code on such environment. (linux/arch/x86/Kconfig says "This is only 
useful for debugging".)

	--------------------------------------------------------------
	config NUMA_EMU
	        bool "NUMA emulation"
	        depends on X86_64 && NUMA
	        ---help---
	          Enable NUMA emulation. A flat machine will be split
	          into virtual nodes when booted with "numa=fake=N", where N
is the
	          number of nodes. This is only useful for debugging.

> 
> Right now, the only writepage() function which is returning
> AOP_WRITEPAGE_ACTIVATE is shmem_writepage(), and very curiously
it''s not
> using redirty_page_for_writeback().  Should it, out of
consistency''s
> sake if not to keep various zone accounting straight?
Umm. I don''t know the reason. instead I''ve cc to hugh :)

> There are some longer-term issues, including the fact that ext4 and
> btrfs are violating some of the rules laid out in
> Documentation/vfs/Locking regarding what writepage() is supposed to do
> under direct reclaim -- something which isn''t going to be
practical for
> us to change on the file-system side, at least not without doing some
> pretty nasty and serious rework, for both ext4 and I suspect btrfs.  But
> if returning AOP_WRITEPAGE_ACTIVATE will help the VM deal more
> gracefully with the fact that ext4 and btrfs will be refusing
> writepage() calls under certain conditions, maybe we should make this
> change?
I''m sorry again. I''m pretty sure our vm also need to change if
we need
to solve your company''s fake numa use case. I think our vm is still
delayed
allocation unfriendly. we haven''t noticed ext4 delayed allocation issue
;-)

So, I have two questions
 - I really hope to understand ext4 delayed allocation issue, can you please
   tell me which url explain ext4 high level design and behavior about delayed
   allocation.
 - If my understood is correctly, making very much fake numa node and
   simple dd can reproduce your issue. right?

Now I''m guessing enough small vm patch can solve this issue.
(that''s only
guess, maybe yes maybe no). but correct understanding and correct testing
way are really necessary. please help.




--
To unsubscribe from this list: send the line "unsubscribe linux-ext4"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Theodore Tso

2010-Apr-26 14:50 UTC

head link

Re: No one seems to be using AOP_WRITEPAGE_ACTIVATE?

On Apr 26, 2010, at 6:18 AM, KOSAK> AOP_WRITEPAGE_ACTIVATE was introduced for ramdisk and tmpfs thing
> (and later rd choosed to use another way).
> Then, It assume writepage refusing aren''t happen on majority
pages.
> IOW, the VM assume other many pages can writeout although the page
can''t.
> Then, the VM only make page activation if AOP_WRITEPAGE_ACTIVATE is
returned.
> but now ext4 and btrfs refuse all writepage(). (right?)
No, not exactly.   Btrfs refuses the writepage() in the direct reclaim cases
(i.e., if PF_MEMALLOC is set), but will do writepage() in the case of zone
scanning.  I don''t want to speak for Chris, but I assume it''s
due to stack depth concerns --- if it was just due to worrying about fs
recursion issues, i assume all of the btrfs allocations could be done GFP_NOFS.

Ext4 is slightly different; it refuses writepages() if the inode blocks for the
page haven''t yet been allocated.  (Regardless of whether it''s
happening for direct reclaim or zone scanning.)  However, if the on-disk block
has been assigned (i.e., this isn''t a delalloc case), ext4 will honor
the writepage().   (i.e., if this is an mmap of an already existing file, or if
the space has been pre-allocated using fallocate()).    The reason for
ext4''s concern is lock ordering, although I''m investigating
whether I can fix this.   If we call set_page_writeback() to set PG_writeback
(plus set the various bits of magic fs accounting), and then drop the page_lock,
does that protect us from random changes happening to the page (i.e., from
vmtruncate, etc.)?
> 
> IOW, I don''t think such documentation suppose delayed allocation
issue ;)
> 
> The point is, Our dirty page accounting only account per-system-memory
> dirty ratio and per-task dirty pages. but It doesn''t account
per-numa-node
> nor per-zone dirty ratio. and then, to refuse write page and fake numa
> abusing can make confusing our vm easily. if _all_ pages in our VM LRU
> list (it''s per-zone), page activation doesn''t help. It
also lead to OOM.
> 
> And I''m sorry. I have to say now all vm developers fake numa is
not
> production level quority yet. afaik, nobody have seriously tested our
> vm code on such environment. (linux/arch/x86/Kconfig says "This is
only
> useful for debugging".)
So I''m sorry I mentioned the fake numa bit, since I think this is a bit
of a red herring.   That code is in production here, and we''ve made all
sorts of changes so ti can be used for more than just debugging.  So please
ignore it, it''s our local hack, and if it breaks that''s our
problem.    More importantly, just two weeks ago I talked to soeone in the
financial sector, who was testing out ext4 on an upstream kernel, and not using
our hacks that force 128MB zones, and he ran into the ext4/OOM problem while
using an upstream kernel.  It involved Oracle pinning down 3G worth of pages,
and him trying to do a huge streaming backup (which of course wasn''t
using fallocate or direct I/O) under ext4, and he had the same issue --- an OOM,
that I''m pretty sure was caused by the fact that ext4_writepage() was
refusing the writepage() and most of the pages weren''t nailed down by
Oracle were delalloc.    The same test scenario using ext3 worked just fine, of
course.

Under normal cases it''s not a problem since statistically there should
be enough other pages in the system compared to the number of pages that are
subject to delalloc, such that pages can usually get pushed out until the
writeback code can get around to writing out the pages.   But in cases where the
zones have been made artificially small, or you have a big program like Oracle
pinning down a large number of pages, then of course we have problems.

I''m trying to fix things from the file system side, which means trying
to understand magic flags like AOP_WRITEPAGE_ACTIVATE, which is described in
Documentation/filesystems/Locking as something which MUST be used if writepage()
is going refuse a page.  And then I discovered no one is actually using it.   So
that''s why I was asking with respect whether the Locking documentation
file was out of date, or whether all of the file systems are doing it wrong.

On a related example of how file system code isnt'' necessarily
following what is required/recommended by the Locking documentation, ext2 and
ext3 are both NOT using set_page_writeback()/end_page_writeback(), but are
rather keeping the page locked until after they call block_write_full_page(),
because of concerns of truncate coming in and screwing things up.   But now
looking at Locking, it appears that set_page_writeback() is as good as
page_lock() for preventing the truncate code from coming in and screwing
everything up?   It''s not clear to me exactly what locking guarantees
are provided against truncate by set_page_writeback().   And suppose we are
writing out a whole cluster of pages, say 4MB worth of pages; do we need to call
set_page_writeback() on every single page in the cluster before we do the I/O to
make sure things don''t change out from under us?  (I''m pretty
sure at least some of the other filesystems that are submitting huge numbers of
pages using bio instead of 4k at a time like ext2/3/4 aren''t calling
set_page_writeback() on all of the pages first.)

Part of the problem is that the writeback Locking semantics aren''t well
documented, and where they are documented, it''s not clear they are up
to date --- and all of the file systems that are doing delayed allocation
writeback are doing things slightly differently, or in some cases very
differently.    (And even without delalloc, as I''ve pointed out ext2/3
don''t use set_page_writeback() --- if this is a MUST USE as implied by
the Locking file, why did whoever added this requirement didn''t go in
and modify common filesystems like ext2 and ext3 to use the
set_page_writeback/end_page_writeback calls?)

I''m happy to change things in ext4; in fact I''m pretty sure
ext4 probably isn''t completely right here.   But it''s not
clear what "right" actually is, and when I look to see what protects
writepage() racing with vmtruncate(), it''s enough to give me a
headache.  :-(

Hence my question about wouldn''t it be simpler if we simply added more
high-level locking to prevent truncate from racing against writepage/writeback.

-- Ted

--
To unsubscribe, send a message with ''unsubscribe linux-mm'' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don''t email: <a href=mailto:"dont@kvack.org">
email@kvack.org </a>

Chris Mason

2010-Apr-26 17:24 UTC

head link

Re: No one seems to be using AOP_WRITEPAGE_ACTIVATE?

On Mon, Apr 26, 2010 at 10:50:45AM -0400, Theodore Tso
wrote:> 
> On Apr 26, 2010, at 6:18 AM, KOSAK
> > AOP_WRITEPAGE_ACTIVATE was introduced for ramdisk and tmpfs thing
> > (and later rd choosed to use another way).
> > Then, It assume writepage refusing aren''t happen on majority
pages.
> > IOW, the VM assume other many pages can writeout although the page
can''t.
> > Then, the VM only make page activation if AOP_WRITEPAGE_ACTIVATE is
returned.
> > but now ext4 and btrfs refuse all writepage(). (right?)
> 
> No, not exactly.   Btrfs refuses the writepage() in the direct reclaim
> cases (i.e., if PF_MEMALLOC is set), but will do writepage() in the
> case of zone scanning.  I don''t want to speak for Chris, but I
assume
> it''s due to stack depth concerns --- if it was just due to
worrying
> about fs recursion issues, i assume all of the btrfs allocations could
> be done GFP_NOFS.
> 
Btrfs refuses all PF_MEMALLOC writepage.  It will go ahead and process a
regular writepage but in practice that never happens...everyone else
except a few internal btrfs callers use writepages.

I wish I had thought of stack depth back then, but really this was to
keep kswapd out of the heavy work done by delalloc.  From a locking
point of view we''re properly GPF_NOFS, so its safe, but it just
isn''t a
great way to use precious PF_MEMALLOC cycles.
> Ext4 is slightly different; it refuses writepages() if the inode
> blocks for the page haven''t yet been allocated.  (Regardless of
> whether it''s happening for direct reclaim or zone scanning.) 
However,
> if the on-disk block has been assigned (i.e., this isn''t a
delalloc
> case), ext4 will honor the writepage().   (i.e., if this is an mmap of
> an already existing file, or if the space has been pre-allocated using
> fallocate()).    The reason for ext4''s concern is lock ordering,
> although I''m investigating whether I can fix this.   If we call
> set_page_writeback() to set PG_writeback (plus set the various bits of
> magic fs accounting), and then drop the page_lock, does that protect
> us from random changes happening to the page (i.e., from vmtruncate,
> etc.)?
PG_writeback will protect you from vmtruncate, but may also want to
have page_mkwrite wait for pages in flight.
> 
> > 
> > IOW, I don''t think such documentation suppose delayed
allocation issue ;)
> > 
> > The point is, Our dirty page accounting only account per-system-memory
> > dirty ratio and per-task dirty pages. but It doesn''t account
per-numa-node
> > nor per-zone dirty ratio. and then, to refuse write page and fake numa
> > abusing can make confusing our vm easily. if _all_ pages in our VM LRU
> > list (it''s per-zone), page activation doesn''t help.
It also lead to OOM.
> > 
> > And I''m sorry. I have to say now all vm developers fake numa
is not
> > production level quority yet. afaik, nobody have seriously tested our
> > vm code on such environment. (linux/arch/x86/Kconfig says "This
is only
> > useful for debugging".)
> 
> So I''m sorry I mentioned the fake numa bit, since I think this is
a
> bit of a red herring.   That code is in production here, and we''ve
> made all sorts of changes so ti can be used for more than just
> debugging.  So please ignore it, it''s our local hack, and if it
breaks
> that''s our problem.    More importantly, just two weeks ago I
talked
> to soeone in the financial sector, who was testing out ext4 on an
> upstream kernel, and not using our hacks that force 128MB zones, and
> he ran into the ext4/OOM problem while using an upstream kernel.  It
> involved Oracle pinning down 3G worth of pages, and him trying to do a
> huge streaming backup (which of course wasn''t using fallocate or
> direct I/O) under ext4, and he had the same issue --- an OOM, that
I''m
> pretty sure was caused by the fact that ext4_writepage() was refusing
> the writepage() and most of the pages weren''t nailed down by
Oracle
> were delalloc.    The same test scenario using ext3 worked just fine,
> of course.
> 
> Under normal cases it''s not a problem since statistically there
should
> be enough other pages in the system compared to the number of pages
> that are subject to delalloc, such that pages can usually get pushed
> out until the writeback code can get around to writing out the pages.
> But in cases where the zones have been made artificially small, or you
> have a big program like Oracle pinning down a large number of pages,
> then of course we have problems. 
> 
> I''m trying to fix things from the file system side, which means
trying
> to understand magic flags like AOP_WRITEPAGE_ACTIVATE, which is
> described in Documentation/filesystems/Locking as something which MUST
> be used if writepage() is going refuse a page.  And then I discovered
> no one is actually using it.   So that''s why I was asking with
respect
> whether the Locking documentation file was out of date, or whether all
> of the file systems are doing it wrong.
> 
> On a related example of how file system code isnt'' necessarily
> following what is required/recommended by the Locking documentation,
> ext2 and ext3 are both NOT using
> set_page_writeback()/end_page_writeback(), but are rather keeping the
> page locked until after they call block_write_full_page(), because of
> concerns of truncate coming in and screwing things up.
block_write_full_page takes a locked page and if all goes well produces
a writeback page without the page locked.  Basically it needs the page
locked until after it has the writeback bit set to protect against
truncate and make sure the page buffers don''t go away while it is
looping over them.

So, I don''t think ext23 are breaking the rules here.
>But now looking at Locking, it appears that set_page_writeback() is as
>good as page_lock() for preventing the truncate code from coming in and
>screwing everything up?   It''s not clear to me exactly what locking
>guarantees are provided against truncate by set_page_writeback().   And
>suppose we are writing out a whole cluster of pages, say 4MB worth of
>pages; do we need to call set_page_writeback() on every single page in
>the cluster before we do the I/O to make sure things don''t change
out
>from under us?  (I''m pretty sure at least some of the other
filesystems
>that are submitting huge numbers of pages using bio instead of 4k at a
>time like ext2/3/4 aren'' t calling set_page_writeback() on all of
the
>pages first.)
> 
> Part of the problem is that the writeback Locking semantics aren''t
> well documented, and where they are documented, it''s not clear
they
> are up to date --- and all of the file systems that are doing delayed
> allocation writeback are doing things slightly differently, or in some
> cases very differently.    (And even without delalloc, as I''ve
pointed
> out ext2/3 don''t use set_page_writeback() --- if this is a MUST
USE as
> implied by the Locking file, why did whoever added this requirement
> didn''t go in and modify common filesystems like ext2 and ext3 to
use
> the set_page_writeback/end_page_writeback calls?)
> 
> I''m happy to change things in ext4; in fact I''m pretty
sure ext4
> probably isn''t completely right here.   But it''s not
clear what
> "right" actually is, and when I look to see what protects
writepage()
> racing with vmtruncate(), it''s enough to give me a headache.  :-(
> 
> Hence my question about wouldn''t it be simpler if we simply added
more
> high-level locking to prevent truncate from racing against
> writepage/writeback.  
My understanding of the current scheme is that truncate will wait on
both locked and writeback pages.  The page lock is used while setting
up the page for writeback, which is true both for writepages and
writepage.

I don''t think we need a new lock on top of the page lock and the
writeback bit, but maybe I don''t see exactly which problem
you''re
solving.  A given range of pages is either:

1) allocated but not under IO.  ext4 must write these pages to disk
before truncate can finish for data=ordered reasons, unless it manages
to log the orphan item.  Figuring out dependency between the orphan
item, which i_size is on disk right now, and holes is pretty tricky, so
I''d go with the less complex: just wait for all the allocated delalloc
pages to hit the disk.

2) Allocated and under IO.  These pages go to disk.

3) Delalloc and not under IO.  Truncate (or notify_change if you lean
toward the xfs crowd) should be able to clean these up
without waiting for the IO.

Of the three, #3 is probably the most common, which #1 a close second.
Is this a case that we really need to optimize for?

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

KOSAKI Motohiro

2010-Apr-27 13:03 UTC

head link

Re: No one seems to be using AOP_WRITEPAGE_ACTIVATE?

> 
> On Apr 26, 2010, at 6:18 AM, KOSAK
> > AOP_WRITEPAGE_ACTIVATE was introduced for ramdisk and tmpfs thing
> > (and later rd choosed to use another way).
> > Then, It assume writepage refusing aren''t happen on majority
pages.
> > IOW, the VM assume other many pages can writeout although the page
can''t.
> > Then, the VM only make page activation if AOP_WRITEPAGE_ACTIVATE is
returned.
> > but now ext4 and btrfs refuse all writepage(). (right?)
> 
> No, not exactly.   Btrfs refuses the writepage() in the direct reclaim
cases (i.e., if PF_MEMALLOC is set), but will do writepage() in the case of zone
scanning.  I don''t want to speak for Chris, but I assume it''s
due to stack depth concerns --- if it was just due to worrying about fs
recursion issues, i assume all of the btrfs allocations could be done GFP_NOFS.
> 
> Ext4 is slightly different; it refuses writepages() if the inode blocks for
the page haven''t yet been allocated.  (Regardless of whether
it''s happening for direct reclaim or zone scanning.)  However, if the
on-disk block has been assigned (i.e., this isn''t a delalloc case),
ext4 will honor the writepage().   (i.e., if this is an mmap of an already
existing file, or if the space has been pre-allocated using fallocate()).    The
reason for ext4''s concern is lock ordering, although I''m
investigating whether I can fix this.   If we call set_page_writeback() to set
PG_writeback (plus set the various bits of magic fs accounting), and then drop
the page_lock, does that protect us from random changes happening to the page
(i.e., from vmtruncate, etc.)?
> 
> > 
> > IOW, I don''t think such documentation suppose delayed
allocation issue ;)
> > 
> > The point is, Our dirty page accounting only account per-system-memory
> > dirty ratio and per-task dirty pages. but It doesn''t account
per-numa-node
> > nor per-zone dirty ratio. and then, to refuse write page and fake numa
> > abusing can make confusing our vm easily. if _all_ pages in our VM LRU
> > list (it''s per-zone), page activation doesn''t help.
It also lead to OOM.
> > 
> > And I''m sorry. I have to say now all vm developers fake numa
is not
> > production level quority yet. afaik, nobody have seriously tested our
> > vm code on such environment. (linux/arch/x86/Kconfig says "This
is only
> > useful for debugging".)
> 
> So I''m sorry I mentioned the fake numa bit, since I think this is
a bit of a red herring.   That code is in production here, and we''ve
made all sorts of changes so ti can be used for more than just debugging.  So
please ignore it, it''s our local hack, and if it breaks that''s
our problem.    More importantly, just two weeks ago I talked to soeone in the
financial sector, who was testing out ext4 on an upstream kernel, and not using
our hacks that force 128MB zones, and he ran into the ext4/OOM problem while
using an upstream kernel.  It involved Oracle pinning down 3G worth of pages,
and him trying to do a huge streaming backup (which of course wasn''t
using fallocate or direct I/O) under ext4, and he had the same issue --- an OOM,
that I''m pretty sure was caused by the fact that ext4_writepage() was
refusing the writepage() and most of the pages weren''t nailed down by
Oracle were delalloc.    The same test scenario using ext3 worked just fine, of
course.
> 
> Under normal cases it''s not a problem since statistically there
should be enough other pages in the system compared to the number of pages that
are subject to delalloc, such that pages can usually get pushed out until the
writeback code can get around to writing out the pages.   But in cases where the
zones have been made artificially small, or you have a big program like Oracle
pinning down a large number of pages, then of course we have problems.
> 
> I''m trying to fix things from the file system side, which means
trying to understand magic flags like AOP_WRITEPAGE_ACTIVATE, which is described
in Documentation/filesystems/Locking as something which MUST be used if
writepage() is going refuse a page.  And then I discovered no one is actually
using it.   So that''s why I was asking with respect whether the Locking
documentation file was out of date, or whether all of the file systems are doing
it wrong.
> 
> On a related example of how file system code isnt'' necessarily
following what is required/recommended by the Locking documentation, ext2 and
ext3 are both NOT using set_page_writeback()/end_page_writeback(), but are
rather keeping the page locked until after they call block_write_full_page(),
because of concerns of truncate coming in and screwing things up.   But now
looking at Locking, it appears that set_page_writeback() is as good as
page_lock() for preventing the truncate code from coming in and screwing
everything up?   It''s not clear to me exactly what locking guarantees
are provided against truncate by set_page_writeback().   And suppose we are
writing out a whole cluster of pages, say 4MB worth of pages; do we need to call
set_page_writeback() on every single page in the cluster before we do the I/O to
make sure things don''t change out from under us?  (I''m pretty
sure at least some of the other filesystems that are submitting huge numbers of
pages using bio instead of 4k at a time like ext2/3/4 aren''t calling
set_page_writeback() on all of the pages first.)
> 
> Part of the problem is that the writeback Locking semantics aren''t
well documented, and where they are documented, it''s not clear they are
up to date --- and all of the file systems that are doing delayed allocation
writeback are doing things slightly differently, or in some cases very
differently.    (And even without delalloc, as I''ve pointed out ext2/3
don''t use set_page_writeback() --- if this is a MUST USE as implied by
the Locking file, why did whoever added this requirement didn''t go in
and modify common filesystems like ext2 and ext3 to use the
set_page_writeback/end_page_writeback calls?)
> 
> I''m happy to change things in ext4; in fact I''m pretty
sure ext4 probably isn''t completely right here.   But it''s not
clear what "right" actually is, and when I look to see what protects
writepage() racing with vmtruncate(), it''s enough to give me a
headache.  :-(
Umm.. sorry, I''m not good person to answer your question. 
probably Nick has best knowledge in this area.

afaics, vmtruncate call graph is here.

vmtruncate
 -> truncate_pagecache
    -> truncate_inode_pages
       -> truncate_inode_pages_range
            lock_page(page);
            wait_on_page_writeback(page);
            truncate_inode_page(mapping, page);
             -> truncate_complete_page
                -> remove_from_page_cache
             ....
            unlock_page(page);

Then, PG_lock and/or PG_writeback protect against remove_from_page_cache().

But..
Now I''m afraid it can''t solve ext4 delalloc issue.
I''m pretty sure
you have done above easy grepping. I guess you are suffering from more
difficult issue. I hope to ask you, why ext4 couing logic of number of
delalloc pages can''t take page lock?

and, today''s my grep result is, 

ext2_writepage
  block_write_full_page
    block_write_full_page_endio
      __block_write_full_page
        set_page_writeback

end_buffer_async_write
  end_page_writeback


ext3 seems to have similar logic of ext2. Am I missing something?



> 
> Hence my question about wouldn''t it be simpler if we simply added
more high-level locking to prevent truncate from racing against
writepage/writeback.
> 
> -- Ted
> 


--
To unsubscribe, send a message with ''unsubscribe linux-mm'' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don''t email: <a href=mailto:"dont@kvack.org">
email@kvack.org </a>

Btrfs devel - Apr 2010 - No one seems to be using AOP_WRITEPAGE_ACTIVATE?

No one seems to be using AOP_WRITEPAGE_ACTIVATE?

Re: No one seems to be using AOP_WRITEPAGE_ACTIVATE?

Re: No one seems to be using AOP_WRITEPAGE_ACTIVATE?

Re: No one seems to be using AOP_WRITEPAGE_ACTIVATE?

Re: No one seems to be using AOP_WRITEPAGE_ACTIVATE?