thr3ads.net - Btrfs devel - [PATCH V2 0/7] Cleancache (was Transcendent Memory): overview [May 2010]

If this information is useful, please help other people find it:
Share via:

Dan Magenheimer

2010-May-28 17:35 UTC

[PATCH V2 0/7] Cleancache (was Transcendent Memory): overview

[PATCH V2 0/7] Cleancache (was Transcendent Memory): overview

Changes since V1:
- Rebased to 2.6.34 (no functional changes)
- Convert to sane types (Al Viro)
- Define some raw constants (Konrad Wilk)
- Add ack from Andreas Dilger

In previous patch postings, cleancache was part of the Transcendent
Memory ("tmem") patchset.  This patchset refocuses not on the
underlying
technology (tmem) but instead on the useful functionality provided for Linux,
and provides a clean API so that cleancache can provide this very useful
functionality either via a Xen tmem driver OR completely independent of tmem.
For example: Nitin Gupta (of compcache and ramzswap fame) is implementing
an in-kernel compression "backend" for cleancache; some believe
cleancache will be a very nice interface for building RAM-like functionality
for pseudo-RAM devices such as SSD or phase-change memory; and a Pune
University team is looking at a backend for virtio (see OLS''2010).

A more complete description of cleancache can be found in the introductory
comment in mm/cleancache.c (in PATCH 2/7) which is included below
for convenience.

Note that an earlier version of this patch is now shipping in OpenSuSE 11.2
and will soon ship in a release of Oracle Enterprise Linux.  Underlying
tmem technology is now shipping in Oracle VM 2.2 and was just released
in Xen 4.0 on April 15, 2010.  (Search news.google.com for Transcendent
Memory)

Signed-off-by: Dan Magenheimer <dan.magenheimer@oracle.com>
Reviewed-by: Jeremy Fitzhardinge <jeremy@goop.org>

 fs/btrfs/extent_io.c       |    9 +
 fs/btrfs/super.c           |    2 
 fs/buffer.c                |    5 +
 fs/ext3/super.c            |    2 
 fs/ext4/super.c            |    2 
 fs/mpage.c                 |    7 +
 fs/ocfs2/super.c           |    3 
 fs/super.c                 |    8 +
 include/linux/cleancache.h |   90 +++++++++++++++++++
 include/linux/fs.h         |    5 +
 mm/Kconfig                 |   22 ++++
 mm/Makefile                |    1 
 mm/cleancache.c            |  203 +++++++++++++++++++++++++++++++++++++++++++++
 mm/filemap.c               |   11 ++
 mm/truncate.c              |   10 ++
 15 files changed, 380 insertions(+)

Cleancache can be thought of as a page-granularity victim cache for clean
pages that the kernel''s pageframe replacement algorithm (PFRA) would
like
to keep around, but can''t since there isn''t enough memory.  So
when the
PFRA "evicts" a page, it first attempts to put it into a synchronous
concurrency-safe page-oriented pseudo-RAM device (such as Xen''s
Transcendent
Memory, aka "tmem", or in-kernel compressed memory, aka
"zmem", or other
RAM-like devices) which is not directly accessible or addressable by the
kernel and is of unknown and possibly time-varying size.  And when a
cleancache-enabled filesystem wishes to access a page in a file on disk,
it first checks cleancache to see if it already contains it; if it does,
the page is copied into the kernel and a disk access is avoided.
This pseudo-RAM device links itself to cleancache by setting the
cleancache_ops pointer appropriately and the functions it provides must
conform to certain semantics as follows:

Most important, cleancache is "ephemeral".  Pages which are copied
into
cleancache have an indefinite lifetime which is completely unknowable
by the kernel and so may or may not still be in cleancache at any later time.
Thus, as its name implies, cleancache is not suitable for dirty pages.  The
pseudo-RAM has complete discretion over what pages to preserve and what
pages to discard and when.

A filesystem calls "init_fs" to obtain a pool id which, if positive,
must be
saved in the filesystem''s superblock; a negative return value indicates
failure.  A "put_page" will copy a (presumably about-to-be-evicted)
page into
pseudo-RAM and associate it with the pool id, the file inode, and a page
index into the file.  (The combination of a pool id, an inode, and an index
is called a "handle".)  A "get_page" will copy the page, if
found, from
pseudo-RAM into kernel memory.  A "flush_page" will ensure the page no
longer
is present in pseudo-RAM; a "flush_inode" will flush all pages
associated
with the specified inode; and a "flush_fs" will flush all pages in all
inodes specified by the given pool id.

A "init_shared_fs", like init, obtains a pool id but tells the
pseudo-RAM
to treat the pool as shared using a 128-bit UUID as a key.  On systems
that may run multiple kernels (such as hard partitioned or virtualized
systems) that may share a clustered filesystem, and where the pseudo-RAM
may be shared among those kernels, calls to init_shared_fs that specify the
same UUID will receive the same pool id, thus allowing the pages to
be shared.  Note that any security requirements must be imposed outside
of the kernel (e.g. by "tools" that control the pseudo-RAM).  Or a
pseudo-RAM implementation can simply disable shared_init by always
returning a negative value.

If a get_page is successful on a non-shared pool, the page is flushed (thus
making cleancache an "exclusive" cache).  On a shared pool, the page
is NOT flushed on a successful get_page so that it remains accessible to
other sharers.  The kernel is responsible for ensuring coherency between
cleancache (shared or not), the page cache, and the filesystem, using
cleancache flush operations as required.

Note that the pseudo-RAM must enforce put-put-get coherency and get-get
coherency.  For the former, if two puts are made to the same handle but
with different data, say AAA by the first put and BBB by the second, a
subsequent get can never return the stale data (AAA).  For get-get coherency,
if a get for a given handle fails, subsequent gets for that handle will
never succeed unless preceded by a successful put with that handle.

Last, pseudo-RAM provides no SMP serialization guarantees; if two
different Linux threads are putting an flushing a page with the same
handle, the results are indeterminate.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Dan Magenheimer

2010-May-28 17:36 UTC

head link

[Ocfs2-devel] [PATCH V2 0/7] Cleancache (was Transcendent Memory): overview

[PATCH V2 0/7] Cleancache (was Transcendent Memory): overview

Changes since V1:
- Rebased to 2.6.34 (no functional changes)
- Convert to sane types (Al Viro)
- Define some raw constants (Konrad Wilk)
- Add ack from Andreas Dilger

In previous patch postings, cleancache was part of the Transcendent
Memory ("tmem") patchset.  This patchset refocuses not on the
underlying
technology (tmem) but instead on the useful functionality provided for Linux,
and provides a clean API so that cleancache can provide this very useful
functionality either via a Xen tmem driver OR completely independent of tmem.
For example: Nitin Gupta (of compcache and ramzswap fame) is implementing
an in-kernel compression "backend" for cleancache; some believe
cleancache will be a very nice interface for building RAM-like functionality
for pseudo-RAM devices such as SSD or phase-change memory; and a Pune
University team is looking at a backend for virtio (see OLS'2010).

A more complete description of cleancache can be found in the introductory
comment in mm/cleancache.c (in PATCH 2/7) which is included below
for convenience.

Note that an earlier version of this patch is now shipping in OpenSuSE 11.2
and will soon ship in a release of Oracle Enterprise Linux.  Underlying
tmem technology is now shipping in Oracle VM 2.2 and was just released
in Xen 4.0 on April 15, 2010.  (Search news.google.com for Transcendent
Memory)

Signed-off-by: Dan Magenheimer <dan.magenheimer at oracle.com>
Reviewed-by: Jeremy Fitzhardinge <jeremy at goop.org>

 fs/btrfs/extent_io.c       |    9 +
 fs/btrfs/super.c           |    2 
 fs/buffer.c                |    5 +
 fs/ext3/super.c            |    2 
 fs/ext4/super.c            |    2 
 fs/mpage.c                 |    7 +
 fs/ocfs2/super.c           |    3 
 fs/super.c                 |    8 +
 include/linux/cleancache.h |   90 +++++++++++++++++++
 include/linux/fs.h         |    5 +
 mm/Kconfig                 |   22 ++++
 mm/Makefile                |    1 
 mm/cleancache.c            |  203 +++++++++++++++++++++++++++++++++++++++++++++
 mm/filemap.c               |   11 ++
 mm/truncate.c              |   10 ++
 15 files changed, 380 insertions(+)

Cleancache can be thought of as a page-granularity victim cache for clean
pages that the kernel's pageframe replacement algorithm (PFRA) would like
to keep around, but can't since there isn't enough memory.  So when the
PFRA "evicts" a page, it first attempts to put it into a synchronous
concurrency-safe page-oriented pseudo-RAM device (such as Xen's Transcendent
Memory, aka "tmem", or in-kernel compressed memory, aka
"zmem", or other
RAM-like devices) which is not directly accessible or addressable by the
kernel and is of unknown and possibly time-varying size.  And when a
cleancache-enabled filesystem wishes to access a page in a file on disk,
it first checks cleancache to see if it already contains it; if it does,
the page is copied into the kernel and a disk access is avoided.
This pseudo-RAM device links itself to cleancache by setting the
cleancache_ops pointer appropriately and the functions it provides must
conform to certain semantics as follows:

Most important, cleancache is "ephemeral".  Pages which are copied
into
cleancache have an indefinite lifetime which is completely unknowable
by the kernel and so may or may not still be in cleancache at any later time.
Thus, as its name implies, cleancache is not suitable for dirty pages.  The
pseudo-RAM has complete discretion over what pages to preserve and what
pages to discard and when.

A filesystem calls "init_fs" to obtain a pool id which, if positive,
must be
saved in the filesystem's superblock; a negative return value indicates
failure.  A "put_page" will copy a (presumably about-to-be-evicted)
page into
pseudo-RAM and associate it with the pool id, the file inode, and a page
index into the file.  (The combination of a pool id, an inode, and an index
is called a "handle".)  A "get_page" will copy the page, if
found, from
pseudo-RAM into kernel memory.  A "flush_page" will ensure the page no
longer
is present in pseudo-RAM; a "flush_inode" will flush all pages
associated
with the specified inode; and a "flush_fs" will flush all pages in all
inodes specified by the given pool id.

A "init_shared_fs", like init, obtains a pool id but tells the
pseudo-RAM
to treat the pool as shared using a 128-bit UUID as a key.  On systems
that may run multiple kernels (such as hard partitioned or virtualized
systems) that may share a clustered filesystem, and where the pseudo-RAM
may be shared among those kernels, calls to init_shared_fs that specify the
same UUID will receive the same pool id, thus allowing the pages to
be shared.  Note that any security requirements must be imposed outside
of the kernel (e.g. by "tools" that control the pseudo-RAM).  Or a
pseudo-RAM implementation can simply disable shared_init by always
returning a negative value.

If a get_page is successful on a non-shared pool, the page is flushed (thus
making cleancache an "exclusive" cache).  On a shared pool, the page
is NOT flushed on a successful get_page so that it remains accessible to
other sharers.  The kernel is responsible for ensuring coherency between
cleancache (shared or not), the page cache, and the filesystem, using
cleancache flush operations as required.

Note that the pseudo-RAM must enforce put-put-get coherency and get-get
coherency.  For the former, if two puts are made to the same handle but
with different data, say AAA by the first put and BBB by the second, a
subsequent get can never return the stale data (AAA).  For get-get coherency,
if a get for a given handle fails, subsequent gets for that handle will
never succeed unless preceded by a successful put with that handle.

Last, pseudo-RAM provides no SMP serialization guarantees; if two
different Linux threads are putting an flushing a page with the same
handle, the results are indeterminate.

Dan Magenheimer

2010-Jun-02 15:28 UTC

head link

[Ocfs2-devel] [PATCH V2 0/7] Cleancache (was Transcendent Memory): overview

Hi Minchan --
> I think cleancache approach is cool. :)
> I have some suggestions and questions.
Thanks for your interest!
> > If a get_page is successful on a non-shared pool, the page is flushed
> (thus
> > making cleancache an "exclusive" cache). ?On a shared pool,
the page
> 
> Do you have any reason about force "exclusive" on a non-shared
pool?
> To free memory on pesudo-RAM?
> I want to make it "inclusive" by some reason but unfortunately I
can't
> say why I want it now.
The main reason is to free up memory in pseudo-RAM and to
avoid unnecessary cleancache_flush calls.  If you want
inclusive, the page can be put immediately following
the get.  If put-after-get for inclusive becomes common,
the interface could easily be extended to add a "get_no_flush"
call.
 > While you mentioned it's "exclusive", cleancache_get_page
doesn't
> flush the page at below code.
> Is it a role of user who implement cleancache_ops->get_page?
Yes, the flush is done by the cleancache implementation.
> If backed device is ram(ie), Could we _move_ the pages from page cache
> to cleancache?
> I mean I don't want to copy page when get/put operation. we can just
> move page in case of backed device "ram". Is it possible?
By "move", do you mean changing the virtual mappings?  Yes,
this could be done as long as the source and destination are
both directly addressable (that is, true physical RAM), but
requires TLB manipulation and has some complicated corner
cases.  The copy semantics simplifies the implementation on
both the "frontend" and the "backend" and also allows the
backend to do fancy things on-the-fly like page compression
and page deduplication.
> You send the patches which is core of cleancache but I don't see any
> use case.
> Could you send use case patches with this series?
> It could help understand cleancache's benefit.
Do you mean the Xen Transcendent Memory ("tmem") implementation?
If so, this is four files in the Xen source tree (common/tmem.c,
common/tmem_xen.c, include/xen/tmem.h, include/xen/tmem_xen.h).
There is also an html document in the Xen source tree, which can
be viewed here:
http://oss.oracle.com/projects/tmem/dist/documentation/internals/xen4-internals-v01.html

Or did you mean a cleancache_ops "backend"?  For tmem, there
is one file linux/drivers/xen/tmem.c and it interfaces between
the cleancache_ops calls and Xen hypercalls.  It should be in
a Xenlinux pv_ops tree soon, or I can email it sooner.

I am also eagerly awaiting Nitin Gupta's cleancache backend
and implementation to do in-kernel page cache compression.

Thanks,
Dan

Dan Magenheimer

2010-Jun-02 15:36 UTC

head link

[Ocfs2-devel] [PATCH V2 0/7] Cleancache (was Transcendent Memory): overview

> From: Jamie Lokier [mailto:jamie at shareable.org]
> Subject: Re: [PATCH V2 0/7] Cleancache (was Transcendent Memory):
> overview
> 
> Dan Magenheimer wrote:
> > Most important, cleancache is "ephemeral".  Pages which are
copied
> into
> > cleancache have an indefinite lifetime which is completely unknowable
> > by the kernel and so may or may not still be in cleancache at any
> later time.
> > Thus, as its name implies, cleancache is not suitable for dirty
> pages.  The
> > pseudo-RAM has complete discretion over what pages to preserve and
> what
> > pages to discard and when.
> 
> Fwiw, the feature sounds useful to userspace too, for those things
> with memory hungry caches like web browsers.  Any plans to make it
> available to userspace?
No plans yet, though we agree it sounds useful, at least for
apps that bypass the page cache (e.g. O_DIRECT).  If you have
time and interest to investigate this further, I'd be happy
to help.  Send email offlist.

Thanks,
Dan

Dan Magenheimer

2010-Jun-02 16:07 UTC

head link

[Ocfs2-devel] [PATCH V2 0/7] Cleancache (was Transcendent Memory): overview

> From: Christoph Hellwig [mailto:hch at infradead.org]
> Subject: Re: [PATCH V2 0/7] Cleancache (was Transcendent Memory):
> overview
Hi Christophe --

Thanks for your feedback!
> >  fs/btrfs/super.c           |    2
> >  fs/buffer.c                |    5 +
> >  fs/ext3/super.c            |    2
> >  fs/ext4/super.c            |    2
> >  fs/mpage.c                 |    7 +
> >  fs/ocfs2/super.c           |    3
> >  fs/super.c                 |    8 +
> 
> This is missing out a whole lot of filesystems.  Even more so why the
> hell do you need hooks into the filesystem?
Let me rephrase/regroup your question.  Let me know if
I missed anything...

1) Why is the VFS layer involved at all?

VFS hooks are necessary to avoid a disk read when a page
is already in cleancache and to maintain coherency (via
cleancache_flush operations) between cleancache, the
page cache, and disk.  This very small, very clean set
of hooks (placed by Chris Mason) all compile into
nothingness if cleancache is config'ed off, and turn
into "if (*p == NULL)" if config'ed on but no "backend"
claims cleancache_ops or if an fs doesn't opt-in
(see below).

2) Why do the individual filesystems need to be modified?

Some filesystems are built entirely on top of VFS and
the hooks in VFS are sufficient, so don't require an
fs "cleancache_init" hook; the initial implementation
of cleancache didn't provide this hook.   But for some
fs (such as btrfs) the VFS hooks are incomplete and
one or more hooks in the fs-specific code is required.
For some other fs's (such as tmpfs), cleancache may even
be counterproductive.

So it seemed prudent to require an fs to "opt in" to
use cleancache, which requires at least one hook in
any fs.

3) Why are filesystems missing?
 
Only because they haven't been tested.  The existence
proof of four fs's (ext3/ext4/ocfs2/btfrs) should be
sufficient to validate the concept, the opt-in approach
means that untested filesystems are not affected, and
the hooks in the four fs's should serve as examples to
show that it should be very easy to add more fs's in the
future.
> Please give your patches some semi-resonable subject line.
Not sure what you mean... are the subject lines too short?
Or should I leave off the back-reference to Transcendent Memory?
Or please suggest something you think is more reasonable?

Thanks,
Dan

Dan Magenheimer

2010-Jun-02 23:04 UTC

head link

[Ocfs2-devel] [PATCH V2 0/7] Cleancache (was Transcendent Memory): overview

> From: Minchan Kim [mailto:minchan.kim at gmail.com]
> > I am also eagerly awaiting Nitin Gupta's cleancache backend
> > and implementation to do in-kernel page cache compression.
> 
> Do Nitin say he will make backend of cleancache for
> page cache compression?
> 
> It would be good feature.
> I have a interest, too. :)
That was Nitin's plan for his GSOC project when we last discussed
this.  Nitin is on the cc list and can comment if this has
changed.
> > By "move", do you mean changing the virtual mappings?  Yes,
> > this could be done as long as the source and destination are
> > both directly addressable (that is, true physical RAM), but
> > requires TLB manipulation and has some complicated corner
> > cases.  The copy semantics simplifies the implementation on
> > both the "frontend" and the "backend" and also
allows the
> > backend to do fancy things on-the-fly like page compression
> > and page deduplication.
> 
> Agree. But I don't mean it.
> If I use brd as backend, i want to do it follwing as.
> 
> <snip>
> 
> Of course, I know it's impossible without new metadata and
> modification of page cache handling and it makes front and
> backend's good layered design.
> 
> What I want is to remove copy overhead when backend is ram
> and it's also part of main memory(ie, we have page descriptor).
> 
> Do you have an idea?
Copy overhead on modern processors is very low now due to
very wide memory buses.  The additional metadata and code
to handle coherency and concurrency, plus existing overhead
for batching and asynchronous access to brd is likely much
higher than the cost to avoid copying.

But if you did implement this without copying, I think
you might need a different set of hooks in various places.
I don't know.
> > Or did you mean a cleancache_ops "backend"?  For tmem, there
> > is one file linux/drivers/xen/tmem.c and it interfaces between
> > the cleancache_ops calls and Xen hypercalls.  It should be in
> > a Xenlinux pv_ops tree soon, or I can email it sooner.
> 
> I mean "backend". :)
I dropped the code used for a RHEL6beta Xen tmem driver here:
http://oss.oracle.com/projects/tmem/dist/files/RHEL6beta/tmem-backend.patch 

Thanks,
Dan

Dan Magenheimer

2010-Jun-03 15:43 UTC

head link

[Ocfs2-devel] [PATCH V2 0/7] Cleancache (was Transcendent Memory): overview

> On 06/03/2010 10:23 AM, Andreas Dilger wrote:
> > On 2010-06-02, at 20:46, Nitin Gupta wrote:
> 
> > I was thinking it would be quite clever to do compression in, say,
> > 64kB or 128kB chunks in a mapping (to get decent compression) and
> > then write these compressed chunks directly from the page cache
> > to disk in btrfs and/or a revived compressed ext4.
> 
> Batching of pages to get good compression ratio seems doable.
Is there evidence that batching a set of random individual 4K
pages will have a significantly better compression ratio than
compressing the pages separately?  I certainly understand that
if the pages are from the same file, compression is likely to
be better, but pages evicted from the page cache (which is
the source for all cleancache_puts) are likely to be quite a
bit more random than that, aren't they?

Btrfs devel - May 2010 - [PATCH V2 0/7] Cleancache (was Transcendent Memory): overview

[PATCH V2 0/7] Cleancache (was Transcendent Memory): overview

[Ocfs2-devel] [PATCH V2 0/7] Cleancache (was Transcendent Memory): overview

[Ocfs2-devel] [PATCH V2 0/7] Cleancache (was Transcendent Memory): overview

[Ocfs2-devel] [PATCH V2 0/7] Cleancache (was Transcendent Memory): overview

[Ocfs2-devel] [PATCH V2 0/7] Cleancache (was Transcendent Memory): overview

[Ocfs2-devel] [PATCH V2 0/7] Cleancache (was Transcendent Memory): overview

[Ocfs2-devel] [PATCH V2 0/7] Cleancache (was Transcendent Memory): overview