thr3ads.net - Btrfs devel - [patch 0/2]btrfs: add two ioctls to do metadata readahead [Jul 2010]

If this information is useful, please help other people find it:
Share via:

Shaohua Li

2010-Jul-14 08:02 UTC

[patch 0/2]btrfs: add two ioctls to do metadata readahead

Hi,
  We have file readahead to do asyn file read, but has no metadata
readahead. For a list of files, their metadata is stored in fragmented
disk space and metadata read is a sync operation, which impacts the
efficiency of readahead much. The patches try to add meatadata readahead
for btrfs.
  In btrfs, metadata is stored in btree_inode. Ideally, if we could hook
the inode to a fd so we could use existing syscalls (readahead, mincore
or upcoming fincore) to do readahead, but the inode is hidden, there is
no easy way for this from my understanding. So we add two ioctls for
this. One is like readahead syscall, the other is like micore/fincore
syscall.
  Under a harddisk based netbook with Meego, the metadata readahead
reduced about 3.5s boot time from total 16s.

Issues:
1. it appears readahead metadata pages skipped checksum checking. I''m
still working on this.
2. in latest kernel, I got a lockdep warning. It looks not related to
the patches but I only observed it with the patches. The warning looks
like a false warning, as in my debug the spin_lock isn''t hold. from my
understanding, all extent_buffer share a lockdep class and in the btree
lookup we might lock several extent_buffer. But I don''t know how to fix
it yet.

Thanks,
Shaohua

Shaohua Li

2010-Jul-14 08:26 UTC

head link

Re: [patch 0/2]btrfs: add two ioctls to do metadata readahead

On Wed, Jul 14, 2010 at 04:02:16PM +0800, Shaohua Li
wrote:> Hi,
>   We have file readahead to do asyn file read, but has no metadata
> readahead. For a list of files, their metadata is stored in fragmented
> disk space and metadata read is a sync operation, which impacts the
> efficiency of readahead much. The patches try to add meatadata readahead
> for btrfs.
>   In btrfs, metadata is stored in btree_inode. Ideally, if we could hook
> the inode to a fd so we could use existing syscalls (readahead, mincore
> or upcoming fincore) to do readahead, but the inode is hidden, there is
> no easy way for this from my understanding. So we add two ioctls for
> this. One is like readahead syscall, the other is like micore/fincore
> syscall.
>   Under a harddisk based netbook with Meego, the metadata readahead
> reduced about 3.5s boot time from total 16s.
> 
> Issues:
> 1. it appears readahead metadata pages skipped checksum checking.
I''m
> still working on this.
> 2. in latest kernel, I got a lockdep warning. It looks not related to
> the patches but I only observed it with the patches. The warning looks
> like a false warning, as in my debug the spin_lock isn''t hold.
from my
> understanding, all extent_buffer share a lockdep class and in the btree
> lookup we might lock several extent_buffer. But I don''t know how
to fix
> it yet.Ha, sorry, actually the two issues are one issue. We set lockdep level class
and doing checksum in one place. I have a debug patch for this and will
send out later. But before this, I''d like know your comments about the
idea.

Thanks,
Shaohua 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Shaohua Li

2010-Jul-19 05:43 UTC

head link

Re: [patch 0/2]btrfs: add two ioctls to do metadata readahead

On Wed, Jul 14, 2010 at 04:26:45PM +0800, Shaohua Li
wrote:> On Wed, Jul 14, 2010 at 04:02:16PM +0800, Shaohua Li wrote:
> > Hi,
> >   We have file readahead to do asyn file read, but has no metadata
> > readahead. For a list of files, their metadata is stored in fragmented
> > disk space and metadata read is a sync operation, which impacts the
> > efficiency of readahead much. The patches try to add meatadata
readahead
> > for btrfs.
> >   In btrfs, metadata is stored in btree_inode. Ideally, if we could
hook
> > the inode to a fd so we could use existing syscalls (readahead,
mincore
> > or upcoming fincore) to do readahead, but the inode is hidden, there
is
> > no easy way for this from my understanding. So we add two ioctls for
> > this. One is like readahead syscall, the other is like micore/fincore
> > syscall.
> >   Under a harddisk based netbook with Meego, the metadata readahead
> > reduced about 3.5s boot time from total 16s.
> > 
> > Issues:
> > 1. it appears readahead metadata pages skipped checksum checking.
I''m
> > still working on this.
> > 2. in latest kernel, I got a lockdep warning. It looks not related to
> > the patches but I only observed it with the patches. The warning looks
> > like a false warning, as in my debug the spin_lock isn''t
hold. from my
> > understanding, all extent_buffer share a lockdep class and in the
btree
> > lookup we might lock several extent_buffer. But I don''t know
how to fix
> > it yet.
> Ha, sorry, actually the two issues are one issue. We set lockdep level
class
> and doing checksum in one place. I have a debug patch for this and will
> send out later. But before this, I''d like know your comments about
the idea.Hi Chris,
Below is the patch to fix checksum check skip issue. It''s applied after
the
two ioctl patches. I did stress test here and can''t find any issue. can
you
please take a look at the three patches.

Thanks,
Shaohua



Subject: [patch]btrfs: do validation for extent_buffer if it''s skipped
before

With metadata readahead, we slightly change the behavior. Before it, we allocate
an extent_buffer (so set page->private), do metadata read and
btree_readpage_end_io_hook() will do validation. After it, we directly do
metadata readahead, and since in this case page hasn''t ->private,
btree_readpage_end_io_hook() will not do validation. This patch fixes this.
It addes a new flag to indicate if a buffer is validated. If not and even
the buffer is uptodated, we will do a validation.

Signed-off-by: Shaohua Li <shaohua.li@intel.com>

---
 fs/btrfs/disk-io.c     |   67 ++++++++++++++++++++++++++++++-------------------
 fs/btrfs/disk-io.h     |    2 +
 fs/btrfs/extent-tree.c |    1 
 fs/btrfs/extent_io.c   |   14 ++++++++--
 fs/btrfs/extent_io.h   |    1 
 5 files changed, 58 insertions(+), 27 deletions(-)

Index: linux/fs/btrfs/disk-io.c
==================================================================---
linux.orig/fs/btrfs/disk-io.c	2010-07-19 10:58:10.000000000 +0800
+++ linux/fs/btrfs/disk-io.c	2010-07-19 10:58:13.000000000 +0800
@@ -406,30 +406,18 @@ void btrfs_set_buffer_lockdep_class(stru
 }
 #endif
 
-static int btree_readpage_end_io_hook(struct page *page, u64 start, u64 end,
-			       struct extent_state *state)
+int btree_validate_extent_buffer(struct btrfs_root *root,
+	struct extent_buffer *eb)
 {
-	struct extent_io_tree *tree;
+	int ret = 0;
 	u64 found_start;
 	int found_level;
-	unsigned long len;
-	struct extent_buffer *eb;
-	struct btrfs_root *root = BTRFS_I(page->mapping->host)->root;
-	int ret = 0;
-
-	tree = &BTRFS_I(page->mapping->host)->io_tree;
-	if (page->private == EXTENT_PAGE_PRIVATE)
-		goto out;
-	if (!page->private)
-		goto out;
-
-	len = page->private >> 2;
-	WARN_ON(len == 0);
 
-	eb = alloc_extent_buffer(tree, start, len, page, GFP_NOFS);
+	if (!test_bit(EXTENT_BUFFER_UNCHECKED, &eb->bflags))
+		return 0;
 
 	found_start = btrfs_header_bytenr(eb);
-	if (found_start != start) {
+	if (found_start != eb->start) {
 		if (printk_ratelimit()) {
 			printk(KERN_INFO "btrfs bad tree block start "
 			       "%llu %llu\n",
@@ -439,13 +427,7 @@ static int btree_readpage_end_io_hook(st
 		ret = -EIO;
 		goto err;
 	}
-	if (eb->first_page != page) {
-		printk(KERN_INFO "btrfs bad first page %lu %lu\n",
-		       eb->first_page->index, page->index);
-		WARN_ON(1);
-		ret = -EIO;
-		goto err;
-	}
+
 	if (check_tree_block_fsid(root, eb)) {
 		if (printk_ratelimit()) {
 			printk(KERN_INFO "btrfs bad fsid on block %llu\n",
@@ -461,6 +443,41 @@ static int btree_readpage_end_io_hook(st
 	ret = csum_tree_block(root, eb, 1);
 	if (ret)
 		ret = -EIO;
+err:
+	if (ret == 0)
+		clear_bit(EXTENT_BUFFER_UNCHECKED, &eb->bflags);
+	return ret;
+}
+
+static int btree_readpage_end_io_hook(struct page *page, u64 start, u64 end,
+			       struct extent_state *state)
+{
+	struct extent_io_tree *tree;
+	unsigned long len;
+	struct extent_buffer *eb;
+	struct btrfs_root *root = BTRFS_I(page->mapping->host)->root;
+	int ret = 0;
+
+	tree = &BTRFS_I(page->mapping->host)->io_tree;
+	if (page->private == EXTENT_PAGE_PRIVATE)
+		goto out;
+	if (!page->private)
+		goto out;
+
+	len = page->private >> 2;
+	WARN_ON(len == 0);
+
+	eb = alloc_extent_buffer(tree, start, len, page, GFP_NOFS);
+
+	if (eb->first_page != page) {
+		printk(KERN_INFO "btrfs bad first page %lu %lu\n",
+		       eb->first_page->index, page->index);
+		WARN_ON(1);
+		ret = -EIO;
+		goto err;
+	}
+
+	ret = btree_validate_extent_buffer(root, eb);
 
 	end = min_t(u64, eb->len, PAGE_CACHE_SIZE);
 	end = eb->start + end - 1;
Index: linux/fs/btrfs/disk-io.h
==================================================================---
linux.orig/fs/btrfs/disk-io.h	2010-07-19 10:56:33.000000000 +0800
+++ linux/fs/btrfs/disk-io.h	2010-07-19 10:58:13.000000000 +0800
@@ -36,6 +36,8 @@ static inline u64 btrfs_sb_offset(int mi
 struct btrfs_device;
 struct btrfs_fs_devices;
 
+int btree_validate_extent_buffer(struct btrfs_root *root,
+	struct extent_buffer *eb);
 struct extent_buffer *read_tree_block(struct btrfs_root *root, u64 bytenr,
 				      u32 blocksize, u64 parent_transid);
 int readahead_tree_block(struct btrfs_root *root, u64 bytenr, u32 blocksize,
Index: linux/fs/btrfs/extent_io.c
==================================================================---
linux.orig/fs/btrfs/extent_io.c	2010-07-19 10:56:33.000000000 +0800
+++ linux/fs/btrfs/extent_io.c	2010-07-19 10:58:13.000000000 +0800
@@ -14,6 +14,7 @@
 #include "extent_map.h"
 #include "compat.h"
 #include "ctree.h"
+#include "disk-io.h"
 #include "btrfs_inode.h"
 
 static struct kmem_cache *extent_state_cache;
@@ -3162,6 +3163,7 @@ struct extent_buffer *alloc_extent_buffe
 			uptodate = 0;
 		unlock_page(p);
 	}
+	set_bit(EXTENT_BUFFER_UNCHECKED, &eb->bflags);
 	if (uptodate)
 		set_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags);
 
@@ -3387,9 +3389,10 @@ int read_extent_buffer_pages(struct exte
 	unsigned long num_pages;
 	struct bio *bio = NULL;
 	unsigned long bio_flags = 0;
+	struct btrfs_root *root = BTRFS_I(tree->mapping->host)->root;
 
 	if (test_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags))
-		return 0;
+		goto out;
 
 	if (test_range_bit(tree, eb->start, eb->start + eb->len - 1,
 			   EXTENT_UPTODATE, 1, NULL)) {
@@ -3454,8 +3457,10 @@ int read_extent_buffer_pages(struct exte
 			ret = -EIO;
 	}
 
-	if (!ret)
+	if (!ret) {
 		set_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags);
+		goto out;
+	}
 	return ret;
 
 unlock_exit:
@@ -3466,6 +3471,11 @@ unlock_exit:
 		unlock_page(page);
 		locked_pages--;
 	}
+out:
+	if (!ret && test_bit(EXTENT_BUFFER_UNCHECKED, &eb->bflags)
&&
+		test_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags))
+		return btree_validate_extent_buffer(root, eb);
+
 	return ret;
 }
 
Index: linux/fs/btrfs/extent_io.h
==================================================================---
linux.orig/fs/btrfs/extent_io.h	2010-07-19 10:56:33.000000000 +0800
+++ linux/fs/btrfs/extent_io.h	2010-07-19 10:58:13.000000000 +0800
@@ -27,6 +27,7 @@
 #define EXTENT_BUFFER_UPTODATE 0
 #define EXTENT_BUFFER_BLOCKING 1
 #define EXTENT_BUFFER_DIRTY 2
+#define EXTENT_BUFFER_UNCHECKED 3
 
 /* these are flags for extent_clear_unlock_delalloc */
 #define EXTENT_CLEAR_UNLOCK_PAGE 0x1
Index: linux/fs/btrfs/extent-tree.c
==================================================================---
linux.orig/fs/btrfs/extent-tree.c	2010-07-19 10:56:33.000000000 +0800
+++ linux/fs/btrfs/extent-tree.c	2010-07-19 10:58:13.000000000 +0800
@@ -5277,6 +5277,7 @@ struct extent_buffer *btrfs_init_new_buf
 	clean_tree_block(trans, root, buf);
 
 	btrfs_set_lock_blocking(buf);
+	clear_bit(EXTENT_BUFFER_UNCHECKED, &buf->bflags);
 	btrfs_set_buffer_uptodate(buf);
 
 	if (root->root_key.objectid == BTRFS_TREE_LOG_OBJECTID) {
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Christoph Hellwig

2010-Jul-19 08:03 UTC

head link

Re: [patch 0/2]btrfs: add two ioctls to do metadata readahead

This seems like iteration 66 of the ill fated mobling readahead crap.

Please go to linux-fsdevel to define a proper interface instead of
your per-filesystem hacks.  I had hoped Arjan got it after the last big
flameware, but it seems like you''re only moving from one target to the
next.

NACK in this current form.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Mason

2010-Jul-19 11:25 UTC

head link

Re: [patch 0/2]btrfs: add two ioctls to do metadata readahead

On Mon, Jul 19, 2010 at 04:03:54AM -0400, Christoph Hellwig
wrote:> This seems like iteration 66 of the ill fated mobling readahead crap.
> 
> Please go to linux-fsdevel to define a proper interface instead of
> your per-filesystem hacks.  I had hoped Arjan got it after the last big
> flameware, but it seems like you''re only moving from one target to
the
> next.
> 
> NACK in this current form.
Honestly I think its good to have some of these features in specific
filesystems at first.  Sure, it is easy to tell if they help in the lab
and during development, but why go through the pain of implementing them
in generic form until they have been through a broader test?

So the idea for the feature is fine with me, especially for something
like readahead which we can support forever by turning it into a noop if
there turns out to be a better idea.

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Arjan van de Ven

2010-Jul-19 14:00 UTC

head link

Re: [patch 0/2]btrfs: add two ioctls to do metadata readahead

On 7/19/2010 1:03 AM, Christoph Hellwig wrote:> This seems like iteration 66 of the ill fated mobling readahead crap.
>    
I think you need to actually look at what this patch does before 
rendering this judgment; something you apparently have not done.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Christoph Hellwig

2010-Jul-20 03:58 UTC

head link

Re: [patch 0/2]btrfs: add two ioctls to do metadata readahead

On Mon, Jul 19, 2010 at 07:00:33AM -0700, Arjan van de Ven
wrote:> On 7/19/2010 1:03 AM, Christoph Hellwig wrote:
> >This seems like iteration 66 of the ill fated mobling readahead crap.
> 
> I think you need to actually look at what this patch does before
> rendering this judgment; something you apparently have not done.
Thanks a lot for the attack, but of course I have.  It''s the same
stupid
idea of throwing almost generic code into filesystems that you previous
attempt did.  It''s completely generic code for checking what pages are
in-memory (plus constaints) and then force read-ahead on them again.
The only filesystems specific bit is which address_space to select.  The
whole apporach is also useful to extN / xfs / etc if you allow it on
any address_space.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Btrfs devel - Jul 2010 - [patch 0/2]btrfs: add two ioctls to do metadata readahead

[patch 0/2]btrfs: add two ioctls to do metadata readahead

Re: [patch 0/2]btrfs: add two ioctls to do metadata readahead

Re: [patch 0/2]btrfs: add two ioctls to do metadata readahead

Re: [patch 0/2]btrfs: add two ioctls to do metadata readahead

Re: [patch 0/2]btrfs: add two ioctls to do metadata readahead

Re: [patch 0/2]btrfs: add two ioctls to do metadata readahead

Re: [patch 0/2]btrfs: add two ioctls to do metadata readahead