Hello everyone, Now that v0.16 is out the door, I''d like to get a thread going on topics people are interested in tackling next. The top of my list looks like this: Improved allocator threading Better in-memory free space indexing (Josef) Better fsync performance (Chris) Improved offline fsck NFS Support O_DIRECT support But anything on the development timeline is fair game. I''m shooting for a smaller number of changes this time around so a new release can be cut before the kernel summit and plumber''s conference in mid-September. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 2008-08-06 at 10:21 -0400, Chris Mason wrote:> NFS SupportThis is basically ready. All you need in btrfs is the two patches from Balaji Rao, which I''ve updated to apply to the 0.16 and put in git.infradead.org/users/dwmw2/btrfs-kernel-unstable.git (along with a build fix for 2.6.27-rc2, which is also below). The rest of it is a generic problem with NFSD, for which the (current) fix is at git.infradead.org/users/dwmw2/nfsexport-2.6.git You could perhaps copy the readdir hack into btrfs code for use with obsolete kernels -- but to be honest I''d be inclined to leave that for the masochists^Wenterprise folks.>From 6c5f1012ccb1bb8a55dc9e564db3ca15d893763b Mon Sep 17 00:00:00 2001From: David Woodhouse <David.Woodhouse@intel.com> Date: Wed, 6 Aug 2008 15:54:51 +0100 Subject: [PATCH] Change TestSetPageLocked() to trylock_page() Add backwards compatibility in compat.h Signed-off-by: David Woodhouse <David.Woodhouse@intel.com> --- compat.h | 3 +++ extent_io.c | 3 ++- 2 files changed, 5 insertions(+), 1 deletions(-) diff --git a/compat.h b/compat.h index d39a768..b3349a6 100644 --- a/compat.h +++ b/compat.h @@ -1,6 +1,9 @@ #ifndef _COMPAT_H_ #define _COMPAT_H_ +#if LINUX_VERSION_CODE <= KERNEL_VERSION(2,6,26) +#define trylock_page(page) (!TestSetPageLocked(page)) +#endif /* * Even if AppArmor isn''t enabled, it still has different prototypes. diff --git a/extent_io.c b/extent_io.c index 1cf4bab..f46f886 100644 --- a/extent_io.c +++ b/extent_io.c @@ -14,6 +14,7 @@ #include <linux/pagevec.h> #include "extent_io.h" #include "extent_map.h" +#include "compat.h" /* temporary define until extent_map moves out of btrfs */ struct kmem_cache *btrfs_cache_create(const char *name, size_t size, @@ -3055,7 +3056,7 @@ int read_extent_buffer_pages(struct extent_io_tree *tree, for (i = start_i; i < num_pages; i++) { page = extent_buffer_page(eb, i); if (!wait) { - if (TestSetPageLocked(page)) + if (!trylock_page(page)) goto unlock_exit; } else { lock_page(page); -- 1.5.5.1 -- David Woodhouse Open Source Technology Centre David.Woodhouse@intel.com Intel Corporation -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 2008-08-06 at 15:58 +0100, David Woodhouse wrote:> On Wed, 2008-08-06 at 10:21 -0400, Chris Mason wrote: > > NFS Support > > This is basically ready. All you need in btrfs is the two patches from > Balaji Rao, which I''ve updated to apply to the 0.16 and put in > git.infradead.org/users/dwmw2/btrfs-kernel-unstable.git (along with a > build fix for 2.6.27-rc2, which is also below). > > The rest of it is a generic problem with NFSD, for which the (current) > fix is at git.infradead.org/users/dwmw2/nfsexport-2.6.git > > You could perhaps copy the readdir hack into btrfs code for use with > obsolete kernels -- but to be honest I''d be inclined to leave that for > the masochists^Wenterprise folks. >We do need the readdir hack, being able to test on older kernels (say 2.6.26) is a big part of attracting and keeping btrfs testers. Thanks for the trylock_page, I''ll toss it in. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
>> You could perhaps copy the readdir hack into btrfs code for use with >> obsolete kernels -- but to be honest I''d be inclined to leave that for >> the masochists^Wenterprise folks. >> > > We do need the readdir hack, being able to test on older kernels (say > 2.6.26) is a big part of attracting and keeping btrfs testers. >Guess you''re talking about the saddistic people like me? Rei -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> Improved allocator threadingI wanted to work on the allocator with a larger scope where threading is only a minor part of trying to address these items from the Project_ideas that I think could change disk format in some way (to fix it before v1.0): - Different sector sizes - Multiple chunk trees and extent allocation trees - Limiting btree failure domains and maybe impacting this from Development_timeline - Reserved space for online fsck and the ability to add storage so that a background extent allocation check can proceed Maybe this is too ambitious or I am seeing intersections that are not there, but I am prepared to try doing the allocator. jim P.S. Are there other V1.0 format issues to lock down that should be worked before the missing features like O_DIRECT? -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 2008-08-06 at 11:42 -0400, jim owens wrote:> > Improved allocator threading > > I wanted to work on the allocator with a larger scope > where threading is only a minor part of trying to addressJosef''s allocator fix is on the list because we currently fall over in some workloads at 100% cpu time when the FS is 60% full. The space indexing is complex and strange, it just needs to be redone.> these items from the Project_ideas that I think could change > disk format in some way (to fix it before v1.0): > - Different sector sizesSector alignment and sector sizes definitely need to happen before 1.0> - Multiple chunk trees and extent allocation treesFor these I was planning on only adding the disk format bits needed and leaving the code alone.> - Limiting btree failure domains > and maybe impacting this from Development_timeline > - Reserved space for online fsck and the ability to add > storage so that a background extent allocation check can proceedThe reserved space is important as well.> > Maybe this is too ambitious or I am seeing intersections that > are not there, but I am prepared to try doing the allocator. >I''d love to have help on all of the above, and you''re welcome to dive in and give it a shot. I''d say to pick one though, starting with smaller patches is going to be a good idea.> jim > > P.S. Are there other V1.0 format issues to lock down that > should be worked before the missing features like O_DIRECT?Yes, I''m trying to walk the line between having enough performance for people to do baseline tests (the results of which may force disk format changes) and pushing out the disk format changes. So, things that are very well understood like multiple copies of the super block or compat flags, I''m pushing off. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 2008-08-06 at 11:13 -0400, Chris Mason wrote:> We do need the readdir hack, being able to test on older kernels (say > 2.6.26) is a big part of attracting and keeping btrfs testers.Well, those testers don''t seem to have been put off so far by the fact that you can''t export it by NFS. But it''s easy enough to copy it over. Added to git.infradead.org/users/dwmw2/btrfs-kernel-unstable.git From: David Woodhouse <David.Woodhouse@intel.com> Date: Wed, 6 Aug 2008 19:42:33 +0100 Subject: [PATCH] Implement our own copy of the nfsd readdir hack, for older kernels Signed-off-by: David Woodhouse <David.Woodhouse@intel.com> --- ctree.h | 4 ++ export.c | 94 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ inode.c | 8 ++++- 3 files changed, 104 insertions(+), 2 deletions(-) diff --git a/ctree.h b/ctree.h index 3694f03..7200178 100644 --- a/ctree.h +++ b/ctree.h @@ -1694,6 +1694,7 @@ void btrfs_destroy_inode(struct inode *inode); int btrfs_init_cachep(void); void btrfs_destroy_cachep(void); long btrfs_ioctl_trans_end(struct file *file); +int btrfs_real_readdir(struct file *filp, void *dirent, filldir_t filldir); struct inode *btrfs_iget_locked(struct super_block *s, u64 objectid, struct btrfs_root *root); struct inode *btrfs_ilookup(struct super_block *s, u64 objectid, @@ -1709,6 +1710,9 @@ int btrfs_update_inode(struct btrfs_trans_handle *trans, struct btrfs_root *root, struct inode *inode); +/* export.c */ +int btrfs_nfshack_readdir(struct file *filp, void *dirent, filldir_t filldir); + /* ioctl.c */ long btrfs_ioctl(struct file *file, unsigned int cmd, unsigned long arg); diff --git a/export.c b/export.c index 9070674..d152fbc 100644 --- a/export.c +++ b/export.c @@ -181,3 +181,97 @@ const struct export_operations btrfs_export_ops = { .fh_to_parent = btrfs_fh_to_parent, .get_parent = btrfs_get_parent, }; + +/* Kernels without FS_LOOKUP_IN_READDIR still have the NFS deadlock where + nfsd will call the file system''s ->lookup() method from within its + filldir callback, which in turn was called from the file system''s + ->readdir() method. And will deadlock for many file systems. */ +#ifndef FS_LOOKUP_IN_READDIR + +struct nfshack_dirent { + u64 ino; + loff_t offset; + int namlen; + unsigned int d_type; + char name[]; +}; + +struct nfshack_readdir { + char *dirent; + size_t used; +}; + + + +static int btrfs_nfshack_filldir(void *__buf, const char *name, int namlen, + loff_t offset, u64 ino, unsigned int d_type) +{ + struct nfshack_readdir *buf = __buf; + struct nfshack_dirent *de = (void *)(buf->dirent + buf->used); + unsigned int reclen; + + reclen = ALIGN(sizeof(struct nfshack_dirent) + namlen, sizeof(u64)); + if (buf->used + reclen > PAGE_SIZE) + return -EINVAL; + + de->namlen = namlen; + de->offset = offset; + de->ino = ino; + de->d_type = d_type; + memcpy(de->name, name, namlen); + buf->used += reclen; + + return 0; +} + +int btrfs_nfshack_readdir(struct file *file, void *dirent, filldir_t filldir) +{ + struct nfshack_readdir buf; + struct nfshack_dirent *de; + int err; + int size; + loff_t offset; + + buf.dirent = (void *)__get_free_page(GFP_KERNEL); + if (!buf.dirent) + return -ENOMEM; + + offset = file->f_pos; + + while (1) { + unsigned int reclen; + + buf.used = 0; + + err = btrfs_real_readdir(file, &buf, btrfs_nfshack_filldir); + if (err) + break; + + size = buf.used; + + if (!size) + break; + + de = (struct nfshack_dirent *)buf.dirent; + while (size > 0) { + offset = de->offset; + + if (filldir(dirent, de->name, de->namlen, de->offset, + de->ino, de->d_type)) + goto done; + offset = file->f_pos; + + reclen = ALIGN(sizeof(*de) + de->namlen, + sizeof(u64)); + size -= reclen; + de = (struct nfshack_dirent *)((char *)de + reclen); + } + } + + done: + free_page((unsigned long)buf.dirent); + file->f_pos = offset; + + return err; +} +#endif diff --git a/inode.c b/inode.c index 393b7aa..f8b3fde 100644 --- a/inode.c +++ b/inode.c @@ -1956,7 +1956,7 @@ static unsigned char btrfs_filetype_table[] = { DT_UNKNOWN, DT_REG, DT_DIR, DT_CHR, DT_BLK, DT_FIFO, DT_SOCK, DT_LNK }; -static int btrfs_readdir(struct file *filp, void *dirent, filldir_t filldir) +int btrfs_real_readdir(struct file *filp, void *dirent, filldir_t filldir) { struct inode *inode = filp->f_dentry->d_inode; struct btrfs_root *root = BTRFS_I(inode)->root; @@ -3661,7 +3661,11 @@ static struct inode_operations btrfs_dir_ro_inode_operations = { static struct file_operations btrfs_dir_file_operations = { .llseek = generic_file_llseek, .read = generic_read_dir, - .readdir = btrfs_readdir, +#ifdef FS_LOOKUP_IN_READDIR /* NFSd readdir/lookup deadlock is fixed */ + .readdir = btrfs_real_readdir, +#else /* otherwise, we need to work around it ourselves */ + .readdir = btrfs_nfshack_readdir, +#endif .unlocked_ioctl = btrfs_ioctl, #ifdef CONFIG_COMPAT .compat_ioctl = btrfs_ioctl, -- 1.5.5.1 -- David Woodhouse Open Source Technology Centre David.Woodhouse@intel.com Intel Corporation -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Chris Mason wrote:> Josef''s allocator fix is on the list because we currently fall over in > some workloads at 100% cpu time when the FS is 60% full. The space > indexing is complex and strange, it just needs to be redone.I don''t understand. Do you mean josef wants someone to fix multithreading or do you mean he is doing that as part of an allocator fix he is working on, or did you mean that his work is an exception and you really were not looking at doing any allocator changes? I see this in your 0.16 list:> Better in-memory free space indexing (Josef)but did not tie it to the allocator. Is that it? and I''ll add an area I missed before in the Development_timeline: - Fallocate support (at least disk format level) as part of the allocator bundle-of-too-much-work :) jim -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 2008-08-06 at 16:36 -0400, jim owens wrote:> Chris Mason wrote: > > > Josef''s allocator fix is on the list because we currently fall over in > > some workloads at 100% cpu time when the FS is 60% full. The space > > indexing is complex and strange, it just needs to be redone. > > I don''t understand. Do you mean josef wants someone to fix > multithreading or do you mean he is doing that as part of > an allocator fix he is working on, or did you mean that > his work is an exception and you really were not looking at > doing any allocator changes? >Josef is fixing the way the allocator indexes free space in ram. This is different from working on the threading, but I''m holding off on the threading until after he is done.> I see this in your 0.16 list: > > > Better in-memory free space indexing (Josef) > > but did not tie it to the allocator. Is that it? > > and I''ll add an area I missed before in the Development_timeline: > - Fallocate support (at least disk format level) > as part of the allocator bundle-of-too-much-work :);) fallocate would be important as well. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
>> Chris Mason wrote: >> >>> Josef''s allocator fix is on the list because we currently fall over in >>> some workloads at 100% cpu time when the FS is 60% full.Chris, does it oops, or just get very slow? Does 0.15 do the same? -Joe -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 2008-08-06 at 14:49 -0600, Joe Peterson wrote:> >> Chris Mason wrote: > >> > >>> Josef''s allocator fix is on the list because we currently fall over in > >>> some workloads at 100% cpu time when the FS is 60% full. > > Chris, does it oops, or just get very slow? Does 0.15 do the same? >Very very slow and v0.15 has the same feature. This doesn''t happen every time you hit 60% full, it varies with the workload. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html