Mark Fasheh
2013-Aug-07 19:57 UTC
[PATCH][RESEND] vfs: allow /proc/PID/maps to get device from stat
stat(2) on btrfs returns a custom device, but proc uses s_dev from the super block. This causes problems (abi breakage) because software (and users) are not expecting the kernel to return different devices from these calls. This patch fixes the problem by adding a new superblock flag, MS_STAT_FOR_DEV. When the proc code sees this flag, it will call the file systems ->getattr() method to extract a device as opposed to getting it directly from s_dev. This problem has been discussed several times before but still with no resolution: http://comments.gmane.org/gmane.comp.file-systems.btrfs/9682 http://thr3ads.net/btrfs-devel/2011/05/2346176-RFC-PATCH-0-2-btrfs-vfs-Return-same-device-in-stat-2-and-proc-pid-maps http://thread.gmane.org/gmane.comp.file-systems.btrfs/26786 Andrew Vagin''s e-mail (third one in the list) links a couple bugzilla entries about this problem too. Signed-off-by: Mark Fasheh <mfasheh@suse.de> --- fs/btrfs/super.c | 1 + fs/proc/generic.c | 15 +++++++++++++++ fs/proc/internal.h | 1 + fs/proc/nommu.c | 2 +- fs/proc/task_mmu.c | 2 +- fs/proc/task_nommu.c | 2 +- include/uapi/linux/fs.h | 1 + 7 files changed, 21 insertions(+), 3 deletions(-) diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c index f0857e0..da7c5f5 100644 --- a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -822,6 +822,7 @@ static int btrfs_fill_super(struct super_block *sb, sb->s_flags |= MS_POSIXACL; #endif sb->s_flags |= MS_I_VERSION; + sb->s_flags |= MS_STAT_FOR_DEV; err = open_ctree(sb, fs_devices, (char *)data); if (err) { printk("btrfs: open_ctree failed\n"); diff --git a/fs/proc/generic.c b/fs/proc/generic.c index a2596af..ed9df4a 100644 --- a/fs/proc/generic.c +++ b/fs/proc/generic.c @@ -24,6 +24,8 @@ #include <linux/spinlock.h> #include <linux/completion.h> #include <asm/uaccess.h> +#include <linux/fs.h> +#include <linux/dcache.h> #include "internal.h" @@ -637,3 +639,16 @@ void *PDE_DATA(const struct inode *inode) return __PDE_DATA(inode); } EXPORT_SYMBOL(PDE_DATA); + +dev_t proc_get_map_dev(struct dentry *dentry) +{ + struct inode *inode = dentry->d_inode; + struct kstat kstat; + + if (inode->i_sb->s_flags & MS_STAT_FOR_DEV && + inode->i_op->getattr && + inode->i_op->getattr(NULL, dentry, &kstat) == 0) + return kstat.dev; + + return inode->i_sb->s_dev; +} diff --git a/fs/proc/internal.h b/fs/proc/internal.h index d600fb0..afc9424 100644 --- a/fs/proc/internal.h +++ b/fs/proc/internal.h @@ -192,6 +192,7 @@ static inline struct proc_dir_entry *pde_get(struct proc_dir_entry *pde) return pde; } extern void pde_put(struct proc_dir_entry *); +dev_t proc_get_map_dev(struct dentry *dentry); /* * inode.c diff --git a/fs/proc/nommu.c b/fs/proc/nommu.c index ccfd99b..fe22ff6 100644 --- a/fs/proc/nommu.c +++ b/fs/proc/nommu.c @@ -46,7 +46,7 @@ static int nommu_region_show(struct seq_file *m, struct vm_region *region) if (file) { struct inode *inode = file_inode(region->vm_file); - dev = inode->i_sb->s_dev; + dev = proc_get_map_dev(vma->vm_file->f_path->dentry); ino = inode->i_ino; } diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 3e636d8..4d9d1d9 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -272,7 +272,7 @@ show_map_vma(struct seq_file *m, struct vm_area_struct *vma, int is_pid) if (file) { struct inode *inode = file_inode(vma->vm_file); - dev = inode->i_sb->s_dev; + dev = proc_get_map_dev(vma->vm_file->f_path.dentry); ino = inode->i_ino; pgoff = ((loff_t)vma->vm_pgoff) << PAGE_SHIFT; } diff --git a/fs/proc/task_nommu.c b/fs/proc/task_nommu.c index 56123a6..0995249 100644 --- a/fs/proc/task_nommu.c +++ b/fs/proc/task_nommu.c @@ -150,7 +150,7 @@ static int nommu_vma_show(struct seq_file *m, struct vm_area_struct *vma, if (file) { struct inode *inode = file_inode(vma->vm_file); - dev = inode->i_sb->s_dev; + dev = proc_get_map_dev(vma->vm_file->f_path.dentry); ino = inode->i_ino; pgoff = (loff_t)vma->vm_pgoff << PAGE_SHIFT; } diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h index a4ed56c..918054d 100644 --- a/include/uapi/linux/fs.h +++ b/include/uapi/linux/fs.h @@ -88,6 +88,7 @@ struct inodes_stat_t { #define MS_STRICTATIME (1<<24) /* Always perform atime updates */ /* These sb flags are internal to the kernel */ +#define MS_STAT_FOR_DEV (1<<27) #define MS_NOSEC (1<<28) #define MS_BORN (1<<29) #define MS_ACTIVE (1<<30) -- 1.8.1.4 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Christoph Hellwig
2013-Aug-07 20:18 UTC
Re: [PATCH][RESEND] vfs: allow /proc/PID/maps to get device from stat
On Wed, Aug 07, 2013 at 12:57:18PM -0700, Mark Fasheh wrote:> stat(2) on btrfs returns a custom device, but proc uses s_dev from the super > block. This causes problems (abi breakage) because software (and users) are > not expecting the kernel to return different devices from these calls.So fix stat on btrfs to return the proper device instead. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Josef Bacik
2013-Aug-07 20:51 UTC
Re: [PATCH][RESEND] vfs: allow /proc/PID/maps to get device from stat
On Wed, Aug 07, 2013 at 01:18:26PM -0700, Christoph Hellwig wrote:> On Wed, Aug 07, 2013 at 12:57:18PM -0700, Mark Fasheh wrote: > > stat(2) on btrfs returns a custom device, but proc uses s_dev from the super > > block. This causes problems (abi breakage) because software (and users) are > > not expecting the kernel to return different devices from these calls. > > So fix stat on btrfs to return the proper device instead. >Not possible, this will break other things as subvolumes have their own inode space, it will confuse applications that get multiples of an inode number for different devices with the same st_dev. Each subvolume has it''s own anonymous dev to segregate things. Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Christoph Hellwig
2013-Aug-08 12:13 UTC
Re: [PATCH][RESEND] vfs: allow /proc/PID/maps to get device from stat
On Wed, Aug 07, 2013 at 04:51:46PM -0400, Josef Bacik wrote:> Not possible, this will break other things as subvolumes have their own inode > space, it will confuse applications that get multiples of an inode number for > different devices with the same st_dev. Each subvolume has it''s own anonymous > dev to segregate things. Thanks,Yes, it''s the same old issue of btrfs volumes misbehaving, and the solution is still the same as 5 years ago: make sure each subvolume has it''s own sb, vfsmount and gets automounted, similar to what nfs4 does for this case. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Josef Bacik
2013-Aug-08 13:02 UTC
Re: [PATCH][RESEND] vfs: allow /proc/PID/maps to get device from stat
On Thu, Aug 08, 2013 at 05:13:49AM -0700, Christoph Hellwig wrote:> On Wed, Aug 07, 2013 at 04:51:46PM -0400, Josef Bacik wrote: > > Not possible, this will break other things as subvolumes have their own inode > > space, it will confuse applications that get multiples of an inode number for > > different devices with the same st_dev. Each subvolume has it''s own anonymous > > dev to segregate things. Thanks, > > Yes, it''s the same old issue of btrfs volumes misbehaving, and the > solution is still the same as 5 years ago: make sure each subvolume > has it''s own sb, vfsmount and gets automounted, similar to what nfs4 > does for this case.This won''t work, try having 10000 subvolumes with dirty inodes and do sync then go skiing, you''ll have time :). Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Christoph Hellwig
2013-Aug-08 13:48 UTC
Re: [PATCH][RESEND] vfs: allow /proc/PID/maps to get device from stat
On Thu, Aug 08, 2013 at 09:02:07AM -0400, Josef Bacik wrote:> This won''t work, try having 10000 subvolumes with dirty inodes and do sync then > go skiing, you''ll have time :). Thanks,Why would the dirty inodes make any difference? If you share the bdi between the subvolumes the sync workflow should be exactly the same still. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Josef Bacik
2013-Aug-08 15:31 UTC
Re: [PATCH][RESEND] vfs: allow /proc/PID/maps to get device from stat
On Thu, Aug 08, 2013 at 06:48:05AM -0700, Christoph Hellwig wrote:> On Thu, Aug 08, 2013 at 09:02:07AM -0400, Josef Bacik wrote: > > This won''t work, try having 10000 subvolumes with dirty inodes and do sync then > > go skiing, you''ll have time :). Thanks, > > Why would the dirty inodes make any difference? If you share the bdi > between the subvolumes the sync workflow should be exactly the same > still. >The inodes are in the per-sb list, so we may start all the writing but we don''t wait all at once, so in the case of btrfs we will write all the dirty inodes, and then wait on the ones in whatever sb we have, and then sync, which will commit the transaction. Then we go to the next sb and wait on those inodes which will dirty metadata which means we''ll have another transaction and we''ll commit the transaction and so on and so forth. This means we write the superblock 10000 times for one sync when we could have just done it once. Now we could probably get around this by having ->sync_fs wait itself for all of the inodes to complete and then commit the transaction once, but we''re still going to get called the 99999 times for the same damned file system that has already had everything done. And this is just one example, IIRC there were a few other issues that popped up because we assume sb == completely separate file system, freeze I think is one of those things. I''m sure there were other ones but the last time I tried to do this was 2010/2011 and many brain cells have died since then. Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Josef Bacik
2013-Aug-08 15:44 UTC
Re: [PATCH][RESEND] vfs: allow /proc/PID/maps to get device from stat
On Thu, Aug 08, 2013 at 06:48:05AM -0700, Christoph Hellwig wrote:> On Thu, Aug 08, 2013 at 09:02:07AM -0400, Josef Bacik wrote: > > This won''t work, try having 10000 subvolumes with dirty inodes and do sync then > > go skiing, you''ll have time :). Thanks, > > Why would the dirty inodes make any difference? If you share the bdi > between the subvolumes the sync workflow should be exactly the same > still. >If we could dis-entangle vfsmounts from sb''s and have it so you could have multiple vfsmounts with just one sb that would solve at least the in-kernel confusion, but I think we still have the userspace confusion. Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Christoph Hellwig
2013-Aug-12 11:47 UTC
Re: [PATCH][RESEND] vfs: allow /proc/PID/maps to get device from stat
On Thu, Aug 08, 2013 at 11:44:54AM -0400, Josef Bacik wrote:> On Thu, Aug 08, 2013 at 06:48:05AM -0700, Christoph Hellwig wrote: > > On Thu, Aug 08, 2013 at 09:02:07AM -0400, Josef Bacik wrote: > > > This won''t work, try having 10000 subvolumes with dirty inodes and do sync then > > > go skiing, you''ll have time :). Thanks, > > > > Why would the dirty inodes make any difference? If you share the bdi > > between the subvolumes the sync workflow should be exactly the same > > still. > > > > If we could dis-entangle vfsmounts from sb''s and have it so you could have > multiple vfsmounts with just one sb that would solve at least the in-kernel > confusion, but I think we still have the userspace confusion. Thanks,I think it would mostly solve userspace confusion, as userspace only sees mounts and the device names. But please fix this up properly instead of propagating the effects of the nasty btrfs hack that should never have been merged in that form further up the stack. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Mark Fasheh
2013-Sep-10 15:36 UTC
Re: [PATCH][RESEND] vfs: allow /proc/PID/maps to get device from stat
On Mon, Aug 12, 2013 at 04:47:52AM -0700, Christoph Hellwig wrote:> On Thu, Aug 08, 2013 at 11:44:54AM -0400, Josef Bacik wrote: > > On Thu, Aug 08, 2013 at 06:48:05AM -0700, Christoph Hellwig wrote: > > > On Thu, Aug 08, 2013 at 09:02:07AM -0400, Josef Bacik wrote: > > > > This won''t work, try having 10000 subvolumes with dirty inodes and do sync then > > > > go skiing, you''ll have time :). Thanks, > > > > > > Why would the dirty inodes make any difference? If you share the bdi > > > between the subvolumes the sync workflow should be exactly the same > > > still. > > > > > > > If we could dis-entangle vfsmounts from sb''s and have it so you could have > > multiple vfsmounts with just one sb that would solve at least the in-kernel > > confusion, but I think we still have the userspace confusion. Thanks, > > I think it would mostly solve userspace confusion, as userspace only > sees mounts and the device names. > > But please fix this up properly instead of propagating the effects of > the nasty btrfs hack that should never have been merged in that form > further up the stack.Can one of you explain how this solves the problem that userspace is getting different devices for the same inode? Seriously, I''ve been looking into it and I''m a bit lost. I followed the converstaion until here but I don''t see how any of the proposed changes actually *fix* anything? Also, what is the relationship between vfsmounts and sb today? Wouldn''t a bind mount produce the situation of more than 1 vfsmount per sb that is described above? Sincerely, someone who would like to fix this ABI breakage that has been going on for years. --Mark -- Mark Fasheh -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Josef Bacik
2013-Sep-10 15:56 UTC
Re: [PATCH][RESEND] vfs: allow /proc/PID/maps to get device from stat
On Tue, Sep 10, 2013 at 08:36:55AM -0700, Mark Fasheh wrote:> On Mon, Aug 12, 2013 at 04:47:52AM -0700, Christoph Hellwig wrote: > > On Thu, Aug 08, 2013 at 11:44:54AM -0400, Josef Bacik wrote: > > > On Thu, Aug 08, 2013 at 06:48:05AM -0700, Christoph Hellwig wrote: > > > > On Thu, Aug 08, 2013 at 09:02:07AM -0400, Josef Bacik wrote: > > > > > This won''t work, try having 10000 subvolumes with dirty inodes and do sync then > > > > > go skiing, you''ll have time :). Thanks, > > > > > > > > Why would the dirty inodes make any difference? If you share the bdi > > > > between the subvolumes the sync workflow should be exactly the same > > > > still. > > > > > > > > > > If we could dis-entangle vfsmounts from sb''s and have it so you could have > > > multiple vfsmounts with just one sb that would solve at least the in-kernel > > > confusion, but I think we still have the userspace confusion. Thanks, > > > > I think it would mostly solve userspace confusion, as userspace only > > sees mounts and the device names. > > > > But please fix this up properly instead of propagating the effects of > > the nasty btrfs hack that should never have been merged in that form > > further up the stack. > > Can one of you explain how this solves the problem that userspace is getting > different devices for the same inode? > > Seriously, I''ve been looking into it and I''m a bit lost. I followed the > converstaion until here but I don''t see how any of the proposed changes > actually *fix* anything? Also, what is the relationship between vfsmounts > and sb today? Wouldn''t a bind mount produce the situation of more than 1 > vfsmount per sb that is described above? > > Sincerely, someone who would like to fix this ABI breakage that has been > going on for years.And let me restate the problem so we''re all on the same page. Btrfs has subvolumes, completely separate trees within the file system. These trees get their own object numbering, which in turn is how we do our inode numbers. So if you have multiple subvolumes, they will likely have the same inode numbers within the same file system. This screws up things like rsync which say "hey look, these two inodes are the same, lets skip them." So we have an anonymous dev so we can make them look different. Now if we were to make each subvol its own vfsmount (essentially a bind mount) and remove the anonymous device that wouldn''t fix the problem _at all_. The file system would appear to be the same to rsync and it wouldn''t back stuff up. So we still need some way of telling userspace that this object is different. I''m not convinced vfsmounts is the way to do this, it doesn''t do anything other than add a whole lot of complexity to our mounting/subvolume mechanism that is already relatively complex. Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Jeff Mahoney
2013-Sep-10 21:21 UTC
Re: [PATCH][RESEND] vfs: allow /proc/PID/maps to get device from stat
On 9/10/13 11:56 AM, Josef Bacik wrote:> On Tue, Sep 10, 2013 at 08:36:55AM -0700, Mark Fasheh wrote: >> On Mon, Aug 12, 2013 at 04:47:52AM -0700, Christoph Hellwig wrote: >>> On Thu, Aug 08, 2013 at 11:44:54AM -0400, Josef Bacik wrote: >>>> On Thu, Aug 08, 2013 at 06:48:05AM -0700, Christoph Hellwig wrote: >>>>> On Thu, Aug 08, 2013 at 09:02:07AM -0400, Josef Bacik wrote: >>>>>> This won''t work, try having 10000 subvolumes with dirty inodes and do sync then >>>>>> go skiing, you''ll have time :). Thanks, >>>>> >>>>> Why would the dirty inodes make any difference? If you share the bdi >>>>> between the subvolumes the sync workflow should be exactly the same >>>>> still. >>>>> >>>> >>>> If we could dis-entangle vfsmounts from sb''s and have it so you could have >>>> multiple vfsmounts with just one sb that would solve at least the in-kernel >>>> confusion, but I think we still have the userspace confusion. Thanks, >>> >>> I think it would mostly solve userspace confusion, as userspace only >>> sees mounts and the device names. >>> >>> But please fix this up properly instead of propagating the effects of >>> the nasty btrfs hack that should never have been merged in that form >>> further up the stack. >> >> Can one of you explain how this solves the problem that userspace is getting >> different devices for the same inode? >> >> Seriously, I''ve been looking into it and I''m a bit lost. I followed the >> converstaion until here but I don''t see how any of the proposed changes >> actually *fix* anything? Also, what is the relationship between vfsmounts >> and sb today? Wouldn''t a bind mount produce the situation of more than 1 >> vfsmount per sb that is described above? >> >> Sincerely, someone who would like to fix this ABI breakage that has been >> going on for years. > > And let me restate the problem so we''re all on the same page. > > Btrfs has subvolumes, completely separate trees within the file system. These > trees get their own object numbering, which in turn is how we do our inode > numbers. So if you have multiple subvolumes, they will likely have the same > inode numbers within the same file system. This screws up things like rsync > which say "hey look, these two inodes are the same, lets skip them." So we have > an anonymous dev so we can make them look different. > > Now if we were to make each subvol its own vfsmount (essentially a bind mount) > and remove the anonymous device that wouldn''t fix the problem _at all_. The > file system would appear to be the same to rsync and it wouldn''t back stuff up. > So we still need some way of telling userspace that this object is different. > > I''m not convinced vfsmounts is the way to do this, it doesn''t do anything other > than add a whole lot of complexity to our mounting/subvolume mechanism that is > already relatively complex. Thanks,Agreed. It''s hugely wasteful as well. We can have thousands of subvolumes even on modest systems like workstations when automated snapshots are involved. Using a vfsmount for each subvolume would make /proc/mounts pretty useless. Having a separate superblock for each one, at 1k a pop, would waste a ton of memory considering that they''ll be identical except for the dev_t. The only way vfsmounts would work is if we added a dev_t there, which would usually be set to ->mnt_sb->s_dev except for the btrfs case. That still doesn''t solve the polluted /proc/mounts, though. -Jeff -- Jeff Mahoney SUSE Labs