Hello everyone, I''ve updated the btrfs git trees to 2.6.28-rc5 and tested against linux-next. We''ve knocked a bunch of things off the todo list since I last posted, including compression (mount -o compress) and the ability to create subvols and snapshots anywhere in the FS. There are a small number of disk format changes pending, which I put off in favor of making compression stable/fast. We''ll hammer these out shortly. The btrfs kernel code is here: http://git.kernel.org/?p=linux/kernel/git/mason/btrfs-unstable.git;a=summary And the utilities are here: http://git.kernel.org/?p=linux/kernel/git/mason/btrfs-progs-unstable.git;a=summary -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 2008-11-20 at 07:18 -0500, Chris Mason wrote:> Hello everyone, > > I''ve updated the btrfs git trees to 2.6.28-rc5 and tested against > linux-next. > > We''ve knocked a bunch of things off the todo list since I last posted, > including compression (mount -o compress) and the ability to create > subvols and snapshots anywhere in the FS. > > There are a small number of disk format changes pending, which I put off > in favor of making compression stable/fast. We''ll hammer these out > shortly. >Just an update, while I still have a long todo list and plenty of things to fix in the code, these src trees have been updated with a disk format I hope to maintain compatibility with from here on. There are still format changes planned, but should go in through the compat mechanisms in the sources now. The btrfs trees are still at 2.6.28-rc5, but I just tested against linux-next without problems.> The btrfs kernel code is here: > http://git.kernel.org/?p=linux/kernel/git/mason/btrfs-unstable.git;a=summary > > And the utilities are here: > > http://git.kernel.org/?p=linux/kernel/git/mason/btrfs-progs-unstable.git;a=summary-chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Andrew, On Wed, 10 Dec 2008 21:34:56 -0500 Chris Mason <chris.mason@oracle.com> wrote:> > Just an update, while I still have a long todo list and plenty of things > to fix in the code, these src trees have been updated with a disk format > I hope to maintain compatibility with from here on. There are still > format changes planned, but should go in through the compat mechanisms > in the sources now. > > The btrfs trees are still at 2.6.28-rc5, but I just tested against > linux-next without problems.Do you think this is ready to be added to the end of linux-next, yet? Or is this more -mm material? -- Cheers, Stephen Rothwell sfr@canb.auug.org.au http://www.canb.auug.org.au/~sfr/
On Thu, 11 Dec 2008 14:14:36 +1100 Stephen Rothwell <sfr@canb.auug.org.au> wrote:> Hi Andrew, > > On Wed, 10 Dec 2008 21:34:56 -0500 Chris Mason <chris.mason@oracle.com> wrote: > > > > Just an update, while I still have a long todo list and plenty of things > > to fix in the code, these src trees have been updated with a disk format > > I hope to maintain compatibility with from here on. There are still > > format changes planned, but should go in through the compat mechanisms > > in the sources now. > > > > The btrfs trees are still at 2.6.28-rc5, but I just tested against > > linux-next without problems. > > Do you think this is ready to be added to the end of linux-next, yet? Or > is this more -mm material?I''d prefer that it go into linux-next in the usual fashion. But the first step is review.. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 10 Dec 2008 20:06:04 -0800 Andrew Morton <akpm@linux-foundation.org> wrote:> > I''d prefer that it go into linux-next in the usual fashion. But the > first step is review..OK, I wasn''t sure where it was up to (not being a file system person). -- Cheers, Stephen Rothwell sfr@canb.auug.org.au http://www.canb.auug.org.au/~sfr/
On Wed, 2008-12-10 at 20:06 -0800, Andrew Morton wrote:> On Thu, 11 Dec 2008 14:14:36 +1100 Stephen Rothwell <sfr@canb.auug.org.au> wrote: > > > Hi Andrew, > > > > On Wed, 10 Dec 2008 21:34:56 -0500 Chris Mason <chris.mason@oracle.com> wrote: > > > > > > Just an update, while I still have a long todo list and plenty of things > > > to fix in the code, these src trees have been updated with a disk format > > > I hope to maintain compatibility with from here on. There are still > > > format changes planned, but should go in through the compat mechanisms > > > in the sources now. > > > > > > The btrfs trees are still at 2.6.28-rc5, but I just tested against > > > linux-next without problems. > > > > Do you think this is ready to be added to the end of linux-next, yet? Or > > is this more -mm material? > > I''d prefer that it go into linux-next in the usual fashion. But the > first step is review..I''m updating the various docs on the btrfs wiki. From a kernel impact point of view, btrfs only changes fs/Kconfig and fs/Makefile. Some of the most visible problems in the code are: No support for fs blocksizes other than the page size. This includes data blocks and btree leaves/nodes. In both cases, the infrastructure to do it is about 1/2 there, but some ugly problems remain. btrfs_file_write should just be removed in favor of write_begin/end. Right now, the main thing btrfs_file_write does is the dance to setup delalloc, so this should be fairly easy. The multi-device code uses a very simple brute force scan from userland to populate the list of devices that belong to a given FS. Kay Sievers has some ideas on hotplug magic to make this less dumb. (The scan isn''t required for single device filesystems). extent_io.c should be split up into two files, one for the extent_buffer code and one for the state bit code. struct-funcs.c needs a big flashing neon sign about what it does and why. The extent_buffer interface needs much clearer documentation around why it is there and how it works. There are too many worker threads. At least some of them should be shared between filesystems instead of started for each FS. Each pool of worker thread represents some operation that would end up deadlocking if it shared threads with another pool. There are too many memory allocations on the IO submission path. It needs mempools and other steps to limit the amount of ram used to write a single block. The IO submission path is generally twisty, with helper threads and lookup functions. There is quite a bit going on here in terms of asynchronous checksumming and compression and it needs better documentation. ENOSPC == BUG() The extent allocation tree is what records which extents are allocated on disk, and tree blocks for the extent allocation tree allocated from the extent allocation tree. This recursion is controlled by deferring some operations for later processing, and the resulting complexity needs better documentation. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Dec 11, 2008 09:43 -0500, Chris Mason wrote:> The multi-device code uses a very simple brute force scan from userland > to populate the list of devices that belong to a given FS. Kay Sievers > has some ideas on hotplug magic to make this less dumb. (The scan isn''t > required for single device filesystems).This should use libblkid to do the scanning of the devices, and it can cache the results for efficiency. Best would be to have the same LABEL+UUID for all devices in the same filesystem, and then once any of these devices are found the mount.btrfs code can query the rest of the devices to find the remaining parts of the filesystem. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Dec 15, 2008 at 22:03, Andreas Dilger <adilger@sun.com> wrote:> On Dec 11, 2008 09:43 -0500, Chris Mason wrote: >> The multi-device code uses a very simple brute force scan from userland >> to populate the list of devices that belong to a given FS. Kay Sievers >> has some ideas on hotplug magic to make this less dumb. (The scan isn''t >> required for single device filesystems). > > This should use libblkid to do the scanning of the devices, and it can > cache the results for efficiency. Best would be to have the same LABEL+UUID > for all devices in the same filesystem, and then once any of these devices > are found the mount.btrfs code can query the rest of the devices to find > the remaining parts of the filesystem.Which is another way to do something you should not do that way in the first place, just with a library instead of your own code. Brute-force scanning /dev with a single thread will not work reliably in many setups we need to support. Sure, it''s good to have it for a rescue system, it will work fine or your workstation, but definitely not for boxes with many devices where you don''t know how they behave. Just do: $ modprobe scsi_debug max_luns=8 num_parts=2 $ echo 1 > /sys/module/scsi_debug/parameters/every_nth $ echo 4 > /sys/module/scsi_debug/parameters/opts $ ls -l /sys/class/block/ | wc -l 45 and then call any binary doing /dev scanning, and wait (in this case) for ~2 hours to return. Also, the blkid cache file uses major/minor numbers or kernel device names, which will also not help in many setups we have to support today. The original btrfs topic, leading to this, is here: http://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg01048.html Thanks, Kay -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, 2008-12-15 at 23:55 +0100, Kay Sievers wrote:> On Mon, Dec 15, 2008 at 22:03, Andreas Dilger <adilger@sun.com> wrote: > > On Dec 11, 2008 09:43 -0500, Chris Mason wrote: > >> The multi-device code uses a very simple brute force scan from userland > >> to populate the list of devices that belong to a given FS. Kay Sievers > >> has some ideas on hotplug magic to make this less dumb. (The scan isn''t > >> required for single device filesystems). > > > > This should use libblkid to do the scanning of the devices, and it can > > cache the results for efficiency. Best would be to have the same LABEL+UUID > > for all devices in the same filesystem, and then once any of these devices > > are found the mount.btrfs code can query the rest of the devices to find > > the remaining parts of the filesystem. > > Which is another way to do something you should not do that way in the > first place, just with a library instead of your own code. >Well, its the same library everyone else is using to do things they shouldn''t be doing ;) I''m very interested in your new scheme for device discovery. It seems like the best and most reasonable way to go forward. But, tossing things into libblkid also makes sense. It isn''t much work, and it allows btrfs to fit in with the established norms while we experiment with new and better ways to hook it all together. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Dec 16, 2008 at 02:37, Chris Mason <chris.mason@oracle.com> wrote:> On Mon, 2008-12-15 at 23:55 +0100, Kay Sievers wrote: >> On Mon, Dec 15, 2008 at 22:03, Andreas Dilger <adilger@sun.com> wrote: >> > On Dec 11, 2008 09:43 -0500, Chris Mason wrote: >> >> The multi-device code uses a very simple brute force scan from userland >> >> to populate the list of devices that belong to a given FS. Kay Sievers >> >> has some ideas on hotplug magic to make this less dumb. (The scan isn''t >> >> required for single device filesystems). >> > >> > This should use libblkid to do the scanning of the devices, and it can >> > cache the results for efficiency. Best would be to have the same LABEL+UUID >> > for all devices in the same filesystem, and then once any of these devices >> > are found the mount.btrfs code can query the rest of the devices to find >> > the remaining parts of the filesystem. >> >> Which is another way to do something you should not do that way in the >> first place, just with a library instead of your own code. >> > > Well, its the same library everyone else is using to do things they > shouldn''t be doing ;)Util-linux-ng can be configured to use libvolume_id and udev data, and it''s not used in SUSE and Ubuntu for exactly the reason mentioned. :) Kay -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Christoph Hellwig
2008-Dec-17 13:23 UTC
Notes on support for multiple devices for a single filesystem
FYI: here''s a little writeup I did this summer on support for filesystems spanning multiple block devices: -- === Notes on support for multiple devices for a single filesystem == == Intro = Btrfs (and an experimental XFS version) can support multiple underlying block devices for a single filesystem instances in a generalized and flexible way. Unlike the support for external log devices in ext3, jfs, reiserfs, XFS, and the special real-time device in XFS all data and metadata may be spread over a potentially large number of block devices, and not just one (or two) == Requirements = We want a scheme to support these complex filesystem topologies in way that is a) easy to setup and non-fragile for the users b) scalable to a large number of disks in the system c) recoverable without requiring user space running first d) generic enough to work for multiple filesystems or other consumers Requirement a) means that a multiple-device filesystem should be mountable by a simple fstab entry (UUID/LABEL or some other cookie) which continues to work when the filesystem topology changes. Requirement b) implies we must not do a scan over all available block devices in large systems, but use an event-based callout on detection of new block devices. Requirement c) means there must be some version to add devices to a filesystem by kernel command lines, even if this is not the default way, and might require additional knowledge from the user / system administrator. Requirement d) means that we should not implement this mechanism inside a single filesystem. == Prior art = * External log and realtime volume The most common way to specify the external log device and the XFS real time device is to have a mount option that contains the path to the block special device for it. This variant means a mount option is always required, and requires the device name doesn''t change, which is enough with udev-generated unique device names (/dev/disk/by-{label,uuid}). An alternative way, supported by optionally by ext3 and reiserfs and exclusively supported by jfs is to open the journal device by the device number (dev_t) of the block special device. While this doesn''t require an additional mount option when the device number is stored in the filesystem superblock it relies on the device number being stable which is getting increasingly unlikely in complex storage topologies. * RAID (MD) and LVM Software RAID and volume managers, although not strictly filesystems, have a similar very similar problem finding their devices. The traditional solution used for early versions of the Linux MD driver and LVM version 1 was to hook into the partitions scanning code and add device with the right partition type to a kernel-internal list of potential RAID / LVM devices. This approach has the advantage of being simple to implement, fast, reliable and not requiring additional user space programs in the boot process. The downside is that it only works with specific partition table formats that allow specifying a partition type, and doesn''t work with unpartitioned disks at all. Recent MD setups and LVM2 thus move the scanning to user space, typically using a command iterating over all block device nodes and performing the format-specific scanning. While this is more flexible than the in-kernel scanning, it scales very badly to a large number of block devices, and requires additional user space commands to run early in the boot process. A variant of this schemes runs a scanning callout from udev once disk device are detected, which avoids the scanning overhead. == High-level design considerations = Due to requirement b) we need a layer that finds devices for a single fstab entry. We can either do this in user space, or in kernel space. As we''ve traditionally always done UUID/LABEL to device mapping in userspace, and we already have libvolume_id and libblkid dealing with the specialized case of UUID/LABEL to single device mapping I would recommend to keep doing this in user space and try to reuse the libvolume_id / libblkid. There are to options to perform the assembly of the device list for a filesystem: 1) whenever libvolume_id / libblkid find a device detected as a multi-device capable filesystem it gets added to a list of all devices of this particular filesystem type. On mount type mount(8) or a mount.fstype helpers calls out to the libraries to get a list of devices belonging to this filesystem type and translates them to device names, which can be passed to the kernel on the mount command line. Advantage: Requires a mount.fstype helper or fs-specific knowledge in mount(8). Disadvantages: Required libvolume_id / libblkid to keep state. 2) whenever libvolume_id / libblkid find a device detected as a multi-device capable filesystem they call into the kernel through and ioctl / sysfs / etc to add it to a list in kernel space. The kernel code then manages to select the right filesystem instances, and either adds it to an already running instance, or prepares a list for a future mount. Advantages: keeps state in the kernel instead of in libraries Disadvantages: requires and additional user interface to mount c) will be deal with an option that allows adding named block devices on the mount command line, which is already required for design 1) but would have to be implemented additionally for design 2). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Kay Sievers
2008-Dec-17 14:50 UTC
Re: Notes on support for multiple devices for a single filesystem
On Wed, Dec 17, 2008 at 14:23, Christoph Hellwig <hch@infradead.org> wrote:> === Notes on support for multiple devices for a single filesystem ==> > == Intro => > Btrfs (and an experimental XFS version) can support multiple underlying block > devices for a single filesystem instances in a generalized and flexible way. > > Unlike the support for external log devices in ext3, jfs, reiserfs, XFS, and > the special real-time device in XFS all data and metadata may be spread over a > potentially large number of block devices, and not just one (or two) > > > == Requirements => > We want a scheme to support these complex filesystem topologies in way > that is > > a) easy to setup and non-fragile for the users > b) scalable to a large number of disks in the system > c) recoverable without requiring user space running first > d) generic enough to work for multiple filesystems or other consumers > > Requirement a) means that a multiple-device filesystem should be mountable > by a simple fstab entry (UUID/LABEL or some other cookie) which continues > to work when the filesystem topology changes. > > Requirement b) implies we must not do a scan over all available block devices > in large systems, but use an event-based callout on detection of new block > devices. > > Requirement c) means there must be some version to add devices to a filesystem > by kernel command lines, even if this is not the default way, and might require > additional knowledge from the user / system administrator. > > Requirement d) means that we should not implement this mechanism inside a > single filesystem. > > > == Prior art => > * External log and realtime volume > > The most common way to specify the external log device and the XFS real time > device is to have a mount option that contains the path to the block special > device for it. This variant means a mount option is always required, and > requires the device name doesn''t change, which is enough with udev-generated > unique device names (/dev/disk/by-{label,uuid}). > > An alternative way, supported by optionally by ext3 and reiserfs and > exclusively supported by jfs is to open the journal device by the device > number (dev_t) of the block special device. While this doesn''t require > an additional mount option when the device number is stored in the filesystem > superblock it relies on the device number being stable which is getting > increasingly unlikely in complex storage topologies. > > > * RAID (MD) and LVM > > Software RAID and volume managers, although not strictly filesystems, > have a similar very similar problem finding their devices. The traditional > solution used for early versions of the Linux MD driver and LVM version 1 > was to hook into the partitions scanning code and add device with the > right partition type to a kernel-internal list of potential RAID / LVM > devices. This approach has the advantage of being simple to implement, > fast, reliable and not requiring additional user space programs in the boot > process. The downside is that it only works with specific partition table > formats that allow specifying a partition type, and doesn''t work with > unpartitioned disks at all. Recent MD setups and LVM2 thus move the scanning > to user space, typically using a command iterating over all block device > nodes and performing the format-specific scanning. While this is more flexible > than the in-kernel scanning, it scales very badly to a large number of > block devices, and requires additional user space commands to run early > in the boot process. A variant of this schemes runs a scanning callout > from udev once disk device are detected, which avoids the scanning overhead. > > > == High-level design considerations => > Due to requirement b) we need a layer that finds devices for a single > fstab entry. We can either do this in user space, or in kernel space. As we''ve > traditionally always done UUID/LABEL to device mapping in userspace, and we > already have libvolume_id and libblkid dealing with the specialized case > of UUID/LABEL to single device mapping I would recommend to keep doing > this in user space and try to reuse the libvolume_id / libblkid. > > There are to options to perform the assembly of the device list for > a filesystem: > > 1) whenever libvolume_id / libblkid find a device detected as a multi-device > capable filesystem it gets added to a list of all devices of this > particular filesystem type. > On mount type mount(8) or a mount.fstype helpers calls out to the > libraries to get a list of devices belonging to this filesystem > type and translates them to device names, which can be passed to > the kernel on the mount command line. > > Advantage: Requires a mount.fstype helper or fs-specific knowledge > in mount(8). > Disadvantages: Required libvolume_id / libblkid to keep state. > > 2) whenever libvolume_id / libblkid find a device detected as a multi-device > capable filesystem they call into the kernel through and ioctl / sysfs / > etc to add it to a list in kernel space. The kernel code then manages > to select the right filesystem instances, and either adds it to an already > running instance, or prepares a list for a future mount. > > Advantages: keeps state in the kernel instead of in libraries > Disadvantages: requires and additional user interface to mount > > c) will be deal with an option that allows adding named block devices > on the mount command line, which is already required for design 1) but > would have to be implemented additionally for design 2).Sounds all sensible. Btrfs already stores the (possibly incomplete) device tree state in the kernel, which should make things pretty easy for userspace, compared to other already existing subsystems. We could have udev maintain a btrfs volume tree: /dev/btrfs/ |-- 0cdedd75-2d03-41e6-a1eb-156c0920a021 | |-- 897fac06-569c-4f45-a0b9-a1f91a9564d4 -> ../../sda10 | `-- aac20975-b642-4650-b65b-b92ce22616f2 -> ../../sda9 `-- a1ec970a-2463-414e-864c-2eb8ac4e1cf2 |-- 4d1f1fff-4c6b-4b87-8486-36f58abc0610 -> ../../sdb2 `-- e7fe3065-c39f-4295-a099-a89e839ae350 -> ../../sdb1 At the same time, by-uuid/ is created: /dev/disk/by-uuid/ |-- 0cdedd75-2d03-41e6-a1eb-156c0920a021 -> ../../sda10 |-- a1ec970a-2463-414e-864c-2eb8ac4e1cf2 -> ../../sdb2 ... And possibly by-label/: /dev/disk/by-label/ |-- butter-butter -> ../../sda10 ... The by-uuid/, by-label/ links always get overwritten by the last device claiming that name (unless udev rules specify a "link priority" for specific devices, which might not make sense for btrfs). With the udev rules that maintain the /dev/btrfs/ tree, we can just plug in "btrfs -A" calls, to make the device known to the kernel. If possible, "btrfsctrl -A" could return the updated state of the in-kernel device tree. If it is complete, we can send out an event to possibly trigger the mounting. This model should just work fine in initramfs too. For recue and recovery cases, it will still be nice to be able to trigger "scan all devices" code in btrfsctrl (own code or libbklid), but it should be avoided in any normal operation mode. Thanks, Kay -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Chris Mason
2008-Dec-17 14:53 UTC
Re: Notes on support for multiple devices for a single filesystem
On Wed, 2008-12-17 at 08:23 -0500, Christoph Hellwig wrote:> FYI: here''s a little writeup I did this summer on support for > filesystems spanning multiple block devices: > >Thanks Christoph, I''ll start with a description of what btrfs does today. Every Btrfs filesystem has a uuid, and a tree that stores all the device uuids that belong to the FS uuid. Every btrfs device has a device uuid and a super block that indicates which FS uuid it belongs to. The btrfs kernel module holds a list of the FS uuids found and the devices that belong to each one. This list is populated by a block device scanning ioctl that opens a bdev and checks for btrfs supers. I tried to keep this code as simple as possible because I knew we''d end up replacing it. At mount time, btrfs makes sure the devices found by scanning match the devices the FS expected to find. btrfsctl -a scans all of /dev calling the scan ioctl on each device and btrfsctl -A /dev/xxxx just calls the ioctl on a single device. No scanning is required for a single device filesystem. After the scan is done, sending any device in a multi-device filesystem is enough for the kernel to mount the FS. IOW: mkfs.btrfs /dev/sdb ; mount /dev/sdb /mnt just works. mkfs.btrfs /dev/sdb /dev/sdc ; mount /dev/sdb /mnt also works (mkfs.btrfs calls the ioctl on multi-device filesystems). UUIDS and labels are important in large systems, but if the admin knows a given device is part of an FS, they are going to expect to be able to send that one device to mount and have things work. Even though btrfs currently maintains the device list in the kernel, I''m happy to move it into a userland api once we settle on one. Kay has some code so that udev can discover the btrfs device<->FS uuid mappings, resulting in a tree like this: $ tree /dev/btrfs/ /dev/btrfs/ |-- 0cdedd75-2d03-41e6-a1eb-156c0920a021 | |-- 897fac06-569c-4f45-a0b9-a1f91a9564d4 -> ../../sda10 | `-- aac20975-b642-4650-b65b-b92ce22616f2 -> ../../sda9 `-- a1ec970a-2463-414e-864c-2eb8ac4e1cf2 |-- 4d1f1fff-4c6b-4b87-8486-36f58abc0610 -> ../../sdb2 `-- e7fe3065-c39f-4295-a099-a89e839ae350 -> ../../sdb1 It makes sense to me to use /dev/multi-device/ instead of /dev/btrfs/, I''m fine with anything really. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Christoph Hellwig
2008-Dec-17 15:08 UTC
Re: Notes on support for multiple devices for a single filesystem
On Wed, Dec 17, 2008 at 03:50:45PM +0100, Kay Sievers wrote:> Sounds all sensible. Btrfs already stores the (possibly incomplete) > device tree state in the kernel, which should make things pretty easy > for userspace, compared to other already existing subsystems. > > We could have udev maintain a btrfs volume tree: > /dev/btrfs/ > |-- 0cdedd75-2d03-41e6-a1eb-156c0920a021 > | |-- 897fac06-569c-4f45-a0b9-a1f91a9564d4 -> ../../sda10 > | `-- aac20975-b642-4650-b65b-b92ce22616f2 -> ../../sda9 > `-- a1ec970a-2463-414e-864c-2eb8ac4e1cf2 > |-- 4d1f1fff-4c6b-4b87-8486-36f58abc0610 -> ../../sdb2 > `-- e7fe3065-c39f-4295-a099-a89e839ae350 -> ../../sdb1 > > At the same time, by-uuid/ is created: > /dev/disk/by-uuid/ > |-- 0cdedd75-2d03-41e6-a1eb-156c0920a021 -> ../../sda10 > |-- a1ec970a-2463-414e-864c-2eb8ac4e1cf2 -> ../../sdb2 > ...Well, it''s not just btrfs, it''s also md, lvm and xfs. I think the right way is to make the single node for the /dev/disk/by-uuid/ just a legacy case for potential multiple devices. E.g. by having /dev/disk/by-uuid/ 0cdedd75-2d03-41e6-a1eb-156c0920a021 -> ../../sda10 0cdedd75-2d03-41e6-a1eb-156c0920a021.d foo -> ../../sda10 bar -> ../../sda9 where foo nad bar could be uuids if the filesystem / volume manager supports it, otherwise just the short name for it.> For recue and recovery cases, it will still be nice to be able to > trigger "scan all devices" code in btrfsctrl (own code or libbklid), > but it should be avoided in any normal operation mode.Again, that''s something we should do generically for the whole /dev/disk/ tree. For that we need to merge libvolume_id and libblkid so that it has a few related but separate use cases: - a lowlevel probe what fs / volume manager / etc is this for the udev callout, mkfs, strip size detection etc - a way to rescan everything, either for non-udev static /dev case or your above recovery scenario - plus potentially some sort of caching for the non-recovery static /dev case I''ve long planned to put you and Ted into a room and not let you out until we see white smoke :) -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Kay Sievers
2008-Dec-17 15:33 UTC
Re: Notes on support for multiple devices for a single filesystem
On Wed, Dec 17, 2008 at 16:08, Christoph Hellwig <hch@infradead.org> wrote:> On Wed, Dec 17, 2008 at 03:50:45PM +0100, Kay Sievers wrote: >> Sounds all sensible. Btrfs already stores the (possibly incomplete) >> device tree state in the kernel, which should make things pretty easy >> for userspace, compared to other already existing subsystems. >> >> We could have udev maintain a btrfs volume tree: >> /dev/btrfs/ >> |-- 0cdedd75-2d03-41e6-a1eb-156c0920a021 >> | |-- 897fac06-569c-4f45-a0b9-a1f91a9564d4 -> ../../sda10 >> | `-- aac20975-b642-4650-b65b-b92ce22616f2 -> ../../sda9 >> `-- a1ec970a-2463-414e-864c-2eb8ac4e1cf2 >> |-- 4d1f1fff-4c6b-4b87-8486-36f58abc0610 -> ../../sdb2 >> `-- e7fe3065-c39f-4295-a099-a89e839ae350 -> ../../sdb1 >> >> At the same time, by-uuid/ is created: >> /dev/disk/by-uuid/ >> |-- 0cdedd75-2d03-41e6-a1eb-156c0920a021 -> ../../sda10 >> |-- a1ec970a-2463-414e-864c-2eb8ac4e1cf2 -> ../../sdb2 >> ... > > Well, it''s not just btrfs, it''s also md, lvm and xfs. I think the right > way is to make the single node for the /dev/disk/by-uuid/ just a legacy > case for potential multiple devices. E.g. by having > > /dev/disk/by-uuid/ > 0cdedd75-2d03-41e6-a1eb-156c0920a021 -> ../../sda10 > 0cdedd75-2d03-41e6-a1eb-156c0920a021.d > foo -> ../../sda10 > bar -> ../../sda9 > > where foo nad bar could be uuids if the filesystem / volume manager > supports it, otherwise just the short name for it.Sure, we can do something like that. /dev/btrfs/ was just something for me to start with, and see how the stuff works.>> For recue and recovery cases, it will still be nice to be able to >> trigger "scan all devices" code in btrfsctrl (own code or libbklid), >> but it should be avoided in any normal operation mode. > > Again, that''s something we should do generically for the whole > /dev/disk/ tree. For that we need to merge libvolume_id and libblkid > so that it has a few related but separate use cases: > > - a lowlevel probe what fs / volume manager / etc is this for > the udev callout, mkfs, strip size detection etcA low-level api will be offered by a future libblkid version in util-linux-ng.> - a way to rescan everything, either for non-udev static /dev case > or your above recovery scenarioThe scan code is part of libblkid, we just need some explicit controls to enable disable the scanning. It should never be the default, like it is today.> - plus potentially some sort of caching for the non-recovery static > /dev caseIt''s also in libblkid. Today it''s pretty useless to cache stuff indexed by major/minor, but it''s there.> I''ve long planned to put you and Ted into a room and not let you out > until we see white smoke :)A new libblkid already happened at: http://git.kernel.org/?p=utils/util-linux-ng/util-linux-ng.git;a=shortlog;h=topic/blkid Almost all of libvolume_id is already merged into this new version (only btrfs is missing :)). Udev will switch over to calling blkid when it''s available in a released version of util-linux-ng. I will just delete the current libvolume_id library after that. No white smoke, if all works out as planned. :) Thanks, Kay -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Andrew Morton
2008-Dec-17 19:53 UTC
Re: Notes on support for multiple devices for a single filesystem
On Wed, 17 Dec 2008 08:23:44 -0500 Christoph Hellwig <hch@infradead.org> wrote:> FYI: here''s a little writeup I did this summer on support for > filesystems spanning multiple block devices: > > > -- > > === Notes on support for multiple devices for a single filesystem ==> > == Intro => > Btrfs (and an experimental XFS version) can support multiple underlying block > devices for a single filesystem instances in a generalized and flexible way. > > Unlike the support for external log devices in ext3, jfs, reiserfs, XFS, and > the special real-time device in XFS all data and metadata may be spread over a > potentially large number of block devices, and not just one (or two) > > > == Requirements => > We want a scheme to support these complex filesystem topologies in way > that is > > a) easy to setup and non-fragile for the users > b) scalable to a large number of disks in the system > c) recoverable without requiring user space running first > d) generic enough to work for multiple filesystems or other consumers > > Requirement a) means that a multiple-device filesystem should be mountable > by a simple fstab entry (UUID/LABEL or some other cookie) which continues > to work when the filesystem topology changes."device topology"?> Requirement b) implies we must not do a scan over all available block devices > in large systems, but use an event-based callout on detection of new block > devices. > > Requirement c) means there must be some version to add devices to a filesystem > by kernel command lines, even if this is not the default way, and might require > additional knowledge from the user / system administrator. > > Requirement d) means that we should not implement this mechanism inside a > single filesystem. >One thing I''ve never seen comprehensively addressed is: why do this in the filesystem at all? Why not let MD take care of all this and present a single block device to the fs layer? Lots of filesystems are violating this, and I''m sure the reasons for this are good, but this document seems like a suitable place in which to briefly decribe those reasons. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Chris Mason
2008-Dec-17 20:58 UTC
Re: Notes on support for multiple devices for a single filesystem
On Wed, 2008-12-17 at 11:53 -0800, Andrew Morton wrote:> On Wed, 17 Dec 2008 08:23:44 -0500 > Christoph Hellwig <hch@infradead.org> wrote: > > > FYI: here''s a little writeup I did this summer on support for > > filesystems spanning multiple block devices: > > > > > > -- > > > > === Notes on support for multiple devices for a single filesystem ==> > > > == Intro => > > > Btrfs (and an experimental XFS version) can support multiple underlying block > > devices for a single filesystem instances in a generalized and flexible way. > > > > Unlike the support for external log devices in ext3, jfs, reiserfs, XFS, and > > the special real-time device in XFS all data and metadata may be spread over a > > potentially large number of block devices, and not just one (or two) > > > > > > == Requirements => > > > We want a scheme to support these complex filesystem topologies in way > > that is > > > > a) easy to setup and non-fragile for the users > > b) scalable to a large number of disks in the system > > c) recoverable without requiring user space running first > > d) generic enough to work for multiple filesystems or other consumers > > > > Requirement a) means that a multiple-device filesystem should be mountable > > by a simple fstab entry (UUID/LABEL or some other cookie) which continues > > to work when the filesystem topology changes. > > "device topology"? > > > Requirement b) implies we must not do a scan over all available block devices > > in large systems, but use an event-based callout on detection of new block > > devices. > > > > Requirement c) means there must be some version to add devices to a filesystem > > by kernel command lines, even if this is not the default way, and might require > > additional knowledge from the user / system administrator. > > > > Requirement d) means that we should not implement this mechanism inside a > > single filesystem. > > > > One thing I''ve never seen comprehensively addressed is: why do this in > the filesystem at all? Why not let MD take care of all this and > present a single block device to the fs layer? > > Lots of filesystems are violating this, and I''m sure the reasons for > this are good, but this document seems like a suitable place in which to > briefly decribe those reasons.I''d almost rather see this doc stick to the device topology interface in hopes of describing something that RAID and MD can use too. But just to toss some information into the pool: * When moving data around (raid rebuild, restripe, pvmove etc), we want to make sure the data read off the disk is correct before writing it to the new location (checksum verification). * When moving data around, we don''t want to move data that isn''t actually used by the filesystem. This could be solved via new APIs, but keeping it crash safe would be very tricky. * When checksum verification fails on read, the FS should be able to ask the raid implementation for another copy. This could be solved via new APIs. * Different parts of the filesystem might want different underlying raid parameters. The easiest example is metadata vs data, where a 4k stripesize for data might be a bad idea and a 64k stripesize for metadata would result in many more rwm cycles. * Sharing the filesystem transaction layer. LVM and MD have to pretend they are a single consistent array of bytes all the time, for each and every write they return as complete to the FS. By pushing the multiple device support up into the filesystem, I can share the filesystem''s transaction layer. Work can be done in larger atomic units, and the filesystem will stay consistent because it is all coordinated. There are other bits and pieces like high speed front end caching devices that would be difficult in MD/LVM, but since I don''t have that coded yet I suppose they don''t really count... -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Kay Sievers
2008-Dec-17 21:20 UTC
Re: Notes on support for multiple devices for a single filesystem
On Wed, Dec 17, 2008 at 21:58, Chris Mason <chris.mason@oracle.com> wrote:> On Wed, 2008-12-17 at 11:53 -0800, Andrew Morton wrote:>> One thing I''ve never seen comprehensively addressed is: why do this in >> the filesystem at all? Why not let MD take care of all this and >> present a single block device to the fs layer? >> >> Lots of filesystems are violating this, and I''m sure the reasons for >> this are good, but this document seems like a suitable place in which to >> briefly decribe those reasons. > > I''d almost rather see this doc stick to the device topology interface in > hopes of describing something that RAID and MD can use too. But just to > toss some information into the pool: > > * When moving data around (raid rebuild, restripe, pvmove etc), we want > to make sure the data read off the disk is correct before writing it to > the new location (checksum verification). > > * When moving data around, we don''t want to move data that isn''t > actually used by the filesystem. This could be solved via new APIs, but > keeping it crash safe would be very tricky. > > * When checksum verification fails on read, the FS should be able to ask > the raid implementation for another copy. This could be solved via new > APIs. > > * Different parts of the filesystem might want different underlying raid > parameters. The easiest example is metadata vs data, where a 4k > stripesize for data might be a bad idea and a 64k stripesize for > metadata would result in many more rwm cycles. > > * Sharing the filesystem transaction layer. LVM and MD have to pretend > they are a single consistent array of bytes all the time, for each and > every write they return as complete to the FS. > > By pushing the multiple device support up into the filesystem, I can > share the filesystem''s transaction layer. Work can be done in larger > atomic units, and the filesystem will stay consistent because it is all > coordinated. > > There are other bits and pieces like high speed front end caching > devices that would be difficult in MD/LVM, but since I don''t have that > coded yet I suppose they don''t really count...Features like the very nice and useful directory-based snapshots would also not be possible with simple block-based multi-devices, right? Kay -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Andreas Dilger
2008-Dec-17 21:24 UTC
Re: Notes on support for multiple devices for a single filesystem
On Dec 17, 2008 15:58 -0500, Chris Mason wrote:> On Wed, 2008-12-17 at 11:53 -0800, Andrew Morton wrote: > > One thing I''ve never seen comprehensively addressed is: why do this in > > the filesystem at all? Why not let MD take care of all this and > > present a single block device to the fs layer? > > > > Lots of filesystems are violating this, and I''m sure the reasons for > > this are good, but this document seems like a suitable place in which to > > briefly decribe those reasons. > > I''d almost rather see this doc stick to the device topology interface in > hopes of describing something that RAID and MD can use too. But just to > toss some information into the pool:Add in here (most important reason, IMHO) that the filesystem wants to make sure that different copies of redundant metadata are stored on different physical devices. It seems pointless to have 4 copies of important data if a single disk failure makes them all inaccessible. At the same time, not all data/metadata is of the same importance, so it makes sense to store e.g. 4 full copies of important metadata like the allocation bitmaps and the tree root block, but only RAID-5 for file data. Even if MD was used to implement the RAID-1 and RAID-5 layer in this case there would need to be multiple MD devices involved.> * When moving data around (raid rebuild, restripe, pvmove etc), we want > to make sure the data read off the disk is correct before writing it to > the new location (checksum verification). > > * When moving data around, we don''t want to move data that isn''t > actually used by the filesystem. This could be solved via new APIs, but > keeping it crash safe would be very tricky. > > * When checksum verification fails on read, the FS should be able to ask > the raid implementation for another copy. This could be solved via new > APIs. > > * Different parts of the filesystem might want different underlying raid > parameters. The easiest example is metadata vs data, where a 4k > stripesize for data might be a bad idea and a 64k stripesize for > metadata would result in many more rwm cycles.Not just different underlying RAID parameters, but completely separate physical storage characteristics. Having e.g. metadata stored on RAID-1 SSD flash (excellent for small random IO) while the data for large files is stored on SATA RAID-5 would maximize performance while minimizing cost. If there is a single virtual block device the filesystem can''t make such allocation decisions unless the virtual block device exposes grotty details like "first 1MB of 128MB is really SSD" or "first 64GB is SSD, rest is SATA" to the filesystem somehow, at which point you are just shoehorning multiple devices into a bad interface (linear array of block numbers) that has to be worked around.> * Sharing the filesystem transaction layer. LVM and MD have to pretend > they are a single consistent array of bytes all the time, for each and > every write they return as complete to the FS. > > By pushing the multiple device support up into the filesystem, I can > share the filesystem''s transaction layer. Work can be done in larger > atomic units, and the filesystem will stay consistent because it is all > coordinated.This is even true with filesystems other than btrfs. As it stands today the MD RAID-1 code implements its own transaction mechanism for the recovery bitmaps, and it would have been more efficient to hook this into the JBD transaction code to avoid 2 layers of flush-then-wait_for_completion. I can''t speak for btrfs, but I don''t think multiple device access from the filesystem is a "layering violation" as some people comment. It is just a different type of layering. With ZFS there is a distinct layer that is handling the allocation, redundancy, and transactions (SPA, DMU) that is exporting an object interface, and the filesystem (ZPL, or future versions of Lustre) is built on top of that object interface. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Chris Mason
2008-Dec-17 21:26 UTC
Re: Notes on support for multiple devices for a single filesystem
On Wed, 2008-12-17 at 22:20 +0100, Kay Sievers wrote:> On Wed, Dec 17, 2008 at 21:58, Chris Mason <chris.mason@oracle.com> wrote: > > On Wed, 2008-12-17 at 11:53 -0800, Andrew Morton wrote: > > > > There are other bits and pieces like high speed front end caching > > devices that would be difficult in MD/LVM, but since I don''t have that > > coded yet I suppose they don''t really count... > > Features like the very nice and useful directory-based snapshots would > also not be possible with simple block-based multi-devices, right?At least for btrfs, the snapshotting is independent from the multi-device code, and you still get snapshotting on single device filesystems. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Jeff Garzik
2008-Dec-17 21:27 UTC
Re: Notes on support for multiple devices for a single filesystem
Kay Sievers wrote:> Features like the very nice and useful directory-based snapshots would > also not be possible with simple block-based multi-devices, right?Snapshotting via block device has always been an incredibly dumb hack, existing primarily because filesystem-based snapshots did not exist for the filesystem in question. Snapshots are better at the filesystem level because the filesystem is the only entity that knows when the filesystem is quiescent and snapshot-able. ISTR we had to add ->write_super_lockfs() to hack in support for LVM in this manner, rather than doing it the right way. Jeff -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Jeff Garzik
2008-Dec-17 21:30 UTC
Re: Notes on support for multiple devices for a single filesystem
Andreas Dilger wrote:> I can''t speak for btrfs, but I don''t think multiple device access from > the filesystem is a "layering violation" as some people comment. It is > just a different type of layering. With ZFS there is a distinct layer > that is handling the allocation, redundancy, and transactions (SPA, DMU) > that is exporting an object interface, and the filesystem (ZPL, or future > versions of Lustre) is built on top of that object interface.Furthermore... think about object-based storage filesystems. They will need to directly issue SCSI commands to storage devices. Call it a layering violation if you will, but you simply cannot even pretend that an OSD is a linear block device for the purposes of our existing block layer. Jeff -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Chris Mason
2008-Dec-17 21:41 UTC
Re: Notes on support for multiple devices for a single filesystem
On Wed, 2008-12-17 at 14:24 -0700, Andreas Dilger wrote:> I can''t speak for btrfs, but I don''t think multiple device access from > the filesystem is a "layering violation" as some people comment. It > is > just a different type of layering. With ZFS there is a distinct layer > that is handling the allocation, redundancy, and transactions (SPA, > DMU) > that is exporting an object interface, and the filesystem (ZPL, or > future > versions of Lustre) is built on top of that object interface.Clean interfaces aren''t really my best talent, but btrfs also layers this out. logical->physical mappings happen in a centralized function, and all of the on disk structures use logical block numbers. The only exception to that rule is the superblock offsets on the device. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Andreas Dilger
2008-Dec-17 22:04 UTC
Re: Notes on support for multiple devices for a single filesystem
On Dec 17, 2008 08:23 -0500, Christoph Hellwig wrote:> == Prior art => > * External log and realtime volume > > The most common way to specify the external log device and the XFS real time > device is to have a mount option that contains the path to the block special > device for it. This variant means a mount option is always required, and > requires the device name doesn''t change, which is enough with udev-generated > unique device names (/dev/disk/by-{label,uuid}). > > An alternative way, supported by optionally by ext3 and reiserfs and > exclusively supported by jfs is to open the journal device by the device > number (dev_t) of the block special device. While this doesn''t require > an additional mount option when the device number is stored in the filesystem > superblock it relies on the device number being stable which is getting > increasingly unlikely in complex storage topologies.Just as an FYI here - the dev_t stored in the ext3/4 superblock for the journal device is only a "cached" device. The journal is properly identified by its UUID, and should the device mapping change there is a "journal_dev=" option that can be used to specify the new device. The one shortcoming is that there is no mount.ext3 helper which does this journal UUID->dev mapping and automatically passes "journal_dev=" if needed.> * RAID (MD) and LVM > > Recent MD setups and LVM2 thus move the scanning to user space, typically > using a command iterating over all block device nodes and performing the > format-specific scanning. While this is more flexible > than the in-kernel scanning, it scales very badly to a large number of > block devices, and requires additional user space commands to run early > in the boot process. A variant of this schemes runs a scanning callout > from udev once disk device are detected, which avoids the scanning overhead.My (admittedly somewhat vague) impression is that with large numbers of devices the udev callout can itself be a huge overhead because this involves a userspace fork/exec for each new device being added. For the same number of devices, a single scan from userspace only requires a single process, and an equal number of device probes. Added to this is that the blkid cache can be used to eliminate the need to do any scanning if the devices have not changed from the previous boot makes it unclear which mechanism is more efficient. The drawback is that the initrd device cache is never going to be up-to-date so it wouldn''t be useful until the root partition is mounted. We''ve used blkid for our testing of Lustre-on-DMU with up to 48 (local) disks w/o any kind of performance issues. We''ll eventually be able to test on systems with around 400 disks in a JBOD configuration, but until then we only run on systems with hundreds of disks behind a RAID controller.> == High-level design considerations => > Due to requirement b) we need a layer that finds devices for a single > fstab entry. We can either do this in user space, or in kernel space. > As we''ve traditionally always done UUID/LABEL to device mapping in > userspace, and we already have libvolume_id and libblkid dealing with > the specialized case of UUID/LABEL to single device mapping I would > recommend to keep doing this in user space and reuse libvolume_id/libblkid. > > There are to options to perform the assembly of the device list for > a filesystem: > > 1) whenever libvolume_id / libblkid find a device detected as a multi-device > capable filesystem it gets added to a list of all devices of this > particular filesystem type. > On mount type mount(8) or a mount.fstype helpers calls out to the > libraries to get a list of devices belonging to this filesystem > type and translates them to device names, which can be passed to > the kernel on the mount command line.I would actually suggest that instead of keeping devices in groups by the filesystem type, rather keep a list of devices with the same UUID and/or LABEL, and if the mount is looking for this UUID/LABEL it gets the whole list of matching devices back. This could also be done in the kernel by having the filesystems register a "probe" function that examines the device/partitions as they are added, similar to the way that MD used to do it. There would likely be very few probe functions needed, only ext3/4 (for journal devices), btrfs, and maybe MD, LVM2 and a handful more. If we wanted to avoid code duplication, this could share code between libblkid and the kernel (just the enhanced probe-only functions in the util-linux-ng implementation) since these functions are little more than "take a pointer, cast it to struct X, check some magic fields and return match + {LABEL, UUID}, or no-match". That MD used to check only the partition type doesn''t mean that we can''t have simple functions that read the superblock (or equivalent) to make an internal list of suitable devices attached to a filesystem-type global structure (possibly split into per-fsUUID sublists if it wants). When filesystem X is mounted (using any one device from the filesystem passed to mount) it can quickly scan this list to find all devices that for that filesystem''s UUID to assemble all of the required devices. Since the filesystem would internally have to verify the list of devices that would (somehow) be passed to the kernel during mount this doesn''t really gain much to keep the list in userspace.> Advantage: Requires a mount.fstype helper or fs-specific knowledge > in mount(8). > Disadvantages: Required libvolume_id / libblkid to keep state. > > 2) whenever libvolume_id / libblkid find a device detected as a multi-device > capable filesystem they call into the kernel through and ioctl / sysfs / > etc to add it to a list in kernel space. The kernel code then manages > to select the right filesystem instances, and either adds it to an already > running instance, or prepares a list for a future mount.While not performance critical, this seems that calling into userspace just to call back down into the kernel is more complex than it needs to be (i.e. prone to failure). I''m not stuck on doing this in the kernel, just thinking that the more components involved the more likely it is that something goes wrong. I have enough grief with just initrd not having the right kernel module, I''d hate to depend on tools to do the device assembly for the root fs.> Advantages: keeps state in the kernel instead of in libraries > Disadvantages: requires and additional user interface to mount > > c) will be deal with an option that allows adding named block devices > on the mount command line, which is already required for design 1) but > would have to be implemented additionally for design 2).Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Dave Kleikamp
2008-Dec-17 22:19 UTC
Re: Notes on support for multiple devices for a single filesystem
On Wed, 2008-12-17 at 15:04 -0700, Andreas Dilger wrote:> On Dec 17, 2008 08:23 -0500, Christoph Hellwig wrote:> > An alternative way, supported by optionally by ext3 and reiserfs and > > exclusively supported by jfs is to open the journal device by the device > > number (dev_t) of the block special device. While this doesn''t require > > an additional mount option when the device number is stored in the filesystem > > superblock it relies on the device number being stable which is getting > > increasingly unlikely in complex storage topologies. > > Just as an FYI here - the dev_t stored in the ext3/4 superblock for the > journal device is only a "cached" device. The journal is properly > identified by its UUID, and should the device mapping change there is a > "journal_dev=" option that can be used to specify the new device. The > one shortcoming is that there is no mount.ext3 helper which does this > journal UUID->dev mapping and automatically passes "journal_dev=" if > needed.An additional FYI. JFS also treats the dev_t in its superblock the same way. Since jfs relies on jfs_fsck running at boot time to ensure that the journal is replayed, jfs_fsck makes sure that the dev_t is accurate. If not, then it scans all of the block devices until it finds the uuid of the journal device, updating the superblock so that the kernel will find the journal. Shaggy -- David Kleikamp IBM Linux Technology Center -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Bryan Henderson
2008-Dec-18 21:22 UTC
Re: Notes on support for multiple devices for a single filesystem
>> Features like the very nice and useful directory-based snapshots would >> also not be possible with simple block-based multi-devices, right?>Snapshotting via block device has always been an incredibly dumb hack, >existing primarily because filesystem-based snapshots did not exist for >the filesystem in question.I can see that if the filesystem driver in question could already do snapshots, nobody would have added snapshot function to the block device driver under it, but this doesn''t explain why someone at some time created block device snapshot instead of creating it for the filesystem in question.>Snapshots are better at the filesystem level because the filesystem is >the only entity that knows when the filesystem is quiescent and >snapshot-able.You can use the same logic to say that snapshots are better at the application level because only the application knows when its database is quiescent and snapshot-able. In fact, carrying it to the extreme, you could say snapshots are better done manually by the human end user with none of the computer knowing anything about it. It probably minimizes engineering effort to have snapshot capability at every level, with the implementation at each level exploiting the function at the level below. E.g. when someone tells a filesystem driver to snapshot a filesystem that resides on two block devices, the filesystem driver quiesces the filesystem, then snapshots each device (implemented in the block device driver), then resumes. The new snapshot filesystem lives on the two new snapshot block devices. Of course, if you want to do a form of snapshot that makes sense only in the context of a filesystem, like the directory snapshot mentioned above, then you can''t get as much help from snapshot functions in the storage devices. -- Bryan Henderson IBM Almaden Research Center San Jose CA Storage Systems -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Liu Hui
2008-Dec-22 01:59 UTC
Re: Notes on support for multiple devices for a single filesystem
A very interesting article wrotete by Jeff Bonwick for Andrew -- "Rampant Layering Violation?" http://blogs.sun.com/bonwick/entry/rampant_layering_violation 2008/12/18 Andrew Morton <akpm@linux-foundation.org>:> On Wed, 17 Dec 2008 08:23:44 -0500 > Christoph Hellwig <hch@infradead.org> wrote: > >> FYI: here''s a little writeup I did this summer on support for >> filesystems spanning multiple block devices: >> >> >> -- >> >> === Notes on support for multiple devices for a single filesystem ==>> >> == Intro =>> >> Btrfs (and an experimental XFS version) can support multiple underlying block >> devices for a single filesystem instances in a generalized and flexible way. >> >> Unlike the support for external log devices in ext3, jfs, reiserfs, XFS, and >> the special real-time device in XFS all data and metadata may be spread over a >> potentially large number of block devices, and not just one (or two) >> >> >> == Requirements =>> >> We want a scheme to support these complex filesystem topologies in way >> that is >> >> a) easy to setup and non-fragile for the users >> b) scalable to a large number of disks in the system >> c) recoverable without requiring user space running first >> d) generic enough to work for multiple filesystems or other consumers >> >> Requirement a) means that a multiple-device filesystem should be mountable >> by a simple fstab entry (UUID/LABEL or some other cookie) which continues >> to work when the filesystem topology changes. > > "device topology"? > >> Requirement b) implies we must not do a scan over all available block devices >> in large systems, but use an event-based callout on detection of new block >> devices. >> >> Requirement c) means there must be some version to add devices to a filesystem >> by kernel command lines, even if this is not the default way, and might require >> additional knowledge from the user / system administrator. >> >> Requirement d) means that we should not implement this mechanism inside a >> single filesystem. >> > > One thing I''ve never seen comprehensively addressed is: why do this in > the filesystem at all? Why not let MD take care of all this and > present a single block device to the fs layer? > > Lots of filesystems are violating this, and I''m sure the reasons for > this are good, but this document seems like a suitable place in which to > briefly decribe those reasons. > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html >-- Thanks & Best Regards Liu Hui -- -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html