deadhorseconsulting
2013-Nov-19 05:12 UTC
Actual effect of mkfs.btrfs -m raid10 </dev/sdX> ... -d raid10 </dev/sdX> ...
In theory (going by the man page and available documentation, not 100% clear) does the following command indeed actually work as advertised and specify how metadata should be placed and kept only on the "devices" specified after the "-m" flag? Thus given the following example: mkfs.btrfs -L foo -m raid10 <ssd> <ssd> <ssd> <ssd> -d raid10 <rust> <rust> <rust> <rust> Would btrfs stripe/mirror and only keep metadata on the 4 specified SSD devices? Likewise then stripe/mirror and only keep data on the specified 4 spinning rust? In trying and creating this type of setup it appears that data is also being stored on the devices specified as "metadata devices". This is observed through via a "btrfs filesystem show". after committing a large amount of data to the filesystem The data devices have balanced data as expected with plenty of free space but the SSD device are reported as either nearly used or completely used. - DHC -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hugo Mills
2013-Nov-19 09:06 UTC
Re: Actual effect of mkfs.btrfs -m raid10 </dev/sdX> ... -d raid10 </dev/sdX> ...
On Mon, Nov 18, 2013 at 11:12:03PM -0600, deadhorseconsulting wrote:> In theory (going by the man page and available documentation, not 100% > clear) does the following command indeed actually work as advertised > and specify how metadata should be placed and kept only on the > "devices" specified after the "-m" flag? > > Thus given the following example: > mkfs.btrfs -L foo -m raid10 <ssd> <ssd> <ssd> <ssd> -d raid10 <rust> > <rust> <rust> <rust> > > Would btrfs stripe/mirror and only keep metadata on the 4 specified SSD devices? > Likewise then stripe/mirror and only keep data on the specified 4 spinning rust?No. The devices are general purpose. The -d and -m options only specify the type of redundancy, not the devices to use. There''s a project[1] to look at this kind of more intelligent chunk allocator, but it''s not been updated in a while. [1] https://btrfs.wiki.kernel.org/index.php/Project_ideas#Device_IO_Priorities> In trying and creating this type of setup it appears that data is also > being stored on the devices specified as "metadata devices". This is > observed through via a "btrfs filesystem show". after committing a > large amount of data to the filesystem The data devices have balanced > data as expected with plenty of free space but the SSD device are > reported as either nearly used or completely used.This will happen with RAID-10. The allocator will write stripes as wide as it can: in this case, the first stripes will run across all 8 devices, until the SSDs are full, and then will write across the remaining 4 devices. Hugo. -- === Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk == PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk --- If it ain''t broke, hit it again. ---
deadhorseconsulting
2013-Nov-19 19:24 UTC
Re: Actual effect of mkfs.btrfs -m raid10 </dev/sdX> ... -d raid10 </dev/sdX> ...
Interesting, this confirms what I was observing. Given the wording in man pages for "-m" and "-d" which states "Specify how the metadata or data must be spanned across the devices specified." I took "devices specified" to literally mean the devices specified after the according switch. - DHC On Tue, Nov 19, 2013 at 3:06 AM, Hugo Mills <hugo@carfax.org.uk> wrote:> On Mon, Nov 18, 2013 at 11:12:03PM -0600, deadhorseconsulting wrote: >> In theory (going by the man page and available documentation, not 100% >> clear) does the following command indeed actually work as advertised >> and specify how metadata should be placed and kept only on the >> "devices" specified after the "-m" flag? >> >> Thus given the following example: >> mkfs.btrfs -L foo -m raid10 <ssd> <ssd> <ssd> <ssd> -d raid10 <rust> >> <rust> <rust> <rust> >> >> Would btrfs stripe/mirror and only keep metadata on the 4 specified SSD devices? >> Likewise then stripe/mirror and only keep data on the specified 4 spinning rust? > > No. The devices are general purpose. The -d and -m options only > specify the type of redundancy, not the devices to use. There''s a > project[1] to look at this kind of more intelligent chunk allocator, > but it''s not been updated in a while. > > [1] https://btrfs.wiki.kernel.org/index.php/Project_ideas#Device_IO_Priorities > >> In trying and creating this type of setup it appears that data is also >> being stored on the devices specified as "metadata devices". This is >> observed through via a "btrfs filesystem show". after committing a >> large amount of data to the filesystem The data devices have balanced >> data as expected with plenty of free space but the SSD device are >> reported as either nearly used or completely used. > > This will happen with RAID-10. The allocator will write stripes as > wide as it can: in this case, the first stripes will run across all 8 > devices, until the SSDs are full, and then will write across the > remaining 4 devices. > > Hugo. > > -- > === Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ==> PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk > --- If it ain''t broke, hit it again. ----- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Duncan
2013-Nov-19 21:04 UTC
Re: Actual effect of mkfs.btrfs -m raid10 </dev/sdX> ... -d raid10 </dev/sdX> ...
deadhorseconsulting posted on Tue, 19 Nov 2013 13:24:01 -0600 as excerpted:> Interesting, this confirms what I was observing.> Given the wording in man pages for "-m" and "-d" which states "Specify > how the metadata or data must be spanned across the devices specified." > I took "devices specified" to literally mean the devices specified after > the according switch.It''s all in how you read the documentation. After years of doing so... While I can see how you might get that from reading the -m and -d option text descriptions, the synopsis indicates differently (excerpt quotes reformatted for posting): SYNOPSIS mkfs.btrfs [ -A alloc-start ] [ -b byte-count ] [ -d data-profile ] [ -f ] [ -n node‐size ] [ -l leafsize ] [ -L label ] [ -m metadata profile ] [ -M mixed data+metadata ] [ -s sectorsize ] [ -r rootdir ] [ -K ] [ -O feature1,feature2,... ] [ -h ] [ -V ] device [ device ... ] Here, you can see that the -d and -m options take only a single parameter, the profile, and that the device list goes at the end and is thus a general device list, not specifically linked to the -d and -m options. Similarly, the option lines themselves: -d, --data type -m, --metadata profile ... not... -d, --data type [ device [ device ... ]] -m, --metdata profile [ device [ device ...]] Those are from the manpage. Similarly, the usage line from the output of mkfs.btrfs --help (--help being an unrecognized option it says, but it does what it needs to do...): usage: mkfs.btrfs [options] dev [ dev ... ] options: -d --data data profile, raid0, raid1, raid5, raid6, raid10, dup or single -m --metadata metadata profile, values like data profile All options come first, no indication of per-option device list, then the general purpose devices list. So I''d argue that the documentation''s reasonably clear as-is, no per- option device list, general purpose device list at the end, thus no ability to specify data-specific and metadata-specific device lists. (Of course it can happen that the code gets out of sync with the documentation, but that wasn''t the argument here.) -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Duncan
2013-Nov-19 23:16 UTC
Re: Actual effect of mkfs.btrfs -m raid10 </dev/sdX> ... -d raid10 </dev/sdX> ...
Hugo Mills posted on Tue, 19 Nov 2013 09:06:02 +0000 as excerpted:> This will happen with RAID-10. The allocator will write stripes as wide > as it can: in this case, the first stripes will run across all 8 > devices, until the SSDs are full, and then will write across the > remaining 4 devices.Hugo, it doesn''t change the outcome for this case, but either your assertion above is incorrect, or the wiki discussion is incorrect (of course, or possibly I''m the one misunderstanding something, in which case hopefully replies to this will correct my understanding). Because I distinctly recall reading on the wiki that for raid, regardless of the raid level, btrfs always allocates in pairs (well, I guess it''d be pairs of pairs for raid10 mode, and I believe that statement pre-dated raid5/6 support so that isn''t included). I was actually shocked by that because while I knew that was the case for raid1, I had thought that other raid levels would stripe as widely as possible, which is what you assert above as well. Now I just have to find where I read that on the wiki... OK, here''s one spot, FAQ, md-raid/device-mapper-raid/btrfs-raid differences, btrfs: https://btrfs.wiki.kernel.org/index.php/FAQ#btrfs>>>>btrfs combines all the drives into a storage pool first, and then duplicates the chunks as file data is created. RAID-1 is defined currently as "2 copies of all the data on different disks". This differs from MD-RAID and dmraid, in that those make exactly n copies for n disks. In a btrfs RAID-1 on 3 1TB drives we get 1.5TB of usable data. Because each block is only copied to 2 drives, writing a given block only requires exactly 2 drives spin up, reading requires only 1 drive to spinup. RAID-0 is similarly defined, with the stripe split among exactly 2 disks. 3 1TB drives yield 3TB usable space, but to read a given stripe only requires 2 disks. RAID-10 is built on top of these definitions. Every stripe is split across to exactly 2 RAID1 sets and those RAID1 sets are written to exactly 2 disk (hense 4 disk minimum). A btrfs raid-10 volume with 6 1TB drives will yield 3TB usable space with 2 copies of all data, but only 4 <<<< [Yes, that ending sentence is incomplete in the wiki.] So we have: 1) raid1 is exactly two copies of data, paired devices. 2) raid0 is a stripe exactly two devices wide (reinforced by to read a stripe takes only two devices), so again paired devices. 3) raid10 is a combination of the above raid0 and raid1 definitions, exactly two raid1 pairs, paired in raid0. So btrfs raid10 is pairs of pairs, each raid0 stripe a pair of raid1 mirrors. If there''s 8 devices, four smaller, four larger, the first allocated chunks should be one per device, until the smaller devices fill up it''ll chunk across the remaining four, but it''ll be pairs of pairs of pairs, two pair(0)-of-pair(1) stripes wide instead of a single quad(0)-of- pair(1) stripe wide. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Martin
2013-Nov-20 06:35 UTC
Re: Actual effect of mkfs.btrfs -m raid10 </dev/sdX> ... -d raid10 </dev/sdX> ...
On 19/11/13 23:16, Duncan wrote:> So we have: > > 1) raid1 is exactly two copies of data, paired devices. > > 2) raid0 is a stripe exactly two devices wide (reinforced by to read a > stripe takes only two devices), so again paired devices.Which is fine for some occasions and a very good start point. However, I''m sure there is a strong wish to be able to specify n-copies of data/metadata spread across m devices. Or even to specify ''hot spares''. This would be a great to overcome the problem of a set of drives becoming "read-only" when one btrfs drive fails or is removed. (Or should we always mount with the "degraded" option?) Regards, Martin -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Martin
2013-Nov-20 06:41 UTC
Re: Actual effect of mkfs.btrfs -m raid10 </dev/sdX> ... -d raid10 </dev/sdX> ...
On 19/11/13 19:24, deadhorseconsulting wrote:> Interesting, this confirms what I was observing. > Given the wording in man pages for "-m" and "-d" which states "Specify > how the metadata or data must be spanned across the devices > specified." > I took "devices specified" to literally mean the devices specified > after the according switch.That sounds like a hang-over from too many years use of the mdadm command and more recently such as the sgdisk command... ;-) Myself, I like the btrfs way to specify the list of parameters and then they all then get applied as a whole. The one bugbear at the moment is that for using multiple disks: Any actions seem to be applied to the list of devices in sequence one-by-one. There''s no apparent intelligence to consider "present pool" -> "new pool" of devices as a whole. More development! Regards, Martin -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hugo Mills
2013-Nov-20 08:09 UTC
Re: Actual effect of mkfs.btrfs -m raid10 </dev/sdX> ... -d raid10 </dev/sdX> ...
On Tue, Nov 19, 2013 at 11:16:58PM +0000, Duncan wrote:> Hugo Mills posted on Tue, 19 Nov 2013 09:06:02 +0000 as excerpted: > > > This will happen with RAID-10. The allocator will write stripes as wide > > as it can: in this case, the first stripes will run across all 8 > > devices, until the SSDs are full, and then will write across the > > remaining 4 devices. > > Hugo, it doesn''t change the outcome for this case, but either your > assertion above is incorrect, or the wiki discussion is incorrect (of > course, or possibly I''m the one misunderstanding something, in which case > hopefully replies to this will correct my understanding). > > Because I distinctly recall reading on the wiki that for raid, regardless > of the raid level, btrfs always allocates in pairs (well, I guess it''d be > pairs of pairs for raid10 mode, and I believe that statement pre-dated > raid5/6 support so that isn''t included). I was actually shocked by that > because while I knew that was the case for raid1, I had thought that > other raid levels would stripe as widely as possible, which is what you > assert above as well.That''s incorrect. I used to think that, a few years ago, and it got into at least one piece of documentation as a result, but once I worked out the actual behaviour, I did try to correct it (I definitely remember fixing the sysadmin guide this way). For striped levels (RAID-0, 10, 5, 6), the FS will use as many stripes as possible -- for RAID-10, this means an even number; for the others, this is all the devices with free space on, down to a RAID-level dependent minimum. RAID-0: min 2 devices RAID-10: min 4 devices RAID-5: min 2 devices (I think) RAID-6: min 3 devices (I think)> Now I just have to find where I read that on the wiki... > > OK, here''s one spot, FAQ, md-raid/device-mapper-raid/btrfs-raid > differences, btrfs: > > https://btrfs.wiki.kernel.org/index.php/FAQ#btrfs > > >>>> > > btrfs combines all the drives into a storage pool first, and then > duplicates the chunks as file data is created. RAID-1 is defined > currently as "2 copies of all the data on different disks". This differs > from MD-RAID and dmraid, in that those make exactly n copies for n disks. > In a btrfs RAID-1 on 3 1TB drives we get 1.5TB of usable data. Because > each block is only copied to 2 drives, writing a given block only > requires exactly 2 drives spin up, reading requires only 1 drive to > spinup.This is correct.> RAID-0 is similarly defined, with the stripe split among exactly 2 disks. > 3 1TB drives yield 3TB usable space, but to read a given stripe only > requires 2 disks.This is definitely wrong. RAID-0 will use all 3 drives for each stripe.> RAID-10 is built on top of these definitions. Every stripe is split > across to exactly 2 RAID1 sets and those RAID1 sets are written to > exactly 2 disk (hense 4 disk minimum). A btrfs raid-10 volume with 6 1TB > drives will yield 3TB usable space with 2 copies of all data, but only 4This is also wrong. You will get 3 TB usage out of 6 × 1 TB drives, but the individual stripes will be 3 drives wide. You would have the same behaviour (2 copies of 3 stripes wide) on a 7-device array.> <<<< > > [Yes, that ending sentence is incomplete in the wiki.] > > So we have: > > 1) raid1 is exactly two copies of data, paired devices. > > 2) raid0 is a stripe exactly two devices wide (reinforced by to read a > stripe takes only two devices), so again paired devices. > > 3) raid10 is a combination of the above raid0 and raid1 definitions, > exactly two raid1 pairs, paired in raid0. > > So btrfs raid10 is pairs of pairs, each raid0 stripe a pair of raid1 > mirrors. If there''s 8 devices, four smaller, four larger, the first > allocated chunks should be one per device, until the smaller devices fill > up it''ll chunk across the remaining four, but it''ll be pairs of pairs of > pairs, two pair(0)-of-pair(1) stripes wide instead of a single quad(0)-of- > pair(1) stripe wide.If the RAID code used pairs for its stripes, that''d be the case, but it doesn''t... Hugo. -- === Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk == PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk --- emacs: Emacs Makes A Computer Slow. ---
Chris Murphy
2013-Nov-20 10:16 UTC
Re: Actual effect of mkfs.btrfs -m raid10 </dev/sdX> ... -d raid10 </dev/sdX> ...
On Nov 19, 2013, at 11:35 PM, Martin <m_btrfs@ml1.co.uk> wrote:> On 19/11/13 23:16, Duncan wrote: > >> So we have: >> >> 1) raid1 is exactly two copies of data, paired devices. >> >> 2) raid0 is a stripe exactly two devices wide (reinforced by to read a >> stripe takes only two devices), so again paired devices. > > Which is fine for some occasions and a very good start point. > > However, I''m sure there is a strong wish to be able to specify n-copies > of data/metadata spread across m devices. Or even to specify ''hot spares''.Hot spares are worse than useless. Especially for raid10. The drive takes up space doing nothing but suck power, rather than adding space or performance. Somehow this idea comes from cheap companies who seem to think their data is so valuable they need hot spares, yet they don''t have 24/7 staff on hand to do a hot swap. (As if the only problem that can occur is a dead drive.) So I think those companies can develop this otherwise unneeded feature. n-copies raid1 is a good idea and I think it''s being worked on. Chris Murphy-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Russell Coker
2013-Nov-20 10:22 UTC
Re: Actual effect of mkfs.btrfs -m raid10 </dev/sdX> ... -d raid10 </dev/sdX> ...
On Wed, 20 Nov 2013, Chris Murphy <lists@colorremedies.com> wrote:> Hot spares are worse than useless. Especially for raid10. The drive takes > up space doing nothing but suck power, rather than adding space or > performance. Somehow this idea comes from cheap companies who seem to > think their data is so valuable they need hot spares, yet they don''t have > 24/7 staff on hand to do a hot swap. (As if the only problem that can > occur is a dead drive.) So I think those companies can develop this > otherwise unneeded feature. > > n-copies raid1 is a good idea and I think it''s being worked on.N copies RAID-1 is definitely more useful than RAID-1 with a hot-spare. But for RAID-5/RAID-6 a hot spare can provide real value. Not having to pay someone to make a special rushed visit to replace a disk is a definite benefit. Also when a disk isn''t being used it doesn''t draw much power. Last time I tested such things I found an IDE disk to use about 7W while spinning and no measurable difference to overall system power use when spun-down. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/ -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Duncan
2013-Nov-20 16:43 UTC
Re: Actual effect of mkfs.btrfs -m raid10 </dev/sdX> ... -d raid10 </dev/sdX> ...
Hugo Mills posted on Wed, 20 Nov 2013 08:09:58 +0000 as excerpted:> RAID-0: min 2 devices > RAID-10: min 4 devices > RAID-5: min 2 devices (I think) > RAID-6: min 3 devices (I think)RAID-5 should be 3-device minimum (each stripe consisting of two data segments and one parity segment, each on a different device). And RAID-6 similarly four devices (two data and two parity). Perhaps it''s time I get that wiki account and edit some of this stuff myself... -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hugo Mills
2013-Nov-20 16:52 UTC
Re: Actual effect of mkfs.btrfs -m raid10 </dev/sdX> ... -d raid10 </dev/sdX> ...
On Wed, Nov 20, 2013 at 04:43:57PM +0000, Duncan wrote:> Hugo Mills posted on Wed, 20 Nov 2013 08:09:58 +0000 as excerpted: > > > RAID-0: min 2 devices > > RAID-10: min 4 devices > > RAID-5: min 2 devices (I think) > > RAID-6: min 3 devices (I think) > > RAID-5 should be 3-device minimum (each stripe consisting of two data > segments and one parity segment, each on a different device).You can successfully run RAID-5 on two devices: one data device(*), plus its parity. The parity check of a single piece of data is that data, so it''s equivalent to RAID-1 in that configuration. IIRC, the MD-RAID code allows this; I can''t remember if the btrfs RAID code does or not, but it probably should do if it doesn''t.> And RAID-6 similarly four devices (two data and two parity).Similarly for RAID-6: it''s a single data device(*), plus an XOR-based parity (effectively a mirror), plus a more complex parity calculation.> Perhaps it''s time I get that wiki account and edit some of this stuff > myself...Do check the assumptions first. :) Hugo. (*) Yeah, OK, rotate the data/parity position as you move through the stripes because it''s not RAID-4. -- === Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk == PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk --- In my day, we didn''t have fancy high numbers. We had "nowt", --- "one", "twain" and "multitudes".
Duncan
2013-Nov-20 21:13 UTC
Re: Actual effect of mkfs.btrfs -m raid10 </dev/sdX> ... -d raid10 </dev/sdX> ...
Hugo Mills posted on Wed, 20 Nov 2013 16:52:47 +0000 as excerpted:>> Perhaps it''s time I get that wiki account and edit some of this stuff >> myself... > > Do check the assumptions first. :)Of course. =:^) -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Jeff Mahoney
2013-Nov-21 17:14 UTC
Re: Actual effect of mkfs.btrfs -m raid10 </dev/sdX> ... -d raid10 </dev/sdX> ...
On 11/19/13, 12:12 AM, deadhorseconsulting wrote:> In theory (going by the man page and available documentation, not 100% > clear) does the following command indeed actually work as advertised > and specify how metadata should be placed and kept only on the > "devices" specified after the "-m" flag? > > Thus given the following example: > mkfs.btrfs -L foo -m raid10 <ssd> <ssd> <ssd> <ssd> -d raid10 <rust> > <rust> <rust> <rust> > > Would btrfs stripe/mirror and only keep metadata on the 4 specified SSD devices? > Likewise then stripe/mirror and only keep data on the specified 4 spinning rust? > > In trying and creating this type of setup it appears that data is also > being stored on the devices specified as "metadata devices". This is > observed through via a "btrfs filesystem show". after committing a > large amount of data to the filesystem The data devices have balanced > data as expected with plenty of free space but the SSD device are > reported as either nearly used or completely used.Others have noted that''s not how it works, but I wanted to add a comment. I had a feature request from a customer recently that was pretty much exactly this. I think it''d be pretty easy to implement by allocating all (except for overhead) of the devices to chunks immediately at mkfs time, bypassing the kernel''s dynamic chunk allocation. Since you don''t *want* to mix allocation profiles, the usual reason for doing it dynamically doesn''t apply. Extending an existing file system created in a such a manner so that the added devices are set up with the right kinds of chunks would require other extensions, though. I have a few things on my plate right now, but I''ll probably dig into this in the next month or so. -Jeff -- Jeff Mahoney SUSE Labs