Duncan
2012-Jan-26 15:41 UTC
btrfs-raid questions I couldn''t find an answer to on the wiki
I''m currently researching an upgrade to (raid1-ed) btrfs from mostly reiserfs (which I''ve found quite reliable (even thru a period of bad ram and resulting system crashes) since data=ordered went in with 2.6.16 or whatever it was. (Thanks, Chris! =:^)) on multiple md/raid-1s. I have some questions that don''t appear to be addressed well on the wiki, yet, or where the wiki info might be dated. Device hardware is 4 now aging 300-gig disks with identical gpt- partitioning on all four disks, using multiple 4-way md/raid-1s for most of the system. I''m running gentoo/~amd64 with the linus mainline kernel from git, kernel generally updated 1-2X/wk except during the merge window, so I stay reasonably current. I have btrfs-progs-9999, aka the live-git build, kernel.org mason tree, installed. The current layout has a total of 16 physical disk partitions on each of the four drives, mostly of which are 4-disk md/raid1, but with a couple md/raid1s for local cache of redownloadables, etc, thrown in. Some of the mds are further partitioned (mdp), some not. A couple are only 2- disk md/raid1 instead of the usual 4-disk. Most mds have a working and backup copy of exactly the same partitioned size, thus explaining the multitude of partitions, since most of them come in pairs. No lvm as I''m not running an initrd which meant it couldn''t handle root, and I wasn''t confident in my ability to recover the system in an emergency with lvm either, so I was best off without it. Note that my current plan is to keep the backup sets as reiserfs on md/ raid1 for the time being, probably until btrfs comes out of experimental/ testing or at least until it further stabilizes, so I''m not too worried about btrfs as long as it''s not going to go scribbling outside the partitions established for it. For the worst-case I have boot-tested external-drive backup. Three questions: 1) My /boot partition and its backup (which I do want to keep separate from root) are only 128 MB each. The wiki recommends 1 gig sizes minimum, but there''s some indication that''s dated info due to mixed data/ metadata mode in recent kernels. Is a 128 MB btrfs reasonable? What''s the mixed-mode minumum recommended and what is overhead going to look like? 2) The wiki indicates that btrfs-raid1 and raid-10 only mirror data 2- way, regardless of the number of devices. On my now aging disks, I really do NOT like the idea of only 2-copy redundancy. I''m far happier with the 4-way redundancy, twice for the important stuff since it''s in both working and backup mds altho they''re on the same 4-disk set (tho I do have an external drive backup as well, but it''s not kept as current). If true that''s a real disappointment, as I was looking forward to btrfs- raid1 with checksummed integrity management. Is there really NO way to do more than 2-way btrfs-raid1? If not, presumably layering it on md/raid1 is possible, but is two-way-btrfs- raid1-on-2-way-md-raid1 or btrfs-on-single-4-way-md-raid1 (presumably still-duped btrfs metadata) recommended? Or perhaps the recommendations for performance and reliability differ in that scenario? 3) How does btrfs space overhead (and ENOSPC issues) compare to reiserfs with its (default) journal and tail-packing? My existing filesystems are 128 MB and 4 GB at the low end, and 90 GB and 16 GB at the high end. At the same size, can I expect to fit more or less data on them? Do the compression options change that by much "IRL"? Given that I''m using same- sized partitions for my raid-1s, I guess at least /that/ angle of it''s covered. Thanks. =:^) -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Martin Steigerwald
2012-Jan-28 12:08 UTC
Re: btrfs-raid questions I couldn''t find an answer to on the wiki
Am Donnerstag, 26. Januar 2012 schrieb Duncan:> I''m currently researching an upgrade to (raid1-ed) btrfs from mostly > reiserfs (which I''ve found quite reliable (even thru a period of bad > ram and resulting system crashes) since data=ordered went in with > 2.6.16 or whatever it was. (Thanks, Chris! =:^)) on multiple > md/raid-1s. I have some questions that don''t appear to be addressed > well on the wiki, yet, or where the wiki info might be dated. > > Device hardware is 4 now aging 300-gig disks with identical gpt- > partitioning on all four disks, using multiple 4-way md/raid-1s for > most of the system. I''m running gentoo/~amd64 with the linus mainline > kernel from git, kernel generally updated 1-2X/wk except during the > merge window, so I stay reasonably current. I have btrfs-progs-9999, > aka the live-git build, kernel.org mason tree, installed. > > The current layout has a total of 16 physical disk partitions on each > of the four drives, mostly of which are 4-disk md/raid1, but with a > couple md/raid1s for local cache of redownloadables, etc, thrown in. > Some of the mds are further partitioned (mdp), some not. A couple are > only 2- disk md/raid1 instead of the usual 4-disk. Most mds have a > working and backup copy of exactly the same partitioned size, thus > explaining the multitude of partitions, since most of them come in > pairs. No lvm as I''m not running an initrd which meant it couldn''t > handle root, and I wasn''t confident in my ability to recover the > system in an emergency with lvm either, so I was best off without it.Sounds like a quite complex setup.> Three questions: > > 1) My /boot partition and its backup (which I do want to keep separate > from root) are only 128 MB each. The wiki recommends 1 gig sizes > minimum, but there''s some indication that''s dated info due to mixed > data/ metadata mode in recent kernels. > > Is a 128 MB btrfs reasonable? What''s the mixed-mode minumum > recommended and what is overhead going to look like?I don´t know. You could try with a loop device. Just create one and mkfs.btrfs on it, mount it and copy your stuff from /boot over to see whether that works and how much space is left. On BTRFS I recommend using btrfs filesystem df for more exact figures of space utilization that df would return. Likewise for RAID 1, just create 2 or 4 BTRFS image files. You may try with: -M, --mixed Mix data and metadata chunks together for more efficient space utilization. This feature incurs a performance penalty in larger filesystems. It is recommended for use with filesystems of 1 GiB or smaller. for smaller partitions (see manpage of mkfs.btrfs).> 2) The wiki indicates that btrfs-raid1 and raid-10 only mirror data 2- > way, regardless of the number of devices. On my now aging disks, I > really do NOT like the idea of only 2-copy redundancy. I''m far happier > with the 4-way redundancy, twice for the important stuff since it''s in > both working and backup mds altho they''re on the same 4-disk set (tho I > do have an external drive backup as well, but it''s not kept as > current). > > If true that''s a real disappointment, as I was looking forward to > btrfs- raid1 with checksummed integrity management.I didn´t see anything like this. Would be nice to be able to adapt the redundancy degree where possible. An idea might be splitting into a delayed synchronisation mirror: Have two BTRFS RAID-1 - original and backup - and have a cronjob with rsync mirroring files every hour or so. Later this might be replaced by btrfs send/receive - or by RAID-1 with higher redundancy.> 3) How does btrfs space overhead (and ENOSPC issues) compare to > reiserfs with its (default) journal and tail-packing? My existing > filesystems are 128 MB and 4 GB at the low end, and 90 GB and 16 GB at > the high end. At the same size, can I expect to fit more or less data > on them? Do the compression options change that by much "IRL"? Given > that I''m using same- sized partitions for my raid-1s, I guess at least > /that/ angle of it''s covered.The efficiency of the compression options depend highly of the kind of data you want to store. I tried lzo on a external disk with movies, music files, images and software archives. The effect has been minimal, about 3% or so. But for unpacked source trees, lots of clear text files, likely also virtual machine image files or other nicely compressible data the effect should be better. Although BTRFS received a lot of fixes for ENOSPC issues I would be a bit reluctant with very small filesystems. But that is just a gut feeling. So I do not know whether the option -M from above is tested widely. I doubt it. Maybe someone with more in-depth knowledge can shed some light on this. -- Martin ''Helios'' Steigerwald - http://www.Lichtvoll.de GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Duncan
2012-Jan-29 05:40 UTC
Re: btrfs-raid questions I couldn''t find an answer to on the wiki
Martin Steigerwald posted on Sat, 28 Jan 2012 13:08:52 +0100 as excerpted:> Am Donnerstag, 26. Januar 2012 schrieb Duncan:>> The current layout has a total of 16 physical disk partitions on each >> of the four drives, mostly of which are 4-disk md/raid1, but with a >> couple md/raid1s for local cache of redownloadables, etc, thrown in. >> Some of the mds are further partitioned (mdp), some not. A couple are >> only 2- disk md/raid1 instead of the usual 4-disk. Most mds have a >> working and backup copy of exactly the same partitioned size, thus >> explaining the multitude of partitions, since most of them come in >> pairs. No lvm as I''m not running an initrd which meant it couldn''t >> handle root, and I wasn''t confident in my ability to recover the system >> in an emergency with lvm either, so I was best off without it. > > Sounds like a quite complex setup.It is. I was actually writing a rather more detailed description, but decided few would care and it''d turn into a tl;dr. It was I think the 4th rewrite that finally got it down to something reasonable while still hopefully conveying any details that might be corner-cases someone knows something about.>> Three questions: >> >> 1) My /boot partition and its backup (which I do want to keep separate >> from root) are only 128 MB each. The wiki recommends 1 gig sizes >> minimum, but there''s some indication that''s dated info due to mixed >> data/ metadata mode in recent kernels. >> >> Is a 128 MB btrfs reasonable? What''s the mixed-mode minumum >> recommended and what is overhead going to look like? > > I don´t know. > > You could try with a loop device. Just create one and mkfs.btrfs on it, > mount it and copy your stuff from /boot over to see whether that works > and how much space is left.The loop device is a really good idea that hadn''t occurred to me. Thanks!> On BTRFS I recommend using btrfs filesystem df for more exact figures of > space utilization that df would return.Yes. I''ve read about the various space reports on the wiki so have the general idea, but will of course need to review it again after I get something setup so I can actually type in the commands and see for myself. Still, thanks for the reinforcement. It certainly won''t hurt, and of course it''s quite possible that others will end up reading this too, so it could end up being a benefit to many people, not just me. =:^)> You may try with: > > -M, --mixed > Mix data and metadata chunks together for more > efficient space utilization. This feature incurs a > performance penalty in larger filesystems. It is > recommended for use with filesystems of 1 GiB or > smaller. > > for smaller partitions (see manpage of mkfs.btrfs).I had actually seen that too, but as it''s newer there''s significantly less mentions of it out there, so the reinforcement is DEFINITELY valued! I like to have a rather good general sysadmin''s idea of what''s going on and how everything fits together, as opposed to simply following instructions by rote, before I''m really comfortable with something as critical as filesystem maintenance (keeping in mind that when one really tends to need that knowledge is in an already stressful recovery situation, very possibly without all the usual documentation/net- resources available), and repetition of the basics helps getting comfortable with it, so I''m very happy for it even if it isn''t "new" to me. =:^) (As mentioned, that was a big reason behind my ultimate rejection of LVM, I simply couldn''t get comfortable enough with it to be confident of my ability to recover it in an emergency recovery situation.)>> 2) The wiki indicates that btrfs-raid1 and raid-10 only mirror data 2- >> way, regardless of the number of devices. On my now aging disks, I >> really do NOT like the idea of only 2-copy redundancy. I''m far happier >> with the 4-way redundancy, twice for the important stuff since it''s in >> both working and backup mds altho they''re on the same 4-disk set (tho I >> do have an external drive backup as well, but it''s not kept as >> current). >> >> If true that''s a real disappointment, as I was looking forward to >> btrfs- raid1 with checksummed integrity management. > > I didn´t see anything like this. > > Would be nice to be able to adapt the redundancy degree where possible.I posted the wiki reference in reply to someone else recently. Let''s see if I can find it again... Here it is. This is from the bottom of the RAID and data replication section (immediately above "Balancing") on the SysadminGuide page:>>>>>With RAID-1 and RAID-10, only two copies of each byte of data are written, regardless of how many block devices are actually in use on the filesystem. <<<<< But that''s one of the bits that I hoped was stale, and that it allowed setting the number of copies for both data and metadata, now. However, I don''t see any options along that line to feed to mkfs.btrfs or btrfs * either one, so it would seem it''s not there yet, at least not in btrfs- tools as built just a couple days ago from the official/mason tree on kernel.org. I haven''t tried the integration tree (aka Hugo Mills'' aka darksatanic.net tree). So I guess that wiki quote is still correct. Oh, well... maybe later-this-year/in-a-few-kernel-cycles.> An idea might be splitting into a delayed synchronisation mirror: > > Have two BTRFS RAID-1 - original and backup - and have a cronjob with > rsync mirroring files every hour or so. Later this might be replaced by > btrfs send/receive - or by RAID-1 with higher redundancy.That''s an interesting idea. However, as I run git kernels and don''t accumulate a lot of uptime in any case, what I''d probably do is set up the rsync to be run after a successful boot or mount of the filesystem in question. That way, if it ever failed to boot/mount for whatever reason, I could be relatively confident that the backup version remained intact and usable. That''s actually /quite/ an interesting idea. While I have working and backup partitions for most stuff now, the process remains a manual one, when I think the system is stable enough and enough time has passed since the last one, so the backup tends to be weeks or months old as opposed to days or hours. This idea, modified to do it once per boot or mount or whatever, would keep the backups far more current and be much less hassle than the manual method I''m using now. So even if I don''t immediately switch to btrfs as I had thought I might, I can implement those scripts on the current system now, and then they''ll be ready and tested, needing little modification when I switch to btrfs, later. Thanks for the ideas! =:^)>> 3) How does btrfs space overhead (and ENOSPC issues) compare to >> reiserfs with its (default) journal and tail-packing? My existing >> filesystems are 128 MB and 4 GB at the low end, and 90 GB and 16 GB at >> the high end. At the same size, can I expect to fit more or less data >> on them? Do the compression options change that by much "IRL"? Given >> that I''m using same- sized partitions for my raid-1s, I guess at least >> /that/ angle of it''s covered. > > The efficiency of the compression options depend highly of the kind of > data you want to store. > > I tried lzo on a external disk with movies, music files, images and > software archives. The effect has been minimal, about 3% or so. But for > unpacked source trees, lots of clear text files, likely also virtual > machine image files or other nicely compressible data the effect should > be better.Back in the day, MS-DOS 6.2 on a 130 MB hard drive, I used to run MS Drivespace (which I guess they partnered with Stacker to get the tech for, then dropped the Stacker partnership like a hot potato after they''d sucked out all the tech they wanted, killing Stacker in the process...), so I''m familiar with the idea of filesystem or lower integrated compression and realize that it''s definitely variable. I was just wondering what the real-life usage scenarios had come up with, realizing even as I wrote it that the question wasn''t one that could be answered in anything but general terms. But I run Gentoo and thus deal with a lot of build scripts, etc, plus the usual *ix style plain text config files, etc, so I expect for that compression will be pretty good. Rather less so on the media and bzip- tarballed binpkgs partitions, certainly, with the home partition likely intermediate since it has a lot of plain text /and/ a lot of pre- compressed data. Meanwhile, even without a specific answer, just the discussion is helping to clarify my understanding and expectations regarding compression, so thanks.> Although BTRFS received a lot of fixes for ENOSPC issues I would be a > bit reluctant with very small filesystems. But that is just a gut > feeling. So I do not know whether the option -M from above is tested > widely. I doubt it.The only real small filesystem/raid I have is /boot, the 128 MB mentioned. But in thinking it over a bit more since I wrote the initial post, I realized that given the 9-ish gigs of unallocated freespace at the end of the drives and the fact that most of the partitions are at a quarter-gig offset due to the 128 MB /boot and the combined 128 MB BIOS and UEFI reserved partitions, I have room to expand both by several times, and making the total of all 3 (plus the initial few sectors of unpartitioned boot area) at the beginning of the drive an even 1 gig would give me even gig offsets for all the other partitions/raids as well. So I''ll almost certainly expand /boot from 1/8 gig to 1/4 gig, and maybe to half or even 3/4 gig, just so the offsets for everything else end up at even half or full gig boundaries, instead of the quarter-gig I have now. Between that and mixed-mode, I think the potential sizing issue of /boot pretty much disappears. One less problem to worry about. =:^) So the big sticking point now is two-copy-only data on btrfs-raid1, regardless of the number of drives, and sticking that on top of md/raid''s a workaround, tho obviously I''d much rather a btrfs that could mirror both data and metadata an arbtrary number of ways instead of just two. (There''s some hints that metadata at least gets mirrored to all drives in a btrfs-raid1, tho nothing clearly states it one way or another. But without data mirrored to all drives as well, I''m just not comfortable.) But while not ideal, the data integrity checking of two-way btrfs-raid1 on two-way md/raid1 should at least be better than entirely unverified 4-way md/raid1, and I expect the rest will come over time, so I could simply upgrade anyway. OTOH, in general as I''ve looked closer, I''ve found btrfs to be rather farther away from exiting experimental than the prominent adoption by various distros had led me to believe, and without N-way mirroring raid, one of the two big features that I was looking forward to (the other being the data integrity checking) just vaporized in front of my eyes, so I may well hold off on upgrading until, potentially, late this year instead of early this year, even if there are workarounds. I''m just not sure it''s worth the cost of dealing with the still experimental aspects. Either way, however, this little foray into previously unexplored territory leaves me with a MUCH firmer grasp of btrfs. It''s no longer simply a vague filesystem with some vague features out there. And now that I''m here, I''ll probably stay on the list as well, as I''ve already answered a number of questions posted by others, based on the material in the wiki and manpages, so I think I have something to contribute, and keeping up with developments will be far easier if I stay involved. Meanwhile, again and overall, thanks for the answer. I did have most of the bits of info I needed there floating around, but having someone to discuss my questions with has definitely helped solidify the concepts, and you''ve given me at least two very good suggestions that were entirely new to me and that would have certainly taken me quite some time to come up with on my own, if I''d been able to do so at all, so thanks, indeed! =:^) -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Martin Steigerwald
2012-Jan-29 07:55 UTC
Re: btrfs-raid questions I couldn''t find an answer to on the wiki
Am Sonntag, 29. Januar 2012 schrieb Duncan:> Martin Steigerwald posted on Sat, 28 Jan 2012 13:08:52 +0100 asexcerpted:> > Am Donnerstag, 26. Januar 2012 schrieb Duncan:[…]> >> 2) The wiki indicates that btrfs-raid1 and raid-10 only mirror data > >> 2- way, regardless of the number of devices. On my now aging > >> disks, I really do NOT like the idea of only 2-copy redundancy. > >> I''m far happier with the 4-way redundancy, twice for the important > >> stuff since it''s in both working and backup mds altho they''re on > >> the same 4-disk set (tho I do have an external drive backup as > >> well, but it''s not kept as current). > >> > >> If true that''s a real disappointment, as I was looking forward to > >> btrfs- raid1 with checksummed integrity management. > > > > I didn´t see anything like this. > > > > Would be nice to be able to adapt the redundancy degree where > > possible. > > I posted the wiki reference in reply to someone else recently. Let''s > see if I can find it again... > > Here it is. This is from the bottom of the RAID and data replication > section (immediately above "Balancing") on the SysadminGuide page: > > > With RAID-1 and RAID-10, only two copies of each byte of data are > written, regardless of how many block devices are actually in use on > the filesystem. > <<<<<Yes, I have seen that too sometime ago. What I meant I didn´t see anything like this, is that I didn´t see and option to set the number of copies anywhere yet - just like you.> > An idea might be splitting into a delayed synchronisation mirror: > > > > Have two BTRFS RAID-1 - original and backup - and have a cronjob with > > rsync mirroring files every hour or so. Later this might be replaced > > by btrfs send/receive - or by RAID-1 with higher redundancy. > > That''s an interesting idea. However, as I run git kernels and don''t > accumulate a lot of uptime in any case, what I''d probably do is set up > the rsync to be run after a successful boot or mount of the filesystem > in question. That way, if it ever failed to boot/mount for whatever > reason, I could be relatively confident that the backup version > remained intact and usable. > > That''s actually /quite/ an interesting idea. While I have working and > backup partitions for most stuff now, the process remains a manual one, > when I think the system is stable enough and enough time has passed > since the last one, so the backup tends to be weeks or months old as > opposed to days or hours. This idea, modified to do it once per boot > or mount or whatever, would keep the backups far more current and be > much less hassle than the manual method I''m using now. So even if I > don''t immediately switch to btrfs as I had thought I might, I can > implement those scripts on the current system now, and then they''ll be > ready and tested, needing little modification when I switch to btrfs, > later. > > Thanks for the ideas! =:^)Well you may even through in a snapshot in-between. During boot before backup first snapshot or just after mount before applications / services are started snapshot the source device. That should give you a fairly consistent backup source. Then do the rsync backup. Then snapshot the backup drive. This way you can access older backups in case the original has gone bad and has been backuped nonetheless. I suggest a cronjob deleting old snapshots after some time again in order to save space. I want to replace my backup by something like this. There is also rsnapshot for this case, but its error reporting I find sub optimal (no rsync error messages included unless you run it on the command line with option -v) and it uses hardlinks. Maybe could be adapted to use snapshots?> > Although BTRFS received a lot of fixes for ENOSPC issues I would be a > > bit reluctant with very small filesystems. But that is just a gut > > feeling. So I do not know whether the option -M from above is tested > > widely. I doubt it. > > The only real small filesystem/raid I have is /boot, the 128 MB > mentioned. But in thinking it over a bit more since I wrote the > initial post, I realized that given the 9-ish gigs of unallocated > freespace at the end of the drives and the fact that most of the > partitions are at a quarter-gig offset due to the 128 MB /boot and the > combined 128 MB BIOS and UEFI reserved partitions, I have room to > expand both by several times, and making the total of all 3 (plus the > initial few sectors of unpartitioned boot area) at the beginning of > the drive an even 1 gig would give me even gig offsets for all the > other partitions/raids as well. > > So I''ll almost certainly expand /boot from 1/8 gig to 1/4 gig, and > maybe to half or even 3/4 gig, just so the offsets for everything else > end up at even half or full gig boundaries, instead of the quarter-gig > I have now. Between that and mixed-mode, I think the potential sizing > issue of /boot pretty much disappears. One less problem to worry > about. =:^)About /boot: I do not see any specific need to convert boot to BTRFS as well. Since kernels have version number attached to seem and can be installed side by side, snapshotting /boot does not appear that important to me. So you can just use Ext3 or with GRUB 2 or a patched GRUB 1, some distros do it, Ext4 for /boot in case BTRFS would not work out.> So the big sticking point now is two-copy-only data on btrfs-raid1, > regardless of the number of drives, and sticking that on top of > md/raid''s a workaround, tho obviously I''d much rather a btrfs that > could mirror both data and metadata an arbtrary number of ways instead > of just two. (There''s some hints that metadata at least gets mirrored > to all drives in a btrfs-raid1, tho nothing clearly states it one way > or another. But without data mirrored to all drives as well, I''m just > not comfortable.)I am with you there. Would be a nice feature. The distributed filesystem Ceph which likes to be based on BTRFS volumes has something like that, but Ceph might be overdoing it for your case ;).> OTOH, in general as I''ve looked closer, I''ve found btrfs to be rather > farther away from exiting experimental than the prominent adoption by > various distros had led me to believe, and without N-way mirroring > raid, one of the two big features that I was looking forward to (the > other being the data integrity checking) just vaporized in front of my > eyes, so I may well hold off on upgrading until, potentially, late > this year instead of early this year, even if there are workarounds. > I''m just not sure it''s worth the cost of dealing with the still > experimental aspects.I decided for a partial approach. My Amarok machine - an old ThinkPad T23 - is fully upgraded. On my main laptop - a ThinkPad T520 with Intel SSD 320 - I have BTRFS as / and /home still sits on Ext4. I like this approach, cause I can gain experience with BTRFS, while not putting to important data at risk. I can afford to loose /, since I have a backup. But even with a backup of /home, I´d rather not loose it, since I only do it all 2-3 weeks cause its a manual thing for me at the moment. At work I have a scratch data partition for Debian package development, compiling stuff and other stuff I do not want to do within the NFS export, on BTRFS - that I backup to an Ext4 partition.> And now that I''m here, I''ll probably stay on the list as well, as I''ve > already answered a number of questions posted by others, based on the > material in the wiki and manpages, so I think I have something to > contribute, and keeping up with developments will be far easier if I > stay involved.I encourage you, to start by putting something you can afford to loose on BTRFS to gather practical experiences.> Meanwhile, again and overall, thanks for the answer. I did have mostYou are welcome. I do not know a definitve answer to the number of copies question, but I believe that its not possible yet to set it. Thanks, -- Martin ''Helios'' Steigerwald - http://www.Lichtvoll.de GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Goffredo Baroncelli
2012-Jan-29 11:23 UTC
Re: btrfs-raid questions I couldn''t find an answer to on the wiki
On Thursday, 26 January, 2012 16:41:32 Duncan wrote:> 1) My /boot partition and its backup (which I do want to keep separate > from root) are only 128 MB each. The wiki recommends 1 gig sizes > minimum, but there''s some indication that''s dated info due to mixed data/ > metadata mode in recent kernels. > > Is a 128 MB btrfs reasonable? What''s the mixed-mode minumum recommended > and what is overhead going to look like?IIRC, the minimum size should be 256MB. Anyway, if you want/allow a separate partition for /boot I suggest to use a classic filesystem like ext3. BR G.Baroncelli -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Li Zefan
2012-Jan-30 05:49 UTC
Re: btrfs-raid questions I couldn''t find an answer to on the wiki
Goffredo Baroncelli wrote:> On Thursday, 26 January, 2012 16:41:32 Duncan wrote: >> 1) My /boot partition and its backup (which I do want to keep separate >> from root) are only 128 MB each. The wiki recommends 1 gig sizes >> minimum, but there''s some indication that''s dated info due to mixed data/ >> metadata mode in recent kernels. >> >> Is a 128 MB btrfs reasonable? What''s the mixed-mode minumum recommended >> and what is overhead going to look like? > > IIRC, the minimum size should be 256MB. Anyway, if you want/allow a separate > partition for /boot I suggest to use a classic filesystem like ext3. >The 256MB limitation has been removed. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Kyle Gates
2012-Jan-30 14:58 UTC
RE: btrfs-raid questions I couldn''t find an answer to on the wiki
I''ve been having good luck with my /boot on a separate 1GB RAID1 btrfs filesystem using grub2 (2 disks only! I wouldn''t try it with 3). I should note, however, that I''m NOT using compression on this volume because if I remember correctly it may not play well with grub (maybe that was just lzo though) and I''m also not using subvolumes either for the same reason. Kyle ----------------------------------------> From: kreijack@inwind.it > To: 1i5t5.duncan@cox.net > Subject: Re: btrfs-raid questions I couldn''t find an answer to on the wiki > Date: Sun, 29 Jan 2012 12:23:39 +0100 > CC: linux-btrfs@vger.kernel.org > > On Thursday, 26 January, 2012 16:41:32 Duncan wrote: > > 1) My /boot partition and its backup (which I do want to keep separate > > from root) are only 128 MB each. The wiki recommends 1 gig sizes > > minimum, but there''s some indication that''s dated info due to mixed data/ > > metadata mode in recent kernels. > > > > Is a 128 MB btrfs reasonable? What''s the mixed-mode minumum recommended > > and what is overhead going to look like? > > IIRC, the minimum size should be 256MB. Anyway, if you want/allow a separate > partition for /boot I suggest to use a classic filesystem like ext3. > > BR > G.Baroncelli > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Duncan
2012-Jan-31 05:55 UTC
Re: btrfs-raid questions I couldn''t find an answer to on the wiki
Kyle Gates posted on Mon, 30 Jan 2012 08:58:41 -0600 as excerpted:> I''ve been having good luck with my /boot on a separate 1GB RAID1 btrfs > filesystem using grub2 (2 disks only! I wouldn''t try it with 3). I > should note, however, that I''m NOT using compression on this volume > because if I remember correctly it may not play well with grub (maybe > that was just lzo though) and I''m also not using subvolumes either for > the same reason.Thanks! I''m on grub2 as well. It''s is still masked on gentoo, but I recently unmasked and upgraded to it, taking advantage of the fact that I have two two-spindle md/raid-1s for /boot and its backup to test and upgrade one of them first, then the other only when I was satisfied with the results on the first set. I''ll be using a similar strategy for the btrfs upgrades, only most of my md/raid-1s are 4-spindle, with two sets, working and backup, and I''ll upgrade one set first. I''m going to keep /boot a pair of two-spindle raid-1s, but intend to make them btrfs-raid1s instead of md/raid-1s, and will upgrade one two-spindle set at a time. More on the status of grub2 btrfs-compression support based on my research. There is support for btrfs/gzip-compression in at least grub trunk. AFAIK, it''s gzip-compression in grub-1.99-release and lzo-compression in trunk only, but I may be misremembering and it''s gzip in trunk only and only uncompressed in grub-1.99-release. In any event, since I''m running 128 MB /boot md/raid-1s without compression now, and intend to increase the size to at least a quarter gig to better align the following partitions, /boot is the one set of btrfs partitions I do NOT intend to enable compression on, so that won''t be an issue for me here. And since for /boot I''m running a pair of two-spindle raid1s instead of my usual quad-spindle raid1s, you''ve confirmed that works as well. =:^) As a side note, since I only recently did the grub2 upgrade, I''ve been enjoying its ability to load and read md/raid and my current reiserfs directly, thus giving me the ability to look up info in at least text- based main system config and notes files directly from grub2, without booting into Linux, if for some reason the above-grub boot is hosed or inconvenient at that moment. I just realized that if I want to maintain that direct-from-grub access, I''ll need to ensure that the grub2 I''m running groks the btrfs compression scheme I''m using on any filesystem I want grub2 to be able to read. Hmm... that brings up another question: You mention a 1-gig btrfs-raid1 / boot, but do NOT mention whether you installed it before or after mixed- chunk (data/metadata) support made it into btrfs and became the default for <= 1 gig filesystems. Can you confirm one way or the other whether you''re running mixed-chunk on that 1-gig? I''m not sure whether grub2''s btrfs module groks mixed- chunk or not, or whether that even matters to it. Also, could you confirm mbr-bios vs gpt-bios vs uefi-gpt partitions? I''m using gpt-bios partitioning here, with the special gpt-bios-reserved partition, so grub2-install can build the modules necessary for /boot access directly into its core-image and install that in the gpt-bios- reserved partition. It occurs to me that either uefi-gpt or gpt-bios with the appropriate reserved partition won''t have quite the same issues with grub2 reading a btrfs /boot that either mbr-bios or gpt-bios without a reserved bios partition would. If you''re running gpt-bios with a reserved bios partition, that confirms yet another aspect of your setup, compared to mine. If you''re running uefi-gpt, not so much as at least in theory, that''s best-case. If you''re running either mbr-bios or gpt-bios without a reserved bios partition, that''s a worst-case, so if it works, then the others should definitely work. Meanwhile, you''re right about subvolumes. I''d not try them on a btrfs /boot, either. (I don''t really see the use case for it, for a separate /boot, tho there''s certainly a case for a /boot subvolume on a btrfs root, for people doing that.) -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Kyle Gates
2012-Feb-01 00:22 UTC
RE: btrfs-raid questions I couldn''t find an answer to on the wiki
>> I''ve been having good luck with my /boot on a separate 1GB RAID1 btrfs >> filesystem using grub2 (2 disks only! I wouldn''t try it with 3). I >> should note, however, that I''m NOT using compression on this volume >> because if I remember correctly it may not play well with grub (maybe >> that was just lzo though) and I''m also not using subvolumes either for >> the same reason. > > Thanks! I''m on grub2 as well. It''s is still masked on gentoo, but I > recently unmasked and upgraded to it, taking advantage of the fact that I > have two two-spindle md/raid-1s for /boot and its backup to test and > upgrade one of them first, then the other only when I was satisfied with > the results on the first set. I''ll be using a similar strategy for the > btrfs upgrades, only most of my md/raid-1s are 4-spindle, with two sets, > working and backup, and I''ll upgrade one set first. > > I''m going to keep /boot a pair of two-spindle raid-1s, but intend to make > them btrfs-raid1s instead of md/raid-1s, and will upgrade one two-spindle > set at a time. > > More on the status of grub2 btrfs-compression support based on my > research. There is support for btrfs/gzip-compression in at least grub > trunk. AFAIK, it''s gzip-compression in grub-1.99-release and > lzo-compression in trunk only, but I may be misremembering and it''s gzip > in trunk only and only uncompressed in grub-1.99-release.I believe you are correct that btrfs zlib support is included in grub2 version 1.99 and lzo is in trunk. I''ll try compressing the files on /boot for one installed kernel with the defrag -czlib option and see how it goes. Result: Seemed to work just fine.> In any event, since I''m running 128 MB /boot md/raid-1s without > compression now, and intend to increase the size to at least a quarter > gig to better align the following partitions, /boot is the one set of > btrfs partitions I do NOT intend to enable compression on, so that won''t > be an issue for me here. And since for /boot I''m running a pair of > two-spindle raid1s instead of my usual quad-spindle raid1s, you''ve > confirmed that works as well. =:^) > > As a side note, since I only recently did the grub2 upgrade, I''ve been > enjoying its ability to load and read md/raid and my current reiserfs > directly, thus giving me the ability to look up info in at least text- > based main system config and notes files directly from grub2, without > booting into Linux, if for some reason the above-grub boot is hosed or > inconvenient at that moment. I just realized that if I want to maintain > that direct-from-grub access, I''ll need to ensure that the grub2 I''m > running groks the btrfs compression scheme I''m using on any filesystem I > want grub2 to be able to read. > > Hmm... that brings up another question: You mention a 1-gig btrfs-raid1 / > boot, but do NOT mention whether you installed it before or after mixed- > chunk (data/metadata) support made it into btrfs and became the default > for <= 1 gig filesystems.I don''t think I specifically enabled mixed chunk support when I created this filesystem. It was done on a 2.6 kernel sometime in the middle of 2011 iirc.> Can you confirm one way or the other whether you''re running mixed-chunk > on that 1-gig? I''m not sure whether grub2''s btrfs module groks mixed- > chunk or not, or whether that even matters to it. > > Also, could you confirm mbr-bios vs gpt-bios vs uefi-gpt partitions? I''m > using gpt-bios partitioning here, with the special gpt-bios-reserved > partition, so grub2-install can build the modules necessary for /boot > access directly into its core-image and install that in the gpt-bios- > reserved partition. It occurs to me that either uefi-gpt or gpt-bios > with the appropriate reserved partition won''t have quite the same issues > with grub2 reading a btrfs /boot that either mbr-bios or gpt-bios without > a reserved bios partition would. If you''re running gpt-bios with a > reserved bios partition, that confirms yet another aspect of your setup, > compared to mine. If you''re running uefi-gpt, not so much as at least in > theory, that''s best-case. If you''re running either mbr-bios or gpt-bios > without a reserved bios partition, that''s a worst-case, so if it works, > then the others should definitely work.Same here, gpt-bios, 1MB partition with bios_grub flag set (gdisk code EF02) for grub to reside on.> Meanwhile, you''re right about subvolumes. I''d not try them on a btrfs > /boot, either. (I don''t really see the use case for it, for a separate > /boot, tho there''s certainly a case for a /boot subvolume on a btrfs > root, for people doing that.)-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Duncan
2012-Feb-01 06:59 UTC
Re: btrfs-raid questions I couldn''t find an answer to on the wiki
Kyle Gates posted on Tue, 31 Jan 2012 18:22:51 -0600 as excerpted:> I don''t think I specifically enabled mixed chunk support when I created > this filesystem. It was done on a 2.6 kernel sometime in the middle of > 2011 iirc.Yeah, I''d guess that was before mixed-chunk, or at least before it became the default for <=1GiB filesystems, so even if it was supported it wouldn''t have been the default. Meaning there''s still an open question as to whether grub-1.99 supports mixed-chunk. It looks like I might get more time to play with it this coming week than I had this past week. I might try some of my own experiments... and whether grub groks mixed-chunk will certainly be among them if I do. As for those recommending something other than btrfs for /boot, yes, that''s a possibility, but I strongly prefer to standardize on a single filesystem type. Right now, that''s reiserfs for everything except flash- based USB and legacy floppies (both of which I use ext4 without journaling for, except for the floppies I used to update my BIOS, before my 2003 era mainboard got EOLed; those were freedos images), and ultimately, I hope it''ll be btrfs for everything including flash-based (tho perhaps not for legacy floppies, but it has been awhile since I used one of them for anything, after that last BIOS update...). Of course I''m going to keep reiserfs on my backups, even if I use btrfs for my working system, for the time being since btrfs is still in heavy development, but ultimately, I want to go all btrfs just as I''m all reiserfs now, and that would include both /boot 2-spindle raid-1s. Tho if btrfs doesn''t work well for that ATM, I can keep /boot as reiserfs for the time being, since I''m already keeping it for the backups, for the time being. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Phillip Susi
2012-Feb-10 19:45 UTC
Re: btrfs-raid questions I couldn''t find an answer to on the wiki
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 1/31/2012 12:55 AM, Duncan wrote:> Thanks! I''m on grub2 as well. It''s is still masked on gentoo, but > I recently unmasked and upgraded to it, taking advantage of the > fact that I have two two-spindle md/raid-1s for /boot and its > backup to test and upgrade one of them first, then the other only > when I was satisfied with the results on the first set. I''ll be > using a similar strategy for the btrfs upgrades, only most of my > md/raid-1s are 4-spindle, with two sets, working and backup, and > I''ll upgrade one set first.Why do you want to have a separate /boot partition? Unless you can''t boot without it, having one just makes things more complex/problematic. If you do have one, I agree that it is best to keep it ext4 not btrfs.> Meanwhile, you''re right about subvolumes. I''d not try them on a > btrfs /boot, either. (I don''t really see the use case for it, for > a separate /boot, tho there''s certainly a case for a /boot > subvolume on a btrfs root, for people doing that.)The Ubuntu installer creates two subvolumes by default when you install on btrfs: one named @, mounted on /, and one named @home, mounted on /home. Grub2 handles this well since the subvols have names in the default root, so grub just refers to /@/boot instead of /boot, and so on. The apt-btrfs-snapshot package makes apt automatically snapshot the root subvol so you can revert after an upgrade. This seamlessly causes grub to go back to the old boot menu without the new kernels too, since it goes back to reading the old grub.cfg in the reverted root subvol. I have a radically different suggestion you might consider rebuilding your system using. Partition each disk into only two partitions: one for bios_grub, and one for everything else ( or just use MBR and skip the bios_grub partition ). Give the second partitions to mdadm to make a raid10 array out of. If you use a 2x far and 2x offset instead of the default near layout, you will have an array that can still handle any 2 of the 4 drives failing, will have twice the capacity of a 4 way mirror, almost the same sequential read throughput of a 4 way raid0, and about twice the write throughput of a 4 way mirror. Partition that array up and put your filesystems on it. -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.17 (MingW32) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iQEcBAEBAgAGBQJPNXPnAAoJEJrBOlT6nu75/d8IAJ0fQ3xWPe6SYBY8nj34mcWh ql6C4ieMkd07ZCuymT5ZVhWJhtdc6/Vg7ecWmhYdeu4d1WGp4DvTumEYHVl4ZlRk mT9Lq4SupDL5Dk0nfxZUqY8XnIek3kIG/wgekgdSuLF0J9QFQdCFc25j/idIh0Dy Gk5NJtgKmsTKUQhzPQZxif8nwWVQzQICm5P//FeOQgx8sq7iVdCQHUxlJEPfsL7m CVVMJPVk+524rFTWxLZ4KLbXkNE7nrikg7UMlWBtM5gflkU0Y+bfmZKPGcqBCSSn AId5M5alzjLSLblBqwf8wKpEIiDXBqb6f+bSxqnk5FdKKx5l5lziZyqQM+gnyIo=ePD3 -----END PGP SIGNATURE----- -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Duncan
2012-Feb-11 05:48 UTC
Re: btrfs-raid questions I couldn''t find an answer to on the wiki
Phillip Susi posted on Fri, 10 Feb 2012 14:45:43 -0500 as excerpted:> On 1/31/2012 12:55 AM, Duncan wrote: >> Thanks! I''m on grub2 as well. It''s is still masked on gentoo, but I >> recently unmasked and upgraded to it, taking advantage of the fact that >> I have two two-spindle md/raid-1s for /boot and its backup to test and >> upgrade one of them first, then the other only when I was satisfied >> with the results on the first set. I''ll be using a similar strategy >> for the btrfs upgrades, only most of my md/raid-1s are 4-spindle, with >> two sets, working and backup, and I''ll upgrade one set first. > > Why do you want to have a separate /boot partition? Unless you can''t > boot without it, having one just makes things more complex/problematic. > If you do have one, I agree that it is best to keep it ext4 not btrfs.For a proper picture of the situation, understand that I don''t have an initr*, I build everything I need into the kernel and have module loading disabled, and I keep /boot unmounted except when I''m actually installing an upgrade or reconfiguring. Having a separate /boot means that I can keep it unmounted and thus free from possible random corruption or accidental partial /boot tree overwrite or deletion, most of the time. It also means that I can emerge (build from sources using the gentoo ebuild script provided for the purpose, and install to the live system) a new grub without fear of corrupting what I actually boot from -- the grub system installation and boot installation remain separate. A separate /boot is also more robust in terms of file system corruption -- if something goes wrong with my rootfs, I can simply boot its backup, from a separate /boot that will not have been corrupted. Similarly, if something goes wrong with /boot (or the bios partition), I can switch drives in the BIOS and boot from the backup /boot, then load my usual rootfs. Since I''m working with four drives, and both the working /boot and backup /boot are two-spindle md/raid1, one on one pair, one on the other, I have both hardware redundancy via the second spindle of the raid1, and admin-fatfinger redundancy via the backup. However, the rootfs and its backup are both on quad-spindle md/raid1s, thus giving me four separate physical copies each of rootfs and its backup. Because the disk points at a single bootloader, if /boot is on rootfs, all four would point to either the working rootfs or the backup rootfs, and would update together, so I''d lose the ability to fall back to the backup /boot. (Note that I developed the backup /boot policy and solution back on legacy-grub. Grub2 is rather more flexible, particularly with a reasonably roomy GPT BIOS partition, and since each BIOS partition is installed individually, in theory, if a grub2 update failed, I could point the BIOS at a disk I hadn''t installed the BIOS partition update to yet, boot to the limited grub rescue-mode-shell, and point it at the /boot in the backup rootfs to load the normal-mode-shell, menu, and additional grub2 modules as necessary. However, being able to access a full normal-mode-shell grub2 on the backup /boot instead of having to resort to the grub2 rescue-mode-shell to reach the backup rootfs, does have its benefits.) One of the nice things about grub2 normal-mode is that it allows (directory and plain text file) browsing of pretty much anything it has a module for, anywhere on the system. That''s a nice thing to be able to do, but it too is much more robust if /boot isn''t part of rootfs, and thus, isn''t likely to be damaged if the rootfs is. The ability to boot to grub2 and retrieve vital information (even if limited to plain-text file storage) from a system without a working rootfs is a very nice ability to have! So you see, a separate /boot really does have its uses. =:^)>> Meanwhile, you''re right about subvolumes. I''d not try them on a btrfs >> /boot, either. (I don''t really see the use case for it, for a separate >> /boot, tho there''s certainly a case for a /boot subvolume on a btrfs >> root, for people doing that.) > > The Ubuntu installer creates two subvolumes by default when you install > on btrfs: one named @, mounted on /, and one named @home, mounted on > /home. Grub2 handles this well since the subvols have names in the > default root, so grub just refers to /@/boot instead of /boot, and so > on. The apt-btrfs-snapshot package makes apt automatically snapshot the > root subvol so you can revert after an upgrade. This seamlessly causes > grub to go back to the old boot menu without the new kernels too, since > it goes back to reading the old grub.cfg in the reverted root subvol.Thanks for that "real world" example. Subvolumes and particularly snapshots can indeed be quite useful, but I''d be rather leery of having all that on the same master filesystem. Lose it and you''ve lost everything, snapshots or no snapshots, if there''s not bootable backups somewhere. Two experiences inform my partitioning and layout judgment here. The first one was back before the turn of the century when I still did MS. In fact, at the time I was running an MSIE public beta for either MSIE 4 or 5, both of which I ran but IDR which it was that this happened with. MS made a change to the MSIE cache indexing, keeping the index file disk location in memory and direct-writing to it for performance reasons, rather than going the usual filesystem access route. The only problem was, whoever made that change didn''t think about MSIE and MS (filesystem) Explorer being effectively merged, and that it ran all the time as it was the shell. So then it comes time for the regularly scheduled system defrag, and defrag moves the index files out from under MSIE. Then MSIE updates the index, writing to the old location, in the process overwriting whatever''s there, causing all sorts of crosslinked files and other destruction. A number of folks running that beta had un-backed-up data destroyed by that bug (which MS fixed in the release by simply marking the MSIE index files with the system attribute, so defrag wouldn''t move them), but all it did to me was screw up a few files on my separate TMP partition, because I HAD a separate TMP partition, and because that''s where I had put the IE cache, reasoning that it was temporary data and thus belonged on the TMP partition. That decision saved my bacon! Both before and after that, I had a number of similar but rather more minor incidents where a strict partitioning policy saved me trouble, as well. But that one was all it took to keep me using a strict separate partitioning system to this day. The second experience was when the AC failed here, in the hot Phoenix summer (routinely 45-48C highs). I had left the system on and gone somewhere. When the AC failed, the outside-in-the-shade-temperature was 45C+, inside room temperature was EASILY 60C+, and the drive temperature was very likely 90C+! The drive of course failed due to physical head-crash on the still- spinning platters (I could see the grooves when I took it apart, later). When I came home of course the system was frozen, and I turned it off. The CPUs survived, and surprisingly, so did much of the disk. It was only where the physical head crash grooves were that the data was gone. I didn''t have off-disk backups at that time (for sure I do now!), but I had duplicate backup partitions for anything valuable. Since they weren''t mounted, I was able to recover and even continue using the backup rootfs, /usr, etc, for a couple months, until I could buy a new disk and transfer everything over. Again, what saved me was the fact that I had everything partitioned off. The partitions that weren''t actually mounted were pretty much undamaged, save for a few single scratches due to head seeking from one mounted partition to another, before the system itself crashed, and unlike the grooves worn in the mounted partitions, the disk''s own error correction caught most of that. An fsck fixed things up pretty good, tho I lost a few files. I hate to think about what would have happened if instead of separate partitions, each with its own intact metadata, etc, those "unmounted" partitions had been simply subvolumes on a single master filesystem! True, btrfs has double metadata and both data and metadata checksumming, and I''m *DEFINITELY* looking forward to the additional protection from that (tho only two-way even on a 4-spindle so-called raid1 btrfs was a big disappointment, tho an article I read somewhere says multi-redundancy is scheduled for kernel 3.4 or 3.5), but the plan at least here is for that to be ADDITIONAL protection, NOT AN EXCUSE TO BE SLOPPY! It''s for that reason that I intend to keep proper partitions and probably won''t make a lot of use of the subvolume functionality, except as it''s used by the snapshot functionality, which I expect I WILL use, for exactly the type of rollback functionality you describe above.> I have a radically different suggestion you might consider rebuilding > your system using. Partition each disk into only two partitions: one > for bios_grub, and one for everything else ( or just use MBR and skip > the bios_grub partition ). Give the second partitions to mdadm to make > a raid10 array out of. If you use a 2x far and 2x offset instead of the > default near layout, you will have an array that can still handle any 2 > of the 4 drives failing, will have twice the capacity of a 4 way mirror, > almost the same sequential read throughput of a 4 way raid0, and about > twice the write throughput of a 4 way mirror. Partition that array up > and put your filesystems on it.I like the raid-10 idea and will have to research it some more as I understand the idea behind "near" and "far" on raid10, but having never used raid-10, I don''t "grok" that idea, understand it well enough to have appreciated the possibility for lose-an-two, before you suggested it. And I''m only running 300 gig disks and given that I''m running a working and a backup copy of most of those raids/partitions, it''s more like 180 or 200 gig of actual storage, with the free-space fragmented due to the multiple partitions/raids, so I /am/ running a bit low on free-space and could definitely use the doubled space at this point! But I believe I''ll keep multiple raids for much the same reason I keep multiple partitions, it''s a FAR more robust solution than having all one''s eggs in one RAID basket. Besides, I actually did try a single partitioned RAID (well, two, one for all the working copies, one for the backups) when I first setup md/raid, and came to the conclusion that the recovery time on that big a raid is rather longer than I like to be dealing with it. Multiple raids, with the ones I''m not using ATM offline, means I don''t have to worry about recovering the entire thing, only the raids that were online and actually dirty at the time of crash or whatever. And of course write-intent bitmaps means even shorter recovery time in most cases, so between multiple raids and write-intent-bitmaps, a recovery that would take 2-3 hours with my original all-in-one raid setup, now often takes < 5 minutes! =:^) Even with write-intent-bitmaps, I''d hate to go back to big all-in-one raids, for recovery reasons alone, and between that and the additional robustness of multiple raids, I just don''t see myself doing that any time soon. But the 2x far, 2x offset raid10 idea, to let me lose any two of the four, is something I will very possibly use, especially now that I''ve seen that btrfs isn''t as close to ready with multi-redundancy as I had hoped, so it''ll probably be mid-year at the earliest before I can reasonably play with that. Thanks again, as that''s a very practical suggestion indeed! =:^) -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Phillip Susi
2012-Feb-12 00:04 UTC
Re: btrfs-raid questions I couldn''t find an answer to on the wiki
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 02/11/2012 12:48 AM, Duncan wrote:> So you see, a separate /boot really does have its uses. =:^)True, but booting from removable media is easy too, and a full livecd gives much more recovery options than the grub shell. It is the corrupted root fs that is of much more concern than /boot.> I like the raid-10 idea and will have to research it some more as I > understand the idea behind "near" and "far" on raid10, but having never > used raid-10, I don''t "grok" that idea, understand it well enough to have > appreciated the possibility for lose-an-two, before you suggested it.To grok the other layouts, it helps to think of the simple two disk case. A far layout is like having a raid0 across the first half of the disk, then mirroring the whole first half of the disk onto the second half of the other disk. Offset has the mirror on the next stripe so each stripe is interleaved with a mirror stripe, rather than having all original, then all mirrors after. It looks like mdadm won''t let you use both at once, so you''d have to go with a 3 way far or offset. Also I was wrong about the additional space. You would only get 25% more space since you still have 3 copies of all data so you get 4/3 times the space, but you will get much better throughput since it is striped across all 4 disks. Far gives better sequential read since it reads just like a raid0, but writes have to seek all the way across the disk to write the backup. Offset requires seeks between each stripe on read, but the writes don''t have to seek to write the backup. You also could do a raid6 and get the double failure tolerance, and two disks worth of capacity, but not as much read throughput as raid10.> But I believe I''ll keep multiple raids for much the same reason I keep > multiple partitions, it''s a FAR more robust solution than having all > one''s eggs in one RAID basket.True.> Besides, I actually did try a single partitioned RAID (well, two, one for > all the working copies, one for the backups) when I first setup md/raid, > and came to the conclusion that the recovery time on that big a raid is > rather longer than I like to be dealing with it. Multiple raids, with > the ones I''m not using ATM offline, means I don''t have to worry about > recovering the entire thing, only the raids that were online and actually > dirty at the time of crash or whatever. And of course write-intent > bitmaps means even shorter recovery time in most cases, so between > multiple raids and write-intent-bitmaps, a recovery that would take 2-3 > hours with my original all-in-one raid setup, now often takes < 5 > minutes! =:^) Even with write-intent-bitmaps, I''d hate to go back to big > all-in-one raids, for recovery reasons alone, and between that and the > additional robustness of multiple raids, I just don''t see myself doing > that any time soon.Depends on what you mean by recovery. Re-adding a drive that you removed will be faster with multiple raids ( though write-intent bitmaps also take care of that ), but if you actually have a failed disk and have to replace it with a new one, you still have to do a rebuild on all of the raids so it ends up taking the same total time. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iQEcBAEBAgAGBQJPNwIZAAoJEJrBOlT6nu754yUIAL79DHhanAC0SWaXFBYTT4T2 N2xG3ved177BXX0VhKCcoYcWFiSerWzAnPlZsUDzMfaHDxBNF4ATsnboY31rCG1j QJE3Oz9Cop45xhTBrMcwYs+woR+0HAmYb1Qa1aKrNwG0d6XlfZsLFBFUtrB411lX erOS77EsT2BYaumanvouM8vm5LG9ZrOItiELI7rm+hEcw64p3rjkUkvBG5nTdj8K 0x7tYgUHEZNngMSx4rMTUFTlx9485gn7eJ2hT1gbVNmRcCGwotTpOTXoJMh3csbF jYbUJKqK0n+gxhHSW/+KJBTlb1gbZpuaiibqpQnUlOecI/Fmj2MpHQnZ4WSNpc8=HjvY -----END PGP SIGNATURE----- -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Duncan
2012-Feb-12 22:31 UTC
Re: btrfs-raid questions I couldn''t find an answer to on the wiki
Phillip Susi posted on Sat, 11 Feb 2012 19:04:41 -0500 as excerpted:> On 02/11/2012 12:48 AM, Duncan wrote: >> So you see, a separate /boot really does have its uses. =:^) > > True, but booting from removable media is easy too, and a full livecd > gives much more recovery options than the grub shell.And a rootfs backup that''s simply a copy of rootfs at the time it was taken is even MORE flexible, especially when rootfs is arranged to contain all packages installed by the package manager. That''s what I use. If misfortune comes my way right in the middle of a critical project and rootfs dies, simply root= on the kernel command line at the grub prompt, to the backup root, and assuming that critical project is on another filesystem (such as home), I can normally simply continue where I left off. Full X and desktop, browser, movie players, document editors and viewers, presentation software, all the software I had on the system at the time I made the backup, directly bootable without futzing around with data restores, etc. =:^)> It is the corrupted root fs that is of much more concern than /boot.Yes, but to the extent that /boot is the gateway to both the rootfs and its backup... and digging out the removable media is at least a /bit/ more hassle than simply altering the root= (and mdX=) on the kernel command line...` (Incidentally, I''ve thought for quite some time that I really should have had two such backups, such that if I''m just doing the backup when misfortune strikes and takes out both the working rootfs and its backup, the backup being mounted and actively written at the time of the misfortune, I could always boot to the second backup. But I hadn''t considered that when I did the current layout. Given that rootfs with the full installed system''s only 4.75 gigs (with a quarter gig /usr/local on the same 5 gig partitioned md/raid), it shouldn''t be /too/ difficult to fit that in at my next rearrange, especially if I do the 4/3 raid10s as you suggested (for another ~100 gig since I''m running 300 gig disks).)>> I don''t "grok" [raid10] > > To grok the other layouts, it helps to think of the simple two disk > case. > A far layout is like having a raid0 across the first half of the disk, > then mirroring the whole first half of the disk onto the second half of > the other disk. Offset has the mirror on the next stripe so each stripe > is interleaved with a mirror stripe, rather than having all original, > then all mirrors after. > > It looks like mdadm won''t let you use both at once, so you''d have to go > with a 3 way far or offset. Also I was wrong about the additional > space. You would only get 25% more space since you still have 3 copies > of all data so you get 4/3 times the space, but you will get much better > throughput since it is striped across all 4 disks. Far gives better > sequential read since it reads just like a raid0, but writes have to > seek all the way across the disk to write the backup. Offset requires > seeks between each stripe on read, but the writes don''t have to seek to > write the backup.Thanks. That''s reasonably clear. Beyond that, I just have to DO IT, to get comfortable enough with it to be confident in my restoration abilities under the stress of an emergency recovery. (That''s the reason I ditched the lvm2 layer I had tried, the additional complexity of that one more layer was simply too much for me to be confident in my ability to manage it without fat-fingering under the stress of an emergency recovery situation.)> You also could do a raid6 and get the double failure tolerance, and two > disks worth of capacity, but not as much read throughput as raid10.Ugh! That''s what I tried as my first raid layout, when I was young and foolish, raid-wise! Raid5/6''s read-modify-write cycle in ordered to get the parity data written was simply too much! Combine that with the parallel job read boost of raid1, and raid1 was a FAR better choice for me than raid6! Actually, since much of my reading /is/ parallel jobs and the kernel i/o scheduler and md do such a good job of taking advantage of raid1''s parallel-read characteristics, it has seemed I do better with that that with raid0! I do still have one raid0, for gentoo''s package tree, the kernel tree, etc, since redundancy doesn''t matter for it and the 4X space it gives me for that is nice, but bigger storage, I''d have it all raid1 (or now raid10) and not have to worry about other levels. Counterintuitively, even write seems more responsive with raid1 than raid0, in actual use. The only explanation I''ve come up with for that is that in practice, any large scale writes tend to be reads from elsewhere as well, and the md scheduler is evidently smart enough to read from one spindle and write to the others, then switch off to catch up writing on the formerly read-spindle, such that there''s rather less head seeking between read and write than there''d be otherwise. Since raid0 only has the single copy, the data MUST be read from whatever spindle it resides on, thus eliminating the kernel/md''s ability to smart-schedule, favoring one spindle at a time for reads to eliminate seeks. For that reason, I''ve always thought that if I went to raid10, I''d try to do it with at least triple spindle at the raid1 level, thus hoping to get both the additional redundancy and parallel scheduling of raid1, while also getting the thruput speed and size of the stripes. Now you''ve pointed out that I can do essentially that with a triple mirror on quad spindle raid10, and I''m seeing new possibilities open up...>> Multiple >> raids, with the ones I''m not using ATM offline, means I don''t have to >> worry about recovering the entire thing, only the raids that were >> online and actually dirty at the time of crash or whatever. > > Depends on what you mean by recovery. Re-adding a drive that you > removed will be faster with multiple raids ( though write-intent bitmaps > also take care of that ), but if you actually have a failed disk and > have to replace it with a new one, you still have to do a rebuild on all > of the raids so it ends up taking the same total time.Very good point. I was talking about re-adding. For various reasons including hardware power-on stability latency (these particular disks apparently take a bit to stabilize after power on and suspend-to-disk often kicks a disk on resume due to ID-match-failure, which then appears as say sde instead of sdb; I''ve solved that problem by simply leaving on or shutting down the system instead of using suspend-to-disk), faulty memory at one point causing kernel panics, and the fact that I run live- git kernels, I''ve had rather more experience with re-add than I would have liked. But that has made me QUITE confident in my ability to recover from either that or a dead drive, since I''ve had rather more practice than I anticipated. But all my experience has been with re-add, so that''s what I was thinking about when I said recovery. Thanks for pointing out that I omitted to mention that as I was really quite oblivious. =:^) -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html