Hi, Chris Mason has posted a bunch of interesting updates to the Project_ideas wiki page. If you''re interested in working on any of these, feel free to speak up and ask for more information if you need it. Here are the new sections, for the curious: == Block group reclaim = The split between data and metadata block groups means that we sometimes have mostly empty block groups dedicated to only data or metadata. As files are deleted, we should be able to reclaim these and put the space back into the free space pool. We also need rebalancing ioctls that focus only on specific raid levels. == RBtree lock contention = Btrfs uses a number of rbtrees to index in-memory data structures. Some of these are dominated by reads, and the lock contention from searching them is showing up in profiles. We need to look into an RCU and sequence counter combination to allow lockless reads. == Forced readonly mounts on errors = The sources have a number of BUG() statements that could easily be replaced with code to force the filesystem readonly. This is the first step in being more fault tolerant of disk corruptions. The first step is to add a framework for generating errors that should result in filesystems going readonly, and the conversion from BUG() to that framework can happen incrementally. == Dedicated metadata drives = We''re able to split data and metadata IO very easily. Metadata tends to be dominated by seeks and for many applications it makes sense to put the metadata onto faster SSDs. == Readonly snapshots = The Btrfs snapshots are read/write by default. A small number of checks would allow us to make readonly snapshots instead. == Per file / directory controls for COW and compression = Data compression and data cow are controlled across the entire FS by mount options right now. ioctls are needed to set this on a per file or per directory basis. This has been proposed previously, but VFS developers wanted us to use generic ioctls rather than btrfs-specific ones. Can we use some of the same ioctls that ext4 uses? This task is mostly organizational rather than technical. == Chunk tree backups = The chunk tree is critical to mapping logical block numbers to physical locations on the drive. We need to make the mappings discoverable via a block device scan so that we can recover from corrupted chunk trees. == Rsync integration = Now that we have code to efficiently find newly updated files, we need to tie it into tools such as rsync and dirvish. (For bonus points, we can even tell rsync _which blocks_ inside a file have changed. Would need to work with the rsync developers on that one.) == Atomic write API = The Btrfs implementation of data=ordered only updates metadata to point to new data blocks when the data IO is finished. This makes it easy for us to implement atomic writes of an arbitrary size. Some hardware is coming out that can support this down in the block layer as well. == Backref walking utilities = Given a block number on a disk, the Btrfs metadata can find all the files and directories that use or care about that block. Some utilities to walk these back refs and print the results would help debug corruptions. Given an inode, the Btrfs metadata can find all the directories that point to the inode. We should have utils to walk these back refs as well. == Scrubbing = We need a periodic daemon that can walk the filesystem and verify the contents of all copies of all allocated blocks are correct. This is mostly equivalent to "find | xargs cat >/dev/null", but with the constraint that we don''t want to thrash the page cache, so direct I/O should be used instead. If we find a bad copy during this process, and we''re using RAID, we should queue up an overwrite of the bad copy with a good one. The overwrite can happen in-place. == Drive swapping = Right now when we replace a drive, we do so with a full FS balance. If we are inserting a new drive to remove an old one, we can do a much less expensive operation where we just put valid copies of all the blocks onto the new drive. == IO error tracking = As we get bad csums or IO errors from drives, we should track the failures and kick out the drive if it is clearly going bad. == Random write performance = Random writes introduce small extents and fragmentation. We need new file layout code to improve this and defrag the files as they are being changed. == Free inode number cache = As the filesystem fills up, finding a free inode number will become expensive. This should be cached the same way we do free blocks. == Snapshot aware defrag = As we defragment files, we break any sharing from other snapshots. The balancing code will preserve the sharing, and defrag needs to grow this as well. == Btree lock contention = The btree locks, especially on the root block can be very hot. We need to improve this, especially in read mostly workloads. == Changing RAID levels = We need ioctls to change between different raid levels. Some of these are quite easy -- e.g. for RAID0 to RAID1, we just halve the available bytes on the fs, then queue a rebalance. == DISCARD utilities = For SSDs with discard support, we could use a scrubber that goes through the fs and performs discard on anything that is unused. You could first use the balance operation to compact data to the front of the drive, then discard the rest. -- Chris Ball <cjb@laptop.org> One Laptop Per Child -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Nov 16, 2010 at 10:19:45PM -0500, Chris Ball wrote:> Hi, > > Chris Mason has posted a bunch of interesting updates to the > Project_ideas wiki page. If you''re interested in working on any > of these, feel free to speak up and ask for more information if > you need it. Here are the new sections, for the curious: > > == Block group reclaim => > The split between data and metadata block groups means that we > sometimes have mostly empty block groups dedicated to only data or > metadata. As files are deleted, we should be able to reclaim these > and put the space back into the free space pool. > > We also need rebalancing ioctls that focus only on specific raid > levels.> == Changing RAID levels => > We need ioctls to change between different raid levels. Some of these > are quite easy -- e.g. for RAID0 to RAID1, we just halve the available > bytes on the fs, then queue a rebalance.I would be interested in the rebalancing ioctls, and in RAID level management. I''m still very much trying to learn the basics, though, so I may go very slowly at first... Hugo. -- === Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk == PGP key: 515C238D from wwwkeys.eu.pgp.net or http://www.carfax.org.uk --- We demand rigidly defined areas of doubt and uncertainty! ---
On Wed, Nov 17, 2010 at 15:31, Hugo Mills <hugo-lkml@carfax.org.uk> wrote:> On Tue, Nov 16, 2010 at 10:19:45PM -0500, Chris Ball wrote: >> == Changing RAID levels =>> >> We need ioctls to change between different raid levels. Some of these >> are quite easy -- e.g. for RAID0 to RAID1, we just halve the available >> bytes on the fs, then queue a rebalance. > > I would be interested in the rebalancing ioctls, and in RAID level > management. I''m still very much trying to learn the basics, though, so > I may go very slowly at first... > > Hugo.Can I suggest we combine this new RAID level management with a modernisation of the terminology for storage redundancy, as has been discussed previously in the "Raid1 with 3 drives" thread of March this year? I.e. abandon the burdened raid* terminology in favour of something that makes more sense for a filesystem. Mostly this would involve a discussion about what terms would make most sense, though some changes in the behaviour of btrfs redundancy modes may be warranted if they make things more intuitive. I could help you make these changes in your patches, or write my own patches against yours, though I''m also completely new to kernel development. Best regards, Bart -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Le 17 novembre 2010 à 16:12, Bart Noordervliet a écrit:> Can I suggest we combine this new RAID level management with a > modernisation of the terminology for storage redundancy, as has been > discussed previously in the "Raid1 with 3 drives" thread of March this > year? I.e. abandon the burdened raid* terminology in favour of > something that makes more sense for a filesystem.I would agree with that. -- Xavier Nicollet -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Nov 17, 2010 at 7:12 AM, Bart Noordervliet <bart@noordervliet.net> wrote:> On Wed, Nov 17, 2010 at 15:31, Hugo Mills <hugo-lkml@carfax.org.uk> wrote: >> On Tue, Nov 16, 2010 at 10:19:45PM -0500, Chris Ball wrote: >>> == Changing RAID levels =>>> >>> We need ioctls to change between different raid levels. Some of these >>> are quite easy -- e.g. for RAID0 to RAID1, we just halve the available >>> bytes on the fs, then queue a rebalance. >> >> I would be interested in the rebalancing ioctls, and in RAID level >> management. I''m still very much trying to learn the basics, though, so >> I may go very slowly at first... >> >> Hugo. > > Can I suggest we combine this new RAID level management with a > modernisation of the terminology for storage redundancy, as has been > discussed previously in the "Raid1 with 3 drives" thread of March this > year? I.e. abandon the burdened raid* terminology in favour of > something that makes more sense for a filesystem. > > Mostly this would involve a discussion about what terms would make > most sense, though some changes in the behaviour of btrfs redundancy > modes may be warranted if they make things more intuitive. > > I could help you make these changes in your patches, or write my own > patches against yours, though I''m also completely new to kernel > development. >That would inherently solve the need to convert between dup and raid1 as well. Why those are separate and why dup does not become raid1 when there are N > 1 drives is beyond me. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Nov 17, 2010 at 04:12:29PM +0100, Bart Noordervliet wrote:> Can I suggest we combine this new RAID level management with a > modernisation of the terminology for storage redundancy, as has been > discussed previously in the "Raid1 with 3 drives" thread of March this > year? I.e. abandon the burdened raid* terminology in favour of > something that makes more sense for a filesystem.Well, our current RAID modes are: * 1 Copy ("SINGLE") * 2 Copies ("DUP") * 2 Copies, different spindles ("RAID1") * 1 Copy, 2 Stripes ("RAID0") * 2 Copies, 2 Stripes [each] ("RAID10") The forthcoming RAID5/6 code will expand on that, with * 1 Copy, n Stripes + 1 Parity ("RAID5") * 1 Copy, n Stripes + 2 Parity ("RAID6") (I''m not certain how "n" will be selected -- it could be a config option, or simply selected on the basis of the number of spindles/devices currently in the FS). We could further postulate a RAID50/RAID60 mode, which would be * 2 Copies, n Stripes + 1 Parity * 2 Copies, n Stripes + 2 Parity For brevity, we could collapse these names down to: 1C, 2C, 2CR, 1C2S, 2C2S, 1CnS1P, 1CnS2P, 2CnS1P, 2CnS2P. However, that''s probably a bit too condensed for useful readability. I''d support some set of terms based on this taxonomy, though, as it''s fairly extensible, and tells you the details of the duplication strategy in question.> Mostly this would involve a discussion about what terms would make > most sense, though some changes in the behaviour of btrfs redundancy > modes may be warranted if they make things more intuitive.Consider the above a first suggestion. :)> I could help you make these changes in your patches, or write my own > patches against yours, though I''m also completely new to kernel > development.Probably best to keep the kernel internals unchanged for this particular issue, as they don''t make much difference to the naming, but patches to the userspace side of things (mkfs.btrfs and btrfs fi df specifically) should be fairly straightforward. Hugo. -- === Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk == PGP key: 515C238D from wwwkeys.eu.pgp.net or http://www.carfax.org.uk --- <gdb> The enemy have elected for Death by Powerpoint. That''s --- what they shall get.
On 11/17/2010 05:56 PM, Hugo Mills wrote:> On Wed, Nov 17, 2010 at 04:12:29PM +0100, Bart Noordervliet wrote: >> Can I suggest we combine this new RAID level management with a >> modernisation of the terminology for storage redundancy, as has been >> discussed previously in the "Raid1 with 3 drives" thread of March this >> year? I.e. abandon the burdened raid* terminology in favour of >> something that makes more sense for a filesystem. > > Well, our current RAID modes are: > > * 1 Copy ("SINGLE") > * 2 Copies ("DUP") > * 2 Copies, different spindles ("RAID1") > * 1 Copy, 2 Stripes ("RAID0") > * 2 Copies, 2 Stripes [each] ("RAID10") > > The forthcoming RAID5/6 code will expand on that, with > > * 1 Copy, n Stripes + 1 Parity ("RAID5") > * 1 Copy, n Stripes + 2 Parity ("RAID6") > > (I''m not certain how "n" will be selected -- it could be a config > option, or simply selected on the basis of the number of > spindles/devices currently in the FS). > > We could further postulate a RAID50/RAID60 mode, which would be > > * 2 Copies, n Stripes + 1 Parity > * 2 Copies, n Stripes + 2 ParitySince BTRFS is already doing some relatively radical things, I would like to suggest that RAID5 and RAID6 be deemed obsolete. RAID5 isn''t safely usable for arrays bigger than about 5TB with disks that have a specified error rate of 10^-14. RAID6 pushes that problem a little further away, but in the longer term, I would argue that RAID (n+m) would work best. We specify that of (n+m) disks in the array, we want n data disks and m redundancy disks. If this is implemented in a generic way, then there won''t be a need to implement additional RAID modes later. Gordan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 17.11.2010 18:56, Hugo Mills wrote:> On Wed, Nov 17, 2010 at 04:12:29PM +0100, Bart Noordervliet wrote: >> Can I suggest we combine this new RAID level management with a >> modernisation of the terminology for storage redundancy, as has been >> discussed previously in the "Raid1 with 3 drives" thread of March this >> year? I.e. abandon the burdened raid* terminology in favour of >> something that makes more sense for a filesystem. > > Well, our current RAID modes are: > > * 1 Copy ("SINGLE") > * 2 Copies ("DUP") > * 2 Copies, different spindles ("RAID1") > * 1 Copy, 2 Stripes ("RAID0") > * 2 Copies, 2 Stripes [each] ("RAID10") > > The forthcoming RAID5/6 code will expand on that, with > > * 1 Copy, n Stripes + 1 Parity ("RAID5") > * 1 Copy, n Stripes + 2 Parity ("RAID6") > > (I''m not certain how "n" will be selected -- it could be a config > option, or simply selected on the basis of the number of > spindles/devices currently in the FS).Just one question on "small n": If one has N = 3*k >= 6 spindles, then RAID5 with n = N/2-1 results in something like RAID50? So having an option for "small n" might realize RAID50 given the right choice for n.> > We could further postulate a RAID50/RAID60 mode, which would be > > * 2 Copies, n Stripes + 1 Parity > * 2 Copies, n Stripes + 2 ParityIsn''t this RAID51/RAID61 (or 15/16 unsure on how to put) and would RAID50/RAID60 correspond to * 2 Stripes, n Stripes + 1 Parity * 2 Stripes, n Stripes + 2 Parity> > For brevity, we could collapse these names down to: 1C, 2C, 2CR, > 1C2S, 2C2S, 1CnS1P, 1CnS2P, 2CnS1P, 2CnS2P. However, that''s probably a > bit too condensed for useful readability. I''d support some set of > terms based on this taxonomy, though, as it''s fairly extensible, and > tells you the details of the duplication strategy in question. > >> Mostly this would involve a discussion about what terms would make >> most sense, though some changes in the behaviour of btrfs redundancy >> modes may be warranted if they make things more intuitive. > > Consider the above a first suggestion. :) > >> I could help you make these changes in your patches, or write my own >> patches against yours, though I''m also completely new to kernel >> development. > > Probably best to keep the kernel internals unchanged for this > particular issue, as they don''t make much difference to the naming, > but patches to the userspace side of things (mkfs.btrfs and btrfs fi > df specifically) should be fairly straightforward. > > Hugo.Andreas -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (MingW32) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iQIcBAEBAgAGBQJM5BuWAAoJEJIcBJ3+XkgiJVkP/i1YrexX3lxH6A02IWHfRIP/ +8qIDLfw5l6wuX0UV3B/ROngsI2HHvWmybFR5+rrcAkntG/EJn0amhYMrKZMDh7n WrpWuBWjBiLI6tAkiE/ur9D3AGQhBW69okeGq2MCGYSIZNVUjVfWtEmF/OKj8soz 1Wk6Ymk0aNYBme7FMZwM/fRQnoRoV3qk5IrztzaZTClmcpM6ut+puPO42r5IEqmV d441f7ta6vXfmXCaBA5otAdMsa3yQedkUd+wAS4xPZgN+CopeuSUUFeD4FH3b1wX pyA9WtS8bb10cdnf0YOkbUVTgWhmsPzABqhZlmcA/8xMCMCx7Fg6eKAjGaBTcnP0 +rxWRmoyLRRS015IDat4bb31yeA8GQxteOOhpF9gLd9I0bF8QYTBSGOG9dVadEJU PJ1aCA+5rePwadOh4wp6V0vH6BRCs7JcDc12K/gfCCFoHTyfk23j73+jJ2/dzH/E aTDrprQIWdJE5Fk33XA1oLIcT2xNj/6PKv8DKTTj8n4SxfhViQkrNu1RrzVd5Ln1 BYbVnUbcDuAUR7AAFqRaMBVMIULJgvkaqeaQohFfONei4CgTm+A14ONcN/c0I3fV ef9hBG2YV9X82yozLvZ0888q+eCb86AnOaVxWtnHNgvPKxnOu8Iu7NhSO53yYaro i8aAoui00NJueGGKXzLt =Wj4a -----END PGP SIGNATURE----- -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Nov 17, 2010 at 07:14:47PM +0100, Andreas Philipp wrote:> On 17.11.2010 18:56, Hugo Mills wrote: > > On Wed, Nov 17, 2010 at 04:12:29PM +0100, Bart Noordervliet wrote: > >> Can I suggest we combine this new RAID level management with a > >> modernisation of the terminology for storage redundancy, as has been > >> discussed previously in the "Raid1 with 3 drives" thread of March this > >> year? I.e. abandon the burdened raid* terminology in favour of > >> something that makes more sense for a filesystem. > > > > Well, our current RAID modes are: > > > > * 1 Copy ("SINGLE") > > * 2 Copies ("DUP") > > * 2 Copies, different spindles ("RAID1") > > * 1 Copy, 2 Stripes ("RAID0") > > * 2 Copies, 2 Stripes [each] ("RAID10") > > > > The forthcoming RAID5/6 code will expand on that, with > > > > * 1 Copy, n Stripes + 1 Parity ("RAID5") > > * 1 Copy, n Stripes + 2 Parity ("RAID6") > > > > (I''m not certain how "n" will be selected -- it could be a config > > option, or simply selected on the basis of the number of > > spindles/devices currently in the FS). > Just one question on "small n": If one has N = 3*k >= 6 spindles, then > RAID5 with n = N/2-1 results in something like RAID50? So having an > option for "small n" might realize RAID50 given the right choice for n.I see what you''re getting at, but actually, that would just be RAID-5 with small n. It merely happens to spread chunks out over more spindles than the minimum n+1 required to give you what you asked for. (See the explanation below for why).> > We could further postulate a RAID50/RAID60 mode, which would be > > > > * 2 Copies, n Stripes + 1 Parity > > * 2 Copies, n Stripes + 2 Parity > Isn''t this RAID51/RAID61 (or 15/16 unsure on how to put) and would > RAID50/RAID60 correspond toErrr... yes, you''re right. My mistake. Although... again, see the conclusion below. :)> * 2 Stripes, n Stripes + 1 Parity > * 2 Stripes, n Stripes + 2 ParityI''m not sure talking about RAID50-like things (as you state above) makes much sense, given the internal data structures that btrfs uses: As far as I know(*), data is firstly allocated in chunks of about 1GiB per device. Chunks are grouped together to give you replication. So, for a RAID-0 or RAID-1 arrangement, chunks are allocated in pairs, picked from different devices. For RAID-10, they''re allocated in quartets, again on different devices. For RAID-5, they''d be allocated in groups of n+1. For RAID-61, we''d use 2n+4 chunks in an allocation. For replication strategies where it matters (anything other than DUP, SINGLE, RAID-1 so far), the chunks are then subdivided into stripes of a fixed width. Data written to the disk is spread across the stripes in an appropriate manner. From this point of view, RAID50 and RAID51 look much the same, unless the stripe size for the "5" is different to the stripe size for the "0" or "1". I''m not sure that''s the case. If the stripe sizes are the same, you''ll basically get the same layout of data across the 2n+2 chunks -- it''s just that (possibly) the internal labels of the chunks which indicate which bit of data they''re holding in the pattern will be different. Hugo. (*) I could be wrong, hopefully someone will correct me if so. -- === Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk == PGP key: 515C238D from wwwkeys.eu.pgp.net or http://www.carfax.org.uk --- A cross? Oy vey, have you picked the wrong vampire! ---
On 11/17/2010 10:07 AM, Gordan Bobic wrote:> On 11/17/2010 05:56 PM, Hugo Mills wrote: >> On Wed, Nov 17, 2010 at 04:12:29PM +0100, Bart Noordervliet wrote: >>> Can I suggest we combine this new RAID level management with a >>> modernisation of the terminology for storage redundancy, as has been >>> discussed previously in the "Raid1 with 3 drives" thread of March this >>> year? I.e. abandon the burdened raid* terminology in favour of >>> something that makes more sense for a filesystem. >> >> Well, our current RAID modes are: >> >> * 1 Copy ("SINGLE") >> * 2 Copies ("DUP") >> * 2 Copies, different spindles ("RAID1") >> * 1 Copy, 2 Stripes ("RAID0") >> * 2 Copies, 2 Stripes [each] ("RAID10") >> >> The forthcoming RAID5/6 code will expand on that, with >> >> * 1 Copy, n Stripes + 1 Parity ("RAID5") >> * 1 Copy, n Stripes + 2 Parity ("RAID6") >> >> (I''m not certain how "n" will be selected -- it could be a config >> option, or simply selected on the basis of the number of >> spindles/devices currently in the FS). >> >> We could further postulate a RAID50/RAID60 mode, which would be >> >> * 2 Copies, n Stripes + 1 Parity >> * 2 Copies, n Stripes + 2 Parity > > Since BTRFS is already doing some relatively radical things, I would > like to suggest that RAID5 and RAID6 be deemed obsolete. RAID5 isn''t > safely usable for arrays bigger than about 5TB with disks that have a > specified error rate of 10^-14. RAID6 pushes that problem a little > further away, but in the longer term, I would argue that RAID (n+m) > would work best. We specify that of (n+m) disks in the array, we want > n data disks and m redundancy disks. If this is implemented in a > generic way, then there won''t be a need to implement additional RAID > modes later.Not to throw a wrench in the works, but has anyone given any thought as to how to best deal with SSD-based RAIDs? Normal RAID algorithms will maximize synchronized failures of those devices. Perhaps there''s a chance here to fix that issue? I like the RAID n+m mode of thinking though. It''d also be nice to have spares which are spun-down until needed. Lastly, perhaps there''s also a chance here to employ SSD-based caching when doing RAID, as is done in the most recent RAID controllers? Exposure to media failures in the SSD does make me nervous about that though. Does anyone know if those controllers write some sort of extra data to the SSD for redundancy/error recovery purposes? --Bart -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Bart Kus wrote:> On 11/17/2010 10:07 AM, Gordan Bobic wrote: >> On 11/17/2010 05:56 PM, Hugo Mills wrote: >>> On Wed, Nov 17, 2010 at 04:12:29PM +0100, Bart Noordervliet wrote: >>>> Can I suggest we combine this new RAID level management with a >>>> modernisation of the terminology for storage redundancy, as has been >>>> discussed previously in the "Raid1 with 3 drives" thread of March this >>>> year? I.e. abandon the burdened raid* terminology in favour of >>>> something that makes more sense for a filesystem. >>> >>> Well, our current RAID modes are: >>> >>> * 1 Copy ("SINGLE") >>> * 2 Copies ("DUP") >>> * 2 Copies, different spindles ("RAID1") >>> * 1 Copy, 2 Stripes ("RAID0") >>> * 2 Copies, 2 Stripes [each] ("RAID10") >>> >>> The forthcoming RAID5/6 code will expand on that, with >>> >>> * 1 Copy, n Stripes + 1 Parity ("RAID5") >>> * 1 Copy, n Stripes + 2 Parity ("RAID6") >>> >>> (I''m not certain how "n" will be selected -- it could be a config >>> option, or simply selected on the basis of the number of >>> spindles/devices currently in the FS). >>> >>> We could further postulate a RAID50/RAID60 mode, which would be >>> >>> * 2 Copies, n Stripes + 1 Parity >>> * 2 Copies, n Stripes + 2 Parity >> >> Since BTRFS is already doing some relatively radical things, I would >> like to suggest that RAID5 and RAID6 be deemed obsolete. RAID5 isn''t >> safely usable for arrays bigger than about 5TB with disks that have a >> specified error rate of 10^-14. RAID6 pushes that problem a little >> further away, but in the longer term, I would argue that RAID (n+m) >> would work best. We specify that of (n+m) disks in the array, we want >> n data disks and m redundancy disks. If this is implemented in a >> generic way, then there won''t be a need to implement additional RAID >> modes later. > > Not to throw a wrench in the works, but has anyone given any thought as > to how to best deal with SSD-based RAIDs? Normal RAID algorithms will > maximize synchronized failures of those devices. Perhaps there''s a > chance here to fix that issue?The wear-out failure of SSDs (the exact failure you are talking bout) is very predictable. Current generation of SSDs provide a reading via SMART of how much life (in %) there is left in the SSD. When this gets down to single figures, the disks should be replaced. Provided that the disks are correctly monitored, it shouldn''t be an issue. On a related issue, I am not convinced that wear-out based SSD failure is an issue provided that: 1) there is at least a rudimentary amount of wear leveling done in the firmware. This is the case even for cheap CF/SD card media, and is not hard to implement. And considering I recently got a number of cheap-ish 32GB CF cards with lifetime warranty, it''s safe to assume they will have wear leveling built in, or Kingston will rue the day they sold them with lifetime warranty. ;) 2) Reasonable effort is made to not put write-heavy things onto SSDs (think /tmp, /var/tmp, /var/lock, /var/run, swap, etc.). These can safely be put on tmpfs instead, and for swap you can use ramzswap (compcache). You''ll get both better performance and prolong the life of the SSD significantly. Switching off atime on the FS helps a lot, too. And switching off journaling can make a difference of over 50% on metadata-heavy operations. And assuming that you write 40GB of data per day to your 40GB SSD (unlikely for most applications), you''ll still get a 10,000 day life expectancy on that disk. That''s 30 years. Does anyone still use any disks from 30 years ago? What about 20 years ago? 10? The rate of growth of RAM and storage in computers has increased by about 10x in the last 10 years. It seems unlikely that even if our current generation of SSDs will be useful in 10 years time, let alone 30.> I like the RAID n+m mode of thinking though. It''d also be nice to have > spares which are spun-down until needed.>> Lastly, perhaps there''s also a chance here to employ SSD-based caching > when doing RAID, as is done in the most recent RAID controllers?Tiered storage capability would be nice. What would it take to keep statistics on how frequently various file blocks are accessed, and put the most frequently accessed file blocks on SSD? It would be nice if this could be done by the accesses/day with some reasonable limit on the number of days over which accesses are considered.> Exposure to media failures in the SSD does make me nervous about that > though.You''d need a pretty substantial churn rate for that to happen quickly. With the caching strategy I described above, churn should be much lower than the naive LRU while providing a much better overall hit rate.> Does anyone know if those controllers write some sort of extra > data to the SSD for redundancy/error recovery purposes?SSDs handle that internally. The predictability of failures due to wear-out on SSDs makes this relatively easy to handle. Another thing that would be nice to have - defrag with ability to specify where particular files should be kept. One thing I''ve been pondering writing for ext2 when I have a month of spare time is a defrag utility that can be passed an ordered list of files to put at the very front of the disk. Such a list could easily be generated using inotify. This would log all file accesses during the boot/login process. Defragging the disk in such a way that all files read-accessed from the disk are laid out sequentially with no gaps at the front of the disk would ensure that boot times are actually faster than on an SSD*. *Access time on an decent SSD is about 100us. With pre-fetch on a rotating disk, most, if not all, of the data that is going to be accessed is going to get pre-cached by the time we even ask for it, so it might even be faster. This might actually provide higher performance. Gordan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Nov 17, 2010 at 19:07, Gordan Bobic <gordan@bobich.net> wrote:> Since BTRFS is already doing some relatively radical things, I would like to > suggest that RAID5 and RAID6 be deemed obsolete. RAID5 isn''t safely usable > for arrays bigger than about 5TB with disks that have a specified error rate > of 10^-14. RAID6 pushes that problem a little further away, but in the > longer term, I would argue that RAID (n+m) would work best. We specify that > of (n+m) disks in the array, we want n data disks and m redundancy disks. If > this is implemented in a generic way, then there won''t be a need to > implement additional RAID modes later.I presume you''re talking about the uncaught read errors that makes many people avoid RAID5. Btrfs actually enables us to use it with confidence again, since using checksums it''s able to detect these errors and prevent corruption of the array. So to the contrary, I see a lot of potential for parity-based redundancy in combination with btrfs. Regards, Bart -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 18/11/10 15:31, Bart Noordervliet wrote:> On Wed, Nov 17, 2010 at 19:07, Gordan Bobic <gordan@bobich.net> wrote: >> Since BTRFS is already doing some relatively radical things, I would like to >> suggest that RAID5 and RAID6 be deemed obsolete. RAID5 isn''t safely usable >> for arrays bigger than about 5TB with disks that have a specified error rate >> of 10^-14. RAID6 pushes that problem a little further away, but in the >> longer term, I would argue that RAID (n+m) would work best. We specify that >> of (n+m) disks in the array, we want n data disks and m redundancy disks. If >> this is implemented in a generic way, then there won''t be a need to >> implement additional RAID modes later. > > I presume you''re talking about the uncaught read errors that makes > many people avoid RAID5. Btrfs actually enables us to use it with > confidence again, since using checksums it''s able to detect these > errors and prevent corruption of the array. So to the contrary, I see > a lot of potential for parity-based redundancy in combination with > btrfs.No he''s talking about the high chance of triggering another error during the long time it takes to perform the recovery (and before your data is redundant again). Often also attributed to multiple disks being from the same batch and having the same flaws and lifetime expectancy. But since btrfs would do this on a per object basis instead of the whole array, only the objects whose blocks have gone are at risk (not necessarily the whole filesystem). Furthermore, additional read errors often only impact a subset of the files that were at risk. Furthermore if recovery is half-way done when another error is triggerd the already done part will still be available. So the real strength is that corruptions are more likely only to impact a small subset of the filesystem and that different objects can have different amount of redundancy. So ''raid1'' for metadata and other very important files, no raid for unimportant data and raid5/6 for large objects or for objects which only need a basic level of protection. Regards, justin.... -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Bart Noordervliet wrote:> On Wed, Nov 17, 2010 at 19:07, Gordan Bobic <gordan@bobich.net> wrote: >> Since BTRFS is already doing some relatively radical things, I would like to >> suggest that RAID5 and RAID6 be deemed obsolete. RAID5 isn''t safely usable >> for arrays bigger than about 5TB with disks that have a specified error rate >> of 10^-14. RAID6 pushes that problem a little further away, but in the >> longer term, I would argue that RAID (n+m) would work best. We specify that >> of (n+m) disks in the array, we want n data disks and m redundancy disks. If >> this is implemented in a generic way, then there won''t be a need to >> implement additional RAID modes later. > > I presume you''re talking about the uncaught read errors that makes > many people avoid RAID5. Btrfs actually enables us to use it with > confidence again, since using checksums it''s able to detect these > errors and prevent corruption of the array. So to the contrary, I see > a lot of potential for parity-based redundancy in combination with > btrfs.No. What I''m talking about the the probability of finding an error during the process of rebuilding a degraded array. With a 6TB (usable) array and disks with 10^-14 error rate, the probability of getting an unrecoverable read error exceeds 50%. n+1 RAID isn''t fit for use with the current generation of drives where n > 1-5TB depending on how important your data and downtime are and how good your backups are. And I don''t put much stock in the manufacturer figures, either, so assume that 10^-14 is optimistic of it is reported. On high capacity drives (especially 1TB Seagates, both 3 and 4 platter variants) I am certainly seeing a higher error rate than that on a significant fraction of the disks. Gordan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Nov 17, 2010 at 2:31 PM, Hugo Mills <hugo-lkml@carfax.org.uk> wrote:>> == Changing RAID levels =>> >> We need ioctls to change between different raid levels. Some of these >> are quite easy -- e.g. for RAID0 to RAID1, we just halve the available >> bytes on the fs, then queue a rebalance.Can we please do it properly? That is, change raid levels on a per-file, per-tree basis? Thanks. -- This message represents the official view of the voices in my head -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html