Hello all, reading the list for a while it looks like all kinds of implementational topics are covered but no basic user requests or talks are going on. Since I have found no other list on vger covering these issues I choose this one, forgive my ignorance if it is the wrong place. Like many people on the planet we try to handle quite some amounts of data (TBs) and try to solve this with several linux-based fileservers. Years of (mostly bad) experience led us to the following minimum requirements for a new fs on our servers: 1. filesystem-check 1.1 it should not - delay boot process (we have to wait for hours currently) - prevent mount in case of errors - be a part of the mount process at all - always check the whole fs 1.2 it should be able - to always be started interactively by user - to check parts/subtrees of the fs - to run purely informational (reporting, non-modifying) - to run on a mounted fs 2. general requirements - fs errors without file/dir names are useless - errors in parts of the fs are no reason for a fs to go offline as a whole - mounting must not delay the system startup significantly - resizing during runtime (up and down) - parallel mounts (very important!) (two or more hosts mount the same fs concurrently for reading and writing) - journaling - versioning (file and dir) - undelete (file and dir) - snapshots - run into hd errors more than once for the same file (as an option) - map out dead blocks (and of course display of the currently mapped out list) - no size limitations (more or less) - performant handling of large numbers of files inside single dirs (to check that use > 100.000 files in a dir, understand that it is no good idea to spread inode-blocks over the whole hd because of seek times) - power loss at any time must not corrupt the fs (atomic fs modification) (new-data loss is acceptable) Remember, this is not meant to be a request for features, it is a list that built up over 10 years of handling data and the failures we experienced. To our knowledge no fs meets this list, but hey, is that a reason for not talking about it? Our goal is pretty simple: maximize fs uptime. How does btrfs match? -- Regards, Stephan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Stephan von Krawczynski <skraw@ithnet.com> writes:> reading the list for a while it looks like all kinds of implementational > topics are covered but no basic user requests or talks are going on. Since I > have found no other list on vger covering these issues I choose this one, > forgive my ignorance if it is the wrong place. > Like many people on the planet we try to handle quite some amounts of data > (TBs) and try to solve this with several linux-based fileservers. > Years of (mostly bad) experience led us to the following minimum requirements > for a new fs on our servers:If that are the minimum requirements, what are the maximum ones? Also you realize that some of the requirements (like parallel read/write aka a full cluster file system) are extremly hard? Perhaps it would make more sense if you extracted the top 10 items and ranked them by importance and posted again. -Andi -- ak@linux.intel.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
btrfs has many of the same goals... but they are goals not code so when you might see them is indeterminate. I believe these should not be in btrfs: Stephan von Krawczynski wrote:> - parallel mounts (very important!)as Andi said, you want a cluster or distributed fs. There are layered designs (CRFS or network filesystems) that can do the job and trying to do it in btrfs causes too many problems.> - journalingI assume you *do not* mean metadata journaling, you mean sending all file updates to a single output stream (as in one disk, tape, or network link). I''ve done that, but would not recommend it in btrfs because it limits the total fs bandwidth to what the single stream can support. This is normally done today by applications like databases, not in the filesystem.> - map out dead blocksUseless... a waste of time, code, and metadata structures. With current device technology, any device reporting bad blocks the device can not map out is about to die and needs replaced! jim -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 2008-10-21 at 13:23 +0200, Stephan von Krawczynski wrote:> Hello all, > > reading the list for a while it looks like all kinds of implementational > topics are covered but no basic user requests or talks are going on. Since I > have found no other list on vger covering these issues I choose this one, > forgive my ignorance if it is the wrong place. > Like many people on the planet we try to handle quite some amounts of data > (TBs) and try to solve this with several linux-based fileservers. > Years of (mostly bad) experience led us to the following minimum requirements > for a new fs on our servers: >Thanks for this input and for taking the time to post it.> 1. filesystem-check > 1.1 it should not > - delay boot process (we have to wait for hours currently) > - prevent mount in case of errors > - be a part of the mount process at all > - always check the whole fsFor this, you have to define filesystem-check very carefully. In reality, corruptions can prevent mounting. We can try very very hard to limit the class of corruptions that prevent mounting, and use duplication and replication to create configurations that address the remaining cases. In general, we''ll be able to make things much better than they are today.> 1.2 it should be able > - to always be started interactively by user > - to check parts/subtrees of the fs > - to run purely informational (reporting, non-modifying) > - to run on a mounted fsStarted interactively? I''m not entirely sure what that means, but in general when you ask the user a question about if/how to fix a corruption, they will have no idea what the correct answer is.> 2. general requirements > - fs errors without file/dir names are useless > - errors in parts of the fs are no reason for a fs to go offline as a wholeThese two are in progress. Btrfs won''t always be able to give a file and directory name, but it will be able to give something that can be turned into a file or directory name. You don''t want important diagnostic messages delayed by name lookup.> - mounting must not delay the system startup significantlyMounts are fast> - resizing during runtime (up and down)Resize is done> - parallel mounts (very important!) > (two or more hosts mount the same fs concurrently for reading and > writing)As Jim and Andi have said, parallel mounts are not in the feature list for Btrfs. Network filesystems will provide these features.> - journalingBtrfs doesn''t journal. The tree logging code is close, it provides optimized fsync and O_SYNC operations. The same basic structures could be used for remote replication.> - versioning (file and dir)>From a data structure point of view, version control is fairly easy. >From a user interface and policy point of view, it gets difficult veryquickly. Aside from snapshotting, version control is outside the scope of btrfs. There are lots of good version control systems available, I''d suggest you use them instead.> - undelete (file and dir)Undelete is easy but I think best done at a layer above the FS.> - snapshotsDone> - run into hd errors more than once for the same file (as an option)Sorry, I''m not sure what you mean here.> - map out dead blocks > (and of course display of the currently mapped out list)I agree with Jim on this one. Drives remap dead sectors, and when they stop remapping them, the drive should be replaced.> - no size limitations (more or less) > - performant handling of large numbers of files inside single dirs > (to check that use > 100.000 files in a dir, understand that it is > no good idea to spread inode-blocks over the whole hd because of seek > times)Everyone has different ideas on "large" numbers of files inside a single dir. The directory indexing done by btrfs can easily handle 100,000> - power loss at any time must not corrupt the fs (atomic fs modification) > (new-data loss is acceptable)Done. Btrfs already uses barriers as required for sata drives.> > Remember, this is not meant to be a request for features, it is a list that > built up over 10 years of handling data and the failures we experienced. To > our knowledge no fs meets this list, but hey, is that a reason for not talking > about it? Our goal is pretty simple: maximize fs uptime. > How does btrfs match?-chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 21 Oct 2008 14:13:33 +0200 Andi Kleen <andi@firstfloor.org> wrote:> Stephan von Krawczynski <skraw@ithnet.com> writes: > > > reading the list for a while it looks like all kinds of implementational > > topics are covered but no basic user requests or talks are going on. Since I > > have found no other list on vger covering these issues I choose this one, > > forgive my ignorance if it is the wrong place. > > Like many people on the planet we try to handle quite some amounts of data > > (TBs) and try to solve this with several linux-based fileservers. > > Years of (mostly bad) experience led us to the following minimum requirements > > for a new fs on our servers: > > If that are the minimum requirements, what are the maximum ones? > > Also you realize that some of the requirements (like parallel read/write > aka a full cluster file system) are extremly hard? > > Perhaps it would make more sense if you extracted the top 10 items > and ranked them by importance and posted again.Hello Andi, thanks for your feedback. Understand "minimum requirement" as "minimum requirement to drop the current installation and migrate the data to a new fs platform". Of course you are right, dealing with multiple/parallel mounts can be quite a nasty job if the fs was not originally planned with this feature in mind. On the other hand I cannot really imagine how to deal with TBs of data in the future without such a feature. If you look at the big picture the things I mentioned allow you to have redundant front-ends for the fileservice doing the same or completely different applications. You can use one mount (host) for tape backup purposes only without heavy loss in standard file service. You can even mount for filesystem check purposes, a box that does nothing else but check the structure and keep you informed what is really going on with your data - and your data is still in production in the meantime. Whatever happens you have a real chance of keeping your file service up, even if parts of your fs go nuts because some underlying hd got partially damaged. Keeping it up and running is the most important part, performance is only second on the list. If you take a close look there are not really 10 different items on my list, depending on the level of abstraction you prefer, nevertheless: 1) parallel mounts 2) mounting must not delay the system startup significantly 3) errors in parts of the fs are no reason for a fs to go offline as a whole 4) power loss at any time must not corrupt the fs 5) fsck on a mounted fs, interactively, not part of the mount (all fsck features) 6) journaling 7) undelete (file and dir) 8) resizing during runtime (up and down) 9) snapshots 10) performant handling of large numbers of files inside single dirs -- Regards, Stephan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hearing what user''s think they want is always good, but... Stephan von Krawczynski wrote:> > thanks for your feedback. Understand "minimum requirement" as "minimum > requirement to drop the current installation and migrate the data to a > new fs platform".I would sure like to know what existing platform and filesystem you have that you think has all 10 of your features.> Of course you are right, dealing with multiple/parallel mounts can be quite a > nasty job if the fs was not originally planned with this feature in mind. > On the other hand I cannot really imagine how to deal with TBs of data in the > future without such a feature. > If you look at the big picture the things I mentioned allow you to have > redundant front-ends for the fileservice doing the same or completely > different applications. You can use one mount (host) for tape backup purposes > only without heavy loss in standard file service. You can even mount for > filesystem check purposes, a box that does nothing else but check the > structure and keep you informed what is really going on with your data - and > your data is still in production in the meantime. > Whatever happens you have a real chance of keeping your file service up, even > if parts of your fs go nuts because some underlying hd got partially damaged. > Keeping it up and running is the most important part, performance is only > second on the list. > If you take a close look there are not really 10 different items on my list, > depending on the level of abstraction you prefer, nevertheless: > > 1) parallel mountsWhat I see from that explanation is you have a "system design" idea using parallel machines to fix problems you have had in the past. To implement your design, you need a filesystem to fit it. I think it is better to just design a filesystem without the problems and configure the hardware to handle the necessary load.> 2) mounting must not delay the system startup significantly > 3) errors in parts of the fs are no reason for a fs to go offline as a whole > 4) power loss at any time must not corrupt the fs > 5) fsck on a mounted fs, interactively, not part of the mount (all fsck > features)I think all of these are part of the "reliability" goal for btrfs and when you say "fsck" it is probably misleading if I understand your real requirement to be the same as my customers: - *NO* fsck - filesystem design "prevents problems we have had before" - filesystem autodetects, isolates, and (possibly) repairs errors - online "scan, check, repair filesystem" tool initiated by admin - Reliability so high that they never run that check-and-fix tool Note that I personally have never seen a first release meet the "no problems, no need to fix" criteria that would obviate any need for a check/fix tool. jim -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Chris Mason <chris.mason@oracle.com> writes:> > Started interactively? I''m not entirely sure what that means, but in > general when you ask the user a question about if/how to fix a > corruption, they will have no idea what the correct answer is.While that''s true today, I''m not sure it has to be true always. I always thought traditional fsck user interfaces were a UI desaster and could be done much better with some simple tweaks. For example the fsck could present the user a list of files that ended up in lost+found and let them examine them, instead of asking a lot of useless questions. Or it could give a high level summary on how many files in which part of the directory tree were corrupted. etc.etc. Or it could default to a high level mode that only gives such high level information to the user. So I don''t think all corruptions could be done perfectly user friendly, but at least the basic user friendliness in many situations could be much improved. -Andi -- ak@linux.intel.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hello Chris, let me clarify some things a bit, see ... On Tue, 21 Oct 2008 09:59:40 -0400 Chris Mason <chris.mason@oracle.com> wrote:> Thanks for this input and for taking the time to post it. > > > 1. filesystem-check > > 1.1 it should not > > - delay boot process (we have to wait for hours currently) > > - prevent mount in case of errors > > - be a part of the mount process at all > > - always check the whole fs > > For this, you have to define filesystem-check very carefully. In > reality, corruptions can prevent mounting. We can try very very hard to > limit the class of corruptions that prevent mounting, and use > duplication and replication to create configurations that address the > remaining cases.What we would like to have is a possibility to check an already mounted and active fs for corruption, that''s the reporting part. If some corruption is found we should be able to correct the data/metadata/whatever on the _still active_ fs, lets say by starting fsck in modify mode. It is often preferred not to do a run over the complete fs but only over certain (already known-to-be-corrupted) parts/subtrees. It is obvious that the fs should not go offline then even if something very ugly happens. You can imagine: Run fsck via cron every night. Then look at the logs in the morning an if bad news arrived try to correct the broken subtree or exclude it from further usage.> In general, we''ll be able to make things much better than they are > today.I am pretty sure about that ;-)> > 1.2 it should be able > > - to always be started interactively by user > > - to check parts/subtrees of the fs > > - to run purely informational (reporting, non-modifying) > > - to run on a mounted fs > > Started interactively? I''m not entirely sure what that means, but in > general when you ask the user a question about if/how to fix a > corruption, they will have no idea what the correct answer is.see above explanation. We don''t expect the classical y/n-questions during fsck. Honestly there are only 3 types of modification modes in fsck: - try correction in place - exclude (i.e. delete) whole problem subtree - duplicate to another subtree whatever can be rescued from the original place (and leave problem subtree as-is)> > 2. general requirements > > - fs errors without file/dir names are useless > > - errors in parts of the fs are no reason for a fs to go offline as a whole > > These two are in progress. Btrfs won''t always be able to give a file > and directory name, but it will be able to give something that can be > turned into a file or directory name. You don''t want important > diagnostic messages delayed by name lookup.That''s a point I really never understood. Why is it non-trivial for a fs to know what file or dir (name) it is currently working on? It really sounds strange to me that a layer that is managing files on some device does not know at any time during runtime what file or dir it is actually handling. If _it_ does not know, how should the _user_ probably hours later reading the logs know based on inode numbers or whatever cryptic logs are thrown out? I mean filenames are nothing more than a human-readable describing data structure mostly type char. Its only reason of existance is readability, why not in logs?> > > - mounting must not delay the system startup significantly > > Mounts are fast > > > - resizing during runtime (up and down) > > Resize is done > > > - parallel mounts (very important!) > > (two or more hosts mount the same fs concurrently for reading and > > writing) > > As Jim and Andi have said, parallel mounts are not in the feature list > for Btrfs. Network filesystems will provide these features.Can you explain what "network filesystems" stands for in this statement, please name two or three examples.> > - journaling > > Btrfs doesn''t journal. The tree logging code is close, it provides > optimized fsync and O_SYNC operations. The same basic structures could > be used for remote replication. > > > - versioning (file and dir) > > >From a data structure point of view, version control is fairly easy. > >From a user interface and policy point of view, it gets difficult very > quickly. Aside from snapshotting, version control is outside the scope > of btrfs. > > There are lots of good version control systems available, I''d suggest > you use them instead.To me versioning sounds like a not-so-easy-to-implement feature. Nevertheless I trust your experience. If a basic implementation is possible and not too complex, why deny a feature?> > - undelete (file and dir) > > Undelete is easyYes, we hear and say that all the time, name one linux fs doing it, please.> but I think best done at a layer above the FS.Before we got into the linux community we used n.vell netware. Undelete has been there since about the first day. More then ten years later (nowadays) it is still missing in linux. I really do suggest to provide _some_ solution and _then_ lets talk about the _better_ solution.> > - snapshots > > Done > > > - run into hd errors more than once for the same file (as an option) > > Sorry, I''m not sure what you mean here.If your hd is going dead you often find out that touching broken files takes ages. If the fs finds out a file is corrupt because the device has errors it could just flag the file as broken and not re-read the same error a thousand times more. Obviously you want that as an option, because there can be good reasons for re-reading dead files...> > - map out dead blocks > > (and of course display of the currently mapped out list) > > I agree with Jim on this one. Drives remap dead sectors, and when they > stop remapping them, the drive should be replaced.If your life depends on it, would you use one rope or two to secure yourself?> > > - no size limitations (more or less) > > - performant handling of large numbers of files inside single dirs > > (to check that use > 100.000 files in a dir, understand that it is > > no good idea to spread inode-blocks over the whole hd because of seek > > times) > > Everyone has different ideas on "large" numbers of files inside a single > dir. The directory indexing done by btrfs can easily handle 100,000The story is not really about if it can but how fast it can. You know that most time is spent in seeks these days. If you have 100000 blocks to read right across the whole disk for scanning through a dir (fstat every file) you will see quite a difference to a situation where the relevant data can be read within few (or zero) seeks. Its a question of fs layout on the disk.> > - power loss at any time must not corrupt the fs (atomic fs modification) > > (new-data loss is acceptable) > > Done. Btrfs already uses barriers as required for sata drives. > [...] > -chris-- Regards, Stephan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Stephan von Krawczynski <skraw@ithnet.com> writes:> > Yes, we hear and say that all the time, name one linux fs doing it, please.ext[234] support it to some extent. It has some limitations (especially when the files are large and you shouldn''t do too much followon IO to prevent the data from being overwriten) and the user frontends are not very nice, but it it''s there -Andi -- ak@linux.intel.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 21 Oct 2008 09:20:16 -0400 jim owens <jowens@hp.com> wrote:> btrfs has many of the same goals... but they are goals not code > so when you might see them is indeterminate.no big issue, my pension is 20 years away, I got time ;-)> I believe these should not be in btrfs: > > Stephan von Krawczynski wrote: > > > - parallel mounts (very important!) > > as Andi said, you want a cluster or distributed fs. There > are layered designs (CRFS or network filesystems) that can do > the job and trying to do it in btrfs causes too many problems.question is: if you had such an implementation, are there drawbacks expectable for the single-mount case? If not I''d vote for it because there are not really many alternatives "on the market".> > - journaling > > I assume you *do not* mean metadata journaling, you mean > sending all file updates to a single output stream (as in one > disk, tape, or network link). I''ve done that, but would not > recommend it in btrfs because it limits the total fs bandwidth > to what the single stream can support. This is normally done > today by applications like databases, not in the filesystem.As far as I know metadata journaling is in, right? If what you mean is capable of creating live or offline images of the fs you got me right.> > - map out dead blocks > > Useless... a waste of time, code, and metadata structures. > With current device technology, any device reporting bad blocks > the device can not map out is about to die and needs replaced!Sure, but what you say only reflects the ideal world. On a file service, you never have that. In fact you do not even have good control about what is going on. Lets say you have a setup that creates, reads and deletes files 24h a day from numerous clients. At two o''clock in the morning some hd decides to partially die. Files get created on it, fill data up to errors, get deleted and another bunch of data arrives and yet again fs tries to allocate the same dead areas. You loose a lot more data only because the fs did not map out the already known dead blocks. Of course you would replace the dead drive later on, but in the meantime you have a lot of fun. In other words: give me a tool to freeze the world right at the time the errors show up, or map out dead blocks (only because it is a lot easier).> jim-- Regards, Stephan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Oct 21, 2008 at 07:01:36PM +0200, Stephan von Krawczynski wrote:> Sure, but what you say only reflects the ideal world. On a file service, you > never have that. In fact you do not even have good control about what is going > on. Lets say you have a setup that creates, reads and deletes files 24h a day > from numerous clients. At two o''clock in the morning some hd decides to > partially die. Files get created on it, fill data up to errors, get > deleted and another bunch of data arrives and yet again fs tries to allocate > the same dead areas. You loose a lot more data only because the fs did not map > out the already known dead blocks. Of course you would replace the dead drive > later on, but in the meantime you have a lot of fun. > In other words: give me a tool to freeze the world right at the time the > errors show up, or map out dead blocks (only because it is a lot easier).When modern disks can''t solve the problems with their internal driver remapping anymore you better replace it ASAP as it is a very strong disk failure indication. Last years FAST has some very interesting statitics showing this in the field. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Christoph Hellwig wrote:> On Tue, Oct 21, 2008 at 07:01:36PM +0200, Stephan von Krawczynski wrote: > >> Sure, but what you say only reflects the ideal world. On a file service, you >> never have that. In fact you do not even have good control about what is going >> on. Lets say you have a setup that creates, reads and deletes files 24h a day >> from numerous clients. At two o''clock in the morning some hd decides to >> partially die. Files get created on it, fill data up to errors, get >> deleted and another bunch of data arrives and yet again fs tries to allocate >> the same dead areas. You loose a lot more data only because the fs did not map >> out the already known dead blocks. Of course you would replace the dead drive >> later on, but in the meantime you have a lot of fun. >> In other words: give me a tool to freeze the world right at the time the >> errors show up, or map out dead blocks (only because it is a lot easier). >> > > When modern disks can''t solve the problems with their internal driver > remapping anymore you better replace it ASAP as it is a very strong > disk failure indication. Last years FAST has some very interesting > statitics showing this in the field. >Doing proactive drive pulls is kind of a black art, but looking for *lots* of remapped sectors is always a pretty reliable clue. Note that modern S-ATA disks might have room to remap 2-3 thousand sectors, so you should not worry too much about a handful (say 20 or so). Sometimes the remapping happens because of transient things (junk on the platter, vibrations, out of spec temperature range, etc) so your drive might be perfectly healthy. If you have remapped a big chunk of the sectors (say more than 10%), you should grab the data off the disk asap and replace it. Worry less about errors during read, writes indicate more serious errors. The file system should not have to worry about remapping sectors internally, by the time writes fail and you have consumed all remapped sectors, you should definitely be in read-only mode and well on the way to replacing the disk :-) ric -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> question is: if you had such an implementation, are there > drawbacks expectable for the single-mount case? If not I''d vote for it > because there are not really many alternatives "on the market".As I understand it, the largest issue is in locking and boundaries. Two different systems could mount a filesystem, and try to use some sort of on-disk markers to keep from writing to the same area at the same time... but there is often some bit of time between when a system sends data to the disk and when it would become available to read from the disk, and little or no guarantee about the order in which the data is written. All the work that goes into making transactions atomic depends on there only being a single path to the disk - through the code that handles transactions. If data can arrive on the disk without being managed by that code, all bets are off. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 2008-10-21 at 18:27 +0200, Stephan von Krawczynski wrote:> > > 2. general requirements > > > - fs errors without file/dir names are useless > > > - errors in parts of the fs are no reason for a fs to go offline as a whole > > > > These two are in progress. Btrfs won''t always be able to give a file > > and directory name, but it will be able to give something that can be > > turned into a file or directory name. You don''t want important > > diagnostic messages delayed by name lookup. > > That''s a point I really never understood. Why is it non-trivial for a fs to > know what file or dir (name) it is currently working on?The name lives in block A, but you might find a corruption while processing block B. Block A might not be in ram anymore, or it might be in ram but locked by another process. On top of all of that, when we print errors it''s because things haven''t gone well. They are deep inside of various parts of the filesystem, and we might not be able to take the required locks or read from the disk in order to find the name of the thing we''re operating on.> > > > > - mounting must not delay the system startup significantly > > > > Mounts are fast > > > > > - resizing during runtime (up and down) > > > > Resize is done > > > > > - parallel mounts (very important!) > > > (two or more hosts mount the same fs concurrently for reading and > > > writing) > > > > As Jim and Andi have said, parallel mounts are not in the feature list > > for Btrfs. Network filesystems will provide these features. > > Can you explain what "network filesystems" stands for in this statement, > please name two or three examples. >NFS (done) CRFS (under development), maybe ceph as well which is also under development.> > > - journaling > > > > Btrfs doesn''t journal. The tree logging code is close, it provides > > optimized fsync and O_SYNC operations. The same basic structures could > > be used for remote replication. > > > > > - versioning (file and dir) > > > > >From a data structure point of view, version control is fairly easy. > > >From a user interface and policy point of view, it gets difficult very > > quickly. Aside from snapshotting, version control is outside the scope > > of btrfs. > > > > There are lots of good version control systems available, I''d suggest > > you use them instead. > > To me versioning sounds like a not-so-easy-to-implement feature. Nevertheless > I trust your experience. If a basic implementation is possible and not too > complex, why deny a feature? >In general I think snapshotting solves enough of the problem for most of the people most of the time. I''d love for Btrfs to be the perfect FS, but I''m afraid everyone has a different definition of perfect. Storing multiple versions of something is pretty easy. Making a usable interface around those versions is the hard part, especially because you need groups of files to be versioned together in atomic groups (something that looks a lot like a snapshot). Versioning is solved in userspace. We would never be able to implement everything that git or mercurial can do inside the filesystem.> > > - undelete (file and dir) > > > > Undelete is easy > > Yes, we hear and say that all the time, name one linux fs doing it, please. >The fact that nobody is doing it is not a good argument for why it should be done ;) Undelete is a policy decision about what to do with files as they are removed. I''d much rather see it implemented above the filesystems instead of individually in each filesystem. This doesn''t mean I''ll never code it, it just means it won''t get implemented directly inside of Btrfs. In comparison with all of the other features pending, undelete is pretty far down on the list.> > but I think best done at a layer above the FS. > > Before we got into the linux community we used n.vell netware. Undelete has > been there since about the first day. More then ten years later (nowadays) it > is still missing in linux. I really do suggest to provide _some_ solution and > _then_ lets talk about the _better_ solution. > > > > - snapshots > > > > Done > > > > > - run into hd errors more than once for the same file (as an option) > > > > Sorry, I''m not sure what you mean here. > > If your hd is going dead you often find out that touching broken files takes > ages. If the fs finds out a file is corrupt because the device has errors it > could just flag the file as broken and not re-read the same error a thousand > times more. Obviously you want that as an option, because there can be good > reasons for re-reading dead files...I really agree that we want to avoid beating on a dead drive. Btrfs will record some error information about the drive so it can decide what to do with failures. But, remembering that sector #12345768 is bad doesn''t help much. When the drive returned the IO error it remapped the sector and the next write will probably succeed.> > > > - map out dead blocks > > > (and of course display of the currently mapped out list) > > > > I agree with Jim on this one. Drives remap dead sectors, and when they > > stop remapping them, the drive should be replaced. > > If your life depends on it, would you use one rope or two to secure yourself? >Btrfs will keep the dead drive around as a fallback for sectors that fail on the other mirrors when data is being rebuilt. Beyond that, we''ll expect you to toss the bad drive once the rebuild has finished. There''s an interesting paper about how netapp puts the drive into rehab and is able to avoid service calls by rewriting the bad sectors and checking them over. That''s a little ways off for Btrfs.> > > > > - no size limitations (more or less) > > > - performant handling of large numbers of files inside single dirs > > > (to check that use > 100.000 files in a dir, understand that it is > > > no good idea to spread inode-blocks over the whole hd because of seek > > > times) > > > > Everyone has different ideas on "large" numbers of files inside a single > > dir. The directory indexing done by btrfs can easily handle 100,000 > > The story is not really about if it can but how fast it can. You know that > most time is spent in seeks these days. If you have 100000 blocks to read > right across the whole disk for scanning through a dir (fstat every file) you > will see quite a difference to a situation where the relevant data can be read > within few (or zero) seeks. Its a question of fs layout on the disk. >Yes, btrfs already performs well in this workload. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
calin wrote:>> question is: if you had such an implementation, are there >> drawbacks expectable for the single-mount case? If not I''d vote for it >> because there are not really many alternatives "on the market". > > As I understand it, the largest issue is in locking and boundaries.Correct, that is the first big issue. As soon as 2 machines can access the same device, you must design for distributed locking. And that means a lot more code, lower performance, and a lot of things a local-only filesystem could do that must be disallowed. The second issue is what is the purpose of more than 1 host accessing the data directly from the device. There are cases where this is a good thing because the application is designed with data partitioning and multi-instance coordination. It is a bad thing for random uncoordinated use like backups or fsck. Remember that the device bandwidth is the limiter so even when each host has a dedicated path to the device (as in dual port SAS or FC), that 2nd host cuts the throughput by more than 1/2 with uncoordinated seeks and transfers. And if the host device drivers are not designed for multiple host sharing, this can cause timeouts, resets, and false device-failed states. And yes... even read-only access from a 2nd host is trouble in many parts of the design and does not come for free. jim -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 2008-10-21 at 09:59 -0400, Chris Mason wrote:> > - power loss at any time must not corrupt the fs (atomic fs modification) > > (new-data loss is acceptable) > > Done. Btrfs already uses barriers as required for sata drives.Aren''t there situations in which write barriers don''t do what they''re supposed to do? Cheers, Eric
Eric Anopolsky wrote:> On Tue, 2008-10-21 at 09:59 -0400, Chris Mason wrote: > >>> - power loss at any time must not corrupt the fs (atomic fs modification) >>> (new-data loss is acceptable) >>> >> Done. Btrfs already uses barriers as required for sata drives. >> > > Aren''t there situations in which write barriers don''t do what they''re > supposed to do? > > Cheers, > Eric > >If the drive effectively "lies" to you about flushing the write cache, you might have an issue. I have not seen that first hand with recent disk drives (and I have seen a lot :-)) Ric -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 2008-10-21 at 18:18 -0400, Ric Wheeler wrote:> Eric Anopolsky wrote: > > On Tue, 2008-10-21 at 09:59 -0400, Chris Mason wrote: > > > >>> - power loss at any time must not corrupt the fs (atomic fs modification) > >>> (new-data loss is acceptable) > >>> > >> Done. Btrfs already uses barriers as required for sata drives. > >> > > > > Aren''t there situations in which write barriers don''t do what they''re > > supposed to do? > > > > Cheers, > > Eric > > > > > If the drive effectively "lies" to you about flushing the write cache, > you might have an issue. I have not seen that first hand with recent > disk drives (and I have seen a lot :-))That does not match the understanding I get from reading the notes/caveats section of Documentation/block/barrier.txt: "Note that block drivers must not requeue preceding requests while completing latter requests in an ordered sequence. Currently, no error checking is done against this." and perhaps more importantly: "[a technical scenario involving disk writes] The problem here is that the barrier request is *supposed* to indicate that filesystem update requests [2] and [3] made it safely to the physical medium and, if the machine crashes after the barrier is written, filesystem recovery code can depend on that. Sadly, that isn''t true in this case anymore. IOW, the success of a I/O barrier should also be dependent on success of some of the preceding requests, where only upper layer (filesystem) knows what ''some'' is. This can be solved by implementing a way to tell the block layer which requests affect the success of the following barrier request and making lower lever drivers to resume operation on error only after block layer tells it to do so. As the probability of this happening is very low and the drive should be faulty, implementing the fix is probably an overkill. But, still, it''s there." Cheers, Eric
jim owens wrote:> > Remember that the device bandwidth is the limiter so even > when each host has a dedicated path to the device (as in > dual port SAS or FC), that 2nd host cuts the throughput by > more than 1/2 with uncoordinated seeks and transfers.That''s only a problem if there is a single shared device. Since btrfs supports multiple devices, each host could own a device set and access from other hosts would be through the owner. You would need RDMA to get reasonable performance and some kind of dual-porting to get high availability. Each host could control the allocation tree for its devices. Of course, this doesn''t solve the other problems with parallel mounts. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Eric Anopolsky wrote:> On Tue, 2008-10-21 at 18:18 -0400, Ric Wheeler wrote: > >> Eric Anopolsky wrote: >> >>> On Tue, 2008-10-21 at 09:59 -0400, Chris Mason wrote: >>> >>> >>>>> - power loss at any time must not corrupt the fs (atomic fs modification) >>>>> (new-data loss is acceptable) >>>>> >>>>> >>>> Done. Btrfs already uses barriers as required for sata drives. >>>> >>>> >>> Aren''t there situations in which write barriers don''t do what they''re >>> supposed to do? >>> >>> Cheers, >>> Eric >>> >>> >>> >> If the drive effectively "lies" to you about flushing the write cache, >> you might have an issue. I have not seen that first hand with recent >> disk drives (and I have seen a lot :-)) >> > > That does not match the understanding I get from reading the > notes/caveats section of Documentation/block/barrier.txt: > > "Note that block drivers must not requeue preceding requests while > completing latter requests in an ordered sequence. Currently, no > error checking is done against this." > > and perhaps more importantly: > > "[a technical scenario involving disk writes] > The problem here is that the barrier request is *supposed* to indicate > that filesystem update requests [2] and [3] made it safely to the > physical medium and, if the machine crashes after the barrier is > written, filesystem recovery code can depend on that. Sadly, that > isn''t true in this case anymore. IOW, the success of a I/O barrier > should also be dependent on success of some of the preceding requests, > where only upper layer (filesystem) knows what ''some'' is. > > This can be solved by implementing a way to tell the block layer which > requests affect the success of the following barrier request and > making lower lever drivers to resume operation on error only after > block layer tells it to do so. > > As the probability of this happening is very low and the drive should > be faulty, implementing the fix is probably an overkill. But, still, > it''s there." > > Cheers, > Eric > >The cache flush command for ATA devices will block and wait until all of the device''s write cache has been written back. What I assume Tejun was referring to here is that some IO might have been written out to the device and an error happened when the device tried to write the cache back (say due to normal drive microcode cache destaging). The problem with this is that there is no outstanding IO context between the host and the storage to report the error to (i.e., the drive has already ack''ed the write). If this is what is being described, there is a non-zero chance that this might happen, but it is extremely infrequent. The checksumming that we have in btrfs will catch these bad writes when you replay the journal after a crash (or even when you read data blocks) so I would contend that this is about as good as we can do. Tejun, Chris, does this match your understanding? Thanks! Ric -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Ric Wheeler wrote:> The cache flush command for ATA devices will block and wait until all of > the device''s write cache has been written back. > > What I assume Tejun was referring to here is that some IO might have > been written out to the device and an error happened when the device > tried to write the cache back (say due to normal drive microcode cache > destaging). The problem with this is that there is no outstanding IO > context between the host and the storage to report the error to (i.e., > the drive has already ack''ed the write). > > If this is what is being described, there is a non-zero chance that this > might happen, but it is extremely infrequent. The checksumming that we > have in btrfs will catch these bad writes when you replay the journal > after a crash (or even when you read data blocks) so I would contend > that this is about as good as we can do.Please consider the following scenario. 1. FS issues lots of writes which are queued in the block elevator. 2. FS issues barrier. 3. Elevator pushes out all the writes. 4. One of the writes fails for some reason. Media failure or what not. Failure is propagated to upper layer. 5. Whether there was preceding failure or not, block queue processing continues and writes out all the pending requests. 6. Elevator issues FLUSH and it gets executed by the device. 7. Elevator issues barrier write and it gets executed by the device. 8. *POWER LOSS* The thing is that currently there is no defined way for FS to take action after #4 once happens unless it waits for all outstanding writes to complete before issuing the barrier. One way to solve this would be to make the failure status sticky such that any barrier following any number of uncleared errors will fail too, so that the filesystem can think about what it should do with the write failure. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 21 Oct 2008 11:34:20 -0400 jim owens <jowens@hp.com> wrote:> Hearing what user''s think they want is always good, but... > > Stephan von Krawczynski wrote: > > > > thanks for your feedback. Understand "minimum requirement" as "minimum > > requirement to drop the current installation and migrate the data to a > > new fs platform". > > I would sure like to know what existing platform and filesystem > you have that you think has all 10 of your features.Obviously none, else I would not speak up and try to find one. :-)> > [...] > > 1) parallel mounts > > What I see from that explanation is you have a "system design" idea > using parallel machines to fix problems you have had in the past. > To implement your design, you need a filesystem to fit it.Well, I can''t hardly deny that. Lets just name the (simple) problem, different names for the very same thing: uptime, availability, redundancy> I think > it is better to just design a filesystem without the problems and > configure the hardware to handle the necessary load.Ok, now you see me astonished. You really think that there is one piece of software around that is "without problems" ? My idea of the world is really very different from that: The world is far from perfect. That is why I try to deploy solutions that have redundancy for all kinds of problems I can think of and hopefully for a few that I haven''t thought of.> > 2) mounting must not delay the system startup significantly > > 3) errors in parts of the fs are no reason for a fs to go offline as a whole > > 4) power loss at any time must not corrupt the fs > > 5) fsck on a mounted fs, interactively, not part of the mount (all fsck > > features) > > I think all of these are part of the "reliability" goal for btrfs > and when you say "fsck" it is probably misleading if I understand > your real requirement to be the same as my customers: > > - *NO* fsck > - filesystem design "prevents problems we have had before" > - filesystem autodetects, isolates, and (possibly) repairs errors > - online "scan, check, repair filesystem" tool initiated by admin > - Reliability so high that they never run that check-and-fix toolThat is _wrong_ (to a certain extent). You _want to run_ diagnostic tools to make sure that there is no problem. And you don''t want some software (not even HAL) to repair errors without prior admin knowledge/permission.> Note that I personally have never seen a first release meet > the "no problems, no need to fix" criteria that would obviate > any need for a check/fix tool.That really does not depend on the release number of _your_ special software. Your software always depends on other components (hw or sw) that (can) have bugs and weird behaviour. And this is the fact: no perfect world, so don''t count on your or others'' perfectness. If you do you will fail.> jim-- Regards, Stephan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 21 Oct 2008 13:15:13 -0400 Christoph Hellwig <hch@infradead.org> wrote:> On Tue, Oct 21, 2008 at 07:01:36PM +0200, Stephan von Krawczynski wrote: > > Sure, but what you say only reflects the ideal world. On a file service, you > > never have that. In fact you do not even have good control about what is going > > on. Lets say you have a setup that creates, reads and deletes files 24h a day > > from numerous clients. At two o''clock in the morning some hd decides to > > partially die. Files get created on it, fill data up to errors, get > > deleted and another bunch of data arrives and yet again fs tries to allocate > > the same dead areas. You loose a lot more data only because the fs did not map > > out the already known dead blocks. Of course you would replace the dead drive > > later on, but in the meantime you have a lot of fun. > > In other words: give me a tool to freeze the world right at the time the > > errors show up, or map out dead blocks (only because it is a lot easier). > > When modern disks can''t solve the problems with their internal driver > remapping anymore you better replace it ASAP as it is a very strong > disk failure indication. Last years FAST has some very interesting > statitics showing this in the field.And of course a "disk" is always a "disk", right? -- Regards, Stephan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 21 Oct 2008 18:09:40 +0200 Andi Kleen <andi@firstfloor.org> wrote:> While that''s true today, I''m not sure it has to be true always. > I always thought traditional fsck user interfaces were a > UI desaster and could be done much better with some simple tweaks. > [...]You are completely right.> -Andi-- Regards, Stephan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 21 Oct 2008 18:59:26 +0200 Andi Kleen <andi@firstfloor.org> wrote:> Stephan von Krawczynski <skraw@ithnet.com> writes: > > > > Yes, we hear and say that all the time, name one linux fs doing it, please. > > ext[234] support it to some extent. It has some limitations > (especially when the files are large and you shouldn''t do too much followon > IO to prevent the data from being overwriten) and the user frontends are not > very nice, but it it''s thereWell, they must be pretty ugly, I really never heard of that. But really, it is not very important, because extX is completely useless with TB-size disks unless you feel good waiting hours for fsck (I did, and will never do again). _All_ customers we deployed ext3 urged us to go back to reiserfs3 ...> -Andi-- Regards, Stephan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Stephan von Krawczynski wrote:> >> - filesystem autodetects, isolates, and (possibly) repairs errors >> - online "scan, check, repair filesystem" tool initiated by admin >> - Reliability so high that they never run that check-and-fix tool >> > > That is _wrong_ (to a certain extent). You _want to run_ diagnostic tools to > make sure that there is no problem. And you don''t want some software (not even > HAL) to repair errors without prior admin knowledge/permissionI think there''s a place for a scrubber to continuously verify filesystem data and metadata, at low io priority, and correct any correctable errors. The admin can read the error correction report at their leisure, and then take any action that''s outside the filesystem''s capabilities (like ordering and installing new disks). -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 21 Oct 2008 13:49:43 -0400 Chris Mason <chris.mason@oracle.com> wrote:> On Tue, 2008-10-21 at 18:27 +0200, Stephan von Krawczynski wrote: > > > > > 2. general requirements > > > > - fs errors without file/dir names are useless > > > > - errors in parts of the fs are no reason for a fs to go offline as a whole > > > > > > These two are in progress. Btrfs won''t always be able to give a file > > > and directory name, but it will be able to give something that can be > > > turned into a file or directory name. You don''t want important > > > diagnostic messages delayed by name lookup. > > > > That''s a point I really never understood. Why is it non-trivial for a fs to > > know what file or dir (name) it is currently working on? > > The name lives in block A, but you might find a corruption while > processing block B. Block A might not be in ram anymore, or it might be > in ram but locked by another process. > > On top of all of that, when we print errors it''s because things haven''t > gone well. They are deep inside of various parts of the filesystem, and > we might not be able to take the required locks or read from the disk in > order to find the name of the thing we''re operating on.Ok, this is interesting. In another thread I was told parallel mounts are really complex and you cannot do good things in such an environment that you can do with single mount. Well, then, why don''t we do it? All boxes I know have tons of RAM, but fs finds no place in RAM to put large parts (if not all) of the structural fs data including filenames? Besides the simple fact that RAM is always faster than any known disk be it rotating or not, and that RAM is just there, whats the word for not doing it?> > > > - parallel mounts (very important!) > > > > (two or more hosts mount the same fs concurrently for reading and > > > > writing) > > > > > > As Jim and Andi have said, parallel mounts are not in the feature list > > > for Btrfs. Network filesystems will provide these features. > > > > Can you explain what "network filesystems" stands for in this statement, > > please name two or three examples. > > > NFS (done) CRFS (under development), maybe ceph as well which is also > under development.NFS is a good example for a fs that never got redesigned for modern world. I hope it will, but currently it''s like Model T on a highway. You have a NFS server with clients. Your NFS server dies, your backup server cannot take over the clients without them resetting their NFS-link (which means reboot to many applications) - no way. Besides that you still need another fs below NFS to bring your data onto some medium, which means you still have the problem how to create redundancy in your server architecture.> > > > - versioning (file and dir) > > > > > > >From a data structure point of view, version control is fairly easy. > > > >From a user interface and policy point of view, it gets difficult very > > > quickly. Aside from snapshotting, version control is outside the scope > > > of btrfs. > > > > > > There are lots of good version control systems available, I''d suggest > > > you use them instead. > > > > To me versioning sounds like a not-so-easy-to-implement feature. Nevertheless > > I trust your experience. If a basic implementation is possible and not too > > complex, why deny a feature? > > > > In general I think snapshotting solves enough of the problem for most of > the people most of the time. I''d love for Btrfs to be the perfect FS, > but I''m afraid everyone has a different definition of perfect. > > Storing multiple versions of something is pretty easy. Making a usable > interface around those versions is the hard part, especially because you > need groups of files to be versioned together in atomic groups > (something that looks a lot like a snapshot). > > Versioning is solved in userspace. We would never be able to implement > everything that git or mercurial can do inside the filesystem.Well, quite often the question is not about whole trees of data to be versioned. Even single (few) files or dirs can be of interest. And you want people to set up a complete user space monster to version three openoffice documents (only a rather flawed example of course)? Lots of people need a basic solution, not the groundbreaking answer to all questions.> > > > - undelete (file and dir) > > > > > > Undelete is easy > > > > Yes, we hear and say that all the time, name one linux fs doing it, please. > > > > The fact that nobody is doing it is not a good argument for why it > should be done ;)Believe me, if NTFS had a simple undelete tool come with it, we (in linux fs) would have it, too. Why do we always want to be _second best_?> Undelete is a policy decision about what to do with > files as they are removed. I''d much rather see it implemented above the > filesystems instead of individually in each filesystem. > > This doesn''t mean I''ll never code it, it just means it won''t get > implemented directly inside of Btrfs. In comparison with all of the > other features pending, undelete is pretty far down on the list.Nobody talks about a solution for a problem he does not have, its of minor priority. Up to the day he needs it, of course. Suddenly the priority jumps up :-) Come on, it is simple and it is useful and it is a question that will never rise again after its solution.> > If your hd is going dead you often find out that touching broken files takes > > ages. If the fs finds out a file is corrupt because the device has errors it > > could just flag the file as broken and not re-read the same error a thousand > > times more. Obviously you want that as an option, because there can be good > > reasons for re-reading dead files... > > I really agree that we want to avoid beating on a dead drive. > > Btrfs will record some error information about the drive so it can > decide what to do with failures. But, remembering that sector #12345768 > is bad doesn''t help much. When the drive returned the IO error it > remapped the sector and the next write will probably succeed.Problem with probability is that software is pretty bad in judging. That''s why my proposal was, lets do it and make it configurable for an admin that has a better idea of the current probability.> > > > - map out dead blocks > > > > (and of course display of the currently mapped out list) > > > > > > I agree with Jim on this one. Drives remap dead sectors, and when they > > > stop remapping them, the drive should be replaced. > > > > If your life depends on it, would you use one rope or two to secure yourself? > > > > Btrfs will keep the dead drive around as a fallback for sectors that > fail on the other mirrors when data is being rebuilt. Beyond that, > we''ll expect you to toss the bad drive once the rebuild has finished. > > There''s an interesting paper about how netapp puts the drive into rehab > and is able to avoid service calls by rewriting the bad sectors and > checking them over. That''s a little ways off for Btrfs.It will become more interesting what remapping means in a world full of flash-disks. Does it mean a disk must be replaced when some or even lots of sectors are dead? How about being faster in understanding we don''t know all future parameters than in buying?> [...] > -chris-- Regards, Stephan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 21 Oct 2008 13:31:37 -0400 Ric Wheeler <ricwheeler@gmail.com> wrote:> [...] > If you have remapped a big chunk of the sectors (say more than 10%), you > should grab the data off the disk asap and replace it. Worry less about > errors during read, writes indicate more serious errors.Ok, now for the bad news: money is invented. If you replace a disk before real failure you won''t get replacement from the manufacturer. That may sound irrelevant to someone handling 5 disks, but is significant if handling 500 or more. The replacement rate is indeed much higher than people think from their home pcs.> [...] > ric-- Regards, Stephan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Oct 22, 2008 at 5:19 AM, Stephan von Krawczynski <skraw@ithnet.com> wrote:> On Tue, 21 Oct 2008 13:49:43 -0400 > Chris Mason <chris.mason@oracle.com> wrote: > >> On Tue, 2008-10-21 at 18:27 +0200, Stephan von Krawczynski wrote: >> >> > > > 2. general requirements >> > > > - fs errors without file/dir names are useless >> > > > - errors in parts of the fs are no reason for a fs to go offline as a whole >> > > >> > > These two are in progress. Btrfs won''t always be able to give a file >> > > and directory name, but it will be able to give something that can be >> > > turned into a file or directory name. You don''t want important >> > > diagnostic messages delayed by name lookup. >> > >> > That''s a point I really never understood. Why is it non-trivial for a fs to >> > know what file or dir (name) it is currently working on? >> >> The name lives in block A, but you might find a corruption while >> processing block B. Block A might not be in ram anymore, or it might be >> in ram but locked by another process. >> >> On top of all of that, when we print errors it''s because things haven''t >> gone well. They are deep inside of various parts of the filesystem, and >> we might not be able to take the required locks or read from the disk in >> order to find the name of the thing we''re operating on. > > Ok, this is interesting. In another thread I was told parallel mounts are > really complex and you cannot do good things in such an environment that you > can do with single mount. Well, then, why don''t we do it? All boxes I know > have tons of RAM, but fs finds no place in RAM to put large parts (if not all) > of the structural fs data including filenames? Besides the simple fact that > RAM is always faster than any known disk be it rotating or not, and that RAM > is just there, whats the word for not doing it?Google "Daniel Phillips Ramback faster than a speeding bullet". He is on this list and may have some insight.>> > > > - parallel mounts (very important!) >> > > > (two or more hosts mount the same fs concurrently for reading and >> > > > writing) >> > > >> > > As Jim and Andi have said, parallel mounts are not in the feature list >> > > for Btrfs. Network filesystems will provide these features. >> > >> > Can you explain what "network filesystems" stands for in this statement, >> > please name two or three examples. >> > >> NFS (done) CRFS (under development), maybe ceph as well which is also >> under development. > > NFS is a good example for a fs that never got redesigned for modern world. I > hope it will, but currently it''s like Model T on a highway. > You have a NFS server with clients. Your NFS server dies, your backup server > cannot take over the clients without them resetting their NFS-link (which > means reboot to many applications) - no way. > Besides that you still need another fs below NFS to bring your data onto some > medium, which means you still have the problem how to create redundancy in > your server architecture.You are somewhat misinformed on this. Perhaps the Linux nfs server can''t cope, but I doubt it. NFS was designed to be stateless. I''ve got a fair amount of experience with a dual head netapp architecture. When 1 head dies, the other transparently fails over. During the brief downtime, the clients will go into I/O wait if at all instead of being disconnected. You might be able to do something similar using nfsd and keepalived if both servers were connected to the same storage. Setting that up would be trivial. You just need the clients mounting the vip and a reliable mechanism to provide the data from that vip. You could use heartbeat, but it is overly complex. Also look at clustered nfs or pnfs, both of which are nfs redesigns like you speak of. -- Jeff Schroeder Don''t drink and derive, alcohol and analysis don''t mix. http://www.digitalprognosis.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Tejun Heo wrote:> Ric Wheeler wrote: > >> The cache flush command for ATA devices will block and wait until all of >> the device''s write cache has been written back. >> >> What I assume Tejun was referring to here is that some IO might have >> been written out to the device and an error happened when the device >> tried to write the cache back (say due to normal drive microcode cache >> destaging). The problem with this is that there is no outstanding IO >> context between the host and the storage to report the error to (i.e., >> the drive has already ack''ed the write). >> >> If this is what is being described, there is a non-zero chance that this >> might happen, but it is extremely infrequent. The checksumming that we >> have in btrfs will catch these bad writes when you replay the journal >> after a crash (or even when you read data blocks) so I would contend >> that this is about as good as we can do. >> > > Please consider the following scenario. > > 1. FS issues lots of writes which are queued in the block elevator. > 2. FS issues barrier. > 3. Elevator pushes out all the writes. > 4. One of the writes fails for some reason. Media failure or what > not. Failure is propagated to upper layer. > 5. Whether there was preceding failure or not, block queue processing > continues and writes out all the pending requests. > 6. Elevator issues FLUSH and it gets executed by the device. > 7. Elevator issues barrier write and it gets executed by the device. > 8. *POWER LOSS* > > The thing is that currently there is no defined way for FS to take > action after #4 once happens unless it waits for all outstanding > writes to complete before issuing the barrier. One way to solve this > would be to make the failure status sticky such that any barrier > following any number of uncleared errors will fail too, so that the > filesystem can think about what it should do with the write failure. > > Thanks. >I think that we do handle a failure in the case that you outline above since the FS will be able to notice the error before it sends a commit down (and that commit is wrapped in the barrier flush calls). This is the easy case since we still have the context for the IO. It is more challenging (and kind of related) if the IO done in (4) has been ack''ed by drive, the drive later destages (not as part of the flush) its write cache and then an error happens. In this case, there is nothing waiting on the initiator side to receive the IO error. We have effectively lost the context for that IO. The only way to detect this is on replay (if the journal has checksums enabled or the error will be flagged as a media error). Ric -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Tejun Heo wrote:> Ric Wheeler wrote: > >> The cache flush command for ATA devices will block and wait until all of >> the device''s write cache has been written back. >> >> What I assume Tejun was referring to here is that some IO might have >> been written out to the device and an error happened when the device >> tried to write the cache back (say due to normal drive microcode cache >> destaging). The problem with this is that there is no outstanding IO >> context between the host and the storage to report the error to (i.e., >> the drive has already ack''ed the write). >> >> If this is what is being described, there is a non-zero chance that this >> might happen, but it is extremely infrequent. The checksumming that we >> have in btrfs will catch these bad writes when you replay the journal >> after a crash (or even when you read data blocks) so I would contend >> that this is about as good as we can do. >> > > Please consider the following scenario. > > 1. FS issues lots of writes which are queued in the block elevator. > 2. FS issues barrier. > 3. Elevator pushes out all the writes. > 4. One of the writes fails for some reason. Media failure or what > not. Failure is propagated to upper layer. > 5. Whether there was preceding failure or not, block queue processing > continues and writes out all the pending requests. > 6. Elevator issues FLUSH and it gets executed by the device. > 7. Elevator issues barrier write and it gets executed by the device. > 8. *POWER LOSS* > > The thing is that currently there is no defined way for FS to take > action after #4 once happens unless it waits for all outstanding > writes to complete before issuing the barrier. One way to solve this > would be to make the failure status sticky such that any barrier > following any number of uncleared errors will fail too, so that the > filesystem can think about what it should do with the write failure. > > Thanks. >I think that we do handle a failure in the case that you outline above since the FS will be able to notice the error before it sends a commit down (and that commit is wrapped in the barrier flush calls). This is the easy case since we still have the context for the IO. It is more challenging (and kind of related) if the IO done in (4) has been ack''ed by drive, the drive later destages (not as part of the flush) its write cache and then an error happens. In this case, there is nothing waiting on the initiator side to receive the IO error. We have effectively lost the context for that IO. The only way to detect this is on replay (if the journal has checksums enabled or the error will be flagged as a media error). Ric -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Avi Kivity wrote:> Stephan von Krawczynski wrote: >> >>> - filesystem autodetects, isolates, and (possibly) repairs errors >>> - online "scan, check, repair filesystem" tool initiated by admin >>> - Reliability so high that they never run that check-and-fix tool >>> >> >> That is _wrong_ (to a certain extent). You _want to run_ diagnostic >> tools to >> make sure that there is no problem. And you don''t want some software >> (not even >> HAL) to repair errors without prior admin knowledge/permission > > I think there''s a place for a scrubber to continuously verify > filesystem data and metadata, at low io priority, and correct any > correctable errors. The admin can read the error correction report at > their leisure, and then take any action that''s outside the > filesystem''s capabilities (like ordering and installing new disks). >Scrubbing is key for many scenarios since errors can "grow" even in places where previous IO has been completed without flagging an error. Some neat tricks are: (1) use block level scrubbing to detect any media errors. If you can map that sector level error into a file system object (meta data, file data or unallocated space), tools can recover (fsck, get another copy of the file or just ignore it!). There is a special command called "READ_VERIFY" that can be used to validate the sectors without actually moving data from the target to the host, so you can scrub without consuming page cache, etc. (2) sign and validate the object at the file level, say by validating a digital signature. This can catch high level errors (say the app messed up). Note that this scrubbing needs to be carefully tuned to not interfere with the foreground workload, using something like IO nice or the other IO controllers being kicked about might help :-) ric -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 2008-10-22 at 09:03 -0400, Ric Wheeler wrote:> Avi Kivity wrote: > > Stephan von Krawczynski wrote: > >> > >>> - filesystem autodetects, isolates, and (possibly) repairs errors > >>> - online "scan, check, repair filesystem" tool initiated by admin > >>> - Reliability so high that they never run that check-and-fix tool > >>> > >> > >> That is _wrong_ (to a certain extent). You _want to run_ diagnostic > >> tools to > >> make sure that there is no problem. And you don''t want some software > >> (not even > >> HAL) to repair errors without prior admin knowledge/permission > > > > I think there''s a place for a scrubber to continuously verify > > filesystem data and metadata, at low io priority, and correct any > > correctable errors. The admin can read the error correction report at > > their leisure, and then take any action that''s outside the > > filesystem''s capabilities (like ordering and installing new disks). > > > Scrubbing is key for many scenarios since errors can "grow" even in > places where previous IO has been completed without flagging an error. >We''ll definitely have background scrubbing. It is a key part of the health of the FS I think. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Ric Wheeler wrote:> I think that we do handle a failure in the case that you outline above > since the FS will be able to notice the error before it sends a commit > down (and that commit is wrapped in the barrier flush calls). This is > the easy case since we still have the context for the IO.I''m no FS guy but for that to be true FS should be waiting for all the outstanding IOs to finish before issuing a barrier and actually doesn''t need barriers at all - it can do the same with flush_cache.> It is more challenging (and kind of related) if the IO done in (4) has > been ack''ed by drive, the drive later destages (not as part of the > flush) its write cache and then an error happens. In this case, there is > nothing waiting on the initiator side to receive the IO error. We have > effectively lost the context for that IO.IIUC, that should be detectable from FLUSH whether the destaging occurred as part of flush or not, no?> The only way to detect this is on replay (if the journal has checksums > enabled or the error will be flagged as a media error).If it''s not reported on FLUSH, it basically amounts to silent data corruption and only checksums can help. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 2008-10-22 at 14:27 +0200, Stephan von Krawczynski wrote:> On Tue, 21 Oct 2008 13:31:37 -0400 > Ric Wheeler <ricwheeler@gmail.com> wrote: > > > [...] > > If you have remapped a big chunk of the sectors (say more than 10%), you > > should grab the data off the disk asap and replace it. Worry less about > > errors during read, writes indicate more serious errors. > > Ok, now for the bad news: money is invented. > If you replace a disk before real failure you won''t get replacement from the > manufacturer. That may sound irrelevant to someone handling 5 disks, but is > significant if handling 500 or more. The replacement rate is indeed much > higher than people think from their home pcs.Hardware vendors already do replace disks based on policies defined by their own array hardware. These are already predictive. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Ric Wheeler wrote:> Scrubbing is key for many scenarios since errors can "grow" even in > places where previous IO has been completed without flagging an error. > > Some neat tricks are: > > (1) use block level scrubbing to detect any media errors. If you > can map that sector level error into a file system object (meta data, > file data or unallocated space), tools can recover (fsck, get another > copy of the file or just ignore it!). There is a special command > called "READ_VERIFY" that can be used to validate the sectors without > actually moving data from the target to the host, so you can scrub > without consuming page cache, etc. >This has the disadvantage of not catching errors that were introduced while writing; the very errors that btrfs checksums can catch.> (2) sign and validate the object at the file level, say by > validating a digital signature. This can catch high level errors (say > the app messed up).Btrfs extent-level checksums can be used for this. This is just below the application level, but good enough IMO.> Note that this scrubbing needs to be carefully tuned to not interfere > with the foreground workload, using something like IO nice or the > other IO controllers being kicked about might help :-)Right. Further, reading the disk by logical block order will help reduce seeks. Btrfs''s back references, if cached properly, will help with this as well. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 2008-10-22 at 22:15 +0900, Tejun Heo wrote:> Ric Wheeler wrote: > > I think that we do handle a failure in the case that you outline above > > since the FS will be able to notice the error before it sends a commit > > down (and that commit is wrapped in the barrier flush calls). This is > > the easy case since we still have the context for the IO. > > I''m no FS guy but for that to be true FS should be waiting for all the > outstanding IOs to finish before issuing a barrier and actually > doesn''t need barriers at all - it can do the same with flush_cache. >We wait and then barrier. If the barrier returned status that a previously ack''d IO had actually failed, we could do something to make sure the FS was consistent. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Tejun Heo wrote:> Ric Wheeler wrote: > >> I think that we do handle a failure in the case that you outline above >> since the FS will be able to notice the error before it sends a commit >> down (and that commit is wrapped in the barrier flush calls). This is >> the easy case since we still have the context for the IO. >> > > I''m no FS guy but for that to be true FS should be waiting for all the > outstanding IOs to finish before issuing a barrier and actually > doesn''t need barriers at all - it can do the same with flush_cache. >Waiting for the target to ack an IO is not sufficient, since the target ack does not (with write cache enabled) mean that it is on persistent storage. The key is to make your transaction commit insure that the commit block itself is not written out of sequence without flushing the dependent IO from the transaction. If we disable the write cache, then file systems effectively do exactly the right thing today as you describe :-)> >> It is more challenging (and kind of related) if the IO done in (4) has >> been ack''ed by drive, the drive later destages (not as part of the >> flush) its write cache and then an error happens. In this case, there is >> nothing waiting on the initiator side to receive the IO error. We have >> effectively lost the context for that IO. >> > > IIUC, that should be detectable from FLUSH whether the destaging > occurred as part of flush or not, no? >I am not sure what happens to a write that fails to get destaged from cache. It probably depends on the target firmware, but I imagine that the target cannot hold onto it forever (or all subsequent flushes would always fail).> >> The only way to detect this is on replay (if the journal has checksums >> enabled or the error will be flagged as a media error). >> > > If it''s not reported on FLUSH, it basically amounts to silent data > corruption and only checksums can help. > > Thanks. > >Agreed - checksums (or proper handling of media errors) are the only way to detect this. Ric -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Chris Mason wrote:> On Wed, 2008-10-22 at 14:27 +0200, Stephan von Krawczynski wrote: > >> On Tue, 21 Oct 2008 13:31:37 -0400 >> Ric Wheeler <ricwheeler@gmail.com> wrote: >> >> >>> [...] >>> If you have remapped a big chunk of the sectors (say more than 10%), you >>> should grab the data off the disk asap and replace it. Worry less about >>> errors during read, writes indicate more serious errors. >>> >> Ok, now for the bad news: money is invented. >> If you replace a disk before real failure you won''t get replacement from the >> manufacturer. That may sound irrelevant to someone handling 5 disks, but is >> significant if handling 500 or more. The replacement rate is indeed much >> higher than people think from their home pcs. >> > > Hardware vendors already do replace disks based on policies defined by > their own array hardware. These are already predictive. > > -chris > > > >One key is not to replace the drives too early - you often can recover significant amounts of data from a drive that is on its last legs. This can be useful even in RAID rebuilds since with today''s enormous drive capacities, you might hit a latent error during the rebuild on one of the presumed healthy drives. Of course, if you don''t have a spare drive in your configuration, this is not practical... ric -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Chris Mason wrote:> On Wed, 2008-10-22 at 22:15 +0900, Tejun Heo wrote: > >> Ric Wheeler wrote: >> >>> I think that we do handle a failure in the case that you outline above >>> since the FS will be able to notice the error before it sends a commit >>> down (and that commit is wrapped in the barrier flush calls). This is >>> the easy case since we still have the context for the IO. >>> >> I''m no FS guy but for that to be true FS should be waiting for all the >> outstanding IOs to finish before issuing a barrier and actually >> doesn''t need barriers at all - it can do the same with flush_cache. >> >> > > We wait and then barrier. If the barrier returned status that a > previously ack''d IO had actually failed, we could do something to make > sure the FS was consistent. > > -chris > > >As I mentioned in a reply to Tejun, I am not sure that we can count on the barrier op giving us status for IO''s that failed to destage cleanly. Waiting and then doing the FLUSH seems to give us the best coverage for normal failures (and your own testing shows that it is hugely effective in reducing some types of corruption at least :-)). If you look at the types of common drive failures, I would break them into two big groups. The first group would be transient errors - i.e., this IO fails (usually a read), but a subsequent IO will succeed with or without a sector remapping happening. Causes might be: (1) just a bad read due to dirt on the surface of the drive - the read will always fail, a write might clean the surface and restore it to useful life. (2) vibrations (dropping your laptop, rolling a big machine down the data center, passing trains :-)) (3) adjacent sector writes - hot spotting on drives can degrade the data on adjacent tracks. This causes IO errors on reads for data that was successfully written before, but the track itself is still perfectly fine. All of these first types of errors need robust error handling on IO errors (i.e., quickly fail, check for IO errors and isolate the impact of the error as best as we can) but do not indicate a bad drive. The second group would be persistent failures - no matter what you do to the drive, it is going to kick the bucket! Common causes might be: (1) a few bad sectors (1-5% of the drive''s remapped sector table for example). (2) a bad disk head - this is a very common failure, you will see a large amount of bad sectors. (3) bad components (say bad memory chips in the write cache) can produce consistent errors (4) failure to spin up (total drive failure). The challenging part is to figure out as best as we can how to differentiate the causes of IO failures or checksum failures and to respond correctly. Array vendors spend a lot of time pulling out hair trying to do predictive drive failure, but it is really, really hard to get correct... ric -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 2008-10-22 at 14:19 +0200, Stephan von Krawczynski wrote:> On Tue, 21 Oct 2008 13:49:43 -0400 > Chris Mason <chris.mason@oracle.com> wrote: > > > On Tue, 2008-10-21 at 18:27 +0200, Stephan von Krawczynski wrote: > > > > > > > 2. general requirements > > > > > - fs errors without file/dir names are useless > > > > > - errors in parts of the fs are no reason for a fs to go offline as a whole > > > > > > > > These two are in progress. Btrfs won''t always be able to give a file > > > > and directory name, but it will be able to give something that can be > > > > turned into a file or directory name. You don''t want important > > > > diagnostic messages delayed by name lookup. > > > > > > That''s a point I really never understood. Why is it non-trivial for a fs to > > > know what file or dir (name) it is currently working on? > > > > The name lives in block A, but you might find a corruption while > > processing block B. Block A might not be in ram anymore, or it might be > > in ram but locked by another process. > > > > On top of all of that, when we print errors it''s because things haven''t > > gone well. They are deep inside of various parts of the filesystem, and > > we might not be able to take the required locks or read from the disk in > > order to find the name of the thing we''re operating on. > > Ok, this is interesting. In another thread I was told parallel mounts are > really complex and you cannot do good things in such an environment that you > can do with single mount. Well, then, why don''t we do it? All boxes I know > have tons of RAM, but fs finds no place in RAM to put large parts (if not all) > of the structural fs data including filenames?I''m afraid it just isn''t practical to keep all of the metadata in ram all of the time.> Besides the simple fact that > RAM is always faster than any known disk be it rotating or not, and that RAM > is just there, whats the word for not doing it? >People expect the OS to use the expensive ram for the data they use most often.> > > > > - parallel mounts (very important!) > > > > > (two or more hosts mount the same fs concurrently for reading and > > > > > writing) > > > > > > > > As Jim and Andi have said, parallel mounts are not in the feature list > > > > for Btrfs. Network filesystems will provide these features. > > > > > > Can you explain what "network filesystems" stands for in this statement, > > > please name two or three examples. > > > > > NFS (done) CRFS (under development), maybe ceph as well which is also > > under development. > > NFS is a good example for a fs that never got redesigned for modern world. I > hope it will, but currently it''s like Model T on a highway. > You have a NFS server with clients. Your NFS server dies, your backup server > cannot take over the clients without them resetting their NFS-link (which > means reboot to many applications) - no way. > Besides that you still need another fs below NFS to bring your data onto some > medium, which means you still have the problem how to create redundancy in > your server architecture. >As someone else replied, NFS is stateless, and they have made a large number of design tradeoffs to stay that way. So, your example above isn''t quite fair, it is one of the things the NFS protocol can handle well. With that said, CRFS is a network filesystem designed explicitly for btrfs, and I have high hopes for it.> > > > > - versioning (file and dir) > > > > > > > > >From a data structure point of view, version control is fairly easy. > > > > >From a user interface and policy point of view, it gets difficult very > > > > quickly. Aside from snapshotting, version control is outside the scope > > > > of btrfs. > > > > > > > > There are lots of good version control systems available, I''d suggest > > > > you use them instead. > > > > > > To me versioning sounds like a not-so-easy-to-implement feature. Nevertheless > > > I trust your experience. If a basic implementation is possible and not too > > > complex, why deny a feature? > > > > > > > In general I think snapshotting solves enough of the problem for most of > > the people most of the time. I''d love for Btrfs to be the perfect FS, > > but I''m afraid everyone has a different definition of perfect. > > > > Storing multiple versions of something is pretty easy. Making a usable > > interface around those versions is the hard part, especially because you > > need groups of files to be versioned together in atomic groups > > (something that looks a lot like a snapshot). > > > > Versioning is solved in userspace. We would never be able to implement > > everything that git or mercurial can do inside the filesystem. > > Well, quite often the question is not about whole trees of data to be > versioned. Even single (few) files or dirs can be of interest. And you want > people to set up a complete user space monster to version three openoffice > documents (only a rather flawed example of course)? > Lots of people need a basic solution, not the groundbreaking answer to all > questions. >One of the things that makes FS design so difficult is that people try to solve lots of problems with filesystems. Every feature we include is a mixture of disk format, policy and userland interface that must be tested in combination with all of the other features, and maintained pretty much forever. A big part of my job is to find the features that are sufficient to justify the expense of starting from scratch, and to get things finished within a reasonable amount of time. Btrfs already has an ioctl to create a COW copy of a file (see the bcp command in btrfs-progs). This is enough for applications to do their own single file versioning. I understand this isn''t the automatic system you would like for the use case above, but I have to draw the line somewhere in terms of providing the tools needed to implement features vs including all the features in the FS. A big part of why Btrfs is gaining ground today is that we''re focusing on finishing the features we have instead of adding the kitchen sink. It is very hard to say no to interested users, but it''s a reality of actually bringing the software to market.> > > If your hd is going dead you often find out that touching broken files takes > > > ages. If the fs finds out a file is corrupt because the device has errors it > > > could just flag the file as broken and not re-read the same error a thousand > > > times more. Obviously you want that as an option, because there can be good > > > reasons for re-reading dead files... > > > > I really agree that we want to avoid beating on a dead drive. > > > > Btrfs will record some error information about the drive so it can > > decide what to do with failures. But, remembering that sector #12345768 > > is bad doesn''t help much. When the drive returned the IO error it > > remapped the sector and the next write will probably succeed. > > Problem with probability is that software is pretty bad in judging. That''s why > my proposal was, lets do it and make it configurable for an admin that has a > better idea of the current probability. >Let me reword my answer ;). The next write will always succeed unless the drive is out of remapping sectors. If the drive is out, it is only good for reads and holding down paper on your desk. This means we''ll want to do a raid rebuild, which won''t use that drive unless something horrible has gone wrong.> > > > > - map out dead blocks > > > > > (and of course display of the currently mapped out list) > > > > > > > > I agree with Jim on this one. Drives remap dead sectors, and when they > > > > stop remapping them, the drive should be replaced. > > > > > > If your life depends on it, would you use one rope or two to secure yourself? > > > > > > > Btrfs will keep the dead drive around as a fallback for sectors that > > fail on the other mirrors when data is being rebuilt. Beyond that, > > we''ll expect you to toss the bad drive once the rebuild has finished. > > > > There''s an interesting paper about how netapp puts the drive into rehab > > and is able to avoid service calls by rewriting the bad sectors and > > checking them over. That''s a little ways off for Btrfs. > > It will become more interesting what remapping means in a world full of > flash-disks. Does it mean a disk must be replaced when some or even lots of > sectors are dead?Yes the disk must be replaced. Our job here is not to provide people with hope they can get to some of the data some of the time. Our job is to tell them a given component is bad and to have it replaced. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 22 Oct 2008 09:15:45 -0400 Chris Mason <chris.mason@oracle.com> wrote:> On Wed, 2008-10-22 at 14:27 +0200, Stephan von Krawczynski wrote: > > On Tue, 21 Oct 2008 13:31:37 -0400 > > Ric Wheeler <ricwheeler@gmail.com> wrote: > > > > > [...] > > > If you have remapped a big chunk of the sectors (say more than 10%), you > > > should grab the data off the disk asap and replace it. Worry less about > > > errors during read, writes indicate more serious errors. > > > > Ok, now for the bad news: money is invented. > > If you replace a disk before real failure you won''t get replacement from the > > manufacturer. That may sound irrelevant to someone handling 5 disks, but is > > significant if handling 500 or more. The replacement rate is indeed much > > higher than people think from their home pcs. > > Hardware vendors already do replace disks based on policies defined by > their own array hardware. These are already predictive.Lets agree that the market for drives, arrays and related stuff is big and contains just about any example one needs for arguing :-) Nevertheless we probably agree that if john doe meets big-player and tries to warranty-replace a non-dead drive he will have troubles.> -chris-- Regards, Stephan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 2008-10-22 at 09:38 -0400, Ric Wheeler wrote:> Chris Mason wrote: > > On Wed, 2008-10-22 at 22:15 +0900, Tejun Heo wrote: > > > >> Ric Wheeler wrote: > >> > >>> I think that we do handle a failure in the case that you outline above > >>> since the FS will be able to notice the error before it sends a commit > >>> down (and that commit is wrapped in the barrier flush calls). This is > >>> the easy case since we still have the context for the IO. > >>> > >> I''m no FS guy but for that to be true FS should be waiting for all the > >> outstanding IOs to finish before issuing a barrier and actually > >> doesn''t need barriers at all - it can do the same with flush_cache. > >> > >> > > > > We wait and then barrier. If the barrier returned status that a > > previously ack''d IO had actually failed, we could do something to make > > sure the FS was consistent. > > > As I mentioned in a reply to Tejun, I am not sure that we can count on > the barrier op giving us status for IO''s that failed to destage cleanly. > > Waiting and then doing the FLUSH seems to give us the best coverage for > normal failures (and your own testing shows that it is hugely effective > in reducing some types of corruption at least :-)). > > If you look at the types of common drive failures, I would break them > into two big groups. > > The first group would be transient errors - i.e., this IO fails (usually > a read), but a subsequent IO will succeed with or without a sector > remapping happening. Causes might be: > > (1) just a bad read due to dirt on the surface of the drive - the > read will always fail, a write might clean the surface and restore it to > useful life. > (2) vibrations (dropping your laptop, rolling a big machine down the > data center, passing trains :-)) > (3) adjacent sector writes - hot spotting on drives can degrade the > data on adjacent tracks. This causes IO errors on reads for data that > was successfully written before, but the track itself is still perfectly > fine. >4) Transient conditions such as heat or other problems made the drive give errors. Combine your matrix with the single drive install vs the mirrored configuration and we get a lot of variables. What I''d love to have is a rehab tool for drives that works it over and decides if it should stay or go. It is somewhat difficult to run the rehab on a mounted single disk install, but we can start with the multi-device config and work out way out from there. For barrier flush, io errors reported back by the barrier flush would allow us to know when corrective action was required. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 22 Oct 2008 05:48:30 -0700 "Jeff Schroeder" <jeffschroed@gmail.com> wrote:> > NFS is a good example for a fs that never got redesigned for modern world. I > > hope it will, but currently it''s like Model T on a highway. > > You have a NFS server with clients. Your NFS server dies, your backup server > > cannot take over the clients without them resetting their NFS-link (which > > means reboot to many applications) - no way. > > Besides that you still need another fs below NFS to bring your data onto some > > medium, which means you still have the problem how to create redundancy in > > your server architecture. > > You are somewhat misinformed on this. Perhaps the Linux nfs server can''t cope, > but I doubt it. NFS was designed to be stateless. I''ve got a fair > amount of experience > with a dual head netapp architecture. When 1 head dies, the other > transparently fails > over. During the brief downtime, the clients will go into I/O wait if > at all instead of being > disconnected. You might be able to do something similar using nfsd and > keepalived if > both servers were connected to the same storage. Setting that up would > be trivial. You > just need the clients mounting the vip and a reliable mechanism to > provide the data from > that vip. You could use heartbeat, but it is overly complex. Also look > at clustered nfs or > pnfs, both of which are nfs redesigns like you speak of.we tried that with pure linux nfs, and it does not work. The clients do not recover. After trying ourselves and failing we found several docs on the net that described just the same problem and its reasons. Very likely netapp found that out too and did something against it. Ah yes, and btw, your description contains another discussed problem: "both servers were connected to the same storage". If you mean that both servers really access the same storage at the same time your software options are pretty few in numbers.> -- > Jeff Schroeder-- Regards, Stephan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 10/22/2008 3:50 PM, Chris Mason wrote:> Let me reword my answer ;). The next write will always succeed unless > the drive is out of remapping sectors. If the drive is out, it is only > good for reads and holding down paper on your desk.I have a fairly new SATA disk with about 3000 hours of 24/7 duty (very light load), 0 remapped sectors and 8 consecutive sectors with read/write errors. Still, it did not perform remapping facing heavy writes on the bad sectors. Now what? For whatever reason, remapping not always works (or mine was produced with a total of zero remapping sectors…). - Matthias -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Avi Kivity wrote:> jim owens wrote: >> >> Remember that the device bandwidth is the limiter so even >> when each host has a dedicated path to the device (as in >> dual port SAS or FC), that 2nd host cuts the throughput by >> more than 1/2 with uncoordinated seeks and transfers. > > That''s only a problem if there is a single shared device. Since btrfs > supports multiple devices, each host could own a device set and access > from other hosts would be through the owner. You would need RDMA to get > reasonable performance and some kind of dual-porting to get high > availability. Each host could control the allocation tree for its devices.No. Every device including a monster $$$ array has the problem. As I said before, unless the application is partitioned there is always data host2 needs from host1''s disk and that slows down host1. If host2 seldom needs any host1 data, then you are describing a configuration that can be done easily by each host having a separate filesystem for the device it owns by default. Each host nfs mounts the other host''s data and if host1 fails, host2 can direct mount host1-fs from the shared array. Even with multiple disks under the same filesystem as separate allocated storage there is still the problem of shared namespace metadata that slows down both hosts. If you don''t need shared namespaces then you absolutely don''t want a cluster fs. A cluster fs is useful, but the cost can be high so using it for a single-host fs is not a good idea. jim jim -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Chris Mason wrote:> On Wed, 2008-10-22 at 09:38 -0400, Ric Wheeler wrote: > >> Chris Mason wrote: >> >>> On Wed, 2008-10-22 at 22:15 +0900, Tejun Heo wrote: >>> >>> >>>> Ric Wheeler wrote: >>>> >>>> >>>>> I think that we do handle a failure in the case that you outline above >>>>> since the FS will be able to notice the error before it sends a commit >>>>> down (and that commit is wrapped in the barrier flush calls). This is >>>>> the easy case since we still have the context for the IO. >>>>> >>>>> >>>> I''m no FS guy but for that to be true FS should be waiting for all the >>>> outstanding IOs to finish before issuing a barrier and actually >>>> doesn''t need barriers at all - it can do the same with flush_cache. >>>> >>>> >>>> >>> We wait and then barrier. If the barrier returned status that a >>> previously ack''d IO had actually failed, we could do something to make >>> sure the FS was consistent. >>> >>> >> As I mentioned in a reply to Tejun, I am not sure that we can count on >> the barrier op giving us status for IO''s that failed to destage cleanly. >> >> Waiting and then doing the FLUSH seems to give us the best coverage for >> normal failures (and your own testing shows that it is hugely effective >> in reducing some types of corruption at least :-)). >> >> If you look at the types of common drive failures, I would break them >> into two big groups. >> >> The first group would be transient errors - i.e., this IO fails (usually >> a read), but a subsequent IO will succeed with or without a sector >> remapping happening. Causes might be: >> >> (1) just a bad read due to dirt on the surface of the drive - the >> read will always fail, a write might clean the surface and restore it to >> useful life. >> (2) vibrations (dropping your laptop, rolling a big machine down the >> data center, passing trains :-)) >> (3) adjacent sector writes - hot spotting on drives can degrade the >> data on adjacent tracks. This causes IO errors on reads for data that >> was successfully written before, but the track itself is still perfectly >> fine. >> >> > > 4) Transient conditions such as heat or other problems made the drive > give errors. >Yes, heat is an issue (as well as severe cold) since drives have part that expand and contract :-)).> Combine your matrix with the single drive install vs the mirrored > configuration and we get a lot of variables. What I''d love to have is a > rehab tool for drives that works it over and decides if it should stay > or go. >That would be a really nice thing to have and not really that difficult to sketch out. MD has some of that built in, but this is also something that we could do pretty easily up in user space.> It is somewhat difficult to run the rehab on a mounted single disk > install, but we can start with the multi-device config and work out way > out from there. >Scanning a mounted drive with read-verify or object level signature checking can be done on mounted file systems...> For barrier flush, io errors reported back by the barrier flush would > allow us to know when corrective action was required. > > -chris > > >As I mentioned before, this would be great, but I am not sure that it would work that way (certainly not consistently across devices). ric -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
jim owens wrote:> Avi Kivity wrote: >> jim owens wrote: >>> >>> Remember that the device bandwidth is the limiter so even >>> when each host has a dedicated path to the device (as in >>> dual port SAS or FC), that 2nd host cuts the throughput by >>> more than 1/2 with uncoordinated seeks and transfers. >> >> That''s only a problem if there is a single shared device. Since >> btrfs supports multiple devices, each host could own a device set and >> access from other hosts would be through the owner. You would need >> RDMA to get reasonable performance and some kind of dual-porting to >> get high availability. Each host could control the allocation tree >> for its devices. > > No. Every device including a monster $$$ array has the problem. > > As I said before, unless the application is partitioned > there is always data host2 needs from host1''s disk and that > slows down host1.The CPU load should not be significant if you have RDMA. Or are you talking about the seek load? Since host1''s load should be distributed over all devices in the system, overall seek capacity increases as you add more nodes.> > If host2 seldom needs any host1 data, then you are describing > a configuration that can be done easily by each host having a > separate filesystem for the device it owns by default. Each > host nfs mounts the other host''s data and if host1 fails, host2 > can direct mount host1-fs from the shared array. >Separate namespaces are uninteresting to me. That''s just pushing back the problem to the user.> Even with multiple disks under the same filesystem as separate > allocated storage there is still the problem of shared namespace > metadata that slows down both hosts. If you don''t need shared > namespaces then you absolutely don''t want a cluster fs. >If you separate the allocation metadata to the storage owning node, and the file metadata to the actively using node, the slowdown should be low in most cases. Problems begin when all nodes access the same file, but that''s relatively rare. Even then, when the file size does not change and when the data is preallocated, it''s possible to achieve acceptable overhead.> A cluster fs is useful, but the cost can be high so using > it for a single-host fs is not a good idea.Development costs, yes. But I don''t see why the runtime overhead can''t disappear when running on a single host. Sort of like running an smp kernel on uniprocessor (I agree the fs problem is much bigger). -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Ric Wheeler wrote:> One key is not to replace the drives too early - you often can recover > significant amounts of data from a drive that is on its last legs. > This can be useful even in RAID rebuilds since with today''s enormous > drive capacities, you might hit a latent error during the rebuild on > one of the presumed healthy drives. > > Of course, if you don''t have a spare drive in your configuration, this > is not practical...Why would you have a spare drive? That''s a wasted spindle. You want to have spare capacity, enough for one or two (or fifteen) drives'' worth of data. When a drive goes bad, you rebuild into the spare capacity you have. When you replace the drive, the filesystem moves data into the new drive to take advantage of the new spindle. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Matthias Wächter wrote:> On 10/22/2008 3:50 PM, Chris Mason wrote: > > >> Let me reword my answer ;). The next write will always succeed unless >> the drive is out of remapping sectors. If the drive is out, it is only >> good for reads and holding down paper on your desk. >> > > I have a fairly new SATA disk with about 3000 hours of 24/7 duty > (very light load), 0 remapped sectors and 8 consecutive sectors with > read/write errors. Still, it did not perform remapping facing heavy > writes on the bad sectors. Now what? For whatever reason, remapping > not always works (or mine was produced with a total of zero > remapping sectors…). > > - Matthias >It sounds like this drive is actually fine, you might have seen some transient issues. Are you positive that the writes went directly to the sectors in question - that should either clear the error or cause it to remap the sectors internally. (Reads will continue to fail). Mark Lord has added some options to hdparm that you might be able to use to expressly clear the sectors in question in a more direct way. Regards, Ric -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
concerning this discussion, I''d like to put up some "requests" which strongly oppose to those brought up initially: - if you run into an error in the fs structure or any IO error that prevents you from bringing the fs into a consistent state, please simply oops. If a user feels that availability is a main issue, he has to use a failover solution. In this case a fast and clean cut is desireable and no "pray-and-hope-mode" or "90%-mode". If avaliability is not the issue, it is in any case most important that data on the fs is safe. If you don''t oops, you risk to pose further damage onto the filesystem and end up with a completely destroyed fs. - if you get any IO error, please **don''t** put up a number of retries or anything. If the device reports an error simply believe it. It is bad enough that many block drivers or controllers try to be smart and put up hundreds of retries. Adding further retries you only end up in wasting hours on useless retries. If availability is an issue, the user again has to put up a failover solution. Again, a clean cut is what is needed. The user has to make shure he uses appropiate configuration according to the importance of his data (mirroring on the fs and/or RAID, failover ...) - if during mount something unexpected comes up and you can''t be shure that the fs will work properly, please deny mounting and request a fsck. This can be easily handled by a start- or mount-script. During mount, take the time you need to ensure that the fs looks proper and safe to use. I''d rather now during boot that something is wrong than to run with a foul fs and end up with data loss or any other mixup later on. - btrfs is no cluster fs, so there is no point of even thinking about it. If somebody feels he needs multiple writeable mounts of the same fs, please use a cluster fs. Of course, you have to live with the tradeoffs. Dreaming of a fs that uses something like witchcraft to do things like locking, quorums, cache synchronisation without penalty and, of course, without any configuration, is pointless. In my opinon, the whole thing comes up from the idea of using cheap hardware and out-of-the-box configurations to keep promises of reliability and availability which are not realistic. There is a reason why there are more expensive HDDs, RAIDs, SANs with volume mirroring, multipathing and so on. Simply ignoring the fact that you have to use the proper tools to address specific problems and pray to the toothfairy to put a solve-all-my-problems-fs under your pillow is no solution. I''d rather have a solid fs with deterministic behavior and some state-of-the-art features. Just my 2c. (Gerald) -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 2008-10-22 at 16:32 +0200, Avi Kivity wrote:> Ric Wheeler wrote: > > One key is not to replace the drives too early - you often can recover > > significant amounts of data from a drive that is on its last legs. > > This can be useful even in RAID rebuilds since with today''s enormous > > drive capacities, you might hit a latent error during the rebuild on > > one of the presumed healthy drives. > > > > Of course, if you don''t have a spare drive in your configuration, this > > is not practical... > > Why would you have a spare drive? That''s a wasted spindle. > > You want to have spare capacity, enough for one or two (or fifteen) > drives'' worth of data. When a drive goes bad, you rebuild into the > spare capacity you have. >You want spare capacity that does not degrade your raid levels if you move the data onto it. In some configs, this will be a hot spare, in others it''ll just be free space. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Chris Mason wrote:>> You want to have spare capacity, enough for one or two (or fifteen) >> drives'' worth of data. When a drive goes bad, you rebuild into the >> spare capacity you have. >> >> > > You want spare capacity that does not degrade your raid levels if you > move the data onto it. In some configs, this will be a hot spare, in > others it''ll just be free space. >What kind of configuration would prefer a spare disk to spare capacity? RAID6 with a small number of disks? -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Ric Wheeler wrote:> Matthias Wächter wrote: >> On 10/22/2008 3:50 PM, Chris Mason wrote: >> >> >>> Let me reword my answer ;). The next write will always succeed unless >>> the drive is out of remapping sectors. If the drive is out, it is only >>> good for reads and holding down paper on your desk. >>> >> >> I have a fairly new SATA disk with about 3000 hours of 24/7 duty >> (very light load), 0 remapped sectors and 8 consecutive sectors with >> read/write errors. Still, it did not perform remapping facing heavy >> writes on the bad sectors. Now what? For whatever reason, remapping >> not always works (or mine was produced with a total of zero >> remapping sectors…). >> >> - Matthias >> > > It sounds like this drive is actually fine, you might have seen some > transient issues. > > Are you positive that the writes went directly to the sectors in > question - that should either clear the error or cause it to remap the > sectors internally. (Reads will continue to fail). > > Mark Lord has added some options to hdparm that you might be able to use > to expressly clear the sectors in question in a more direct way.let me add 2 other thoughts from my experience with other drive types: - check for firmware updates. - some drives have a remapping mode where it fails the write, reports to the host, then the host will send a remap-this-sector command. this mode might be selectable on the drive. if the host driver does not do the remap that sector will continue to fail. jim -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Avi Kivity wrote:> Ric Wheeler wrote: >> One key is not to replace the drives too early - you often can >> recover significant amounts of data from a drive that is on its last >> legs. This can be useful even in RAID rebuilds since with today''s >> enormous drive capacities, you might hit a latent error during the >> rebuild on one of the presumed healthy drives. >> >> Of course, if you don''t have a spare drive in your configuration, >> this is not practical... > > Why would you have a spare drive? That''s a wasted spindle.You have a spare drive because you care about data integrity and have too many years of experience in disk arrays to go without :-)> > You want to have spare capacity, enough for one or two (or fifteen) > drives'' worth of data. When a drive goes bad, you rebuild into the > spare capacity you have.That is a different model (and one that makes sense, we used that in Centera for object level protection schemes). It is a nice model as well, but not how most storage works today.> > When you replace the drive, the filesystem moves data into the new > drive to take advantage of the new spindle. >When you buy a storage solution (hardware or software), the key here is "utilized capacity." If you have an enclosure that can host say 12-15 drives in a 2U enclosure, people normally leave one drive as spare. RAID6 is another way to do this. You can do a 4+2 and 4+2 with 66% utilized capacity in RAID 6 or possibly a RAID5 scheme using like 5+1 and 4+1 with one global spare (75% utilized capacity). That gives you the chance to do rebuild your RAID group without having to physically visit the data center. You can also do fancy stuff with the spare (like migrate as many blocks as possible before the RAID rebuild to that spare) which reduces your exposure to the 2nd drive failure and speeds up your rebuild time. In the end, whether you use a block based RAID solution or an object based solution, you just need to figure out how to balance your utilized capacity against performance and data integrity needs. ric -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Ric Wheeler wrote:>> You want to have spare capacity, enough for one or two (or fifteen) >> drives'' worth of data. When a drive goes bad, you rebuild into the >> spare capacity you have. > > That is a different model (and one that makes sense, we used that in > Centera for object level protection schemes). It is a nice model as > well, but not how most storage works today.Well, btrfs is not about duplicating how most storage works today. Spare capacity has significant advantages over spare disks, such as being able to mix disk sizes, RAID levels, and better performance.>> >> When you replace the drive, the filesystem moves data into the new >> drive to take advantage of the new spindle. >> > > When you buy a storage solution (hardware or software), the key here > is "utilized capacity." If you have an enclosure that can host say > 12-15 drives in a 2U enclosure, people normally leave one drive as > spare. RAID6 is another way to do this. You can do a 4+2 and 4+2 with > 66% utilized capacity in RAID 6 or possibly a RAID5 scheme using like > 5+1 and 4+1 with one global spare (75% utilized capacity). > > That gives you the chance to do rebuild your RAID group without > having to physically visit the data center. You can also do fancy > stuff with the spare (like migrate as many blocks as possible before > the RAID rebuild to that spare) which reduces your exposure to the 2nd > drive failure and speeds up your rebuild time. > > In the end, whether you use a block based RAID solution or an object > based solution, you just need to figure out how to balance your > utilized capacity against performance and data integrity needs.In both models (spare disk and spare capacity) the storage utilization is the same, or nearly so. But with spare capacity you get better performance since you have more spindles seeking for your data, and since less of the disk surface is occupied by data, making your seeks shorter. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Avi Kivity wrote:> Ric Wheeler wrote: >>> You want to have spare capacity, enough for one or two (or fifteen) >>> drives'' worth of data. When a drive goes bad, you rebuild into the >>> spare capacity you have. >> >> That is a different model (and one that makes sense, we used that in >> Centera for object level protection schemes). It is a nice model as >> well, but not how most storage works today. > > Well, btrfs is not about duplicating how most storage works today. > Spare capacity has significant advantages over spare disks, such as > being able to mix disk sizes, RAID levels, and better performance.Sure, there are advantages that go in favour of one or the other approaches. But btrfs is also about being able to use common hardware configurations without having to reinvent where we can avoid it (if we have a working RAID or enough drives to do RAID5 with spares or RAID6, we want to be able to delegate that off to something else if we can).> >>> >>> When you replace the drive, the filesystem moves data into the new >>> drive to take advantage of the new spindle. >>> >> >> When you buy a storage solution (hardware or software), the key here >> is "utilized capacity." If you have an enclosure that can host say >> 12-15 drives in a 2U enclosure, people normally leave one drive as >> spare. RAID6 is another way to do this. You can do a 4+2 and 4+2 >> with 66% utilized capacity in RAID 6 or possibly a RAID5 scheme using >> like 5+1 and 4+1 with one global spare (75% utilized capacity). >> >> That gives you the chance to do rebuild your RAID group without >> having to physically visit the data center. You can also do fancy >> stuff with the spare (like migrate as many blocks as possible before >> the RAID rebuild to that spare) which reduces your exposure to the >> 2nd drive failure and speeds up your rebuild time. >> >> In the end, whether you use a block based RAID solution or an object >> based solution, you just need to figure out how to balance your >> utilized capacity against performance and data integrity needs. > > In both models (spare disk and spare capacity) the storage utilization > is the same, or nearly so. But with spare capacity you get better > performance since you have more spindles seeking for your data, and > since less of the disk surface is occupied by data, making your seeks > shorter. >True, you can get more performance if you use all of the hardware you have all of the time. The major difficulty with the spare capacity model is that your recovery is not as simple and well understood as RAID rebuilds. If you assume that whole drives fail under btrfs mirroring, you are not really doing anything more than simple RAID, or do I misunderstand your suggestion? I don''t see the point about head seeking. In RAID, you also have the same layout so you minimize head movement (just move more heads per IO in parallel). ric -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Ric Wheeler wrote:>> >> Well, btrfs is not about duplicating how most storage works today. >> Spare capacity has significant advantages over spare disks, such as >> being able to mix disk sizes, RAID levels, and better performance. > > Sure, there are advantages that go in favour of one or the other > approaches. But btrfs is also about being able to use common hardware > configurations without having to reinvent where we can avoid it (if we > have a working RAID or enough drives to do RAID5 with spares or RAID6, > we want to be able to delegate that off to something else if we can).Well, if you have an existing RAID (or have lots of $$$ to buy a new one), you needn''t tell Btrfs about it. Just be sure not to enable Btrfs data redundancy, or you''ll have redundant redundancy, which is expensive. What Btrfs enables with its multiple device capabilities is to assemble a JBOD into a filesystem-level data redundancy system, which is cheaper, more flexible (per-file data redundancy levels), and faster (no need for RMW, since you''re always COWing).> The major difficulty with the spare capacity model is that your > recovery is not as simple and well understood as RAID rebuilds.That''s Chris''s problem. :-)> If you assume that whole drives fail under btrfs mirroring, you are > not really doing anything more than simple RAID, or do I misunderstand > your suggestion?I do assume that whole drives fail, but RAIDing and rebuilding is file level. So one extent on a failed disk might be part of a mirrored file, while another extent can be part of a 14-member RAID6 extent. A rebuild would iterate over all disk extents (making use of the backref tree), determine which file contains that extent, and rebuild that extent using spare storage on other disks.> I don''t see the point about head seeking. In RAID, you also have the > same layout so you minimize head movement (just move more heads per IO > in parallel).Suppose you have 5 disks with 1 spare. Suppose you are reading from a full fs. On a disk-level RAID, all disks are full. So you have 5 spindles seeking over 100% of the disk surface. With spare capacity, you have 6 disks which are 5/6 full (retaining the same utilization as old-school RAID). So you have 6 spindles, each with a seek range that is 5/6 of a whole disk, so more seek heads _and_ faster individual seeks. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Avi Kivity wrote:> Ric Wheeler wrote: >>> >>> Well, btrfs is not about duplicating how most storage works today. >>> Spare capacity has significant advantages over spare disks, such as >>> being able to mix disk sizes, RAID levels, and better performance. >> >> Sure, there are advantages that go in favour of one or the other >> approaches. But btrfs is also about being able to use common hardware >> configurations without having to reinvent where we can avoid it (if >> we have a working RAID or enough drives to do RAID5 with spares or >> RAID6, we want to be able to delegate that off to something else if >> we can). > > Well, if you have an existing RAID (or have lots of $$$ to buy a new > one), you needn''t tell Btrfs about it. Just be sure not to enable > Btrfs data redundancy, or you''ll have redundant redundancy, which is > expensive. > > What Btrfs enables with its multiple device capabilities is to > assemble a JBOD into a filesystem-level data redundancy system, which > is cheaper, more flexible (per-file data redundancy levels), and > faster (no need for RMW, since you''re always COWing).I think that the btrfs plan is still to push more complicated RAID schemes off to MD (RAID6, etc) so this is an issue even with a JBOD. It will be interesting to map out the possible ways to use built in mirroring, etc vs the external RAID and actually measure the utilized capacity and performance (online & during rebuilds).> >> The major difficulty with the spare capacity model is that your >> recovery is not as simple and well understood as RAID rebuilds. > > That''s Chris''s problem. :-)Unless he can pawn it off on some other lucky developer :-)> >> If you assume that whole drives fail under btrfs mirroring, you are >> not really doing anything more than simple RAID, or do I >> misunderstand your suggestion? > > I do assume that whole drives fail, but RAIDing and rebuilding is file > level. So one extent on a failed disk might be part of a mirrored > file, while another extent can be part of a 14-member RAID6 extent. > > A rebuild would iterate over all disk extents (making use of the > backref tree), determine which file contains that extent, and rebuild > that extent using spare storage on other disks. > >> I don''t see the point about head seeking. In RAID, you also have the >> same layout so you minimize head movement (just move more heads per >> IO in parallel). > > Suppose you have 5 disks with 1 spare. Suppose you are reading from a > full fs. On a disk-level RAID, all disks are full. So you have 5 > spindles seeking over 100% of the disk surface. With spare capacity, > you have 6 disks which are 5/6 full (retaining the same utilization as > old-school RAID). So you have 6 spindles, each with a seek range that > is 5/6 of a whole disk, so more seek heads _and_ faster individual seeks. >I think that this is somewhat correct, but most likely offset by the performance levels of streaming IO vs IO with any seeks (at least for full file systems). Certainly, the spare capacity model is increasingly better when you have really light utilized file systems... Don''t think that I am arguing against the model, just saying that it is not always as clear cut as you might think.... ric -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 2008-10-22 at 11:25 -0400, Ric Wheeler wrote:> Avi Kivity wrote: > > Ric Wheeler wrote: > >>> > >>> Well, btrfs is not about duplicating how most storage works today. > >>> Spare capacity has significant advantages over spare disks, such as > >>> being able to mix disk sizes, RAID levels, and better performance. > >> > >> Sure, there are advantages that go in favour of one or the other > >> approaches. But btrfs is also about being able to use common hardware > >> configurations without having to reinvent where we can avoid it (if > >> we have a working RAID or enough drives to do RAID5 with spares or > >> RAID6, we want to be able to delegate that off to something else if > >> we can). > > > > Well, if you have an existing RAID (or have lots of $$$ to buy a new > > one), you needn''t tell Btrfs about it. Just be sure not to enable > > Btrfs data redundancy, or you''ll have redundant redundancy, which is > > expensive. > > > > What Btrfs enables with its multiple device capabilities is to > > assemble a JBOD into a filesystem-level data redundancy system, which > > is cheaper, more flexible (per-file data redundancy levels), and > > faster (no need for RMW, since you''re always COWing). > > I think that the btrfs plan is still to push more complicated RAID > schemes off to MD (RAID6, etc) so this is an issue even with a JBOD.At least v1.0 won''t have raid6. Over the longer term I hope to include it because managing the storage once in btrfs and once in md is going to be a bit clumsy. It also limits the mixed mode functionality like different stripe sizes for data vs metadata or metadata mirroring and data raid6 that will allow us to perform well. The goal will be to make a library of raid routines based on md that other storage will be able to use. I know Christoph has been interested in this as well. But in general, the btrfs raid code can do either spare disks or spare capacity modes safely. It enforces the correct number of devices in each raid mode (as long as the admin doesn''t lie to us and feed partitions off the same device). I''ll leave the rest up to the admin. One problem with the spare capacity model is the general trend where drives from the same batch that get hammered on in the same way tend to die at the same time. Some shops will sleep better knowing there''s a hot spare and that''s fine by me. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Ric Wheeler wrote:> I think that the btrfs plan is still to push more complicated RAID > schemes off to MD (RAID6, etc) so this is an issue even with a JBOD. > It will be interesting to map out the possible ways to use built in > mirroring, etc vs the external RAID and actually measure the utilized > capacity and performance (online & during rebuilds).That''s leaving a lot of performance and features on the table, IMO. We definitely want to have metadata and small files using mirroring (perhaps even three copies for some metadata). Use RAID[56] for large files. Perhaps even start files at RAID1, and have the scrubber convert them to RAID[56] when it notices they are only ever read. Keep temporary or unimportant files at RAID0. Play games with asymmetric setups (small fast disks + large slow disks). etc etc etc. Delegating things to MD throws out a lot of metadata so these things become impossible. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Chris Mason wrote:> One problem with the spare > capacity model is the general trend where drives from the same batch > that get hammered on in the same way tend to die at the same time. Some > shops will sleep better knowing there''s a hot spare and that''s fine by > me. >How does hot sparing help? All your disks die except the spare. Of course, I''ve no objection to disk sparing as an additional option; I just feel that capacity sparing is superior. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Avi Kivity wrote:> Chris Mason wrote: >> One problem with the spare >> capacity model is the general trend where drives from the same batch >> that get hammered on in the same way tend to die at the same time. Some >> shops will sleep better knowing there''s a hot spare and that''s fine by >> me. >> > > How does hot sparing help? All your disks die except the spare. > > Of course, I''ve no objection to disk sparing as an additional option; > I just feel that capacity sparing is superior. >For any given set of disks, you "just" need to do the math to compute the utilized capacity, the expected rate of drive failure, the rebuild time and then see whether you can recover from your first failure before a 2nd disk dies. In practice, this is not an academic question since drives do occasionally fail in batches (and drives from the same batch get stuffed into the same system). I suspect that what will be used in mission critical deployments will be more conservative than what is used in less critical path systems, but this will be up to the end user to configure... ric -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Oct 22, 2008 at 9:52 AM, Stephan von Krawczynski <skraw@ithnet.com> wrote:> On Wed, 22 Oct 2008 09:15:45 -0400 > Chris Mason <chris.mason@oracle.com> wrote: > >> On Wed, 2008-10-22 at 14:27 +0200, Stephan von Krawczynski wrote: >> > On Tue, 21 Oct 2008 13:31:37 -0400 >> > Ric Wheeler <ricwheeler@gmail.com> wrote: >> > >> > > [...] >> > > If you have remapped a big chunk of the sectors (say more than 10%), you >> > > should grab the data off the disk asap and replace it. Worry less about >> > > errors during read, writes indicate more serious errors. >> > >> > Ok, now for the bad news: money is invented. >> > If you replace a disk before real failure you won't get replacement from the >> > manufacturer. That may sound irrelevant to someone handling 5 disks, but is >> > significant if handling 500 or more. The replacement rate is indeed much >> > higher than people think from their home pcs. >> >> Hardware vendors already do replace disks based on policies defined by >> their own array hardware. These are already predictive. > > Lets agree that the market for drives, arrays and related stuff is big and > contains just about any example one needs for arguing :-) > Nevertheless we probably agree that if john doe meets big-player and tries to > warranty-replace a non-dead drive he will have troubles. >If John Doe is using redundant storage in the first place, he just needs an emergency disk that can be swapped-in for a failing disk, and then stress-test the failing disk to death, get it replaced by manufacturer, and the replacement becomes the next standby/emergency disk. Though it would be nice to have a tool that would provide enough information to make a warranty claim -- does btrfs keep enough information for such a tool to be written? Thanks, -- miʃel salim • http://hircus.jaiku.com/ IUCS • msalim@cs.indiana.edu Fedora • salimma@fedoraproject.org MacPorts • hircus@macports.org
Ric Wheeler wrote:> Waiting for the target to ack an IO is not sufficient, since the target > ack does not (with write cache enabled) mean that it is on persistent > storage.FS waiting for completion of all the dependent writes isn''t too good latency and throughput-wise tho. It would be best if FS can indicate dependencies between write commands and barrier so that barrier doesn''t have to empty the whole queue. Hmm... Can someone tell me how much such scheme would help?> The key is to make your transaction commit insure that the commit block > itself is not written out of sequence without flushing the dependent IO > from the transaction. > > If we disable the write cache, then file systems effectively do exactly > the right thing today as you describe :-)For most SATA drives, disabling write back cache seems to take high toll on write throughput. :-(>> IIUC, that should be detectable from FLUSH whether the destaging >> occurred as part of flush or not, no? >> > > I am not sure what happens to a write that fails to get destaged from > cache. It probably depends on the target firmware, but I imagine that > the target cannot hold onto it forever (or all subsequent flushes would > always fail).As long as the error status is sticky, it doesn''t have to hold on to the data, it''s not gonna be able to write it anyway. The drive has to hold onto the failure information only. Yeah, but fully agreed on that it''s most likely dependent on the specific firmware. There isn''t any requirement on how to handle write back failure in the ATA spec. It wouldn''t be too surprising if there are some drives which happily report the old data after silent write failure followed by flush and power loss at the right timing. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Tejun Heo wrote:> Ric Wheeler wrote: > >> Waiting for the target to ack an IO is not sufficient, since the target >> ack does not (with write cache enabled) mean that it is on persistent >> storage. >> > > FS waiting for completion of all the dependent writes isn''t too good > latency and throughput-wise tho. It would be best if FS can indicate > dependencies between write commands and barrier so that barrier > doesn''t have to empty the whole queue. Hmm... Can someone tell me how > much such scheme would help? > >I think that this is where SCSI ordered tags come in (or similar schemes). The idea would be to have tag all IO. You bump the tag, for example after you send down the journal data blocks to a new tag which is used for the commit block data sequence. The ordering would require that lower ranked tags must all be destaged to persistent storage before a subsequent tag is written out. The T13 had a microsoft proposal that is in this area: http://www.t13.org/Documents/UploadedDocuments/docs2007/e07174r0-Write_Barrier_Command_Proposal.doc>> The key is to make your transaction commit insure that the commit block >> itself is not written out of sequence without flushing the dependent IO >> from the transaction. >> >> If we disable the write cache, then file systems effectively do exactly >> the right thing today as you describe :-) >> > > For most SATA drives, disabling write back cache seems to take high > toll on write throughput. :-( >I have seen a 50% reduction in my testing on S-ATA :-(> >>> IIUC, that should be detectable from FLUSH whether the destaging >>> occurred as part of flush or not, no? >>> >>> >> I am not sure what happens to a write that fails to get destaged from >> cache. It probably depends on the target firmware, but I imagine that >> the target cannot hold onto it forever (or all subsequent flushes would >> always fail). >> > > As long as the error status is sticky, it doesn''t have to hold on to > the data, it''s not gonna be able to write it anyway. The drive has to > hold onto the failure information only. Yeah, but fully agreed on > that it''s most likely dependent on the specific firmware. There isn''t > any requirement on how to handle write back failure in the ATA spec. > It wouldn''t be too surprising if there are some drives which happily > report the old data after silent write failure followed by flush and > power loss at the right timing. > > Thanks. > >agreed.... ric -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Michel Salim wrote:> > Though it would be nice to have a tool that would provide enough > information to make a warranty claim -- does btrfs keep enough > information for such a tool to be written?Failed device I/O (rather than bad checksums and other fs-specific error detections) should be logged at a lower layer in the standard system logs. Warranties are really about who you buy your drives from, if you go cheap don''t expect any replacements. If you buy quality stuff, the failures usually occur right after the warranty expires :) In the case of bad manufacturing batches, the good vendors figure that out real fast and don''t hassle you about replacing them as they fail. And even from a good vendor, don''t expect you can run a drive with a 1-year 20% duty-cycle warranty like it was a 100% duty-cycle drive and get the vendor to replace them if they fail in < 1 year. People often complain the vendor does not stand behind the warranty when they are really badly violating the usage terms. jim -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Ric Wheeler wrote:> For any given set of disks, you "just" need to do the math to compute > the utilized capacity, the expected rate of drive failure, the rebuild > time and then see whether you can recover from your first failure > before a 2nd disk dies. >Spare disks have the advantage of a fully linear access pattern (ignoring normal working load). Spare capacity has the advantage of utilizing all devices (if you have a hundred-disk fs, all surviving disks participate in the rebuild; whereas with spare disks you are limited to the surviving raidset members. Spare capacity also has the advantage that you don''t need to rebuild free space.> In practice, this is not an academic question since drives do > occasionally fail in batches (and drives from the same batch get > stuffed into the same system).This seems to be orthogonal to the sparing method used; and in both cases the answer is to tolerate dual failures. File-based redundancy has the advantage here of allowing triple mirroring for metadata and frequently written files, and double parity raid for large files.> I suspect that what will be used in mission critical deployments will > be more conservative than what is used in less critical path systemsThat''s true, unfortunately. But with time people will trust the newer, more efficient methods. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Tejun Heo wrote:> For most SATA drives, disabling write back cache seems to take high > toll on write throughput. :-( > >I measured this yesterday. This is true for pure write workloads; for mixed read/write workloads the throughput decrease is negligible.> As long as the error status is sticky, it doesn''t have to hold on to > the data, it''s not gonna be able to write it anyway. The drive has to > hold onto the failure information only. Yeah, but fully agreed on > that it''s most likely dependent on the specific firmware. There isn''t > any requirement on how to handle write back failure in the ATA spec. > It wouldn''t be too surprising if there are some drives which happily > report the old data after silent write failure followed by flush and > power loss at the right timing.I got flamed for this on another list, but let''s disable the write cache and live with the performance drop. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Avi Kivity wrote:> Tejun Heo wrote: >> For most SATA drives, disabling write back cache seems to take high >> toll on write throughput. :-( > > I measured this yesterday. This is true for pure write workloads; for > mixed read/write workloads the throughput decrease is negligible.Different tests on different hardware give different results at different times!>> As long as the error status is sticky, it doesn''t have to hold on to >> the data, it''s not gonna be able to write it anyway. The drive has to >> hold onto the failure information only. Yeah, but fully agreed on >> that it''s most likely dependent on the specific firmware. There isn''t >> any requirement on how to handle write back failure in the ATA spec. >> It wouldn''t be too surprising if there are some drives which happily >> report the old data after silent write failure followed by flush and >> power loss at the right timing. > > I got flamed for this on another list, but let''s disable the write cache > and live with the performance drop.We don''t get to decide this, customers do. As they say in the raid forum... fast, cheap, good - pick any 2 We just need to ensure we don''t turn good into bad with fs mistakes. jim P.S. no flames because we chose no-battery == disable-write-cache -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
jim owens wrote:>>> For most SATA drives, disabling write back cache seems to take high >>> toll on write throughput. :-( >> >> I measured this yesterday. This is true for pure write workloads; >> for mixed read/write workloads the throughput decrease is negligible. > > Different tests on different hardware > give different results at different times! >True. But data loss is forever!>> >> I got flamed for this on another list, but let''s disable the write >> cache and live with the performance drop. > > We don''t get to decide this, customers do.We get to pick the defaults.> As they say in the raid forum... fast, cheap, good - pick any 2We can upgrade slow to fast, but !good gets upgraded to another fs.> P.S. no flames because we chose no-battery == disable-write-cacheACK! -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Avi Kivity wrote:> Tejun Heo wrote: >> For most SATA drives, disabling write back cache seems to take high >> toll on write throughput. :-( >> >> > > I measured this yesterday. This is true for pure write workloads; for > mixed read/write workloads the throughput decrease is negligible. >Depends on your workload, I have measured (back at Centera) a significant win for mixed read/write as well (at least 20%) depending on file size.>> As long as the error status is sticky, it doesn''t have to hold on to >> the data, it''s not gonna be able to write it anyway. The drive has to >> hold onto the failure information only. Yeah, but fully agreed on >> that it''s most likely dependent on the specific firmware. There isn''t >> any requirement on how to handle write back failure in the ATA spec. >> It wouldn''t be too surprising if there are some drives which happily >> report the old data after silent write failure followed by flush and >> power loss at the right timing. > > I got flamed for this on another list, but let''s disable the write > cache and live with the performance drop. >Won''t ever happen, no one wants to lose 50% of their performance :-) ric -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 2008-10-23 at 01:14 +0900, Tejun Heo wrote:> Ric Wheeler wrote: > > Waiting for the target to ack an IO is not sufficient, since the target > > ack does not (with write cache enabled) mean that it is on persistent > > storage. > > FS waiting for completion of all the dependent writes isn''t too good > latency and throughput-wise tho. It would be best if FS can indicate > dependencies between write commands and barrier so that barrier > doesn''t have to empty the whole queue. Hmm... Can someone tell me how > much such scheme would help?The extent of my coding for ZFS on FUSE was in this area. Solaris has a generic ioctl to flush the write cache on a block device but Linux does not. I wrote a few routines to detect the type of block device and flush the cache by talking to the hardware via an ioctl. Tests with bonnie++ on my laptop showed that throughput and metadata operations per second were not noticeably affected by completely flushing the write cache when necessary versus never flushing the write cache or using any kind of IO barrier. Caveats: *Not every HDD is a laptop HDD. *ZFS on FUSE got average to poor results for metadata operations per second since it hadn''t been optimized for that yet. Maybe fancier schemes aren''t necessary? Cheers, Eric
Eric Anopolsky wrote:> On Thu, 2008-10-23 at 01:14 +0900, Tejun Heo wrote: > >> Ric Wheeler wrote: >> >>> Waiting for the target to ack an IO is not sufficient, since the target >>> ack does not (with write cache enabled) mean that it is on persistent >>> storage. >>> >> FS waiting for completion of all the dependent writes isn''t too good >> latency and throughput-wise tho. It would be best if FS can indicate >> dependencies between write commands and barrier so that barrier >> doesn''t have to empty the whole queue. Hmm... Can someone tell me how >> much such scheme would help? >> > > The extent of my coding for ZFS on FUSE was in this area. Solaris has a > generic ioctl to flush the write cache on a block device but Linux does > not. I wrote a few routines to detect the type of block device and flush > the cache by talking to the hardware via an ioctl. > > Tests with bonnie++ on my laptop showed that throughput and metadata > operations per second were not noticeably affected by completely > flushing the write cache when necessary versus never flushing the write > cache or using any kind of IO barrier. > > Caveats: > *Not every HDD is a laptop HDD. > *ZFS on FUSE got average to poor results for metadata operations per > second since it hadn''t been optimized for that yet. > > Maybe fancier schemes aren''t necessary? > > Cheers, > Eric > >What I have seen so far with meta-data heavy workloads & the write barrier (working correctly!) is a pretty close match to the specs of the drive, at least for single threaded writing. For example, if you have an average seek time of 20ms, you should see no more than 50 files/sec (if only one barrier is issued per file write). In practice, we see closer to 30 files/sec. If nothing else, you can always detect a broken (or disabled) write barrier by exceeding that spec for single writers :-) ric -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Ric Wheeler wrote:>> FS waiting for completion of all the dependent writes isn''t too good >> latency and throughput-wise tho. It would be best if FS can indicate >> dependencies between write commands and barrier so that barrier >> doesn''t have to empty the whole queue. Hmm... Can someone tell me how >> much such scheme would help? >> >> > I think that this is where SCSI ordered tags come in (or similar > schemes). The idea would be to have tag all IO. You bump the tag, for > example after you send down the journal data blocks to a new tag which > is used for the commit block data sequence. > > The ordering would require that lower ranked tags must all be destaged > to persistent storage before a subsequent tag is written out. > > The T13 had a microsoft proposal that is in this area: > > http://www.t13.org/Documents/UploadedDocuments/docs2007/e07174r0-Write_Barrier_Command_Proposal.docYeah, that''s one thing although it still has the undetected-write-errors in front of barrier problem (SCSI spec doesn''t have a way to detect that). There''s another queue, which can be considerably larger than the on-device buffer - the block elevator queue. Currently, as the elevator doesn''t know what''s dependent on what, it has to dump the whole content of elevator before doing barrier. I don''t know how much it would help to do it selectively tho. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 22 Oct 2008 11:56:58 -0400 "Michel Salim" <michel.sylvan@gmail.com> wrote:> > [...] > > Lets agree that the market for drives, arrays and related stuff is big and > > contains just about any example one needs for arguing :-) > > Nevertheless we probably agree that if john doe meets big-player and tries to > > warranty-replace a non-dead drive he will have troubles. > > > If John Doe is using redundant storage in the first place, he just > needs an emergency disk that can be swapped-in for a failing disk, and > then stress-test the failing disk to death, get it replaced by > manufacturer, and the replacement becomes the next standby/emergency > disk.Even more expensive than drives is working time. So you just swapped the problem the wrong way round. I would not have expected that it is hard to argue why it makes sense to replace dead disks when they are dead, because you then know that they are dead and everybody else looking at the brick knows it too - without spending time and money for testing and arguing about warranty issues. Does anybody remember the word "keep it simple" ? PS: of course we agree in your description of a minimal replacement strategy. -- Regards, Stephan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 22 Oct 2008 11:19:06 pm Stephan von Krawczynski wrote:> You have a NFS server with clients. Your NFS server dies, your backup > server cannot take over the clients without them resetting their NFS-link > (which means reboot to many applications) - no way.We''re getting way off btrfs here, but did you set the fsid''s for all your exports on the primary and backup NFS servers and make sure they were also set to the same values ? e.g. /home 10.0.0.0/255.0.0.0(async,no_subtree_check,rw,fsid=0111) cheers, Chris -- Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC This email may come with a PGP signature as a file. Do not panic. For more info see: http://en.wikipedia.org/wiki/OpenPGP
On Thu, 23 Oct 2008 12:50:33 am Chris Mason wrote:> As someone else replied, NFS is statelessNFS up to and including v3 is, but NFSv4 is stateful. -- Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC This email may come with a PGP signature as a file. Do not panic. For more info see: http://en.wikipedia.org/wiki/OpenPGP
On Wed, 22 Oct 2008 16:35:55 +0200 "dbz" <hwallenstone@gmx.de> wrote:> concerning this discussion, I''d like to put up some "requests" which > strongly oppose to those brought up initially: > > - if you run into an error in the fs structure or any IO error that prevents > you from bringing the fs into a consistent state, please simply oops. If a > user feels that availability is a main issue, he has to use a failover > solution. In this case a fast and clean cut is desireable and no > "pray-and-hope-mode" or "90%-mode". If avaliability is not the issue, it is > in any case most important that data on the fs is safe. If you don''t oops, > you risk to pose further damage onto the filesystem and end up with a > completely destroyed fs.Hi Gerald, this is a good proposal to explain why most failover setups do indeed not work. If you look at numerous internet howtos about building failover you will recognise that 95% talk about servers that syncronise their fs by all kinds of tools _offline_, like drbd - or choose some network-dependant raid, like nbd or enbd. All these have in common that they are unreliable just because of the needed mounting during failover. In your example: if box 1 oopses because of some error, chances are that box 2 trying to mount the very same data (which should be because of raid or sync) will indeed fail to mount, too. That leaves you with exactly nothing in hand.> - if you get any IO error, please **don''t** put up a number of retries or > anything. If the device reports an error simply believe it. It is bad enough > that many block drivers or controllers try to be smart and put up hundreds > of retries. Adding further retries you only end up in wasting hours on > useless retries. If availability is an issue, the user again has to put up a > failover solution. Again, a clean cut is what is needed. The user has to > make shure he uses appropiate configuration according to the importance of > his data (mirroring on the fs and/or RAID, failover ...)Well, this leaves you with my proposal to optionally stop retrying, marking files or (better) blocks as dead.> - if during mount something unexpected comes up and you can''t be shure that > the fs will work properly, please deny mounting and request a fsck. This can > be easily handled by a start- or mount-script. During mount, take the time > you need to ensure that the fs looks proper and safe to use. I''d rather now > during boot that something is wrong than to run with a foul fs and end up > with data loss or any other mixup later on.As explained above it is exactly the lack of parallel mounts that drives you to not having a lot of time during mount. A failover that takes only 10 minutes for re-mount is no failover, it is sh.t. ext? btw hardly ever mounts TBs at below 10 minutes.> - btrfs is no cluster fs, so there is no point of even thinking about it. If > somebody feels he needs multiple writeable mounts of the same fs, please use > a cluster fs. Of course, you have to live with the tradeoffs. Dreaming of a > fs that uses something like witchcraft to do things like locking, quorums, > cache synchronisation without penalty and, of course, without any > configuration, is pointless.This reads pretty much like "a processor is a processor and not multiple processors". We all know today that this time has passed. In 5 years you should pretty much say the same for "single fs" vs. "cluster fs".> In my opinon, the whole thing comes up from the idea of using cheap hardware > and out-of-the-box configurations to keep promises of reliability and > availability which are not realistic. There is a reason why there are more > expensive HDDs, RAIDs, SANs with volume mirroring, multipathing and so on. > Simply ignoring the fact that you have to use the proper tools to address > specific problems and pray to the toothfairy to put a > solve-all-my-problems-fs under your pillow is no solution. I''d rather have a > solid fs with deterministic behavior and some state-of-the-art features.Well, sorry to say, but I begin to sound a bit like Joseph Stiglitz trying to explain why neoliberalism does not work out. Please accept that this world is full of failure of all kinds. If you deny that all your models and ideas will only be failures, too. All I am saying is that we should accept that dead sectors, braindead firmware-programmers, production in jungle-environment, transportation in rough areas, high temperatures, high humidity, harddisks that have no disks and so on are facts of live. And only a childs answer can be : "oops" (sorry could not resist this one ;-)> Just my 2c. > (Gerald)-- Regards, Stephan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
>> In my opinon, the whole thing comes up from the idea of using cheap hardware >> and out-of-the-box configurations to keep promises of reliability and >> availability which are not realistic. There is a reason why there are more >> expensive HDDs, RAIDs, SANs with volume mirroring, multipathing and so on. >> Simply ignoring the fact that you have to use the proper tools to address >> specific problems and pray to the toothfairy to put a >> solve-all-my-problems-fs under your pillow is no solution. I''d rather have a >> solid fs with deterministic behavior and some state-of-the-art features.SvK> Well, sorry to say, but I begin to sound a bit like Joseph Stiglitz SvK> trying to explain why neoliberalism does not work out. SvK> Please accept that this world is full of failure of all kinds. If you deny SvK> that all your models and ideas will only be failures, too. SvK> All I am saying is that we should accept that dead sectors, braindead SvK> firmware-programmers, production in jungle-environment, transportation in SvK> rough areas, high temperatures, high humidity, harddisks that have no disks SvK> and so on are facts of live. And only a childs answer can be : "oops" SvK> (sorry could not resist this one ;-) +1000 Systems should survive and function or it''s dead - like in nature. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html