I''m looking to try BTRFS on a SSD, and I would like to know what SSD optimizations it applies. Is there a comprehensive list of what ssd mount option does? How are the blocks and metadata arranged? Are there options available comparable to ext2/ext3 to help reduce wear and improve performance? Specifically, on ext2 (journal means more writes, so I don''t use ext3 on SSDs, since fsck typically only takes a few seconds when access time is < 100us), I usually apply the -b 4096 -E stripe-width = (erase_block/4096) parameters to mkfs in order to reduce the multiple erase cycles on the same underlying block. Are there similar optimizations available in BTRFS? Gordan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi there, On Wed, Mar 10, 2010 at 8:49 PM, Gordan Bobic <gordan@bobich.net> wrote:> [...] > Are there similar optimizations available in BTRFS?There is an SSD mount option available[1]. Cheers, Marcus [1] http://btrfs.wiki.kernel.org/index.php/Getting_started#Mount_Options -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Erm... You know, sorry for the noise. Cheers, Marcus -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Mar 10, 2010 at 11:49 AM, Gordan Bobic <gordan@bobich.net> wrote:> I''m looking to try BTRFS on a SSD, and I would like to know what SSD > optimizations it applies. Is there a comprehensive list of what ssd mount > option does? How are the blocks and metadata arranged? Are there options > available comparable to ext2/ext3 to help reduce wear and improve > performance? > > Specifically, on ext2 (journal means more writes, so I don''t use ext3 on > SSDs, since fsck typically only takes a few seconds when access time is < > 100us), I usually apply the > -b 4096 -E stripe-width = (erase_block/4096) > parameters to mkfs in order to reduce the multiple erase cycles on the same > underlying block. > > Are there similar optimizations available in BTRFS?I think you''ll get more out of btrfs, but another thing you can look into is ext4 without the journal. Support was added for that recently (thanks to google). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Marcus Fritzsch wrote:> Hi there, > > On Wed, Mar 10, 2010 at 8:49 PM, Gordan Bobic <gordan@bobich.net> wrote: >> [...] >> Are there similar optimizations available in BTRFS? > > There is an SSD mount option available[1]. > > [1] http://btrfs.wiki.kernel.org/index.php/Getting_started#Mount_OptionsBut what _exactly_ does it do? Is there a way to leverage any knowledge of erase block size at file system creation time? Are there any special parameters that might affect locations of superblocks and metadata? Is there a way to ensure they don''t span erase block boundaries? What about ATA TRIM command support? Is this available? Is it included in the version in Fedora 13? Gordan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Mike Fedyk wrote:> On Wed, Mar 10, 2010 at 11:49 AM, Gordan Bobic <gordan@bobich.net> wrote: >> I''m looking to try BTRFS on a SSD, and I would like to know what SSD >> optimizations it applies. Is there a comprehensive list of what ssd mount >> option does? How are the blocks and metadata arranged? Are there options >> available comparable to ext2/ext3 to help reduce wear and improve >> performance? >> >> Specifically, on ext2 (journal means more writes, so I don''t use ext3 on >> SSDs, since fsck typically only takes a few seconds when access time is < >> 100us), I usually apply the >> -b 4096 -E stripe-width = (erase_block/4096) >> parameters to mkfs in order to reduce the multiple erase cycles on the same >> underlying block. >> >> Are there similar optimizations available in BTRFS? > > I think you''ll get more out of btrfs, but another thing you can look > into is ext4 without the journal. Support was added for that recently > (thanks to google).How is this different to using mkfs.ext2 from e4fsprogs? And while I appreciate hopeful remarks along the lines of "I think you''ll get more out of btrfs", I am really after specifics of what the ssd mount option does, and what features comparable to the optimizations that can be done with ext2/3/4 (e.g. the mentioned stripe-width option) are available to get the best possible alignment of data and metadata to increase both performance and life expectancy of a SSD. Also, for drives that don''t support TRIM, is there a way to make the FS apply aggressive re-use of erased space (in order to help the drive''s internal wear-leveling)? I have looked through the documentation and the wiki, but it provides very little of actual substance. Gordan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hello Gordan, Gordan Bobic wrote (ao):> Mike Fedyk wrote: > >On Wed, Mar 10, 2010 at 11:49 AM, Gordan Bobic <gordan@bobich.net> wrote: > >>Are there options available comparable to ext2/ext3 to help reduce > >>wear and improve performance?With SSDs you don''t have to worry about wear.> And while I appreciate hopeful remarks along the lines of "I think > you''ll get more out of btrfs", I am really after specifics of what > the ssd mount option does, and what features comparable to the > optimizations that can be done with ext2/3/4 (e.g. the mentioned > stripe-width option) are available to get the best possible > alignment of data and metadata to increase both performance and life > expectancy of a SSD.Alignment is about the partition, not the fs, and thus taken care of with fdisk and the like. If you don''t create a partition, the fs is aligned with the SSD.> Also, for drives that don''t support TRIM, is there a way to make the > FS apply aggressive re-use of erased space (in order to help the > drive''s internal wear-leveling)?TRIM has nothing to do with wear-leveling (although it helps reducing wear). TRIM lets the OS tell the disk which blocks are not in use anymore, and thus don''t have to be copied during a rewrite of the blocks. Wear-leveling is the SSD making sure all blocks are more or less equally written to avoid continuous load on the same blocks. Sander -- Humilis IT Services and Solutions http://www.humilis.net -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Mar 10, 2010 at 11:13 PM, Gordan Bobic <gordan@bobich.net> wrote:> Marcus Fritzsch wrote: >> >> Hi there, >> >> On Wed, Mar 10, 2010 at 8:49 PM, Gordan Bobic <gordan@bobich.net> wrote: >>> >>> [...] >>> Are there similar optimizations available in BTRFS? >> >> There is an SSD mount option available[1]. >> >> [1] http://btrfs.wiki.kernel.org/index.php/Getting_started#Mount_Options > > But what _exactly_ does it do?Chris explains the change to favour spatial locality in allocator behaviour in with ''-o ssd''. ''-o ssd_spread'' does the opposite, where RMW cycles are higher penalty. Elsewhere IIRC, Chris also said BTRFS attempts to submit 128KB BIOs where possible (or wishful thinking?): http://markmail.org/message/4sq4uco2lghgxzzz -- Daniel J Blueman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thursday 11 March 2010 08:38:53 Sander wrote:> Hello Gordan, > > Gordan Bobic wrote (ao): > > Mike Fedyk wrote: > > >On Wed, Mar 10, 2010 at 11:49 AM, Gordan Bobic <gordan@bobich.net> wrote: > > >>Are there options available comparable to ext2/ext3 to help reduce > > >>wear and improve performance? > > With SSDs you don''t have to worry about wear.Sorry, but you do have to worry about wear. I was able to destroy a relatively new SD card (2007 or early 2008) just by writing on the first 10MiB over and over again for two or three days. The end of the card still works without problems but about 10 sectors on the beginning give write errors. And with journaled file systems that write over and over again on the same spot you do have to worry about wear leveling. It depends on the underlying block allocation algorithm, but I''m sure that most of the cheap SSDs do wear leveling only inside big blocks, not on whole hard drive, making it much easier to hit the 10 000-100 000 erase cycles boundary. Still, I think that if you can prolong the life of hardware without noticable performance degradation, you should do it. Just because it may help the drive with some defects last those 3-5years between upgreades without any problems.> > > And while I appreciate hopeful remarks along the lines of "I think > > you''ll get more out of btrfs", I am really after specifics of what > > the ssd mount option does, and what features comparable to the > > optimizations that can be done with ext2/3/4 (e.g. the mentioned > > stripe-width option) are available to get the best possible > > alignment of data and metadata to increase both performance and life > > expectancy of a SSD. > > Alignment is about the partition, not the fs, and thus taken care of > with fdisk and the like. > > If you don''t create a partition, the fs is aligned with the SSD.But it does not align internal FS structures to the SSD erase block size and that''s what Gordon asked for. And sorry Gordon, I don''t know. But there''s a ''ssd_spread'' option that tries to allocate blocks as far as possible (within reason) from themselfs. That should, in most cases, make the fs structures reside on an erase block by themself. I''m afraid that you''ll need to dive into the code to know about block alignment or one of the developers will need to provide us with info.> > > Also, for drives that don''t support TRIM, is there a way to make the > > FS apply aggressive re-use of erased space (in order to help the > > drive''s internal wear-leveling)? > > TRIM has nothing to do with wear-leveling (although it helps reducing > wear). > TRIM lets the OS tell the disk which blocks are not in use anymore, and > thus don''t have to be copied during a rewrite of the blocks. > Wear-leveling is the SSD making sure all blocks are more or less equally > written to avoid continuous load on the same blocks.Isn''t this all about wear leveling? TRIM has no meaning for magnetic media. It''s used to tell the drive which parts of medium contain only junk data and can be used in block rotation, making the wear-leveling easier and more effective. -- Hubert Kario QBS - Quality Business Software ul. Ksawerów 30/85 02-656 Warszawa POLAND tel. +48 (22) 646-61-51, 646-74-24 fax +48 (22) 646-61-50 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 11 Mar 2010 11:59:57 +0100 Hubert Kario <hka@qbs.com.pl> wrote:> On Thursday 11 March 2010 08:38:53 Sander wrote: > > Hello Gordan, > > > > Gordan Bobic wrote (ao): > > > Mike Fedyk wrote: > > > >On Wed, Mar 10, 2010 at 11:49 AM, Gordan Bobic <gordan@bobich.net> wrote: > > > >>Are there options available comparable to ext2/ext3 to help reduce > > > >>wear and improve performance? > > > > With SSDs you don''t have to worry about wear. > > Sorry, but you do have to worry about wear. I was able to destroy a relatively > new SD card (2007 or early 2008) just by writing on the first 10MiB over and > over again for two or three days. The end of the card still works without > problems but about 10 sectors on the beginning give write errors.Sorry, the topic was SSD, not SD. SSDs have controllers that contain heavy closed magic to circumvent all kinds of troubles you get when using classical flash and SD cards. Honestly I would just drop the idea of an SSD option simply because the vendors implement all kinds of neat strategies in their devices. So in the end you cannot really tell if the option does something constructive and not destructive in combination with a SSD controller. Of course you may well discuss about an option for passive flash devices like ide-CF/SD or the like. There is no controller involved so your fs implementation may well work out. -- Regards, Stephan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 11 Mar 2010 08:38:53 +0100, Sander <sander@humilis.net> wrote:>> >>Are there options available comparable to ext2/ext3 to help reduce >> >>wear and improve performance? > > With SSDs you don''t have to worry about wear.And if you believe that you clearly swallowed the marketing spiel hook line and sinker without enough real world exprience to show you otherwise. But I''m not going to go off on a tangent now enumerating various victories of marketing over mathematics and empirical evidence relating to currently popular technologies. In short - I have several dead SSDs of various denominations that demonstrate otherwise - and all within their warranty period, and not having been used in pathologically write-heavy environments. You do have to worry about wear. Operations that increase wear also reduce speed (erasing a block is slow, and if the disk is fully tainted you cannot write without erasing), so you doubly have to worry about it. Also remember that hardware sectors are 512 bytes, and FS blocks tend to be 4096 bytes. It is thus entirely plausible that if you aren''t careful you''ll end up with blocks straddling two erase block boundaries. If that happens, you''ll make wear twice as bad because you are facing a situation where you may need to erase and write two blocks rather than one. Half the performance, twice the wear.>> And while I appreciate hopeful remarks along the lines of "I think >> you''ll get more out of btrfs", I am really after specifics of what >> the ssd mount option does, and what features comparable to the >> optimizations that can be done with ext2/3/4 (e.g. the mentioned >> stripe-width option) are available to get the best possible >> alignment of data and metadata to increase both performance and life >> expectancy of a SSD. > > Alignment is about the partition, not the fs, and thus taken care of > with fdisk and the like. > > If you don''t create a partition, the fs is aligned with the SSD.I''m talking about internal FS data structures, not the partition alignment.>> Also, for drives that don''t support TRIM, is there a way to make the >> FS apply aggressive re-use of erased space (in order to help the >> drive''s internal wear-leveling)? > > TRIM has nothing to do with wear-leveling (although it helps reducing > wear).That''s self contradictory. If it helps reduce wear it has something to do with wear leveling.> TRIM lets the OS tell the disk which blocks are not in use anymore, and > thus don''t have to be copied during a rewrite of the blocks. > Wear-leveling is the SSD making sure all blocks are more or less equally > written to avoid continuous load on the same blocks.And thus it is impossible to do wear leveling when all blocks have been written to once without TRIM. So I''d say that in the long term, without TRIM there is no wear leveling. That makes them pretty related. So considering that there are various nuggets of opinion floating around (correct or otherwise) saying that ext4 has support for TRIM, I''d like to know whether there similar support in BTRFS at the moment? Gordan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 11 Mar 2010 10:35:45 +0000, Daniel J Blueman <daniel.blueman@gmail.com> wrote:>>>> [...] >>>> Are there similar optimizations available in BTRFS? >>> >>> There is an SSD mount option available[1]. >>> >>> [1]http://btrfs.wiki.kernel.org/index.php/Getting_started#Mount_Options>> >> But what _exactly_ does it do? > > Chris explains the change to favour spatial locality in allocator > behaviour in with ''-o ssd''. ''-o ssd_spread'' does the opposite, where > RMW cycles are higher penalty. Elsewhere IIRC, Chris also said BTRFS > attempts to submit 128KB BIOs where possible (or wishful thinking?): > > http://markmail.org/message/4sq4uco2lghgxzzzThanks, that''s useful info. What about FS block and metadata alignment, though? Is there a way to leverage the knowledge of erase block size in order to reduce wear and increase performance? Gordan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 11 Mar 2010 11:59:57 +0100, Hubert Kario <hka@qbs.com.pl> wrote:> Still, I think that if you can prolong the life of hardware without > noticable > performance degradation, you should do it. Just because it may help the > drive > with some defects last those 3-5years between upgreades without any > problems.I couldn''t agree more. Not only that but working with wear leveling in mind (especially on devices without TRIM support) can increase performance, too, by avoiding having to wait for the slow erase operation on writes.>> > Also, for drives that don''t support TRIM, is there a way to make the >> > FS apply aggressive re-use of erased space (in order to help the >> > drive''s internal wear-leveling)? >> >> TRIM has nothing to do with wear-leveling (although it helps reducing >> wear). >> TRIM lets the OS tell the disk which blocks are not in use anymore, and >> thus don''t have to be copied during a rewrite of the blocks. >> Wear-leveling is the SSD making sure all blocks are more or lessequally>> written to avoid continuous load on the same blocks. > > Isn''t this all about wear leveling? TRIM has no meaning for magnetic > media.I fully agree that it''s important for wear leveling on flash media, but from the security point of view, I think TRIM would be a useful feature on all storage media. If the erased blocks were trimmed it would provide a potentially useful feature of securely erasing the sectors that are no longer used. It would be useful and much more transparent than the secure erase features that only operate on the entire disk. Just MHO. Gordan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 11 Mar 2010 12:31:03 +0100, Stephan von Krawczynski <skraw@ithnet.com> wrote:>> > > >On Wed, Mar 10, 2010 at 11:49 AM, Gordan Bobic <gordan@bobich.net> >> > > >wrote: >> > > >>Are there options available comparable to ext2/ext3 to helpreduce>> > > >>wear and improve performance? >> > >> > With SSDs you don''t have to worry about wear. >> >> Sorry, but you do have to worry about wear. I was able to destroy a >> relatively >> new SD card (2007 or early 2008) just by writing on the first 10MiBover>> and >> over again for two or three days. The end of the card still works >> without >> problems but about 10 sectors on the beginning give write errors. > > Sorry, the topic was SSD, not SD.SD == SSD with an SD interface.> SSDs have controllers that contain heavy > closed magic to circumvent all kinds of troubles you get when using > classical flash and SD cards.There is absolutely no basis for thinking that SD cards don''t contain wear leveling logic. SD standard, and thus SD cards support a lot of fancy copy protection capabilities, which means there is a lot of firmware involvement on SD cards. It is unlikely that any reputable SD card manufacturer wouldn''t also build wear leveling logic into it.> Honestly I would just drop the idea of an SSD option simply because the > vendors implement all kinds of neat strategies in their devices. So inthe> end you cannot really tell if the option does something constructive andnot> destructive in combination with a SSD controller.You can make an educated guess. For starters given that visible sector sizes are not equal to FS block sizes, it means that FS block sizes can straddle erase block boundaries without the flash controller, no matter how fancy, being able to determine this. Thus, at the very least, aligning FS structures so that they do not straddle erase block boundaries is useful in ALL cases. Thinking otherwise is just sticking your head in the sand because you cannot be bothered to think.> Of course you may well discuss about an option for passive flash devices > like ide-CF/SD or the like. There is no controller involved so your fs > implementation may well work out.I suggest you educate yourself on the nature of IDE and CF (which is just IDE with a different connector). There most certainly are controllers involved. The days when disks (mechanical or solid state) didn''t integrate controllers ended with MFM/RLL and ESDI disks some 20+ years ago. Gordan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 11 Mar 2010 12:17:30 +0000 Gordan Bobic <gordan@bobich.net> wrote:> On Thu, 11 Mar 2010 12:31:03 +0100, Stephan von Krawczynski > <skraw@ithnet.com> wrote: > >> > > >On Wed, Mar 10, 2010 at 11:49 AM, Gordan Bobic <gordan@bobich.net> > >> > > >wrote: > >> > > >>Are there options available comparable to ext2/ext3 to help > reduce > >> > > >>wear and improve performance? > >> > > >> > With SSDs you don''t have to worry about wear. > >> > >> Sorry, but you do have to worry about wear. I was able to destroy a > >> relatively > >> new SD card (2007 or early 2008) just by writing on the first 10MiB > over > >> and > >> over again for two or three days. The end of the card still works > >> without > >> problems but about 10 sectors on the beginning give write errors. > > > > Sorry, the topic was SSD, not SD. > > SD == SSD with an SD interface.That really is quite a statement. You really talk of a few-bucks SD card (like the one in my android handy) as an SSD comparable with Intel XE only with different interface? Come on, stay serious. The product is not only made of SLCs and some raw logic.> > SSDs have controllers that contain heavy > > closed magic to circumvent all kinds of troubles you get when using > > classical flash and SD cards. > > There is absolutely no basis for thinking that SD cards don''t contain wear > leveling logic. SD standard, and thus SD cards support a lot of fancy copy > protection capabilities, which means there is a lot of firmware involvement > on SD cards. It is unlikely that any reputable SD card manufacturer > wouldn''t also build wear leveling logic into it.I really don''t guess about what is built into an SD or even CF card. But we hopefully agree that there is a significant difference compared to a product that calls itself a _disk_.> > Honestly I would just drop the idea of an SSD option simply because the > > vendors implement all kinds of neat strategies in their devices. So in > the > > end you cannot really tell if the option does something constructive and > not > > destructive in combination with a SSD controller. > > You can make an educated guess. For starters given that visible sector > sizes are not equal to FS block sizes, it means that FS block sizes can > straddle erase block boundaries without the flash controller, no matter how > fancy, being able to determine this. Thus, at the very least, aligning FS > structures so that they do not straddle erase block boundaries is useful in > ALL cases. Thinking otherwise is just sticking your head in the sand > because you cannot be bothered to think.And your guess is that intel engineers had no glue when designing the XE including its controller? You think they did not know what you and me know and therefore pray every day that some smart fs designer falls from heaven and saves their product from dying in between? Really?> > Of course you may well discuss about an option for passive flash devices > > like ide-CF/SD or the like. There is no controller involved so your fs > > implementation may well work out. > > I suggest you educate yourself on the nature of IDE and CF (which is just > IDE with a different connector). There most certainly are controllers > involved. The days when disks (mechanical or solid state) didn''t integrate > controllers ended with MFM/RLL and ESDI disks some 20+ years ago.I suggest you don''t talk to someone administering some hundred boxes based on CF and SSD mediums for _years_ about pro and con of the respective implementation and its long term usage. Sorry, the world is not built out of paper, sometimes you meet the hard facts. And one of it is that the ssd option in fs is very likely already overrun by the ssd controller designers and mostly _superfluous_. The market has already decided to make SSDs compatible to standard fs layouts. -- Regards, Stephan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 11 Mar 2010 13:59:09 +0100, Stephan von Krawczynski <skraw@ithnet.com> wrote:>> >> > > >On Wed, Mar 10, 2010 at 11:49 AM, Gordan Bobic >> >> > > ><gordan@bobich.net> >> >> > > >wrote: >> >> > > >>Are there options available comparable to ext2/ext3 to help >> reduce >> >> > > >>wear and improve performance? >> >> > >> >> > With SSDs you don''t have to worry about wear. >> >> >> >> Sorry, but you do have to worry about wear. I was able to destroy a >> >> relatively >> >> new SD card (2007 or early 2008) just by writing on the first 10MiB >> over >> >> and >> >> over again for two or three days. The end of the card still works >> >> without >> >> problems but about 10 sectors on the beginning give write errors. >> > >> > Sorry, the topic was SSD, not SD. >> >> SD == SSD with an SD interface. > > That really is quite a statement. You really talk of a few-bucks SD card > (like the one in my android handy) as an SSD comparable with Intel XEonly with> different interface? Come on, stay serious. The product is not only madeof> SLCs and some raw logic.I am saying that there is no reason for the firmware in an SD card to not be as advanced. If the manufacturer has some advanced logic in their SATA SSD, I cannot see any valid engineering reason to not apply the same logic in a SD product.>> > SSDs have controllers that contain heavy >> > closed magic to circumvent all kinds of troubles you get when using >> > classical flash and SD cards. >> >> There is absolutely no basis for thinking that SD cards don''t contain >> wear >> leveling logic. SD standard, and thus SD cards support a lot of fancy >> copy >> protection capabilities, which means there is a lot of firmware >> involvement >> on SD cards. It is unlikely that any reputable SD card manufacturer >> wouldn''t also build wear leveling logic into it. > > I really don''t guess about what is built into an SD or even CF card. Butwe> hopefully agree that there is a significant difference compared to a > product that calls itself a _disk_.Wo don''t agree on that. Not at all. I don''t see any reason why a CF card and an IDE SSD made by the same manufacturer should have any difference between them other than capacity and the physical package.>> > Honestly I would just drop the idea of an SSD option simply becausethe>> > vendors implement all kinds of neat strategies in their devices. Soin>> the >> > end you cannot really tell if the option does something constructive >> > and not destructive in combination with a SSD controller. >> >> You can make an educated guess. For starters given that visible sector >> sizes are not equal to FS block sizes, it means that FS block sizes can >> straddle erase block boundaries without the flash controller, no matter >> how >> fancy, being able to determine this. Thus, at the very least, aligningFS>> structures so that they do not straddle erase block boundaries isuseful>> in >> ALL cases. Thinking otherwise is just sticking your head in the sand >> because you cannot be bothered to think. > > And your guess is that intel engineers had no glue when designing the XE > including its controller? You think they did not know what you and meknow> and > therefore pray every day that some smart fs designer falls from heavenand> saves their product from dying in between? Really?I am saying that there are problems that CANNOT be solved on the disk firmware level. Some problems HAVE to be addressed higher up the stack.>> > Of course you may well discuss about an option for passive flash >> > devices >> > like ide-CF/SD or the like. There is no controller involved so yourfs>> > implementation may well work out. >> >> I suggest you educate yourself on the nature of IDE and CF (which isjust>> IDE with a different connector). There most certainly are controllers >> involved. The days when disks (mechanical or solid state) didn''t >> integrate >> controllers ended with MFM/RLL and ESDI disks some 20+ years ago. > > I suggest you don''t talk to someone administering some hundred boxesbased> on > CF and SSD mediums for _years_ about pro and con of the respective > implementation and its long term usage. > Sorry, the world is not built out of paper, sometimes you meet the hard > facts. > And one of it is that the ssd option in fs is very likely alreadyoverrun> by > the ssd controller designers and mostly _superfluous_. The market has > already decided to make SSDs compatible to standard fs layouts.Seems to me that you haven''t done any analysis of comparative long term failure rates between SSDs used with default layouts (Default? Really? You mean you don''t apply any special partitioning on your hundreds of servers?) and those with carefully aligned FS-es. Just because defaults may be good enough for your use case, doesn''t mean that somebody with a use case that''s harder on the flash will observe the same reliability, or deem the the unoptimized performance figures good enough. Gordan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thursday 11 March 2010 14:20:23 Gordan Bobic wrote:> On Thu, 11 Mar 2010 13:59:09 +0100, Stephan von Krawczynski > > <skraw@ithnet.com> wrote: > >> >> > > >On Wed, Mar 10, 2010 at 11:49 AM, Gordan Bobic > >> >> > > ><gordan@bobich.net> > >> >> > > > > >> >> > > >wrote: > >> >> > > >>Are there options available comparable to ext2/ext3 to help > >> > >> reduce > >> > >> >> > > >>wear and improve performance? > >> >> > > >> >> > With SSDs you don''t have to worry about wear. > >> >> > >> >> Sorry, but you do have to worry about wear. I was able to destroy a > >> >> relatively > >> >> new SD card (2007 or early 2008) just by writing on the first 10MiB > >> > >> over > >> > >> >> and > >> >> over again for two or three days. The end of the card still works > >> >> without > >> >> problems but about 10 sectors on the beginning give write errors. > >> > > >> > Sorry, the topic was SSD, not SD. > >> > >> SD == SSD with an SD interface. > > > > That really is quite a statement. You really talk of a few-bucks SD card > > (like the one in my android handy) as an SSD comparable with Intel XE > > only with > > > different interface? Come on, stay serious. The product is not only made > > of > > > SLCs and some raw logic. > > I am saying that there is no reason for the firmware in an SD card to not > be as advanced. If the manufacturer has some advanced logic in their SATA > SSD, I cannot see any valid engineering reason to not apply the same logic > in a SD product.The _SD_standard_ states that the media has to implement wear-leveling. So any card with an SD logo implements it. As I stated previously, the algorithms used in SD cards may not be as advanced as those in top-of-the-line Intel SSDs, but I bet they don''t differ by much to the ones used in cheapest SSD drives. Besides, why shouldn''t we help the drive firmware by - writing the data only in erase-block sizes - trying to write blocks that are smaller than the erase-block in a way that won''t cross the erase-block boundary - using TRIM on deallocated parts of the drive This will not only increase the life of the SSD but also increase its performance.> > >> > Honestly I would just drop the idea of an SSD option simply because > > the > > >> > vendors implement all kinds of neat strategies in their devices. So > > in > > >> the > >> > >> > end you cannot really tell if the option does something constructive > >> > and not destructive in combination with a SSD controller. > >> > >> You can make an educated guess. For starters given that visible sector > >> sizes are not equal to FS block sizes, it means that FS block sizes can > >> straddle erase block boundaries without the flash controller, no matter > >> how > >> fancy, being able to determine this. Thus, at the very least, aligning > > FS > > >> structures so that they do not straddle erase block boundaries is > > useful > > >> in > >> ALL cases. Thinking otherwise is just sticking your head in the sand > >> because you cannot be bothered to think. > > > > And your guess is that intel engineers had no glue when designing the XE > > including its controller? You think they did not know what you and me > > know > > > and > > therefore pray every day that some smart fs designer falls from heaven > > and > > > saves their product from dying in between? Really? > > I am saying that there are problems that CANNOT be solved on the disk > firmware level. Some problems HAVE to be addressed higher up the stack.Exactly, you can''t assume that the SSDs firmware understands any and all file system layouts, especially if they are on fragmented LVM or other logical volume manager partitions. -- Hubert Kario QBS - Quality Business Software 02-656 Warszawa, ul. Ksawerów 30/85 tel. +48 (22) 646-61-51, 646-74-24 www.qbs.com.pl System Zarządzania Jakością zgodny z normą ISO 9001:2000 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Mar 10, 2010 at 07:49:34PM +0000, Gordan Bobic wrote:> I''m looking to try BTRFS on a SSD, and I would like to know what SSD > optimizations it applies. Is there a comprehensive list of what ssd > mount option does? How are the blocks and metadata arranged? Are > there options available comparable to ext2/ext3 to help reduce wear > and improve performance? > > Specifically, on ext2 (journal means more writes, so I don''t use > ext3 on SSDs, since fsck typically only takes a few seconds when > access time is < 100us), I usually apply the > -b 4096 -E stripe-width = (erase_block/4096) > parameters to mkfs in order to reduce the multiple erase cycles on > the same underlying block. > > Are there similar optimizations available in BTRFS?All devices (raid, ssd, single spindle) tend to benefit from big chunks of writes going down close together on disk. This is true for different reasons on each one, but it is still the easiest way to optimize writes. COW filesystems like btrfs are very well suited to send down lots of big writes because we''re always reallocating things. For traditional storage, we also need to keep blocks from one file (or files in a directory) close together to reduce seeks during reads. SSDs have no such restrictions, and so the mount -o ssd related options in btrfs focus on tossing out tradeoffs that slow down writes in hopes of reading faster later. Someone already mentioned the mount -o ssd and ssd_spread options. Mount -o ssd is targeted at faster SSD that is good at wear leveling and generally just benefits from having a bunch of data sent down close together. In mount -o ssd, you might find a write pattern like this: block N, N+2, N+3, N+4, N+6, N+7, N+16, N+17, N+18, N+19, N+20 ... It''s a largely contiguous chunk of writes, but there may be gaps. Good ssds don''t really care about the gaps, and they benefit more from the fact that we''re preferring to reuse blocks that had once been written than to go off and find completely contiguous areas of the disk to write (which are more likely to have never been written at all). mount -o ssd_spread is much more strict. You''ll get N,N+2,N+3,N+4,N+5 etc because crummy ssds really do care about the gaps. Now, btrfs could go off and probe for the erasure size and work very hard to align things to it. As someone said, alignment of the partition table is very important here as well. But for modern ssd this generally matters much less than just doing big ios and letting the little log structured squirrel inside the device figure things out. For trim, we do have mount -o discard. It does introduce a run time performance hit (this varies wildly from device to device) and we''re tuning things as discard capable devices become more common. If anyone is looking for a project it would be nice to have an ioctl that triggers free space discards in bulk. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Stephan von Krawczynski wrote (ao):> Honestly I would just drop the idea of an SSD option simply because the > vendors implement all kinds of neat strategies in their devices. So in the end > you cannot really tell if the option does something constructive and not > destructive in combination with a SSD controller.My understanding of the ssd mount option is also that the fs doens''t try to do all kinds of smart (and potential expensive) things which make sense for rotating media to reduce seeks and the like. Sander -- Humilis IT Services and Solutions http://www.humilis.net -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 11 Mar 2010 15:01:55 +0100 Hubert Kario <hka@qbs.com.pl> wrote:> [...] > The _SD_standard_ states that the media has to implement wear-leveling. > So any card with an SD logo implements it. > > As I stated previously, the algorithms used in SD cards may not be as advanced > as those in top-of-the-line Intel SSDs, but I bet they don''t differ by much to > the ones used in cheapest SSD drives.Well, we are all pretty sure about that. And that is exactly the reason why these are not surviving the market pressure. Why should one care about bad products that are possibly already extincted because of their bad performance when the fs is production ready some day?> Besides, why shouldn''t we help the drive firmware by > - writing the data only in erase-block sizes > - trying to write blocks that are smaller than the erase-block in a way that > won''t cross the erase-block boundaryBecause if the designing engineer of a good SSD controller wasn''t able to cope with that he will have no chance to design a second one.> - using TRIM on deallocated parts of the driveAnother story. That is a designed part of a software interface between fs and drive bios on which both agreed in its usage pattern. Whereas the above points are pure guess based on dumb and old hardware and its behaviour.> This will not only increase the life of the SSD but also increase its > performance.TRIM: maybe yes. Rest: pure handwaving.> [...] > > > And your guess is that intel engineers had no glue when designing the XE > > > including its controller? You think they did not know what you and me > > > know and > > > therefore pray every day that some smart fs designer falls from heaven > > > and saves their product from dying in between? Really? > > > > I am saying that there are problems that CANNOT be solved on the disk > > firmware level. Some problems HAVE to be addressed higher up the stack. > > Exactly, you can''t assume that the SSDs firmware understands any and all file > system layouts, especially if they are on fragmented LVM or other logical > volume manager partitions.Hopefully the firmware understands exactly no fs layout at all. That would be braindead. Instead it should understand how to arrange incoming and outgoing data in a way that its own technical requirements are met as perfect as possible. This is no spinning disk, it is completely irrelevant what the data layout looks like as long as the controller finds its way through and copes best with read/write/erase cycles. It may well use additional RAM for caching and data reordering. Do you really believe ascending block numbers are placed in ascending addresses inside the disk (as an example)? Why should they? What does that mean for fs block ordering? If you don''t know anyway what a controller does to your data ordering, how do you want to help it with its job? Please accept that we are _not_ talking about trivial flash mem here or pseudo-SSDs consisting of sd cards. The market has already evolved better products. The dinosaurs are extincted even if some are still looking alive.> -- > Hubert Kario-- Regards, Stephan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Gordan Bobic wrote:>> TRIM lets the OS tell the disk which blocks are not in use anymore, and >> thus don''t have to be copied during a rewrite of the blocks. >> Wear-leveling is the SSD making sure all blocks are more or less equally >> written to avoid continuous load on the same blocks. >> > > And thus it is impossible to do wear leveling when all blocks have been > written to once without TRIM. So I''d say that in the long term, without > TRIM there is no wear leveling. That makes them pretty related. >I''m no expert of SSDs, however 1- I think the SSD would rewrite once-written blocks to other locations, so to reuse the same physical blocks for wear levelling. The written-once blocks are very good candidates because their write-count is "1" 2- I think SSDs show you a smaller usable size than what they physically have. In this way they always have some more blocks for moving data to, so to free blocks which have a low write-count. 3- If you think #2 is not enough you can leave part of the SSD disk unused, by leaving unused space after the last partition. Actually, after considering #1 and #2, I don''t think TRIM is really needed for SSDs, are you sure it is really needed? I think it''s more a kind of optimization, but it needs to be very fast for it to be useful as an optimization, faster than an internal block rewrite by the SSD wear levelling, and so fast as SATA/SAS command that the computer is not significantly slowed down by using it. Instead IIRC I read something about it being slow and maybe was even requiring FUA or barrier or flush? I don''t remember exactly. There is one place where TRIM would be very useful though, and it''s not for SSDs, but it''s in virtualization: if the Virtual Machine frees space, the VM file system should use TRIM to signal to the host that some space is unused. The host should have a way to tell its filesystem that the VM-disk-file has a new hole in that position, so that disk space can be freed on the host for use for another VM. This would allow much greater overcommit of disk spaces to virtual machines. There''s probably no need for "TRIM" support itself on the host filesystem, but another mechanism is needed that allows to sparsify an existing file creating a hole in it (which I think is not possible with the filesystems syscalls we have now, correct me if I''m wrong). There *is* need for TRIM support in the guest filesystem though. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 11 Mar 2010 16:35:33 +0100, Stephan von Krawczynski <skraw@ithnet.com> wrote:>> Besides, why shouldn''t we help the drive firmware by >> - writing the data only in erase-block sizes >> - trying to write blocks that are smaller than the erase-block in a way >> that won''t cross the erase-block boundary > > Because if the designing engineer of a good SSD controller wasn''t ableto> cope with that he will have no chance to design a second one.You seem to be confusing quality of implementation with theoretical possibility.>> This will not only increase the life of the SSD but also increase its >> performance. > > TRIM: maybe yes. Rest: pure handwaving. > >> [...] >> > > And your guess is that intel engineers had no glue when designing >> > > the XE >> > > including its controller? You think they did not know what you andme>> > > know and >> > > therefore pray every day that some smart fs designer falls from >> > > heaven >> > > and saves their product from dying in between? Really? >> > >> > I am saying that there are problems that CANNOT be solved on the disk >> > firmware level. Some problems HAVE to be addressed higher up thestack.>> >> Exactly, you can''t assume that the SSDs firmware understands any andall>> file >> system layouts, especially if they are on fragmented LVM or other >> logical >> volume manager partitions. > > Hopefully the firmware understands exactly no fs layout at all. Thatwould> be > braindead. Instead it should understand how to arrange incoming and > outgoing > data in a way that its own technical requirements are met as perfect as > possible. This is no spinning disk, it is completely irrelevant what the > data > layout looks like as long as the controller finds its way through andcopes> best with read/write/erase cycles. It may well use additional RAM for > caching and data reordering. > Do you really believe ascending block numbers are placed in ascending > addresses inside the disk (as an example)? Why should they? What doesthat> mean for fs block ordering? If you don''t know anyway what a controller > does to > your data ordering, how do you want to help it with its job? > Please accept that we are _not_ talking about trivial flash mem here or > pseudo-SSDs consisting of sd cards. The market has already evolvedbetter> products. The dinosaurs are extincted even if some are still lookingalive. I am assuming that you are being deliberately facetious here (the alternative is less kind). The simple fact is that you cannot come up with some magical data (re)ordering method that nullifies problems of common use-cases that are quite nasty for flash based media. For example - you have a disk that has had all it''s addressable blocks tainted. A new write comes in - what do you do with it? Worse, a write comes in spanning two erase blocks as a consequence of the data re-alignment in the firmware. You have no choice but to wipe them both and re-write the data. You''d be better off not doing the magic and assuming that the FS is sensibly aligned. Having a large chunk of spare non-addressable space for this doesn''t necessarily help you, either, unless it is about the same size as the addressable space (worse case scenario, if you accept that the vast majority of FS-es use 4KB block sizes, you can cut a corner there by a factor of 8). All of that adds to cost - flash is still expensive. The bottom line is that you _cannot_ solve wear-leveling completely just in firmware. There is no doubt you can get some of the way there, but it is mathematically impossible to solve completely without intervention from further up the stack. Since some black-box firmware optimizations may quite concievably make the wear problem worse, it makes perfect sense to just hopefully assume that the FS is trying to help - it''s unlikely to make things worse and may well make things a lot better. Gordan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 11 Mar 2010 14:42:40 +0100, Asdo <asdo@shiftmail.org> wrote:> 1- I think the SSD would rewrite once-written blocks to other locations,> so to reuse the same physical blocks for wear levelling. The > written-once blocks are very good candidates because their write-count > is "1"There are likely to be millions of blocks with the same write count. How do you pick the optimal ones?> 2- I think SSDs show you a smaller usable size than what they physically> have. In this way they always have some more blocks for moving data to, > so to free blocks which have a low write-count.I''m pretty sure that is the case, too. However, to be able to deal with the worst case scenario you would have to effectively double the amoutn of flash (only expose half of it as addressable). With some corner cutting and assumptions about file systems'' block sizes (usually 4KB these days), you can cut a corner, but that''s dodgy until we get bigger hardware sectors as standard.> 3- If you think #2 is not enough you can leave part of the SSD disk > unused, by leaving unused space after the last partition.That is true, but I''d rather apply some higher level logic to this rather than expect the firmware do make a guess about things it has absolutely no way of knowing for sure.> Actually, after considering #1 and #2, I don''t think TRIM is really > needed for SSDs, are you sure it is really needed?I don''t think there''s any doubt that trim helps flash longevity.> I think it''s more a > kind of optimization, but it needs to be very fast for it to be useful > as an optimization, faster than an internal block rewrite by the SSD > wear levelling, and so fast as SATA/SAS command that the computer is not> significantly slowed down by using it. Instead IIRC I read something > about it being slow and maybe was even requiring FUA or barrier or > flush? I don''t remember exactly.There is no obligation on part of the disk to do anything in response to the trim command, IIRC. It is advisory. It doesn''t have to clear the blocks online. In fact, trim is sector based, and it is unlikely an SSD would act until it can sensibly free an entire erase block.> There is one place where TRIM would be very useful though, and it''s not > for SSDs, but it''s in virtualization: if the Virtual Machine frees > space, the VM file system should use TRIM to signal to the host that > some space is unused. The host should have a way to tell its filesystem > that the VM-disk-file has a new hole in that position, so that disk > space can be freed on the host for use for another VM. This would allow > much greater overcommit of disk spaces to virtual machines.Indeed, I brought this very point up on the KVM mailing list a while back. Gordan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 11 Mar 2010 09:21:30 -0500, Chris Mason <chris.mason@oracle.com> wrote:> On Wed, Mar 10, 2010 at 07:49:34PM +0000, Gordan Bobic wrote: >> I''m looking to try BTRFS on a SSD, and I would like to know what SSD >> optimizations it applies. Is there a comprehensive list of what ssd >> mount option does? How are the blocks and metadata arranged? Are >> there options available comparable to ext2/ext3 to help reduce wear >> and improve performance? >> >> Specifically, on ext2 (journal means more writes, so I don''t use >> ext3 on SSDs, since fsck typically only takes a few seconds when >> access time is < 100us), I usually apply the >> -b 4096 -E stripe-width = (erase_block/4096) >> parameters to mkfs in order to reduce the multiple erase cycles on >> the same underlying block. >> >> Are there similar optimizations available in BTRFS? > > All devices (raid, ssd, single spindle) tend to benefit from big chunks > of writes going down close together on disk. This is true for different > reasons on each one, but it is still the easiest way to optimize writes. > COW filesystems like btrfs are very well suited to send down lots of big > writes because we''re always reallocating things.Doesn''t this mean _more_ writes? If that''s the case, then that would make btrfs a _bad_ choice for flash based media with limite write cycles.> For traditional storage, we also need to keep blocks from one file (or > files in a directory) close together to reduce seeks during reads. SSDs > have no such restrictions, and so the mount -o ssd related options in > btrfs focus on tossing out tradeoffs that slow down writes in hopes of > reading faster later. > > Someone already mentioned the mount -o ssd and ssd_spread options. > Mount -o ssd is targeted at faster SSD that is good at wear leveling and > generally just benefits from having a bunch of data sent down close > together. In mount -o ssd, you might find a write pattern like this: > > block N, N+2, N+3, N+4, N+6, N+7, N+16, N+17, N+18, N+19, N+20 ... > > It''s a largely contiguous chunk of writes, but there may be gaps. Good > ssds don''t really care about the gaps, and they benefit more from the > fact that we''re preferring to reuse blocks that had once been written > than to go off and find completely contiguous areas of the disk to > write (which are more likely to have never been written at all). > > mount -o ssd_spread is much more strict. You''ll get N,N+2,N+3,N+4,N+5 > etc because crummy ssds really do care about the gaps. > > Now, btrfs could go off and probe for the erasure size and work very > hard to align things to it. As someone said, alignment of the partition > table is very important here as well. But for modern ssd this generally > matters much less than just doing big ios and letting the little log > structured squirrel inside the device figure things out.Thanks, that''s quite helpful. Can you provide any insight into alignment of FS structures in such a way that they do not straddle erase block boundaries?> For trim, we do have mount -o discard. It does introduce a run time > performance hit (this varies wildly from device to device) and we''re > tuning things as discard capable devices become more common. If anyone > is looking for a project it would be nice to have an ioctl that triggers > free space discards in bulk.Are you saying that -o discard implements trim support? Gordan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Mar 11, 2010 at 04:03:59PM +0000, Gordan Bobic wrote:> On Thu, 11 Mar 2010 16:35:33 +0100, Stephan von Krawczynski > <skraw@ithnet.com> wrote: > > >> Besides, why shouldn''t we help the drive firmware by > >> - writing the data only in erase-block sizes > >> - trying to write blocks that are smaller than the erase-block in a way > >> that won''t cross the erase-block boundary > > > > Because if the designing engineer of a good SSD controller wasn''t able > to > > cope with that he will have no chance to design a second one. > > You seem to be confusing quality of implementation with theoretical > possibility. > > >> This will not only increase the life of the SSD but also increase its > >> performance. > > > > TRIM: maybe yes. Rest: pure handwaving. > > > >> [...] > >> > > And your guess is that intel engineers had no glue when designing > >> > > the XE > >> > > including its controller? You think they did not know what you and > me > >> > > know and > >> > > therefore pray every day that some smart fs designer falls from > >> > > heaven > >> > > and saves their product from dying in between? Really? > >> > > >> > I am saying that there are problems that CANNOT be solved on the disk > >> > firmware level. Some problems HAVE to be addressed higher up the > stack. > >> > >> Exactly, you can''t assume that the SSDs firmware understands any and > all > >> file > >> system layouts, especially if they are on fragmented LVM or other > >> logical > >> volume manager partitions. > > > > Hopefully the firmware understands exactly no fs layout at all. That > would > > be > > braindead. Instead it should understand how to arrange incoming and > > outgoing > > data in a way that its own technical requirements are met as perfect as > > possible. This is no spinning disk, it is completely irrelevant what the > > data > > layout looks like as long as the controller finds its way through and > copes > > best with read/write/erase cycles. It may well use additional RAM for > > caching and data reordering. > > Do you really believe ascending block numbers are placed in ascending > > addresses inside the disk (as an example)? Why should they? What does > that > > mean for fs block ordering? If you don''t know anyway what a controller > > does to > > your data ordering, how do you want to help it with its job? > > Please accept that we are _not_ talking about trivial flash mem here or > > pseudo-SSDs consisting of sd cards. The market has already evolved > better > > products. The dinosaurs are extincted even if some are still looking > alive. > > I am assuming that you are being deliberately facetious here (the > alternative is less kind). The simple fact is that you cannot come up with > some magical data (re)ordering method that nullifies problems of common > use-cases that are quite nasty for flash based media. > > For example - you have a disk that has had all it''s addressable blocks > tainted. A new write comes in - what do you do with it? Worse, a write > comes in spanning two erase blocks as a consequence of the data > re-alignment in the firmware. You have no choice but to wipe them both and > re-write the data. You''d be better off not doing the magic and assuming > that the FS is sensibly aligned.Ok, how exactly would the FS help here? We have a device with a 256kb erasure size, and userland does a 4k write followed by an fsync. If the FS were to be smart and know about the 256kb requirement, it would do a read/modify/write cycle somewhere and then write the 4KB. The underlying implementation is the same in the device. It picks a destination, reads it then writes it back. You could argue (and many people do) that this operation is risky and has a good chance of destroying old data. Perhaps we''re best off if the FS does the rmw cycle instead into an entirely safe location. It''s a great place for research and people are definitely looking at it. But with all of that said, it has nothing to do with alignment or trim. Modern ssds are a raid device with a large stripe size, and someone somewhere is going to do a read/modify/write to service any small write. You can force this up to the FS or the application, it''ll happen somewhere. The filesystem metadata writes are a very small percentage of the problem overall. Sure we can do better and try to force larger metadata blocks. This was the whole point behind btrfs'' support for large tree blocks, which I''ll be enabling again shortly. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>> "Gordan" == Gordan Bobic <gordan@bobich.net> writes:Gordan> I fully agree that it''s important for wear leveling on flash Gordan> media, but from the security point of view, I think TRIM would Gordan> be a useful feature on all storage media. If the erased blocks Gordan> were trimmed it would provide a potentially useful feature of Gordan> securely erasing the sectors that are no longer used. It would Gordan> be useful and much more transparent than the secure erase Gordan> features that only operate on the entire disk. Just MHO. Except there are no guarantees that TRIM does anything, even if the drive claims to support it. There are a couple of IDENTIFY DEVICE knobs that indicate whether the drive deterministically returns data after a TRIM. And whether the resulting data is zeroes. We query these values and report them to the filesystem. However, testing revealed several devices that reported the right thing but which did in fact return the old data afterwards. -- Martin K. Petersen Oracle Linux Engineering -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Mar 11, 2010 at 04:18:48PM +0000, Gordan Bobic wrote:> On Thu, 11 Mar 2010 09:21:30 -0500, Chris Mason <chris.mason@oracle.com> > wrote: > > On Wed, Mar 10, 2010 at 07:49:34PM +0000, Gordan Bobic wrote: > >> I''m looking to try BTRFS on a SSD, and I would like to know what SSD > >> optimizations it applies. Is there a comprehensive list of what ssd > >> mount option does? How are the blocks and metadata arranged? Are > >> there options available comparable to ext2/ext3 to help reduce wear > >> and improve performance? > >> > >> Specifically, on ext2 (journal means more writes, so I don''t use > >> ext3 on SSDs, since fsck typically only takes a few seconds when > >> access time is < 100us), I usually apply the > >> -b 4096 -E stripe-width = (erase_block/4096) > >> parameters to mkfs in order to reduce the multiple erase cycles on > >> the same underlying block. > >> > >> Are there similar optimizations available in BTRFS? > > > > All devices (raid, ssd, single spindle) tend to benefit from big chunks > > of writes going down close together on disk. This is true for different > > reasons on each one, but it is still the easiest way to optimize writes. > > COW filesystems like btrfs are very well suited to send down lots of big > > writes because we''re always reallocating things. > > Doesn''t this mean _more_ writes? If that''s the case, then that would make > btrfs a _bad_ choice for flash based media with limite write cycles.It just means that when we do write, we don''t overwrite the existing data in the file. We allocate a new block instead and write there (freeing the old one) . This gives us a lot of control over grouping writes together, instead of being restricted to the layout from when the file was first created. It also fragments the files much more, but this isn''t an issue on ssd.> > > For traditional storage, we also need to keep blocks from one file (or > > files in a directory) close together to reduce seeks during reads. SSDs > > have no such restrictions, and so the mount -o ssd related options in > > btrfs focus on tossing out tradeoffs that slow down writes in hopes of > > reading faster later. > > > > Someone already mentioned the mount -o ssd and ssd_spread options. > > Mount -o ssd is targeted at faster SSD that is good at wear leveling and > > generally just benefits from having a bunch of data sent down close > > together. In mount -o ssd, you might find a write pattern like this: > > > > block N, N+2, N+3, N+4, N+6, N+7, N+16, N+17, N+18, N+19, N+20 ... > > > > It''s a largely contiguous chunk of writes, but there may be gaps. Good > > ssds don''t really care about the gaps, and they benefit more from the > > fact that we''re preferring to reuse blocks that had once been written > > than to go off and find completely contiguous areas of the disk to > > write (which are more likely to have never been written at all). > > > > mount -o ssd_spread is much more strict. You''ll get N,N+2,N+3,N+4,N+5 > > etc because crummy ssds really do care about the gaps. > > > > Now, btrfs could go off and probe for the erasure size and work very > > hard to align things to it. As someone said, alignment of the partition > > table is very important here as well. But for modern ssd this generally > > matters much less than just doing big ios and letting the little log > > structured squirrel inside the device figure things out. > > Thanks, that''s quite helpful. Can you provide any insight into alignment > of FS structures in such a way that they do not straddle erase block > boundaries?We align on 4k (but partition alignment can defeat this). We don''t attempt to understand or guess at erasure blocks. Unless the filesystem completely takes over the FTL duties, I don''t think it makes sense to do more than send large writes whenever we can. The raid 5/6 patches will add more knobs for strict alignment, but I''d be very surprised if they made a big difference on modern ssd.> > > For trim, we do have mount -o discard. It does introduce a run time > > performance hit (this varies wildly from device to device) and we''re > > tuning things as discard capable devices become more common. If anyone > > is looking for a project it would be nice to have an ioctl that triggers > > free space discards in bulk. > > Are you saying that -o discard implements trim support?Yes, it sends trim/discards down to devices that support it. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>> "Gordan" == Gordan Bobic <gordan@bobich.net> writes:Gordan> SD == SSD with an SD interface. No, not really. It is true that conceivably you could fit a sophisticated controller in an SD card form factor. But fact is that takes up space which could otherwise be used for flash. There may also be power consumption/heat dissipation concerns. Most SD card controllers have very, very simple wear leveling that in most cases rely on the filesystem being FAT. These cards are aimed at cameras, MP3 players, etc. after all. And consequently it''s trivial to wear out an SD card by writing a block over and over. The same is kind of true for Compact Flash. There are two types of cards, I prefer to think of them as camera grade and industrial. Camera grade CF is really no different from SD cards or any other consumer flash form factor. Industrial CF cards have controllers with sophisticated wear leveling. Usually this is not quite as clever as a "big" SSD, but it is close enough that you can treat the device as a real disk drive. I.e. it has multiple channels working in parallel unlike the consumer devices. As a result of the smarter controller logic and the bigger bank of spare flash, industrial cards are much smaller in capacity. Typically in the 1-4 GB range. But they are in many cases indistinguishable from a real SSD in terms of performance and reliability. Gordan> You can make an educated guess. For starters given that visible Gordan> sector sizes are not equal to FS block sizes, it means that FS Gordan> block sizes can straddle erase block boundaries without the Gordan> flash controller, no matter how fancy, being able to determine Gordan> this. Thus, at the very least, aligning FS structures so that Gordan> they do not straddle erase block boundaries is useful in ALL Gordan> cases. Thinking otherwise is just sticking your head in the sand Gordan> because you cannot be bothered to think. There are no means of telling what the erase block size is. None. We have no idea. The vendors won''t talk. It''s part of their IP. Also, there is no point in being hung up on the whole erase block thing. Only crappy SSDs use block mapping where that matters. These devices will die a horrible death soon enough. Good SSDs use a technique akin to logging filesystems in which the erase block size and all other other physical characteristics don''t matter (from a host perspective). -- Martin K. Petersen Oracle Linux Engineering -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 11 Mar 2010 15:39:05 +0100 Sander <sander@humilis.net> wrote:> Stephan von Krawczynski wrote (ao): > > Honestly I would just drop the idea of an SSD option simply because the > > vendors implement all kinds of neat strategies in their devices. So in the end > > you cannot really tell if the option does something constructive and not > > destructive in combination with a SSD controller. > > My understanding of the ssd mount option is also that the fs doens''t try > to do all kinds of smart (and potential expensive) things which make > sense for rotating media to reduce seeks and the like. > > SanderSuch an optimization sounds valid on first sight. But re-think closely: how does the fs really know about seeks needed during some operation? If your disk is a single plate one your seeks are completely different from multi plate. So even a simple case is more or less unpredictable. If you consider a RAID or SAN as device base it should be clear that trying to optimize for certain device types is just a fake. What does that tell you? The optimization was a pure loss of work hours in the first place. In fact if you look at this list a lot of talks going on are highly academic and have no real usage scenario. Sometimes trying to be super-smart is indeed not useful (for a fs) ... -- Regards, Stephan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Mar 11, 2010 at 06:35:06PM +0100, Stephan von Krawczynski wrote:> On Thu, 11 Mar 2010 15:39:05 +0100 > Sander <sander@humilis.net> wrote: > > > Stephan von Krawczynski wrote (ao): > > > Honestly I would just drop the idea of an SSD option simply because the > > > vendors implement all kinds of neat strategies in their devices. So in the end > > > you cannot really tell if the option does something constructive and not > > > destructive in combination with a SSD controller. > > > > My understanding of the ssd mount option is also that the fs doens''t try > > to do all kinds of smart (and potential expensive) things which make > > sense for rotating media to reduce seeks and the like. > > > > Sander > > Such an optimization sounds valid on first sight. But re-think closely: how > does the fs really know about seeks needed during some operation?Well the FS makes a few assumptions (in the nonssd case). First it assumes the storage is not a memory device. If things would fit in memory we wouldn''t need filesytems in the first place. Then it assumes that adjacent blocks are cheap to read and blocks that are far away are expensive to read. Given expensive raid controllers, cache, and everything else, you''re correct that sometimes this assumption is wrong. But, on average seeking hurts. Really a lot. We try to organize files such that files that are likely to be read together are found together on disk. Btrfs is fairly good at this during file creation and not as good as ext*/xfs as files over overwritten and modified again and again (due to cow). If you turn mount -o ssd on for your drive and do a test, you might not notice much difference right away. ssds tend to be pretty good right out of the box. Over time it tends to help, but it is a very hard thing to benchmark in general. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thursday 11 March 2010 17:19:32 Chris Mason wrote:> On Thu, Mar 11, 2010 at 04:03:59PM +0000, Gordan Bobic wrote: > > On Thu, 11 Mar 2010 16:35:33 +0100, Stephan von Krawczynski > > > > <skraw@ithnet.com> wrote: > > >> Besides, why shouldn''t we help the drive firmware by > > >> - writing the data only in erase-block sizes > > >> - trying to write blocks that are smaller than the erase-block in a > > >> way that won''t cross the erase-block boundary > > > > > > Because if the designing engineer of a good SSD controller wasn''t able > > > > to > > > > > cope with that he will have no chance to design a second one. > > > > You seem to be confusing quality of implementation with theoretical > > possibility. > > > > >> This will not only increase the life of the SSD but also increase its > > >> performance. > > > > > > TRIM: maybe yes. Rest: pure handwaving. > > > > > >> [...] > > >> > > >> > > And your guess is that intel engineers had no glue when designing > > >> > > the XE > > >> > > including its controller? You think they did not know what you and > > > > me > > > > >> > > know and > > >> > > therefore pray every day that some smart fs designer falls from > > >> > > heaven > > >> > > and saves their product from dying in between? Really? > > >> > > > >> > I am saying that there are problems that CANNOT be solved on the > > >> > disk firmware level. Some problems HAVE to be addressed higher up > > >> > the > > > > stack. > > > > >> Exactly, you can''t assume that the SSDs firmware understands any and > > > > all > > > > >> file > > >> system layouts, especially if they are on fragmented LVM or other > > >> logical > > >> volume manager partitions. > > > > > > Hopefully the firmware understands exactly no fs layout at all. That > > > > would > > > > > be > > > braindead. Instead it should understand how to arrange incoming and > > > outgoing > > > data in a way that its own technical requirements are met as perfect as > > > possible. This is no spinning disk, it is completely irrelevant what > > > the data > > > layout looks like as long as the controller finds its way through and > > > > copes > > > > > best with read/write/erase cycles. It may well use additional RAM for > > > caching and data reordering. > > > Do you really believe ascending block numbers are placed in ascending > > > addresses inside the disk (as an example)? Why should they? What does > > > > that > > > > > mean for fs block ordering? If you don''t know anyway what a controller > > > does to > > > your data ordering, how do you want to help it with its job? > > > Please accept that we are _not_ talking about trivial flash mem here or > > > pseudo-SSDs consisting of sd cards. The market has already evolved > > > > better > > > > > products. The dinosaurs are extincted even if some are still looking > > > > alive.You seem to be forgetting that CEOs like to save 10 cents per drive to show "millions of dollars saved" by their work, I highly doubt that we won''t see SSDs with half assed wear leveling implementations 10 years from now. And no, I don''t think that the linear storage that we see at the ATA level is any linear on the drive itself. But erase blocks are still erase blocks. I highly doubt that the abstraction layer works over sector sizes (512B) and not over whole erase block sizes -- just because it would make it much more complicated, thus slower. This way, even if the writes to the flash cells are made in fashion similar to a LogFS, one will still get r/m/w cycle if the write is 512B in size on a block that has also other data.> > > > I am assuming that you are being deliberately facetious here (the > > alternative is less kind). The simple fact is that you cannot come up > > with some magical data (re)ordering method that nullifies problems of > > common use-cases that are quite nasty for flash based media. > > > > For example - you have a disk that has had all it''s addressable blocks > > tainted. A new write comes in - what do you do with it? Worse, a write > > comes in spanning two erase blocks as a consequence of the data > > re-alignment in the firmware. You have no choice but to wipe them both > > and re-write the data. You''d be better off not doing the magic and > > assuming that the FS is sensibly aligned. > > Ok, how exactly would the FS help here? We have a device with a 256kb > erasure size, and userland does a 4k write followed by an fsync.I assume here that the FS knows about erasure size and does implement TRIM.> If the FS were to be smart and know about the 256kb requirement, it > would do a read/modify/write cycle somewhere and then write the 4KB.If all the free blocks have been TRIMmed, FS should pick a completely free erasure size block and write those 4KiB of data. Correct implementation of wear leveling in the drive should notice that the write is entirely inside a free block and make just a write cycle adding zeros to the end of supplied data.> The underlying implementation is the same in the device. It picks a > destination, reads it then writes it back. You could argue (and many > people do) that this operation is risky and has a good chance of > destroying old data. Perhaps we''re best off if the FS does the rmw > cycle instead into an entirely safe location.And IMO that''s the idea behind TRIM -- not to force the device do do rmw cycles, only write cycle or erase cycle, provided there''s free space and the free space doesn''t have considerably more write cycles than the already allocated data.> > It''s a great place for research and people are definitely looking at it. > > But with all of that said, it has nothing to do with alignment or trim. > Modern ssds are a raid device with a large stripe size, and someone > somewhere is going to do a read/modify/write to service any small write. > You can force this up to the FS or the application, it''ll happen > somewhere.Yes, and if the parition is full rmw will happen in the drive. But if the partition is far from full, free space is TRIMmed then than the r/m/w cycle will happen inside btrfs and the SSD won''t have to do its magic -- making the process faster. The effect will be a FS that behaves consistently over a broad range of SSDs, provided there''s free space left.> The filesystem metadata writes are a very small percentage of the > problem overall. Sure we can do better and try to force larger metadata > blocks. This was the whole point behind btrfs'' support for large tree > blocks, which I''ll be enabling again shortly.-- Hubert Kario QBS - Quality Business Software ul. Ksawerów 30/85 02-656 Warszawa POLAND tel. +48 (22) 646-61-51, 646-74-24 fax +48 (22) 646-61-50 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Mar 12, 2010 at 02:07:40AM +0100, Hubert Kario wrote:> > > For example - you have a disk that has had all it''s addressable blocks > > > tainted. A new write comes in - what do you do with it? Worse, a write > > > comes in spanning two erase blocks as a consequence of the data > > > re-alignment in the firmware. You have no choice but to wipe them both > > > and re-write the data. You''d be better off not doing the magic and > > > assuming that the FS is sensibly aligned. > > > > Ok, how exactly would the FS help here? We have a device with a 256kb > > erasure size, and userland does a 4k write followed by an fsync. > > I assume here that the FS knows about erasure size and does implement TRIM. > > > If the FS were to be smart and know about the 256kb requirement, it > > would do a read/modify/write cycle somewhere and then write the 4KB. > > If all the free blocks have been TRIMmed, FS should pick a completely free > erasure size block and write those 4KiB of data. > > Correct implementation of wear leveling in the drive should notice that the > write is entirely inside a free block and make just a write cycle adding zeros > to the end of supplied data. > > > The underlying implementation is the same in the device. It picks a > > destination, reads it then writes it back. You could argue (and many > > people do) that this operation is risky and has a good chance of > > destroying old data. Perhaps we''re best off if the FS does the rmw > > cycle instead into an entirely safe location. > > And IMO that''s the idea behind TRIM -- not to force the device do do rmw > cycles, only write cycle or erase cycle, provided there''s free space and the > free space doesn''t have considerably more write cycles than the already > allocated data. > > > > > It''s a great place for research and people are definitely looking at it. > > > > But with all of that said, it has nothing to do with alignment or trim. > > Modern ssds are a raid device with a large stripe size, and someone > > somewhere is going to do a read/modify/write to service any small write. > > You can force this up to the FS or the application, it''ll happen > > somewhere. > > Yes, and if the parition is full rmw will happen in the drive. But if the > partition is far from full, free space is TRIMmed then than the r/m/w cycle > will happen inside btrfs and the SSD won''t have to do its magic -- making the > process faster.The filesystem cannot do read/modify/write faster or better than the drive. The drive is pushing data around internally and the FS has to pull it in and out of the sata bus. The drive is much faster. The FS can be safer than the drive because it is able to do more consistency checks on the data as it reads. But this also has a cost because the crcs for the blocks might not be adjacent to the block. If the FS is the FTL, we don''t need trim because the FS already knows which blocks are in use. So, there isn''t as much complexity in finding the free erasure block. The FS FTL does win there. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 12 Mar 2010 02:07:40 +0100 Hubert Kario <hka@qbs.com.pl> wrote:> > [...] > > If the FS were to be smart and know about the 256kb requirement, it > > would do a read/modify/write cycle somewhere and then write the 4KB. > > If all the free blocks have been TRIMmed, FS should pick a completely free > erasure size block and write those 4KiB of data. > > Correct implementation of wear leveling in the drive should notice that the > write is entirely inside a free block and make just a write cycle adding zeros > to the end of supplied data.Your assumption here is that your _addressed_ block layout is completely identical to the SSDs "disk" layout. Else you cannot know where a "free erasure block" is located and how to address it from FS. I really wonder what this assumption is based on. You still think a SSD is a true disk with linear addressing. I doubt that very much. Even on true spinning disks your assumption is wrong for relocated sectors. Which basically means that every disk controller firmware fiddles around with the physical layout since decades. Please accept that you cannot do a disks'' job in FS. The more advanced technology gets the more disks become black boxes with a defined software interface. Use this interface and drop the idea of having inside knowledge of such a device. That''s other peoples'' work. If you want to design smart SSD controllers hire at a company that builds those. -- Regards, Stephan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Friday 12 March 2010 10:15:28 Stephan von Krawczynski wrote:> On Fri, 12 Mar 2010 02:07:40 +0100 > > Hubert Kario <hka@qbs.com.pl> wrote: > > > [...] > > > If the FS were to be smart and know about the 256kb requirement, it > > > would do a read/modify/write cycle somewhere and then write the 4KB. > > > > If all the free blocks have been TRIMmed, FS should pick a completely > > free erasure size block and write those 4KiB of data. > > > > Correct implementation of wear leveling in the drive should notice that > > the write is entirely inside a free block and make just a write cycle > > adding zeros to the end of supplied data. > > Your assumption here is that your _addressed_ block layout is completely > identical to the SSDs "disk" layout. > Else you cannot know where a "free > erasure block" is located and how to address it from FS. > I really wonder what this assumption is based on. You still think a SSD is > a true disk with linear addressing. I doubt that very much.I made no such assumptions. Im sure that the linearity on the ATA LBA level isn''t so linear on the device level, especially after wear-leveling takes its toll, but I assume that the smallest block of data that the translation layer can address is erase-block sized and that all the erase-block are equal in size. Otherwise the algorithm would be needlessly complicated which would make it both slower and more error prone.> Even on true > spinning disks your assumption is wrong for relocated sectors.Which we don''t have to worry about because if the drive has less than 5 of ''em, the impact of hitting them is marginal and if there are more, the user has much more pressing problem than the performance of the drive or FS.> Which > basically means that every disk controller firmware fiddles around with > the physical layout since decades. Please accept that you cannot do a > disks'' job in FS. The more advanced technology gets the more disks become > black boxes with a defined software interface. Use this interface and drop > the idea of having inside knowledge of such a device. That''s other > peoples'' work. If you want to design smart SSD controllers hire at a > company that builds those.And I don''t think that doing disks'' job in the FS is good idea, but I think that we should be able to minimise the impact of the translation layer. The way to do this, is to threat the device as a block device with sectors the size of erase-blocks. That''s nothing too fancy, don''t you think? -- Hubert Kario QBS - Quality Business Software ul. Ksawerów 30/85 02-656 Warszawa POLAND tel. +48 (22) 646-61-51, 646-74-24 fax +48 (22) 646-61-50 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 11 Mar 2010 13:00:17 -0500 Chris Mason <chris.mason@oracle.com> wrote:> On Thu, Mar 11, 2010 at 06:35:06PM +0100, Stephan von Krawczynski wrote: > > On Thu, 11 Mar 2010 15:39:05 +0100 > > Sander <sander@humilis.net> wrote: > > > > > Stephan von Krawczynski wrote (ao): > > > > Honestly I would just drop the idea of an SSD option simply because the > > > > vendors implement all kinds of neat strategies in their devices. So in the end > > > > you cannot really tell if the option does something constructive and not > > > > destructive in combination with a SSD controller. > > > > > > My understanding of the ssd mount option is also that the fs doens''t try > > > to do all kinds of smart (and potential expensive) things which make > > > sense for rotating media to reduce seeks and the like. > > > > > > Sander > > > > Such an optimization sounds valid on first sight. But re-think closely: how > > does the fs really know about seeks needed during some operation? > > Well the FS makes a few assumptions (in the nonssd case). First it > assumes the storage is not a memory device. If things would fit in > memory we wouldn''t need filesytems in the first place.Ok, here is the bad news. This assumption everything from right to completely wrong, and you cannot really tell the mainstream answer. Two examples from opposite parts of the technology world: - History: way back in the 80''s there was a 3rd party hardware for C=1541 (floppy drive for C=64) that read in the complete floppy and served all incoming requests from the ram buffer. So your assumption can already be wrong for a trivial floppy drive from ancient times. - Nowadays: being a linux installation today chances are that the matrix has you. Quite a lot of installations are virtualized. So your storage is a virtual one either, which means it is likely being a fs buffer from the host system, i.e. RAM. And sorry to say: "if things would fit in memory" you probably still need a fs simply because there is no actual way to organize data (be it executable or not) in RAM without a fs layer. You can''t save data without an abstract file data type. To have one accessible you need a fs. Btw the other way round is as interesting: there is currently no fs for linux that knows how to execute in place. Meaning if you really had only RAM and you have a fs to organize your data it would be just logical to have ways to _not_ load data (in other parts of the RAM), but to use it in its original storage (RAM-)space.> Then it assumes that adjacent blocks are cheap to read and blocks that > are far away are expensive to read. Given expensive raid controllers, > cache, and everything else, you''re correct that sometimes this > assumption is wrong.As already mentioned this assumption may be completely wrong even without a raid controller, being within a virtual environment. Even far away blocks can be one byte away in the next fs buffer of the underlying host fs (assuming your device is in fact a file on the host;-).> But, on average seeking hurts. Really a lot.Yes, seeking hurts. But there is no way to know if there is seeking at all. On the other hand, if your storage is a netblock device seeking on the server is probably your smallest problem, compared to the network latency in between.> We try to organize files such that files that are likely to be read > together are found together on disk. Btrfs is fairly good at this > during file creation and not as good as ext*/xfs as files over > overwritten and modified again and again (due to cow).You are basically saying that btrfs perfectly organizes write-once devices ;-)> If you turn mount -o ssd on for your drive and do a test, you might not > notice much difference right away. ssds tend to be pretty good right > out of the box. Over time it tends to help, but it is a very hard thing > to benchmark in general.Honestly, this sounds like "I give up" to me ;-) You just said that generally it is "very hard to benchmark". Which means "nobody can see or feel it in real world" in non-tech language. Please understand that I am the last one critizing your and others'' brillant work and the time you spend for btrfs. Only I do believe that if you spent one hour on some fs like glusterfs for every 10 hours you spend on btrfs you would be both king and queen for the linux HA community :-) (but probably unemployed, so I can''t really beat you for it)> -chris-- Regards, Stephan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 12 Mar 2010 17:00:08 +0100 Hubert Kario <hka@qbs.com.pl> wrote:> > Even on true > > spinning disks your assumption is wrong for relocated sectors. > > Which we don''t have to worry about because if the drive has less than 5 of > ''em, the impact of hitting them is marginal and if there are more, the user > has much more pressing problem than the performance of the drive or FS.Are you really sure that a drive firmware tells you about the true number of relocated sectors? I mean if it makes the product look better in comparison to another product, are you really sure that the firmware will not tell you what you expect to see only to make you content and happy with your drive?> > Which > > basically means that every disk controller firmware fiddles around with > > the physical layout since decades. Please accept that you cannot do a > > disks'' job in FS. The more advanced technology gets the more disks become > > black boxes with a defined software interface. Use this interface and drop > > the idea of having inside knowledge of such a device. That''s other > > peoples'' work. If you want to design smart SSD controllers hire at a > > company that builds those. > > And I don''t think that doing disks'' job in the FS is good idea, but I think > that we should be able to minimise the impact of the translation layer. > > The way to do this, is to threat the device as a block device with sectors the > size of erase-blocks. That''s nothing too fancy, don''t you think?I don''t believe anyone is able to tell the size of erase-blocks of some device - current and future - for sure. I do believe that making this guess only reduces the future design options for new devices - if its creators care at all about your guess. Why not let the fs designer take his creative options in fs layer and let the device designer use his brain on the device level and all meet at the predefined software interface in between - and nowhere _else_.> -- > Hubert Kario-- Regards, Stephan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Saturday 13 March 2010 18:02:10 Stephan von Krawczynski wrote:> On Fri, 12 Mar 2010 17:00:08 +0100 > Hubert Kario <hka@qbs.com.pl> wrote: > > > Even on true > > > spinning disks your assumption is wrong for relocated sectors. > > > > Which we don''t have to worry about because if the drive has less than 5 > > of ''em, the impact of hitting them is marginal and if there are more, > > the user has much more pressing problem than the performance of the > > drive or FS. > > Are you really sure that a drive firmware tells you about the true number > of relocated sectors? I mean if it makes the product look better in > comparison to another product, are you really sure that the firmware will > not tell you what you expect to see only to make you content and happy > with your drive?because Joe Sixpack reads SMART values, and even if he does, he will be much more angry when a drive that has no or few relocations fails, that when a drive that reports that''s failing fails. If the drive arrives with badsectors, it goes where it came from the same day if it meets an IT guy worth its salt, any IT guy knows that some HDDs develop badsectors no matter the make and model, but if they do, you replace them. And as the Google disk survey showed, the SMART has very high percentage of Type I errors, but very few Type II errors. But we''re off-topic here> > > Which > > > basically means that every disk controller firmware fiddles around with > > > the physical layout since decades. Please accept that you cannot do a > > > disks'' job in FS. The more advanced technology gets the more disks > > > become black boxes with a defined software interface. Use this > > > interface and drop the idea of having inside knowledge of such a > > > device. That''s other peoples'' work. If you want to design smart SSD > > > controllers hire at a company that builds those. > > > > And I don''t think that doing disks'' job in the FS is good idea, but I > > think that we should be able to minimise the impact of the translation > > layer. > > > > The way to do this, is to threat the device as a block device with > > sectors the size of erase-blocks. That''s nothing too fancy, don''t you > > think? > > I don''t believe anyone is able to tell the size of erase-blocks of some > device - current and future - for sure.Well, if the engeneer that designed it doesn''t know this, I don''t know how he got his degree. Just because it isn''t publicised now, doesn''t mean it won''t be in near future. Besides that, to detect how big the erase-blocks are in size is easy, if they have any impact on the performance, if they don''t have any impact (whatever the reason) tunning for their size is pointless anyway.> I do believe that making this > guess only reduces the future design options for new devices - if its > creators care at all about your guess.Did I, or any one else, say that we want to hardwire a specific erase-block size to the design of the FS?! That would be utter stupidity!> Why not let the fs designer take his creative options in fs layer and let > the device designer use his brain on the device level and all meet at the > predefined software interface in between - and nowhere _else_.We (well, at least Gordon and I) just want a "stripe_width" option added to the mkfs.btrfs, just like it is there for ext2/3/4, reiserfs, xfs and jfs to name a few. It would need very few additional tweaks to make it SSD friendly, hardly any considering how -o ssd or -o ssd_spread already work. You''re forgetting there''s an elephant in the room that won''t to talk to devices that don''t have sectors 512B in size. If not for it, there wouldn''t even _be_ SSDs with 512B sectors. It''s not the way Flash memory works. The 512B abstraction is there to be compatible, to work with one current OS, it''s not there because it describes better the way Flash memory works or is the best way to address the data on the device itself. There are already consumer HDDs with 4kiB sector size, so the situation is getting better. We can only hope that in few years time the SSDs will have sectors the size of erase-blocks. But in the mean time, stripe_width would be enough. Besides, the stripe_width option will be not only useful for the SSDs but also in environments where btrfs is on a device that is a RAID5/6 array (reconfiguring a server with many virtual machines is far from easy and sometimes just can''t be done because of heterogeneous virtualised OSs that need the data protection provided by lower layers). -- Hubert Kario QBS - Quality Business Software ul. Ksawerów 30/85 02-656 Warszawa POLAND tel. +48 (22) 646-61-51, 646-74-24 fax +48 (22) 646-61-50 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Saturday 13 March 2010 17:43:59 Stephan von Krawczynski wrote:> On Thu, 11 Mar 2010 13:00:17 -0500 > Chris Mason <chris.mason@oracle.com> wrote: > > On Thu, Mar 11, 2010 at 06:35:06PM +0100, Stephan von Krawczynski wrote: > > > On Thu, 11 Mar 2010 15:39:05 +0100 > > > Sander <sander@humilis.net> wrote: > > > > Stephan von Krawczynski wrote (ao): > > > > > Honestly I would just drop the idea of an SSD option simply because > > > > > the vendors implement all kinds of neat strategies in their > > > > > devices. So in the end you cannot really tell if the option does > > > > > something constructive and not destructive in combination with a > > > > > SSD controller. > > > > > > > > My understanding of the ssd mount option is also that the fs doens''t > > > > try to do all kinds of smart (and potential expensive) things which > > > > make sense for rotating media to reduce seeks and the like. > > > > > > > > Sander > > > > > > Such an optimization sounds valid on first sight. But re-think closely: > > > how does the fs really know about seeks needed during some operation? > > > > Well the FS makes a few assumptions (in the nonssd case). First it > > assumes the storage is not a memory device. If things would fit in > > memory we wouldn''t need filesytems in the first place. > > Ok, here is the bad news. This assumption everything from right to > completely wrong, and you cannot really tell the mainstream answer. > Two examples from opposite parts of the technology world: > - History: way back in the 80''s there was a 3rd party hardware for C=1541 > (floppy drive for C=64) that read in the complete floppy and served all > incoming requests from the ram buffer. So your assumption can already be > wrong for a trivial floppy drive from ancient times.such assumption doesn''t make it work slower on such device> - Nowadays: being a linux installation today chances are that the matrix > has you. Quite a lot of installations are virtualized. So your storage is > a virtual one either, which means it is likely being a fs buffer from the > host system, i.e. RAM.Buffers use read_ahead and are smaller than the underlaying device, still, such assumption doesn''t make the FS perform worse in this situation.> And sorry to say: "if things would fit in memory" you probably still need a > fs simply because there is no actual way to organize data (be it > executable or not) in RAM without a fs layer. You can''t save data without > an abstract file data type. To have one accessible you need a fs.yes, that''s why there is tmpfs, btrfs isn''t meant to be all and end all as far as FSs go> Btw the other way round is as interesting: there is currently no fs for > linux that knows how to execute in place. Meaning if you really had only > RAM and you have a fs to organize your data it would be just logical to > have ways to _not_ load data (in other parts of the RAM), but to use it in > its original storage (RAM-)space.at least ext2 does support XIP on platform that support it...> > > Then it assumes that adjacent blocks are cheap to read and blocks that > > are far away are expensive to read. Given expensive raid controllers, > > cache, and everything else, you''re correct that sometimes this > > assumption is wrong. > > As already mentioned this assumption may be completely wrong even without a > raid controller, being within a virtual environment. Even far away blocks > can be one byte away in the next fs buffer of the underlying host fs > (assuming your device is in fact a file on the host;-).and again, such assumption doesn''t reduce the performance> > > But, on average seeking hurts. Really a lot. > > Yes, seeking hurts. But there is no way to know if there is seeking at all. > On the other hand, if your storage is a netblock device seeking on the > server is probably your smallest problem, compared to the network latency > in between.and because of that, there''s read ahead and support for big packets on the TCP level, so the assumption does make the FS perform better with it than without it. It''s one of the assumptions that you _have_ to make, just like the assumption that the computer counts in binary, or there''s more disk space than RAM. But those assumptions _don''t_ make the performance (much) worse when they don''t hold true for known devices that can impersonate rotating magnetic media.> > We try to organize files such that files that are likely to be read > > together are found together on disk. Btrfs is fairly good at this > > during file creation and not as good as ext*/xfs as files over > > overwritten and modified again and again (due to cow). > > You are basically saying that btrfs perfectly organizes write-once devices > ;-) > > > If you turn mount -o ssd on for your drive and do a test, you might not > > notice much difference right away. ssds tend to be pretty good right > > out of the box. Over time it tends to help, but it is a very hard thing > > to benchmark in general. > > Honestly, this sounds like "I give up" to me ;-) > You just said that generally it is "very hard to benchmark". Which means > "nobody can see or feel it in real world" in non-tech language.No, it''s not this. When a SSD is fresh, the undeling write leveling has many blocks to choose from, so it''s blaizing fast. The same holds true when the test uses small amount of data (relative to SSD size). "very hard to benchmark" means just that -- the benchmark is much more complicated, must take into account much more variables and takes much more time compared to rotating magnetic media benchmark. To test SSD performance you need to benchmark both the speed of flash memory _and_ the speed and performance of the write leveling algorithm (because it shows its ugly head only after specific workloads or when all blocks are allocated), and that''s non trivial to say the least. Add FS on top of it and you have a nice dissertation right there. -- Hubert Kario QBS - Quality Business Software ul. Ksawerów 30/85 02-656 Warszawa POLAND tel. +48 (22) 646-61-51, 646-74-24 fax +48 (22) 646-61-50 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sat, Mar 13, 2010 at 05:43:59PM +0100, Stephan von Krawczynski wrote:> On Thu, 11 Mar 2010 13:00:17 -0500 > Chris Mason <chris.mason@oracle.com> wrote: > > > On Thu, Mar 11, 2010 at 06:35:06PM +0100, Stephan von Krawczynski wrote: > > > On Thu, 11 Mar 2010 15:39:05 +0100 > > > Sander <sander@humilis.net> wrote: > > > > > > > Stephan von Krawczynski wrote (ao): > > > > > Honestly I would just drop the idea of an SSD option simply because the > > > > > vendors implement all kinds of neat strategies in their devices. So in the end > > > > > you cannot really tell if the option does something constructive and not > > > > > destructive in combination with a SSD controller. > > > > > > > > My understanding of the ssd mount option is also that the fs doens''t try > > > > to do all kinds of smart (and potential expensive) things which make > > > > sense for rotating media to reduce seeks and the like. > > > > > > > > Sander > > > > > > Such an optimization sounds valid on first sight. But re-think closely: how > > > does the fs really know about seeks needed during some operation? > > > > Well the FS makes a few assumptions (in the nonssd case). First it > > assumes the storage is not a memory device. If things would fit in > > memory we wouldn''t need filesytems in the first place. > > Ok, here is the bad news. This assumption everything from right to completely > wrong, and you cannot really tell the mainstream answer. > Two examples from opposite parts of the technology world: > - History: way back in the 80''s there was a 3rd party hardware for C=1541 > (floppy drive for C=64) that read in the complete floppy and served all > incoming requests from the ram buffer. So your assumption can already be wrong > for a trivial floppy drive from ancient times.Agreed, I''ll try my best not to tune btrfs for trivial floppies from ancient times ;)> > Then it assumes that adjacent blocks are cheap to read and blocks that > > are far away are expensive to read. Given expensive raid controllers, > > cache, and everything else, you''re correct that sometimes this > > assumption is wrong. > > As already mentioned this assumption may be completely wrong even without a > raid controller, being within a virtual environment. Even far away blocks can > be one byte away in the next fs buffer of the underlying host fs (assuming > your device is in fact a file on the host;-).Ok, there are roughly three environments at play here. 1) Seeking hurts, and you have no idea if adjacent block numbers are close together on the device. 2) Seeking doesn''t hurt and you have no idea if adjacent block numbers are close together on the device. (SSD). 3) Seeking hurts and you can assume adjacent block numbers are close together on the device (disks). Type one is impossible to tune, and so it isn''t interesting in this discussion. There are an infinite number of ways to actually store data you care about, and just because one of those ways can''t be tuned doesn''t mean we should stop trying to tune for the ones that most people actually use.> > > But, on average seeking hurts. Really a lot. > > Yes, seeking hurts. But there is no way to know if there is seeking at all. > On the other hand, if your storage is a netblock device seeking on the server > is probably your smallest problem, compared to the network latency in between. >Very true, and if I were using such a setup in performance critical applications, I would: 1) Tune the network so that seeks mattered again 2) Tune the seeks.> > We try to organize files such that files that are likely to be read > > together are found together on disk. Btrfs is fairly good at this > > during file creation and not as good as ext*/xfs as files over > > overwritten and modified again and again (due to cow). > > You are basically saying that btrfs perfectly organizes write-once devices ;-)Storage is all about trade offs, and optimizing read access for write once vs write many is a very different thing. It''s surprising how many of your files are written once and never read, let alone written and then never changed.> > > If you turn mount -o ssd on for your drive and do a test, you might not > > notice much difference right away. ssds tend to be pretty good right > > out of the box. Over time it tends to help, but it is a very hard thing > > to benchmark in general. > > Honestly, this sounds like "I give up" to me ;-) > You just said that generally it is "very hard to benchmark". Which means > "nobody can see or feel it in real world" in non-tech language.No, it just means it is hard to benchmark. SSDs, even really good ssds, are not deterministic. Sometimes they are faster than others and the history of how you''ve abused it in the past factors into how well it performs in the future. A simple graph that talks about the performance of one drive in one workload needs a lot of explanation.> > Please understand that I am the last one critizing your and others'' brillant > work and the time you spend for btrfs. Only I do believe that if you spent one > hour on some fs like glusterfs for every 10 hours you spend on btrfs you would > be both king and queen for the linux HA community :-) > (but probably unemployed, so I can''t really beat you for it)Grin, the list of things I wish I had time to work on is quite long. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 03/13/2010 08:43 AM, Stephan von Krawczynski wrote:> - Nowadays: being a linux installation today chances are that the matrix has > you. Quite a lot of installations are virtualized. So your storage is a virtual > one either, which means it is likely being a fs buffer from the host system, > i.e. RAM. >That would be a strictly amateur-hour implementation. It is very important for data integrity that at least all writes are synchronous, and ideally all IO should be uncached in the host. In that case the performance of the guest''s virtual IO device will be broadly similar to a real hardware device. J -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html