In a few weeks parts for my new computer will be arriving. The storage will be a 128GB SSD. A few weeks after that I will order three large disks for a RAID array. I understand that BTRFS RAID 5 support will be available shortly. What is the best possible way for me to get the highest performance out of this setup. I know of the option to optimize for SSD''s, but wont that affect all the drives in the array, not to mention having the SSD in the raid array will make the usable size much smaller as RAID 5 goes by the smallest disk. Is there a way to use it as a cache the works even on power down. My current plan is to have the /tmp directory in RAM on tmpfs, the /boot directory on a dedicated partition on the SSD along with a 12GB swap partition also on the SSD with the rest of the space (on the SSD) available as a cache. The three mechanical hard drives will be on a RAID 5 array using BTRFS. Can anyone suggest any improvements to my plan and also how to implement the cache?
On 12/12/2010 17:24, Paddy Steed wrote:> In a few weeks parts for my new computer will be arriving. The storage > will be a 128GB SSD. A few weeks after that I will order three large > disks for a RAID array. I understand that BTRFS RAID 5 support will be > available shortly. What is the best possible way for me to get the > highest performance out of this setup. I know of the option to optimize > for SSD''sBTRFS is hardly the best option for SSDs. I typically use ext4 without a journal on SSDs, or ext2 if that is not available. Journalling causes more writes to hit the disk, which wears out flash faster. Plus, SSDs typically have much slower writes than reads, so avoiding writes is a good thing. AFAIK there is no way to disable journaling on BTRFS.> but wont that affect all the drives in the array, not to > mention having the SSD in the raid array will make the usable size much > smaller as RAID 5 goes by the smallest disk.If you are talking about BTRFS'' parity RAID implementation, it is hard to comment in any way on it before it has actually been implemented. Especially if you are looking for something stable for production use, you should probably avoid features that immature.> Is there a way to use it as > a cache the works even on power down.You want to use the SSD as a _write_ cache? That doesn''t sound too sensible at all. What you are looking for is hierarchical/tiered storage. I am not aware of existance of such a thing for Linux. BTRFS has no feature for it. You might be able to cobble up a solution that uses aufs or mhddfs (both fuse based) with some cron jobs to shift most recently used files to your SSD, but the fuse overheads will probably limit the usefulness of this approach.> My current plan is to have > the /tmp directory in RAM on tmpfsIdeally, quite a lot should really be on tmpfs, in addition to /tmp and /var/tmp. Have a look at my patches here: https://bugzilla.redhat.com/show_bug.cgi?id=223722 My motivation for this was mainly to improve performance on slow flash (when running off a USB stick or an SD card), but it also removes the most write-heavy things off the flash and into RAM. Less flash wear and more speed. If you are putting a lot onto tmpfs, you may also want to look at the compcache project which provides a compressed swap RAM disk. Much faster than actual swap - to the point where it actually makes swapping feasible.> the /boot directory on a dedicated > partition on the SSD along with a 12GB swap partition also on the SSD > with the rest of the space (on the SSD) available as a cache.Swap on SSD is generally a bad idea. If your machine starts swapping it''ll grind to a halt anyway, regardless of whether it''s swapping to SSD, and heavy swapping to SSD will just kill the flash prematurely.> The three > mechanical hard drives will be on a RAID 5 array using BTRFS. Can anyone > suggest any improvements to my plan and also how to implement the cache?A very "soft" solution using aufs and cron jobs for moving things with the most recent atime to the SSD is probably as good as it''s going to get at the moment, but bear in mind that fuse overheads will probably offset any performance benefit you gain from the SSD. You could get clever and instead of just using atime set up inotify logging and put the most frequently (as opposed to most recently) accessed files onto your SSD. This would, in theory, give you more benefit. You also have to bear in mind that the most frequently accessed files will be cached in RAM anyway, so your pre-caching onto SSD is only really going to be relevant when your working set size is considerably bigger than your RAM - at which point your performance is going to take a significant nosedive anyway (especially if you then hit a fuse file system). In either case, you should not put the frequently written files onto flash (recent mtime). Also note that RAID5 is potentially very slow on writes, especially small writes. It is also unsuitable for arrays over about 4TB (usable) in size for disk reliability reasons. Gordan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Gordan Bobic wrote (ao):> On 12/12/2010 17:24, Paddy Steed wrote: > >In a few weeks parts for my new computer will be arriving. The storage > >will be a 128GB SSD. A few weeks after that I will order three large > >disks for a RAID array. I understand that BTRFS RAID 5 support will be > >available shortly. What is the best possible way for me to get the > >highest performance out of this setup. I know of the option to optimize > >for SSD''s > > BTRFS is hardly the best option for SSDs. I typically use ext4 > without a journal on SSDs, or ext2 if that is not available. > Journalling causes more writes to hit the disk, which wears out > flash faster. Plus, SSDs typically have much slower writes than > reads, so avoiding writes is a good thing.Gordan, this you wrote is so wrong I don''t even know where to begin. You''d better google a bit on the subject (ssd, and btrfs on ssd) as much is written about it already. Sander -- Humilis IT Services and Solutions http://www.humilis.net -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 12/13/2010 05:11 AM, Sander wrote:> Gordan Bobic wrote (ao): >> On 12/12/2010 17:24, Paddy Steed wrote: >>> In a few weeks parts for my new computer will be arriving. The storage >>> will be a 128GB SSD. A few weeks after that I will order three large >>> disks for a RAID array. I understand that BTRFS RAID 5 support will be >>> available shortly. What is the best possible way for me to get the >>> highest performance out of this setup. I know of the option to optimize >>> for SSD''s >> >> BTRFS is hardly the best option for SSDs. I typically use ext4 >> without a journal on SSDs, or ext2 if that is not available. >> Journalling causes more writes to hit the disk, which wears out >> flash faster. Plus, SSDs typically have much slower writes than >> reads, so avoiding writes is a good thing. > > Gordan, this you wrote is so wrong I don''t even know where to begin. > > You''d better google a bit on the subject (ssd, and btrfs on ssd) as much > is written about it already.I suggest you back your opinion up with some hard data before making such statements. Here''s a quick test - make an ext2 fs and a btrfs on two similar disk partitions (any disk, for the sake of the experiment it doesn''t have to be an ssd), then check vmstat -d to get a base line. Then put the kernel sources on each it, do a full build, then make clean and check vmstat -d again. Check the vmstat -d output again. See how many writes (sectors) hit the disk with ext2 and how many with btrfs. You''ll find that there were many more writes with BTRFS. You can''t go faster when doing more. Journaling is expensive. Gordan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Dec 13, 2010 at 4:25 AM, Gordan Bobic wrote:> I suggest you back your opinion up with some hard data before making such > statements. Here''s a quick test - make an ext2 fs and a btrfs on two similar > disk partitions (any disk, for the sake of the experiment it doesn''t have to > be an ssd)Okay, here''s some hard data. Acer Aspire One ZG5 with an SSDPAMM0008G1 (cheap/slow) SSD, Fedora 13. Doing a standard yum update, measuring the yum cleanup phase while browsing with Firefox: Default extN: machine becomes completely unusable for minutes. btrfs with ssd_spread: machine functions normally, cleanup finishes in (often much) under 15 seconds. Regardless of what vmstat says, btrfs is clearly faster on this hardware. Peter Harris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 13/12/2010 14:33, Peter Harris wrote:> On Mon, Dec 13, 2010 at 4:25 AM, Gordan Bobic wrote: >> I suggest you back your opinion up with some hard data before making such >> statements. Here''s a quick test - make an ext2 fs and a btrfs on two similar >> disk partitions (any disk, for the sake of the experiment it doesn''t have to >> be an ssd) > > Okay, here''s some hard data. > > Acer Aspire One ZG5 with an SSDPAMM0008G1 (cheap/slow) SSD, Fedora 13. > > Doing a standard yum update, measuring the yum cleanup phase while > browsing with Firefox: > > Default extN: machine becomes completely unusable for minutes. > btrfs with ssd_spread: machine functions normally, cleanup finishes in > (often much) under 15 seconds. > > Regardless of what vmstat says, btrfs is clearly faster on this hardware.extN is too broad. ext2, ext3, or ext4? If ext4, with journal or without? I am talking specifically about extN _without_ a journal. I use ext2 and ext4-without-a-journal on all my cheap flash (mostly SD/CF cards and USB sticks) with a deadline scheduler and I have not observed any massive slowdown like you describe. Either way, there is also the longevity of the flash to be considered, and vmstat''s write reading is very indicative of that. Gordan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Dec 13, 2010 at 3:25 AM, Gordan Bobic <gordan@bobich.net> wrote:> On 12/13/2010 05:11 AM, Sander wrote: >> >> Gordan Bobic wrote (ao): >>> >>> On 12/12/2010 17:24, Paddy Steed wrote: >>>> >>>> In a few weeks parts for my new computer will be arriving. The storage >>>> will be a 128GB SSD. A few weeks after that I will order three large >>>> disks for a RAID array. I understand that BTRFS RAID 5 support will be >>>> available shortly. What is the best possible way for me to get the >>>> highest performance out of this setup. I know of the option to optimize >>>> for SSD''s >>> >>> BTRFS is hardly the best option for SSDs. I typically use ext4 >>> without a journal on SSDs, or ext2 if that is not available. >>> Journalling causes more writes to hit the disk, which wears out >>> flash faster. Plus, SSDs typically have much slower writes than >>> reads, so avoiding writes is a good thing. >> >> Gordan, this you wrote is so wrong I don''t even know where to begin. >> >> You''d better google a bit on the subject (ssd, and btrfs on ssd) as much >> is written about it already. > > I suggest you back your opinion up with some hard data before making such > statements. Here''s a quick test - make an ext2 fs and a btrfs on two similar > disk partitions (any disk, for the sake of the experiment it doesn''t have to > be an ssd), then check vmstat -d to get a base line. Then put the kernel > sources on each it, do a full build, then make clean and check vmstat -d > again. Check the vmstat -d output again. See how many writes (sectors) hit > the disk with ext2 and how many with btrfs. You''ll find that there were many > more writes with BTRFS. You can''t go faster when doing more. Journaling is > expensive.Of course. But that applies to rotating media as well (where the seeks involved hurt much more), and has little if anything to do with why you would use btrfs instead of ext2. Good ssd drives (by which I mean anything but consumer flash as it exists on sd cards and usb sticks) have very good wear leveling, good enough that you could overwrite the same logical sector billions of times before you''d experience any failure due to wear. The issues with cheaper ssd drives (which I distinguish from things like sd cards) are uniformly performance degredation due to crappy garbage collection and lack of trim support to compensate. A journal is _not_ a problem here. On crappy flash, yes, you want to avoid a journal, mainly because the write leveling for a given sector only occurs over a fixed small number of erase blocks, resulting in a filesystem that you can burn out quite easily — I have a small pile of sd cards on my desk that I sent to such a fate. Even here there is reason to use btrfs. The journaling performed is much less strenuous that ext3/4: it''s basically just a version stamp, as opposed to actually journaling the metadata involved. The actual metadata writes, being copy-on-write, provide pretty much the best case for crappy flash, as cow inherently wear-levels over the entire device (ssd_spread). To say nothing of checksums and duplicated metadata, allowing you to actually determine if you''re running into corrupted metadata, and often recover from it transparently. Ext2''s behavior in this respect is less than ideal. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 13/12/2010 15:17, cwillu wrote:>>>>> In a few weeks parts for my new computer will be arriving. The storage >>>>> will be a 128GB SSD. A few weeks after that I will order three large >>>>> disks for a RAID array. I understand that BTRFS RAID 5 support will be >>>>> available shortly. What is the best possible way for me to get the >>>>> highest performance out of this setup. I know of the option to optimize >>>>> for SSD''s >>>> >>>> BTRFS is hardly the best option for SSDs. I typically use ext4 >>>> without a journal on SSDs, or ext2 if that is not available. >>>> Journalling causes more writes to hit the disk, which wears out >>>> flash faster. Plus, SSDs typically have much slower writes than >>>> reads, so avoiding writes is a good thing. >>> >>> Gordan, this you wrote is so wrong I don''t even know where to begin. >>> >>> You''d better google a bit on the subject (ssd, and btrfs on ssd) as much >>> is written about it already. >> >> I suggest you back your opinion up with some hard data before making such >> statements. Here''s a quick test - make an ext2 fs and a btrfs on two similar >> disk partitions (any disk, for the sake of the experiment it doesn''t have to >> be an ssd), then check vmstat -d to get a base line. Then put the kernel >> sources on each it, do a full build, then make clean and check vmstat -d >> again. Check the vmstat -d output again. See how many writes (sectors) hit >> the disk with ext2 and how many with btrfs. You''ll find that there were many >> more writes with BTRFS. You can''t go faster when doing more. Journaling is >> expensive. > > Of course. But that applies to rotating media as well (where the > seeks involved hurt much more), and has little if anything to do with > why you would use btrfs instead of ext2.Indeed - btrfs is about features, most specifically the chesumming that allows smart recovery from disk media failure. But on flash, write volumes are something that shouldn''t be ignored.> Good ssd drives (by which I mean anything but consumer flash as it > exists on sd cards and usb sticks) have very good wear leveling, good > enough that you could overwrite the same logical sector billions of > times before you''d experience any failure due to wear.It comes down to volumes even in the best case scenario. A _very_ good SSD (e.g. Intel) might get write amplification down to about 1.2:1, but more typical figures are in the region of 10-20:1. Every write that can be avoided, should be avoided.> The issues > with cheaper ssd drives (which I distinguish from things like sd > cards) are uniformly performance degredation due to crappy garbage > collection and lack of trim support to compensate. A journal is _not_ > a problem here.The journal doesn''t help. It can cause more than a 50% overhead on metadata-heavy operations.> On crappy flash, yes, you want to avoid a journal, mainly because the > write leveling for a given sector only occurs over a fixed small > number of erase blocks, resulting in a filesystem that you can burn > out quite easily — I have a small pile of sd cards on my desk that I > sent to such a fate. Even here there is reason to use btrfs. The > journaling performed is much less strenuous that ext3/4: it''s > basically just a version stamp, as opposed to actually journaling the > metadata involved. The actual metadata writes, being copy-on-write, > provide pretty much the best case for crappy flash, as cow inherently > wear-levels over the entire device (ssd_spread). To say nothing of > checksums and duplicated metadata, allowing you to actually determine > if you''re running into corrupted metadata, and often recover from it > transparently. Ext2''s behavior in this respect is less than ideal.I''m not disputing that, but the OP was talking about using the SSD as a cache for a slower disk subsystem. That is likely to waste the SSD pretty quickly purely by volume of writes, regardless of how good the wear leveling is. That may be fine on a setup where the SSD is treated as disposable throw-away cache item that doesn''t lose you data when it goes wrong, but what was being discussed isn''t an expensive enterprise grade setup that behaves that way. Gordan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Thank you for all your replies. On Mon, 2010-12-13 at 00:04 +0000, Gordan Bobic wrote:> On 12/12/2010 17:24, Paddy Steed wrote: > > In a few weeks parts for my new computer will be arriving. The storage > > will be a 128GB SSD. A few weeks after that I will order three large > > disks for a RAID array. I understand that BTRFS RAID 5 support will be > > available shortly. What is the best possible way for me to get the > > highest performance out of this setup. I know of the option to optimize > > for SSD''s > > BTRFS is hardly the best option for SSDs. I typically use ext4 without a > journal on SSDs, or ext2 if that is not available. Journalling causes > more writes to hit the disk, which wears out flash faster. Plus, SSDs > typically have much slower writes than reads, so avoiding writes is a > good thing. AFAIK there is no way to disable journaling on BTRFS.My write speed is similar to the read speed (OCZ Vertex 128GB) and it also comes with a waranty that runs out after the drive will be obsolete. Using up flash cycles is not an issue for me.> > but wont that affect all the drives in the array, not to > > mention having the SSD in the raid array will make the usable size much > > smaller as RAID 5 goes by the smallest disk. > > If you are talking about BTRFS'' parity RAID implementation, it is hard > to comment in any way on it before it has actually been implemented. > Especially if you are looking for something stable for production use, > you should probably avoid features that immature.I would take images until I felt it was stable every day. I spoke to `cmasion'' who has now finished fsck and is working on RAID 5 fully.> > Is there a way to use it as > > a cache the works even on power down. > > You want to use the SSD as a _write_ cache? That doesn''t sound too > sensible at all.As previously stated, wear is not an issue.> What you are looking for is hierarchical/tiered storage. I am not aware > of existance of such a thing for Linux. BTRFS has no feature for it. You > might be able to cobble up a solution that uses aufs or mhddfs (both > fuse based) with some cron jobs to shift most recently used files to > your SSD, but the fuse overheads will probably limit the usefulness of > this approach. > > > My current plan is to have > > the /tmp directory in RAM on tmpfs > > Ideally, quite a lot should really be on tmpfs, in addition to /tmp and > /var/tmp. > Have a look at my patches here: > https://bugzilla.redhat.com/show_bug.cgi?id=223722 > > My motivation for this was mainly to improve performance on slow flash > (when running off a USB stick or an SD card), but it also removes the > most write-heavy things off the flash and into RAM. Less flash wear and > more speed. > > If you are putting a lot onto tmpfs, you may also want to look at the > compcache project which provides a compressed swap RAM disk. Much faster > than actual swap - to the point where it actually makes swapping feasible. > > > the /boot directory on a dedicated > > partition on the SSD along with a 12GB swap partition also on the SSD > > with the rest of the space (on the SSD) available as a cache. > > Swap on SSD is generally a bad idea. If your machine starts swapping > it''ll grind to a halt anyway, regardless of whether it''s swapping to > SSD, and heavy swapping to SSD will just kill the flash prematurely. > > > The three > > mechanical hard drives will be on a RAID 5 array using BTRFS. Can anyone > > suggest any improvements to my plan and also how to implement the cache? > > A very "soft" solution using aufs and cron jobs for moving things with > the most recent atime to the SSD is probably as good as it''s going to > get at the moment, but bear in mind that fuse overheads will probably > offset any performance benefit you gain from the SSD. You could get > clever and instead of just using atime set up inotify logging and put > the most frequently (as opposed to most recently) accessed files onto > your SSD. This would, in theory, give you more benefit. You also have to > bear in mind that the most frequently accessed files will be cached in > RAM anyway, so your pre-caching onto SSD is only really going to be > relevant when your working set size is considerably bigger than your RAM > - at which point your performance is going to take a significant > nosedive anyway (especially if you then hit a fuse file system). > > In either case, you should not put the frequently written files onto > flash (recent mtime). > > Also note that RAID5 is potentially very slow on writes, especially > small writes. It is also unsuitable for arrays over about 4TB (usable) > in size for disk reliability reasons. > > GordanSo, no-one has any idea''s on how to implement the cache. Would making it all swap work, does to OS cache files in swap?
On 13/12/2010 17:17, Paddy Steed wrote:> So, no-one has any idea''s on how to implement the cache. Would making it > all swap work, does to OS cache files in swap?No, it doesn''t. I don''t believe there are any plans to implement hierarchical storage in BTRFS, but perhaps one of the developers can confirm or deny that. As for how to do it - I don''t have any ideas other than what I mentioned earlier (aufs to overlay file systems and cron jobs to rotate things in and out of cache). Gordan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Dec 13, 2010 at 05:17:51PM +0000, Paddy Steed wrote:> So, no-one has any idea''s on how to implement the cache. Would making it > all swap work, does to OS cache files in swap?Quite the opposite. Too many people have ideas for SSD-as-cache in Linux, in non particular order: — bcache — cleancache — btrfs temperature tracking — dm-hstore — dm-cache / flashcache Patches are in various states of implementation, some with explicit btrfs support. There''s no clear winner at this time, but some of above solutions are shipped in distro kernels. -- Tomasz Torcz "God, root, what''s the difference?" xmpp: zdzichubg@chrome.pl "God is more forgiving." -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 12/13/2010 01:20 PM, Tomasz Torcz wrote:> On Mon, Dec 13, 2010 at 05:17:51PM +0000, Paddy Steed wrote: >> So, no-one has any idea''s on how to implement the cache. Would making it >> all swap work, does to OS cache files in swap? > Quite the opposite. Too many people have ideas for SSD-as-cache in Linux, > in non particular order: > — bcache > — cleancache > — btrfs temperature tracking > — dm-hstore > — dm-cache / flashcache > > Patches are in various states of implementation, some with explicit btrfs > support. There''s no clear winner at this time, but some of above solutions > are shipped in distro kernels. >People are working on quite a few ways that btrfs can leverage SSD devices. One technique would be to use the SSD as a block level cache, another would be to steer all metadata to the SSD (leaving your bulk data on normal drives). ric -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html