I''ve just started to work with btrfs so I started with a benchmark. On four identical servers, (2 dual core cpus, single local disk), I built filesystems - ext3, ext4, nilfs2, and btrfs. I checked out a sizable code tree and timed a build. The build is parallelized to use 4 threads when possible. I''m seeing similar build times on ext[34] and nilfs2 but I''m seeing almost double the times for btrfs using default options. And I''m having trouble reconciling this performance cost with the benchmarks I''m seeing around the net. Is this a common result? Is there a trick to getting ext4 competitive performance out of btrfs? Is my application a poor choice for btrfs? Am I missing something obvious here? --rich -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, May 24, 2010 at 2:08 PM, K. Richard Pixley <rich@noir.com> wrote:> I''ve just started to work with btrfs so I started with a benchmark. On four > identical servers, (2 dual core cpus, single local disk), I built > filesystems - ext3, ext4, nilfs2, and btrfs. I checked out a sizable code > tree and timed a build. The build is parallelized to use 4 threads when > possible. > > I''m seeing similar build times on ext[34] and nilfs2 but I''m seeing almost > double the times for btrfs using default options. And I''m having trouble > reconciling this performance cost with the benchmarks I''m seeing around the > net. > > Is this a common result? Is there a trick to getting ext4 competitive > performance out of btrfs? Is my application a poor choice for btrfs? Am I > missing something obvious here? >Please make sure you''re testing with the latest btrfs from git or linus latest kernel. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Just as a followup, my problem appears to be hardware related. It''s not clear yet whether it''s a strange failure mode or a configuration snafoo, disk or controller, but elsewhere I''m seeing a btfs single disk performance penalty more like 2% over ext[34] which seems completely reasonable. Sorry for the panic. --rich -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Once again I''m stumped by some performance numbers and hoping for some insight. Using an 8-core server, building in parallel, I''m building some code. Using ext2 over a 5-way, (5 disk), lvm partition, I can build that code in 35 minutes. Tests with dd on the raw disk and lvm partitions show me that I''m getting near linear improvement from the raw stripe, even with dd runs exceeding 10G, so I think that convinces me that my disks and controller subsystem are capable of operating in parallel and in concert. hdparm -t numbers seem to support what I''m seeing from dd. Running the same build, same parallelism, over a btrfs (defaults) partition on a single drive, I''m seeing very consistent build times around an hour, which is reasonable. I get a little under an hour on ext4 single disk, again, very consistently. However, if I build a btrfs file system across the 5 disks, my build times decline to around 1.5 - 2hrs, although there''s about a 30min variation between different runs. If I build a btrfs file system across the 5-way lvm stripe, I get even worse performance at around 2.5hrs per build, with about a 45min variation between runs. I can''t explain these last two results. Any theories? --rich -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
K. Richard Pixley wrote:> Once again I''m stumped by some performance numbers and hoping for some > insight. > > Using an 8-core server, building in parallel, I''m building some code. > Using ext2 over a 5-way, (5 disk), lvm partition, I can build that code > in 35 minutes. Tests with dd on the raw disk and lvm partitions show me > that I''m getting near linear improvement from the raw stripe, even with > dd runs exceeding 10G, so I think that convinces me that my disks and > controller subsystem are capable of operating in parallel and in > concert. hdparm -t numbers seem to support what I''m seeing from dd. > > Running the same build, same parallelism, over a btrfs (defaults) > partition on a single drive, I''m seeing very consistent build times > around an hour, which is reasonable. I get a little under an hour on > ext4 single disk, again, very consistently. > > However, if I build a btrfs file system across the 5 disks, my build > times decline to around 1.5 - 2hrs, although there''s about a 30min > variation between different runs. > > If I build a btrfs file system across the 5-way lvm stripe, I get even > worse performance at around 2.5hrs per build, with about a 45min > variation between runs. > > I can''t explain these last two results. Any theories?If you just want theory, I can try. :-) Theory of striping follows (numbers invented). If you have a stripe size of 8 sectors, 40 successive sectors are divided in 5 groups of 8 sectors, with each group on a different disk. Suppose you want to read 40 sectors; with one disk and no striping you need: time_to_place_disk_head_and_rotational_latency (10ms) + time to read 40 sectors (around 0ms) Suppose you want to read 40 sectors with your 5disk striped volume, you need time_to_place_disk_head_and_rotational_latency (10ms) + time to read 8 sectors (around 0ms) time_to_place_disk_head_and_rotational_latency (10ms) + time to read 8 sectors (around 0ms) time_to_place_disk_head_and_rotational_latency (10ms) + time to read 8 sectors (around 0ms) time_to_place_disk_head_and_rotational_latency (10ms) + time to read 8 sectors (around 0ms) time_to_place_disk_head_and_rotational_latency (10ms) + time to read 8 sectors (around 0ms) so you are 5 times slower. Now, it could be that you submit the 5 requests together; in that case you do not pay a 5 times penalty, but a 2 times penalty. Why? Because, if you think about rotational latency, if a disk takes 10ms to do one rotation, you will have your data ready in a random time equally distributed between 0 and 10 ms (average 5ms). If you submit 5 command to 5 disks each of one will have a (independent!) flat distribution between 0 and 10ms; as you need all 5 pieces you have to wait the unluckier of the disks, so your average will be near 10ms. So in general striping costs you a 2 to 5 speed penalty. If your build if really parallel, when one process is waiting data, another one will make requests. But remember that all the disks are busy because of the first process, so it is not unreasonable to have that multiple processes do not gain any speed. In reality, the first 5 requests and the second 5 could be evaluated at the same time to give precedence to the one of the two more easy for the drive (maybe the second one is lucky from a rotational point of view, so it is better to do that before the first). In this case the disks are better utilized, but the net effect on the overall build is not so easy to establish, because, when you give precedence to the second request, you are delaying the first, so the entire first 40-sectors read could have worse timing than 0-10ms_almost_surely_10, and can get a 0-20ms_maybe_15. There is a lot of maths you can study (queue theory and scheduling algorithms) and a lot of factors can be important (disk queue size, NCQ, caching) at various levels (O.S., controller, disk). In my opinion, the basic rule in these cases should be: == use a stripe size bigger than the sizes of your random reads =In case of seeky load I would personally use a stripe size of 64MiB, for example. One read should only involve one disk. Stripe size is often configured with very small values (such as 4KiB), because it produces very big numbers when you read sequentially (as you are really using all the disks together). But latency sucks. In your case, the build is probably very seeky, and the "seekiness" could be exacerbated by having many writes (and things become even worse when the filesystems involve a journal...). (sorry for the long mail. you asked for a theory :-) ) -- Roberto Ragusa mail at robertoragusa.it -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
<snip a lot of fancy math that missed the point> That''s all well and good, but you missed the part where he said ext2 on a 5-way LVM stripeset is many times faster than btrfs on a 5-way btrfs stripeset. IOW, same 5-way stripeset, different filesystems and volume managers, and very different performance. And he''s wondering why the btrfs method used for striping is so much slower than the lvm method used for striping. -- Freddie Cash fjwcash@gmail.com -- Freddie Cash fjwcash@gmail.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Freddie Cash wrote:> > <snip a lot of fancy math that missed the point> > > > > That''s all well and good, but you missed the part where he said ext2 > > on a 5-way LVM stripeset is many times faster than btrfs on a 5-way > > btrfs stripeset. > > > > IOW, same 5-way stripeset, different filesystems and volume managers, > > and very different performance. > > > > And he''s wondering why the btrfs method used for striping is so much > > slower than the lvm method used for striping.Sorry, I missed the first line where ext2 on 5disk lvm is said to be fast. I was commenting as the last two results were ext2 on 5disk and btrfs on 5disk. I''d say the great variation between successive runs is important and seems to point to some bigger problem. -- Roberto Ragusa mail at robertoragusa.it -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Jun 16, 2010 at 7:08 PM, K. Richard Pixley <rich@noir.com> wrote:> Once again I''m stumped by some performance numbers and hoping for some > insight. > > Using an 8-core server, building in parallel, I''m building some code. Using > ext2 over a 5-way, (5 disk), lvm partition, I can build that code in 35 > minutes. Tests with dd on the raw disk and lvm partitions show me that I''m > getting near linear improvement from the raw stripe, even with dd runs > exceeding 10G, so I think that convinces me that my disks and controller > subsystem are capable of operating in parallel and in concert. hdparm -t > numbers seem to support what I''m seeing from dd. > > Running the same build, same parallelism, over a btrfs (defaults) partition > on a single drive, I''m seeing very consistent build times around an hour, > which is reasonable. I get a little under an hour on ext4 single disk, > again, very consistently. > > However, if I build a btrfs file system across the 5 disks, my build times > decline to around 1.5 - 2hrs, although there''s about a 30min variation > between different runs. > > If I build a btrfs file system across the 5-way lvm stripe, I get even worse > performance at around 2.5hrs per build, with about a 45min variation between > runs. > > I can''t explain these last two results. Any theories?Try mounting the BTRFS filesystem with ''nobarrier'', since this may be an obvious difference. Also, for metadata-write-intensive workloads, when creating the filesystem try ''mkfs.btrfs -m single''. Of course, all this doesn''t explain the variance. I''d say it''s worth emplying ''blktrace'' to see what happening at a lower level, and even eg varying between deadline/CFQ I/O schedulers. Daniel -- Daniel J Blueman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 16/06/2010 21:35, Freddie Cash wrote:> <snip a lot of fancy math that missed the point> > > That''s all well and good, but you missed the part where he said ext2 > on a 5-way LVM stripeset is many times faster than btrfs on a 5-way > btrfs stripeset. > > IOW, same 5-way stripeset, different filesystems and volume managers, > and very different performance. > > And he''s wondering why the btrfs method used for striping is so much > slower than the lvm method used for striping. >This could easily be explained by Roberto''s theory and maths - if the lvm stripe set used large stripe sizes so that the random reads were mostly read from a single disk, it would be fast. If the btrfs stripes were small, then it would be slow due to all the extra seeks. Do we know anything about the stripe sizes used? -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Jun 16, 2010 at 11:08:48AM -0700, K. Richard Pixley wrote:> Once again I''m stumped by some performance numbers and hoping for > some insight. > > Using an 8-core server, building in parallel, I''m building some > code. Using ext2 over a 5-way, (5 disk), lvm partition, I can build > that code in 35 minutes. Tests with dd on the raw disk and lvm > partitions show me that I''m getting near linear improvement from the > raw stripe, even with dd runs exceeding 10G, so I think that > convinces me that my disks and controller subsystem are capable of > operating in parallel and in concert. hdparm -t numbers seem to > support what I''m seeing from dd. > > Running the same build, same parallelism, over a btrfs (defaults) > partition on a single drive, I''m seeing very consistent build times > around an hour, which is reasonable. I get a little under an hour > on ext4 single disk, again, very consistently. > > However, if I build a btrfs file system across the 5 disks, my build > times decline to around 1.5 - 2hrs, although there''s about a 30min > variation between different runs. > > If I build a btrfs file system across the 5-way lvm stripe, I get > even worse performance at around 2.5hrs per build, with about a > 45min variation between runs. > > I can''t explain these last two results. Any theories?I suspect they come down to different raid levels done by btrfs, and maybe barriers. By default btrfs will duplicate metadata, so ext2 is doing much less metadata IO than btrfs does. Try mkfs.btrfs -m raid0 -d raid0 /dev/xxx /dev/xxy ... Then try mount -o nobarrier /dev/xxx /mnt Someone else mentioned blktrace, it would help explain things if you''re interested in tracking this down. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html