Hello everyone, Btrfs v0.16 is available for download, please see http://btrfs.wiki.kernel.org/ for download links and project information. v0.16 has a shiny new disk format, and is not compatible with filesystems created by older Btrfs releases. But, it should be the fastest Btrfs yet, with a wide variety of scalability fixes and new features. There were quite a few contributors this time around, but big thanks to Josef Bacik and Yan Zheng for their help on this release. Toei Rei also helped track down an important corruption problem. Scalability and performance: * Fine grained btree locking. The large fs_mutex is finally gone. There is still some work to do on the locking during extent allocation, but the code is much more scalable than it was. * Helper threads for checksumming and other background tasks. Most CPU intensive operations have been pushed off to helper threads to take advantage of SMP machines. Streaming read and write throughput now scale to disk speed even with checksumming on. * Improved data=ordered mode. Metadata is now updated only after all the blocks in a data extent are on disk. This allows btrfs to provide data=ordered semantics without waiting for all the dirty data in the FS to flush at commit time. fsync and O_SYNC writes do not force down all the dirty data in the FS. * Faster cleanup of old transactions (Yan Zheng). A new cache now dramatically reduces the amount of IO required to clean up and delete old snapshots. Major features (all from Josef Bacik): * ACL support. ACLs are enabled by default, no special mount options required. * Orphan inode prevention, no more lost files after a crash * New directory index format, fixing some suboptimal corner cases in the original. There are still more disk format changes planned, but we''re making every effort to get them out of the way as quickly as we can. You can see the major features we have planned on the development timeline: http://btrfs.wiki.kernel.org/index.php/Development_timeline A few interesting statistics: Between v0.14 and v0.15: 42 files changed, 6995 insertions(+), 3011 deletions(-) The btrfs kernel module now weighs in at 30,000 LOC, which means we''re getting very close to the size of ext[34]. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 2008-08-05 at 15:01 -0400, Chris Mason wrote:> * Fine grained btree locking. The large fs_mutex is finally gone. > There is still some work to do on the locking during extent allocation, > but the code is much more scalable than it was.Cool - will try to find a cycle to stare at the code ;-)> * Helper threads for checksumming and other background tasks. Most CPU > intensive operations have been pushed off to helper threads to take > advantage of SMP machines. Streaming read and write throughput now > scale to disk speed even with checksumming on.Can this lead to the same Priority Inversion issues as seen with kjournald? -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 2008-08-05 at 15:01 -0400, Chris Mason wrote:> There are still more disk format changes planned, but we''re making every > effort to get them out of the way as quickly as we can. You can see the > major features we have planned on the development timeline: > > http://btrfs.wiki.kernel.org/index.php/Development_timelineJust took a peek, seems to be slightly out of date as it still lists the single mutex thingy. Also, how true is the IO-error and disk-full claim? -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 2008-08-07 at 11:08 +0200, Peter Zijlstra wrote:> On Tue, 2008-08-05 at 15:01 -0400, Chris Mason wrote: > > > * Fine grained btree locking. The large fs_mutex is finally gone. > > There is still some work to do on the locking during extent allocation, > > but the code is much more scalable than it was. > > Cool - will try to find a cycle to stare at the code ;-) >I was able to get it mostly lockdep complaint by using mutex_lock_nested based on the level of the btree I was locking. My allocation mutex is a little of a problem for lockdep though.> > * Helper threads for checksumming and other background tasks. Most CPU > > intensive operations have been pushed off to helper threads to take > > advantage of SMP machines. Streaming read and write throughput now > > scale to disk speed even with checksumming on. > > Can this lead to the same Priority Inversion issues as seen with > kjournald? >Yes, although in general only the helper threads end up actually doing the IO for writes. Unfortunately, they are almost but not quite an elevator. It is tempting to try sorting the bios on the helper queues etc. But I haven''t done that because it gets into starvation and other fun. I haven''t done any real single cpu testing, it may make sense in those workloads to checksum and submit directly in the calling context. But real single cpu boxes are harder to come by these days. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 2008-08-07 at 11:14 +0200, Peter Zijlstra wrote:> On Tue, 2008-08-05 at 15:01 -0400, Chris Mason wrote: > > > There are still more disk format changes planned, but we''re making every > > effort to get them out of the way as quickly as we can. You can see the > > major features we have planned on the development timeline: > > > > http://btrfs.wiki.kernel.org/index.php/Development_timeline > > Just took a peek, seems to be slightly out of date as it still lists the > single mutex thingy.Thanks, I thought I had removed all the references to it on that page, but there was one left.> > Also, how true is the IO-error and disk-full claim? >We still don''t handle disk full. The IO errors are handled most of the time. If a checksum doesn''t match or the lower layers report an IO error, btrfs will use an alternate mirror of the block. If there is no alternate mirror, the caller gets EIO and in the case of a failed csum, the page is zero filled (actually filled with ones so I can find bogus pages in an oops). Metadata is duplicated by default even on single spindle drives, so this means that metadata IO errors are handled as long as the other mirror is happy. If mirroring is off or both mirrors are bad, we currently get into trouble. data pages work better, those errors bubble up to userland just like in other filesystems. -chris
On Thu, 2008-08-07 at 17:03 +0300, Ahmed Kamal wrote:> With csum errors, do we get warnings in logs ?Yes> Does too many faults cause a device to be flagged as faulty ?Not yet> is there any user-space application to monitor/scrub/re-silver btrfs > volumes ? >Not yet, but there definitely will be. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Chris Mason wrote:> I haven''t done any real single cpu testing, it may make sense in those > workloads to checksum and submit directly in the calling context. But > real single cpu boxes are harder to come by these days.They''re still pretty common in the embedded/low power space. I could see something like a settop box wanting to use btrfs with massive disks. Chris
Chris Mason wrote on 07/08/2008 11:34:02:> > > * Helper threads for checksumming and other background tasks. MostCPU> > > intensive operations have been pushed off to helper threads to take > > > advantage of SMP machines. Streaming read and write throughput now > > > scale to disk speed even with checksumming on. > > > > Can this lead to the same Priority Inversion issues as seen with > > kjournald? > > Yes, although in general only the helper threads end up actually doing > the IO for writes. Unfortunately, they are almost but not quite an > elevator. It is tempting to try sorting the bios on the helper queues > etc. But I haven''t done that because it gets into starvation and other > fun. > > I haven''t done any real single cpu testing, it may make sense in those > workloads to checksum and submit directly in the calling context. But > real single cpu boxes are harder to come by these days.[just jumping in as a casual bystander with one remark] For this purpose it seems booting up with limiting to one CPU should be sufficient. Tvrtko Sophos Plc, The Pentagon, Abingdon Science Park, Abingdon, OX14 3YP, United Kingdom. Company Reg No 2096520. VAT Reg No GB 348 3873 20. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Chris Mason <chris.mason@oracle.com> writes:> > Metadata is duplicated by default even on single spindle drives,Can you please say a bit how much that impacts performance? That sounds costly. -Andi -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> If there is no > alternate mirror, the caller gets EIO and in the case of a failed csum, > the page is zero filled (actually filled with ones so I can find bogus > pages in an oops).You mention there will be a utility to scrub the disks to repair stuff like this, but does it make sense to retry the read in event the it came back from the wrong sector or a bit got flipped somewhere or some other transient error? -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 2008-08-07 at 20:02 +0200, Andi Kleen wrote:> Chris Mason <chris.mason@oracle.com> writes: > > > > Metadata is duplicated by default even on single spindle drives, > > Can you please say a bit how much that impacts performance? That sounds > costly.Most metadata is allocated in groups of 128k or 256k, and so most of the writes are nicely sized. The mirroring code has areas of the disk dedicated to mirror other areas. So we end up with something like this: metadata chunk A (~1GB in size) [ ......................... ] mirror of chunk A (~1GB in size) [ ......................... ] So, the mirroring turns a single large write into two large writes. Definitely not free, but always a fixed cost. I started to make some numbers of this yesterday on single spindles and discovered that my worker threads are not doing as good a job as they should be of maintaining IO ordering. I''ve been using an array with a writeback cache for benchmarking lately and hadn''t noticed. I need to fix that, but here are some numbers on a single sata drive. The drive can do about 100MB/s streaming reads/writes. Btrfs checksumming and inline data (tail packing) are both turned on. Single process creating 30 kernel trees (2.6.27-rc2) Btrfs defaults 36MB/s Btrfs no mirror 50MB/s Ext4 defaults 59.2MB/s (much better than ext3 here) With /sys/block/sdb/queue/nr_requests at 8192 to hide my IO ordering submission problems: Btrfs defaults: 57MB/s Btrfs no mirror: 61.51MB/s -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> So, the mirroring turns a single large write into two large writes. > Definitely not free, but always a fixed cost.Thanks for the explanation and the numbers. I see that''s the advantage of copy-on-write that you can actually always cluster the metadata together and get always batched IO this way and then afford to do more of it. Still wondering what that will do to read seekiness. -Andi -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Aug 08, 2008 at 11:56:25PM +0200, Andi Kleen wrote:> > So, the mirroring turns a single large write into two large writes. > > Definitely not free, but always a fixed cost. > > Thanks for the explanation and the numbers. I see that''s the advantage of > copy-on-write that you can actually always cluster the metadata together and > get always batched IO this way and then afford to do more of it. > > Still wondering what that will do to read seekiness.In theory, if the elevator was smart enough, it could actually help read seekiness; there are two copies of the metadata, and it shouldn''t matter which one is fetched. So I could imagine a (hypothetical) read request which says, "please give me the contents of block 4500 or 75000000 --- I don''t care which, if the disk head is closer to one end of the disk or another, use whichever one is most convenient". Our elevator algorithms are currently totally unable to deal with this sort of request, and if SSD''s are going to be coming on line as quickly as some people are claiming, maybe it''s not worth it to try to implement that kind of thing, but at least in theory it''s something that could be done.... - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> In theory, if the elevator was smart enough, it could actually help > read seekiness; there are two copies of the metadata, and it shouldn''tThat assumes the elevator actually knows what is nearby? I thought that wasn''t that easy with modern disks with multiple spindles and invisible remapping, not even talking about RAID arrays looking like disks. -Andi -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sat, Aug 09, 2008 at 03:23:22AM +0200, Andi Kleen wrote:> > In theory, if the elevator was smart enough, it could actually help > > read seekiness; there are two copies of the metadata, and it shouldn''t > > That assumes the elevator actually knows what is nearby? I thought > that wasn''t that easy with modern disks with multiple spindles > and invisible remapping, not even talking about RAID > arrays looking like disks.RAID is the big problem, yeah. In general, though, we are already making an assumption in the elevator code and in filesystem code that block numbers which are numerically closer together are "close" from the perspective of disks. There has been talk about trying to make filesystems smarter about allocating blocks by giving them visibility to the RAID parameters; in theory the elevator algorithm could also be made smarter as well using the same information. I''m really not sure if the complexity is worth it, though.... - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 2008-08-08 at 14:48 -0400, Chris Mason wrote:> On Thu, 2008-08-07 at 20:02 +0200, Andi Kleen wrote: > > Chris Mason <chris.mason@oracle.com> writes: > > > > > > Metadata is duplicated by default even on single spindle drives, > > > > Can you please say a bit how much that impacts performance? That sounds > > costly. > > Most metadata is allocated in groups of 128k or 256k, and so most of the > writes are nicely sized. The mirroring code has areas of the disk > dedicated to mirror other areas.[ ... ]> So, the mirroring turns a single large write into two large writes. > Definitely not free, but always a fixed cost.> With /sys/block/sdb/queue/nr_requests at 8192 to hide my IO ordering > submission problems: > > Btrfs defaults: 57MB/s > Btrfs no mirror: 61.51MB/sI spent a bunch of time hammering on different ways to fix this without increasing nr_requests, and it was a mixture of needing better tuning in btrfs and needing to init mapping->writeback_index on inode allocation. So, today''s numbers for creating 30 kernel trees in sequence: Btrfs defaults 57.41 MB/s Btrfs dup no csum 74.59 MB/s Btrfs no duplication 76.83 MB/s Btrfs no dup no csum no inline 76.85 MB/s Ext4 data=writeback, delalloc 60.50 MB/s I may be able to get the duplication numbers higher by tuning metadata writeback. My current code doesn''t push metadata throughput as high in order to give some spindle time to data writes. This graph may give you an idea of how the duplication goes to disk: http://oss.oracle.com/~mason/seekwatcher/btrfs-dup/btrfs-default.png Compared with the result of mkfs.btrfs -m single (no duplication): http://oss.oracle.com/~mason/seekwatcher/btrfs-dup/btrfs-single.png Both on one graph is a little hard to read: http://oss.oracle.com/~mason/seekwatcher/btrfs-dup/btrfs-dup-compare.png Here is btrfs with duplication on, but without checksumming. Even with inline extents on, the checksums seem to cause most of the metadata related syncing (they are stored in the btree): http://oss.oracle.com/~mason/seekwatcher/btrfs-dup/btrfs-dup-nosum.png It is worth noting that with checksumming on, I go through async kthreads to do the checksumming and they may be reordering the IO a bit as they submit things. So, I''m not 100% sure the extra seeks aren''t coming from my async code. And Ext4: http://oss.oracle.com/~mason/seekwatcher/btrfs-dup/ext4-writeback.png This benchmark has questionable real world value, but since it includes a number of smallish files it is a good place to look at the cost of metadata and metadata dup I''ll push the btrfs related changes for this out tonight after some stress testing. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Aug 14, 2008 at 05:00:56PM -0400, Chris Mason wrote:> Btrfs defaults 57.41 MB/s > Btrfs dup no csum 74.59 MB/sWith duplications checksums seem to be quite costly (CPU bound?)> Btrfs no duplication 76.83 MB/s > Btrfs no dup no csum no inline 76.85 MB/sBut without duplication they are basically free here at least in IO rate. Seems odd? Does it compute them twice in the duplication case perhaps? -Andi -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> I spent a bunch of time hammering on different ways to fix this without > increasing nr_requests, and it was a mixture of needing better tuning in > btrfs and needing to init mapping->writeback_index on inode allocation. > > So, today''s numbers for creating 30 kernel trees in sequence: > > Btrfs defaults 57.41 MB/s > Btrfs dup no csum 74.59 MB/s > Btrfs no duplication 76.83 MB/s > Btrfs no dup no csum no inline 76.85 MB/sWhat sort of script are you using? Basically something like this? for i in `seq 1 30` do mkdir $i; cd $i tar xjf /usr/src/linux-2.6.28.tar.bz2 cd .. done - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 2008-08-14 at 19:44 -0400, Theodore Tso wrote:> > I spent a bunch of time hammering on different ways to fix this without > > increasing nr_requests, and it was a mixture of needing better tuning in > > btrfs and needing to init mapping->writeback_index on inode allocation. > > > > So, today''s numbers for creating 30 kernel trees in sequence: > > > > Btrfs defaults 57.41 MB/s > > Btrfs dup no csum 74.59 MB/s > > Btrfs no duplication 76.83 MB/s > > Btrfs no dup no csum no inline 76.85 MB/s > > What sort of script are you using? Basically something like this? > > for i in `seq 1 30` do > mkdir $i; cd $i > tar xjf /usr/src/linux-2.6.28.tar.bz2 > cd .. > doneSimilar. I used compilebench -i 30 -r 0, which means create 30 initial kernel trees and then do nothing. compilebench simulates compiles by writing to the FS files of the same size that you would get by creating kernel trees or compiling them. The idea is to get all of the IO without needing to keep 2.6.28.tar.bz2 in cache or the compiler using up CPU. http://www.oracle.com/~mason/compilebench -chris -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 2008-08-14 at 23:17 +0200, Andi Kleen wrote:> On Thu, Aug 14, 2008 at 05:00:56PM -0400, Chris Mason wrote: > > Btrfs defaults 57.41 MB/sLooks like I can get the btrfs defaults up to 64MB/s with some writeback tweaks.> > Btrfs dup no csum 74.59 MB/s > > With duplications checksums seem to be quite costly (CPU bound?) >The async worker threads should be spreading the load across CPUs pretty well, and even a single CPU could keep up with 100MB/s checksumming. But, the async worker threads do randomize the IO somewhat because the IO goes from pdflush -> one worker thread per CPU -> submit_bio. So, maybe that 3rd thread is more than the drive can handle? btrfsck tells me the total size of the btree is only 20MB larger with checksumming on.> > Btrfs no duplication 76.83 MB/s > > Btrfs no dup no csum no inline 76.85 MB/s > > But without duplication they are basically free here at least > in IO rate. Seems odd? > > Does it compute them twice in the duplication case perhaps? >The duplication happens lower down in the stack, they only get done once. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> The async worker threads should be spreading the load across CPUs pretty > well, and even a single CPU could keep up with 100MB/s checksumming. > But, the async worker threads do randomize the IO somewhat because the > IO goes from pdflush -> one worker thread per CPU -> submit_bio. So, > maybe that 3rd thread is more than the drive can handle?You have more threads with duplication?> > btrfsck tells me the total size of the btree is only 20MB larger with > checksumming on. > > > > Btrfs no duplication 76.83 MB/s > > > Btrfs no dup no csum no inline 76.85 MB/s > > > > But without duplication they are basically free here at least > > in IO rate. Seems odd? > > > > Does it compute them twice in the duplication case perhaps? > > > > The duplication happens lower down in the stack, they only get done > once.Ok was just speculation. The big difference still seems odd. -Andi -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 2008-08-14 at 21:10 -0400, Chris Mason wrote:> On Thu, 2008-08-14 at 19:44 -0400, Theodore Tso wrote: > > > I spent a bunch of time hammering on different ways to fix this without > > > increasing nr_requests, and it was a mixture of needing better tuning in > > > btrfs and needing to init mapping->writeback_index on inode allocation. > > > > > > So, today''s numbers for creating 30 kernel trees in sequence: > > > > > > Btrfs defaults 57.41 MB/s > > > Btrfs dup no csum 74.59 MB/s > > > Btrfs no duplication 76.83 MB/s > > > Btrfs no dup no csum no inline 76.85 MB/s > > > > What sort of script are you using? Basically something like this? > > > > for i in `seq 1 30` do > > mkdir $i; cd $i > > tar xjf /usr/src/linux-2.6.28.tar.bz2 > > cd .. > > done > > Similar. I used compilebench -i 30 -r 0, which means create 30 initial > kernel trees and then do nothing. compilebench simulates compiles by > writing to the FS files of the same size that you would get by creating > kernel trees or compiling them. > > The idea is to get all of the IO without needing to keep 2.6.28.tar.bz2 > in cache or the compiler using up CPU. > > http://www.oracle.com/~mason/compilebenchWhoops the link above is wrong, try: http://oss.oracle.com/~mason/compilebench It is worth noting that the end throughput doesn''t matter quite as much as the writeback pattern. Ext4 is pretty solid on this test, with very consistent results. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 2008-08-15 at 03:39 +0200, Andi Kleen wrote:> > The async worker threads should be spreading the load across CPUs pretty > > well, and even a single CPU could keep up with 100MB/s checksumming. > > But, the async worker threads do randomize the IO somewhat because the > > IO goes from pdflush -> one worker thread per CPU -> submit_bio. So, > > maybe that 3rd thread is more than the drive can handle? > > You have more threads with duplication? >It was a very confusing use of the word thread. I have the same number of kernel threads running, but the single spindle on the drive has to deal with 3 different streams of writes. The seeks/sec portion of the graph shows a big enough increase in seeks on the duplication run to explain the performance.> > btrfsck tells me the total size of the btree is only 20MB larger with > > checksumming on. > > > > > > Btrfs no duplication 76.83 MB/s > > > > Btrfs no dup no csum no inline 76.85 MB/s > > > > > > But without duplication they are basically free here at least > > > in IO rate. Seems odd? > > > > > > Does it compute them twice in the duplication case perhaps? > > > > > > > The duplication happens lower down in the stack, they only get done > > once. > > Ok was just speculation. The big difference still seems odd.It does, I''ll give the test a shot on other hardware too. To be honest I''m pretty happy at matching ext4 with duplication on. The graph shows even writeback and the times from each iteration are fairly consistent. Ext3 and XFS score somewhere between 10-15MB/s on the same test... -chris -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Aug 15, 2008 at 08:46:01AM -0400, Chris Mason wrote:> Whoops the link above is wrong, try: > > http://oss.oracle.com/~mason/compilebenchThanks, I figured it out.> It is worth noting that the end throughput doesn''t matter quite as much > as the writeback pattern. Ext4 is pretty solid on this test, with very > consistent results.There were two reasons why I wanted to play with compilebench. The first is we have a fragmentation problem with delayed allocation and small files getting forced out due to memory pressure, that we''ve been working for the past week. My intuition (which has proven to be correct) is that compilebench is a great tool to show it off. It may not matter so much for write throughput results, since usually the separation distance between the first block and the rest of the file is small, and the write elevator takes care of it, but in the long run this kind of allocation pattern is no good: Inode 221280: (0):887097, (1):882497 Inode 221282: (0):887098, (1-2):882498-882499 Inode 221284: (0):887099, (1):882500 The other reason why I was interested in playing with compilebench tool is that I wanted to try tweaking the commit timers to see if this would make a difference to the result. Not for this benchmark, it appears, given a quick test that I did last night. - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 2008-08-15 at 09:45 -0400, Theodore Tso wrote:> On Fri, Aug 15, 2008 at 08:46:01AM -0400, Chris Mason wrote: > > Whoops the link above is wrong, try: > > > > http://oss.oracle.com/~mason/compilebench > > Thanks, I figured it out. > > > It is worth noting that the end throughput doesn''t matter quite as much > > as the writeback pattern. Ext4 is pretty solid on this test, with very > > consistent results. > > There were two reasons why I wanted to play with compilebench. The > first is we have a fragmentation problem with delayed allocation and > small files getting forced out due to memory pressure, that we''ve been > working for the past week.Have you tried this one: http://article.gmane.org/gmane.linux.file-systems/25560 This bug should cause fragmentation on small files getting forced out due to memory pressure in ext4. But, I wasn''t able to really demonstrate it with ext4 on my machine. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Aug 15, 2008 at 01:52:52PM -0400, Chris Mason wrote:> Have you tried this one: > > http://article.gmane.org/gmane.linux.file-systems/25560 > > This bug should cause fragmentation on small files getting forced out > due to memory pressure in ext4. But, I wasn''t able to really > demonstrate it with ext4 on my machine.I''ve been able to use compilebench to see the fragmentation problem very easily. Annesh has been workign on it, and has some fixes that he queued up. I''ll have to point him at your proposed fix, thanks. This is what he came up with in the common code. What do you think? - Ted (From Annesh, on the linux-ext4 list.) As I explained in my previous patch the problem is due to pdflush background_writeout. Now when pdflush does the writeout we may have only few pages for the file and we would attempt to write them to disk. So my attempt in the last patch was to do the below a) When allocation blocks try to be close to the goal block specified b) When we call ext4_da_writepages make sure we have minimal nr_to_write that ensures we allocate all dirty buffer_heads in a single go. nr_to_write is set to 1024 in pdflush background_writeout and that would mean we may end up calling some inodes writepages() with really small values even though we have more dirty buffer_heads. What it doesn''t handle is 1) File A have 4 dirty buffer_heads. 2) pdflush try to write them. We get 4 contig blocks 3) File A now have new 5 dirty_buffer_heads 4) File B now have 6 dirty_buffer_heads 5) pdflush try to write the 6 dirty buffer_heads of file B and allocate them next to earlier file A blocks 6) pdflush try to write the 5 dirty buffer_heads of file A and allocate them after file B blocks resulting in discontinuity. I am right now testing the below patch which make sure new dirty inodes are added to the tail of the dirty inode list commit 6ad9d25595aea8efa0d45c0a2dd28b4a415e34e6 Author: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Date: Fri Aug 15 23:19:15 2008 +0530 move the dirty inodes to the end of the list diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index 25adfc3..91f3c54 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -163,7 +163,7 @@ void __mark_inode_dirty(struct inode *inode, int flags) */ if (!was_dirty) { inode->dirtied_when = jiffies; - list_move(&inode->i_list, &sb->s_dirty); + list_move_tail(&inode->i_list, &sb->s_dirty); } } out: -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 2008-08-15 at 15:59 -0400, Theodore Tso wrote:> On Fri, Aug 15, 2008 at 01:52:52PM -0400, Chris Mason wrote: > > Have you tried this one: > > > > http://article.gmane.org/gmane.linux.file-systems/25560 > > > > This bug should cause fragmentation on small files getting forced out > > due to memory pressure in ext4. But, I wasn''t able to really > > demonstrate it with ext4 on my machine. > > I''ve been able to use compilebench to see the fragmentation problem > very easily. > > Annesh has been workign on it, and has some fixes that he queued up. > I''ll have to point him at your proposed fix, thanks. This is what he > came up with in the common code. What do you think? >It sounds like ext4 would show the writeback_index bug with fragmentation on disk and btrfs would show it with seeks during the benchmark. I was only watching the throughput numbers and not looking at filefrag results.> - Ted > > (From Annesh, on the linux-ext4 list.) > > As I explained in my previous patch the problem is due to pdflush > background_writeout. Now when pdflush does the writeout we may > have only few pages for the file and we would attempt > to write them to disk. So my attempt in the last patch was to > do the below >pdflush and delalloc and raid stripe alignment and lots of other things don''t play well together. In general, I think we need one or more pdflush threads per mounted FS so that write_cache_pages doesn''t have to bail out every time it hits congestion. The current write_cache_pages code even misses easy changes to create bigger bios just because a block device is congested when called by background_writeout() But I would hope we can deal with a single threaded small file workload like compilebench without resorting to big rewrites> a) When allocation blocks try to be close to the goal block specified > b) When we call ext4_da_writepages make sure we have minimal nr_to_write > that ensures we allocate all dirty buffer_heads in a single go. > nr_to_write is set to 1024 in pdflush background_writeout and that > would mean we may end up calling some inodes writepages() with really > small values even though we have more dirty buffer_heads. > > What it doesn''t handle is > 1) File A have 4 dirty buffer_heads. > 2) pdflush try to write them. We get 4 contig blocks > 3) File A now have new 5 dirty_buffer_heads > 4) File B now have 6 dirty_buffer_heads > 5) pdflush try to write the 6 dirty buffer_heads of file B and allocate > them next to earlier file A blocks > 6) pdflush try to write the 5 dirty buffer_heads of file A and allocate > them after file B blocks resulting in discontinuity. > > I am right now testing the below patch which make sure new dirty inodes > are added to the tail of the dirty inode list > > commit 6ad9d25595aea8efa0d45c0a2dd28b4a415e34e6 > Author: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> > Date: Fri Aug 15 23:19:15 2008 +0530 > > move the dirty inodes to the end of the list > > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c > index 25adfc3..91f3c54 100644 > --- a/fs/fs-writeback.c > +++ b/fs/fs-writeback.c > @@ -163,7 +163,7 @@ void __mark_inode_dirty(struct inode *inode, int flags) > */ > if (!was_dirty) { > inode->dirtied_when = jiffies; > - list_move(&inode->i_list, &sb->s_dirty); > + list_move_tail(&inode->i_list, &sb->s_dirty); > } > } > out:Looks like everyone who walks sb->s_io or s_dirty walks it backwards. This should make the newly dirtied inode the first one to be processed, which probably isn''t what we want. I could be reading it backwards of course ;) -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 2008-08-15 at 16:37 -0400, Chris Mason wrote:> On Fri, 2008-08-15 at 15:59 -0400, Theodore Tso wrote: > > On Fri, Aug 15, 2008 at 01:52:52PM -0400, Chris Mason wrote: > > > Have you tried this one: > > > > > > http://article.gmane.org/gmane.linux.file-systems/25560 > > > > > > This bug should cause fragmentation on small files getting forced out > > > due to memory pressure in ext4. But, I wasn''t able to really > > > demonstrate it with ext4 on my machine. > > > > I''ve been able to use compilebench to see the fragmentation problem > > very easily. > > > > Annesh has been workign on it, and has some fixes that he queued up. > > I''ll have to point him at your proposed fix, thanks. This is what he > > came up with in the common code. What do you think? > > > > It sounds like ext4 would show the writeback_index bug with > fragmentation on disk and btrfs would show it with seeks during the > benchmark. I was only watching the throughput numbers and not looking > at filefrag results. >I tried just the writeback_index patch and got only 4 fragmented files on ext4 after a compilebench run. Then I tried again and got 1200. Seems there is something timing dependent in here ;) By default compilebench uses 256k buffers for writing (see compilebench -b) and btrfs_file_write will lock down up to 512 pages at a time during a single write. This means that for most small files, compilebench will send the whole file down in one write() and btrfs_file_write will lock down pages for the entire write() call while working on it. So, even if pdflush tries to jump in and do the wrong thing, the pages will be locked by btrfs_file_write and pdflush will end up skipping them. With the generic file write routines, pages are locked one at a time, giving pdflush more windows to trigger delalloc while a write is still ongoing. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 15 Aug 2008, Chris Mason wrote:> Ext3 and XFS score somewhere between 10-15MB/s on the same test...Interesting (and cool animations). We tried compilebench (-i 30 -r 0) just for fun using kernel 2.6.26, freshly formatted partition, with defaults. Results: MB/s Runtime (s) ----- ----------- ext3 13.24 877 btrfs 12.33 793 ntfs-3g 8.55 865 reiserfs 8.38 966 xfs 1.88 3901 Regards, Szaka -- NTFS-3G: http://ntfs-3g.org -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sat, Aug 16, 2008 at 02:10:10PM -0400, Chris Mason wrote:> > I tried just the writeback_index patch and got only 4 fragmented files > on ext4 after a compilebench run. Then I tried again and got 1200. > Seems there is something timing dependent in here ;) >Yeah, the patch Aneesh sent to change where we added the inode to the dirty list was false lead. The right fix is in the ext4 patch queue now. I think we have the problem licked and a quick test showed it increased the compilebench MB/s by a very tiny amount (enough so that I wasnt sure whether or not it was measurement error), but it does avoid the needly fragmentation. - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sat, 2008-08-16 at 22:26 +0300, Szabolcs Szakacsits wrote:> On Fri, 15 Aug 2008, Chris Mason wrote: > > > Ext3 and XFS score somewhere between 10-15MB/s on the same test... > > Interesting (and cool animations). > > We tried compilebench (-i 30 -r 0) just for fun using kernel 2.6.26, > freshly formatted partition, with defaults. Results: > > MB/s Runtime (s) > ----- ----------- > ext3 13.24 877 > btrfs 12.33 793Thanks for running things. The code in the btrfs-unstable tree has all my performance fixes. You''ll need it to get good results. Also, the MB/s number doesn''t include the time to run sync at the end, which is probably why the runtime for btrfs is shorter but MB/s is lower. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, 18 Aug 2008, Chris Mason wrote:> On Sat, 2008-08-16 at 22:26 +0300, Szabolcs Szakacsits wrote: > > > > We tried compilebench (-i 30 -r 0) just for fun using kernel 2.6.26, > > freshly formatted partition, with defaults. Results: > > > > MB/s Runtime (s) > > ----- ----------- > > ext3 13.24 877 > > btrfs 12.33 793 > > Thanks for running things. > > The code in the btrfs-unstable tree has all my performance fixes. > You''ll need it to get good results.The numbers are indeed much better: MB/s Runtime (s) ----- ----------- btrfs-unstable 17.09 572 The disk is capable of 40+ MB/s however the test partition was one of the last ones and as I figured it out now, it can do only 26 MB/sec. Btrfs bulk write easily sustains it. The write speed was 21 MB/s during the benchmark, so btrfs is the closest to the possible best write speed in the test environment. Szaka -- NTFS-3G: http://ntfs-3g.org -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html