Hello everyone, I''ve pushed out another set of changes to the unstable trees, and these include: * RAID 1+0 support mkfs.btrfs -d raid10 -m raid10 /dev/sd... 4 or more drives required. * async work queues for checksumming writes * Better back references in the multi-device data structs The async work queues include code to checksum data pages without the FS mutex held, greatly increasing streaming write throughput. On my 4 drive system, I was getting around 120MB/s writes with checksumming on. Now I get 180MB/s, which is disk speed. The rest of the week will be spent doing hot add/remove of devices. Happy testing ;) -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Chris Mason <chris.mason@oracle.com> writes:> > The async work queues include code to checksum data pages without the FS mutexAre they able to distribute work to other cores? -Andi -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wednesday 16 April 2008, Andi Kleen wrote:> Chris Mason <chris.mason@oracle.com> writes: > > The async work queues include code to checksum data pages without the FS > > mutex > > Are they able to distribute work to other cores?Yes, it just uses a workqueue. The current implemention is pretty simple, it surely could be more effective at spreading the work around. I''m testing a variant that only tosses over to the async queue for pdflush, inline reclaim should stay inline. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Chris Mason <chris.mason@oracle.com> writes:> On Wednesday 16 April 2008, Andi Kleen wrote: >> Chris Mason <chris.mason@oracle.com> writes: >> > The async work queues include code to checksum data pages without the FS >> > mutex >> >> Are they able to distribute work to other cores? > > Yes, it just uses a workqueue.Unfortunately work queues don''t do that by default currently. They tend to process on the current CPU only.> The current implemention is pretty simple, it > surely could be more effective at spreading the work around. > > I''m testing a variant that only tosses over to the async queue for pdflush, > inline reclaim should stay inline.Longer term I would hope that write checksum will be basically free by doing csum-copy at write() time. The only problem is just where to store the checksum between the write and the final IO? There''s no space in struct page. The same could be also done for read() but that might be a little more tricky because it would require delayed error reporting and it might be difficult to do this for partial blocks? -Andi -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wednesday 16 April 2008, Andi Kleen wrote:> Chris Mason <chris.mason@oracle.com> writes: > > On Wednesday 16 April 2008, Andi Kleen wrote: > >> Chris Mason <chris.mason@oracle.com> writes: > >> > The async work queues include code to checksum data pages without the > >> > FS mutex > >> > >> Are they able to distribute work to other cores? > > > > Yes, it just uses a workqueue. > > Unfortunately work queues don''t do that by default currently. They > tend to process on the current CPU only.Well, I see multiple work queue threads using CPU time, but I haven''t spent much time optimizing it. There''s definitely room for improvement.> > > The current implemention is pretty simple, it > > surely could be more effective at spreading the work around. > > > > I''m testing a variant that only tosses over to the async queue for > > pdflush, inline reclaim should stay inline. > > Longer term I would hope that write checksum will be basically free by > doing csum-copy at write() time. The only problem is just where to store > the checksum between the write and the final IO? There''s no space in > struct page.At write time is easier (except for mmap) because I can toss the csum directly into the btree inside btrfs_file_write. The current code avoids that complexity and does it all at writeout. One advantage to the current code is that I''m able to optimize tree searches away but checksumming a bunch of pages at a time. Multiple pages worth of checksums get stored in a single btree item, so at least for btree operations the current code is fairly optimal.> > The same could be also done for read() but that might be a little more > tricky because it would require delayed error reporting and it might > be difficult to do this for partial blocks?Yeah, it doesn''t quite fit with how the kernel does reads. For now it is much easier if the retry-other-mirror operation happens long before copy_to_user. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Chris Mason wrote:> On Wednesday 16 April 2008, Andi Kleen wrote: >> Chris Mason <chris.mason@oracle.com> writes: >>> On Wednesday 16 April 2008, Andi Kleen wrote: >>>> Chris Mason <chris.mason@oracle.com> writes: >>>>> The async work queues include code to checksum data pages without the >>>>> FS mutex >>>> Are they able to distribute work to other cores? >>> Yes, it just uses a workqueue. >> Unfortunately work queues don''t do that by default currently. They >> tend to process on the current CPU only. > > Well, I see multiple work queue threads using CPU time, but I haven''t spent > much time optimizing it. There''s definitely room for improvement.That''s likely because you submit from multiple CPUs. But with a single submitter running on a single CPU there shouldn''t be any load balancing currently. -Andi -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Apr 16 2008, Andi Kleen wrote:> Chris Mason wrote: > > On Wednesday 16 April 2008, Andi Kleen wrote: > >> Chris Mason <chris.mason@oracle.com> writes: > >>> On Wednesday 16 April 2008, Andi Kleen wrote: > >>>> Chris Mason <chris.mason@oracle.com> writes: > >>>>> The async work queues include code to checksum data pages without the > >>>>> FS mutex > >>>> Are they able to distribute work to other cores? > >>> Yes, it just uses a workqueue. > >> Unfortunately work queues don''t do that by default currently. They > >> tend to process on the current CPU only. > > > > Well, I see multiple work queue threads using CPU time, but I haven''t spent > > much time optimizing it. There''s definitely room for improvement. > > That''s likely because you submit from multiple CPUs. But with a single > submitter running on a single CPU there shouldn''t be any load balancing > currently.There have been various implementations of queue_work_on() posted through the years, I''ve had one version that I''ve used off and on for a long time: http://git.kernel.dk/?p=linux-2.6-block.git;a=commit;h=c68c42fd6df96f5b3fb5b8b47c571f233d054c71 then you need some balancing decider on top of that, of course. -- Jens Axboe -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> There have been various implementations of queue_work_on() posted > through the years, I''ve had one version that I''ve used off and on for a > long time:queue_work_on is the wrong interface I think. You rather want a pool of non pinned threads that are then load balanced by the scheduler (who knows best what cpus have cycles available) -Andi -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Apr 16 2008, Andi Kleen wrote:> > > There have been various implementations of queue_work_on() posted > > through the years, I''ve had one version that I''ve used off and on for a > > long time: > > queue_work_on is the wrong interface I think. You rather > want a pool of non pinned threads that are then load balanced by the > scheduler (who knows best what cpus have cycles available)Yeah, that actually sounds like the best interface. What I described typically ends up trying to be too clever, you really want to leave any scheduling decisions to the scheduler. -- Jens Axboe -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wednesday 16 April 2008, Andi Kleen wrote:> > There have been various implementations of queue_work_on() posted > > through the years, I''ve had one version that I''ve used off and on for a > > long time: > > queue_work_on is the wrong interface I think. You rather > want a pool of non pinned threads that are then load balanced by the > scheduler (who knows best what cpus have cycles available)Fair enough, I''ll tune things a bit. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html