Hello everyone, It took me much longer to chase down races in my new data=ordered code, but I think I''ve finally got it, and have pushed it out to the unstable trees. There are no disk format changes included. I need to make minor mods to the resizing and balancing code, but I wanted to get this stuff out the door. In general, I''ll call data=ordered any system that prevents seeing stale data on the disk after a crash. This would include null bytes from areas not yet written when we crashed and the contents of old blocks the filesystem had freed in the past. The old data=ordered code worked something like this: file_write: * modify pages in page cache * set delayed allocation bits * Update in memory and on-disk i_size writepage: * collect a large delalloc region * allocate new extent * drop existing extents from the metadata * insert new extent * start the page io transaction commit: * write and wait on any dirty file data to finish * commit the new btree pointers The end result was very large latencies during transaction commit because it had to wait on all the file data. A fsync of a single file was forced to write out all the dirty metadata and dirty data on the FS. This is how ext3 works today, xfs does something smarter. ext4 is moving to something similar to xfs. With the new code, metadata is not modified in the btree until new extents are fully on disk. It now looks something like this: file write (start, len): * wait on pending ordered extents for the start, len range * modify pages in the page cache * set delayed allocation bits * Update in memory only i_size writepage: * collect a large delalloc extent * reserve a extent on disk in the allocation tree * create an ordered extent record * start the page io At IO completion (done in a kthread): * find the corresponding ordered extent record * if fully written, remove old extents from the tree, add new extents to the tree, update on disk i_size At commit time: * Just do only metadata IO The end result of all of this is lower commit latencies and a smoother system. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Chris Mason wrote:> Hello everyone, > > It took me much longer to chase down races in my new data=ordered code, > but I think I''ve finally got it, and have pushed it out to the unstable > trees. > > There are no disk format changes included. I need to make minor mods to > the resizing and balancing code, but I wanted to get this stuff out the > door. > > In general, I''ll call data=ordered any system that prevents seeing stale > data on the disk after a crash. This would include null bytes from > areas not yet written when we crashed and the contents of old blocks the > filesystem had freed in the past. > > The old data=ordered code worked something like this: > > file_write: > * modify pages in page cache > * set delayed allocation bits > * Update in memory and on-disk i_size > > writepage: > * collect a large delalloc region > * allocate new extent > * drop existing extents from the metadata > * insert new extent > * start the page io > > transaction commit: > * write and wait on any dirty file data to finish > * commit the new btree pointers > > The end result was very large latencies during transaction commit > because it had to wait on all the file data. A fsync of a single file > was forced to write out all the dirty metadata and dirty data on the FS. > This is how ext3 works today, xfs does something smarter. ext4 is > moving to something similar to xfs. > > With the new code, metadata is not modified in the btree until new > extents are fully on disk. It now looks something like this: > > file write (start, len): > * wait on pending ordered extents for the start, len range > * modify pages in the page cache > * set delayed allocation bits > * Update in memory only i_size > > writepage: > * collect a large delalloc extent > * reserve a extent on disk in the allocation tree > * create an ordered extent record > * start the page io > > At IO completion (done in a kthread): > * find the corresponding ordered extent record > * if fully written, remove old extents from the tree, > add new extents to the tree, update on disk i_size > > At commit time: > * Just do only metadata IO > > The end result of all of this is lower commit latencies and a smoother > system. > > -chris >Just to kick the tires, I tried the same test that I ran last week on ext4. Everything was going great, I decided to kill it after 6 million files or so and restart. The unmount has taken a very, very long time - seems like we are cleaning up the pending transactions at a very slow rate: Jul 18 16:06:04 localhost kernel: cleaner awake Jul 18 16:06:04 localhost kernel: cleaner done Jul 18 16:06:34 localhost kernel: trans 188 in commit Jul 18 16:06:35 localhost kernel: trans 188 done in commit Jul 18 16:06:35 localhost kernel: cleaner awake Jul 18 16:06:35 localhost kernel: cleaner done Jul 18 16:07:05 localhost kernel: trans 189 in commit Jul 18 16:07:06 localhost kernel: trans 189 done in commit Jul 18 16:07:06 localhost kernel: cleaner awake Jul 18 16:07:06 localhost kernel: cleaner done Jul 18 16:07:36 localhost kernel: trans 190 in commit Jul 18 16:07:37 localhost kernel: trans 190 done in commit Jul 18 16:07:37 localhost kernel: cleaner awake Jul 18 16:07:37 localhost kernel: cleaner done Jul 18 16:08:07 localhost kernel: trans 191 in commit Jul 18 16:08:09 localhost kernel: trans 191 done in commit Jul 18 16:08:09 localhost kernel: cleaner awake Jul 18 16:08:09 localhost kernel: cleaner done Jul 18 16:08:39 localhost kernel: trans 192 in commit Jul 18 16:08:39 localhost kernel: trans 192 done in commit Jul 18 16:08:39 localhost kernel: cleaner awake Jul 18 16:08:39 localhost kernel: cleaner done The command I ran was: fs_mark -d /mnt/test -D 256 -n 100000 -t 4 -s 20480 -F -S 0 -l btrfs_new.txt (No fsyncs involved here) ric -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 2008-07-18 at 16:09 -0400, Ric Wheeler wrote:> Just to kick the tires, I tried the same test that I ran last week on > ext4. Everything was going great, I decided to kill it after 6 million > files or so and restart. > > The unmount has taken a very, very long time - seems like we are > cleaning up the pending transactions at a very slow rate: >This is a known problem, Yan will take care of it next week. You''ve got the right idea, cleaning old snapshots does more IO than it should. The good news is that if you hit reset and mount again, it''ll pick up where it left off. The bad news is it''ll be be just as slow as last time around ;)> Jul 18 16:06:04 localhost kernel: cleaner awake > Jul 18 16:06:04 localhost kernel: cleaner done > Jul 18 16:06:34 localhost kernel: trans 188 in commit > Jul 18 16:06:35 localhost kernel: trans 188 done in commitAnd these I meant to get rid of -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Chris Mason wrote:> On Fri, 2008-07-18 at 16:09 -0400, Ric Wheeler wrote: > >> Just to kick the tires, I tried the same test that I ran last week on >> ext4. Everything was going great, I decided to kill it after 6 million >> files or so and restart. >> >> The unmount has taken a very, very long time - seems like we are >> cleaning up the pending transactions at a very slow rate: >> > > This is a known problem, Yan will take care of it next week. You''ve got > the right idea, cleaning old snapshots does more IO than it should. > > The good news is that if you hit reset and mount again, it''ll pick up > where it left off. The bad news is it''ll be be just as slow as last > time around ;) > >> Jul 18 16:06:04 localhost kernel: cleaner awake >> Jul 18 16:06:04 localhost kernel: cleaner done >> Jul 18 16:06:34 localhost kernel: trans 188 in commit >> Jul 18 16:06:35 localhost kernel: trans 188 done in commit > > And these I meant to get rid of > > -chris > >I will just restart it to get the timings. It was looking quite good in the initial few data points, ric -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 2008-07-18 at 18:35 -0400, Ric Wheeler wrote:> Chris Mason wrote: > > On Fri, 2008-07-18 at 16:09 -0400, Ric Wheeler wrote: > > > >> Just to kick the tires, I tried the same test that I ran last week on > >> ext4. Everything was going great, I decided to kill it after 6 million > >> files or so and restart.Well, it looks like I neglected to push all the changesets, especially the last one that made it less racey. So, I''ve just done another push, sorry. For the fs_mark workload, it shouldn''t change anything. This code still hasn''t really survived an overnight run, hopefully this commit will. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Chris Mason wrote:> On Fri, 2008-07-18 at 18:35 -0400, Ric Wheeler wrote: > >> Chris Mason wrote: >> >>> On Fri, 2008-07-18 at 16:09 -0400, Ric Wheeler wrote: >>> >>> >>>> Just to kick the tires, I tried the same test that I ran last week on >>>> ext4. Everything was going great, I decided to kill it after 6 million >>>> files or so and restart. >>>> > > Well, it looks like I neglected to push all the changesets, especially > the last one that made it less racey. So, I''ve just done another push, > sorry. For the fs_mark workload, it shouldn''t change anything. > > This code still hasn''t really survived an overnight run, hopefully this > commit will. > > -chris > > >The test is still running, but slowly, with a (slow) stream of messages about: Jul 19 10:55:38 localhost kernel: INFO: task btrfs:448 blocked for more than 120 seconds. Jul 19 10:55:38 localhost kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jul 19 10:55:38 localhost kernel: btrfs D ffffffff8129c5b0 0 448 2 Jul 19 10:55:38 localhost kernel: ffff8100283dfc50 0000000000000046 0000000000000000 ffffffffa0514254 Jul 19 10:55:38 localhost kernel: ffff810012c061c0 ffffffff814b2280 ffffffff814b2280 ffff81000b5ce0a0 Jul 19 10:55:38 localhost kernel: 0000000000000000 ffff81003f182cc0 ffff81003fac0000 ffff81003f183010 Jul 19 10:55:38 localhost kernel: Call Trace: Jul 19 10:55:38 localhost kernel: [<ffffffffa0514254>] ? :btrfs:free_extent_state+0x69/0x6e Jul 19 10:55:38 localhost kernel: [<ffffffff8127ff47>] __mutex_lock_slowpath+0x6b/0xa2 Jul 19 10:55:38 localhost kernel: [<ffffffff8127fdd2>] mutex_lock+0x2f/0x33 Jul 19 10:55:38 localhost kernel: [<ffffffffa04f6e96>] :btrfs:maybe_lock_mutex+0x29/0x2b Jul 19 10:55:38 localhost kernel: [<ffffffffa04fae0c>] :btrfs:btrfs_alloc_reserved_extent+0x2c/0x67 Jul 19 10:55:38 localhost kernel: [<ffffffffa0512106>] ? :btrfs:btrfs_lookup_ordered_extent+0x139/0x148 Jul 19 10:55:38 localhost kernel: [<ffffffffa050635f>] :btrfs:btrfs_finish_ordered_io+0x102/0x2a8 Jul 19 10:55:38 localhost kernel: [<ffffffffa0506666>] :btrfs:btrfs_writepage_end_io_hook+0x10/0x12 Jul 19 10:55:38 localhost kernel: [<ffffffffa0516b62>] :btrfs:end_bio_extent_writepage+0xbe/0x28d Jul 19 10:55:38 localhost kernel: [<ffffffff810c6f5d>] bio_endio+0x2b/0x2d Jul 19 10:55:38 localhost kernel: [<ffffffffa0500d19>] :btrfs:end_workqueue_fn+0x103/0x110 Jul 19 10:55:38 localhost kernel: [<ffffffffa051b43c>] :btrfs:worker_loop+0x63/0x13e Jul 19 10:55:38 localhost kernel: [<ffffffffa051b3d9>] ? :btrfs:worker_loop+0x0/0x13e Jul 19 10:55:38 localhost kernel: [<ffffffff810460e3>] kthread+0x49/0x76 Jul 19 10:55:38 localhost kernel: [<ffffffff8100cda8>] child_rip+0xa/0x12 Jul 19 10:55:38 localhost kernel: [<ffffffff8104609a>] ? kthread+0x0/0x76 Jul 19 10:55:38 localhost kernel: [<ffffffff8100cd9e>] ? child_rip+0x0/0x12 Jul 19 10:55:38 localhost kernel: ric -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sun, 2008-07-20 at 08:19 -0400, Ric Wheeler wrote:> Chris Mason wrote: > > On Fri, 2008-07-18 at 18:35 -0400, Ric Wheeler wrote: > > > >> Chris Mason wrote: > >> > >>> On Fri, 2008-07-18 at 16:09 -0400, Ric Wheeler wrote: > >>> > >>> > >>>> Just to kick the tires, I tried the same test that I ran last week on > >>>> ext4. Everything was going great, I decided to kill it after 6 million > >>>> files or so and restart. > >>>> > > > > Well, it looks like I neglected to push all the changesets, especially > > the last one that made it less racey. So, I''ve just done another push, > > sorry. For the fs_mark workload, it shouldn''t change anything. > > > > This code still hasn''t really survived an overnight run, hopefully this > > commit will. > > > > -chris > > > > > > > The test is still running, but slowly, with a (slow) stream of messages > about: >Could you please grab the sysrq-w if this is still running? -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Chris Mason wrote:> On Sun, 2008-07-20 at 08:19 -0400, Ric Wheeler wrote: > >> Chris Mason wrote: >> >>> On Fri, 2008-07-18 at 18:35 -0400, Ric Wheeler wrote: >>> >>> >>>> Chris Mason wrote: >>>> >>>> >>>>> On Fri, 2008-07-18 at 16:09 -0400, Ric Wheeler wrote: >>>>> >>>>> >>>>> >>>>>> Just to kick the tires, I tried the same test that I ran last week on >>>>>> ext4. Everything was going great, I decided to kill it after 6 million >>>>>> files or so and restart. >>>>>> >>>>>> >>> Well, it looks like I neglected to push all the changesets, especially >>> the last one that made it less racey. So, I''ve just done another push, >>> sorry. For the fs_mark workload, it shouldn''t change anything. >>> >>> This code still hasn''t really survived an overnight run, hopefully this >>> commit will. >>> >>> -chris >>> >>> >>> >>> >> The test is still running, but slowly, with a (slow) stream of messages >> about: >> >> > > Could you please grab the sysrq-w if this is still running? > > -chris > > >Attached.... ric
On Sun, 2008-07-20 at 09:46 -0400, Ric Wheeler wrote:> > >>>>>> Just to kick the tires, I tried the same test that I ran last week on > >>>>>> ext4. Everything was going great, I decided to kill it after 6 million > >>>>>> files or so and restart. > >>>>>> > >>>>>> > >>> Well, it looks like I neglected to push all the changesets, especially > >>> the last one that made it less racey. So, I''ve just done another push, > >>> sorry. For the fs_mark workload, it shouldn''t change anything. > >>> > >>> This code still hasn''t really survived an overnight run, hopefully this > >>> commit will. > >>> > >> The test is still running, but slowly, with a (slow) stream of messages > >> about:[ lock timeouts and stalls ] Ok, I''ve made a few changes that should lower overall contenion on the allocation mutex. I''m getting better performance on a 3 million file run, please give it a shot. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, 2008-07-21 at 14:29 -0400, Ric Wheeler wrote:> Chris Mason wrote: > > On Sun, 2008-07-20 at 09:46 -0400, Ric Wheeler wrote: > > > >> > >> > >>>>>>>> Just to kick the tires, I tried the same test that I ran last week on > >>>>>>>> ext4. Everything was going great, I decided to kill it after 6 million > >>>>>>>> files or so and restart. > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>> Well, it looks like I neglected to push all the changesets, especially > >>>>> the last one that made it less racey. So, I''ve just done another push, > >>>>> sorry. For the fs_mark workload, it shouldn''t change anything. > >>>>> > >>>>> This code still hasn''t really survived an overnight run, hopefully this > >>>>> commit will. > >>>>> > >>>>> > >>>> The test is still running, but slowly, with a (slow) stream of messages > >>>> about: > >>>> > > > > [ lock timeouts and stalls ] > > > > > > Ok, I''ve made a few changes that should lower overall contenion on the > > allocation mutex. I''m getting better performance on a 3 million file > > run, please give it a shot. > > > > -chris > > > > > Hi Chris, > > After an update, clean rebuild & reboot, the test is running along and > has hit about 10 million files. I still see some messages like: > > INFO: task pdflush:4051 blocked for more than 120 seconds. > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > pdflush D ffffffff8129c5b0 0 4051 2 > ffff81002ae77870 0000000000000046 0000000000000000 ffff81002ae77834 > 0000000000000001 ffffffff814b2280 ffffffff814b2280 0000000100000001 > 0000000000000000 ffff81003f188000 ffff81003fac5980 ffff81003f188350 > > but not as many as before. > > I will attach the messages file,I''ll try running with soft-lockup detection here, see if I can hunt down the cause of these stalls. Good to know I''ve made progress though ;) -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Chris Mason wrote:> On Mon, 2008-07-21 at 14:29 -0400, Ric Wheeler wrote: > >> Chris Mason wrote: >> >>> On Sun, 2008-07-20 at 09:46 -0400, Ric Wheeler wrote: >>> >>> >>>> >>>> >>>> >>>>>>>>>> Just to kick the tires, I tried the same test that I ran last week on >>>>>>>>>> ext4. Everything was going great, I decided to kill it after 6 million >>>>>>>>>> files or so and restart. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>> Well, it looks like I neglected to push all the changesets, especially >>>>>>> the last one that made it less racey. So, I''ve just done another push, >>>>>>> sorry. For the fs_mark workload, it shouldn''t change anything. >>>>>>> >>>>>>> This code still hasn''t really survived an overnight run, hopefully this >>>>>>> commit will. >>>>>>> >>>>>>> >>>>>>> >>>>>> The test is still running, but slowly, with a (slow) stream of messages >>>>>> about: >>>>>> >>>>>> >>> [ lock timeouts and stalls ] >>> >>> >>> Ok, I''ve made a few changes that should lower overall contenion on the >>> allocation mutex. I''m getting better performance on a 3 million file >>> run, please give it a shot. >>> >>> -chris >>> >>> >>> >> Hi Chris, >> >> After an update, clean rebuild & reboot, the test is running along and >> has hit about 10 million files. I still see some messages like: >> >> INFO: task pdflush:4051 blocked for more than 120 seconds. >> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. >> pdflush D ffffffff8129c5b0 0 4051 2 >> ffff81002ae77870 0000000000000046 0000000000000000 ffff81002ae77834 >> 0000000000000001 ffffffff814b2280 ffffffff814b2280 0000000100000001 >> 0000000000000000 ffff81003f188000 ffff81003fac5980 ffff81003f188350 >> >> but not as many as before. >> >> I will attach the messages file, >> > > I''ll try running with soft-lockup detection here, see if I can hunt down > the cause of these stalls. Good to know I''ve made progress though ;) > > -chris > >This is an 8 core box, so it is might be more prone to hitting these things ;-) ric -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, 2008-07-21 at 15:23 -0400, Ric Wheeler wrote:> >>> [ lock timeouts and stalls ] > >>> > >>> > >>> Ok, I''ve made a few changes that should lower overall contenion on the > >>> allocation mutex. I''m getting better performance on a 3 million file > >>> run, please give it a shot. > >> > >> After an update, clean rebuild & reboot, the test is running along and > >> has hit about 10 million files. I still see some messages like: > >> > >> INFO: task pdflush:4051 blocked for more than 120 seconds.The latest code in btrfs-unstable has everything I can safely do right now :) Basically the stalls come from someone doing IO with the allocation mutex held. It is surprising that we should be stalling for such a long time, it is probably a mixture of elevator starvation and btrfs fun. But, btrfs-unstable also has code to replace the page lock with a per-tree block mutex, which will allow me to get rid of the big allocation mutex over the long term. I was able to break up most of the long operations and have them drop/reacquire the allocation mutex to prevent this starvation most of the time. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 2008-07-18 at 16:09 -0400, Ric Wheeler wrote:> Just to kick the tires, I tried the same test that I ran last week on > ext4. Everything was going great, I decided to kill it after 6 million > files or so and restart. > > The unmount has taken a very, very long time - seems like we are > cleaning up the pending transactions at a very slow rate:For many workloads, the long unmount time is now fixed in btrfs-unstable as well (thanks to a cache from Yan Zheng). For fs_mark, I created 10 million files (20k each) and was able to unmount in 2s. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html