[adding linux-btrfs to cc] Josef, Chris, any ideas on the below issues? On Mon, 24 Oct 2011, Christian Brunner wrote:> Thanks for explaining this. I don''t have any objections against btrfs > as a osd filesystem. Even the fact that there is no btrfs-fsck doesn''t > scare me, since I can use the ceph replication to recover a lost > btrfs-filesystem. The only problem I have is, that btrfs is not stable > on our side and I wonder what you are doing to make it work. (Maybe > it''s related to the load pattern of using ceph as a backend store for > qemu). > > Here is a list of the btrfs problems I''m having: > > - When I run ceph with the default configuration (btrfs snaps enabled) > I can see a rapid increase in Disk-I/O after a few hours of uptime. > Btrfs-cleaner is using more and more time in > btrfs_clean_old_snapshots().In theory, there shouldn''t be any significant difference between taking a snapshot and removing it a few commits later, and the prior root refs that btrfs holds on to internally until the new commit is complete. That''s clearly not quite the case, though. In any case, we''re going to try to reproduce this issue in our environment.> - When I run ceph with btrfs snaps disabled, the situation is getting > slightly better. I can run an OSD for about 3 days without problems, > but then again the load increases. This time, I can see that the > ceph-osd (blkdev_issue_flush) and btrfs-endio-wri are doing more work > than usual.FYI in this scenario you''re exposed to the same journal replay issues that ext4 and XFS are. The btrfs workload that ceph is generating will also not be all that special, though, so this problem shouldn''t be unique to ceph.> Another thing is that I''m seeing a WARNING: at fs/btrfs/inode.c:2114 > from time to time. Maybe it''s related to the performance issues, but > seems to be able to verify this.I haven''t seen this yet with the latest stuff from Josef, but others have. Josef, is there any information we can provide to help track it down?> It''s really sad to see, that ceph performance and stability is > suffering that much from the underlying filesystems and that this > hasn''t changed over the last months.We don''t have anyone internally working on btrfs at the moment, and are still struggling to hire experienced kernel/fs people. Josef has been very helpful with tracking these issues down, but he hass responsibilities beyond just the Ceph related issues. Progress is slow, but we are working on it! sage> > Kind regards, > Christian > > 2011/10/24 Sage Weil <sage@newdream.net>: > > Although running on ext4, xfs, or whatever other non-btrfs you want mostly > > works, there are a few important remaining issues: > > > > 1- ext4 limits total xattrs for 4KB. This can cause problems in some > > cases, as Ceph uses xattrs extensively. Most of the time we don''t hit > > this. We do hit the limit with radosgw pretty easily, though, and may > > also hit it in exceptional cases where the OSD cluster is very unhealthy. > > > > There is a large xattr patch for ext4 from the Lustre folks that has been > > floating around for (I think) years. Maybe as interest grows in running > > Ceph on ext4 this can move upstream. > > > > Previously we were being forgiving about large setxattr failures on ext3, > > but we found that was leading to corruption in certain cases (because we > > couldn''t set our internal metadata), so the next release will assert/crash > > in that case (fail-stop instead of fail-maybe-eventually-corrupt). > > > > XFS does not have an xattr size limit and thus does have this problem. > > > > 2- The other problem is with OSD journal replay of non-idempotent > > transactions. On non-btrfs backends, the Ceph OSDs use a write-ahead > > journal. After restart, the OSD does not know exactly which transactions > > in the journal may have already been committed to disk, and may reapply a > > transaction again during replay. For most operations (write, delete, > > truncate) this is fine. > > > > Some operations, though, are non-idempotent. The simplest example is > > CLONE, which copies (efficiently, on btrfs) data from one object to > > another. If the source object is modified, the osd restarts, and then > > the clone is replayed, the target will get incorrect (newer) data. For > > example, > > > > 1- clone A -> B > > 2- modify A > > <osd crash, replay from 1> > > > > B will get new instead of old contents. > > > > (This doesn''t happen on btrfs because the snapshots allow us to replay > > from a known consistent point in time.) > > > > For things like clone, skipping the operation of the target exists almost > > works, except for cases like > > > > 1- clone A -> B > > 2- modify A > > ... > > 3- delete B > > <osd crash, replay from 1> > > > > (Although in that example who cares if B had bad data; it was removed > > anyway.) The larger problem, though, is that that doesn''t always work; > > CLONERANGE copies a range of a file from A to B, where B may already > > exist. > > > > In practice, the higher level interfaces don''t make full use of the > > low-level interface, so it''s possible some solution exists that careful > > avoids the problem with a partial solution in the lower layer. This makes > > me nervous, though, as it is easy to break. > > > > Another possibility: > > > > - on non-btrfs, we set a xattr on every modified object with the > > op_seq, the unique sequence number for the transaction. > > - for any (potentially) non-idempotent operation, we fsync() before > > continuing to the next transaction, to ensure that xattr hits disk. > > - on replay, we skip a transaction if the xattr indicates we already > > performed this transaction. > > > > Because every ''transaction'' only modifies on a single object (file), > > this ought to work. It''ll make things like clone slow, but let''s face it: > > they''re already slow on non-btrfs file systems because they actually copy > > the data (instead of duplicating the extent refs in btrfs). And it should > > make the full ObjectStore iterface safe, without upper layers having to > > worry about the kinds and orders of transactions they perform. > > > > Other ideas? > > > > This issue is tracked at http://tracker.newdream.net/issues/213. > > > > sage > > > > > > -- > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > >
Josef Bacik
2011-Oct-24 19:51 UTC
Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]
On Mon, Oct 24, 2011 at 10:06:49AM -0700, Sage Weil wrote:> [adding linux-btrfs to cc] > > Josef, Chris, any ideas on the below issues? > > On Mon, 24 Oct 2011, Christian Brunner wrote: > > Thanks for explaining this. I don''t have any objections against btrfs > > as a osd filesystem. Even the fact that there is no btrfs-fsck doesn''t > > scare me, since I can use the ceph replication to recover a lost > > btrfs-filesystem. The only problem I have is, that btrfs is not stable > > on our side and I wonder what you are doing to make it work. (Maybe > > it''s related to the load pattern of using ceph as a backend store for > > qemu). > > > > Here is a list of the btrfs problems I''m having: > > > > - When I run ceph with the default configuration (btrfs snaps enabled) > > I can see a rapid increase in Disk-I/O after a few hours of uptime. > > Btrfs-cleaner is using more and more time in > > btrfs_clean_old_snapshots(). > > In theory, there shouldn''t be any significant difference between taking a > snapshot and removing it a few commits later, and the prior root refs that > btrfs holds on to internally until the new commit is complete. That''s > clearly not quite the case, though. > > In any case, we''re going to try to reproduce this issue in our > environment. >I''ve noticed this problem too, clean_old_snapshots is taking quite a while in cases where it really shouldn''t. I will see if I can come up with a reproducer that doesn''t require setting up ceph ;).> > - When I run ceph with btrfs snaps disabled, the situation is getting > > slightly better. I can run an OSD for about 3 days without problems, > > but then again the load increases. This time, I can see that the > > ceph-osd (blkdev_issue_flush) and btrfs-endio-wri are doing more work > > than usual. > > FYI in this scenario you''re exposed to the same journal replay issues that > ext4 and XFS are. The btrfs workload that ceph is generating will also > not be all that special, though, so this problem shouldn''t be unique to > ceph. >Can you get sysrq+w when this happens? I''d like to see what btrfs-endio-write is up to.> > Another thing is that I''m seeing a WARNING: at fs/btrfs/inode.c:2114 > > from time to time. Maybe it''s related to the performance issues, but > > seems to be able to verify this. > > I haven''t seen this yet with the latest stuff from Josef, but others have. > Josef, is there any information we can provide to help track it down? >Actually this would show up in 2 cases, I fixed the one most people hit with my earlier stuff and then fixed the other one more recently, hopefully it will be fixed in 3.2. A full backtrace would be nice so I can figure out which one it is you are hitting.> > It''s really sad to see, that ceph performance and stability is > > suffering that much from the underlying filesystems and that this > > hasn''t changed over the last months. > > We don''t have anyone internally working on btrfs at the moment, and are > still struggling to hire experienced kernel/fs people. Josef has been > very helpful with tracking these issues down, but he hass responsibilities > beyond just the Ceph related issues. Progress is slow, but we are > working on it!I''m open to offers ;). These things are being hit by people all over the place, but it''s hard for me to reproduce, especially since most of the reports are "run X server for Y days and wait for it to start sucking." I will try and get a box setup that I can let stress.sh run on for a few days to see if I can make some of this stuff come out to play with me, but unfortunately I end up having to debug these kind of things over email, which means they get a whole lot of nowhere. Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Chris Mason
2011-Oct-24 20:35 UTC
Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]
On Mon, Oct 24, 2011 at 03:51:47PM -0400, Josef Bacik wrote:> On Mon, Oct 24, 2011 at 10:06:49AM -0700, Sage Weil wrote: > > [adding linux-btrfs to cc] > > > > Josef, Chris, any ideas on the below issues? > > > > On Mon, 24 Oct 2011, Christian Brunner wrote: > > > Thanks for explaining this. I don''t have any objections against btrfs > > > as a osd filesystem. Even the fact that there is no btrfs-fsck doesn''t > > > scare me, since I can use the ceph replication to recover a lost > > > btrfs-filesystem. The only problem I have is, that btrfs is not stable > > > on our side and I wonder what you are doing to make it work. (Maybe > > > it''s related to the load pattern of using ceph as a backend store for > > > qemu). > > > > > > Here is a list of the btrfs problems I''m having: > > > > > > - When I run ceph with the default configuration (btrfs snaps enabled) > > > I can see a rapid increase in Disk-I/O after a few hours of uptime. > > > Btrfs-cleaner is using more and more time in > > > btrfs_clean_old_snapshots(). > > > > In theory, there shouldn''t be any significant difference between taking a > > snapshot and removing it a few commits later, and the prior root refs that > > btrfs holds on to internally until the new commit is complete. That''s > > clearly not quite the case, though. > > > > In any case, we''re going to try to reproduce this issue in our > > environment. > > > > I''ve noticed this problem too, clean_old_snapshots is taking quite a while in > cases where it really shouldn''t. I will see if I can come up with a reproducer > that doesn''t require setting up ceph ;).This sounds familiar though, I thought we had fixed a similar regression. Either way, Arne''s readahead code should really help. Which kernel version were you running? [ ack on the rest of Josef''s comments ] -chris -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Christian Brunner
2011-Oct-24 21:34 UTC
Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]
2011/10/24 Chris Mason <chris.mason@oracle.com>:> On Mon, Oct 24, 2011 at 03:51:47PM -0400, Josef Bacik wrote: >> On Mon, Oct 24, 2011 at 10:06:49AM -0700, Sage Weil wrote: >> > [adding linux-btrfs to cc] >> > >> > Josef, Chris, any ideas on the below issues? >> > >> > On Mon, 24 Oct 2011, Christian Brunner wrote: >> > > Thanks for explaining this. I don''t have any objections against btrfs >> > > as a osd filesystem. Even the fact that there is no btrfs-fsck doesn''t >> > > scare me, since I can use the ceph replication to recover a lost >> > > btrfs-filesystem. The only problem I have is, that btrfs is not stable >> > > on our side and I wonder what you are doing to make it work. (Maybe >> > > it''s related to the load pattern of using ceph as a backend store for >> > > qemu). >> > > >> > > Here is a list of the btrfs problems I''m having: >> > > >> > > - When I run ceph with the default configuration (btrfs snaps enabled) >> > > I can see a rapid increase in Disk-I/O after a few hours of uptime. >> > > Btrfs-cleaner is using more and more time in >> > > btrfs_clean_old_snapshots(). >> > >> > In theory, there shouldn''t be any significant difference between taking a >> > snapshot and removing it a few commits later, and the prior root refs that >> > btrfs holds on to internally until the new commit is complete. That''s >> > clearly not quite the case, though. >> > >> > In any case, we''re going to try to reproduce this issue in our >> > environment. >> > >> >> I''ve noticed this problem too, clean_old_snapshots is taking quite a while in >> cases where it really shouldn''t. I will see if I can come up with a reproducer >> that doesn''t require setting up ceph ;). > > This sounds familiar though, I thought we had fixed a similar > regression. Either way, Arne''s readahead code should really help. > > Which kernel version were you running? > > [ ack on the rest of Josef''s comments ]This was with a 3.0 kernel, including all btrfs-patches from josefs git repo plus the "use the global reserve when truncating the free space cache inode" patch. I''ll try the readahead code. Thanks, Christian -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Arne Jansen
2011-Oct-24 21:37 UTC
Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]
On 24.10.2011 23:34, Christian Brunner wrote:> 2011/10/24 Chris Mason<chris.mason@oracle.com>: >> On Mon, Oct 24, 2011 at 03:51:47PM -0400, Josef Bacik wrote: >>> On Mon, Oct 24, 2011 at 10:06:49AM -0700, Sage Weil wrote: >>>> [adding linux-btrfs to cc] >>>> >>>> Josef, Chris, any ideas on the below issues? >>>> >>>> On Mon, 24 Oct 2011, Christian Brunner wrote: >>>>> Thanks for explaining this. I don''t have any objections against btrfs >>>>> as a osd filesystem. Even the fact that there is no btrfs-fsck doesn''t >>>>> scare me, since I can use the ceph replication to recover a lost >>>>> btrfs-filesystem. The only problem I have is, that btrfs is not stable >>>>> on our side and I wonder what you are doing to make it work. (Maybe >>>>> it''s related to the load pattern of using ceph as a backend store for >>>>> qemu). >>>>> >>>>> Here is a list of the btrfs problems I''m having: >>>>> >>>>> - When I run ceph with the default configuration (btrfs snaps enabled) >>>>> I can see a rapid increase in Disk-I/O after a few hours of uptime. >>>>> Btrfs-cleaner is using more and more time in >>>>> btrfs_clean_old_snapshots(). >>>> >>>> In theory, there shouldn''t be any significant difference between taking a >>>> snapshot and removing it a few commits later, and the prior root refs that >>>> btrfs holds on to internally until the new commit is complete. That''s >>>> clearly not quite the case, though. >>>> >>>> In any case, we''re going to try to reproduce this issue in our >>>> environment. >>>> >>> >>> I''ve noticed this problem too, clean_old_snapshots is taking quite a while in >>> cases where it really shouldn''t. I will see if I can come up with a reproducer >>> that doesn''t require setting up ceph ;). >> >> This sounds familiar though, I thought we had fixed a similar >> regression. Either way, Arne''s readahead code should really help. >> >> Which kernel version were you running? >> >> [ ack on the rest of Josef''s comments ] > > This was with a 3.0 kernel, including all btrfs-patches from josefs > git repo plus the "use the global reserve when truncating the free > space cache inode" patch. > > I''ll try the readahead code.The current readahead code is only used for scrub. I plan to extend it to snapshot deletion in a next step, but currently I''m afraid it can''t help. -Arne> > Thanks, > Christian > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html-- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Christoph Hellwig
2011-Oct-25 10:23 UTC
Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]
On Mon, Oct 24, 2011 at 10:06:49AM -0700, Sage Weil wrote:> > - When I run ceph with btrfs snaps disabled, the situation is getting > > slightly better. I can run an OSD for about 3 days without problems, > > but then again the load increases. This time, I can see that the > > ceph-osd (blkdev_issue_flush) and btrfs-endio-wri are doing more work > > than usual. > > FYI in this scenario you''re exposed to the same journal replay issues that > ext4 and XFS are. The btrfs workload that ceph is generating will also > not be all that special, though, so this problem shouldn''t be unique to > ceph.What journal replay issues would ext4 and XFS be exposed to? -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Christian Brunner
2011-Oct-25 11:56 UTC
Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]
2011/10/24 Josef Bacik <josef@redhat.com>:> On Mon, Oct 24, 2011 at 10:06:49AM -0700, Sage Weil wrote: >> [adding linux-btrfs to cc] >> >> Josef, Chris, any ideas on the below issues? >> >> On Mon, 24 Oct 2011, Christian Brunner wrote: >> > >> > - When I run ceph with btrfs snaps disabled, the situation is getting >> > slightly better. I can run an OSD for about 3 days without problems, >> > but then again the load increases. This time, I can see that the >> > ceph-osd (blkdev_issue_flush) and btrfs-endio-wri are doing more work >> > than usual. >> >> FYI in this scenario you''re exposed to the same journal replay issues that >> ext4 and XFS are. The btrfs workload that ceph is generating will also >> not be all that special, though, so this problem shouldn''t be unique to >> ceph. >> > > Can you get sysrq+w when this happens? I''d like to see what btrfs-endio-write > is up to.Capturing this seems to be not easy. I have a few traces (see attachment), but with sysrq+w I do not get a stacktrace of btrfs-endio-write. What I have is a "latencytop -c" output which is interesting: In our Ceph-OSD server we have 4 disks with 4 btrfs filesystems. Ceph tries to balance the load over all OSDs, so all filesystems should get an nearly equal load. At the moment one filesystem seems to have a problem. When running with iostat I see the following Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sdd 0.00 0.00 0.00 4.33 0.00 53.33 12.31 0.08 19.38 12.23 5.30 sdc 0.00 1.00 0.00 228.33 0.00 1957.33 8.57 74.33 380.76 2.74 62.57 sdb 0.00 0.00 0.00 1.33 0.00 16.00 12.00 0.03 25.00 19.75 2.63 sda 0.00 0.00 0.00 0.67 0.00 8.00 12.00 0.01 19.50 12.50 0.83 The PID of the ceph-osd taht is running on sdc is 2053 and when I look with top I see this process and a btrfs-endio-writer (PID 5447): PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 2053 root 20 0 537m 146m 2364 S 33.2 0.6 43:31.24 ceph-osd 5447 root 20 0 0 0 0 S 22.6 0.0 19:32.18 btrfs-endio-wri In the latencytop output you can see that those processes have a much higher latency, than the other ceph-osd and btrfs-endio-writers. Regards, Christian
Josef Bacik
2011-Oct-25 12:23 UTC
Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]
On Tue, Oct 25, 2011 at 01:56:48PM +0200, Christian Brunner wrote:> 2011/10/24 Josef Bacik <josef@redhat.com>: > > On Mon, Oct 24, 2011 at 10:06:49AM -0700, Sage Weil wrote: > >> [adding linux-btrfs to cc] > >> > >> Josef, Chris, any ideas on the below issues? > >> > >> On Mon, 24 Oct 2011, Christian Brunner wrote: > >> > > >> > - When I run ceph with btrfs snaps disabled, the situation is getting > >> > slightly better. I can run an OSD for about 3 days without problems, > >> > but then again the load increases. This time, I can see that the > >> > ceph-osd (blkdev_issue_flush) and btrfs-endio-wri are doing more work > >> > than usual. > >> > >> FYI in this scenario you''re exposed to the same journal replay issues that > >> ext4 and XFS are. The btrfs workload that ceph is generating will also > >> not be all that special, though, so this problem shouldn''t be unique to > >> ceph. > >> > > > > Can you get sysrq+w when this happens? I''d like to see what btrfs-endio-write > > is up to. > > Capturing this seems to be not easy. I have a few traces (see > attachment), but with sysrq+w I do not get a stacktrace of > btrfs-endio-write. What I have is a "latencytop -c" output which is > interesting: > > In our Ceph-OSD server we have 4 disks with 4 btrfs filesystems. Ceph > tries to balance the load over all OSDs, so all filesystems should get > an nearly equal load. At the moment one filesystem seems to have a > problem. When running with iostat I see the following > > Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s > avgrq-sz avgqu-sz await svctm %util > sdd 0.00 0.00 0.00 4.33 0.00 53.33 > 12.31 0.08 19.38 12.23 5.30 > sdc 0.00 1.00 0.00 228.33 0.00 1957.33 > 8.57 74.33 380.76 2.74 62.57 > sdb 0.00 0.00 0.00 1.33 0.00 16.00 > 12.00 0.03 25.00 19.75 2.63 > sda 0.00 0.00 0.00 0.67 0.00 8.00 > 12.00 0.01 19.50 12.50 0.83 > > The PID of the ceph-osd taht is running on sdc is 2053 and when I look > with top I see this process and a btrfs-endio-writer (PID 5447): > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > 2053 root 20 0 537m 146m 2364 S 33.2 0.6 43:31.24 ceph-osd > 5447 root 20 0 0 0 0 S 22.6 0.0 19:32.18 btrfs-endio-wri > > In the latencytop output you can see that those processes have a much > higher latency, than the other ceph-osd and btrfs-endio-writers. >I''m seeing a lot of this [schedule] 1654.6 msec 96.4 % schedule blkdev_issue_flush blkdev_fsync vfs_fsync_range generic_write_sync blkdev_aio_write do_sync_readv_writev do_readv_writev vfs_writev sys_writev system_call_fastpath where ceph-osd''s latency is mostly coming from this fsync of a block device directly, and not so much being tied up by btrfs directly. With 22% CPU being taken up by btrfs-endio-wri we must be doing something wrong. Can you run perf record -ag when this is going on and then perf report so we can see what btrfs-endio-wri is doing with the cpu. You can drill down in perf report to get only what btrfs-endio-wri is doing, so that would be best. As far as the rest of the latencytop goes, it doesn''t seem like btrfs-endio-wri is doing anything horribly wrong or introducing a lot of latency. Most of it seems to be when running the dleayed refs and having to read in blocks. I''ve been suspecting for a while that the delayed ref stuff ends up doing way more work than it needs to be per task, and it''s possible that btrfs-endio-wri is simply getting screwed by other people doing work. At this point it seems like the biggest problem with latency in ceph-osd is not related to btrfs, the latency seems to all be from the fact that ceph-osd is fsyncing a block dev for whatever reason. As for btrfs-endio-wri it seems like its blowing a lot of CPU time, so perf record -ag is probably going to be your best bet when it''s using lots of cpu so we can figure out what it''s spinning on. Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Christian Brunner
2011-Oct-25 14:25 UTC
Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]
2011/10/25 Josef Bacik <josef@redhat.com>:> On Tue, Oct 25, 2011 at 01:56:48PM +0200, Christian Brunner wrote: >> 2011/10/24 Josef Bacik <josef@redhat.com>: >> > On Mon, Oct 24, 2011 at 10:06:49AM -0700, Sage Weil wrote: >> >> [adding linux-btrfs to cc] >> >> >> >> Josef, Chris, any ideas on the below issues? >> >> >> >> On Mon, 24 Oct 2011, Christian Brunner wrote: >> >> > >> >> > - When I run ceph with btrfs snaps disabled, the situation is getting >> >> > slightly better. I can run an OSD for about 3 days without problems, >> >> > but then again the load increases. This time, I can see that the >> >> > ceph-osd (blkdev_issue_flush) and btrfs-endio-wri are doing more work >> >> > than usual. >> >> >> >> FYI in this scenario you''re exposed to the same journal replay issues that >> >> ext4 and XFS are. The btrfs workload that ceph is generating will also >> >> not be all that special, though, so this problem shouldn''t be unique to >> >> ceph. >> >> >> > >> > Can you get sysrq+w when this happens? I''d like to see what btrfs-endio-write >> > is up to. >> >> Capturing this seems to be not easy. I have a few traces (see >> attachment), but with sysrq+w I do not get a stacktrace of >> btrfs-endio-write. What I have is a "latencytop -c" output which is >> interesting: >> >> In our Ceph-OSD server we have 4 disks with 4 btrfs filesystems. Ceph >> tries to balance the load over all OSDs, so all filesystems should get >> an nearly equal load. At the moment one filesystem seems to have a >> problem. When running with iostat I see the following >> >> Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s >> avgrq-sz avgqu-sz await svctm %util >> sdd 0.00 0.00 0.00 4.33 0.00 53.33 >> 12.31 0.08 19.38 12.23 5.30 >> sdc 0.00 1.00 0.00 228.33 0.00 1957.33 >> 8.57 74.33 380.76 2.74 62.57 >> sdb 0.00 0.00 0.00 1.33 0.00 16.00 >> 12.00 0.03 25.00 19.75 2.63 >> sda 0.00 0.00 0.00 0.67 0.00 8.00 >> 12.00 0.01 19.50 12.50 0.83 >> >> The PID of the ceph-osd taht is running on sdc is 2053 and when I look >> with top I see this process and a btrfs-endio-writer (PID 5447): >> >> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND >> 2053 root 20 0 537m 146m 2364 S 33.2 0.6 43:31.24 ceph-osd >> 5447 root 20 0 0 0 0 S 22.6 0.0 19:32.18 btrfs-endio-wri >> >> In the latencytop output you can see that those processes have a much >> higher latency, than the other ceph-osd and btrfs-endio-writers. >> > > I''m seeing a lot of this > > [schedule] 1654.6 msec 96.4 % > schedule blkdev_issue_flush blkdev_fsync vfs_fsync_range > generic_write_sync blkdev_aio_write do_sync_readv_writev > do_readv_writev vfs_writev sys_writev system_call_fastpath > > where ceph-osd''s latency is mostly coming from this fsync of a block device > directly, and not so much being tied up by btrfs directly. With 22% CPU being > taken up by btrfs-endio-wri we must be doing something wrong. Can you run perf > record -ag when this is going on and then perf report so we can see what > btrfs-endio-wri is doing with the cpu. You can drill down in perf report to get > only what btrfs-endio-wri is doing, so that would be best. As far as the rest > of the latencytop goes, it doesn''t seem like btrfs-endio-wri is doing anything > horribly wrong or introducing a lot of latency. Most of it seems to be when > running the dleayed refs and having to read in blocks. I''ve been suspecting for > a while that the delayed ref stuff ends up doing way more work than it needs to > be per task, and it''s possible that btrfs-endio-wri is simply getting screwed by > other people doing work. > > At this point it seems like the biggest problem with latency in ceph-osd is not > related to btrfs, the latency seems to all be from the fact that ceph-osd is > fsyncing a block dev for whatever reason. As for btrfs-endio-wri it seems like > its blowing a lot of CPU time, so perf record -ag is probably going to be your > best bet when it''s using lots of cpu so we can figure out what it''s spinning on.Attached is a perf-report. I have included the whole report, so that you can see the difference between the good and the bad btrfs-endio-wri. Thanks, Christian
Josef Bacik
2011-Oct-25 15:00 UTC
Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]
On Tue, Oct 25, 2011 at 04:25:02PM +0200, Christian Brunner wrote:> 2011/10/25 Josef Bacik <josef@redhat.com>: > > On Tue, Oct 25, 2011 at 01:56:48PM +0200, Christian Brunner wrote: > >> 2011/10/24 Josef Bacik <josef@redhat.com>: > >> > On Mon, Oct 24, 2011 at 10:06:49AM -0700, Sage Weil wrote: > >> >> [adding linux-btrfs to cc] > >> >> > >> >> Josef, Chris, any ideas on the below issues? > >> >> > >> >> On Mon, 24 Oct 2011, Christian Brunner wrote: > >> >> > > >> >> > - When I run ceph with btrfs snaps disabled, the situation is getting > >> >> > slightly better. I can run an OSD for about 3 days without problems, > >> >> > but then again the load increases. This time, I can see that the > >> >> > ceph-osd (blkdev_issue_flush) and btrfs-endio-wri are doing more work > >> >> > than usual. > >> >> > >> >> FYI in this scenario you''re exposed to the same journal replay issues that > >> >> ext4 and XFS are. The btrfs workload that ceph is generating will also > >> >> not be all that special, though, so this problem shouldn''t be unique to > >> >> ceph. > >> >> > >> > > >> > Can you get sysrq+w when this happens? I''d like to see what btrfs-endio-write > >> > is up to. > >> > >> Capturing this seems to be not easy. I have a few traces (see > >> attachment), but with sysrq+w I do not get a stacktrace of > >> btrfs-endio-write. What I have is a "latencytop -c" output which is > >> interesting: > >> > >> In our Ceph-OSD server we have 4 disks with 4 btrfs filesystems. Ceph > >> tries to balance the load over all OSDs, so all filesystems should get > >> an nearly equal load. At the moment one filesystem seems to have a > >> problem. When running with iostat I see the following > >> > >> Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s > >> avgrq-sz avgqu-sz await svctm %util > >> sdd 0.00 0.00 0.00 4.33 0.00 53.33 > >> 12.31 0.08 19.38 12.23 5.30 > >> sdc 0.00 1.00 0.00 228.33 0.00 1957.33 > >> 8.57 74.33 380.76 2.74 62.57 > >> sdb 0.00 0.00 0.00 1.33 0.00 16.00 > >> 12.00 0.03 25.00 19.75 2.63 > >> sda 0.00 0.00 0.00 0.67 0.00 8.00 > >> 12.00 0.01 19.50 12.50 0.83 > >> > >> The PID of the ceph-osd taht is running on sdc is 2053 and when I look > >> with top I see this process and a btrfs-endio-writer (PID 5447): > >> > >> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > >> 2053 root 20 0 537m 146m 2364 S 33.2 0.6 43:31.24 ceph-osd > >> 5447 root 20 0 0 0 0 S 22.6 0.0 19:32.18 btrfs-endio-wri > >> > >> In the latencytop output you can see that those processes have a much > >> higher latency, than the other ceph-osd and btrfs-endio-writers. > >> > > > > I''m seeing a lot of this > > > > [schedule] 1654.6 msec 96.4 % > > schedule blkdev_issue_flush blkdev_fsync vfs_fsync_range > > generic_write_sync blkdev_aio_write do_sync_readv_writev > > do_readv_writev vfs_writev sys_writev system_call_fastpath > > > > where ceph-osd''s latency is mostly coming from this fsync of a block device > > directly, and not so much being tied up by btrfs directly. With 22% CPU being > > taken up by btrfs-endio-wri we must be doing something wrong. Can you run perf > > record -ag when this is going on and then perf report so we can see what > > btrfs-endio-wri is doing with the cpu. You can drill down in perf report to get > > only what btrfs-endio-wri is doing, so that would be best. As far as the rest > > of the latencytop goes, it doesn''t seem like btrfs-endio-wri is doing anything > > horribly wrong or introducing a lot of latency. Most of it seems to be when > > running the dleayed refs and having to read in blocks. I''ve been suspecting for > > a while that the delayed ref stuff ends up doing way more work than it needs to > > be per task, and it''s possible that btrfs-endio-wri is simply getting screwed by > > other people doing work. > > > > At this point it seems like the biggest problem with latency in ceph-osd is not > > related to btrfs, the latency seems to all be from the fact that ceph-osd is > > fsyncing a block dev for whatever reason. As for btrfs-endio-wri it seems like > > its blowing a lot of CPU time, so perf record -ag is probably going to be your > > best bet when it''s using lots of cpu so we can figure out what it''s spinning on. > > Attached is a perf-report. I have included the whole report, so that > you can see the difference between the good and the bad > btrfs-endio-wri. >Oh shit we''re inserting xattrs in endio, thats not good. I''ll look more into this when I get back home but this is definitely a problem, we''re doing a lot more work in endio than we should. Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Josef Bacik
2011-Oct-25 15:05 UTC
Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]
On Tue, Oct 25, 2011 at 04:25:02PM +0200, Christian Brunner wrote:> 2011/10/25 Josef Bacik <josef@redhat.com>: > > On Tue, Oct 25, 2011 at 01:56:48PM +0200, Christian Brunner wrote: > >> 2011/10/24 Josef Bacik <josef@redhat.com>: > >> > On Mon, Oct 24, 2011 at 10:06:49AM -0700, Sage Weil wrote: > >> >> [adding linux-btrfs to cc] > >> >> > >> >> Josef, Chris, any ideas on the below issues? > >> >> > >> >> On Mon, 24 Oct 2011, Christian Brunner wrote: > >> >> > > >> >> > - When I run ceph with btrfs snaps disabled, the situation is getting > >> >> > slightly better. I can run an OSD for about 3 days without problems, > >> >> > but then again the load increases. This time, I can see that the > >> >> > ceph-osd (blkdev_issue_flush) and btrfs-endio-wri are doing more work > >> >> > than usual. > >> >> > >> >> FYI in this scenario you''re exposed to the same journal replay issues that > >> >> ext4 and XFS are. The btrfs workload that ceph is generating will also > >> >> not be all that special, though, so this problem shouldn''t be unique to > >> >> ceph. > >> >> > >> > > >> > Can you get sysrq+w when this happens? I''d like to see what btrfs-endio-write > >> > is up to. > >> > >> Capturing this seems to be not easy. I have a few traces (see > >> attachment), but with sysrq+w I do not get a stacktrace of > >> btrfs-endio-write. What I have is a "latencytop -c" output which is > >> interesting: > >> > >> In our Ceph-OSD server we have 4 disks with 4 btrfs filesystems. Ceph > >> tries to balance the load over all OSDs, so all filesystems should get > >> an nearly equal load. At the moment one filesystem seems to have a > >> problem. When running with iostat I see the following > >> > >> Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s > >> avgrq-sz avgqu-sz await svctm %util > >> sdd 0.00 0.00 0.00 4.33 0.00 53.33 > >> 12.31 0.08 19.38 12.23 5.30 > >> sdc 0.00 1.00 0.00 228.33 0.00 1957.33 > >> 8.57 74.33 380.76 2.74 62.57 > >> sdb 0.00 0.00 0.00 1.33 0.00 16.00 > >> 12.00 0.03 25.00 19.75 2.63 > >> sda 0.00 0.00 0.00 0.67 0.00 8.00 > >> 12.00 0.01 19.50 12.50 0.83 > >> > >> The PID of the ceph-osd taht is running on sdc is 2053 and when I look > >> with top I see this process and a btrfs-endio-writer (PID 5447): > >> > >> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > >> 2053 root 20 0 537m 146m 2364 S 33.2 0.6 43:31.24 ceph-osd > >> 5447 root 20 0 0 0 0 S 22.6 0.0 19:32.18 btrfs-endio-wri > >> > >> In the latencytop output you can see that those processes have a much > >> higher latency, than the other ceph-osd and btrfs-endio-writers. > >> > > > > I''m seeing a lot of this > > > > [schedule] 1654.6 msec 96.4 % > > schedule blkdev_issue_flush blkdev_fsync vfs_fsync_range > > generic_write_sync blkdev_aio_write do_sync_readv_writev > > do_readv_writev vfs_writev sys_writev system_call_fastpath > > > > where ceph-osd''s latency is mostly coming from this fsync of a block device > > directly, and not so much being tied up by btrfs directly. With 22% CPU being > > taken up by btrfs-endio-wri we must be doing something wrong. Can you run perf > > record -ag when this is going on and then perf report so we can see what > > btrfs-endio-wri is doing with the cpu. You can drill down in perf report to get > > only what btrfs-endio-wri is doing, so that would be best. As far as the rest > > of the latencytop goes, it doesn''t seem like btrfs-endio-wri is doing anything > > horribly wrong or introducing a lot of latency. Most of it seems to be when > > running the dleayed refs and having to read in blocks. I''ve been suspecting for > > a while that the delayed ref stuff ends up doing way more work than it needs to > > be per task, and it''s possible that btrfs-endio-wri is simply getting screwed by > > other people doing work. > > > > At this point it seems like the biggest problem with latency in ceph-osd is not > > related to btrfs, the latency seems to all be from the fact that ceph-osd is > > fsyncing a block dev for whatever reason. As for btrfs-endio-wri it seems like > > its blowing a lot of CPU time, so perf record -ag is probably going to be your > > best bet when it''s using lots of cpu so we can figure out what it''s spinning on. > > Attached is a perf-report. I have included the whole report, so that > you can see the difference between the good and the bad > btrfs-endio-wri. >We also shouldn''t be running run_ordered_operations, man this is screwed up, thanks so much for this, I should be able to nail this down pretty easily. Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Christian Brunner
2011-Oct-25 15:13 UTC
Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]
2011/10/25 Josef Bacik <josef@redhat.com>:> On Tue, Oct 25, 2011 at 04:25:02PM +0200, Christian Brunner wrote: >> 2011/10/25 Josef Bacik <josef@redhat.com>: >> > On Tue, Oct 25, 2011 at 01:56:48PM +0200, Christian Brunner wrote:[...]>> >> >> >> In our Ceph-OSD server we have 4 disks with 4 btrfs filesystems. Ceph >> >> tries to balance the load over all OSDs, so all filesystems should get >> >> an nearly equal load. At the moment one filesystem seems to have a >> >> problem. When running with iostat I see the following >> >> >> >> Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s >> >> avgrq-sz avgqu-sz await svctm %util >> >> sdd 0.00 0.00 0.00 4.33 0.00 53.33 >> >> 12.31 0.08 19.38 12.23 5.30 >> >> sdc 0.00 1.00 0.00 228.33 0.00 1957.33 >> >> 8.57 74.33 380.76 2.74 62.57 >> >> sdb 0.00 0.00 0.00 1.33 0.00 16.00 >> >> 12.00 0.03 25.00 19.75 2.63 >> >> sda 0.00 0.00 0.00 0.67 0.00 8.00 >> >> 12.00 0.01 19.50 12.50 0.83 >> >> >> >> The PID of the ceph-osd taht is running on sdc is 2053 and when I look >> >> with top I see this process and a btrfs-endio-writer (PID 5447): >> >> >> >> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND >> >> 2053 root 20 0 537m 146m 2364 S 33.2 0.6 43:31.24 ceph-osd >> >> 5447 root 20 0 0 0 0 S 22.6 0.0 19:32.18 btrfs-endio-wri >> >> >> >> In the latencytop output you can see that those processes have a much >> >> higher latency, than the other ceph-osd and btrfs-endio-writers. >> >> >> > >> > I''m seeing a lot of this >> > >> > [schedule] 1654.6 msec 96.4 % >> > schedule blkdev_issue_flush blkdev_fsync vfs_fsync_range >> > generic_write_sync blkdev_aio_write do_sync_readv_writev >> > do_readv_writev vfs_writev sys_writev system_call_fastpath >> > >> > where ceph-osd''s latency is mostly coming from this fsync of a block device >> > directly, and not so much being tied up by btrfs directly. With 22% CPU being >> > taken up by btrfs-endio-wri we must be doing something wrong. Can you run perf >> > record -ag when this is going on and then perf report so we can see what >> > btrfs-endio-wri is doing with the cpu. You can drill down in perf report to get >> > only what btrfs-endio-wri is doing, so that would be best. As far as the rest >> > of the latencytop goes, it doesn''t seem like btrfs-endio-wri is doing anything >> > horribly wrong or introducing a lot of latency. Most of it seems to be when >> > running the dleayed refs and having to read in blocks. I''ve been suspecting for >> > a while that the delayed ref stuff ends up doing way more work than it needs to >> > be per task, and it''s possible that btrfs-endio-wri is simply getting screwed by >> > other people doing work. >> > >> > At this point it seems like the biggest problem with latency in ceph-osd is not >> > related to btrfs, the latency seems to all be from the fact that ceph-osd is >> > fsyncing a block dev for whatever reason. As for btrfs-endio-wri it seems like >> > its blowing a lot of CPU time, so perf record -ag is probably going to be your >> > best bet when it''s using lots of cpu so we can figure out what it''s spinning on. >> >> Attached is a perf-report. I have included the whole report, so that >> you can see the difference between the good and the bad >> btrfs-endio-wri. >> > > We also shouldn''t be running run_ordered_operations, man this is screwed up, > thanks so much for this, I should be able to nail this down pretty easily.Please note that this is with "btrfs snaps disabled" in the ceph conf. When I enable snaps our problems get worse (the btrfs-cleaner thing), but I would be glad if this one thing gets solved. I can run debugging with snaps enabled, if you want, but I would suggest, that we do this afterwards. Thanks, Christian -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Sage Weil
2011-Oct-25 16:23 UTC
Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]
On Tue, 25 Oct 2011, Christoph Hellwig wrote:> On Mon, Oct 24, 2011 at 10:06:49AM -0700, Sage Weil wrote: > > > - When I run ceph with btrfs snaps disabled, the situation is getting > > > slightly better. I can run an OSD for about 3 days without problems, > > > but then again the load increases. This time, I can see that the > > > ceph-osd (blkdev_issue_flush) and btrfs-endio-wri are doing more work > > > than usual. > > > > FYI in this scenario you''re exposed to the same journal replay issues that > > ext4 and XFS are. The btrfs workload that ceph is generating will also > > not be all that special, though, so this problem shouldn''t be unique to > > ceph. > > What journal replay issues would ext4 and XFS be exposed to?It''s the ceph-osd journal replay, not the ext4/XFS journal... the #2 item in http://marc.info/?l=ceph-devel&m=131942130322957&w=2 sage -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Sage Weil
2011-Oct-25 16:36 UTC
Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]
On Tue, 25 Oct 2011, Josef Bacik wrote:> At this point it seems like the biggest problem with latency in ceph-osd > is not related to btrfs, the latency seems to all be from the fact that > ceph-osd is fsyncing a block dev for whatever reason.There is one place where we sync_file_range() on the journal block device, but that should only happen if directio is disabled (it''s on by default). Christian, have you tweaked those settings in your ceph.conf? It would be something like ''journal dio = false''. If not, can you verify that directio shows true when the journal is initialized from your osd log? E.g., 2011-10-21 15:21:02.026789 7ff7e5c54720 journal _open dev/osd0.journal fd 14: 104857600 bytes, block size 4096 bytes, directio = 1 If directio = 1 for you, something else funky is causing those blkdev_fsync''s... sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Christian Brunner
2011-Oct-25 19:09 UTC
Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]
2011/10/25 Sage Weil <sage@newdream.net>:> On Tue, 25 Oct 2011, Josef Bacik wrote: >> At this point it seems like the biggest problem with latency in ceph-osd >> is not related to btrfs, the latency seems to all be from the fact that >> ceph-osd is fsyncing a block dev for whatever reason. > > There is one place where we sync_file_range() on the journal block device, > but that should only happen if directio is disabled (it''s on by default). > > Christian, have you tweaked those settings in your ceph.conf? It would be > something like ''journal dio = false''. If not, can you verify that > directio shows true when the journal is initialized from your osd log? > E.g., > > 2011-10-21 15:21:02.026789 7ff7e5c54720 journal _open dev/osd0.journal fd 14: 104857600 bytes, block size 4096 bytes, directio = 1 > > If directio = 1 for you, something else funky is causing those > blkdev_fsync''s...I''ve looked it up in the logs - directio is 1: Oct 25 17:20:16 os00 osd.000[1696]: 7f0016841740 journal _open /dev/vg01/lv_osd_journal_0 fd 15: 17179869184 bytes, block size 4096 bytes, directio = 1 Regards, Christian -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Chris Mason
2011-Oct-25 20:15 UTC
Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]
On Tue, Oct 25, 2011 at 11:05:12AM -0400, Josef Bacik wrote:> On Tue, Oct 25, 2011 at 04:25:02PM +0200, Christian Brunner wrote: > > > > Attached is a perf-report. I have included the whole report, so that > > you can see the difference between the good and the bad > > btrfs-endio-wri. > > > > We also shouldn''t be running run_ordered_operations, man this is screwed up, > thanks so much for this, I should be able to nail this down pretty easily. > Thanks,Looks like we''re getting there from reserve_metadata_bytes when we join the transaction? -chris -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Josef Bacik
2011-Oct-25 20:22 UTC
Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]
On Tue, Oct 25, 2011 at 04:15:45PM -0400, Chris Mason wrote:> On Tue, Oct 25, 2011 at 11:05:12AM -0400, Josef Bacik wrote: > > On Tue, Oct 25, 2011 at 04:25:02PM +0200, Christian Brunner wrote: > > > > > > Attached is a perf-report. I have included the whole report, so that > > > you can see the difference between the good and the bad > > > btrfs-endio-wri. > > > > > > > We also shouldn''t be running run_ordered_operations, man this is screwed up, > > thanks so much for this, I should be able to nail this down pretty easily. > > Thanks, > > Looks like we''re getting there from reserve_metadata_bytes when we join > the transaction? >We don''t do reservations in the endio stuff, we assume you''ve reserved all the space you need in delalloc, plus we would have seen reserve_metadata_bytes in the trace. Though it does look like perf is lying to us in at least one case sicne btrfs_alloc_logged_file_extent is only called from log replay and not during normal runtime, so it definitely shouldn''t be showing up. Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Sage Weil
2011-Oct-25 22:27 UTC
Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]
On Tue, 25 Oct 2011, Christian Brunner wrote:> 2011/10/25 Sage Weil <sage@newdream.net>: > > On Tue, 25 Oct 2011, Josef Bacik wrote: > >> At this point it seems like the biggest problem with latency in ceph-osd > >> is not related to btrfs, the latency seems to all be from the fact that > >> ceph-osd is fsyncing a block dev for whatever reason. > > > > There is one place where we sync_file_range() on the journal block device, > > but that should only happen if directio is disabled (it''s on by default). > > > > Christian, have you tweaked those settings in your ceph.conf? It would be > > something like ''journal dio = false''. If not, can you verify that > > directio shows true when the journal is initialized from your osd log? > > E.g., > > > > 2011-10-21 15:21:02.026789 7ff7e5c54720 journal _open dev/osd0.journal fd 14: 104857600 bytes, block size 4096 bytes, directio = 1 > > > > If directio = 1 for you, something else funky is causing those > > blkdev_fsync''s... > > I''ve looked it up in the logs - directio is 1: > > Oct 25 17:20:16 os00 osd.000[1696]: 7f0016841740 journal _open > /dev/vg01/lv_osd_journal_0 fd 15: 17179869184 bytes, block size 4096 > bytes, directio = 1Do you mind capturing an strace? I''d like to see where that blkdev_fsync is coming from. thanks! sage
Christian Brunner
2011-Oct-26 00:16 UTC
Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]
2011/10/25 Josef Bacik <josef@redhat.com>:> On Tue, Oct 25, 2011 at 04:15:45PM -0400, Chris Mason wrote: >> On Tue, Oct 25, 2011 at 11:05:12AM -0400, Josef Bacik wrote: >> > On Tue, Oct 25, 2011 at 04:25:02PM +0200, Christian Brunner wrote: >> > > >> > > Attached is a perf-report. I have included the whole report, so that >> > > you can see the difference between the good and the bad >> > > btrfs-endio-wri. >> > > >> > >> > We also shouldn''t be running run_ordered_operations, man this is screwed up, >> > thanks so much for this, I should be able to nail this down pretty easily. >> > Thanks, >> >> Looks like we''re getting there from reserve_metadata_bytes when we join >> the transaction? >> > > We don''t do reservations in the endio stuff, we assume you''ve reserved all the > space you need in delalloc, plus we would have seen reserve_metadata_bytes in > the trace. Though it does look like perf is lying to us in at least one case > sicne btrfs_alloc_logged_file_extent is only called from log replay and not > during normal runtime, so it definitely shouldn''t be showing up. Thanks,Strange! - I''ll check if symbols got messed up in the report tomorrow. Christian -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Christian Brunner
2011-Oct-26 08:21 UTC
Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]
2011/10/26 Christian Brunner <chb@muc.de>:> 2011/10/25 Josef Bacik <josef@redhat.com>: >> On Tue, Oct 25, 2011 at 04:15:45PM -0400, Chris Mason wrote: >>> On Tue, Oct 25, 2011 at 11:05:12AM -0400, Josef Bacik wrote: >>> > On Tue, Oct 25, 2011 at 04:25:02PM +0200, Christian Brunner wrote: >>> > > >>> > > Attached is a perf-report. I have included the whole report, so that >>> > > you can see the difference between the good and the bad >>> > > btrfs-endio-wri. >>> > > >>> > >>> > We also shouldn''t be running run_ordered_operations, man this is screwed up, >>> > thanks so much for this, I should be able to nail this down pretty easily. >>> > Thanks, >>> >>> Looks like we''re getting there from reserve_metadata_bytes when we join >>> the transaction? >>> >> >> We don''t do reservations in the endio stuff, we assume you''ve reserved all the >> space you need in delalloc, plus we would have seen reserve_metadata_bytes in >> the trace. Though it does look like perf is lying to us in at least one case >> sicne btrfs_alloc_logged_file_extent is only called from log replay and not >> during normal runtime, so it definitely shouldn''t be showing up. Thanks, > > Strange! - I''ll check if symbols got messed up in the report tomorrow.I''ve checked this now: Except for the missing symbols for iomemory_vsl module, everything is looking normal. I''ve also run the report on another OSD again, but the results look quite similar. Regards, Christian PS: This is what perf report -v is saying... build id event received for [kernel.kallsyms]: 805ca93f4057cc0c8f53b061a849b3f847f2de40 build id event received for /lib/modules/3.0.6-1.fits.8.el6.x86_64/kernel/fs/btrfs/btrfs.ko: 64a723e05af3908fb9593f4a3401d6563cb1a01b build id event received for /lib/modules/3.0.6-1.fits.8.el6.x86_64/kernel/lib/libcrc32c.ko: b1391be8d33b54b6de20e07b7f2ee8d777fc09d2 build id event received for /lib/modules/3.0.6-1.fits.8.el6.x86_64/kernel/drivers/net/bonding/bonding.ko: 663392df0f407211ab8f9527c482d54fce890c5e build id event received for /lib/modules/3.0.6-1.fits.8.el6.x86_64/kernel/drivers/scsi/hpsa.ko: 676eecffd476aef1b0f2f8c1bf8c8e6120d369c9 build id event received for /lib/modules/3.0.6-1.fits.8.el6.x86_64/kernel/drivers/net/ixgbe/ixgbe.ko: db7c200894b27e71ae6fe5cf7adaebf787c90da9 build id event received for [iomemory_vsl]: 4ed417c9a815e6bbe77a1656bceda95d9f06cb13 build id event received for /lib64/libc-2.12.so: 2ab28d41242ede641418966ef08f9aacffd9e8c7 build id event received for /lib64/libpthread-2.12.so: c177389a6f119b3883ea0b3c33cb04df3f8e5cc7 build id event received for /sbin/rsyslogd: 1372ef1e2ec550967fe20d0bdddbc0aab0bb36dc build id event received for /lib64/libglib-2.0.so.0.2200.5: d880be15bf992b5fbcc629e6bbf1c747a928ddd5 build id event received for /usr/sbin/irqbalance: 842de64f46ca9fde55efa29a793c08b197d58354 build id event received for /lib64/libm-2.12.so: 46ac89195918407d2937bd1450c0ec99c8d41a2a build id event received for /usr/bin/ceph-osd: 9fcb36e020c49fc49171b4c88bd784b38eb0675b build id event received for /usr/lib64/libstdc++.so.6.0.13: d1b2ca4e1ec8f81ba820e5f1375d960107ac7e50 build id event received for /usr/lib64/libtcmalloc.so.0.2.0: 02766551b2eb5a453f003daee0c5fc9cd176e831 Looking at the vmlinux_path (6 entries long) dso__load_sym: cannot get elf header. Using /proc/kallsyms for symbols Looking at the vmlinux_path (6 entries long) No kallsyms or vmlinux with build-id 4ed417c9a815e6bbe77a1656bceda95d9f06cb13 was found [iomemory_vsl] with build id 4ed417c9a815e6bbe77a1656bceda95d9f06cb13 not found, continuing without symbols -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Chris Mason
2011-Oct-26 13:23 UTC
Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]
On Tue, Oct 25, 2011 at 04:22:48PM -0400, Josef Bacik wrote:> On Tue, Oct 25, 2011 at 04:15:45PM -0400, Chris Mason wrote: > > On Tue, Oct 25, 2011 at 11:05:12AM -0400, Josef Bacik wrote: > > > On Tue, Oct 25, 2011 at 04:25:02PM +0200, Christian Brunner wrote: > > > > > > > > Attached is a perf-report. I have included the whole report, so that > > > > you can see the difference between the good and the bad > > > > btrfs-endio-wri. > > > > > > > > > > We also shouldn''t be running run_ordered_operations, man this is screwed up, > > > thanks so much for this, I should be able to nail this down pretty easily. > > > Thanks, > > > > Looks like we''re getting there from reserve_metadata_bytes when we join > > the transaction? > > > > We don''t do reservations in the endio stuff, we assume you''ve reserved all the > space you need in delalloc, plus we would have seen reserve_metadata_bytes in > the trace. Though it does look like perf is lying to us in at least one case > sicne btrfs_alloc_logged_file_extent is only called from log replay and not > during normal runtime, so it definitely shouldn''t be showing up. Thanks,Whoops, I should have read that num_items > 0 check harder. btrfs_end_transaction is doing it by setting ->blocked = 1 if (lock && !atomic_read(&root->fs_info->open_ioctl_trans) && should_end_transaction(trans, root)) { trans->transaction->blocked = 1; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ smp_wmb(); } if (lock && cur_trans->blocked && !cur_trans->in_commit) { ^^^^^^^^^^^^^^^^^^^ if (throttle) { /* * We may race with somebody else here so end up having * to call end_transaction on ourselves again, so inc * our use_count. */ trans->use_count++; return btrfs_commit_transaction(trans, root); } else { wake_up_process(info->transaction_kthread); } } perf is definitely lying a little bit about the trace ;) -chris -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Josef Bacik
2011-Oct-27 15:07 UTC
Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]
On Wed, Oct 26, 2011 at 09:23:54AM -0400, Chris Mason wrote:> On Tue, Oct 25, 2011 at 04:22:48PM -0400, Josef Bacik wrote: > > On Tue, Oct 25, 2011 at 04:15:45PM -0400, Chris Mason wrote: > > > On Tue, Oct 25, 2011 at 11:05:12AM -0400, Josef Bacik wrote: > > > > On Tue, Oct 25, 2011 at 04:25:02PM +0200, Christian Brunner wrote: > > > > > > > > > > Attached is a perf-report. I have included the whole report, so that > > > > > you can see the difference between the good and the bad > > > > > btrfs-endio-wri. > > > > > > > > > > > > > We also shouldn''t be running run_ordered_operations, man this is screwed up, > > > > thanks so much for this, I should be able to nail this down pretty easily. > > > > Thanks, > > > > > > Looks like we''re getting there from reserve_metadata_bytes when we join > > > the transaction? > > > > > > > We don''t do reservations in the endio stuff, we assume you''ve reserved all the > > space you need in delalloc, plus we would have seen reserve_metadata_bytes in > > the trace. Though it does look like perf is lying to us in at least one case > > sicne btrfs_alloc_logged_file_extent is only called from log replay and not > > during normal runtime, so it definitely shouldn''t be showing up. Thanks, > > Whoops, I should have read that num_items > 0 check harder. > > btrfs_end_transaction is doing it by setting ->blocked = 1 > > if (lock && !atomic_read(&root->fs_info->open_ioctl_trans) && > should_end_transaction(trans, root)) { > trans->transaction->blocked = 1; > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > smp_wmb(); > } > > if (lock && cur_trans->blocked && !cur_trans->in_commit) { > ^^^^^^^^^^^^^^^^^^^ > if (throttle) { > /* > * We may race with somebody else here so end up having > * to call end_transaction on ourselves again, so inc > * our use_count. > */ > trans->use_count++; > return btrfs_commit_transaction(trans, root); > } else { > wake_up_process(info->transaction_kthread); > } > } >Not sure what you are getting at here? Even if we set blocked, we''re not throttling so it will just wake up the transaction kthread, so we won''t do the commit in the endio case. Thanks Josef -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Josef Bacik
2011-Oct-27 18:14 UTC
Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]
On Thu, Oct 27, 2011 at 11:07:38AM -0400, Josef Bacik wrote:> On Wed, Oct 26, 2011 at 09:23:54AM -0400, Chris Mason wrote: > > On Tue, Oct 25, 2011 at 04:22:48PM -0400, Josef Bacik wrote: > > > On Tue, Oct 25, 2011 at 04:15:45PM -0400, Chris Mason wrote: > > > > On Tue, Oct 25, 2011 at 11:05:12AM -0400, Josef Bacik wrote: > > > > > On Tue, Oct 25, 2011 at 04:25:02PM +0200, Christian Brunner wrote: > > > > > > > > > > > > Attached is a perf-report. I have included the whole report, so that > > > > > > you can see the difference between the good and the bad > > > > > > btrfs-endio-wri. > > > > > > > > > > > > > > > > We also shouldn''t be running run_ordered_operations, man this is screwed up, > > > > > thanks so much for this, I should be able to nail this down pretty easily. > > > > > Thanks, > > > > > > > > Looks like we''re getting there from reserve_metadata_bytes when we join > > > > the transaction? > > > > > > > > > > We don''t do reservations in the endio stuff, we assume you''ve reserved all the > > > space you need in delalloc, plus we would have seen reserve_metadata_bytes in > > > the trace. Though it does look like perf is lying to us in at least one case > > > sicne btrfs_alloc_logged_file_extent is only called from log replay and not > > > during normal runtime, so it definitely shouldn''t be showing up. Thanks, > > > > Whoops, I should have read that num_items > 0 check harder. > > > > btrfs_end_transaction is doing it by setting ->blocked = 1 > > > > if (lock && !atomic_read(&root->fs_info->open_ioctl_trans) && > > should_end_transaction(trans, root)) { > > trans->transaction->blocked = 1; > > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > smp_wmb(); > > } > > > > if (lock && cur_trans->blocked && !cur_trans->in_commit) { > > ^^^^^^^^^^^^^^^^^^^ > > if (throttle) { > > /* > > * We may race with somebody else here so end up having > > * to call end_transaction on ourselves again, so inc > > * our use_count. > > */ > > trans->use_count++; > > return btrfs_commit_transaction(trans, root); > > } else { > > wake_up_process(info->transaction_kthread); > > } > > } > > > > Not sure what you are getting at here? Even if we set blocked, we''re not > throttling so it will just wake up the transaction kthread, so we won''t do the > commit in the endio case. Thanks >Oh I see what you were trying to say, that we''d set blocking and then commit the transaction from the endio process which would run ordered operations, but since throttle isn''t set that won''t happen. I think that the perf symbols are just lying to us. Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Josef Bacik
2011-Oct-27 19:52 UTC
Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]
On Tue, Oct 25, 2011 at 01:56:48PM +0200, Christian Brunner wrote:> 2011/10/24 Josef Bacik <josef@redhat.com>: > > On Mon, Oct 24, 2011 at 10:06:49AM -0700, Sage Weil wrote: > >> [adding linux-btrfs to cc] > >> > >> Josef, Chris, any ideas on the below issues? > >> > >> On Mon, 24 Oct 2011, Christian Brunner wrote: > >> > > >> > - When I run ceph with btrfs snaps disabled, the situation is getting > >> > slightly better. I can run an OSD for about 3 days without problems, > >> > but then again the load increases. This time, I can see that the > >> > ceph-osd (blkdev_issue_flush) and btrfs-endio-wri are doing more work > >> > than usual. > >> > >> FYI in this scenario you''re exposed to the same journal replay issues that > >> ext4 and XFS are. The btrfs workload that ceph is generating will also > >> not be all that special, though, so this problem shouldn''t be unique to > >> ceph. > >> > > > > Can you get sysrq+w when this happens? I''d like to see what btrfs-endio-write > > is up to. > > Capturing this seems to be not easy. I have a few traces (see > attachment), but with sysrq+w I do not get a stacktrace of > btrfs-endio-write. What I have is a "latencytop -c" output which is > interesting: > > In our Ceph-OSD server we have 4 disks with 4 btrfs filesystems. Ceph > tries to balance the load over all OSDs, so all filesystems should get > an nearly equal load. At the moment one filesystem seems to have a > problem. When running with iostat I see the following > > Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s > avgrq-sz avgqu-sz await svctm %util > sdd 0.00 0.00 0.00 4.33 0.00 53.33 > 12.31 0.08 19.38 12.23 5.30 > sdc 0.00 1.00 0.00 228.33 0.00 1957.33 > 8.57 74.33 380.76 2.74 62.57 > sdb 0.00 0.00 0.00 1.33 0.00 16.00 > 12.00 0.03 25.00 19.75 2.63 > sda 0.00 0.00 0.00 0.67 0.00 8.00 > 12.00 0.01 19.50 12.50 0.83 > > The PID of the ceph-osd taht is running on sdc is 2053 and when I look > with top I see this process and a btrfs-endio-writer (PID 5447): > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > 2053 root 20 0 537m 146m 2364 S 33.2 0.6 43:31.24 ceph-osd > 5447 root 20 0 0 0 0 S 22.6 0.0 19:32.18 btrfs-endio-wri > > In the latencytop output you can see that those processes have a much > higher latency, than the other ceph-osd and btrfs-endio-writers. > > Regards, > ChristianOk just a shot in the dark, but could you give this a whirl and see if it helps you? Thanks Josef diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c index 125cf76..fbc196e 100644 --- a/fs/btrfs/delayed-ref.c +++ b/fs/btrfs/delayed-ref.c @@ -210,9 +210,9 @@ int btrfs_delayed_ref_lock(struct btrfs_trans_handle *trans, } int btrfs_find_ref_cluster(struct btrfs_trans_handle *trans, - struct list_head *cluster, u64 start) + struct list_head *cluster, u64 start, unsigned long max_count) { - int count = 0; + unsigned long count = 0; struct btrfs_delayed_ref_root *delayed_refs; struct rb_node *node; struct btrfs_delayed_ref_node *ref; @@ -242,7 +242,7 @@ int btrfs_find_ref_cluster(struct btrfs_trans_handle *trans, node = rb_first(&delayed_refs->root); } again: - while (node && count < 32) { + while (node && count < max_count) { ref = rb_entry(node, struct btrfs_delayed_ref_node, rb_node); if (btrfs_delayed_ref_is_head(ref)) { head = btrfs_delayed_node_to_head(ref); diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h index e287e3b..b15a6ad 100644 --- a/fs/btrfs/delayed-ref.h +++ b/fs/btrfs/delayed-ref.h @@ -169,7 +169,8 @@ btrfs_find_delayed_ref_head(struct btrfs_trans_handle *trans, u64 bytenr); int btrfs_delayed_ref_lock(struct btrfs_trans_handle *trans, struct btrfs_delayed_ref_head *head); int btrfs_find_ref_cluster(struct btrfs_trans_handle *trans, - struct list_head *cluster, u64 search_start); + struct list_head *cluster, u64 search_start, + unsigned long max_count); /* * a node might live in a head or a regular ref, this lets you * test for the proper type to use. diff --git a/fs/btrfs/dir-item.c b/fs/btrfs/dir-item.c index 31d84e7..c190282 100644 --- a/fs/btrfs/dir-item.c +++ b/fs/btrfs/dir-item.c @@ -81,6 +81,7 @@ int btrfs_insert_xattr_item(struct btrfs_trans_handle *trans, u32 data_size; BUG_ON(name_len + data_len > BTRFS_MAX_XATTR_SIZE(root)); + WARN_ON(trans->endio); key.objectid = objectid; btrfs_set_key_type(&key, BTRFS_XATTR_ITEM_KEY); diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 4eb7d2b..0977a10 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -2295,7 +2295,7 @@ again: * lock */ ret = btrfs_find_ref_cluster(trans, &cluster, - delayed_refs->run_delayed_start); + delayed_refs->run_delayed_start, count); if (ret) break; @@ -2338,7 +2338,8 @@ again: node = rb_next(node); } spin_unlock(&delayed_refs->lock); - schedule_timeout(1); + if (need_resched()) + schedule_timeout(1); goto again; } out: diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index f12747c..73a5e66 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -1752,6 +1752,7 @@ static int btrfs_finish_ordered_io(struct inode *inode, u64 start, u64 end) else trans = btrfs_join_transaction(root); BUG_ON(IS_ERR(trans)); + trans->endio = 1; trans->block_rsv = &root->fs_info->delalloc_block_rsv; if (test_bit(BTRFS_ORDERED_COMPRESSED, &ordered_extent->flags)) @@ -2057,8 +2058,11 @@ void btrfs_run_delayed_iputs(struct btrfs_root *root) LIST_HEAD(list); struct btrfs_fs_info *fs_info = root->fs_info; struct delayed_iput *delayed; + struct btrfs_trans_handle *trans; int empty; + trans = current->journal_info; + WARN_ON(trans && trans->endio); spin_lock(&fs_info->delayed_iput_lock); empty = list_empty(&fs_info->delayed_iputs); spin_unlock(&fs_info->delayed_iput_lock); diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c index a1c9404..ab68cfa 100644 --- a/fs/btrfs/ordered-data.c +++ b/fs/btrfs/ordered-data.c @@ -527,12 +527,15 @@ int btrfs_wait_ordered_extents(struct btrfs_root *root, */ int btrfs_run_ordered_operations(struct btrfs_root *root, int wait) { + struct btrfs_trans_handle *trans; struct btrfs_inode *btrfs_inode; struct inode *inode; struct list_head splice; + trans = (struct btrfs_trans_handle *)current->journal_info; INIT_LIST_HEAD(&splice); + WARN_ON(trans && trans->endio); mutex_lock(&root->fs_info->ordered_operations_mutex); spin_lock(&root->fs_info->ordered_extent_lock); again: diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c index 29bef63..009d2db 100644 --- a/fs/btrfs/transaction.c +++ b/fs/btrfs/transaction.c @@ -310,6 +310,7 @@ again: h->use_count = 1; h->block_rsv = NULL; h->orig_rsv = NULL; + h->endio = 0; smp_mb(); if (cur_trans->blocked && may_wait_transaction(root, type)) { @@ -467,20 +468,17 @@ static int __btrfs_end_transaction(struct btrfs_trans_handle *trans, while (count < 4) { unsigned long cur = trans->delayed_ref_updates; trans->delayed_ref_updates = 0; - if (cur && - trans->transaction->delayed_refs.num_heads_ready > 64) { - trans->delayed_ref_updates = 0; - - /* - * do a full flush if the transaction is trying - * to close - */ - if (trans->transaction->delayed_refs.flushing) - cur = 0; - btrfs_run_delayed_refs(trans, root, cur); - } else { + if (!cur || + trans->transaction->delayed_refs.num_heads_ready <= 64) break; - } + + /* + * do a full flush if the transaction is trying + * to close + */ + if (trans->transaction->delayed_refs.flushing && throttle) + cur = 0; + btrfs_run_delayed_refs(trans, root, cur); count++; } @@ -498,6 +496,7 @@ static int __btrfs_end_transaction(struct btrfs_trans_handle *trans, * our use_count. */ trans->use_count++; + WARN_ON(trans->endio); return btrfs_commit_transaction(trans, root); } else { wake_up_process(info->transaction_kthread); diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h index 02564e6..7eae404 100644 --- a/fs/btrfs/transaction.h +++ b/fs/btrfs/transaction.h @@ -55,6 +55,7 @@ struct btrfs_trans_handle { struct btrfs_transaction *transaction; struct btrfs_block_rsv *block_rsv; struct btrfs_block_rsv *orig_rsv; + unsigned endio; }; struct btrfs_pending_snapshot { -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Christian Brunner
2011-Oct-27 20:39 UTC
Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]
2011/10/27 Josef Bacik <josef@redhat.com>:> On Tue, Oct 25, 2011 at 01:56:48PM +0200, Christian Brunner wrote: >> 2011/10/24 Josef Bacik <josef@redhat.com>: >> > On Mon, Oct 24, 2011 at 10:06:49AM -0700, Sage Weil wrote: >> >> [adding linux-btrfs to cc] >> >> >> >> Josef, Chris, any ideas on the below issues? >> >> >> >> On Mon, 24 Oct 2011, Christian Brunner wrote: >> >> > >> >> > - When I run ceph with btrfs snaps disabled, the situation is getting >> >> > slightly better. I can run an OSD for about 3 days without problems, >> >> > but then again the load increases. This time, I can see that the >> >> > ceph-osd (blkdev_issue_flush) and btrfs-endio-wri are doing more work >> >> > than usual. >> >> >> >> FYI in this scenario you''re exposed to the same journal replay issues that >> >> ext4 and XFS are. The btrfs workload that ceph is generating will also >> >> not be all that special, though, so this problem shouldn''t be unique to >> >> ceph. >> >> >> > >> > Can you get sysrq+w when this happens? I''d like to see what btrfs-endio-write >> > is up to. >> >> Capturing this seems to be not easy. I have a few traces (see >> attachment), but with sysrq+w I do not get a stacktrace of >> btrfs-endio-write. What I have is a "latencytop -c" output which is >> interesting: >> >> In our Ceph-OSD server we have 4 disks with 4 btrfs filesystems. Ceph >> tries to balance the load over all OSDs, so all filesystems should get >> an nearly equal load. At the moment one filesystem seems to have a >> problem. When running with iostat I see the following >> >> Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s >> avgrq-sz avgqu-sz await svctm %util >> sdd 0.00 0.00 0.00 4.33 0.00 53.33 >> 12.31 0.08 19.38 12.23 5.30 >> sdc 0.00 1.00 0.00 228.33 0.00 1957.33 >> 8.57 74.33 380.76 2.74 62.57 >> sdb 0.00 0.00 0.00 1.33 0.00 16.00 >> 12.00 0.03 25.00 19.75 2.63 >> sda 0.00 0.00 0.00 0.67 0.00 8.00 >> 12.00 0.01 19.50 12.50 0.83 >> >> The PID of the ceph-osd taht is running on sdc is 2053 and when I look >> with top I see this process and a btrfs-endio-writer (PID 5447): >> >> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND >> 2053 root 20 0 537m 146m 2364 S 33.2 0.6 43:31.24 ceph-osd >> 5447 root 20 0 0 0 0 S 22.6 0.0 19:32.18 btrfs-endio-wri >> >> In the latencytop output you can see that those processes have a much >> higher latency, than the other ceph-osd and btrfs-endio-writers. >> >> Regards, >> Christian > > Ok just a shot in the dark, but could you give this a whirl and see if it helps > you? ThanksThanks for the patch! I''ll install it tomorrow and I think that I can report back on Monday. It always takes a few days until the load goes up. Regards, Christian -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Christian Brunner
2011-Oct-31 10:25 UTC
Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]
2011/10/31 Christian Brunner <chb@muc.de>:> 2011/10/31 Christian Brunner <chb@muc.de>: >> >> The patch didn''t hurt, but I''ve to tell you that I''m still seeing the >> same old problems. Load is going up again: >> >> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND >> 5502 root 20 0 0 0 0 S 52.5 0.0 106:29.97 btrfs-endio-wri >> 1976 root 20 0 601m 211m 1464 S 28.3 0.9 115:10.62 ceph-osd >> >> And I have hit our warning again: >> >> [223560.970713] ------------[ cut here ]------------ >> [223560.976043] WARNING: at fs/btrfs/inode.c:2118 >> btrfs_orphan_commit_root+0xb0/0xc0 [btrfs]() >> [223560.985411] Hardware name: ProLiant DL180 G6 >> [223560.990491] Modules linked in: btrfs zlib_deflate libcrc32c sunrpc >> bonding ipv6 sg serio_raw pcspkr ghes hed iTCO_wdt iTCO_vendor_support >> i7core_edac edac_core ixgbe dca mdio iomemory_vsl(P) hpsa squashfs >> [last unloaded: scsi_wait_scan] >> [223561.014748] Pid: 2079, comm: ceph-osd Tainted: P >> 3.0.6-1.fits.9.el6.x86_64 #1 >> [223561.023874] Call Trace: >> [223561.026738] [<ffffffff8106344f>] warn_slowpath_common+0x7f/0xc0 >> [223561.033564] [<ffffffff810634aa>] warn_slowpath_null+0x1a/0x20 >> [223561.040272] [<ffffffffa0282120>] btrfs_orphan_commit_root+0xb0/0xc0 [btrfs] >> [223561.048278] [<ffffffffa027ce55>] commit_fs_roots+0xc5/0x1b0 [btrfs] >> [223561.055534] [<ffffffff8154c231>] ? mutex_lock+0x31/0x60 >> [223561.061666] [<ffffffffa027ddbe>] >> btrfs_commit_transaction+0x3ce/0x820 [btrfs] >> [223561.069876] [<ffffffffa027d1b8>] ? wait_current_trans+0x28/0x110 [btrfs] >> [223561.077582] [<ffffffffa027e325>] ? join_transaction+0x25/0x250 [btrfs] >> [223561.085065] [<ffffffff81086410>] ? wake_up_bit+0x40/0x40 >> [223561.091251] [<ffffffffa025a329>] btrfs_sync_fs+0x59/0xd0 [btrfs] >> [223561.098187] [<ffffffffa02abc65>] btrfs_ioctl+0x495/0xd50 [btrfs] >> [223561.105120] [<ffffffff8125ed20>] ? inode_has_perm+0x30/0x40 >> [223561.111575] [<ffffffff81261a2c>] ? file_has_perm+0xdc/0xf0 >> [223561.117924] [<ffffffff8117086a>] do_vfs_ioctl+0x9a/0x5a0 >> [223561.124072] [<ffffffff81170e11>] sys_ioctl+0xa1/0xb0 >> [223561.129842] [<ffffffff81555702>] system_call_fastpath+0x16/0x1b >> [223561.136699] ---[ end trace 176e8be8996f25f6 ]--- > > [ Not sending this to the lists, as the attachment is large ]. > > I''ve spent a little time to do some tracing with ftrace. Its output > seems to be right (at least as far as I can tell). I hope that its > output can give you an insight on whats going on. > > The interesting PIDs in the trace are: > > 5502 root 20 0 0 0 0 S 33.6 0.0 118:28.37 btrfs-endio-wri > 5518 root 20 0 0 0 0 S 29.3 0.0 41:23.58 btrfs-endio-wri > 8059 root 20 0 400m 48m 2756 S 8.0 0.2 8:31.56 ceph-osd > 7993 root 20 0 401m 41m 2808 S 13.6 0.2 7:58.38 ceph-osd >[ adding linux-btrfs again ] I''ve been digging into this a bit further: Attached is another ftrace report that I''ve filtered for "btrfs_*" calls and limited to CPU0 (this is where PID 5502 was running). From what I can see there is a lot of time consumed in btrfs_reserve_extent(). I this normal? Thanks, Christian
Christian Brunner
2011-Oct-31 13:29 UTC
Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]
2011/10/31 Christian Brunner <chb@muc.de>:> 2011/10/31 Christian Brunner <chb@muc.de>: >> 2011/10/31 Christian Brunner <chb@muc.de>: >>> >>> The patch didn''t hurt, but I''ve to tell you that I''m still seeing the >>> same old problems. Load is going up again: >>> >>> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND >>> 5502 root 20 0 0 0 0 S 52.5 0.0 106:29.97 btrfs-endio-wri >>> 1976 root 20 0 601m 211m 1464 S 28.3 0.9 115:10.62 ceph-osd >>> >>> And I have hit our warning again: >>> >>> [223560.970713] ------------[ cut here ]------------ >>> [223560.976043] WARNING: at fs/btrfs/inode.c:2118 >>> btrfs_orphan_commit_root+0xb0/0xc0 [btrfs]() >>> [223560.985411] Hardware name: ProLiant DL180 G6 >>> [223560.990491] Modules linked in: btrfs zlib_deflate libcrc32c sunrpc >>> bonding ipv6 sg serio_raw pcspkr ghes hed iTCO_wdt iTCO_vendor_support >>> i7core_edac edac_core ixgbe dca mdio iomemory_vsl(P) hpsa squashfs >>> [last unloaded: scsi_wait_scan] >>> [223561.014748] Pid: 2079, comm: ceph-osd Tainted: P >>> 3.0.6-1.fits.9.el6.x86_64 #1 >>> [223561.023874] Call Trace: >>> [223561.026738] [<ffffffff8106344f>] warn_slowpath_common+0x7f/0xc0 >>> [223561.033564] [<ffffffff810634aa>] warn_slowpath_null+0x1a/0x20 >>> [223561.040272] [<ffffffffa0282120>] btrfs_orphan_commit_root+0xb0/0xc0 [btrfs] >>> [223561.048278] [<ffffffffa027ce55>] commit_fs_roots+0xc5/0x1b0 [btrfs] >>> [223561.055534] [<ffffffff8154c231>] ? mutex_lock+0x31/0x60 >>> [223561.061666] [<ffffffffa027ddbe>] >>> btrfs_commit_transaction+0x3ce/0x820 [btrfs] >>> [223561.069876] [<ffffffffa027d1b8>] ? wait_current_trans+0x28/0x110 [btrfs] >>> [223561.077582] [<ffffffffa027e325>] ? join_transaction+0x25/0x250 [btrfs] >>> [223561.085065] [<ffffffff81086410>] ? wake_up_bit+0x40/0x40 >>> [223561.091251] [<ffffffffa025a329>] btrfs_sync_fs+0x59/0xd0 [btrfs] >>> [223561.098187] [<ffffffffa02abc65>] btrfs_ioctl+0x495/0xd50 [btrfs] >>> [223561.105120] [<ffffffff8125ed20>] ? inode_has_perm+0x30/0x40 >>> [223561.111575] [<ffffffff81261a2c>] ? file_has_perm+0xdc/0xf0 >>> [223561.117924] [<ffffffff8117086a>] do_vfs_ioctl+0x9a/0x5a0 >>> [223561.124072] [<ffffffff81170e11>] sys_ioctl+0xa1/0xb0 >>> [223561.129842] [<ffffffff81555702>] system_call_fastpath+0x16/0x1b >>> [223561.136699] ---[ end trace 176e8be8996f25f6 ]--- >> >> [ Not sending this to the lists, as the attachment is large ]. >> >> I''ve spent a little time to do some tracing with ftrace. Its output >> seems to be right (at least as far as I can tell). I hope that its >> output can give you an insight on whats going on. >> >> The interesting PIDs in the trace are: >> >> 5502 root 20 0 0 0 0 S 33.6 0.0 118:28.37 btrfs-endio-wri >> 5518 root 20 0 0 0 0 S 29.3 0.0 41:23.58 btrfs-endio-wri >> 8059 root 20 0 400m 48m 2756 S 8.0 0.2 8:31.56 ceph-osd >> 7993 root 20 0 401m 41m 2808 S 13.6 0.2 7:58.38 ceph-osd >> > > [ adding linux-btrfs again ] > > I''ve been digging into this a bit further: > > Attached is another ftrace report that I''ve filtered for "btrfs_*" > calls and limited to CPU0 (this is where PID 5502 was running). > > From what I can see there is a lot of time consumed in > btrfs_reserve_extent(). I this normal?Sorry for spamming, but in the meantime I''m almost certain that the problem is inside find_free_extent (called from btrfs_reserve_extent). When I''m running ftrace for a sample period of 10s my system is wasting a total of 4,2 seconds inside find_free_extent(). Each call to find_free_extent() is taking an average of 4 milliseconds to complete. On a recently rebooted system this is only 1-2 us! I''m not sure if the problem is occurring suddenly or slowly over time. (At the moment I suspect that its occurring suddenly, but I still have to investigate this). Thanks, Christian -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Josef Bacik
2011-Oct-31 14:04 UTC
Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]
On Mon, Oct 31, 2011 at 02:29:44PM +0100, Christian Brunner wrote:> 2011/10/31 Christian Brunner <chb@muc.de>: > > 2011/10/31 Christian Brunner <chb@muc.de>: > >> 2011/10/31 Christian Brunner <chb@muc.de>: > >>> > >>> The patch didn''t hurt, but I''ve to tell you that I''m still seeing the > >>> same old problems. Load is going up again: > >>> > >>> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > >>> 5502 root 20 0 0 0 0 S 52.5 0.0 106:29.97 btrfs-endio-wri > >>> 1976 root 20 0 601m 211m 1464 S 28.3 0.9 115:10.62 ceph-osd > >>> > >>> And I have hit our warning again: > >>> > >>> [223560.970713] ------------[ cut here ]------------ > >>> [223560.976043] WARNING: at fs/btrfs/inode.c:2118 > >>> btrfs_orphan_commit_root+0xb0/0xc0 [btrfs]() > >>> [223560.985411] Hardware name: ProLiant DL180 G6 > >>> [223560.990491] Modules linked in: btrfs zlib_deflate libcrc32c sunrpc > >>> bonding ipv6 sg serio_raw pcspkr ghes hed iTCO_wdt iTCO_vendor_support > >>> i7core_edac edac_core ixgbe dca mdio iomemory_vsl(P) hpsa squashfs > >>> [last unloaded: scsi_wait_scan] > >>> [223561.014748] Pid: 2079, comm: ceph-osd Tainted: P > >>> 3.0.6-1.fits.9.el6.x86_64 #1 > >>> [223561.023874] Call Trace: > >>> [223561.026738] [<ffffffff8106344f>] warn_slowpath_common+0x7f/0xc0 > >>> [223561.033564] [<ffffffff810634aa>] warn_slowpath_null+0x1a/0x20 > >>> [223561.040272] [<ffffffffa0282120>] btrfs_orphan_commit_root+0xb0/0xc0 [btrfs] > >>> [223561.048278] [<ffffffffa027ce55>] commit_fs_roots+0xc5/0x1b0 [btrfs] > >>> [223561.055534] [<ffffffff8154c231>] ? mutex_lock+0x31/0x60 > >>> [223561.061666] [<ffffffffa027ddbe>] > >>> btrfs_commit_transaction+0x3ce/0x820 [btrfs] > >>> [223561.069876] [<ffffffffa027d1b8>] ? wait_current_trans+0x28/0x110 [btrfs] > >>> [223561.077582] [<ffffffffa027e325>] ? join_transaction+0x25/0x250 [btrfs] > >>> [223561.085065] [<ffffffff81086410>] ? wake_up_bit+0x40/0x40 > >>> [223561.091251] [<ffffffffa025a329>] btrfs_sync_fs+0x59/0xd0 [btrfs] > >>> [223561.098187] [<ffffffffa02abc65>] btrfs_ioctl+0x495/0xd50 [btrfs] > >>> [223561.105120] [<ffffffff8125ed20>] ? inode_has_perm+0x30/0x40 > >>> [223561.111575] [<ffffffff81261a2c>] ? file_has_perm+0xdc/0xf0 > >>> [223561.117924] [<ffffffff8117086a>] do_vfs_ioctl+0x9a/0x5a0 > >>> [223561.124072] [<ffffffff81170e11>] sys_ioctl+0xa1/0xb0 > >>> [223561.129842] [<ffffffff81555702>] system_call_fastpath+0x16/0x1b > >>> [223561.136699] ---[ end trace 176e8be8996f25f6 ]--- > >> > >> [ Not sending this to the lists, as the attachment is large ]. > >> > >> I''ve spent a little time to do some tracing with ftrace. Its output > >> seems to be right (at least as far as I can tell). I hope that its > >> output can give you an insight on whats going on. > >> > >> The interesting PIDs in the trace are: > >> > >> 5502 root 20 0 0 0 0 S 33.6 0.0 118:28.37 btrfs-endio-wri > >> 5518 root 20 0 0 0 0 S 29.3 0.0 41:23.58 btrfs-endio-wri > >> 8059 root 20 0 400m 48m 2756 S 8.0 0.2 8:31.56 ceph-osd > >> 7993 root 20 0 401m 41m 2808 S 13.6 0.2 7:58.38 ceph-osd > >> > > > > [ adding linux-btrfs again ] > > > > I''ve been digging into this a bit further: > > > > Attached is another ftrace report that I''ve filtered for "btrfs_*" > > calls and limited to CPU0 (this is where PID 5502 was running). > > > > From what I can see there is a lot of time consumed in > > btrfs_reserve_extent(). I this normal? > > Sorry for spamming, but in the meantime I''m almost certain that the > problem is inside find_free_extent (called from btrfs_reserve_extent). > > When I''m running ftrace for a sample period of 10s my system is > wasting a total of 4,2 seconds inside find_free_extent(). Each call to > find_free_extent() is taking an average of 4 milliseconds to complete. > On a recently rebooted system this is only 1-2 us! > > I''m not sure if the problem is occurring suddenly or slowly over time. > (At the moment I suspect that its occurring suddenly, but I still have > to investigate this). >Ugh ok then this is lxo''s problem with our clustering stuff taking way too much time. I guess it''s time to actually take a hard look at that code. Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html