thr3ads.net - Btrfs devel - ceph on btrfs [was Re: ceph on non-btrfs file systems] [Oct 2011]

If this information is useful, please help other people find it:
Share via:

Sage Weil

2011-Oct-24 17:06 UTC

ceph on btrfs [was Re: ceph on non-btrfs file systems]

[adding linux-btrfs to cc]

Josef, Chris, any ideas on the below issues?

On Mon, 24 Oct 2011, Christian Brunner wrote:> Thanks for explaining this. I don''t have any objections against
btrfs
> as a osd filesystem. Even the fact that there is no btrfs-fsck
doesn''t
> scare me, since I can use the ceph replication to recover a lost
> btrfs-filesystem. The only problem I have is, that btrfs is not stable
> on our side and I wonder what you are doing to make it work. (Maybe
> it''s related to the load pattern of using ceph as a backend store
for
> qemu).
> 
> Here is a list of the btrfs problems I''m having:
> 
> - When I run ceph with the default configuration (btrfs snaps enabled)
> I can see a rapid increase in Disk-I/O after a few hours of uptime.
> Btrfs-cleaner is using more and more time in
> btrfs_clean_old_snapshots().
In theory, there shouldn''t be any significant difference between taking
a
snapshot and removing it a few commits later, and the prior root refs that 
btrfs holds on to internally until the new commit is complete.  That''s 
clearly not quite the case, though.

In any case, we''re going to try to reproduce this issue in our 
environment.
> - When I run ceph with btrfs snaps disabled, the situation is getting
> slightly better. I can run an OSD for about 3 days without problems,
> but then again the load increases. This time, I can see that the
> ceph-osd (blkdev_issue_flush) and btrfs-endio-wri are doing more work
> than usual.
FYI in this scenario you''re exposed to the same journal replay issues
that
ext4 and XFS are.  The btrfs workload that ceph is generating will also 
not be all that special, though, so this problem shouldn''t be unique to
ceph.
> Another thing is that I''m seeing a WARNING: at
fs/btrfs/inode.c:2114
> from time to time. Maybe it''s related to the performance issues,
but
> seems to be able to verify this.
I haven''t seen this yet with the latest stuff from Josef, but others
have.
Josef, is there any information we can provide to help track it down?
> It''s really sad to see, that ceph performance and stability is
> suffering that much from the underlying filesystems and that this
> hasn''t changed over the last months.
We don''t have anyone internally working on btrfs at the moment, and are
still struggling to hire experienced kernel/fs people.  Josef has been 
very helpful with tracking these issues down, but he hass responsibilities 
beyond just the Ceph related issues.  Progress is slow, but we are 
working on it!

sage

> 
> Kind regards,
> Christian
> 
> 2011/10/24 Sage Weil <sage@newdream.net>:
> > Although running on ext4, xfs, or whatever other non-btrfs you want
mostly
> > works, there are a few important remaining issues:
> >
> > 1- ext4 limits total xattrs for 4KB.  This can cause problems in some
> > cases, as Ceph uses xattrs extensively.  Most of the time we
don''t hit
> > this.  We do hit the limit with radosgw pretty easily, though, and may
> > also hit it in exceptional cases where the OSD cluster is very
unhealthy.
> >
> > There is a large xattr patch for ext4 from the Lustre folks that has
been
> > floating around for (I think) years.  Maybe as interest grows in
running
> > Ceph on ext4 this can move upstream.
> >
> > Previously we were being forgiving about large setxattr failures on
ext3,
> > but we found that was leading to corruption in certain cases (because
we
> > couldn''t set our internal metadata), so the next release will
assert/crash
> > in that case (fail-stop instead of fail-maybe-eventually-corrupt).
> >
> > XFS does not have an xattr size limit and thus does have this problem.
> >
> > 2- The other problem is with OSD journal replay of non-idempotent
> > transactions.  On non-btrfs backends, the Ceph OSDs use a write-ahead
> > journal.  After restart, the OSD does not know exactly which
transactions
> > in the journal may have already been committed to disk, and may
reapply a
> > transaction again during replay.  For most operations (write, delete,
> > truncate) this is fine.
> >
> > Some operations, though, are non-idempotent.  The simplest example is
> > CLONE, which copies (efficiently, on btrfs) data from one object to
> > another.  If the source object is modified, the osd restarts, and then
> > the clone is replayed, the target will get incorrect (newer) data.
 For
> > example,
> >
> > 1- clone A -> B
> > 2- modify A
> >   <osd crash, replay from 1>
> >
> > B will get new instead of old contents.
> >
> > (This doesn''t happen on btrfs because the snapshots allow us
to replay
> > from a known consistent point in time.)
> >
> > For things like clone, skipping the operation of the target exists
almost
> > works, except for cases like
> >
> > 1- clone A -> B
> > 2- modify A
> > ...
> > 3- delete B
> >   <osd crash, replay from 1>
> >
> > (Although in that example who cares if B had bad data; it was removed
> > anyway.)  The larger problem, though, is that that doesn''t
always work;
> > CLONERANGE copies a range of a file from A to B, where B may already
> > exist.
> >
> > In practice, the higher level interfaces don''t make full use
of the
> > low-level interface, so it''s possible some solution exists
that careful
> > avoids the problem with a partial solution in the lower layer.  This
makes
> > me nervous, though, as it is easy to break.
> >
> > Another possibility:
> >
> >  - on non-btrfs, we set a xattr on every modified object with the
> >   op_seq, the unique sequence number for the transaction.
> >  - for any (potentially) non-idempotent operation, we fsync() before
> >   continuing to the next transaction, to ensure that xattr hits disk.
> >  - on replay, we skip a transaction if the xattr indicates we already
> >   performed this transaction.
> >
> > Because every ''transaction'' only modifies on a
single object (file),
> > this ought to work.  It''ll make things like clone slow, but
let''s face it:
> > they''re already slow on non-btrfs file systems because they
actually copy
> > the data (instead of duplicating the extent refs in btrfs).  And it
should
> > make the full ObjectStore iterface safe, without upper layers having
to
> > worry about the kinds and orders of transactions they perform.
> >
> > Other ideas?
> >
> > This issue is tracked at http://tracker.newdream.net/issues/213.
> >
> > sage
> >
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe
ceph-devel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >
> --
> To unsubscribe from this list: send the line "unsubscribe
ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
>

Josef Bacik

2011-Oct-24 19:51 UTC

head link

Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]

On Mon, Oct 24, 2011 at 10:06:49AM -0700, Sage Weil
wrote:> [adding linux-btrfs to cc]
> 
> Josef, Chris, any ideas on the below issues?
> 
> On Mon, 24 Oct 2011, Christian Brunner wrote:
> > Thanks for explaining this. I don''t have any objections
against btrfs
> > as a osd filesystem. Even the fact that there is no btrfs-fsck
doesn''t
> > scare me, since I can use the ceph replication to recover a lost
> > btrfs-filesystem. The only problem I have is, that btrfs is not stable
> > on our side and I wonder what you are doing to make it work. (Maybe
> > it''s related to the load pattern of using ceph as a backend
store for
> > qemu).
> > 
> > Here is a list of the btrfs problems I''m having:
> > 
> > - When I run ceph with the default configuration (btrfs snaps enabled)
> > I can see a rapid increase in Disk-I/O after a few hours of uptime.
> > Btrfs-cleaner is using more and more time in
> > btrfs_clean_old_snapshots().
> 
> In theory, there shouldn''t be any significant difference between
taking a
> snapshot and removing it a few commits later, and the prior root refs that 
> btrfs holds on to internally until the new commit is complete. 
That''s
> clearly not quite the case, though.
> 
> In any case, we''re going to try to reproduce this issue in our 
> environment.
> 
I''ve noticed this problem too, clean_old_snapshots is taking quite a
while in
cases where it really shouldn''t.  I will see if I can come up with a
reproducer
that doesn''t require setting up ceph ;).
> > - When I run ceph with btrfs snaps disabled, the situation is getting
> > slightly better. I can run an OSD for about 3 days without problems,
> > but then again the load increases. This time, I can see that the
> > ceph-osd (blkdev_issue_flush) and btrfs-endio-wri are doing more work
> > than usual.
> 
> FYI in this scenario you''re exposed to the same journal replay
issues that
> ext4 and XFS are.  The btrfs workload that ceph is generating will also 
> not be all that special, though, so this problem shouldn''t be
unique to
> ceph.
> 
Can you get sysrq+w when this happens?  I''d like to see what
btrfs-endio-write
is up to.
> > Another thing is that I''m seeing a WARNING: at
fs/btrfs/inode.c:2114
> > from time to time. Maybe it''s related to the performance
issues, but
> > seems to be able to verify this.
> 
> I haven''t seen this yet with the latest stuff from Josef, but
others have.
> Josef, is there any information we can provide to help track it down?
>
Actually this would show up in 2 cases, I fixed the one most people hit with my
earlier stuff and then fixed the other one more recently, hopefully it will be
fixed in 3.2.  A full backtrace would be nice so I can figure out which one it
is you are hitting.
 > > It''s really sad to see, that ceph performance and stability
is
> > suffering that much from the underlying filesystems and that this
> > hasn''t changed over the last months.
> 
> We don''t have anyone internally working on btrfs at the moment,
and are
> still struggling to hire experienced kernel/fs people.  Josef has been 
> very helpful with tracking these issues down, but he hass responsibilities 
> beyond just the Ceph related issues.  Progress is slow, but we are 
> working on it!
I''m open to offers ;).  These things are being hit by people all over
the place,
but it''s hard for me to reproduce, especially since most of the reports
are "run
X server for Y days and wait for it to start sucking."  I will try and get
a box
setup that I can let stress.sh run on for a few days to see if I can make some
of this stuff come out to play with me, but unfortunately I end up having to
debug these kind of things over email, which means they get a whole lot of
nowhere.  Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Mason

2011-Oct-24 20:35 UTC

head link

Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]

On Mon, Oct 24, 2011 at 03:51:47PM -0400, Josef Bacik
wrote:> On Mon, Oct 24, 2011 at 10:06:49AM -0700, Sage Weil wrote:
> > [adding linux-btrfs to cc]
> > 
> > Josef, Chris, any ideas on the below issues?
> > 
> > On Mon, 24 Oct 2011, Christian Brunner wrote:
> > > Thanks for explaining this. I don''t have any objections
against btrfs
> > > as a osd filesystem. Even the fact that there is no btrfs-fsck
doesn''t
> > > scare me, since I can use the ceph replication to recover a lost
> > > btrfs-filesystem. The only problem I have is, that btrfs is not
stable
> > > on our side and I wonder what you are doing to make it work.
(Maybe
> > > it''s related to the load pattern of using ceph as a
backend store for
> > > qemu).
> > > 
> > > Here is a list of the btrfs problems I''m having:
> > > 
> > > - When I run ceph with the default configuration (btrfs snaps
enabled)
> > > I can see a rapid increase in Disk-I/O after a few hours of
uptime.
> > > Btrfs-cleaner is using more and more time in
> > > btrfs_clean_old_snapshots().
> > 
> > In theory, there shouldn''t be any significant difference
between taking a
> > snapshot and removing it a few commits later, and the prior root refs
that
> > btrfs holds on to internally until the new commit is complete. 
That''s
> > clearly not quite the case, though.
> > 
> > In any case, we''re going to try to reproduce this issue in
our
> > environment.
> > 
> 
> I''ve noticed this problem too, clean_old_snapshots is taking quite
a while in
> cases where it really shouldn''t.  I will see if I can come up with
a reproducer
> that doesn''t require setting up ceph ;).
This sounds familiar though, I thought we had fixed a similar
regression.  Either way, Arne''s readahead code should really help.

Which kernel version were you running?

[ ack on the rest of Josef''s comments ]

-chris
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Christian Brunner

2011-Oct-24 21:34 UTC

head link

Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]

2011/10/24 Chris Mason <chris.mason@oracle.com>:> On Mon, Oct 24, 2011 at 03:51:47PM -0400, Josef Bacik wrote:
>> On Mon, Oct 24, 2011 at 10:06:49AM -0700, Sage Weil wrote:
>> > [adding linux-btrfs to cc]
>> >
>> > Josef, Chris, any ideas on the below issues?
>> >
>> > On Mon, 24 Oct 2011, Christian Brunner wrote:
>> > > Thanks for explaining this. I don''t have any
objections against btrfs
>> > > as a osd filesystem. Even the fact that there is no
btrfs-fsck doesn''t
>> > > scare me, since I can use the ceph replication to recover a
lost
>> > > btrfs-filesystem. The only problem I have is, that btrfs is
not stable
>> > > on our side and I wonder what you are doing to make it work.
(Maybe
>> > > it''s related to the load pattern of using ceph as a
backend store for
>> > > qemu).
>> > >
>> > > Here is a list of the btrfs problems I''m having:
>> > >
>> > > - When I run ceph with the default configuration (btrfs snaps
enabled)
>> > > I can see a rapid increase in Disk-I/O after a few hours of
uptime.
>> > > Btrfs-cleaner is using more and more time in
>> > > btrfs_clean_old_snapshots().
>> >
>> > In theory, there shouldn''t be any significant difference
between taking a
>> > snapshot and removing it a few commits later, and the prior root
refs that
>> > btrfs holds on to internally until the new commit is complete.
 That''s
>> > clearly not quite the case, though.
>> >
>> > In any case, we''re going to try to reproduce this issue
in our
>> > environment.
>> >
>>
>> I''ve noticed this problem too, clean_old_snapshots is taking
quite a while in
>> cases where it really shouldn''t.  I will see if I can come up
with a reproducer
>> that doesn''t require setting up ceph ;).
>
> This sounds familiar though, I thought we had fixed a similar
> regression.  Either way, Arne''s readahead code should really help.
>
> Which kernel version were you running?
>
> [ ack on the rest of Josef''s comments ]
This was with a 3.0 kernel, including all btrfs-patches from josefs
git repo plus the "use the global reserve when truncating the free
space cache inode" patch.

I''ll try the readahead code.

Thanks,
Christian
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Arne Jansen

2011-Oct-24 21:37 UTC

head link

Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]

On 24.10.2011 23:34, Christian Brunner wrote:> 2011/10/24 Chris Mason<chris.mason@oracle.com>:
>> On Mon, Oct 24, 2011 at 03:51:47PM -0400, Josef Bacik wrote:
>>> On Mon, Oct 24, 2011 at 10:06:49AM -0700, Sage Weil wrote:
>>>> [adding linux-btrfs to cc]
>>>>
>>>> Josef, Chris, any ideas on the below issues?
>>>>
>>>> On Mon, 24 Oct 2011, Christian Brunner wrote:
>>>>> Thanks for explaining this. I don''t have any
objections against btrfs
>>>>> as a osd filesystem. Even the fact that there is no
btrfs-fsck doesn''t
>>>>> scare me, since I can use the ceph replication to recover a
lost
>>>>> btrfs-filesystem. The only problem I have is, that btrfs is
not stable
>>>>> on our side and I wonder what you are doing to make it
work. (Maybe
>>>>> it''s related to the load pattern of using ceph as
a backend store for
>>>>> qemu).
>>>>>
>>>>> Here is a list of the btrfs problems I''m having:
>>>>>
>>>>> - When I run ceph with the default configuration (btrfs
snaps enabled)
>>>>> I can see a rapid increase in Disk-I/O after a few hours of
uptime.
>>>>> Btrfs-cleaner is using more and more time in
>>>>> btrfs_clean_old_snapshots().
>>>>
>>>> In theory, there shouldn''t be any significant
difference between taking a
>>>> snapshot and removing it a few commits later, and the prior
root refs that
>>>> btrfs holds on to internally until the new commit is complete. 
That''s
>>>> clearly not quite the case, though.
>>>>
>>>> In any case, we''re going to try to reproduce this
issue in our
>>>> environment.
>>>>
>>>
>>> I''ve noticed this problem too, clean_old_snapshots is
taking quite a while in
>>> cases where it really shouldn''t.  I will see if I can come
up with a reproducer
>>> that doesn''t require setting up ceph ;).
>>
>> This sounds familiar though, I thought we had fixed a similar
>> regression.  Either way, Arne''s readahead code should really
help.
>>
>> Which kernel version were you running?
>>
>> [ ack on the rest of Josef''s comments ]
>
> This was with a 3.0 kernel, including all btrfs-patches from josefs
> git repo plus the "use the global reserve when truncating the free
> space cache inode" patch.
>
> I''ll try the readahead code.
The current readahead code is only used for scrub. I plan to extend it
to snapshot deletion in a next step, but currently I''m afraid it
can''t
help.

-Arne
>
> Thanks,
> Christian
> --
> To unsubscribe from this list: send the line "unsubscribe
linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Christoph Hellwig

2011-Oct-25 10:23 UTC

head link

Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]

On Mon, Oct 24, 2011 at 10:06:49AM -0700, Sage Weil
wrote:> > - When I run ceph with btrfs snaps disabled, the situation is getting
> > slightly better. I can run an OSD for about 3 days without problems,
> > but then again the load increases. This time, I can see that the
> > ceph-osd (blkdev_issue_flush) and btrfs-endio-wri are doing more work
> > than usual.
> 
> FYI in this scenario you''re exposed to the same journal replay
issues that
> ext4 and XFS are.  The btrfs workload that ceph is generating will also 
> not be all that special, though, so this problem shouldn''t be
unique to
> ceph.
What journal replay issues would ext4 and XFS be exposed to?

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Christian Brunner

2011-Oct-25 11:56 UTC

head link

Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]

2011/10/24 Josef Bacik <josef@redhat.com>:> On Mon, Oct 24, 2011 at 10:06:49AM -0700, Sage Weil wrote:
>> [adding linux-btrfs to cc]
>>
>> Josef, Chris, any ideas on the below issues?
>>
>> On Mon, 24 Oct 2011, Christian Brunner wrote:
>> >
>> > - When I run ceph with btrfs snaps disabled, the situation is
getting
>> > slightly better. I can run an OSD for about 3 days without
problems,
>> > but then again the load increases. This time, I can see that the
>> > ceph-osd (blkdev_issue_flush) and btrfs-endio-wri are doing more
work
>> > than usual.
>>
>> FYI in this scenario you''re exposed to the same journal replay
issues that
>> ext4 and XFS are.  The btrfs workload that ceph is generating will also
>> not be all that special, though, so this problem shouldn''t be
unique to
>> ceph.
>>
>
> Can you get sysrq+w when this happens?  I''d like to see what
btrfs-endio-write
> is up to.
Capturing this seems to be not easy. I have a few traces (see
attachment), but with sysrq+w I do not get a stacktrace of
btrfs-endio-write. What I have is a "latencytop -c" output which is
interesting:

In our Ceph-OSD server we have 4 disks with 4 btrfs filesystems. Ceph
tries to balance the load over all OSDs, so all filesystems should get
an nearly equal load. At the moment one filesystem seems to have a
problem. When running with iostat I see the following

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
avgrq-sz avgqu-sz   await  svctm  %util
sdd               0.00     0.00    0.00    4.33     0.00    53.33
12.31     0.08   19.38  12.23   5.30
sdc               0.00     1.00    0.00  228.33     0.00  1957.33
8.57    74.33  380.76   2.74  62.57
sdb               0.00     0.00    0.00    1.33     0.00    16.00
12.00     0.03   25.00  19.75   2.63
sda               0.00     0.00    0.00    0.67     0.00     8.00
12.00     0.01   19.50  12.50   0.83

The PID of the ceph-osd taht is running on sdc is 2053 and when I look
with top I see this process and a btrfs-endio-writer (PID 5447):

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 2053 root      20   0  537m 146m 2364 S 33.2  0.6  43:31.24 ceph-osd
 5447 root      20   0     0    0    0 S 22.6  0.0  19:32.18 btrfs-endio-wri

In the latencytop output you can see that those processes have a much
higher latency, than the other ceph-osd and btrfs-endio-writers.

Regards,
Christian

Josef Bacik

2011-Oct-25 12:23 UTC

head link

Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]

On Tue, Oct 25, 2011 at 01:56:48PM +0200, Christian Brunner
wrote:> 2011/10/24 Josef Bacik <josef@redhat.com>:
> > On Mon, Oct 24, 2011 at 10:06:49AM -0700, Sage Weil wrote:
> >> [adding linux-btrfs to cc]
> >>
> >> Josef, Chris, any ideas on the below issues?
> >>
> >> On Mon, 24 Oct 2011, Christian Brunner wrote:
> >> >
> >> > - When I run ceph with btrfs snaps disabled, the situation is
getting
> >> > slightly better. I can run an OSD for about 3 days without
problems,
> >> > but then again the load increases. This time, I can see that
the
> >> > ceph-osd (blkdev_issue_flush) and btrfs-endio-wri are doing
more work
> >> > than usual.
> >>
> >> FYI in this scenario you''re exposed to the same journal
replay issues that
> >> ext4 and XFS are.  The btrfs workload that ceph is generating will
also
> >> not be all that special, though, so this problem
shouldn''t be unique to
> >> ceph.
> >>
> >
> > Can you get sysrq+w when this happens?  I''d like to see what
btrfs-endio-write
> > is up to.
> 
> Capturing this seems to be not easy. I have a few traces (see
> attachment), but with sysrq+w I do not get a stacktrace of
> btrfs-endio-write. What I have is a "latencytop -c" output which
is
> interesting:
> 
> In our Ceph-OSD server we have 4 disks with 4 btrfs filesystems. Ceph
> tries to balance the load over all OSDs, so all filesystems should get
> an nearly equal load. At the moment one filesystem seems to have a
> problem. When running with iostat I see the following
> 
> Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
> avgrq-sz avgqu-sz   await  svctm  %util
> sdd               0.00     0.00    0.00    4.33     0.00    53.33
> 12.31     0.08   19.38  12.23   5.30
> sdc               0.00     1.00    0.00  228.33     0.00  1957.33
> 8.57    74.33  380.76   2.74  62.57
> sdb               0.00     0.00    0.00    1.33     0.00    16.00
> 12.00     0.03   25.00  19.75   2.63
> sda               0.00     0.00    0.00    0.67     0.00     8.00
> 12.00     0.01   19.50  12.50   0.83
> 
> The PID of the ceph-osd taht is running on sdc is 2053 and when I look
> with top I see this process and a btrfs-endio-writer (PID 5447):
> 
>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>  2053 root      20   0  537m 146m 2364 S 33.2  0.6  43:31.24 ceph-osd
>  5447 root      20   0     0    0    0 S 22.6  0.0  19:32.18
btrfs-endio-wri
> 
> In the latencytop output you can see that those processes have a much
> higher latency, than the other ceph-osd and btrfs-endio-writers.
> 
I''m seeing a lot of this

        [schedule]      1654.6 msec         96.4 %
                schedule blkdev_issue_flush blkdev_fsync vfs_fsync_range
                generic_write_sync blkdev_aio_write do_sync_readv_writev
                do_readv_writev vfs_writev sys_writev system_call_fastpath

where ceph-osd''s latency is mostly coming from this fsync of a block
device
directly, and not so much being tied up by btrfs directly.  With 22% CPU being
taken up by btrfs-endio-wri we must be doing something wrong.  Can you run perf
record -ag when this is going on and then perf report so we can see what
btrfs-endio-wri is doing with the cpu.  You can drill down in perf report to get
only what btrfs-endio-wri is doing, so that would be best.  As far as the rest
of the latencytop goes, it doesn''t seem like btrfs-endio-wri is doing
anything
horribly wrong or introducing a lot of latency.  Most of it seems to be when
running the dleayed refs and having to read in blocks.  I''ve been
suspecting for
a while that the delayed ref stuff ends up doing way more work than it needs to
be per task, and it''s possible that btrfs-endio-wri is simply getting
screwed by
other people doing work.

At this point it seems like the biggest problem with latency in ceph-osd is not
related to btrfs, the latency seems to all be from the fact that ceph-osd is
fsyncing a block dev for whatever reason.  As for btrfs-endio-wri it seems like
its blowing a lot of CPU time, so perf record -ag is probably going to be your
best bet when it''s using lots of cpu so we can figure out what
it''s spinning on.
Thanks,

Josef

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Christian Brunner

2011-Oct-25 14:25 UTC

head link

Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]

2011/10/25 Josef Bacik <josef@redhat.com>:> On Tue, Oct 25, 2011 at 01:56:48PM +0200, Christian Brunner wrote:
>> 2011/10/24 Josef Bacik <josef@redhat.com>:
>> > On Mon, Oct 24, 2011 at 10:06:49AM -0700, Sage Weil wrote:
>> >> [adding linux-btrfs to cc]
>> >>
>> >> Josef, Chris, any ideas on the below issues?
>> >>
>> >> On Mon, 24 Oct 2011, Christian Brunner wrote:
>> >> >
>> >> > - When I run ceph with btrfs snaps disabled, the
situation is getting
>> >> > slightly better. I can run an OSD for about 3 days
without problems,
>> >> > but then again the load increases. This time, I can see
that the
>> >> > ceph-osd (blkdev_issue_flush) and btrfs-endio-wri are
doing more work
>> >> > than usual.
>> >>
>> >> FYI in this scenario you''re exposed to the same
journal replay issues that
>> >> ext4 and XFS are.  The btrfs workload that ceph is generating
will also
>> >> not be all that special, though, so this problem
shouldn''t be unique to
>> >> ceph.
>> >>
>> >
>> > Can you get sysrq+w when this happens?  I''d like to see
what btrfs-endio-write
>> > is up to.
>>
>> Capturing this seems to be not easy. I have a few traces (see
>> attachment), but with sysrq+w I do not get a stacktrace of
>> btrfs-endio-write. What I have is a "latencytop -c" output
which is
>> interesting:
>>
>> In our Ceph-OSD server we have 4 disks with 4 btrfs filesystems. Ceph
>> tries to balance the load over all OSDs, so all filesystems should get
>> an nearly equal load. At the moment one filesystem seems to have a
>> problem. When running with iostat I see the following
>>
>> Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
>> avgrq-sz avgqu-sz   await  svctm  %util
>> sdd               0.00     0.00    0.00    4.33     0.00    53.33
>> 12.31     0.08   19.38  12.23   5.30
>> sdc               0.00     1.00    0.00  228.33     0.00  1957.33
>> 8.57    74.33  380.76   2.74  62.57
>> sdb               0.00     0.00    0.00    1.33     0.00    16.00
>> 12.00     0.03   25.00 19.75 2.63
>> sda               0.00     0.00    0.00    0.67     0.00     8.00
>> 12.00     0.01   19.50  12.50   0.83
>>
>> The PID of the ceph-osd taht is running on sdc is 2053 and when I look
>> with top I see this process and a btrfs-endio-writer (PID 5447):
>>
>>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>>  2053 root      20   0  537m 146m 2364 S 33.2 0.6 43:31.24 ceph-osd
>>  5447 root      20   0     0    0    0 S 22.6 0.0 19:32.18
btrfs-endio-wri
>>
>> In the latencytop output you can see that those processes have a much
>> higher latency, than the other ceph-osd and btrfs-endio-writers.
>>
>
> I''m seeing a lot of this
>
>        [schedule]      1654.6 msec         96.4 %
>                schedule blkdev_issue_flush blkdev_fsync vfs_fsync_range
>                generic_write_sync blkdev_aio_write do_sync_readv_writev
>                do_readv_writev vfs_writev sys_writev system_call_fastpath
>
> where ceph-osd''s latency is mostly coming from this fsync of a
block device
> directly, and not so much being tied up by btrfs directly.  With 22% CPU
being
> taken up by btrfs-endio-wri we must be doing something wrong.  Can you run
perf
> record -ag when this is going on and then perf report so we can see what
> btrfs-endio-wri is doing with the cpu.  You can drill down in perf report
to get
> only what btrfs-endio-wri is doing, so that would be best.  As far as the
rest
> of the latencytop goes, it doesn''t seem like btrfs-endio-wri is
doing anything
> horribly wrong or introducing a lot of latency.  Most of it seems to be
when
> running the dleayed refs and having to read in blocks.  I''ve been
suspecting for
> a while that the delayed ref stuff ends up doing way more work than it
needs to
> be per task, and it''s possible that btrfs-endio-wri is simply
getting screwed by
> other people doing work.
>
> At this point it seems like the biggest problem with latency in ceph-osd is
not
> related to btrfs, the latency seems to all be from the fact that ceph-osd
is
> fsyncing a block dev for whatever reason.  As for btrfs-endio-wri it seems
like
> its blowing a lot of CPU time, so perf record -ag is probably going to be
your
> best bet when it''s using lots of cpu so we can figure out what
it''s spinning on.
Attached is a perf-report. I have included the whole report, so that
you can see the difference between the good and the bad
btrfs-endio-wri.

Thanks,
Christian

Josef Bacik

2011-Oct-25 15:00 UTC

head link

Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]

On Tue, Oct 25, 2011 at 04:25:02PM +0200, Christian Brunner
wrote:> 2011/10/25 Josef Bacik <josef@redhat.com>:
> > On Tue, Oct 25, 2011 at 01:56:48PM +0200, Christian Brunner wrote:
> >> 2011/10/24 Josef Bacik <josef@redhat.com>:
> >> > On Mon, Oct 24, 2011 at 10:06:49AM -0700, Sage Weil wrote:
> >> >> [adding linux-btrfs to cc]
> >> >>
> >> >> Josef, Chris, any ideas on the below issues?
> >> >>
> >> >> On Mon, 24 Oct 2011, Christian Brunner wrote:
> >> >> >
> >> >> > - When I run ceph with btrfs snaps disabled, the
situation is getting
> >> >> > slightly better. I can run an OSD for about 3 days
without problems,
> >> >> > but then again the load increases. This time, I can
see that the
> >> >> > ceph-osd (blkdev_issue_flush) and btrfs-endio-wri
are doing more work
> >> >> > than usual.
> >> >>
> >> >> FYI in this scenario you''re exposed to the same
journal replay issues that
> >> >> ext4 and XFS are.  The btrfs workload that ceph is
generating will also
> >> >> not be all that special, though, so this problem
shouldn''t be unique to
> >> >> ceph.
> >> >>
> >> >
> >> > Can you get sysrq+w when this happens?  I''d like to
see what btrfs-endio-write
> >> > is up to.
> >>
> >> Capturing this seems to be not easy. I have a few traces (see
> >> attachment), but with sysrq+w I do not get a stacktrace of
> >> btrfs-endio-write. What I have is a "latencytop -c"
output which is
> >> interesting:
> >>
> >> In our Ceph-OSD server we have 4 disks with 4 btrfs filesystems.
Ceph
> >> tries to balance the load over all OSDs, so all filesystems should
get
> >> an nearly equal load. At the moment one filesystem seems to have a
> >> problem. When running with iostat I see the following
> >>
> >> Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
> >> avgrq-sz avgqu-sz   await  svctm  %util
> >> sdd               0.00     0.00    0.00    4.33     0.00    53.33
> >> 12.31     0.08   19.38  12.23   5.30
> >> sdc               0.00     1.00    0.00  228.33     0.00  1957.33
> >> 8.57    74.33  380.76   2.74  62.57
> >> sdb               0.00     0.00    0.00    1.33     0.00    16.00
> >> 12.00     0.03   25.00 19.75 2.63
> >> sda               0.00     0.00    0.00    0.67     0.00     8.00
> >> 12.00     0.01   19.50  12.50   0.83
> >>
> >> The PID of the ceph-osd taht is running on sdc is 2053 and when I
look
> >> with top I see this process and a btrfs-endio-writer (PID 5447):
> >>
> >>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+
 COMMAND
> >>  2053 root      20   0  537m 146m 2364 S 33.2 0.6 43:31.24
ceph-osd
> >>  5447 root      20   0     0    0    0 S 22.6 0.0 19:32.18
btrfs-endio-wri
> >>
> >> In the latencytop output you can see that those processes have a
much
> >> higher latency, than the other ceph-osd and btrfs-endio-writers.
> >>
> >
> > I''m seeing a lot of this
> >
> >        [schedule]      1654.6 msec         96.4 %
> >                schedule blkdev_issue_flush blkdev_fsync
vfs_fsync_range
> >                generic_write_sync blkdev_aio_write
do_sync_readv_writev
> >                do_readv_writev vfs_writev sys_writev
system_call_fastpath
> >
> > where ceph-osd''s latency is mostly coming from this fsync of
a block device
> > directly, and not so much being tied up by btrfs directly.  With 22%
CPU being
> > taken up by btrfs-endio-wri we must be doing something wrong.  Can you
run perf
> > record -ag when this is going on and then perf report so we can see
what
> > btrfs-endio-wri is doing with the cpu.  You can drill down in perf
report to get
> > only what btrfs-endio-wri is doing, so that would be best.  As far as
the rest
> > of the latencytop goes, it doesn''t seem like btrfs-endio-wri
is doing anything
> > horribly wrong or introducing a lot of latency.  Most of it seems to
be when
> > running the dleayed refs and having to read in blocks.  I''ve
been suspecting for
> > a while that the delayed ref stuff ends up doing way more work than it
needs to
> > be per task, and it''s possible that btrfs-endio-wri is simply
getting screwed by
> > other people doing work.
> >
> > At this point it seems like the biggest problem with latency in
ceph-osd is not
> > related to btrfs, the latency seems to all be from the fact that
ceph-osd is
> > fsyncing a block dev for whatever reason.  As for btrfs-endio-wri it
seems like
> > its blowing a lot of CPU time, so perf record -ag is probably going to
be your
> > best bet when it''s using lots of cpu so we can figure out
what it''s spinning on.
> 
> Attached is a perf-report. I have included the whole report, so that
> you can see the difference between the good and the bad
> btrfs-endio-wri.
>
Oh shit we''re inserting xattrs in endio, thats not good.  I''ll
look more into
this when I get back home but this is definitely a problem, we''re doing
a lot
more work in endio than we should.  Thanks,

Josef 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Josef Bacik

2011-Oct-25 15:05 UTC

head link

Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]

On Tue, Oct 25, 2011 at 04:25:02PM +0200, Christian Brunner
wrote:> 2011/10/25 Josef Bacik <josef@redhat.com>:
> > On Tue, Oct 25, 2011 at 01:56:48PM +0200, Christian Brunner wrote:
> >> 2011/10/24 Josef Bacik <josef@redhat.com>:
> >> > On Mon, Oct 24, 2011 at 10:06:49AM -0700, Sage Weil wrote:
> >> >> [adding linux-btrfs to cc]
> >> >>
> >> >> Josef, Chris, any ideas on the below issues?
> >> >>
> >> >> On Mon, 24 Oct 2011, Christian Brunner wrote:
> >> >> >
> >> >> > - When I run ceph with btrfs snaps disabled, the
situation is getting
> >> >> > slightly better. I can run an OSD for about 3 days
without problems,
> >> >> > but then again the load increases. This time, I can
see that the
> >> >> > ceph-osd (blkdev_issue_flush) and btrfs-endio-wri
are doing more work
> >> >> > than usual.
> >> >>
> >> >> FYI in this scenario you''re exposed to the same
journal replay issues that
> >> >> ext4 and XFS are.  The btrfs workload that ceph is
generating will also
> >> >> not be all that special, though, so this problem
shouldn''t be unique to
> >> >> ceph.
> >> >>
> >> >
> >> > Can you get sysrq+w when this happens?  I''d like to
see what btrfs-endio-write
> >> > is up to.
> >>
> >> Capturing this seems to be not easy. I have a few traces (see
> >> attachment), but with sysrq+w I do not get a stacktrace of
> >> btrfs-endio-write. What I have is a "latencytop -c"
output which is
> >> interesting:
> >>
> >> In our Ceph-OSD server we have 4 disks with 4 btrfs filesystems.
Ceph
> >> tries to balance the load over all OSDs, so all filesystems should
get
> >> an nearly equal load. At the moment one filesystem seems to have a
> >> problem. When running with iostat I see the following
> >>
> >> Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
> >> avgrq-sz avgqu-sz   await  svctm  %util
> >> sdd               0.00     0.00    0.00    4.33     0.00    53.33
> >> 12.31     0.08   19.38  12.23   5.30
> >> sdc               0.00     1.00    0.00  228.33     0.00  1957.33
> >> 8.57    74.33  380.76   2.74  62.57
> >> sdb               0.00     0.00    0.00    1.33     0.00    16.00
> >> 12.00     0.03   25.00 19.75 2.63
> >> sda               0.00     0.00    0.00    0.67     0.00     8.00
> >> 12.00     0.01   19.50  12.50   0.83
> >>
> >> The PID of the ceph-osd taht is running on sdc is 2053 and when I
look
> >> with top I see this process and a btrfs-endio-writer (PID 5447):
> >>
> >>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+
 COMMAND
> >>  2053 root      20   0  537m 146m 2364 S 33.2 0.6 43:31.24
ceph-osd
> >>  5447 root      20   0     0    0    0 S 22.6 0.0 19:32.18
btrfs-endio-wri
> >>
> >> In the latencytop output you can see that those processes have a
much
> >> higher latency, than the other ceph-osd and btrfs-endio-writers.
> >>
> >
> > I''m seeing a lot of this
> >
> >        [schedule]      1654.6 msec         96.4 %
> >                schedule blkdev_issue_flush blkdev_fsync
vfs_fsync_range
> >                generic_write_sync blkdev_aio_write
do_sync_readv_writev
> >                do_readv_writev vfs_writev sys_writev
system_call_fastpath
> >
> > where ceph-osd''s latency is mostly coming from this fsync of
a block device
> > directly, and not so much being tied up by btrfs directly.  With 22%
CPU being
> > taken up by btrfs-endio-wri we must be doing something wrong.  Can you
run perf
> > record -ag when this is going on and then perf report so we can see
what
> > btrfs-endio-wri is doing with the cpu.  You can drill down in perf
report to get
> > only what btrfs-endio-wri is doing, so that would be best.  As far as
the rest
> > of the latencytop goes, it doesn''t seem like btrfs-endio-wri
is doing anything
> > horribly wrong or introducing a lot of latency.  Most of it seems to
be when
> > running the dleayed refs and having to read in blocks.  I''ve
been suspecting for
> > a while that the delayed ref stuff ends up doing way more work than it
needs to
> > be per task, and it''s possible that btrfs-endio-wri is simply
getting screwed by
> > other people doing work.
> >
> > At this point it seems like the biggest problem with latency in
ceph-osd is not
> > related to btrfs, the latency seems to all be from the fact that
ceph-osd is
> > fsyncing a block dev for whatever reason.  As for btrfs-endio-wri it
seems like
> > its blowing a lot of CPU time, so perf record -ag is probably going to
be your
> > best bet when it''s using lots of cpu so we can figure out
what it''s spinning on.
> 
> Attached is a perf-report. I have included the whole report, so that
> you can see the difference between the good and the bad
> btrfs-endio-wri.
>
We also shouldn''t be running run_ordered_operations, man this is
screwed up,
thanks so much for this, I should be able to nail this down pretty easily.
Thanks,

Josef 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Christian Brunner

2011-Oct-25 15:13 UTC

head link

Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]

2011/10/25 Josef Bacik <josef@redhat.com>:> On Tue, Oct 25, 2011 at 04:25:02PM +0200, Christian Brunner wrote:
>> 2011/10/25 Josef Bacik <josef@redhat.com>:
>> > On Tue, Oct 25, 2011 at 01:56:48PM +0200, Christian Brunner wrote:
[...]>> >>
>> >> In our Ceph-OSD server we have 4 disks with 4 btrfs
filesystems. Ceph
>> >> tries to balance the load over all OSDs, so all filesystems
should get
>> >> an nearly equal load. At the moment one filesystem seems to
have a
>> >> problem. When running with iostat I see the following
>> >>
>> >> Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s  
wsec/s
>> >> avgrq-sz avgqu-sz   await  svctm  %util
>> >> sdd               0.00     0.00    0.00    4.33     0.00  
 53.33
>> >> 12.31     0.08   19.38  12.23   5.30
>> >> sdc               0.00     1.00    0.00  228.33     0.00
 1957.33
>> >> 8.57    74.33  380.76   2.74  62.57
>> >> sdb               0.00     0.00    0.00    1.33     0.00  
 16.00
>> >> 12.00     0.03   25.00 19.75 2.63
>> >> sda               0.00     0.00    0.00    0.67     0.00    
8.00
>> >> 12.00     0.01   19.50  12.50   0.83
>> >>
>> >> The PID of the ceph-osd taht is running on sdc is 2053 and
when I look
>> >> with top I see this process and a btrfs-endio-writer (PID
5447):
>> >>
>> >>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+
 COMMAND
>> >>  2053 root      20   0  537m 146m 2364 S 33.2 0.6 43:31.24
ceph-osd
>> >>  5447 root      20   0     0    0    0 S 22.6 0.0 19:32.18
btrfs-endio-wri
>> >>
>> >> In the latencytop output you can see that those processes have
a much
>> >> higher latency, than the other ceph-osd and
btrfs-endio-writers.
>> >>
>> >
>> > I''m seeing a lot of this
>> >
>> >        [schedule]      1654.6 msec         96.4 %
>> >                schedule blkdev_issue_flush blkdev_fsync
vfs_fsync_range
>> >                generic_write_sync blkdev_aio_write
do_sync_readv_writev
>> >                do_readv_writev vfs_writev sys_writev
system_call_fastpath
>> >
>> > where ceph-osd''s latency is mostly coming from this fsync
of a block device
>> > directly, and not so much being tied up by btrfs directly.  With
22% CPU being
>> > taken up by btrfs-endio-wri we must be doing something wrong.  Can
you run perf
>> > record -ag when this is going on and then perf report so we can
see what
>> > btrfs-endio-wri is doing with the cpu.  You can drill down in perf
report to get
>> > only what btrfs-endio-wri is doing, so that would be best.  As far
as the rest
>> > of the latencytop goes, it doesn''t seem like
btrfs-endio-wri is doing anything
>> > horribly wrong or introducing a lot of latency.  Most of it seems
to be when
>> > running the dleayed refs and having to read in blocks.
 I''ve been suspecting for
>> > a while that the delayed ref stuff ends up doing way more work
than it needs to
>> > be per task, and it''s possible that btrfs-endio-wri is
simply getting screwed by
>> > other people doing work.
>> >
>> > At this point it seems like the biggest problem with latency in
ceph-osd is not
>> > related to btrfs, the latency seems to all be from the fact that
ceph-osd is
>> > fsyncing a block dev for whatever reason.  As for btrfs-endio-wri
it seems like
>> > its blowing a lot of CPU time, so perf record -ag is probably
going to be your
>> > best bet when it''s using lots of cpu so we can figure out
what it''s spinning on.
>>
>> Attached is a perf-report. I have included the whole report, so that
>> you can see the difference between the good and the bad
>> btrfs-endio-wri.
>>
>
> We also shouldn''t be running run_ordered_operations, man this is
screwed up,
> thanks so much for this, I should be able to nail this down pretty easily.
Please note that this is with "btrfs snaps disabled" in the ceph conf.
When I enable snaps our problems get worse (the btrfs-cleaner thing),
but I would be glad if this one thing gets solved. I can run debugging
with snaps enabled, if you want, but I would suggest, that we do this
afterwards.

Thanks,
Christian
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Sage Weil

2011-Oct-25 16:23 UTC

head link

Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]

On Tue, 25 Oct 2011, Christoph Hellwig wrote:> On Mon, Oct 24, 2011 at 10:06:49AM -0700, Sage Weil wrote:
> > > - When I run ceph with btrfs snaps disabled, the situation is
getting
> > > slightly better. I can run an OSD for about 3 days without
problems,
> > > but then again the load increases. This time, I can see that the
> > > ceph-osd (blkdev_issue_flush) and btrfs-endio-wri are doing more
work
> > > than usual.
> > 
> > FYI in this scenario you''re exposed to the same journal
replay issues that
> > ext4 and XFS are.  The btrfs workload that ceph is generating will
also
> > not be all that special, though, so this problem shouldn''t be
unique to
> > ceph.
> 
> What journal replay issues would ext4 and XFS be exposed to?
It''s the ceph-osd journal replay, not the ext4/XFS journal... the #2 
item in

	http://marc.info/?l=ceph-devel&m=131942130322957&w=2

sage
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Sage Weil

2011-Oct-25 16:36 UTC

head link

Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]

On Tue, 25 Oct 2011, Josef Bacik wrote:> At this point it seems like the biggest problem with latency in ceph-osd 
> is not related to btrfs, the latency seems to all be from the fact that 
> ceph-osd is fsyncing a block dev for whatever reason. 
There is one place where we sync_file_range() on the journal block device, 
but that should only happen if directio is disabled (it''s on by
default).

Christian, have you tweaked those settings in your ceph.conf?  It would be 
something like ''journal dio = false''.  If not, can you verify
that
directio shows true when the journal is initialized from your osd log?  
E.g.,

 2011-10-21 15:21:02.026789 7ff7e5c54720 journal _open dev/osd0.journal fd 14:
104857600 bytes, block size 4096 bytes, directio = 1

If directio = 1 for you, something else funky is causing those 
blkdev_fsync''s...

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Christian Brunner

2011-Oct-25 19:09 UTC

head link

Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]

2011/10/25 Sage Weil <sage@newdream.net>:> On Tue, 25 Oct 2011, Josef Bacik wrote:
>> At this point it seems like the biggest problem with latency in
ceph-osd
>> is not related to btrfs, the latency seems to all be from the fact that
>> ceph-osd is fsyncing a block dev for whatever reason.
>
> There is one place where we sync_file_range() on the journal block device,
> but that should only happen if directio is disabled (it''s on by
default).
>
> Christian, have you tweaked those settings in your ceph.conf?  It would be
> something like ''journal dio = false''.  If not, can you
verify that
> directio shows true when the journal is initialized from your osd log?
> E.g.,
>
>  2011-10-21 15:21:02.026789 7ff7e5c54720 journal _open dev/osd0.journal fd
14: 104857600 bytes, block size 4096 bytes, directio = 1
>
> If directio = 1 for you, something else funky is causing those
> blkdev_fsync''s...
I''ve looked it up in the logs - directio is 1:

Oct 25 17:20:16 os00 osd.000[1696]: 7f0016841740 journal _open
/dev/vg01/lv_osd_journal_0 fd 15: 17179869184 bytes, block size 4096
bytes, directio = 1

Regards,
Christian
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Mason

2011-Oct-25 20:15 UTC

head link

Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]

On Tue, Oct 25, 2011 at 11:05:12AM -0400, Josef Bacik
wrote:> On Tue, Oct 25, 2011 at 04:25:02PM +0200, Christian Brunner wrote:
> > 
> > Attached is a perf-report. I have included the whole report, so that
> > you can see the difference between the good and the bad
> > btrfs-endio-wri.
> >
> 
> We also shouldn''t be running run_ordered_operations, man this is
screwed up,
> thanks so much for this, I should be able to nail this down pretty easily.
> Thanks,
Looks like we''re getting there from reserve_metadata_bytes when we join
the transaction?

-chris
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Josef Bacik

2011-Oct-25 20:22 UTC

head link

Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]

On Tue, Oct 25, 2011 at 04:15:45PM -0400, Chris Mason
wrote:> On Tue, Oct 25, 2011 at 11:05:12AM -0400, Josef Bacik wrote:
> > On Tue, Oct 25, 2011 at 04:25:02PM +0200, Christian Brunner wrote:
> > > 
> > > Attached is a perf-report. I have included the whole report, so
that
> > > you can see the difference between the good and the bad
> > > btrfs-endio-wri.
> > >
> > 
> > We also shouldn''t be running run_ordered_operations, man this
is screwed up,
> > thanks so much for this, I should be able to nail this down pretty
easily.
> > Thanks,
> 
> Looks like we''re getting there from reserve_metadata_bytes when we
join
> the transaction?
>
We don''t do reservations in the endio stuff, we assume you''ve
reserved all the
space you need in delalloc, plus we would have seen reserve_metadata_bytes in
the trace.  Though it does look like perf is lying to us in at least one case
sicne btrfs_alloc_logged_file_extent is only called from log replay and not
during normal runtime, so it definitely shouldn''t be showing up. 
Thanks,

Josef 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Sage Weil

2011-Oct-25 22:27 UTC

head link

Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]

On Tue, 25 Oct 2011, Christian Brunner wrote:> 2011/10/25 Sage Weil <sage@newdream.net>:
> > On Tue, 25 Oct 2011, Josef Bacik wrote:
> >> At this point it seems like the biggest problem with latency in
ceph-osd
> >> is not related to btrfs, the latency seems to all be from the fact
that
> >> ceph-osd is fsyncing a block dev for whatever reason.
> >
> > There is one place where we sync_file_range() on the journal block
device,
> > but that should only happen if directio is disabled (it''s on
by default).
> >
> > Christian, have you tweaked those settings in your ceph.conf?  It
would be
> > something like ''journal dio = false''.  If not, can
you verify that
> > directio shows true when the journal is initialized from your osd log?
> > E.g.,
> >
> >  2011-10-21 15:21:02.026789 7ff7e5c54720 journal _open
dev/osd0.journal fd 14: 104857600 bytes, block size 4096 bytes, directio = 1
> >
> > If directio = 1 for you, something else funky is causing those
> > blkdev_fsync''s...
> 
> I''ve looked it up in the logs - directio is 1:
> 
> Oct 25 17:20:16 os00 osd.000[1696]: 7f0016841740 journal _open
> /dev/vg01/lv_osd_journal_0 fd 15: 17179869184 bytes, block size 4096
> bytes, directio = 1
Do you mind capturing an strace?  I''d like to see where that
blkdev_fsync
is coming from.

thanks!
sage

Christian Brunner

2011-Oct-26 00:16 UTC

head link

Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]

2011/10/25 Josef Bacik <josef@redhat.com>:> On Tue, Oct 25, 2011 at 04:15:45PM -0400, Chris Mason wrote:
>> On Tue, Oct 25, 2011 at 11:05:12AM -0400, Josef Bacik wrote:
>> > On Tue, Oct 25, 2011 at 04:25:02PM +0200, Christian Brunner wrote:
>> > >
>> > > Attached is a perf-report. I have included the whole report,
so that
>> > > you can see the difference between the good and the bad
>> > > btrfs-endio-wri.
>> > >
>> >
>> > We also shouldn''t be running run_ordered_operations, man
this is screwed up,
>> > thanks so much for this, I should be able to nail this down pretty
easily.
>> > Thanks,
>>
>> Looks like we''re getting there from reserve_metadata_bytes
when we join
>> the transaction?
>>
>
> We don''t do reservations in the endio stuff, we assume
you''ve reserved all the
> space you need in delalloc, plus we would have seen reserve_metadata_bytes
in
> the trace.  Though it does look like perf is lying to us in at least one
case
> sicne btrfs_alloc_logged_file_extent is only called from log replay and not
> during normal runtime, so it definitely shouldn''t be showing up.
 Thanks,
Strange! - I''ll check if symbols got messed up in the report tomorrow.

Christian
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Christian Brunner

2011-Oct-26 08:21 UTC

head link

Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]

2011/10/26 Christian Brunner <chb@muc.de>:> 2011/10/25 Josef Bacik <josef@redhat.com>:
>> On Tue, Oct 25, 2011 at 04:15:45PM -0400, Chris Mason wrote:
>>> On Tue, Oct 25, 2011 at 11:05:12AM -0400, Josef Bacik wrote:
>>> > On Tue, Oct 25, 2011 at 04:25:02PM +0200, Christian Brunner
wrote:
>>> > >
>>> > > Attached is a perf-report. I have included the whole
report, so that
>>> > > you can see the difference between the good and the bad
>>> > > btrfs-endio-wri.
>>> > >
>>> >
>>> > We also shouldn''t be running run_ordered_operations,
man this is screwed up,
>>> > thanks so much for this, I should be able to nail this down
pretty easily.
>>> > Thanks,
>>>
>>> Looks like we''re getting there from reserve_metadata_bytes
when we join
>>> the transaction?
>>>
>>
>> We don''t do reservations in the endio stuff, we assume
you''ve reserved all the
>> space you need in delalloc, plus we would have seen
reserve_metadata_bytes in
>> the trace.  Though it does look like perf is lying to us in at least
one case
>> sicne btrfs_alloc_logged_file_extent is only called from log replay and
not
>> during normal runtime, so it definitely shouldn''t be showing
up.  Thanks,
>
> Strange! - I''ll check if symbols got messed up in the report
tomorrow.
I''ve checked this now: Except for the missing symbols for iomemory_vsl
module, everything is looking normal.

I''ve also run the report on another OSD again, but the results look
quite similar.

Regards,
Christian

PS: This is what perf report -v is saying...

build id event received for [kernel.kallsyms]:
805ca93f4057cc0c8f53b061a849b3f847f2de40
build id event received for
/lib/modules/3.0.6-1.fits.8.el6.x86_64/kernel/fs/btrfs/btrfs.ko:
64a723e05af3908fb9593f4a3401d6563cb1a01b
build id event received for
/lib/modules/3.0.6-1.fits.8.el6.x86_64/kernel/lib/libcrc32c.ko:
b1391be8d33b54b6de20e07b7f2ee8d777fc09d2
build id event received for
/lib/modules/3.0.6-1.fits.8.el6.x86_64/kernel/drivers/net/bonding/bonding.ko:
663392df0f407211ab8f9527c482d54fce890c5e
build id event received for
/lib/modules/3.0.6-1.fits.8.el6.x86_64/kernel/drivers/scsi/hpsa.ko:
676eecffd476aef1b0f2f8c1bf8c8e6120d369c9
build id event received for
/lib/modules/3.0.6-1.fits.8.el6.x86_64/kernel/drivers/net/ixgbe/ixgbe.ko:
db7c200894b27e71ae6fe5cf7adaebf787c90da9
build id event received for [iomemory_vsl]:
4ed417c9a815e6bbe77a1656bceda95d9f06cb13
build id event received for /lib64/libc-2.12.so:
2ab28d41242ede641418966ef08f9aacffd9e8c7
build id event received for /lib64/libpthread-2.12.so:
c177389a6f119b3883ea0b3c33cb04df3f8e5cc7
build id event received for /sbin/rsyslogd:
1372ef1e2ec550967fe20d0bdddbc0aab0bb36dc
build id event received for /lib64/libglib-2.0.so.0.2200.5:
d880be15bf992b5fbcc629e6bbf1c747a928ddd5
build id event received for /usr/sbin/irqbalance:
842de64f46ca9fde55efa29a793c08b197d58354
build id event received for /lib64/libm-2.12.so:
46ac89195918407d2937bd1450c0ec99c8d41a2a
build id event received for /usr/bin/ceph-osd:
9fcb36e020c49fc49171b4c88bd784b38eb0675b
build id event received for /usr/lib64/libstdc++.so.6.0.13:
d1b2ca4e1ec8f81ba820e5f1375d960107ac7e50
build id event received for /usr/lib64/libtcmalloc.so.0.2.0:
02766551b2eb5a453f003daee0c5fc9cd176e831
Looking at the vmlinux_path (6 entries long)
dso__load_sym: cannot get elf header.
Using /proc/kallsyms for symbols
Looking at the vmlinux_path (6 entries long)
No kallsyms or vmlinux with build-id
4ed417c9a815e6bbe77a1656bceda95d9f06cb13 was found
[iomemory_vsl] with build id 4ed417c9a815e6bbe77a1656bceda95d9f06cb13
not found, continuing without symbols
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Mason

2011-Oct-26 13:23 UTC

head link

Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]

On Tue, Oct 25, 2011 at 04:22:48PM -0400, Josef Bacik
wrote:> On Tue, Oct 25, 2011 at 04:15:45PM -0400, Chris Mason wrote:
> > On Tue, Oct 25, 2011 at 11:05:12AM -0400, Josef Bacik wrote:
> > > On Tue, Oct 25, 2011 at 04:25:02PM +0200, Christian Brunner
wrote:
> > > > 
> > > > Attached is a perf-report. I have included the whole report,
so that
> > > > you can see the difference between the good and the bad
> > > > btrfs-endio-wri.
> > > >
> > > 
> > > We also shouldn''t be running run_ordered_operations, man
this is screwed up,
> > > thanks so much for this, I should be able to nail this down
pretty easily.
> > > Thanks,
> > 
> > Looks like we''re getting there from reserve_metadata_bytes
when we join
> > the transaction?
> >
> 
> We don''t do reservations in the endio stuff, we assume
you''ve reserved all the
> space you need in delalloc, plus we would have seen reserve_metadata_bytes
in
> the trace.  Though it does look like perf is lying to us in at least one
case
> sicne btrfs_alloc_logged_file_extent is only called from log replay and not
> during normal runtime, so it definitely shouldn''t be showing up. 
Thanks,
Whoops, I should have read that num_items > 0 check harder.

btrfs_end_transaction is doing it by setting ->blocked = 1

        if (lock &&
!atomic_read(&root->fs_info->open_ioctl_trans) &&
            should_end_transaction(trans, root)) {
                trans->transaction->blocked = 1;
		^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                smp_wmb();
        }

       if (lock && cur_trans->blocked &&
!cur_trans->in_commit) {
                   ^^^^^^^^^^^^^^^^^^^
                if (throttle) {
                        /*
                         * We may race with somebody else here so end up having
                         * to call end_transaction on ourselves again, so inc
                         * our use_count.
                         */
                        trans->use_count++;
                        return btrfs_commit_transaction(trans, root);
                } else {
                        wake_up_process(info->transaction_kthread);
                }
        }

perf is definitely lying a little bit about the trace ;)

-chris

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Josef Bacik

2011-Oct-27 15:07 UTC

head link

Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]

On Wed, Oct 26, 2011 at 09:23:54AM -0400, Chris Mason
wrote:> On Tue, Oct 25, 2011 at 04:22:48PM -0400, Josef Bacik wrote:
> > On Tue, Oct 25, 2011 at 04:15:45PM -0400, Chris Mason wrote:
> > > On Tue, Oct 25, 2011 at 11:05:12AM -0400, Josef Bacik wrote:
> > > > On Tue, Oct 25, 2011 at 04:25:02PM +0200, Christian Brunner
wrote:
> > > > > 
> > > > > Attached is a perf-report. I have included the whole
report, so that
> > > > > you can see the difference between the good and the bad
> > > > > btrfs-endio-wri.
> > > > >
> > > > 
> > > > We also shouldn''t be running
run_ordered_operations, man this is screwed up,
> > > > thanks so much for this, I should be able to nail this down
pretty easily.
> > > > Thanks,
> > > 
> > > Looks like we''re getting there from
reserve_metadata_bytes when we join
> > > the transaction?
> > >
> > 
> > We don''t do reservations in the endio stuff, we assume
you''ve reserved all the
> > space you need in delalloc, plus we would have seen
reserve_metadata_bytes in
> > the trace.  Though it does look like perf is lying to us in at least
one case
> > sicne btrfs_alloc_logged_file_extent is only called from log replay
and not
> > during normal runtime, so it definitely shouldn''t be showing
up.  Thanks,
> 
> Whoops, I should have read that num_items > 0 check harder.
> 
> btrfs_end_transaction is doing it by setting ->blocked = 1
> 
>         if (lock &&
!atomic_read(&root->fs_info->open_ioctl_trans) &&
>             should_end_transaction(trans, root)) {
>                 trans->transaction->blocked = 1;
> 		^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>                 smp_wmb();
>         }
> 
>        if (lock && cur_trans->blocked &&
!cur_trans->in_commit) {
>                    ^^^^^^^^^^^^^^^^^^^
>                 if (throttle) {
>                         /*
>                          * We may race with somebody else here so end up
having
>                          * to call end_transaction on ourselves again, so
inc
>                          * our use_count.
>                          */
>                         trans->use_count++;
>                         return btrfs_commit_transaction(trans, root);
>                 } else {
>                         wake_up_process(info->transaction_kthread);
>                 }
>         }
> 
Not sure what you are getting at here?  Even if we set blocked, we''re
not
throttling so it will just wake up the transaction kthread, so we won''t
do the
commit in the endio case.  Thanks

Josef
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Josef Bacik

2011-Oct-27 18:14 UTC

head link

Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]

On Thu, Oct 27, 2011 at 11:07:38AM -0400, Josef Bacik
wrote:> On Wed, Oct 26, 2011 at 09:23:54AM -0400, Chris Mason wrote:
> > On Tue, Oct 25, 2011 at 04:22:48PM -0400, Josef Bacik wrote:
> > > On Tue, Oct 25, 2011 at 04:15:45PM -0400, Chris Mason wrote:
> > > > On Tue, Oct 25, 2011 at 11:05:12AM -0400, Josef Bacik wrote:
> > > > > On Tue, Oct 25, 2011 at 04:25:02PM +0200, Christian
Brunner wrote:
> > > > > > 
> > > > > > Attached is a perf-report. I have included the
whole report, so that
> > > > > > you can see the difference between the good and
the bad
> > > > > > btrfs-endio-wri.
> > > > > >
> > > > > 
> > > > > We also shouldn''t be running
run_ordered_operations, man this is screwed up,
> > > > > thanks so much for this, I should be able to nail this
down pretty easily.
> > > > > Thanks,
> > > > 
> > > > Looks like we''re getting there from
reserve_metadata_bytes when we join
> > > > the transaction?
> > > >
> > > 
> > > We don''t do reservations in the endio stuff, we assume
you''ve reserved all the
> > > space you need in delalloc, plus we would have seen
reserve_metadata_bytes in
> > > the trace.  Though it does look like perf is lying to us in at
least one case
> > > sicne btrfs_alloc_logged_file_extent is only called from log
replay and not
> > > during normal runtime, so it definitely shouldn''t be
showing up.  Thanks,
> > 
> > Whoops, I should have read that num_items > 0 check harder.
> > 
> > btrfs_end_transaction is doing it by setting ->blocked = 1
> > 
> >         if (lock &&
!atomic_read(&root->fs_info->open_ioctl_trans) &&
> >             should_end_transaction(trans, root)) {
> >                 trans->transaction->blocked = 1;
> > 		^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> >                 smp_wmb();
> >         }
> > 
> >        if (lock && cur_trans->blocked &&
!cur_trans->in_commit) {
> >                    ^^^^^^^^^^^^^^^^^^^
> >                 if (throttle) {
> >                         /*
> >                          * We may race with somebody else here so end
up having
> >                          * to call end_transaction on ourselves again,
so inc
> >                          * our use_count.
> >                          */
> >                         trans->use_count++;
> >                         return btrfs_commit_transaction(trans, root);
> >                 } else {
> >                         wake_up_process(info->transaction_kthread);
> >                 }
> >         }
> > 
> 
> Not sure what you are getting at here?  Even if we set blocked,
we''re not
> throttling so it will just wake up the transaction kthread, so we
won''t do the
> commit in the endio case.  Thanks
> 
Oh I see what you were trying to say, that we''d set blocking and then
commit the
transaction from the endio process which would run ordered operations, but since
throttle isn''t set that won''t happen.  I think that the perf
symbols are just
lying to us.  Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Josef Bacik

2011-Oct-27 19:52 UTC

head link

Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]

On Tue, Oct 25, 2011 at 01:56:48PM +0200, Christian Brunner
wrote:> 2011/10/24 Josef Bacik <josef@redhat.com>:
> > On Mon, Oct 24, 2011 at 10:06:49AM -0700, Sage Weil wrote:
> >> [adding linux-btrfs to cc]
> >>
> >> Josef, Chris, any ideas on the below issues?
> >>
> >> On Mon, 24 Oct 2011, Christian Brunner wrote:
> >> >
> >> > - When I run ceph with btrfs snaps disabled, the situation is
getting
> >> > slightly better. I can run an OSD for about 3 days without
problems,
> >> > but then again the load increases. This time, I can see that
the
> >> > ceph-osd (blkdev_issue_flush) and btrfs-endio-wri are doing
more work
> >> > than usual.
> >>
> >> FYI in this scenario you''re exposed to the same journal
replay issues that
> >> ext4 and XFS are.  The btrfs workload that ceph is generating will
also
> >> not be all that special, though, so this problem
shouldn''t be unique to
> >> ceph.
> >>
> >
> > Can you get sysrq+w when this happens?  I''d like to see what
btrfs-endio-write
> > is up to.
> 
> Capturing this seems to be not easy. I have a few traces (see
> attachment), but with sysrq+w I do not get a stacktrace of
> btrfs-endio-write. What I have is a "latencytop -c" output which
is
> interesting:
> 
> In our Ceph-OSD server we have 4 disks with 4 btrfs filesystems. Ceph
> tries to balance the load over all OSDs, so all filesystems should get
> an nearly equal load. At the moment one filesystem seems to have a
> problem. When running with iostat I see the following
> 
> Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
> avgrq-sz avgqu-sz   await  svctm  %util
> sdd               0.00     0.00    0.00    4.33     0.00    53.33
> 12.31     0.08   19.38  12.23   5.30
> sdc               0.00     1.00    0.00  228.33     0.00  1957.33
> 8.57    74.33  380.76   2.74  62.57
> sdb               0.00     0.00    0.00    1.33     0.00    16.00
> 12.00     0.03   25.00  19.75   2.63
> sda               0.00     0.00    0.00    0.67     0.00     8.00
> 12.00     0.01   19.50  12.50   0.83
> 
> The PID of the ceph-osd taht is running on sdc is 2053 and when I look
> with top I see this process and a btrfs-endio-writer (PID 5447):
> 
>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>  2053 root      20   0  537m 146m 2364 S 33.2  0.6  43:31.24 ceph-osd
>  5447 root      20   0     0    0    0 S 22.6  0.0  19:32.18
btrfs-endio-wri
> 
> In the latencytop output you can see that those processes have a much
> higher latency, than the other ceph-osd and btrfs-endio-writers.
> 
> Regards,
> Christian
Ok just a shot in the dark, but could you give this a whirl and see if it helps
you?  Thanks

Josef


diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
index 125cf76..fbc196e 100644
--- a/fs/btrfs/delayed-ref.c
+++ b/fs/btrfs/delayed-ref.c
@@ -210,9 +210,9 @@ int btrfs_delayed_ref_lock(struct btrfs_trans_handle *trans,
 }
 
 int btrfs_find_ref_cluster(struct btrfs_trans_handle *trans,
-			   struct list_head *cluster, u64 start)
+			   struct list_head *cluster, u64 start, unsigned long max_count)
 {
-	int count = 0;
+	unsigned long count = 0;
 	struct btrfs_delayed_ref_root *delayed_refs;
 	struct rb_node *node;
 	struct btrfs_delayed_ref_node *ref;
@@ -242,7 +242,7 @@ int btrfs_find_ref_cluster(struct btrfs_trans_handle *trans,
 			node = rb_first(&delayed_refs->root);
 	}
 again:
-	while (node && count < 32) {
+	while (node && count < max_count) {
 		ref = rb_entry(node, struct btrfs_delayed_ref_node, rb_node);
 		if (btrfs_delayed_ref_is_head(ref)) {
 			head = btrfs_delayed_node_to_head(ref);
diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h
index e287e3b..b15a6ad 100644
--- a/fs/btrfs/delayed-ref.h
+++ b/fs/btrfs/delayed-ref.h
@@ -169,7 +169,8 @@ btrfs_find_delayed_ref_head(struct btrfs_trans_handle
*trans, u64 bytenr);
 int btrfs_delayed_ref_lock(struct btrfs_trans_handle *trans,
 			   struct btrfs_delayed_ref_head *head);
 int btrfs_find_ref_cluster(struct btrfs_trans_handle *trans,
-			   struct list_head *cluster, u64 search_start);
+			   struct list_head *cluster, u64 search_start,
+			   unsigned long max_count);
 /*
  * a node might live in a head or a regular ref, this lets you
  * test for the proper type to use.
diff --git a/fs/btrfs/dir-item.c b/fs/btrfs/dir-item.c
index 31d84e7..c190282 100644
--- a/fs/btrfs/dir-item.c
+++ b/fs/btrfs/dir-item.c
@@ -81,6 +81,7 @@ int btrfs_insert_xattr_item(struct btrfs_trans_handle *trans,
 	u32 data_size;
 
 	BUG_ON(name_len + data_len > BTRFS_MAX_XATTR_SIZE(root));
+	WARN_ON(trans->endio);
 
 	key.objectid = objectid;
 	btrfs_set_key_type(&key, BTRFS_XATTR_ITEM_KEY);
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 4eb7d2b..0977a10 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2295,7 +2295,7 @@ again:
 		 * lock
 		 */
 		ret = btrfs_find_ref_cluster(trans, &cluster,
-					     delayed_refs->run_delayed_start);
+					     delayed_refs->run_delayed_start, count);
 		if (ret)
 			break;
 
@@ -2338,7 +2338,8 @@ again:
 			node = rb_next(node);
 		}
 		spin_unlock(&delayed_refs->lock);
-		schedule_timeout(1);
+		if (need_resched())
+			schedule_timeout(1);
 		goto again;
 	}
 out:
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index f12747c..73a5e66 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -1752,6 +1752,7 @@ static int btrfs_finish_ordered_io(struct inode *inode,
u64 start, u64 end)
 	else
 		trans = btrfs_join_transaction(root);
 	BUG_ON(IS_ERR(trans));
+	trans->endio = 1;
 	trans->block_rsv = &root->fs_info->delalloc_block_rsv;
 
 	if (test_bit(BTRFS_ORDERED_COMPRESSED, &ordered_extent->flags))
@@ -2057,8 +2058,11 @@ void btrfs_run_delayed_iputs(struct btrfs_root *root)
 	LIST_HEAD(list);
 	struct btrfs_fs_info *fs_info = root->fs_info;
 	struct delayed_iput *delayed;
+	struct btrfs_trans_handle *trans;
 	int empty;
 
+	trans = current->journal_info;
+	WARN_ON(trans && trans->endio);
 	spin_lock(&fs_info->delayed_iput_lock);
 	empty = list_empty(&fs_info->delayed_iputs);
 	spin_unlock(&fs_info->delayed_iput_lock);
diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index a1c9404..ab68cfa 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -527,12 +527,15 @@ int btrfs_wait_ordered_extents(struct btrfs_root *root,
  */
 int btrfs_run_ordered_operations(struct btrfs_root *root, int wait)
 {
+	struct btrfs_trans_handle *trans;
 	struct btrfs_inode *btrfs_inode;
 	struct inode *inode;
 	struct list_head splice;
 
+	trans = (struct btrfs_trans_handle *)current->journal_info;
 	INIT_LIST_HEAD(&splice);
 
+	WARN_ON(trans && trans->endio);
 	mutex_lock(&root->fs_info->ordered_operations_mutex);
 	spin_lock(&root->fs_info->ordered_extent_lock);
 again:
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 29bef63..009d2db 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -310,6 +310,7 @@ again:
 	h->use_count = 1;
 	h->block_rsv = NULL;
 	h->orig_rsv = NULL;
+	h->endio = 0;
 
 	smp_mb();
 	if (cur_trans->blocked && may_wait_transaction(root, type)) {
@@ -467,20 +468,17 @@ static int __btrfs_end_transaction(struct
btrfs_trans_handle *trans,
 	while (count < 4) {
 		unsigned long cur = trans->delayed_ref_updates;
 		trans->delayed_ref_updates = 0;
-		if (cur &&
-		    trans->transaction->delayed_refs.num_heads_ready > 64) {
-			trans->delayed_ref_updates = 0;
-
-			/*
-			 * do a full flush if the transaction is trying
-			 * to close
-			 */
-			if (trans->transaction->delayed_refs.flushing)
-				cur = 0;
-			btrfs_run_delayed_refs(trans, root, cur);
-		} else {
+		if (!cur ||
+		    trans->transaction->delayed_refs.num_heads_ready <= 64)
 			break;
-		}
+
+		/*
+		 * do a full flush if the transaction is trying
+		 * to close
+		 */
+		if (trans->transaction->delayed_refs.flushing && throttle)
+			cur = 0;
+		btrfs_run_delayed_refs(trans, root, cur);
 		count++;
 	}
 
@@ -498,6 +496,7 @@ static int __btrfs_end_transaction(struct btrfs_trans_handle
*trans,
 			 * our use_count.
 			 */
 			trans->use_count++;
+			WARN_ON(trans->endio);
 			return btrfs_commit_transaction(trans, root);
 		} else {
 			wake_up_process(info->transaction_kthread);
diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
index 02564e6..7eae404 100644
--- a/fs/btrfs/transaction.h
+++ b/fs/btrfs/transaction.h
@@ -55,6 +55,7 @@ struct btrfs_trans_handle {
 	struct btrfs_transaction *transaction;
 	struct btrfs_block_rsv *block_rsv;
 	struct btrfs_block_rsv *orig_rsv;
+	unsigned endio;
 };
 
 struct btrfs_pending_snapshot {
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Christian Brunner

2011-Oct-27 20:39 UTC

head link

Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]

2011/10/27 Josef Bacik <josef@redhat.com>:> On Tue, Oct 25, 2011 at 01:56:48PM +0200, Christian Brunner wrote:
>> 2011/10/24 Josef Bacik <josef@redhat.com>:
>> > On Mon, Oct 24, 2011 at 10:06:49AM -0700, Sage Weil wrote:
>> >> [adding linux-btrfs to cc]
>> >>
>> >> Josef, Chris, any ideas on the below issues?
>> >>
>> >> On Mon, 24 Oct 2011, Christian Brunner wrote:
>> >> >
>> >> > - When I run ceph with btrfs snaps disabled, the
situation is getting
>> >> > slightly better. I can run an OSD for about 3 days
without problems,
>> >> > but then again the load increases. This time, I can see
that the
>> >> > ceph-osd (blkdev_issue_flush) and btrfs-endio-wri are
doing more work
>> >> > than usual.
>> >>
>> >> FYI in this scenario you''re exposed to the same
journal replay issues that
>> >> ext4 and XFS are.  The btrfs workload that ceph is generating
will also
>> >> not be all that special, though, so this problem
shouldn''t be unique to
>> >> ceph.
>> >>
>> >
>> > Can you get sysrq+w when this happens?  I''d like to see
what btrfs-endio-write
>> > is up to.
>>
>> Capturing this seems to be not easy. I have a few traces (see
>> attachment), but with sysrq+w I do not get a stacktrace of
>> btrfs-endio-write. What I have is a "latencytop -c" output
which is
>> interesting:
>>
>> In our Ceph-OSD server we have 4 disks with 4 btrfs filesystems. Ceph
>> tries to balance the load over all OSDs, so all filesystems should get
>> an nearly equal load. At the moment one filesystem seems to have a
>> problem. When running with iostat I see the following
>>
>> Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
>> avgrq-sz avgqu-sz   await  svctm  %util
>> sdd               0.00     0.00    0.00    4.33     0.00    53.33
>> 12.31     0.08   19.38  12.23   5.30
>> sdc               0.00     1.00    0.00  228.33     0.00  1957.33
>> 8.57    74.33  380.76   2.74  62.57
>> sdb               0.00     0.00    0.00    1.33     0.00    16.00
>> 12.00     0.03   25.00 19.75 2.63
>> sda               0.00     0.00    0.00    0.67     0.00     8.00
>> 12.00     0.01   19.50  12.50   0.83
>>
>> The PID of the ceph-osd taht is running on sdc is 2053 and when I look
>> with top I see this process and a btrfs-endio-writer (PID 5447):
>>
>>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>>  2053 root      20   0  537m 146m 2364 S 33.2 0.6 43:31.24 ceph-osd
>>  5447 root      20   0     0    0    0 S 22.6 0.0 19:32.18
btrfs-endio-wri
>>
>> In the latencytop output you can see that those processes have a much
>> higher latency, than the other ceph-osd and btrfs-endio-writers.
>>
>> Regards,
>> Christian
>
> Ok just a shot in the dark, but could you give this a whirl and see if it
helps
> you?  Thanks
Thanks for the patch! I''ll install it tomorrow and I think that I can
report back on Monday. It always takes a few days until the load goes
up.

Regards,
Christian
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Christian Brunner

2011-Oct-31 10:25 UTC

head link

Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]

2011/10/31 Christian Brunner <chb@muc.de>:> 2011/10/31 Christian Brunner <chb@muc.de>:
>>
>> The patch didn''t hurt, but I''ve to tell you that
I''m still seeing the
>> same old problems. Load is going up again:
>>
>>  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>>  5502 root      20   0     0    0    0 S 52.5 0.0 106:29.97
btrfs-endio-wri
>>  1976 root      20   0  601m 211m 1464 S 28.3 0.9 115:10.62 ceph-osd
>>
>> And I have hit our warning again:
>>
>> [223560.970713] ------------[ cut here ]------------
>> [223560.976043] WARNING: at fs/btrfs/inode.c:2118
>> btrfs_orphan_commit_root+0xb0/0xc0 [btrfs]()
>> [223560.985411] Hardware name: ProLiant DL180 G6
>> [223560.990491] Modules linked in: btrfs zlib_deflate libcrc32c sunrpc
>> bonding ipv6 sg serio_raw pcspkr ghes hed iTCO_wdt iTCO_vendor_support
>> i7core_edac edac_core ixgbe dca mdio iomemory_vsl(P) hpsa squashfs
>> [last unloaded: scsi_wait_scan]
>> [223561.014748] Pid: 2079, comm: ceph-osd Tainted: P
>> 3.0.6-1.fits.9.el6.x86_64 #1
>> [223561.023874] Call Trace:
>> [223561.026738]  [<ffffffff8106344f>]
warn_slowpath_common+0x7f/0xc0
>> [223561.033564]  [<ffffffff810634aa>]
warn_slowpath_null+0x1a/0x20
>> [223561.040272]  [<ffffffffa0282120>]
btrfs_orphan_commit_root+0xb0/0xc0 [btrfs]
>> [223561.048278]  [<ffffffffa027ce55>] commit_fs_roots+0xc5/0x1b0
[btrfs]
>> [223561.055534]  [<ffffffff8154c231>] ? mutex_lock+0x31/0x60
>> [223561.061666]  [<ffffffffa027ddbe>]
>> btrfs_commit_transaction+0x3ce/0x820 [btrfs]
>> [223561.069876]  [<ffffffffa027d1b8>] ?
wait_current_trans+0x28/0x110 [btrfs]
>> [223561.077582]  [<ffffffffa027e325>] ?
join_transaction+0x25/0x250 [btrfs]
>> [223561.085065]  [<ffffffff81086410>] ? wake_up_bit+0x40/0x40
>> [223561.091251]  [<ffffffffa025a329>] btrfs_sync_fs+0x59/0xd0
[btrfs]
>> [223561.098187]  [<ffffffffa02abc65>] btrfs_ioctl+0x495/0xd50
[btrfs]
>> [223561.105120]  [<ffffffff8125ed20>] ? inode_has_perm+0x30/0x40
>> [223561.111575]  [<ffffffff81261a2c>] ? file_has_perm+0xdc/0xf0
>> [223561.117924]  [<ffffffff8117086a>] do_vfs_ioctl+0x9a/0x5a0
>> [223561.124072]  [<ffffffff81170e11>] sys_ioctl+0xa1/0xb0
>> [223561.129842]  [<ffffffff81555702>]
system_call_fastpath+0x16/0x1b
>> [223561.136699] ---[ end trace 176e8be8996f25f6 ]---
>
> [ Not sending this to the lists, as the attachment is large ].
>
> I''ve spent a little time to do some tracing with ftrace. Its
output
> seems to be right (at least as far as I can tell). I hope that its
> output can give you an insight on whats going on.
>
> The interesting PIDs in the trace are:
>
>  5502 root      20   0     0    0    0 S 33.6 0.0 118:28.37 btrfs-endio-wri
>  5518 root      20   0     0    0    0 S 29.3 0.0 41:23.58 btrfs-endio-wri
>  8059 root      20   0  400m  48m 2756 S  8.0  0.2   8:31.56 ceph-osd
>  7993 root      20   0  401m  41m 2808 S 13.6  0.2   7:58.38 ceph-osd
>
[ adding linux-btrfs again ]

I''ve been digging into this a bit further:

Attached is another ftrace report that I''ve filtered for
"btrfs_*"
calls and limited to CPU0 (this is where PID 5502 was running).

From what I can see there is a lot of time consumed in
btrfs_reserve_extent(). I this normal?

Thanks,
Christian

Christian Brunner

2011-Oct-31 13:29 UTC

head link

Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]

2011/10/31 Christian Brunner <chb@muc.de>:> 2011/10/31 Christian Brunner <chb@muc.de>:
>> 2011/10/31 Christian Brunner <chb@muc.de>:
>>>
>>> The patch didn''t hurt, but I''ve to tell you that
I''m still seeing the
>>> same old problems. Load is going up again:
>>>
>>>  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>>>  5502 root      20   0     0    0    0 S 52.5 0.0 106:29.97
btrfs-endio-wri
>>>  1976 root      20   0  601m 211m 1464 S 28.3 0.9 115:10.62
ceph-osd
>>>
>>> And I have hit our warning again:
>>>
>>> [223560.970713] ------------[ cut here ]------------
>>> [223560.976043] WARNING: at fs/btrfs/inode.c:2118
>>> btrfs_orphan_commit_root+0xb0/0xc0 [btrfs]()
>>> [223560.985411] Hardware name: ProLiant DL180 G6
>>> [223560.990491] Modules linked in: btrfs zlib_deflate libcrc32c
sunrpc
>>> bonding ipv6 sg serio_raw pcspkr ghes hed iTCO_wdt
iTCO_vendor_support
>>> i7core_edac edac_core ixgbe dca mdio iomemory_vsl(P) hpsa squashfs
>>> [last unloaded: scsi_wait_scan]
>>> [223561.014748] Pid: 2079, comm: ceph-osd Tainted: P
>>> 3.0.6-1.fits.9.el6.x86_64 #1
>>> [223561.023874] Call Trace:
>>> [223561.026738]  [<ffffffff8106344f>]
warn_slowpath_common+0x7f/0xc0
>>> [223561.033564]  [<ffffffff810634aa>]
warn_slowpath_null+0x1a/0x20
>>> [223561.040272]  [<ffffffffa0282120>]
btrfs_orphan_commit_root+0xb0/0xc0 [btrfs]
>>> [223561.048278]  [<ffffffffa027ce55>]
commit_fs_roots+0xc5/0x1b0 [btrfs]
>>> [223561.055534]  [<ffffffff8154c231>] ? mutex_lock+0x31/0x60
>>> [223561.061666]  [<ffffffffa027ddbe>]
>>> btrfs_commit_transaction+0x3ce/0x820 [btrfs]
>>> [223561.069876]  [<ffffffffa027d1b8>] ?
wait_current_trans+0x28/0x110 [btrfs]
>>> [223561.077582]  [<ffffffffa027e325>] ?
join_transaction+0x25/0x250 [btrfs]
>>> [223561.085065]  [<ffffffff81086410>] ? wake_up_bit+0x40/0x40
>>> [223561.091251]  [<ffffffffa025a329>] btrfs_sync_fs+0x59/0xd0
[btrfs]
>>> [223561.098187]  [<ffffffffa02abc65>] btrfs_ioctl+0x495/0xd50
[btrfs]
>>> [223561.105120]  [<ffffffff8125ed20>] ?
inode_has_perm+0x30/0x40
>>> [223561.111575]  [<ffffffff81261a2c>] ?
file_has_perm+0xdc/0xf0
>>> [223561.117924]  [<ffffffff8117086a>] do_vfs_ioctl+0x9a/0x5a0
>>> [223561.124072]  [<ffffffff81170e11>] sys_ioctl+0xa1/0xb0
>>> [223561.129842]  [<ffffffff81555702>]
system_call_fastpath+0x16/0x1b
>>> [223561.136699] ---[ end trace 176e8be8996f25f6 ]---
>>
>> [ Not sending this to the lists, as the attachment is large ].
>>
>> I''ve spent a little time to do some tracing with ftrace. Its
output
>> seems to be right (at least as far as I can tell). I hope that its
>> output can give you an insight on whats going on.
>>
>> The interesting PIDs in the trace are:
>>
>>  5502 root      20   0     0    0    0 S 33.6 0.0 118:28.37
btrfs-endio-wri
>>  5518 root      20   0     0    0    0 S 29.3 0.0 41:23.58
btrfs-endio-wri
>>  8059 root      20   0  400m  48m 2756 S  8.0  0.2   8:31.56 ceph-osd
>>  7993 root      20   0  401m  41m 2808 S 13.6  0.2   7:58.38 ceph-osd
>>
>
> [ adding linux-btrfs again ]
>
> I''ve been digging into this a bit further:
>
> Attached is another ftrace report that I''ve filtered for
"btrfs_*"
> calls and limited to CPU0 (this is where PID 5502 was running).
>
> From what I can see there is a lot of time consumed in
> btrfs_reserve_extent(). I this normal?
Sorry for spamming, but in the meantime I''m almost certain that the
problem is inside find_free_extent (called from btrfs_reserve_extent).

When I''m running ftrace for a sample period of 10s my system is
wasting a total of 4,2 seconds inside find_free_extent(). Each call to
find_free_extent() is taking an average of 4 milliseconds to complete.
On a recently rebooted system this is only 1-2 us!

I''m not sure if the problem is occurring suddenly or slowly over time.
(At the moment I suspect that its occurring suddenly, but I still have
to investigate this).

Thanks,
Christian
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Josef Bacik

2011-Oct-31 14:04 UTC

head link

Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]

On Mon, Oct 31, 2011 at 02:29:44PM +0100, Christian Brunner
wrote:> 2011/10/31 Christian Brunner <chb@muc.de>:
> > 2011/10/31 Christian Brunner <chb@muc.de>:
> >> 2011/10/31 Christian Brunner <chb@muc.de>:
> >>>
> >>> The patch didn''t hurt, but I''ve to tell you
that I''m still seeing the
> >>> same old problems. Load is going up again:
> >>>
> >>>  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+
 COMMAND
> >>>  5502 root      20   0     0    0    0 S 52.5 0.0 106:29.97
btrfs-endio-wri
> >>>  1976 root      20   0  601m 211m 1464 S 28.3 0.9 115:10.62
ceph-osd
> >>>
> >>> And I have hit our warning again:
> >>>
> >>> [223560.970713] ------------[ cut here ]------------
> >>> [223560.976043] WARNING: at fs/btrfs/inode.c:2118
> >>> btrfs_orphan_commit_root+0xb0/0xc0 [btrfs]()
> >>> [223560.985411] Hardware name: ProLiant DL180 G6
> >>> [223560.990491] Modules linked in: btrfs zlib_deflate
libcrc32c sunrpc
> >>> bonding ipv6 sg serio_raw pcspkr ghes hed iTCO_wdt
iTCO_vendor_support
> >>> i7core_edac edac_core ixgbe dca mdio iomemory_vsl(P) hpsa
squashfs
> >>> [last unloaded: scsi_wait_scan]
> >>> [223561.014748] Pid: 2079, comm: ceph-osd Tainted: P
> >>> 3.0.6-1.fits.9.el6.x86_64 #1
> >>> [223561.023874] Call Trace:
> >>> [223561.026738]  [<ffffffff8106344f>]
warn_slowpath_common+0x7f/0xc0
> >>> [223561.033564]  [<ffffffff810634aa>]
warn_slowpath_null+0x1a/0x20
> >>> [223561.040272]  [<ffffffffa0282120>]
btrfs_orphan_commit_root+0xb0/0xc0 [btrfs]
> >>> [223561.048278]  [<ffffffffa027ce55>]
commit_fs_roots+0xc5/0x1b0 [btrfs]
> >>> [223561.055534]  [<ffffffff8154c231>] ?
mutex_lock+0x31/0x60
> >>> [223561.061666]  [<ffffffffa027ddbe>]
> >>> btrfs_commit_transaction+0x3ce/0x820 [btrfs]
> >>> [223561.069876]  [<ffffffffa027d1b8>] ?
wait_current_trans+0x28/0x110 [btrfs]
> >>> [223561.077582]  [<ffffffffa027e325>] ?
join_transaction+0x25/0x250 [btrfs]
> >>> [223561.085065]  [<ffffffff81086410>] ?
wake_up_bit+0x40/0x40
> >>> [223561.091251]  [<ffffffffa025a329>]
btrfs_sync_fs+0x59/0xd0 [btrfs]
> >>> [223561.098187]  [<ffffffffa02abc65>]
btrfs_ioctl+0x495/0xd50 [btrfs]
> >>> [223561.105120]  [<ffffffff8125ed20>] ?
inode_has_perm+0x30/0x40
> >>> [223561.111575]  [<ffffffff81261a2c>] ?
file_has_perm+0xdc/0xf0
> >>> [223561.117924]  [<ffffffff8117086a>]
do_vfs_ioctl+0x9a/0x5a0
> >>> [223561.124072]  [<ffffffff81170e11>]
sys_ioctl+0xa1/0xb0
> >>> [223561.129842]  [<ffffffff81555702>]
system_call_fastpath+0x16/0x1b
> >>> [223561.136699] ---[ end trace 176e8be8996f25f6 ]---
> >>
> >> [ Not sending this to the lists, as the attachment is large ].
> >>
> >> I''ve spent a little time to do some tracing with ftrace.
Its output
> >> seems to be right (at least as far as I can tell). I hope that its
> >> output can give you an insight on whats going on.
> >>
> >> The interesting PIDs in the trace are:
> >>
> >>  5502 root      20   0     0    0    0 S 33.6 0.0 118:28.37
btrfs-endio-wri
> >>  5518 root      20   0     0    0    0 S 29.3 0.0 41:23.58
btrfs-endio-wri
> >>  8059 root      20   0  400m  48m 2756 S  8.0  0.2   8:31.56
ceph-osd
> >>  7993 root      20   0  401m  41m 2808 S 13.6  0.2   7:58.38
ceph-osd
> >>
> >
> > [ adding linux-btrfs again ]
> >
> > I''ve been digging into this a bit further:
> >
> > Attached is another ftrace report that I''ve filtered for
"btrfs_*"
> > calls and limited to CPU0 (this is where PID 5502 was running).
> >
> > From what I can see there is a lot of time consumed in
> > btrfs_reserve_extent(). I this normal?
> 
> Sorry for spamming, but in the meantime I''m almost certain that
the
> problem is inside find_free_extent (called from btrfs_reserve_extent).
> 
> When I''m running ftrace for a sample period of 10s my system is
> wasting a total of 4,2 seconds inside find_free_extent(). Each call to
> find_free_extent() is taking an average of 4 milliseconds to complete.
> On a recently rebooted system this is only 1-2 us!
> 
> I''m not sure if the problem is occurring suddenly or slowly over
time.
> (At the moment I suspect that its occurring suddenly, but I still have
> to investigate this).
>
Ugh ok then this is lxo''s problem with our clustering stuff taking way
too much
time.  I guess it''s time to actually take a hard look at that code. 
Thanks,

Josef 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Btrfs devel - Oct 2011 - ceph on btrfs [was Re: ceph on non-btrfs file systems]

ceph on btrfs [was Re: ceph on non-btrfs file systems]

Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]

Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]

Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]

Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]

Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]

Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]

Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]

Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]

Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]

Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]

Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]

Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]

Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]

Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]

Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]

Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]

Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]

Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]

Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]

Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]

Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]

Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]

Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]

Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]

Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]

Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]

Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]