thr3ads.net - Btrfs devel - Updated performance results [Jul 2009]

If this information is useful, please help other people find it:
Share via:

Steven Pratt

2009-Jul-23 18:35 UTC

Updated performance results

I have re-run the raid tests with re-creating the fileset between each 
of the random write workloads and performance does now match the 
previous newformat results.  The bad news is that the huge gain that I 
had attributed to the newformat release, does not really exist.  All of 
the previous results(except for the newformat run) were not re-creating 
the fileset, so the gain in performance was due only to having a fresh 
set of files, not any code changes.

So, I have done 2 new sets of runs to look into this further. One is a 3 
hour run of single threaded random write to the RAID system.  I have 
compared this to ext3.  Performance results are here:  
http://btrfs.boxacle.net/repository/raid/longwrite/longwrite/Longrandomwrite.html

and graphing of all the iostat data can be found here:

http://btrfs.boxacle.net/repository/raid/longwrite/summary.html

The iostat graphs for btrfs are interesting for a number of reasons.  
First, it takes about 3000 seconds (or 50 minutes) for btrfs to reach 
steady state.  Second, if you look at write throughput from the device 
view vs. the btrfs/application view, we see that for a application 
throughput of 21.5MB/sec it requires 63MB/sec of actual disk writes.  
That is an overhead of 3 to 1 vs an overhead of ~0 for ext3. Also, 
looking at the change in iops vs MB/sec, we see that while  btrfs starts 
out with reasonable size IOs, it quickly deteriorate to an average IO 
size of only 13kb.  Remember, the starting file set is only 100GB on a 
2.1TB filesystem, and all data is overwrite, and this is single 
threaded, so there is no reason this should fragment.  It seems like the 
allocator is having a problem doing sequential allocations.

Another set of runs I did was to do repetitive 5 minute random write 
runs.  Results are here:
http://btrfs.boxacle.net/repository/raid/repeat/repeat/repeat.html

This shows the dramatic degredation after just a short time, but I 
believe there is a fair amount of overhead in btrfs after newly mounting 
the FS (which this test did between each run) so I repeated without 
unmounting and remounting and results are here:
http://btrfs.boxacle.net/repository/raid/repeat-nomount/repeat/repeat-nomount.html

These results show a much less dramatic degradation, but btrfs still 
degrades by over 40% in just 30 minutes of run time.  In fact, it was 
still degrading by 10% every 5 minutes when this test ended.

Steve
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Mason

2009-Jul-23 21:00 UTC

head link

Re: Updated performance results

On Thu, Jul 23, 2009 at 01:35:21PM -0500, Steven Pratt
wrote:> I have re-run the raid tests with re-creating the fileset between each  
> of the random write workloads and performance does now match the  
> previous newformat results.  The bad news is that the huge gain that I  
> had attributed to the newformat release, does not really exist.  All of  
> the previous results(except for the newformat run) were not re-creating  
> the fileset, so the gain in performance was due only to having a fresh  
> set of files, not any code changes.
Thanks for doing all of these runs.  This is still a little different
than what I have here, my initial runs are very very fast and after 10
or so level out to a relatively low performance on random writes.  With
nodatacow, it stays even.
>
> So, I have done 2 new sets of runs to look into this further. One is a 3  
> hour run of single threaded random write to the RAID system.  I have  
> compared this to ext3.  Performance results are here:   
>
http://btrfs.boxacle.net/repository/raid/longwrite/longwrite/Longrandomwrite.html
>
> and graphing of all the iostat data can be found here:
>
> http://btrfs.boxacle.net/repository/raid/longwrite/summary.html
>
> The iostat graphs for btrfs are interesting for a number of reasons.   
> First, it takes about 3000 seconds (or 50 minutes) for btrfs to reach  
> steady state.  Second, if you look at write throughput from the device  
> view vs. the btrfs/application view, we see that for a application  
> throughput of 21.5MB/sec it requires 63MB/sec of actual disk writes.   
> That is an overhead of 3 to 1 vs an overhead of ~0 for ext3. Also,  
> looking at the change in iops vs MB/sec, we see that while  btrfs starts  
> out with reasonable size IOs, it quickly deteriorate to an average IO  
> size of only 13kb.  Remember, the starting file set is only 100GB on a  
> 2.1TB filesystem, and all data is overwrite, and this is single  
> threaded, so there is no reason this should fragment.  It seems like the  
> allocator is having a problem doing sequential allocations.
There are two things happening.  First the default allocation scheme
isn''t very well suited to this, mount -o ssd will perform better.  But
over the long term, random overwrites to the file cause a lot of writes
to the extent allocation tree.  That''s really what -o nodatacow is
saving us.  There are optimizations we can do, but we''re holding off on
that in favor of enospc and other pressing things.

But, with all of that said, Josef has some really important allocator
improvements.  I''ve put them out along with our pending patches into
the
experimental branch of the btrfs-unstable tree.  Could you please give
this branch a try both with and without the ssd mount option?

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Steven Pratt

2009-Jul-23 22:04 UTC

head link

Re: Updated performance results

Chris Mason wrote:> On Thu, Jul 23, 2009 at 01:35:21PM -0500, Steven Pratt wrote:
>   
>> I have re-run the raid tests with re-creating the fileset between each
>> of the random write workloads and performance does now match the  
>> previous newformat results.  The bad news is that the huge gain that I
>> had attributed to the newformat release, does not really exist.  All of
>> the previous results(except for the newformat run) were not re-creating
>> the fileset, so the gain in performance was due only to having a fresh
>> set of files, not any code changes.
>>     
>
> Thanks for doing all of these runs.  This is still a little different
> than what I have here, my initial runs are very very fast and after 10
> or so level out to a relatively low performance on random writes.  With
> nodatacow, it stays even.
>
>   Right, I do not see this problem with nodatacow.
>> So, I have done 2 new sets of runs to look into this further. One is a
3
>> hour run of single threaded random write to the RAID system.  I have  
>> compared this to ext3.  Performance results are here:   
>>
http://btrfs.boxacle.net/repository/raid/longwrite/longwrite/Longrandomwrite.html
>>
>> and graphing of all the iostat data can be found here:
>>
>> http://btrfs.boxacle.net/repository/raid/longwrite/summary.html
>>
>> The iostat graphs for btrfs are interesting for a number of reasons.   
>> First, it takes about 3000 seconds (or 50 minutes) for btrfs to reach  
>> steady state.  Second, if you look at write throughput from the device
>> view vs. the btrfs/application view, we see that for a application  
>> throughput of 21.5MB/sec it requires 63MB/sec of actual disk writes.   
>> That is an overhead of 3 to 1 vs an overhead of ~0 for ext3. Also,  
>> looking at the change in iops vs MB/sec, we see that while  btrfs
starts
>> out with reasonable size IOs, it quickly deteriorate to an average IO  
>> size of only 13kb.  Remember, the starting file set is only 100GB on a
>> 2.1TB filesystem, and all data is overwrite, and this is single  
>> threaded, so there is no reason this should fragment.  It seems like
the
>> allocator is having a problem doing sequential allocations.
>>     
>
> There are two things happening.  First the default allocation scheme
> isn''t very well suited to this, mount -o ssd will perform better. 
But
> over the long term, random overwrites to the file cause a lot of writes
> to the extent allocation tree.  That''s really what -o nodatacow is
> saving us.  There are optimizations we can do, but we''re holding
off on
> that in favor of enospc and other pressing things.
>   Well I have -o ssd data that I can upload, but it was worse than 
without.  I do understand about timing and priorities.
> But, with all of that said, Josef has some really important allocator
> improvements.  I''ve put them out along with our pending patches
into the
> experimental branch of the btrfs-unstable tree.  Could you please give
> this branch a try both with and without the ssd mount option?
>
>   Sure, will try to get to it tomorrow.

Steve
> -chris
>
>   
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Mason

2009-Jul-24 13:24 UTC

head link

Re: Updated performance results

On Thu, Jul 23, 2009 at 05:04:49PM -0500, Steven Pratt
wrote:> Chris Mason wrote:
>> On Thu, Jul 23, 2009 at 01:35:21PM -0500, Steven Pratt wrote:
>>   
>>> I have re-run the raid tests with re-creating the fileset between 
>>> each  of the random write workloads and performance does now match 
>>> the  previous newformat results.  The bad news is that the huge
gain
>>> that I  had attributed to the newformat release, does not really 
>>> exist.  All of  the previous results(except for the newformat run) 
>>> were not re-creating  the fileset, so the gain in performance was
due
>>> only to having a fresh  set of files, not any code changes.
>>>     
>>
>> Thanks for doing all of these runs.  This is still a little different
>> than what I have here, my initial runs are very very fast and after 10
>> or so level out to a relatively low performance on random writes.  With
>> nodatacow, it stays even.
>>
>>   
> Right, I do not see this problem with nodatacow.
>
>>> So, I have done 2 new sets of runs to look into this further. One
is
>>> a 3  hour run of single threaded random write to the RAID system. 
I
>>> have  compared this to ext3.  Performance results are here:    
>>>
http://btrfs.boxacle.net/repository/raid/longwrite/longwrite/Longrandomwrite.html
>>>
>>> and graphing of all the iostat data can be found here:
>>>
>>> http://btrfs.boxacle.net/repository/raid/longwrite/summary.html
>>>
>>> The iostat graphs for btrfs are interesting for a number of
reasons.
>>>  First, it takes about 3000 seconds (or 50 minutes) for btrfs to 
>>> reach  steady state.  Second, if you look at write throughput from 
>>> the device  view vs. the btrfs/application view, we see that for a 
>>> application  throughput of 21.5MB/sec it requires 63MB/sec of
actual
>>> disk writes.   That is an overhead of 3 to 1 vs an overhead of ~0
for
>>> ext3. Also,  looking at the change in iops vs MB/sec, we see that 
>>> while  btrfs starts  out with reasonable size IOs, it quickly 
>>> deteriorate to an average IO  size of only 13kb.  Remember, the 
>>> starting file set is only 100GB on a  2.1TB filesystem, and all
data
>>> is overwrite, and this is single  threaded, so there is no reason 
>>> this should fragment.  It seems like the  allocator is having a 
>>> problem doing sequential allocations.
>>>     
>>
>> There are two things happening.  First the default allocation scheme
>> isn''t very well suited to this, mount -o ssd will perform
better.  But
>> over the long term, random overwrites to the file cause a lot of writes
>> to the extent allocation tree.  That''s really what -o
nodatacow is
>> saving us.  There are optimizations we can do, but we''re
holding off on
>> that in favor of enospc and other pressing things.
>>   
> Well I have -o ssd data that I can upload, but it was worse than  
> without.  I do understand about timing and priorities.
>
>> But, with all of that said, Josef has some really important allocator
>> improvements.  I''ve put them out along with our pending
patches into the
>> experimental branch of the btrfs-unstable tree.  Could you please give
>> this branch a try both with and without the ssd mount option?
>>
>>   
> Sure, will try to get to it tomorrow.
Sorry, I missed a fix in the experimental branch.  I''ll push out a
rebased version in a few minutes.

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Mason

2009-Jul-24 14:00 UTC

head link

Re: Updated performance results

On Fri, Jul 24, 2009 at 09:24:07AM -0400, Chris Mason
wrote:> >>   
> > Sure, will try to get to it tomorrow.
> 
> Sorry, I missed a fix in the experimental branch.  I''ll push out a
> rebased version in a few minutes.
> 
Ok, the rebased version is ready to use.

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Steven Pratt

2009-Jul-24 15:05 UTC

head link

Re: Updated performance results

Chris Mason wrote:> On Fri, Jul 24, 2009 at 09:24:07AM -0400, Chris Mason wrote:
>   
>>>>   
>>>>         
>>> Sure, will try to get to it tomorrow.
>>>       
>> Sorry, I missed a fix in the experimental branch.  I''ll push
out a
>> rebased version in a few minutes.
>>
>>     
>
> Ok, the rebased version is ready to use.
>   Ok, good.  Also seems I misspoke on the -o ssd results. I looked them 
over again this morning and they are slightly bette than without.  
Initial score of 46MB/sec vs 44MB/sec without, and degrades after 30 
minutes to about 25MB/sec vs 20MB/sec without.  So after 30 minutes of 
runtime -o ssd is running about 25% faster, but still significant degrade.

Steve
> -chris
>
> --
> To unsubscribe from this list: send the line "unsubscribe
linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>   
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Steven Pratt

2009-Jul-28 20:12 UTC

head link

Re: Updated performance results

Chris Mason wrote:> On Fri, Jul 24, 2009 at 09:24:07AM -0400, Chris Mason wrote:
>   
>>>>   
>>>>         
>>> Sure, will try to get to it tomorrow.
>>>       
>> Sorry, I missed a fix in the experimental branch.  I''ll push
out a
>> rebased version in a few minutes.
>>
>>     
>
> Ok, the rebased version is ready to use.
>   New results are up for both with and without nodatacow.  Not much change.

http://btrfs.boxacle.net/repository/raid/history/History.html

Have another run going with nodatacow and ssd.

Steve
> -chris
>
> --
> To unsubscribe from this list: send the line "unsubscribe
linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>   
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Mason

2009-Jul-28 20:23 UTC

head link

Re: Updated performance results

On Tue, Jul 28, 2009 at 03:12:38PM -0500, Steven Pratt
wrote:> Chris Mason wrote:
> >On Fri, Jul 24, 2009 at 09:24:07AM -0400, Chris Mason wrote:
> >>>Sure, will try to get to it tomorrow.
> >>Sorry, I missed a fix in the experimental branch.  I''ll
push out a
> >>rebased version in a few minutes.
> >>
> >
> >Ok, the rebased version is ready to use.
> New results are up for both with and without nodatacow.  Not much change.
> 
> http://btrfs.boxacle.net/repository/raid/history/History.html
> 
> Have another run going with nodatacow and ssd.
Hi Steve,

I think I''m going to start tuning something other than the
random-writes, there is definitely low hanging fruit in the large file
creates workload ;)  Thanks again for posting all of these.

The history graph has 2.6.31-rc btrfs against 2.6.29-rc ext4.  Have you
done more recent runs on ext4?

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Steven Pratt

2009-Jul-28 21:10 UTC

head link

Re: Updated performance results

Chris Mason wrote:> On Tue, Jul 28, 2009 at 03:12:38PM -0500, Steven Pratt wrote:
>   
>> Chris Mason wrote:
>>     
>>> On Fri, Jul 24, 2009 at 09:24:07AM -0400, Chris Mason wrote:
>>>       
>>>>> Sure, will try to get to it tomorrow.
>>>>>           
>>>> Sorry, I missed a fix in the experimental branch. 
I''ll push out a
>>>> rebased version in a few minutes.
>>>>
>>>>         
>>> Ok, the rebased version is ready to use.
>>>       
>> New results are up for both with and without nodatacow.  Not much
change.
>>
>> http://btrfs.boxacle.net/repository/raid/history/History.html
>>
>> Have another run going with nodatacow and ssd.
>>     
>
> Hi Steve,
>
> I think I''m going to start tuning something other than the
> random-writes, there is definitely low hanging fruit in the large file
> creates workload ;)  Thanks again for posting all of these.
>   Sure, no problem.
> The history graph has 2.6.31-rc btrfs against 2.6.29-rc ext4.  Have you
> done more recent runs on ext4?
>
>   Yes, thanks for pointing that out, had so many issues I forgot to update 
the graphs for other file systems.  Just pushed new graphs with data for 
2.6.30-rc7 for all the other file systems.  This was from your 
"newformat" branch from June 6th.

Steve
> -chris
>   
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Mason

2009-Aug-05 20:35 UTC

head link

Re: Updated performance results

On Tue, Jul 28, 2009 at 04:10:41PM -0500, Steven Pratt
wrote:> >
> >Hi Steve,
> >
> >I think I''m going to start tuning something other than the
> >random-writes, there is definitely low hanging fruit in the large file
> >creates workload ;)  Thanks again for posting all of these.
> Sure, no problem.
> 
> >The history graph has 2.6.31-rc btrfs against 2.6.29-rc ext4.  Have you
> >done more recent runs on ext4?
> >
> Yes, thanks for pointing that out, had so many issues I forgot to
> update the graphs for other file systems.  Just pushed new graphs
> with data for 2.6.30-rc7 for all the other file systems.  This was
> from your "newformat" branch from June 6th.
I''ve been tuning the 128 thread large file streaming writes, and found
some easy optimizations.  While I''m fixing up these patches, could you
please do a streaming O_DIRECT write test run for me?  I think buffered
writeback in general has some problems right now on high end arrays.

On my box 2.6.31-rc5 streaming buffered write with xfs only got at
200MB/s (with the 128 thread ffsb workload).  Buffered btrfs goes at
175MB/s.

O_DIRECT btrfs runs at 390MB/s, while XFS varies a bit between 330MB/s
and 250MB/s.

I''m using a 1MB write blocksize.

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

debian developer

2009-Aug-07 07:30 UTC

head link

Re: Updated performance results

HI,

Do you have any benchmarks against non-raid common workloads?
Like say a desktop user? It would be great to compare against ext3,
ext4, xfs etc.,

Thanks,

On Thu, Aug 6, 2009 at 2:05 AM, Chris Mason<chris.mason@oracle.com>
wrote:> On Tue, Jul 28, 2009 at 04:10:41PM -0500, Steven Pratt wrote:
>> >
>> >Hi Steve,
>> >
>> >I think I''m going to start tuning something other than the
>> >random-writes, there is definitely low hanging fruit in the large
file
>> >creates workload ;)  Thanks again for posting all of these.
>> Sure, no problem.
>>
>> >The history graph has 2.6.31-rc btrfs against 2.6.29-rc ext4.  Have
you
>> >done more recent runs on ext4?
>> >
>> Yes, thanks for pointing that out, had so many issues I forgot to
>> update the graphs for other file systems.  Just pushed new graphs
>> with data for 2.6.30-rc7 for all the other file systems.  This was
>> from your "newformat" branch from June 6th.
>
> I''ve been tuning the 128 thread large file streaming writes, and
found
> some easy optimizations.  While I''m fixing up these patches, could
you
> please do a streaming O_DIRECT write test run for me?  I think buffered
> writeback in general has some problems right now on high end arrays.
>
> On my box 2.6.31-rc5 streaming buffered write with xfs only got at
> 200MB/s (with the 128 thread ffsb workload).  Buffered btrfs goes at
> 175MB/s.
>
> O_DIRECT btrfs runs at 390MB/s, while XFS varies a bit between 330MB/s
> and 250MB/s.
>
> I''m using a 1MB write blocksize.
>
> -chris
>
> --
> To unsubscribe from this list: send the line "unsubscribe
linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Steven Pratt

2009-Aug-07 13:56 UTC

head link

Re: Updated performance results

debian developer wrote:> HI,
>
> Do you have any benchmarks against non-raid common workloads?
> Like say a desktop user? It would be great to compare against ext3,
> ext4, xfs etc.,
>   Yes, have had a little trouble with that box recently, but plenty of 
results based on the 2.6.29 kernels here:

http://btrfs.boxacle.net/repository/single-disk/History/History.html

If you are not familiar with the runs I have been doing, you can find 
the details of the benchmarking machine and test procedures here: 
http://btrfs.boxacle.net/

Steve
> Thanks,
>
> On Thu, Aug 6, 2009 at 2:05 AM, Chris Mason<chris.mason@oracle.com>
wrote:
>   
>> On Tue, Jul 28, 2009 at 04:10:41PM -0500, Steven Pratt wrote:
>>     
>>>> Hi Steve,
>>>>
>>>> I think I''m going to start tuning something other than
the
>>>> random-writes, there is definitely low hanging fruit in the
large file
>>>> creates workload ;)  Thanks again for posting all of these.
>>>>         
>>> Sure, no problem.
>>>
>>>       
>>>> The history graph has 2.6.31-rc btrfs against 2.6.29-rc ext4. 
Have you
>>>> done more recent runs on ext4?
>>>>
>>>>         
>>> Yes, thanks for pointing that out, had so many issues I forgot to
>>> update the graphs for other file systems.  Just pushed new graphs
>>> with data for 2.6.30-rc7 for all the other file systems.  This was
>>> from your "newformat" branch from June 6th.
>>>       
>> I''ve been tuning the 128 thread large file streaming writes,
and found
>> some easy optimizations.  While I''m fixing up these patches,
could you
>> please do a streaming O_DIRECT write test run for me?  I think buffered
>> writeback in general has some problems right now on high end arrays.
>>
>> On my box 2.6.31-rc5 streaming buffered write with xfs only got at
>> 200MB/s (with the 128 thread ffsb workload).  Buffered btrfs goes at
>> 175MB/s.
>>
>> O_DIRECT btrfs runs at 390MB/s, while XFS varies a bit between 330MB/s
>> and 250MB/s.
>>
>> I''m using a 1MB write blocksize.
>>
>> -chris
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe
linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>     
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Steven Pratt

2009-Aug-07 13:56 UTC

head link

Re: Updated performance results

Chris Mason wrote:> On Tue, Jul 28, 2009 at 04:10:41PM -0500, Steven Pratt wrote:
>   
>>> Hi Steve,
>>>
>>> I think I''m going to start tuning something other than the
>>> random-writes, there is definitely low hanging fruit in the large
file
>>> creates workload ;)  Thanks again for posting all of these.
>>>       
>> Sure, no problem.
>>
>>     
>>> The history graph has 2.6.31-rc btrfs against 2.6.29-rc ext4.  Have
you
>>> done more recent runs on ext4?
>>>
>>>       
>> Yes, thanks for pointing that out, had so many issues I forgot to
>> update the graphs for other file systems.  Just pushed new graphs
>> with data for 2.6.30-rc7 for all the other file systems.  This was
>> from your "newformat" branch from June 6th.
>>     
>
> I''ve been tuning the 128 thread large file streaming writes, and
found
> some easy optimizations.  While I''m fixing up these patches, could
you
> please do a streaming O_DIRECT write test run for me?  I think buffered
> writeback in general has some problems right now on high end arrays.
>
> On my box 2.6.31-rc5 streaming buffered write with xfs only got at
> 200MB/s (with the 128 thread ffsb workload).  Buffered btrfs goes at
> 175MB/s.
>
> O_DIRECT btrfs runs at 390MB/s, while XFS varies a bit between 330MB/s
> and 250MB/s.
>
> I''m using a 1MB write blocksize.
>   On my todo list, but am swamped this week trying to get ready for 
vacation.  Will try to get to it as soon as I can.

Stee
> -chris
>   
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Mason

2009-Aug-07 23:12 UTC

head link

Re: Updated performance results

On Fri, Aug 07, 2009 at 08:56:52AM -0500, Steven Pratt
wrote:> Chris Mason wrote:
> >On Tue, Jul 28, 2009 at 04:10:41PM -0500, Steven Pratt wrote:
> >>>Hi Steve,
> >>>
> >>>I think I''m going to start tuning something other than
the
> >>>random-writes, there is definitely low hanging fruit in the
large file
> >>>creates workload ;)  Thanks again for posting all of these.
> >>Sure, no problem.
> >>
> >>>The history graph has 2.6.31-rc btrfs against 2.6.29-rc ext4. 
Have you
> >>>done more recent runs on ext4?
> >>>
> >>Yes, thanks for pointing that out, had so many issues I forgot to
> >>update the graphs for other file systems.  Just pushed new graphs
> >>with data for 2.6.30-rc7 for all the other file systems.  This was
> >>from your "newformat" branch from June 6th.
> >
> >I''ve been tuning the 128 thread large file streaming writes,
and found
> >some easy optimizations.  While I''m fixing up these patches,
could you
> >please do a streaming O_DIRECT write test run for me?  I think buffered
> >writeback in general has some problems right now on high end arrays.
> >
> >On my box 2.6.31-rc5 streaming buffered write with xfs only got at
> >200MB/s (with the 128 thread ffsb workload).  Buffered btrfs goes at
> >175MB/s.
> >
> >O_DIRECT btrfs runs at 390MB/s, while XFS varies a bit between 330MB/s
> >and 250MB/s.
> >
> >I''m using a 1MB write blocksize.
> On my todo list, but am swamped this week trying to get ready for
> vacation.  Will try to get to it as soon as I can.
Ok, I''ve pushed out a very raw version of my buffered write fixes to
a new branch named performance on btrfs-unstable.

Please try this with the streaming large file create workload.  I''m
also
curious to see if it improves on your box when you mount with

mount -o thread_pool=128

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Steven Pratt

2009-Aug-31 17:49 UTC

head link

Re: Updated performance results

Chris Mason wrote:> On Fri, Aug 07, 2009 at 08:56:52AM -0500, Steven Pratt wrote:
>   
>> Chris Mason wrote:
>>     
>>> On Tue, Jul 28, 2009 at 04:10:41PM -0500, Steven Pratt wrote:
>>>       
>>>>> Hi Steve,
>>>>>
>>>>> I think I''m going to start tuning something other
than the
>>>>> random-writes, there is definitely low hanging fruit in the
large file
>>>>> creates workload ;)  Thanks again for posting all of these.
>>>>>           
>>>> Sure, no problem.
>>>>
>>>>         
>>>>> The history graph has 2.6.31-rc btrfs against 2.6.29-rc
ext4.  Have you
>>>>> done more recent runs on ext4?
>>>>>
>>>>>           
>>>> Yes, thanks for pointing that out, had so many issues I forgot
to
>>>> update the graphs for other file systems.  Just pushed new
graphs
>>>> with data for 2.6.30-rc7 for all the other file systems.  This
was
>>>>         
>>> >from your "newformat" branch from June 6th.
>>>
>>> I''ve been tuning the 128 thread large file streaming
writes, and found
>>> some easy optimizations.  While I''m fixing up these
patches, could you
>>> please do a streaming O_DIRECT write test run for me?  I think
buffered
>>> writeback in general has some problems right now on high end
arrays.
>>>
>>> On my box 2.6.31-rc5 streaming buffered write with xfs only got at
>>> 200MB/s (with the 128 thread ffsb workload).  Buffered btrfs goes
at
>>> 175MB/s.
>>>
>>> O_DIRECT btrfs runs at 390MB/s, while XFS varies a bit between
330MB/s
>>> and 250MB/s.
>>>
>>> I''m using a 1MB write blocksize.
>>>       
>> On my todo list, but am swamped this week trying to get ready for
>> vacation.  Will try to get to it as soon as I can.
>>     
>
> Ok, I''ve pushed out a very raw version of my buffered write fixes
to
> a new branch named performance on btrfs-unstable.
>
> Please try this with the streaming large file create workload. 
I''m also
> curious to see if it improves on your box when you mount with
>
> mount -o thread_pool=128
>
>   Better late than never. Finally got this finished up.  Mixed bag on this 
one.  BTRFS lags significantly on single threaded.  Seems unable to keep 
IO outstanding to the device.  Less that 60% busy on the DM device, 
compared to 97%+ for all other filesystems.  nodatacow helps out, 
increasing utilization to about 70%, but still trails by a large margin.

Results are more favorable for multithreaded tests.  nodatacow is 
actually the top performer here!  However, cow still raises it''s ugly 
head and causes significant performance degradation (45%) and increased 
CPU (43%).  Also, even without cow, BTRFS is consuming 8-10x more CPU 
than other File Systems.  I don''t have oprofile data for these runs, as
that was causing some issues with BTRFS.  Will retry, and see if that 
problem is fixed.

thread_pool seemed to make no difference at all.

All runs were done against an August 20th pull of experimental tree.  
These are 1M, odirect file creates, with each file being 1G in size.

Results can be found here:

http://btrfs.boxacle.net/repository/raid/large_create_test/write-test/1M_odirect_create.html

Steve> -chris
>   
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Mason

2009-Sep-11 19:29 UTC

head link

Re: Updated performance results

On Mon, Aug 31, 2009 at 12:49:13PM -0500, Steven Pratt
wrote:> Better late than never. Finally got this finished up.  Mixed bag on
> this one.  BTRFS lags significantly on single threaded.  Seems
> unable to keep IO outstanding to the device.  Less that 60% busy on
> the DM device, compared to 97%+ for all other filesystems.
> nodatacow helps out, increasing utilization to about 70%, but still
> trails by a large margin.
Hi Steve,

Jens Axboe did some profiling on his big test rig and I think we found
the biggest CPU problems.  The end result is now setting in the master
branch of the btrfs-unstable repo.

On his boxes, btrfs went from around 400MB/s streaming writes to 1GB/s
limit, and we''re now tied with XFS while using less CPU time.

Hopefully you will see similar results ;)

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Steven Pratt

2009-Sep-11 21:35 UTC

head link

Re: Updated performance results

Chris Mason wrote:> On Mon, Aug 31, 2009 at 12:49:13PM -0500, Steven Pratt wrote:
>   
>> Better late than never. Finally got this finished up.  Mixed bag on
>> this one.  BTRFS lags significantly on single threaded.  Seems
>> unable to keep IO outstanding to the device.  Less that 60% busy on
>> the DM device, compared to 97%+ for all other filesystems.
>> nodatacow helps out, increasing utilization to about 70%, but still
>> trails by a large margin.
>>     
>
> Hi Steve,
>
> Jens Axboe did some profiling on his big test rig and I think we found
> the biggest CPU problems.  The end result is now setting in the master
> branch of the btrfs-unstable repo.
>
> On his boxes, btrfs went from around 400MB/s streaming writes to 1GB/s
> limit, and we''re now tied with XFS while using less CPU time.
>
> Hopefully you will see similar results ;)
>   Hmmm, well no I didn''t.  Throughputs at 1 and 128 threads are pretty 
much unchanged, although I do see a good CPU savings on the 128 thread 
case (with cow).  For 16 threads we actually regressed with cow enabled.

Results  are here:

 http://btrfs.boxacle.net/repository/raid/large_create_test/write-test/1M_odirect_create.html

I''ll try to look more into this next week.

Steve
> -chris
> --
> To unsubscribe from this list: send the line "unsubscribe
linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>   
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Mason

2009-Sep-14 13:51 UTC

head link

Re: Updated performance results

On Fri, Sep 11, 2009 at 04:35:50PM -0500, Steven Pratt
wrote:> Chris Mason wrote:
> >On Mon, Aug 31, 2009 at 12:49:13PM -0500, Steven Pratt wrote:
> >>Better late than never. Finally got this finished up.  Mixed bag on
> >>this one.  BTRFS lags significantly on single threaded.  Seems
> >>unable to keep IO outstanding to the device.  Less that 60% busy on
> >>the DM device, compared to 97%+ for all other filesystems.
> >>nodatacow helps out, increasing utilization to about 70%, but still
> >>trails by a large margin.
> >
> >Hi Steve,
> >
> >Jens Axboe did some profiling on his big test rig and I think we found
> >the biggest CPU problems.  The end result is now setting in the master
> >branch of the btrfs-unstable repo.
> >
> >On his boxes, btrfs went from around 400MB/s streaming writes to 1GB/s
> >limit, and we''re now tied with XFS while using less CPU time.
> >
> >Hopefully you will see similar results ;)
> Hmmm, well no I didn''t.  Throughputs at 1 and 128 threads are
pretty
> much unchanged, although I do see a good CPU savings on the 128
> thread case (with cow).  For 16 threads we actually regressed with
> cow enabled.
> 
> Results  are here:
> 
>
http://btrfs.boxacle.net/repository/raid/large_create_test/write-test/1M_odirect_create.html
> 
> I''ll try to look more into this next week.
> 
Hmmm, Jens was benchmarking buffered writes, but he was also testing on
his new per-bdi write back code.  If your next run could be buffered
instead of O_DIRECT, I''d be curious to see the results.

Thanks,
Chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jens Axboe

2009-Sep-14 17:20 UTC

head link

Re: Updated performance results

On Mon, Sep 14 2009, Chris Mason wrote:> On Fri, Sep 11, 2009 at 04:35:50PM -0500, Steven Pratt wrote:
> > Chris Mason wrote:
> > >On Mon, Aug 31, 2009 at 12:49:13PM -0500, Steven Pratt wrote:
> > >>Better late than never. Finally got this finished up.  Mixed
bag on
> > >>this one.  BTRFS lags significantly on single threaded.  Seems
> > >>unable to keep IO outstanding to the device.  Less that 60%
busy on
> > >>the DM device, compared to 97%+ for all other filesystems.
> > >>nodatacow helps out, increasing utilization to about 70%, but
still
> > >>trails by a large margin.
> > >
> > >Hi Steve,
> > >
> > >Jens Axboe did some profiling on his big test rig and I think we
found
> > >the biggest CPU problems.  The end result is now setting in the
master
> > >branch of the btrfs-unstable repo.
> > >
> > >On his boxes, btrfs went from around 400MB/s streaming writes to
1GB/s
> > >limit, and we''re now tied with XFS while using less CPU
time.
> > >
> > >Hopefully you will see similar results ;)
> > Hmmm, well no I didn''t.  Throughputs at 1 and 128 threads are
pretty
> > much unchanged, although I do see a good CPU savings on the 128
> > thread case (with cow).  For 16 threads we actually regressed with
> > cow enabled.
> > 
> > Results  are here:
> > 
> >
http://btrfs.boxacle.net/repository/raid/large_create_test/write-test/1M_odirect_create.html
> > 
> > I''ll try to look more into this next week.
> > 
> 
> Hmmm, Jens was benchmarking buffered writes, but he was also testing on
> his new per-bdi write back code.  If your next run could be buffered
> instead of O_DIRECT, I''d be curious to see the results.
I found out today that a larger MAX_WRITEBACK_PAGES is still an
essential for me. It basically doubles throughput on btrfs. So I think
we need to do something about that, and sooner and rather than later.

-- 
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Steven Pratt

2009-Sep-14 21:41 UTC

head link

Re: Updated performance results

Chris Mason wrote:> On Fri, Sep 11, 2009 at 04:35:50PM -0500, Steven Pratt wrote:
>   
>> Chris Mason wrote:
>>     
>>> On Mon, Aug 31, 2009 at 12:49:13PM -0500, Steven Pratt wrote:
>>>       
>>>> Better late than never. Finally got this finished up.  Mixed
bag on
>>>> this one.  BTRFS lags significantly on single threaded.  Seems
>>>> unable to keep IO outstanding to the device.  Less that 60%
busy on
>>>> the DM device, compared to 97%+ for all other filesystems.
>>>> nodatacow helps out, increasing utilization to about 70%, but
still
>>>> trails by a large margin.
>>>>         
>>> Hi Steve,
>>>
>>> Jens Axboe did some profiling on his big test rig and I think we
found
>>> the biggest CPU problems.  The end result is now setting in the
master
>>> branch of the btrfs-unstable repo.
>>>
>>> On his boxes, btrfs went from around 400MB/s streaming writes to
1GB/s
>>> limit, and we''re now tied with XFS while using less CPU
time.
>>>
>>> Hopefully you will see similar results ;)
>>>       
>> Hmmm, well no I didn''t.  Throughputs at 1 and 128 threads are
pretty
>> much unchanged, although I do see a good CPU savings on the 128
>> thread case (with cow).  For 16 threads we actually regressed with
>> cow enabled.
>>
>> Results  are here:
>>
>>
http://btrfs.boxacle.net/repository/raid/large_create_test/write-test/1M_odirect_create.html
>>
>> I''ll try to look more into this next week.
>>
>>     
>
> Hmmm, Jens was benchmarking buffered writes, but he was also testing on
> his new per-bdi write back code.  If your next run could be buffered
> instead of O_DIRECT, I''d be curious to see the results.
>
>   Buffered does look a lot better. I don''t have a btrfs baseline before 
these latest changes for this exact workload, but these results are not 
bad at all.  With cow, beats just about everything except XFS, and with 
nocow simply screams.  CPU consumption looks good as well.  I''ll 
probably give the full set of tests a run tonight.

Results are here:
http://btrfs.boxacle.net/repository/raid/buffered-creates/buffered-create/buffered-create.html

Only bit of bad news is I did get one error that crashed the system on 
single threaded nocow run. So that data point is missing.
Output below:

btrfs1 kernel: [251789.525886] ------------[ cut here ]------------
Message from syslogd@ at Mon Sep 14 13:13:04 2009 ...
btrfs1 kernel: [251789.526574] invalid opcode: 0000 [#1] SMP
Message from syslogd@ at Mon Sep 14 13:13:04 2009 ...
btrfs1 kernel: [251789.526654] last sysfs file: 
/sys/devices/pci0000:0c/0000:0c:01.0/local_cpus
Message from syslogd@ at Mon Sep 14 13:13:05 2009 ...
btrfs1 kernel: [251789.526654] Stack:
Message from syslogd@ at Mon Sep 14 13:13:05 2009 ...
btrfs1 kernel: [251789.526654]  ffff88013fc234c0 ffff88013fbcf400 
0000000000000000 ffff88013fc01080
Message from syslogd@ at Mon Sep 14 13:13:05 2009 ...
btrfs1 kernel: [251789.526654]  ffff880132e11d38 ffffffff802a5392 
0000000000000001 ffff88013fbcf400
Message from syslogd@ at Mon Sep 14 13:13:05 2009 ...
btrfs1 kernel: [251789.526654] Call Trace:
Message from syslogd@ at Mon Sep 14 13:13:05 2009 ...
btrfs1 kernel: [251789.526654]  [<ffffffff802a5392>] 
cache_flusharray+0x7d/0xae
Message from syslogd@ at Mon Sep 14 13:13:05 2009 ...
btrfs1 kernel: [251789.526654]  [<ffffffff802a5629>] kfree+0x192/0x1b1
Message from syslogd@ at Mon Sep 14 13:13:05 2009 ...
btrfs1 kernel: [251789.526654]  [<ffffffffa0378c9f>] 
put_worker+0x14/0x16 [btrfs]
Message from syslogd@ at Mon Sep 14 13:13:05 2009 ...
btrfs1 kernel: [251789.526654]  [<ffffffffa0378d55>] 
btrfs_stop_workers+0xb4/0xc9 [btrfs]
Message from syslogd@ at Mon Sep 14 13:13:05 2009 ...
btrfs1 kernel: [251789.526654]  [<ffffffffa0355cbe>] 
close_ctree+0x210/0x288 [btrfs]
Message from syslogd@ at Mon Sep 14 13:13:05 2009 ...
btrfs1 kernel: [251789.526654]  [<ffffffff802bd1a1>] ? 
invalidate_inodes+0x100/0x112
Message from syslogd@ at Mon Sep 14 13:13:05 2009 ...
btrfs1 kernel: [251789.526654]  [<ffffffffa033f4cb>] 
btrfs_put_super+0x18/0x27 [btrfs]
Message from syslogd@ at Mon Sep 14 13:13:05 2009 ...
btrfs1 kernel: [251789.526654]  [<ffffffff802ad12b>] 
generic_shutdown_super+0x73/0xe2
Message from syslogd@ at Mon Sep 14 13:13:05 2009 ...
btrfs1 kernel: [251789.526654]  [<ffffffff802ad1e5>] 
kill_anon_super+0x11/0x3b
Message from syslogd@ at Mon Sep 14 13:13:05 2009 ...
btrfs1 kernel: [251789.526654]  [<ffffffff802ad51d>] 
deactivate_super+0x62/0x77
Message from syslogd@ at Mon Sep 14 13:13:05 2009 ...
btrfs1 kernel: [251789.526654]  [<ffffffff802bf9eb>] 
mntput_no_expire+0xec/0x12c
Message from syslogd@ at Mon Sep 14 13:13:05 2009 ...
btrfs1 kernel: [251789.526654]  [<ffffffff802bff3a>]
sys_umount+0x2c5/0x31c
Message from syslogd@ at Mon Sep 14 13:13:05 2009 ...
btrfs1 kernel: [251789.526654]  [<ffffffff8020ba2b>] 
system_call_fastpath+0x16/0x1b
Message from syslogd@ at Mon Sep 14 13:13:05 2009 ...
btrfs1 kernel: [251789.526654] Code: 89 f7 e8 48 07 f8 ff 48 c1 e8 0c 48 
ba 00 00 00 00 00 e2 ff ff 48 6b c0 38 48 01 d0 66 83 38 00 79 04 48 8b 
40 10 80 38 00 78 04 <0f> 0b eb fe 48 8b 58 30 48 63 45 c8 48 89 df 4d 
8b a4 c5 60 08

Steve

> Thanks,
> Chris
>   
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Mason

2009-Sep-14 23:13 UTC

head link

Re: Updated performance results

On Mon, Sep 14, 2009 at 04:41:48PM -0500, Steven Pratt
wrote:> Chris Mason wrote:
> >On Fri, Sep 11, 2009 at 04:35:50PM -0500, Steven Pratt wrote:
> >>Chris Mason wrote:
> >>>On Mon, Aug 31, 2009 at 12:49:13PM -0500, Steven Pratt wrote:
> >>>>Better late than never. Finally got this finished up. 
Mixed bag on
> >>>>this one.  BTRFS lags significantly on single threaded. 
Seems
> >>>>unable to keep IO outstanding to the device.  Less that 60%
busy on
> >>>>the DM device, compared to 97%+ for all other filesystems.
> >>>>nodatacow helps out, increasing utilization to about 70%,
but still
> >>>>trails by a large margin.
> >>>Hi Steve,
> >>>
> >>>Jens Axboe did some profiling on his big test rig and I think
we found
> >>>the biggest CPU problems.  The end result is now setting in the
master
> >>>branch of the btrfs-unstable repo.
> >>>
> >>>On his boxes, btrfs went from around 400MB/s streaming writes
to 1GB/s
> >>>limit, and we''re now tied with XFS while using less
CPU time.
> >>>
> >>>Hopefully you will see similar results ;)
> >>Hmmm, well no I didn''t.  Throughputs at 1 and 128 threads
are pretty
> >>much unchanged, although I do see a good CPU savings on the 128
> >>thread case (with cow).  For 16 threads we actually regressed with
> >>cow enabled.
> >>
> >>Results  are here:
> >>
>
>>http://btrfs.boxacle.net/repository/raid/large_create_test/write-test/1M_odirect_create.html
> >>
> >>I''ll try to look more into this next week.
> >>
> >
> >Hmmm, Jens was benchmarking buffered writes, but he was also testing on
> >his new per-bdi write back code.  If your next run could be buffered
> >instead of O_DIRECT, I''d be curious to see the results.
> >
> Buffered does look a lot better. I don''t have a btrfs baseline
> before these latest changes for this exact workload, but these
> results are not bad at all.  With cow, beats just about everything
> except XFS, and with nocow simply screams.  CPU consumption looks
> good as well.  I''ll probably give the full set of tests a run
> tonight.
Wow, good news at last ;)  For the oops, try the patch below (I need to
push it out, but I think it''ll help).

I''ll try to figure out the O_DIRECT problems.


-chris

diff --git a/fs/btrfs/async-thread.c b/fs/btrfs/async-thread.c
index 6ea5cd0..ba28742 100644
--- a/fs/btrfs/async-thread.c
+++ b/fs/btrfs/async-thread.c
@@ -177,7 +177,7 @@ static int try_worker_shutdown(struct btrfs_worker_thread
*worker)
 	int freeit = 0;
 
 	spin_lock_irq(&worker->lock);
-	spin_lock_irq(&worker->workers->lock);
+	spin_lock(&worker->workers->lock);
 	if (worker->workers->num_workers > 1 &&
 	    worker->idle &&
 	    !worker->working &&
@@ -188,7 +188,7 @@ static int try_worker_shutdown(struct btrfs_worker_thread
*worker)
 		list_del_init(&worker->worker_list);
 		worker->workers->num_workers--;
 	}
-	spin_unlock_irq(&worker->workers->lock);
+	spin_unlock(&worker->workers->lock);
 	spin_unlock_irq(&worker->lock);
 
 	if (freeit)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Mason

2009-Sep-16 00:52 UTC

head link

Re: Updated performance results

On Mon, Sep 14, 2009 at 04:41:48PM -0500, Steven Pratt
wrote:> 
> Only bit of bad news is I did get one error that crashed the system
> on single threaded nocow run. So that data point is missing.
> Output below:
I hope I''ve got this fixed.  If you pull from the master branch of
btrfs-unstable there are fixes for async thread races.  The single
patch I sent before is included, but not enough.

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Steven Pratt

2009-Sep-16 15:15 UTC

head link

Re: Updated performance results

Chris Mason wrote:> On Mon, Sep 14, 2009 at 04:41:48PM -0500, Steven Pratt wrote:
>   
>> Only bit of bad news is I did get one error that crashed the system
>> on single threaded nocow run. So that data point is missing.
>> Output below:
>>     
>
> I hope I''ve got this fixed.  If you pull from the master branch of
> btrfs-unstable there are fixes for async thread races.  The single
> patch I sent before is included, but not enough.
>   Glad you said that.  Keeps me from sending the email that said the patch 
didn''t help :-)

Steve
> -chris
>
> --
> To unsubscribe from this list: send the line "unsubscribe
linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>   
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Steven Pratt

2009-Sep-16 17:57 UTC

head link

Re: Updated performance results

Steven Pratt wrote:> Chris Mason wrote:
>> On Mon, Sep 14, 2009 at 04:41:48PM -0500, Steven Pratt wrote:
>>  
>>> Only bit of bad news is I did get one error that crashed the system
>>> on single threaded nocow run. So that data point is missing.
>>> Output below:
>>>     
>>
>> I hope I''ve got this fixed.  If you pull from the master
branch of
>> btrfs-unstable there are fixes for async thread races.  The single
>> patch I sent before is included, but not enough.
>>   
> Glad you said that.  Keeps me from sending the email that said the 
> patch didn''t help :-)
>
> SteveWell, still getting oopses even with new code.

Lots of:
Sep 16 11:07:27 btrfs1 kernel: [ 1862.942754] BUG: soft lockup - CPU#10 
stuck for 61s! [btrfs-endio-1:30250]
Sep 16 11:07:27 btrfs1 kernel: [ 1862.942754] Pid: 30250, comm: 
btrfs-endio-1 Not tainted 2.6.31-autokern1 #1 IBM x3950-[88726RU]-
Sep 16 11:07:27 btrfs1 kernel: [ 1862.942754] RIP: 
0010:[<ffffffff81153920>]  [<ffffffff81153920>] crc32c+0x20/0x26
Sep 16 11:07:27 btrfs1 kernel: [ 1862.942754] RSP: 
0018:ffff88013a857cc8  EFLAGS: 00000217
Sep 16 11:07:27 btrfs1 kernel: [ 1862.942754] RAX: 0000000000000040 RBX: 
ffff88013a857cc8 RCX: ffff88013d8022c0
Sep 16 11:07:27 btrfs1 kernel: [ 1862.942754] RDX: 0000000000000010 RSI: 
ffff88001d349ff0 RDI: 0000000041703e71
Sep 16 11:07:27 btrfs1 kernel: [ 1862.942754] RBP: ffffffff8100c4ee R08: 
0000000000000000 R09: 0000000000000000
Sep 16 11:07:27 btrfs1 kernel: [ 1862.942754] R10: ffff88013a857d30 R11: 
0000000000000002 R12: ffff88013a857d10
Sep 16 11:07:27 btrfs1 kernel: [ 1862.942754] R13: 0000000000000002 R14: 
ffff88013a857cb0 R15: ffffffff8100c38e
Sep 16 11:07:27 btrfs1 kernel: [ 1862.942754] FS:  
0000000000000000(0000) GS:ffff880028159000(0000) knlGS:0000000000000000
Sep 16 11:07:27 btrfs1 kernel: [ 1862.942754] CS:  0010 DS: 0018 ES: 
0018 CR0: 000000008005003b
Sep 16 11:07:27 btrfs1 kernel: [ 1862.942754] CR2: 0000000000000043 CR3: 
00000001368f7000 CR4: 00000000000006e0
Sep 16 11:07:27 btrfs1 kernel: [ 1862.942754] DR0: 0000000000000000 DR1: 
0000000000000000 DR2: 0000000000000000
Sep 16 11:07:27 btrfs1 kernel: [ 1862.942754] DR3: 0000000000000000 DR6: 
00000000ffff0ff0 DR7: 0000000000000400
Sep 16 11:07:27 btrfs1 kernel: [ 1862.942754] Call Trace:
Sep 16 11:07:27 btrfs1 kernel: [ 1862.942754]  [<ffffffff8115397e>] ? 
chksum_update+0x10/0x18
Sep 16 11:07:27 btrfs1 kernel: [ 1862.942754]  [<ffffffff81150084>] ? 
crypto_shash_update+0x1a/0x1c
Sep 16 11:07:27 btrfs1 kernel: [ 1862.942754]  [<ffffffff81175c34>] ? 
crc32c+0x4c/0x60
Sep 16 11:07:27 btrfs1 kernel: [ 1862.942754]  [<ffffffffa0391d0f>] ? 
get_state_private+0x38/0x6f [btrfs]
Sep 16 11:07:27 btrfs1 kernel: [ 1862.942754]  [<ffffffffa0376688>] ? 
btrfs_csum_data+0xd/0xf [btrfs]
Sep 16 11:07:27 btrfs1 kernel: [ 1862.942754]  [<ffffffffa037fefc>] ? 
btrfs_readpage_end_io_hook+0x158/0x27b [btrfs]
Sep 16 11:07:27 btrfs1 kernel: [ 1862.942754]  [<ffffffffa0392a46>] ? 
end_bio_extent_readpage+0xb8/0x1c0 [btrfs]
Sep 16 11:07:27 btrfs1 kernel: [ 1862.942754]  [<ffffffff810e5733>] ? 
bio_endio+0x26/0x28
Sep 16 11:07:27 btrfs1 kernel: [ 1862.947656]  [<ffffffffa037666e>] ? 
end_workqueue_fn+0x111/0x11e [btrfs]
Sep 16 11:07:27 btrfs1 kernel: [ 1862.947823]  [<ffffffffa039a490>] ? 
worker_loop+0x12a/0x3ea [btrfs]
Sep 16 11:07:27 btrfs1 kernel: [ 1862.947823]  [<ffffffffa039a366>] ? 
worker_loop+0x0/0x3ea [btrfs]
Sep 16 11:07:27 btrfs1 kernel: [ 1862.948800]  [<ffffffff810544e4>] ? 
kthread+0x8f/0x97
Sep 16 11:07:27 btrfs1 kernel: [ 1862.948800]  [<ffffffff8100ca1a>] ? 
child_rip+0xa/0x20
Sep 16 11:07:27 btrfs1 kernel: [ 1862.948800]  [<ffffffff81054455>] ? 
kthread+0x0/0x97
Sep 16 11:07:27 btrfs1 kernel: [ 1862.948800]  [<ffffffff8100ca10>] ? 
child_rip+0x0/0x20


Sep 16 11:54:47 btrfs1 kernel: [ 4703.082621] Pid: 31421, comm: 
btrfs-endio-wri Not tainted 2.6.31-autokern1 #1 IBM x3950-[88726RU]-
Sep 16 11:54:47 btrfs1 kernel: [ 4703.082621] RIP: 
0010:[<ffffffffa036afb3>]  [<ffffffffa036afb3>] 
alloc_reserved_file_extent+0x8d/0x1c3 [btrfs]
Sep 16 11:54:47 btrfs1 kernel: [ 4703.082621] RSP: 
0018:ffff8800aa555af0  EFLAGS: 00010282
Sep 16 11:54:47 btrfs1 kernel: [ 4703.082621] RAX: 00000000ffffffef RBX: 
ffff88013b55e000 RCX: 0000000000000002
Sep 16 11:54:47 btrfs1 kernel: [ 4703.082621] RDX: 0000000000000001 RSI: 
0000000000000000 RDI: ffff88012f20a9a0
Sep 16 11:54:47 btrfs1 kernel: [ 4703.082621] RBP: ffff8800aa555b60 R08: 
ffff8800aa555888 R09: ffff8800aa555880
Sep 16 11:54:47 btrfs1 kernel: [ 4703.082621] R10: ffff880077937400 R11: 
00000000fffffffa R12: 000000000000001d
Sep 16 11:54:47 btrfs1 kernel: [ 4703.082621] R13: ffff880077937400 R14: 
0000000000000000 R15: 0000000000000000
Sep 16 11:54:47 btrfs1 kernel: [ 4703.082621] FS:  
0000000000000000(0000) GS:ffff88002804b000(0000) knlGS:0000000000000000
Sep 16 11:54:47 btrfs1 kernel: [ 4703.082621] CS:  0010 DS: 0018 ES: 
0018 CR0: 000000008005003b
Sep 16 11:54:47 btrfs1 kernel: [ 4703.082621] CR2: 00000000007c0000 CR3: 
000000013e038000 CR4: 00000000000006f0
Sep 16 11:54:47 btrfs1 kernel: [ 4703.082621] DR0: 0000000000000000 DR1: 
0000000000000000 DR2: 0000000000000000
Sep 16 11:54:47 btrfs1 kernel: [ 4703.082621] DR3: 0000000000000000 DR6: 
00000000ffff0ff0 DR7: 0000000000000400
Sep 16 11:54:47 btrfs1 kernel: [ 4703.082621] Process btrfs-endio-wri 
(pid: 31421, threadinfo ffff8800aa554000, task ffff8801395447a0)
Sep 16 11:54:47 btrfs1 kernel: [ 4703.082621] Stack:
Sep 16 11:54:47 btrfs1 kernel: [ 4703.082621]  ffff880077937400 
0000000000000a7c 0000000000000005 0000000000000000
Sep 16 11:54:47 btrfs1 kernel: [ 4703.082621] <0> ffff880101d0c800 
ffff8801140bbd20 000000b2aa555b60 ffffffffa036a190
Sep 16 11:54:47 btrfs1 kernel: [ 4703.082621] <0> 000000350000091d 
ffff8801090fdd40 ffff88013a4e9d40 0000000000000001
Sep 16 11:54:47 btrfs1 kernel: [ 4703.082621] Call Trace:
Sep 16 11:54:47 btrfs1 kernel: [ 4703.082621]  [<ffffffffa036a190>] ? 
update_reserved_extents+0xa7/0xbe [btrfs]
Sep 16 11:54:47 btrfs1 kernel: [ 4703.082621]  [<ffffffffa036f430>] 
run_one_delayed_ref+0x382/0x42f [btrfs]
Sep 16 11:54:47 btrfs1 kernel: [ 4703.082621]  [<ffffffff8100c4ee>] ? 
apic_timer_interrupt+0xe/0x20
Sep 16 11:54:47 btrfs1 kernel: [ 4703.082621]  [<ffffffffa03700b1>] 
run_clustered_refs+0x237/0x2b4 [btrfs]
Sep 16 11:54:47 btrfs1 kernel: [ 4703.082621]  [<ffffffffa03a5665>] ? 
btrfs_find_ref_cluster+0xdc/0x115 [btrfs]
Sep 16 11:54:47 btrfs1 kernel: [ 4703.082621]  [<ffffffffa03701da>] 
btrfs_run_delayed_refs+0xac/0x195 [btrfs]
Sep 16 11:54:47 btrfs1 kernel: [ 4703.082621]  [<ffffffffa0379a76>] 
__btrfs_end_transaction+0x59/0xfe [btrfs]
Sep 16 11:54:47 btrfs1 kernel: [ 4703.082621]  [<ffffffffa0379b36>] 
btrfs_end_transaction+0xb/0xd [btrfs]
Sep 16 11:54:47 btrfs1 kernel: [ 4703.082621]  [<ffffffffa037f29b>] 
btrfs_finish_ordered_io+0x23c/0x265 [btrfs]
Sep 16 11:54:47 btrfs1 kernel: [ 4703.082621]  [<ffffffffa037f2d9>] 
btrfs_writepage_end_io_hook+0x15/0x17 [btrfs]
Sep 16 11:54:47 btrfs1 kernel: [ 4703.082621]  [<ffffffffa0392901>] 
end_bio_extent_writepage+0xa5/0x132 [btrfs]
Sep 16 11:54:47 btrfs1 kernel: [ 4703.082621]  [<ffffffff810e5733>] 
bio_endio+0x26/0x28
Sep 16 11:54:47 btrfs1 kernel: [ 4703.082621]  [<ffffffffa037666e>] 
end_workqueue_fn+0x111/0x11e [btrfs]
Sep 16 11:54:47 btrfs1 kernel: [ 4703.082621]  [<ffffffffa039a490>] 
worker_loop+0x12a/0x3ea [btrfs]
Sep 16 11:54:47 btrfs1 kernel: [ 4703.082621]  [<ffffffffa039a366>] ? 
worker_loop+0x0/0x3ea [btrfs]
Sep 16 11:54:48 btrfs1 kernel: [ 4703.082621]  [<ffffffff810544e4>] 
kthread+0x8f/0x97
Sep 16 11:54:48 btrfs1 kernel: [ 4703.082621]  [<ffffffff8100ca1a>] 
child_rip+0xa/0x20
Sep 16 11:54:48 btrfs1 kernel: [ 4703.082621]  [<ffffffff81054455>] ? 
kthread+0x0/0x97
Sep 16 11:54:48 btrfs1 kernel: [ 4703.082621]  [<ffffffff8100ca10>] ? 
child_rip+0x0/0x20
Sep 16 11:54:48 btrfs1 kernel: [ 4703.082621] Code: 08 4c 8d 45 d4 41 8d 
44 24 18 48 8b 73 20 48 8b 4d 18 41 b9 01 00 00 00 48 8b 7d b8 4c 89 ea
 89 45 d4 e8 93 e3 ff ff 85 c0 74 04 <0f> 0b eb fe 49 63 75 40 4d 8b 65 
00 49 83 cf 01 4c 89 e7 48 6b



Happened on 2 machines.


Steve



>
>> -chris
>>
>> -- 
>> To unsubscribe from this list: send the line "unsubscribe 
>> linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>   
>
> -- 
> To unsubscribe from this list: send the line "unsubscribe
linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Mason

2009-Sep-16 18:07 UTC

head link

Re: Updated performance results

On Wed, Sep 16, 2009 at 12:57:22PM -0500, Steven Pratt
wrote:> Steven Pratt wrote:
> >Chris Mason wrote:
> >>On Mon, Sep 14, 2009 at 04:41:48PM -0500, Steven Pratt wrote:
> >>>Only bit of bad news is I did get one error that crashed the
system
> >>>on single threaded nocow run. So that data point is missing.
> >>>Output below:
> >>
> >>I hope I''ve got this fixed.  If you pull from the master
branch of
> >>btrfs-unstable there are fixes for async thread races.  The single
> >>patch I sent before is included, but not enough.
> >Glad you said that.  Keeps me from sending the email that said the
> >patch didn''t help :-)
> >
> >Steve
> Well, still getting oopses even with new code.
> 
> Lots of:
> Sep 16 11:07:27 btrfs1 kernel: [ 1862.942754] BUG: soft lockup -
> CPU#10 stuck for 61s! [btrfs-endio-1:30250]
> Sep 16 11:07:27 btrfs1 kernel: [ 1862.942754] Pid: 30250, comm:
> btrfs-endio-1 Not tainted 2.6.31-autokern1 #1 IBM x3950-[88726RU]-
> Sep 16 11:07:27 btrfs1 kernel: [ 1862.942754] RIP:
> 0010:[<ffffffff81153920>]  [<ffffffff81153920>]
crc32c+0x20/0x26
If I''m reading this right, you''ve got a softlockup in crc32c? 
Something
has gone really wrong here.  Are you reusing datasets from old runs?

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Steven Pratt

2009-Sep-16 18:15 UTC

head link

Re: Updated performance results

Chris Mason wrote:> On Wed, Sep 16, 2009 at 12:57:22PM -0500, Steven Pratt wrote:
>   
>> Steven Pratt wrote:
>>     
>>> Chris Mason wrote:
>>>       
>>>> On Mon, Sep 14, 2009 at 04:41:48PM -0500, Steven Pratt wrote:
>>>>         
>>>>> Only bit of bad news is I did get one error that crashed
the system
>>>>> on single threaded nocow run. So that data point is
missing.
>>>>> Output below:
>>>>>           
>>>> I hope I''ve got this fixed.  If you pull from the
master branch of
>>>> btrfs-unstable there are fixes for async thread races.  The
single
>>>> patch I sent before is included, but not enough.
>>>>         
>>> Glad you said that.  Keeps me from sending the email that said the
>>> patch didn''t help :-)
>>>
>>> Steve
>>>       
>> Well, still getting oopses even with new code.
>>
>> Lots of:
>> Sep 16 11:07:27 btrfs1 kernel: [ 1862.942754] BUG: soft lockup -
>> CPU#10 stuck for 61s! [btrfs-endio-1:30250]
>> Sep 16 11:07:27 btrfs1 kernel: [ 1862.942754] Pid: 30250, comm:
>> btrfs-endio-1 Not tainted 2.6.31-autokern1 #1 IBM x3950-[88726RU]-
>> Sep 16 11:07:27 btrfs1 kernel: [ 1862.942754] RIP:
>> 0010:[<ffffffff81153920>]  [<ffffffff81153920>]
crc32c+0x20/0x26
>>     
>
> If I''m reading this right, you''ve got a softlockup in
crc32c?  Something
> has gone really wrong here.  Are you reusing datasets from old runs?
>   No, mkfs before every run.

Steve
> -chris
> --
> To unsubscribe from this list: send the line "unsubscribe
linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>   
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Steven Pratt

2009-Sep-16 18:16 UTC

head link

Re: Updated performance results

Chris Mason wrote:> On Wed, Sep 16, 2009 at 12:57:22PM -0500, Steven Pratt wrote:
>   
>> Steven Pratt wrote:
>>     
>>> Chris Mason wrote:
>>>       
>>>> On Mon, Sep 14, 2009 at 04:41:48PM -0500, Steven Pratt wrote:
>>>>         
>>>>> Only bit of bad news is I did get one error that crashed
the system
>>>>> on single threaded nocow run. So that data point is
missing.
>>>>> Output below:
>>>>>           
>>>> I hope I''ve got this fixed.  If you pull from the
master branch of
>>>> btrfs-unstable there are fixes for async thread races.  The
single
>>>> patch I sent before is included, but not enough.
>>>>         
>>> Glad you said that.  Keeps me from sending the email that said the
>>> patch didn''t help :-)
>>>
>>> Steve
>>>       
>> Well, still getting oopses even with new code.
>>
>> Lots of:
>> Sep 16 11:07:27 btrfs1 kernel: [ 1862.942754] BUG: soft lockup -
>> CPU#10 stuck for 61s! [btrfs-endio-1:30250]
>> Sep 16 11:07:27 btrfs1 kernel: [ 1862.942754] Pid: 30250, comm:
>> btrfs-endio-1 Not tainted 2.6.31-autokern1 #1 IBM x3950-[88726RU]-
>> Sep 16 11:07:27 btrfs1 kernel: [ 1862.942754] RIP:
>> 0010:[<ffffffff81153920>]  [<ffffffff81153920>]
crc32c+0x20/0x26
>>     
>
> If I''m reading this right, you''ve got a softlockup in
crc32c?  Something
> has gone really wrong here.  Are you reusing datasets from old runs?
>    From the second machine a single bug:
Sep 16 11:53:42 btrfs2 kernel: [ 3769.298240] ------------[ cut here 
]------------
Sep 16 11:53:42 btrfs2 kernel: [ 3769.298550] kernel BUG at 
fs/btrfs/extent-tree.c:4097!
Sep 16 11:53:42 btrfs2 kernel: [ 3769.298550] invalid opcode: 0000 [#1] SMP
Sep 16 11:53:42 btrfs2 kernel: [ 3769.298550] last sysfs file: 
/sys/devices/system/cpu/cpu15/cache/index1/shared_cpu_map
Sep 16 11:53:42 btrfs2 kernel: [ 3769.298550] CPU 9
Sep 16 11:53:42 btrfs2 kernel: [ 3769.298550] Modules linked in: 
ipmi_devintf ipmi_si ipmi_msghandler btrfs zlib_deflate oprofile autofs4 
nfs lockd nfs_acl auth_rpc
gss sunrpc dm_multipath video output sbs sbshc battery ac parport_pc lp 
parport sg joydev serio_raw acpi_memhotplug rtc_cmos rtc_core rtc_lib 
button tg3 libphy i2c_
piix4 i2c_core pcspkr dm_snapshot dm_zero dm_mirror dm_region_hash 
dm_log dm_mod lpfc scsi_transport_fc aic94xx libsas libata 
scsi_transport_sas sd_mod scsi_mod ext
3 jbd uhci_hcd ohci_hcd ehci_hcd [last unloaded: microcode]
Sep 16 11:53:42 btrfs2 kernel: [ 3769.298550] Pid: 2106, comm: 
btrfs-endio-wri Not tainted 2.6.31-autokern1 #1 IBM x3950-[88726RU]-
Sep 16 11:53:42 btrfs2 kernel: [ 3769.298550] RIP: 
0010:[<ffffffffa0386fb3>]  [<ffffffffa0386fb3>] 
alloc_reserved_file_extent+0x8d/0x1c3 [btrfs]
Sep 16 11:53:42 btrfs2 kernel: [ 3769.298550] RSP: 
0018:ffff88002758faf0  EFLAGS: 00010282
Sep 16 11:53:42 btrfs2 kernel: [ 3769.298550] RAX: 00000000ffffffef RBX: 
ffff880136434000 RCX: 0000000000000002
Sep 16 11:53:42 btrfs2 kernel: [ 3769.298550] RDX: 0000000000000001 RSI: 
0000000000000000 RDI: ffff8800a7040370
Sep 16 11:53:42 btrfs2 kernel: [ 3769.298550] RBP: ffff88002758fb60 R08: 
ffff88002758f958 R09: ffff88002758f950
Sep 16 11:53:42 btrfs2 kernel: [ 3769.298550] R10: 0000000000000004 R11: 
ffff8800a7040370 R12: 000000000000001d
Sep 16 11:53:43 btrfs2 kernel: [ 3769.298550] R13: ffff8800b79e6910 R14: 
0000000000000000 R15: 0000000000000000
Sep 16 11:53:43 btrfs2 kernel: [ 3769.298550] FS:  
0000000000000000(0000) GS:ffff88002813e000(0000) knlGS:0000000000000000
Sep 16 11:53:43 btrfs2 kernel: [ 3769.298550] CS:  0010 DS: 0018 ES: 
0018 CR0: 000000008005003b
Sep 16 11:53:43 btrfs2 kernel: [ 3769.298550] CR2: 00007f1f6915a000 CR3: 
000000013dd4e000 CR4: 00000000000006e0
Sep 16 11:53:43 btrfs2 kernel: [ 3769.298550] DR0: 0000000000000000 DR1: 
0000000000000000 DR2: 0000000000000000
Sep 16 11:53:43 btrfs2 kernel: [ 3769.298550] DR3: 0000000000000000 DR6: 
00000000ffff0ff0 DR7: 0000000000000400
Sep 16 11:53:43 btrfs2 kernel: [ 3769.298550] Process btrfs-endio-wri 
(pid: 2106, threadinfo ffff88002758e000, task ffff88013b94c100)
Sep 16 11:53:43 btrfs2 kernel: [ 3769.298550] Stack:
Sep 16 11:53:43 btrfs2 kernel: [ 3769.298550]  ffff8800709fc760 
0000000000000856 0000000000000005 0000000000000000
Sep 16 11:53:43 btrfs2 kernel: [ 3769.298550] <0> ffff8801329d5000 
ffff880102242de0 000000b22758fb60 ffffffffa0386190
Sep 16 11:53:43 btrfs2 kernel: [ 3769.298550] <0> 00000035329d5000 
ffff880128291440 ffff880108302340 0000000000000001
Sep 16 11:53:43 btrfs2 kernel: [ 3769.298550] Call Trace:
Sep 16 11:53:43 btrfs2 kernel: [ 3769.298550]  [<ffffffffa0386190>] ? 
update_reserved_extents+0xa7/0xbe [btrfs]
Sep 16 11:53:43 btrfs2 kernel: [ 3769.298550]  [<ffffffffa038b430>] 
run_one_delayed_ref+0x382/0x42f [btrfs]
Sep 16 11:53:43 btrfs2 kernel: [ 3769.298550]  [<ffffffffa038c0b1>] 
run_clustered_refs+0x237/0x2b4 [btrfs]
Sep 16 11:53:43 btrfs2 kernel: [ 3769.298550]  [<ffffffffa03c1665>] ? 
btrfs_find_ref_cluster+0xdc/0x115 [btrfs]
Sep 16 11:53:43 btrfs2 kernel: [ 3769.298550]  [<ffffffffa038c1da>] 
btrfs_run_delayed_refs+0xac/0x195 [btrfs]
Sep 16 11:53:43 btrfs2 kernel: [ 3769.298550]  [<ffffffffa0395a76>] 
__btrfs_end_transaction+0x59/0xfe [btrfs]
Sep 16 11:53:43 btrfs2 kernel: [ 3769.298550]  [<ffffffffa0395b36>] 
btrfs_end_transaction+0xb/0xd [btrfs]
Sep 16 11:53:43 btrfs2 kernel: [ 3769.298550]  [<ffffffffa039b29b>] 
btrfs_finish_ordered_io+0x23c/0x265 [btrfs]
Sep 16 11:53:43 btrfs2 kernel: [ 3769.298550]  [<ffffffffa039b2d9>] 
btrfs_writepage_end_io_hook+0x15/0x17 [btrfs]
Sep 16 11:53:43 btrfs2 kernel: [ 3769.298550]  [<ffffffffa03ae901>] 
end_bio_extent_writepage+0xa5/0x132 [btrfs]
Sep 16 11:53:43 btrfs2 kernel: [ 3769.298550]  [<ffffffff810e5733>] 
bio_endio+0x26/0x28
Sep 16 11:53:43 btrfs2 kernel: [ 3769.298550]  [<ffffffffa039266e>] 
end_workqueue_fn+0x111/0x11e [btrfs]
Sep 16 11:53:43 btrfs2 kernel: [ 3769.298550]  [<ffffffffa03b6490>] 
worker_loop+0x12a/0x3ea [btrfs]
Sep 16 11:53:43 btrfs2 kernel: [ 3769.298550]  [<ffffffffa03b6366>] ? 
worker_loop+0x0/0x3ea [btrfs]
Sep 16 11:53:43 btrfs2 kernel: [ 3769.298550]  [<ffffffff810544e4>] 
kthread+0x8f/0x97
Sep 16 11:53:43 btrfs2 kernel: [ 3769.298550]  [<ffffffff8100ca1a>] 
child_rip+0xa/0x20
Sep 16 11:53:43 btrfs2 kernel: [ 3769.298550]  [<ffffffff81054455>] ? 
kthread+0x0/0x97
Sep 16 11:53:43 btrfs2 kernel: [ 3769.298550]  [<ffffffff8100ca10>] ? 
child_rip+0x0/0x20
Sep 16 11:53:43 btrfs2 kernel: [ 3769.298550] Code: 08 4c 8d 45 d4 41 8d 
44 24 18 48 8b 73 20 48 8b 4d 18 41 b9 01 00 00 00 48 8b 7d b8 4c 89 ea 
89 45 d4 e8 93 e3 f
f ff 85 c0 74 04 <0f> 0b eb fe 49 63 75 40 4d 8b 65 00 49 83 cf 01 4c 89 
e7 48 6b


Steve
> -chris
> --
> To unsubscribe from this list: send the line "unsubscribe
linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>   
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Mason

2009-Sep-16 18:17 UTC

head link

Re: Updated performance results

On Wed, Sep 16, 2009 at 01:15:12PM -0500, Steven Pratt
wrote:> Chris Mason wrote:
> >On Wed, Sep 16, 2009 at 12:57:22PM -0500, Steven Pratt wrote:
> >>Steven Pratt wrote:
> >>>Chris Mason wrote:
> >>>>On Mon, Sep 14, 2009 at 04:41:48PM -0500, Steven Pratt
wrote:
> >>>>>Only bit of bad news is I did get one error that
crashed the system
> >>>>>on single threaded nocow run. So that data point is
missing.
> >>>>>Output below:
> >>>>I hope I''ve got this fixed.  If you pull from the
master branch of
> >>>>btrfs-unstable there are fixes for async thread races.  The
single
> >>>>patch I sent before is included, but not enough.
> >>>Glad you said that.  Keeps me from sending the email that said
the
> >>>patch didn''t help :-)
> >>>
> >>>Steve
> >>Well, still getting oopses even with new code.
> >>
> >>Lots of:
> >>Sep 16 11:07:27 btrfs1 kernel: [ 1862.942754] BUG: soft lockup -
> >>CPU#10 stuck for 61s! [btrfs-endio-1:30250]
> >>Sep 16 11:07:27 btrfs1 kernel: [ 1862.942754] Pid: 30250, comm:
> >>btrfs-endio-1 Not tainted 2.6.31-autokern1 #1 IBM x3950-[88726RU]-
> >>Sep 16 11:07:27 btrfs1 kernel: [ 1862.942754] RIP:
> >>0010:[<ffffffff81153920>]  [<ffffffff81153920>]
crc32c+0x20/0x26
> >
> >If I''m reading this right, you''ve got a softlockup in
crc32c?  Something
> >has gone really wrong here.  Are you reusing datasets from old runs?
> No, mkfs before every run.
Could you please send me the full softlockup output?  Its hard to read
all line wrapped, so the original files would help.

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Mason

2009-Sep-16 18:20 UTC

head link

Re: Updated performance results

On Wed, Sep 16, 2009 at 01:16:56PM -0500, Steven Pratt
wrote:> Chris Mason wrote:
> >On Wed, Sep 16, 2009 at 12:57:22PM -0500, Steven Pratt wrote:
> >>Steven Pratt wrote:
> >>>Chris Mason wrote:
> >>>>On Mon, Sep 14, 2009 at 04:41:48PM -0500, Steven Pratt
wrote:
> >>>>>Only bit of bad news is I did get one error that
crashed the system
> >>>>>on single threaded nocow run. So that data point is
missing.
> >>>>>Output below:
> >>>>I hope I''ve got this fixed.  If you pull from the
master branch of
> >>>>btrfs-unstable there are fixes for async thread races.  The
single
> >>>>patch I sent before is included, but not enough.
> >>>Glad you said that.  Keeps me from sending the email that said
the
> >>>patch didn''t help :-)
> >>>
> >>>Steve
> >>Well, still getting oopses even with new code.
> >>
> >>Lots of:
> >>Sep 16 11:07:27 btrfs1 kernel: [ 1862.942754] BUG: soft lockup -
> >>CPU#10 stuck for 61s! [btrfs-endio-1:30250]
> >>Sep 16 11:07:27 btrfs1 kernel: [ 1862.942754] Pid: 30250, comm:
> >>btrfs-endio-1 Not tainted 2.6.31-autokern1 #1 IBM x3950-[88726RU]-
> >>Sep 16 11:07:27 btrfs1 kernel: [ 1862.942754] RIP:
> >>0010:[<ffffffff81153920>]  [<ffffffff81153920>]
crc32c+0x20/0x26
> >
> >If I''m reading this right, you''ve got a softlockup in
crc32c?  Something
> >has gone really wrong here.  Are you reusing datasets from old runs?
> From the second machine a single bug:
> Sep 16 11:53:42 btrfs2 kernel: [ 3769.298240] ------------[ cut here
Ok, which mount options and job file is this from?

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Steven Pratt

2009-Sep-16 18:37 UTC

head link

Re: Updated performance results

Chris Mason wrote:> On Wed, Sep 16, 2009 at 01:16:56PM -0500, Steven Pratt wrote:
>   
>> Chris Mason wrote:
>>     
>>> On Wed, Sep 16, 2009 at 12:57:22PM -0500, Steven Pratt wrote:
>>>       
>>>> Steven Pratt wrote:
>>>>         
>>>>> Chris Mason wrote:
>>>>>           
>>>>>> On Mon, Sep 14, 2009 at 04:41:48PM -0500, Steven Pratt
wrote:
>>>>>>             
>>>>>>> Only bit of bad news is I did get one error that
crashed the system
>>>>>>> on single threaded nocow run. So that data point is
missing.
>>>>>>> Output below:
>>>>>>>               
>>>>>> I hope I''ve got this fixed.  If you pull from
the master branch of
>>>>>> btrfs-unstable there are fixes for async thread races. 
The single
>>>>>> patch I sent before is included, but not enough.
>>>>>>             
>>>>> Glad you said that.  Keeps me from sending the email that
said the
>>>>> patch didn''t help :-)
>>>>>
>>>>> Steve
>>>>>           
>>>> Well, still getting oopses even with new code.
>>>>
>>>> Lots of:
>>>> Sep 16 11:07:27 btrfs1 kernel: [ 1862.942754] BUG: soft lockup
-
>>>> CPU#10 stuck for 61s! [btrfs-endio-1:30250]
>>>> Sep 16 11:07:27 btrfs1 kernel: [ 1862.942754] Pid: 30250, comm:
>>>> btrfs-endio-1 Not tainted 2.6.31-autokern1 #1 IBM
x3950-[88726RU]-
>>>> Sep 16 11:07:27 btrfs1 kernel: [ 1862.942754] RIP:
>>>> 0010:[<ffffffff81153920>]  [<ffffffff81153920>]
crc32c+0x20/0x26
>>>>         
>>> If I''m reading this right, you''ve got a
softlockup in crc32c?  Something
>>> has gone really wrong here.  Are you reusing datasets from old
runs?
>>>       
>> From the second machine a single bug:
>> Sep 16 11:53:42 btrfs2 kernel: [ 3769.298240] ------------[ cut here
>>     
>
> Ok, which mount options and job file is this from?
>
>   mount -t btrfs  /dev/ffsbdev1 /mnt/ffsb1''
[20090916-11:47:37.738883526] PROCESSING COMMAND : ''run 
random_writes__threads_0001 ffsb 
http://hks.austin.ibm.com/users/corry/btrfs/ffsb/profiles/btrfs2/random_writes.ffsb
num_threads=1''


So , this is single disk machine, running single threaded random write 
workload.  Buffered, not odirect.

I''m packaging up full messages file, repeated errros makes it big. Will
send separately.

Steve


> -chris
> --
> To unsubscribe from this list: send the line "unsubscribe
linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>   
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Eric Whitney

2009-Sep-17 18:32 UTC

head link

Re: Updated performance results

Chris Mason wrote:> On Mon, Sep 14, 2009 at 04:41:48PM -0500, Steven Pratt wrote:
>> Only bit of bad news is I did get one error that crashed the system
>> on single threaded nocow run. So that data point is missing.
>> Output below:
> 
> I hope I''ve got this fixed.  If you pull from the master branch of
> btrfs-unstable there are fixes for async thread races.  The single
> patch I sent before is included, but not enough.
Chris:

FYI - all five of my test systems have now finished my standard test 
cycle on the -unstable master branch, and I''ve not seen a single hang. 
So, your fix for the async thread shutdown race seems to have fixed my 
problems, even if Steve''s still seeing trouble.

I''ll note that the running times for fsstress on some of my systems
have
become rather longer with btrfs-unstable/master kernels - 3.5 rather 
than 2.5 hours on multidevice filesystems.  Running times on single 
device filesystems are roughly the same.

I''m going to start another set of tests for thoroughness unless
you''ve
got more patches coming.

Thanks,
Eric
> 
> -chris
> --
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Steven Pratt

2009-Sep-17 18:39 UTC

head link

Re: Updated performance results

Eric Whitney wrote:>
>
> Chris Mason wrote:
>> On Mon, Sep 14, 2009 at 04:41:48PM -0500, Steven Pratt wrote:
>>> Only bit of bad news is I did get one error that crashed the system
>>> on single threaded nocow run. So that data point is missing.
>>> Output below:
>>
>> I hope I''ve got this fixed.  If you pull from the master
branch of
>> btrfs-unstable there are fixes for async thread races.  The single
>> patch I sent before is included, but not enough.
>
> Chris:
>
> FYI - all five of my test systems have now finished my standard test 
> cycle on the -unstable master branch, and I''ve not seen a single
hang.
> So, your fix for the async thread shutdown race seems to have fixed my 
> problems, even if Steve''s still seeing trouble.
>
> I''ll note that the running times for fsstress on some of my
systems
> have become rather longer with btrfs-unstable/master kernels - 3.5 
> rather than 2.5 hours on multidevice filesystems.  Running times on 
> single device filesystems are roughly the same.
>
> I''m going to start another set of tests for thoroughness unless
you''ve
> got more patches coming.I''ve had some offline discussions with Chris, and it seems the problem 
is triggered by unmounting and re-mounting the file system between tests 
(but not running mkfs again).   I have also just verified that the 
problem does not occur if repeated tests are run without the unmount 
mount cycle.  So in case this is not clear:

mkfs
mount
create new files
run test
umount
mount
delete old files
create new files
run test
BUG

but..

mkfs
mount
create new files
run test
umount
mkfs           <------  differnet
mount
delete old files
create new files
run test
...
all is fine

or...

mkfs
mount
create new files
run test
# no mounts or mkfs here
delete old files
create new files
run test
...
all is fine

Steve

>
> Thanks,
> Eric
>
>>
>> -chris
>>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Mason

2009-Sep-17 18:52 UTC

head link

Re: Updated performance results

On Thu, Sep 17, 2009 at 01:39:01PM -0500, Steven Pratt
wrote:> Eric Whitney wrote:
> >
> >
> >Chris Mason wrote:
> >>On Mon, Sep 14, 2009 at 04:41:48PM -0500, Steven Pratt wrote:
> >>>Only bit of bad news is I did get one error that crashed the
system
> >>>on single threaded nocow run. So that data point is missing.
> >>>Output below:
> >>
> >>I hope I''ve got this fixed.  If you pull from the master
branch of
> >>btrfs-unstable there are fixes for async thread races.  The single
> >>patch I sent before is included, but not enough.
> >
> >Chris:
> >
> >FYI - all five of my test systems have now finished my standard
> >test cycle on the -unstable master branch, and I''ve not seen a
> >single hang. So, your fix for the async thread shutdown race seems
> >to have fixed my problems, even if Steve''s still seeing
trouble.
> >
> >I''ll note that the running times for fsstress on some of my
> >systems have become rather longer with btrfs-unstable/master
> >kernels - 3.5 rather than 2.5 hours on multidevice filesystems.
> >Running times on single device filesystems are roughly the same.
> >
> >I''m going to start another set of tests for thoroughness
unless
> >you''ve got more patches coming.
> I''ve had some offline discussions with Chris, and it seems the
> problem is triggered by unmounting and re-mounting the file system
> between tests (but not running mkfs again).   I have also just
> verified that the problem does not occur if repeated tests are run
> without the unmount mount cycle.  So in case this is not clear:
Ok, I''ve triggered it here. Next step is trying Yan Zheng''s
async
caching update.

------------[ cut here ]------------
kernel BUG at fs/btrfs/extent-tree.c:4097!
invalid opcode: 0000 [#1] SMP

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Mason

2009-Sep-17 20:17 UTC

head link

Re: Updated performance results

[ crashes on runs involving unmounts ]

The run is still going here, but it has survived longer than before.
I''m trying with Yan Zheng''s patch:

From: Yan Zheng <zheng.yan@oracle.com>
Date: Fri, 11 Sep 2009 16:11:19 -0400
Subject: [PATCH] Btrfs: improve async block group caching

This patch gets rid of two limitations of async block group caching.
The old code delays handling pinned extents when block group is in
caching. To allocate logged file extents, the old code need wait
until block group is fully cached. To get rid of the limitations,
This patch introduces a data structure to track the progress of
caching. Base on the caching progress, we know which extents should
be added to the free space cache when handling the pinned extents.
The logged file extents are also handled in a similar way.

This patch also changes how pinned extents are tracked. The old
code uses one tree to track pinned extents, and copy the pinned
extents tree at transaction commit time. This patch makes it use
two trees to track pinned extents. One tree for extents that are
pinned in the running transaction, one tree for extents that can
be unpinned. At transaction commit time, we swap the two trees.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
---
 fs/btrfs/ctree.h       |   29 ++-
 fs/btrfs/disk-io.c     |    7 +-
 fs/btrfs/extent-tree.c |  586 +++++++++++++++++++++++++++++-------------------
 fs/btrfs/transaction.c |   15 +-
 fs/btrfs/tree-log.c    |    4 +-
 5 files changed, 382 insertions(+), 259 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 732d5b8..3b6df71 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -726,6 +726,15 @@ enum btrfs_caching_type {
 	BTRFS_CACHE_FINISHED	= 2,
 };
 
+struct btrfs_caching_control {
+	struct list_head list;
+	struct mutex mutex;
+	wait_queue_head_t wait;
+	struct btrfs_block_group_cache *block_group;
+	u64 progress;
+	atomic_t count;
+};
+
 struct btrfs_block_group_cache {
 	struct btrfs_key key;
 	struct btrfs_block_group_item item;
@@ -742,8 +751,9 @@ struct btrfs_block_group_cache {
 	int dirty;
 
 	/* cache tracking stuff */
-	wait_queue_head_t caching_q;
 	int cached;
+	struct btrfs_caching_control *caching_ctl;
+	u64 last_byte_to_unpin;
 
 	struct btrfs_space_info *space_info;
 
@@ -788,7 +798,8 @@ struct btrfs_fs_info {
 	spinlock_t block_group_cache_lock;
 	struct rb_root block_group_cache_tree;
 
-	struct extent_io_tree pinned_extents;
+	struct extent_io_tree freed_extents[2];
+	struct extent_io_tree *pinned_extents;
 
 	/* logical->physical extent mapping */
 	struct btrfs_mapping_tree mapping_tree;
@@ -825,8 +836,6 @@ struct btrfs_fs_info {
 	struct mutex drop_mutex;
 	struct mutex volume_mutex;
 	struct mutex tree_reloc_mutex;
-	struct rw_semaphore extent_commit_sem;
-
 	/*
 	 * this protects the ordered operations list only while we are
 	 * processing all of the entries on it.  This way we make
@@ -835,10 +844,12 @@ struct btrfs_fs_info {
 	 * before jumping into the main commit.
 	 */
 	struct mutex ordered_operations_mutex;
+	struct rw_semaphore extent_commit_sem;
 
 	struct list_head trans_list;
 	struct list_head hashers;
 	struct list_head dead_roots;
+	struct list_head caching_block_groups;
 
 	atomic_t nr_async_submits;
 	atomic_t async_submit_draining;
@@ -1920,8 +1931,8 @@ void btrfs_put_block_group(struct btrfs_block_group_cache
*cache);
 int btrfs_run_delayed_refs(struct btrfs_trans_handle *trans,
 			   struct btrfs_root *root, unsigned long count);
 int btrfs_lookup_extent(struct btrfs_root *root, u64 start, u64 len);
-int btrfs_update_pinned_extents(struct btrfs_root *root,
-				u64 bytenr, u64 num, int pin);
+int btrfs_pin_extent(struct btrfs_root *root,
+		     u64 bytenr, u64 num, int reserved);
 int btrfs_drop_leaf_ref(struct btrfs_trans_handle *trans,
 			struct btrfs_root *root, struct extent_buffer *leaf);
 int btrfs_cross_ref_exist(struct btrfs_trans_handle *trans,
@@ -1971,9 +1982,10 @@ int btrfs_free_extent(struct btrfs_trans_handle *trans,
 		      u64 root_objectid, u64 owner, u64 offset);
 
 int btrfs_free_reserved_extent(struct btrfs_root *root, u64 start, u64 len);
+int btrfs_prepare_extent_commit(struct btrfs_trans_handle *trans,
+				struct btrfs_root *root);
 int btrfs_finish_extent_commit(struct btrfs_trans_handle *trans,
-			       struct btrfs_root *root,
-			       struct extent_io_tree *unpin);
+			       struct btrfs_root *root);
 int btrfs_inc_extent_ref(struct btrfs_trans_handle *trans,
 			 struct btrfs_root *root,
 			 u64 bytenr, u64 num_bytes, u64 parent,
@@ -2006,7 +2018,6 @@ void btrfs_delalloc_reserve_space(struct btrfs_root *root,
struct inode *inode,
 				 u64 bytes);
 void btrfs_delalloc_free_space(struct btrfs_root *root, struct inode *inode,
 			      u64 bytes);
-void btrfs_free_pinned_extents(struct btrfs_fs_info *info);
 /* ctree.c */
 int btrfs_bin_search(struct extent_buffer *eb, struct btrfs_key *key,
 		     int level, int *slot);
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 253da7e..16dae12 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1563,6 +1563,7 @@ struct btrfs_root *open_ctree(struct super_block *sb,
 	INIT_LIST_HEAD(&fs_info->hashers);
 	INIT_LIST_HEAD(&fs_info->delalloc_inodes);
 	INIT_LIST_HEAD(&fs_info->ordered_operations);
+	INIT_LIST_HEAD(&fs_info->caching_block_groups);
 	spin_lock_init(&fs_info->delalloc_lock);
 	spin_lock_init(&fs_info->new_trans_lock);
 	spin_lock_init(&fs_info->ref_cache_lock);
@@ -1621,8 +1622,11 @@ struct btrfs_root *open_ctree(struct super_block *sb,
 	spin_lock_init(&fs_info->block_group_cache_lock);
 	fs_info->block_group_cache_tree.rb_node = NULL;
 
-	extent_io_tree_init(&fs_info->pinned_extents,
+	extent_io_tree_init(&fs_info->freed_extents[0],
 			     fs_info->btree_inode->i_mapping, GFP_NOFS);
+	extent_io_tree_init(&fs_info->freed_extents[1],
+			     fs_info->btree_inode->i_mapping, GFP_NOFS);
+	fs_info->pinned_extents = &fs_info->freed_extents[0];
 	fs_info->do_barriers = 1;
 
 	BTRFS_I(fs_info->btree_inode)->root = tree_root;
@@ -2359,7 +2363,6 @@ int close_ctree(struct btrfs_root *root)
 	free_extent_buffer(root->fs_info->csum_root->commit_root);
 
 	btrfs_free_block_groups(root->fs_info);
-	btrfs_free_pinned_extents(root->fs_info);
 
 	del_fs_roots(fs_info);
 
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index edd86ae..9bcb9c0 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -32,12 +32,12 @@
 #include "locking.h"
 #include "free-space-cache.h"
 
-static int update_reserved_extents(struct btrfs_root *root,
-				   u64 bytenr, u64 num, int reserve);
 static int update_block_group(struct btrfs_trans_handle *trans,
 			      struct btrfs_root *root,
 			      u64 bytenr, u64 num_bytes, int alloc,
 			      int mark_free);
+static int update_reserved_extents(struct btrfs_block_group_cache *cache,
+				   u64 num_bytes, int reserve);
 static int __btrfs_free_extent(struct btrfs_trans_handle *trans,
 				struct btrfs_root *root,
 				u64 bytenr, u64 num_bytes, u64 parent,
@@ -57,10 +57,17 @@ static int alloc_reserved_tree_block(struct
btrfs_trans_handle *trans,
 				     u64 parent, u64 root_objectid,
 				     u64 flags, struct btrfs_disk_key *key,
 				     int level, struct btrfs_key *ins);
-
 static int do_chunk_alloc(struct btrfs_trans_handle *trans,
 			  struct btrfs_root *extent_root, u64 alloc_bytes,
 			  u64 flags, int force);
+static int pin_down_bytes(struct btrfs_trans_handle *trans,
+			  struct btrfs_root *root,
+			  struct btrfs_path *path,
+			  u64 bytenr, u64 num_bytes,
+			  int is_data, int reserved,
+			  struct extent_buffer **must_clean);
+static int find_next_key(struct btrfs_path *path, int level,
+			 struct btrfs_key *key);
 
 static noinline int
 block_group_cache_done(struct btrfs_block_group_cache *cache)
@@ -153,34 +160,34 @@ block_group_cache_tree_search(struct btrfs_fs_info *info,
u64 bytenr,
 	return ret;
 }
 
-/*
- * We always set EXTENT_LOCKED for the super mirror extents so we
don''t
- * overwrite them, so those bits need to be unset.  Also, if we are unmounting
- * with pinned extents still sitting there because we had a block group
caching,
- * we need to clear those now, since we are done.
- */
-void btrfs_free_pinned_extents(struct btrfs_fs_info *info)
+static int add_excluded_extent(struct btrfs_root *root,
+			       u64 start, u64 num_bytes)
 {
-	u64 start, end, last = 0;
-	int ret;
+	u64 end = start + num_bytes - 1;
+	set_extent_bits(&root->fs_info->freed_extents[0],
+			start, end, EXTENT_UPTODATE, GFP_NOFS);
+	set_extent_bits(&root->fs_info->freed_extents[1],
+			start, end, EXTENT_UPTODATE, GFP_NOFS);
+	return 0;
+}
 
-	while (1) {
-		ret = find_first_extent_bit(&info->pinned_extents, last,
-					    &start, &end,
-					    EXTENT_LOCKED|EXTENT_DIRTY);
-		if (ret)
-			break;
+static void free_excluded_extents(struct btrfs_root *root,
+				  struct btrfs_block_group_cache *cache)
+{
+	u64 start, end;
 
-		clear_extent_bits(&info->pinned_extents, start, end,
-				  EXTENT_LOCKED|EXTENT_DIRTY, GFP_NOFS);
-		last = end+1;
-	}
+	start = cache->key.objectid;
+	end = start + cache->key.offset - 1;
+
+	clear_extent_bits(&root->fs_info->freed_extents[0],
+			  start, end, EXTENT_UPTODATE, GFP_NOFS);
+	clear_extent_bits(&root->fs_info->freed_extents[1],
+			  start, end, EXTENT_UPTODATE, GFP_NOFS);
 }
 
-static int remove_sb_from_cache(struct btrfs_root *root,
-				struct btrfs_block_group_cache *cache)
+static int exclude_super_stripes(struct btrfs_root *root,
+				 struct btrfs_block_group_cache *cache)
 {
-	struct btrfs_fs_info *fs_info = root->fs_info;
 	u64 bytenr;
 	u64 *logical;
 	int stripe_len;
@@ -192,17 +199,41 @@ static int remove_sb_from_cache(struct btrfs_root *root,
 				       cache->key.objectid, bytenr,
 				       0, &logical, &nr, &stripe_len);
 		BUG_ON(ret);
+
 		while (nr--) {
-			try_lock_extent(&fs_info->pinned_extents,
-					logical[nr],
-					logical[nr] + stripe_len - 1, GFP_NOFS);
+			ret = add_excluded_extent(root, logical[nr],
+						  stripe_len);
+			BUG_ON(ret);
 		}
+
 		kfree(logical);
 	}
-
 	return 0;
 }
 
+static struct btrfs_caching_control *
+get_caching_control(struct btrfs_block_group_cache *cache)
+{
+	struct btrfs_caching_control *ctl;
+
+	spin_lock(&cache->lock);
+	if (cache->cached != BTRFS_CACHE_STARTED) {
+		spin_unlock(&cache->lock);
+		return NULL;
+	}
+
+	ctl = cache->caching_ctl;
+	atomic_inc(&ctl->count);
+	spin_unlock(&cache->lock);
+	return ctl;
+}
+
+static void put_caching_control(struct btrfs_caching_control *ctl)
+{
+	if (atomic_dec_and_test(&ctl->count))
+		kfree(ctl);
+}
+
 /*
  * this is only called by cache_block_group, since we could have freed extents
  * we need to check the pinned_extents for any extents that can''t be
used yet
@@ -215,9 +246,9 @@ static u64 add_new_free_space(struct btrfs_block_group_cache
*block_group,
 	int ret;
 
 	while (start < end) {
-		ret = find_first_extent_bit(&info->pinned_extents, start,
+		ret = find_first_extent_bit(info->pinned_extents, start,
 					    &extent_start, &extent_end,
-					    EXTENT_DIRTY|EXTENT_LOCKED);
+					    EXTENT_DIRTY | EXTENT_UPTODATE);
 		if (ret)
 			break;
 
@@ -249,22 +280,24 @@ static int caching_kthread(void *data)
 {
 	struct btrfs_block_group_cache *block_group = data;
 	struct btrfs_fs_info *fs_info = block_group->fs_info;
-	u64 last = 0;
+	struct btrfs_caching_control *caching_ctl = block_group->caching_ctl;
+	struct btrfs_root *extent_root = fs_info->extent_root;
 	struct btrfs_path *path;
-	int ret = 0;
-	struct btrfs_key key;
 	struct extent_buffer *leaf;
-	int slot;
+	struct btrfs_key key;
 	u64 total_found = 0;
-
-	BUG_ON(!fs_info);
+	u64 last = 0;
+	u32 nritems;
+	int ret = 0;
 
 	path = btrfs_alloc_path();
 	if (!path)
 		return -ENOMEM;
 
-	atomic_inc(&block_group->space_info->caching_threads);
+	exclude_super_stripes(extent_root, block_group);
+
 	last = max_t(u64, block_group->key.objectid, BTRFS_SUPER_INFO_OFFSET);
+
 	/*
 	 * We don''t want to deadlock with somebody trying to allocate a new
 	 * extent for the extent root while also trying to search the extent
@@ -277,74 +310,64 @@ static int caching_kthread(void *data)
 
 	key.objectid = last;
 	key.offset = 0;
-	btrfs_set_key_type(&key, BTRFS_EXTENT_ITEM_KEY);
+	key.type = BTRFS_EXTENT_ITEM_KEY;
 again:
+	mutex_lock(&caching_ctl->mutex);
 	/* need to make sure the commit_root doesn''t disappear */
 	down_read(&fs_info->extent_commit_sem);
 
-	ret = btrfs_search_slot(NULL, fs_info->extent_root, &key, path, 0, 0);
+	ret = btrfs_search_slot(NULL, extent_root, &key, path, 0, 0);
 	if (ret < 0)
 		goto err;
 
+	leaf = path->nodes[0];
+	nritems = btrfs_header_nritems(leaf);
+
 	while (1) {
 		smp_mb();
-		if (block_group->fs_info->closing > 1) {
+		if (fs_info->closing > 1) {
 			last = (u64)-1;
 			break;
 		}
 
-		leaf = path->nodes[0];
-		slot = path->slots[0];
-		if (slot >= btrfs_header_nritems(leaf)) {
-			ret = btrfs_next_leaf(fs_info->extent_root, path);
-			if (ret < 0)
-				goto err;
-			else if (ret)
+		if (path->slots[0] < nritems) {
+			btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
+		} else {
+			ret = find_next_key(path, 0, &key);
+			if (ret)
 				break;
 
-			if (need_resched() ||
-			    btrfs_transaction_in_commit(fs_info)) {
-				leaf = path->nodes[0];
-
-				/* this shouldn''t happen, but if the
-				 * leaf is empty just move on.
-				 */
-				if (btrfs_header_nritems(leaf) == 0)
-					break;
-				/*
-				 * we need to copy the key out so that
-				 * we are sure the next search advances
-				 * us forward in the btree.
-				 */
-				btrfs_item_key_to_cpu(leaf, &key, 0);
-				btrfs_release_path(fs_info->extent_root, path);
-				up_read(&fs_info->extent_commit_sem);
+			caching_ctl->progress = last;
+			btrfs_release_path(extent_root, path);
+			up_read(&fs_info->extent_commit_sem);
+			mutex_unlock(&caching_ctl->mutex);
+			if (btrfs_transaction_in_commit(fs_info))
 				schedule_timeout(1);
-				goto again;
-			}
+			else
+				cond_resched();
+			goto again;
+		}
 
+		if (key.objectid < block_group->key.objectid) {
+			path->slots[0]++;
 			continue;
 		}
-		btrfs_item_key_to_cpu(leaf, &key, slot);
-		if (key.objectid < block_group->key.objectid)
-			goto next;
 
 		if (key.objectid >= block_group->key.objectid +
 		    block_group->key.offset)
 			break;
 
-		if (btrfs_key_type(&key) == BTRFS_EXTENT_ITEM_KEY) {
+		if (key.type == BTRFS_EXTENT_ITEM_KEY) {
 			total_found += add_new_free_space(block_group,
 							  fs_info, last,
 							  key.objectid);
 			last = key.objectid + key.offset;
-		}
 
-		if (total_found > (1024 * 1024 * 2)) {
-			total_found = 0;
-			wake_up(&block_group->caching_q);
+			if (total_found > (1024 * 1024 * 2)) {
+				total_found = 0;
+				wake_up(&caching_ctl->wait);
+			}
 		}
-next:
 		path->slots[0]++;
 	}
 	ret = 0;
@@ -352,33 +375,65 @@ next:
 	total_found += add_new_free_space(block_group, fs_info, last,
 					  block_group->key.objectid +
 					  block_group->key.offset);
+	caching_ctl->progress = (u64)-1;
 
 	spin_lock(&block_group->lock);
+	block_group->caching_ctl = NULL;
 	block_group->cached = BTRFS_CACHE_FINISHED;
 	spin_unlock(&block_group->lock);
 
 err:
 	btrfs_free_path(path);
 	up_read(&fs_info->extent_commit_sem);
-	atomic_dec(&block_group->space_info->caching_threads);
-	wake_up(&block_group->caching_q);
 
+	free_excluded_extents(extent_root, block_group);
+
+	mutex_unlock(&caching_ctl->mutex);
+	wake_up(&caching_ctl->wait);
+
+	put_caching_control(caching_ctl);
+	atomic_dec(&block_group->space_info->caching_threads);
 	return 0;
 }
 
 static int cache_block_group(struct btrfs_block_group_cache *cache)
 {
+	struct btrfs_fs_info *fs_info = cache->fs_info;
+	struct btrfs_caching_control *caching_ctl;
 	struct task_struct *tsk;
 	int ret = 0;
 
+	smp_mb();
+	if (cache->cached != BTRFS_CACHE_NO)
+		return 0;
+
+	caching_ctl = kzalloc(sizeof(*caching_ctl), GFP_KERNEL);
+	BUG_ON(!caching_ctl);
+
+	INIT_LIST_HEAD(&caching_ctl->list);
+	mutex_init(&caching_ctl->mutex);
+	init_waitqueue_head(&caching_ctl->wait);
+	caching_ctl->block_group = cache;
+	caching_ctl->progress = cache->key.objectid;
+	/* one for caching kthread, one for caching block group list */
+	atomic_set(&caching_ctl->count, 2);
+
 	spin_lock(&cache->lock);
 	if (cache->cached != BTRFS_CACHE_NO) {
 		spin_unlock(&cache->lock);
-		return ret;
+		kfree(caching_ctl);
+		return 0;
 	}
+	cache->caching_ctl = caching_ctl;
 	cache->cached = BTRFS_CACHE_STARTED;
 	spin_unlock(&cache->lock);
 
+	down_write(&fs_info->extent_commit_sem);
+	list_add_tail(&caching_ctl->list,
&fs_info->caching_block_groups);
+	up_write(&fs_info->extent_commit_sem);
+
+	atomic_inc(&cache->space_info->caching_threads);
+
 	tsk = kthread_run(caching_kthread, cache, "btrfs-cache-%llu\n",
 			  cache->key.objectid);
 	if (IS_ERR(tsk)) {
@@ -1656,7 +1711,6 @@ static int run_delayed_data_ref(struct btrfs_trans_handle
*trans,
 						 parent, ref_root, flags,
 						 ref->objectid, ref->offset,
 						 &ins, node->ref_mod);
-		update_reserved_extents(root, ins.objectid, ins.offset, 0);
 	} else if (node->action == BTRFS_ADD_DELAYED_REF) {
 		ret = __btrfs_inc_extent_ref(trans, root, node->bytenr,
 					     node->num_bytes, parent,
@@ -1782,7 +1836,6 @@ static int run_delayed_tree_ref(struct btrfs_trans_handle
*trans,
 						extent_op->flags_to_set,
 						&extent_op->key,
 						ref->level, &ins);
-		update_reserved_extents(root, ins.objectid, ins.offset, 0);
 	} else if (node->action == BTRFS_ADD_DELAYED_REF) {
 		ret = __btrfs_inc_extent_ref(trans, root, node->bytenr,
 					     node->num_bytes, parent, ref_root,
@@ -1817,16 +1870,32 @@ static int run_one_delayed_ref(struct btrfs_trans_handle
*trans,
 		BUG_ON(extent_op);
 		head = btrfs_delayed_node_to_head(node);
 		if (insert_reserved) {
+			int mark_free = 0;
+			struct extent_buffer *must_clean = NULL;
+
+			ret = pin_down_bytes(trans, root, NULL,
+					     node->bytenr, node->num_bytes,
+					     head->is_data, 1, &must_clean);
+			if (ret > 0)
+				mark_free = 1;
+
+			if (must_clean) {
+				clean_tree_block(NULL, root, must_clean);
+				btrfs_tree_unlock(must_clean);
+				free_extent_buffer(must_clean);
+			}
 			if (head->is_data) {
 				ret = btrfs_del_csums(trans, root,
 						      node->bytenr,
 						      node->num_bytes);
 				BUG_ON(ret);
 			}
-			btrfs_update_pinned_extents(root, node->bytenr,
-						    node->num_bytes, 1);
-			update_reserved_extents(root, node->bytenr,
-						node->num_bytes, 0);
+			if (mark_free) {
+				ret = btrfs_free_reserved_extent(root,
+							node->bytenr,
+							node->num_bytes);
+				BUG_ON(ret);
+			}
 		}
 		mutex_unlock(&head->mutex);
 		return 0;
@@ -3008,10 +3077,12 @@ static int update_block_group(struct btrfs_trans_handle
*trans,
 		num_bytes = min(total, cache->key.offset - byte_in_group);
 		if (alloc) {
 			old_val += num_bytes;
+			btrfs_set_block_group_used(&cache->item, old_val);
+			cache->reserved -= num_bytes;
 			cache->space_info->bytes_used += num_bytes;
+			cache->space_info->bytes_reserved -= num_bytes;
 			if (cache->ro)
 				cache->space_info->bytes_readonly -= num_bytes;
-			btrfs_set_block_group_used(&cache->item, old_val);
 			spin_unlock(&cache->lock);
 			spin_unlock(&cache->space_info->lock);
 		} else {
@@ -3056,127 +3127,136 @@ static u64 first_logical_byte(struct btrfs_root *root,
u64 search_start)
 	return bytenr;
 }
 
-int btrfs_update_pinned_extents(struct btrfs_root *root,
-				u64 bytenr, u64 num, int pin)
+/*
+ * this function must be called within transaction
+ */
+int btrfs_pin_extent(struct btrfs_root *root,
+		     u64 bytenr, u64 num_bytes, int reserved)
 {
-	u64 len;
-	struct btrfs_block_group_cache *cache;
 	struct btrfs_fs_info *fs_info = root->fs_info;
+	struct btrfs_block_group_cache *cache;
 
-	if (pin)
-		set_extent_dirty(&fs_info->pinned_extents,
-				bytenr, bytenr + num - 1, GFP_NOFS);
-
-	while (num > 0) {
-		cache = btrfs_lookup_block_group(fs_info, bytenr);
-		BUG_ON(!cache);
-		len = min(num, cache->key.offset -
-			  (bytenr - cache->key.objectid));
-		if (pin) {
-			spin_lock(&cache->space_info->lock);
-			spin_lock(&cache->lock);
-			cache->pinned += len;
-			cache->space_info->bytes_pinned += len;
-			spin_unlock(&cache->lock);
-			spin_unlock(&cache->space_info->lock);
-			fs_info->total_pinned += len;
-		} else {
-			int unpin = 0;
+	cache = btrfs_lookup_block_group(fs_info, bytenr);
+	BUG_ON(!cache);
 
-			/*
-			 * in order to not race with the block group caching, we
-			 * only want to unpin the extent if we are cached.  If
-			 * we aren''t cached, we want to start async caching this
-			 * block group so we can free the extent the next time
-			 * around.
-			 */
-			spin_lock(&cache->space_info->lock);
-			spin_lock(&cache->lock);
-			unpin = (cache->cached == BTRFS_CACHE_FINISHED);
-			if (likely(unpin)) {
-				cache->pinned -= len;
-				cache->space_info->bytes_pinned -= len;
-				fs_info->total_pinned -= len;
-			}
-			spin_unlock(&cache->lock);
-			spin_unlock(&cache->space_info->lock);
+	spin_lock(&cache->space_info->lock);
+	spin_lock(&cache->lock);
+	cache->pinned += num_bytes;
+	cache->space_info->bytes_pinned += num_bytes;
+	if (reserved) {
+		cache->reserved -= num_bytes;
+		cache->space_info->bytes_reserved -= num_bytes;
+	}
+	spin_unlock(&cache->lock);
+	spin_unlock(&cache->space_info->lock);
 
-			if (likely(unpin))
-				clear_extent_dirty(&fs_info->pinned_extents,
-						   bytenr, bytenr + len -1,
-						   GFP_NOFS);
-			else
-				cache_block_group(cache);
+	btrfs_put_block_group(cache);
 
-			if (unpin)
-				btrfs_add_free_space(cache, bytenr, len);
-		}
-		btrfs_put_block_group(cache);
-		bytenr += len;
-		num -= len;
+	set_extent_dirty(fs_info->pinned_extents,
+			 bytenr, bytenr + num_bytes - 1, GFP_NOFS);
+	return 0;
+}
+
+static int update_reserved_extents(struct btrfs_block_group_cache *cache,
+				   u64 num_bytes, int reserve)
+{
+	spin_lock(&cache->space_info->lock);
+	spin_lock(&cache->lock);
+	if (reserve) {
+		cache->reserved += num_bytes;
+		cache->space_info->bytes_reserved += num_bytes;
+	} else {
+		cache->reserved -= num_bytes;
+		cache->space_info->bytes_reserved -= num_bytes;
 	}
+	spin_unlock(&cache->lock);
+	spin_unlock(&cache->space_info->lock);
 	return 0;
 }
 
-static int update_reserved_extents(struct btrfs_root *root,
-				   u64 bytenr, u64 num, int reserve)
+int btrfs_prepare_extent_commit(struct btrfs_trans_handle *trans,
+				struct btrfs_root *root)
 {
-	u64 len;
-	struct btrfs_block_group_cache *cache;
 	struct btrfs_fs_info *fs_info = root->fs_info;
+	struct btrfs_caching_control *next;
+	struct btrfs_caching_control *caching_ctl;
+	struct btrfs_block_group_cache *cache;
 
-	while (num > 0) {
-		cache = btrfs_lookup_block_group(fs_info, bytenr);
-		BUG_ON(!cache);
-		len = min(num, cache->key.offset -
-			  (bytenr - cache->key.objectid));
+	down_write(&fs_info->extent_commit_sem);
 
-		spin_lock(&cache->space_info->lock);
-		spin_lock(&cache->lock);
-		if (reserve) {
-			cache->reserved += len;
-			cache->space_info->bytes_reserved += len;
+	list_for_each_entry_safe(caching_ctl, next,
+				 &fs_info->caching_block_groups, list) {
+		cache = caching_ctl->block_group;
+		if (block_group_cache_done(cache)) {
+			cache->last_byte_to_unpin = (u64)-1;
+			list_del_init(&caching_ctl->list);
+			put_caching_control(caching_ctl);
 		} else {
-			cache->reserved -= len;
-			cache->space_info->bytes_reserved -= len;
+			cache->last_byte_to_unpin = caching_ctl->progress;
 		}
-		spin_unlock(&cache->lock);
-		spin_unlock(&cache->space_info->lock);
-		btrfs_put_block_group(cache);
-		bytenr += len;
-		num -= len;
 	}
+
+	if (fs_info->pinned_extents == &fs_info->freed_extents[0])
+		fs_info->pinned_extents = &fs_info->freed_extents[1];
+	else
+		fs_info->pinned_extents = &fs_info->freed_extents[0];
+
+	up_write(&fs_info->extent_commit_sem);
 	return 0;
 }
 
-int btrfs_copy_pinned(struct btrfs_root *root, struct extent_io_tree *copy)
+static int unpin_extent_range(struct btrfs_root *root, u64 start, u64 end)
 {
-	u64 last = 0;
-	u64 start;
-	u64 end;
-	struct extent_io_tree *pinned_extents =
&root->fs_info->pinned_extents;
-	int ret;
+	struct btrfs_fs_info *fs_info = root->fs_info;
+	struct btrfs_block_group_cache *cache = NULL;
+	u64 len;
 
-	while (1) {
-		ret = find_first_extent_bit(pinned_extents, last,
-					    &start, &end, EXTENT_DIRTY);
-		if (ret)
-			break;
+	while (start <= end) {
+		if (!cache ||
+		    start >= cache->key.objectid + cache->key.offset) {
+			if (cache)
+				btrfs_put_block_group(cache);
+			cache = btrfs_lookup_block_group(fs_info, start);
+			BUG_ON(!cache);
+		}
+
+		len = cache->key.objectid + cache->key.offset - start;
+		len = min(len, end + 1 - start);
+
+		if (start < cache->last_byte_to_unpin) {
+			len = min(len, cache->last_byte_to_unpin - start);
+			btrfs_add_free_space(cache, start, len);
+		}
+
+		spin_lock(&cache->space_info->lock);
+		spin_lock(&cache->lock);
+		cache->pinned -= len;
+		cache->space_info->bytes_pinned -= len;
+		spin_unlock(&cache->lock);
+		spin_unlock(&cache->space_info->lock);
 
-		set_extent_dirty(copy, start, end, GFP_NOFS);
-		last = end + 1;
+		start += len;
 	}
+
+	if (cache)
+		btrfs_put_block_group(cache);
 	return 0;
 }
 
 int btrfs_finish_extent_commit(struct btrfs_trans_handle *trans,
-			       struct btrfs_root *root,
-			       struct extent_io_tree *unpin)
+			       struct btrfs_root *root)
 {
+	struct btrfs_fs_info *fs_info = root->fs_info;
+	struct extent_io_tree *unpin;
 	u64 start;
 	u64 end;
 	int ret;
 
+	if (fs_info->pinned_extents == &fs_info->freed_extents[0])
+		unpin = &fs_info->freed_extents[1];
+	else
+		unpin = &fs_info->freed_extents[0];
+
 	while (1) {
 		ret = find_first_extent_bit(unpin, 0, &start, &end,
 					    EXTENT_DIRTY);
@@ -3185,10 +3265,8 @@ int btrfs_finish_extent_commit(struct btrfs_trans_handle
*trans,
 
 		ret = btrfs_discard_extent(root, start, end + 1 - start);
 
-		/* unlocks the pinned mutex */
-		btrfs_update_pinned_extents(root, start, end + 1 - start, 0);
 		clear_extent_dirty(unpin, start, end, GFP_NOFS);
-
+		unpin_extent_range(root, start, end);
 		cond_resched();
 	}
 
@@ -3198,7 +3276,8 @@ int btrfs_finish_extent_commit(struct btrfs_trans_handle
*trans,
 static int pin_down_bytes(struct btrfs_trans_handle *trans,
 			  struct btrfs_root *root,
 			  struct btrfs_path *path,
-			  u64 bytenr, u64 num_bytes, int is_data,
+			  u64 bytenr, u64 num_bytes,
+			  int is_data, int reserved,
 			  struct extent_buffer **must_clean)
 {
 	int err = 0;
@@ -3230,15 +3309,15 @@ static int pin_down_bytes(struct btrfs_trans_handle
*trans,
 	}
 	free_extent_buffer(buf);
 pinit:
-	btrfs_set_path_blocking(path);
+	if (path)
+		btrfs_set_path_blocking(path);
 	/* unlocks the pinned mutex */
-	btrfs_update_pinned_extents(root, bytenr, num_bytes, 1);
+	btrfs_pin_extent(root, bytenr, num_bytes, reserved);
 
 	BUG_ON(err < 0);
 	return 0;
 }
 
-
 static int __btrfs_free_extent(struct btrfs_trans_handle *trans,
 				struct btrfs_root *root,
 				u64 bytenr, u64 num_bytes, u64 parent,
@@ -3412,7 +3491,7 @@ static int __btrfs_free_extent(struct btrfs_trans_handle
*trans,
 		}
 
 		ret = pin_down_bytes(trans, root, path, bytenr,
-				     num_bytes, is_data, &must_clean);
+				     num_bytes, is_data, 0, &must_clean);
 		if (ret > 0)
 			mark_free = 1;
 		BUG_ON(ret < 0);
@@ -3543,8 +3622,7 @@ int btrfs_free_extent(struct btrfs_trans_handle *trans,
 	if (root_objectid == BTRFS_TREE_LOG_OBJECTID) {
 		WARN_ON(owner >= BTRFS_FIRST_FREE_OBJECTID);
 		/* unlocks the pinned mutex */
-		btrfs_update_pinned_extents(root, bytenr, num_bytes, 1);
-		update_reserved_extents(root, bytenr, num_bytes, 0);
+		btrfs_pin_extent(root, bytenr, num_bytes, 1);
 		ret = 0;
 	} else if (owner < BTRFS_FIRST_FREE_OBJECTID) {
 		ret = btrfs_add_delayed_tree_ref(trans, bytenr, num_bytes,
@@ -3584,19 +3662,33 @@ static noinline int
 wait_block_group_cache_progress(struct btrfs_block_group_cache *cache,
 				u64 num_bytes)
 {
+	struct btrfs_caching_control *caching_ctl;
 	DEFINE_WAIT(wait);
 
-	prepare_to_wait(&cache->caching_q, &wait, TASK_UNINTERRUPTIBLE);
-
-	if (block_group_cache_done(cache)) {
-		finish_wait(&cache->caching_q, &wait);
+	caching_ctl = get_caching_control(cache);
+	if (!caching_ctl)
 		return 0;
-	}
-	schedule();
-	finish_wait(&cache->caching_q, &wait);
 
-	wait_event(cache->caching_q, block_group_cache_done(cache) ||
+	wait_event(caching_ctl->wait, block_group_cache_done(cache) ||
 		   (cache->free_space >= num_bytes));
+
+	put_caching_control(caching_ctl);
+	return 0;
+}
+
+static noinline int
+wait_block_group_cache_done(struct btrfs_block_group_cache *cache)
+{
+	struct btrfs_caching_control *caching_ctl;
+	DEFINE_WAIT(wait);
+
+	caching_ctl = get_caching_control(cache);
+	if (!caching_ctl)
+		return 0;
+
+	wait_event(caching_ctl->wait, block_group_cache_done(cache));
+
+	put_caching_control(caching_ctl);
 	return 0;
 }
 
@@ -3880,6 +3972,8 @@ checks:
 					     search_start - offset);
 		BUG_ON(offset > search_start);
 
+		update_reserved_extents(block_group, num_bytes, 1);
+
 		/* we are all good, lets return */
 		break;
 loop:
@@ -3972,12 +4066,12 @@ static void dump_space_info(struct btrfs_space_info
*info, u64 bytes)
 	up_read(&info->groups_sem);
 }
 
-static int __btrfs_reserve_extent(struct btrfs_trans_handle *trans,
-				  struct btrfs_root *root,
-				  u64 num_bytes, u64 min_alloc_size,
-				  u64 empty_size, u64 hint_byte,
-				  u64 search_end, struct btrfs_key *ins,
-				  u64 data)
+int btrfs_reserve_extent(struct btrfs_trans_handle *trans,
+			 struct btrfs_root *root,
+			 u64 num_bytes, u64 min_alloc_size,
+			 u64 empty_size, u64 hint_byte,
+			 u64 search_end, struct btrfs_key *ins,
+			 u64 data)
 {
 	int ret;
 	u64 search_start = 0;
@@ -4043,25 +4137,8 @@ int btrfs_free_reserved_extent(struct btrfs_root *root,
u64 start, u64 len)
 	ret = btrfs_discard_extent(root, start, len);
 
 	btrfs_add_free_space(cache, start, len);
+	update_reserved_extents(cache, len, 0);
 	btrfs_put_block_group(cache);
-	update_reserved_extents(root, start, len, 0);
-
-	return ret;
-}
-
-int btrfs_reserve_extent(struct btrfs_trans_handle *trans,
-				  struct btrfs_root *root,
-				  u64 num_bytes, u64 min_alloc_size,
-				  u64 empty_size, u64 hint_byte,
-				  u64 search_end, struct btrfs_key *ins,
-				  u64 data)
-{
-	int ret;
-	ret = __btrfs_reserve_extent(trans, root, num_bytes, min_alloc_size,
-				     empty_size, hint_byte, search_end, ins,
-				     data);
-	if (!ret)
-		update_reserved_extents(root, ins->objectid, ins->offset, 1);
 
 	return ret;
 }
@@ -4222,15 +4299,46 @@ int btrfs_alloc_logged_file_extent(struct
btrfs_trans_handle *trans,
 {
 	int ret;
 	struct btrfs_block_group_cache *block_group;
+	struct btrfs_caching_control *caching_ctl;
+	u64 start = ins->objectid;
+	u64 num_bytes = ins->offset;
 
 	block_group = btrfs_lookup_block_group(root->fs_info, ins->objectid);
 	cache_block_group(block_group);
-	wait_event(block_group->caching_q,
-		   block_group_cache_done(block_group));
+	caching_ctl = get_caching_control(block_group);
 
-	ret = btrfs_remove_free_space(block_group, ins->objectid,
-				      ins->offset);
-	BUG_ON(ret);
+	if (!caching_ctl) {
+		BUG_ON(!block_group_cache_done(block_group));
+		ret = btrfs_remove_free_space(block_group, start, num_bytes);
+		BUG_ON(ret);
+	} else {
+		mutex_lock(&caching_ctl->mutex);
+
+		if (start >= caching_ctl->progress) {
+			ret = add_excluded_extent(root, start, num_bytes);
+			BUG_ON(ret);
+		} else if (start + num_bytes <= caching_ctl->progress) {
+			ret = btrfs_remove_free_space(block_group,
+						      start, num_bytes);
+			BUG_ON(ret);
+		} else {
+			num_bytes = caching_ctl->progress - start;
+			ret = btrfs_remove_free_space(block_group,
+						      start, num_bytes);
+			BUG_ON(ret);
+
+			start = caching_ctl->progress;
+			num_bytes = ins->objectid + ins->offset -
+				    caching_ctl->progress;
+			ret = add_excluded_extent(root, start, num_bytes);
+			BUG_ON(ret);
+		}
+
+		mutex_unlock(&caching_ctl->mutex);
+		put_caching_control(caching_ctl);
+	}
+
+	update_reserved_extents(block_group, ins->offset, 1);
 	btrfs_put_block_group(block_group);
 	ret = alloc_reserved_file_extent(trans, root, 0, root_objectid,
 					 0, owner, offset, ins, 1);
@@ -4254,9 +4362,9 @@ static int alloc_tree_block(struct btrfs_trans_handle
*trans,
 	int ret;
 	u64 flags = 0;
 
-	ret = __btrfs_reserve_extent(trans, root, num_bytes, num_bytes,
-				     empty_size, hint_byte, search_end,
-				     ins, 0);
+	ret = btrfs_reserve_extent(trans, root, num_bytes, num_bytes,
+				   empty_size, hint_byte, search_end,
+				   ins, 0);
 	if (ret)
 		return ret;
 
@@ -4267,7 +4375,6 @@ static int alloc_tree_block(struct btrfs_trans_handle
*trans,
 	} else
 		BUG_ON(parent > 0);
 
-	update_reserved_extents(root, ins->objectid, ins->offset, 1);
 	if (root_objectid != BTRFS_TREE_LOG_OBJECTID) {
 		struct btrfs_delayed_extent_op *extent_op;
 		extent_op = kmalloc(sizeof(*extent_op), GFP_NOFS);
@@ -7164,8 +7271,18 @@ int btrfs_free_block_groups(struct btrfs_fs_info *info)
 {
 	struct btrfs_block_group_cache *block_group;
 	struct btrfs_space_info *space_info;
+	struct btrfs_caching_control *caching_ctl;
 	struct rb_node *n;
 
+	down_write(&info->extent_commit_sem);
+	while (!list_empty(&info->caching_block_groups)) {
+		caching_ctl = list_entry(info->caching_block_groups.next,
+					 struct btrfs_caching_control, list);
+		list_del(&caching_ctl->list);
+		put_caching_control(caching_ctl);
+	}
+	up_write(&info->extent_commit_sem);
+
 	spin_lock(&info->block_group_cache_lock);
 	while ((n = rb_last(&info->block_group_cache_tree)) != NULL) {
 		block_group = rb_entry(n, struct btrfs_block_group_cache,
@@ -7179,8 +7296,7 @@ int btrfs_free_block_groups(struct btrfs_fs_info *info)
 		up_write(&block_group->space_info->groups_sem);
 
 		if (block_group->cached == BTRFS_CACHE_STARTED)
-			wait_event(block_group->caching_q,
-				   block_group_cache_done(block_group));
+			wait_block_group_cache_done(block_group);
 
 		btrfs_remove_free_space_cache(block_group);
 
@@ -7250,7 +7366,6 @@ int btrfs_read_block_groups(struct btrfs_root *root)
 		spin_lock_init(&cache->lock);
 		spin_lock_init(&cache->tree_lock);
 		cache->fs_info = info;
-		init_waitqueue_head(&cache->caching_q);
 		INIT_LIST_HEAD(&cache->list);
 		INIT_LIST_HEAD(&cache->cluster_list);
 
@@ -7272,8 +7387,6 @@ int btrfs_read_block_groups(struct btrfs_root *root)
 		cache->flags = btrfs_block_group_flags(&cache->item);
 		cache->sectorsize = root->sectorsize;
 
-		remove_sb_from_cache(root, cache);
-
 		/*
 		 * check for two cases, either we are full, and therefore
 		 * don''t need to bother with the caching work since we
won''t
@@ -7282,13 +7395,17 @@ int btrfs_read_block_groups(struct btrfs_root *root)
 		 * time, particularly in the full case.
 		 */
 		if (found_key.offset == btrfs_block_group_used(&cache->item)) {
+			cache->last_byte_to_unpin = (u64)-1;
 			cache->cached = BTRFS_CACHE_FINISHED;
 		} else if (btrfs_block_group_used(&cache->item) == 0) {
+			exclude_super_stripes(root, cache);
+			cache->last_byte_to_unpin = (u64)-1;
 			cache->cached = BTRFS_CACHE_FINISHED;
 			add_new_free_space(cache, root->fs_info,
 					   found_key.objectid,
 					   found_key.objectid +
 					   found_key.offset);
+			free_excluded_extents(root, cache);
 		}
 
 		ret = update_space_info(info, cache->flags, found_key.offset,
@@ -7345,7 +7462,6 @@ int btrfs_make_block_group(struct btrfs_trans_handle
*trans,
 	atomic_set(&cache->count, 1);
 	spin_lock_init(&cache->lock);
 	spin_lock_init(&cache->tree_lock);
-	init_waitqueue_head(&cache->caching_q);
 	INIT_LIST_HEAD(&cache->list);
 	INIT_LIST_HEAD(&cache->cluster_list);
 
@@ -7354,12 +7470,15 @@ int btrfs_make_block_group(struct btrfs_trans_handle
*trans,
 	cache->flags = type;
 	btrfs_set_block_group_flags(&cache->item, type);
 
+	cache->last_byte_to_unpin = (u64)-1;
 	cache->cached = BTRFS_CACHE_FINISHED;
-	remove_sb_from_cache(root, cache);
+	exclude_super_stripes(root, cache);
 
 	add_new_free_space(cache, root->fs_info, chunk_offset,
 			   chunk_offset + size);
 
+	free_excluded_extents(root, cache);
+
 	ret = update_space_info(root->fs_info, cache->flags, size, bytes_used,
 				&cache->space_info);
 	BUG_ON(ret);
@@ -7428,8 +7547,7 @@ int btrfs_remove_block_group(struct btrfs_trans_handle
*trans,
 	up_write(&block_group->space_info->groups_sem);
 
 	if (block_group->cached == BTRFS_CACHE_STARTED)
-		wait_event(block_group->caching_q,
-			   block_group_cache_done(block_group));
+		wait_block_group_cache_done(block_group);
 
 	btrfs_remove_free_space_cache(block_group);
 
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index cdbb502..6ed6186 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -874,7 +874,6 @@ int btrfs_commit_transaction(struct btrfs_trans_handle
*trans,
 	unsigned long timeout = 1;
 	struct btrfs_transaction *cur_trans;
 	struct btrfs_transaction *prev_trans = NULL;
-	struct extent_io_tree *pinned_copy;
 	DEFINE_WAIT(wait);
 	int ret;
 	int should_grow = 0;
@@ -915,13 +914,6 @@ int btrfs_commit_transaction(struct btrfs_trans_handle
*trans,
 		return 0;
 	}
 
-	pinned_copy = kmalloc(sizeof(*pinned_copy), GFP_NOFS);
-	if (!pinned_copy)
-		return -ENOMEM;
-
-	extent_io_tree_init(pinned_copy,
-			     root->fs_info->btree_inode->i_mapping, GFP_NOFS);
-
 	trans->transaction->in_commit = 1;
 	trans->transaction->blocked = 1;
 	if (cur_trans->list.prev != &root->fs_info->trans_list) {
@@ -1019,6 +1011,8 @@ int btrfs_commit_transaction(struct btrfs_trans_handle
*trans,
 	ret = commit_cowonly_roots(trans, root);
 	BUG_ON(ret);
 
+	btrfs_prepare_extent_commit(trans, root);
+
 	cur_trans = root->fs_info->running_transaction;
 	spin_lock(&root->fs_info->new_trans_lock);
 	root->fs_info->running_transaction = NULL;
@@ -1042,8 +1036,6 @@ int btrfs_commit_transaction(struct btrfs_trans_handle
*trans,
 	memcpy(&root->fs_info->super_for_commit,
&root->fs_info->super_copy,
 	       sizeof(root->fs_info->super_copy));
 
-	btrfs_copy_pinned(root, pinned_copy);
-
 	trans->transaction->blocked = 0;
 
 	wake_up(&root->fs_info->transaction_wait);
@@ -1059,8 +1051,7 @@ int btrfs_commit_transaction(struct btrfs_trans_handle
*trans,
 	 */
 	mutex_unlock(&root->fs_info->tree_log_mutex);
 
-	btrfs_finish_extent_commit(trans, root, pinned_copy);
-	kfree(pinned_copy);
+	btrfs_finish_extent_commit(trans, root);
 
 	/* do the directory inserts of any pending snapshot creations */
 	finish_pending_snapshots(trans, root->fs_info);
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 8661a73..f4a7b62 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -263,8 +263,8 @@ static int process_one_buffer(struct btrfs_root *log,
 			      struct walk_control *wc, u64 gen)
 {
 	if (wc->pin)
-		btrfs_update_pinned_extents(log->fs_info->extent_root,
-					    eb->start, eb->len, 1);
+		btrfs_pin_extent(log->fs_info->extent_root,
+				 eb->start, eb->len, 0);
 
 	if (btrfs_buffer_uptodate(eb, gen)) {
 		if (wc->write)
-- 
1.6.4.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Mason

2009-Sep-17 20:43 UTC

head link

Re: Updated performance results

On Thu, Sep 17, 2009 at 04:17:14PM -0400, Chris Mason
wrote:> [ crashes on runs involving unmounts ]
> 
> The run is still going here, but it has survived longer than before.
> I''m trying with Yan Zheng''s patch:
> 
> From: Yan Zheng <zheng.yan@oracle.com>
> Date: Fri, 11 Sep 2009 16:11:19 -0400
> Subject: [PATCH] Btrfs: improve async block group caching
Quick update, I got through a full run of Steve''s test with this
applied.  I''ll start  a few more ;)

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Steven Pratt

2009-Sep-17 22:04 UTC

head link

Re: Updated performance results

Chris Mason wrote:> On Thu, Sep 17, 2009 at 04:17:14PM -0400, Chris Mason wrote:
>   
>> [ crashes on runs involving unmounts ]
>>
>> The run is still going here, but it has survived longer than before.
>> I''m trying with Yan Zheng''s patch:
>>
>> From: Yan Zheng <zheng.yan@oracle.com>
>> Date: Fri, 11 Sep 2009 16:11:19 -0400
>> Subject: [PATCH] Btrfs: improve async block group caching
>>     
>
> Quick update, I got through a full run of Steve''s test with this
> applied.  I''ll start  a few more ;)
>   Seems to work for me too!  Got through all the random write tests with 
no problems. Will kick off full run overnight.

Steve
> -chris
> --
> To unsubscribe from this list: send the line "unsubscribe
linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>   
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Mason

2009-Sep-18 20:14 UTC

head link

Re: Updated performance results

On Thu, Sep 17, 2009 at 05:04:11PM -0500, Steven Pratt
wrote:> Chris Mason wrote:
> >On Thu, Sep 17, 2009 at 04:17:14PM -0400, Chris Mason wrote:
> >>[ crashes on runs involving unmounts ]
> >>
> >>The run is still going here, but it has survived longer than
before.
> >>I''m trying with Yan Zheng''s patch:
> >>
> >>From: Yan Zheng <zheng.yan@oracle.com>
> >>Date: Fri, 11 Sep 2009 16:11:19 -0400
> >>Subject: [PATCH] Btrfs: improve async block group caching
> >
> >Quick update, I got through a full run of Steve''s test with
this
> >applied.  I''ll start  a few more ;)
> Seems to work for me too!  Got through all the random write tests
> with no problems. Will kick off full run overnight.
Thanks again.

I''ve updated the btrfs-unstable tree to include Yan Zheng''s
fix.  I''ve
also included two more buffered writeback speedups, and I expect these
to make a difference on the multi-threaded tests.

I took a look at 1MB O_DIRECT writes, and the latencies of sending off
checksumming to the checksum threads seem to be the biggest problem.  I
get full tput at 8MB O_DIRECT writes, so for now I''m going to leave
this
one alone.

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Steven Pratt

2009-Sep-23 15:24 UTC

head link

Re: Updated performance results

Chris Mason wrote:> On Thu, Sep 17, 2009 at 05:04:11PM -0500, Steven Pratt wrote:
>   
>> Chris Mason wrote:
>>     
>>> On Thu, Sep 17, 2009 at 04:17:14PM -0400, Chris Mason wrote:
>>>       
>>>> [ crashes on runs involving unmounts ]
>>>>
>>>> The run is still going here, but it has survived longer than
before.
>>>> I''m trying with Yan Zheng''s patch:
>>>>
>>>> From: Yan Zheng <zheng.yan@oracle.com>
>>>> Date: Fri, 11 Sep 2009 16:11:19 -0400
>>>> Subject: [PATCH] Btrfs: improve async block group caching
>>>>         
>>> Quick update, I got through a full run of Steve''s test
with this
>>> applied.  I''ll start  a few more ;)
>>>       
>> Seems to work for me too!  Got through all the random write tests
>> with no problems. Will kick off full run overnight.
>>     
>
> Thanks again.
>
> I''ve updated the btrfs-unstable tree to include Yan
Zheng''s fix.  I''ve
> also included two more buffered writeback speedups, and I expect these
> to make a difference on the multi-threaded tests.
>
> I took a look at 1MB O_DIRECT writes, and the latencies of sending off
> checksumming to the checksum threads seem to be the biggest problem.  I
> get full tput at 8MB O_DIRECT writes, so for now I''m going to
leave this
> one alone.
>   Updated performance results are available.  Ran both the 9/16 tree + 
async patches and the tree from 9/18.  Results are a mixed bag, some 
faster some slower.

http://btrfs.boxacle.net/repository/raid/history/History.html

Steve
> -chris
>
> --
> To unsubscribe from this list: send the line "unsubscribe
linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>   
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Btrfs devel - Jul 2009 - Updated performance results

Updated performance results

Re: Updated performance results

Re: Updated performance results

Re: Updated performance results

Re: Updated performance results

Re: Updated performance results

Re: Updated performance results

Re: Updated performance results

Re: Updated performance results

Re: Updated performance results

Re: Updated performance results

Re: Updated performance results

Re: Updated performance results

Re: Updated performance results

Re: Updated performance results

Re: Updated performance results

Re: Updated performance results

Re: Updated performance results

Re: Updated performance results

Re: Updated performance results

Re: Updated performance results

Re: Updated performance results

Re: Updated performance results

Re: Updated performance results

Re: Updated performance results

Re: Updated performance results

Re: Updated performance results

Re: Updated performance results

Re: Updated performance results

Re: Updated performance results

Re: Updated performance results

Re: Updated performance results

Re: Updated performance results

Re: Updated performance results

Re: Updated performance results

Re: Updated performance results

Re: Updated performance results

Re: Updated performance results