thr3ads.net - Btrfs devel - Strange prformance degradation when COW writes happen at fixed offsets [Feb 2012]

If this information is useful, please help other people find it:
Share via:

Nik Markovic

2012-Feb-24 01:32 UTC

Strange prformance degradation when COW writes happen at fixed offsets

Hi,

My kernel version is 32-bit 3.2.0-rc5 and using btrfs-tools 0.19

I was having performance issues with BTRFS with fragmentation and
HDDs, so I decided to switch to an SSD to see if these would go away.
Performance was much better but at times, I would see a "freeze
happen" which I can''t really explain. The CPU would spike up to
100%
at times.

I decided to try reproduce this, hough it may or may not be related,
while testing BTFS performance, I encountered this interesting problem
where performance would depend on whether a file is freshly copied
onto a BTRFS filesystem or obtained via COW "children". This is all
happening on a Crucial M4 SSD, so something on the SSD firmware could
be causing the issue but I feel it''s related to BTRFS  metadata.

Here is the test:
1. Write a fresh large file to the file system called A
2. Make a reflink of A COW copy B
3. Modify a set of random blocks on B
4. Remove A
5. Repeat 2-5 but use newly produced B as new A

Expected results:
Each steps takes equal amount of time to complete on an SSD because
there is no fragmentation involved and the system is in the same state
at #2 because there''s always only one file on the filesystem.

I used 1GB file as my source. I repeated tests using different
algorithms for the "write" in step #2 above.
Algorithm 1 (random): Write 8 bytes randomly
Algorithm 2 (fixed): Write first 8 bytes and continue at 50k offsets
Algorithm 3 (incremental): Write first 8 bytes at offset = random
(50k) then continue at 50k offsets
For each test, there were 40k writes total. Algorithm is in the Java code below.

The following is observed with each iteration ONLY when using algorithm #3
1. Over time, the time to modify the file increases
2. Over time, the time to make the reflink copy increases
3. Over time, the time to remove the file increases
4. First few writes take less then normal time to complete.

Data for 1st/5th/10th/15th/20th iteration:
Algorithm 1 and 2:
Always Write:6s
Always Copy: 0.5s
Always Remove: 0.10s

Algorithm 2:
Write: 2/6/9/10/11.5
Copy: 0.5/3/4.5/5.5/6
Remove: 0.1/1/2/2/2

As you can see, things degrade and taper off after the 10th iteration.
This probably has to do with 4k block size being near 50k/10. I don''t
think this has to do with SSD garbage collection because I ran these
tests multiple times.

To use this script, cd into an empty directory on a btrfs filesystem
and and run it with "incremental" as argument. You can use other modes
to confirm expected behavior.
Script used to produce the bug:
#!/bin/bash

mode=$1
if [ -z "$mode" ]; then
	echo "Usage $0 <incremental|random|fixed>"
	exit -1
fi
mode=$1

src=`pwd`/test/src
dst=`pwd`/test/dst
srcfile=$src/test.tar
dstfile=$dst/test.tar

mkdir -p $src
mkdir -p $dst

filesize=100MB

#build a 1GB file from a smaller download. You can tweak filesize and
the loop below for lower bandwidth
if [ ! -f $srcfile ]; then
	cd $src
	if [ ! -f $srcfile.dl ]; then
		wget http://download.thinkbroadband.com/${filesize}.zip
--output-document=$srcfile.dl
	fi
	rm -rf tarbase
	mkdir tarbase
	for  i in {1..10}; do
		cp --reflink=always $srcfile.dl tarbase/$i.dl
	done
	tar -cvf $srcfile tarbase
	rm -rf tarbase
fi

cat <<END > $src/FileTest.java
import java.io.IOException;
import java.io.RandomAccessFile;
public class FileTest {
    public static final int BLOCK_SIZE = 50000;
    public static final int MAX_ITERATIONS = 40000;
    public static void main(String args[]) throws IOException {
        String mode = args[0];
        RandomAccessFile f = new RandomAccessFile(args[1], "rw");
        //int offset = 0;
        int i;
        int offset = new java.util.Random().nextInt(BLOCK_SIZE); //
initializer ONLY for incremental mode
        for (i=0; i < MAX_ITERATIONS; i++) {
            try {
                int writeOffset;
                if (mode.equals("incremental")) {
                    writeOffset = new
java.util.Random().nextInt(offset + i * BLOCK_SIZE);
                } else { // mode.equals random
                    writeOffset = new
java.util.Random().nextInt(((int)f.length() - 100));
                    offset = writeOffset; // for reporting it at the end
                }
                f.seek(writeOffset);
                f.writeBytes("DEADBEEF");
            } catch (java.io.IOException e) {
                System.out.println("EOF");
                break;
            }
        }
        System.out.print("Last offset=" + offset);
        System.out.println(". Made " + i + " random
writes.");
        f.close();
    }
}

END

cd $src
javac FileTest.java


/usr/bin/time --format ''rm: %E'' rm -rf $dst/*
cp --reflink=always $srcfile.dl $dst/1.tst
cd $dst
for i in {1..20}; do	
	echo -n "$i."
	i_plus=`expr $i + 1`
	/usr/bin/time --format ''write: %E'' java -cp $src FileTest
$mode $i.tst
	/usr/bin/time --format ''cp:    %E'' cp --reflink=always $i.tst
$i_plus.tst
	/usr/bin/time --format ''rm:    %E'' rm $i.tst
	/usr/bin/time --format ''sync:  %E'' sync
	sleep 1
done
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Nik Markovic

2012-Feb-24 02:31 UTC

head link

Re: Strange prformance degradation when COW writes happen at fixed offsets

I noticed a few errors in the script that I used. I corrected it and
it seems that degradation is occurring even at fully random writes:


#!/bin/bash

mode=$1
if [ -z "$mode" ]; then
	echo "Usage $0 <incremental|random|fixed>"
	exit -1
fi
mode=$1

src=`pwd`/test/src
dst=`pwd`/test/dst
srcfile=$src/test.tar
dstfile=$dst/test.tar

mkdir -p $src
mkdir -p $dst

filesize=100MB

#build a 10GB file from a smaller download. You can tweak filesize and
the loop below for lower bandwidth
if [ ! -f $srcfile ]; then
	cd $src
	if [ ! -f $srcfile.dl ]; then
		wget http://download.thinkbroadband.com/${filesize}.zip
--output-document=$srcfile.dl
	fi
	rm -rf tarbase
	mkdir tarbase
	for  i in {1..100}; do
		cp --reflink=always $srcfile.dl tarbase/$i.dl
	done
	tar -cvf $srcfile tarbase
	rm -rf tarbase
fi

cat <<END > $src/FileTest.java
import java.io.IOException;
import java.io.RandomAccessFile;
public class FileTest {
    public static final int BLOCK_SIZE = 50000;
    public static final int MAX_ITERATIONS = 40000;
    public static void main(String args[]) throws IOException {
        String mode = args[0];
        RandomAccessFile f = new RandomAccessFile(args[1], "rw");
        //int offset = 0;
        int i;
        int offset = new java.util.Random().nextInt(BLOCK_SIZE); //
initializer ONLY for incremental mode
        for (i=0; i < MAX_ITERATIONS; i++) {
            try {
                int writeOffset;
                if (mode.equals("incremental")) {
                    writeOffset = new
java.util.Random().nextInt(offset + i * BLOCK_SIZE);
                }  else if (mode.equals("fixed")) {
                    writeOffset = i * BLOCK_SIZE;
                    offset = writeOffset; // for reporting it at the end
                } else { // mode.equals random
                    writeOffset = new
java.util.Random().nextInt(((int)f.length() - 100));
                    offset = writeOffset; // for reporting it at the end
                }
		if (writeOffset > (f.length() - 100)) {
			break;
		}
                f.seek(writeOffset);
                f.writeBytes("DEADBEEF");
            } catch (java.io.IOException e) {
                System.out.println("EOF");
                break;
            }
        }
        System.out.print("Last offset=" + offset);
        System.out.println(". Made " + i + " random
writes.");
        f.close();
    }
}
END

cd $src
javac FileTest.java


/usr/bin/time --format ''rm: %E'' rm -rf $dst/*
cp --reflink=always $srcfile $dst/1.tst
cd $dst
for i in {1..100}; do	
	echo -n "$i."
	i_plus=`expr $i + 1`
	/usr/bin/time --format ''write: %E'' java -cp $src FileTest
$mode $i.tst
	/usr/bin/time --format ''cp:    %E'' cp --reflink=always $i.tst
$i_plus.tst
	/usr/bin/time --format ''rm:    %E'' rm $i.tst
	/usr/bin/time --format ''sync:  %E'' sync
	sleep 1
done
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Duncan

2012-Feb-24 06:38 UTC

head link

Re: Strange prformance degradation when COW writes happen at fixed offsets

Nik Markovic posted on Thu, 23 Feb 2012 20:31:02 -0600 as excerpted:
> I noticed a few errors in the script that I used. I corrected it and it
> seems that degradation is occurring even at fully random writes:
I don''t have an ssd, but is it possible that you''re simply
seeing erase-
block related degradation due to multi-write-block sized erase-blocks?

It seems to me that when originally written to the btrfs-on-ssd, the file 
will likely be written block-sequentially enough that the file as a whole 
takes up relatively few erase-blocks.  As you COW-write individual 
blocks, they''ll be written elsewhere, perhaps all the changed blocks to
a
new erase-block, perhaps each to a different erase block.

As you increase the successive COW generation count, the file''s file-
system/write blocks will be spread thru more and more erase-blocks, 
basically fragmentation but of the SSD-critical type, into more and more 
erase blocks, thus affecting modification and removal time but not read 
time.

IIRC I saw a note about this on the wiki, in regard to the nodatacow 
mount-option.  Let''s see if I can find it again.  Hmm... yes...

http://btrfs.ipv5.de/index.php?title=Getting_started#Mount_Options

In particular this (for nodatacow, read the rest as there''s additional 
implications):
>>>>>Performance gain is usually < 5% unless the workload is random writes to 
large database files, where the difference can become very large.
<<<<<

In addition to nodatacow, see the note on the autodefrag option.

IOW, with the repeated generations of random-writes to cow-copies,
you''re
apparently triggering a cow-worst-case fragmentation situation.  It 
shouldn''t affect read-time much on SSD, but it certainly will affect
copy
and erase time, as the data and metadata (which as you''ll recall is 2X
by
default on btrfs) gets written to more and more blocks that need updated 
at copy/erase time, 

That /might/ be the problem triggering the freezes you noted that set off 
the original investigation as well, if the SSD firmware is running out of 
erase blocks and having to pause access while it rearranges data to allow 
operations to continue.  Since your original issue on "rotating rust" 
drives was fragmentation, rewriting would seem to be something you do 
quite a lot of, triggering different but similar-cause issues on SSDs as 
well.

FWIW, with that sort of database-style workload, large files constantly 
random-change rewritten, something like xfs might be more appropriate 
than btrfs.  See the recent xfs presentations (were they at ScaleX or 
LinuxConf.au? both happened about the same time and were covered in the 
same LWN weekly edition) as covered a couple weeks ago on LWN for more.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Nik Markovic

2012-Feb-24 20:38 UTC

head link

Re: Strange prformance degradation when COW writes happen at fixed offsets

On Fri, Feb 24, 2012 at 12:38 AM, Duncan <1i5t5.duncan@cox.net>
wrote:> Nik Markovic posted on Thu, 23 Feb 2012 20:31:02 -0600 as excerpted:
>
>> I noticed a few errors in the script that I used. I corrected it and it
>> seems that degradation is occurring even at fully random writes:
>
> I don''t have an ssd, but is it possible that you''re
simply seeing erase-
> block related degradation due to multi-write-block sized erase-blocks?
>
> It seems to me that when originally written to the btrfs-on-ssd, the file
> will likely be written block-sequentially enough that the file as a whole
> takes up relatively few erase-blocks.  As you COW-write individual
> blocks, they''ll be written elsewhere, perhaps all the changed
blocks to a
> new erase-block, perhaps each to a different erase block.
This is a very interesting insight. I wasn''t even aware of the
erase-block issue, so I did some reading up on it...
>
> As you increase the successive COW generation count, the file''s
file-
> system/write blocks will be spread thru more and more erase-blocks,
> basically fragmentation but of the SSD-critical type, into more and more
> erase blocks, thus affecting modification and removal time but not read
> time.
OK, so time to write would increase due to fragmentation and writing,
it now makes sense (though I don''t see why small writes would affect
this, but my concerns are not writes anyway), but why would cp
--reflink time increase so much. Yes, new extents would be created,
but btrfs doesn''t write into data blocks, does it? I figured its
metadata would be kept in one place. I figure the only thing BTRFS
would do on cp --reflink=always:
1. Take a collection of extents owned by source.
2. Make the new copy use the same collection of extents.
3. Write the collection of extents to the "directory".

Now this process seems to be CPU intensive. When I remove or make a
reflink copy, one core pikes up to 100%, which tells me that there''s a
performance issue there, not an ssd issue. Also, only one CPU thread
is being used for this. I figured that I can improve this by some
setting. Maybe thread_pool mount option? Are there any updates in
later kernels that I should possibly pick up?
>
> IIRC I saw a note about this on the wiki, in regard to the nodatacow
> mount-option.  Let''s see if I can find it again.  Hmm... yes...
>
> http://btrfs.ipv5.de/index.php?title=Getting_started#Mount_Options
>
> In particular this (for nodatacow, read the rest as there''s
additional
> implications):
>
>>>>>>
> Performance gain is usually < 5% unless the workload is random writes to
> large database files, where the difference can become very large.
> <<<<<
>
Unless I am wrong, this would disable COW completely and reflink copy.
Reflinks are a crucial component and the sole
reason I picked BTRFS for the system that I am writing for my company.
The autodefrag option addresses multiple writes. Writing is not the
problem, but cp --reflink should be near-instant. That was the reason
we chose BTRFS over ZFS, which seemed to be the only feasible
alternative. ZFS snapshot complicate the design and deduplication copy
time is the same as (or not much better than) raw copy.
> In addition to nodatacow, see the note on the autodefrag option.
>
> IOW, with the repeated generations of random-writes to cow-copies,
you''re
> apparently triggering a cow-worst-case fragmentation situation.  It
> shouldn''t affect read-time much on SSD, but it certainly will
affect copy
> and erase time, as the data and metadata (which as you''ll recall
is 2X by
> default on btrfs) gets written to more and more blocks that need updated
> at copy/erase time,
>
>
> That /might/ be the problem triggering the freezes you noted that set off
> the original investigation as well, if the SSD firmware is running out of
> erase blocks and having to pause access while it rearranges data to allow
> operations to continue.  Since your original issue on "rotating
rust"
> drives was fragmentation, rewriting would seem to be something you do
> quite a lot of, triggering different but similar-cause issues on SSDs as
> well.
>
> FWIW, with that sort of database-style workload, large files constantly
> random-change rewritten, something like xfs might be more appropriate
> than btrfs.  See the recent xfs presentations (were they at ScaleX or
> LinuxConf.au? both happened about the same time and were covered in the
> same LWN weekly edition) as covered a couple weeks ago on LWN for more.
>
As I mentioned above, the COW is the crucial component of our system,
XFS won''t do. Our system does not do random writes. In fact it is
mainly heavy on read operation. The system does occasional "rotation
of rust" on large files in a way that version control system would
(large files are modified and then used as a new baseline)
> --
> Duncan - List replies preferred.   No HTML msgs.
> "Every nonfree program has a lord, a master --
> and if you use the program, he is your master."  Richard Stallman
>
> --
> To unsubscribe from this list: send the line "unsubscribe
linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thanks for all your help on this issue. I hope that someone can point
out some more tweaks or added features/fixes after 3.2 RC5 that I may
do.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Nik Markovic

2012-Feb-24 21:33 UTC

head link

Re: Strange prformance degradation when COW writes happen at fixed offsets

To add... I also tried nodatasum (only) and nodatacow otions. I found
somewhere that nodatacow doesn''t really mean tthat COW is disabled.
Test data is still the same - CPU spikes and times are the same.

On Fri, Feb 24, 2012 at 2:38 PM, Nik Markovic <nmarkovi.navteq@gmail.com>
wrote:> On Fri, Feb 24, 2012 at 12:38 AM, Duncan <1i5t5.duncan@cox.net>
wrote:
>> Nik Markovic posted on Thu, 23 Feb 2012 20:31:02 -0600 as excerpted:
>>
>>> I noticed a few errors in the script that I used. I corrected it
and it
>>> seems that degradation is occurring even at fully random writes:
>>
>> I don''t have an ssd, but is it possible that you''re
simply seeing erase-
>> block related degradation due to multi-write-block sized erase-blocks?
>>
>> It seems to me that when originally written to the btrfs-on-ssd, the
file
>> will likely be written block-sequentially enough that the file as a
whole
>> takes up relatively few erase-blocks.  As you COW-write individual
>> blocks, they''ll be written elsewhere, perhaps all the changed
blocks to a
>> new erase-block, perhaps each to a different erase block.
>
> This is a very interesting insight. I wasn''t even aware of the
> erase-block issue, so I did some reading up on it...
>
>>
>> As you increase the successive COW generation count, the
file''s file-
>> system/write blocks will be spread thru more and more erase-blocks,
>> basically fragmentation but of the SSD-critical type, into more and
more
>> erase blocks, thus affecting modification and removal time but not read
>> time.
>
> OK, so time to write would increase due to fragmentation and writing,
> it now makes sense (though I don''t see why small writes would
affect
> this, but my concerns are not writes anyway), but why would cp
> --reflink time increase so much. Yes, new extents would be created,
> but btrfs doesn''t write into data blocks, does it? I figured its
> metadata would be kept in one place. I figure the only thing BTRFS
> would do on cp --reflink=always:
> 1. Take a collection of extents owned by source.
> 2. Make the new copy use the same collection of extents.
> 3. Write the collection of extents to the "directory".
>
> Now this process seems to be CPU intensive. When I remove or make a
> reflink copy, one core pikes up to 100%, which tells me that
there''s a
> performance issue there, not an ssd issue. Also, only one CPU thread
> is being used for this. I figured that I can improve this by some
> setting. Maybe thread_pool mount option? Are there any updates in
> later kernels that I should possibly pick up?
>
>>
>> IIRC I saw a note about this on the wiki, in regard to the nodatacow
>> mount-option.  Let''s see if I can find it again.  Hmm...
yes...
>>
>> http://btrfs.ipv5.de/index.php?title=Getting_started#Mount_Options
>>
>> In particular this (for nodatacow, read the rest as there''s
additional
>> implications):
>>
>>>>>>>
>> Performance gain is usually < 5% unless the workload is random
writes to
>> large database files, where the difference can become very large.
>> <<<<<
>>
>
> Unless I am wrong, this would disable COW completely and reflink copy.
> Reflinks are a crucial component and the sole
> reason I picked BTRFS for the system that I am writing for my company.
> The autodefrag option addresses multiple writes. Writing is not the
> problem, but cp --reflink should be near-instant. That was the reason
> we chose BTRFS over ZFS, which seemed to be the only feasible
> alternative. ZFS snapshot complicate the design and deduplication copy
> time is the same as (or not much better than) raw copy.
>
>> In addition to nodatacow, see the note on the autodefrag option.
>>
>> IOW, with the repeated generations of random-writes to cow-copies,
you''re
>> apparently triggering a cow-worst-case fragmentation situation.  It
>> shouldn''t affect read-time much on SSD, but it certainly will
affect copy
>> and erase time, as the data and metadata (which as you''ll
recall is 2X by
>> default on btrfs) gets written to more and more blocks that need
updated
>> at copy/erase time,
>>
>>
>> That /might/ be the problem triggering the freezes you noted that set
off
>> the original investigation as well, if the SSD firmware is running out
of
>> erase blocks and having to pause access while it rearranges data to
allow
>> operations to continue.  Since your original issue on "rotating
rust"
>> drives was fragmentation, rewriting would seem to be something you do
>> quite a lot of, triggering different but similar-cause issues on SSDs
as
>> well.
>>
>> FWIW, with that sort of database-style workload, large files constantly
>> random-change rewritten, something like xfs might be more appropriate
>> than btrfs.  See the recent xfs presentations (were they at ScaleX or
>> LinuxConf.au? both happened about the same time and were covered in the
>> same LWN weekly edition) as covered a couple weeks ago on LWN for more.
>>
>
> As I mentioned above, the COW is the crucial component of our system,
> XFS won''t do. Our system does not do random writes. In fact it is
> mainly heavy on read operation. The system does occasional "rotation
> of rust" on large files in a way that version control system would
> (large files are modified and then used as a new baseline)
>
>> --
>> Duncan - List replies preferred.   No HTML msgs.
>> "Every nonfree program has a lord, a master --
>> and if you use the program, he is your master."  Richard Stallman
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe
linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
> Thanks for all your help on this issue. I hope that someone can point
> out some more tweaks or added features/fixes after 3.2 RC5 that I may
> do.--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Duncan

2012-Feb-25 03:34 UTC

head link

Re: Strange prformance degradation when COW writes happen at fixed offsets

Nik Markovic posted on Fri, 24 Feb 2012 14:38:57 -0600 as excerpted:
> On Fri, Feb 24, 2012 at 12:38 AM, Duncan <1i5t5.duncan@cox.net>
wrote:
>> Nik Markovic posted on Thu, 23 Feb 2012 20:31:02 -0600 as excerpted:
>>
>>> I noticed a few errors in the script that I used. I corrected it
and
>>> it seems that degradation is occurring even at fully random writes:
>>
>> I don''t have an ssd, but is it possible that you''re
simply seeing
>> erase- block related degradation due to multi-write-block sized
>> erase-blocks?
>>
>> It seems to me that when originally written to the btrfs-on-ssd, the
>> file will likely be written block-sequentially enough that the file as
>> a whole takes up relatively few erase-blocks.  As you COW-write
>> individual blocks, they''ll be written elsewhere, perhaps all
the
>> changed blocks to a new erase-block, perhaps each to a different erase
>> block.
> 
> This is a very interesting insight. I wasn''t even aware of the
> erase-block issue, so I did some reading up on it...
I take it you looked at TRIM/discard, then, as well?  In theory and for 
some SSD firmware, it works well at helping to alleviate the problem by 
informing the SSD of data areas that it no longer needs to care about 
(empty space), thus allowing more effective management of those erase-
blocks.

Reality is however not quite so simple, and it doesn''t help a lot with 
some SSDs, plus there''s a potential performance issue due when doing
the
discard on especially earlier devices, since TRIM is an unqueueable 
command in the earlier standards (I''ve read it''s defined as
queueable in
the latest standards, however), thus forcing a flush of all activity in 
the queue before the discard, potentially triggering I/O freeze 
behavior.  Additionally, when run on top of dm-crypt, there''s a
potential
security issue (examination of the raw undecrypted storage reveals 
whether there''s data there or not, and possibly the filesystem type
used
based on patterns, a potential deniability issue in that they know the 
data is there, tho it doesn''t affect the strength of the encryption 
itself).

So since on a lot of firmware it doesn''t make a lot of difference
anyway,
and there''s a couple of down sides, the btrfs ssd mount option does NOT
enable discard as well.  However, there *IS* a discard option that you 
can experiment with if you like, and it probably WILL help with erase-
block handling on SOME firmware.

See the FAQ, part 3 Features, # 3.4 on TRIM/discard, for a bit more.
I''ve
really covered what it says, above, but there''s a link to the
encryption
security vs TRIM research, for instance.  And the discard mount-option 
for whatever reason isn''t listed in mount options, or at least I
didn''t
see it, only in the FAQ.

(This is one URL, my client is wrapping it and it''s a hassle to fix.)

http://btrfs.ipv5.de/index.php?
title=FAQ#Does_Btrfs_support_TRIM.2Fdiscard.3F

Bottom line, if it is indeed an erase-block issue, the discard mount 
option MIGHT help, or it might not, depending on your device firmware.  
It''s an experiment-and-see thing.
>> As you increase the successive COW generation count, the
file''s file-
>> system/write blocks will be spread thru more and more erase-blocks,
>> basically fragmentation but of the SSD-critical type, into more and
>> more erase blocks, thus affecting modification and removal time but not
>> read time.
> 
> OK, so time to write would increase due to fragmentation and writing, it
> now makes sense (though I don''t see why small writes would affect
this,
> but my concerns are not writes anyway), but why would cp --reflink time
> increase so much. Yes, new extents would be created, but btrfs
doesn''t
> write into data blocks, does it? I figured its metadata would be kept in
> one place. I figure the only thing BTRFS would do on cp
> --reflink=always:
> 1. Take a collection of extents owned by source.
> 2. Make the new copy use the same collection of extents.
> 3. Write the collection of extents to the "directory".
> 
> Now this process seems to be CPU intensive. When I remove or make a
> reflink copy, one core pikes up to 100%, which tells me that
there''s a
> performance issue there, not an ssd issue. Also, only one CPU thread is
> being used for this. I figured that I can improve this by some setting.
> Maybe thread_pool mount option? Are there any updates in later kernels
> that I should possibly pick up?
FWIW... I am by no means an expert on this.  I /think/ I understand 
enough of it to somewhat guide trial and error testing to arrive at a 
reasonable if not best-case config for any setup I might deal with, and 
well enough to hopefully point you in the right direction for your own 
research and testing, but I''d not going to claim to be able to explain 
the whys of individual cases, or even necessarily to understand them 
myself, just understand enough to know of the issue and to trial and 
error resolve to a hopefully reasonable situation on any hardware I might 
run.

However, I could speculate (enough to guide my own testing were I 
troubleshooting here) that it''s one of several things or more likely a 
combination of them.   

One, I''m not sure if the metadata ends up being COW also, or not, but
if
it is, then your test case is fragmenting it too, thus explaining the 
reflink copy issue.  And keep in mind that by default, btrfs uses DUP for 
metadata, so there''s TWO copies of it written, thereby DOUBLING the 
performance effects of anything affecting metadata!

Two, see the FAQ deduplication question/answer a couple questions below 
the TRIM/discard one mentioned above.  I''m rather fuzzy on the
filesystem
implications of this myself, but it seems to me that our COW assumptions 
might be wrong because they''re assuming deduplication effects that
simply
aren''t the way btrfs works presently, as it hasn''t implemented
deduplication.  Admittedly, this is at best a handwavy black-box factor, 
but that''s the best I can do with it, presently.  I guess that at least
gives you another place to do additional research, if it comes to that.  
(In this regard I do wish the COW subsection of the sysadmin guide page 
on the wiki was written, it''s simply punted ATM, since there''s
a fair
chance that a good explanation there would cover the filesystem viewpoint 
differences between full deduplication and the COW that btrfs does, 
perhaps clearing up some misconceptions people including me may have 
about it, as well.)

Three, as evident in the discussion on the nodatacow and autodefrag 
options I mentioned before, there''s known issues with some use cases 
involving large files and rewrites of data at random locations within 
them.  But I''m not sure if these known issues are simply the ones
we''ve
been discussing, or if there''s other factors I''m unaware of in
this
regard.  Knowing more about just what those known issues are and the 
specific scenarios under which they occur, could go a long way toward 
resolving the situation for you.

But I''m only a recent list regular, joining a few weeks ago as part of
my
own research into btrfs (FWIW my use case involves N-way mirroring, with 
N=3-4; since only no-mirroring and N=2 is available today and 3-way/n-way 
is planned to layer on top of raid5/6, which is planned for kernel 
3.4/3.5, I''m now waiting for that... while continuing to stay current
on
the list), so whatever research or test cases lead to the remarks on the 
wiki regarding large files with random data rewrites, predates my 
involvement likely by quite some time.

Four, there''s additional block alignment issues having to do with the 
alignment of the partition on the physical storage, as it relates to 
read-, write- and erase-block sizes and alignment.  On SSDs, erase-block 
sizes are the biggest, so the optimum alignment would be to erase-block 
size.  Getting it wrong can result in multiple block writes and/or erases 
where proper alignment would require only one.  This phenomenon is called 
write-amplification (and less commonly, erase-amplification).  However, 
depending on what you used to create the partition on which the 
filesystem resides (and loopback files do tend toward worst-case), it''s
quite possible you don''t have block-alignment level control at all.

FWIW, that''s one use case for the mkbtrfs/mkfs.btrfs --alloc-start/-A 
option, since that allows you to align the allocation within the 
partition as necessary for alignment, regardless of the partition 
alignment.

FWIW2, gptfdisk (a gpt partitioner as opposed to the old mbr style) has 
reasonable alignment defaults of 1 MiB on disks without an existing 
partition layout, and attempts 8-sector (4 KiB) alignment even on 
existing layouts, for disks >=300 GB at least.  That''s what
I''ve been
using for the last few years, having converted to gpt-based partitioning 
for everything, even USB-thumb-drives, if partitioned.  (GPT was designed 
for EFI, but can be used on BIOS based systems as well, which is what
I''m
doing.  Grub2 understands gpt well and puts to good use any reserved BIOS 
partition it finds, and there''s options in the kernel for it that need 
enabled as well.)

Block alignment is DEFINITELY something you can play with, in terms of 
testing whether it makes a difference on your drives, SSD or "spinning 
rust".

There''s probably other factors involved of which I''m unaware,
as well.
>> IIRC I saw a note about this on the wiki, in regard to the nodatacow
>> mount-option.
>> In addition to nodatacow, see the note on the autodefrag option.
> Unless I am wrong, this would disable COW completely and reflink copy.
> Reflinks are a crucial component and the sole reason I picked BTRFS for
> the system that I am writing for my company.
> The autodefrag option addresses multiple writes. Writing is not the
> problem, but cp --reflink should be near-instant. That was the reason we
> chose BTRFS over ZFS, which seemed to be the only feasible alternative.
> ZFS snapshot complicate the design and deduplication copy time is the
> same as (or not much better than) raw copy.
> As I mentioned above, the COW is the crucial component of our system,
> XFS won''t do. Our system does not do random writes. In fact it is
mainly
> heavy on read operation. The system does occasional "rotation of
rust"
> on large files in a way that version control system would (large files
> are modified and then used as a new baseline)
Pardon me, I think I might have been too vague with that "rotating
rust"
allusion and lost you.  Either that or you''re taking the allusion out 
even further and potentially lost me! =;^0

I meant spinning magnetic media with that "rotating rust" reference,
the
"rotating rust" bit being a double entendre allusion both to the iron 
oxide (rust) used as the data storage layer, and to the fact that many 
view rotating magnetic media as a legacy technology (rusting out) 
compared to SSDs. =:^)  As it happens, I saw that double-meaning word-
play used elsewhere recently with the same two allusions attached, and 
liked it enough to use it myself, when I got the chance.  Only I''m not 
sure you got the reference, because...

You used it quite differently, referring to file rotation.  So either you 
saw my reference and upped the ante, so to speak, leaving me to pick up 
the pieces, or I lost you with the original reference, one of the two.

But I guess we should be on the same page knowing each other''s meaning,
now.  Meanwhile...

[I do see your followup mentioning that it doesn''t actually disable
/all/
COW, and that you tested it, without significant change in the results...]

FWIW, I wasn''t so much SUGGESTING those options, as noting the 
INFORMATION contained in their description, the random writes to large db 
files and its effect on btrfs bit.  But testing (which you did) is a good 
idea, just to see what difference it makes, little in your case, so 
either the nocow option isn''t disabling it in your case (specific use
of
cp --reflink), or the cow isn''t the problem at all.

While you''re at testing, tho, the question occurred to me of whether 
simply using btrfs'' snapshotting would make a difference.  (I did say I
don''t claim a full understanding, and that trial and error testing
would
be my method here, that I really only understand enough to hopefully 
guide me a bit in what to test...)  Snapshotting by definition uses the 
COW capacities, bit it occurs to me that since it''s doing it on a 
filesystem-wide basis instead of a single-file basis, that might allow 
more efficiency in metadata handling.

Note that I don''t necessarily expect that snapshotting would be a 
workable final solution for you, but if in testing you discover that the 
speed stays reasonable with the snapshot method (still only changing the 
single file between snapshots), while it degrades (as you''ve found)
with
the single cp --reflink method, then that''s important data for the test
case, and given btrfs'' state of development, it could well lead to
major
optimizations of the single-file cp --reflink case as well, which you 
presumably COULD use in final deployment.

> Thanks for all your help on this issue. I hope that someone can point
> out some more tweaks or added features/fixes after 3.2 RC5 that I may
> do.
Talking about which... since you mentioned 3.2-rc5, you do seem aware of 
the fact that btrfs is still very much experimental status, in active 
development, and the need for staying current on the kernel.

However, unless your testing is for a system with actual deployment 
scheduled for say a year or more out, I''d question btrfs as a
reasonable
solution in any case.  One of the things that a lot of people don''t
seem
to realize is just how much active btrfs development is still going on, 
and that it''s NOT just corner-case use cases such as the multi-mirror 
raid1 that I''m waiting on ATM, but that there''s still data
corruption
issues being traced and fixed, etc.

IOW, btrfs isn''t something I''d recommend on either a
production system or
even a general user''s system, for the time being.  If the intent is to 
test btrfs, filling it with data that you are not only prepared for it to 
be destroyed, but expect it to happen, so you not only have backups or 
simply don''t value the data enough to be worth backups, you''re
not
counting on the btrfs copy as anything but experimental "garbage"
data,
expected to be lost in testing, as well, then that''s FINE.  Such
testing,
and hopefully bug reporting, and patching where possible, is what btrfs 
is out there for, ATM.  

But if the intent is to actually put production data on the filesystem, 
or use it as the primary copy of data that you don''t want to lose,
btrfs
isn''t an appropriate choice at this point, and I''d say
probably won''t be
until say Q4, or even next year, so if your production deployment is 
scheduled for before that, really, you shouldn''t be looking at btrfs
for
it, as it''s not fit for that purpose ATM and isn''t likely to
be, for
another year or so (and even then, it''ll be suitable for only the early
adopters, the cautious folk will wait another year or more after that, 
just as many of the cautious folk are only now warming to ext4 as opposed 
to ext3).

I just don''t want to see you back here as one of those folks asking 
questions about recovering data on a screwed filesystem, because they had 
no backups or the backups weren''t kept current, because they were using
btrfs for real-life use beyond testing purposes, and that''s simply not 
the sort of use btrfs is designed to or can properly deliver at this 
point!

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Christian Brunner

2012-Feb-27 08:29 UTC

head link

Re: Strange prformance degradation when COW writes happen at fixed offsets

2012/2/24 Nik Markovic
<nmarkovi.navteq@gmail.com>:> To add... I also tried nodatasum (only) and nodatacow otions. I found
> somewhere that nodatacow doesn''t really mean tthat COW is
disabled.
> Test data is still the same - CPU spikes and times are the same.
>
> On Fri, Feb 24, 2012 at 2:38 PM, Nik Markovic
<nmarkovi.navteq@gmail.com> wrote:
>> On Fri, Feb 24, 2012 at 12:38 AM, Duncan <1i5t5.duncan@cox.net>
wrote:
>>> Nik Markovic posted on Thu, 23 Feb 2012 20:31:02 -0600 as
excerpted:
>>>
>>>> I noticed a few errors in the script that I used. I corrected
it and it
>>>> seems that degradation is occurring even at fully random
writes:
>>>
>>> I don''t have an ssd, but is it possible that
you''re simply seeing erase-
>>> block related degradation due to multi-write-block sized
erase-blocks?
>>>
>>> It seems to me that when originally written to the btrfs-on-ssd,
the file
>>> will likely be written block-sequentially enough that the file as a
whole
>>> takes up relatively few erase-blocks.  As you COW-write individual
>>> blocks, they''ll be written elsewhere, perhaps all the
changed blocks to a
>>> new erase-block, perhaps each to a different erase block.
>>
>> This is a very interesting insight. I wasn''t even aware of the
>> erase-block issue, so I did some reading up on it...
>>
>>>
>>> As you increase the successive COW generation count, the
file''s file-
>>> system/write blocks will be spread thru more and more erase-blocks,
>>> basically fragmentation but of the SSD-critical type, into more and
more
>>> erase blocks, thus affecting modification and removal time but not
read
>>> time.
>>
>> OK, so time to write would increase due to fragmentation and writing,
>> it now makes sense (though I don''t see why small writes would
affect
>> this, but my concerns are not writes anyway), but why would cp
>> --reflink time increase so much. Yes, new extents would be created,
>> but btrfs doesn''t write into data blocks, does it? I figured
its
>> metadata would be kept in one place. I figure the only thing BTRFS
>> would do on cp --reflink=always:
>> 1. Take a collection of extents owned by source.
>> 2. Make the new copy use the same collection of extents.
>> 3. Write the collection of extents to the "directory".
>>
>> Now this process seems to be CPU intensive. When I remove or make a
>> reflink copy, one core pikes up to 100%, which tells me that
there''s a
>> performance issue there, not an ssd issue. Also, only one CPU thread
>> is being used for this. I figured that I can improve this by some
>> setting. Maybe thread_pool mount option? Are there any updates in
>> later kernels that I should possibly pick up?
>>
>> [...]
>>
>> Unless I am wrong, this would disable COW completely and reflink copy.
>> Reflinks are a crucial component and the sole
>> reason I picked BTRFS for the system that I am writing for my company.
>> The autodefrag option addresses multiple writes. Writing is not the
>> problem, but cp --reflink should be near-instant. That was the reason
>> we chose BTRFS over ZFS, which seemed to be the only feasible
>> alternative. ZFS snapshot complicate the design and deduplication copy
>> time is the same as (or not much better than) raw copy.
>>
>> [...]
>>
>> As I mentioned above, the COW is the crucial component of our system,
>> XFS won''t do. Our system does not do random writes. In fact it
is
>> mainly heavy on read operation. The system does occasional
"rotation
>> of rust" on large files in a way that version control system would
>> (large files are modified and then used as a new baseline)
The symptoms you are reporting are quite similar to what I''m seeing in
our Ceph cluster:

http://comments.gmane.org/gmane.comp.file-systems.btrfs/15413

AFAIK, Chris and Josef are working on it, but you''ll have to wait for
kernel 3.4, until this will be available in mainline. If you are
feeling adventurous, you could try the patches in Josef''s git tree,
but I think it''s still experimental.

Regards,
Christian
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Btrfs devel - Feb 2012 - Strange prformance degradation when COW writes happen at fixed offsets

Strange prformance degradation when COW writes happen at fixed offsets

Re: Strange prformance degradation when COW writes happen at fixed offsets

Re: Strange prformance degradation when COW writes happen at fixed offsets

Re: Strange prformance degradation when COW writes happen at fixed offsets

Re: Strange prformance degradation when COW writes happen at fixed offsets

Re: Strange prformance degradation when COW writes happen at fixed offsets

Re: Strange prformance degradation when COW writes happen at fixed offsets