Nik Markovic
2012-Feb-24 01:32 UTC
Strange prformance degradation when COW writes happen at fixed offsets
Hi, My kernel version is 32-bit 3.2.0-rc5 and using btrfs-tools 0.19 I was having performance issues with BTRFS with fragmentation and HDDs, so I decided to switch to an SSD to see if these would go away. Performance was much better but at times, I would see a "freeze happen" which I can''t really explain. The CPU would spike up to 100% at times. I decided to try reproduce this, hough it may or may not be related, while testing BTFS performance, I encountered this interesting problem where performance would depend on whether a file is freshly copied onto a BTRFS filesystem or obtained via COW "children". This is all happening on a Crucial M4 SSD, so something on the SSD firmware could be causing the issue but I feel it''s related to BTRFS metadata. Here is the test: 1. Write a fresh large file to the file system called A 2. Make a reflink of A COW copy B 3. Modify a set of random blocks on B 4. Remove A 5. Repeat 2-5 but use newly produced B as new A Expected results: Each steps takes equal amount of time to complete on an SSD because there is no fragmentation involved and the system is in the same state at #2 because there''s always only one file on the filesystem. I used 1GB file as my source. I repeated tests using different algorithms for the "write" in step #2 above. Algorithm 1 (random): Write 8 bytes randomly Algorithm 2 (fixed): Write first 8 bytes and continue at 50k offsets Algorithm 3 (incremental): Write first 8 bytes at offset = random (50k) then continue at 50k offsets For each test, there were 40k writes total. Algorithm is in the Java code below. The following is observed with each iteration ONLY when using algorithm #3 1. Over time, the time to modify the file increases 2. Over time, the time to make the reflink copy increases 3. Over time, the time to remove the file increases 4. First few writes take less then normal time to complete. Data for 1st/5th/10th/15th/20th iteration: Algorithm 1 and 2: Always Write:6s Always Copy: 0.5s Always Remove: 0.10s Algorithm 2: Write: 2/6/9/10/11.5 Copy: 0.5/3/4.5/5.5/6 Remove: 0.1/1/2/2/2 As you can see, things degrade and taper off after the 10th iteration. This probably has to do with 4k block size being near 50k/10. I don''t think this has to do with SSD garbage collection because I ran these tests multiple times. To use this script, cd into an empty directory on a btrfs filesystem and and run it with "incremental" as argument. You can use other modes to confirm expected behavior. Script used to produce the bug: #!/bin/bash mode=$1 if [ -z "$mode" ]; then echo "Usage $0 <incremental|random|fixed>" exit -1 fi mode=$1 src=`pwd`/test/src dst=`pwd`/test/dst srcfile=$src/test.tar dstfile=$dst/test.tar mkdir -p $src mkdir -p $dst filesize=100MB #build a 1GB file from a smaller download. You can tweak filesize and the loop below for lower bandwidth if [ ! -f $srcfile ]; then cd $src if [ ! -f $srcfile.dl ]; then wget http://download.thinkbroadband.com/${filesize}.zip --output-document=$srcfile.dl fi rm -rf tarbase mkdir tarbase for i in {1..10}; do cp --reflink=always $srcfile.dl tarbase/$i.dl done tar -cvf $srcfile tarbase rm -rf tarbase fi cat <<END > $src/FileTest.java import java.io.IOException; import java.io.RandomAccessFile; public class FileTest { public static final int BLOCK_SIZE = 50000; public static final int MAX_ITERATIONS = 40000; public static void main(String args[]) throws IOException { String mode = args[0]; RandomAccessFile f = new RandomAccessFile(args[1], "rw"); //int offset = 0; int i; int offset = new java.util.Random().nextInt(BLOCK_SIZE); // initializer ONLY for incremental mode for (i=0; i < MAX_ITERATIONS; i++) { try { int writeOffset; if (mode.equals("incremental")) { writeOffset = new java.util.Random().nextInt(offset + i * BLOCK_SIZE); } else { // mode.equals random writeOffset = new java.util.Random().nextInt(((int)f.length() - 100)); offset = writeOffset; // for reporting it at the end } f.seek(writeOffset); f.writeBytes("DEADBEEF"); } catch (java.io.IOException e) { System.out.println("EOF"); break; } } System.out.print("Last offset=" + offset); System.out.println(". Made " + i + " random writes."); f.close(); } } END cd $src javac FileTest.java /usr/bin/time --format ''rm: %E'' rm -rf $dst/* cp --reflink=always $srcfile.dl $dst/1.tst cd $dst for i in {1..20}; do echo -n "$i." i_plus=`expr $i + 1` /usr/bin/time --format ''write: %E'' java -cp $src FileTest $mode $i.tst /usr/bin/time --format ''cp: %E'' cp --reflink=always $i.tst $i_plus.tst /usr/bin/time --format ''rm: %E'' rm $i.tst /usr/bin/time --format ''sync: %E'' sync sleep 1 done -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Nik Markovic
2012-Feb-24 02:31 UTC
Re: Strange prformance degradation when COW writes happen at fixed offsets
I noticed a few errors in the script that I used. I corrected it and it seems that degradation is occurring even at fully random writes: #!/bin/bash mode=$1 if [ -z "$mode" ]; then echo "Usage $0 <incremental|random|fixed>" exit -1 fi mode=$1 src=`pwd`/test/src dst=`pwd`/test/dst srcfile=$src/test.tar dstfile=$dst/test.tar mkdir -p $src mkdir -p $dst filesize=100MB #build a 10GB file from a smaller download. You can tweak filesize and the loop below for lower bandwidth if [ ! -f $srcfile ]; then cd $src if [ ! -f $srcfile.dl ]; then wget http://download.thinkbroadband.com/${filesize}.zip --output-document=$srcfile.dl fi rm -rf tarbase mkdir tarbase for i in {1..100}; do cp --reflink=always $srcfile.dl tarbase/$i.dl done tar -cvf $srcfile tarbase rm -rf tarbase fi cat <<END > $src/FileTest.java import java.io.IOException; import java.io.RandomAccessFile; public class FileTest { public static final int BLOCK_SIZE = 50000; public static final int MAX_ITERATIONS = 40000; public static void main(String args[]) throws IOException { String mode = args[0]; RandomAccessFile f = new RandomAccessFile(args[1], "rw"); //int offset = 0; int i; int offset = new java.util.Random().nextInt(BLOCK_SIZE); // initializer ONLY for incremental mode for (i=0; i < MAX_ITERATIONS; i++) { try { int writeOffset; if (mode.equals("incremental")) { writeOffset = new java.util.Random().nextInt(offset + i * BLOCK_SIZE); } else if (mode.equals("fixed")) { writeOffset = i * BLOCK_SIZE; offset = writeOffset; // for reporting it at the end } else { // mode.equals random writeOffset = new java.util.Random().nextInt(((int)f.length() - 100)); offset = writeOffset; // for reporting it at the end } if (writeOffset > (f.length() - 100)) { break; } f.seek(writeOffset); f.writeBytes("DEADBEEF"); } catch (java.io.IOException e) { System.out.println("EOF"); break; } } System.out.print("Last offset=" + offset); System.out.println(". Made " + i + " random writes."); f.close(); } } END cd $src javac FileTest.java /usr/bin/time --format ''rm: %E'' rm -rf $dst/* cp --reflink=always $srcfile $dst/1.tst cd $dst for i in {1..100}; do echo -n "$i." i_plus=`expr $i + 1` /usr/bin/time --format ''write: %E'' java -cp $src FileTest $mode $i.tst /usr/bin/time --format ''cp: %E'' cp --reflink=always $i.tst $i_plus.tst /usr/bin/time --format ''rm: %E'' rm $i.tst /usr/bin/time --format ''sync: %E'' sync sleep 1 done -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Duncan
2012-Feb-24 06:38 UTC
Re: Strange prformance degradation when COW writes happen at fixed offsets
Nik Markovic posted on Thu, 23 Feb 2012 20:31:02 -0600 as excerpted:> I noticed a few errors in the script that I used. I corrected it and it > seems that degradation is occurring even at fully random writes:I don''t have an ssd, but is it possible that you''re simply seeing erase- block related degradation due to multi-write-block sized erase-blocks? It seems to me that when originally written to the btrfs-on-ssd, the file will likely be written block-sequentially enough that the file as a whole takes up relatively few erase-blocks. As you COW-write individual blocks, they''ll be written elsewhere, perhaps all the changed blocks to a new erase-block, perhaps each to a different erase block. As you increase the successive COW generation count, the file''s file- system/write blocks will be spread thru more and more erase-blocks, basically fragmentation but of the SSD-critical type, into more and more erase blocks, thus affecting modification and removal time but not read time. IIRC I saw a note about this on the wiki, in regard to the nodatacow mount-option. Let''s see if I can find it again. Hmm... yes... http://btrfs.ipv5.de/index.php?title=Getting_started#Mount_Options In particular this (for nodatacow, read the rest as there''s additional implications):>>>>>Performance gain is usually < 5% unless the workload is random writes to large database files, where the difference can become very large. <<<<< In addition to nodatacow, see the note on the autodefrag option. IOW, with the repeated generations of random-writes to cow-copies, you''re apparently triggering a cow-worst-case fragmentation situation. It shouldn''t affect read-time much on SSD, but it certainly will affect copy and erase time, as the data and metadata (which as you''ll recall is 2X by default on btrfs) gets written to more and more blocks that need updated at copy/erase time, That /might/ be the problem triggering the freezes you noted that set off the original investigation as well, if the SSD firmware is running out of erase blocks and having to pause access while it rearranges data to allow operations to continue. Since your original issue on "rotating rust" drives was fragmentation, rewriting would seem to be something you do quite a lot of, triggering different but similar-cause issues on SSDs as well. FWIW, with that sort of database-style workload, large files constantly random-change rewritten, something like xfs might be more appropriate than btrfs. See the recent xfs presentations (were they at ScaleX or LinuxConf.au? both happened about the same time and were covered in the same LWN weekly edition) as covered a couple weeks ago on LWN for more. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Nik Markovic
2012-Feb-24 20:38 UTC
Re: Strange prformance degradation when COW writes happen at fixed offsets
On Fri, Feb 24, 2012 at 12:38 AM, Duncan <1i5t5.duncan@cox.net> wrote:> Nik Markovic posted on Thu, 23 Feb 2012 20:31:02 -0600 as excerpted: > >> I noticed a few errors in the script that I used. I corrected it and it >> seems that degradation is occurring even at fully random writes: > > I don''t have an ssd, but is it possible that you''re simply seeing erase- > block related degradation due to multi-write-block sized erase-blocks? > > It seems to me that when originally written to the btrfs-on-ssd, the file > will likely be written block-sequentially enough that the file as a whole > takes up relatively few erase-blocks. As you COW-write individual > blocks, they''ll be written elsewhere, perhaps all the changed blocks to a > new erase-block, perhaps each to a different erase block.This is a very interesting insight. I wasn''t even aware of the erase-block issue, so I did some reading up on it...> > As you increase the successive COW generation count, the file''s file- > system/write blocks will be spread thru more and more erase-blocks, > basically fragmentation but of the SSD-critical type, into more and more > erase blocks, thus affecting modification and removal time but not read > time.OK, so time to write would increase due to fragmentation and writing, it now makes sense (though I don''t see why small writes would affect this, but my concerns are not writes anyway), but why would cp --reflink time increase so much. Yes, new extents would be created, but btrfs doesn''t write into data blocks, does it? I figured its metadata would be kept in one place. I figure the only thing BTRFS would do on cp --reflink=always: 1. Take a collection of extents owned by source. 2. Make the new copy use the same collection of extents. 3. Write the collection of extents to the "directory". Now this process seems to be CPU intensive. When I remove or make a reflink copy, one core pikes up to 100%, which tells me that there''s a performance issue there, not an ssd issue. Also, only one CPU thread is being used for this. I figured that I can improve this by some setting. Maybe thread_pool mount option? Are there any updates in later kernels that I should possibly pick up?> > IIRC I saw a note about this on the wiki, in regard to the nodatacow > mount-option. Let''s see if I can find it again. Hmm... yes... > > http://btrfs.ipv5.de/index.php?title=Getting_started#Mount_Options > > In particular this (for nodatacow, read the rest as there''s additional > implications): > >>>>>> > Performance gain is usually < 5% unless the workload is random writes to > large database files, where the difference can become very large. > <<<<< >Unless I am wrong, this would disable COW completely and reflink copy. Reflinks are a crucial component and the sole reason I picked BTRFS for the system that I am writing for my company. The autodefrag option addresses multiple writes. Writing is not the problem, but cp --reflink should be near-instant. That was the reason we chose BTRFS over ZFS, which seemed to be the only feasible alternative. ZFS snapshot complicate the design and deduplication copy time is the same as (or not much better than) raw copy.> In addition to nodatacow, see the note on the autodefrag option. > > IOW, with the repeated generations of random-writes to cow-copies, you''re > apparently triggering a cow-worst-case fragmentation situation. It > shouldn''t affect read-time much on SSD, but it certainly will affect copy > and erase time, as the data and metadata (which as you''ll recall is 2X by > default on btrfs) gets written to more and more blocks that need updated > at copy/erase time, > > > That /might/ be the problem triggering the freezes you noted that set off > the original investigation as well, if the SSD firmware is running out of > erase blocks and having to pause access while it rearranges data to allow > operations to continue. Since your original issue on "rotating rust" > drives was fragmentation, rewriting would seem to be something you do > quite a lot of, triggering different but similar-cause issues on SSDs as > well. > > FWIW, with that sort of database-style workload, large files constantly > random-change rewritten, something like xfs might be more appropriate > than btrfs. See the recent xfs presentations (were they at ScaleX or > LinuxConf.au? both happened about the same time and were covered in the > same LWN weekly edition) as covered a couple weeks ago on LWN for more. >As I mentioned above, the COW is the crucial component of our system, XFS won''t do. Our system does not do random writes. In fact it is mainly heavy on read operation. The system does occasional "rotation of rust" on large files in a way that version control system would (large files are modified and then used as a new baseline)> -- > Duncan - List replies preferred. No HTML msgs. > "Every nonfree program has a lord, a master -- > and if you use the program, he is your master." Richard Stallman > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.htmlThanks for all your help on this issue. I hope that someone can point out some more tweaks or added features/fixes after 3.2 RC5 that I may do. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Nik Markovic
2012-Feb-24 21:33 UTC
Re: Strange prformance degradation when COW writes happen at fixed offsets
To add... I also tried nodatasum (only) and nodatacow otions. I found somewhere that nodatacow doesn''t really mean tthat COW is disabled. Test data is still the same - CPU spikes and times are the same. On Fri, Feb 24, 2012 at 2:38 PM, Nik Markovic <nmarkovi.navteq@gmail.com> wrote:> On Fri, Feb 24, 2012 at 12:38 AM, Duncan <1i5t5.duncan@cox.net> wrote: >> Nik Markovic posted on Thu, 23 Feb 2012 20:31:02 -0600 as excerpted: >> >>> I noticed a few errors in the script that I used. I corrected it and it >>> seems that degradation is occurring even at fully random writes: >> >> I don''t have an ssd, but is it possible that you''re simply seeing erase- >> block related degradation due to multi-write-block sized erase-blocks? >> >> It seems to me that when originally written to the btrfs-on-ssd, the file >> will likely be written block-sequentially enough that the file as a whole >> takes up relatively few erase-blocks. As you COW-write individual >> blocks, they''ll be written elsewhere, perhaps all the changed blocks to a >> new erase-block, perhaps each to a different erase block. > > This is a very interesting insight. I wasn''t even aware of the > erase-block issue, so I did some reading up on it... > >> >> As you increase the successive COW generation count, the file''s file- >> system/write blocks will be spread thru more and more erase-blocks, >> basically fragmentation but of the SSD-critical type, into more and more >> erase blocks, thus affecting modification and removal time but not read >> time. > > OK, so time to write would increase due to fragmentation and writing, > it now makes sense (though I don''t see why small writes would affect > this, but my concerns are not writes anyway), but why would cp > --reflink time increase so much. Yes, new extents would be created, > but btrfs doesn''t write into data blocks, does it? I figured its > metadata would be kept in one place. I figure the only thing BTRFS > would do on cp --reflink=always: > 1. Take a collection of extents owned by source. > 2. Make the new copy use the same collection of extents. > 3. Write the collection of extents to the "directory". > > Now this process seems to be CPU intensive. When I remove or make a > reflink copy, one core pikes up to 100%, which tells me that there''s a > performance issue there, not an ssd issue. Also, only one CPU thread > is being used for this. I figured that I can improve this by some > setting. Maybe thread_pool mount option? Are there any updates in > later kernels that I should possibly pick up? > >> >> IIRC I saw a note about this on the wiki, in regard to the nodatacow >> mount-option. Let''s see if I can find it again. Hmm... yes... >> >> http://btrfs.ipv5.de/index.php?title=Getting_started#Mount_Options >> >> In particular this (for nodatacow, read the rest as there''s additional >> implications): >> >>>>>>> >> Performance gain is usually < 5% unless the workload is random writes to >> large database files, where the difference can become very large. >> <<<<< >> > > Unless I am wrong, this would disable COW completely and reflink copy. > Reflinks are a crucial component and the sole > reason I picked BTRFS for the system that I am writing for my company. > The autodefrag option addresses multiple writes. Writing is not the > problem, but cp --reflink should be near-instant. That was the reason > we chose BTRFS over ZFS, which seemed to be the only feasible > alternative. ZFS snapshot complicate the design and deduplication copy > time is the same as (or not much better than) raw copy. > >> In addition to nodatacow, see the note on the autodefrag option. >> >> IOW, with the repeated generations of random-writes to cow-copies, you''re >> apparently triggering a cow-worst-case fragmentation situation. It >> shouldn''t affect read-time much on SSD, but it certainly will affect copy >> and erase time, as the data and metadata (which as you''ll recall is 2X by >> default on btrfs) gets written to more and more blocks that need updated >> at copy/erase time, >> >> >> That /might/ be the problem triggering the freezes you noted that set off >> the original investigation as well, if the SSD firmware is running out of >> erase blocks and having to pause access while it rearranges data to allow >> operations to continue. Since your original issue on "rotating rust" >> drives was fragmentation, rewriting would seem to be something you do >> quite a lot of, triggering different but similar-cause issues on SSDs as >> well. >> >> FWIW, with that sort of database-style workload, large files constantly >> random-change rewritten, something like xfs might be more appropriate >> than btrfs. See the recent xfs presentations (were they at ScaleX or >> LinuxConf.au? both happened about the same time and were covered in the >> same LWN weekly edition) as covered a couple weeks ago on LWN for more. >> > > As I mentioned above, the COW is the crucial component of our system, > XFS won''t do. Our system does not do random writes. In fact it is > mainly heavy on read operation. The system does occasional "rotation > of rust" on large files in a way that version control system would > (large files are modified and then used as a new baseline) > >> -- >> Duncan - List replies preferred. No HTML msgs. >> "Every nonfree program has a lord, a master -- >> and if you use the program, he is your master." Richard Stallman >> >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > > Thanks for all your help on this issue. I hope that someone can point > out some more tweaks or added features/fixes after 3.2 RC5 that I may > do.-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Duncan
2012-Feb-25 03:34 UTC
Re: Strange prformance degradation when COW writes happen at fixed offsets
Nik Markovic posted on Fri, 24 Feb 2012 14:38:57 -0600 as excerpted:> On Fri, Feb 24, 2012 at 12:38 AM, Duncan <1i5t5.duncan@cox.net> wrote: >> Nik Markovic posted on Thu, 23 Feb 2012 20:31:02 -0600 as excerpted: >> >>> I noticed a few errors in the script that I used. I corrected it and >>> it seems that degradation is occurring even at fully random writes: >> >> I don''t have an ssd, but is it possible that you''re simply seeing >> erase- block related degradation due to multi-write-block sized >> erase-blocks? >> >> It seems to me that when originally written to the btrfs-on-ssd, the >> file will likely be written block-sequentially enough that the file as >> a whole takes up relatively few erase-blocks. As you COW-write >> individual blocks, they''ll be written elsewhere, perhaps all the >> changed blocks to a new erase-block, perhaps each to a different erase >> block. > > This is a very interesting insight. I wasn''t even aware of the > erase-block issue, so I did some reading up on it...I take it you looked at TRIM/discard, then, as well? In theory and for some SSD firmware, it works well at helping to alleviate the problem by informing the SSD of data areas that it no longer needs to care about (empty space), thus allowing more effective management of those erase- blocks. Reality is however not quite so simple, and it doesn''t help a lot with some SSDs, plus there''s a potential performance issue due when doing the discard on especially earlier devices, since TRIM is an unqueueable command in the earlier standards (I''ve read it''s defined as queueable in the latest standards, however), thus forcing a flush of all activity in the queue before the discard, potentially triggering I/O freeze behavior. Additionally, when run on top of dm-crypt, there''s a potential security issue (examination of the raw undecrypted storage reveals whether there''s data there or not, and possibly the filesystem type used based on patterns, a potential deniability issue in that they know the data is there, tho it doesn''t affect the strength of the encryption itself). So since on a lot of firmware it doesn''t make a lot of difference anyway, and there''s a couple of down sides, the btrfs ssd mount option does NOT enable discard as well. However, there *IS* a discard option that you can experiment with if you like, and it probably WILL help with erase- block handling on SOME firmware. See the FAQ, part 3 Features, # 3.4 on TRIM/discard, for a bit more. I''ve really covered what it says, above, but there''s a link to the encryption security vs TRIM research, for instance. And the discard mount-option for whatever reason isn''t listed in mount options, or at least I didn''t see it, only in the FAQ. (This is one URL, my client is wrapping it and it''s a hassle to fix.) http://btrfs.ipv5.de/index.php? title=FAQ#Does_Btrfs_support_TRIM.2Fdiscard.3F Bottom line, if it is indeed an erase-block issue, the discard mount option MIGHT help, or it might not, depending on your device firmware. It''s an experiment-and-see thing.>> As you increase the successive COW generation count, the file''s file- >> system/write blocks will be spread thru more and more erase-blocks, >> basically fragmentation but of the SSD-critical type, into more and >> more erase blocks, thus affecting modification and removal time but not >> read time. > > OK, so time to write would increase due to fragmentation and writing, it > now makes sense (though I don''t see why small writes would affect this, > but my concerns are not writes anyway), but why would cp --reflink time > increase so much. Yes, new extents would be created, but btrfs doesn''t > write into data blocks, does it? I figured its metadata would be kept in > one place. I figure the only thing BTRFS would do on cp > --reflink=always: > 1. Take a collection of extents owned by source. > 2. Make the new copy use the same collection of extents. > 3. Write the collection of extents to the "directory". > > Now this process seems to be CPU intensive. When I remove or make a > reflink copy, one core pikes up to 100%, which tells me that there''s a > performance issue there, not an ssd issue. Also, only one CPU thread is > being used for this. I figured that I can improve this by some setting. > Maybe thread_pool mount option? Are there any updates in later kernels > that I should possibly pick up?FWIW... I am by no means an expert on this. I /think/ I understand enough of it to somewhat guide trial and error testing to arrive at a reasonable if not best-case config for any setup I might deal with, and well enough to hopefully point you in the right direction for your own research and testing, but I''d not going to claim to be able to explain the whys of individual cases, or even necessarily to understand them myself, just understand enough to know of the issue and to trial and error resolve to a hopefully reasonable situation on any hardware I might run. However, I could speculate (enough to guide my own testing were I troubleshooting here) that it''s one of several things or more likely a combination of them. One, I''m not sure if the metadata ends up being COW also, or not, but if it is, then your test case is fragmenting it too, thus explaining the reflink copy issue. And keep in mind that by default, btrfs uses DUP for metadata, so there''s TWO copies of it written, thereby DOUBLING the performance effects of anything affecting metadata! Two, see the FAQ deduplication question/answer a couple questions below the TRIM/discard one mentioned above. I''m rather fuzzy on the filesystem implications of this myself, but it seems to me that our COW assumptions might be wrong because they''re assuming deduplication effects that simply aren''t the way btrfs works presently, as it hasn''t implemented deduplication. Admittedly, this is at best a handwavy black-box factor, but that''s the best I can do with it, presently. I guess that at least gives you another place to do additional research, if it comes to that. (In this regard I do wish the COW subsection of the sysadmin guide page on the wiki was written, it''s simply punted ATM, since there''s a fair chance that a good explanation there would cover the filesystem viewpoint differences between full deduplication and the COW that btrfs does, perhaps clearing up some misconceptions people including me may have about it, as well.) Three, as evident in the discussion on the nodatacow and autodefrag options I mentioned before, there''s known issues with some use cases involving large files and rewrites of data at random locations within them. But I''m not sure if these known issues are simply the ones we''ve been discussing, or if there''s other factors I''m unaware of in this regard. Knowing more about just what those known issues are and the specific scenarios under which they occur, could go a long way toward resolving the situation for you. But I''m only a recent list regular, joining a few weeks ago as part of my own research into btrfs (FWIW my use case involves N-way mirroring, with N=3-4; since only no-mirroring and N=2 is available today and 3-way/n-way is planned to layer on top of raid5/6, which is planned for kernel 3.4/3.5, I''m now waiting for that... while continuing to stay current on the list), so whatever research or test cases lead to the remarks on the wiki regarding large files with random data rewrites, predates my involvement likely by quite some time. Four, there''s additional block alignment issues having to do with the alignment of the partition on the physical storage, as it relates to read-, write- and erase-block sizes and alignment. On SSDs, erase-block sizes are the biggest, so the optimum alignment would be to erase-block size. Getting it wrong can result in multiple block writes and/or erases where proper alignment would require only one. This phenomenon is called write-amplification (and less commonly, erase-amplification). However, depending on what you used to create the partition on which the filesystem resides (and loopback files do tend toward worst-case), it''s quite possible you don''t have block-alignment level control at all. FWIW, that''s one use case for the mkbtrfs/mkfs.btrfs --alloc-start/-A option, since that allows you to align the allocation within the partition as necessary for alignment, regardless of the partition alignment. FWIW2, gptfdisk (a gpt partitioner as opposed to the old mbr style) has reasonable alignment defaults of 1 MiB on disks without an existing partition layout, and attempts 8-sector (4 KiB) alignment even on existing layouts, for disks >=300 GB at least. That''s what I''ve been using for the last few years, having converted to gpt-based partitioning for everything, even USB-thumb-drives, if partitioned. (GPT was designed for EFI, but can be used on BIOS based systems as well, which is what I''m doing. Grub2 understands gpt well and puts to good use any reserved BIOS partition it finds, and there''s options in the kernel for it that need enabled as well.) Block alignment is DEFINITELY something you can play with, in terms of testing whether it makes a difference on your drives, SSD or "spinning rust". There''s probably other factors involved of which I''m unaware, as well.>> IIRC I saw a note about this on the wiki, in regard to the nodatacow >> mount-option.>> In addition to nodatacow, see the note on the autodefrag option.> Unless I am wrong, this would disable COW completely and reflink copy. > Reflinks are a crucial component and the sole reason I picked BTRFS for > the system that I am writing for my company. > The autodefrag option addresses multiple writes. Writing is not the > problem, but cp --reflink should be near-instant. That was the reason we > chose BTRFS over ZFS, which seemed to be the only feasible alternative. > ZFS snapshot complicate the design and deduplication copy time is the > same as (or not much better than) raw copy.> As I mentioned above, the COW is the crucial component of our system, > XFS won''t do. Our system does not do random writes. In fact it is mainly > heavy on read operation. The system does occasional "rotation of rust" > on large files in a way that version control system would (large files > are modified and then used as a new baseline)Pardon me, I think I might have been too vague with that "rotating rust" allusion and lost you. Either that or you''re taking the allusion out even further and potentially lost me! =;^0 I meant spinning magnetic media with that "rotating rust" reference, the "rotating rust" bit being a double entendre allusion both to the iron oxide (rust) used as the data storage layer, and to the fact that many view rotating magnetic media as a legacy technology (rusting out) compared to SSDs. =:^) As it happens, I saw that double-meaning word- play used elsewhere recently with the same two allusions attached, and liked it enough to use it myself, when I got the chance. Only I''m not sure you got the reference, because... You used it quite differently, referring to file rotation. So either you saw my reference and upped the ante, so to speak, leaving me to pick up the pieces, or I lost you with the original reference, one of the two. But I guess we should be on the same page knowing each other''s meaning, now. Meanwhile... [I do see your followup mentioning that it doesn''t actually disable /all/ COW, and that you tested it, without significant change in the results...] FWIW, I wasn''t so much SUGGESTING those options, as noting the INFORMATION contained in their description, the random writes to large db files and its effect on btrfs bit. But testing (which you did) is a good idea, just to see what difference it makes, little in your case, so either the nocow option isn''t disabling it in your case (specific use of cp --reflink), or the cow isn''t the problem at all. While you''re at testing, tho, the question occurred to me of whether simply using btrfs'' snapshotting would make a difference. (I did say I don''t claim a full understanding, and that trial and error testing would be my method here, that I really only understand enough to hopefully guide me a bit in what to test...) Snapshotting by definition uses the COW capacities, bit it occurs to me that since it''s doing it on a filesystem-wide basis instead of a single-file basis, that might allow more efficiency in metadata handling. Note that I don''t necessarily expect that snapshotting would be a workable final solution for you, but if in testing you discover that the speed stays reasonable with the snapshot method (still only changing the single file between snapshots), while it degrades (as you''ve found) with the single cp --reflink method, then that''s important data for the test case, and given btrfs'' state of development, it could well lead to major optimizations of the single-file cp --reflink case as well, which you presumably COULD use in final deployment.> Thanks for all your help on this issue. I hope that someone can point > out some more tweaks or added features/fixes after 3.2 RC5 that I may > do.Talking about which... since you mentioned 3.2-rc5, you do seem aware of the fact that btrfs is still very much experimental status, in active development, and the need for staying current on the kernel. However, unless your testing is for a system with actual deployment scheduled for say a year or more out, I''d question btrfs as a reasonable solution in any case. One of the things that a lot of people don''t seem to realize is just how much active btrfs development is still going on, and that it''s NOT just corner-case use cases such as the multi-mirror raid1 that I''m waiting on ATM, but that there''s still data corruption issues being traced and fixed, etc. IOW, btrfs isn''t something I''d recommend on either a production system or even a general user''s system, for the time being. If the intent is to test btrfs, filling it with data that you are not only prepared for it to be destroyed, but expect it to happen, so you not only have backups or simply don''t value the data enough to be worth backups, you''re not counting on the btrfs copy as anything but experimental "garbage" data, expected to be lost in testing, as well, then that''s FINE. Such testing, and hopefully bug reporting, and patching where possible, is what btrfs is out there for, ATM. But if the intent is to actually put production data on the filesystem, or use it as the primary copy of data that you don''t want to lose, btrfs isn''t an appropriate choice at this point, and I''d say probably won''t be until say Q4, or even next year, so if your production deployment is scheduled for before that, really, you shouldn''t be looking at btrfs for it, as it''s not fit for that purpose ATM and isn''t likely to be, for another year or so (and even then, it''ll be suitable for only the early adopters, the cautious folk will wait another year or more after that, just as many of the cautious folk are only now warming to ext4 as opposed to ext3). I just don''t want to see you back here as one of those folks asking questions about recovering data on a screwed filesystem, because they had no backups or the backups weren''t kept current, because they were using btrfs for real-life use beyond testing purposes, and that''s simply not the sort of use btrfs is designed to or can properly deliver at this point! -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Christian Brunner
2012-Feb-27 08:29 UTC
Re: Strange prformance degradation when COW writes happen at fixed offsets
2012/2/24 Nik Markovic <nmarkovi.navteq@gmail.com>:> To add... I also tried nodatasum (only) and nodatacow otions. I found > somewhere that nodatacow doesn''t really mean tthat COW is disabled. > Test data is still the same - CPU spikes and times are the same. > > On Fri, Feb 24, 2012 at 2:38 PM, Nik Markovic <nmarkovi.navteq@gmail.com> wrote: >> On Fri, Feb 24, 2012 at 12:38 AM, Duncan <1i5t5.duncan@cox.net> wrote: >>> Nik Markovic posted on Thu, 23 Feb 2012 20:31:02 -0600 as excerpted: >>> >>>> I noticed a few errors in the script that I used. I corrected it and it >>>> seems that degradation is occurring even at fully random writes: >>> >>> I don''t have an ssd, but is it possible that you''re simply seeing erase- >>> block related degradation due to multi-write-block sized erase-blocks? >>> >>> It seems to me that when originally written to the btrfs-on-ssd, the file >>> will likely be written block-sequentially enough that the file as a whole >>> takes up relatively few erase-blocks. As you COW-write individual >>> blocks, they''ll be written elsewhere, perhaps all the changed blocks to a >>> new erase-block, perhaps each to a different erase block. >> >> This is a very interesting insight. I wasn''t even aware of the >> erase-block issue, so I did some reading up on it... >> >>> >>> As you increase the successive COW generation count, the file''s file- >>> system/write blocks will be spread thru more and more erase-blocks, >>> basically fragmentation but of the SSD-critical type, into more and more >>> erase blocks, thus affecting modification and removal time but not read >>> time. >> >> OK, so time to write would increase due to fragmentation and writing, >> it now makes sense (though I don''t see why small writes would affect >> this, but my concerns are not writes anyway), but why would cp >> --reflink time increase so much. Yes, new extents would be created, >> but btrfs doesn''t write into data blocks, does it? I figured its >> metadata would be kept in one place. I figure the only thing BTRFS >> would do on cp --reflink=always: >> 1. Take a collection of extents owned by source. >> 2. Make the new copy use the same collection of extents. >> 3. Write the collection of extents to the "directory". >> >> Now this process seems to be CPU intensive. When I remove or make a >> reflink copy, one core pikes up to 100%, which tells me that there''s a >> performance issue there, not an ssd issue. Also, only one CPU thread >> is being used for this. I figured that I can improve this by some >> setting. Maybe thread_pool mount option? Are there any updates in >> later kernels that I should possibly pick up? >> >> [...] >> >> Unless I am wrong, this would disable COW completely and reflink copy. >> Reflinks are a crucial component and the sole >> reason I picked BTRFS for the system that I am writing for my company. >> The autodefrag option addresses multiple writes. Writing is not the >> problem, but cp --reflink should be near-instant. That was the reason >> we chose BTRFS over ZFS, which seemed to be the only feasible >> alternative. ZFS snapshot complicate the design and deduplication copy >> time is the same as (or not much better than) raw copy. >> >> [...] >> >> As I mentioned above, the COW is the crucial component of our system, >> XFS won''t do. Our system does not do random writes. In fact it is >> mainly heavy on read operation. The system does occasional "rotation >> of rust" on large files in a way that version control system would >> (large files are modified and then used as a new baseline)The symptoms you are reporting are quite similar to what I''m seeing in our Ceph cluster: http://comments.gmane.org/gmane.comp.file-systems.btrfs/15413 AFAIK, Chris and Josef are working on it, but you''ll have to wait for kernel 3.4, until this will be available in mainline. If you are feeling adventurous, you could try the patches in Josef''s git tree, but I think it''s still experimental. Regards, Christian -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html