Dave Chinner
2013-May-04 01:15 UTC
[3.9] parallel fsmark perf is real bad on sparse devices
Hi folks, It''s that time again - I ran fsmark on btrfs and found performance was awful. tl;dr: memory pressure causes random writeback of metadata ("bad"), fragmenting the underlying sparse storage. This causes a downward spiral as btrfs cycles through "good" IO patterns that get fragmented at the device level due to the "bad" IO patterns fragmenting the underlying sparse device. FYI, The storage hardware is a DM RAID0 stripe across 4 SSDs sitting behind 512MB of BBWC with an XFS filesystem on it. The only file on the filesystem is the sparse 100TB file used for the device, and the VM is using virtio,cache=none to access the filesystem image. i.e. the storage I''m working on this time is a thinly provisioned 100TB device fed to an 8p, 4GB RAM VM, and this script is then run: $ cat fsmark-50-test-btrfs.sh #!/bin/bash sudo umount /mnt/scratch > /dev/null 2>&1 sudo mkfs.btrfs /dev/vdc sudo mount /dev/vdc /mnt/scratch sudo chmod 777 /mnt/scratch cd /home/dave/src/fs_mark-3.3/ time ./fs_mark -D 10000 -S0 -n 100000 -s 0 -L 63 \ -d /mnt/scratch/0 -d /mnt/scratch/1 \ -d /mnt/scratch/2 -d /mnt/scratch/3 \ -d /mnt/scratch/4 -d /mnt/scratch/5 \ -d /mnt/scratch/6 -d /mnt/scratch/7 \ | tee >(stats --trim-outliers | tail -1 1>&2) sync $ $ ./fsmark-50-test-btrfs.sh WARNING! - Btrfs Btrfs v0.19 IS EXPERIMENTAL WARNING! - see http://btrfs.wiki.kernel.org before using fs created label (null) on /dev/vdc nodesize 4096 leafsize 4096 sectorsize 4096 size 100.00TB Btrfs Btrfs v0.19 # ./fs_mark -D 10000 -S0 -n 100000 -s 0 -L 63 -d /mnt/scratch/0 -d /mnt/scratch/1 -d /mnt/scratch/2 -d /mnt/scratch/3 -d /mnt/scratch/4 -d /mnt/scratch/5 -d /mnt/scratch/6 -d /mnt/scratch/7 # Version 3.3, 8 thread(s) starting at Fri May 3 17:08:46 2013 # Sync method: NO SYNC: Test does not issue sync() or fsync() calls. # Directories: Time based hash between directories across 10000 subdirectories with 180 seconds per subdirectory. # File names: 40 bytes long, (16 initial bytes of time stamp with 24 random bytes at end of name) # Files info: size 0 bytes, written with an IO size of 16384 bytes per write # App overhead is time in microseconds spent in the test not doing file writing related system calls. FSUse% Count Size Files/sec App Overhead 0 800000 0 53498.9 7898900 0 1600000 0 11186.5 9409278 0 2400000 0 17026.1 7907599 0 3200000 0 25815.6 9749980 0 4000000 0 11503.0 8556349 0 4800000 0 43561.9 8295238 0 5600000 0 17175.3 8304668 ^C 0 800000-5600000(3.2e+06+/-1.1e+06) 0 11186.500000-53498.900000(23016.4+/-1.1e+04) 7898900-9749980(8.49463e+06+/-5e+05) What I''m seeing is that the underlying image file is getting badly, badly fragmented. This short test created approximately 8 million extents in the image file in about 10 minutes runtime. Running xfs_fsr on the image file pointed this out: # xfs_fsr -d -v vm-100TB-sparse.img vm-100TB-sparse.img vm-100TB-sparse.img extents=7971773 can_save=7926036 tmp=./.fsr6198 DEBUG: fsize=109951162777600 blsz_dio=16773120 d_min=512 d_max=2147483136 pgsz=4096 Temporary file has 46107 extents (7971773 in original) extents before:7971773 after:46107 vm-100TB-sparse.img # Most of the data written to the file is contiguous. This means that btrfs is filling the filesystem in a contiguous manner, but it''s IO is anything but contiguous. So, what''s happening here? Turns out that when the machine first runs out of free memory (about 1.2m inodes in), btrfs goes from running a couple of hundred nice large 512k IOs a second to an intense 10s long burst of 10-15kiops of tiny random IOs. Looking at it from the IO completion side of things: 253,32 4 238 5.936043934 0 C W 103680 + 1024 [0] 253,32 4 239 5.936155917 0 C W 2201728 + 1024 [0] 253,32 4 240 5.936172087 0 C W 104704 + 1024 [0] 253,32 4 241 5.936283060 0 C W 2202752 + 1024 [0] 253,32 4 242 5.936294881 0 C W 105728 + 1024 [0] 253,32 4 243 5.936385182 0 C W 106752 + 1024 [0] 253,32 4 244 5.936394695 0 C W 107776 + 1024 [0] 253,32 4 245 5.936402936 0 C W 108800 + 1024 [0] 253,32 4 246 5.936406721 0 C W 109824 + 896 [0] 253,32 4 247 5.936414258 0 C W 2203776 + 1024 [0] 253,32 4 248 5.936515302 0 C W 2204800 + 1024 [0] 253,32 4 249 5.936606737 0 C W 2205824 + 1024 [0] 253,32 4 250 5.936689345 0 C W 2206848 + 1024 [0] All nice and large, mostly sequential IO patterns. Fast foward to where we''ve run out of memory: 253,32 3 59209 31.490788795 0 C WS 1821992 + 16 [0] 253,32 3 59210 31.490790691 0 C WS 1822024 + 24 [0] 253,32 3 59211 31.490792205 0 C WS 1822056 + 16 [0] 253,32 3 59212 31.490793680 0 C WS 1822080 + 8 [0] 253,32 3 59213 31.490794984 0 C WS 1822096 + 32 [0] 253,32 3 59214 31.490796307 0 C WS 1822136 + 8 [0] 253,32 3 59215 31.490798261 0 C WS 1822152 + 16 [0] 253,32 3 59216 31.490799713 0 C WS 3919120 + 8 [0] 253,32 3 59217 31.490831740 0 C WS 3919144 + 16 [0] 253,32 3 59218 31.490835419 0 C WS 3919176 + 24 [0] 253,32 3 59219 31.490838989 0 C WS 3919208 + 16 [0] You can see that there are lots of small IOs being completed, with lots of tiny holes in between them. This is what causes the fragmentation of the backing device image. Performance hasn''t quite tanked yet - that happens after the massive burst of IO when reclaiming memory. btrfs goes back to nice IO patterns: 53,32 4 114006 40.036082347 6902 C W 4268032 + 896 [0] 253,32 4 114007 40.036088989 6902 C W 4268928 + 896 [0] 253,32 4 114008 40.036104027 6902 C W 4269824 + 896 [0] 253,32 4 114009 40.036108753 6902 C W 4270720 + 896 [0] 253,32 4 114010 40.036112097 6902 C W 4271616 + 896 [0] 253,32 4 114011 40.036116985 6902 C W 5316608 + 896 [0] 253,32 4 114012 40.036189985 6902 C W 5317504 + 896 [0] 253,32 4 114013 40.036259904 6902 C W 5318400 + 896 [0] But because it''s already fragmented the crap out of the underlying file image and thanks to the wonder of the kernel direct IO doing in individual allocation for every vector in the pwritev() iovec (i.e. one allocation per 4k page), this further fragments the underlying file as it fills small holes first. The result is that btrfs is doing a couple of hundred IOPS, but the back end storage is now doing 25,000 IOPS because of the fragmentation of the image file. Worth noting is that btrfs is filling the filesystem from block 0 upwards - punching the first 100GB out of the image file removes all the fragmentation from the file (6m extents down to 21000) - all the higher address space extents are from XFS.... So, BTRFS doesn''t play at all well with sparse image files or fine-grained thin provisioned devices, and the cause of the problem is the IO behaviour in low memory situations. Cheers, Dave. (*) The btrfs result when the underlying image file is not fragmented is this: FSUse% Count Size Files/sec App Overhead 0 800000 0 57842.6 6955654 0 1600000 0 50669.6 7507264 0 2400000 0 46375.2 7038246 0 3200000 0 51564.7 7028544 0 4000000 0 44751.0 7019479 0 4800000 0 49647.7 7748393 0 5600000 0 45121.4 6980789 0 6400000 0 36758.9 8387095 0 7200000 0 15014.6 8291624 Note that I''d only defragmented the first 6 million inode region, so perf tanked at around that point as fragmentation started again. Here''s the equivalent XFS run: FSUse% Count Size Files/sec App Overhead 0 800000 0 106989.2 6862546 0 1600000 0 99506.5 7024546 0 2400000 0 88726.5 8085128 0 3200000 0 90616.9 7709196 0 4000000 0 93900.0 7323644 0 4800000 0 94869.2 7166322 0 5600000 0 92693.4 7213337 0 6400000 0 92217.6 7178681 0 7200000 0 95983.9 7075248 0 8000000 0 95096.8 7182689 0 8800000 0 95350.1 7160214 which runs at about 500 iops and results in almost no underlying device fragmentation at all. Hence BTRFS is running at roughly half the speed of a debug XFS kernel on this workload on my setup. I''d be remiss not to mention ext4 performance on this workload, too: FSUse% Count Size Files/sec App Overhead 5 800000 0 37948.7 5674131 5 1600000 0 35918.5 5941488 5 2400000 0 33313.1 6427143 5 3200000 0 36491.2 6587327 5 4000000 0 35426.2 6027680 5 4800000 0 33323.9 6501011 5 5600000 0 35292.6 6016546 5 6400000 0 37851.4 6327824 5 7200000 0 34384.9 5897006 Yeah, it sucks worse than btrfs when the underlying image is not fragmented. However, ext4 is fragmenting the underlying device just as badly as btrfs is - it''s creating about 100k fragments per million inodes allocated. However, the fragmentation is not affecting performance as there''s a 1:1 ratio between ext4 IOs and IOs to the physical device through the image file. As it is, ext4 is sustaining about 6000 iops - an order of magnitude more than XFS and the "good" BTRFS IO patterns, and only managing to use about 2 CPUs of the 8p in the system. The back end storage is at about 50% utilisation so it isn''t the bottleneck - there''s other bottlenecks in ext4 that limit it''s performance under these sorts of workloads. IOWs, XFS runs this workload at about 2% storage utilisation and 750% CPU utilisation, btrfs at about 600% CPU utilisation and ext4 at 50% storage and 200% CPU utilisation. This says a lot about the inherent parallelism in the filesystem architectures... FWIW, A comparision with the fsmark testing I did on this 8p/4GB RAM VM I reported on 18 months ago at LCA: - XFS is at roughly the same performance/efficiency point, but with added functionality - btrfs is about 30% slower (on a non-fragmented device), consumes more CPU and has some interesting new warts, but it is definitely more stable as it is completing tests rather than hanging half way through. - ext4 performance has dropped by half for this 8-way workload.... -- Dave Chinner david@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Chris Mason
2013-May-04 11:20 UTC
Re: [3.9] parallel fsmark perf is real bad on sparse devices
Quoting Dave Chinner (2013-05-03 21:15:47)> Hi folks, > > It''s that time again - I ran fsmark on btrfs and found performance > was awful. > > tl;dr: memory pressure causes random writeback of metadata ("bad"), > fragmenting the underlying sparse storage. This causes a downward > spiral as btrfs cycles through "good" IO patterns that get > fragmented at the device level due to the "bad" IO patterns > fragmenting the underlying sparse device. >Really interesting Dave, thanks for all this analysis. We''re going to have hard time matching xfs fragmentation just because the files are zero size and we don''t have the inode tables. But, I''ll take a look at the metadata memory pressure based writeback, sounds like we need to push a bigger burst. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Dave Chinner
2013-May-05 01:45 UTC
Re: [3.9] parallel fsmark perf is real bad on sparse devices
On Sat, May 04, 2013 at 07:20:05AM -0400, Chris Mason wrote:> Quoting Dave Chinner (2013-05-03 21:15:47) > > Hi folks, > > > > It''s that time again - I ran fsmark on btrfs and found performance > > was awful. > > > > tl;dr: memory pressure causes random writeback of metadata ("bad"), > > fragmenting the underlying sparse storage. This causes a downward > > spiral as btrfs cycles through "good" IO patterns that get > > fragmented at the device level due to the "bad" IO patterns > > fragmenting the underlying sparse device. > > > > Really interesting Dave, thanks for all this analysis. > > We''re going to have hard time matching xfs fragmentation just because > the files are zero size and we don''t have the inode tables. But, I''ll > take a look at the metadata memory pressure based writeback, sounds like > we need to push a bigger burst.Yeah, I wouldn''t expect it to behave like XFS does given all the metadata writeback ordering optimisation XFS has, but the level of fragmentation was a surprise. Fragmentation by itself isn''t so much of a problem - ext4 is just as bad as btrfs in terms of the amount of image fragmentation, but it doesn''t have the 100:1 IOPS explosion in the backing device. Run the test and have a look at the iowatcher movies - they are quite instructive as they show the two separate phases that write alternately over the same sections of the disk. A picture^Wmovie is worth a thousand words ;) FWIW, the main reason I thought it is important enough to report because if the filesystem is being unfriendly to sparse files, then it is almost certainly being unfriendly to the internal mapping tables in modern SSDs.... Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Chris Mason
2013-May-05 14:56 UTC
Re: [BULK] Re: [3.9] parallel fsmark perf is real bad on sparse devices
Quoting Dave Chinner (2013-05-04 21:45:40)> On Sat, May 04, 2013 at 07:20:05AM -0400, Chris Mason wrote: > > Quoting Dave Chinner (2013-05-03 21:15:47) > > > Hi folks, > > > > > > It''s that time again - I ran fsmark on btrfs and found performance > > > was awful. > > > > > > tl;dr: memory pressure causes random writeback of metadata ("bad"), > > > fragmenting the underlying sparse storage. This causes a downward > > > spiral as btrfs cycles through "good" IO patterns that get > > > fragmented at the device level due to the "bad" IO patterns > > > fragmenting the underlying sparse device. > > > > > > > Really interesting Dave, thanks for all this analysis. > > > > We''re going to have hard time matching xfs fragmentation just because > > the files are zero size and we don''t have the inode tables. But, I''ll > > take a look at the metadata memory pressure based writeback, sounds like > > we need to push a bigger burst. > > Yeah, I wouldn''t expect it to behave like XFS does given all the > metadata writeback ordering optimisation XFS has, but the level of > fragmentation was a surprise. Fragmentation by itself isn''t so much > of a problem - ext4 is just as bad as btrfs in terms of the amount of > image fragmentation, but it doesn''t have the 100:1 IOPS explosion in > the backing device.The frustrating part of fsmark is watching all those inodes and dentries suck down our ram, while the FS gets slammed doing writes on pages that we actually want to keep around.> > Run the test and have a look at the iowatcher movies - they > are quite instructive as they show the two separate phases > that write alternately over the same sections of the disk. A > picture^Wmovie is worth a thousand words ;);) Will do. We already have a few checks to skip btree writeback if there isn''t much actually dirty. But that''s a tricky knob to turn because balance_dirty_pages gets angry when you ignore it. This is one of those tests where keeping the metadata out of the page cache should make life easier.> > FWIW, the main reason I thought it is important enough to report > because if the filesystem is being unfriendly to sparse files, then > it is almost certainly being unfriendly to the internal mapping > tables in modern SSDs....Definitely. The other side of it is that memory pressure based writeback means writing before we needed to, which means COWs that we didn''t need to do, which means more work for the allocator, which also means more COWs that we didn''t need to do. So, even without the sparse backing store, we''ll go faster if we tune this well. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html