Dave Chinner
2013-May-04  01:15 UTC
[3.9] parallel fsmark perf is real bad on sparse devices
Hi folks,
It''s that time again - I ran fsmark on btrfs and found performance
was awful.
tl;dr: memory pressure causes random writeback of metadata ("bad"),
fragmenting the underlying sparse storage. This causes a downward
spiral as btrfs cycles through "good" IO patterns that get
fragmented at the device level due to the "bad" IO patterns
fragmenting the underlying sparse device.
FYI, The storage hardware is a DM RAID0 stripe across 4 SSDs sitting
behind 512MB of BBWC with an XFS filesystem on it. The only file on
the filesystem is the sparse 100TB file used for the device, and the
VM is using virtio,cache=none to access the filesystem image.
i.e. the storage I''m working on this time is a thinly provisioned
100TB device fed to an 8p, 4GB RAM VM, and this script is then run:
$ cat fsmark-50-test-btrfs.sh 
#!/bin/bash
sudo umount /mnt/scratch > /dev/null 2>&1
sudo mkfs.btrfs /dev/vdc
sudo mount /dev/vdc /mnt/scratch
sudo chmod 777 /mnt/scratch
cd /home/dave/src/fs_mark-3.3/
time ./fs_mark  -D  10000  -S0  -n  100000  -s  0  -L  63 \
        -d  /mnt/scratch/0  -d  /mnt/scratch/1 \
        -d  /mnt/scratch/2  -d  /mnt/scratch/3 \
        -d  /mnt/scratch/4  -d  /mnt/scratch/5 \
        -d  /mnt/scratch/6  -d  /mnt/scratch/7 \
        | tee >(stats --trim-outliers | tail -1 1>&2)
sync
$
$ ./fsmark-50-test-btrfs.sh
WARNING! - Btrfs Btrfs v0.19 IS EXPERIMENTAL
WARNING! - see http://btrfs.wiki.kernel.org before using
fs created label (null) on /dev/vdc
        nodesize 4096 leafsize 4096 sectorsize 4096 size 100.00TB
Btrfs Btrfs v0.19
#  ./fs_mark  -D  10000  -S0  -n  100000  -s  0  -L  63  -d  /mnt/scratch/0  -d 
/mnt/scratch/1  -d  /mnt/scratch/2  -d  /mnt/scratch/3  -d  /mnt/scratch/4  -d 
/mnt/scratch/5  -d  /mnt/scratch/6  -d  /mnt/scratch/7
#       Version 3.3, 8 thread(s) starting at Fri May  3 17:08:46 2013
#       Sync method: NO SYNC: Test does not issue sync() or fsync() calls.
#       Directories:  Time based hash between directories across 10000
subdirectories with 180 seconds per subdirectory.
#       File names: 40 bytes long, (16 initial bytes of time stamp with 24
random bytes at end of name)
#       Files info: size 0 bytes, written with an IO size of 16384 bytes per
write
#       App overhead is time in microseconds spent in the test not doing file
writing related system calls.
FSUse%        Count         Size    Files/sec     App Overhead
     0       800000            0      53498.9          7898900
     0      1600000            0      11186.5          9409278
     0      2400000            0      17026.1          7907599
     0      3200000            0      25815.6          9749980
     0      4000000            0      11503.0          8556349
     0      4800000            0      43561.9          8295238
     0      5600000            0      17175.3          8304668
^C     0 800000-5600000(3.2e+06+/-1.1e+06)            0
11186.500000-53498.900000(23016.4+/-1.1e+04)
7898900-9749980(8.49463e+06+/-5e+05)
What I''m seeing is that the underlying image file is getting badly,
badly fragmented. This short test created approximately 8 million
extents in the image file in about 10 minutes runtime. Running
xfs_fsr on the image file pointed this out:
# xfs_fsr -d -v vm-100TB-sparse.img
vm-100TB-sparse.img
vm-100TB-sparse.img extents=7971773 can_save=7926036 tmp=./.fsr6198
DEBUG: fsize=109951162777600 blsz_dio=16773120 d_min=512
d_max=2147483136 pgsz=4096
Temporary file has 46107 extents (7971773 in original)
extents before:7971773 after:46107      vm-100TB-sparse.img
#
Most of the data written to the file is contiguous. This means that
btrfs is filling the filesystem in a contiguous manner, but it''s IO
is anything but contiguous. So, what''s happening here?
Turns out that when the machine first runs out of free memory (about
1.2m inodes in), btrfs goes from running a couple of hundred nice
large 512k IOs a second to an intense 10s long burst of 10-15kiops
of tiny random IOs. Looking at it from the IO completion side of
things:
253,32   4      238     5.936043934     0  C   W 103680 + 1024 [0]
253,32   4      239     5.936155917     0  C   W 2201728 + 1024 [0]
253,32   4      240     5.936172087     0  C   W 104704 + 1024 [0]
253,32   4      241     5.936283060     0  C   W 2202752 + 1024 [0]
253,32   4      242     5.936294881     0  C   W 105728 + 1024 [0]
253,32   4      243     5.936385182     0  C   W 106752 + 1024 [0]
253,32   4      244     5.936394695     0  C   W 107776 + 1024 [0]
253,32   4      245     5.936402936     0  C   W 108800 + 1024 [0]
253,32   4      246     5.936406721     0  C   W 109824 + 896 [0]
253,32   4      247     5.936414258     0  C   W 2203776 + 1024 [0]
253,32   4      248     5.936515302     0  C   W 2204800 + 1024 [0]
253,32   4      249     5.936606737     0  C   W 2205824 + 1024 [0]
253,32   4      250     5.936689345     0  C   W 2206848 + 1024 [0]
All nice and large, mostly sequential IO patterns. Fast foward to
where we''ve run out of memory:
253,32   3    59209    31.490788795     0  C  WS 1821992 + 16 [0]
253,32   3    59210    31.490790691     0  C  WS 1822024 + 24 [0]
253,32   3    59211    31.490792205     0  C  WS 1822056 + 16 [0]
253,32   3    59212    31.490793680     0  C  WS 1822080 + 8 [0]
253,32   3    59213    31.490794984     0  C  WS 1822096 + 32 [0]
253,32   3    59214    31.490796307     0  C  WS 1822136 + 8 [0]
253,32   3    59215    31.490798261     0  C  WS 1822152 + 16 [0]
253,32   3    59216    31.490799713     0  C  WS 3919120 + 8 [0]
253,32   3    59217    31.490831740     0  C  WS 3919144 + 16 [0]
253,32   3    59218    31.490835419     0  C  WS 3919176 + 24 [0]
253,32   3    59219    31.490838989     0  C  WS 3919208 + 16 [0]
You can see that there are lots of small IOs being completed, with
lots of tiny holes in between them. This is what causes the
fragmentation of the backing device image. Performance hasn''t quite
tanked yet - that happens after the massive burst of IO when
reclaiming memory. btrfs goes back to nice IO patterns:
53,32   4   114006    40.036082347  6902  C   W 4268032 + 896 [0]
253,32   4   114007    40.036088989  6902  C   W 4268928 + 896 [0]
253,32   4   114008    40.036104027  6902  C   W 4269824 + 896 [0]
253,32   4   114009    40.036108753  6902  C   W 4270720 + 896 [0]
253,32   4   114010    40.036112097  6902  C   W 4271616 + 896 [0]
253,32   4   114011    40.036116985  6902  C   W 5316608 + 896 [0]
253,32   4   114012    40.036189985  6902  C   W 5317504 + 896 [0]
253,32   4   114013    40.036259904  6902  C   W 5318400 + 896 [0]
But because it''s already fragmented the crap out of the underlying
file image and thanks to the wonder of the kernel direct IO doing in
individual allocation for every vector in the pwritev() iovec (i.e.
one allocation per 4k page), this further fragments the underlying
file as it fills small holes first. The result is that btrfs is
doing a couple of hundred IOPS, but the back end storage is now
doing 25,000 IOPS because of the fragmentation of the image file.
Worth noting is that btrfs is filling the filesystem from block 0
upwards - punching the first 100GB out of the image file removes all
the fragmentation from the file (6m extents down to 21000) - all the
higher address space extents are from XFS....
So, BTRFS doesn''t play at all well with sparse image files or
fine-grained thin provisioned devices, and the cause of the problem
is the IO behaviour in low memory situations.
Cheers,
Dave.
(*) The btrfs result when the underlying image file is not
fragmented is this:
FSUse%        Count         Size    Files/sec     App Overhead
     0       800000            0      57842.6          6955654
     0      1600000            0      50669.6          7507264
     0      2400000            0      46375.2          7038246
     0      3200000            0      51564.7          7028544
     0      4000000            0      44751.0          7019479
     0      4800000            0      49647.7          7748393
     0      5600000            0      45121.4          6980789
     0      6400000            0      36758.9          8387095
     0      7200000            0      15014.6          8291624
Note that I''d only defragmented the first 6 million inode region,
so perf tanked at around that point as fragmentation started again.
Here''s the equivalent XFS run:
FSUse%        Count         Size    Files/sec     App Overhead
     0       800000            0     106989.2          6862546
     0      1600000            0      99506.5          7024546
     0      2400000            0      88726.5          8085128
     0      3200000            0      90616.9          7709196
     0      4000000            0      93900.0          7323644
     0      4800000            0      94869.2          7166322
     0      5600000            0      92693.4          7213337
     0      6400000            0      92217.6          7178681
     0      7200000            0      95983.9          7075248
     0      8000000            0      95096.8          7182689
     0      8800000            0      95350.1          7160214
which runs at about 500 iops and results in almost no underlying
device fragmentation at all. Hence BTRFS is running at roughly half
the speed of a debug XFS kernel on this workload on my setup.
I''d be remiss not to mention ext4 performance on this workload, too:
FSUse%        Count         Size    Files/sec     App Overhead
     5       800000            0      37948.7          5674131
     5      1600000            0      35918.5          5941488
     5      2400000            0      33313.1          6427143
     5      3200000            0      36491.2          6587327
     5      4000000            0      35426.2          6027680
     5      4800000            0      33323.9          6501011
     5      5600000            0      35292.6          6016546
     5      6400000            0      37851.4          6327824
     5      7200000            0      34384.9          5897006
Yeah, it sucks worse than btrfs when the underlying image is not
fragmented. However, ext4 is fragmenting the underlying device just
as badly as btrfs is - it''s creating about 100k fragments per
million inodes allocated. However, the fragmentation is not
affecting performance as there''s a 1:1 ratio between ext4 IOs and
IOs to the physical device through the image file.
As it is, ext4 is sustaining about 6000 iops - an order of
magnitude more than XFS and the "good" BTRFS IO patterns, and only
managing to use about 2 CPUs of the 8p in the system. The back end
storage is at about 50% utilisation so it isn''t the bottleneck -
there''s other bottlenecks in ext4 that limit it''s performance
under
these sorts of workloads.
IOWs, XFS runs this workload at about 2% storage utilisation and
750% CPU utilisation, btrfs at about 600% CPU utilisation and ext4
at 50% storage and 200% CPU utilisation. This says a lot about the
inherent parallelism in the filesystem architectures...
FWIW, A comparision with the fsmark testing I did on this 8p/4GB RAM
VM I reported on 18 months ago at LCA:
	- XFS is at roughly the same performance/efficiency point,
	  but with added functionality
	- btrfs is about 30% slower (on a non-fragmented device),
	  consumes more CPU and has some interesting new warts, but
	  it is definitely more stable as it is completing tests
	  rather than hanging half way through.
	- ext4 performance has dropped by half for this 8-way
	  workload....
-- 
Dave Chinner
david@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Chris Mason
2013-May-04  11:20 UTC
Re: [3.9] parallel fsmark perf is real bad on sparse devices
Quoting Dave Chinner (2013-05-03 21:15:47)> Hi folks, > > It''s that time again - I ran fsmark on btrfs and found performance > was awful. > > tl;dr: memory pressure causes random writeback of metadata ("bad"), > fragmenting the underlying sparse storage. This causes a downward > spiral as btrfs cycles through "good" IO patterns that get > fragmented at the device level due to the "bad" IO patterns > fragmenting the underlying sparse device. >Really interesting Dave, thanks for all this analysis. We''re going to have hard time matching xfs fragmentation just because the files are zero size and we don''t have the inode tables. But, I''ll take a look at the metadata memory pressure based writeback, sounds like we need to push a bigger burst. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Dave Chinner
2013-May-05  01:45 UTC
Re: [3.9] parallel fsmark perf is real bad on sparse devices
On Sat, May 04, 2013 at 07:20:05AM -0400, Chris Mason wrote:> Quoting Dave Chinner (2013-05-03 21:15:47) > > Hi folks, > > > > It''s that time again - I ran fsmark on btrfs and found performance > > was awful. > > > > tl;dr: memory pressure causes random writeback of metadata ("bad"), > > fragmenting the underlying sparse storage. This causes a downward > > spiral as btrfs cycles through "good" IO patterns that get > > fragmented at the device level due to the "bad" IO patterns > > fragmenting the underlying sparse device. > > > > Really interesting Dave, thanks for all this analysis. > > We''re going to have hard time matching xfs fragmentation just because > the files are zero size and we don''t have the inode tables. But, I''ll > take a look at the metadata memory pressure based writeback, sounds like > we need to push a bigger burst.Yeah, I wouldn''t expect it to behave like XFS does given all the metadata writeback ordering optimisation XFS has, but the level of fragmentation was a surprise. Fragmentation by itself isn''t so much of a problem - ext4 is just as bad as btrfs in terms of the amount of image fragmentation, but it doesn''t have the 100:1 IOPS explosion in the backing device. Run the test and have a look at the iowatcher movies - they are quite instructive as they show the two separate phases that write alternately over the same sections of the disk. A picture^Wmovie is worth a thousand words ;) FWIW, the main reason I thought it is important enough to report because if the filesystem is being unfriendly to sparse files, then it is almost certainly being unfriendly to the internal mapping tables in modern SSDs.... Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Chris Mason
2013-May-05  14:56 UTC
Re: [BULK] Re: [3.9] parallel fsmark perf is real bad on sparse devices
Quoting Dave Chinner (2013-05-04 21:45:40)> On Sat, May 04, 2013 at 07:20:05AM -0400, Chris Mason wrote: > > Quoting Dave Chinner (2013-05-03 21:15:47) > > > Hi folks, > > > > > > It''s that time again - I ran fsmark on btrfs and found performance > > > was awful. > > > > > > tl;dr: memory pressure causes random writeback of metadata ("bad"), > > > fragmenting the underlying sparse storage. This causes a downward > > > spiral as btrfs cycles through "good" IO patterns that get > > > fragmented at the device level due to the "bad" IO patterns > > > fragmenting the underlying sparse device. > > > > > > > Really interesting Dave, thanks for all this analysis. > > > > We''re going to have hard time matching xfs fragmentation just because > > the files are zero size and we don''t have the inode tables. But, I''ll > > take a look at the metadata memory pressure based writeback, sounds like > > we need to push a bigger burst. > > Yeah, I wouldn''t expect it to behave like XFS does given all the > metadata writeback ordering optimisation XFS has, but the level of > fragmentation was a surprise. Fragmentation by itself isn''t so much > of a problem - ext4 is just as bad as btrfs in terms of the amount of > image fragmentation, but it doesn''t have the 100:1 IOPS explosion in > the backing device.The frustrating part of fsmark is watching all those inodes and dentries suck down our ram, while the FS gets slammed doing writes on pages that we actually want to keep around.> > Run the test and have a look at the iowatcher movies - they > are quite instructive as they show the two separate phases > that write alternately over the same sections of the disk. A > picture^Wmovie is worth a thousand words ;);) Will do. We already have a few checks to skip btree writeback if there isn''t much actually dirty. But that''s a tricky knob to turn because balance_dirty_pages gets angry when you ignore it. This is one of those tests where keeping the metadata out of the page cache should make life easier.> > FWIW, the main reason I thought it is important enough to report > because if the filesystem is being unfriendly to sparse files, then > it is almost certainly being unfriendly to the internal mapping > tables in modern SSDs....Definitely. The other side of it is that memory pressure based writeback means writing before we needed to, which means COWs that we didn''t need to do, which means more work for the allocator, which also means more COWs that we didn''t need to do. So, even without the sparse backing store, we''ll go faster if we tune this well. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html