Hi,
We''re interested in getting the highest possible read performance on a
server. To that end, we have a high-end server with multiple solid state
disks (SSDs). Since BtrFS outperformed other Linux filesystem, we choose
that. Unfortunately, there seems to be an upper boundary in the
performance of BtrFS of roughly 1 GiByte/s read speed. Compare the
following results with either BTRFS on Ubuntu versus ZFS on FreeBSD:
             ZFS             BtrFS
1 SSD      256 MiByte/s     256 MiByte/s
2 SSDs     505 MiByte/s     504 MiByte/s
3 SSDs     736 MiByte/s     756 MiByte/s
4 SSDs     952 MiByte/s     916 MiByte/s
5 SSDs    1226 MiByte/s     986 MiByte/s
6 SSDs    1450 MiByte/s     978 MiByte/s
8 SSDs    1653 MiByte/s     932 MiByte/s
16 SSDs   2750 MiByte/s     919 MiByte/s
The results were originally measured on a Dell PowerEdge T610, but were
repeated using a SuperMicro machine with 4 independent SAS+SATA
controllers. We made sure that the PCI-e slots where not the bottleneck.
The above results were for Ubuntu 10.04.1 server, with BtrFS v0.19,
although earlier tests with Ubuntu 9.10 showed the same results.
Apparently, the limitation is not in the hardware (the same hardware
with ZFS did scale near linear). We also tested both hardware RAID-0,
software RAID-0 (md), and using the btrfs built-in software RAID-0, but
the differences were small (<10%) (md-based software RAID was marginally
slower on Linux; RAIDZ was marginally faster on FreeBSD). So we presume
that the bottleneck is somewhere in the BtrFS (or kernel) software.
Are there suggestions how to tune the read performance? We like to scale
this up to 32 solid state disks. The -o ssd option did not improve
overall performance, although it did gave more stable results (less
fluctuation in repeated tests).
Note that the write speeds did scale fine. In the scenario with 16 solid
state disks, the write speed is 1596 MiByte/s (1.7 times as fast as the
read speed! Suffice to say that for a single disk, write is much slower
than read...).
Here are the exact settings:
~# mkfs.btrfs -d raid0 /dev/sdd /dev/sde /dev/sdf /dev/sdg \
     /dev/sdh /dev/sdi /dev/sdj /dev/sdk /dev/sdl /dev/sdm \
     /dev/sdn /dev/sdo /dev/sdp /dev/sdq /dev/sdr /dev/sds
nodesize 4096 leafsize 4096 sectorsize 4096 size 2.33TB
Btrfs Btrfs v0.19
~# mount -t btrfs -o ssd /dev/sdd /mnt/ssd6
~# iozone -s 32G -r 1024 -i 0 -i 1 -w -f /mnt/ssd6/iozone.tmp
              KB  reclen   write rewrite    read    reread
        33554432    1024 1628475 1640349   943416   951135
Regards,
Freek Dijkstra
SARA high performance- and computing
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
On Thu, Aug 05, 2010 at 04:05:33PM +0200, Freek Dijkstra wrote:> Hi, > > We''re interested in getting the highest possible read performance on a > server. To that end, we have a high-end server with multiple solid state > disks (SSDs). Since BtrFS outperformed other Linux filesystem, we choose > that. Unfortunately, there seems to be an upper boundary in the > performance of BtrFS of roughly 1 GiByte/s read speed. Compare the > following results with either BTRFS on Ubuntu versus ZFS on FreeBSD:Really cool, thanks for posting this.> > ZFS BtrFS > 1 SSD 256 MiByte/s 256 MiByte/s > 2 SSDs 505 MiByte/s 504 MiByte/s > 3 SSDs 736 MiByte/s 756 MiByte/s > 4 SSDs 952 MiByte/s 916 MiByte/s > 5 SSDs 1226 MiByte/s 986 MiByte/s > 6 SSDs 1450 MiByte/s 978 MiByte/s > 8 SSDs 1653 MiByte/s 932 MiByte/s > 16 SSDs 2750 MiByte/s 919 MiByte/s > > The results were originally measured on a Dell PowerEdge T610, but were > repeated using a SuperMicro machine with 4 independent SAS+SATA > controllers. We made sure that the PCI-e slots where not the bottleneck. > The above results were for Ubuntu 10.04.1 server, with BtrFS v0.19, > although earlier tests with Ubuntu 9.10 showed the same results.Which kernels are those? Basically we have two different things to tune. First the block layer and then btrfs. Can I ask you to do a few experiments? First grab fio: http://brick.kernel.dk/snaps/fio-git-latest.tar.gz And then we need to setup a fio job file that hammers on all the ssds at once. I''d have it use adio/dio and talk directly to the drives. I''d do something like this for the fio job file, but Jens Axboe is cc''d and he might make another suggestion on the job file. I''d do something like this in a file named ssd.fio [global] size=32g direct=1 iodepth=8 bs=20m rw=read [f1] filename=/dev/sdd [f2] filename=/dev/sde [f3] filename=/dev/sdf repeat for all the drives, then run fio ssd.fio fio should be able to push these devices up to the line speed. If it doesn''t I would suggest changing elevators (deadline, cfq, noop) and bumping the max request size to the max supported by the device. When we have a config that does so, we can tune the btrfs side of things as well. The btrfs job file would look something like this: [global] size=32g direct=1 iodepth=8 bs=20m rw=read [btrfs] directory=/btrfs_mount_point # experiment with numjobs numjobs=16 My first guess is just that your IOs are not large enough w/btrfs. The iozone command below is doing buffered reads, so our performance is going to be limited by the kernel readahead buffer size. If you use a much larger IO size (the fio job above reads in 20M chunks) and aio/dio instead, you can have more control over how the IO goes down to the device. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 5 August 2010 15:05, Freek Dijkstra <Freek.Dijkstra@sara.nl> wrote:> Hi, > > We''re interested in getting the highest possible read performance on a > server. To that end, we have a high-end server with multiple solid state > disks (SSDs). Since BtrFS outperformed other Linux filesystem, we choose > that. Unfortunately, there seems to be an upper boundary in the > performance of BtrFS of roughly 1 GiByte/s read speed. Compare the > following results with either BTRFS on Ubuntu versus ZFS on FreeBSD: > > ZFS BtrFS > 1 SSD 256 MiByte/s 256 MiByte/s > 2 SSDs 505 MiByte/s 504 MiByte/s > 3 SSDs 736 MiByte/s 756 MiByte/s > 4 SSDs 952 MiByte/s 916 MiByte/s > 5 SSDs 1226 MiByte/s 986 MiByte/s > 6 SSDs 1450 MiByte/s 978 MiByte/s > 8 SSDs 1653 MiByte/s 932 MiByte/s > 16 SSDs 2750 MiByte/s 919 MiByte/s > > The results were originally measured on a Dell PowerEdge T610, but were > repeated using a SuperMicro machine with 4 independent SAS+SATA > controllers. We made sure that the PCI-e slots where not the bottleneck. > The above results were for Ubuntu 10.04.1 server, with BtrFS v0.19, > although earlier tests with Ubuntu 9.10 showed the same results. > > Apparently, the limitation is not in the hardware (the same hardware > with ZFS did scale near linear). We also tested both hardware RAID-0, > software RAID-0 (md), and using the btrfs built-in software RAID-0, but > the differences were small (<10%) (md-based software RAID was marginally > slower on Linux; RAIDZ was marginally faster on FreeBSD). So we presume > that the bottleneck is somewhere in the BtrFS (or kernel) software. > > Are there suggestions how to tune the read performance? We like to scale > this up to 32 solid state disks. The -o ssd option did not improve > overall performance, although it did gave more stable results (less > fluctuation in repeated tests). > > Note that the write speeds did scale fine. In the scenario with 16 solid > state disks, the write speed is 1596 MiByte/s (1.7 times as fast as the > read speed! Suffice to say that for a single disk, write is much slower > than read...). > > Here are the exact settings: > ~# mkfs.btrfs -d raid0 /dev/sdd /dev/sde /dev/sdf /dev/sdg \ > /dev/sdh /dev/sdi /dev/sdj /dev/sdk /dev/sdl /dev/sdm \ > /dev/sdn /dev/sdo /dev/sdp /dev/sdq /dev/sdr /dev/sds > nodesize 4096 leafsize 4096 sectorsize 4096 size 2.33TB > Btrfs Btrfs v0.19 > ~# mount -t btrfs -o ssd /dev/sdd /mnt/ssd6 > ~# iozone -s 32G -r 1024 -i 0 -i 1 -w -f /mnt/ssd6/iozone.tmp > KB reclen write rewrite read reread > 33554432 1024 1628475 1640349 943416 951135Perhaps create a new filesystem and mount with ''nodatasum'' - existing extents which were previously created will be checked, so need to start fresh. Daniel -- Daniel J Blueman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Mathieu Chouquet-Stringer
2010-Aug-05  16:21 UTC
Re: Poor read performance on high-end server
Hello, Freek.Dijkstra@sara.nl (Freek Dijkstra) writes:> [...] > > Here are the exact settings: > ~# mkfs.btrfs -d raid0 /dev/sdd /dev/sde /dev/sdf /dev/sdg \ > /dev/sdh /dev/sdi /dev/sdj /dev/sdk /dev/sdl /dev/sdm \ > /dev/sdn /dev/sdo /dev/sdp /dev/sdq /dev/sdr /dev/sds > nodesize 4096 leafsize 4096 sectorsize 4096 size 2.33TB > Btrfs Btrfs v0.19Don''t you need to stripe metadata too (with -m raid0)? Or you may be limited by your metadata drive? -- Mathieu Chouquet-Stringer mchouque@free.fr The sun itself sees not till heaven clears. -- William Shakespeare -- -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Chris, Daniel and Mathieu, Thanks for your constructive feedback!> On Thu, Aug 05, 2010 at 04:05:33PM +0200, Freek Dijkstra wrote: >> ZFS BtrFS >> 1 SSD 256 MiByte/s 256 MiByte/s >> 2 SSDs 505 MiByte/s 504 MiByte/s >> 3 SSDs 736 MiByte/s 756 MiByte/s >> 4 SSDs 952 MiByte/s 916 MiByte/s >> 5 SSDs 1226 MiByte/s 986 MiByte/s >> 6 SSDs 1450 MiByte/s 978 MiByte/s >> 8 SSDs 1653 MiByte/s 932 MiByte/s >> 16 SSDs 2750 MiByte/s 919 MiByte/s >>[...]>> The above results were for Ubuntu 10.04.1 server, with BtrFS v0.19, > > Which kernels are those?For BtrFS: Linux 2.6.32-21-server #32-Ubuntu SMP x86_64 GNU/Linux For ZFS: FreeBSD 8.1-RELEASE (GENERIC) (Note that we can currently not upgrade easily due to binary drivers for the SAS+SATA controllers :(. I''d be happy to push the vendor though, if you think it makes a difference.) Daniel J Blueman wrote:> Perhaps create a new filesystem and mount with ''nodatasum''I get an improvement: 919 MiByte/s just became 1580 MiByte/s. Not as fast as it can, but most certainly an improvement.> existing extents which were previously created will be checked, so > need to start fresh.Indeed, also the other way around. I created two test files, while mounted with and without the -o nodatasum option: write w/o nodatasum; read w/o nodatasum: 919 ± 43 MiByte/s write w/o nodatasum; read w/ nodatasum: 922 ± 72 MiByte/s write w/ nodatasum; read w/o nodatasum: 1082 ± 46 MiByte/s write w/ nodatasum; read w/ nodatasum: 1586 ± 126 MiByte/s So even if I remount the disk in the normal way, and read a file created without checksums, I still get a small improvement :) (PS: the above tests were repeated 4 times, the last even 8 times. As you can see from the standard deviation, the results are not always very accurate. The cause is unknown; CPU load is low.) Chris Mason wrote:> Basically we have two different things to tune. First the block layer > and then btrfs.> And then we need to setup a fio job file that hammers on all the ssds at > once. I''d have it use adio/dio and talk directly to the drives. > > [global] > size=32g > direct=1 > iodepth=8 > bs=20m > rw=read > > [f1] > filename=/dev/sdd > [f2] > filename=/dev/sde > [f3] > filename=/dev/sdf[...]> [f16] > filename=/dev/sdsThanks. First one disk:> f1: (groupid=0, jobs=1): err= 0: pid=6273 > read : io=32780MB, bw=260964KB/s, iops=12, runt=128626msec > clat (usec): min=74940, max=80721, avg=78449.61, stdev=923.24 > bw (KB/s) : min=240469, max=269981, per=100.10%, avg=261214.77, stdev=2765.91 > cpu : usr=0.01%, sys=2.69%, ctx=1747, majf=0, minf=5153 > IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% > issued r/w: total=1639/0, short=0/0 > > lat (msec): 100=100.00% > > Run status group 0 (all jobs): > READ: io=32780MB, aggrb=260963KB/s, minb=267226KB/s, maxb=267226KB/s, mint=128626msec, maxt=128626msec > > Disk stats (read/write): > sdd: ios=261901/0, merge=0/0, ticks=10135270/0, in_queue=10136460, util=99.30%So 255 MiByte/s. Out of curiousity, what is the distinction between the reported figures of 260964 kiB/s, 261214.77 kiB/s, 267226 kiB/s and 260963 kiB/s? Now 16 disks (abbreviated):> ~/fio# ./fio ssd.fio > Starting 16 processes > f1: (groupid=0, jobs=1): err= 0: pid=4756 > read : io=32780MB, bw=212987KB/s, iops=10, runt=157600msec > clat (msec): min=75, max=138, avg=96.15, stdev= 4.47 > lat (msec): min=75, max=138, avg=96.15, stdev= 4.47 > bw (KB/s) : min=153121, max=268968, per=6.31%, avg=213181.15, stdev=9052.26 > cpu : usr=0.00%, sys=1.71%, ctx=2737, majf=0, minf=5153 > IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% > issued r/w: total=1639/0, short=0/0 > > lat (msec): 100=97.99%, 250=2.01%[..similar for f2 to f16..]> f1: read : io=32780MB, bw=212987KB/s, iops=10, runt=157600msec > bw (KB/s) : min=153121, max=268968, per=6.31%, avg=213181.15, stdev=9052.26 > f2: read : io=32780MB, bw=213873KB/s, iops=10, runt=156947msec > bw (KB/s) : min=151143, max=251508, per=6.33%, avg=213987.34, stdev=8958.86 > f3: read : io=32780MB, bw=214613KB/s, iops=10, runt=156406msec > bw (KB/s) : min=149216, max=219037, per=6.35%, avg=214779.89, stdev=9332.99 > f4: read : io=32780MB, bw=214388KB/s, iops=10, runt=156570msec > bw (KB/s) : min=148675, max=226298, per=6.35%, avg=214576.51, stdev=8985.03 > f5: read : io=32780MB, bw=213848KB/s, iops=10, runt=156965msec > bw (KB/s) : min=144479, max=241414, per=6.33%, avg=213935.81, stdev=10023.68 > f6: read : io=32780MB, bw=213514KB/s, iops=10, runt=157211msec > bw (KB/s) : min=141730, max=264990, per=6.32%, avg=213656.75, stdev=10871.71 > f7: read : io=32780MB, bw=213431KB/s, iops=10, runt=157272msec > bw (KB/s) : min=148137, max=254635, per=6.32%, avg=213493.12, stdev=9319.08 > f8: read : io=32780MB, bw=213099KB/s, iops=10, runt=157517msec > bw (KB/s) : min=143467, max=267962, per=6.31%, avg=213267.60, stdev=11224.35 > f9: read : io=32780MB, bw=211254KB/s, iops=10, runt=158893msec > bw (KB/s) : min=149489, max=267962, per=6.25%, avg=211257.05, stdev=9370.64 > f10: read : io=32780MB, bw=212251KB/s, iops=10, runt=158146msec > bw (KB/s) : min=150865, max=225882, per=6.28%, avg=212300.50, stdev=8431.06 > f11: read : io=32780MB, bw=212988KB/s, iops=10, runt=157599msec > bw (KB/s) : min=149489, max=221007, per=6.31%, avg=213123.72, stdev=9569.27 > f12: read : io=32780MB, bw=212788KB/s, iops=10, runt=157747msec > bw (KB/s) : min=154274, max=218647, per=6.30%, avg=212957.41, stdev=8233.52 > f13: read : io=32780MB, bw=212315KB/s, iops=10, runt=158099msec > bw (KB/s) : min=153696, max=256000, per=6.29%, avg=212482.68, stdev=9203.34 > f14: read : io=32780MB, bw=212033KB/s, iops=10, runt=158309msec > bw (KB/s) : min=150588, max=267962, per=6.28%, avg=212198.76, stdev=9572.31 > f15: read : io=32780MB, bw=211720KB/s, iops=10, runt=158543msec > bw (KB/s) : min=146024, max=268968, per=6.27%, avg=211846.40, stdev=10341.58 > f16: read : io=32780MB, bw=211637KB/s, iops=10, runt=158605msec > bw (KB/s) : min=148945, max=261605, per=6.26%, avg=211618.40, stdev=9240.64 > > Run status group 0 (all jobs): > READ: io=524480MB, aggrb=3301MB/s, minb=216323KB/s, maxb=219763KB/s, mint=156406msec, maxt=158893msec > > Disk stats (read/write): > sdd: ios=261902/0, merge=0/0, ticks=12531810/0, in_queue=12532910, util=99.46% > sde: ios=262221/0, merge=0/0, ticks=12494200/0, in_queue=12495300, util=99.50% > sdf: ios=261867/0, merge=0/0, ticks=12427000/0, in_queue=12430530, util=99.47% > sdg: ios=261983/0, merge=0/0, ticks=12462320/0, in_queue=12466060, util=99.62% > sdh: ios=262184/0, merge=0/0, ticks=12487350/0, in_queue=12489960, util=99.49% > sdi: ios=262193/0, merge=0/0, ticks=12524400/0, in_queue=12526580, util=99.47% > sdj: ios=262044/0, merge=0/0, ticks=12511850/0, in_queue=12513840, util=99.50% > sdk: ios=262055/0, merge=0/0, ticks=12526560/0, in_queue=12527890, util=99.50% > sdl: ios=261789/0, merge=0/0, ticks=12609230/0, in_queue=12610400, util=99.54% > sdm: ios=261787/0, merge=0/0, ticks=12579000/0, in_queue=12581050, util=99.44% > sdn: ios=261941/0, merge=0/0, ticks=12524530/0, in_queue=12525790, util=99.48% > sdo: ios=262100/0, merge=0/0, ticks=12554650/0, in_queue=12555820, util=99.58% > sdp: ios=261877/0, merge=0/0, ticks=12572220/0, in_queue=12574610, util=99.54% > sdq: ios=261956/0, merge=0/0, ticks=12601480/0, in_queue=12603770, util=99.62% > sdr: ios=261991/0, merge=0/0, ticks=12599680/0, in_queue=12602190, util=99.49% > sds: ios=261852/0, merge=0/0, ticks=12624070/0, in_queue=12626580, util=99.58%So, the maximum for these 16 disks is 3301 MiByte/s. I also tried hardware RAID (2 sets of 8 disks), and got a similar result:> Run status group 0 (all jobs): > READ: io=65560MB, aggrb=3024MB/s, minb=1548MB/s, maxb=1550MB/s, mint=21650msec, maxt=21681msec> fio should be able to push these devices up to the line speed. If it > doesn''t I would suggest changing elevators (deadline, cfq, noop) and > bumping the max request size to the max supported by the device.3301 MiByte/s seems like a reasonable number, given the theoretic maximum of 16 times the single disk performance of 16*256 MiByte/s 4096 MiByte/s. Based on this, I have not looked at tuning. Would you recommend that I do? Our minimal goal is 2500 MiByte/s; that seems achievable as ZFS was able to reach 2750 MiByte/s without tuning.> When we have a config that does so, we can tune the btrfs side of things > as well.Some files are created in the root folder of the mount point, but I get errors instead of results:> ~/fio# ./fio btrfs16.fio > btrfs: (g=0): rw=read, bs=20M-20M/20M-20M, ioengine=sync, iodepth=8 > Starting 16 processes > btrfs: Laying out IO file(s) (1 file(s) / 32768MB) > btrfs: Laying out IO file(s) (1 file(s) / 32768MB)[...]> btrfs: Laying out IO file(s) (1 file(s) / 32768MB) > fio: first direct IO errored. File system may not support direct IO, or iomem_align= is bad. > fio: first direct IO errored. File system may not support direct IO, or iomem_align= is bad. > fio: first direct IO errored. File system may not support direct IO, or iomem_align= is bad. > fio: pid=5958, err=22/file:engines/sync.c:62, func=xfer, error=Invalid argument > fio: pid=5961, err=22/file:engines/sync.c:62, func=xfer, error=Invalid argument > fio: pid=5962, err=22/file:engines/sync.c:62, func=xfer, error=Invalid argument > fio: first direct IO errored. File system may not support direct IO, or iomem_align= is bad.[...]> > btrfs: (groupid=0, jobs=1): err=22 (file:engines/sync.c:62, func=xfer, error=Invalid argument): pid=5956 > cpu : usr=0.00%, sys=0.00%, ctx=1, majf=0, minf=52 > IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% > complete : 0=50.0%, 4=50.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% > issued r/w: total=1/0, short=0/0[no results] What could be going on here? (I get the same result from the github version of fio, fio 1.42, as well as the one that came with Ubuntu, fio 1.33.1).> My first guess is just that your IOs are not large enough w/btrfs. The > iozone command below is doing buffered reads, so our performance is > going to be limited by the kernel readahead buffer size. > > If you use a much larger IO size (the fio job above reads in 20M chunks) > and aio/dio instead, you can have more control over how the IO goes down > to the device.I don''t quite understand (I must warn you that I''m a novice here; I''m a networking expert by origin, not a storage expert). I reran the first fio test with other "bs" settings:> 1 disk, 1M buffer: > READ: io=32768MB, aggrb=247817KB/s, minb=253764KB/s, maxb=253764KB/s, mint=135400msec, maxt=135400msec > > 1 disk, 20M buffer: > READ: io=32780MB, aggrb=260963KB/s, minb=267226KB/s, maxb=267226KB/s, mint=128626msec, maxt=128626msec > > 1 disk, 100M buffer: > READ: io=32800MB, aggrb=263776KB/s, minb=270107KB/s, maxb=270107KB/s, mint=127332msec, maxt=127332msec > > 16 disk, 1M buffer: > READ: io=524288MB, aggrb=3265MB/s, minb=213983KB/s, maxb=215761KB/s, mint=159249msec, maxt=160572msec > > 16 disk, 20M buffer: > READ: io=524480MB, aggrb=3301MB/s, minb=216323KB/s, maxb=219763KB/s, mint=156406msec, maxt=158893msec > > 16 disk, 100M buffer: > READ: io=524800MB, aggrb=3272MB/s, minb=214443KB/s, maxb=216446KB/s, mint=158900msec, maxt=160384msecHowever, the buffer size does not seem to make that much of a difference. Or am I adjusting the wrong buffers here? Mathieu Chouquet-Stringer wrote:> Don''t you need to stripe metadata too (with -m raid0)? Or you may > be limited by your metadata drive?I presume that if this were the case, we would see good performance for hardware RAID and mdadm based software RAID, and poor performance for BtrFS. However, we saw poor performance for all three options. Of course, seeing is believing. Without metadata striping: # mkfs.btrfs -d raid0 -m raid0 /dev/sdd ... /dev/sds # mount -t btrfs -o ssd /dev/sdd /mnt/ssd6 # iozone -s 32G -r 1024 -i 0 -i 1 -w -f /mnt/ssd6/iozone.tmp KB reclen write rewrite read reread 33554432 1024 1628475 1640349 943416 951135 With metadata striping: # mkfs.btrfs -d raid0 /dev/sdd ... /dev/sds # mount -t btrfs -o ssd /dev/sdd /mnt/ssd6 # iozone -s 32G -r 1024 -i 0 -i 1 -w -f /mnt/ssd6/iozone.tmp KB reclen write rewrite read reread 33554432 1024 1631833 1564137 950405 954434 Unfortunately, no noticeable difference. With kind regards, Freek Dijkstra SARA High Performance Networking- and Computing -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 5 August 2010 22:21, Freek Dijkstra <Freek.Dijkstra@sara.nl> wrote:> Chris, Daniel and Mathieu, > > Thanks for your constructive feedback! > >> On Thu, Aug 05, 2010 at 04:05:33PM +0200, Freek Dijkstra wrote: >>> ZFS BtrFS >>> 1 SSD 256 MiByte/s 256 MiByte/s >>> 2 SSDs 505 MiByte/s 504 MiByte/s >>> 3 SSDs 736 MiByte/s 756 MiByte/s >>> 4 SSDs 952 MiByte/s 916 MiByte/s >>> 5 SSDs 1226 MiByte/s 986 MiByte/s >>> 6 SSDs 1450 MiByte/s 978 MiByte/s >>> 8 SSDs 1653 MiByte/s 932 MiByte/s >>> 16 SSDs 2750 MiByte/s 919 MiByte/s >>> > [...] >>> The above results were for Ubuntu 10.04.1 server, with BtrFS v0.19, >> >> Which kernels are those? > > For BtrFS: Linux 2.6.32-21-server #32-Ubuntu SMP x86_64 GNU/Linux > For ZFS: FreeBSD 8.1-RELEASE (GENERIC) > > (Note that we can currently not upgrade easily due to binary drivers for > the SAS+SATA controllers :(. I''d be happy to push the vendor though, if > you think it makes a difference.) > > > Daniel J Blueman wrote: > >> Perhaps create a new filesystem and mount with ''nodatasum'' > > I get an improvement: 919 MiByte/s just became 1580 MiByte/s. Not as > fast as it can, but most certainly an improvement. > >> existing extents which were previously created will be checked, so >> need to start fresh. > > Indeed, also the other way around. I created two test files, while > mounted with and without the -o nodatasum option: > write w/o nodatasum; read w/o nodatasum: 919 ą 43 MiByte/s > write w/o nodatasum; read w/ nodatasum: 922 ą 72 MiByte/s > write w/ nodatasum; read w/o nodatasum: 1082 ą 46 MiByte/s > write w/ nodatasum; read w/ nodatasum: 1586 ą 126 MiByte/s > > So even if I remount the disk in the normal way, and read a file created > without checksums, I still get a small improvement :) > > (PS: the above tests were repeated 4 times, the last even 8 times. As > you can see from the standard deviation, the results are not always very > accurate. The cause is unknown; CPU load is low.) > > > Chris Mason wrote: > >> Basically we have two different things to tune. First the block layer >> and then btrfs. > > >> And then we need to setup a fio job file that hammers on all the ssds at >> once. I''d have it use adio/dio and talk directly to the drives. >> >> [global] >> size=32g >> direct=1 >> iodepth=8 >> bs=20m >> rw=read >> >> [f1] >> filename=/dev/sdd >> [f2] >> filename=/dev/sde >> [f3] >> filename=/dev/sdf > [...] >> [f16] >> filename=/dev/sds > > Thanks. First one disk: > >> f1: (groupid=0, jobs=1): err= 0: pid=6273 >> read : io=32780MB, bw=260964KB/s, iops=12, runt=128626msec >> clat (usec): min=74940, max=80721, avg=78449.61, stdev=923.24 >> bw (KB/s) : min=240469, max=269981, per=100.10%, avg=261214.77, stdev=2765.91 >> cpu : usr=0.01%, sys=2.69%, ctx=1747, majf=0, minf=5153 >> IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% >> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% >> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% >> issued r/w: total=1639/0, short=0/0 >> >> lat (msec): 100=100.00% >> >> Run status group 0 (all jobs): >> READ: io=32780MB, aggrb=260963KB/s, minb=267226KB/s, maxb=267226KB/s, mint=128626msec, maxt=128626msec >> >> Disk stats (read/write): >> sdd: ios=261901/0, merge=0/0, ticks=10135270/0, in_queue=10136460, util=99.30% > > So 255 MiByte/s. > Out of curiousity, what is the distinction between the reported figures > of 260964 kiB/s, 261214.77 kiB/s, 267226 kiB/s and 260963 kiB/s? > > > Now 16 disks (abbreviated): > >> ~/fio# ./fio ssd.fio >> Starting 16 processes >> f1: (groupid=0, jobs=1): err= 0: pid=4756 >> read : io=32780MB, bw=212987KB/s, iops=10, runt=157600msec >> clat (msec): min=75, max=138, avg=96.15, stdev= 4.47 >> lat (msec): min=75, max=138, avg=96.15, stdev= 4.47 >> bw (KB/s) : min=153121, max=268968, per=6.31%, avg=213181.15, stdev=9052.26 >> cpu : usr=0.00%, sys=1.71%, ctx=2737, majf=0, minf=5153 >> IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% >> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% >> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% >> issued r/w: total=1639/0, short=0/0 >> >> lat (msec): 100=97.99%, 250=2.01% > > [..similar for f2 to f16..] > >> f1: read : io=32780MB, bw=212987KB/s, iops=10, runt=157600msec >> bw (KB/s) : min=153121, max=268968, per=6.31%, avg=213181.15, stdev=9052.26 >> f2: read : io=32780MB, bw=213873KB/s, iops=10, runt=156947msec >> bw (KB/s) : min=151143, max=251508, per=6.33%, avg=213987.34, stdev=8958.86 >> f3: read : io=32780MB, bw=214613KB/s, iops=10, runt=156406msec >> bw (KB/s) : min=149216, max=219037, per=6.35%, avg=214779.89, stdev=9332.99 >> f4: read : io=32780MB, bw=214388KB/s, iops=10, runt=156570msec >> bw (KB/s) : min=148675, max=226298, per=6.35%, avg=214576.51, stdev=8985.03 >> f5: read : io=32780MB, bw=213848KB/s, iops=10, runt=156965msec >> bw (KB/s) : min=144479, max=241414, per=6.33%, avg=213935.81, stdev=10023.68 >> f6: read : io=32780MB, bw=213514KB/s, iops=10, runt=157211msec >> bw (KB/s) : min=141730, max=264990, per=6.32%, avg=213656.75, stdev=10871.71 >> f7: read : io=32780MB, bw=213431KB/s, iops=10, runt=157272msec >> bw (KB/s) : min=148137, max=254635, per=6.32%, avg=213493.12, stdev=9319.08 >> f8: read : io=32780MB, bw=213099KB/s, iops=10, runt=157517msec >> bw (KB/s) : min=143467, max=267962, per=6.31%, avg=213267.60, stdev=11224.35 >> f9: read : io=32780MB, bw=211254KB/s, iops=10, runt=158893msec >> bw (KB/s) : min=149489, max=267962, per=6.25%, avg=211257.05, stdev=9370.64 >> f10: read : io=32780MB, bw=212251KB/s, iops=10, runt=158146msec >> bw (KB/s) : min=150865, max=225882, per=6.28%, avg=212300.50, stdev=8431.06 >> f11: read : io=32780MB, bw=212988KB/s, iops=10, runt=157599msec >> bw (KB/s) : min=149489, max=221007, per=6.31%, avg=213123.72, stdev=9569.27 >> f12: read : io=32780MB, bw=212788KB/s, iops=10, runt=157747msec >> bw (KB/s) : min=154274, max=218647, per=6.30%, avg=212957.41, stdev=8233.52 >> f13: read : io=32780MB, bw=212315KB/s, iops=10, runt=158099msec >> bw (KB/s) : min=153696, max=256000, per=6.29%, avg=212482.68, stdev=9203.34 >> f14: read : io=32780MB, bw=212033KB/s, iops=10, runt=158309msec >> bw (KB/s) : min=150588, max=267962, per=6.28%, avg=212198.76, stdev=9572.31 >> f15: read : io=32780MB, bw=211720KB/s, iops=10, runt=158543msec >> bw (KB/s) : min=146024, max=268968, per=6.27%, avg=211846.40, stdev=10341.58 >> f16: read : io=32780MB, bw=211637KB/s, iops=10, runt=158605msec >> bw (KB/s) : min=148945, max=261605, per=6.26%, avg=211618.40, stdev=9240.64 >> >> Run status group 0 (all jobs): >> READ: io=524480MB, aggrb=3301MB/s, minb=216323KB/s, maxb=219763KB/s, mint=156406msec, maxt=158893msec >> >> Disk stats (read/write): >> sdd: ios=261902/0, merge=0/0, ticks=12531810/0, in_queue=12532910, util=99.46% >> sde: ios=262221/0, merge=0/0, ticks=12494200/0, in_queue=12495300, util=99.50% >> sdf: ios=261867/0, merge=0/0, ticks=12427000/0, in_queue=12430530, util=99.47% >> sdg: ios=261983/0, merge=0/0, ticks=12462320/0, in_queue=12466060, util=99.62% >> sdh: ios=262184/0, merge=0/0, ticks=12487350/0, in_queue=12489960, util=99.49% >> sdi: ios=262193/0, merge=0/0, ticks=12524400/0, in_queue=12526580, util=99.47% >> sdj: ios=262044/0, merge=0/0, ticks=12511850/0, in_queue=12513840, util=99.50% >> sdk: ios=262055/0, merge=0/0, ticks=12526560/0, in_queue=12527890, util=99.50% >> sdl: ios=261789/0, merge=0/0, ticks=12609230/0, in_queue=12610400, util=99.54% >> sdm: ios=261787/0, merge=0/0, ticks=12579000/0, in_queue=12581050, util=99.44% >> sdn: ios=261941/0, merge=0/0, ticks=12524530/0, in_queue=12525790, util=99.48% >> sdo: ios=262100/0, merge=0/0, ticks=12554650/0, in_queue=12555820, util=99.58% >> sdp: ios=261877/0, merge=0/0, ticks=12572220/0, in_queue=12574610, util=99.54% >> sdq: ios=261956/0, merge=0/0, ticks=12601480/0, in_queue=12603770, util=99.62% >> sdr: ios=261991/0, merge=0/0, ticks=12599680/0, in_queue=12602190, util=99.49% >> sds: ios=261852/0, merge=0/0, ticks=12624070/0, in_queue=12626580, util=99.58% > > So, the maximum for these 16 disks is 3301 MiByte/s. > > I also tried hardware RAID (2 sets of 8 disks), and got a similar result: > >> Run status group 0 (all jobs): >> READ: io=65560MB, aggrb=3024MB/s, minb=1548MB/s, maxb=1550MB/s, mint=21650msec, maxt=21681msec > > > >> fio should be able to push these devices up to the line speed. If it >> doesn''t I would suggest changing elevators (deadline, cfq, noop) and >> bumping the max request size to the max supported by the device. > > 3301 MiByte/s seems like a reasonable number, given the theoretic > maximum of 16 times the single disk performance of 16*256 MiByte/s > 4096 MiByte/s. > > Based on this, I have not looked at tuning. Would you recommend that I do? > > Our minimal goal is 2500 MiByte/s; that seems achievable as ZFS was able > to reach 2750 MiByte/s without tuning. > >> When we have a config that does so, we can tune the btrfs side of things >> as well. > > Some files are created in the root folder of the mount point, but I get > errors instead of results: > >> ~/fio# ./fio btrfs16.fio >> btrfs: (g=0): rw=read, bs=20M-20M/20M-20M, ioengine=sync, iodepth=8 >> Starting 16 processes >> btrfs: Laying out IO file(s) (1 file(s) / 32768MB) >> btrfs: Laying out IO file(s) (1 file(s) / 32768MB) > [...] > >> btrfs: Laying out IO file(s) (1 file(s) / 32768MB) >> fio: first direct IO errored. File system may not support direct IO, or iomem_align= is bad. >> fio: first direct IO errored. File system may not support direct IO, or iomem_align= is bad. >> fio: first direct IO errored. File system may not support direct IO, or iomem_align= is bad. >> fio: pid=5958, err=22/file:engines/sync.c:62, func=xfer, error=Invalid argument >> fio: pid=5961, err=22/file:engines/sync.c:62, func=xfer, error=Invalid argument >> fio: pid=5962, err=22/file:engines/sync.c:62, func=xfer, error=Invalid argument >> fio: first direct IO errored. File system may not support direct IO, or iomem_align= is bad. > [...] >> >> btrfs: (groupid=0, jobs=1): err=22 (file:engines/sync.c:62, func=xfer, error=Invalid argument): pid=5956 >> cpu : usr=0.00%, sys=0.00%, ctx=1, majf=0, minf=52 >> IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% >> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% >> complete : 0=50.0%, 4=50.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% >> issued r/w: total=1/0, short=0/0 > [no results] > > What could be going on here? > (I get the same result from the github version of fio, fio 1.42, as well > as the one that came with Ubuntu, fio 1.33.1).If you are using 2.6.32 (as above), BTRFS on this release doesn''t support direct I/O. It is supported on 2.6.35, so you could retry with (eg): http://kernel.ubuntu.com/~kernel-ppa/mainline/v2.6.35-maverick/linux-image-2.6.35-020635-generic_2.6.35-020635_amd64.deb -- Daniel J Blueman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Aug 05, 2010 at 11:21:06PM +0200, Freek Dijkstra wrote:> Chris Mason wrote: > > > Basically we have two different things to tune. First the block layer > > and then btrfs. > > > > And then we need to setup a fio job file that hammers on all the ssds at > > once. I''d have it use adio/dio and talk directly to the drives. > > Thanks. First one disk: > > > f1: (groupid=0, jobs=1): err= 0: pid=6273 > > read : io=32780MB, bw=260964KB/s, iops=12, runt=128626msec > > clat (usec): min=74940, max=80721, avg=78449.61, stdev=923.24 > > bw (KB/s) : min=240469, max=269981, per=100.10%, avg=261214.77, stdev=2765.91 > > cpu : usr=0.01%, sys=2.69%, ctx=1747, majf=0, minf=5153 > > IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% > > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% > > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% > > issued r/w: total=1639/0, short=0/0 > > > > lat (msec): 100=100.00% > > > > Run status group 0 (all jobs): > > READ: io=32780MB, aggrb=260963KB/s, minb=267226KB/s, maxb=267226KB/s, mint=128626msec, maxt=128626msec > > > > Disk stats (read/write): > > sdd: ios=261901/0, merge=0/0, ticks=10135270/0, in_queue=10136460, util=99.30% > > So 255 MiByte/s. > Out of curiousity, what is the distinction between the reported figures > of 260964 kiB/s, 261214.77 kiB/s, 267226 kiB/s and 260963 kiB/s?When there is only one job, they should all be the same. aggr is the total seen across all the jobs, min is the lowest, max is the highest.> > > Now 16 disks (abbreviated): > > > ~/fio# ./fio ssd.fio > > Starting 16 processes > > f1: (groupid=0, jobs=1): err= 0: pid=4756 > > read : io=32780MB, bw=212987KB/s, iops=10, runt=157600msec > > clat (msec): min=75, max=138, avg=96.15, stdev= 4.47 > > lat (msec): min=75, max=138, avg=96.15, stdev= 4.47 > > bw (KB/s) : min=153121, max=268968, per=6.31%, avg=213181.15, stdev=9052.26 > > cpu : usr=0.00%, sys=1.71%, ctx=2737, majf=0, minf=5153 > > IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% > > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% > > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% > > issued r/w: total=1639/0, short=0/0 > > > > lat (msec): 100=97.99%, 250=2.01% > > Run status group 0 (all jobs): > > READ: io=524480MB, aggrb=3301MB/s, minb=216323KB/s, maxb=219763KB/s, mint=156406msec, maxt=158893msec> So, the maximum for these 16 disks is 3301 MiByte/s. > > I also tried hardware RAID (2 sets of 8 disks), and got a similar result: > > > Run status group 0 (all jobs): > > READ: io=65560MB, aggrb=3024MB/s, minb=1548MB/s, maxb=1550MB/s, mint=21650msec, maxt=21681msecGreat, so we know the drives are fast.> > > > > fio should be able to push these devices up to the line speed. If it > > doesn''t I would suggest changing elevators (deadline, cfq, noop) and > > bumping the max request size to the max supported by the device. > > 3301 MiByte/s seems like a reasonable number, given the theoretic > maximum of 16 times the single disk performance of 16*256 MiByte/s > 4096 MiByte/s. > > Based on this, I have not looked at tuning. Would you recommend that I do? > > Our minimal goal is 2500 MiByte/s; that seems achievable as ZFS was able > to reach 2750 MiByte/s without tuning. > > > When we have a config that does so, we can tune the btrfs side of things > > as well. > > Some files are created in the root folder of the mount point, but I get > errors instead of results: >Someone else mentioned that btrfs only gained DIO reads in 2.6.35. I think you''ll get the best results with that kernel if you can find an update. If not, you can change the fio job file to remove direct=1 and increase the bs flag up to 20M. I''d also suggest changing /sys/class/bdi/btrfs-1/read_ahead_kb to a bigger number. Try 20480 -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 2010-08-05 16:51, Chris Mason wrote:> And then we need to setup a fio job file that hammers on all the ssds at > once. I''d have it use adio/dio and talk directly to the drives. I''d do > something like this for the fio job file, but Jens Axboe is cc''d and he > might make another suggestion on the job file. I''d do something like > this in a file named ssd.fio > > [global] > size=32g > direct=1 > iodepth=8iodepth=8 will have no effect if you don''t also set a different IO engine, otherwise you would be using read(2) to fetch the data. So add ioengine=libaio to take advantage of a higher queue depth as well. Also, I didn''t see Chris mention this, but if you have a newer intel box you can use hw accellerated crc32c instead. For some reason my test box always loads crc32c and not crc32c-intel, so I need to do that manually. That helps a lot with higher transfer rates. You can check support for hw crc32c by checking for the ''sse4_2'' flag in /proc/cpuinfo. -- Jens Axboe -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Aug 06, 2010 at 01:55:21PM +0200, Jens Axboe wrote:> On 2010-08-05 16:51, Chris Mason wrote: > > And then we need to setup a fio job file that hammers on all the ssds at > > once. I''d have it use adio/dio and talk directly to the drives. I''d do > > something like this for the fio job file, but Jens Axboe is cc''d and he > > might make another suggestion on the job file. I''d do something like > > this in a file named ssd.fio > > > > [global] > > size=32g > > direct=1 > > iodepth=8 > > iodepth=8 will have no effect if you don''t also set a different IO > engine, otherwise you would be using read(2) to fetch the data. So add > ioengine=libaio to take advantage of a higher queue depth as well.Yeah, I just realized I messed up the suggested file, but it worked well enough on the block devices, so I think just having 16 procs hitting the array was enough. libaio will only help with O_DIRECT though, so this only applies to 2.6.35 as well.> > Also, I didn''t see Chris mention this, but if you have a newer intel box > you can use hw accellerated crc32c instead. For some reason my test box > always loads crc32c and not crc32c-intel, so I need to do that manually. > That helps a lot with higher transfer rates. You can check support for > hw crc32c by checking for the ''sse4_2'' flag in /proc/cpuinfo.Yeah, the HW assisted crc does make a huge difference. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Jens Axboe <axboe@kernel.dk> writes:> > Also, I didn''t see Chris mention this, but if you have a newer intel box > you can use hw accellerated crc32c instead. For some reason my test box > always loads crc32c and not crc32c-intel, so I need to do that manually.I have a patch for that, will post it later: autoloading of modules based on x86 cpuinfo. -Andi -- ak@linux.intel.com -- Speaking for myself only. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 08/08/2010 03:18 AM, Andi Kleen wrote:> Jens Axboe <axboe@kernel.dk> writes: >> >> Also, I didn''t see Chris mention this, but if you have a newer intel box >> you can use hw accellerated crc32c instead. For some reason my test box >> always loads crc32c and not crc32c-intel, so I need to do that manually. > > I have a patch for that, will post it later: autoloading of modules > based on x86 cpuinfo.Great, it is pretty annoying to have to do it manually. Sometimes you forget. And it''s not possible to de-select CRC32C and have the intel variant loaded. -- Jens Axboe -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi all, Thanks a lot for the great feedback from before the weekend. Since one of my colleagues needed the machine, I could only do the tests today. In short: just installing 2.6.35 did make some difference, but I was mostly impressed with the speedup gained by the hardware acceleration of the crc32c_intel module. Here is some quick data. Reference figures: 16* single disk (theoretical limit): 4092 MiByte/s fio data layer tests (achievable limit): 3250 MiByte/s ZFS performance: 2505 MiByte/s BtrFS figures: IOzone on 2.6.32: 919 MiByte/s fio btrfs tests on 2.6.35: 1460 MiByte/s IOzone on 2.6.35 with crc32c: 1250 MiByte/s IOzone on 2.6.35 with crc32c_intel: 1629 MiByte/s IOzone on 2.6.35, using -o nodatasum: 1955 MiByte/s For those finding this message and want a howto: the easiest way to use crc32c_intel is to add the module name to /etc/modules: # echo "crc32c_intel" >> /etc/modules # reboot Now the next step for us is to tune the block sizes. We only did that preliminary, but now that we have a good knowledge of what software to use, we can start tuning that in more detail. If there is interest on this list, I''ll gladly post our results here. Jens Axboe wrote:>>> Also, I didn''t see Chris mention this, but if you have a newer intel box >>> you can use hw accellerated crc32c instead. For some reason my test box >>> always loads crc32c and not crc32c-intel, so I need to do that manually. > > it is pretty annoying to have to do it manually. Sometimes > you forget. And it''s not possible to de-select CRC32C and have > the intel variant loaded.You can, but only if you first unmount the partition: # unmount /mnt/mybtrfsdisk # rmmod btrfs # rmmod libcrc32c # rmmod crc32c # modprobe crc32c_intel # mount -t btrfs /dev/sda1 /mnt/mybtrfsdisk We encountered a small bug: the btrfs partition with RAID0 that was made on 2.6.32 did not mount after a reboot or after unmounting. Running btrfsck fixes this, but after a next umount, we had to run btrfsck again. After recreating the btrfs partition on 2.6.35, all was well. btrfs partitions that don''t use (software) RAID work fine. ~# mount -t btrfs -o ssd /dev/sdd /mnt/ssd3 mount: wrong fs type, bad option, bad superblock on /dev/sdd, missing codepage or helper program, or other error In some cases useful info is found in syslog - try dmesg | tail or so ~# dmesg | tail device fsid ec4d518ec61d4496-81e5aeda2d8ef7b5 devid 1 transid 69 /dev/sdd btrfs: use ssd allocation scheme btrfs: failed to read the system array on sdd btrfs: open_ctree failed ~# btrfsck /dev/sdd found 550511136768 bytes used err is 0 total csum bytes: 536870912 total tree bytes: 755322880 total fs tree bytes: 77824 btree space waste bytes: 169152328 file data blocks allocated: 549755813888 referenced 549755813888 Btrfs Btrfs v0.19 ~# mount -t btrfs -o ssd /dev/sdd /mnt/ssd3 [and it mounts fine now] Regards, Freek Dijkstra SARA High Performance Computing and Networking -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Aug 09, 2010 at 04:45:45PM +0200, Freek Dijkstra wrote:> Hi all, > > Thanks a lot for the great feedback from before the weekend. Since one > of my colleagues needed the machine, I could only do the tests today. > > In short: just installing 2.6.35 did make some difference, but I was > mostly impressed with the speedup gained by the hardware acceleration of > the crc32c_intel module. > > Here is some quick data. > > Reference figures: > 16* single disk (theoretical limit): 4092 MiByte/s > fio data layer tests (achievable limit): 3250 MiByte/s > ZFS performance: 2505 MiByte/s > > BtrFS figures: > IOzone on 2.6.32: 919 MiByte/s > fio btrfs tests on 2.6.35: 1460 MiByte/sWas this one with O_DIRECT?> IOzone on 2.6.35 with crc32c: 1250 MiByte/s > IOzone on 2.6.35 with crc32c_intel: 1629 MiByte/s > IOzone on 2.6.35, using -o nodatasum: 1955 MiByte/s > > For those finding this message and want a howto: the easiest way to use > crc32c_intel is to add the module name to /etc/modules: > # echo "crc32c_intel" >> /etc/modules > # reboot > > Now the next step for us is to tune the block sizes. We only did that > preliminary, but now that we have a good knowledge of what software to > use, we can start tuning that in more detail. > > If there is interest on this list, I''ll gladly post our results here.Definitely, please do. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Chris Mason wrote (ao):> On Fri, Aug 06, 2010 at 01:55:21PM +0200, Jens Axboe wrote: > > Also, I didn''t see Chris mention this, but if you have a newer intel box > > you can use hw accellerated crc32c instead. For some reason my test box > > always loads crc32c and not crc32c-intel, so I need to do that manually. > > That helps a lot with higher transfer rates. You can check support for > > hw crc32c by checking for the ''sse4_2'' flag in /proc/cpuinfo. > > Yeah, the HW assisted crc does make a huge difference.The above says "newer intel box". I did some googling and it seems to mean really Intel CPUs only, not AMD, correct? Is there a way to get hardware support for crc32c on ARM based systems? Sander -- Humilis IT Services and Solutions http://www.humilis.net -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Aug 20, 2010 at 06:53:44AM +0200, Sander wrote:> Chris Mason wrote (ao): > > On Fri, Aug 06, 2010 at 01:55:21PM +0200, Jens Axboe wrote: > > > Also, I didn''t see Chris mention this, but if you have a newer intel box > > > you can use hw accellerated crc32c instead. For some reason my test box > > > always loads crc32c and not crc32c-intel, so I need to do that manually. > > > That helps a lot with higher transfer rates. You can check support for > > > hw crc32c by checking for the ''sse4_2'' flag in /proc/cpuinfo. > > > > Yeah, the HW assisted crc does make a huge difference. > > The above says "newer intel box". I did some googling and it seems to > mean really Intel CPUs only, not AMD, correct? > > Is there a way to get hardware support for crc32c on ARM based systems?So far I only know of the intel sse4.2 systems that support this. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html