Hi All, We have recently acquired hardware for a new fileserver and my task, if I want to use OpenSolaris (osol or sxce) on it is for it to perform at least as well as Linux (and our 5 year old fileserver) in our environment. Our current file server is a whitebox Debian server with 8x 10,000 RPM SCSI drives behind an LSI MegaRaid controller with a BBU. The filesystem in use is XFS. The raw performance tests that I have to use to compare them are as follows: * Create 100,000 0 byte files over NFS * Delete 100,000 0 byte files over NFS * Repeat the previous 2 tasks with 1k files * Untar a copy of our product with object files (quite a nasty test) * Rebuild the product "make -j" * Delete the build directory The reason for the 100k files tests is that this has been proven to be a significant indicator of desktop performance on the desktop systems of the developers. Within the budget we had, we have purchased the following system to meet our goals - if the OpenSolaris tests do not meet our requirements, it is certain that the equivalent tests under Linux will. I''m the only person here who wants OpenSolaris specificially so it is in my interest to try to get it working at least on par if not better than our current system. So here I am begging for further help. Dell R710 2x 2.40 Ghz Xeon 5330 CPU 16GB RAM (4x 4GB) mpt0 SAS 6/i (LSI 1068E) 2x 1TB SATA-II drives (rpool) 2x 50GB Enterprise SSD (slog) - Samsung MCCOE50G5MPQ-0VAD3 mpt1 SAS 5/E (LSI 1068E) Dell MD1000 15-bay External storage chassis with 2 heads 10x 450GB Seagate Cheetah 15,000 RPM SAS We also have a PERC 6/E w/512MB BBWC to test with or fall back to if we go with a Linux solution. I have installed OpenSolaris 2009.06 and updated to b117 and used mdb to modify the kernel to work around a current bug in b117 with the newer Dell systems. http://bugs.opensolaris.org/bugdatabase/view_bug.do%3Bjsessionid=76a34f41df5bbbfc2578934eeff8?bug_id=6850943 Keeping in mind that with these tests, the external MD1000 chassis is connected with a single 4 lane SAS cable which should give 12Gbps or 1.2GBps of throughput. Individually, each disk exhibits about 170MB/s raw write performance. e.g. jamver at scalzi:~$ pfexec dd if=/dev/zero of=/dev/rdsk/c8t5d0 bs=65536 count=32768 2147483648 bytes (2.1 GB) copied, 12.4934 s, 172 MB/s A single spindle zpool seems to perform OK. jamver at scalzi:~$ pfexec zpool create single c8t20d0 jamver at scalzi:~$ pfexec dd if=/dev/zero of=/single/foo bs=65536 count=327680 21474836480 bytes (21 GB) copied, 127.201 s, 169 MB/s RAID10 tests seem to be quite slow (about half the speed I would have expected - 170*5 = 850, I would have expected to see around 800MB/s) jamver at scalzi:~$ pfexec zpool create fastdata mirror c8t9d0 c8t10d0 mirror c8t11d0 c8t15d0 mirror c8t16d0 c8t17d0 mirror c8t18d0 c8t19d0 mirror c8t20d0 c8t21d0 jamver at scalzi:~$ pfexec dd if=/dev/zero of=/fastdata/foo bs=131072 count=163840 21474836480 bytes (21 GB) copied, 50.3066 s, 427 MB/s a 5 disk stripe seemed to perform as expected jamver at scalzi:~$ pfexec zpool create fastdata c8t10d0 c8t15d0 c8t17d0 c8t19d0 c8t21d0 jamver at scalzi:~$ pfexec dd if=/dev/zero of=/fastdata/foo bs=131072 count=163840 21474836480 bytes (21 GB) copied, 27.7972 s, 773 MB/s but a 10 disk stripe did not increase significantly jamver at scalzi:~$ pfexec zpool create fastdata c8t10d0 c8t15d0 c8t17d0 c8t19d0 c8t21d0 c8t20d0 c8t18d0 c8t16d0 c8t11d0 c8t9d0 jamver at scalzi:~$ pfexec dd if=/dev/zero of=/fastdata/foo bs=131072 count=163840 21474836480 bytes (21 GB) copied, 26.1189 s, 822 MB/s The best sequential write test I could elicit with redundancy was a pool with 2x 5 disk RAIDZ''s striped jamver at scalzi:~$ pfexec zpool create fastdata raidz c8t10d0 c8t15d0 c8t16d0 c8t11d0 c8t9d0 raidz c8t17d0 c8t19d0 c8t21d0 c8t20d0 c8t18d0 jamver at scalzi:~$ pfexec dd if=/dev/zero of=/fastdata/foo bs=131072 count=163840 21474836480 bytes (21 GB) copied, 31.3934 s, 684 MB/s Moving onto testing NFS and trying to perform the create 100,000 0 byte files (aka, the metadata and NFS sync test). The test seemed to be likely to take about half an hour without a slog as I worked out when I killed it. Painfully slow. So I added one of the SSDs to the system as a slog which improved matters. The client is a Red Hat Enterprise Linux server on modern hardware and has been used for all tests against our old fileserver. The time to beat: RHEL5 client to Debian4+XFS server: bash-3.2# time tar xf zeroes.tar real 2m41.979s user 0m0.420s sys 0m5.255s And on the currently configured system: jamver at scalzi:~$ pfexec zpool create fastdata mirror c8t9d0 c8t10d0 mirror c8t11d0 c8t15d0 mirror c8t16d0 c8t17d0 mirror c8t18d0 c8t19d0 mirror c8t20d0 c8t21d0 log c7t2d0 jamver at scalzi:~$ pfexec zfs set sharenfs=''rw,root=@10.1.0/23'' fastdata bash-3.2# time tar xf zeroes.tar real 8m7.176s user 0m0.438s sys 0m5.754s While this was running, I was looking at the output of zpool iostat fastdata 10 to see how it was going and was surprised to see the seemingly low IOPS. jamver at scalzi:~$ zpool iostat fastdata 10 capacity operations bandwidth pool used avail read write read write ---------- ----- ----- ----- ----- ----- ----- fastdata 10.0G 2.02T 0 312 268 3.89M fastdata 10.0G 2.02T 0 818 0 3.20M fastdata 10.0G 2.02T 0 811 0 3.17M fastdata 10.0G 2.02T 0 860 0 3.27M Strangely, when I added a second SSD as a second slog, it made no difference to the write operations. I''m not sure where to go from here, these results are appalling (about 3x the time of the old system with 8x 10kRPM spindles) even with two Enterprise SSDs as separate log devices. cheers, James
This is something that I''ve run into as well across various installs very similar to the one described (PE2950 backed by an MD1000). I find that overall the write performance across NFS is absolutely horrible on 2008.11 and 2009.06. Worse, I use iSCSI under 2008.11 and it''s just fine with near wire speeds in most cases, but under 2009.06 I can''t even format a VMFS volume from ESX without hitting a timeout. Throughput over the iSCSI connection is mostly around 64K/s with 1 operation per second. I''m downgrading my new server back to 2008.11 until I can find a way to ensure decent performance since this is really a showstopper. But in the meantime I''ve completely given up on NFS as a primary data store - strictly used for templates and iso images and stuff which I copy up via scp since it''s literally 10 times faster than over NFS. I have a 2008.11 OpenSolaris server with an MD1000 using 7 mirror vdevs. The networking is 4 GbE split into two trunked connections. Locally, I get 460 MB/s write and 1 GB/s read so raw disk performance is not a problem. When I use iSCSI I get wire speed in both directions on the GbE from ESX and other clients. However when I use NFS, write performance is limited to about 2 MB/s. Read performance is close to wire speed. I''m using a pretty vanilla configuration, using only atime=off and sharenfs=anon=0. I''ve looked at various tuning guides for NFS with and without ZFS but I haven''t found anything that seems to address this type of issue. Anyone have some tuning tips for this issue? Other than adding an SSD as a write log or disabling the ZIL.. (although from James'' experience this too seems to have a limited impact). Cheers, Erik On 3 juil. 09, at 08:39, James Lever wrote:> While this was running, I was looking at the output of zpool iostat > fastdata 10 to see how it was going and was surprised to see the > seemingly low IOPS. > > jamver at scalzi:~$ zpool iostat fastdata 10 > capacity operations bandwidth > pool used avail read write read write > ---------- ----- ----- ----- ----- ----- ----- > fastdata 10.0G 2.02T 0 312 268 3.89M > fastdata 10.0G 2.02T 0 818 0 3.20M > fastdata 10.0G 2.02T 0 811 0 3.17M > fastdata 10.0G 2.02T 0 860 0 3.27M > > Strangely, when I added a second SSD as a second slog, it made no > difference to the write operations. > > I''m not sure where to go from here, these results are appalling > (about 3x the time of the old system with 8x 10kRPM spindles) even > with two Enterprise SSDs as separate log devices.
On Thu, Jul 2, 2009 at 11:39 PM, James Lever<j at jamver.id.au> wrote:> Hi All, > > We have recently acquired hardware for a new fileserver and my task, if I > want to use OpenSolaris (osol or sxce) on it is for it to perform at least > as well as Linux (and our 5 year old fileserver) in our environment. > > Our current file server is a whitebox Debian server with 8x 10,000 RPM SCSI > drives behind an LSI MegaRaid controller with a BBU. ?The filesystem in use > is XFS. > > The raw performance tests that I have to use to compare them are as follows: > > ?* Create 100,000 0 byte files over NFS > ?* Delete 100,000 0 byte files over NFS > ?* Repeat the previous 2 tasks with 1k files > ?* Untar a copy of our product with object files (quite a nasty test) > ?* Rebuild the product "make -j" > ?* Delete the build directory > > The reason for the 100k files tests is that this has been proven to be a > significant indicator of desktop performance on the desktop systems of the > developers. > > Within the budget we had, we have purchased the following system to meet our > goals - if the OpenSolaris tests do not meet our requirements, it is certain > that the equivalent tests under Linux will. ?I''m the only person here who > wants OpenSolaris specificially so it is in my interest to try to get it > working at least on par if not better than our current system. ?So here I am > begging for further help. > > Dell R710 > 2x 2.40 Ghz Xeon 5330 CPU > 16GB RAM (4x 4GB) > > mpt0 SAS 6/i (LSI 1068E) > 2x 1TB SATA-II drives (rpool) > 2x 50GB Enterprise SSD (slog) - Samsung MCCOE50G5MPQ-0VAD3 > > mpt1 SAS 5/E (LSI 1068E) > Dell MD1000 15-bay External storage chassis with 2 heads > 10x 450GB Seagate Cheetah 15,000 RPM SAS > > We also have a PERC 6/E w/512MB BBWC to test with or fall back to if we go > with a Linux solution. > > I have installed OpenSolaris 2009.06 and updated to b117 and used mdb to > modify the kernel to work around a current bug in b117 with the newer Dell > systems. > ?http://bugs.opensolaris.org/bugdatabase/view_bug.do%3Bjsessionid=76a34f41df5bbbfc2578934eeff8?bug_id=6850943 > > Keeping in mind that with these tests, the external MD1000 chassis is > connected with a single 4 lane SAS cable which should give 12Gbps or 1.2GBps > of throughput. > > Individually, each disk exhibits about 170MB/s raw write performance. ?e.g. > > jamver at scalzi:~$ pfexec dd if=/dev/zero of=/dev/rdsk/c8t5d0 bs=65536 > count=32768 > 2147483648 bytes (2.1 GB) copied, 12.4934 s, 172 MB/s > > A single spindle zpool seems to perform OK. > > jamver at scalzi:~$ pfexec zpool create single c8t20d0 > jamver at scalzi:~$ pfexec dd if=/dev/zero of=/single/foo bs=65536 count=327680 > 21474836480 bytes (21 GB) copied, 127.201 s, 169 MB/s > > RAID10 tests seem to be quite slow (about half the speed I would have > expected - 170*5 = 850, I would have expected to see around 800MB/s) > > jamver at scalzi:~$ pfexec zpool create fastdata mirror c8t9d0 c8t10d0 mirror > c8t11d0 c8t15d0 mirror c8t16d0 c8t17d0 mirror c8t18d0 c8t19d0 mirror c8t20d0 > c8t21d0 > > jamver at scalzi:~$ pfexec dd if=/dev/zero of=/fastdata/foo bs=131072 > count=163840 > 21474836480 bytes (21 GB) copied, 50.3066 s, 427 MB/s > > a 5 disk stripe seemed to perform as expected > > jamver at scalzi:~$ pfexec zpool create fastdata c8t10d0 c8t15d0 c8t17d0 > c8t19d0 c8t21d0 > > jamver at scalzi:~$ pfexec dd if=/dev/zero of=/fastdata/foo bs=131072 > count=163840 > 21474836480 bytes (21 GB) copied, 27.7972 s, 773 MB/s > > but a 10 disk stripe did not increase significantly > > jamver at scalzi:~$ pfexec zpool create fastdata c8t10d0 c8t15d0 c8t17d0 > c8t19d0 c8t21d0 c8t20d0 c8t18d0 c8t16d0 c8t11d0 c8t9d0 > > jamver at scalzi:~$ pfexec dd if=/dev/zero of=/fastdata/foo bs=131072 > count=163840 > 21474836480 bytes (21 GB) copied, 26.1189 s, 822 MB/s > > The best sequential write test I could elicit with redundancy was a pool > with 2x 5 disk RAIDZ''s striped > > jamver at scalzi:~$ pfexec zpool create fastdata raidz c8t10d0 c8t15d0 c8t16d0 > c8t11d0 c8t9d0 raidz c8t17d0 c8t19d0 c8t21d0 c8t20d0 c8t18d0 > > jamver at scalzi:~$ pfexec dd if=/dev/zero of=/fastdata/foo bs=131072 > count=163840 > 21474836480 bytes (21 GB) copied, 31.3934 s, 684 MB/s > > Moving onto testing NFS and trying to perform the create 100,000 0 byte > files (aka, the metadata and NFS sync test). ?The test seemed to be likely > to take about half an hour without a slog as I worked out when I killed it. > ?Painfully slow. ?So I added one of the SSDs to the system as a slog which > improved matters. ?The client is a Red Hat Enterprise Linux server on modern > hardware and has been used for all tests against our old fileserver. > > The time to beat: RHEL5 client to Debian4+XFS server: > > bash-3.2# time tar xf zeroes.tar > > real ? ?2m41.979s > user ? ?0m0.420s > sys ? ? 0m5.255s > > And on the currently configured system: > > jamver at scalzi:~$ pfexec zpool create fastdata mirror c8t9d0 c8t10d0 mirror > c8t11d0 c8t15d0 mirror c8t16d0 c8t17d0 mirror c8t18d0 c8t19d0 mirror c8t20d0 > c8t21d0 log c7t2d0 > > jamver at scalzi:~$ pfexec zfs set sharenfs=''rw,root=@10.1.0/23'' fastdata > > bash-3.2# time tar xf zeroes.tar > > real ? ?8m7.176s > user ? ?0m0.438s > sys ? ? 0m5.754s > > While this was running, I was looking at the output of zpool iostat fastdata > 10 to see how it was going and was surprised to see the seemingly low IOPS. > > jamver at scalzi:~$ zpool iostat fastdata 10 > ? ? ? ? ? ? ? capacity ? ? operations ? ?bandwidth > pool ? ? ? ? used ?avail ? read ?write ? read ?write > ---------- ?----- ?----- ?----- ?----- ?----- ?----- > fastdata ? ?10.0G ?2.02T ? ? ?0 ? ?312 ? ?268 ?3.89M > fastdata ? ?10.0G ?2.02T ? ? ?0 ? ?818 ? ? ?0 ?3.20M > fastdata ? ?10.0G ?2.02T ? ? ?0 ? ?811 ? ? ?0 ?3.17M > fastdata ? ?10.0G ?2.02T ? ? ?0 ? ?860 ? ? ?0 ?3.27M > > Strangely, when I added a second SSD as a second slog, it made no difference > to the write operations. > > I''m not sure where to go from here, these results are appalling (about 3x > the time of the old system with 8x 10kRPM spindles) even with two Enterprise > SSDs as separate log devices. > > cheers, > James > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >Are you sure the slog is working right? Try disabling the ZIL to see if that helps with your NFS performance. If your performance increases a hundred fold, I''m suspecting the slog isn''t perming well, or even doing its job at all. -- Brent Jones brent at servuhome.net
On 03/07/2009, at 5:03 PM, Brent Jones wrote:> Are you sure the slog is working right? Try disabling the ZIL to see > if that helps with your NFS performance. > If your performance increases a hundred fold, I''m suspecting the slog > isn''t perming well, or even doing its job at all.The slog appears to be working fine - at ~800 IOPS it wasn''t lighting up the light significantly and when a second was added both activity lights were even more dim. Without the slog, the pool was only providing ~200 IOPS for the NFS metadata test. Speaking of which, can anybody point me at a good, valid test to measure the IOPS of these SSDs? cheers, James
Hi James, ZFS SSD usage behaviour heavly depends on access pattern and for asynch ops ZFS will not use SSD''s. I''d suggest you to disable SSD''s , create a ram disk and use it as SLOG device to compare the performance. If performance doesnt change, it means that the measurement method have some flaws or you havent configured Slog correctly. Please note that SSD''s are way slower then DRAM based write cache''s. SSD''s will show performance increase when you create load from multiple clients at the same time, as ZFS will be flushing the dirty cache sequantialy. SO I''d suggest running the test from a lot of clients simultaneously Best regards Mertol Mertol Ozyoney Storage Practice - Sales Manager Sun Microsystems, TR Istanbul TR Phone +902123352200 Mobile +905339310752 Fax +902123352222 Email mertol.ozyoney at sun.com -----Original Message----- From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-bounces at opensolaris.org] On Behalf Of James Lever Sent: Friday, July 03, 2009 10:09 AM To: Brent Jones Cc: zfs-discuss; storage-discuss at opensolaris.org Subject: Re: [zfs-discuss] surprisingly poor performance On 03/07/2009, at 5:03 PM, Brent Jones wrote:> Are you sure the slog is working right? Try disabling the ZIL to see > if that helps with your NFS performance. > If your performance increases a hundred fold, I''m suspecting the slog > isn''t perming well, or even doing its job at all.The slog appears to be working fine - at ~800 IOPS it wasn''t lighting up the light significantly and when a second was added both activity lights were even more dim. Without the slog, the pool was only providing ~200 IOPS for the NFS metadata test. Speaking of which, can anybody point me at a good, valid test to measure the IOPS of these SSDs? cheers, James _______________________________________________ zfs-discuss mailing list zfs-discuss at opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Hi, James Lever wrote:>Hi All, > >We have recently acquired hardware for a new fileserver and my task, >if I want to use OpenSolaris (osol or sxce) on it is for it to perform >at least as well as Linux (and our 5 year old fileserver) in our >environment. > >Our current file server is a whitebox Debian server with 8x 10,000 RPM >SCSI drives behind an LSI MegaRaid controller with a BBU. The >filesystem in use is XFS. > >The raw performance tests that I have to use to compare them are as >follows: > > * Create 100,000 0 byte files over NFS > * Delete 100,000 0 byte files over NFS > * Repeat the previous 2 tasks with 1k files > * Untar a copy of our product with object files (quite a nasty test) > * Rebuild the product "make -j" > * Delete the build directory > >The reason for the 100k files tests is that this has been proven to be >a significant indicator of desktop performance on the desktop systems >of the developers. > >Within the budget we had, we have purchased the following system to >meet our goals - if the OpenSolaris tests do not meet our >requirements, it is certain that the equivalent tests under Linux >will. I''m the only person here who wants OpenSolaris specificially so >it is in my interest to try to get it working at least on par if not >better than our current system. So here I am begging for further help. > >Dell R710 >2x 2.40 Ghz Xeon 5330 CPU >16GB RAM (4x 4GB) > >mpt0 SAS 6/i (LSI 1068E) >2x 1TB SATA-II drives (rpool) >2x 50GB Enterprise SSD (slog) - Samsung MCCOE50G5MPQ-0VAD3 > >mpt1 SAS 5/E (LSI 1068E) >Dell MD1000 15-bay External storage chassis with 2 heads >10x 450GB Seagate Cheetah 15,000 RPM SAS > >We also have a PERC 6/E w/512MB BBWC to test with or fall back to if >we go with a Linux solution. > >I have installed OpenSolaris 2009.06 and updated to b117 and used mdb >to modify the kernel to work around a current bug in b117 with the >newer Dell systems. http://bugs.opensolaris.org/bugdatabase/view_bug.do%3Bjsessionid=76a34f41df5bbbfc2578934eeff8?bug_id=6850943 > >Keeping in mind that with these tests, the external MD1000 chassis is >connected with a single 4 lane SAS cable which should give 12Gbps or >1.2GBps of throughput. > >Individually, each disk exhibits about 170MB/s raw write performance. >e.g. > >jamver at scalzi:~$ pfexec dd if=/dev/zero of=/dev/rdsk/c8t5d0 bs=65536 >count=32768 >2147483648 bytes (2.1 GB) copied, 12.4934 s, 172 MB/s > >A single spindle zpool seems to perform OK. > >jamver at scalzi:~$ pfexec zpool create single c8t20d0 >jamver at scalzi:~$ pfexec dd if=/dev/zero of=/single/foo bs=65536 >count=327680 >21474836480 bytes (21 GB) copied, 127.201 s, 169 MB/s > >RAID10 tests seem to be quite slow (about half the speed I would have >expected - 170*5 = 850, I would have expected to see around 800MB/s) > >jamver at scalzi:~$ pfexec zpool create fastdata mirror c8t9d0 c8t10d0 >mirror c8t11d0 c8t15d0 mirror c8t16d0 c8t17d0 mirror c8t18d0 c8t19d0 >mirror c8t20d0 c8t21d0 > >jamver at scalzi:~$ pfexec dd if=/dev/zero of=/fastdata/foo bs=131072 >count=163840 >21474836480 bytes (21 GB) copied, 50.3066 s, 427 MB/s > >a 5 disk stripe seemed to perform as expected > >jamver at scalzi:~$ pfexec zpool create fastdata c8t10d0 c8t15d0 c8t17d0 >c8t19d0 c8t21d0 > >jamver at scalzi:~$ pfexec dd if=/dev/zero of=/fastdata/foo bs=131072 >count=163840 >21474836480 bytes (21 GB) copied, 27.7972 s, 773 MB/s > >but a 10 disk stripe did not increase significantly > >jamver at scalzi:~$ pfexec zpool create fastdata c8t10d0 c8t15d0 c8t17d0 >c8t19d0 c8t21d0 c8t20d0 c8t18d0 c8t16d0 c8t11d0 c8t9d0 > >jamver at scalzi:~$ pfexec dd if=/dev/zero of=/fastdata/foo bs=131072 >count=163840 >21474836480 bytes (21 GB) copied, 26.1189 s, 822 MB/s > >The best sequential write test I could elicit with redundancy was a >pool with 2x 5 disk RAIDZ''s striped > >jamver at scalzi:~$ pfexec zpool create fastdata raidz c8t10d0 c8t15d0 >c8t16d0 c8t11d0 c8t9d0 raidz c8t17d0 c8t19d0 c8t21d0 c8t20d0 c8t18d0 > >jamver at scalzi:~$ pfexec dd if=/dev/zero of=/fastdata/foo bs=131072 >count=163840 >21474836480 bytes (21 GB) copied, 31.3934 s, 684 MB/s > >Moving onto testing NFS and trying to perform the create 100,000 0 >byte files (aka, the metadata and NFS sync test). The test seemed to >be likely to take about half an hour without a slog as I worked out >when I killed it. Painfully slow. So I added one of the SSDs to the >system as a slog which improved matters. The client is a Red Hat >Enterprise Linux server on modern hardware and has been used for all >tests against our old fileserver. > >The time to beat: RHEL5 client to Debian4+XFS server: > >bash-3.2# time tar xf zeroes.tar > >real 2m41.979s >user 0m0.420s >sys 0m5.255s > >And on the currently configured system: > >jamver at scalzi:~$ pfexec zpool create fastdata mirror c8t9d0 c8t10d0 >mirror c8t11d0 c8t15d0 mirror c8t16d0 c8t17d0 mirror c8t18d0 c8t19d0 >mirror c8t20d0 c8t21d0 log c7t2d0 > >jamver at scalzi:~$ pfexec zfs set sharenfs=''rw,root=@10.1.0/23'' fastdata > >bash-3.2# time tar xf zeroes.tar > >real 8m7.176s >user 0m0.438s >sys 0m5.754s > >While this was running, I was looking at the output of zpool iostat >fastdata 10 to see how it was going and was surprised to see the >seemingly low IOPS.Have you tried running this locally on your OpenSolaris box - just to get an idea of what it could deliver in terms of speed ? Which NFS version are you using ?>jamver at scalzi:~$ zpool iostat fastdata 10 > capacity operations bandwidth >pool used avail read write read write >---------- ----- ----- ----- ----- ----- ----- >fastdata 10.0G 2.02T 0 312 268 3.89M >fastdata 10.0G 2.02T 0 818 0 3.20M >fastdata 10.0G 2.02T 0 811 0 3.17M >fastdata 10.0G 2.02T 0 860 0 3.27M > >Strangely, when I added a second SSD as a second slog, it made no >difference to the write operations. > >I''m not sure where to go from here, these results are appalling (about >3x the time of the old system with 8x 10kRPM spindles) even with two >Enterprise SSDs as separate log devices. > >cheers, >James > >_______________________________________________ >zfs-discuss mailing list >zfs-discuss at opensolaris.org >http://mail.opensolaris.org/mailman/listinfo/zfs-discuss-- Med venlig hilsen / Best Regards Henrik Johansen henrik at scannet.dk Tlf. 75 53 35 00 ScanNet Group A/S ScanNet
Hej Henrik, On 03/07/2009, at 8:57 PM, Henrik Johansen wrote:> Have you tried running this locally on your OpenSolaris box - just to > get an idea of what it could deliver in terms of speed ? Which NFS > version are you using ?Most of the tests shown in my original message are local except the explicitly NFS based Metadata test shown at the very end (100k 0b files). The 100k/0b test is an atomic test locally due to caching semantics and a lack of 100k explicit SYNC requests so the transactions are able to be bundled together and written in one block. I''ve just been using NFSv3 so far for these tests as it it widely regarded as faster, even though less functional. cheers, James
Hi Mertol, On 03/07/2009, at 6:49 PM, Mertol Ozyoney wrote:> ZFS SSD usage behaviour heavly depends on access pattern and for > asynch ops ZFS will not use SSD''s. I''d suggest you to disable > SSD''s , create a ram disk and use it as SLOG device to compare the > performance. If performance doesnt change, it means that the > measurement method have some flaws or you havent configured Slog > correctly.I did some tests with a ramdisk slog and the the write IOPS seemed to run about the 4k/s mark vs about 800/s when using the SSD as slog and 200/s without a slog. # osol b117 RAID10+ramdisk slog # bash-3.2# time tar xf zeroes.tar; rm -rf zeroes/; | tee /root/zeroes- test-scalzi-dell-ramdisk_slog.txt # tar real 1m32.343s # rm real 0m44.418s # linux+XFS on Hardware RAID bash-3.2# time tar xf zeroes.tar; time rm -rf zeroes/; | tee /root/ zeroes-test-linux-lsimegaraid_bbwc.txt #tar real 2m27.791s #rm real 0m46.112s> Please note that SSD''s are way slower then DRAM based write cache''s. > SSD''s will show performance increase when you create load from > multiple clients at the same time, as ZFS will be flushing the dirty > cache sequantialy. So I''d suggest running the test from a lot of > clients simultaneouslyI''m sure that it will be a more performant system in general, however, it is this explicit set of tests that I need to maintain or improve performance on. cheers, James
On 03.07.09 15:34, James Lever wrote:> Hi Mertol, > > On 03/07/2009, at 6:49 PM, Mertol Ozyoney wrote: > >> ZFS SSD usage behaviour heavly depends on access pattern and for >> asynch ops ZFS will not use SSD''s. I''d suggest you to disable SSD''s >> , create a ram disk and use it as SLOG device to compare the >> performance. If performance doesnt change, it means that the >> measurement method have some flaws or you havent configured Slog >> correctly. > > I did some tests with a ramdisk slog and the the write IOPS seemed to > run about the 4k/s mark vs about 800/s when using the SSD as slog and > 200/s without a slog. > > # osol b117 RAID10+ramdisk slog > # > bash-3.2# time tar xf zeroes.tar; rm -rf zeroes/; | tee > /root/zeroes-test-scalzi-dell-ramdisk_slog.txt > # tar > real 1m32.343s > # rm > real 0m44.418s > > # linux+XFS on Hardware RAID > bash-3.2# time tar xf zeroes.tar; time rm -rf zeroes/; | tee > /root/zeroes-test-linux-lsimegaraid_bbwc.txt > #tar > real 2m27.791s > #rm > real 0m46.112sAbove results make me question whether your Linux NFS server is really honoring synchronous semantics or not... Slog in ramdisk is analogous to no slog at all and disable zil (well, it may be actually a bit worse). If you say that your old system is 5 years old difference in above numbers may be due to difference in CPU and memory speed, and so it suggests that your Linux NFS server appears to be working at the memory speed, hence the question. Because if it does not honor sync semantics you are really comparing apples with oranges here. victor
On Fri, 3 Jul 2009, James Lever wrote:> > I did some tests with a ramdisk slog and the the write IOPS seemed to run > about the 4k/s mark vs about 800/s when using the SSD as slog and 200/s > without a slog.It seems like you may have selected the wrong SSD product to use. There seems to be a huge variation in performance (and cost) with so-called "enterprise" SSDs. SSDs with capacitor-backed write caches seem to be fastest. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
>>>>> "vl" == Victor Latushkin <Victor.Latushkin at Sun.COM> writes:vl> Above results make me question whether your Linux NFS server vl> is really honoring synchronous semantics or not... Any idea how to test it? -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090703/0847a553/attachment.bin>
On Fri, Jul 3, 2009 at 7:34 AM, James Lever<j at jamver.id.au> wrote:> Hi Mertol, > > On 03/07/2009, at 6:49 PM, Mertol Ozyoney wrote: > >> ZFS SSD usage behaviour heavly depends on access pattern and for asynch >> ops ZFS will not use SSD''s. ? I''d suggest you to disable SSD''s , create a >> ram disk and use it as SLOG device to compare the performance. If >> performance doesnt change, it means that the measurement method have some >> flaws or you havent configured Slog correctly. > > I did some tests with a ramdisk slog and the the write IOPS seemed to run > about the 4k/s mark vs about 800/s when using the SSD as slog and 200/s > without a slog. > > # osol b117 RAID10+ramdisk slog > # > bash-3.2# time tar xf zeroes.tar; rm -rf zeroes/; | tee > /root/zeroes-test-scalzi-dell-ramdisk_slog.txt > # tar > real ? ?1m32.343s > # rm > real ? ?0m44.418s > > # linux+XFS on Hardware RAID > bash-3.2# time tar xf zeroes.tar; time rm -rf zeroes/; | tee > /root/zeroes-test-linux-lsimegaraid_bbwc.txt > #tar > real ? ?2m27.791s > #rm > real ? ?0m46.112s > >> Please note that SSD''s are way slower then DRAM based write cache''s. SSD''s >> will show performance increase when you create load from multiple clients at >> the same time, as ZFS will be flushing the dirty cache sequantialy. ?So I''d >> suggest running the test from a lot of clients simultaneously > > I''m sure that it will be a more performant system in general, however, it is > this explicit set of tests that I need to maintain or improve performance > on.As my experience with the same setup as yours, but with iSCSI I find the built-in write-back cache on the PERC 6e controllers doesn''t perform so well when spread out over so many logical drives. It''s better then none, for sure, but not as good as an SSD ZIL I believe. -Ross
On 04/07/2009, at 3:08 AM, Bob Friesenhahn wrote:> It seems like you may have selected the wrong SSD product to use. > There seems to be a huge variation in performance (and cost) with so- > called "enterprise" SSDs. SSDs with capacitor-backed write caches > seem to be fastest.How can I do a valid measurement of this SSDs IOPS and latency that can be accurately measured against what others are using? It was suggested (I think on IRC) that my issue may in fact be the latency of the SSDs in use, which may make some sense given the lack of change in performance when a second slog was added. Can anybody confirm/deny this? zpool iostat reported the following while doing the Metadata test of creating 10k 0b files over NFS: just disks ~200 IOPS 1 SSD slog ~800 IOPS 2 SSD slog ~800 IOPS 3GiB Ramdisk ~4k IOPS SSDs are Samsung MCCOE50G5MPQ-0VAD3 (Dell branded). For anybody wishing to play along at home (and perhaps add further data to this test), I have attached a 174k tbz which has 100k 0b files inside which can be used to replicate this test over NFS. Has anybody been measuring the IOPS and latency of their SSDs before putting them into production? Care to share some numbers or reference URLs? cheers, James -------------- next part -------------- A non-text attachment was scrubbed... Name: zeroes.tar.bz2 Type: application/x-bzip2 Size: 178029 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090704/6fe71fbf/attachment.bin> -------------- next part --------------
On 03/07/2009, at 10:37 PM, Victor Latushkin wrote:> Slog in ramdisk is analogous to no slog at all and disable zil > (well, it may be actually a bit worse). If you say that your old > system is 5 years old difference in above numbers may be due to > difference in CPU and memory speed, and so it suggests that your > Linux NFS server appears to be working at the memory speed, hence > the question. Because if it does not honor sync semantics you are > really comparing apples with oranges here.The slog in ramdisk is in no way similar to disabling the ZIL. This is an NFS test, so if I had disabled the ZIL, writes would have to go direct to disk (not ZIL) before returning, which would potentially be even slower than ZIL on zpool. The appearance of the Linux NFS server appearing to perform at memory speed may just be the BBWC in the LSI MegaRaid SCSI card. One of the developers here had explicitly performed tests to check these similar assumptions and found no evidence that the Linux/XFS sync implementation to be lacking even though there were previous issues with it in one kernel revision. cheers, James
Ross Walker
2009-Jul-04 00:42 UTC
[zfs-discuss] [storage-discuss] surprisingly poor performance
On Jul 3, 2009, at 8:20 PM, James Lever <j at jamver.id.au> wrote:> > On 03/07/2009, at 10:37 PM, Victor Latushkin wrote: > >> Slog in ramdisk is analogous to no slog at all and disable zil >> (well, it may be actually a bit worse). If you say that your old >> system is 5 years old difference in above numbers may be due to >> difference in CPU and memory speed, and so it suggests that your >> Linux NFS server appears to be working at the memory speed, hence >> the question. Because if it does not honor sync semantics you are >> really comparing apples with oranges here. > > The slog in ramdisk is in no way similar to disabling the ZIL. This > is an NFS test, so if I had disabled the ZIL, writes would have to > go direct to disk (not ZIL) before returning, which would > potentially be even slower than ZIL on zpool. > > The appearance of the Linux NFS server appearing to perform at > memory speed may just be the BBWC in the LSI MegaRaid SCSI card. > One of the developers here had explicitly performed tests to check > these similar assumptions and found no evidence that the Linux/XFS > sync implementation to be lacking even though there were previous > issues with it in one kernel revision.XFS on LVM or EVMS volumes can''t do barrier writes due to the lack of barrier support in LVM and EVMS, so it doesn''t do a hard cache sync like it would on a raw disk partition which makes the numbers higher, BUT with battery backed write cache the risk is negligible, but the numbers are higher then those on file systems that do do a hard cache sync. Try XFS on a raw partition and NFS with sync writes enabled and see how it performs then. -Ross
James Lever
2009-Jul-04 01:47 UTC
[zfs-discuss] [storage-discuss] surprisingly poor performance
On 04/07/2009, at 10:42 AM, Ross Walker wrote:> XFS on LVM or EVMS volumes can''t do barrier writes due to the lack > of barrier support in LVM and EVMS, so it doesn''t do a hard cache > sync like it would on a raw disk partition which makes the numbers > higher, BUT with battery backed write cache the risk is negligible, > but the numbers are higher then those on file systems that do do a > hard cache sync.Do you have any references for this? and perhaps some published numbers that you may have seen?> Try XFS on a raw partition and NFS with sync writes enabled and see > how it performs then.I cannot do this on the existing fileserver and do not have another system with a BBWC card to test against. The BBWC on the LSI MegaRaid is certainly the key factor here, I would expect. I can test this assumption on this new hardware next week when I do a number of other tests and compare linux/XFS and perhaps remove LVM (though, I don''t see why you would remove LVM from the equation). cheers, James
Ross Walker
2009-Jul-04 03:49 UTC
[zfs-discuss] [storage-discuss] surprisingly poor performance
On Fri, Jul 3, 2009 at 9:47 PM, James Lever<j at jamver.id.au> wrote:> > On 04/07/2009, at 10:42 AM, Ross Walker wrote: > >> XFS on LVM or EVMS volumes can''t do barrier writes due to the lack of >> barrier support in LVM and EVMS, so it doesn''t do a hard cache sync like it >> would on a raw disk partition which makes the numbers higher, BUT with >> battery backed write cache the risk is negligible, but the numbers are >> higher then those on file systems that do do a hard cache sync. > > Do you have any references for this? ?and perhaps some published numbers > that you may have seen?I ran some benchmarks back when verifying this, but didn''t keep them unfortunately. You can google: XFS Barrier LVM OR EVMS and see the threads about this.>> Try XFS on a raw partition and NFS with sync writes enabled and see how it >> performs then. > > I cannot do this on the existing fileserver and do not have another system > with a BBWC card to test against. ?The BBWC on the LSI MegaRaid is certainly > the key factor here, I would expect. > > I can test this assumption on this new hardware next week when I do a number > of other tests and compare linux/XFS and perhaps remove LVM (though, I don''t > see why you would remove LVM from the equation).When you do send me a copy, try both on a straight partition then on a LVM volume and always use NFS sync, but when exporting use the no_wdelay option if you don''t already that eliminates slow downs with NFS sync on Linux. -Ross
>>>>> "jl" == James Lever <j at jamver.id.au> writes:jl> if I had disabled the ZIL, writes would have to go direct to jl> disk (not ZIL) before returning, which would potentially be jl> even slower than ZIL on zpool. no, I''m all but certain you are confused. jl> Has anybody been measuring the IOPS and latency of their SSDs you might try: iostat -xcnXTdz c3t31d0 1 I haven''t done this before though. jl> One of the developers here had explicitly performed tests to jl> check these similar assumptions and found no evidence that the jl> Linux/XFS sync implementation to be lacking even though there jl> were previous issues with it in one kernel revision. Did he perform the same test on the one kernel revision with ``issues'''', and if so what ``issues'''' did the test find? Also note that it''s not only Linux/XFS which must be tested but knfs and LVM2 as well. I''m not saying it''s broken, only that I''ve yet to hear of someone using a decisive test and getting conclusive results---there are only anecdotal war stories and speculations about how one might test. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090704/ffdae5ef/attachment.bin>
On 04/07/2009, at 2:08 PM, Miles Nordin wrote:> iostat -xcnXTdz c3t31d0 1on that device being used as a slog, a higher range of output looks like: extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 1477.8 0.0 2955.4 0.0 0.0 0.0 0.0 0 5 c7t2d0 Saturday, July 4, 2009 2:18:48 PM EST cpu us sy wt id 0 1 0 99 I started a second task from the first server while using only a single slog and the performance of the SSD got up to 1900 w/s extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 1945.8 0.0 3891.7 0.0 0.1 0.0 0.0 0 6 c7t2d0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c7t3d0 Saturday, July 4, 2009 2:23:11 PM EST cpu us sy wt id 0 1 0 99 Interestingly, adding a second SSD into the mix and a 3rd writer (on a second client system) showed no further increases: extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 942.3 0.0 1884.4 0.0 0.0 0.0 0.0 0 3 c7t2d0 0.0 942.4 0.0 1884.4 0.0 0.0 0.0 0.0 0 3 c7t3d0 Add the ramdisk as a 3rd slog with 3 writers and only an increase in the speed of the slowest device extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 453.6 0.0 1814.4 0.0 0.0 0.0 0.0 0 1 ramdisk1 0.0 907.2 0.0 1814.4 0.0 0.0 0.0 0.0 0 3 c7t2d0 0.0 907.2 0.0 1814.4 0.0 0.0 0.0 0.0 0 3 c7t3d0 Saturday, July 4, 2009 2:29:08 PM EST cpu us sy wt id 0 2 0 98 When only the ramdisk is used as a slog, it gives the following results: extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 3999.4 0.0 15997.8 0.0 0.0 0.0 0.0 0 2 ramdisk1 Saturday, July 4, 2009 2:36:58 PM EST cpu us sy wt id 0 3 0 96 Any insightful observations? cheers, James
James Lever
2009-Jul-04 05:38 UTC
[zfs-discuss] [storage-discuss] surprisingly poor performance
On 04/07/2009, at 1:49 PM, Ross Walker wrote:> I ran some benchmarks back when verifying this, but didn''t keep them > unfortunately. > > You can google: XFS Barrier LVM OR EVMS and see the threads about > this.Interesting reading. Testing seems to show that either it''s not relevant or there is something interesting going on with ext3 as a separate case.> When you do send me a copy, try both on a straight partition then on a > LVM volume and always use NFS sync, but when exporting use the > no_wdelay option if you don''t already that eliminates slow downs with > NFS sync on Linux.The numbers below seem to indicate that either there is no barrier issues here, or the BBWC in the raid controller makes them more-or- less invisible as the ext3fs volume below is directly onto the exposed LUN while the xfs partition is on top of LVM2. It does, however, show that xfs is much faster for deletes. cheers, James bash-3.2# cd /nfs/xfs_on_LVM bash-3.2# ( date ; time tar xf zeroes-10k.tar ; date ; time rm -rf zeroes/ ; date ) 2>&1 Sat Jul 4 15:31:13 EST 2009 real 0m18.145s user 0m0.055s sys 0m0.500s Sat Jul 4 15:31:31 EST 2009 real 0m4.585s user 0m0.004s sys 0m0.261s Sat Jul 4 15:31:36 EST 2009 bash-3.2# cd /nfs/ext3 bash-3.2# ( date ; time tar xf zeroes-10k.tar ; date ; time rm -rf zeroes/ ; date ) Sat Jul 4 15:32:43 EST 2009 real 0m15.509s user 0m0.048s sys 0m0.508s Sat Jul 4 15:32:59 EST 2009 real 0m37.793s user 0m0.006s sys 0m0.225s Sat Jul 4 15:33:37 EST 2009
On Sat, 4 Jul 2009, James Lever wrote:> > Any insightful observations?Probably multiple slog devices are used to expand slog size and not used in parallel since that would require somehow knowing the order. The principle bottleneck is likely the update rate of the first device in the chain, followed by the update rate of the underlying disks. If you put the ramdisk first in the slog chain, the performance is likely to jump. Note that using the non-volatile log device is just a way to defer the writes to the underlying device, and the writes need to occur eventually or else the slog will fill up. Ideally the writes to the underlying devices can be ordered more sequentially for better throughput or else the gain will be short-lived since the slog will fill up. If you do a search, you will find that others have reported less than hoped for performance with these Samsung SSDs. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Ross Walker
2009-Jul-04 15:57 UTC
[zfs-discuss] [storage-discuss] surprisingly poor performance
On Sat, Jul 4, 2009 at 1:38 AM, James Lever<j at jamver.id.au> wrote:> > On 04/07/2009, at 1:49 PM, Ross Walker wrote: > >> I ran some benchmarks back when verifying this, but didn''t keep them >> unfortunately. >> >> You can google: XFS Barrier LVM OR EVMS and see the threads about this. > > Interesting reading. ?Testing seems to show that either it''s not relevant or > there is something interesting going on with ext3 as a separate case.Barriers are by default are disabled on ext3 mounts... Google it and you''ll see interesting threads in the LKML. Seems there was some serious performance degradation in using them. A lot of decisions in Linux are made in favor of performance over data consistency.>> When you do send me a copy, try both on a straight partition then on a >> LVM volume and always use NFS sync, but when exporting use the >> no_wdelay option if you don''t already that eliminates slow downs with >> NFS sync on Linux. > > > The numbers below seem to indicate that either there is no barrier issues > here, or the BBWC in the raid controller makes them more-or-less invisible > as the ext3fs volume below is directly onto the exposed LUN while the xfs > partition is on top of LVM2. > > It does, however, show that xfs is much faster for deletes.Actually it''s LVM/EVMS that hides the barrier performance problems because they act as a barrier filter (because they don''t support barriers), so running on LVM/EVMS shows great performance, but also the #1 reason that people complain XFS isn''t reliable during a system failure, which is all because logging isn''t done properly without barriers! -Ross
Miles Nordin
2009-Jul-04 18:30 UTC
[zfs-discuss] [storage-discuss] surprisingly poor performance
>>>>> "rw" == Ross Walker <rswwalker at gmail.com> writes:rw> Barriers are by default are disabled on ext3 mounts... http://lwn.net/Articles/283161/ https://bugzilla.redhat.com/show_bug.cgi?id=458936 enabled by default on SLES. to enable on other distro: mount -t ext3 -o barrier=1 <device> <mount point> (no help if using LVM2) rw> A lot of decisions in Linux are made in favor of performance rw> over data consistency. yes, which is why it''s worth suspecting knfsd as well. However I don''t think you can sell a Solaris system that performs 1/3 as well on better hardware without a real test case showing the fast system''s broken. The fantastic thing to my view (and I''m NOT being sarcastic) is that, if the fast system really is broken, you''ve the option of breaking ZFS to match its performance (by disabling the ZIL). And after you''ve done this, ZFS is still ahead of the old system because your cheat hasn''t put you at greater risk of corrupting the whole pool, while the other system''s cheating _has_. ...ZFS may still be more fragile overall, but as a tradeoff, it''s interesting. But having this choice puts you in the position of really wanting to know for sure if the other system''s broken before you cripple your own system perhaps destroying your reputation... when you find out some ``unfair'''' corner case was rescuing the old system: ex. suppose contiguous journal doesn''t get reordered by the drive, and disk write buffers still get flushed if OS crashes which turns out to be the common failure mode, not cord-yanking. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090704/9948dbe2/attachment.bin>
David Magda
2009-Jul-04 18:35 UTC
[zfs-discuss] [storage-discuss] surprisingly poor performance
On Jul 4, 2009, at 14:30, Miles Nordin wrote:> yes, which is why it''s worth suspecting knfsd as well. However I > don''t think you can sell a Solaris system that performs 1/3 as well on > better hardware without a real test case showing the fast system''s > broken.It should be noted that RAID-0 performs better than any other RAID level. :)
On 04/07/2009, at 3:08 AM, Bob Friesenhahn wrote:> It seems like you may have selected the wrong SSD product to use. > There seems to be a huge variation in performance (and cost) with so- > called "enterprise" SSDs. SSDs with capacitor-backed write caches > seem to be fastest.Do you have any methods to "correctly" measure the performance of an SSD for the purpose of a slog and any information on others (other than anecdotal evidence)? cheers, James
James Lever wrote:> > On 04/07/2009, at 3:08 AM, Bob Friesenhahn wrote: > >> It seems like you may have selected the wrong SSD product to use. >> There seems to be a huge variation in performance (and cost) with >> so-called "enterprise" SSDs. SSDs with capacitor-backed write caches >> seem to be fastest. > > Do you have any methods to "correctly" measure the performance of an > SSD for the purpose of a slog and any information on others (other > than anecdotal evidence)?First, determine the ZIL pattern for your workload using zilstat. Then buy an SSD which efficiently handles a sequential workload which is similar to your workload. For example, if your workload creates a lot of small ZIL iops, then you''ll want to favor SSDs which have high, small, write iop performance. -- richard
On Jul 5, 2009, at 6:06 AM, James Lever <j at jamver.id.au> wrote:> > On 04/07/2009, at 3:08 AM, Bob Friesenhahn wrote: > >> It seems like you may have selected the wrong SSD product to use. >> There seems to be a huge variation in performance (and cost) with >> so-called "enterprise" SSDs. SSDs with capacitor-backed write >> caches seem to be fastest. > > Do you have any methods to "correctly" measure the performance of an > SSD for the purpose of a slog and any information on others (other > than anecdotal evidence)?There are two types of SSD drives on the market, the fast write SLC (single level cell) and the slow write MLC (multi level cell). MLC is usually used in laptops as SLC drives over 16GB usually go for $1000+ which isn''t cost effective in a laptop. MLC is good for read caching though and most use it for L2ARC. I just ordered a bunch of 16GB Imation Pro 7500''s (formerly Mtron) from CDW lately for $290 a pop. They are suppose to be fast sequential write SLC drives and so-so random write. We''ll see. -Ross
James Lever
2009-Jul-05 23:31 UTC
[zfs-discuss] [storage-discuss] surprisingly poor performance
On 05/07/2009, at 1:57 AM, Ross Walker wrote:> Barriers are by default are disabled on ext3 mounts... Google it and > you''ll see interesting threads in the LKML. Seems there was some > serious performance degradation in using them. A lot of decisions in > Linux are made in favor of performance over data consistency.After doing a fair bit of reading about linux and write barriers, I''m sure that it''s an issue for traditional direct attach storage and for non-battery backed write cache in raid cards when cache is enabled. Is it actually an issue if you have a hardware raid controller w/ BBWC enabled and the cache disabled on the HDDs? (i.e. correctly configured for data safety) Should a correctly performing raid card be ignoring barrier write requests because it is already on stable storage? cheers, James
Ross Walker wrote:> > On Jul 5, 2009, at 6:06 AM, James Lever <j at jamver.id.au> wrote: > >> >> On 04/07/2009, at 3:08 AM, Bob Friesenhahn wrote: >> >>> It seems like you may have selected the wrong SSD product to use. >>> There seems to be a huge variation in performance (and cost) with >>> so-called "enterprise" SSDs. SSDs with capacitor-backed write >>> caches seem to be fastest. >> >> Do you have any methods to "correctly" measure the performance of an >> SSD for the purpose of a slog and any information on others (other >> than anecdotal evidence)? > > There are two types of SSD drives on the market, the fast write SLC > (single level cell) and the slow write MLC (multi level cell). MLC is > usually used in laptops as SLC drives over 16GB usually go for $1000+ > which isn''t cost effective in a laptop. MLC is good for read caching > though and most use it for L2ARC.Please don''t classify them as MLC vs SLC or you''ll find yourself totally confused by the modern MLC designs which use SLC as a cache. Be happy with specs: random write iops: slow or fast. -- richard
On 06/07/2009, at 9:31 AM, Ross Walker wrote:> There are two types of SSD drives on the market, the fast write SLC > (single level cell) and the slow write MLC (multi level cell). MLC > is usually used in laptops as SLC drives over 16GB usually go for > $1000+ which isn''t cost effective in a laptop. MLC is good for read > caching though and most use it for L2ARC. > > I just ordered a bunch of 16GB Imation Pro 7500''s (formerly Mtron) > from CDW lately for $290 a pop. They are suppose to be fast > sequential write SLC drives and so-so random write. We''ll see.That will be interesting to see. The Samsung drives we have are 50GB (64GB) SLC and apparently 2nd generation. For a slog, is random write even an issue? Or is it just the mechanism used to measure the IOPS performance of a typical device? AFAIUI, the ZIL is used as a ring buffer. How does that work with an SSD? All this pain really makes me think the only sane slog is one that is RAM based and has enough capacitance to either make itself permanent or move the data to something permanent before failing (FusionIO, DDRdrive, for example).
On Jul 5, 2009, at 7:47 PM, Richard Elling <richard.elling at gmail.com> wrote:> Ross Walker wrote: >> >> On Jul 5, 2009, at 6:06 AM, James Lever <j at jamver.id.au> wrote: >> >>> >>> On 04/07/2009, at 3:08 AM, Bob Friesenhahn wrote: >>> >>>> It seems like you may have selected the wrong SSD product to use. >>>> There seems to be a huge variation in performance (and cost) with >>>> so-called "enterprise" SSDs. SSDs with capacitor-backed write >>>> caches seem to be fastest. >>> >>> Do you have any methods to "correctly" measure the performance of >>> an SSD for the purpose of a slog and any information on others >>> (other than anecdotal evidence)? >> >> There are two types of SSD drives on the market, the fast write SLC >> (single level cell) and the slow write MLC (multi level cell). MLC >> is usually used in laptops as SLC drives over 16GB usually go for >> $1000+ which isn''t cost effective in a laptop. MLC is good for read >> caching though and most use it for L2ARC. > > Please don''t classify them as MLC vs SLC or you''ll find yourself > totally > confused by the modern MLC designs which use SLC as a cache. Be > happy with specs: random write iops: slow or fast.Thanks for the info. SSD is still very much a moving target. I worry about SSD drives long term reliability. If I mirror two of the same drives what do you think the probability of a double failure will be in 3, 4, 5 years? What I would really like to see is zpool''s ability to fail-back to an inline zil in the event an external one fails or is missing. Then one can remove an slog from a pool and add a different one if necessary or just remove it altogether. -Ross
Ross Walker wrote:> On Jul 5, 2009, at 7:47 PM, Richard Elling <richard.elling at gmail.com> > wrote: > >> Ross Walker wrote: >>> >>> On Jul 5, 2009, at 6:06 AM, James Lever <j at jamver.id.au> wrote: >>> >>>> >>>> On 04/07/2009, at 3:08 AM, Bob Friesenhahn wrote: >>>> >>>>> It seems like you may have selected the wrong SSD product to use. >>>>> There seems to be a huge variation in performance (and cost) with >>>>> so-called "enterprise" SSDs. SSDs with capacitor-backed write >>>>> caches seem to be fastest. >>>> >>>> Do you have any methods to "correctly" measure the performance of >>>> an SSD for the purpose of a slog and any information on others >>>> (other than anecdotal evidence)? >>> >>> There are two types of SSD drives on the market, the fast write SLC >>> (single level cell) and the slow write MLC (multi level cell). MLC >>> is usually used in laptops as SLC drives over 16GB usually go for >>> $1000+ which isn''t cost effective in a laptop. MLC is good for read >>> caching though and most use it for L2ARC. >> >> Please don''t classify them as MLC vs SLC or you''ll find yourself totally >> confused by the modern MLC designs which use SLC as a cache. Be >> happy with specs: random write iops: slow or fast. > > Thanks for the info. SSD is still very much a moving target. > > I worry about SSD drives long term reliability. If I mirror two of > the same drives what do you think the probability of a double failure > will be in 3, 4, 5 years?Assuming there are no common cause faults (eg firmware), you should expect an MTBF of 2-4M hours. But I can''t answer the question without knowing more info. It seems to me that you are really asking for the MTTDL, which is a representation of probability of data loss. I describe these algorithms here: http://blogs.sun.com/relling/entry/a_story_of_two_mttdl Since the vendors do not report UER rates, which makes sense for flash devices, the MTTDL[1] model applies. You can do the math yourself, once you figure out what your MTTR might be. For enterprise systems, we usually default to 8 hour response, but for home users you might plan on a few days, so you can take a vacation every once in a while. For 48 hours MTTR: 2M hours MTBF -> MTTDL[1] = 4,756,469 years 4M hours MTBF -> MTTDL[1] = 19,025,875 years Most folks find it more intuitive to look at probability per year in the form of a percent, so 2M hours MTBF -> Annual DL rate = 0.000021% 4M hours MTBF -> Annual DL rate = 0.000005% If you want to more accurately model based on endurance, then you''ll need to know the expected write rate and the nature of the wear leveling mechanism. It can be done, but the probability is really, really small.> > What I would really like to see is zpool''s ability to fail-back to an > inline zil in the event an external one fails or is missing. Then one > can remove an slog from a pool and add a different one if necessary or > just remove it altogether.It already does this, with caveats. What you might also want is CR 6574286, removing a slog doesn''t work. http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6574286 -- richard
On Jul 5, 2009, at 9:20 PM, Richard Elling <richard.elling at gmail.com> wrote:> Ross Walker wrote: >> >> Thanks for the info. SSD is still very much a moving target. >> >> I worry about SSD drives long term reliability. If I mirror two of >> the same drives what do you think the probability of a double >> failure will be in 3, 4, 5 years? > > Assuming there are no common cause faults (eg firmware), you should > expect an MTBF of 2-4M hours. But I can''t answer the question without > knowing more info. It seems to me that you are really asking for the > MTTDL, which is a representation of probability of data loss. I > describe > these algorithms here: > http://blogs.sun.com/relling/entry/a_story_of_two_mttdl > > Since the vendors do not report UER rates, which makes sense for > flash devices, the MTTDL[1] model applies. You can do the math > yourself, once you figure out what your MTTR might be. For enterprise > systems, we usually default to 8 hour response, but for home users you > might plan on a few days, so you can take a vacation every once in a > while. For 48 hours MTTR: > 2M hours MTBF -> MTTDL[1] = 4,756,469 years > 4M hours MTBF -> MTTDL[1] = 19,025,875 years > > Most folks find it more intuitive to look at probability per year in > the > form of a percent, so > 2M hours MTBF -> Annual DL rate = 0.000021% > 4M hours MTBF -> Annual DL rate = 0.000005% > > If you want to more accurately model based on endurance, then you''ll > need to know the expected write rate and the nature of the wear > leveling > mechanism. It can be done, but the probability is really, really > small.Wow, detailed, interested in a career in actuarial analysis? :-) Thanks, I''ll try to wrap my mind around this during daylight hours after my caffeine fix.>> What I would really like to see is zpool''s ability to fail-back to >> an inline zil in the event an external one fails or is missing. >> Then one can remove an slog from a pool and add a different one if >> necessary or just remove it altogether. > > It already does this, with caveats. What you might also want is > CR 6574286, removing a slog doesn''t work. > http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6574286Well I''ll keep an eye on when the fix gets out and then for it to get into Solaris 10. -Ross
James Lever wrote:> We also have a PERC 6/E w/512MB BBWC to test with or fall back to if we > go with a Linux solution.Have you tried putting the slog on this controller, either as an SSD or regular disk? It''s supported by the mega_sas driver, x86 and amd64 only. -- James Andrewartha | Sysadmin Data Analysis Australia Pty Ltd
On 07/07/2009, at 8:20 PM, James Andrewartha wrote:> Have you tried putting the slog on this controller, either as an SSD > or > regular disk? It''s supported by the mega_sas driver, x86 and amd64 > only.What exactly are you suggesting here? Configure one disk on this array as a dedicated ZIL? Would that improve performance any over using all disks with an internal ZIL? I have now done some tests with the PERC6/E in both RAID10 (all devices RAID0 LUNs, ZFS mirror/striped config) and also as a hardware RAID5 both with an internal ZIL. RAID10 (10 disks, 5 mirror vdevs) create 2m14.448s unlink 0m54.503s RAID5 (9 disks, 1 hot spare) create 1m58.819s unlink 0m48.509s Unfortunately, linux on the same RAID5 array using XFS seems significantly faster, still. Linux RAID5 (9 disks, 1 hot spare), XFS create 1m30.911s unlink 0m38.953s Is there a way to disable the write barrier in ZFS in the way you can with Linux filesystems (-o barrier=0)? Would this make any difference? After much consideration, the lack of barrier capability makes no difference to filesystem stability in the scenario where you have a battery backed write cache. Due to using identical hardware and configurations, I think this is a fair apples to apples test now. I''m now wondering if XFS is just the faster filesystem... (not the most practical management solution, just speed). cheers, James
James Lever wrote:> > On 07/07/2009, at 8:20 PM, James Andrewartha wrote: > >> Have you tried putting the slog on this controller, either as an SSD or >> regular disk? It''s supported by the mega_sas driver, x86 and amd64 only. > > What exactly are you suggesting here? Configure one disk on this array > as a dedicated ZIL? Would that improve performance any over using all > disks with an internal ZIL?I was mainly thinking about using the battery-backup write cache to eliminate the NFS latency. There''s not much difference between internal vs dedicated ZIL if the disks are the same and on the same controller - dedicated ZIL wins come from using SSDs and battery-backed cache. http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide#Separate_Log_Devices> Is there a way to disable the write barrier in ZFS in the way you can > with Linux filesystems (-o barrier=0)? Would this make any difference?http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#Cache_Flushes might help if the RAID card is still flushing to disk when ZFS asks it to even though it''s safe in the battery-backed cache. -- James Andrewartha | Sysadmin Data Analysis Australia Pty Ltd
You might wanna try one thing I just noticed - wrap the log device inside a SVM (disksuite) metadevice - makes wonders for the performance on my test server (Sun Fire X4240)... I do wonder what the downsides might be (except for having to fiddle with Disksuite again). Ie: # zpool create TEST c1t12d0 # format c1t13d0 (Create a 4GB partition 0) # metadb -f -a -c 3 c1t13d0s0 # metainit d0 1 1 c1t13d0s0 # zpool add TEST log /dev/md/dsk/d0 In my case the disks involved above are: c1t12d0 146GB 10krpm SAS disk c1t13d0 32GB Intel X25-E SLC SSD SATA disk Without the log added running ''gtar zxf emacs-22.3.tar.gz'' over NFS from another server takes 1:39.2 (almost 2 minutes). With c1t15d0s0 added as log it takes 1:04.2, but with the same c1t15d0s0 added, but wrapped inside a SVM metadevice the same operation takes 10.4 seconds... 1:39 vs 0:10 is a pretty good speedup I think... -- This message posted from opensolaris.org
Oh, and for completeness: If I wrap ''c1t12d0s0'' inside a SVM metadevice to and use that to create the "TEST" zpool (without a log) I run the same test command in 36.3 seconds... Ie: # metadb -f -a -c3 c1t13d0s0 # metainit d0 1 1 c1t13d0s0 # metainit d2 1 1 c1t12d0s0 # zpool create TEST /dev/md/dsk/d2 If I then add a log to that device: # zpool add TEST log /dev/md/dsk/d0 the same test (gtar zxf emacs-22.3.tar.gz) runs in 10.1 seconds... (Ie, not much better than just using a raw disk + svm-encapsulated log). -- This message posted from opensolaris.org
>>>>> "pe" == Peter Eriksson <no-reply at opensolaris.org> writes:pe> With c1t15d0s0 added as log it takes 1:04.2, but with the same pe> c1t15d0s0 added, but wrapped inside a SVM metadevice the same pe> operation takes 10.4 seconds... so now SVM discards cache flushes, too? great. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090708/1f87d6a4/attachment.bin>
I wonder exactly what''s going on. Perhaps it is the cache flushes that is causing the SCSI errors when trying to use the SSD (Intel X25-E and X25-M) disks? Btw, I''m seeing the same behaviour on both an X4500 (SATA/Marwell controller) and the X4240 (SAS/LSI controller). Well, almost. On the X4500 I didn''t seen the errors printed on the console, but things behaved strangely - and I did see the same speedup. If SVM silently disables cache flushes then perhaps there should be a HUGE warning printed somewhere (ZFS FAQ? Solaris documentation? In zpool when creating/adding devices?) about using ZFS with SVM? I wonder what the potential danger might be _if_ SVM disables cache flushes for the SLOG... Sure, that might mean a missed update on the filesystem, but since the data disks on the pool is raw disk devices the ZFS filesystem should be stable (sans any possibly missed updates). I think I can live with that. What I don''t want is a corrupt 16TB zpool in case of a power outage... Message was edited by: pen -- This message posted from opensolaris.org
The things I''d pay most attention to would be all single threaded 4K, 32K, and 128K writes to the raw device. Maybe sure the SSD has a capacitor and enable the write cache on the device. -r Le 5 juil. 09 ? 12:06, James Lever a ?crit :> > On 04/07/2009, at 3:08 AM, Bob Friesenhahn wrote: > >> It seems like you may have selected the wrong SSD product to use. >> There seems to be a huge variation in performance (and cost) with >> so-called "enterprise" SSDs. SSDs with capacitor-backed write >> caches seem to be fastest. > > Do you have any methods to "correctly" measure the performance of an > SSD for the purpose of a slog and any information on others (other > than anecdotal evidence)? > > cheers, > James > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss-------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 2431 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090731/4b700a67/attachment.bin>
>SSDs with capacitor-backed write caches >seem to be fastest.how to distinguish them from ssd`s without one? i never saw this explicitly mentioned in the specs. -- This message posted from opensolaris.org
roland writes: > >SSDs with capacitor-backed write caches > >seem to be fastest. > > how to distinguish them from ssd`s without one? > i never saw this explicitly mentioned in the specs. They probably don''t have one then (or they should fire their entire marketing dept). Capacitors allows the device to function safely with write cache enabled even while ignoring the cache flushes being sent by ZFS. If the device firmware is not setup to ignore the flushes, better make sure that sd.conf is setup to not send them otherwise one looses the benefit. Setting up sd.conf in ZFS Evil tuning guide : http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#How_to_Tune_Cache_Sync_Handling_Per_Storage_Device -r > -- > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss