I have a 24 x 1TB system being used as an NFS file server. Seagate SAS disks connected via an LSI 9211-8i SAS controller, disk layout 2 x 11 disk RAIDZ2 + 2 spares. I am using 2 x DDR Drive X1s as the ZIL. When we write anything to it, the writes are always very bursty like this: ool 488K 20.0T 0 0 0 0 xpool 488K 20.0T 0 0 0 0 xpool 488K 20.0T 0 0 0 0 xpool 488K 20.0T 0 232 0 29.0M xpool 488K 20.0T 0 101 0 12.7M xpool 488K 20.0T 0 0 0 0 xpool 488K 20.0T 0 0 0 0 xpool 488K 20.0T 0 0 0 0 xpool 488K 20.0T 0 0 0 0 xpool 488K 20.0T 0 50 0 6.37M xpool 488K 20.0T 0 477 0 59.7M xpool 488K 20.0T 0 0 0 0 xpool 488K 20.0T 0 0 0 0 xpool 488K 20.0T 0 0 0 0 xpool 488K 20.0T 0 0 0 0 xpool 488K 20.0T 0 0 0 0 xpool 74.7M 20.0T 0 702 0 76.2M xpool 74.7M 20.0T 0 577 0 72.2M xpool 74.7M 20.0T 0 110 0 13.9M xpool 74.7M 20.0T 0 0 0 0 xpool 74.7M 20.0T 0 0 0 0 xpool 74.7M 20.0T 0 0 0 0 xpool 74.7M 20.0T 0 0 0 0 Whenever you see 0 the write is just hanging. What I would like to see is at least some writing happening every second. What can I look at for this issue? Thanks -- This message posted from opensolaris.org
I think you are seeing ZFS store up the writes, coalesce them, then flush to disk every 30 seconds. Unless the writes are synchronous, the ZIL won''t be used, but the writes will be cached instead, then flushed. If you think about it, this is far more sane than flushing to disk every time the write() system call is used. -- This message posted from opensolaris.org
On Wed, 6 Oct 2010, Marty Scholes wrote:> If you think about it, this is far more sane than flushing to disk > every time the write() system call is used.Yes, it dramatically diminishes the number of copy-on-write writes and improves the pool layout efficiency. It also saves energy. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
The NFS client that we''re using always uses O_SYNC, which is why it was critical for us to use the DDRdrive X1 as the ZIL. I was unclear on the entire system we''re using, my apologies. It is: OpenSolaris SNV_134 Motherboard: SuperMicro X8DAH RAM: 72GB CPU: Dual Intel 5503 @ 2.0GHz ZIL: DDRdrive X1 (two of these, independent and not mirrored) Drives: 24 x Seagate 1TB SAS, 7200 RPM Network connected via 3 x gigabit links as LACP + 1 gigabit backup, IPMP on top of those. The output I posted is from zpool iostat and I used that because it corresponds to what users are seeing. Whenever zpool iostat shows write activity, the file copies to the system are working as expected. As soon as zpool iostat shows no activity, the writes all pause. The simple test case is to copy a cd-rom ISO image to the server while doing the zpool iostat. -- This message posted from opensolaris.org
Figured it out - it was the NFS client. I used snoop and then some dtrace magic to prove that the client (which was using O_SYNC) was sending very bursty requests to the system. I tried a number of other NFS clients with O_SYNC as well and got excellent performance when they were configured correctly. Just for fun I disabled the DDRdrive X1 (pair of them) that I use for the ZIL and performance tanked across the board when using O_SYNC. I can''t recommend the DDRdrive X1 enough as a ZIL! Here is a great article on this behavior here: http://blogs.sun.com/brendan/entry/slog_screenshots Thanks for the help all! -- This message posted from opensolaris.org
Thanks for posting your findings. What was incorrect about the client''s config? On Oct 7, 2010 4:15 PM, "Eff Norwood" <smith at jsvp.com> wrote: Figured it out - it was the NFS client. I used snoop and then some dtrace magic to prove that the client (which was using O_SYNC) was sending very bursty requests to the system. I tried a number of other NFS clients with O_SYNC as well and got excellent performance when they were configured correctly. Just for fun I disabled the DDRdrive X1 (pair of them) that I use for the ZIL and performance tanked across the board when using O_SYNC. I can''t recommend the DDRdrive X1 enough as a ZIL! Here is a great article on this behavior here: http://blogs.sun.com/brendan/entry/slog_screenshots Thanks for the help all! -- This message posted from opensolaris.org _______________________________________________ zfs-dis... -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20101012/b11d5180/attachment-0001.html>
The NFS client in this case was VMWare ESXi 4.1 release build. What happened is that the file uploader behavior was changed in 4.1 to prevent I/O contention with the VM guests. That means when you go to upload something to the datastore, it only sends chunks of the file instead of streaming it all at once like it did in ESXi 4.0. To end users, something appeared to be broken because file uploads now took 95 seconds instead of 30. Turns out that is by design in 4.1. This is the behavior *only* for the uploader and not for the VM guests. Their I/O is as expected. I have to say as a side note, the DDRdrive X1s make a day and night difference with VMWare. If you use VMWare via NFS, I highly recommend the X1s as the ZIL. Otherwise the VMWare O_SYNC (Stable = FSYNC) will kill your performance dead. We also tried SSDs as the ZIL which worked ok until they got full, then performance tanked. As I have posted before, SSDs as your ZIL - don''t do it! -- This message posted from opensolaris.org
On Tue, Oct 12, 2010 at 12:09:44PM -0700, Eff Norwood wrote:> The NFS client in this case was VMWare ESXi 4.1 release build. What > happened is that the file uploader behavior was changed in 4.1 to > prevent I/O contention with the VM guests. That means when you go to > upload something to the datastore, it only sends chunks of the file > instead of streaming it all at once like it did in ESXi 4.0. To end > users, something appeared to be broken because file uploads now took > 95 seconds instead of 30. Turns out that is by design in 4.1. This is > the behavior *only* for the uploader and not for the VM guests. Their > I/O is as expected.Interesting.> I have to say as a side note, the DDRdrive X1s make a day and night > difference with VMWare. If you use VMWare via NFS, I highly recommend > the X1s as the ZIL. Otherwise the VMWare O_SYNC (Stable = FSYNC) will > kill your performance dead. We also tried SSDs as the ZIL which > worked ok until they got full, then performance tanked. As I have > posted before, SSDs as your ZIL - don''t do it! --We run SSD''s as ZIL here exclusively on what I''d consider fairly busy VMware datastores and have never encountered this. How would one know how "full" their SSD being used as ZIL is? I was under the impression that even using a full 32GB X-25E was overkill spacewise for typical ZIL functionality... Ray
>>>>> "en" == Eff Norwood <smith at jsvp.com> writes:en> We also tried SSDs as the ZIL which worked ok until they got en> full, then performance tanked. As I have posted before, SSDs en> as your ZIL - don''t do it! yeah, iirc the thread went back and forth between you and I for a few days, something like this, you: SSD''s work fine at first, then slow down, see this anandtech article. We got bit by this. me: That article is two years old. Read this other article which is one year old and explains the problem is fixed if you buy current gen2 intel or sandforce-based SSD. you: Well absent test results from you I think we will just have to continue believing that all SSD''s gradually slow down like I said, though I would love to be proved wrong. me: You haven''t provided any test results yourself nor even said what drive you''re using. We''ve both just cited anandtech, and my citation''s newer than yours. you: I welcome further tests that prove the DDRDrive is not the only suitable ZIL, but absent these tests we have to assume I''m right that it is. silly! slowdowns with age: http://www.pcper.com/article.php?aid=669 http://www.anandtech.com/show/2738/15 slowdowns fixed: http://www.anandtech.com/show/2899/8 ``With the X25-M G2 Intel managed to virtually eliminate the random-write performance penalty on a sequentially filled drive. In other words, if you used an X25-M G2 as a normal desktop drive, 4KB random write performance wouldnC"BB http://www.anandtech.com/show/2738/25 t really degrade over time. Even without TRIM.'''' note this is not advice to buy sandforce for slog because I don''t know if anyone''s tested it respects flush-cache commands and suspect it may drop them. sumary: There''s probably been major, documented shifts in the industry between when you tested and now, but no one knows because you don''t even tell what you tested or how---you just spread FUD and flog the DDRDrive and then say ``do research to prove me wrong or else my hazy statement stands.'''' bad science. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20101012/2b328299/attachment.bin>
> -----Original Message----- > From: zfs-discuss-bounces at opensolaris.org > [mailto:zfs-discuss-bounces at opensolaris.org] On Behalf Of Miles Nordin > Sent: Tuesday, October 12, 2010 5:15 PM > To: zfs-discuss at opensolaris.org > Subject: Re: [zfs-discuss] Bursty writes - why? > > >>>>> "en" == Eff Norwood <smith at jsvp.com> writes: > > en> We also tried SSDs as the ZIL which worked ok until they got > en> full, then performance tanked. As I have posted before, SSDs > en> as your ZIL - don''t do it! > > yeah, iirc the thread went back and forth between you and I for a few > days, something like this, > > you: SSD''s work fine at first, then slow down, see this anandtech > article. We got bit by this. > > me: That article is two years old. Read this other article which is > one year old and explains the problem is fixed if you > buy current gen2 > intel or sandforce-based SSD. > > you: Well absent test results from you I think we will just have to > continue believing that all SSD''s gradually slow down like I > said, though I would love to be proved wrong. > > me: You haven''t provided any test results yourself nor even said what > drive you''re using. We''ve both just cited anandtech, and my > citation''s newer than yours. > > you: I welcome further tests that prove the DDRDrive is not the only > suitable ZIL, but absent these tests we have to assume I''m right > that it is. > > silly! > > slowdowns with age: > http://www.pcper.com/article.php?aid=669 > http://www.anandtech.com/show/2738/15 > > slowdowns fixed: > http://www.anandtech.com/show/2899/8 > > ``With the X25-M G2 Intel managed to virtually eliminate the > random-write performance penalty on a sequentially filled > drive. In other words, if you used an X25-M G2 as a normal desktop > drive, 4KB random write performance wouldnC"BB > http://www.anandtech.com/show/2738/25 t really degrade over > time. Even without TRIM.'''' > > note this is not advice to buy sandforce for slog because I don''t > know if anyone''s tested it respects flush-cache commands and suspect > it may drop them. > > sumary: There''s probably been major, documented shifts in the industry > between when you tested and now, but no one knows because you don''t > even tell what you tested or how---you just spread FUD and flog the > DDRDrive and then say ``do research to prove me wrong or else my hazy > statement stands.'''' bad science.Another article concerning Sandforce performance: http://www.anandtech.com/show/3667/6 Evidently, since the Sandforce controllers do deduplication to reduce writes, write performance with highly random data suffers relative to ''normal'' data. In particular: "Sequential write speed actually takes the biggest hit of them all. At only 144.4MB/s if you''re writing highly random data sequentially you''ll find that the SF-1200/SF-1500 performs worse than just about any other SSD controller on the market. Only the X25-M is slower. While the impact to read performance and random write performance isn''t terrible, sequential performance does take a significant hit on these SandForce drives." Unfortunately, this article doesn''t actually compare performance for writes of random data between different controllers, it just says that the random data write performance of the Sandforce is worse than everything but the X25-M relative to their ''normal'' data write performance. Do other controllers do dedup on written data like Sandforce? When I read this I thought that it kind of eliminated Sandforce drives from consideration as SLOG devices, which is a pity because the OCZ Vertex 2 EX or Vertex 2 Pro SAS otherwise look like good candidates. -Will
On Tue, 12 Oct 2010, Saxon, Will wrote:> When I read this I thought that it kind of eliminated Sandforce > drives from consideration as SLOG devices, which is a pity because > the OCZ Vertex 2 EX or Vertex 2 Pro SAS otherwise look like good > candidates.For obvious reasons, the SLOG is designed to write sequentially. Otherwise it would offer much less benefit. Maybe this random-write issue with Sandforce would not be a problem? Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On Oct 12, 2010, at 3:31 PM, Bob Friesenhahn wrote:> > For obvious reasons, the SLOG is designed to write sequentially. Otherwise it would offer much less benefit. Maybe this random-write issue with Sandforce would not be a problem?Isn''t writing from cache to disk designed to be sequential, while writes to the ZIL/SLOG will be more random (in order to commit quickly)? Scott Meilicke
> Maybe this random-write issue with Sandforce would not be a > problem?It is most definitely a problem, as one needs to question the conventional assertion of a sequential write pattern? I presented some findings recently at the Nexenta Training Seminar in Rotterdam. Here is a link to an excerpt (full presentation available to those interested, email cgeorge at ddrdrive dot com): http://www.ddrdrive.com/zil_iopattern_excerpt.pdf In summary, a sequential write pattern is found for a pool with only a single file system. But as additional file systems are added the resultant (or aggregate) write pattern trends to random. Over 50% random with a pool containing just 5 filesystems. This makes intuitive sense knowing each filesystem has it''s own ZIL and they all share the dedicated log (ZIL Accelerator). Best regards, Christopher George Founder/CTO www.ddrdrive.com -- This message posted from opensolaris.org
Bob Friesenhan wrote:> On Tue, 12 Oct 2010, Saxon, Will wrote: >> When I read this I thought that it kind of eliminated Sandforce >> drives from consideration as SLOG devices, which is a pity because >> the OCZ Vertex 2 EX or Vertex 2 Pro SAS otherwise look like good >> candidates.> For obvious reasons, the SLOG is designed to write sequentially. > Otherwise it would offer much less benefit. Maybe this random-write > issue with Sandforce would not be a problem?The observation was that the Sandforce controllers perform more poorly than others when sequentially writing highly random data, not with random writes of ''normal'' data. -Will
On Tue, October 12, 2010 18:31, Bob Friesenhahn wrote:> On Tue, 12 Oct 2010, Saxon, Will wrote: >> Another article concerning Sandforce performance: >> >> http://www.anandtech.com/show/3667/6 >> >> [...] >> >> When I read this I thought that it kind of eliminated Sandforce >> drives from consideration as SLOG devices, which is a pity because >> the OCZ Vertex 2 EX or Vertex 2 Pro SAS otherwise look like good >> candidates. > > For obvious reasons, the SLOG is designed to write sequentially. > Otherwise it would offer much less benefit. Maybe this random-write > issue with Sandforce would not be a problem?The other thing is that the article talks about an SF-1200-based drive. And an MLC one to boot. When SandForce originally came up on this list a while ago, I got the general impression that while SF-1200-based devices were fine for L2ARC caches, the consensus was that you would want an SF-1500-based devices for slogs. Not only does the SF-1500 get you better write IOps, the devices that used them also tended to have batteries or super-caps as well. This helped with data integrity in the case of unexpected power outages. SF-1500 units were also usually available with SLC flash, which would help with longevity give the write-oriented nature of slogs. See: http://www.anandtech.com/show/3661/ So while the ''dedupe article'' is informative, and the conclusions about slogs and SF-1200-based devices appear sound, it''s a bit beside the point IMHO. Sadly there don''t seem to be many SSDs out there that you /really/ want to use for slogs: there are many that you can make due with (especially in mirrored configurations), but few that are ideal.