So, I''m still beating my head against the wall, trying to find our performance bottleneck with NFS on our Thors. We''ve got a couple Intel SSDs for the ZIL, using 2 SSDs as ZIL devices. Cache flushing is still enabled, as are the write caches on all 48 disk devices. What I''m thinking of doing is disabling all write caches, and disabling the cache flushing. What would this mean for the safety of data in the pool? And, would this even do anything to address the performance issue? -Greg
If all write caches are truly disabled, then disabling the cache flush won''t affect the safety of your data. It will change your performance characteristics, almost certainly for the worse. -- This message posted from opensolaris.org
Multiple Thors (more than 2?), with performance problems. Maybe it''s the common demnominator - the network. Can you run local ZFS IO loads and determine if performance is expected when NFS and the network are out of the picture? Thanks, /jim Greg Mason wrote:> So, I''m still beating my head against the wall, trying to find our > performance bottleneck with NFS on our Thors. > > We''ve got a couple Intel SSDs for the ZIL, using 2 SSDs as ZIL devices. > Cache flushing is still enabled, as are the write caches on all 48 disk > devices. > > What I''m thinking of doing is disabling all write caches, and disabling > the cache flushing. > > What would this mean for the safety of data in the pool? > > And, would this even do anything to address the performance issue? > > -Greg > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >
the funny thing is that I''m showing a performance improvement over write caches + cache flushes. The only way these pools are being accessed is over NFS. Well, at least the only way I care about when it comes to high performance. I''m pretty sure it would give a performance hit locally, but I don''t care about local disk performance, I only care about the performance over NFS. Anton B. Rang wrote:> If all write caches are truly disabled, then disabling the cache flush won''t affect the safety of your data. > > It will change your performance characteristics, almost certainly for the worse.
This problem only manifests itself when dealing with many small files over NFS. There is no throughput problem with the network. I''ve run tests with the write cache disabled on all disks, and the cache flush disabled. I''m using two Intel SSDs for ZIL devices. This setup is faster than using the two Intel SSDs with write caches enabled on all disks, and with the cache flush enabled. My test would run around 3.5 to 4 minutes, now it is completing in abound 2.5 minutes. I still think this is a bit slow, but I still have quite a bit of testing to perform. I''ll keep the list updated with my findings. I''ve already established both via this list and through other research that ZFS has performance issues over NFS when dealing with many small files. This seems to maybe be an issue with NFS itself, where NVRAM-backed storage is needed for decent performance with small files. Typically such an NVRAM cache is supplied by a hardware raid controller in a disk shelf. I find it very hard to explain to a user why an "upgrade" is a step down in performance. For the users these Thors are going to serve, such a drastic performance hit is a deal breaker... I''ve done my homework on this issue, I''ve ruled out the network as an issue, as well as the NFS clients. I''ve narrowed my particular performance issue down to the ZIL, and how well ZFS plays with NFS. -Greg Jim Mauro wrote:> Multiple Thors (more than 2?), with performance problems. > Maybe it''s the common demnominator - the network. > > Can you run local ZFS IO loads and determine if performance > is expected when NFS and the network are out of the picture? > > Thanks, > /jim >
On 01/29/09 21:32, Greg Mason wrote:> This problem only manifests itself when dealing with many small files > over NFS. There is no throughput problem with the network. > > I''ve run tests with the write cache disabled on all disks, and the cache > flush disabled. I''m using two Intel SSDs for ZIL devices. > > This setup is faster than using the two Intel SSDs with write caches > enabled on all disks, and with the cache flush enabled. > > My test would run around 3.5 to 4 minutes, now it is completing in > abound 2.5 minutes. I still think this is a bit slow, but I still have > quite a bit of testing to perform. I''ll keep the list updated with my > findings. > > I''ve already established both via this list and through other research > that ZFS has performance issues over NFS when dealing with many small > files. This seems to maybe be an issue with NFS itself, where > NVRAM-backed storage is needed for decent performance with small files. > Typically such an NVRAM cache is supplied by a hardware raid controller > in a disk shelf. > > I find it very hard to explain to a user why an "upgrade" is a step down > in performance. For the users these Thors are going to serve, such a > drastic performance hit is a deal breaker...Perhaps I missed something, but what was your previous setup? I.e. what did you upgrade from? Neil.
A Linux NFS file server, with a few terabytes of fibre-attached disk, using XFS. I''m trying to get these Thors to perform at least as well as the current setup. A performance hit is very hard to explain to our users.> Perhaps I missed something, but what was your previous setup? > I.e. what did you upgrade from? > Neil. > >
On Fri, Jan 30, 2009 at 8:24 AM, Greg Mason <gmason at msu.edu> wrote:> A Linux NFS file server, with a few terabytes of fibre-attached disk, > using XFS. > > I''m trying to get these Thors to perform at least as well as the current > setup. A performance hit is very hard to explain to our users. >What type of spindles were in the FC attached disk? --Tim -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090130/382198cf/attachment.html>
I should also add that this "creating many small files" issue is the ONLY case where the Thors are performing poorly, which is why I''m focusing on it. Greg Mason wrote:> A Linux NFS file server, with a few terabytes of fibre-attached disk, > using XFS. > > I''m trying to get these Thors to perform at least as well as the current > setup. A performance hit is very hard to explain to our users. > >> Perhaps I missed something, but what was your previous setup? >> I.e. what did you upgrade from? >> Neil. >> >> > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > >
7200 RPM SATA disks. Tim wrote:> > > On Fri, Jan 30, 2009 at 8:24 AM, Greg Mason <gmason at msu.edu > <mailto:gmason at msu.edu>> wrote: > > A Linux NFS file server, with a few terabytes of fibre-attached disk, > using XFS. > > I''m trying to get these Thors to perform at least as well as the current > setup. A performance hit is very hard to explain to our users. > > > What type of spindles were in the FC attached disk? > > --Tim
> This problem only manifests itself when dealing with many small files > over NFS. There is no throughput problem with the network.But there could be a _latency_ issue with the network. [snip]> I''ve done my homework on this issue, I''ve ruled out the network as an > issue, as well as the NFS clients. I''ve narrowed my particular > performance issue down to the ZIL, and how well ZFS plays with NFS.Great. Good luck. /jim
Jim Mauro wrote:> >> This problem only manifests itself when dealing with many small files >> over NFS. There is no throughput problem with the network. > But there could be a _latency_ issue with the network.If there was a latency issue, we would see such a problem with our existing file server as well, which we do not. We''d also have much greater problems than just file server performance. So, like I''ve said, we''ve ruled out the network as an issue.
> If there was a latency issue, we would see such a problem with our > existing file server as well, which we do not. We''d also have much > greater problems than just file server performance. > > So, like I''ve said, we''ve ruled out the network as an issue.I should also add that I''ve tested these Thors with the ZIL disabled, and they scream! With the cache flush disabled, they also do quite well. The specific issue i''m trying to solve is the ZIL being slow when using NFS. I really don''t want to have to do something drastic like disabling the ZIL to get the performance I need...
You have SSD''s for the ZIL (logzilla) enabled, and ZIL IO is what is hurting your performance...Hmmm.... I''ll ask the stupid question (just to get it out of the way) - is it possible that the logzilla is undersized? Did you gather data using Richard Elling''s zilstat (included below)? Thanks, /jim #! /usr/bin/ksh -p # CDDL HEADER START # # The contents of this file are subject to the terms of the # Common Development and Distribution License (the "License"). # You may not use this file except in compliance with the License. # # You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE # or http://www.opensolaris.org/os/licensing. # See the License for the specific language governing permissions # and limitations under the License. # # When distributing Covered Code, include this CDDL HEADER in each # file and include the License file at usr/src/OPENSOLARIS.LICENSE. # If applicable, add the following below this CDDL HEADER, with the # fields enclosed by brackets "[]" replaced with your own identifying # information: Portions Copyright [yyyy] [name of copyright owner] # # CDDL HEADER END # Portions Copyright 2009 Sun Microsystems, Inc. # #File: zilstat.d #Author: Richard.Elling at sun.com # # This dtrace program will help identify the ZIL activity by sampling # writes sent to the ZIL. # output: # [TIME] # BYTES - total bytes written to ZIL over the interval # BYTES/S - bytes/s written to ZIL over ther interval # MAX-BYTES/S - maximum rate during any 1-second sample # ############################## # --- Process Arguments --- # # TODO: clean up args ### default variables opt_pool=0 opt_time=0 filter=0 poollines=-1 interval=1 count=-1 ### process options while getopts hl:p:t name do case $name in l) lines=$OPTARG ;; p) opt_pool=1; pool=$OPTARG ;; t) opt_time=1 ;; h|?) ME=$(basename $0) cat <<-END >&2 USAGE: $ME [t][-l linecount] [-p poolname] [interval [count]] -t # print timestamp -l linecount # print header every linecount lines (default=only once) -d poolname # only look at poolname -l number # print header every number lines examples: $ME # default output, 1 second samples $ME 10 # 10 second samples $ME 10 6 # print 6 x 10 second samples $ME -p rpool # show ZIL stats for rpool only output: [TIME] BYTES - total bytes written to ZIL over the interval BYTES/S - bytes/s written to ZIL over ther interval MAX-BYTES/S - maximum rate during any 1-second sample END exit 1 esac done shift $(( $OPTIND - 1 )) ### option logic if [[ "$1" > 0 ]]; then interval=$1; shift fi if [[ "$1" > 0 ]]; then count=$1; shift fi if (( opt_pool )); then filter=1 fi ############################## # --- Main Program, DTrace --- /usr/sbin/dtrace -n '' #pragma D option quiet inline int OPT_time = ''$opt_time''; inline int OPT_pool = ''$opt_pool''; inline int INTERVAL = ''$interval''; inline int LINES = ''$lines''; inline int COUNTER = ''$count''; inline int FILTER = ''$filter''; inline string POOL = "''$pool''"; dtrace:::BEGIN { /* starting values */ counts = COUNTER; secs = INTERVAL; line = 0; last_event[""] = 0; nused=0; max_per_sec=0; nused_per_sec=0; } /* * collect info when zil_lwb_write_start fires */ fbt::zil_lwb_write_start:entry /OPT_pool == 0 || POOL == args[0]->zl_dmu_pool->dp_spa->spa_name/ { nused += args[1]->lwb_nused; nused_per_sec += args[1]->lwb_nused; } /* * Timer */ profile:::tick-1sec { secs--; nused_per_sec > max_per_sec ? max_per_sec = nused_per_sec : 1; nused_per_sec = 0; } /* * Print header */ profile:::tick-1sec /line == 0 / { /* print optional headers */ OPT_time ? printf("%-20s ", "TIME") : 1; /* print header */ printf("%10s %10s %10s\n", "BYTES", "BYTES/S", "MAX-BYTES/S"); line = LINES; } /* * Print Output */ profile:::tick-1sec /secs == 0/ { OPT_time ? printf("%-20Y ", walltimestamp) : 1; printf("%10d %10d %10d\n", nused, nused/INTERVAL, max_per_sec); nused = 0; nused_per_sec = 0; max_per_sec = 0; secs = INTERVAL; counts--; line--; } /* * End of program */ profile:::tick-1sec /counts == 0/ { exit(0); } '' Greg Mason wrote:>> If there was a latency issue, we would see such a problem with our >> existing file server as well, which we do not. We''d also have much >> greater problems than just file server performance. >> >> So, like I''ve said, we''ve ruled out the network as an issue. > > I should also add that I''ve tested these Thors with the ZIL disabled, > and they scream! With the cache flush disabled, they also do quite well. > > The specific issue i''m trying to solve is the ZIL being slow when > using NFS. > > I really don''t want to have to do something drastic like disabling the > ZIL to get the performance I need...
I''ll give this a script a shot a little bit later today. For ZIL sizing, I''m using either 1 or 2 32G Intel X25-E SSDs in my tests, which, according to what I''ve read, is 2-4 times larger than the maximum that ZFS can possibly use. We''ve got 32G of system memory in these Thors, and (if I''m not mistaken) the maximum amount of in-play data can be 16G, 1/2 the system memory. Also, because I know people will be asking, has anybody ever tried to recover from something like a system crash with a ZFS pool that has the ZIL disabled? What kind of nightmares would I be facing in such a situation? Would I simply just risk losing that in-play data, or could more serious things happen? I know disabling the ZIL is an Extremely Bad Idea, but I need to tell people exactly why... -Greg Jim Mauro wrote: > You have SSD''s for the ZIL (logzilla) enabled, and ZIL IO > is what is hurting your performance...Hmmm.... > > I''ll ask the stupid question (just to get it out of the way) - is > it possible that the logzilla is undersized? > > Did you gather data using Richard Elling''s zilstat (included below)? > > Thanks, > /jim <snip>
On Fri, 30 Jan 2009, Greg Mason wrote:> A Linux NFS file server, with a few terabytes of fibre-attached disk, > using XFS. > > I''m trying to get these Thors to perform at least as well as the current > setup. A performance hit is very hard to explain to our users.I have heard that Linux NFS service does not perform synchronous writes to disk (in violation of the NFS specification). If that is true, then it would make a huge difference to write performance. Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Jim Mauro wrote:> You have SSD''s for the ZIL (logzilla) enabled, and ZIL IO > is what is hurting your performance...Hmmm.... > > I''ll ask the stupid question (just to get it out of the way) - is > it possible that the logzilla is undersized? > > Did you gather data using Richard Elling''s zilstat (included below)? >Jim stole a little bit of my thunder :-) To get him back, I''ve now blogged about my updated (better) version. The goal is to try and answer the question, "will adding a low-latency separate log device improve the performance of my workload?" To try to answer the question, we look at the amount of work asked of the ZIL. http://richardelling.blogspot.com/2009/01/zilstat.html Enjoy! -- richard
Sounds like the device it not ignoring the cache flush requests sent down by ZFS/zil commit. If the SSD is able the drain it''s internal buffer to flash on a power outage; then it needs to ignore the cache flush. You can do this on a per device basis, It''s kludgy tuning but hope the instructions on the evil tuning guide will be enough. -r Le 30 janv. 09 ? 16:03, Greg Mason a ?crit :>> If there was a latency issue, we would see such a problem with our >> existing file server as well, which we do not. We''d also have much >> greater problems than just file server performance. >> >> So, like I''ve said, we''ve ruled out the network as an issue. > > I should also add that I''ve tested these Thors with the ZIL disabled, > and they scream! With the cache flush disabled, they also do quite > well. > > The specific issue i''m trying to solve is the ZIL being slow when > using NFS. > > I really don''t want to have to do something drastic like disabling the > ZIL to get the performance I need... > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>>>>> "gm" == Greg Mason <gmason at msu.edu> writes: >>>>> "g" == Gary Mills <mills at cc.umanitoba.ca> writes:gm> I know disabling the ZIL is an Extremely Bad Idea, but maybe you don''t care about trashed thunderbird databases. You just don''t want to lose the whole pool to ``status: The pool metadata is corrupted and cannot be opened. / action: Destroy the pool and restore from backup.'''' I''ve no answer for that---maybe someone else? The known problem with ZIL disabling, AIUI, is that it breaks the statelessness of NFS. If the server reboots and the NFS clients do not, then assumptions on which the NFS protocol is built could be broken, and files could get corrupted. Behind this dire warning is an expectation I''m not sure everyone shares: if the NFS server reboots, and the clients do not, then (modulo bugs) no data is lost---once the clients unfreeze, it''s like nothing ever happened. I don''t think other file sharing protocols like SMB or AFP attempt to keep that promise, so maybe people are being warned about something most assumed would happen anyway. will disabling the ZIL make NFS corrupt files worse than SMB or AFP would when the server reboots? not sure---at least SMB or AFP _should_ give an error to the userland when the server reboots, sort of like NFS ''hard,intr'' when you press ^C, so applications using sqllite or berkeleydb or whatever can catch that error and perform their own user-level recovery, and if they call fsync() and get success they can trust it absolutely no matter server or client reboots. while the ZIL-less NFS problems would probably be more silent, more analagous to the ZFS-iSCSI problems except one layer higher in the stack so programs think they''ve written to these .db files but they haven''t, and blindly scribble on, not knowing that a batch of writes in the past was silently discarded. In practice everyone always says to run filemaker or Mail.app or Thunderbird or anything with database files on ``a local disk'''' only, so I think the SMB and AFP error paths are not working right either and the actual expectation is very low. g> Consider a file server running ZFS that exports a volume with g> Iscsi. Consider also an application server that imports the g> LUN with Iscsi and runs a ZFS filesystem on that LUN. I was pretty sure there was a bug for the iscsitadm target ignoring SYNCHRONIZE_CACHE, but I cannot find the bug number now and may be wrong. Also there is a separate problem with remote storage and filesystems highly dependent on SYNCHRONIZE_CACHE. Even if not for the bug I can''t find, remote storage adds a failure case. Normally you have three main cases to handle: SYNCHRONIZE CACHE returns success after some delay SYNCHRONIZE CACHE never returns because someone yanked the cord---the whole system goes down. You deal with it at boot, when mounting the filesystem. SYNCHRONIZE CACHE never returns because a drive went bad. iSCSI adds a fourth: SYNCHRONIZE CACHE returns success SYNCHRONIZE CACHE returns success SYNCHRONIZE CACHE returns failure SYNCHRONIZE CACHE returns success I think ZFS probably does not understand this case. The others are easier, because either you have enough raidz/mirror redundancy, or else you are allowed handle the ``returns failure'''' by implicitly unmounting the filesystem and killing everything that held an open file. NFS works around this with the COMMIT op and client-driven replay in v3, or by making everything synchronous in v2. iSCSI is _not_ v2-like because, even if there is no write caching in the initiator/target (there probably ought to be), if the underlying physical disk in the target has a write cache, still the entire target chassis can reboot and lose the contents of that cache. And I suspect iSCSI is not using NFS-v3-like workarounds right now. I think this hole is probably still open. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090202/aef2ce9a/attachment.bin>
Hello Miles, Monday, February 2, 2009, 7:20:49 PM, you wrote:>>>>>> "gm" == Greg Mason <gmason at msu.edu> writes: >>>>>> "g" == Gary Mills <mills at cc.umanitoba.ca> writes:MN> gm> I know disabling the ZIL is an Extremely Bad Idea, MN> but maybe you don''t care about trashed thunderbird databases. You MN> just don''t want to lose the whole pool to ``status: The pool metadata MN> is corrupted and cannot be opened. / action: Destroy the pool and MN> restore from backup.'''' I''ve no answer for that---maybe someone else? It will not cause the above. Disabling ZIL has nothing to do with a pool consistency. -- Best regards, Robert Milkowski http://milek.blogspot.com