I?m experiencing occasional slow responsiveness on an OpenSolaris b118 system typically noticed when running an ?ls? (no extra flags, so no directory service lookups). There is a delay of between 2 and 30 seconds but no correlation has been noticed with load on the server and the slow return. This problem has only been noticed via NFS (v3. We are migrating to NFSv4 once the O_EXCL/mtime bug fix has been integrated - anticipated for snv_124). The problem has been observed both locally on the primary filesystem, in an locally automounted reference (/home/foo) and remotely via NFS. zpool is RAIDZ2 comprised of 10 * 15kRPM SAS drives behind an LSI 1078 w/ 512MB BBWC exposed as RAID0 LUNs (Dell MD1000 behind PERC 6/E) with 2x SSDs each partitioned as 10GB slog and 36GB remainder as l2arc behind another LSI 1078 w/ 256MB BBWC (Dell R710 server with PERC 6/i). The system is configured as an NFS (currently serving NFSv3), iSCSI (COMSTAR) and CIFS (using the SUN SFW package running Samba 3.0.34) with authentication taking place from a remote openLDAP server. Automount is in use both locally and remotely (linux clients). Locally /home/* is remounted from the zpool, remotely /home and another filesystem (and children) are mounted using autofs. There was some suspicion that automount is the problem, but no definitive evidence as of yet. The problem has definitely been observed with stats (of some form, typically ?/usr/bin/ls? output) both remotely, locally in /home/* and locally in /zpool/home/* (the true source location). There is a clear correlation with recency of reads of the directories in question and reoccurrence of the fault in that one user has scripted a regular (15m/ 30m/hourly tests so far) ?ls? of the filesystems of interested and this has reduced the fault to have minimal noted impact since starting down this path (just for themself). I have removed the l2arc(s) (cache devices) from the pool and the same behaviour has been observed. My suspicion here was that there was perhaps occasional high synchronous load causing heavy writes to the slog devices and when a stat was requested it may have been faulting from ARC to L2ARC prior to going to the primary data store. The slowness has been reported since removing the extra cache devices. Another thought I had was along the lines of fileystem caching and heavy writes causing read blocking. I have no evidence that this is the case, but some suggestions on list recently of limiting the ZFS memory usage for write caching. Can anybody comment to the effectiveness of this (I have 256MB write cache in front of the slog SSDs and 512MB in front of the primary storage devices). My DTrace is very poor but I?m suspicious that this is the best way to root cause this problem. If somebody has any code that may assist in debugging this problem and was able to share it would much appreciated. Any other suggestions for how to identify this fault and work around it would be greatly appreciated. cheers, James
On Sun, Sep 6, 2009 at 9:15 AM, James Lever<j at jamver.id.au> wrote:> I?m experiencing occasional slow responsiveness on an OpenSolaris b118 > system typically noticed when running an ?ls? (no extra flags, so no > directory service lookups). ?There is a delay of between 2 and 30 seconds > but no correlation has been noticed with load on the server and the slow > return. ?This problem has only been noticed via NFS (v3. ?We are migrating > to NFSv4 once the O_EXCL/mtime bug fix has been integrated - anticipated for > snv_124). ?The problem has been observed both locally on the primary > filesystem, in an locally automounted reference (/home/foo) and remotely via > NFS. > > zpool is RAIDZ2 comprised of 10 * 15kRPM SAS drives behind an LSI 1078 w/ > 512MB BBWC exposed as RAID0 LUNs (Dell MD1000 behind PERC 6/E) with 2x SSDs > each partitioned as 10GB slog and 36GB remainder as l2arc behind another LSI > 1078 w/ 256MB BBWC (Dell R710 server with PERC 6/i). > > The system is configured as an NFS (currently serving NFSv3), iSCSI > (COMSTAR) and CIFS (using the SUN SFW package running Samba 3.0.34) with > authentication taking place from a remote openLDAP server. > > Automount is in use both locally and remotely (linux clients). ?Locally > /home/* is remounted from the zpool, remotely /home and another filesystem > (and children) are mounted using autofs. ?There was some suspicion that > automount is the problem, but no definitive evidence as of yet. > > The problem has definitely been observed with stats (of some form, typically > ?/usr/bin/ls? output) both remotely, locally in /home/* and locally in > /zpool/home/* (the true source location). ?There is a clear correlation with > recency of reads of the directories in question and reoccurrence of the > fault in that one user has scripted a regular (15m/30m/hourly tests so far) > ?ls? of the filesystems of interested and this has reduced the fault to have > minimal noted impact since starting down this path (just for themself). > > I have removed the l2arc(s) (cache devices) from the pool and the same > behaviour has been observed. ?My suspicion here was that there was perhaps > occasional high synchronous load causing heavy writes to the slog devices > and when a stat was requested it may have been faulting from ARC to L2ARC > prior to going to the primary data store. ?The slowness has been reported > since removing the extra cache devices. > > Another thought I had was along the lines of fileystem caching and heavy > writes causing read blocking. ?I have no evidence that this is the case, but > some suggestions on list recently of limiting the ZFS memory usage for write > caching. ?Can anybody comment to the effectiveness of this (I have 256MB > write cache in front of the slog SSDs and 512MB in front of the primary > storage devices). > > My DTrace is very poor but I?m suspicious that this is the best way to root > cause this problem. ?If somebody has any code that may assist in debugging > this problem and was able to share it would much appreciated. > > Any other suggestions for how to identify this fault and work around it > would be greatly appreciated.That behavior sounds a lot like a process has a memory leak and is filling the VM. On Linux there is an OOM killer for these, but on OpenSolaris, your the OOM killer. You have iSCSI, NFS, CIFS to choose from (most obvious), try restarting them one at a time during down time and see if performance improves after each restart to find the culprit. -Ross
On Sep 6, 2009, at 7:53 AM, Ross Walker wrote:> On Sun, Sep 6, 2009 at 9:15 AM, James Lever<j at jamver.id.au> wrote: >> I?m experiencing occasional slow responsiveness on an OpenSolaris >> b118 >> system typically noticed when running an ?ls? (no extra flags, so no >> directory service lookups). There is a delay of between 2 and 30 >> seconds >> but no correlation has been noticed with load on the server and the >> slow >> return. This problem has only been noticed via NFS (v3. We are >> migrating >> to NFSv4 once the O_EXCL/mtime bug fix has been integrated - >> anticipated for >> snv_124). The problem has been observed both locally on the primary >> filesystem, in an locally automounted reference (/home/foo) and >> remotely via >> NFS.I''m confused. If "This problem has only been noticed via NFS (v3" then how is it "observed locally?">> zpool is RAIDZ2 comprised of 10 * 15kRPM SAS drives behind an LSI >> 1078 w/ >> 512MB BBWC exposed as RAID0 LUNs (Dell MD1000 behind PERC 6/E) with >> 2x SSDs >> each partitioned as 10GB slog and 36GB remainder as l2arc behind >> another LSI >> 1078 w/ 256MB BBWC (Dell R710 server with PERC 6/i). >> >> The system is configured as an NFS (currently serving NFSv3), iSCSI >> (COMSTAR) and CIFS (using the SUN SFW package running Samba 3.0.34) >> with >> authentication taking place from a remote openLDAP server. >> >> Automount is in use both locally and remotely (linux clients). >> Locally >> /home/* is remounted from the zpool, remotely /home and another >> filesystem >> (and children) are mounted using autofs. There was some suspicion >> that >> automount is the problem, but no definitive evidence as of yet. >> >> The problem has definitely been observed with stats (of some form, >> typically >> ?/usr/bin/ls? output) both remotely, locally in /home/* and locally >> in >> /zpool/home/* (the true source location). There is a clear >> correlation with >> recency of reads of the directories in question and reoccurrence of >> the >> fault in that one user has scripted a regular (15m/30m/hourly tests >> so far) >> ?ls? of the filesystems of interested and this has reduced the >> fault to have >> minimal noted impact since starting down this path (just for >> themself).iostat(1m) is the program for troubleshooting performance issues related to latency. It will show the latency of nfs mounts as well as other devices.>> I have removed the l2arc(s) (cache devices) from the pool and the >> same >> behaviour has been observed. My suspicion here was that there was >> perhaps >> occasional high synchronous load causing heavy writes to the slog >> devices >> and when a stat was requested it may have been faulting from ARC to >> L2ARC >> prior to going to the primary data store. The slowness has been >> reported >> since removing the extra cache devices. >> >> Another thought I had was along the lines of fileystem caching and >> heavy >> writes causing read blocking. I have no evidence that this is the >> case, but >> some suggestions on list recently of limiting the ZFS memory usage >> for write >> caching. Can anybody comment to the effectiveness of this (I have >> 256MB >> write cache in front of the slog SSDs and 512MB in front of the >> primary >> storage devices).stat(2) doesn''t write, so you can stop worrying about the slog.>> >> My DTrace is very poor but I?m suspicious that this is the best way >> to root >> cause this problem. If somebody has any code that may assist in >> debugging >> this problem and was able to share it would much appreciated. >> >> Any other suggestions for how to identify this fault and work >> around it >> would be greatly appreciated.Rule out the network by looking at retransmissions and ioerrors with netstat(1m) on both the client and server.> That behavior sounds a lot like a process has a memory leak and is > filling the VM. On Linux there is an OOM killer for these, but on > OpenSolaris, your the OOM killer.See rcapd(1m), rcapadm(1m), and rcapstat(1m) along with the "Physical Memory Control Using the Resource Capping Daemon" in System Administration Guide: Solaris Containers-Resource Management, and Solaris Zones -- richard> You have iSCSI, NFS, CIFS to choose from (most obvious), try > restarting them one at a time during down time and see if performance > improves after each restart to find the culprit. > > -Ross > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
On 07/09/2009, at 12:53 AM, Ross Walker wrote:> That behavior sounds a lot like a process has a memory leak and is > filling the VM. On Linux there is an OOM killer for these, but on > OpenSolaris, your the OOM killer.If it was this type of behaviour, where would it be logged when the process was killed/restarted? If it?s not logged by default, can that be enabled? I have not seen any evidence of this in /var/adm/messages, /var/log/ syslog, or my /var/log/debug (*.debug), but perhaps I?m not looking for the right clues.> You have iSCSI, NFS, CIFS to choose from (most obvious), try > restarting them one at a time during down time and see if performance > improves after each restart to find the culprit.The downtime is being reported by users, and I have only seen it once (while in their office) so this method of debugging isn?t going to help, I?m afraid. (this is why I asked about alternate root cause analysis methods) cheers, James
On 07/09/2009, at 6:24 AM, Richard Elling wrote:> On Sep 6, 2009, at 7:53 AM, Ross Walker wrote: > >> On Sun, Sep 6, 2009 at 9:15 AM, James Lever<j at jamver.id.au> wrote: >>> I?m experiencing occasional slow responsiveness on an OpenSolaris >>> b118 >>> system typically noticed when running an ?ls? (no extra flags, so no >>> directory service lookups). There is a delay of between 2 and 30 >>> seconds >>> but no correlation has been noticed with load on the server and >>> the slow >>> return. This problem has only been noticed via NFS (v3. We are >>> migrating >>> to NFSv4 once the O_EXCL/mtime bug fix has been integrated - >>> anticipated for >>> snv_124). The problem has been observed both locally on the primary >>> filesystem, in an locally automounted reference (/home/foo) and >>> remotely via >>> NFS. > > I''m confused. If "This problem has only been noticed via NFS (v3" > then > how is it "observed locally??Sorry, I was meaning to say it had not been noticed using CIFS or iSCSI. It has been observed in client:/home/user (NFSv3 automount from server:/home/user, redirected to server:/zpool/home/user) and also in server:/home/user (local automount) and server:/zpool/home/user (origin).> iostat(1m) is the program for troubleshooting performance issues > related to latency. It will show the latency of nfs mounts as well as > other devices.What specifically should I be looking for here? (using ?iostat -xen -T d?) and I?m guessing I?ll require a high level of granularity (1s intervals) to see the issue if it is a single disk or similar.> stat(2) doesn''t write, so you can stop worrying about the slog.My concern here was I may have been trying to write (via other concurrent processes) at the same time as there was a memory fault from the ARC to L2ARC.> Rule out the network by looking at retransmissions and ioerrors > with netstat(1m) on both the client and server.No errors or collisions from either server or clients observed.>> That behavior sounds a lot like a process has a memory leak and is >> filling the VM. On Linux there is an OOM killer for these, but on >> OpenSolaris, your the OOM killer. > > See rcapd(1m), rcapadm(1m), and rcapstat(1m) along with the > "Physical Memory Control Using the Resource Capping Daemon" > in System Administration Guide: Solaris Containers-Resource > Management, and Solaris ZonesThanks Richard, I?ll have a look at that today and see where I get. cheers, James -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090907/821a9aae/attachment.html>
Sorry for my earlier post I responded prematurely. On Sep 6, 2009, at 9:15 AM, James Lever <j at jamver.id.au> wrote:> I?m experiencing occasional slow responsiveness on an OpenSolaris b1 > 18 system typically noticed when running an ?ls? (no extra flags, > so no directory service lookups). There is a delay of between 2 and > 30 seconds but no correlation has been noticed with load on the ser > ver and the slow return. This problem has only been noticed via NFS > (v3. We are migrating to NFSv4 once the O_EXCL/mtime bug fix has b > een integrated - anticipated for snv_124). The problem has been obs > erved both locally on the primary filesystem, in an locally automoun > ted reference (/home/foo) and remotely via NFS.Have you tried snoop/tcpdump/wirehark on the client side and server side to figure out what is being sent and exactly how long it is taking to get a response?> zpool is RAIDZ2 comprised of 10 * 15kRPM SAS drives behind an LSI > 1078 w/ 512MB BBWC exposed as RAID0 LUNs (Dell MD1000 behind PERC 6/ > E) with 2x SSDs each partitioned as 10GB slog and 36GB remainder as > l2arc behind another LSI 1078 w/ 256MB BBWC (Dell R710 server with > PERC 6/i).This config might lead to heavy sync writes (NFS) starving reads due to the fact that the whole RAIDZ2 behaves as a single disk on writes. How about a 2 5 disk RAIDZ2s or 3 4 disk RAIDZs? Just one or two other vdevs to spread the load can make the world of difference.> The system is configured as an NFS (currently serving NFSv3), iSCSI > (COMSTAR) and CIFS (using the SUN SFW package running Samba 3.0.34) > with authentication taking place from a remote openLDAP server.There are a lot of services here, all off one pool? You might be trying to bite off more then the config can chew.> Automount is in use both locally and remotely (linux clients). > Locally /home/* is remounted from the zpool, remotely /home and > another filesystem (and children) are mounted using autofs. There > was some suspicion that automount is the problem, but no definitive > evidence as of yet.Try taking a particularly bad problem station and configuring it static for a bit to see if it is.> The problem has definitely been observed with stats (of some form, > typically ?/usr/bin/ls? output) both remotely, locally in /home/* > and locally in /zpool/home/* (the true source location). There is a > clear correlation with recency of reads of the directories in quest > ion and reoccurrence of the fault in that one user has scripted a re > gular (15m/30m/hourly tests so far) ?ls? of the filesystems of > interested and this has reduced the fault to have minimal noted impa > ct since starting down this path (just for themself).Sounds like the user is pre-fetching his attribute cache to over come poor performance.> I have removed the l2arc(s) (cache devices) from the pool and the > same behaviour has been observed. My suspicion here was that there > was perhaps occasional high synchronous load causing heavy writes to > the slog devices and when a stat was requested it may have been > faulting from ARC to L2ARC prior to going to the primary data > store. The slowness has been reported since removing the extra > cache devices.That doesn''t make a lot of sense to me the L2ARC is secondary read cache, if writes are starving reads then the L2ARC would only help here.> Another thought I had was along the lines of fileystem caching and > heavy writes causing read blocking. I have no evidence that this is > the case, but some suggestions on list recently of limiting the ZFS > memory usage for write caching. Can anybody comment to the > effectiveness of this (I have 256MB write cache in front of the slog > SSDs and 512MB in front of the primary storage devices).It just may be that the pool configuration just can''t handle the write IOPS needed and reads are starving.> My DTrace is very poor but I?m suspicious that this is the best way > to root cause this problem. If somebody has any code that may assis > t in debugging this problem and was able to share it would much appr > eciated.Dtrace would tell you, but i wish the learning curve wasn''t so steep to get it going.> Any other suggestions for how to identify this fault and work around > it would be greatly appreciated.I hope I gave some good pointers. I''d first look at the pool configuration. -Ross
On Sep 6, 2009, at 5:06 PM, James Lever wrote:> On 07/09/2009, at 6:24 AM, Richard Elling wrote: >> On Sep 6, 2009, at 7:53 AM, Ross Walker wrote: >> On Sun, Sep 6, 2009 at 9:15 AM, James Lever<j at jamver.id.au> wrote: >>>> I?m experiencing occasional slow responsiveness on an OpenSolaris >>>> b118 >>>> system typically noticed when running an ?ls? (no extra flags, so >>>> no >>>> directory service lookups). There is a delay of between 2 and 30 >>>> seconds >>>> but no correlation has been noticed with load on the server and >>>> the slow >>>> return. This problem has only been noticed via NFS (v3. We are >>>> migrating >>>> to NFSv4 once the O_EXCL/mtime bug fix has been integrated - >>>> anticipated for >>>> snv_124). The problem has been observed both locally on the >>>> primary >>>> filesystem, in an locally automounted reference (/home/foo) and >>>> remotely via >>>> NFS. >> >> I''m confused. If "This problem has only been noticed via NFS (v3" >> then >> how is it "observed locally?? > > Sorry, I was meaning to say it had not been noticed using CIFS or > iSCSI. > > It has been observed in client:/home/user (NFSv3 automount from > server:/home/user, redirected to server:/zpool/home/user) and also > in server:/home/user (local automount) and server:/zpool/home/user > (origin).Ok, just so I am clear, when you mean "local automount" you are on the server and using the loopback -- no NFS or network involved?>> iostat(1m) is the program for troubleshooting performance issues >> related to latency. It will show the latency of nfs mounts as well as >> other devices. > > What specifically should I be looking for here? (using ?iostat -xen - > T d?) and I?m guessing I?ll require a high level of granularity (1s > intervals) to see the issue if it is a single disk or similar.You are looking for I/O that takes seconds to complete or is stuck in the device. This is in the actv column stuck > 1 and the asvc_t >> 1000>> stat(2) doesn''t write, so you can stop worrying about the slog. > > My concern here was I may have been trying to write (via other > concurrent processes) at the same time as there was a memory fault > from the ARC to L2ARC.stat(2) looks at metadata, which is generally small and compressed. It is also cached in the ARC, by default. If this is repeatable in a short period of time, then it is not an I/O problem and you need to look at: 1. the number of files in the directory 2. the locale (ls sorts by default, and your locale affects the sort time)>> Rule out the network by looking at retransmissions and ioerrors >> with netstat(1m) on both the client and server. > > No errors or collisions from either server or clients observed.retrans? As Ross mentioned, wireshark, snoop, or most other network monitors will show network traffic in detail. -- richard>>> That behavior sounds a lot like a process has a memory leak and is >>> filling the VM. On Linux there is an OOM killer for these, but on >>> OpenSolaris, your the OOM killer. >> >> See rcapd(1m), rcapadm(1m), and rcapstat(1m) along with the >> "Physical Memory Control Using the Resource Capping Daemon" >> in System Administration Guide: Solaris Containers-Resource >> Management, and Solaris Zones > > Thanks Richard, I?ll have a look at that today and see where I get. > > cheers, > James >
On 07/09/2009, at 11:08 AM, Richard Elling wrote:> Ok, just so I am clear, when you mean "local automount" you are > on the server and using the loopback -- no NFS or network involved?Correct. And the behaviour has been seen locally as well as remotely.> You are looking for I/O that takes seconds to complete or is stuck in > the device. This is in the actv column stuck > 1 and the asvc_t >> > 1000Just started having some slow responsiveness reported form a user using emacs (autosave, start of a build) so a small file write request. The second or so before they went to do this, it appears as if the raid cache in front of the slog devices was nearly filled and the SSDs were being utilised quite heavily, but then there was a break where I am seeing relatively light usage on the slog but 100% busy on the device reported. The iostat output is at the end of this message - I can?t make any real sense out of why a user would have seen a ~4s delay at about 2:39:17-18. Only one of the two slog devices are being used at all. Is there some tunable about how multiple slogs are used? c7t[01] are rpool c7t[23] are slog devices in the data pool c11t* are the primary storage devices for the data pool cheers, James Monday, 7 September 2009 2:39:17 PM EST extended device statistics ---- errors --- r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b s/w h/w trn tot device 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 10 0 10 c9t0d0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 c7t0d0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 c7t1d0 0.0 1475.0 0.0 188799.0 0.0 30.2 0.0 20.5 2 90 0 0 0 0 c7t2d0 0.0 232.0 0.0 29571.8 0.0 33.8 0.0 145.9 0 98 0 0 0 0 c7t3d0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 c11t0d0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 c11t1d0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 c11t2d0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 c11t3d0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 c11t4d0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 c11t5d0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 c11t6d0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 c11t7d0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 c11t8d0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 c11t9d0 Monday, 7 September 2009 2:39:18 PM EST extended device statistics ---- errors --- r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b s/w h/w trn tot device 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 10 0 10 c9t0d0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 c7t0d0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 c7t1d0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 c7t2d0 0.0 0.0 0.0 0.0 0.0 35.0 0.0 0.0 0 100 0 0 0 0 c7t3d0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 c11t0d0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 c11t1d0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 c11t2d0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 c11t3d0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 c11t4d0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 c11t5d0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 c11t6d0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 c11t7d0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 c11t8d0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 c11t9d0 Monday, 7 September 2009 2:39:19 PM EST extended device statistics ---- errors --- r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b s/w h/w trn tot device 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 10 0 10 c9t0d0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 c7t0d0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 c7t1d0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 c7t2d0 0.0 341.0 0.0 43650.1 0.0 35.0 0.0 102.5 0 100 0 0 0 0 c7t3d0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 c11t0d0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 c11t1d0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 c11t2d0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 c11t3d0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 c11t4d0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 c11t5d0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 c11t6d0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 c11t7d0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 c11t8d0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 c11t9d0 Monday, 7 September 2009 2:39:20 PM EST extended device statistics ---- errors --- r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b s/w h/w trn tot device 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 10 0 10 c9t0d0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 c7t0d0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 c7t1d0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 c7t2d0 0.0 342.0 0.0 43774.8 0.0 35.0 0.0 102.2 0 100 0 0 0 0 c7t3d0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 c11t0d0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 c11t1d0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 c11t2d0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 c11t3d0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 c11t4d0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 c11t5d0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 c11t6d0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 c11t7d0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 c11t8d0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 c11t9d0 Monday, 7 September 2009 2:39:21 PM EST extended device statistics ---- errors --- r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b s/w h/w trn tot device 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 10 0 10 c9t0d0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 c7t0d0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 c7t1d0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 c7t2d0 0.0 349.0 0.0 44546.8 0.0 35.0 0.0 100.2 0 100 0 0 0 0 c7t3d0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 c11t0d0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 c11t1d0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 c11t2d0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 c11t3d0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 c11t4d0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 c11t5d0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 c11t6d0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 c11t7d0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 c11t8d0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 c11t9d0 Monday, 7 September 2009 2:39:22 PM EST extended device statistics ---- errors --- r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b s/w h/w trn tot device 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 10 0 10 c9t0d0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 c7t0d0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 c7t1d0 0.0 2.0 0.0 32.0 0.0 0.0 0.0 0.1 0 0 0 0 0 0 c7t2d0 0.0 214.0 0.0 27168.6 0.0 19.7 0.0 91.8 0 61 0 0 0 0 c7t3d0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 c11t0d0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 c11t1d0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 c11t2d0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 c11t3d0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 c11t4d0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 c11t5d0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 c11t6d0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 c11t7d0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 c11t8d0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 c11t9d0 Monday, 7 September 2009 2:39:23 PM EST extended device statistics ---- errors --- r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b s/w h/w trn tot device 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 10 0 10 c9t0d0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 c7t0d0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 c7t1d0 0.0 2.0 0.0 132.0 0.0 0.0 0.0 0.2 0 0 0 0 0 0 c7t2d0 0.0 3.0 0.0 356.1 0.0 0.0 0.0 0.4 0 0 0 0 0 0 c7t3d0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 c11t0d0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 c11t1d0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 c11t2d0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 c11t3d0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 c11t4d0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 c11t5d0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 c11t6d0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 c11t7d0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 c11t8d0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 c11t9d0
On 07/09/2009, at 10:46 AM, Ross Walker wrote:>> zpool is RAIDZ2 comprised of 10 * 15kRPM SAS drives behind an LSI >> 1078 w/ 512MB BBWC exposed as RAID0 LUNs (Dell MD1000 behind PERC 6/ >> E) with 2x SSDs each partitioned as 10GB slog and 36GB remainder as >> l2arc behind another LSI 1078 w/ 256MB BBWC (Dell R710 server with >> PERC 6/i). > > This config might lead to heavy sync writes (NFS) starving reads due > to the fact that the whole RAIDZ2 behaves as a single disk on > writes. How about a 2 5 disk RAIDZ2s or 3 4 disk RAIDZs? > > Just one or two other vdevs to spread the load can make the world of > difference.This was a management decision. I wanted to go down the striped mirrored pair solution, but the amount of space lost was considered too great. RAIDZ2 was considered the best value option for our environment.>> The system is configured as an NFS (currently serving NFSv3), iSCSI >> (COMSTAR) and CIFS (using the SUN SFW package running Samba 3.0.34) >> with authentication taking place from a remote openLDAP server. > > There are a lot of services here, all off one pool? You might be > trying to bite off more then the config can chew.That?s not a lot of services, really. We have 6 users doing builds on multiple platforms and using the storage as their home directory (windows and unix). The issue is interactive responsiveness and if there is a way to tune the system to give that while still having good performance for builds when they are run.> Try taking a particularly bad problem station and configuring it > static for a bit to see if it is.That has been considered also, but the issue has also been observed locally on the fileserver.> That doesn''t make a lot of sense to me the L2ARC is secondary read > cache, if writes are starving reads then the L2ARC would only help > here.I was suggesting that slog write were possibly starving reads from the l2arc as they were on the same device. This appears not to have been the issue as the problem has persisted even with the l2arc devices removed from the pool.> It just may be that the pool configuration just can''t handle the > write IOPS needed and reads are starving.Possible, but hard to tell. Have a look at the iostat results I?ve posted. cheers, James -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090907/f88831be/attachment.html>
On Sep 7, 2009, at 1:32 AM, James Lever <j at jamver.id.au> wrote:> > On 07/09/2009, at 10:46 AM, Ross Walker wrote: > >>> zpool is RAIDZ2 comprised of 10 * 15kRPM SAS drives behind an LSI >>> 1078 w/ 512MB BBWC exposed as RAID0 LUNs (Dell MD1000 behind PERC >>> 6/E) with 2x SSDs each partitioned as 10GB slog and 36GB remainder >>> as l2arc behind another LSI 1078 w/ 256MB BBWC (Dell R710 server >>> with PERC 6/i). >> >> This config might lead to heavy sync writes (NFS) starving reads >> due to the fact that the whole RAIDZ2 behaves as a single disk on >> writes. How about a 2 5 disk RAIDZ2s or 3 4 disk RAIDZs? >> >> Just one or two other vdevs to spread the load can make the world >> of difference. > > This was a management decision. I wanted to go down the striped > mirrored pair solution, but the amount of space lost was considered > too great. RAIDZ2 was considered the best value option for our > environment.Well a MD1000 holds 15 drives a good compromise might be 2 7 drive RAIDZ2s with a hotspare... That should provide 320 IOPS instead of 160, big difference.>>> The system is configured as an NFS (currently serving NFSv3), >>> iSCSI (COMSTAR) and CIFS (using the SUN SFW package running Samba >>> 3.0.34) with authentication taking place from a remote openLDAP >>> server. >> >> There are a lot of services here, all off one pool? You might be >> trying to bite off more then the config can chew. > > That?s not a lot of services, really. We have 6 users doing builds > on multiple platforms and using the storage as their home directory > (windows and unix).Ok, six users, but what happens during a build?> The issue is interactive responsiveness and if there is a way to > tune the system to give that while still having good performance for > builds when they are run.Look at the write IOPS of the pool with the zpool iostat -v and look at how many are happening on the RAIDZ2 vdev.>> Try taking a particularly bad problem station and configuring it >> static for a bit to see if it is. > > That has been considered also, but the issue has also been observed > locally on the fileserver.Then I suppose you have eliminated automounter as a culprit at this point then.>> That doesn''t make a lot of sense to me the L2ARC is secondary read >> cache, if writes are starving reads then the L2ARC would only help >> here. > > I was suggesting that slog write were possibly starving reads from > the l2arc as they were on the same device. This appears not to have > been the issue as the problem has persisted even with the l2arc > devices removed from the pool.The SSD will handle a lot more IOPS then the pool and L2ARC is a lazy reader, it mostly just holds on to read cache data.>> It just may be that the pool configuration just can''t handle the >> write IOPS needed and reads are starving. > > Possible, but hard to tell. Have a look at the iostat results I?ve > posted.The busy times of the disks while the issue is occurring should let you know. -Ross
On 08/09/2009, at 2:01 AM, Ross Walker wrote:> On Sep 7, 2009, at 1:32 AM, James Lever <j at jamver.id.au> wrote: > > Well a MD1000 holds 15 drives a good compromise might be 2 7 drive > RAIDZ2s with a hotspare... That should provide 320 IOPS instead of > 160, big difference. > >> The issue is interactive responsiveness and if there is a way to >> tune the system to give that while still having good performance >> for builds when they are run. > > Look at the write IOPS of the pool with the zpool iostat -v and look > at how many are happening on the RAIDZ2 vdev. > >> I was suggesting that slog write were possibly starving reads from >> the l2arc as they were on the same device. This appears not to >> have been the issue as the problem has persisted even with the >> l2arc devices removed from the pool. > > The SSD will handle a lot more IOPS then the pool and L2ARC is a > lazy reader, it mostly just holds on to read cache data. > >>> It just may be that the pool configuration just can''t handle the >>> write IOPS needed and reads are starving. >> >> Possible, but hard to tell. Have a look at the iostat results I?ve >> posted. > > The busy times of the disks while the issue is occurring should let > you know.So it turns out that the problem is that all writes coming via NFS are going through the slog. When that happens, the transfer speed to the device drops to ~70MB/s (the write speed of his SLC SSD) and until the load drops all new write requests are blocked causing a noticeable delay (which has been observed to be up to 20s, but generally only 2-4s). I can reproduce this behaviour by copying a large file (hundreds of MB in size) using ''cp src dst? on an NFS (still currently v3) client and observe that all data is pushed through the slog device (10GB partition of a Samsung 50GB SSD behind a PERC 6/i w/256MB BBC) rather than going direct to the primary storage disks. On a related note, I had 2 of these devices (both using just 10GB partitions) connected as log devices (so the pool had 2 separate log devices) and the second one was consistently running significantly slower than the first. Removing the second device made an improvement on performance, but did not remove the occasional observed pauses. I was of the (mis)understanding that only metadata and writes smaller than 64k went via the slog device in the event of an O_SYNC write request? The clients are (mostly) RHEL5. Is there a way to tune this on the NFS server or clients such that when I perform a large synchronous write, the data does not go via the slog device? I have investigated using the logbias setting, but that will just kill small file performance also on any filesystem using it and defeat the purpose of having a slog device at all. cheers, James -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090924/8bad9781/attachment.html>
On Thu, 24 Sep 2009, James Lever wrote:> > I was of the (mis)understanding that only metadata and writes smaller than > 64k went via the slog device in the event of an O_SYNC write request?What would cause you to understand that?> Is there a way to tune this on the NFS server or clients such that when I > perform a large synchronous write, the data does not go via the slog device?Synchronous writes are needed by NFS to support its atomic write requirement. It sounds like your SSD is write-bandwidth bottlenecked rather than IOPS bottlenecked. Replacing your SSD with a more performant one seems like the first step. NFS client tunings can make a big difference when it comes to performance. Check the nfs(5) manual page for your Linux systems to see what options are available. An obvious tunable is ''wsize'' which should ideally match (or be a multiple of) the zfs filesystem block size. The /proc/mounts file for my Debian install shows that 1048576 is being used. This is quite large and perhaps a smaller value would help. If you are willing to accept the risk, using the Linux ''async'' mount option may make things seem better. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
comment below... On Sep 23, 2009, at 10:00 PM, James Lever wrote:> > On 08/09/2009, at 2:01 AM, Ross Walker wrote: >> On Sep 7, 2009, at 1:32 AM, James Lever <j at jamver.id.au> wrote: >> >> Well a MD1000 holds 15 drives a good compromise might be 2 7 drive >> RAIDZ2s with a hotspare... That should provide 320 IOPS instead of >> 160, big difference. >> >>> The issue is interactive responsiveness and if there is a way to >>> tune the system to give that while still having good performance >>> for builds when they are run. >> >> Look at the write IOPS of the pool with the zpool iostat -v and >> look at how many are happening on the RAIDZ2 vdev. >> >>> I was suggesting that slog write were possibly starving reads from >>> the l2arc as they were on the same device. This appears not to >>> have been the issue as the problem has persisted even with the >>> l2arc devices removed from the pool. >> >> The SSD will handle a lot more IOPS then the pool and L2ARC is a >> lazy reader, it mostly just holds on to read cache data. >> >>>> It just may be that the pool configuration just can''t handle the >>>> write IOPS needed and reads are starving. >>> >>> Possible, but hard to tell. Have a look at the iostat results >>> I?ve posted. >> >> The busy times of the disks while the issue is occurring should let >> you know. > > So it turns out that the problem is that all writes coming via NFS > are going through the slog. When that happens, the transfer speed > to the device drops to ~70MB/s (the write speed of his SLC SSD) and > until the load drops all new write requests are blocked causing a > noticeable delay (which has been observed to be up to 20s, but > generally only 2-4s).Thank you sir, can I have another? If you add (not attach) more slogs, the workload will be spread across them. But...> > I can reproduce this behaviour by copying a large file (hundreds of > MB in size) using ''cp src dst? on an NFS (still currently v3) client > and observe that all data is pushed through the slog device (10GB > partition of a Samsung 50GB SSD behind a PERC 6/i w/256MB BBC) > rather than going direct to the primary storage disks. > > On a related note, I had 2 of these devices (both using just 10GB > partitions) connected as log devices (so the pool had 2 separate log > devices) and the second one was consistently running significantly > slower than the first. Removing the second device made an > improvement on performance, but did not remove the occasional > observed pauses....this is not surprising, when you add a slow slog device. This is the weakest link rule.> I was of the (mis)understanding that only metadata and writes > smaller than 64k went via the slog device in the event of an O_SYNC > write request?The threshold is 32 kBytes, which is unfortunately the same as the default NFS write size. See CR6686887 http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6686887 If you have a slog and logbias=latency (default) then the writes go to the slog. So there is some interaction here that can affect NFS workloads in particular.> > The clients are (mostly) RHEL5. > > Is there a way to tune this on the NFS server or clients such that > when I perform a large synchronous write, the data does not go via > the slog device?You can change the IOP size on the client. -- richard> > I have investigated using the logbias setting, but that will just > kill small file performance also on any filesystem using it and > defeat the purpose of having a slog device at all. > > cheers, > James > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
On 25/09/2009, at 2:58 AM, Richard Elling wrote:> On Sep 23, 2009, at 10:00 PM, James Lever wrote: > >> So it turns out that the problem is that all writes coming via NFS >> are going through the slog. When that happens, the transfer speed >> to the device drops to ~70MB/s (the write speed of his SLC SSD) and >> until the load drops all new write requests are blocked causing a >> noticeable delay (which has been observed to be up to 20s, but >> generally only 2-4s). > > Thank you sir, can I have another? > If you add (not attach) more slogs, the workload will be spread > across them. But...My log configurations is : logs c7t2d0s0 ONLINE 0 0 0 c7t3d0s0 OFFLINE 0 0 0 I?m going to test the now removed SSD and see if I can get it to perform significantly worse than the first one, but my memory of testing these at pre-production testing was that they were both equally slow but not significantly different.>> On a related note, I had 2 of these devices (both using just 10GB >> partitions) connected as log devices (so the pool had 2 separate >> log devices) and the second one was consistently running >> significantly slower than the first. Removing the second device >> made an improvement on performance, but did not remove the >> occasional observed pauses. > > ...this is not surprising, when you add a slow slog device. This is > the weakest link rule.So, in theory, even if one of the two SSDs was even slightly slower than the other, it would just appear that it would be more heavily effected? Here is part of what I?m not understanding - unless one SSD is significantly worse than the other, how can the following scenario be true? Here is some iostat output from the two slog devices at 1s intervals when it gets a large series of write requests. Idle at start. 0.0 1462.0 0.0 187010.2 0.0 28.6 0.0 19.6 2 83 0 0 0 0 c7t2d0 0.0 233.0 0.0 29823.7 0.0 28.7 0.0 123.3 0 83 0 0 0 0 c7t3d0 NVRAM cache close to full. (256MB BBC) 0.0 84.0 0.0 10622.0 0.0 3.5 0.0 41.2 0 12 0 0 0 0 c7t2d0 0.0 0.0 0.0 0.0 0.0 35.0 0.0 0.0 0 100 0 0 0 0 c7t3d0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 c7t2d0 0.0 305.0 0.0 39039.3 0.0 35.0 0.0 114.7 0 100 0 0 0 0 c7t3d0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 c7t2d0 0.0 361.0 0.0 46208.1 0.0 35.0 0.0 96.8 0 100 0 0 0 0 c7t3d0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 c7t2d0 0.0 329.0 0.0 42114.0 0.0 35.0 0.0 106.3 0 100 0 0 0 0 c7t3d0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0 0 0 0 c7t2d0 0.0 317.0 0.0 40449.6 0.0 27.4 0.0 86.5 0 85 0 0 0 0 c7t3d0 0.0 4.0 0.0 263.8 0.0 0.0 0.0 0.2 0 0 0 0 0 0 c7t2d0 0.0 4.0 0.0 367.8 0.0 0.0 0.0 0.3 0 0 0 0 0 0 c7t3d0 What determines the size of the writes or distribution between slog devices? It looks like ZFS decided to send a large chunk to one slog which nearly filled the NVRAM, and then continue writing to the other one, which meant that it had to go at device speed (whatever that is for the data size/write size). Is there a way to tune the writes to multiple slogs to be (for arguments sake) 10MB slices?>> I was of the (mis)understanding that only metadata and writes >> smaller than 64k went via the slog device in the event of an O_SYNC >> write request? > > The threshold is 32 kBytes, which is unfortunately the same as the > default > NFS write size. See CR6686887 > http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6686887 > > If you have a slog and logbias=latency (default) then the writes go > to the slog. > So there is some interaction here that can affect NFS workloads in > particular.Interesting CR. nfsstat -m output on one of the linux hosts (ubuntu) Flags: rw ,vers = 3 ,rsize = 1048576 ,wsize = 1048576 ,namlen = 255 ,hard ,nointr ,noacl ,proto = tcp ,timeo = 600 ,retrans =2,sec=sys,mountaddr=10.1.0.17,mountvers=3,mountproto=tcp,addr=10.1.0.17 rsize and wsize auto tuned to 1MB. How does this effect the sync request threshold?>> The clients are (mostly) RHEL5. >> >> Is there a way to tune this on the NFS server or clients such that >> when I perform a large synchronous write, the data does not go via >> the slog device? > > You can change the IOP size on the client.You?re suggesting modifying rsize/wsize? or something else? cheers, James -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090925/936e5b21/attachment.html>
On 25/09/2009, at 1:24 AM, Bob Friesenhahn wrote:> On Thu, 24 Sep 2009, James Lever wrote:>> Is there a way to tune this on the NFS server or clients such that >> when I perform a large synchronous write, the data does not go via >> the slog device? > > Synchronous writes are needed by NFS to support its atomic write > requirement. It sounds like your SSD is write-bandwidth > bottlenecked rather than IOPS bottlenecked. Replacing your SSD with > a more performant one seems like the first step.> NFS client tunings can make a big difference when it comes to > performance. Check the nfs(5) manual page for your Linux systems to > see what options are available. An obvious tunable is ''wsize'' which > should ideally match (or be a multiple of) the zfs filesystem block > size. The /proc/mounts file for my Debian install shows that > 1048576 is being used. This is quite large and perhaps a smaller > value would help. If you are willing to accept the risk, using the > Linux ''async'' mount option may make things seem better.From the Linux NFS FAQ. http://nfs.sourceforge.net/ NFS Version 3 introduces the concept of "safe asynchronous writes.? And it continues. My rsize and wsize are negotiating to 1MB. James
On Fri, 25 Sep 2009, James Lever wrote:> > NFS Version 3 introduces the concept of "safe asynchronous writes.?Being "safe" then requires a responsibilty level on the client which is often not present. For example, if the server crashes, and then the client crashes, how does the client resend the uncommitted data? If the client had a non-volatile storage cache, then it would be able to responsibly finish the writes that failed. The commentary says that normally the COMMIT operations occur during close(2) or fsync(2) system call, or when encountering memory pressure. If the problem is slow copying of many small files, this COMMIT approach does not help very much since very little data is sent per file and most time is spent creating directories and files. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On 25/09/2009, at 11:49 AM, Bob Friesenhahn wrote:> The commentary says that normally the COMMIT operations occur during > close(2) or fsync(2) system call, or when encountering memory > pressure. If the problem is slow copying of many small files, this > COMMIT approach does not help very much since very little data is > sent per file and most time is spent creating directories and files.The problem appears to be slog bandwidth exhaustion due to all data being sent via the slog creating a contention for all following NFS or locally synchronous writes. The NFS writes do not appear to be synchronous in nature - there is only a COMMIT being issued at the very end, however, all of that data appears to be going via the slog and it appears to be inflating to twice its original size. For a test, I just copied a relatively small file (8.4MB in size). Looking at a tcpdump analysis using wireshark, there is a SETATTR which ends with a V3 COMMIT and no COMMIT messages during the transfer. iostat output that matches looks like this: slog write of the data (17MB appears to hit the slog) Friday, 25 September 2009 1:01:00 PM EST extended device statistics ---- errors --- r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b s/w h/w trn tot device 0.0 135.0 0.0 17154.5 0.0 0.8 0.0 6.0 0 3 0 0 0 0 c7t2d0 then a few seconds later, the transaction group gets flushed to primary storage writing nearly 11.4MB which is inline with raid Z2 (expect around 10.5MB; 8.4/8*10): Friday, 25 September 2009 1:01:13 PM EST extended device statistics ---- errors --- r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b s/w h/w trn tot device 0.0 91.0 0.0 1170.4 0.0 0.1 0.0 1.3 0 2 0 0 0 0 c11t0d0 0.0 84.0 0.0 1171.4 0.0 0.1 0.0 1.2 0 2 0 0 0 0 c11t1d0 0.0 92.0 0.0 1172.4 0.0 0.1 0.0 1.2 0 2 0 0 0 0 c11t2d0 0.0 84.0 0.0 1172.4 0.0 0.1 0.0 1.3 0 2 0 0 0 0 c11t3d0 0.0 81.0 0.0 1176.4 0.0 0.1 0.0 1.4 0 2 0 0 0 0 c11t4d0 0.0 86.0 0.0 1176.4 0.0 0.1 0.0 1.4 0 2 0 0 0 0 c11t5d0 0.0 89.0 0.0 1175.4 0.0 0.1 0.0 1.4 0 2 0 0 0 0 c11t6d0 0.0 84.0 0.0 1175.4 0.0 0.1 0.0 1.3 0 2 0 0 0 0 c11t7d0 0.0 91.0 0.0 1168.9 0.0 0.1 0.0 1.3 0 2 0 0 0 0 c11t8d0 0.0 89.0 0.0 1170.9 0.0 0.1 0.0 1.4 0 2 0 0 0 0 c11t9d0 So I performed the same test with a much larger file (533MB) to see what it would do, being larger than the NVRAM cache in front of the SSD. Note that after the second second of activity the NVRAM is full and only allowing in about the sequential write speed of the SSD (~70MB/s). Friday, 25 September 2009 1:13:14 PM EST extended device statistics ---- errors --- r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b s/w h/w trn tot device 0.0 640.9 0.0 81782.9 0.0 4.2 0.0 6.5 1 14 0 0 0 0 c7t2d0 0.0 1065.7 0.0 136408.1 0.0 18.6 0.0 17.5 1 78 0 0 0 0 c7t2d0 0.0 579.0 0.0 74113.3 0.0 30.7 0.0 53.1 1 100 0 0 0 0 c7t2d0 0.0 588.7 0.0 75357.0 0.0 33.2 0.0 56.3 1 100 0 0 0 0 c7t2d0 0.0 532.0 0.0 68096.3 0.0 31.5 0.0 59.1 1 100 0 0 0 0 c7t2d0 0.0 559.0 0.0 71428.0 0.0 32.5 0.0 58.1 1 100 0 0 0 0 c7t2d0 0.0 542.0 0.0 68755.9 0.0 25.1 0.0 46.4 1 100 0 0 0 0 c7t2d0 0.0 542.0 0.0 69376.4 0.0 35.0 0.0 64.6 1 100 0 0 0 0 c7t2d0 0.0 581.0 0.0 74368.0 0.0 30.6 0.0 52.6 1 100 0 0 0 0 c7t2d0 0.0 567.0 0.0 72574.1 0.0 33.2 0.0 58.6 1 100 0 0 0 0 c7t2d0 0.0 564.0 0.0 72194.1 0.0 31.1 0.0 55.2 1 100 0 0 0 0 c7t2d0 0.0 573.0 0.0 73343.5 0.0 33.2 0.0 57.9 1 100 0 0 0 0 c7t2d0 0.0 536.3 0.0 68640.5 0.0 33.1 0.0 61.7 1 100 0 0 0 0 c7t2d0 0.0 121.9 0.0 15608.9 0.0 2.7 0.0 22.1 0 22 0 0 0 0 c7t2d0 Again, the slog wrote about double the file size (1022.6MB) and a few seconds later, the data was pushed to the primary storage (684.9MB with an expectation of 666MB = 533MB/8*10) so again about the right number hit the spinning platters. Friday, 25 September 2009 1:13:43 PM EST extended device statistics ---- errors --- r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b s/w h/w trn tot device 0.0 338.3 0.0 32794.4 0.0 13.7 0.0 40.6 1 47 0 0 0 0 c11t0d0 0.0 325.3 0.0 31399.8 0.0 13.7 0.0 42.0 1 47 0 0 0 0 c11t1d0 0.0 339.3 0.0 33273.3 0.0 13.7 0.0 40.3 1 47 0 0 0 0 c11t2d0 0.0 332.3 0.0 32009.0 0.0 13.7 0.0 41.4 0 47 0 0 0 0 c11t3d0 0.0 352.3 0.0 34364.0 0.0 13.7 0.0 39.0 1 47 0 0 0 0 c11t4d0 0.0 355.2 0.0 33788.7 0.0 13.7 0.0 38.6 1 47 0 0 0 0 c11t5d0 0.0 352.3 0.0 33452.3 0.0 13.8 0.0 39.3 1 47 0 0 0 0 c11t6d0 0.0 339.3 0.0 32873.5 0.0 13.7 0.0 40.4 1 47 0 0 0 0 c11t7d0 0.0 337.3 0.0 32889.0 0.0 13.5 0.0 40.0 1 47 0 0 0 0 c11t8d0 0.0 336.3 0.0 32441.9 0.0 13.7 0.0 40.9 1 47 0 0 0 0 c11t9d0 Friday, 25 September 2009 1:13:44 PM EST extended device statistics ---- errors --- r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b s/w h/w trn tot device 0.0 349.7 0.0 35677.0 0.0 16.1 0.0 45.9 0 48 0 0 0 0 c11t0d0 0.0 367.7 0.0 37078.3 0.0 16.1 0.0 43.8 0 49 0 0 0 0 c11t1d0 0.0 348.7 0.0 35197.1 0.0 16.3 0.0 46.9 0 49 0 0 0 0 c11t2d0 0.0 360.7 0.0 36467.7 0.0 15.9 0.0 44.1 0 48 0 0 0 0 c11t3d0 0.0 342.7 0.0 34103.9 0.0 16.2 0.0 47.2 0 48 0 0 0 0 c11t4d0 0.0 347.7 0.0 34682.1 0.0 16.0 0.0 46.0 0 48 0 0 0 0 c11t5d0 0.0 349.7 0.0 35018.3 0.0 16.3 0.0 46.7 0 49 0 0 0 0 c11t6d0 0.0 353.7 0.0 35600.5 0.0 16.1 0.0 45.6 0 49 0 0 0 0 c11t7d0 0.0 350.7 0.0 35580.5 0.0 16.2 0.0 46.1 0 49 0 0 0 0 c11t8d0 0.0 356.7 0.0 36031.0 0.0 15.9 0.0 44.4 0 48 0 0 0 0 c11t9d0 Can anybody explain what is going on with the slog device in that all data is being shunted via it and why about double the data size is being written to it per transaction? cheers, James -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090925/65ca4620/attachment.html>
I thought I would try the same test using dd bs=131072 if=source of=/ path/to/nfs to see what the results looked liked? It is very similar to before, about 2x slog usage and same timing and write totals. Friday, 25 September 2009 1:49:48 PM EST extended device statistics ---- errors --- r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b s/w h/ w trn tot device 0.0 1538.7 0.0 196834.0 0.0 23.1 0.0 15.0 2 67 0 0 0 0 c7t2d0 0.0 562.0 0.0 71942.3 0.0 35.0 0.0 62.3 1 100 0 0 0 0 c7t2d0 0.0 590.7 0.0 75614.4 0.0 35.0 0.0 59.2 1 100 0 0 0 0 c7t2d0 0.0 600.9 0.0 76920.0 0.0 35.0 0.0 58.2 1 100 0 0 0 0 c7t2d0 0.0 546.0 0.0 69887.9 0.0 35.0 0.0 64.1 1 100 0 0 0 0 c7t2d0 0.0 554.0 0.0 70913.9 0.0 35.0 0.0 63.2 1 100 0 0 0 0 c7t2d0 0.0 598.0 0.0 76549.2 0.0 35.0 0.0 58.5 1 100 0 0 0 0 c7t2d0 0.0 563.0 0.0 72065.1 0.0 35.0 0.0 62.1 1 100 0 0 0 0 c7t2d0 0.0 588.1 0.0 75282.6 0.0 31.5 0.0 53.5 1 100 0 0 0 0 c7t2d0 0.0 564.0 0.0 72195.7 0.0 34.8 0.0 61.7 1 100 0 0 0 0 c7t2d0 0.0 582.8 0.0 74599.8 0.0 35.0 0.0 60.0 1 100 0 0 0 0 c7t2d0 0.0 544.0 0.0 69633.3 0.0 35.0 0.0 64.3 1 100 0 0 0 0 c7t2d0 0.0 530.0 0.0 67191.5 0.0 30.6 0.0 57.7 0 90 0 0 0 0 c7t2d0 And then the write to primary storage a few seconds later: Friday, 25 September 2009 1:50:14 PM EST extended device statistics ---- errors --- r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b s/w h/w trn tot device 0.0 426.3 0.0 32196.3 0.0 12.7 0.0 29.8 1 45 0 0 0 0 c11t0d0 0.0 410.4 0.0 31857.1 0.0 12.4 0.0 30.3 1 45 0 0 0 0 c11t1d0 0.0 426.3 0.0 30698.1 0.0 13.0 0.0 30.5 1 45 0 0 0 0 c11t2d0 0.0 429.3 0.0 31392.3 0.0 12.6 0.0 29.4 1 45 0 0 0 0 c11t3d0 0.0 443.2 0.0 33280.8 0.0 12.9 0.0 29.1 1 45 0 0 0 0 c11t4d0 0.0 424.3 0.0 33872.4 0.0 12.7 0.0 30.0 1 45 0 0 0 0 c11t5d0 0.0 432.3 0.0 32903.2 0.0 12.6 0.0 29.2 1 45 0 0 0 0 c11t6d0 0.0 418.3 0.0 32562.0 0.0 12.5 0.0 29.9 1 45 0 0 0 0 c11t7d0 0.0 417.3 0.0 31746.2 0.0 12.4 0.0 29.8 1 44 0 0 0 0 c11t8d0 0.0 424.3 0.0 31270.6 0.0 12.7 0.0 29.9 1 45 0 0 0 0 c11t9d0 Friday, 25 September 2009 1:50:15 PM EST extended device statistics ---- errors --- r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b s/w h/w trn tot device 0.0 434.9 0.0 37028.5 0.0 17.3 0.0 39.7 1 52 0 0 0 0 c11t0d0 1.0 436.9 64.3 37372.1 0.0 17.1 0.0 39.0 1 51 0 0 0 0 c11t1d0 1.0 442.9 64.3 38543.2 0.0 17.2 0.0 38.7 1 52 0 0 0 0 c11t2d0 1.0 436.9 64.3 37834.2 0.0 17.3 0.0 39.6 1 52 0 0 0 0 c11t3d0 1.0 412.8 64.3 35935.0 0.0 16.8 0.0 40.7 0 52 0 0 0 0 c11t4d0 1.0 413.8 64.3 35342.5 0.0 16.6 0.0 40.1 0 51 0 0 0 0 c11t5d0 2.0 418.8 128.6 36321.3 0.0 16.5 0.0 39.3 0 52 0 0 0 0 c11t6d0 1.0 425.8 64.3 36660.4 0.0 16.6 0.0 39.0 1 51 0 0 0 0 c11t7d0 1.0 437.9 64.3 37484.0 0.0 17.2 0.0 39.2 1 52 0 0 0 0 c11t8d0 0.0 437.9 0.0 37968.1 0.0 17.2 0.0 39.2 1 52 0 0 0 0 c11t9d0 So, 533MB source file, 13 seconds to write to the slog (14 before, no appreciable change), 1071.5MB written to the slog, 692.3MB written to primary storage. Just another data point. cheers, James -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090925/35534724/attachment.html>
>On Fri, 25 Sep 2009, James Lever wrote: >> >> NFS Version 3 introduces the concept of "safe asynchronous writes.? > >Being "safe" then requires a responsibilty level on the client which >is often not present. For example, if the server crashes, and then >the client crashes, how does the client resend the uncommitted data? >If the client had a non-volatile storage cache, then it would be able >to responsibly finish the writes that failed.If the client crashes, it is clear that "work will be lost" up to the point that the client did a successful commit. Other than support for the NFSv3 commit operation and resending the missing operations. If the client crashes, we know that non-committed operations may be dropped in the floor.>The commentary says that normally the COMMIT operations occur during >close(2) or fsync(2) system call, or when encountering memory >pressure. If the problem is slow copying of many small files, this >COMMIT approach does not help very much since very little data is sent >per file and most time is spent creating directories and files.Indeed; the commit is mostly to make sure that the pipe between the server and the client can be filled for write operations. Casper
On Thu, Sep 24, 2009 at 11:29 PM, James Lever <j at jamver.id.au> wrote:> > On 25/09/2009, at 11:49 AM, Bob Friesenhahn wrote: > > The commentary says that normally the COMMIT operations occur during > close(2) or fsync(2) system call, or when encountering memory pressure. ?If > the problem is slow copying of many small files, this COMMIT approach does > not help very much since very little data is sent per file and most time is > spent creating directories and files. > > The problem appears to be slog bandwidth exhaustion due to all data being > sent via the slog creating a contention for all following NFS or locally > synchronous writes. ?The NFS writes do not appear to be synchronous in > nature - there is only a COMMIT being issued at the very end, however, all > of that data appears to be going via the slog and it appears to be inflating > to twice its original size. > For a test, I just copied a relatively small file (8.4MB in size). ?Looking > at a?tcpdump analysis using wireshark, there is a SETATTR which ends with a > V3 COMMIT and no COMMIT messages during the transfer. > iostat output that matches looks like this: > slog write of the data (17MB appears to hit the slog)[snip]> then a few seconds later, the transaction group gets flushed to primary > storage writing nearly 11.4MB which is inline with raid Z2 (expect around > 10.5MB; 8.4/8*10):[snip]> So I performed the same test with a much larger file (533MB) to see what it > would do, being larger than the NVRAM cache in front of the SSD. ?Note that > after the second second of activity the NVRAM is full and only allowing in > about the sequential write speed of the SSD (~70MB/s).[snip]> Again, the slog wrote about double the file size (1022.6MB) and a few > seconds later, the data was pushed to the primary storage (684.9MB with an > expectation of?666MB = 533MB/8*10) so again about the right number hit the > spinning platters.[snip]> Can anybody explain what is going on with the slog device in that all data > is being shunted via it and why about double the data size is being written > to it per transaction?By any chance do you have copies=2 set? That will make 2 transactions of 1. Also, try setting zfs_write_limit_override equal to the size of the NVRAM cache (or half depending on how long it takes to flush): echo zfs_write_limit_override/W0t268435456 | mdb -kw Set the PERC flush interval to say 1 second. As a side an slog device will not be too beneficial for large sequential writes, because it will be throughput bound not latency bound. slog devices really help when you have lots of small sync writes. A RAIDZ2 with the ZIL spread across it will provide much higher throughput then an SSD. An example of a workload that benefits from an slog device is ESX over NFS, which does a COMMIT for each block written, so it benefits from an slog, but a standard media server will not (but an L2ARC would be beneficial). Better workload analysis is really what it is about. -Ross
On Fri, 25 Sep 2009, Ross Walker wrote:> > As a side an slog device will not be too beneficial for large > sequential writes, because it will be throughput bound not latency > bound. slog devices really help when you have lots of small sync > writes. A RAIDZ2 with the ZIL spread across it will provide muchSurely this depends on the origin of the large sequential writes. If the origin is NFS and the SSD has considerably more sustained write bandwidth than the ethernet transfer bandwidth, then using the SSD is a win. If the SSD accepts data slower than the ethernet can deliver it (which seems to be this particular case) then the SSD is not helping. If the ethernet can pass 100MB/second, then the sustained write specification for the SSD needs to be at least 100MB/second. Since data is buffered in the Ethernet,TCP/IP,NFS stack prior to sending it to ZFS, the SSD should support write bursts of at least double that or else it will not be helping bulk-write performance. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On Fri, Sep 25, 2009 at 11:34 AM, Bob Friesenhahn <bfriesen at simple.dallas.tx.us> wrote:> On Fri, 25 Sep 2009, Ross Walker wrote: >> >> As a side an slog device will not be too beneficial for large >> sequential writes, because it will be throughput bound not latency >> bound. slog devices really help when you have lots of small sync >> writes. A RAIDZ2 with the ZIL spread across it will provide much > > Surely this depends on the origin of the large sequential writes. ?If the > origin is NFS and the SSD has considerably more sustained write bandwidth > than the ethernet transfer bandwidth, then using the SSD is a win. ?If the > SSD accepts data slower than the ethernet can deliver it (which seems to be > this particular case) then the SSD is not helping. > > If the ethernet can pass 100MB/second, then the sustained write > specification for the SSD needs to be at least 100MB/second. Since data is > buffered in the Ethernet,TCP/IP,NFS stack prior to sending it to ZFS, the > SSD should support write bursts of at least double that or else it will not > be helping bulk-write performance.Specifically I was talking NFS as that was what the OP was talking about, but yes it does depend on the origin, but you also assume that NFS IO goes over only a single 1Gbe interface when it could be over multiple 1Gbe interfaces or a 10Gbe interface or even multple 10Gbe interfaces. You also assume the IO recorded in the ZIL is just the raw IO when there is also meta-data or multiple transaction copies as well. Personnally I still prefer to spread the ZIL across the pool and have a large NVRAM backed HBA as opposed to an slog which really puts all my IO in one basket. If I had a pure NVRAM device I might consider using that as an slog device, but SSDs are too variable for my taste. -Ross
On Sep 25, 2009, at 9:14 AM, Ross Walker wrote:> On Fri, Sep 25, 2009 at 11:34 AM, Bob Friesenhahn > <bfriesen at simple.dallas.tx.us> wrote: >> On Fri, 25 Sep 2009, Ross Walker wrote: >>> >>> As a side an slog device will not be too beneficial for large >>> sequential writes, because it will be throughput bound not latency >>> bound. slog devices really help when you have lots of small sync >>> writes. A RAIDZ2 with the ZIL spread across it will provide much >> >> Surely this depends on the origin of the large sequential writes. >> If the >> origin is NFS and the SSD has considerably more sustained write >> bandwidth >> than the ethernet transfer bandwidth, then using the SSD is a win. >> If the >> SSD accepts data slower than the ethernet can deliver it (which >> seems to be >> this particular case) then the SSD is not helping. >> >> If the ethernet can pass 100MB/second, then the sustained write >> specification for the SSD needs to be at least 100MB/second. Since >> data is >> buffered in the Ethernet,TCP/IP,NFS stack prior to sending it to >> ZFS, the >> SSD should support write bursts of at least double that or else it >> will not >> be helping bulk-write performance. > > Specifically I was talking NFS as that was what the OP was talking > about, but yes it does depend on the origin, but you also assume that > NFS IO goes over only a single 1Gbe interface when it could be over > multiple 1Gbe interfaces or a 10Gbe interface or even multple 10Gbe > interfaces. You also assume the IO recorded in the ZIL is just the raw > IO when there is also meta-data or multiple transaction copies as > well. > > Personnally I still prefer to spread the ZIL across the pool and have > a large NVRAM backed HBA as opposed to an slog which really puts all > my IO in one basket. If I had a pure NVRAM device I might consider > using that as an slog device, but SSDs are too variable for my taste.Back of the envelope math says: 10 Gbe = ~1 GByte/sec of I/O capacity If the SSD can only sink 70 MByte/s, then you will need: int(1000/70) + 1 = 15 SSDs for the slog For capacity, you need: 1 GByte/sec * 30 sec = 30 GBytes Ross'' idea has merit, if the size of the NVRAM in the array is 30 GBytes or so. Both of the above assume there is lots of memory in the server. This is increasingly becoming easier to do as the memory costs come down and you can physically fit 512 GBytes in a 4u server. By default, the txg commit will occur when 1/8 of memory is used for writes. For 30 GBytes, that would mean a main memory of only 240 Gbytes... feasible for modern servers. However, most folks won''t stomach 15 SSDs for slog or 30 GBytes of NVRAM in their arrays. So Bob''s recommendation of reducing the txg commit interval below 30 seconds also has merit. Or, to put it another way, the dynamic sizing of the txg commit interval isn''t quite perfect yet. [Cue for Neil to chime in... :-)] -- richard
On 26/09/2009, at 1:14 AM, Ross Walker wrote:> By any chance do you have copies=2 set?No, only 1. So the double data going to the slog (as reported by iostat) is still confusing me and clearly potentially causing significant harm to my performance.> Also, try setting zfs_write_limit_override equal to the size of the > NVRAM cache (or half depending on how long it takes to flush): > > echo zfs_write_limit_override/W0t268435456 | mdb -kwThat?s an interesting concept. All data still appears to go via the slog device, however, under heavy load my responsive to a new write is typically below 2s (a few outliers at about 3.5s) and a read (directory listing of a non-cached entry) is about 2s. What will this do once it hits the limit? Will streaming writes now be sent directly to a txg and streamed to the primary storage devices? (that is what I would like to see happen).> As a side an slog device will not be too beneficial for large > sequential writes, because it will be throughput bound not latency > bound. slog devices really help when you have lots of small sync > writes. A RAIDZ2 with the ZIL spread across it will provide much > higher throughput then an SSD. An example of a workload that benefits > from an slog device is ESX over NFS, which does a COMMIT for each > block written, so it benefits from an slog, but a standard media > server will not (but an L2ARC would be beneficial). > > Better workload analysis is really what it is about.It seems that it doesn?t matter what the workload is if the NFS pipe can sustain more continuous throughput the slog chain can support. I suppose some creative use of the logbias setting might assist this situation and force all potentially heavy writers directly to the primary storage. This would, however, negate any benefit for having a fast, low latency device for those filesystems for the times when it is desirable (any large batch of small writes, for example). Is there a way to have a dynamic, auto logbias type setting depending on the transaction currently presented to the server such that if it is clearly a large streaming write it gets treated as logbias=throughput and if it is a small transaction it gets treated as logbias=latency? (i.e. such that NFS transactions can be effectively treated as if it was local storage but minorly breaking the benefits of the txg scheduling). On 26/09/2009, at 3:39 AM, Richard Elling wrote:> Back of the envelope math says: > 10 Gbe = ~1 GByte/sec of I/O capacity > > If the SSD can only sink 70 MByte/s, then you will need: > int(1000/70) + 1 = 15 SSDs for the slog > > For capacity, you need: > 1 GByte/sec * 30 sec = 30 GBytes > > Ross'' idea has merit, if the size of the NVRAM in the array is 30 > GBytes > or so.At this point, enter the fusionIO cards or similar devices. Unfortunately there does not seem to be anything on the market with infinitely fast write capacity (memory speeds) that is also supported under OpenSolaris as a slog device. I think this is precisely what I (and anybody running a general purpose NFS server) need for a general purpose slog device.> Both of the above assume there is lots of memory in the server. > This is increasingly becoming easier to do as the memory costs > come down and you can physically fit 512 GBytes in a 4u server. > By default, the txg commit will occur when 1/8 of memory is used > for writes. For 30 GBytes, that would mean a main memory of only > 240 Gbytes... feasible for modern servers. > > However, most folks won''t stomach 15 SSDs for slog or 30 GBytes of > NVRAM in their arrays. So Bob''s recommendation of reducing the > txg commit interval below 30 seconds also has merit. Or, to put it > another way, the dynamic sizing of the txg commit interval isn''t > quite perfect yet. [Cue for Neil to chime in... :-)]How does reducing the txg commit interval really help? WIll data no longer go via the slog once it is streaming to disk? or will data still all be pushed through the slog regardless? For a predominantly NFS server purpose, it really looks like a case of the slog has to outperform your main pool for continuous write speed as well as an instant response time as the primary criterion. Which might as well be a fast (or group of fast) SSDs or 15kRPM drives with some NVRAM in front of them. Is there also a way to throttle synchronous writes to the slog device? Much like the ZFS write throttling that is already implemented, so that there is a gap for new writers to enter when writing to the slog device? (or is this the norm and includes slog writes?) cheers, James
On Fri, Sep 25, 2009 at 5:24 PM, James Lever <j at jamver.id.au> wrote:> > On 26/09/2009, at 1:14 AM, Ross Walker wrote: > >> By any chance do you have copies=2 set? > > No, only 1. ?So the double data going to the slog (as reported by iostat) is > still confusing me and clearly potentially causing significant harm to my > performance.Weird then, I thought that would be an easy explaination.>> Also, try setting zfs_write_limit_override equal to the size of the >> NVRAM cache (or half depending on how long it takes to flush): >> >> echo zfs_write_limit_override/W0t268435456 | mdb -kw > > That?s an interesting concept. ?All data still appears to go via the slog > device, however, under heavy load my responsive to a new write is typically > below 2s (a few outliers at about 3.5s) and a read (directory listing of a > non-cached entry) is about 2s. > > What will this do once it hits the limit? ?Will streaming writes now be sent > directly to a txg and streamed to the primary storage devices? ?(that is > what I would like to see happen).It''s sets the max size of a txg to the given size. When it hits that number it flushes to disk.>> As a side an slog device will not be too beneficial for large >> sequential writes, because it will be throughput bound not latency >> bound. slog devices really help when you have lots of small sync >> writes. A RAIDZ2 with the ZIL spread across it will provide much >> higher throughput then an SSD. An example of a workload that benefits >> from an slog device is ESX over NFS, which does a COMMIT for each >> block written, so it benefits from an slog, but a standard media >> server will not (but an L2ARC would be beneficial). >> >> Better workload analysis is really what it is about. > > > It seems that it doesn?t matter what the workload is if the NFS pipe can > sustain more continuous throughput the slog chain can support.Only on large sequentials, small sync IO should benefit from the slog.> I suppose some creative use of the logbias setting might assist this > situation and force all potentially heavy writers directly to the primary > storage. ?This would, however, negate any benefit for having a fast, low > latency device for those filesystems for the times when it is desirable (any > large batch of small writes, for example). > > Is there a way to have a dynamic, auto logbias type setting depending on the > transaction currently presented to the server such that if it is clearly a > large streaming write it gets treated as logbias=throughput and if it is a > small transaction it gets treated as logbias=latency? ?(i.e. such that NFS > transactions can be effectively treated as if it was local storage but > minorly breaking the benefits of the txg scheduling).I''ll leave that to the Sun guys to answer. -Ross
On Fri, Sep 25, 2009 at 1:39 PM, Richard Elling <richard.elling at gmail.com> wrote:> On Sep 25, 2009, at 9:14 AM, Ross Walker wrote: > >> On Fri, Sep 25, 2009 at 11:34 AM, Bob Friesenhahn >> <bfriesen at simple.dallas.tx.us> wrote: >>> >>> On Fri, 25 Sep 2009, Ross Walker wrote: >>>> >>>> As a side an slog device will not be too beneficial for large >>>> sequential writes, because it will be throughput bound not latency >>>> bound. slog devices really help when you have lots of small sync >>>> writes. A RAIDZ2 with the ZIL spread across it will provide much >>> >>> Surely this depends on the origin of the large sequential writes. ?If the >>> origin is NFS and the SSD has considerably more sustained write bandwidth >>> than the ethernet transfer bandwidth, then using the SSD is a win. ?If >>> the SSD accepts data slower than the ethernet can deliver it (which seems to >>> be this particular case) then the SSD is not helping. >>> >>> If the ethernet can pass 100MB/second, then the sustained write >>> specification for the SSD needs to be at least 100MB/second. ?Since data >>> is buffered in the Ethernet,TCP/IP,NFS stack prior to sending it to ZFS, the >>> SSD should support write bursts of at least double that or else it will >>> not be helping bulk-write performance. >> >> Specifically I was talking NFS as that was what the OP was talking >> about, but yes it does depend on the origin, but you also assume that >> NFS IO goes over only a single 1Gbe interface when it could be over >> multiple 1Gbe interfaces or a 10Gbe interface or even multple 10Gbe >> interfaces. You also assume the IO recorded in the ZIL is just the raw >> IO when there is also meta-data or multiple transaction copies as >> well. >> >> Personnally I still prefer to spread the ZIL across the pool and have >> a large NVRAM backed HBA as opposed to an slog which really puts all >> my IO in one basket. If I had a pure NVRAM device I might consider >> using that as an slog device, but SSDs are too variable for my taste. > > Back of the envelope math says: > ? ? ? ?10 Gbe = ~1 GByte/sec of I/O capacity > > If the SSD can only sink 70 MByte/s, then you will need: > ? ? ? ?int(1000/70) + 1 = 15 SSDs for the slog > > For capacity, you need: > ? ? ? ?1 GByte/sec * 30 sec = 30 GBytesWhere did the 30 seconds come in here? The amount of time to hold cache depends on how fast you can fill it.> Ross'' idea has merit, if the size of the NVRAM in the array is 30 GBytes > or so.I''m thinking you can do less if you don''t need to hold it for 30 seconds.> Both of the above assume there is lots of memory in the server. > This is increasingly becoming easier to do as the memory costs > come down and you can physically fit 512 GBytes in a 4u server. > By default, the txg commit will occur when 1/8 of memory is used > for writes. For 30 GBytes, that would mean a main memory of only > 240 Gbytes... feasible for modern servers. > > However, most folks won''t stomach 15 SSDs for slog or 30 GBytes of > NVRAM in their arrays. So Bob''s recommendation of reducing the > txg commit interval below 30 seconds also has merit. ?Or, to put it > another way, the dynamic sizing of the txg commit interval isn''t > quite perfect yet. [Cue for Neil to chime in... :-)]I''m sorry did I miss something Bob said about the txg commit interval? I looked back and didn''t see it, maybe it was off-list? -Ross
j at jamver.id.au said:> For a predominantly NFS server purpose, it really looks like a case of the > slog has to outperform your main pool for continuous write speed as well as > an instant response time as the primary criterion. Which might as well be a > fast (or group of fast) SSDs or 15kRPM drives with some NVRAM in front of > them.I wonder if you ran Richard Elling''s "zilstat" while running your workload. That should tell you how much ZIL bandwidth is needed, and it would be interesting to see if its stats match with your other measurements of slog-device traffic. I did some filebench and "tar extract over NFS" tests of J4400 (500GB, 7200RPM SATA drives), with and without slog, where slog was using the internal 2.5" 10kRPM SAS drives in an X4150. These drives were behind the standard Sun/Adaptec internal RAID controller, 256MB battery-backed cache memory, all on Solaris-10U7. We saw slight differences on filebench oltp profile, and a huge speedup for the "tar extract over NFS" tests with the slog present. Granted, the latter was with only one NFS client, so likely did not fill NVRAM. Pretty good results for a poor-person''s slog, though: http://acc.ohsu.edu/~hakansom/j4400_bench.html Just as an aside, and based on my experience as a user/admin of various NFS-server vendors, the old Prestoserve cards, and NetApp filers, seem to get very good improvements with relatively small amounts of NVRAM (128K, 1MB, 256MB, etc.). None of the filers I''ve seen have ever had tens of GB of NVRAM. Regards, Marion
On Fri, Sep 25, 2009 at 5:47 PM, Marion Hakanson <hakansom at ohsu.edu> wrote:> j at jamver.id.au said: >> For a predominantly NFS server purpose, it really looks like a case of the >> slog has to outperform your main pool for continuous write speed as well as >> an instant response time as the primary criterion. Which might as well be a >> fast (or group of fast) SSDs or 15kRPM drives with some NVRAM in front of >> them. > > I wonder if you ran Richard Elling''s "zilstat" while running your > workload. ?That should tell you how much ZIL bandwidth is needed, > and it would be interesting to see if its stats match with your > other measurements of slog-device traffic.Yes, but if it''s on NFS you can just figure out the workload in MB/s and use that as a rough guideline. Problem is most SSD manufactures list sustained throughput with large IO sizes, say 4MB, and not 128K, so it is tricky buying a good SSD that can handle the throughput.> I did some filebench and "tar extract over NFS" tests of J4400 (500GB, > 7200RPM SATA drives), with and without slog, where slog was using the > internal 2.5" 10kRPM SAS drives in an X4150. ?These drives were behind > the standard Sun/Adaptec internal RAID controller, 256MB battery-backed > cache memory, all on Solaris-10U7. > > We saw slight differences on filebench oltp profile, and a huge speedup > for the "tar extract over NFS" tests with the slog present. ?Granted, the > latter was with only one NFS client, so likely did not fill NVRAM. ?Pretty > good results for a poor-person''s slog, though: > ? ? ? ?http://acc.ohsu.edu/~hakansom/j4400_bench.htmlI did a smiliar test with a 512MB BBU controller and saw no difference with or without the SSD slog, so I didn''t end up using it. Does your BBU controller ignore the ZFS flushes?> Just as an aside, and based on my experience as a user/admin of various > NFS-server vendors, the old Prestoserve cards, and NetApp filers, seem > to get very good improvements with relatively small amounts of NVRAM > (128K, 1MB, 256MB, etc.). ?None of the filers I''ve seen have ever had > tens of GB of NVRAM.They don''t hold on to the cache for a long time, just as long as it takes to write it all to disk. -Ross
On Fri, 25 Sep 2009, Richard Elling wrote:> By default, the txg commit will occur when 1/8 of memory is used > for writes. For 30 GBytes, that would mean a main memory of only > 240 Gbytes... feasible for modern servers.Ahem. We were advised that 7/8s of memory is currently what is allowed for writes. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On Fri, 25 Sep 2009, Ross Walker wrote:> Problem is most SSD manufactures list sustained throughput with large > IO sizes, say 4MB, and not 128K, so it is tricky buying a good SSD > that can handle the throughput.Who said that the slog SSD is written to in 128K chunks? That seems wrong to me. Previously we were advised that the slog is basically a log of uncommitted system calls so the size of the data chunks written to the slog should be similar to the data sizes in the system calls. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
rswwalker at gmail.com said:> Yes, but if it''s on NFS you can just figure out the workload in MB/s and use > that as a rough guideline.I wonder if that''s the case. We have an NFS server without NVRAM cache (X4500), and it gets huge MB/sec throughput on large-file writes over NFS. But it''s painfully slow on the "tar extract lots of small files" test, where many, tiny, synchronous metadata operations are performed.> I did a smiliar test with a 512MB BBU controller and saw no difference with > or without the SSD slog, so I didn''t end up using it. > > Does your BBU controller ignore the ZFS flushes?I believe it does (it would be slow otherwise). It''s the Sun StorageTek internal SAS RAID HBA. Regards, Marion
On Sep 25, 2009, at 6:19 PM, Bob Friesenhahn <bfriesen at simple.dallas.tx.us > wrote:> On Fri, 25 Sep 2009, Ross Walker wrote: > >> Problem is most SSD manufactures list sustained throughput with large >> IO sizes, say 4MB, and not 128K, so it is tricky buying a good SSD >> that can handle the throughput. > > Who said that the slog SSD is written to in 128K chunks? That seems > wrong to me. Previously we were advised that the slog is basically > a log of uncommitted system calls so the size of the data chunks > written to the slog should be similar to the data sizes in the > system calls.Are these not broken into recordsize chunks? -Ross
On 09/25/09 16:19, Bob Friesenhahn wrote:> On Fri, 25 Sep 2009, Ross Walker wrote: > >> Problem is most SSD manufactures list sustained throughput with large >> IO sizes, say 4MB, and not 128K, so it is tricky buying a good SSD >> that can handle the throughput. > > Who said that the slog SSD is written to in 128K chunks? That seems > wrong to me. Previously we were advised that the slog is basically a > log of uncommitted system calls so the size of the data chunks written > to the slog should be similar to the data sizes in the system calls.Log blocks are variable in size dependent on what needs to be committed. The minimum size is 4KB and the max 128KB. Log records are aggregated and written together as much as possible. Neil.