thr3ads.net - zfs discuss - [zfs-discuss] periodic slow responsiveness [Sep 2009]

If this information is useful, please help other people find it:
Share via:

James Lever

2009-Sep-06 13:15 UTC

[zfs-discuss] periodic slow responsiveness

I?m experiencing occasional slow responsiveness on an OpenSolaris b118  
system typically noticed when running an ?ls? (no extra flags, so no  
directory service lookups).  There is a delay of between 2 and 30  
seconds but no correlation has been noticed with load on the server  
and the slow return.  This problem has only been noticed via NFS (v3.   
We are migrating to NFSv4 once the O_EXCL/mtime bug fix has been  
integrated - anticipated for snv_124).  The problem has been observed  
both locally on the primary filesystem, in an locally automounted  
reference (/home/foo) and remotely via NFS.

zpool is RAIDZ2 comprised of 10 * 15kRPM SAS drives behind an LSI 1078  
w/ 512MB BBWC exposed as RAID0 LUNs (Dell MD1000 behind PERC 6/E) with  
2x SSDs each partitioned as 10GB slog and 36GB remainder as l2arc  
behind another LSI 1078 w/ 256MB BBWC (Dell R710 server with PERC 6/i).

The system is configured as an NFS (currently serving NFSv3), iSCSI  
(COMSTAR) and CIFS (using the SUN SFW package running Samba 3.0.34)  
with authentication taking place from a remote openLDAP server.

Automount is in use both locally and remotely (linux clients).   
Locally /home/* is remounted from the zpool, remotely /home and  
another filesystem (and children) are mounted using autofs.  There was  
some suspicion that automount is the problem, but no definitive  
evidence as of yet.

The problem has definitely been observed with stats (of some form,  
typically ?/usr/bin/ls? output) both remotely, locally in /home/* and  
locally in /zpool/home/* (the true source location).  There is a clear  
correlation with recency of reads of the directories in question and  
reoccurrence of the fault in that one user has scripted a regular (15m/ 
30m/hourly tests so far) ?ls? of the filesystems of interested and  
this has reduced the fault to have minimal noted impact since starting  
down this path (just for themself).

I have removed the l2arc(s) (cache devices) from the pool and the same  
behaviour has been observed.  My suspicion here was that there was  
perhaps occasional high synchronous load causing heavy writes to the  
slog devices and when a stat was requested it may have been faulting  
from ARC to L2ARC prior to going to the primary data store.  The  
slowness has been reported since removing the extra cache devices.

Another thought I had was along the lines of fileystem caching and  
heavy writes causing read blocking.  I have no evidence that this is  
the case, but some suggestions on list recently of limiting the ZFS  
memory usage for write caching.  Can anybody comment to the  
effectiveness of this (I have 256MB write cache in front of the slog  
SSDs and 512MB in front of the primary storage devices).

My DTrace is very poor but I?m suspicious that this is the best way to  
root cause this problem.  If somebody has any code that may assist in  
debugging this problem and was able to share it would much appreciated.

Any other suggestions for how to identify this fault and work around  
it would be greatly appreciated.

cheers,
James

Ross Walker

2009-Sep-06 14:53 UTC

head link

[zfs-discuss] periodic slow responsiveness

On Sun, Sep 6, 2009 at 9:15 AM, James Lever<j at jamver.id.au>
wrote:> I?m experiencing occasional slow responsiveness on an OpenSolaris b118
> system typically noticed when running an ?ls? (no extra flags, so no
> directory service lookups). ?There is a delay of between 2 and 30 seconds
> but no correlation has been noticed with load on the server and the slow
> return. ?This problem has only been noticed via NFS (v3. ?We are migrating
> to NFSv4 once the O_EXCL/mtime bug fix has been integrated - anticipated
for
> snv_124). ?The problem has been observed both locally on the primary
> filesystem, in an locally automounted reference (/home/foo) and remotely
via
> NFS.
>
> zpool is RAIDZ2 comprised of 10 * 15kRPM SAS drives behind an LSI 1078 w/
> 512MB BBWC exposed as RAID0 LUNs (Dell MD1000 behind PERC 6/E) with 2x SSDs
> each partitioned as 10GB slog and 36GB remainder as l2arc behind another
LSI
> 1078 w/ 256MB BBWC (Dell R710 server with PERC 6/i).
>
> The system is configured as an NFS (currently serving NFSv3), iSCSI
> (COMSTAR) and CIFS (using the SUN SFW package running Samba 3.0.34) with
> authentication taking place from a remote openLDAP server.
>
> Automount is in use both locally and remotely (linux clients). ?Locally
> /home/* is remounted from the zpool, remotely /home and another filesystem
> (and children) are mounted using autofs. ?There was some suspicion that
> automount is the problem, but no definitive evidence as of yet.
>
> The problem has definitely been observed with stats (of some form,
typically
> ?/usr/bin/ls? output) both remotely, locally in /home/* and locally in
> /zpool/home/* (the true source location). ?There is a clear correlation
with
> recency of reads of the directories in question and reoccurrence of the
> fault in that one user has scripted a regular (15m/30m/hourly tests so far)
> ?ls? of the filesystems of interested and this has reduced the fault to
have
> minimal noted impact since starting down this path (just for themself).
>
> I have removed the l2arc(s) (cache devices) from the pool and the same
> behaviour has been observed. ?My suspicion here was that there was perhaps
> occasional high synchronous load causing heavy writes to the slog devices
> and when a stat was requested it may have been faulting from ARC to L2ARC
> prior to going to the primary data store. ?The slowness has been reported
> since removing the extra cache devices.
>
> Another thought I had was along the lines of fileystem caching and heavy
> writes causing read blocking. ?I have no evidence that this is the case,
but
> some suggestions on list recently of limiting the ZFS memory usage for
write
> caching. ?Can anybody comment to the effectiveness of this (I have 256MB
> write cache in front of the slog SSDs and 512MB in front of the primary
> storage devices).
>
> My DTrace is very poor but I?m suspicious that this is the best way to root
> cause this problem. ?If somebody has any code that may assist in debugging
> this problem and was able to share it would much appreciated.
>
> Any other suggestions for how to identify this fault and work around it
> would be greatly appreciated.
That behavior sounds a lot like a process has a memory leak and is
filling the VM. On Linux there is an OOM killer for these, but on
OpenSolaris, your the OOM killer.

You have iSCSI, NFS, CIFS to choose from (most obvious), try
restarting them one at a time during down time and see if performance
improves after each restart to find the culprit.

-Ross

Richard Elling

2009-Sep-06 20:24 UTC

head link

[zfs-discuss] periodic slow responsiveness

On Sep 6, 2009, at 7:53 AM, Ross Walker wrote:
> On Sun, Sep 6, 2009 at 9:15 AM, James Lever<j at jamver.id.au> wrote:
>> I?m experiencing occasional slow responsiveness on an OpenSolaris  
>> b118
>> system typically noticed when running an ?ls? (no extra flags, so no
>> directory service lookups).  There is a delay of between 2 and 30  
>> seconds
>> but no correlation has been noticed with load on the server and the  
>> slow
>> return.  This problem has only been noticed via NFS (v3.  We are  
>> migrating
>> to NFSv4 once the O_EXCL/mtime bug fix has been integrated -  
>> anticipated for
>> snv_124).  The problem has been observed both locally on the primary
>> filesystem, in an locally automounted reference (/home/foo) and  
>> remotely via
>> NFS.
I''m confused.  If "This problem has only been noticed via NFS
(v3" then
how is it "observed locally?"
>> zpool is RAIDZ2 comprised of 10 * 15kRPM SAS drives behind an LSI  
>> 1078 w/
>> 512MB BBWC exposed as RAID0 LUNs (Dell MD1000 behind PERC 6/E) with  
>> 2x SSDs
>> each partitioned as 10GB slog and 36GB remainder as l2arc behind  
>> another LSI
>> 1078 w/ 256MB BBWC (Dell R710 server with PERC 6/i).
>>
>> The system is configured as an NFS (currently serving NFSv3), iSCSI
>> (COMSTAR) and CIFS (using the SUN SFW package running Samba 3.0.34)  
>> with
>> authentication taking place from a remote openLDAP server.
>>
>> Automount is in use both locally and remotely (linux clients).   
>> Locally
>> /home/* is remounted from the zpool, remotely /home and another  
>> filesystem
>> (and children) are mounted using autofs.  There was some suspicion  
>> that
>> automount is the problem, but no definitive evidence as of yet.
>>
>> The problem has definitely been observed with stats (of some form,  
>> typically
>> ?/usr/bin/ls? output) both remotely, locally in /home/* and locally  
>> in
>> /zpool/home/* (the true source location).  There is a clear  
>> correlation with
>> recency of reads of the directories in question and reoccurrence of  
>> the
>> fault in that one user has scripted a regular (15m/30m/hourly tests  
>> so far)
>> ?ls? of the filesystems of interested and this has reduced the  
>> fault to have
>> minimal noted impact since starting down this path (just for  
>> themself).
iostat(1m) is the program for troubleshooting performance issues
related to latency. It will show the latency of nfs mounts as well as
other devices.
>> I have removed the l2arc(s) (cache devices) from the pool and the  
>> same
>> behaviour has been observed.  My suspicion here was that there was  
>> perhaps
>> occasional high synchronous load causing heavy writes to the slog  
>> devices
>> and when a stat was requested it may have been faulting from ARC to  
>> L2ARC
>> prior to going to the primary data store.  The slowness has been  
>> reported
>> since removing the extra cache devices.
>>
>> Another thought I had was along the lines of fileystem caching and  
>> heavy
>> writes causing read blocking.  I have no evidence that this is the  
>> case, but
>> some suggestions on list recently of limiting the ZFS memory usage  
>> for write
>> caching.  Can anybody comment to the effectiveness of this (I have  
>> 256MB
>> write cache in front of the slog SSDs and 512MB in front of the  
>> primary
>> storage devices).
stat(2) doesn''t write, so you can stop worrying about the slog.
>>
>> My DTrace is very poor but I?m suspicious that this is the best way  
>> to root
>> cause this problem.  If somebody has any code that may assist in  
>> debugging
>> this problem and was able to share it would much appreciated.
>>
>> Any other suggestions for how to identify this fault and work  
>> around it
>> would be greatly appreciated.
Rule out the network by looking at retransmissions and ioerrors
with netstat(1m) on both the client and server.
> That behavior sounds a lot like a process has a memory leak and is
> filling the VM. On Linux there is an OOM killer for these, but on
> OpenSolaris, your the OOM killer.
See rcapd(1m), rcapadm(1m), and rcapstat(1m) along with the
"Physical Memory Control Using the Resource Capping  Daemon"
in  System Administration Guide: Solaris Containers-Resource
Management, and Solaris Zones
  -- richard
> You have iSCSI, NFS, CIFS to choose from (most obvious), try
> restarting them one at a time during down time and see if performance
> improves after each restart to find the culprit.
>
> -Ross
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

James Lever

2009-Sep-06 23:29 UTC

head link

[zfs-discuss] periodic slow responsiveness

On 07/09/2009, at 12:53 AM, Ross Walker wrote:
> That behavior sounds a lot like a process has a memory leak and is
> filling the VM. On Linux there is an OOM killer for these, but on
> OpenSolaris, your the OOM killer.
If it was this type of behaviour, where would it be logged when the  
process was killed/restarted?  If it?s not logged by default, can that  
be enabled?

I have not seen any evidence of this in /var/adm/messages, /var/log/ 
syslog, or my /var/log/debug (*.debug), but perhaps I?m not looking  
for the right clues.
> You have iSCSI, NFS, CIFS to choose from (most obvious), try
> restarting them one at a time during down time and see if performance
> improves after each restart to find the culprit.
The downtime is being reported by users, and I have only seen it once  
(while in their office) so this method of debugging isn?t going to  
help, I?m afraid.  (this is why I asked about alternate root cause  
analysis methods)

cheers,
James

James Lever

2009-Sep-07 00:06 UTC

head link

[zfs-discuss] periodic slow responsiveness

On 07/09/2009, at 6:24 AM, Richard Elling wrote:
> On Sep 6, 2009, at 7:53 AM, Ross Walker wrote:
>
>> On Sun, Sep 6, 2009 at 9:15 AM, James Lever<j at jamver.id.au>
wrote:
>>> I?m experiencing occasional slow responsiveness on an OpenSolaris  
>>> b118
>>> system typically noticed when running an ?ls? (no extra flags, so
no
>>> directory service lookups).  There is a delay of between 2 and 30  
>>> seconds
>>> but no correlation has been noticed with load on the server and  
>>> the slow
>>> return.  This problem has only been noticed via NFS (v3.  We are  
>>> migrating
>>> to NFSv4 once the O_EXCL/mtime bug fix has been integrated -  
>>> anticipated for
>>> snv_124).  The problem has been observed both locally on the
primary
>>> filesystem, in an locally automounted reference (/home/foo) and  
>>> remotely via
>>> NFS.
>
> I''m confused.  If "This problem has only been noticed via NFS
(v3"
> then
> how is it "observed locally??
Sorry, I was meaning to say it had not been noticed using CIFS or iSCSI.

It has been observed in client:/home/user (NFSv3 automount from  
server:/home/user, redirected to server:/zpool/home/user) and also in  
server:/home/user (local automount) and server:/zpool/home/user  
(origin).
> iostat(1m) is the program for troubleshooting performance issues
> related to latency. It will show the latency of nfs mounts as well as
> other devices.
What specifically should I be looking for here? (using ?iostat -xen -T  
d?)  and I?m guessing I?ll require a high level of granularity (1s  
intervals) to see the issue if it is a single disk or similar.
> stat(2) doesn''t write, so you can stop worrying about the slog.
My concern here was I may have been trying to write (via other  
concurrent processes) at the same time as there was a memory fault  
from the ARC to L2ARC.
> Rule out the network by looking at retransmissions and ioerrors
> with netstat(1m) on both the client and server.
No errors or collisions from either server or clients observed.
>> That behavior sounds a lot like a process has a memory leak and is
>> filling the VM. On Linux there is an OOM killer for these, but on
>> OpenSolaris, your the OOM killer.
>
> See rcapd(1m), rcapadm(1m), and rcapstat(1m) along with the
> "Physical Memory Control Using the Resource Capping  Daemon"
> in  System Administration Guide: Solaris Containers-Resource
> Management, and Solaris Zones
Thanks Richard, I?ll have a look at that today and see where I get.

cheers,
James

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090907/821a9aae/attachment.html>

Ross Walker

2009-Sep-07 00:46 UTC

head link

[zfs-discuss] periodic slow responsiveness

Sorry for my earlier post I responded prematurely.


On Sep 6, 2009, at 9:15 AM, James Lever <j at jamver.id.au> wrote:
> I?m experiencing occasional slow responsiveness on an OpenSolaris b1 
> 18 system typically noticed when running an ?ls? (no extra flags,  
> so no directory service lookups).  There is a delay of between 2 and 
>  30 seconds but no correlation has been noticed with load on the ser 
> ver and the slow return.  This problem has only been noticed via NFS 
>  (v3.  We are migrating to NFSv4 once the O_EXCL/mtime bug fix has b 
> een integrated - anticipated for snv_124).  The problem has been obs 
> erved both locally on the primary filesystem, in an locally automoun 
> ted reference (/home/foo) and remotely via NFS.
Have you tried snoop/tcpdump/wirehark on the client side and server  
side to figure out what is being sent and exactly how long it is  
taking to get a response?
> zpool is RAIDZ2 comprised of 10 * 15kRPM SAS drives behind an LSI  
> 1078 w/ 512MB BBWC exposed as RAID0 LUNs (Dell MD1000 behind PERC 6/ 
> E) with 2x SSDs each partitioned as 10GB slog and 36GB remainder as  
> l2arc behind another LSI 1078 w/ 256MB BBWC (Dell R710 server with  
> PERC 6/i).
This config might lead to heavy sync writes (NFS) starving reads due  
to the fact that the whole RAIDZ2 behaves as a single disk on writes.  
How about a 2 5 disk RAIDZ2s or 3 4 disk RAIDZs?

Just one or two other vdevs to spread the load can make the world of  
difference.
> The system is configured as an NFS (currently serving NFSv3), iSCSI  
> (COMSTAR) and CIFS (using the SUN SFW package running Samba 3.0.34)  
> with authentication taking place from a remote openLDAP server.
There are a lot of services here, all off one pool? You might be  
trying to bite off more then the config can chew.
> Automount is in use both locally and remotely (linux clients).   
> Locally /home/* is remounted from the zpool, remotely /home and  
> another filesystem (and children) are mounted using autofs.  There  
> was some suspicion that automount is the problem, but no definitive  
> evidence as of yet.
Try taking a particularly bad problem station and configuring it  
static for a bit to see if it is.
> The problem has definitely been observed with stats (of some form,  
> typically ?/usr/bin/ls? output) both remotely, locally in /home/*  
> and locally in /zpool/home/* (the true source location).  There is a 
>  clear correlation with recency of reads of the directories in quest 
> ion and reoccurrence of the fault in that one user has scripted a re 
> gular (15m/30m/hourly tests so far) ?ls? of the filesystems of  
> interested and this has reduced the fault to have minimal noted impa 
> ct since starting down this path (just for themself).
Sounds like the user is pre-fetching his attribute cache to over come  
poor performance.
> I have removed the l2arc(s) (cache devices) from the pool and the  
> same behaviour has been observed.  My suspicion here was that there  
> was perhaps occasional high synchronous load causing heavy writes to  
> the slog devices and when a stat was requested it may have been  
> faulting from ARC to L2ARC prior to going to the primary data  
> store.  The slowness has been reported since removing the extra  
> cache devices.
That doesn''t make a lot of sense to me the L2ARC is secondary read  
cache, if writes are starving reads then the L2ARC would only help here.
> Another thought I had was along the lines of fileystem caching and  
> heavy writes causing read blocking.  I have no evidence that this is  
> the case, but some suggestions on list recently of limiting the ZFS  
> memory usage for write caching.  Can anybody comment to the  
> effectiveness of this (I have 256MB write cache in front of the slog  
> SSDs and 512MB in front of the primary storage devices).
It just may be that the pool configuration just can''t handle the write
IOPS needed and reads are starving.
> My DTrace is very poor but I?m suspicious that this is the best way  
> to root cause this problem.  If somebody has any code that may assis 
> t in debugging this problem and was able to share it would much appr 
> eciated.
Dtrace would tell you, but i wish the learning curve wasn''t so steep  
to get it going.
> Any other suggestions for how to identify this fault and work around  
> it would be greatly appreciated.
I hope I gave some good pointers. I''d first look at the pool  
configuration.

-Ross

Richard Elling

2009-Sep-07 01:08 UTC

head link

[zfs-discuss] periodic slow responsiveness

On Sep 6, 2009, at 5:06 PM, James Lever wrote:> On 07/09/2009, at 6:24 AM, Richard Elling wrote:
>> On Sep 6, 2009, at 7:53 AM, Ross Walker wrote:
>> On Sun, Sep 6, 2009 at 9:15 AM, James Lever<j at jamver.id.au>
wrote:
>>>> I?m experiencing occasional slow responsiveness on an
OpenSolaris
>>>> b118
>>>> system typically noticed when running an ?ls? (no extra flags,
so
>>>> no
>>>> directory service lookups).  There is a delay of between 2 and
30
>>>> seconds
>>>> but no correlation has been noticed with load on the server and
>>>> the slow
>>>> return.  This problem has only been noticed via NFS (v3.  We
are
>>>> migrating
>>>> to NFSv4 once the O_EXCL/mtime bug fix has been integrated -  
>>>> anticipated for
>>>> snv_124).  The problem has been observed both locally on the  
>>>> primary
>>>> filesystem, in an locally automounted reference (/home/foo) and
>>>> remotely via
>>>> NFS.
>>
>> I''m confused.  If "This problem has only been noticed via
NFS (v3"
>> then
>> how is it "observed locally??
>
> Sorry, I was meaning to say it had not been noticed using CIFS or  
> iSCSI.
>
> It has been observed in client:/home/user (NFSv3 automount from  
> server:/home/user, redirected to server:/zpool/home/user) and also  
> in server:/home/user (local automount) and server:/zpool/home/user  
> (origin).
Ok, just so I am clear, when you mean "local automount" you are
on the server and using the loopback -- no NFS or network involved?
>> iostat(1m) is the program for troubleshooting performance issues
>> related to latency. It will show the latency of nfs mounts as well as
>> other devices.
>
> What specifically should I be looking for here? (using ?iostat -xen - 
> T d?)  and I?m guessing I?ll require a high level of granularity (1s  
> intervals) to see the issue if it is a single disk or similar.
You are looking for I/O that takes seconds to complete or is stuck in
the device.  This is in the actv column stuck > 1 and the asvc_t >>
1000
>> stat(2) doesn''t write, so you can stop worrying about the
slog.
>
> My concern here was I may have been trying to write (via other  
> concurrent processes) at the same time as there was a memory fault  
> from the ARC to L2ARC.
stat(2) looks at metadata, which is generally small and compressed.
It is also cached in the ARC, by default. If this is repeatable in a  
short
period of time, then it is not an  I/O problem and you need to look at:
1. the number of files in the directory
2. the locale (ls sorts by default, and your locale affects the sort  
time)
>> Rule out the network by looking at retransmissions and ioerrors
>> with netstat(1m) on both the client and server.
>
> No errors or collisions from either server or clients observed.
retrans?
As Ross mentioned, wireshark, snoop, or most other network monitors
will show network traffic in detail.
  -- richard
>>> That behavior sounds a lot like a process has a memory leak and is
>>> filling the VM. On Linux there is an OOM killer for these, but on
>>> OpenSolaris, your the OOM killer.
>>
>> See rcapd(1m), rcapadm(1m), and rcapstat(1m) along with the
>> "Physical Memory Control Using the Resource Capping  Daemon"
>> in  System Administration Guide: Solaris Containers-Resource
>> Management, and Solaris Zones
>
> Thanks Richard, I?ll have a look at that today and see where I get.
>
> cheers,
> James
>

James Lever

2009-Sep-07 05:17 UTC

head link

[zfs-discuss] periodic slow responsiveness

On 07/09/2009, at 11:08 AM, Richard Elling wrote:
> Ok, just so I am clear, when you mean "local automount" you are
> on the server and using the loopback -- no NFS or network involved?
Correct.  And the behaviour has been seen locally as well as remotely.
> You are looking for I/O that takes seconds to complete or is stuck in
> the device.  This is in the actv column stuck > 1 and the asvc_t
>>
> 1000
Just started having some slow responsiveness reported form a user  
using emacs (autosave, start of a build) so a small file write request.

The second or so before they went to do this, it appears as if the  
raid cache in front of the slog devices was nearly filled and the SSDs  
were being utilised quite heavily, but then there was a break where I  
am seeing relatively light usage on the slog but 100% busy on the  
device reported.

The iostat output is at the end of this message - I can?t make any  
real sense out of why a user would have seen a ~4s delay at about  
2:39:17-18.  Only one of the two slog devices are being used at all.   
Is there some tunable about how multiple slogs are used?

c7t[01] are rpool
c7t[23] are slog devices in the data pool
c11t* are the primary storage devices for the data pool

cheers,
James

Monday,  7 September 2009  2:39:17 PM EST
                             extended device statistics       ----  
errors ---
     r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b s/w h/w  
trn tot device
     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0   
10   0  10 c9t0d0
     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0    
0   0   0 c7t0d0
     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0    
0   0   0 c7t1d0
     0.0 1475.0    0.0 188799.0  0.0 30.2    0.0   20.5   2  90   0    
0   0   0 c7t2d0
     0.0  232.0    0.0 29571.8  0.0 33.8    0.0  145.9   0  98   0    
0   0   0 c7t3d0
     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0    
0   0   0 c11t0d0
     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0    
0   0   0 c11t1d0
     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0    
0   0   0 c11t2d0
     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0    
0   0   0 c11t3d0
     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0    
0   0   0 c11t4d0
     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0    
0   0   0 c11t5d0
     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0    
0   0   0 c11t6d0
     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0    
0   0   0 c11t7d0
     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0    
0   0   0 c11t8d0
     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0    
0   0   0 c11t9d0
Monday,  7 September 2009  2:39:18 PM EST
                             extended device statistics       ----  
errors ---
     r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b s/w h/w  
trn tot device
     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0   
10   0  10 c9t0d0
     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0    
0   0   0 c7t0d0
     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0    
0   0   0 c7t1d0
     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0    
0   0   0 c7t2d0
     0.0    0.0    0.0    0.0  0.0 35.0    0.0    0.0   0 100   0    
0   0   0 c7t3d0
     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0    
0   0   0 c11t0d0
     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0    
0   0   0 c11t1d0
     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0    
0   0   0 c11t2d0
     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0    
0   0   0 c11t3d0
     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0    
0   0   0 c11t4d0
     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0    
0   0   0 c11t5d0
     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0    
0   0   0 c11t6d0
     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0    
0   0   0 c11t7d0
     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0    
0   0   0 c11t8d0
     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0    
0   0   0 c11t9d0
Monday,  7 September 2009  2:39:19 PM EST
                             extended device statistics       ----  
errors ---
     r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b s/w h/w  
trn tot device
     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0   
10   0  10 c9t0d0
     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0    
0   0   0 c7t0d0
     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0    
0   0   0 c7t1d0
     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0    
0   0   0 c7t2d0
     0.0  341.0    0.0 43650.1  0.0 35.0    0.0  102.5   0 100   0    
0   0   0 c7t3d0
     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0    
0   0   0 c11t0d0
     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0    
0   0   0 c11t1d0
     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0    
0   0   0 c11t2d0
     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0    
0   0   0 c11t3d0
     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0    
0   0   0 c11t4d0
     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0    
0   0   0 c11t5d0
     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0    
0   0   0 c11t6d0
     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0    
0   0   0 c11t7d0
     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0    
0   0   0 c11t8d0
     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0    
0   0   0 c11t9d0
Monday,  7 September 2009  2:39:20 PM EST
                             extended device statistics       ----  
errors ---
     r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b s/w h/w  
trn tot device
     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0   
10   0  10 c9t0d0
     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0    
0   0   0 c7t0d0
     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0    
0   0   0 c7t1d0
     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0    
0   0   0 c7t2d0
     0.0  342.0    0.0 43774.8  0.0 35.0    0.0  102.2   0 100   0    
0   0   0 c7t3d0
     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0    
0   0   0 c11t0d0
     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0    
0   0   0 c11t1d0
     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0    
0   0   0 c11t2d0
     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0    
0   0   0 c11t3d0
     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0    
0   0   0 c11t4d0
     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0    
0   0   0 c11t5d0
     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0    
0   0   0 c11t6d0
     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0    
0   0   0 c11t7d0
     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0    
0   0   0 c11t8d0
     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0    
0   0   0 c11t9d0
Monday,  7 September 2009  2:39:21 PM EST
                             extended device statistics       ----  
errors ---
     r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b s/w h/w  
trn tot device
     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0   
10   0  10 c9t0d0
     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0    
0   0   0 c7t0d0
     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0    
0   0   0 c7t1d0
     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0    
0   0   0 c7t2d0
     0.0  349.0    0.0 44546.8  0.0 35.0    0.0  100.2   0 100   0    
0   0   0 c7t3d0
     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0    
0   0   0 c11t0d0
     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0    
0   0   0 c11t1d0
     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0    
0   0   0 c11t2d0
     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0    
0   0   0 c11t3d0
     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0    
0   0   0 c11t4d0
     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0    
0   0   0 c11t5d0
     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0    
0   0   0 c11t6d0
     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0    
0   0   0 c11t7d0
     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0    
0   0   0 c11t8d0
     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0    
0   0   0 c11t9d0
Monday,  7 September 2009  2:39:22 PM EST
                             extended device statistics       ----  
errors ---
     r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b s/w h/w  
trn tot device
     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0   
10   0  10 c9t0d0
     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0    
0   0   0 c7t0d0
     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0    
0   0   0 c7t1d0
     0.0    2.0    0.0   32.0  0.0  0.0    0.0    0.1   0   0   0    
0   0   0 c7t2d0
     0.0  214.0    0.0 27168.6  0.0 19.7    0.0   91.8   0  61   0    
0   0   0 c7t3d0
     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0    
0   0   0 c11t0d0
     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0    
0   0   0 c11t1d0
     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0    
0   0   0 c11t2d0
     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0    
0   0   0 c11t3d0
     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0    
0   0   0 c11t4d0
     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0    
0   0   0 c11t5d0
     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0    
0   0   0 c11t6d0
     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0    
0   0   0 c11t7d0
     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0    
0   0   0 c11t8d0
     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0    
0   0   0 c11t9d0
Monday,  7 September 2009  2:39:23 PM EST
                             extended device statistics       ----  
errors ---
     r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b s/w h/w  
trn tot device
     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0   
10   0  10 c9t0d0
     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0    
0   0   0 c7t0d0
     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0    
0   0   0 c7t1d0
     0.0    2.0    0.0  132.0  0.0  0.0    0.0    0.2   0   0   0    
0   0   0 c7t2d0
     0.0    3.0    0.0  356.1  0.0  0.0    0.0    0.4   0   0   0    
0   0   0 c7t3d0
     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0    
0   0   0 c11t0d0
     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0    
0   0   0 c11t1d0
     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0    
0   0   0 c11t2d0
     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0    
0   0   0 c11t3d0
     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0    
0   0   0 c11t4d0
     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0    
0   0   0 c11t5d0
     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0    
0   0   0 c11t6d0
     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0    
0   0   0 c11t7d0
     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0    
0   0   0 c11t8d0
     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0    
0   0   0 c11t9d0

James Lever

2009-Sep-07 05:32 UTC

head link

[zfs-discuss] periodic slow responsiveness

On 07/09/2009, at 10:46 AM, Ross Walker wrote:
>> zpool is RAIDZ2 comprised of 10 * 15kRPM SAS drives behind an LSI  
>> 1078 w/ 512MB BBWC exposed as RAID0 LUNs (Dell MD1000 behind PERC 6/ 
>> E) with 2x SSDs each partitioned as 10GB slog and 36GB remainder as  
>> l2arc behind another LSI 1078 w/ 256MB BBWC (Dell R710 server with  
>> PERC 6/i).
>
> This config might lead to heavy sync writes (NFS) starving reads due  
> to the fact that the whole RAIDZ2 behaves as a single disk on  
> writes. How about a 2 5 disk RAIDZ2s or 3 4 disk RAIDZs?
>
> Just one or two other vdevs to spread the load can make the world of  
> difference.
This was a management decision.  I wanted to go down the striped  
mirrored pair solution, but the amount of space lost was considered  
too great.  RAIDZ2 was considered the best value option for our  
environment.
>> The system is configured as an NFS (currently serving NFSv3), iSCSI  
>> (COMSTAR) and CIFS (using the SUN SFW package running Samba 3.0.34)  
>> with authentication taking place from a remote openLDAP server.
>
> There are a lot of services here, all off one pool? You might be  
> trying to bite off more then the config can chew.
That?s not a lot of services, really.  We have 6 users doing builds on  
multiple platforms and using the storage as their home directory  
(windows and unix).

The issue is interactive responsiveness and if there is a way to tune  
the system to give that while still having good performance for builds  
when they are run.
> Try taking a particularly bad problem station and configuring it  
> static for a bit to see if it is.
That has been considered also, but the issue has also been observed  
locally on the fileserver.
> That doesn''t make a lot of sense to me the L2ARC is secondary read
> cache, if writes are starving reads then the L2ARC would only help  
> here.
I was suggesting that slog write were possibly starving reads from the  
l2arc as they were on the same device.  This appears not to have been  
the issue as the problem has persisted even with the l2arc devices  
removed from the pool.
> It just may be that the pool configuration just can''t handle the  
> write IOPS needed and reads are starving.
Possible, but hard to tell.  Have a look at the iostat results I?ve  
posted.

cheers,
James

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090907/f88831be/attachment.html>

Ross Walker

2009-Sep-07 16:01 UTC

head link

[zfs-discuss] periodic slow responsiveness

On Sep 7, 2009, at 1:32 AM, James Lever <j at jamver.id.au> wrote:
>
> On 07/09/2009, at 10:46 AM, Ross Walker wrote:
>
>>> zpool is RAIDZ2 comprised of 10 * 15kRPM SAS drives behind an LSI  
>>> 1078 w/ 512MB BBWC exposed as RAID0 LUNs (Dell MD1000 behind PERC  
>>> 6/E) with 2x SSDs each partitioned as 10GB slog and 36GB remainder
>>> as l2arc behind another LSI 1078 w/ 256MB BBWC (Dell R710 server  
>>> with PERC 6/i).
>>
>> This config might lead to heavy sync writes (NFS) starving reads  
>> due to the fact that the whole RAIDZ2 behaves as a single disk on  
>> writes. How about a 2 5 disk RAIDZ2s or 3 4 disk RAIDZs?
>>
>> Just one or two other vdevs to spread the load can make the world  
>> of difference.
>
> This was a management decision.  I wanted to go down the striped  
> mirrored pair solution, but the amount of space lost was considered  
> too great.  RAIDZ2 was considered the best value option for our  
> environment.
Well a MD1000 holds 15 drives a good compromise might be 2 7 drive  
RAIDZ2s with a hotspare... That should provide 320 IOPS instead of  
160, big difference.

>>> The system is configured as an NFS (currently serving NFSv3),  
>>> iSCSI (COMSTAR) and CIFS (using the SUN SFW package running Samba  
>>> 3.0.34) with authentication taking place from a remote openLDAP  
>>> server.
>>
>> There are a lot of services here, all off one pool? You might be  
>> trying to bite off more then the config can chew.
>
> That?s not a lot of services, really.  We have 6 users doing builds  
> on multiple platforms and using the storage as their home directory  
> (windows and unix).
Ok, six users, but what happens during a build?
> The issue is interactive responsiveness and if there is a way to  
> tune the system to give that while still having good performance for  
> builds when they are run.
Look at the write IOPS of the pool with the zpool iostat -v and look  
at how many are happening on the RAIDZ2 vdev.
>> Try taking a particularly bad problem station and configuring it  
>> static for a bit to see if it is.
>
> That has been considered also, but the issue has also been observed  
> locally on the fileserver.
Then I suppose you have eliminated automounter as a culprit at this  
point then.
>> That doesn''t make a lot of sense to me the L2ARC is secondary
read
>> cache, if writes are starving reads then the L2ARC would only help  
>> here.
>
> I was suggesting that slog write were possibly starving reads from  
> the l2arc as they were on the same device.  This appears not to have  
> been the issue as the problem has persisted even with the l2arc  
> devices removed from the pool.
The SSD will handle a lot more IOPS then the pool and L2ARC is a lazy  
reader, it mostly just holds on to read cache data.
>> It just may be that the pool configuration just can''t handle
the
>> write IOPS needed and reads are starving.
>
> Possible, but hard to tell.  Have a look at the iostat results I?ve  
> posted.
The busy times of the disks while the issue is occurring should let  
you know.

-Ross

James Lever

2009-Sep-24 05:00 UTC

head link

[zfs-discuss] periodic slow responsiveness

On 08/09/2009, at 2:01 AM, Ross Walker wrote:> On Sep 7, 2009, at 1:32 AM, James Lever <j at jamver.id.au> wrote:
>
> Well a MD1000 holds 15 drives a good compromise might be 2 7 drive  
> RAIDZ2s with a hotspare... That should provide 320 IOPS instead of  
> 160, big difference.
>
>> The issue is interactive responsiveness and if there is a way to  
>> tune the system to give that while still having good performance  
>> for builds when they are run.
>
> Look at the write IOPS of the pool with the zpool iostat -v and look  
> at how many are happening on the RAIDZ2 vdev.
>
>> I was suggesting that slog write were possibly starving reads from  
>> the l2arc as they were on the same device.  This appears not to  
>> have been the issue as the problem has persisted even with the  
>> l2arc devices removed from the pool.
>
> The SSD will handle a lot more IOPS then the pool and L2ARC is a  
> lazy reader, it mostly just holds on to read cache data.
>
>>> It just may be that the pool configuration just can''t
handle the
>>> write IOPS needed and reads are starving.
>>
>> Possible, but hard to tell.  Have a look at the iostat results I?ve  
>> posted.
>
> The busy times of the disks while the issue is occurring should let  
> you know.
So it turns out that the problem is that all writes coming via NFS are  
going through the slog.  When that happens, the transfer speed to the  
device drops to ~70MB/s (the write speed of his SLC SSD) and until the  
load drops all new write requests are blocked causing a noticeable  
delay (which has been observed to be up to 20s, but generally only  
2-4s).

I can reproduce this behaviour by copying a large file (hundreds of MB  
in size) using ''cp src dst? on an NFS (still currently v3) client and  
observe that all data is pushed through the slog device (10GB  
partition of a Samsung 50GB SSD behind a PERC 6/i w/256MB BBC) rather  
than going direct to the primary storage disks.

On a related note, I had 2 of these devices (both using just 10GB  
partitions) connected as log devices (so the pool had 2 separate log  
devices) and the second one was consistently running significantly  
slower than the first.  Removing the second device made an improvement  
on performance, but did not remove the occasional observed pauses.

I was of the (mis)understanding that only metadata and writes smaller  
than 64k went via the slog device in the event of an O_SYNC write  
request?

The clients are (mostly) RHEL5.

Is there a way to tune this on the NFS server or clients such that  
when I perform a large synchronous write, the data does not go via the  
slog device?

I have investigated using the logbias setting, but that will just kill  
small file performance also on any filesystem using it and defeat the  
purpose of having a slog device at all.

cheers,
James

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090924/8bad9781/attachment.html>

Bob Friesenhahn

2009-Sep-24 15:24 UTC

head link

[zfs-discuss] periodic slow responsiveness

On Thu, 24 Sep 2009, James Lever wrote:>
> I was of the (mis)understanding that only metadata and writes smaller than 
> 64k went via the slog device in the event of an O_SYNC write request?
What would cause you to understand that?
> Is there a way to tune this on the NFS server or clients such that when I 
> perform a large synchronous write, the data does not go via the slog
device?
Synchronous writes are needed by NFS to support its atomic write 
requirement.  It sounds like your SSD is write-bandwidth bottlenecked 
rather than IOPS bottlenecked.  Replacing your SSD with a more 
performant one seems like the first step.

NFS client tunings can make a big difference when it comes to 
performance.  Check the nfs(5) manual page for your Linux systems to 
see what options are available.  An obvious tunable is ''wsize''
which
should ideally match (or be a multiple of) the zfs filesystem block 
size.  The /proc/mounts file for my Debian install shows that 1048576 
is being used.  This is quite large and perhaps a smaller value would 
help.  If you are willing to accept the risk, using the Linux
''async''
mount option may make things seem better.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Richard Elling

2009-Sep-24 16:58 UTC

head link

[zfs-discuss] periodic slow responsiveness

comment below...

On Sep 23, 2009, at 10:00 PM, James Lever wrote:>
> On 08/09/2009, at 2:01 AM, Ross Walker wrote:
>> On Sep 7, 2009, at 1:32 AM, James Lever <j at jamver.id.au>
wrote:
>>
>> Well a MD1000 holds 15 drives a good compromise might be 2 7 drive  
>> RAIDZ2s with a hotspare... That should provide 320 IOPS instead of  
>> 160, big difference.
>>
>>> The issue is interactive responsiveness and if there is a way to  
>>> tune the system to give that while still having good performance  
>>> for builds when they are run.
>>
>> Look at the write IOPS of the pool with the zpool iostat -v and  
>> look at how many are happening on the RAIDZ2 vdev.
>>
>>> I was suggesting that slog write were possibly starving reads from
>>> the l2arc as they were on the same device.  This appears not to  
>>> have been the issue as the problem has persisted even with the  
>>> l2arc devices removed from the pool.
>>
>> The SSD will handle a lot more IOPS then the pool and L2ARC is a  
>> lazy reader, it mostly just holds on to read cache data.
>>
>>>> It just may be that the pool configuration just can''t
handle the
>>>> write IOPS needed and reads are starving.
>>>
>>> Possible, but hard to tell.  Have a look at the iostat results  
>>> I?ve posted.
>>
>> The busy times of the disks while the issue is occurring should let  
>> you know.
>
> So it turns out that the problem is that all writes coming via NFS  
> are going through the slog.  When that happens, the transfer speed  
> to the device drops to ~70MB/s (the write speed of his SLC SSD) and  
> until the load drops all new write requests are blocked causing a  
> noticeable delay (which has been observed to be up to 20s, but  
> generally only 2-4s).
Thank you sir, can I have another?
If you add (not attach) more slogs, the workload will be spread across  
them.  But...
>
> I can reproduce this behaviour by copying a large file (hundreds of  
> MB in size) using ''cp src dst? on an NFS (still currently v3)
client
> and observe that all data is pushed through the slog device (10GB  
> partition of a Samsung 50GB SSD behind a PERC 6/i w/256MB BBC)  
> rather than going direct to the primary storage disks.
>
> On a related note, I had 2 of these devices (both using just 10GB  
> partitions) connected as log devices (so the pool had 2 separate log  
> devices) and the second one was consistently running significantly  
> slower than the first.  Removing the second device made an  
> improvement on performance, but did not remove the occasional  
> observed pauses.
...this is not surprising, when you add a slow slog device.  This is  
the weakest link rule.
> I was of the (mis)understanding that only metadata and writes  
> smaller than 64k went via the slog device in the event of an O_SYNC  
> write request?
The threshold is 32 kBytes, which is unfortunately the same as the  
default
NFS write size. See CR6686887
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6686887

If you have a slog and logbias=latency (default) then the writes go to  
the slog.
So there is some interaction here that can affect NFS workloads in  
particular.
>
> The clients are (mostly) RHEL5.
>
> Is there a way to tune this on the NFS server or clients such that  
> when I perform a large synchronous write, the data does not go via  
> the slog device?
You can change the IOP size on the client.
  -- richard
>
> I have investigated using the logbias setting, but that will just  
> kill small file performance also on any filesystem using it and  
> defeat the purpose of having a slog device at all.
>
> cheers,
> James
>
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

James Lever

2009-Sep-24 22:45 UTC

head link

[zfs-discuss] periodic slow responsiveness

On 25/09/2009, at 2:58 AM, Richard Elling wrote:
> On Sep 23, 2009, at 10:00 PM, James Lever wrote:
>
>> So it turns out that the problem is that all writes coming via NFS  
>> are going through the slog.  When that happens, the transfer speed  
>> to the device drops to ~70MB/s (the write speed of his SLC SSD) and  
>> until the load drops all new write requests are blocked causing a  
>> noticeable delay (which has been observed to be up to 20s, but  
>> generally only 2-4s).
>
> Thank you sir, can I have another?
> If you add (not attach) more slogs, the workload will be spread  
> across them.  But...
My log configurations is :

         logs
           c7t2d0s0   ONLINE       0     0     0
           c7t3d0s0   OFFLINE      0     0     0

I?m going to test the now removed SSD and see if I can get it to  
perform significantly worse than the first one, but my memory of  
testing these at pre-production testing was that they were both  
equally slow but not significantly different.
>> On a related note, I had 2 of these devices (both using just 10GB  
>> partitions) connected as log devices (so the pool had 2 separate  
>> log devices) and the second one was consistently running  
>> significantly slower than the first.  Removing the second device  
>> made an improvement on performance, but did not remove the  
>> occasional observed pauses.
>
> ...this is not surprising, when you add a slow slog device.  This is  
> the weakest link rule.
So, in theory, even if one of the two SSDs was even slightly slower  
than the other, it would just appear that it would be more heavily  
effected?

Here is part of what I?m not understanding - unless one SSD is  
significantly worse than the other, how can the following scenario be  
true?  Here is some iostat output from the two slog devices at 1s  
intervals when it gets a large series of write requests.

Idle at start.

     0.0 1462.0    0.0 187010.2  0.0 28.6    0.0   19.6   2  83   0    
0   0   0 c7t2d0
     0.0  233.0    0.0  29823.7  0.0 28.7    0.0  123.3   0  83   0    
0   0   0 c7t3d0

NVRAM cache close to full. (256MB BBC)

     0.0   84.0    0.0 10622.0  0.0  3.5    0.0   41.2   0  12   0    
0   0   0 c7t2d0
     0.0    0.0    0.0     0.0  0.0 35.0    0.0    0.0   0 100   0    
0   0   0 c7t3d0

     0.0    0.0    0.0     0.0  0.0  0.0    0.0    0.0   0   0   0    
0   0   0 c7t2d0
     0.0  305.0    0.0 39039.3  0.0 35.0    0.0  114.7   0 100   0    
0   0   0 c7t3d0


     0.0    0.0    0.0     0.0  0.0  0.0    0.0    0.0   0   0   0    
0   0   0 c7t2d0
     0.0  361.0    0.0 46208.1  0.0 35.0    0.0   96.8   0 100   0    
0   0   0 c7t3d0


     0.0    0.0    0.0     0.0  0.0  0.0    0.0    0.0   0   0   0    
0   0   0 c7t2d0
     0.0  329.0    0.0 42114.0  0.0 35.0    0.0  106.3   0 100   0    
0   0   0 c7t3d0

     0.0    0.0    0.0     0.0  0.0  0.0    0.0    0.0   0   0   0    
0   0   0 c7t2d0
     0.0  317.0    0.0 40449.6  0.0 27.4    0.0   86.5   0  85   0    
0   0   0 c7t3d0

     0.0    4.0    0.0   263.8  0.0  0.0    0.0    0.2   0   0   0    
0   0   0 c7t2d0
     0.0    4.0    0.0   367.8  0.0  0.0    0.0    0.3   0   0   0    
0   0   0 c7t3d0

What determines the size of the writes or distribution between slog  
devices?  It looks like ZFS decided to send a large chunk to one slog  
which nearly filled the NVRAM, and then continue writing to the other  
one, which meant that it had to go at device speed (whatever that is  
for the data size/write size).   Is there a way to tune the writes to  
multiple slogs to be (for arguments sake) 10MB slices?
>> I was of the (mis)understanding that only metadata and writes  
>> smaller than 64k went via the slog device in the event of an O_SYNC  
>> write request?
>
> The threshold is 32 kBytes, which is unfortunately the same as the  
> default
> NFS write size. See CR6686887
> http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6686887
>
> If you have a slog and logbias=latency (default) then the writes go  
> to the slog.
> So there is some interaction here that can affect NFS workloads in  
> particular.
Interesting CR.

nfsstat -m output on one of the linux hosts (ubuntu)

  Flags:  
rw 
,vers 
= 
3 
,rsize 
= 
1048576 
,wsize 
= 
1048576 
,namlen 
= 
255 
,hard 
,nointr 
,noacl 
,proto 
= 
tcp 
,timeo 
= 
600 
,retrans 
=2,sec=sys,mountaddr=10.1.0.17,mountvers=3,mountproto=tcp,addr=10.1.0.17

rsize and wsize auto tuned to 1MB.  How does this effect the sync  
request threshold?
>> The clients are (mostly) RHEL5.
>>
>> Is there a way to tune this on the NFS server or clients such that  
>> when I perform a large synchronous write, the data does not go via  
>> the slog device?
>
> You can change the IOP size on the client.

You?re suggesting modifying rsize/wsize?  or something else?

cheers,
James

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090925/936e5b21/attachment.html>

James Lever

2009-Sep-24 23:00 UTC

head link

[zfs-discuss] periodic slow responsiveness

On 25/09/2009, at 1:24 AM, Bob Friesenhahn wrote:
> On Thu, 24 Sep 2009, James Lever wrote:
>> Is there a way to tune this on the NFS server or clients such that  
>> when I perform a large synchronous write, the data does not go via  
>> the slog device?
>
> Synchronous writes are needed by NFS to support its atomic write  
> requirement.  It sounds like your SSD is write-bandwidth  
> bottlenecked rather than IOPS bottlenecked.  Replacing your SSD with  
> a more performant one seems like the first step.
> NFS client tunings can make a big difference when it comes to  
> performance.  Check the nfs(5) manual page for your Linux systems to  
> see what options are available.  An obvious tunable is
''wsize'' which
> should ideally match (or be a multiple of) the zfs filesystem block  
> size.  The /proc/mounts file for my Debian install shows that  
> 1048576 is being used.  This is quite large and perhaps a smaller  
> value would help.  If you are willing to accept the risk, using the  
> Linux ''async'' mount option may make things seem better.
 From the Linux NFS FAQ.  http://nfs.sourceforge.net/

NFS Version 3 introduces the concept of "safe asynchronous writes.?

And it continues.

My rsize and wsize are negotiating to 1MB.

James

Bob Friesenhahn

2009-Sep-25 01:49 UTC

head link

[zfs-discuss] periodic slow responsiveness

On Fri, 25 Sep 2009, James Lever wrote:>
> NFS Version 3 introduces the concept of "safe asynchronous writes.?
Being "safe" then requires a responsibilty level on the client which 
is often not present.  For example, if the server crashes, and then 
the client crashes, how does the client resend the uncommitted data? 
If the client had a non-volatile storage cache, then it would be able 
to responsibly finish the writes that failed.

The commentary says that normally the COMMIT operations occur during 
close(2) or fsync(2) system call, or when encountering memory 
pressure.  If the problem is slow copying of many small files, this 
COMMIT approach does not help very much since very little data is sent 
per file and most time is spent creating directories and files.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

James Lever

2009-Sep-25 03:29 UTC

head link

[zfs-discuss] periodic slow responsiveness

On 25/09/2009, at 11:49 AM, Bob Friesenhahn wrote:
> The commentary says that normally the COMMIT operations occur during  
> close(2) or fsync(2) system call, or when encountering memory  
> pressure.  If the problem is slow copying of many small files, this  
> COMMIT approach does not help very much since very little data is  
> sent per file and most time is spent creating directories and files.
The problem appears to be slog bandwidth exhaustion due to all data  
being sent via the slog creating a contention for all following NFS or  
locally synchronous writes.  The NFS writes do not appear to be  
synchronous in nature - there is only a COMMIT being issued at the  
very end, however, all of that data appears to be going via the slog  
and it appears to be inflating to twice its original size.

For a test, I just copied a relatively small file (8.4MB in size).   
Looking at a tcpdump analysis using wireshark, there is a SETATTR  
which ends with a V3 COMMIT and no COMMIT messages during the transfer.

iostat output that matches looks like this:

slog write of the data (17MB appears to hit the slog)

Friday, 25 September 2009  1:01:00 PM EST
                             extended device statistics       ----  
errors ---
     r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b s/w h/w  
trn tot device
     0.0  135.0    0.0 17154.5  0.0  0.8    0.0    6.0   0   3   0    
0   0   0 c7t2d0

then a few seconds later, the transaction group gets flushed to  
primary storage writing nearly 11.4MB which is inline with raid Z2  
(expect around 10.5MB; 8.4/8*10):

Friday, 25 September 2009  1:01:13 PM EST
                             extended device statistics       ----  
errors ---
     r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b s/w h/w  
trn tot device
     0.0   91.0    0.0 1170.4  0.0  0.1    0.0    1.3   0   2   0    
0   0   0 c11t0d0
     0.0   84.0    0.0 1171.4  0.0  0.1    0.0    1.2   0   2   0    
0   0   0 c11t1d0
     0.0   92.0    0.0 1172.4  0.0  0.1    0.0    1.2   0   2   0    
0   0   0 c11t2d0
     0.0   84.0    0.0 1172.4  0.0  0.1    0.0    1.3   0   2   0    
0   0   0 c11t3d0
     0.0   81.0    0.0 1176.4  0.0  0.1    0.0    1.4   0   2   0    
0   0   0 c11t4d0
     0.0   86.0    0.0 1176.4  0.0  0.1    0.0    1.4   0   2   0    
0   0   0 c11t5d0
     0.0   89.0    0.0 1175.4  0.0  0.1    0.0    1.4   0   2   0    
0   0   0 c11t6d0
     0.0   84.0    0.0 1175.4  0.0  0.1    0.0    1.3   0   2   0    
0   0   0 c11t7d0
     0.0   91.0    0.0 1168.9  0.0  0.1    0.0    1.3   0   2   0    
0   0   0 c11t8d0
     0.0   89.0    0.0 1170.9  0.0  0.1    0.0    1.4   0   2   0    
0   0   0 c11t9d0

So I performed the same test with a much larger file (533MB) to see  
what it would do, being larger than the NVRAM cache in front of the  
SSD.  Note that after the second second of activity the NVRAM is full  
and only allowing in about the sequential write speed of the SSD  
(~70MB/s).

Friday, 25 September 2009  1:13:14 PM EST
                             extended device statistics       ----  
errors ---
     r/s    w/s   kr/s    kw/s wait actv wsvc_t asvc_t  %w  %b s/w h/w  
trn tot device
     0.0  640.9    0.0  81782.9  0.0  4.2    0.0    6.5   1  14   0    
0   0   0 c7t2d0
     0.0 1065.7    0.0 136408.1  0.0 18.6    0.0   17.5   1  78   0    
0   0   0 c7t2d0
     0.0  579.0    0.0  74113.3  0.0 30.7    0.0   53.1   1 100   0    
0   0   0 c7t2d0
     0.0  588.7    0.0  75357.0  0.0 33.2    0.0   56.3   1 100   0    
0   0   0 c7t2d0
     0.0  532.0    0.0  68096.3  0.0 31.5    0.0   59.1   1 100   0    
0   0   0 c7t2d0
     0.0  559.0    0.0  71428.0  0.0 32.5    0.0   58.1   1 100   0    
0   0   0 c7t2d0
     0.0  542.0    0.0  68755.9  0.0 25.1    0.0   46.4   1 100   0    
0   0   0 c7t2d0
     0.0  542.0    0.0  69376.4  0.0 35.0    0.0   64.6   1 100   0    
0   0   0 c7t2d0
     0.0  581.0    0.0  74368.0  0.0 30.6    0.0   52.6   1 100   0    
0   0   0 c7t2d0
     0.0  567.0    0.0  72574.1  0.0 33.2    0.0   58.6   1 100   0    
0   0   0 c7t2d0
     0.0  564.0    0.0  72194.1  0.0 31.1    0.0   55.2   1 100   0    
0   0   0 c7t2d0
     0.0  573.0    0.0  73343.5  0.0 33.2    0.0   57.9   1 100   0    
0   0   0 c7t2d0
     0.0  536.3    0.0  68640.5  0.0 33.1    0.0   61.7   1 100   0    
0   0   0 c7t2d0
     0.0  121.9    0.0  15608.9  0.0  2.7    0.0   22.1   0  22   0    
0   0   0 c7t2d0

Again, the slog wrote about double the file size (1022.6MB) and a few  
seconds later, the data was pushed to the primary storage (684.9MB  
with an expectation of 666MB = 533MB/8*10) so again about the right  
number hit the spinning platters.

Friday, 25 September 2009  1:13:43 PM EST
                             extended device statistics       ----  
errors ---
     r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b s/w h/w  
trn tot device
     0.0  338.3    0.0 32794.4  0.0 13.7    0.0   40.6   1  47   0    
0   0   0 c11t0d0
     0.0  325.3    0.0 31399.8  0.0 13.7    0.0   42.0   1  47   0    
0   0   0 c11t1d0
     0.0  339.3    0.0 33273.3  0.0 13.7    0.0   40.3   1  47   0    
0   0   0 c11t2d0
     0.0  332.3    0.0 32009.0  0.0 13.7    0.0   41.4   0  47   0    
0   0   0 c11t3d0
     0.0  352.3    0.0 34364.0  0.0 13.7    0.0   39.0   1  47   0    
0   0   0 c11t4d0
     0.0  355.2    0.0 33788.7  0.0 13.7    0.0   38.6   1  47   0    
0   0   0 c11t5d0
     0.0  352.3    0.0 33452.3  0.0 13.8    0.0   39.3   1  47   0    
0   0   0 c11t6d0
     0.0  339.3    0.0 32873.5  0.0 13.7    0.0   40.4   1  47   0    
0   0   0 c11t7d0
     0.0  337.3    0.0 32889.0  0.0 13.5    0.0   40.0   1  47   0    
0   0   0 c11t8d0
     0.0  336.3    0.0 32441.9  0.0 13.7    0.0   40.9   1  47   0    
0   0   0 c11t9d0

Friday, 25 September 2009  1:13:44 PM EST
                             extended device statistics       ----  
errors ---
     r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b s/w h/w  
trn tot device
     0.0  349.7    0.0 35677.0  0.0 16.1    0.0   45.9   0  48   0    
0   0   0 c11t0d0
     0.0  367.7    0.0 37078.3  0.0 16.1    0.0   43.8   0  49   0    
0   0   0 c11t1d0
     0.0  348.7    0.0 35197.1  0.0 16.3    0.0   46.9   0  49   0    
0   0   0 c11t2d0
     0.0  360.7    0.0 36467.7  0.0 15.9    0.0   44.1   0  48   0    
0   0   0 c11t3d0
     0.0  342.7    0.0 34103.9  0.0 16.2    0.0   47.2   0  48   0    
0   0   0 c11t4d0
     0.0  347.7    0.0 34682.1  0.0 16.0    0.0   46.0   0  48   0    
0   0   0 c11t5d0
     0.0  349.7    0.0 35018.3  0.0 16.3    0.0   46.7   0  49   0    
0   0   0 c11t6d0
     0.0  353.7    0.0 35600.5  0.0 16.1    0.0   45.6   0  49   0    
0   0   0 c11t7d0
     0.0  350.7    0.0 35580.5  0.0 16.2    0.0   46.1   0  49   0    
0   0   0 c11t8d0
     0.0  356.7    0.0 36031.0  0.0 15.9    0.0   44.4   0  48   0    
0   0   0 c11t9d0

Can anybody explain what is going on with the slog device in that all  
data is being shunted via it and why about double the data size is  
being written to it per transaction?

cheers,
James

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090925/65ca4620/attachment.html>

James Lever

2009-Sep-25 03:58 UTC

head link

[zfs-discuss] periodic slow responsiveness

I thought I would try the same test using dd bs=131072 if=source of=/ 
path/to/nfs to see what the results looked liked?

It is very similar to before, about 2x slog usage and same timing and  
write totals.

Friday, 25 September 2009  1:49:48 PM EST
                             extended device statistics       ----  
errors ---
     r/s    w/s   kr/s     kw/s wait actv wsvc_t asvc_t  %w  %b s/w h/ 
w trn tot device
     0.0 1538.7    0.0 196834.0  0.0 23.1    0.0   15.0   2  67   0    
0   0   0 c7t2d0
     0.0  562.0    0.0  71942.3  0.0 35.0    0.0   62.3   1 100   0    
0   0   0 c7t2d0
     0.0  590.7    0.0  75614.4  0.0 35.0    0.0   59.2   1 100   0    
0   0   0 c7t2d0
     0.0  600.9    0.0  76920.0  0.0 35.0    0.0   58.2   1 100   0    
0   0   0 c7t2d0
     0.0  546.0    0.0  69887.9  0.0 35.0    0.0   64.1   1 100   0    
0   0   0 c7t2d0
     0.0  554.0    0.0  70913.9  0.0 35.0    0.0   63.2   1 100   0    
0   0   0 c7t2d0
     0.0  598.0    0.0  76549.2  0.0 35.0    0.0   58.5   1 100   0    
0   0   0 c7t2d0
     0.0  563.0    0.0  72065.1  0.0 35.0    0.0   62.1   1 100   0    
0   0   0 c7t2d0
     0.0  588.1    0.0  75282.6  0.0 31.5    0.0   53.5   1 100   0    
0   0   0 c7t2d0
     0.0  564.0    0.0  72195.7  0.0 34.8    0.0   61.7   1 100   0    
0   0   0 c7t2d0
     0.0  582.8    0.0  74599.8  0.0 35.0    0.0   60.0   1 100   0    
0   0   0 c7t2d0
     0.0  544.0    0.0  69633.3  0.0 35.0    0.0   64.3   1 100   0    
0   0   0 c7t2d0
     0.0  530.0    0.0  67191.5  0.0 30.6    0.0   57.7   0  90   0    
0   0   0 c7t2d0

And then the write to primary storage a few seconds later:

Friday, 25 September 2009  1:50:14 PM EST
                             extended device statistics       ----  
errors ---
     r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b s/w h/w  
trn tot device
     0.0  426.3    0.0 32196.3  0.0 12.7    0.0   29.8   1  45   0    
0   0   0 c11t0d0
     0.0  410.4    0.0 31857.1  0.0 12.4    0.0   30.3   1  45   0    
0   0   0 c11t1d0
     0.0  426.3    0.0 30698.1  0.0 13.0    0.0   30.5   1  45   0    
0   0   0 c11t2d0
     0.0  429.3    0.0 31392.3  0.0 12.6    0.0   29.4   1  45   0    
0   0   0 c11t3d0
     0.0  443.2    0.0 33280.8  0.0 12.9    0.0   29.1   1  45   0    
0   0   0 c11t4d0
     0.0  424.3    0.0 33872.4  0.0 12.7    0.0   30.0   1  45   0    
0   0   0 c11t5d0
     0.0  432.3    0.0 32903.2  0.0 12.6    0.0   29.2   1  45   0    
0   0   0 c11t6d0
     0.0  418.3    0.0 32562.0  0.0 12.5    0.0   29.9   1  45   0    
0   0   0 c11t7d0
     0.0  417.3    0.0 31746.2  0.0 12.4    0.0   29.8   1  44   0    
0   0   0 c11t8d0
     0.0  424.3    0.0 31270.6  0.0 12.7    0.0   29.9   1  45   0    
0   0   0 c11t9d0
Friday, 25 September 2009  1:50:15 PM EST
                             extended device statistics       ----  
errors ---
     r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b s/w h/w  
trn tot device
     0.0  434.9    0.0 37028.5  0.0 17.3    0.0   39.7   1  52   0    
0   0   0 c11t0d0
     1.0  436.9   64.3 37372.1  0.0 17.1    0.0   39.0   1  51   0    
0   0   0 c11t1d0
     1.0  442.9   64.3 38543.2  0.0 17.2    0.0   38.7   1  52   0    
0   0   0 c11t2d0
     1.0  436.9   64.3 37834.2  0.0 17.3    0.0   39.6   1  52   0    
0   0   0 c11t3d0
     1.0  412.8   64.3 35935.0  0.0 16.8    0.0   40.7   0  52   0    
0   0   0 c11t4d0
     1.0  413.8   64.3 35342.5  0.0 16.6    0.0   40.1   0  51   0    
0   0   0 c11t5d0
     2.0  418.8  128.6 36321.3  0.0 16.5    0.0   39.3   0  52   0    
0   0   0 c11t6d0
     1.0  425.8   64.3 36660.4  0.0 16.6    0.0   39.0   1  51   0    
0   0   0 c11t7d0
     1.0  437.9   64.3 37484.0  0.0 17.2    0.0   39.2   1  52   0    
0   0   0 c11t8d0
     0.0  437.9    0.0 37968.1  0.0 17.2    0.0   39.2   1  52   0    
0   0   0 c11t9d0

So, 533MB source file, 13 seconds to write to the slog (14 before, no  
appreciable change), 1071.5MB written to the slog, 692.3MB written to  
primary storage.

Just another data point.

cheers,
James

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090925/35534724/attachment.html>

Casper.Dik at Sun.COM

2009-Sep-25 08:09 UTC

head link

[zfs-discuss] periodic slow responsiveness

>On Fri, 25 Sep 2009, James Lever wrote:
>>
>> NFS Version 3 introduces the concept of "safe asynchronous
writes.?
>
>Being "safe" then requires a responsibilty level on the client
which
>is often not present.  For example, if the server crashes, and then 
>the client crashes, how does the client resend the uncommitted data? 
>If the client had a non-volatile storage cache, then it would be able 
>to responsibly finish the writes that failed.
If the client crashes, it is clear that "work will be lost" up to the
point
that the client did a successful commit.  Other than support for the
NFSv3 commit operation and resending the missing operations.
If the client crashes, we know that non-committed operations may be dropped
in the floor.
>The commentary says that normally the COMMIT operations occur during 
>close(2) or fsync(2) system call, or when encountering memory 
>pressure.  If the problem is slow copying of many small files, this 
>COMMIT approach does not help very much since very little data is sent 
>per file and most time is spent creating directories and files.
Indeed; the commit is mostly to make sure that the pipe between the server
and the client can be filled for write operations.

Casper

Ross Walker

2009-Sep-25 15:14 UTC

head link

[zfs-discuss] periodic slow responsiveness

On Thu, Sep 24, 2009 at 11:29 PM, James Lever <j at jamver.id.au>
wrote:>
> On 25/09/2009, at 11:49 AM, Bob Friesenhahn wrote:
>
> The commentary says that normally the COMMIT operations occur during
> close(2) or fsync(2) system call, or when encountering memory pressure. ?If
> the problem is slow copying of many small files, this COMMIT approach does
> not help very much since very little data is sent per file and most time is
> spent creating directories and files.
>
> The problem appears to be slog bandwidth exhaustion due to all data being
> sent via the slog creating a contention for all following NFS or locally
> synchronous writes. ?The NFS writes do not appear to be synchronous in
> nature - there is only a COMMIT being issued at the very end, however, all
> of that data appears to be going via the slog and it appears to be
inflating
> to twice its original size.
> For a test, I just copied a relatively small file (8.4MB in size). ?Looking
> at a?tcpdump analysis using wireshark, there is a SETATTR which ends with a
> V3 COMMIT and no COMMIT messages during the transfer.
> iostat output that matches looks like this:
> slog write of the data (17MB appears to hit the slog)
[snip]> then a few seconds later, the transaction group gets flushed to primary
> storage writing nearly 11.4MB which is inline with raid Z2 (expect around
> 10.5MB; 8.4/8*10):
[snip]> So I performed the same test with a much larger file (533MB) to see what it
> would do, being larger than the NVRAM cache in front of the SSD. ?Note that
> after the second second of activity the NVRAM is full and only allowing in
> about the sequential write speed of the SSD (~70MB/s).
[snip]> Again, the slog wrote about double the file size (1022.6MB) and a few
> seconds later, the data was pushed to the primary storage (684.9MB with an
> expectation of?666MB = 533MB/8*10) so again about the right number hit the
> spinning platters.
[snip]> Can anybody explain what is going on with the slog device in that all data
> is being shunted via it and why about double the data size is being written
> to it per transaction?
By any chance do you have copies=2 set?

That will make 2 transactions of 1.

Also, try setting zfs_write_limit_override equal to the size of the
NVRAM cache (or half depending on how long it takes to flush):

echo zfs_write_limit_override/W0t268435456 | mdb -kw

Set the PERC flush interval to say 1 second.

As a side an slog device will not be too beneficial for large
sequential writes, because it will be throughput bound not latency
bound. slog devices really help when you have lots of small sync
writes. A RAIDZ2 with the ZIL spread across it will provide much
higher throughput then an SSD. An example of a workload that benefits
from an slog device is ESX over NFS, which does a COMMIT for each
block written, so it benefits from an slog, but a standard media
server will not (but an L2ARC would be beneficial).

Better workload analysis is really what it is about.

-Ross

Bob Friesenhahn

2009-Sep-25 15:34 UTC

head link

[zfs-discuss] periodic slow responsiveness

On Fri, 25 Sep 2009, Ross Walker wrote:>
> As a side an slog device will not be too beneficial for large
> sequential writes, because it will be throughput bound not latency
> bound. slog devices really help when you have lots of small sync
> writes. A RAIDZ2 with the ZIL spread across it will provide much
Surely this depends on the origin of the large sequential writes.  If 
the origin is NFS and the SSD has considerably more sustained write 
bandwidth than the ethernet transfer bandwidth, then using the SSD is 
a win.  If the SSD accepts data slower than the ethernet can deliver 
it (which seems to be this particular case) then the SSD is not 
helping.

If the ethernet can pass 100MB/second, then the sustained write 
specification for the SSD needs to be at least 100MB/second.  Since 
data is buffered in the Ethernet,TCP/IP,NFS stack prior to sending it 
to ZFS, the SSD should support write bursts of at least double that or 
else it will not be helping bulk-write performance.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Ross Walker

2009-Sep-25 16:14 UTC

head link

[zfs-discuss] periodic slow responsiveness

On Fri, Sep 25, 2009 at 11:34 AM, Bob Friesenhahn
<bfriesen at simple.dallas.tx.us> wrote:> On Fri, 25 Sep 2009, Ross Walker wrote:
>>
>> As a side an slog device will not be too beneficial for large
>> sequential writes, because it will be throughput bound not latency
>> bound. slog devices really help when you have lots of small sync
>> writes. A RAIDZ2 with the ZIL spread across it will provide much
>
> Surely this depends on the origin of the large sequential writes. ?If the
> origin is NFS and the SSD has considerably more sustained write bandwidth
> than the ethernet transfer bandwidth, then using the SSD is a win. ?If the
> SSD accepts data slower than the ethernet can deliver it (which seems to be
> this particular case) then the SSD is not helping.
>
> If the ethernet can pass 100MB/second, then the sustained write
> specification for the SSD needs to be at least 100MB/second.  Since data is
> buffered in the Ethernet,TCP/IP,NFS stack prior to sending it to ZFS, the
> SSD should support write bursts of at least double that or else it will not
> be helping bulk-write performance.
Specifically I was talking NFS as that was what the OP was talking
about, but yes it does depend on the origin, but you also assume that
NFS IO goes over only a single 1Gbe interface when it could be over
multiple 1Gbe interfaces or a 10Gbe interface or even multple 10Gbe
interfaces. You also assume the IO recorded in the ZIL is just the raw
IO when there is also meta-data or multiple transaction copies as
well.

Personnally I still prefer to spread the ZIL across the pool and have
a large NVRAM backed HBA as opposed to an slog which really puts all
my IO in one basket. If I had a pure NVRAM device I might consider
using that as an slog device, but SSDs are too variable for my taste.

-Ross

Richard Elling

2009-Sep-25 17:39 UTC

head link

[zfs-discuss] periodic slow responsiveness

On Sep 25, 2009, at 9:14 AM, Ross Walker wrote:
> On Fri, Sep 25, 2009 at 11:34 AM, Bob Friesenhahn
> <bfriesen at simple.dallas.tx.us> wrote:
>> On Fri, 25 Sep 2009, Ross Walker wrote:
>>>
>>> As a side an slog device will not be too beneficial for large
>>> sequential writes, because it will be throughput bound not latency
>>> bound. slog devices really help when you have lots of small sync
>>> writes. A RAIDZ2 with the ZIL spread across it will provide much
>>
>> Surely this depends on the origin of the large sequential writes.   
>> If the
>> origin is NFS and the SSD has considerably more sustained write  
>> bandwidth
>> than the ethernet transfer bandwidth, then using the SSD is a win.   
>> If the
>> SSD accepts data slower than the ethernet can deliver it (which  
>> seems to be
>> this particular case) then the SSD is not helping.
>>
>> If the ethernet can pass 100MB/second, then the sustained write
>> specification for the SSD needs to be at least 100MB/second.  Since  
>> data is
>> buffered in the Ethernet,TCP/IP,NFS stack prior to sending it to  
>> ZFS, the
>> SSD should support write bursts of at least double that or else it  
>> will not
>> be helping bulk-write performance.
>
> Specifically I was talking NFS as that was what the OP was talking
> about, but yes it does depend on the origin, but you also assume that
> NFS IO goes over only a single 1Gbe interface when it could be over
> multiple 1Gbe interfaces or a 10Gbe interface or even multple 10Gbe
> interfaces. You also assume the IO recorded in the ZIL is just the raw
> IO when there is also meta-data or multiple transaction copies as
> well.
>
> Personnally I still prefer to spread the ZIL across the pool and have
> a large NVRAM backed HBA as opposed to an slog which really puts all
> my IO in one basket. If I had a pure NVRAM device I might consider
> using that as an slog device, but SSDs are too variable for my taste.
Back of the envelope math says:
	10 Gbe = ~1 GByte/sec of I/O capacity

If the SSD can only sink 70 MByte/s, then you will need:
	int(1000/70) + 1 = 15 SSDs for the slog

For capacity, you need:
	1 GByte/sec * 30 sec = 30 GBytes

Ross'' idea has merit, if the size of the NVRAM in the array is 30
GBytes
or so.

Both of the above assume there is lots of memory in the server.
This is increasingly becoming easier to do as the memory costs
come down and you can physically fit 512 GBytes in a 4u server.
By default, the txg commit will occur when 1/8 of memory is used
for writes. For 30 GBytes, that would mean a main memory of only
240 Gbytes... feasible for modern servers.

However, most folks won''t stomach 15 SSDs for slog or 30 GBytes of
NVRAM in their arrays. So Bob''s recommendation of reducing the
txg commit interval below 30 seconds also has merit.  Or, to put it
another way, the dynamic sizing of the txg commit interval isn''t
quite perfect yet. [Cue for Neil to chime in... :-)]
  -- richard

James Lever

2009-Sep-25 21:24 UTC

head link

[zfs-discuss] periodic slow responsiveness

On 26/09/2009, at 1:14 AM, Ross Walker wrote:
> By any chance do you have copies=2 set?
No, only 1.  So the double data going to the slog (as reported by  
iostat) is still confusing me and clearly potentially causing  
significant harm to my performance.
> Also, try setting zfs_write_limit_override equal to the size of the
> NVRAM cache (or half depending on how long it takes to flush):
>
> echo zfs_write_limit_override/W0t268435456 | mdb -kw
That?s an interesting concept.  All data still appears to go via the  
slog device, however, under heavy load my responsive to a new write is  
typically below 2s (a few outliers at about 3.5s) and a read  
(directory listing of a non-cached entry) is about 2s.

What will this do once it hits the limit?  Will streaming writes now  
be sent directly to a txg and streamed to the primary storage  
devices?  (that is what I would like to see happen).
> As a side an slog device will not be too beneficial for large
> sequential writes, because it will be throughput bound not latency
> bound. slog devices really help when you have lots of small sync
> writes. A RAIDZ2 with the ZIL spread across it will provide much
> higher throughput then an SSD. An example of a workload that benefits
> from an slog device is ESX over NFS, which does a COMMIT for each
> block written, so it benefits from an slog, but a standard media
> server will not (but an L2ARC would be beneficial).
>
> Better workload analysis is really what it is about.

It seems that it doesn?t matter what the workload is if the NFS pipe  
can sustain more continuous throughput the slog chain can support.

I suppose some creative use of the logbias setting might assist this  
situation and force all potentially heavy writers directly to the  
primary storage.  This would, however, negate any benefit for having a  
fast, low latency device for those filesystems for the times when it  
is desirable (any large batch of small writes, for example).

Is there a way to have a dynamic, auto logbias type setting depending  
on the transaction currently presented to the server such that if it  
is clearly a large streaming write it gets treated as  
logbias=throughput and if it is a small transaction it gets treated as  
logbias=latency?  (i.e. such that NFS transactions can be effectively  
treated as if it was local storage but minorly breaking the benefits  
of the txg scheduling).

On 26/09/2009, at 3:39 AM, Richard Elling wrote:
> Back of the envelope math says:
> 	10 Gbe = ~1 GByte/sec of I/O capacity
>
> If the SSD can only sink 70 MByte/s, then you will need:
> 	int(1000/70) + 1 = 15 SSDs for the slog
>
> For capacity, you need:
> 	1 GByte/sec * 30 sec = 30 GBytes
>
> Ross'' idea has merit, if the size of the NVRAM in the array is 30
> GBytes
> or so.
At this point, enter the fusionIO cards or similar devices.   
Unfortunately there does not seem to be anything on the market with  
infinitely fast write capacity (memory speeds) that is also supported  
under OpenSolaris as a slog device.

I think this is precisely what I (and anybody running a general  
purpose NFS server) need for a general purpose slog device.
> Both of the above assume there is lots of memory in the server.
> This is increasingly becoming easier to do as the memory costs
> come down and you can physically fit 512 GBytes in a 4u server.
> By default, the txg commit will occur when 1/8 of memory is used
> for writes. For 30 GBytes, that would mean a main memory of only
> 240 Gbytes... feasible for modern servers.
>
> However, most folks won''t stomach 15 SSDs for slog or 30 GBytes of
> NVRAM in their arrays. So Bob''s recommendation of reducing the
> txg commit interval below 30 seconds also has merit.  Or, to put it
> another way, the dynamic sizing of the txg commit interval isn''t
> quite perfect yet. [Cue for Neil to chime in... :-)]
How does reducing the txg commit interval really help?  WIll data no  
longer go via the slog once it is streaming to disk?  or will data  
still all be pushed through the slog regardless?

For a predominantly NFS server purpose, it really looks like a case of  
the slog has to outperform your main pool for continuous write speed  
as well as an instant response time as the primary criterion. Which  
might as well be a fast (or group of fast) SSDs or 15kRPM drives with  
some NVRAM in front of them.

Is there also a way to throttle synchronous writes to the slog  
device?  Much like the ZFS write throttling that is already  
implemented, so that there is a gap for new writers to enter when  
writing to the slog device? (or is this the norm and includes slog  
writes?)

cheers,
James

Ross Walker

2009-Sep-25 21:30 UTC

head link

[zfs-discuss] periodic slow responsiveness

On Fri, Sep 25, 2009 at 5:24 PM, James Lever <j at jamver.id.au>
wrote:>
> On 26/09/2009, at 1:14 AM, Ross Walker wrote:
>
>> By any chance do you have copies=2 set?
>
> No, only 1. ?So the double data going to the slog (as reported by iostat)
is
> still confusing me and clearly potentially causing significant harm to my
> performance.
Weird then, I thought that would be an easy explaination.
>> Also, try setting zfs_write_limit_override equal to the size of the
>> NVRAM cache (or half depending on how long it takes to flush):
>>
>> echo zfs_write_limit_override/W0t268435456 | mdb -kw
>
> That?s an interesting concept. ?All data still appears to go via the slog
> device, however, under heavy load my responsive to a new write is typically
> below 2s (a few outliers at about 3.5s) and a read (directory listing of a
> non-cached entry) is about 2s.
>
> What will this do once it hits the limit? ?Will streaming writes now be
sent
> directly to a txg and streamed to the primary storage devices? ?(that is
> what I would like to see happen).
It''s sets the max size of a txg to the given size. When it hits that
number it flushes to disk.
>> As a side an slog device will not be too beneficial for large
>> sequential writes, because it will be throughput bound not latency
>> bound. slog devices really help when you have lots of small sync
>> writes. A RAIDZ2 with the ZIL spread across it will provide much
>> higher throughput then an SSD. An example of a workload that benefits
>> from an slog device is ESX over NFS, which does a COMMIT for each
>> block written, so it benefits from an slog, but a standard media
>> server will not (but an L2ARC would be beneficial).
>>
>> Better workload analysis is really what it is about.
>
>
> It seems that it doesn?t matter what the workload is if the NFS pipe can
> sustain more continuous throughput the slog chain can support.
Only on large sequentials, small sync IO should benefit from the slog.
> I suppose some creative use of the logbias setting might assist this
> situation and force all potentially heavy writers directly to the primary
> storage. ?This would, however, negate any benefit for having a fast, low
> latency device for those filesystems for the times when it is desirable
(any
> large batch of small writes, for example).
>
> Is there a way to have a dynamic, auto logbias type setting depending on
the
> transaction currently presented to the server such that if it is clearly a
> large streaming write it gets treated as logbias=throughput and if it is a
> small transaction it gets treated as logbias=latency? ?(i.e. such that NFS
> transactions can be effectively treated as if it was local storage but
> minorly breaking the benefits of the txg scheduling).
I''ll leave that to the Sun guys to answer.

-Ross

Ross Walker

2009-Sep-25 21:37 UTC

head link

[zfs-discuss] periodic slow responsiveness

On Fri, Sep 25, 2009 at 1:39 PM, Richard Elling
<richard.elling at gmail.com> wrote:> On Sep 25, 2009, at 9:14 AM, Ross Walker wrote:
>
>> On Fri, Sep 25, 2009 at 11:34 AM, Bob Friesenhahn
>> <bfriesen at simple.dallas.tx.us> wrote:
>>>
>>> On Fri, 25 Sep 2009, Ross Walker wrote:
>>>>
>>>> As a side an slog device will not be too beneficial for large
>>>> sequential writes, because it will be throughput bound not
latency
>>>> bound. slog devices really help when you have lots of small
sync
>>>> writes. A RAIDZ2 with the ZIL spread across it will provide
much
>>>
>>> Surely this depends on the origin of the large sequential writes.
?If the
>>> origin is NFS and the SSD has considerably more sustained write
bandwidth
>>> than the ethernet transfer bandwidth, then using the SSD is a win.
?If
>>> the SSD accepts data slower than the ethernet can deliver it (which
seems to
>>> be this particular case) then the SSD is not helping.
>>>
>>> If the ethernet can pass 100MB/second, then the sustained write
>>> specification for the SSD needs to be at least 100MB/second. ?Since
data
>>> is buffered in the Ethernet,TCP/IP,NFS stack prior to sending it to
ZFS, the
>>> SSD should support write bursts of at least double that or else it
will
>>> not be helping bulk-write performance.
>>
>> Specifically I was talking NFS as that was what the OP was talking
>> about, but yes it does depend on the origin, but you also assume that
>> NFS IO goes over only a single 1Gbe interface when it could be over
>> multiple 1Gbe interfaces or a 10Gbe interface or even multple 10Gbe
>> interfaces. You also assume the IO recorded in the ZIL is just the raw
>> IO when there is also meta-data or multiple transaction copies as
>> well.
>>
>> Personnally I still prefer to spread the ZIL across the pool and have
>> a large NVRAM backed HBA as opposed to an slog which really puts all
>> my IO in one basket. If I had a pure NVRAM device I might consider
>> using that as an slog device, but SSDs are too variable for my taste.
>
> Back of the envelope math says:
> ? ? ? ?10 Gbe = ~1 GByte/sec of I/O capacity
>
> If the SSD can only sink 70 MByte/s, then you will need:
> ? ? ? ?int(1000/70) + 1 = 15 SSDs for the slog
>
> For capacity, you need:
> ? ? ? ?1 GByte/sec * 30 sec = 30 GBytes
Where did the 30 seconds come in here?

The amount of time to hold cache depends on how fast you can fill it.
> Ross'' idea has merit, if the size of the NVRAM in the array is 30
GBytes
> or so.
I''m thinking you can do less if you don''t need to hold it for
30 seconds.
> Both of the above assume there is lots of memory in the server.
> This is increasingly becoming easier to do as the memory costs
> come down and you can physically fit 512 GBytes in a 4u server.
> By default, the txg commit will occur when 1/8 of memory is used
> for writes. For 30 GBytes, that would mean a main memory of only
> 240 Gbytes... feasible for modern servers.
>
> However, most folks won''t stomach 15 SSDs for slog or 30 GBytes of
> NVRAM in their arrays. So Bob''s recommendation of reducing the
> txg commit interval below 30 seconds also has merit. ?Or, to put it
> another way, the dynamic sizing of the txg commit interval isn''t
> quite perfect yet. [Cue for Neil to chime in... :-)]
I''m sorry did I miss something Bob said about the txg commit interval?

I looked back and didn''t see it, maybe it was off-list?

-Ross

Marion Hakanson

2009-Sep-25 21:47 UTC

head link

[zfs-discuss] periodic slow responsiveness

j at jamver.id.au said:> For a predominantly NFS server purpose, it really looks like a case of the
> slog has to outperform your main pool for continuous write speed as well as
> an instant response time as the primary criterion. Which might as well be a
> fast (or group of fast) SSDs or 15kRPM drives with some NVRAM in front of
> them. 
I wonder if you ran Richard Elling''s "zilstat" while running
your
workload.  That should tell you how much ZIL bandwidth is needed,
and it would be interesting to see if its stats match with your
other measurements of slog-device traffic.

I did some filebench and "tar extract over NFS" tests of J4400 (500GB,
7200RPM SATA drives), with and without slog, where slog was using the
internal 2.5" 10kRPM SAS drives in an X4150.  These drives were behind
the standard Sun/Adaptec internal RAID controller, 256MB battery-backed
cache memory, all on Solaris-10U7.

We saw slight differences on filebench oltp profile, and a huge speedup
for the "tar extract over NFS" tests with the slog present.  Granted,
the
latter was with only one NFS client, so likely did not fill NVRAM.  Pretty
good results for a poor-person''s slog, though:
	http://acc.ohsu.edu/~hakansom/j4400_bench.html

Just as an aside, and based on my experience as a user/admin of various
NFS-server vendors, the old Prestoserve cards, and NetApp filers, seem
to get very good improvements with relatively small amounts of NVRAM
(128K, 1MB, 256MB, etc.).  None of the filers I''ve seen have ever had
tens of GB of NVRAM.

Regards,

Marion

Ross Walker

2009-Sep-25 22:04 UTC

head link

[zfs-discuss] periodic slow responsiveness

On Fri, Sep 25, 2009 at 5:47 PM, Marion Hakanson <hakansom at ohsu.edu>
wrote:> j at jamver.id.au said:
>> For a predominantly NFS server purpose, it really looks like a case of
the
>> slog has to outperform your main pool for continuous write speed as
well as
>> an instant response time as the primary criterion. Which might as well
be a
>> fast (or group of fast) SSDs or 15kRPM drives with some NVRAM in front
of
>> them.
>
> I wonder if you ran Richard Elling''s "zilstat" while
running your
> workload. ?That should tell you how much ZIL bandwidth is needed,
> and it would be interesting to see if its stats match with your
> other measurements of slog-device traffic.
Yes, but if it''s on NFS you can just figure out the workload in MB/s
and use that as a rough guideline.

Problem is most SSD manufactures list sustained throughput with large
IO sizes, say 4MB, and not 128K, so it is tricky buying a good SSD
that can handle the throughput.
> I did some filebench and "tar extract over NFS" tests of J4400
(500GB,
> 7200RPM SATA drives), with and without slog, where slog was using the
> internal 2.5" 10kRPM SAS drives in an X4150. ?These drives were behind
> the standard Sun/Adaptec internal RAID controller, 256MB battery-backed
> cache memory, all on Solaris-10U7.
>
> We saw slight differences on filebench oltp profile, and a huge speedup
> for the "tar extract over NFS" tests with the slog present.
?Granted, the
> latter was with only one NFS client, so likely did not fill NVRAM. ?Pretty
> good results for a poor-person''s slog, though:
> ? ? ? ?http://acc.ohsu.edu/~hakansom/j4400_bench.html
I did a smiliar test with a 512MB BBU controller and saw no difference
with or without the SSD slog, so I didn''t end up using it.

Does your BBU controller ignore the ZFS flushes?
> Just as an aside, and based on my experience as a user/admin of various
> NFS-server vendors, the old Prestoserve cards, and NetApp filers, seem
> to get very good improvements with relatively small amounts of NVRAM
> (128K, 1MB, 256MB, etc.). ?None of the filers I''ve seen have ever
had
> tens of GB of NVRAM.
They don''t hold on to the cache for a long time, just as long as it
takes to write it all to disk.

-Ross

Bob Friesenhahn

2009-Sep-25 22:11 UTC

head link

[zfs-discuss] periodic slow responsiveness

On Fri, 25 Sep 2009, Richard Elling wrote:> By default, the txg commit will occur when 1/8 of memory is used
> for writes. For 30 GBytes, that would mean a main memory of only
> 240 Gbytes... feasible for modern servers.
Ahem.  We were advised that 7/8s of memory is currently what is 
allowed for writes.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Bob Friesenhahn

2009-Sep-25 22:19 UTC

head link

[zfs-discuss] periodic slow responsiveness

On Fri, 25 Sep 2009, Ross Walker wrote:
> Problem is most SSD manufactures list sustained throughput with large
> IO sizes, say 4MB, and not 128K, so it is tricky buying a good SSD
> that can handle the throughput.
Who said that the slog SSD is written to in 128K chunks?  That seems 
wrong to me.  Previously we were advised that the slog is basically a 
log of uncommitted system calls so the size of the data chunks written 
to the slog should be similar to the data sizes in the system calls.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Marion Hakanson

2009-Sep-25 22:38 UTC

head link

[zfs-discuss] periodic slow responsiveness

rswwalker at gmail.com said:> Yes, but if it''s on NFS you can just figure out the workload in
MB/s and use
> that as a rough guideline. 
I wonder if that''s the case.  We have an NFS server without NVRAM cache
(X4500), and it gets huge MB/sec throughput on large-file writes over NFS.
But it''s painfully slow on the "tar extract lots of small
files" test,
where many, tiny, synchronous metadata operations are performed.

> I did a smiliar test with a 512MB BBU controller and saw no difference with
> or without the SSD slog, so I didn''t end up using it.
> 
> Does your BBU controller ignore the ZFS flushes? 
I believe it does (it would be slow otherwise).  It''s the Sun
StorageTek
internal SAS RAID HBA.

Regards,

Marion

Ross Walker

2009-Sep-26 01:02 UTC

head link

[zfs-discuss] periodic slow responsiveness

On Sep 25, 2009, at 6:19 PM, Bob Friesenhahn <bfriesen at simple.dallas.tx.us
 > wrote:
> On Fri, 25 Sep 2009, Ross Walker wrote:
>
>> Problem is most SSD manufactures list sustained throughput with large
>> IO sizes, say 4MB, and not 128K, so it is tricky buying a good SSD
>> that can handle the throughput.
>
> Who said that the slog SSD is written to in 128K chunks?  That seems  
> wrong to me.  Previously we were advised that the slog is basically  
> a log of uncommitted system calls so the size of the data chunks  
> written to the slog should be similar to the data sizes in the  
> system calls.
Are these not broken into recordsize chunks?

-Ross

Neil Perrin

2009-Sep-26 01:47 UTC

head link

[zfs-discuss] periodic slow responsiveness

On 09/25/09 16:19, Bob Friesenhahn wrote:> On Fri, 25 Sep 2009, Ross Walker wrote:
> 
>> Problem is most SSD manufactures list sustained throughput with large
>> IO sizes, say 4MB, and not 128K, so it is tricky buying a good SSD
>> that can handle the throughput.
> 
> Who said that the slog SSD is written to in 128K chunks?  That seems 
> wrong to me.  Previously we were advised that the slog is basically a 
> log of uncommitted system calls so the size of the data chunks written 
> to the slog should be similar to the data sizes in the system calls.
Log blocks are variable in size dependent on what needs to be committed.
The minimum size is 4KB and the max 128KB. Log records are aggregated
and written together as much as possible.

Neil.

zfs discuss - Sep 2009 - periodic slow responsiveness

[zfs-discuss] periodic slow responsiveness

[zfs-discuss] periodic slow responsiveness

[zfs-discuss] periodic slow responsiveness

[zfs-discuss] periodic slow responsiveness

[zfs-discuss] periodic slow responsiveness

[zfs-discuss] periodic slow responsiveness

[zfs-discuss] periodic slow responsiveness

[zfs-discuss] periodic slow responsiveness

[zfs-discuss] periodic slow responsiveness

[zfs-discuss] periodic slow responsiveness

[zfs-discuss] periodic slow responsiveness

[zfs-discuss] periodic slow responsiveness

[zfs-discuss] periodic slow responsiveness

[zfs-discuss] periodic slow responsiveness

[zfs-discuss] periodic slow responsiveness

[zfs-discuss] periodic slow responsiveness

[zfs-discuss] periodic slow responsiveness

[zfs-discuss] periodic slow responsiveness

[zfs-discuss] periodic slow responsiveness

[zfs-discuss] periodic slow responsiveness

[zfs-discuss] periodic slow responsiveness

[zfs-discuss] periodic slow responsiveness

[zfs-discuss] periodic slow responsiveness

[zfs-discuss] periodic slow responsiveness

[zfs-discuss] periodic slow responsiveness

[zfs-discuss] periodic slow responsiveness

[zfs-discuss] periodic slow responsiveness

[zfs-discuss] periodic slow responsiveness

[zfs-discuss] periodic slow responsiveness

[zfs-discuss] periodic slow responsiveness

[zfs-discuss] periodic slow responsiveness

[zfs-discuss] periodic slow responsiveness

[zfs-discuss] periodic slow responsiveness