thr3ads.net - zfs discuss - [zfs-discuss] write cache and cache flush [Jan 2009]

If this information is useful, please help other people find it:
Share via:

Greg Mason

2009-Jan-29 21:23 UTC

[zfs-discuss] write cache and cache flush

So, I''m still beating my head against the wall, trying to find our 
performance bottleneck with NFS on our Thors.

We''ve got a couple Intel SSDs for the ZIL, using 2 SSDs as ZIL devices.
Cache flushing is still enabled, as are the write caches on all 48 disk 
devices.

What I''m thinking of doing is disabling all write caches, and disabling
the cache flushing.

What would this mean for the safety of data in the pool?

And, would this even do anything to address the performance issue?

-Greg

Anton B. Rang

2009-Jan-30 02:22 UTC

head link

[zfs-discuss] write cache and cache flush

If all write caches are truly disabled, then disabling the cache flush
won''t affect the safety of your data.

It will change your performance characteristics, almost certainly for the worse.
-- 
This message posted from opensolaris.org

Jim Mauro

2009-Jan-30 02:41 UTC

head link

[zfs-discuss] write cache and cache flush

Multiple Thors (more than 2?), with performance problems.
Maybe it''s the common demnominator - the network.

Can you run local ZFS IO loads and determine if performance
is expected when NFS and the network are out of the picture?

Thanks,
/jim


Greg Mason wrote:> So, I''m still beating my head against the wall, trying to find our
> performance bottleneck with NFS on our Thors.
>
> We''ve got a couple Intel SSDs for the ZIL, using 2 SSDs as ZIL
devices.
> Cache flushing is still enabled, as are the write caches on all 48 disk 
> devices.
>
> What I''m thinking of doing is disabling all write caches, and
disabling
> the cache flushing.
>
> What would this mean for the safety of data in the pool?
>
> And, would this even do anything to address the performance issue?
>
> -Greg
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>

Greg Mason

2009-Jan-30 04:21 UTC

head link

[zfs-discuss] write cache and cache flush

the funny thing is that I''m showing a performance improvement over
write
caches + cache flushes.

The only way these pools are being accessed is over NFS. Well, at least 
the only way I care about when it comes to high performance.

I''m pretty sure it would give a performance hit locally, but I
don''t
care about local disk performance, I only care about the performance 
over NFS.

Anton B. Rang wrote:> If all write caches are truly disabled, then disabling the cache flush
won''t affect the safety of your data.
> 
> It will change your performance characteristics, almost certainly for the
worse.

Greg Mason

2009-Jan-30 04:32 UTC

head link

[zfs-discuss] write cache and cache flush

This problem only manifests itself when dealing with many small files 
over NFS. There is no throughput problem with the network.

I''ve run tests with the write cache disabled on all disks, and the
cache
flush disabled. I''m using two Intel SSDs for ZIL devices.

This setup is faster than using the two Intel SSDs with write caches 
enabled on all disks, and with the cache flush enabled.

My test would run around 3.5 to 4 minutes, now it is completing in 
abound 2.5 minutes. I still think this is a bit slow, but I still have 
quite a bit of testing to perform. I''ll keep the list updated with my 
findings.

I''ve already established both via this list and through other research 
that ZFS has performance issues over NFS when dealing with many small 
files. This seems to maybe be an issue with NFS itself, where 
NVRAM-backed storage is needed for decent performance with small files. 
Typically such an NVRAM cache is supplied by a hardware raid controller 
in a disk shelf.

I find it very hard to explain to a user why an "upgrade" is a step
down
in performance. For the users these Thors are going to serve, such a 
drastic performance hit is a deal breaker...

I''ve done my homework on this issue, I''ve ruled out the
network as an
issue, as well as the NFS clients. I''ve narrowed my particular 
performance issue down to the ZIL, and how well ZFS plays with NFS.

-Greg

Jim Mauro wrote:> Multiple Thors (more than 2?), with performance problems.
> Maybe it''s the common demnominator - the network.
> 
> Can you run local ZFS IO loads and determine if performance
> is expected when NFS and the network are out of the picture?
> 
> Thanks,
> /jim
>

Neil Perrin

2009-Jan-30 05:42 UTC

head link

[zfs-discuss] write cache and cache flush

On 01/29/09 21:32, Greg Mason wrote:> This problem only manifests itself when dealing with many small files 
> over NFS. There is no throughput problem with the network.
> 
> I''ve run tests with the write cache disabled on all disks, and the
cache
> flush disabled. I''m using two Intel SSDs for ZIL devices.
> 
> This setup is faster than using the two Intel SSDs with write caches 
> enabled on all disks, and with the cache flush enabled.
> 
> My test would run around 3.5 to 4 minutes, now it is completing in 
> abound 2.5 minutes. I still think this is a bit slow, but I still have 
> quite a bit of testing to perform. I''ll keep the list updated with
my
> findings.
> 
> I''ve already established both via this list and through other
research
> that ZFS has performance issues over NFS when dealing with many small 
> files. This seems to maybe be an issue with NFS itself, where 
> NVRAM-backed storage is needed for decent performance with small files. 
> Typically such an NVRAM cache is supplied by a hardware raid controller 
> in a disk shelf.
> 
> I find it very hard to explain to a user why an "upgrade" is a
step down
> in performance. For the users these Thors are going to serve, such a 
> drastic performance hit is a deal breaker...
Perhaps I missed something, but what was your previous setup?
I.e. what did you upgrade from? 

Neil.

Greg Mason

2009-Jan-30 14:24 UTC

head link

[zfs-discuss] write cache and cache flush

A Linux NFS file server, with a few terabytes of fibre-attached disk, 
using XFS.

I''m trying to get these Thors to perform at least as well as the
current
setup. A performance hit is very hard to explain to our users.
> Perhaps I missed something, but what was your previous setup?
> I.e. what did you upgrade from?
> Neil.
> 
>

Tim

2009-Jan-30 14:25 UTC

head link

[zfs-discuss] write cache and cache flush

On Fri, Jan 30, 2009 at 8:24 AM, Greg Mason <gmason at msu.edu> wrote:
> A Linux NFS file server, with a few terabytes of fibre-attached disk,
> using XFS.
>
> I''m trying to get these Thors to perform at least as well as the
current
> setup. A performance hit is very hard to explain to our users.
>
What type of spindles were in the FC attached disk?

--Tim
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090130/382198cf/attachment.html>

Greg Mason

2009-Jan-30 14:27 UTC

head link

[zfs-discuss] write cache and cache flush

I should also add that this "creating many small files" issue is the 
ONLY case where the Thors are performing poorly, which is why I''m 
focusing on it.

Greg Mason wrote:> A Linux NFS file server, with a few terabytes of fibre-attached disk, 
> using XFS.
> 
> I''m trying to get these Thors to perform at least as well as the
current
> setup. A performance hit is very hard to explain to our users.
> 
>> Perhaps I missed something, but what was your previous setup?
>> I.e. what did you upgrade from?
>> Neil.
>>
>>
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> 
>

Greg Mason

2009-Jan-30 14:28 UTC

head link

[zfs-discuss] write cache and cache flush

7200 RPM SATA disks.

Tim wrote:> 
> 
> On Fri, Jan 30, 2009 at 8:24 AM, Greg Mason <gmason at msu.edu 
> <mailto:gmason at msu.edu>> wrote:
> 
>     A Linux NFS file server, with a few terabytes of fibre-attached disk,
>     using XFS.
> 
>     I''m trying to get these Thors to perform at least as well as
the current
>     setup. A performance hit is very hard to explain to our users.
> 
> 
> What type of spindles were in the FC attached disk?
> 
> --Tim

Jim Mauro

2009-Jan-30 14:43 UTC

head link

[zfs-discuss] write cache and cache flush

> This problem only manifests itself when dealing with many small files 
> over NFS. There is no throughput problem with the network.But there could be a _latency_ issue with the network.

[snip]> I''ve done my homework on this issue, I''ve ruled out the
network as an
> issue, as well as the NFS clients. I''ve narrowed my particular 
> performance issue down to the ZIL, and how well ZFS plays with NFS.
Great.

Good luck.

/jim

Greg Mason

2009-Jan-30 14:57 UTC

head link

[zfs-discuss] write cache and cache flush

Jim Mauro wrote:> 
>> This problem only manifests itself when dealing with many small files 
>> over NFS. There is no throughput problem with the network.
> But there could be a _latency_ issue with the network.
If there was a latency issue, we would see such a problem with our 
existing file server as well, which we do not. We''d also have much 
greater problems than just file server performance.

So, like I''ve said, we''ve ruled out the network as an issue.

Greg Mason

2009-Jan-30 15:03 UTC

head link

[zfs-discuss] write cache and cache flush

> If there was a latency issue, we would see such a problem with our 
> existing file server as well, which we do not. We''d also have much
> greater problems than just file server performance.
> 
> So, like I''ve said, we''ve ruled out the network as an
issue.
I should also add that I''ve tested these Thors with the ZIL disabled, 
and they scream! With the cache flush disabled, they also do quite well.

The specific issue i''m trying to solve is the ZIL being slow when using
NFS.

I really don''t want to have to do something drastic like disabling the 
ZIL to get the performance I need...

Jim Mauro

2009-Jan-30 15:28 UTC

head link

[zfs-discuss] write cache and cache flush

You have SSD''s for the ZIL (logzilla) enabled, and ZIL IO
is what is hurting your performance...Hmmm....

I''ll ask the stupid question (just to get it out of the way) - is
it possible that the logzilla is undersized?

Did you gather data using Richard Elling''s zilstat (included below)?

Thanks,
/jim

#! /usr/bin/ksh -p
# CDDL HEADER START
#
# The contents of this file are subject to the terms of the
# Common Development and Distribution License (the "License").
# You may not use this file except in compliance with the License.
#
# You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
# or http://www.opensolaris.org/os/licensing.
# See the License for the specific language governing permissions
# and limitations under the License.
#
# When distributing Covered Code, include this CDDL HEADER in each
# file and include the License file at usr/src/OPENSOLARIS.LICENSE.
# If applicable, add the following below this CDDL HEADER, with the
# fields enclosed by brackets "[]" replaced with your own identifying
# information: Portions Copyright [yyyy] [name of copyright owner]
#
# CDDL HEADER END
# Portions Copyright 2009 Sun Microsystems, Inc.
#
#File:   zilstat.d
#Author: Richard.Elling at sun.com
#
# This dtrace program will help identify the ZIL activity by sampling
# writes sent to the ZIL.
# output:
#     [TIME]
#     BYTES - total bytes written to ZIL over the interval
#     BYTES/S - bytes/s written to ZIL over ther interval
#     MAX-BYTES/S - maximum rate during any 1-second sample
#
##############################
# --- Process Arguments ---
#

# TODO: clean up args
### default variables
opt_pool=0
opt_time=0
filter=0
poollines=-1
interval=1
count=-1

### process options
while getopts hl:p:t name
do
	case $name in
	l)  lines=$OPTARG ;;
	p)  opt_pool=1; pool=$OPTARG ;;
	t)  opt_time=1 ;;
	h|?)    ME=$(basename $0)
                cat <<-END >&2
		USAGE: $ME [t][-l linecount] [-p poolname] [interval [count]]
    -t   # print timestamp
    -l linecount    # print header every linecount lines (default=only once)
    -d poolname      # only look at poolname
    -l number      # print header every number lines

    examples:
        $ME # default output, 1 second samples
        $ME 10  # 10 second samples
        $ME 10 6    # print 6 x 10 second samples
        $ME -p rpool    # show ZIL stats for rpool only

    output:
        [TIME]
        BYTES - total bytes written to ZIL over the interval
        BYTES/S - bytes/s written to ZIL over ther interval
        MAX-BYTES/S - maximum rate during any 1-second sample
		END
		exit 1
	esac
done

shift $(( $OPTIND - 1 ))

### option logic
if [[ "$1" > 0 ]]; then
        interval=$1; shift
fi
if [[ "$1" > 0 ]]; then
        count=$1; shift
fi
if (( opt_pool )); then
	filter=1
fi

##############################
# --- Main Program, DTrace ---

/usr/sbin/dtrace -n ''
#pragma D option quiet
 inline int OPT_time = ''$opt_time'';
 inline int OPT_pool = ''$opt_pool'';
 inline int INTERVAL = ''$interval'';
 inline int LINES = ''$lines'';
 inline int COUNTER = ''$count'';
 inline int FILTER = ''$filter'';
 inline string POOL = "''$pool''";
 dtrace:::BEGIN
 {
        /* starting values */
        counts = COUNTER;
        secs = INTERVAL;
	line = 0;
	last_event[""] = 0;
        nused=0;
        max_per_sec=0;
        nused_per_sec=0;
 }

 /*
  * collect info when zil_lwb_write_start fires
  */
 fbt::zil_lwb_write_start:entry
 /OPT_pool == 0 || POOL == args[0]->zl_dmu_pool->dp_spa->spa_name/
{
     nused += args[1]->lwb_nused;
     nused_per_sec += args[1]->lwb_nused;
}

 /*
  * Timer
  */
 profile:::tick-1sec
 {
	secs--;
        nused_per_sec > max_per_sec ? max_per_sec = nused_per_sec : 1;
        nused_per_sec = 0;
 }

 /*
  * Print header
  */
 profile:::tick-1sec
 /line == 0 /
 {
	/* print optional headers */
	OPT_time   ? printf("%-20s ", "TIME")  : 1;

	/* print header */
	printf("%10s %10s %10s\n", "BYTES", "BYTES/S",
"MAX-BYTES/S");
	line = LINES;
 }

 /*
  * Print Output
  */
 profile:::tick-1sec
 /secs == 0/
 {
	OPT_time   ? printf("%-20Y ", walltimestamp) : 1;
        printf("%10d %10d %10d\n", nused, nused/INTERVAL,
max_per_sec);

        nused = 0;
        nused_per_sec = 0;
        max_per_sec = 0;
        secs = INTERVAL;
	counts--;
	line--;
 }

 /*
  * End of program
  */
 profile:::tick-1sec
 /counts == 0/
 {
	exit(0);
 }
''

Greg Mason wrote:>> If there was a latency issue, we would see such a problem with our 
>> existing file server as well, which we do not. We''d also have
much
>> greater problems than just file server performance.
>>
>> So, like I''ve said, we''ve ruled out the network as an
issue.
>
> I should also add that I''ve tested these Thors with the ZIL
disabled,
> and they scream! With the cache flush disabled, they also do quite well.
>
> The specific issue i''m trying to solve is the ZIL being slow when 
> using NFS.
>
> I really don''t want to have to do something drastic like disabling
the
> ZIL to get the performance I need...

Greg Mason

2009-Jan-30 16:38 UTC

head link

[zfs-discuss] write cache and cache flush

I''ll give this a script a shot a little bit later today.

For ZIL sizing, I''m using either 1 or 2 32G Intel X25-E SSDs in my 
tests, which, according to what I''ve read, is 2-4 times larger than the
maximum that ZFS can possibly use. We''ve got 32G of system memory in 
these Thors, and (if I''m not mistaken) the maximum amount of in-play 
data can be 16G, 1/2 the system memory.

Also, because I know people will be asking, has anybody ever tried to 
recover from something like a system crash with a ZFS pool that has the 
ZIL disabled? What kind of nightmares would I be facing in such a 
situation? Would I simply just risk losing that in-play data, or could 
more serious things happen? I know disabling the ZIL is an Extremely Bad 
Idea, but I need to tell people exactly why...

-Greg

Jim Mauro wrote:
 > You have SSD''s for the ZIL (logzilla) enabled, and ZIL IO
 > is what is hurting your performance...Hmmm....
 >
 > I''ll ask the stupid question (just to get it out of the way) - is
 > it possible that the logzilla is undersized?
 >
 > Did you gather data using Richard Elling''s zilstat (included
below)?
 >
 > Thanks,
 > /jim
<snip>

Bob Friesenhahn

2009-Jan-30 17:23 UTC

head link

[zfs-discuss] write cache and cache flush

On Fri, 30 Jan 2009, Greg Mason wrote:
> A Linux NFS file server, with a few terabytes of fibre-attached disk,
> using XFS.
>
> I''m trying to get these Thors to perform at least as well as the
current
> setup. A performance hit is very hard to explain to our users.
I have heard that Linux NFS service does not perform synchronous 
writes to disk (in violation of the NFS specification).  If that is 
true, then it would make a huge difference to write performance.

Bob
=====================================Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Richard Elling

2009-Jan-30 21:09 UTC

head link

[zfs-discuss] write cache and cache flush

Jim Mauro wrote:> You have SSD''s for the ZIL (logzilla) enabled, and ZIL IO
> is what is hurting your performance...Hmmm....
>
> I''ll ask the stupid question (just to get it out of the way) - is
> it possible that the logzilla is undersized?
>
> Did you gather data using Richard Elling''s zilstat (included
below)?
>   
Jim stole a little bit of my thunder :-)
To get him back, I''ve now blogged about my updated (better) version.
The goal is to try and answer the question, "will adding a low-latency
separate log device improve the performance of my workload?" To try
to answer the question, we look at the amount of work asked of the
ZIL.

http://richardelling.blogspot.com/2009/01/zilstat.html

Enjoy!
-- richard

Roch Bourbonnais

2009-Jan-30 21:57 UTC

head link

[zfs-discuss] write cache and cache flush

Sounds like the device it not ignoring the cache flush requests sent  
down by ZFS/zil commit.
If the SSD is able the drain it''s internal buffer to flash on a power  
outage; then it needs to ignore the cache flush.
You can do this on a per device basis, It''s kludgy tuning but hope the
instructions on the evil tuning guide will be enough.

-r

Le 30 janv. 09 ? 16:03, Greg Mason a ?crit :
>> If there was a latency issue, we would see such a problem with our
>> existing file server as well, which we do not. We''d also have
much
>> greater problems than just file server performance.
>>
>> So, like I''ve said, we''ve ruled out the network as an
issue.
>
> I should also add that I''ve tested these Thors with the ZIL
disabled,
> and they scream! With the cache flush disabled, they also do quite  
> well.
>
> The specific issue i''m trying to solve is the ZIL being slow when
> using NFS.
>
> I really don''t want to have to do something drastic like disabling
the
> ZIL to get the performance I need...
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Miles Nordin

2009-Feb-02 19:20 UTC

head link

[zfs-discuss] write cache and cache flush

>>>>> "gm" == Greg Mason <gmason at msu.edu>
writes:
>>>>> "g" == Gary Mills <mills at
cc.umanitoba.ca> writes:
gm> I know disabling the ZIL is an Extremely Bad Idea,

but maybe you don''t care about trashed thunderbird databases. You
just don''t want to lose the whole pool to ``status: The pool metadata
is corrupted and cannot be opened. / action: Destroy the pool and
restore from backup.'''' I''ve no answer for
that---maybe someone else?

The known problem with ZIL disabling, AIUI, is that it breaks the
statelessness of NFS. If the server reboots and the NFS clients do
not, then assumptions on which the NFS protocol is built could be
broken, and files could get corrupted.

Behind this dire warning is an expectation I''m not sure everyone
shares: if the NFS server reboots, and the clients do not, then
(modulo bugs) no data is lost---once the clients unfreeze, it''s like
nothing ever happened. I don''t think other file sharing protocols
like SMB or AFP attempt to keep that promise, so maybe people are
being warned about something most assumed would happen anyway.

will disabling the ZIL make NFS corrupt files worse than SMB or AFP
would when the server reboots? not sure---at least SMB or AFP
_should_ give an error to the userland when the server reboots, sort
of like NFS ''hard,intr'' when you press ^C, so applications
using
sqllite or berkeleydb or whatever can catch that error and perform
their own user-level recovery, and if they call fsync() and get
success they can trust it absolutely no matter server or client
reboots. while the ZIL-less NFS problems would probably be more
silent, more analagous to the ZFS-iSCSI problems except one layer
higher in the stack so programs think they''ve written to these .db
files but they haven''t, and blindly scribble on, not knowing that a
batch of writes in the past was silently discarded. In practice
everyone always says to run filemaker or Mail.app or Thunderbird or
anything with database files on ``a local disk'''' only, so I
think the
SMB and AFP error paths are not working right either and the actual
expectation is very low.

g> Consider a file server running ZFS that exports a volume with
g> Iscsi. Consider also an application server that imports the
g> LUN with Iscsi and runs a ZFS filesystem on that LUN.

I was pretty sure there was a bug for the iscsitadm target ignoring
SYNCHRONIZE_CACHE, but I cannot find the bug number now and may be
wrong.

Also there is a separate problem with remote storage and filesystems
highly dependent on SYNCHRONIZE_CACHE. Even if not for the bug I
can''t find, remote storage adds a failure case. Normally you have
three main cases to handle:

SYNCHRONIZE CACHE returns success after some delay

SYNCHRONIZE CACHE never returns because someone yanked the
cord---the whole system goes down. You deal with it at boot, when
mounting the filesystem.

SYNCHRONIZE CACHE never returns because a drive went bad.

iSCSI adds a fourth:

SYNCHRONIZE CACHE returns success
SYNCHRONIZE CACHE returns success
SYNCHRONIZE CACHE returns failure
SYNCHRONIZE CACHE returns success

I think ZFS probably does not understand this case. The others are
easier, because either you have enough raidz/mirror redundancy, or
else you are allowed handle the ``returns failure'''' by
implicitly
unmounting the filesystem and killing everything that held an open
file.

NFS works around this with the COMMIT op and client-driven replay in
v3, or by making everything synchronous in v2. iSCSI is _not_ v2-like
because, even if there is no write caching in the initiator/target
(there probably ought to be), if the underlying physical disk in the
target has a write cache, still the entire target chassis can reboot
and lose the contents of that cache. And I suspect iSCSI is not using
NFS-v3-like workarounds right now. I think this hole is probably
still open.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 304 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090202/aef2ce9a/attachment.bin>

Robert Milkowski

2009-Feb-02 21:41 UTC

head link

[zfs-discuss] write cache and cache flush

Hello Miles,

Monday, February 2, 2009, 7:20:49 PM, you wrote:
>>>>>> "gm" == Greg Mason <gmason at msu.edu>
writes:
>>>>>> "g" == Gary Mills <mills at
cc.umanitoba.ca> writes:
MN>     gm> I know disabling the ZIL is an Extremely Bad Idea,

MN> but maybe you don''t care about trashed thunderbird databases. 
You
MN> just don''t want to lose the whole pool to ``status: The pool
metadata
MN> is corrupted and cannot be opened. / action: Destroy the pool and
MN> restore from backup.''''  I''ve no answer for
that---maybe someone else?

It will not cause the above. Disabling ZIL has nothing to do with a
pool consistency.

-- 
Best regards,
 Robert Milkowski
                                       http://milek.blogspot.com

zfs discuss - Jan 2009 - write cache and cache flush

[zfs-discuss] write cache and cache flush

[zfs-discuss] write cache and cache flush

[zfs-discuss] write cache and cache flush

[zfs-discuss] write cache and cache flush

[zfs-discuss] write cache and cache flush

[zfs-discuss] write cache and cache flush

[zfs-discuss] write cache and cache flush

[zfs-discuss] write cache and cache flush

[zfs-discuss] write cache and cache flush

[zfs-discuss] write cache and cache flush

[zfs-discuss] write cache and cache flush

[zfs-discuss] write cache and cache flush

[zfs-discuss] write cache and cache flush

[zfs-discuss] write cache and cache flush

[zfs-discuss] write cache and cache flush

[zfs-discuss] write cache and cache flush

[zfs-discuss] write cache and cache flush

[zfs-discuss] write cache and cache flush

[zfs-discuss] write cache and cache flush

[zfs-discuss] write cache and cache flush