thr3ads.net - zfs discuss - [zfs-discuss] Making ZIL faster [Oct 2012]

If this information is useful, please help other people find it:
Share via:

Schweiss, Chip

2012-Oct-03 18:28 UTC

[zfs-discuss] Making ZIL faster

I''m in the planing stages of a rather larger ZFS system to house
approximately 1 PB of data.

I have only one system with SSDs for L2ARC and ZIL,  The ZIL seems to be
the bottle neck for large bursts of data being written.    I can''t
confirm
this for sure, but the when throwing enough data at my storage pool and the
write latency starts rising, the ZIL write speed hangs close the max
sustained throughput I''ve measured on the SSD (~200 MB/s).

The pool when empty w/o L2ARC or ZIL it was tested with Bonnie++ and showed
~1300MB/s serial read and ~800MB/s serial write speed.

How can I determine for sure that my ZIL is my bottleneck?  If it is the
bottleneck, is it possible to keep adding mirrored pairs of SSDs to the ZIL
to make it faster?  Or should I be looking for a DDR drive, ZeusRAM, etc.

Thanks for any input,
-Chip
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20121003/61a2bf2d/attachment.html>

Timothy Coalson

2012-Oct-03 19:54 UTC

head link

[zfs-discuss] Making ZIL faster

I found something similar happening when writing over NFS (at significantly
lower throughput than available on the system directly), specifically that
effectively all data, even asynchronous writes, were being written to the
ZIL, which I eventually traced (with help from Richard Elling and others on
this list) at least partially to the linux NFS client issuing commit
requests before ZFS wanted to write the asynchronous data to a txg.  I
tried fiddling with zfs_write_limit_override to get more data onto normal
vdevs faster, but this reduced performance (perhaps setting a tunable to
make ZFS not throttle writes while hitting the write limit could fix that),
and didn''t cause it to go significantly easier on the ZIL devices.  I
decided to live with the default behavior, since my main bottleneck is
ethernet anyway, and the projected lifespan of the ZIL devices was fairly
large due to our workload.

I did find that setting logbias=throughput on a zfs filesystem caused it to
act as though the ZIL devices weren''t there, which actually reduced
commit
times under continuous streaming writes (mostly due to having more
throughput for the same amount of data to commit, in large chunks, but the
zilstat script also reported less writing to the ZIL blocks (which are
allocated from normal vdevs without a ZIL device, or with
logbias=throughput) under this condition, so perhaps there is more to the
story), so if you have different workloads for different datasets, this
could help (since it isn''t a poolwide setting).  Obviously, small
synchronous writes to that zfs filesystem will take a large hit from this
setting.

It would be nice if there was a feature in ZFS that could direct small
commits to ZIL blocks on log devices, but behave like logbias=throughput
for large commits.  It would probably need manual tuning, but it would
treat SSD log devices more gently, and increase performance for large
contiguous writes.

If you can''t configure ZFS to write less data to the ZIL, I think a RAM
based ZIL device would be a good way to get throughput up higher (and less
worries about flash endurance, etc).

Tim

On Wed, Oct 3, 2012 at 1:28 PM, Schweiss, Chip <chip at innovates.com>
wrote:
> I''m in the planing stages of a rather larger ZFS system to house
> approximately 1 PB of data.
>
> I have only one system with SSDs for L2ARC and ZIL,  The ZIL seems to be
> the bottle neck for large bursts of data being written.    I can''t
confirm
> this for sure, but the when throwing enough data at my storage pool and the
> write latency starts rising, the ZIL write speed hangs close the max
> sustained throughput I''ve measured on the SSD (~200 MB/s).
>
> The pool when empty w/o L2ARC or ZIL it was tested with Bonnie++ and
> showed ~1300MB/s serial read and ~800MB/s serial write speed.
>
> How can I determine for sure that my ZIL is my bottleneck?  If it is the
> bottleneck, is it possible to keep adding mirrored pairs of SSDs to the ZIL
> to make it faster?  Or should I be looking for a DDR drive, ZeusRAM, etc.
>
> Thanks for any input,
> -Chip
>
>
>
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20121003/239dcfb2/attachment-0001.html>

Timothy Coalson

2012-Oct-03 20:07 UTC

head link

[zfs-discuss] Making ZIL faster

To answer your questions more directly, zilstat is what I used to check
what the ZIL was doing:

http://www.richardelling.com/Home/scripts-and-programs-1/zilstat

While I have added a mirrored log device, I haven''t tried adding
multiple
sets of mirror log devices, but I think it should work.  I believe that a
failed unmirrored log device is only a problem if the pool is ungracefully
closed before ZFS notices that the log device failed (ie, simultaneous
power failure and log device failure), so mirroring them may not be
required.

Tim

On Wed, Oct 3, 2012 at 2:54 PM, Timothy Coalson <tsc5yc at mst.edu> wrote:
> I found something similar happening when writing over NFS (at
> significantly lower throughput than available on the system directly),
> specifically that effectively all data, even asynchronous writes, were
> being written to the ZIL, which I eventually traced (with help from Richard
> Elling and others on this list) at least partially to the linux NFS client
> issuing commit requests before ZFS wanted to write the asynchronous data to
> a txg.  I tried fiddling with zfs_write_limit_override to get more data
> onto normal vdevs faster, but this reduced performance (perhaps setting a
> tunable to make ZFS not throttle writes while hitting the write limit could
> fix that), and didn''t cause it to go significantly easier on the
ZIL
> devices.  I decided to live with the default behavior, since my main
> bottleneck is ethernet anyway, and the projected lifespan of the ZIL
> devices was fairly large due to our workload.
>
> I did find that setting logbias=throughput on a zfs filesystem caused it
> to act as though the ZIL devices weren''t there, which actually
reduced
> commit times under continuous streaming writes (mostly due to having more
> throughput for the same amount of data to commit, in large chunks, but the
> zilstat script also reported less writing to the ZIL blocks (which are
> allocated from normal vdevs without a ZIL device, or with
> logbias=throughput) under this condition, so perhaps there is more to the
> story), so if you have different workloads for different datasets, this
> could help (since it isn''t a poolwide setting).  Obviously, small
> synchronous writes to that zfs filesystem will take a large hit from this
> setting.
>
> It would be nice if there was a feature in ZFS that could direct small
> commits to ZIL blocks on log devices, but behave like logbias=throughput
> for large commits.  It would probably need manual tuning, but it would
> treat SSD log devices more gently, and increase performance for large
> contiguous writes.
>
> If you can''t configure ZFS to write less data to the ZIL, I think
a RAM
> based ZIL device would be a good way to get throughput up higher (and less
> worries about flash endurance, etc).
>
> Tim
>
> On Wed, Oct 3, 2012 at 1:28 PM, Schweiss, Chip <chip at
innovates.com> wrote:
>
>> I''m in the planing stages of a rather larger ZFS system to
house
>> approximately 1 PB of data.
>>
>> I have only one system with SSDs for L2ARC and ZIL,  The ZIL seems to
be
>> the bottle neck for large bursts of data being written.    I
can''t confirm
>> this for sure, but the when throwing enough data at my storage pool and
the
>> write latency starts rising, the ZIL write speed hangs close the max
>> sustained throughput I''ve measured on the SSD (~200 MB/s).
>>
>> The pool when empty w/o L2ARC or ZIL it was tested with Bonnie++ and
>> showed ~1300MB/s serial read and ~800MB/s serial write speed.
>>
>> How can I determine for sure that my ZIL is my bottleneck?  If it is
the
>> bottleneck, is it possible to keep adding mirrored pairs of SSDs to the
ZIL
>> to make it faster?  Or should I be looking for a DDR drive, ZeusRAM,
etc.
>>
>> Thanks for any input,
>> -Chip
>>
>>
>>
>> _______________________________________________
>> zfs-discuss mailing list
>> zfs-discuss at opensolaris.org
>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>>
>>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20121003/306cce87/attachment.html>

Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)

2012-Oct-04 01:33 UTC

head link

[zfs-discuss] Making ZIL faster

> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> bounces at opensolaris.org] On Behalf Of Schweiss, Chip
> 
> How can I determine for sure that my ZIL is my bottleneck?? If it is the
> bottleneck, is it possible to keep adding mirrored pairs of SSDs to the ZIL
to
> make it faster?? Or should I be looking for a DDR drive, ZeusRAM, etc.
Temporarily set sync=disabled
Or, depending on your application, leave it that way permanently.  I know, for
the work I do, most systems I support at most locations have sync=disabled.  It
all depends on the workload.

Andrew Gabriel

2012-Oct-04 08:03 UTC

head link

[zfs-discuss] Making ZIL faster

Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)
wrote:>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
>> bounces at opensolaris.org] On Behalf Of Schweiss, Chip
>>
>> How can I determine for sure that my ZIL is my bottleneck?  If it is
the
>> bottleneck, is it possible to keep adding mirrored pairs of SSDs to the
ZIL to
>> make it faster?  Or should I be looking for a DDR drive, ZeusRAM, etc.
> 
> Temporarily set sync=disabled
> Or, depending on your application, leave it that way permanently.  I know,
for the work I do, most systems I support at most locations have sync=disabled. 
It all depends on the workload.
Noting of course that this means that in the case of an unexpected system outage
or loss of connectivity to the disks, synchronous writes since the last txg
commit will be lost, even though the applications will believe they are secured
to disk. (ZFS filesystem won''t be corrupted, but it will look like
it''s been wound back by up to 30 seconds when you reboot.)

This is fine for some workloads, such as those where you would start again with
fresh data and those which can look closely at the data to see how far they got
before being rudely interrupted, but not for those which rely on the Posix
semantics of synchronous writes/syncs meaning data is secured on non-volatile
storage when the function returns.

-- 
Andrew

Schweiss, Chip

2012-Oct-04 11:30 UTC

head link

[zfs-discuss] Making ZIL faster

Thanks for all the input.  It seems information on the performance of the
ZIL is sparse and scattered.   I''ve spent significant time researching
this
the past day.  I''ll summarize what I''ve found.   Please
correct me if I''m
wrong.

   - The ZIL can have any number of SSDs attached either mirror or
   individually.   ZFS will stripe across these in a raid0 or raid10 fashion
   depending on how you configure.
   - To determine the true maximum streaming performance of the ZIL setting
   sync=disabled will only use the in RAM ZIL.   This gives up power
   protection to synchronous writes.
   - Many SSDs do not help protect against power failure because they have
   their own ram cache for writes.  This effectively makes the SSD useless for
   this purpose and potentially introduces a false sense of security.  (These
   SSDs are fine for L2ARC)
   - Mirroring SSDs is only helpful if one SSD fails at the time of a power
   failure.  This leave several unanswered questions.  How good is ZFS at
   detecting that an SSD is no longer a reliable write target?   The chance of
   silent data corruption is well documented about spinning disks.  What
   chance of data corruption does this introduce with up to 10 seconds of data
   written on SSD.  Does ZFS read the ZIL during a scrub to determine if our
   SSD is returning what we write to it?
   - Zpool versions 19 and higher should be able to survive a ZIL failure
   only loosing the uncommitted data.   However, I haven''t seen good
enough
   information that I would necessarily trust this yet.
   - Several threads seem to suggest a ZIL throughput limit of 1Gb/s with
   SSDs.   I''m not sure if that is current, but I can''t find
any reports of
   better performance.   I would suspect that DDR drive or Zeus RAM as ZIL
   would push past this.

Anyone care to post their performance numbers on current hardware with E5
processors, and ram based ZIL solutions?

Thanks to everyone who has responded and contacted me directly on this
issue.

-Chip
On Thu, Oct 4, 2012 at 3:03 AM, Andrew Gabriel <
andrew.gabriel at cucumber.demon.co.uk> wrote:
> Edward Ned Harvey (**opensolarisisdeadlongliveopens**olaris) wrote:
>
>> From: zfs-discuss-bounces@**opensolaris.org<zfs-discuss-bounces at
opensolaris.org>[mailto:
>>> zfs-discuss-
>>> bounces at opensolaris.org] On Behalf Of Schweiss, Chip
>>>
>>> How can I determine for sure that my ZIL is my bottleneck?  If it
is the
>>> bottleneck, is it possible to keep adding mirrored pairs of SSDs to
the
>>> ZIL to
>>> make it faster?  Or should I be looking for a DDR drive, ZeusRAM,
etc.
>>>
>>
>> Temporarily set sync=disabled
>> Or, depending on your application, leave it that way permanently.  I
>> know, for the work I do, most systems I support at most locations have
>> sync=disabled.  It all depends on the workload.
>>
>
> Noting of course that this means that in the case of an unexpected system
> outage or loss of connectivity to the disks, synchronous writes since the
> last txg commit will be lost, even though the applications will believe
> they are secured to disk. (ZFS filesystem won''t be corrupted, but
it will
> look like it''s been wound back by up to 30 seconds when you
reboot.)
>
> This is fine for some workloads, such as those where you would start again
> with fresh data and those which can look closely at the data to see how far
> they got before being rudely interrupted, but not for those which rely on
> the Posix semantics of synchronous writes/syncs meaning data is secured on
> non-volatile storage when the function returns.
>
> --
> Andrew
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20121004/2ed89830/attachment-0001.html>

Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)

2012-Oct-04 11:40 UTC

head link

[zfs-discuss] Making ZIL faster

> From: Andrew Gabriel [mailto:andrew.gabriel at cucumber.demon.co.uk]
> 
> > Temporarily set sync=disabled
> > Or, depending on your application, leave it that way permanently.  I
know,
> for the work I do, most systems I support at most locations have
> sync=disabled.  It all depends on the workload.
> 
> Noting of course that this means that in the case of an unexpected system
> outage or loss of connectivity to the disks, synchronous writes since the
last
> txg commit will be lost, even though the applications will believe they are
> secured to disk. (ZFS filesystem won''t be corrupted, but it will
look like it''s
> been wound back by up to 30 seconds when you reboot.)
> 
> This is fine for some workloads, such as those where you would start again
> with fresh data 
It''s fine for any load where you don''t have clients keeping
track of your state.

Examples where it''s not fine:  You''re processing credit card
transactions.  You just finished processing a transaction, system crashes, and
you forget about it.  Not fine, because systems external to yourself are aware
of state that is in the future of your state, and you aren''t aware of
it.

You''re a NFS server.  Some clients write some files, you say
they''re written, you crash, and forget about it.  Now you reboot, start
serving NFS again, and the client still has a file handle for something it
thinks exists ... but according to you in your new state, it doesn''t
exist.

You''re a database server, and your clients are external to yourself. 
They do transactions against you, you say they''re complete, and you
forget about it.

But it''s ok:  

You''re a NFS server, and you''re configured to NOT restart NFS
automatically upon reboot.  In the event of an ungraceful crash, admin
intervention is required, and the admin is aware, he needs to reboot the NFS
clients before starting the NFS services again.

You''re a database server, and your clients are all inside yourself,
either as VM''s, or services of various kinds....

You''re a VM server, and all of your VM''s are desktop clients,
like a windows 7 machine for example.  None of your guests are servers in and of
themselves maintaining state with external entities (such as processing credit
card transactions, serving a database, or file server.)  By mere virtue of the
fact that you crash ungracefully implies your guests also crash ungracefully. 
You all reboot, rewind a few seconds, no problem.

Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)

2012-Oct-04 11:57 UTC

head link

[zfs-discuss] Making ZIL faster

> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> bounces at opensolaris.org] On Behalf Of Schweiss, Chip
> 
> . The ZIL can have any number of SSDs attached either mirror or
> individually.?? ZFS will stripe across these in a raid0 or raid10 fashion
> depending on how you configure.
I''m regurgitating something somebody else said - but I don''t
know where.  I believe multiple ZIL devices don''t get striped.  They
get round-robin''d.  This means your ZIL can absolutely become a
bottleneck, if you''re doing sustained high throughput (not high IOPS)
sync writes.  But the way to prevent that bottleneck is by tuning the ... I
don''t know the names of the parameters.  Some parameters that indicate
"a sync write larger than X should skip the ZIL and go directly to
pool."

> . To determine the true maximum streaming performance of the ZIL setting
> sync=disabled will only use the in RAM ZIL.?? This gives up power
protection to
> synchronous writes.
There is no RAM ZIL.  The basic idea behind ZIL is like this:  Some applications
simply tell the system to "write" and the system will buffer these
writes in memory, and the application will continue processing.  But some
applications do not want the OS to buffer writes, so they issue writes in
"sync" mode.  These applications will issue the write command, and
they will block there, until the OS says it''s written to nonvolatile
storage.  In ZFS, this means the transaction gets written to the ZIL, and then
it gets put into the memory buffer just like any other write.  Upon reboot, when
the filesystem is mounting, ZFS will always look in the ZIL to see if there are
any transactions that have not yet been played to disk.

So, when you set sync=disabled, you''re just bypassing that step. 
You''re lying to the applications, if they say "I want to know when
this is written to disk," and you just immediately say "Yup,
it''s done" unconditionally.  This is the highest performance thing
you could possibly do - but depending on your system workload, could put you at
risk for data loss.

> . Mirroring SSDs is only helpful if one SSD fails at the time of a power
> failure.? This leave several unanswered questions.? How good is ZFS at
> detecting that an SSD is no longer a reliable write target??? The chance of
> silent data corruption is well documented about spinning disks.? What
chance
> of data corruption does this introduce with up to 10 seconds of data
written
> on SSD.? Does ZFS read the ZIL during a scrub to determine if our SSD is
> returning what we write to it?
Not just power loss -- any ungraceful crash.  

ZFS doesn''t have any way to scrub ZIL devices, so it''s not
very good at detecting failed ZIL devices.  There is definitely the possibility
for an SSD to enter a failure mode where you write to it, it doesn''t
complain, but you wouldn''t be able to read it back if you tried.  Also,
upon ungraceful crash, even if you try to read that data, and fail to get it
back, there''s no way to know that you should have expected something. 
So you still don''t detect the failure.

If you want to maintain your SSD periodically, you should do something like: 
Remove it as a ZIL device, create a new pool with just this disk in it, write a
bunch of random data to the new junk pool, scrub the pool, then destroy the junk
pool and return it as a ZIL device to the main pool.  This does not guarantee
anything - but then - nothing anywhere guarantees anything.  This is a good
practice, and it definitely puts you into a territory of reliability better than
the competing alternatives.

> . Zpool versions 19 and higher should be able to survive a ZIL failure only
> loosing the uncommitted data. ? However, I haven''t seen good
enough
> information that I would necessarily trust this yet.
That was a very long time ago.  (What, 2-3 years?)  It''s very solid
now.

> . Several threads seem to suggest a ZIL throughput limit of 1Gb/s with
> SSDs.?? I''m not sure if that is current, but I can''t find
any reports of better
> performance.?? I would suspect that DDR drive or Zeus RAM as ZIL would push
> past this.
Whenever I measure the sustainable throughput of a SSD, HDD, DDRDrive, or
anything else ... Very few devices can actually sustain faster than 1Gb/s, for
use as a ZIL or anything else.  Published specs are often higher, but not
realistic.

If you are ZIL bandwidth limited, you should consider tuning the size of stuff
that goes to ZIL.

Neil Perrin

2012-Oct-04 15:31 UTC

head link

Re: Making ZIL faster

On 10/04/12 05:30, Schweiss, Chip wrote:
    Thanks for all the input.  It seems information on the
      performance of the ZIL is sparse and scattered.   I''ve spent
      significant time researching this the past day.  I''ll summarize
      what I''ve found.   Please correct me if I''m wrong.

        The ZIL can have any number of SSDs attached either mirror
          or individually.   ZFS will stripe across these in a raid0 or
          raid10 fashion depending on how you configure.

    The ZIL code chains blocks together and these are allocated round
    robin among slogs or

    if they don''t exist then the main pool devices.

        To determine the true maximum streaming performance of the
          ZIL setting sync=disabled will only use the in RAM ZIL.   This
          gives up power protection to synchronous writes.

    There is no RAM ZIL. If sync=disabled then all writes are
    asynchronous and are written

    as part of the periodic ZFS transaction group (txg) commit that
    occurs every 5 seconds.

        Many SSDs do not help protect against power failure because
          they have their own ram cache for writes.  This effectively
          makes the SSD useless for this purpose and potentially
          introduces a false sense of security.  (These SSDs are fine
          for L2ARC)

    The ZIL code issues a write cache flush to all devices it has
    written before returning

    from the system call. I''ve heard, that not all devices obey the
    flush but we consider them

    as broken hardware. I don''t have a list to avoid.

        Mirroring SSDs is only helpful if one SSD fails at the time
          of a power failure.  This leave several unanswered questions. 
          How good is ZFS at detecting that an SSD is no longer a
          reliable write target?   The chance of silent data corruption
          is well documented about spinning disks.  What chance of data
          corruption does this introduce with up to 10 seconds of data
          written on SSD.  Does ZFS read the ZIL during a scrub to
          determine if our SSD is returning what we write to it?

    If the ZIL code gets a block write failure it will force the txg to
    commit before returning.

    It will depend on the drivers and IO subsystem as to how hard it
    tries to write the block.

        Zpool versions 19 and higher should be able to survive a ZIL
          failure only loosing the uncommitted data.   However, I
          haven''t seen good enough information that I would necessarily
          trust this yet. 

    This has been available for quite a while and I haven''t heard of
any
    bugs in this area.

          Several threads seem to suggest a ZIL throughput limit of
          1Gb/s with SSDs.   I''m not sure if that is current, but I
          can''t find any reports of better performance.   I would
          suspect that DDR drive or Zeus RAM as ZIL would push past
          this.

    1GB/s seems very high, but I don''t have any numbers to share.

Anyone care to post their performance numbers on current
        hardware with E5 processors, and ram based ZIL solutions?  

Thanks to everyone who has responded and contacted me directly
        on this issue.

-Chip

      On Thu, Oct 4, 2012 at 3:03 AM, Andrew
        Gabriel &lt;andrew.gabriel@cucumber.demon.co.uk&gt;
        wrote:

            Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)
              wrote:

                  From: zfs-discuss-bounces@opensolaris.org
                  [mailto:zfs-discuss-

                  bounces@opensolaris.org] On
                  Behalf Of Schweiss, Chip

                  How can I determine for sure that my ZIL is my
                  bottleneck?  If it is the

                  bottleneck, is it possible to keep adding mirrored
                  pairs of SSDs to the ZIL to

                  make it faster?  Or should I be looking for a DDR
                  drive, ZeusRAM, etc.

                Temporarily set sync=disabled

                Or, depending on your application, leave it that way
                permanently.  I know, for the work I do, most systems I
                support at most locations have sync=disabled.  It all
                depends on the workload.

          Noting of course that this means that in the case of an
          unexpected system outage or loss of connectivity to the disks,
          synchronous writes since the last txg commit will be lost,
          even though the applications will believe they are secured to
          disk. (ZFS filesystem won''t be corrupted, but it will look
          like it''s been wound back by up to 30 seconds when you
          reboot.)

          This is fine for some workloads, such as those where you would
          start again with fresh data and those which can look closely
          at the data to see how far they got before being rudely
          interrupted, but not for those which rely on the Posix
          semantics of synchronous writes/syncs meaning data is secured
          on non-volatile storage when the function returns.

              -- 

              Andrew

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Richard Elling

2012-Oct-04 18:33 UTC

head link

[zfs-discuss] Making ZIL faster

Thanks Neil, we always appreciate your comments on ZIL implementation.
One additional comment below...

On Oct 4, 2012, at 8:31 AM, Neil Perrin <neil.perrin at oracle.com> wrote:
> On 10/04/12 05:30, Schweiss, Chip wrote:
>> 
>> Thanks for all the input.  It seems information on the performance of
the ZIL is sparse and scattered.   I''ve spent significant time
researching this the past day.  I''ll summarize what I''ve
found.   Please correct me if I''m wrong.
>> The ZIL can have any number of SSDs attached either mirror or
individually.   ZFS will stripe across these in a raid0 or raid10 fashion
depending on how you configure.
> 
> The ZIL code chains blocks together and these are allocated round robin
among slogs or
> if they don''t exist then the main pool devices.
> 
>> To determine the true maximum streaming performance of the ZIL setting
sync=disabled will only use the in RAM ZIL.   This gives up power protection to
synchronous writes.
> 
> There is no RAM ZIL. If sync=disabled then all writes are asynchronous and
are written
> as part of the periodic ZFS transaction group (txg) commit that occurs
every 5 seconds.
> 
>> Many SSDs do not help protect against power failure because they have
their own ram cache for writes.  This effectively makes the SSD useless for this
purpose and potentially introduces a false sense of security.  (These SSDs are
fine for L2ARC)
> 
> The ZIL code issues a write cache flush to all devices it has written
before returning
> from the system call. I''ve heard, that not all devices obey the
flush but we consider them
> as broken hardware. I don''t have a list to avoid.
> 
>> 
>> Mirroring SSDs is only helpful if one SSD fails at the time of a power
failure.  This leave several unanswered questions.  How good is ZFS at detecting
that an SSD is no longer a reliable write target?   The chance of silent data
corruption is well documented about spinning disks.  What chance of data
corruption does this introduce with up to 10 seconds of data written on SSD. 
Does ZFS read the ZIL during a scrub to determine if our SSD is returning what
we write to it?
> 
> If the ZIL code gets a block write failure it will force the txg to commit
before returning.
> It will depend on the drivers and IO subsystem as to how hard it tries to
write the block.
> 
>> 
>> Zpool versions 19 and higher should be able to survive a ZIL failure
only loosing the uncommitted data.   However, I haven''t seen good
enough information that I would necessarily trust this yet.
> 
> This has been available for quite a while and I haven''t heard of
any bugs in this area.
> 
>> Several threads seem to suggest a ZIL throughput limit of 1Gb/s with
SSDs.   I''m not sure if that is current, but I can''t find any
reports of better performance.   I would suspect that DDR drive or Zeus RAM as
ZIL would push past this.
> 
> 1GB/s seems very high, but I don''t have any numbers to share.
It is not unusual for workloads to exceed the performance of a single device.
For example, if you have a device that can achieve 700 MB/sec, but a workload
generated by lots of clients accessing the server via 10GbE (1 GB/sec), then it
should be immediately obvious that the slog needs to be striped. Empirically,
this is also easy to measure.
 -- richard
> 
>>   
>> Anyone care to post their performance numbers on current hardware with
E5 processors, and ram based ZIL solutions?
>> 
>> Thanks to everyone who has responded and contacted me directly on this
issue.
>> 
>> -Chip
>> On Thu, Oct 4, 2012 at 3:03 AM, Andrew Gabriel <andrew.gabriel at
cucumber.demon.co.uk> wrote:
>> Edward Ned Harvey (opensolarisisdeadlongliveopensolaris) wrote:
>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
>> bounces at opensolaris.org] On Behalf Of Schweiss, Chip
>> 
>> How can I determine for sure that my ZIL is my bottleneck?  If it is
the
>> bottleneck, is it possible to keep adding mirrored pairs of SSDs to the
ZIL to
>> make it faster?  Or should I be looking for a DDR drive, ZeusRAM, etc.
>> 
>> Temporarily set sync=disabled
>> Or, depending on your application, leave it that way permanently.  I
know, for the work I do, most systems I support at most locations have
sync=disabled.  It all depends on the workload.
>> 
>> Noting of course that this means that in the case of an unexpected
system outage or loss of connectivity to the disks, synchronous writes since the
last txg commit will be lost, even though the applications will believe they are
secured to disk. (ZFS filesystem won''t be corrupted, but it will look
like it''s been wound back by up to 30 seconds when you reboot.)
>> 
>> This is fine for some workloads, such as those where you would start
again with fresh data and those which can look closely at the data to see how
far they got before being rudely interrupted, but not for those which rely on
the Posix semantics of synchronous writes/syncs meaning data is secured on
non-volatile storage when the function returns.
>> 
>> -- 
>> Andrew
>> 
>> 
>> 
>> _______________________________________________
>> zfs-discuss mailing list
>> zfs-discuss at opensolaris.org
>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> 
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
--

Richard.Elling at RichardElling.com
+1-760-896-4422









-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20121004/8df214b8/attachment-0001.html>

Schweiss, Chip

2012-Oct-04 20:33 UTC

head link

[zfs-discuss] Making ZIL faster

Again thanks for the input and clarifications.

I would like to clarify the numbers I was talking about with ZiL
performance specs I was seeing talked about on other forums.   Right now
I''m getting streaming performance of sync writes at about 1 Gbit/S.  
My
target is closer to 10Gbit/S.   If I get to build it this system, it will
house a decent size VMware NFS storage w/ 200+ VMs, which will be dual
connected via 10Gbe.   This is all medical imaging research.  We move data
around by the TB and fast streaming is imperative.

On the system I''ve been testing with is 10Gbe connected and I have
about 50
VMs running very happily, and haven''t yet found my random I/O limit.
However every time, I storage vMotion a handful of additional VMs, the ZIL
seems to max out it''s writing speed to the SSDs and random I/O also
suffers.   With out the SSD ZIL, random I/O is very poor.   I will be doing
some testing with sync=off, tomorrow and see how things perform.

If anyone can testify to a ZIL device(s) that can keep up with 10GBe or
more streaming synchronous writes please let me know.

-Chip

On Thu, Oct 4, 2012 at 1:33 PM, Richard Elling <richard.elling at
gmail.com>wrote:
>
> This has been available for quite a while and I haven''t heard of
any bugs
> in this area.
>
>
>    - Several threads seem to suggest a ZIL throughput limit of 1Gb/s with
>    SSDs.   I''m not sure if that is current, but I can''t
find any reports of
>    better performance.   I would suspect that DDR drive or Zeus RAM as ZIL
>    would push past this.
>
>
> 1GB/s seems very high, but I don''t have any numbers to share.
>
>
> It is not unusual for workloads to exceed the performance of a single
> device.
> For example, if you have a device that can achieve 700 MB/sec, but a
> workload
> generated by lots of clients accessing the server via 10GbE (1 GB/sec),
> then it
> should be immediately obvious that the slog needs to be striped.
> Empirically,
> this is also easy to measure.
>  -- richard
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20121004/d746429d/attachment.html>

Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)

2012-Oct-04 21:57 UTC

head link

[zfs-discuss] Making ZIL faster

> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> bounces at opensolaris.org] On Behalf Of Schweiss, Chip
> 
> If I get to build it this system, it will house a decent size VMware
> NFS storage w/ 200+ VMs, which will be dual connected via 10Gbe.?? This is
all
> medical imaging research.? We move data around by the TB and fast
> streaming is imperative.
This might not carry over to vmware, iscsi vs nfs.  But with virtualbox, using a
local file versus using a local zvol, I have found the zvol is much faster for
the guest OS.  Also, by default the zvol will have smarter reservations
(refreservation) which seems to be a good thing.

Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)

2012-Oct-04 21:59 UTC

head link

[zfs-discuss] Making ZIL faster

> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> bounces at opensolaris.org] On Behalf Of Neil Perrin
> 
> The ZIL code chains blocks together and these are allocated round robin
> among slogs or
> if they don''t exist then the main pool devices.
So, if somebody is doing sync writes as fast as possible, would they gain more
bandwidth by adding multiple slog devices?

Richard Elling

2012-Oct-04 22:12 UTC

head link

[zfs-discuss] Making ZIL faster

On Oct 4, 2012, at 1:33 PM, "Schweiss, Chip" <chip at
innovates.com> wrote:
> Again thanks for the input and clarifications.
> 
> I would like to clarify the numbers I was talking about with ZiL
performance specs I was seeing talked about on other forums.   Right now
I''m getting streaming performance of sync writes at about 1 Gbit/S.  
My target is closer to 10Gbit/S.   If I get to build it this system, it will
house a decent size VMware NFS storage w/ 200+ VMs, which will be dual connected
via 10Gbe.   This is all medical imaging research.  We move data around by the
TB and fast streaming is imperative.
> 
> On the system I''ve been testing with is 10Gbe connected and I have
about 50 VMs running very happily, and haven''t yet found my random I/O
limit. However every time, I storage vMotion a handful of additional VMs, the
ZIL seems to max out it''s writing speed to the SSDs and random I/O also
suffers.   With out the SSD ZIL, random I/O is very poor.   I will be doing some
testing with sync=off, tomorrow and see how things perform.
> 
> If anyone can testify to a ZIL device(s) that can keep up with 10GBe or
more streaming synchronous writes please let me know.
Quick datapoint, with qty 3 ZeusRAMs as striped slog, we could push 1.3
GBytes/sec of
storage vmotion on a relatively modest system. To sustain that sort of thing
often requires
full system-level tuning and proper systems engineering design. Fortunately,
people
tend to not do storage vmotion on a continuous basis.
 -- richard

--

Richard.Elling at RichardElling.com
+1-760-896-4422









-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20121004/7df9d57a/attachment.html>

Neil Perrin

2012-Oct-05 03:36 UTC

head link

[zfs-discuss] Making ZIL faster

On 10/04/12 15:59, Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)
wrote:>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
>> bounces at opensolaris.org] On Behalf Of Neil Perrin
>>
>> The ZIL code chains blocks together and these are allocated round robin
>> among slogs or
>> if they don''t exist then the main pool devices.
> So, if somebody is doing sync writes as fast as possible, would they gain
more bandwidth by adding multiple slog devices?
>In general - yes, but it really depends. Multiple synchronous writes of any size
across multiple file systems will fan out across the log devices. That is
because there is a separate independent log chain for each file system.

Also large synchronous writes (eg 1MB) within a specific file system will be
spread out.
The ZIL code will try to allocate a block to hold all the records it needs to
commit up to the largest block size - which currently for you should be 128KB.
Anything larger will allocate a new block - on a different device if there are
multiple devices.

However, lots of small synchronous writes to the same file system might not
use more than one 128K block and benefit from multiple slog devices.

Neil.

Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)

2012-Oct-05 13:50 UTC

head link

[zfs-discuss] Making ZIL faster

> From: Neil Perrin [mailto:neil.perrin at oracle.com]
> 
> In general - yes, but it really depends. Multiple synchronous writes of any
> size
> across multiple file systems will fan out across the log devices. That is
> because there is a separate independent log chain for each file system.
> 
> Also large synchronous writes (eg 1MB) within a specific file system will
be
> spread out.
> The ZIL code will try to allocate a block to hold all the records it needs
to
> commit up to the largest block size - which currently for you should be
128KB.
> Anything larger will allocate a new block - on a different device if there
are
> multiple devices.
> 
> However, lots of small synchronous writes to the same file system might not
> use more than one 128K block and benefit from multiple slog devices.
That is an awesome explanation.  Thank you.

zfs discuss - Oct 2012 - Making ZIL faster

[zfs-discuss] Making ZIL faster

[zfs-discuss] Making ZIL faster

[zfs-discuss] Making ZIL faster

[zfs-discuss] Making ZIL faster

[zfs-discuss] Making ZIL faster

[zfs-discuss] Making ZIL faster

[zfs-discuss] Making ZIL faster

[zfs-discuss] Making ZIL faster

Re: Making ZIL faster

[zfs-discuss] Making ZIL faster

[zfs-discuss] Making ZIL faster

[zfs-discuss] Making ZIL faster

[zfs-discuss] Making ZIL faster

[zfs-discuss] Making ZIL faster

[zfs-discuss] Making ZIL faster

[zfs-discuss] Making ZIL faster