thr3ads.net - zfs discuss - [zfs-discuss] Sun Flash Accelerator F20 numbers [Mar 2010]

If this information is useful, please help other people find it:
Share via:

Karsten Weiss

2010-Mar-30 10:30 UTC

[zfs-discuss] Sun Flash Accelerator F20 numbers

Hi, I did some tests on a Sun Fire x4540 with an external J4500 array (connected
via two
HBA ports). I.e. there are 96 disks in total configured as seven 12-disk raidz2
vdevs
(plus system, spares, unused disks) providing a ~ 63 TB pool with fletcher4
checksums.
The system was recently equipped with a Sun Flash Accelerator F20 with 4 FMod
modules to be used as log devices (ZIL). I was using the latest snv_134 software
release.

Here are some first performance numbers for the extraction of an uncompressed 50
MB
tarball on a Linux (CentOS 5.4 x86_64) NFS-client which mounted the test
filesystem
(no compression or dedup) via NFSv3 (rsize=wsize=32k,sync,tcp,hard).

standard ZIL:               7m40s  (ZFS default)
1x SSD ZIL:                  4m07s  (Flash Accelerator F20)
2x SSD ZIL:                  2m42s  (Flash Accelerator F20)
2x SSD mirrored ZIL:   3m59s  (Flash Accelerator F20)
3x SSD ZIL:                  2m47s  (Flash Accelerator F20)
4x SSD ZIL:                  2m57s  (Flash Accelerator F20)
disabled ZIL:               0m15s
(local extraction        0m0.269s)

I was not so much interested in the absolute numbers but rather in the relative
performance differences between the standard ZIL, the SSD ZIL and the disabled
ZIL cases.

Any opinions on the results? I wish the SSD ZIL performance was closer to the
disabled ZIL case than it is right now.

ATM I tend to use two F20 FMods for the log and the two other FMods as L2ARC
cache
devices (although the system has lots of system memory i.e. the L2ARC is not
really
necessary). But the speedup of disabling the ZIL altogether is appealing (and
would
probably be acceptable in this environment).
-- 
This message posted from opensolaris.org

Adam Leventhal

2010-Mar-30 18:44 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

Hey Karsten,

Very interesting data. Your test is inherently single-threaded so I''m
not surprised that the benefits aren''t more impressive -- the flash
modules on the F20 card are optimized more for concurrent IOPS than
single-threaded latency.

Adam

On Mar 30, 2010, at 3:30 AM, Karsten Weiss wrote:
> Hi, I did some tests on a Sun Fire x4540 with an external J4500 array
(connected via two
> HBA ports). I.e. there are 96 disks in total configured as seven 12-disk
raidz2 vdevs
> (plus system, spares, unused disks) providing a ~ 63 TB pool with fletcher4
checksums.
> The system was recently equipped with a Sun Flash Accelerator F20 with 4
FMod
> modules to be used as log devices (ZIL). I was using the latest snv_134
software release.
> 
> Here are some first performance numbers for the extraction of an
uncompressed 50 MB
> tarball on a Linux (CentOS 5.4 x86_64) NFS-client which mounted the test
filesystem
> (no compression or dedup) via NFSv3 (rsize=wsize=32k,sync,tcp,hard).
> 
> standard ZIL:               7m40s  (ZFS default)
> 1x SSD ZIL:                  4m07s  (Flash Accelerator F20)
> 2x SSD ZIL:                  2m42s  (Flash Accelerator F20)
> 2x SSD mirrored ZIL:   3m59s  (Flash Accelerator F20)
> 3x SSD ZIL:                  2m47s  (Flash Accelerator F20)
> 4x SSD ZIL:                  2m57s  (Flash Accelerator F20)
> disabled ZIL:               0m15s
> (local extraction        0m0.269s)
> 
> I was not so much interested in the absolute numbers but rather in the
relative
> performance differences between the standard ZIL, the SSD ZIL and the
disabled
> ZIL cases.
> 
> Any opinions on the results? I wish the SSD ZIL performance was closer to
the
> disabled ZIL case than it is right now.
> 
> ATM I tend to use two F20 FMods for the log and the two other FMods as
L2ARC cache
> devices (although the system has lots of system memory i.e. the L2ARC is
not really
> necessary). But the speedup of disabling the ZIL altogether is appealing
(and would
> probably be acceptable in this environment).
> -- 
> This message posted from opensolaris.org
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

--
Adam Leventhal, Fishworks                        http://blogs.sun.com/ahl

Kyle McDonald

2010-Mar-30 19:30 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

On 3/30/2010 2:44 PM, Adam Leventhal wrote:> Hey Karsten,
>
> Very interesting data. Your test is inherently single-threaded so
I''m not surprised that the benefits aren''t more impressive --
the flash modules on the F20 card are optimized more for concurrent IOPS than
single-threaded latency.
>
>   
Yes it would be interesting to see the Avg numbers for 10 or more
clients (or jobs on one client) all performing that same test.

 -Kyle
> Adam
>
> On Mar 30, 2010, at 3:30 AM, Karsten Weiss wrote:
>
>   
>> Hi, I did some tests on a Sun Fire x4540 with an external J4500 array
(connected via two
>> HBA ports). I.e. there are 96 disks in total configured as seven
12-disk raidz2 vdevs
>> (plus system, spares, unused disks) providing a ~ 63 TB pool with
fletcher4 checksums.
>> The system was recently equipped with a Sun Flash Accelerator F20 with
4 FMod
>> modules to be used as log devices (ZIL). I was using the latest snv_134
software release.
>>
>> Here are some first performance numbers for the extraction of an
uncompressed 50 MB
>> tarball on a Linux (CentOS 5.4 x86_64) NFS-client which mounted the
test filesystem
>> (no compression or dedup) via NFSv3 (rsize=wsize=32k,sync,tcp,hard).
>>
>> standard ZIL:               7m40s  (ZFS default)
>> 1x SSD ZIL:                  4m07s  (Flash Accelerator F20)
>> 2x SSD ZIL:                  2m42s  (Flash Accelerator F20)
>> 2x SSD mirrored ZIL:   3m59s  (Flash Accelerator F20)
>> 3x SSD ZIL:                  2m47s  (Flash Accelerator F20)
>> 4x SSD ZIL:                  2m57s  (Flash Accelerator F20)
>> disabled ZIL:               0m15s
>> (local extraction        0m0.269s)
>>
>> I was not so much interested in the absolute numbers but rather in the
relative
>> performance differences between the standard ZIL, the SSD ZIL and the
disabled
>> ZIL cases.
>>
>> Any opinions on the results? I wish the SSD ZIL performance was closer
to the
>> disabled ZIL case than it is right now.
>>
>> ATM I tend to use two F20 FMods for the log and the two other FMods as
L2ARC cache
>> devices (although the system has lots of system memory i.e. the L2ARC
is not really
>> necessary). But the speedup of disabling the ZIL altogether is
appealing (and would
>> probably be acceptable in this environment).
>> -- 
>> This message posted from opensolaris.org
>> _______________________________________________
>> zfs-discuss mailing list
>> zfs-discuss at opensolaris.org
>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>>     
>
> --
> Adam Leventhal, Fishworks                        http://blogs.sun.com/ahl
>
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>

Jeroen Roodhart

2010-Mar-30 21:50 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

Hi Karsten. Adam, List,

Adam Leventhal wrote:
>Very interesting data. Your test is inherently single-threaded so
I''m not surprised that the benefits aren''t more impressive --
the flash modules on the F20 card are optimized more for concurrent IOPS than
single-threaded latency.
Well, I actually wanted to do a bit more bottleneck searching, but let me weigh
in with some measurements of our own :)

We''re om a single X4540 with quad-core CPUs so we''re on the
older hypertransport bus. Connected it up to two X2200-s running Centos 5, each
on its own 1Gb link. Switched write caching off with the following addition to
the /kernel/drv/sd.conf file (Karsten: if you didn''t do this already,
you _really_ want to :) ):

#
http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#Cache_Flushes
# Add whitespace to make the vendor ID (VID) 8 ... and Product ID (PID) 16
characters long...
sd-config-list = "ATA     MARVELL
SD88SA02","cache-nonvolatile";
cache-nonvolatile=1, 0x40000,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1;

As test we''ve found that untarring an eclipse sourcetar file is a good
use case. So we use that. Called from a shell script that creates a directory,
pushes directory and does the unpacking, for 40 times on each machine.

Now for the interesting bit: 

When we use one vmod, both machines are finished in about 6min45, zilstat maxes
out at about 4200 IOPS.
Using four vmods it takes about 6min55, zilstat maxes out at 2200 IOPS.

In both cases, probing the hyper transport bus seems to show no bottleneck there
(although I''d like to see the biderectional flow, but I know we
can''t :) ). Network stays comfortably under the 400Mbits/s and
that''s peak load when using 1 vmod.

Looking at the IO-connection architecture, it figures that in this set we
traverse the different HT busses quite a lot. So we''ve also placed an
Intel dual 1Gb NIC in another PCIE slot, so that ZIL traffic should only have to
use 1 HT bus (not counting offloading intelligence). That helped a bit, but not
much:

Around 6min35 using one vmod and 6min45 using four vmod-s.

It made looking at the HT-dtrace more telling though. Since the outgoing HT-bus
to the F20 (and the e1000-s) is now, expectedly, a better indication of the ZIL
traffic.

We didn''t do the 40 x 2 untar test whilst not using a SSD device. As an
indication: unpacking a single tarbal then takes about 1min30.

In case it means anything, single tarbal unpack no_zil, 1vmod, 1vmod_Intel,
4vmod-s, 4vmod_Intel measures around (decimals only used as indication!):
                                                                                
4s,     12s,            11.2s,      12.5s, 11.6s


Taking this all in account, I still don''t see what''s holding
it up. Interestingly enough, the client side times are close within about 10
secs, but zilstat shows something different. Hypothesis: Zilstat shows only one
vmod andwere capped in a layer above the ZIL? Can''t rule out networking
just yet, but my gut tells me we''re not network bound here. That leaves
the ZFS ZPL/VFS layer?

I''m very open to suggestions on how to proceed... :)

With kind regards,

Jeroen
--
Jeroen Roodhart
ICT Consultant
                                        University of Amsterdam
j.r.roodhart uva.nl          Informatiseringscentrum
                                        Technical support/ATG
--
See http://www.science.uva.nl/~jeroen for openPGP public key
-- 
This message posted from opensolaris.org

Richard Elling

2010-Mar-30 22:16 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

On Mar 30, 2010, at 2:50 PM, Jeroen Roodhart wrote:> Hi Karsten. Adam, List,
> 
> Adam Leventhal wrote:
> 
>> Very interesting data. Your test is inherently single-threaded so
I''m not surprised that the benefits aren''t more impressive --
the flash modules on the F20 card are optimized more for concurrent IOPS than
single-threaded latency.
> 
> Well, I actually wanted to do a bit more bottleneck searching, but let me
weigh in with some measurements of our own :)
> 
> We''re om a single X4540 with quad-core CPUs so we''re on
the older hypertransport bus. Connected it up to two X2200-s running Centos 5,
each on its own 1Gb link. Switched write caching off with the following addition
to the /kernel/drv/sd.conf file (Karsten: if you didn''t do this
already, you _really_ want to :) ):
> 
> #
http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#Cache_Flushes
> # Add whitespace to make the vendor ID (VID) 8 ... and Product ID (PID) 16
characters long...
> sd-config-list = "ATA     MARVELL
SD88SA02","cache-nonvolatile";
> cache-nonvolatile=1, 0x40000,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1;
If you are going to trick the system into thinking a volatile cache is
nonvolatile, you
might as well disable the ZIL -- the data corruption potential is the same.
> As test we''ve found that untarring an eclipse sourcetar file is a
good use case. So we use that. Called from a shell script that creates a
directory, pushes directory and does the unpacking, for 40 times on each
machine.
> 
> Now for the interesting bit: 
> 
> When we use one vmod, both machines are finished in about 6min45, zilstat
maxes out at about 4200 IOPS.
> Using four vmods it takes about 6min55, zilstat maxes out at 2200 IOPS.
> 
> In both cases, probing the hyper transport bus seems to show no bottleneck
there (although I''d like to see the biderectional flow, but I know we
can''t :) ). Network stays comfortably under the 400Mbits/s and
that''s peak load when using 1 vmod.
> 
> Looking at the IO-connection architecture, it figures that in this set we
traverse the different HT busses quite a lot. So we''ve also placed an
Intel dual 1Gb NIC in another PCIE slot, so that ZIL traffic should only have to
use 1 HT bus (not counting offloading intelligence). That helped a bit, but not
much:
> 
> Around 6min35 using one vmod and 6min45 using four vmod-s.
> 
> It made looking at the HT-dtrace more telling though. Since the outgoing
HT-bus to the F20 (and the e1000-s) is now, expectedly, a better indication of
the ZIL traffic.
> 
> We didn''t do the 40 x 2 untar test whilst not using a SSD device.
As an indication: unpacking a single tarbal then takes about 1min30.
> 
> In case it means anything, single tarbal unpack no_zil, 1vmod, 1vmod_Intel,
4vmod-s, 4vmod_Intel measures around (decimals only used as indication!):
>                                                                            
4s,     12s,            11.2s,      12.5s, 11.6s
> 
> 
> Taking this all in account, I still don''t see what''s
holding it up. Interestingly enough, the client side times are close within
about 10 secs, but zilstat shows something different. Hypothesis: Zilstat shows
only one vmod andwere capped in a layer above the ZIL? Can''t rule out
networking just yet, but my gut tells me we''re not network bound here.
That leaves the ZFS ZPL/VFS layer?
The difference between writing to the ZIL and not writing to the ZIL is 
perhaps thousands of CPU cycles.  For a latency-sensitive workload
this will be noticed.
 -- richard
> 
> I''m very open to suggestions on how to proceed... :)
> 
> With kind regards,
> 
> Jeroen
> --
> Jeroen Roodhart
> ICT Consultant
>                                        University of Amsterdam
> j.r.roodhart uva.nl          Informatiseringscentrum
>                                        Technical support/ATG
> --
> See http://www.science.uva.nl/~jeroen for openPGP public key
> -- 
> This message posted from opensolaris.org
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
ZFS storage and performance consulting at http://www.RichardElling.com
ZFS training on deduplication, NexentaStor, and NAS performance
Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com

Jeroen Roodhart

2010-Mar-30 22:32 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

>If you are going to trick the system into thinking a volatile cache is
nonvolatile, you
>might as well disable the ZIL -- the data corruption potential is the same.
I''m sorry? I believe the F20 has a supercap or the like? The advise on:

http://wikis.sun.com/display/Performance/Tuning+ZFS+for+the+F5100#TuningZFSfortheF5100-ZFSF5100

Is to disable write caching altogether. We opted not to do _that_ though... :)

Are you sure about disabling write cache on the F20 is a bad thing to do?

With kind regards,

Jeroen
-- 
This message posted from opensolaris.org

Richard Elling

2010-Mar-30 23:12 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

On Mar 30, 2010, at 3:32 PM, Jeroen Roodhart wrote:>> If you are going to trick the system into thinking a volatile cache is
nonvolatile, you
>> might as well disable the ZIL -- the data corruption potential is the
same.
> 
> I''m sorry? I believe the F20 has a supercap or the like? The
advise on:
You are correct, I misread the Marvell (as in F20) and X4540 (as in not X4500)
combination.
>
http://wikis.sun.com/display/Performance/Tuning+ZFS+for+the+F5100#TuningZFSfortheF5100-ZFSF5100
> 
> Is to disable write caching altogether. We opted not to do _that_ though...
:)
Good idea.  That recommendation is flawed for the general case and only
applies when all devices have nonvolatile caches.
> Are you sure about disabling write cache on the F20 is a bad thing to do?
I agree that it is a reasonable choice.

For this case, what is the average latency to the F20?
 -- richard

ZFS storage and performance consulting at http://www.RichardElling.com
ZFS training on deduplication, NexentaStor, and NAS performance
Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com

Carson Gaspar

2010-Mar-30 23:41 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

Richard Elling wrote:> On Mar 30, 2010, at 3:32 PM, Jeroen Roodhart wrote:
>>> If you are going to trick the system into thinking a volatile cache
is nonvolatile, you
>>> might as well disable the ZIL -- the data corruption potential is
the same.
>> I''m sorry? I believe the F20 has a supercap or the like? The
advise on:
> 
> You are correct, I misread the Marvell (as in F20) and X4540 (as in not
X4500)
> combination.
> 
>>
http://wikis.sun.com/display/Performance/Tuning+ZFS+for+the+F5100#TuningZFSfortheF5100-ZFSF5100
>>
>> Is to disable write caching altogether. We opted not to do _that_
though... :)
> 
> Good idea.  That recommendation is flawed for the general case and only
> applies when all devices have nonvolatile caches.
> 
>> Are you sure about disabling write cache on the F20 is a bad thing to
do?
> 
> I agree that it is a reasonable choice.
For those following along at home, I''m pretty sure that the terminology
being used is confusing at best, and just plain wrong at worst.

The write cache is _not_ being disabled. The write cache is being marked 
as non-volatile.

By marking the write cache as non-volatile, one is telling ZFS to not 
issue cache flush commands.

BTW, why is a Sun/Oracle branded product not properly respecting the NV 
bit in the cache flush command? This seems remarkably broken, and leads 
to the amazingly bad advice given on the wiki referenced above.

-- 
Carson

Edward Ned Harvey

2010-Mar-31 01:40 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

> But the speedup of disabling the ZIL altogether is
> appealing (and would
> probably be acceptable in this environment).
Just to make sure you know ... if you disable the ZIL altogether, and you
have a power interruption, failed cpu, or kernel halt, then you''re
likely to
have a corrupt unusable zpool, or at least data corruption.  If that is
indeed acceptable to you, go nuts.  ;-)

Edward Ned Harvey

2010-Mar-31 01:42 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

> standard ZIL:               7m40s  (ZFS default)
> 1x SSD ZIL:                  4m07s  (Flash Accelerator F20)
> 2x SSD ZIL:                  2m42s  (Flash Accelerator F20)
> 2x SSD mirrored ZIL:   3m59s  (Flash Accelerator F20)
> 3x SSD ZIL:                  2m47s  (Flash Accelerator F20)
> 4x SSD ZIL:                  2m57s  (Flash Accelerator F20)
> disabled ZIL:               0m15s
> (local extraction        0m0.269s)
> 
> I was not so much interested in the absolute numbers but rather in the
> relative
> performance differences between the standard ZIL, the SSD ZIL and the
> disabled
> ZIL cases.
Oh, one more comment.  If you don''t mirror your ZIL, and your
unmirrored SSD
goes bad, you lose your whole pool.  Or at least suffer data corruption.

Bob Friesenhahn

2010-Mar-31 02:00 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

On Tue, 30 Mar 2010, Edward Ned Harvey wrote:
>> But the speedup of disabling the ZIL altogether is
>> appealing (and would
>> probably be acceptable in this environment).
>
> Just to make sure you know ... if you disable the ZIL altogether, and you
> have a power interruption, failed cpu, or kernel halt, then you''re
likely to
> have a corrupt unusable zpool, or at least data corruption.  If that is
> indeed acceptable to you, go nuts.  ;-)
I believe that the above is wrong information as long as the devices 
involved do flush their caches when requested to.  Zfs still writes 
data in order (at the TXG level) and advances to the next transaction 
group when the devices written to affirm that they have flushed their 
cache.  Without the ZIL, data claimed to be synchronously written 
since the previous transaction group may be entirely lost.

If the devices don''t flush their caches appropriately, the ZIL is 
irrelevant to pool corruption.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Edward Ned Harvey

2010-Mar-31 02:08 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

> Again, we can''t get a straight answer on this one......
> (or at least not 1 straight answer...)
> 
> Since the ZIL logs are committed atomically they are either committed
> in FULL, or NOT at all (by way of rollback of incomplete ZIL applies at
> zpool mount time / or transaction rollbacks if things go exceptionally
> bad), the only LOST data would be what hasn''t been transferred
from ZIL
> to the primary pool......
> 
> But the pool should be "sane".
If this is true ...  Suppose you shutdown a system, remove the ZIL device,
and power back on again.  What will happen?  I''m informed that with
current
versions of solaris, you simply can''t remove a zil device once
it''s added to
a pool.  (That''s changed in recent versions of opensolaris) ... but in
any
system where removing the zil isn''t allowed, what happens if the zil is
removed?

I have to assume something which isn''t quite sane happens.

Bob Friesenhahn

2010-Mar-31 02:15 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

On Tue, 30 Mar 2010, Edward Ned Harvey wrote:>
> If this is true ...  Suppose you shutdown a system, remove the ZIL device,
> and power back on again.  What will happen?  I''m informed that
with current
> versions of solaris, you simply can''t remove a zil device once
it''s added to
> a pool.  (That''s changed in recent versions of opensolaris) ...
but in any
> system where removing the zil isn''t allowed, what happens if the
zil is
> removed?
If the ZIL device goes away then zfs might refuse to use the pool 
without user affirmation (due to potential loss of uncommitted 
transactions), but if the dedicated ZIL device is gone, zfs will use 
disks in the main pool for the ZIL.

This has been clarified before on the list by top zfs developers.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Edward Ned Harvey

2010-Mar-31 02:40 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

> If the ZIL device goes away then zfs might refuse to use the pool
> without user affirmation (due to potential loss of uncommitted
> transactions), but if the dedicated ZIL device is gone, zfs will use
> disks in the main pool for the ZIL.
> 
> This has been clarified before on the list by top zfs developers.
Here''s a snippet from man zpool.  (Latest version available today in
solaris)

zpool remove pool device ...
Removes the specified device from the pool. This command
currently  only  supports  removing hot spares and cache
devices. Devices that are part of a mirrored  configura-
tion  can  be  removed  using  the zpool detach command.
Non-redundant and raidz devices cannot be removed from a
pool.

So you think it would be ok to shutdown, physically remove the log device,
and then power back on again, and force import the pool?  So although there
may be no "live" way to remove a log device from a pool, it might
still be
possible if you offline the pool to ensure writes are all completed before
removing the device?

If it were really just that simple ... if zfs only needed to stop writing to
the log device and ensure the cache were flushed, and then you could safely
remove the log device ... doesn''t it seem silly that there was ever a
time
when that wasn''t implemented?  Like ... Today.  (Still not implemented
in
solaris, only opensolaris.)

I know I am not going to put the health of my pool on the line, assuming
this line of thought.

Rob Logan

2010-Mar-31 02:40 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

> if you disable the ZIL altogether, and you have a power interruption,
failed cpu,
> or kernel halt, then you''re likely to have a corrupt unusable
zpool
the pool will always be fine, no matter what.
> or at least data corruption. 
yea, its a good bet that data sent to your file or zvol will not be there
when the box comes back, even though your program had finished seconds 
before the crash.

					Rob

Edward Ned Harvey

2010-Mar-31 02:57 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

> So you think it would be ok to shutdown, physically remove the log
> device,
> and then power back on again, and force import the pool?  So although
> there
> may be no "live" way to remove a log device from a pool, it might
still
> be
> possible if you offline the pool to ensure writes are all completed
> before
> removing the device?
> 
> If it were really just that simple ... if zfs only needed to stop
> writing to
> the log device and ensure the cache were flushed, and then you could
> safely
> remove the log device ... doesn''t it seem silly that there was
ever a
> time
> when that wasn''t implemented?  Like ... Today.  (Still not
implemented
> in
> solaris, only opensolaris.)
Allow me to clarify a little further, why I care about this so much.  I have
a solaris file server, with all the company jewels on it.  I had a pair of
intel X.25 SSD mirrored log devices.  One of them failed.  The replacement
device came with a newer version of firmware on it.  Now, instead of
appearing as 29.802 Gb, it appears at 29.801 Gb.  I cannot zpool attach.
New device is too small.

So apparently I''m the first guy this happened to.  Oracle is caught
totally
off guard.  They''re pulling their inventory of X25''s from
dispatch
warehouses, and inventorying all the firmware versions, and trying to figure
it all out.  Meanwhile, I''m still degraded.  Or at least, I think I am.

Nobody knows any way for me to remove my unmirrored log device.  Nobody
knows any way for me to add a mirror to it (until they can locate a drive
with the correct firmware.)  All the support people I have on the phone are
just as scared as I am.  "Well we could upgrade the firmware of your
existing drive, but that''ll reduce it by 0.001 Gb, and that might just
create a time bomb to destroy your pool at a later date."  So we
don''t do
it.

Nobody has suggested that I simply shutdown and remove my unmirrored SSD,
and power back on.

Neil Perrin

2010-Mar-31 02:59 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

On 03/30/10 20:00, Bob Friesenhahn wrote:> On Tue, 30 Mar 2010, Edward Ned Harvey wrote:
>
>>> But the speedup of disabling the ZIL altogether is
>>> appealing (and would
>>> probably be acceptable in this environment).
>>
>> Just to make sure you know ... if you disable the ZIL altogether, and 
>> you
>> have a power interruption, failed cpu, or kernel halt, then
you''re
>> likely to
>> have a corrupt unusable zpool, or at least data corruption.  If that is
>> indeed acceptable to you, go nuts.  ;-)
>
> I believe that the above is wrong information as long as the devices 
> involved do flush their caches when requested to.  Zfs still writes 
> data in order (at the TXG level) and advances to the next transaction 
> group when the devices written to affirm that they have flushed their 
> cache.  Without the ZIL, data claimed to be synchronously written 
> since the previous transaction group may be entirely lost.
>
> If the devices don''t flush their caches appropriately, the ZIL is 
> irrelevant to pool corruption.
>
> BobYes Bob is correct - that is exactly how it works.

Neil.

Edward Ned Harvey

2010-Mar-31 03:24 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

> > Just to make sure you know ... if you disable the ZIL altogether, and
> you
> > have a power interruption, failed cpu, or kernel halt, then
you''re
> likely to
> > have a corrupt unusable zpool, or at least data corruption.  If that
> is
> > indeed acceptable to you, go nuts.  ;-)
> 
> I believe that the above is wrong information as long as the devices
> involved do flush their caches when requested to.  Zfs still writes
> data in order (at the TXG level) and advances to the next transaction
> group when the devices written to affirm that they have flushed their
> cache.  Without the ZIL, data claimed to be synchronously written
> since the previous transaction group may be entirely lost.
> 
> If the devices don''t flush their caches appropriately, the ZIL is
> irrelevant to pool corruption.
I stand corrected.  You don''t lose your pool.  You don''t have
corrupted
filesystem.  But you lose whatever writes were not yet completed, so if
those writes happen to be things like database transactions, you could have
corrupted databases or files, or missing files if you were creating them at
the time, and stuff like that.  AKA, data corruption.

But not pool corruption, and not filesystem corruption.

Jeroen Roodhart

2010-Mar-31 06:00 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

>Oh, one more comment. If you don''t mirror your ZIL, and your
unmirrored SSD
>goes bad, you lose your whole pool. Or at least suffer data corruption.
Hmmm, I thought that in that case ZFS reverts to the "regular on
disks" ZIL?

With kind regards,

Jeroen
-- 
This message posted from opensolaris.org

Jeroen Roodhart

2010-Mar-31 06:10 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

>The write cache is _not_ being disabled. The write cache is being marked
>as non-volatile.
Of course you''re right :) Please filter my postings with a "sed
''s/write cache/write cache flush/g''" ;)
>BTW, why is a Sun/Oracle branded product not properly respecting the NV
>bit in the cache flush command? This seems remarkably broken, and leads
>to the amazingly bad advice given on the wiki referenced above.
I suspect it has something to do with "emulating disk semantics" over
PCIE. Anyway, this did get us stumped in the beginning, performance
wasn''t better than when using an OCZ Vertex Turbo ;)

By the way, the URL to the reference is part of the official F20 product
documentation (that''s how we found it in the first place)...

With kind regards,

Jeroen
-- 
This message posted from opensolaris.org

Karsten Weiss

2010-Mar-31 07:45 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

> I stand corrected.  You don''t lose your pool.  You don''t
have corrupted
> filesystem.  But you lose whatever writes were not yet completed, so if
> those writes happen to be things like database transactions, you could have
> corrupted databases or files, or missing files if you were creating them at
> the time, and stuff like that.  AKA, data corruption.
> 
> But not pool corruption, and not filesystem corruption.
Yeah, that''s a big difference! :)

Of course we could not live with pool or fs corruption. However, we can live
with
the fact the NFS written data is not all on disk in case of a server crash
although
the NFS client could rely on the write guaranteed by the NFS protocol. I.e. we
do
not use it for db transactions or something like that.
-- 
This message posted from opensolaris.org

Karsten Weiss

2010-Mar-31 08:00 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

Hi Adam,
> Very interesting data. Your test is inherently
> single-threaded so I''m not surprised that the
> benefits aren''t more impressive -- the flash modules
> on the F20 card are optimized more for concurrent
> IOPS than single-threaded latency.
Thanks for your reply. I''ll probably test the multiple write case, too.

But frankly at the moment I care the most about the single-threaded case
because if we put e.g. user homes on this server I think they would be
severely disappointed if they would have to wait 2m42s just to extract a rather
small 50 MB tarball. The default 7m40s without SSD log were unacceptable
and we were hoping that the F20 would make a big difference and bring the
performance down to acceptable runtimes. But IMHO 2m42s is still too slow
and disabling the ZIL seems to be the only option.

Knowing that 100s of users could do this in parallel with good performance
is nice but it does not improve the situation for the single user which only
cares for his own tar run. If there''s anything else we can do/try to
improve
the single-threaded case I''m all ears.
-- 
This message posted from opensolaris.org

Brent Jones

2010-Mar-31 08:11 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

On Wed, Mar 31, 2010 at 1:00 AM, Karsten Weiss
<k.weiss at science-computing.de> wrote:> Hi Adam,
>
>> Very interesting data. Your test is inherently
>> single-threaded so I''m not surprised that the
>> benefits aren''t more impressive -- the flash modules
>> on the F20 card are optimized more for concurrent
>> IOPS than single-threaded latency.
>
> Thanks for your reply. I''ll probably test the multiple write case,
too.
>
> But frankly at the moment I care the most about the single-threaded case
> because if we put e.g. user homes on this server I think they would be
> severely disappointed if they would have to wait 2m42s just to extract a
rather
> small 50 MB tarball. The default 7m40s without SSD log were unacceptable
> and we were hoping that the F20 would make a big difference and bring the
> performance down to acceptable runtimes. But IMHO 2m42s is still too slow
> and disabling the ZIL seems to be the only option.
>
> Knowing that 100s of users could do this in parallel with good performance
> is nice but it does not improve the situation for the single user which
only
> cares for his own tar run. If there''s anything else we can do/try
to improve
> the single-threaded case I''m all ears.
> --
> This message posted from opensolaris.org
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>
Use something other than Open/Solaris with ZFS as an NFS server?  :)

I don''t think you''ll find the performance you paid for with
ZFS and
Solaris at this time. I''ve been trying to more than a year, and
watching dozens, if not hundreds of threads.
Getting half-ways decent performance from NFS and ZFS is impossible
unless you disable the ZIL.

You''d be better off getting NetApp

-- 
Brent Jones
brent at servuhome.net

Arne Jansen

2010-Mar-31 08:45 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

Brent Jones wrote:> 
> I don''t think you''ll find the performance you paid for
with ZFS and
> Solaris at this time. I''ve been trying to more than a year, and
> watching dozens, if not hundreds of threads.
> Getting half-ways decent performance from NFS and ZFS is impossible
> unless you disable the ZIL.
A few days ago I posted to nfs-discuss with a proposal to add some
mount/share options to change semantics of a nfs-mounted filesystem
so that they parallel those of a local filesystem.
The main point is that data gets flushed to stable storage only if the
client explicitly requests so via fsync or O_DSYNC, not implicitly
with every close().
That would give you the performance you are seeking without sacrificing
data integrity for applications that need it.

I get the impression that I''m not the only one who could be interested
in that ;)

-Arne
> 
> You''d be better off getting NetApp
>

Karsten Weiss

2010-Mar-31 09:33 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

> Nobody knows any way for me to remove my unmirrored
> log device.  Nobody knows any way for me to add a mirror to it (until
Since snv_125 you can remove log devices. See
http://bugs.opensolaris.org/view_bug.do?bug_id=6574286

I''ve used this all the time during my testing and was able to remove
both
mirrored and unmirrored log devices without any problems (and without
reboot). I''m using snv_134.
-- 
This message posted from opensolaris.org

Robert Milkowski

2010-Mar-31 09:39 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

> On Wed, Mar 31, 2010 at 1:00 AM, Karsten Weiss
>    
> Use something other than Open/Solaris with ZFS as an NFS server?  :)
>
> I don''t think you''ll find the performance you paid for
with ZFS and
> Solaris at this time. I''ve been trying to more than a year, and
> watching dozens, if not hundreds of threads.
> Getting half-ways decent performance from NFS and ZFS is impossible
> unless you disable the ZIL.
>
>    
Well, for lots of environments disabling ZIL is perfectly acceptable.
And frankly the reason you get better performance out of the box on 
Linux as NFS server is that it actually behaves like with disabled ZIL - 
so disabling ZIL on ZFS for NFS shares is no worse than using Linux here 
or any other OS which behaves in the same manner. Actually it makes it 
better as even if ZIL is disabled ZFS filesystem is always consisten on 
a disk and you still get all the other benefits from ZFS.

What would be useful though is to be able to easily disable ZIL per 
dataset instead of OS wide switch.
This feature has already been coded and tested and awaits a formal 
process to be completed in order to get integrated. Should be rather 
sooner than later.
> You''d be better off getting NetApp
>    Well, spend some extra money on a really fast NVRAM solution for ZIL and 
you will get much faster ZFS environment than NetApp and still you will 
spend much less money. Not to mention all the extra flexibity compared 
to NetApp.

-- 
Robert Milkowski
http://milek.blogspot.com

Robert Milkowski

2010-Mar-31 09:44 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

>
>>> Just to make sure you know ... if you disable the ZIL altogether,
and
>>>        
>> you
>>      
>>> have a power interruption, failed cpu, or kernel halt, then
you''re
>>>        
>> likely to
>>      
>>> have a corrupt unusable zpool, or at least data corruption.  If
that
>>>        
>> is
>>      
>>> indeed acceptable to you, go nuts.  ;-)
>>>        
>> I believe that the above is wrong information as long as the devices
>> involved do flush their caches when requested to.  Zfs still writes
>> data in order (at the TXG level) and advances to the next transaction
>> group when the devices written to affirm that they have flushed their
>> cache.  Without the ZIL, data claimed to be synchronously written
>> since the previous transaction group may be entirely lost.
>>
>> If the devices don''t flush their caches appropriately, the ZIL
is
>> irrelevant to pool corruption.
>>      
> I stand corrected.  You don''t lose your pool.  You don''t
have corrupted
> filesystem.  But you lose whatever writes were not yet completed, so if
> those writes happen to be things like database transactions, you could have
> corrupted databases or files, or missing files if you were creating them at
> the time, and stuff like that.  AKA, data corruption.
>
> But not pool corruption, and not filesystem corruption.
>
>
>    Which is an expected behavior when you break NFS requirements as Linux 
does out of the box.
Disabling ZIL on a nfs server makes it no worse than the standard Linux 
behaviour - now you get decent performance at a cost of some data to get 
corrupted from a nfs client point of view. But then there are 
environments when it is perfectly acceptable as you there are not 
running critical databases but rather user home directories and zfs will 
flush a transaction maximum after 30s currently so user won''t be able
to
loose more than last 30s if the nfs server would suddenly lost power.

To clarify - if ZIL is disabled it makes no difference at all for a 
pool/filesystem level consistency.

-- 
Robert Milkowski
http://milek.blogspot.com

Robert Milkowski

2010-Mar-31 09:49 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

>
>> standard ZIL:               7m40s  (ZFS default)
>> 1x SSD ZIL:                  4m07s  (Flash Accelerator F20)
>> 2x SSD ZIL:                  2m42s  (Flash Accelerator F20)
>> 2x SSD mirrored ZIL:   3m59s  (Flash Accelerator F20)
>> 3x SSD ZIL:                  2m47s  (Flash Accelerator F20)
>> 4x SSD ZIL:                  2m57s  (Flash Accelerator F20)
>> disabled ZIL:               0m15s
>> (local extraction        0m0.269s)
>>
>> I was not so much interested in the absolute numbers but rather in the
>> relative
>> performance differences between the standard ZIL, the SSD ZIL and the
>> disabled
>> ZIL cases.
>>      
> Oh, one more comment.  If you don''t mirror your ZIL, and your
unmirrored SSD
> goes bad, you lose your whole pool.  Or at least suffer data corruption.
>
>
>    This is not true. If ZIL device would die while pool is imported then 
ZFS would start using z ZIL withing a pool and continue to operate.

On the other hand if your server would suddenly lost power and then when 
you power it up later on and ZFS detects that the ZIL is broken/gone it 
will require a sysadmin intervation to force the pool import and yes 
possibly loose some data.

But how is it different from any other solution where your log is put on 
a separate device?
Well, it is actually different. With ZFS you can still guearantee it to 
be consistent on-disk while others generally can''t and often you will 
have to do fsck to even mount a fs in r/w...

-- 
Robert Milkowski
http://milek.blogspot.com

Karsten Weiss

2010-Mar-31 10:16 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

> What would be useful though is to be able to easily disable ZIL per 
> dataset instead of OS wide switch.
> This feature has already been coded and tested and awaits a formal 
> process to be completed in order to get integrated.
> Should be rather sooner than later.
I agree.
 > > You''d be better off getting NetApp
> >    
> Well, spend some extra money on a really fast NVRAM solution for ZIL and 
> you will get much faster ZFS environment than NetApp and still you will 
> spend much less money. Not to mention all the extra flexibity compared 
> to NetApp.
Do you have a concrete recommendation we could use in the x4540 instead
of the F20?
-- 
This message posted from opensolaris.org

Karsten Weiss

2010-Mar-31 11:09 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

Hi  Jeroen, Adam!
> link. Switched write caching off with the following
> addition to the /kernel/drv/sd.conf file (Karsten: if
> you didn''t do this already, you _really_ want to :)
Okay, I bite! :) format->inquiry on the F20 FMods disks returns:

# Vendor:   ATA
# Product:  MARVELL SD88SA02

So I put this in /kernel/drv/sd.conf and rebooted:

# KAW, 2010-03-31
# Set F20 FMod devices to non-volatile mode
# See
http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#Cache_Flushes
sd-config-list = "ATA     MARVELL SD88SA02", "nvcache1";
nvcache1=1, 0x40000,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1;

Now the tarball extraction test with active ZIL finishes in ~0m32s!
I''ve tested with a mirrored SSD log and two separate SSD log devices.
The runtime is nearly the same. Compared to the 2m42s before the
/kernel/drv/sd.conf modification this is a huge improvement. The
performance with active ZIL would be acceptable now.

But is this mode of operation *really* safe?

FWIW zilstat during the test shows this:

   N-Bytes  N-Bytes/s N-Max-Rate    B-Bytes  B-Bytes/s B-Max-Rate    ops 
<=4kB 4-32kB >=32kB
         0          0          0          0          0          0      0      0 
0      0
   1039072    1039072    1039072    3772416    3772416    3772416    610    299 
311      0
   1522496    1522496    1522496    5402624    5402624    5402624    874    429 
445      0
   2292952    2292952    2292952    6746112    6746112    6746112    931    215 
716      0
   2321272    2321272    2321272    6774784    6774784    6774784    931    208 
723      0
   2303472    2303472    2303472    6549504    6549504    6549504    897    195 
702      0
   2222632    2222632    2222632    6733824    6733824    6733824    935    226 
709      0
   2198328    2198328    2198328    6668288    6668288    6668288    926    224 
702      0
   2170000    2170000    2170000    6373376    6373376    6373376    878    200 
678      0
   2185416    2185416    2185416    6352896    6352896    6352896    874    197 
677      0
   2218040    2218040    2218040    6516736    6516736    6516736    897    203 
694      0
   2436984    2436984    2436984    6549504    6549504    6549504    885    171 
714      0

I.e. ~900 ops/s.
-- 
This message posted from opensolaris.org

Edward Ned Harvey

2010-Mar-31 11:29 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

> Use something other than Open/Solaris with ZFS as an NFS server?  :)
> 
> I don''t think you''ll find the performance you paid for
with ZFS and
> Solaris at this time. I''ve been trying to more than a year, and
> watching dozens, if not hundreds of threads.
> Getting half-ways decent performance from NFS and ZFS is impossible
> unless you disable the ZIL.
> 
> You''d be better off getting NetApp
Hah hah.  I have a Sun X4275 server exporting NFS.  We have clients on all 4
of the Gb ethers, and the Gb ethers are the bottleneck, not the disks or
filesystem.

I suggest you either enable the WriteBack cache on your HBA, or add
SSD''s
for ZIL.  Performance is 5-10x higher this way than using "naked"
disks.
But of course, not as high as it is with a disabled ZIL.

Jeroen Roodhart

2010-Mar-31 11:31 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

Hi Karsten,
> But is this mode of operation *really* safe?
As far as I can tell it is. 

-The F20 uses some form of power backup that should provide power to the
interface card long enough to get the cache onto solid state in case of power
failure.

-Recollecting from earlier threads here; in case the card fails (but not the
host), there should be enough data residing in memory for ZFS to safely switch
to the regular on disk ZIL.

-According to my contacts at Sun, the F20 is a viable replacement solution for
the X25-E.

-Switching write caching off seems to be officially recommended on the Sun
performance wiki
 (translated to "more sane defaults").

If I''m wrong here I''d like to know too, ''cause this
is probably the way we''re taking it in production.
 :)

With kind regards,

Jeroen
-- 
This message posted from opensolaris.org

Edward Ned Harvey

2010-Mar-31 11:31 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

> > Nobody knows any way for me to remove my unmirrored
> > log device.  Nobody knows any way for me to add a mirror to it (until
> 
> Since snv_125 you can remove log devices. See
> http://bugs.opensolaris.org/view_bug.do?bug_id=6574286
> 
> I''ve used this all the time during my testing and was able to
remove
> both
> mirrored and unmirrored log devices without any problems (and without
> reboot). I''m using snv_134.
Aware.  Opensolaris can remove log devices.  Solaris cannot.  Yet.  But if
you want your server in production, you can get a support contract for
solaris.  Opensolaris cannot.

Jeroen Roodhart

2010-Mar-31 11:39 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

Hi Richard,
>For this case, what is the average latency to the F20?
I''m not giving the average since I only performed a single run here
(still need to get autopilot set up :) ). However here is a graph of iostat
IOPS/svc_t sampled in 10sec intervals during a run of untarring an eclipse
tarbal 40 times from two hosts. I''m using 1 vmod here.

http://www.science.uva.nl/~jeroen/zil_1slog_e1000_iostat_iops_svc_t_10sec_interval.pdf

Maximum svc_t is around 2.7ms averaged over 10s.

Still wondering why this won''t scale out though. We don''t seem
to be CPU bound, unless ZFS limits itself to max 30% cputime?

With kind regards,

Jeroen
-- 
This message posted from opensolaris.org

Edward Ned Harvey

2010-Mar-31 11:51 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

> >Oh, one more comment. If you don''t mirror your ZIL, and your
> unmirrored SSD
> >goes bad, you lose your whole pool. Or at least suffer data
> corruption.
> 
> Hmmm, I thought that in that case ZFS reverts to the "regular on
disks"
> ZIL?
I see the source for some confusion.  On the ZFS Best Practices page:
http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide

It says:
Failure of the log device may cause the storage pool to be inaccessible if
you are running the Solaris Nevada release prior to build 96 and a release
prior to the Solaris 10 10/09 release.

It also says:
If a separate log device is not mirrored and the device that contains the
log fails, storing log blocks reverts to the storage pool.

...  At the time when I built my system (Oct 2009) this is what it said:
At present, until [http://bugs.opensolaris.org/view_bug.do?bug_id=6707530 CR
6707530] is integrated, failure of the log device may cause the storage pool
to be inaccessible. Protecting the log device by mirroring will allow you to
access the storage pool even if a log device has failed.

Tim Cook

2010-Mar-31 14:32 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

On Wed, Mar 31, 2010 at 6:31 AM, Edward Ned Harvey
<solaris2 at nedharvey.com>wrote:
> > > Nobody knows any way for me to remove my unmirrored
> > > log device.  Nobody knows any way for me to add a mirror to it
(until
> >
> > Since snv_125 you can remove log devices. See
> > http://bugs.opensolaris.org/view_bug.do?bug_id=6574286
> >
> > I''ve used this all the time during my testing and was able to
remove
> > both
> > mirrored and unmirrored log devices without any problems (and without
> > reboot). I''m using snv_134.
>
> Aware.  Opensolaris can remove log devices.  Solaris cannot.  Yet.  But if
> you want your server in production, you can get a support contract for
> solaris.  Opensolaris cannot.
>

According to who?

http://www.opensolaris.com/learn/features/availability/

Full production level support

Both Standard and Premium support offerings are available for deployment of
Open HA Cluster 2009.06 with OpenSolaris 2009.06 with following
configurations:


--Tim
<http://www.opensolaris.com/learn/features/availability/>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100331/9bc1246e/attachment.html>

Bob Friesenhahn

2010-Mar-31 14:47 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

On Wed, 31 Mar 2010, Tim Cook wrote:> 
> http://www.opensolaris.com/learn/features/availability/
>
>   Full production level support
> 
> Both Standard and Premium support offerings are available for 
> deployment of Open HA Cluster 2009.06 with OpenSolaris 2009.06 with 
> following configurations:
This formal OpenSolaris release is too anchient to do him any good. 
In fact, zfs-wise, it lags the Solaris 10 releases.

If there is ever another OpenSolaris formal release, then the 
situation will be different.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Bob Friesenhahn

2010-Mar-31 15:08 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

On Wed, 31 Mar 2010, Karsten Weiss wrote:>
> But frankly at the moment I care the most about the single-threaded case
> because if we put e.g. user homes on this server I think they would be
> severely disappointed if they would have to wait 2m42s just to extract a
rather
> small 50 MB tarball. The default 7m40s without SSD log were unacceptable
> and we were hoping that the F20 would make a big difference and bring the
> performance down to acceptable runtimes. But IMHO 2m42s is still too slow
> and disabling the ZIL seems to be the only option.
Is extracting 50 MB tarballs something that your users do quite a lot 
of?  Would your users be concerned if there was a possibility that 
after extracting a 50 MB tarball that files are incomplete, whole 
subdirectories are missing, or file permissions are incorrect?

The Sun Flash Accelerator F20 was not strictly designed as a zfs log 
device.  It was originally designed to be a database accelerator.  It 
was repurposed for zfs slog use because it works.  It is a bit wimpy 
for bulk data.  If you need fast support for bulk writes, perhaps you 
need something like STEC''s very expensive ZEUS SSD drive.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

David Magda

2010-Mar-31 15:12 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

On Tue, March 30, 2010 22:40, Edward Ned Harvey wrote:
> Here''s a snippet from man zpool.  (Latest version available today
in
> solaris)
>
>     zpool remove pool device ...
>     Removes the specified device from the pool. This command
>     currently  only  supports  removing hot spares and cache
>     devices. Devices that are part of a mirrored  configura-
>     tion  can  be  removed  using  the zpool detach command.
>     Non-redundant and raidz devices cannot be removed from a
>     pool.
>
> So you think it would be ok to shutdown, physically remove the log device,
> and then power back on again, and force import the pool?  So although
A "cache device" is for the L2ARC, a "log device" is for
ZIL. Log devices
are removable as of snv_125 (mentioned in another e-mail).

If you want log removal in Solaris proper, and you have a support account,
call up and ask that CR 6574286 be fixed:

    http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6574286

Tim Cook

2010-Mar-31 15:32 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

On Wed, Mar 31, 2010 at 9:47 AM, Bob Friesenhahn <
bfriesen at simple.dallas.tx.us> wrote:
> On Wed, 31 Mar 2010, Tim Cook wrote:
>
>>
>> http://www.opensolaris.com/learn/features/availability/
>>
>>  Full production level support
>>
>> Both Standard and Premium support offerings are available for
deployment
>> of Open HA Cluster 2009.06 with OpenSolaris 2009.06 with following
>> configurations:
>>
>
> This formal OpenSolaris release is too anchient to do him any good. In
> fact, zfs-wise, it lags the Solaris 10 releases.
>
> If there is ever another OpenSolaris formal release, then the situation
> will be different.
>
> Bob
>
Cmon now, have a little faith.  It hasn''t even slipped past March yet
:)  Of
course it''d be way more fun if someone from Sun threw caution to the
wind
and told us what the hold-up is *cough*oracle*cough*.

--Tim
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100331/f16f8fd7/attachment.html>

Karl Katzke

2010-Mar-31 15:48 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

> Allow me to clarify a little further, why I care about this so much.  I
have
> a solaris file server, with all the company jewels on it.  I had a pair of
> intel X.25 SSD mirrored log devices.  One of them failed.  The replacement
> device came with a newer version of firmware on it.  Now, instead of
> appearing as 29.802 Gb, it appears at 29.801 Gb.  I cannot zpool attach.
> New device is too small.
> 
> So apparently I''m the first guy this happened to.  Oracle is
caught totally
> off guard.  They''re pulling their inventory of X25''s from
dispatch
> warehouses, and inventorying all the firmware versions, and trying to
figure
> it all out.  Meanwhile, I''m still degraded.  Or at least, I think
I am.
This isn''t the only problem that SnOracle has had with the X25s. We
managed to reproduce a problem with the SSDs as ZIL on an x4250. An I/O error of
some sort caused a retryable write error ... which brought throughput to 0 as if
a PCI bus reset had occurred.

Here''s a sample of our output... you might want to check and see if
you''re getting similar errors.

Jan 10 21:36:52 tips-fs1.tamu.edu scsi: [ID 365881 kern.info] /pci at
0,0/pci8086,25f8 at 4/pci111d,801c at 0/pci111d,801c at 4/pci1000,3150 at 0
(mpt1):
Jan 10 21:36:52 tips-fs1.tamu.edu       Log info 31126000 received for target
15.
Jan 10 21:36:52 tips-fs1.tamu.edu       scsi_status=0, ioc_status=804b,
scsi_state=c
Jan 10 21:36:52 tips-fs1.tamu.edu scsi: [ID 365881 kern.info] /pci at
0,0/pci8086,25f8 at 4/pci111d,801c at 0/pci111d,801c at 4/pci1000,3150 at 0
(mpt1):
Jan 10 21:36:52 tips-fs1.tamu.edu       Log info 31126000 received for target
15.
Jan 10 21:36:52 tips-fs1.tamu.edu       scsi_status=0, ioc_status=804b,
scsi_state=c
Jan 10 21:36:52 tips-fs1.tamu.edu scsi: [ID 107833 kern.warning] WARNING: /pci
at 0,0/pci8086,25f8 at 4/pci111d,801c at 0/pci111d,801c at 4/pci1000,3150 at
0/sd at f,0 (sd28):
Jan 10 21:36:52 tips-fs1.tamu.edu       Error for Command: write Error Level:
Retryable
Jan 10 21:36:52 tips-fs1.tamu.edu scsi: [ID 107833 kern.notice] Requested Block:
8448                      Error Block: 8448
Jan 10 21:36:52 tips-fs1.tamu.edu scsi: [ID 107833 kern.notice] Vendor: ATA     
Serial Number: CVEM902401BA
Jan 10 21:36:52 tips-fs1.tamu.edu scsi: [ID 107833 kern.notice] Sense Key: Unit
Attention
Jan 10 21:36:52 tips-fs1.tamu.edu scsi: [ID 107833 kern.notice] ASC: 0x29 (power
on, reset, or bus reset occurred), ASCQ: 0x0, FRU: 0x0


We were lucky to catch the problem before we went live. There were an
exceptionally large number of I/O errors

Sun has not gotten back to me with a resolution for this problem yet, but they
were able to reproduce the issue.

-K 

Karl Katzke
Systems Analyst II
TAMU / DRGS

Edward Ned Harvey

2010-Mar-31 16:22 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

> Would your users be concerned if there was a possibility that
> after extracting a 50 MB tarball that files are incomplete, whole
> subdirectories are missing, or file permissions are incorrect?
Correction:  "Would your users be concerned if there was a possibility that
after extracting a 50MB tarball *and having a server crash* then files could
be corrupted as described above."

If you disable the ZIL, the filesystem still stays correct in RAM, and the
only way you lose any data such as you''ve described, is to have an
ungraceful power down or reboot.

The advice I would give is:  Do zfs autosnapshots frequently (say ... every
5 minutes, keeping the most recent 2 hours of snaps) and then run with no
ZIL.  If you have an ungraceful shutdown or reboot, rollback to the latest
snapshot ... and rollback once more for good measure.  As long as you can
afford to risk 5-10 minutes of the most recent work after a crash, then you
can get a 10x performance boost most of the time, and no risk of the
aforementioned data corruption.

Obviously, if you cannot accept 5-10 minutes of data loss, such as credit
card transactions, this would not be acceptable.  You''d need to keep
your
ZIL enabled.  Also, if you have a svn server on the ZFS server, and you have
svn clients on other systems ... You should never allow your clients to
advance beyond the current rev of the server.  So again, you''d have to
keep
the ZIL enabled on the server.

It all depends on your workload.  For some, the disabled ZIL is worth the
risk.

Bob Friesenhahn

2010-Mar-31 16:23 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

On Wed, 31 Mar 2010, Tim Cook wrote:> 
> If there is ever another OpenSolaris formal release, then the situation
will be different.
> 
> Cmon now, have a little faith. ?It hasn''t even slipped past March 
> yet :) ?Of course it''d be way more fun if someone from Sun threw 
> caution to the wind and told us what the hold-up is 
> *cough*oracle*cough*.
Oracle is a total "cold boot" for me.  Everything they have put on 
their web site seems carefully designed to cast fear and panic into 
the former Sun customer base and cause substantial doubt, dismay, and 
even terror.  I don''t know what I can and can''t trust.  Every
bit of
trust that Sun earned with me over the past 19 years is clean-slated.

Regardless, it seems likely that Oracle is taking time to change all 
of the copyrights, documentation, and logos to reflect the new 
othership.  They are probably re-evaluating which parts should be 
included for free in OpenSolaris.  The name "Sun" is deeply embedded 
in Solaris.  All of the Solaris 10 packages include "SUN" in their 
name.

Yesterday I noticed that the Sun Studio 12 compiler (used to build 
OpenSolaris) now costs a minimum of $1,015/year.  The "Premium" 
service plan costs $200 more.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Bob Friesenhahn

2010-Mar-31 16:31 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

On Wed, 31 Mar 2010, Edward Ned Harvey wrote:
>> Would your users be concerned if there was a possibility that
>> after extracting a 50 MB tarball that files are incomplete, whole
>> subdirectories are missing, or file permissions are incorrect?
>
> Correction:  "Would your users be concerned if there was a possibility
that
> after extracting a 50MB tarball *and having a server crash* then files
could
> be corrupted as described above."
>
> If you disable the ZIL, the filesystem still stays correct in RAM, and the
> only way you lose any data such as you''ve described, is to have an
> ungraceful power down or reboot.
Yes, of course.  Suppose that you are a system administrator.  The 
server spontaneously reboots.  A corporate VP (CFO) comes to you and 
says that he had just saved the critical presentation to be given to 
the board of the company (and all shareholders) later that day, and 
now it is gone due to your spontaneous server reboot.  Due to a 
delayed financial statement, the corporate stock plummets.  What are 
you to do?  Do you expect that your employment will continue?

Reliable NFS synchronous writes are good for the system 
administrators.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

David Magda

2010-Mar-31 16:35 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

On Wed, March 31, 2010 12:23, Bob Friesenhahn wrote:
> Yesterday I noticed that the Sun Studio 12 compiler (used to build
> OpenSolaris) now costs a minimum of $1,015/year.  The "Premium"
> service plan costs $200 more.
I feel a great disturbance in the force. It is as if a great multitude of
developers screamed and then went out and downloaded GCC.

Tim Cook

2010-Mar-31 16:38 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

On Wed, Mar 31, 2010 at 11:23 AM, Bob Friesenhahn <
bfriesen at simple.dallas.tx.us> wrote:
> On Wed, 31 Mar 2010, Tim Cook wrote:
>
>>
>> If there is ever another OpenSolaris formal release, then the situation
>> will be different.
>>
>> Cmon now, have a little faith.  It hasn''t even slipped past
March yet :)
>>  Of course it''d be way more fun if someone from Sun threw
caution to the
>> wind and told us what the hold-up is *cough*oracle*cough*.
>>
>
> Oracle is a total "cold boot" for me.  Everything they have put
on their
> web site seems carefully designed to cast fear and panic into the former
Sun
> customer base and cause substantial doubt, dismay, and even terror.  I
don''t
> know what I can and can''t trust.  Every bit of trust that Sun
earned with me
> over the past 19 years is clean-slated.
>
> Regardless, it seems likely that Oracle is taking time to change all of the
> copyrights, documentation, and logos to reflect the new othership.  They
are
> probably re-evaluating which parts should be included for free in
> OpenSolaris.  The name "Sun" is deeply embedded in Solaris.  All
of the
> Solaris 10 packages include "SUN" in their name.
>
> Yesterday I noticed that the Sun Studio 12 compiler (used to build
> OpenSolaris) now costs a minimum of $1,015/year.  The "Premium"
service plan
> costs $200 more.
>
> Bob
> --
> Bob Friesenhahn
> bfriesen at simple.dallas.tx.us,
http://www.simplesystems.org/users/bfriesen/
> GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/


Where did you see that?  It looks to be free to me:
Sun Studio 12 Update 1 - FREE for SDN members.

SDN members can download a free, full-license copy of Sun Studio 12 Update
1.

--Tim
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100331/94236660/attachment.html>

Chris Ridd

2010-Mar-31 16:39 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

On 31 Mar 2010, at 17:23, Bob Friesenhahn wrote:
> Yesterday I noticed that the Sun Studio 12 compiler (used to build
OpenSolaris) now costs a minimum of $1,015/year.  The "Premium"
service plan costs $200 more.
The download still seems to be a "free, full-license copy" for SDN
members; the $1015 you quote is for the standard Sun Software service plan. Is a
service plan now *required*, a la Solaris 10?

Cheers,

Chris

Tim Cook

2010-Mar-31 16:43 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

On Wed, Mar 31, 2010 at 11:39 AM, Chris Ridd <chrisridd at mac.com> wrote:
> On 31 Mar 2010, at 17:23, Bob Friesenhahn wrote:
>
> > Yesterday I noticed that the Sun Studio 12 compiler (used to build
> OpenSolaris) now costs a minimum of $1,015/year.  The "Premium"
service plan
> costs $200 more.
>
> The download still seems to be a "free, full-license copy" for
SDN members;
> the $1015 you quote is for the standard Sun Software service plan. Is a
> service plan now *required*, a la Solaris 10?
>
> Cheers,
>
> Chris
>

It''s still available in the opensolaris repo, and I see no license
reference
stating you have to have a support contract, so I''m guessing no...

*Several releases of Sun Studio Software are available in the OpenSolaris
repositories. The following list shows you how to download and install each
release, and where you can find the documentation for the release:*

   - *Sun Studio 12 Update 1:** The Sun Studio 12 Update 1 release is the
   latest full production release of Sun Studio software. It has recently been
   added to the OpenSolaris IPS repository.

   To install this release in your OpenSolaris 2009.06 environment using the
   Package Manager:*

*
*
--Tim
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100331/aba66868/attachment.html>

Bob Friesenhahn

2010-Mar-31 16:50 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

On Wed, 31 Mar 2010, Chris Ridd wrote:
>> Yesterday I noticed that the Sun Studio 12 compiler (used to build 
>> OpenSolaris) now costs a minimum of $1,015/year.  The
"Premium"
>> service plan costs $200 more.
>
> The download still seems to be a "free, full-license copy" for
SDN
> members; the $1015 you quote is for the standard Sun Software 
> service plan. Is a service plan now *required*, a la Solaris 10?
There is no telling.  Everything is subject to evaluation by Oracle 
and it is not clear which parts of the web site are confirmed and 
which parts are still subject to change.  In the past it was free to 
join SDN but if one was to put an ''M'' in front of that SDN,
then there
would be a subtantial yearly charge for membership (up to $10,939 USD 
per year according to Wikipedia).  This is a world that Oracle has 
been commonly exposed to in the past.  Not everyone who uses a 
compiler qualifies as a "developer".

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Chris Ridd

2010-Mar-31 16:55 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

On 31 Mar 2010, at 17:50, Bob Friesenhahn wrote:
> On Wed, 31 Mar 2010, Chris Ridd wrote:
> 
>>> Yesterday I noticed that the Sun Studio 12 compiler (used to build
OpenSolaris) now costs a minimum of $1,015/year.  The "Premium"
service plan costs $200 more.
>> 
>> The download still seems to be a "free, full-license copy"
for SDN members; the $1015 you quote is for the standard Sun Software service
plan. Is a service plan now *required*, a la Solaris 10?
> 
> There is no telling.  Everything is subject to evaluation by Oracle and it
is not clear which parts of the web site are confirmed and which parts are still
subject to change.  In the past it was free to join SDN but if one was to put an
''M'' in front of that SDN, then there would be a subtantial
yearly charge for membership (up to $10,939 USD per year according to
Wikipedia).  This is a world that Oracle has been commonly exposed to in the
past.  Not everyone who uses a compiler qualifies as a "developer".
Indeed, but Microsoft still give out free "express" versions of their
tools. If memory serves, you''re not allowed to distribute binaries
built with them but otherwise they''re not broken in any significant
way.

Maybe this will also be the difference between Sun Studio and Sun Studio
Express.

Perhaps we should take this to tools-compilers.

Cheers,

Chris

Miles Nordin

2010-Mar-31 20:34 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

>>>>> "rm" == Robert Milkowski <milek at
task.gda.pl> writes:
    rm> This is not true. If ZIL device would die *while pool is
    rm> imported* then ZFS would start using z ZIL withing a pool and
    rm> continue to operate.

what you do not say, is that a pool with dead zil cannot be 
''import -f''d.  So, for example, if your rpool and slog are on
the same
SSD, and it dies, you have just lost your whole pool.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 304 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100331/c0b5bf14/attachment.bin>

Miles Nordin

2010-Mar-31 20:38 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

>>>>> "rm" == Robert Milkowski <milek at
task.gda.pl> writes:
    rm> the reason you get better performance out of the box on Linux
    rm> as NFS server is that it actually behaves like with disabled
    rm> ZIL

careful.

Solaris people have been slinging mud at linux for things unfsd did in
spite of the fact knfsd has been around for a decade.  and ``has
options to behave like the ZIL is disabled (sync/async in
/etc/exports)'''' != ``always behaves like the ZIL is
disabled''''.

If you are certain about Linux NFS servers not preserving data for
hard mounts when the server reboots even with the ''sync''
option which
is the default, please confirm, but otherwise I do not believe you.

    rm> Which is an expected behavior when you break NFS requirements
    rm> as Linux does out of the box.

wrong.  The default is ''sync'' in /etc/exports.  The default
has
changed, but the default is ''sync'', and the whole thing is
well-documented.

    rm> What would be useful though is to be able to easily disable
    rm> ZIL per dataset instead of OS wide switch.

yeah, Linux NFS servers have that granularity for their equivalent
option.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 304 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100331/1afc57ec/attachment.bin>

Wes Felter

2010-Mar-31 21:05 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

Karsten Weiss wrote:
> Knowing that 100s of users could do this in parallel with good performance
> is nice but it does not improve the situation for the single user which
only
> cares for his own tar run. If there''s anything else we can do/try
to improve
> the single-threaded case I''m all ears.
A MegaRAID card with write-back cache? It should also be cheaper than 
the F20.

Wes Felter

Stuart Anderson

2010-Mar-31 21:52 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

Edward Ned Harvey <solaris2 <at> nedharvey.com> writes:
> 
> Allow me to clarify a little further, why I care about this so much.  I
have
> a solaris file server, with all the company jewels on it.  I had a pair of
> intel X.25 SSD mirrored log devices.  One of them failed.  The replacement
> device came with a newer version of firmware on it.  Now, instead of
> appearing as 29.802 Gb, it appears at 29.801 Gb.  I cannot zpool attach.
> New device is too small.
> 
> So apparently I''m the first guy this happened to.  Oracle is
caught totally
> off guard.  They''re pulling their inventory of X25''s from
dispatch
> warehouses, and inventorying all the firmware versions, and trying to
figure
> it all out.  Meanwhile, I''m still degraded.  Or at least, I think
I am.
> 
> Nobody knows any way for me to remove my unmirrored log device.  Nobody
> knows any way for me to add a mirror to it (until they can locate a drive
> with the correct firmware.)  All the support people I have on the phone are
> just as scared as I am.  "Well we could upgrade the firmware of your
> existing drive, but that''ll reduce it by 0.001 Gb, and that might
just
> create a time bomb to destroy your pool at a later date."  So we
don''t do
> it.
> 
> Nobody has suggested that I simply shutdown and remove my unmirrored SSD,
> and power back on.
> 
We ran into something similar with these drives in an X4170 that turned out to
be  an issue of the preconfigured logical volumes on the drives. Once we made
sure all of our Sun PCI HBAs where running the exact same version of firmware
and recreated the volumes on new drives arriving from Sun we got back into sync
on the X25-E devices sizes.

Robert Milkowski

2010-Mar-31 23:06 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

On 31/03/2010 17:31, Bob Friesenhahn wrote:> On Wed, 31 Mar 2010, Edward Ned Harvey wrote:
>
>>> Would your users be concerned if there was a possibility that
>>> after extracting a 50 MB tarball that files are incomplete, whole
>>> subdirectories are missing, or file permissions are incorrect?
>>
>> Correction:  "Would your users be concerned if there was a 
>> possibility that
>> after extracting a 50MB tarball *and having a server crash* then 
>> files could
>> be corrupted as described above."
>>
>> If you disable the ZIL, the filesystem still stays correct in RAM, 
>> and the
>> only way you lose any data such as you''ve described, is to
have an
>> ungraceful power down or reboot.
>
> Yes, of course.  Suppose that you are a system administrator.  The 
> server spontaneously reboots.  A corporate VP (CFO) comes to you and 
> says that he had just saved the critical presentation to be given to 
> the board of the company (and all shareholders) later that day, and 
> now it is gone due to your spontaneous server reboot.  Due to a 
> delayed financial statement, the corporate stock plummets.  What are 
> you to do?  Do you expect that your employment will continue?
>
> Reliable NFS synchronous writes are good for the system administrators.
well, it really depends on your environment.
There is place for Oracle database and there is place for MySQL, then 
you don''t really need to cluster everything and then there are 
environments where disabling ZIL is perfectly acceptablt.

One of such cases is that you need to re-import a database or recover 
lots of files over NFS - your service is down and disabling ZIL makes a 
recovery MUCH faster. Then there are cases when leaving the ZIL disabled 
is acceptable as well.

-- 
Robert Milkowski
http://milek.blogspot.com

Robert Milkowski

2010-Mar-31 23:13 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

On 31/03/2010 17:22, Edward Ned Harvey wrote:>
> The advice I would give is:  Do zfs autosnapshots frequently (say ... every
> 5 minutes, keeping the most recent 2 hours of snaps) and then run with no
> ZIL.  If you have an ungraceful shutdown or reboot, rollback to the latest
> snapshot ... and rollback once more for good measure.  As long as you can
> afford to risk 5-10 minutes of the most recent work after a crash, then you
> can get a 10x performance boost most of the time, and no risk of the
> aforementioned data corruption.
>    
I don''t really get it - rolling back to a last snapshot
doesn''t really
improve things here it actually makes it worse as now you are going to 
loose even more data. Keep in mind that currently the maximum time after 
which ZFS commits a transaction is 30s - ZIL or not. So with disabled 
ZIL in worst case scenario you should loose no more than last 30-60s. 
You can tune it down if you want. Rolling back to a snapshot will only 
make it worse. Then also keep in mind that it is a worst case scenario 
here - it well may be there were no outstanding transactions at all - it 
all goes down basically to a risk assessment, impact assessment and a cost.

Unless you are talking about doing regular snapshots and making sure 
that application is consistent while doing so - for example putting all 
Oracle tablespaces in a hot backup mode and taking a snapshot... 
otherwise it doesn''t really make sense.

-- 
Robert Milkowski
http://milek.blogspot.com

Robert Milkowski

2010-Mar-31 23:41 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

On 31/03/2010 21:38, Miles Nordin wrote:>      rm>  Which is an expected behavior when you break NFS requirements
>      rm>  as Linux does out of the box.
>
> wrong.  The default is ''sync'' in /etc/exports.  The
default has
> changed, but the default is ''sync'', and the whole thing
is
> well-documented.
>    
I double checked the documentation and you''re right - the default has 
changed to sync.
I haven''t found in which RH version it happened but it doesn''t
really
matter.

So yes, I was wrong - the current default it seems to be sync on Linux 
as well.

-- 
Robert Milkowski
http://milek.blogspot.com

David Magda

2010-Apr-01 00:44 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

On Mar 31, 2010, at 19:41, Robert Milkowski wrote:
> I double checked the documentation and you''re right - the default
> has changed to sync.
> I haven''t found in which RH version it happened but it
doesn''t
> really matter.
 From the SourceForge site:
> Since version 1.0.1 of the NFS utilities tarball has changed the  
> server export default to "sync", then, if no behavior is
specified
> in the export list (thus assuming the default behavior), a warning  
> will be generated at export time.
http://nfs.sourceforge.net/

Ross Walker

2010-Apr-01 02:11 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

On Mar 31, 2010, at 5:39 AM, Robert Milkowski <milek at task.gda.pl>
wrote:
>
>> On Wed, Mar 31, 2010 at 1:00 AM, Karsten Weiss
>>   Use something other than Open/Solaris with ZFS as an NFS  
>> server?  :)
>>
>> I don''t think you''ll find the performance you paid
for with ZFS and
>> Solaris at this time. I''ve been trying to more than a year,
and
>> watching dozens, if not hundreds of threads.
>> Getting half-ways decent performance from NFS and ZFS is impossible
>> unless you disable the ZIL.
>>
>>
>
> Well, for lots of environments disabling ZIL is perfectly acceptable.
> And frankly the reason you get better performance out of the box on  
> Linux as NFS server is that it actually behaves like with disabled  
> ZIL - so disabling ZIL on ZFS for NFS shares is no worse than using  
> Linux here or any other OS which behaves in the same manner.  
> Actually it makes it better as even if ZIL is disabled ZFS  
> filesystem is always consisten on a disk and you still get all the  
> other benefits from ZFS.
>
> What would be useful though is to be able to easily disable ZIL per  
> dataset instead of OS wide switch.
> This feature has already been coded and tested and awaits a formal  
> process to be completed in order to get integrated. Should be rather  
> sooner than later.
Well being fair to Linux the default for NFS exports is to export them  
''sync'' now which syncs to disk on close or fsync. It has been
many
years before they exported ''async'' by default. Now if Linux
admins set
their shares ''async'' and loose important data then
it''s operator error
and not Linux''s fault.

If apps don''t care about their data consistency and don''t sync
their
data I don''t see why the file server has to care for them. I mean if  
it were a local file system and the machine rebooted the data would be  
lost too. Should we care more for data written remotely then locally?

-Ross

Richard Elling

2010-Apr-01 02:25 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

On Mar 31, 2010, at 7:11 PM, Ross Walker wrote:
> On Mar 31, 2010, at 5:39 AM, Robert Milkowski <milek at task.gda.pl>
wrote:
> 
>> 
>>> On Wed, Mar 31, 2010 at 1:00 AM, Karsten Weiss
>>>  Use something other than Open/Solaris with ZFS as an NFS server? 
:)
>>> 
>>> I don''t think you''ll find the performance you
paid for with ZFS and
>>> Solaris at this time. I''ve been trying to more than a
year, and
>>> watching dozens, if not hundreds of threads.
>>> Getting half-ways decent performance from NFS and ZFS is impossible
>>> unless you disable the ZIL.
>>> 
>>> 
>> 
>> Well, for lots of environments disabling ZIL is perfectly acceptable.
>> And frankly the reason you get better performance out of the box on
Linux as NFS server is that it actually behaves like with disabled ZIL - so
disabling ZIL on ZFS for NFS shares is no worse than using Linux here or any
other OS which behaves in the same manner. Actually it makes it better as even
if ZIL is disabled ZFS filesystem is always consisten on a disk and you still
get all the other benefits from ZFS.
>> 
>> What would be useful though is to be able to easily disable ZIL per
dataset instead of OS wide switch.
>> This feature has already been coded and tested and awaits a formal
process to be completed in order to get integrated. Should be rather sooner than
later.
> 
> Well being fair to Linux the default for NFS exports is to export them
''sync'' now which syncs to disk on close or fsync. It has been
many years before they exported ''async'' by default. Now if
Linux admins set their shares ''async'' and loose important data
then it''s operator error and not Linux''s fault.
> 
> If apps don''t care about their data consistency and don''t
sync their data I don''t see why the file server has to care for them. I
mean if it were a local file system and the machine rebooted the data would be
lost too. Should we care more for data written remotely then locally?
This is not true for sync data written locally, unless you disable the ZIL
locally.
 -- richard

ZFS storage and performance consulting at http://www.RichardElling.com
ZFS training on deduplication, NexentaStor, and NAS performance
Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com

Ross Walker

2010-Apr-01 02:35 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

On Mar 31, 2010, at 10:25 PM, Richard Elling  
<richard.elling at gmail.com> wrote:
>
> On Mar 31, 2010, at 7:11 PM, Ross Walker wrote:
>
>> On Mar 31, 2010, at 5:39 AM, Robert Milkowski <milek at
task.gda.pl>
>> wrote:
>>
>>>
>>>> On Wed, Mar 31, 2010 at 1:00 AM, Karsten Weiss
>>>> Use something other than Open/Solaris with ZFS as an NFS  
>>>> server?  :)
>>>>
>>>> I don''t think you''ll find the performance you
paid for with ZFS and
>>>> Solaris at this time. I''ve been trying to more than a
year, and
>>>> watching dozens, if not hundreds of threads.
>>>> Getting half-ways decent performance from NFS and ZFS is
impossible
>>>> unless you disable the ZIL.
>>>>
>>>>
>>>
>>> Well, for lots of environments disabling ZIL is perfectly  
>>> acceptable.
>>> And frankly the reason you get better performance out of the box  
>>> on Linux as NFS server is that it actually behaves like with  
>>> disabled ZIL - so disabling ZIL on ZFS for NFS shares is no worse  
>>> than using Linux here or any other OS which behaves in the same  
>>> manner. Actually it makes it better as even if ZIL is disabled ZFS
>>> filesystem is always consisten on a disk and you still get all the
>>> other benefits from ZFS.
>>>
>>> What would be useful though is to be able to easily disable ZIL  
>>> per dataset instead of OS wide switch.
>>> This feature has already been coded and tested and awaits a formal
>>> process to be completed in order to get integrated. Should be  
>>> rather sooner than later.
>>
>> Well being fair to Linux the default for NFS exports is to export  
>> them ''sync'' now which syncs to disk on close or
fsync. It has been
>> many years before they exported ''async'' by default.
Now if Linux
>> admins set their shares ''async'' and loose important
data then it''s
>> operator error and not Linux''s fault.
>>
>> If apps don''t care about their data consistency and
don''t sync
>> their data I don''t see why the file server has to care for
them. I
>> mean if it were a local file system and the machine rebooted the  
>> data would be lost too. Should we care more for data written  
>> remotely then locally?
>
> This is not true for sync data written locally, unless you disable  
> the ZIL locally.
No, of course if it''s written sync with ZIL, it just seems over  
Solaris NFS all writes are delayed not just sync writes.

-Ross

Edward Ned Harvey

2010-Apr-01 03:42 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

> I see the source for some confusion.  On the ZFS Best Practices page:
> http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide
> 
> It says:
> Failure of the log device may cause the storage pool to be inaccessible
> if
> you are running the Solaris Nevada release prior to build 96 and a
> release
> prior to the Solaris 10 10/09 release.
> 
> It also says:
> If a separate log device is not mirrored and the device that contains
> the
> log fails, storing log blocks reverts to the storage pool.
I have some more concrete data on this now.  Running Solaris 10u8 (which is
10/09), fully updated last weekend.  We want to explore the consequences of
adding or failing a non-mirrored log device.  We created a pool with a
non-mirrored ZIL log device.  And experimented with it:

(a)  Simply yank out the non-mirrored log device while the system is live.
The result was:  Any zfs or zpool command would hang permanently.  Even
"zfs
list" hangs permanently.  The system cannot shutdown, cannot reboot, cannot
"zfs send" or "zfs snapshot" or anything ... It''s a
bad state.  You''re
basically hosed.  Power cycle is the only option.

(b)  After power cycling, the system won''t boot.  It gets part way
through
the boot process, and eventually just hangs there, infinitely cycling error
messages about services that couldn''t start.  Random services, such as
inetd, which seem unrelated to some random data pool that failed.  So we
power cycle again, and go into failsafe mode, to clean up and destroy the
old messed up pool ... Boot up totally clean again, and create a new totally
clean pool with a non-mirrored log device.  Just to ensure we really are
clean, we simply "zpool export" and "zpool import" with no
trouble, and
reboot once for good measure.  "zfs list" and everything are all
working
great...

(c)  Do a "zpool export."  Obviously, the ZIL log device is clean and
flushed at this point, not being used.  We simply yank out the log device,
and do "zpool import."  Well ... Without that log device, I forget the
terminology, it said something like "missing disk."  Plain and simple,
you
*can* *not* import the pool without the log device.  It does not say "to
force use -f" and even if you specify the -f, it still just throws the same
error message, missing disk or whatever.  Won''t import.  Period.

...  So, to anybody who said the failed log device will simply fail over to
blocks within the main pool:  Sorry.  That may be true in some later
version, but it is not the slightest bit true in the absolute latest solaris
(proper) available today.

I''m going to venture a guess this is no longer a problem, after zpool
version 19.  This is when "ZFS log device removal" was introduced.

Unfortunately, the latest version of solaris only goes up to zpool version
15.

Edward Ned Harvey

2010-Apr-01 03:51 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

> A MegaRAID card with write-back cache? It should also be cheaper than
> the F20.
I haven''t posted results yet, but I just finished a few weeks of
extensive
benchmarking various configurations.  I can say this:

WriteBack cache is much faster than "naked" disks, but if you can buy
an SSD
or two for ZIL log device, the dedicated ZIL is yet again much faster than
WriteBack.

It doesn''t have to be F20.  You could use the Intel X25 for example. 
If
you''re running solaris proper, you better mirror your ZIL log device. 
If
you''re running opensolaris ... I don''t know if that''s
important.  I''ll
probably test it, just to be sure, but I might never get around to it
because I don''t have a justifiable business reason to build the
opensolaris
machine just for this one little test.

Seriously, all disks configured WriteThrough (spindle and SSD disks alike)
using the dedicated ZIL SSD device, very noticeably faster than enabling the
WriteBack.

Edward Ned Harvey

2010-Apr-01 03:58 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

> We ran into something similar with these drives in an X4170 that turned
> out to
> be  an issue of the preconfigured logical volumes on the drives. Once
> we made
> sure all of our Sun PCI HBAs where running the exact same version of
> firmware
> and recreated the volumes on new drives arriving from Sun we got back
> into sync
> on the X25-E devices sizes.
Can you elaborate?  Just today, we got the replacement drive that has
precisely the right version of firmware and everything.  Still, when we
plugged in that drive, and "create simple volume" in the storagetek
raid
utility, the new drive is 0.001 Gb smaller than the old drive.  I''m
still
hosed.

Are you saying I might benefit by sticking the SSD into some laptop, and
zero''ing the disk?  And then attach to the sun server?

Are you saying I might benefit by finding some other way to make the drive
available, instead of using the storagetek raid utility?

Thanks for the suggestions...

Stuart Anderson

2010-Apr-01 04:15 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

On Mar 31, 2010, at 8:58 PM, Edward Ned Harvey wrote:
>> We ran into something similar with these drives in an X4170 that turned
>> out to
>> be  an issue of the preconfigured logical volumes on the drives. Once
>> we made
>> sure all of our Sun PCI HBAs where running the exact same version of
>> firmware
>> and recreated the volumes on new drives arriving from Sun we got back
>> into sync
>> on the X25-E devices sizes.
> 
> Can you elaborate?  Just today, we got the replacement drive that has
> precisely the right version of firmware and everything.  Still, when we
> plugged in that drive, and "create simple volume" in the
storagetek raid
> utility, the new drive is 0.001 Gb smaller than the old drive. 
I''m still
> hosed.
> 
> Are you saying I might benefit by sticking the SSD into some laptop, and
> zero''ing the disk?  And then attach to the sun server?
> 
> Are you saying I might benefit by finding some other way to make the drive
> available, instead of using the storagetek raid utility?
Assuming you are also using a PCI LSI HBA from Sun that is managed with
a utility called /opt/StorMan/arcconf and reports itself as the amazingly
informative model number "Sun STK RAID INT" what worked for me was to
run,
arcconf delete (to delete the pre-configured volume shipped on the drive)
arcconf create (to create a new volume)

What I observed was that
arcconf getconfig 1
would show the same physical device size for our existing drives and new
ones from Sun, but they reported a slightly different logical volume size.
I am fairly sure that was due to the Sun factory creating the initial volume
with a different version of the HBA controller firmware then we where using
to create our own volumes.

If I remember the sign correctly, the newer firmware creates larger logical
volumes, and you really want to upgrade the firmware if you are going to
be running multiple X25-E drives from the same controller.

I hope that helps.


--
Stuart Anderson  anderson at ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

Richard Elling

2010-Apr-01 04:28 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

On Mar 31, 2010, at 9:22 AM, Edward Ned Harvey wrote:
>> Would your users be concerned if there was a possibility that
>> after extracting a 50 MB tarball that files are incomplete, whole
>> subdirectories are missing, or file permissions are incorrect?
> 
> Correction:  "Would your users be concerned if there was a possibility
that
> after extracting a 50MB tarball *and having a server crash* then files
could
> be corrupted as described above."
> 
> If you disable the ZIL, the filesystem still stays correct in RAM, and the
> only way you lose any data such as you''ve described, is to have an
> ungraceful power down or reboot.
> 
> The advice I would give is:  Do zfs autosnapshots frequently (say ... every
> 5 minutes, keeping the most recent 2 hours of snaps) and then run with no
> ZIL.  If you have an ungraceful shutdown or reboot, rollback to the latest
> snapshot ... and rollback once more for good measure.  As long as you can
> afford to risk 5-10 minutes of the most recent work after a crash, then you
> can get a 10x performance boost most of the time, and no risk of the
> aforementioned data corruption.
This approach does not solve the problem.  When you do a snapshot, 
the txg is committed.  If you wish to reduce the exposure to loss of
sync data and run with ZIL disabled, then you can change the txg commit 
interval -- however changing the txg commit interval will not eliminate the 
possibility of data loss.

 -- richard

ZFS storage and performance consulting at http://www.RichardElling.com
ZFS training on deduplication, NexentaStor, and NAS performance
Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com

Casper.Dik at Sun.COM

2010-Apr-01 07:09 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

>If you disable the ZIL, the filesystem still stays correct in RAM, and the
>only way you lose any data such as you''ve described, is to have an
>ungraceful power down or reboot.
>The advice I would give is:  Do zfs autosnapshots frequently (say ... every
>5 minutes, keeping the most recent 2 hours of snaps) and then run with no
>ZIL.  If you have an ungraceful shutdown or reboot, rollback to the latest
>snapshot ... and rollback once more for good measure.  As long as you can
>afford to risk 5-10 minutes of the most recent work after a crash, then you
>can get a 10x performance boost most of the time, and no risk of the
>aforementioned data corruption.
Why do you need the rollback? The current filesystems have correct and 
consistent data; not different from the last two snapshots.
(Snapshots can happen in the middle of untarring)

The difference between running with or without ZIL is whether the
client has lost data when the server reboots; not different from using 
Linux as an NFS server.

Casper

Casper.Dik at Sun.COM

2010-Apr-01 08:43 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

>Well being fair to Linux the default for NFS exports is to export them  
>''sync'' now which syncs to disk on close or fsync. It has
been many
>years before they exported ''async'' by default. Now if
Linux admins set
>their shares ''async'' and loose important data then
it''s operator error
>and not Linux''s fault.
Is that what "sync" means in Linux?  As NFS doesn''t use
"close" or
"fsync", what exactly are the semantics.

(For NFSv2/v3 each *operation* is sync and the client needs to make sure
it can continue; for NFSv4, some operations are async and the client
needs to use COMMIT)
>If apps don''t care about their data consistency and don''t
sync their
>data I don''t see why the file server has to care for them. I mean
if
>it were a local file system and the machine rebooted the data would be  
>lost too. Should we care more for data written remotely then locally?
If the system crashes the applications is also gone but if the server 
reboots, data should *never* be lost; the sync may just miss the window.
The application continuous to run so clearly we must handle this 
differently.  What you''re saying sounds like that the kernel can forget
what you wrote because you didn''t call fsync().

Casper

Edward Ned Harvey

2010-Apr-01 11:01 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

> >If you disable the ZIL, the filesystem still stays correct in RAM, and
> the
> >only way you lose any data such as you''ve described, is to
have an
> >ungraceful power down or reboot.
> 
> >The advice I would give is:  Do zfs autosnapshots frequently (say ...
> every
> >5 minutes, keeping the most recent 2 hours of snaps) and then run with
> no
> >ZIL.  If you have an ungraceful shutdown or reboot, rollback to the
> latest
> >snapshot ... and rollback once more for good measure.  As long as you
> can
> >afford to risk 5-10 minutes of the most recent work after a crash,
> then you
> >can get a 10x performance boost most of the time, and no risk of the
> >aforementioned data corruption.
> 
> Why do you need the rollback? The current filesystems have correct and
> consistent data; not different from the last two snapshots.
> (Snapshots can happen in the middle of untarring)
> 
> The difference between running with or without ZIL is whether the
> client has lost data when the server reboots; not different from using
> Linux as an NFS server.
If you have an ungraceful shutdown in the middle of writing stuff, while the
ZIL is disabled, then you have corrupt data.  Could be files that are
partially written.  Could be wrong permissions or attributes on files.
Could be missing files or directories.  Or some other problem.

Some changes from the last 1 second of operation before crash might be
written, while some changes from the last 4 seconds might be still
unwritten.  This is data corruption, which could be worse than losing a few
minutes of changes.  At least, if you rollback, you know the data is
consistent, and you know what you lost.  You won''t continue having more
losses afterward caused by inconsistent data on disk.

Edward Ned Harvey

2010-Apr-01 11:04 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

> > Can you elaborate?  Just today, we got the replacement drive that has
> > precisely the right version of firmware and everything.  Still, when
> we
> > plugged in that drive, and "create simple volume" in the
storagetek
> raid
> > utility, the new drive is 0.001 Gb smaller than the old drive. 
I''m
> still
> > hosed.
> >
> > Are you saying I might benefit by sticking the SSD into some laptop,
> and
> > zero''ing the disk?  And then attach to the sun server?
> >
> > Are you saying I might benefit by finding some other way to make the
> drive
> > available, instead of using the storagetek raid utility?
> 
> Assuming you are also using a PCI LSI HBA from Sun that is managed with
> a utility called /opt/StorMan/arcconf and reports itself as the
> amazingly
> informative model number "Sun STK RAID INT" what worked for me
was to
> run,
> arcconf delete (to delete the pre-configured volume shipped on the
> drive)
> arcconf create (to create a new volume)
> 
> What I observed was that
> arcconf getconfig 1
> would show the same physical device size for our existing drives and
> new
> ones from Sun, but they reported a slightly different logical volume
> size.
> I am fairly sure that was due to the Sun factory creating the initial
> volume
> with a different version of the HBA controller firmware then we where
> using
> to create our own volumes.
> 
> If I remember the sign correctly, the newer firmware creates larger
> logical
> volumes, and you really want to upgrade the firmware if you are going
> to
> be running multiple X25-E drives from the same controller.
> 
> I hope that helps.
Uggh.  This is totally different than my system.  But thanks for writing.
I''ll take this knowledge, and see if we can find some analogous
situation
with the StorageTek controller.  It still may be helpful, so again, thanks.

Casper.Dik at Sun.COM

2010-Apr-01 11:19 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

>If you have an ungraceful shutdown in the middle of writing stuff, while the
>ZIL is disabled, then you have corrupt data.  Could be files that are
>partially written.  Could be wrong permissions or attributes on files.
>Could be missing files or directories.  Or some other problem.
>
>Some changes from the last 1 second of operation before crash might be
>written, while some changes from the last 4 seconds might be still
>unwritten.  This is data corruption, which could be worse than losing a few
>minutes of changes.  At least, if you rollback, you know the data is
>consistent, and you know what you lost.  You won''t continue having
more
>losses afterward caused by inconsistent data on disk.
How exactly is this different from "rolling back to some other point of 
time?".

I think you don''t quite understand how ZFS works; all operations are 
grouped in transaction groups; all the transactions in a particular group 
are commit in one operation.  I don''t know what partial ordering ZFS
uses
when creating transaction groups, but a snapshot just picks one
transaction group as the last group included in the snapshot.

When the system reboots, ZFS picks the most recent, valid uberblock;
so the data available is "correct upto transaction group N1".

If you rollback to a snapshot, you get data
 "correct upto transaction group N2".

But N2 < N1 so you lose more data.

Why do you think that a "Snapshot" has a "better quality"
than the last
snapshot available?

Casper

Edward Ned Harvey

2010-Apr-01 11:42 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

> >If you have an ungraceful shutdown in the middle of writing stuff,
> while the
> >ZIL is disabled, then you have corrupt data.  Could be files that are
> >partially written.  Could be wrong permissions or attributes on files.
> >Could be missing files or directories.  Or some other problem.
> >
> >Some changes from the last 1 second of operation before crash might be
> >written, while some changes from the last 4 seconds might be still
> >unwritten.  This is data corruption, which could be worse than losing
> a few
> >minutes of changes.  At least, if you rollback, you know the data is
> >consistent, and you know what you lost.  You won''t continue
having
> more
> >losses afterward caused by inconsistent data on disk.
> 
> How exactly is this different from "rolling back to some other point
of
> time?".
> 
> I think you don''t quite understand how ZFS works; all operations
are
> grouped in transaction groups; all the transactions in a particular
> group
> are commit in one operation.  I don''t know what partial ordering
ZFS
Dude, don''t be so arrogant.  Acting like you know what I''m
talking about
better than I do.  Face it that you have something to learn here.

Yes, all the transactions in a transaction group are either committed
entirely to disk, or not at all.  But they''re not necessarily committed
to
disk in the same order that the user level applications requested.  Meaning:
If I have an application that writes to disk in "sync" mode
intentionally
... perhaps because my internal file format consistency would be corrupt if
I wrote out-of-order ... If the sysadmin has disabled ZIL, my "sync"
write
will not block, and I will happily issue more write operations.  As long as
the OS remains operational, no problem.  The OS keeps the filesystem
consistent in RAM, and correctly manages all the open file handles.  But if
the OS dies for some reason, some of my later writes may have been committed
to disk while some of my earlier writes could be lost, which were still
being buffered in system RAM for a later transaction group.

This is particularly likely to happen, if my application issues a very small
sync write, followed by a larger async write, followed by a very small sync
write, and so on.  Then the OS will buffer my small sync writes and attempt
to aggregate them into a larger sequential block for the sake of accelerated
performance.  The end result is:  My larger async writes are sometimes
committed to disk before my small sync writes.  But the only reason I would
ever know or care about that would be if the ZIL were disabled, and the OS
crashed.  Afterward, my file has internal inconsistency.

Perfect examples of applications behaving this way would be databases and
virtual machines.

> Why do you think that a "Snapshot" has a "better
quality" than the last
> snapshot available?
If you rollback to a snapshot from several minutes ago, you can rest assured
all the transaction groups that belonged to that snapshot have been
committed.  So although you''re losing the most recent few minutes of
data,
you can rest assured you haven''t got file corruption in any of the
existing
files.

Edward Ned Harvey

2010-Apr-01 11:51 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

> This approach does not solve the problem.  When you do a snapshot,
> the txg is committed.  If you wish to reduce the exposure to loss of
> sync data and run with ZIL disabled, then you can change the txg commit
> interval -- however changing the txg commit interval will not eliminate
> the
> possibility of data loss.
The default commit interval is what, 30 seconds?  Doesn''t that
guarantee
that any snapshot taken more than 30 seconds ago will have been fully
committed to disk?

Therefore, any snapshot older than 30 seconds old is guaranteed to be
consistent on disk.  While anything less than 30 seconds old could possibly
have some later writes committed to disk before some older writes from a few
seconds before.

If I''m wrong about this, please explain.

I am envisioning a database, which issues a small sync write, followed by a
larger async write.  Since the sync write is small, the OS would prefer to
defer the write and aggregate into a larger block.  So the possibility of
the later async write being committed to disk before the older sync write is
a real risk.  The end result would be inconsistency in my database file.

If you rollback to a snapshot that''s at least 30 seconds old, then all
the
writes for that snapshot are guaranteed to be committed to disk already, and
in the right order.  You''re acknowledging the loss of some known time
worth
of data.  But you''re gaining a guarantee of internal file consistency.

Edward Ned Harvey

2010-Apr-01 12:01 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

> Is that what "sync" means in Linux?  
A sync write is one in which the application blocks until the OS acks that
the write has been committed to disk.  An async write is given to the OS,
and the OS is permitted to buffer the write to disk at its own discretion.
Meaning the async write function call returns sooner, and the application is
free to continue doing other stuff, including issuing more writes.

Async writes are faster from the point of view of the application.  But sync
writes are done by applications which need to satisfy a race condition for
the sake of internal consistency.  Applications which need to know their
next commands will not begin until after the previous sync write was
committed to disk.

Casper.Dik at Sun.COM

2010-Apr-01 12:22 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

>Dude, don''t be so arrogant.  Acting like you know what I''m
talking about
>better than I do.  Face it that you have something to learn here.
You may say that, but then you post this:

>> Why do you think that a "Snapshot" has a "better
quality" than the last
>> snapshot available?
>
>If you rollback to a snapshot from several minutes ago, you can rest assured
>all the transaction groups that belonged to that snapshot have been
>committed.  So although you''re losing the most recent few minutes
of data,
>you can rest assured you haven''t got file corruption in any of the
existing
>files.

But the actual fact is that there is *NO* difference between the last
uberblock and an uberblock named as "snapshot-such-and-so".  All
changes
made after the uberblock was written are discarded by rolling back.


All the transaction groups referenced by "last uberblock" *are*
written to
disk.

Disabling the ZIL makes sure that fsync() and sync() no longer work;
whether you take a named snapshot or the uberblock is immaterial; your
strategy will cause more data to be lost.

Casper

Casper.Dik at Sun.COM

2010-Apr-01 12:42 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

>> Is that what "sync" means in Linux?  
>
>A sync write is one in which the application blocks until the OS acks that
>the write has been committed to disk.  An async write is given to the OS,
>and the OS is permitted to buffer the write to disk at its own discretion.
>Meaning the async write function call returns sooner, and the application is
>free to continue doing other stuff, including issuing more writes.
>
>Async writes are faster from the point of view of the application.  But sync
>writes are done by applications which need to satisfy a race condition for
>the sake of internal consistency.  Applications which need to know their
>next commands will not begin until after the previous sync write was
>committed to disk.

We''re talking about the "sync" for NFS exports in Linux; what
do they mean
with "sync" NFS exports? 


Casper

Casper.Dik at Sun.COM

2010-Apr-01 12:50 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

>> This approach does not solve the problem.  When you do a snapshot,
>> the txg is committed.  If you wish to reduce the exposure to loss of
>> sync data and run with ZIL disabled, then you can change the txg commit
>> interval -- however changing the txg commit interval will not eliminate
>> the
>> possibility of data loss.
>
>The default commit interval is what, 30 seconds?  Doesn''t that
guarantee
>that any snapshot taken more than 30 seconds ago will have been fully
>committed to disk?
When a system boots and it finds the snapshot, then all the data referred 
by the snapshot are on-disk.  But the snapshot doesn''t guarantee more
than
the last valid uberblock.
>Therefore, any snapshot older than 30 seconds old is guaranteed to be
>consistent on disk.  While anything less than 30 seconds old could possibly
>have some later writes committed to disk before some older writes from a few
>seconds before.
>
>If I''m wrong about this, please explain.
When a pointer to data is committed to disk by ZFS, then the data is 
also on disk.  (if the pointer is reachable from the uberblock,
then the data is also on dissk and reachable from the uberblock)

You don''t need to wait 30 seconds.  If it''s there,
it''s there.
>I am envisioning a database, which issues a small sync write, followed by a
>larger async write.  Since the sync write is small, the OS would prefer to
>defer the write and aggregate into a larger block.  So the possibility of
>the later async write being committed to disk before the older sync write is
>a real risk.  The end result would be inconsistency in my database file.
>
>If you rollback to a snapshot that''s at least 30 seconds old, then
all the
>writes for that snapshot are guaranteed to be committed to disk already, and
>in the right order.  You''re acknowledging the loss of some known
time worth
>of data.  But you''re gaining a guarantee of internal file
consistency.

I don''t know what ZFS guarantees when you disable the zil; the one
broken
promise is that when fsync() returns, that the data may not have 
committed to stable storage when fsync() returns.

I''m not sure whether there is a "barrier" when there is a
"sync()/fsync()",
if that is the case, then ZFS is still safe for your application.



Casper

Ross Walker

2010-Apr-01 13:35 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

On Mar 31, 2010, at 11:51 PM, Edward Ned Harvey  
<solaris2 at nedharvey.com> wrote:
>> A MegaRAID card with write-back cache? It should also be cheaper than
>> the F20.
>
> I haven''t posted results yet, but I just finished a few weeks of  
> extensive
> benchmarking various configurations.  I can say this:
>
> WriteBack cache is much faster than "naked" disks, but if you can
> buy an SSD
> or two for ZIL log device, the dedicated ZIL is yet again much  
> faster than
> WriteBack.
>
> It doesn''t have to be F20.  You could use the Intel X25 for  
> example.  If
> you''re running solaris proper, you better mirror your ZIL log  
> device.  If
> you''re running opensolaris ... I don''t know if
that''s important.  I''ll
> probably test it, just to be sure, but I might never get around to it
> because I don''t have a justifiable business reason to build the  
> opensolaris
> machine just for this one little test.
>
> Seriously, all disks configured WriteThrough (spindle and SSD disks  
> alike)
> using the dedicated ZIL SSD device, very noticeably faster than  
> enabling the
> WriteBack.
What do you get with both SSD ZIL and WriteBack disks enabled?

I mean if you have both why not use both? Then both async and sync IO  
benefits.

-Ross

Ross Walker

2010-Apr-01 13:41 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

On Mar 31, 2010, at 11:58 PM, Edward Ned Harvey  
<solaris2 at nedharvey.com> wrote:
>> We ran into something similar with these drives in an X4170 that  
>> turned
>> out to
>> be  an issue of the preconfigured logical volumes on the drives. Once
>> we made
>> sure all of our Sun PCI HBAs where running the exact same version of
>> firmware
>> and recreated the volumes on new drives arriving from Sun we got back
>> into sync
>> on the X25-E devices sizes.
>
> Can you elaborate?  Just today, we got the replacement drive that has
> precisely the right version of firmware and everything.  Still, when  
> we
> plugged in that drive, and "create simple volume" in the
storagetek
> raid
> utility, the new drive is 0.001 Gb smaller than the old drive. 
I''m
> still
> hosed.
>
> Are you saying I might benefit by sticking the SSD into some laptop,  
> and
> zero''ing the disk?  And then attach to the sun server?
>
> Are you saying I might benefit by finding some other way to make the  
> drive
> available, instead of using the storagetek raid utility?
I know it is way after the fact, but I find it best to coerce each  
drive down to the whole GB boundary using format (create Solaris  
partition just up to the boundary). Then if you ever get a drive a  
little smaller it still should fit.

-Ross

Ross Walker

2010-Apr-01 13:49 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

On Apr 1, 2010, at 8:42 AM, Casper.Dik at Sun.COM wrote:
>
>>> Is that what "sync" means in Linux?
>>
>> A sync write is one in which the application blocks until the OS  
>> acks that
>> the write has been committed to disk.  An async write is given to  
>> the OS,
>> and the OS is permitted to buffer the write to disk at its own  
>> discretion.
>> Meaning the async write function call returns sooner, and the  
>> application is
>> free to continue doing other stuff, including issuing more writes.
>>
>> Async writes are faster from the point of view of the application.   
>> But sync
>> writes are done by applications which need to satisfy a race  
>> condition for
>> the sake of internal consistency.  Applications which need to know  
>> their
>> next commands will not begin until after the previous sync write was
>> committed to disk.
>
>
> We''re talking about the "sync" for NFS exports in Linux;
what do
> they mean
> with "sync" NFS exports?
See section A1 in the FAQ:

http://nfs.sourceforge.net/

-Ross

Darren J Moffat

2010-Apr-01 14:03 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

On 01/04/2010 14:49, Ross Walker wrote:>> We''re talking about the "sync" for NFS exports in
Linux; what do they
>> mean
>> with "sync" NFS exports?
>
> See section A1 in the FAQ:
>
> http://nfs.sourceforge.net/
I think B4 is the answer to Casper''s question:

---- BEGIN QUOTE ----
Linux servers (although not the Solaris reference implementation) allow 
this requirement to be relaxed by setting a per-export option in 
/etc/exports. The name of this export option is "[a]sync" (note that 
there is also a client-side mount option by the same name, but it has a 
different function, and does not defeat NFS protocol compliance).

When set to "sync," Linux server behavior strictly conforms to the NFS
protocol. This is default behavior in most other server implementations. 
When set to "async," the Linux server replies to NFS clients before 
flushing data or metadata modifying operations to permanent storage, 
thus improving performance, but breaking all guarantees about server 
reboot recovery.
---- END QUOTE ----

For more info the whole of section B4 though B6.

-- 
Darren J Moffat

Ross Walker

2010-Apr-01 15:06 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

On Thu, Apr 1, 2010 at 10:03 AM, Darren J Moffat
<darrenm at opensolaris.org> wrote:> On 01/04/2010 14:49, Ross Walker wrote:
>>>
>>> We''re talking about the "sync" for NFS exports
in Linux; what do they
>>> mean
>>> with "sync" NFS exports?
>>
>> See section A1 in the FAQ:
>>
>> http://nfs.sourceforge.net/
>
> I think B4 is the answer to Casper''s question:
>
> ---- BEGIN QUOTE ----
> Linux servers (although not the Solaris reference implementation) allow
this
> requirement to be relaxed by setting a per-export option in /etc/exports.
> The name of this export option is "[a]sync" (note that there is
also a
> client-side mount option by the same name, but it has a different function,
> and does not defeat NFS protocol compliance).
>
> When set to "sync," Linux server behavior strictly conforms to
the NFS
> protocol. This is default behavior in most other server implementations.
> When set to "async," the Linux server replies to NFS clients
before flushing
> data or metadata modifying operations to permanent storage, thus improving
> performance, but breaking all guarantees about server reboot recovery.
> ---- END QUOTE ----
>
> For more info the whole of section B4 though B6.
True, I was thinking more of the protocol summary.
> Is that what "sync" means in Linux?  As NFS doesn''t use
"close" or
> "fsync", what exactly are the semantics.
>
> (For NFSv2/v3 each *operation* is sync and the client needs to make sure
> it can continue; for NFSv4, some operations are async and the client
> needs to use COMMIT)
Actually the COMMIT command was introduced in NFSv3.

The full details:

NFS Version 3 introduces the concept of "safe asynchronous writes." A
Version 3 client can specify that the server is allowed to reply
before it has saved the requested data to disk, permitting the server
to gather small NFS write operations into a single efficient disk
write operation. A Version 3 client can also specify that the data
must be written to disk before the server replies, just like a Version
2 write. The client specifies the type of write by setting the
stable_how field in the arguments of each write operation to UNSTABLE
to request a safe asynchronous write, and FILE_SYNC for an NFS Version
2 style write.

Servers indicate whether the requested data is permanently stored by
setting a corresponding field in the response to each NFS write
operation. A server can respond to an UNSTABLE write request with an
UNSTABLE reply or a FILE_SYNC reply, depending on whether or not the
requested data resides on permanent storage yet. An NFS
protocol-compliant server must respond to a FILE_SYNC request only
with a FILE_SYNC reply.

Clients ensure that data that was written using a safe asynchronous
write has been written onto permanent storage using a new operation
available in Version 3 called a COMMIT. Servers do not send a response
to a COMMIT operation until all data specified in the request has been
written to permanent storage. NFS Version 3 clients must protect
buffered data that has been written using a safe asynchronous write
but not yet committed. If a server reboots before a client has sent an
appropriate COMMIT, the server can reply to the eventual COMMIT
request in a way that forces the client to resend the original write
operation. Version 3 clients use COMMIT operations when flushing safe
asynchronous writes to the server during a close(2) or fsync(2) system
call, or when encountering memory pressure.

Bob Friesenhahn

2010-Apr-01 15:11 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

On Thu, 1 Apr 2010, Edward Ned Harvey wrote:
> If I''m wrong about this, please explain.
>
> I am envisioning a database, which issues a small sync write, followed by a
> larger async write.  Since the sync write is small, the OS would prefer to
> defer the write and aggregate into a larger block.  So the possibility of
> the later async write being committed to disk before the older sync write
is
> a real risk.  The end result would be inconsistency in my database file.
Zfs writes data in transaction groups and each bunch of data which 
gets written is bounded by a transaction group.  The current state of 
the data at the time the TXG starts will be the state of the data once 
the TXG completes.  If the system spontaneously reboots then it will 
restart at the last completed TXG so any residual writes which might 
have occured while a TXG write was in progress will be discarded. 
Based on this, I think that your ordering concerns (sync writes 
getting to disk "faster" than async writes) are unfounded for normal 
file I/O.

However, if file I/O is done via memory mapped files, then changed 
memory pages will not necessarily be written.  The changes will not be 
known to ZFS until the kernel decides that a dirty page should be 
written or there is a conflicting traditional I/O which would update 
the same file data.  Use of msync(3C) is necessary to assure that file 
data updated via mmap() will be seen by ZFS and comitted to disk in an 
orderly fashion.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Robert Milkowski

2010-Apr-01 15:40 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

On 01/04/2010 13:01, Edward Ned Harvey wrote:>> Is that what "sync" means in Linux?
>>      
> A sync write is one in which the application blocks until the OS acks that
> the write has been committed to disk.  An async write is given to the OS,
> and the OS is permitted to buffer the write to disk at its own discretion.
> Meaning the async write function call returns sooner, and the application
is
> free to continue doing other stuff, including issuing more writes.
>
> Async writes are faster from the point of view of the application.  But
sync
> writes are done by applications which need to satisfy a race condition for
> the sake of internal consistency.  Applications which need to know their
> next commands will not begin until after the previous sync write was
> committed to disk.
>
>    ROTFL!!!

I think you should explain it even further for Casper :) :) :) :) :) :) :)

-- 
Robert Milkowski
http://milek.blogspot.com

Casper.Dik at Sun.COM

2010-Apr-01 15:47 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

>On 01/04/2010 13:01, Edward Ned Harvey wrote:
>>> Is that what "sync" means in Linux?
>>>      
>> A sync write is one in which the application blocks until the OS acks
that
>> the write has been committed to disk.  An async write is given to the
OS,
>> and the OS is permitted to buffer the write to disk at its own
discretion.
>> Meaning the async write function call returns sooner, and the
application is
>> free to continue doing other stuff, including issuing more writes.
>>
>> Async writes are faster from the point of view of the application.  But
sync
>> writes are done by applications which need to satisfy a race condition
for
>> the sake of internal consistency.  Applications which need to know
their
>> next commands will not begin until after the previous sync write was
>> committed to disk.
>>
>>    
>ROTFL!!!
>
>I think you should explain it even further for Casper :) :) :) :) :) :) :)
>

:-)

So what I *really* wanted to know what "sync" meant for the NFS server
in the case of Linux.

Apparently it means "implement the NFS protocol to the letter".

I''m happy to see that it is now the default and I hope this will cause
the
Linux NFS client implementation to be faster for conforming NFS servers.

Casper

Bob Friesenhahn

2010-Apr-01 15:47 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

On Thu, 1 Apr 2010, Edward Ned Harvey wrote:>
> Dude, don''t be so arrogant.  Acting like you know what
I''m talking about
> better than I do.  Face it that you have something to learn here.
Geez!
> Yes, all the transactions in a transaction group are either committed
> entirely to disk, or not at all.  But they''re not necessarily
committed to
> disk in the same order that the user level applications requested. 
Meaning:
> If I have an application that writes to disk in "sync" mode
intentionally
> ... perhaps because my internal file format consistency would be corrupt if
> I wrote out-of-order ... If the sysadmin has disabled ZIL, my
"sync" write
> will not block, and I will happily issue more write operations.  As long as
> the OS remains operational, no problem.  The OS keeps the filesystem
> consistent in RAM, and correctly manages all the open file handles.  But if
> the OS dies for some reason, some of my later writes may have been
committed
> to disk while some of my earlier writes could be lost, which were still
> being buffered in system RAM for a later transaction group.
The purpose of the ZIL is to act like a fast "log" for synchronous 
writes.  It allows the system to quickly confirm a synchronous write 
request with the minimum amount of work.  As you say, "OS keeps the 
filesystem consistent in RAM".  There is no 1:1 ordering between 
application write requests and zfs writes and in fact, if the same 
portion of file is updated many times, or the file is created/deleted 
many times, zfs only writes the updated data which is current when the 
next TXG is written.  For a synchronous write, zfs advances its index 
in the slog once the corresponding data has been committed in a TXG. 
In other words, the "sync" and "async" write paths are the
same when
it comes to writing final data to disk.

There is however the recovery case where synchronous writes were 
affirmed which were not yet written in a TXG and the system 
spontaneously reboots.  In this case the synchronous writes will occur 
based on the slog, and uncommitted async writes will have been lost. 
Perhaps this is the case you are worried about.

It does seem like rollback to a snapshot does help here (to assure 
that sync & async data is consistent), but it certainly does not help 
any NFS clients.  Only a broken application uses sync writes 
sometimes, and async writes at other times.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Casper.Dik at Sun.COM

2010-Apr-01 15:54 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

>It does seem like rollback to a snapshot does help here (to assure 
>that sync & async data is consistent), but it certainly does not help 
>any NFS clients.  Only a broken application uses sync writes 
>sometimes, and async writes at other times.
But doesn''t that snapshot possibly have the same issues?

Casper

Bob Friesenhahn

2010-Apr-01 16:05 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

On Thu, 1 Apr 2010, Casper.Dik at Sun.COM wrote:
>> It does seem like rollback to a snapshot does help here (to assure
>> that sync & async data is consistent), but it certainly does not
help
>> any NFS clients.  Only a broken application uses sync writes
>> sometimes, and async writes at other times.
>
> But doesn''t that snapshot possibly have the same issues?
No, at least not based on my understanding.  My understanding is that 
zfs uses uniform prioritization of updates and performs writes in 
order (at least to the level of a TXG).  If this is true, then each 
normal TXG will be a coherent representation of the filesystem.

If the slog is used to recover uncommitted writes, then the TXG based 
on that may not match the in-memory filesystem at the time of the 
crash since async writes may have been lost.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Günther

2010-Apr-01 19:52 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

hello

i have had this problem this week. our zil ssd died (apt slc ssd 16gb).
because we had no spare drive in stock, we ignored it.

then we decided to update our nexenta 3 alpha to beta, exported the pool and
made a fresh install to have a clean system and tried to import the pool. we
only got a error message about a missing drive.

we googled about this and it seems there is no way to acces the pool !!!
(hope this will be fixed in future)

we had a backup and the data are not so important, but that could be a real
problem.
you have  a valid zfs3 pool and you cannot access your data due to missing zil.


gea

www.napp-it.org zfs server
-- 
This message posted from opensolaris.org

Jeroen Roodhart

2010-Apr-01 19:58 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

Hi Casper,
> :-)
Leuk te zien dat je straal nog steeds even ver komt :-)
>I''m happy to see that it is now the default and I hope this will
cause the
>Linux NFS client implementation to be faster for conforming NFS servers.
Interesting thing is that apparently defaults on Solaris an Linux are chosen
such that one can''t signal the desired behaviour to the other. At least
we didn''t manage to get a Linux client to asynchronously mount a
Solaris (ZFS backed) NFS export...

Anyway we seem to be getting of topic here :-)

The thread was started to get insight in behaviour of the F20 as ZIL. _My_
particular interest would be to be able to answer why perfomance
doesn''t seem to scale up when adding vmod-s...

With kind regards,

Jeroen
-- 
This message posted from opensolaris.org

Carson Gaspar

2010-Apr-01 20:27 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

Jeroen Roodhart wrote:
> The thread was started to get insight in behaviour of the F20 as ZIL.
> _My_ particular interest would be to be able to answer why perfomance
> doesn''t seem to scale up when adding vmod-s...
My best guess would be latency. If you are latency bound, adding 
additional parallel devices with the same latency will make no 
difference. It will improve throughput, but may actually make latency 
worse (additional time to select which parallel device to use).

But one of the ZFS gurus may be able to provide a better answer, or some 
dtrace foo to confirm/deny my thesis.

-- 
Carson

Jeroen Roodhart

2010-Apr-01 20:38 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

> It doesn''t have to be F20.  You could use the Intel
> X25 for example.  
The mlc-based disks are bound to be too slow (we tested with an OCZ Vertex
Turbo). So you''re stuck with the X25-E (which Sun stopped supporting
for some reason). I believe most "normal" SSDs do have some sort of
cache and usually no supercap or other backup power solution. So be wary of
that.

Having said all this, the new Sandforce based SSDs look promising...
 >If you''re running solaris proper, you better mirror your
> ZIL log device.  
Absolutely true, I forgot this ''cause we''re running OSOL
nv130... (we constantly seem to need features that haven''t landed in
Solaris proper :) ).
> If you''re running opensolaris ... I don''t know if
that''s
> important. 
At least I can confirm ability of adding and removing ZIL devices on the fly
with OSOL of a sufficiently recent build.
> I''ll  probably test it, just to be sure, but I might never
> get around to it
> because I don''t have a justifiable business reason to
> build the opensolaris
> machine just for this one little test.
I plan to get test this as well, won''t be until late next week though.

With kind regards,

Jeroen

Message was edited by: tuxwield
-- 
This message posted from opensolaris.org

Miles Nordin

2010-Apr-01 21:26 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

>>>>> "enh" == Edward Ned Harvey <solaris2 at
nedharvey.com> writes:
   enh> Dude, don''t be so arrogant.  Acting like you know what
I''m
   enh> talking about better than I do.  Face it that you have
   enh> something to learn here.

funny!  AIUI you are wrong and Casper is right.

ZFS recovers to a crash-consistent state, even without the slog,
meaning it recovers to some state through which the filesystem passed
in the seconds leading up to the crash.  This isn''t what UFS or XFS
do.

The on-disk log (slog or otherwise), if I understand right, can
actually make the filesystem recover to a crash-INconsistent state (a
state not equal to a snapshot you might have hypothetically taken in
the seconds leading up to the crash), because files that were recently
fsync()''d may be of newer versions than files that
weren''t---that is,
fsync() durably commits only the file it references, by copying that
*part* of the in-RAM ZIL to the durable slog.  fsync() is not
equivalent to ''lockfs -fa'' committing every file on the system
(is
it?).  I guess I could be wrong about that.

If I''m right, this isn''t a bad thing because apps that call
fsync()
are supposed to expect the inconsistency, but it''s still important to
understanding what''s going on.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 304 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100401/3d688b70/attachment.bin>

Robert Milkowski

2010-Apr-01 21:50 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

On 01/04/2010 20:58, Jeroen Roodhart wrote:>
>> I''m happy to see that it is now the default and I hope this
will cause the
>> Linux NFS client implementation to be faster for conforming NFS
servers.
>>      
> Interesting thing is that apparently defaults on Solaris an Linux are
chosen such that one can''t signal the desired behaviour to the other.
At least we didn''t manage to get a Linux client to asynchronously mount
a Solaris (ZFS backed) NFS export...
>    
Which is to be expected as it is not a nfs client which requests the 
behavior but rather a nfs server.
Currently on Linux you can export a share with as sync (default) or 
async share while on Solaris you can''t really currently force a NFS 
server to start working in an async mode.

-- 
Robert Milkowski
http://milek.blogspot.com

Casper.Dik at Sun.COM

2010-Apr-02 09:21 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

>On 01/04/2010 20:58, Jeroen Roodhart wrote:
>>
>>> I''m happy to see that it is now the default and I hope
this will cause the
>>> Linux NFS client implementation to be faster for conforming NFS
servers.
>>>      
>> Interesting thing is that apparently defaults on Solaris an Linux are
chosen such that one can''t signal the desired behaviour to the other. At least we didn''t manage
to get a Linux client to asyn
chronously mount a Solaris (ZFS backed) NFS export...>>    
>
>Which is to be expected as it is not a nfs client which requests the 
>behavior but rather a nfs server.
>Currently on Linux you can export a share with as sync (default) or 
>async share while on Solaris you can''t really currently force a NFS
>server to start working in an async mode.

The other part of the issue is that the Solaris Clients have been 
developed with a "sync" server.  The client write behinds more and
continues caching the non-acked data.  The Linux client has been developed 
with a "async" server and has some catching up to do.


Casper

Roch

2010-Apr-02 09:55 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

Robert Milkowski writes:
 > On 01/04/2010 20:58, Jeroen Roodhart wrote:
 > >
 > >> I''m happy to see that it is now the default and I hope
this will cause the
 > >> Linux NFS client implementation to be faster for conforming NFS
servers.
 > >>      
 > > Interesting thing is that apparently defaults on Solaris an Linux are
chosen such that one can''t signal the desired behaviour to the other.
At least we didn''t manage to get a Linux client to asynchronously mount
a Solaris (ZFS backed) NFS export...
 > >    
 > 
 > Which is to be expected as it is not a nfs client which requests the 
 > behavior but rather a nfs server.
 > Currently on Linux you can export a share with as sync (default) or 
 > async share while on Solaris you can''t really currently force a
NFS
 > server to start working in an async mode.
 > 

True, and there is an entrenched misconception (not you)
that this a ZFS specific problem which it''s not. 

It''s really an NFS protocol feature which can be
circumvented using zil_disable which therefore reinforces
the misconception.  It''s further reinforced by testing NFS
server on disk drives with WCE=1 with filesystem not ZFS.

All fast options cause the NFS client to become inconsistent
after a server reboot. Whatever was being done in the moments
prior to server reboot will need to be wiped out by users if
they are told that the server did reboot. That''s manageable
for home use not for the entreprise.

-r

 > -- 
 > Robert Milkowski
 > http://milek.blogspot.com
 > _______________________________________________
 > zfs-discuss mailing list
 > zfs-discuss at opensolaris.org
 > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Edward Ned Harvey

2010-Apr-02 12:03 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

> > Seriously, all disks configured WriteThrough (spindle and SSD disks
> > alike)
> > using the dedicated ZIL SSD device, very noticeably faster than
> > enabling the
> > WriteBack.
> 
> What do you get with both SSD ZIL and WriteBack disks enabled?
> 
> I mean if you have both why not use both? Then both async and sync IO
> benefits.
Interesting, but unfortunately false.  Soon I''ll post the results here.
I
just need to package them in a way suitable to give the public, and stick it
on a website.  But I''m fighting IT fires for now and haven''t
had the time
yet.

Roughly speaking, the following are approximately representative.  Of course
it varies based on tweaks of the benchmark and stuff like that.
	Stripe 3 mirrors write through:  450-780 IOPS
	Stripe 3 mirrors write back:  1030-2130 IOPS
	Stripe 3 mirrors write back + SSD ZIL:  1220-2480 IOPS
	Stripe 3 mirrors write through + SSD ZIL:  1840-2490 IOPS

Overall, I would say WriteBack is 2-3 times faster than naked disks.  SSD
ZIL is 3-4 times faster than naked disk.  And for some reason, having the
WriteBack enabled while you have SSD ZIL actually hurts performance by
approx 10%.  You''re better off to use the SSD ZIL with disks in Write
Through mode.

That result is surprising to me.  But I have a theory to explain it.  When
you have WriteBack enabled, the OS issues a small write, and the HBA
immediately returns to the OS:  "Yes, it''s on nonvolatile
storage."  So the
OS quickly gives it another, and another, until the HBA write cache is full.
Now the HBA faces the task of writing all those tiny writes to disk, and the
HBA must simply follow orders, writing a tiny chunk to the sector it said it
would write, and so on.  The HBA cannot effectively consolidate the small
writes into a larger sequential block write.  But if you have the WriteBack
disabled, and you have a SSD for ZIL, then ZFS can log the tiny operation on
SSD, and immediately return to the process:  "Yes, it''s on
nonvolatile
storage."  So the application can issue another, and another, and another.
ZFS is smart enough to aggregate all these tiny write operations into a
single larger sequential write before sending it to the spindle disks.  

Long story short, the evidence suggests if you have SSD ZIL, you''re
better
off without WriteBack on the HBA.  And I conjecture the reasoning behind it
is because ZFS can write buffer better than the HBA can.

Edward Ned Harvey

2010-Apr-02 12:08 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

> I know it is way after the fact, but I find it best to coerce each
> drive down to the whole GB boundary using format (create Solaris
> partition just up to the boundary). Then if you ever get a drive a
> little smaller it still should fit.
It seems like it should be unnecessary.  It seems like extra work.  But
based on my present experience, I reached the same conclusion.

If my new replacement SSD with identical part number and firmware is 0.001
Gb smaller than the original and hence unable to mirror, what''s to
prevent
the same thing from happening to one of my 1TB spindle disk mirrors?
Nothing.  That''s what.

I take it back.  Me.  I am to prevent it from happening.  And the technique
to do so is precisely as you''ve said.  First slice every drive to be a
little smaller than actual.  Then later if I get a replacement device for
the mirror, that''s slightly smaller than the others, I have no reason
to
care.

Roch

2010-Apr-02 12:09 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

When we use one vmod, both machines are finished in about 6min45,
  zilstat maxes out at about 4200 IOPS.        
  Using four vmods it takes about 6min55, zilstat maxes out at 2200
  IOPS. 

Can  you try 4 concurrent tar to four different ZFS filesystems (same
pool). 

-r

Edward Ned Harvey

2010-Apr-02 12:17 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

> > http://nfs.sourceforge.net/
> 
> I think B4 is the answer to Casper''s question:
We were talking about ZFS, and under what circumstances data is flushed to
disk, in what way "sync" and "async" writes are handled by
the OS, and what
happens if you disable ZIL and lose power to your system.

We were talking about C/C++ sync and async.  Not NFS sync and async.

I don''t think anything relating to NFS is the answer to
Casper''s question,
or else, Casper was simply jumping context by asking it.  Don''t get me
wrong, I have no objection to his question or anything, it''s just that
the
conversation has derailed and now people are talking about NFS sync/async
instead of what happens when a C/C++ application is doing sync/async writes
to a disabled ZIL.

Edward Ned Harvey

2010-Apr-02 12:32 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

> > I am envisioning a database, which issues a small sync write,
> followed by a
> > larger async write.  Since the sync write is small, the OS would
> prefer to
> > defer the write and aggregate into a larger block.  So the
> possibility of
> > the later async write being committed to disk before the older sync
> write is
> > a real risk.  The end result would be inconsistency in my database
> file.
> 
> Zfs writes data in transaction groups and each bunch of data which
> gets written is bounded by a transaction group.  The current state of
> the data at the time the TXG starts will be the state of the data once
> the TXG completes.  If the system spontaneously reboots then it will
> restart at the last completed TXG so any residual writes which might
> have occured while a TXG write was in progress will be discarded.
> Based on this, I think that your ordering concerns (sync writes
> getting to disk "faster" than async writes) are unfounded for
normal
> file I/O.
So you''re saying that while the OS is building txg''s to write
to disk, the
OS will never reorder the sequence in which individual write operations get
ordered into the txg''s.  That is, an application performing a small
sync
write, followed by a large async write, will never have the second operation
flushed to disk before the first.  Can you support this belief in any way?

If that''s true, if there''s no increased risk of data
corruption, then why
doesn''t everybody just disable their ZIL all the time on every system?

The reason to have a sync() function in C/C++ is so you can ensure data is
written to disk before you move on.  It''s a blocking call, that
doesn''t
return until the sync is completed.  The only reason you would ever do this
is if order matters.  If you cannot allow the next command to begin until
after the previous one was completed.  Such is the situation with databases
and sometimes virtual machines.

Edward Ned Harvey

2010-Apr-02 12:41 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

> hello
> 
> i have had this problem this week. our zil ssd died (apt slc ssd 16gb).
> because we had no spare drive in stock, we ignored it.
> 
> then we decided to update our nexenta 3 alpha to beta, exported the
> pool and made a fresh install to have a clean system and tried to
> import the pool. we only got a error message about a missing drive.
> 
> we googled about this and it seems there is no way to acces the pool
> !!!
> (hope this will be fixed in future)
> 
> we had a backup and the data are not so important, but that could be a
> real problem.
> you have  a valid zfs3 pool and you cannot access your data due to
> missing zil.
If you have zpool less than version 19 (when ability to remove log device
was introduced) and you have a non-mirrored log device that failed, you had
better treat the situation as an emergency.  Normally you can find your
current zpool version by doing "zpool upgrade," but you cannot now if
you''re
in this failure state.  Do not attempt "zfs send" or "zfs
list" or any other
zpool or zfs command.  Instead, do "man zpool" and look for
"zpool remove."
If it says "supports removing log devices" then you had better use it
to
remove your log device.  If it says "only supports removing hotspares or
cache" then your zpool is lost permanently.

If you are running Solaris, take it as given, you do not have zpool version
19.  If you are running Opensolaris, I don''t know at which point zpool
19
was introduced.  Your only hope is to "zpool remove" the log device. 
Use
tar or cp or something, to try and salvage your data out of there.  Your
zpool is lost and if it''s functional at all right now, it
won''t stay that
way for long.  Your system will soon hang, and then you will not be able to
import your pool.

Ask me how I know.

Edward Ned Harvey

2010-Apr-02 12:45 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

> ZFS recovers to a crash-consistent state, even without the slog,
> meaning it recovers to some state through which the filesystem passed
> in the seconds leading up to the crash.  This isn''t what UFS or
XFS
> do.
> 
> The on-disk log (slog or otherwise), if I understand right, can
> actually make the filesystem recover to a crash-INconsistent state (a
You''re speaking the opposite of common sense.  If disabling the ZIL
makes
the system faster *and* less prone to data corruption, please explain why we
don''t all disable the ZIL?

Edward Ned Harvey

2010-Apr-02 12:54 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

> If you have zpool less than version 19 (when ability to remove log
> device
> was introduced) and you have a non-mirrored log device that failed, you
> had
> better treat the situation as an emergency.  
> Instead, do "man zpool" and look for "zpool
> remove."
> If it says "supports removing log devices" then you had better
use it
> to
> remove your log device.  If it says "only supports removing hotspares
> or
> cache" then your zpool is lost permanently.
I take it back.  If you lost your log device on a zpool which is less than
version 19, then you *might* have a possible hope if you migrate your disks
to a later system.  You *might* be able to "zpool import" on a later
version
of OS.

Casper.Dik at Sun.COM

2010-Apr-02 13:00 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

>> > http://nfs.sourceforge.net/
>> 
>> I think B4 is the answer to Casper''s question:
>
>We were talking about ZFS, and under what circumstances data is flushed to
>disk, in what way "sync" and "async" writes are handled
by the OS, and what
>happens if you disable ZIL and lose power to your system.
>
>We were talking about C/C++ sync and async.  Not NFS sync and async.
I don''t think so.

http://www.mail-archive.com/zfs-discuss at opensolaris.org/msg36783.html

(This discussion was started, I think, in the context of NFS performance)

Casper

Casper.Dik at Sun.COM

2010-Apr-02 13:20 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

>So you''re saying that while the OS is building txg''s to
write to disk, the
>OS will never reorder the sequence in which individual write operations get
>ordered into the txg''s.  That is, an application performing a small
sync
>write, followed by a large async write, will never have the second operation
>flushed to disk before the first.  Can you support this belief in any way?
The question is not how the writes are ordered but whether an earlier
write can be in a later txg.  A transaction group is committed atomically.

In http://arc.opensolaris.org/caselog/PSARC/2010/108/mail I ask a similar 
question to make sure I understand it correctly, and the answer was:

"> = Casper", the answer is from Neil Perrin:

	> Is there a partialy order defined for all filesystem operations?
	>   

	File system operations  will be written in order for all settings of the 
	sync flag.

	> Specifically, will ZFS guarantee that when fsync()/O_DATA happens on a
	> file,
   
	(I assume by O_DATA you meant O_DSYNC).

	> that later transactions will not be in an earlier transaction group?
	> (Or is this already the case?)
	  
	This is already the case.


So what I assumed was true but what you made me doubt, was apparently still
true: later transactions cannot be committed in an earlier txg.


>If that''s true, if there''s no increased risk of data
corruption, then why
>doesn''t everybody just disable their ZIL all the time on every
system?
For an application running on the file server, there is no difference.
When the system panics you know that data might be lost.  The application 
also dies.  (The snapshot and the last valid uberblock are equally valid)

But for an application on an NFS client, without ZIL data will be lost 
while the NFS client believes the data is written amd it will not try 
again.  With the ZIL, when the NFS server says that data is written then 
it is actually on stable storage.
>The reason to have a sync() function in C/C++ is so you can ensure data is
>written to disk before you move on.  It''s a blocking call, that
doesn''t
>return until the sync is completed.  The only reason you would ever do this
>is if order matters.  If you cannot allow the next command to begin until
>after the previous one was completed.  Such is the situation with databases
>and sometimes virtual machines.  
So the question is: when will your data invalid?

What happens with the data when the system dies before the fsync() call?
What happens with the data when the system dies after the fsync() call?
What happens with the data when the system dies after more I/O operations?

With the zil disabled, you call fsync() but you may encounter data from
before the call to fsync().  That could happen before, so I assume you can
actually recover from that situation.

Casper

Edward Ned Harvey

2010-Apr-02 13:46 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

> >Dude, don''t be so arrogant.  Acting like you know what
I''m talking
> about
> >better than I do.  Face it that you have something to learn here.
> 
> You may say that, but then you post this:
Acknowledged.  I read something arrogant, and I replied even more arrogant.
That was dumb of me.

Edward Ned Harvey

2010-Apr-02 13:48 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

> Only a broken application uses sync writes
> sometimes, and async writes at other times.
Suppose there is a virtual machine, with virtual processes inside it.  Some
virtual process issues a sync write to the virtual OS, meanwhile another
virtual process issues an async write.  Then the virtual OS will sometimes
issue sync writes and sometimes async writes to the host OS.

Are you saying this makes qemu, and vbox, and vmware "broken
applications?"

Edward Ned Harvey

2010-Apr-02 14:24 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

> The purpose of the ZIL is to act like a fast "log" for
synchronous
> writes.  It allows the system to quickly confirm a synchronous write
> request with the minimum amount of work.  
Bob and Casper and some others clearly know a lot here.  But I''m
hearing
conflicting information, and don''t know what to believe.  Does anyone
here
work on ZFS as an actual ZFS developer for Sun/Oracle?  Can claim "I can
answer this question, I wrote that code, or at least have read it?"

Questions to answer would be:

Is a ZIL log device used only by sync() and fsync() system calls?  Is it
ever used to accelerate async writes?

Suppose there is an application which sometimes does sync writes, and
sometimes async writes.  In fact, to make it easier, suppose two processes
open two files, one of which always writes asynchronously, and one of which
always writes synchronously.  Suppose the ZIL is disabled.  Is it possible
for writes to be committed to disk out-of-order?  Meaning, can a large block
async write be put into a TXG and committed to disk before a small sync
write to a different file is committed to disk, even though the small sync
write was issued by the application before the large async write?  Remember,
the point is:  ZIL is disabled.  Question is whether the async could
possibly be committed to disk before the sync.

I make the assumption that an uberblock is the term for a TXG after it is
committed to disk.  Correct?

At boot time, or "zpool import" time, what is taken to be "the
current
filesystem?"  The latest uberblock?  Something else?

My understanding is that enabling a dedicated ZIL device guarantees sync()
and fsync() system calls block until the write has been committed to
nonvolatile storage, and attempts to accelerate by using a physical device
which is faster or more idle than the main storage pool.  My understanding
is that this provides two implicit guarantees:  (1) sync writes are always
guaranteed to be committed to disk in order, relevant to other sync writes.
(2) In the event of OS halting or ungraceful shutdown, sync writes committed
to disk are guaranteed to be equal or greater than the async writes that
were taking place at the same time.  That is, if two processes both complete
a write operation at the same time, one in sync mode and the other in async
mode, then it is guaranteed the data on disk will never have the async data
committed before the sync data.

Based on this understanding, if you disable ZIL, then there is no guarantee
about order of writes being committed to disk.  Neither of the above
guarantees is valid anymore.  Sync writes may be completed out of order.
Async writes that supposedly happened after sync writes may be committed to
disk before the sync writes.

Somebody, (Casper?) said it before, and now I''m starting to realize ...
This
is also true of the snapshots.  If you disable your ZIL, then there is no
guarantee your snapshots are consistent either.  Rolling back doesn''t
necessarily gain you anything.

The only way to guarantee consistency in the snapshot is to always
(regardless of ZIL enabled/disabled) give priority for sync writes to get
into the TXG before async writes.

If the OS does give priority for sync writes going into TXG''s before
async
writes (even with ZIL disabled), then after spontaneous ungraceful reboot,
the latest uberblock is guaranteed to be consistent.

Casper.Dik at Sun.COM

2010-Apr-02 15:04 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

>Questions to answer would be:
>
>Is a ZIL log device used only by sync() and fsync() system calls?  Is it
>ever used to accelerate async writes?
There are quite a few of "sync" writes, specifically when you mix in
the
NFS server.
>Suppose there is an application which sometimes does sync writes, and
>sometimes async writes.  In fact, to make it easier, suppose two processes
>open two files, one of which always writes asynchronously, and one of which
>always writes synchronously.  Suppose the ZIL is disabled.  Is it possible
>for writes to be committed to disk out-of-order?  Meaning, can a large block
>async write be put into a TXG and committed to disk before a small sync
>write to a different file is committed to disk, even though the small sync
>write was issued by the application before the large async write?  Remember,
>the point is:  ZIL is disabled.  Question is whether the async could
>possibly be committed to disk before the sync.
What I quoted from the other discussion, it seems to be that later writes 
cannot be committed in an earlier TXG then your sync write or other earlier
writes.
>I make the assumption that an uberblock is the term for a TXG after it is
>committed to disk.  Correct?
The "uberblock" is the "root of all the data".  All the data
in a ZFS pool
is referenced by it; after the txg is in stable storage then the uberblock 
is updated.
>At boot time, or "zpool import" time, what is taken to be
"the current
>filesystem?"  The latest uberblock?  Something else?
The current "zpool" and the filesystems such as referenced by the last
uberblock.
>My understanding is that enabling a dedicated ZIL device guarantees sync()
>and fsync() system calls block until the write has been committed to
>nonvolatile storage, and attempts to accelerate by using a physical device
>which is faster or more idle than the main storage pool.  My understanding
>is that this provides two implicit guarantees:  (1) sync writes are always
>guaranteed to be committed to disk in order, relevant to other sync writes.
>(2) In the event of OS halting or ungraceful shutdown, sync writes committed
>to disk are guaranteed to be equal or greater than the async writes that
>were taking place at the same time.  That is, if two processes both complete
>a write operation at the same time, one in sync mode and the other in async
>mode, then it is guaranteed the data on disk will never have the async data
>committed before the sync data.
sync() is actually *async* and returning from sync() says nothing about 
stable storage.  After fsync() returns it signals that all the data is
in stable storage (except if you disable ZIL), or, apparently, in Linux
when the write caches for your disks are enabled (the default for PC
drives).  ZFS doesn''t care about the writecache; it makes sure it is 
flushed.  (There''s fsyc() and open(..., O_DSYNC|O_SYNC)
>Based on this understanding, if you disable ZIL, then there is no guarantee
>about order of writes being committed to disk.  Neither of the above
>guarantees is valid anymore.  Sync writes may be completed out of order.
>Async writes that supposedly happened after sync writes may be committed to
>disk before the sync writes.
>
>Somebody, (Casper?) said it before, and now I''m starting to realize
... This
>is also true of the snapshots.  If you disable your ZIL, then there is no
>guarantee your snapshots are consistent either.  Rolling back
doesn''t
>necessarily gain you anything.
>
>The only way to guarantee consistency in the snapshot is to always
>(regardless of ZIL enabled/disabled) give priority for sync writes to get
>into the TXG before async writes.
>
>If the OS does give priority for sync writes going into TXG''s
before async
>writes (even with ZIL disabled), then after spontaneous ungraceful reboot,
>the latest uberblock is guaranteed to be consistent.

I believe that the writes are still ordered so the consistency you want is 
actually delivered even without the ZIL enabled.

Casper

Bob Friesenhahn

2010-Apr-02 15:07 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

On Fri, 2 Apr 2010, Edward Ned Harvey wrote:>
> So you''re saying that while the OS is building txg''s to
write to disk, the
> OS will never reorder the sequence in which individual write operations get
> ordered into the txg''s.  That is, an application performing a
small sync
> write, followed by a large async write, will never have the second
operation
> flushed to disk before the first.  Can you support this belief in any way?
I am like a "pool" or "tank" of regurgitated zfs knowledge. 
I simply
pay attention when someone who really knows explains something (e.g. 
Neil Perrin, as Casper referred to) so I can regurgitate it later.  I 
try to do so faithfully.  If I had behaved this way in school, I would 
have been a good student.  Sometimes I am wrong or the design has 
somewhat changed since the original information was provided.

There are indeed popular filesystems (e.g. Linux EXT4) which write 
data to disk in different order than cronologically requested so it is 
good that you are paying attention to these issues.  While in the 
slog-based recovery scenario, it is possible for a TXG to be generated 
which lacks async data, this only happens after a system crash and if 
all of the critical data is written as a sync request, it will be 
faithfully preserved.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Kyle McDonald

2010-Apr-02 15:08 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

On 4/2/2010 8:08 AM, Edward Ned Harvey wrote:>> I know it is way after the fact, but I find it best to coerce each
>> drive down to the whole GB boundary using format (create Solaris
>> partition just up to the boundary). Then if you ever get a drive a
>> little smaller it still should fit.
>>     
> It seems like it should be unnecessary.  It seems like extra work.  But
> based on my present experience, I reached the same conclusion.
>
> If my new replacement SSD with identical part number and firmware is 0.001
> Gb smaller than the original and hence unable to mirror, what''s to
prevent
> the same thing from happening to one of my 1TB spindle disk mirrors?
> Nothing.  That''s what.
>
>   Actually, It''s my experience that Sun (and other vendors) do exactly
that for you when you buy their parts - at least for rotating drives, I
have no experience with SSD''s.

The Sun disk label shipped on all the drives is setup to make the drive
the standard size for that sun part number. They have to do this since
they (for many reasons) have many sources (diff. vendors, even diff.
parts from the same vendor) for the actual disks they use for a
particular Sun part number.

This isn''t new, I beleive IBM, EMC, HP, etc all do it also for the same
reasons.
I''m a little surprised that the engineers would suddenly stop doing it
only on SSD''s. But who knows.

  -Kyle
> I take it back.  Me.  I am to prevent it from happening.  And the technique
> to do so is precisely as you''ve said.  First slice every drive to
be a
> little smaller than actual.  Then later if I get a replacement device for
> the mirror, that''s slightly smaller than the others, I have no
reason to
> care.
>
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>

Mattias Pantzare

2010-Apr-02 15:09 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

On Fri, Apr 2, 2010 at 16:24, Edward Ned Harvey <solaris2 at
nedharvey.com> wrote:>> The purpose of the ZIL is to act like a fast "log" for
synchronous
>> writes. ?It allows the system to quickly confirm a synchronous write
>> request with the minimum amount of work.
>
> Bob and Casper and some others clearly know a lot here. ?But I''m
hearing
> conflicting information, and don''t know what to believe. ?Does
anyone here
> work on ZFS as an actual ZFS developer for Sun/Oracle? ?Can claim "I
can
> answer this question, I wrote that code, or at least have read it?"
>
> Questions to answer would be:
>
> Is a ZIL log device used only by sync() and fsync() system calls? ?Is it
> ever used to accelerate async writes?
sync() will tell the filesystems to flush writes to disk. sync() will
not use ZIL, it will just start a new TXG, and could return before the
writes are done.

fsync() is what you are interested in.
>
> Suppose there is an application which sometimes does sync writes, and
> sometimes async writes. ?In fact, to make it easier, suppose two processes
> open two files, one of which always writes asynchronously, and one of which
> always writes synchronously. ?Suppose the ZIL is disabled. ?Is it possible
> for writes to be committed to disk out-of-order? ?Meaning, can a large
block
> async write be put into a TXG and committed to disk before a small sync
> write to a different file is committed to disk, even though the small sync
> write was issued by the application before the large async write?
?Remember,
> the point is: ?ZIL is disabled. ?Question is whether the async could
> possibly be committed to disk before the sync.
>
Writers from a TXG will not be used until the whole TXG is committed to disk.
Everything from a half written TXG will be ignored after a crash.

This means that the order of writes within a TXG is not important.

The only way to do a sync write without ZIL is to start a new TXG
after the write. That costs a lot so we have the ZIL for sync writes.

Bob Friesenhahn

2010-Apr-02 15:39 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

On Fri, 2 Apr 2010, Edward Ned Harvey wrote:> were taking place at the same time.  That is, if two processes both
complete
> a write operation at the same time, one in sync mode and the other in async
> mode, then it is guaranteed the data on disk will never have the async data
> committed before the sync data.
>
> Based on this understanding, if you disable ZIL, then there is no guarantee
> about order of writes being committed to disk.  Neither of the above
> guarantees is valid anymore.  Sync writes may be completed out of order.
> Async writes that supposedly happened after sync writes may be committed to
> disk before the sync writes.
You seem to be assuming that Solaris is an incoherent operating 
system.  With ZFS, the filesystem in memory is coherent, and 
transaction groups are constructed in simple chronological order 
(capturing combined changes up to that point in time), without regard 
to SYNC options.  The only possible exception to the coherency is for 
memory mapped files, where the mapped memory is a copy of data 
(originally) from the ZFS ARC and needs to be reconciled with the ARC 
if an application has dirtied it.  This differs from UFS and the way 
Solaris worked prior to Solaris 10.

Synchronous writes are not "faster" than asynchronous writes.  If you 
drop heavy and light objects from the same height, they fall at the 
same rate.  This was proven long ago.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Stuart Anderson

2010-Apr-02 16:01 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

On Apr 2, 2010, at 5:08 AM, Edward Ned Harvey wrote:
>> I know it is way after the fact, but I find it best to coerce each
>> drive down to the whole GB boundary using format (create Solaris
>> partition just up to the boundary). Then if you ever get a drive a
>> little smaller it still should fit.
> 
> It seems like it should be unnecessary.  It seems like extra work.  But
> based on my present experience, I reached the same conclusion.
> 
> If my new replacement SSD with identical part number and firmware is 0.001
> Gb smaller than the original and hence unable to mirror, what''s to
prevent
> the same thing from happening to one of my 1TB spindle disk mirrors?
> Nothing.  That''s what.
> 
> I take it back.  Me.  I am to prevent it from happening.  And the technique
> to do so is precisely as you''ve said.  First slice every drive to
be a
> little smaller than actual.  Then later if I get a replacement device for
> the mirror, that''s slightly smaller than the others, I have no
reason to
> care.
However, I believe there are some downsides to letting ZFS manage just
a slice rather than an entire drive, but perhaps those do not apply as
significantly to SSD devices?

Thanks

--
Stuart Anderson  anderson at ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

Ross Walker

2010-Apr-02 16:20 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

On Fri, Apr 2, 2010 at 8:03 AM, Edward Ned Harvey
<solaris2 at nedharvey.com> wrote:>> > Seriously, all disks configured WriteThrough (spindle and SSD
disks
>> > alike)
>> > using the dedicated ZIL SSD device, very noticeably faster than
>> > enabling the
>> > WriteBack.
>>
>> What do you get with both SSD ZIL and WriteBack disks enabled?
>>
>> I mean if you have both why not use both? Then both async and sync IO
>> benefits.
>
> Interesting, but unfortunately false. ?Soon I''ll post the results
here. ?I
> just need to package them in a way suitable to give the public, and stick
it
> on a website. ?But I''m fighting IT fires for now and
haven''t had the time
> yet.
>
> Roughly speaking, the following are approximately representative. ?Of
course
> it varies based on tweaks of the benchmark and stuff like that.
> ? ? ? ?Stripe 3 mirrors write through: ?450-780 IOPS
> ? ? ? ?Stripe 3 mirrors write back: ?1030-2130 IOPS
> ? ? ? ?Stripe 3 mirrors write back + SSD ZIL: ?1220-2480 IOPS
> ? ? ? ?Stripe 3 mirrors write through + SSD ZIL: ?1840-2490 IOPS
>
> Overall, I would say WriteBack is 2-3 times faster than naked disks. ?SSD
> ZIL is 3-4 times faster than naked disk. ?And for some reason, having the
> WriteBack enabled while you have SSD ZIL actually hurts performance by
> approx 10%. ?You''re better off to use the SSD ZIL with disks in
Write
> Through mode.
>
> That result is surprising to me. ?But I have a theory to explain it. ?When
> you have WriteBack enabled, the OS issues a small write, and the HBA
> immediately returns to the OS: ?"Yes, it''s on nonvolatile
storage." ?So the
> OS quickly gives it another, and another, until the HBA write cache is
full.
> Now the HBA faces the task of writing all those tiny writes to disk, and
the
> HBA must simply follow orders, writing a tiny chunk to the sector it said
it
> would write, and so on. ?The HBA cannot effectively consolidate the small
> writes into a larger sequential block write. ?But if you have the WriteBack
> disabled, and you have a SSD for ZIL, then ZFS can log the tiny operation
on
> SSD, and immediately return to the process: ?"Yes, it''s on
nonvolatile
> storage." ?So the application can issue another, and another, and
another.
> ZFS is smart enough to aggregate all these tiny write operations into a
> single larger sequential write before sending it to the spindle disks.
Hmm, when you did the write-back test was the ZIL SSD included in the
write-back?

What I was proposing was write-back only on the disks, and ZIL SSD
with no write-back.

Not all operations hit the ZIL, so it would still be nice to have the
non-ZIL operations return quickly.

-Ross

Robert Milkowski

2010-Apr-02 17:13 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

On 02/04/2010 16:04, Casper.Dik at Sun.COM wrote:>
> sync() is actually *async* and returning from sync() says nothing about
>
>    
to clarify - in case of ZFS sync() is actually synchronous.

-- 
Robert Milkowski
http://milek.blogspot.com

Tirso Alonso

2010-Apr-02 18:14 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

> If my new replacement SSD with identical part number and firmware is 0.001
> Gb smaller than the original and hence unable to mirror, what''s to
prevent
> the same thing from happening to one of my 1TB spindle disk mirrors?
There is a standard for sizes that many manufatures use (IDEMA LBA1-02):

LBA count = (97696368) + (1953504 * (Desired Capacity in Gbytes ? 50.0)) 

Sizes should match exactly if the manufacturer follows the standard.

See:
http://opensolaris.org/jive/message.jspa?messageID=393336#393336
http://www.idema.org/_smartsite/modules/local/data_file/show_file.php?cmd=download&data_file_id=1066
-- 
This message posted from opensolaris.org

Miles Nordin

2010-Apr-02 18:56 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

>>>>> "enh" == Edward Ned Harvey <solaris2 at
nedharvey.com> writes:
   enh> If you have zpool less than version 19 (when ability to remove
   enh> log device was introduced) and you have a non-mirrored log
   enh> device that failed, you had better treat the situation as an
   enh> emergency.

Ed the log device removal support is only good for adding a slog to
try it out, then changing your mind and removing the slog (which was
not possible before).  It doesn''t change the reliability situation one
bit: pools with dead slogs are not importable.  There''ve been threads
on this for a while.  It''s well-discussed because it''s an
example of
IMHO broken process of ``obviously a critical requirement but not
technically part of the original RFE which is already late,''''
as well
as a dangerous pitfall for ZFS admins.  I imagine the process works
well in other cases to keep stuff granular enough that it can be
prioritized effectively, but in this case it''s made the slog feature
significantly incomplete for a couple years and put many production
systems in a precarious spot, and the whole mess was predicted before
the slog feature was integrated.

     >> The on-disk log (slog or otherwise), if I understand right, can
     >> actually make the filesystem recover to a crash-INconsistent
     >> state 

   enh> You''re speaking the opposite of common sense.  

Yeah, I''m doing it on purpose to suggest that just guessing how you
feel things ought to work based on vague notions of economy isn''t a
good idea.

   enh> If disabling the ZIL makes the system faster *and* less prone
   enh> to data corruption, please explain why we don''t all disable
   enh> the ZIL?

I said complying with fsync can make the system recover to a state not
equal to one you might have hypothetically snapshotted in a moment
leading up to the crash.  Elsewhere I might''ve said disabling the ZIL
does not make the system more prone to data corruption, *iff* you are
not an NFS server.

If you are, disabling the ZIL can lead to lost writes if an NFS server
reboots and an NFS client does not, which can definitely cause
app-level data corruption.

Disabling the ZIL breaks the D requirement of ACID databases which
might screw up apps that replicate, or keep databases on several
separate servers in sync, and it might lead to lost mail on an MTA,
but because unlike non-COW filesystems it costs nothing extra for ZFS
to preserve write ordering even without fsync(), AIUI you will not get
corrupted application-level data by disabling the ZIL.  you just get
missing data that the app has a right to expect should be there.  The
dire warnings written by kernel developers in the wikis of ``don''t
EVER disable the ZIL'''' are totally ridiculous and
inappropriate IMO.
I think they probably just worked really hard to write the ZIL piece
of ZFS, and don''t want people telling their brilliant code to fuckoff
just because it makes things a little slower.  so we get all this
``enterprise'''' snobbery and so on.

``crash consistent'''' is a technical term not a common-sense
term, and
I may have used it incorrectly:

 http://oraclestorageguy.typepad.com/oraclestorageguy/2007/07/why-emc-technol.html

With a system that loses power on which fsync() had been in use, the
files getting fsync()''ed will probably recover to more recent versions
than the rest of the files, which means the recovered state achieved
by yanking the cord couldn''t have been emulated by cloning a snapshot
and not actually having lost power.  However, the app calling fsync()
will expect this, so it''s not supposed to lead to application-level
inconsistency.  

If you test your app''s recovery ability in just that way, by cloning
snapshots of filesystems on which the app is actively writing and then
seeing if the app can recover the clone, then you''re unfortunately not
testing the app quite hard enough if fsync() is involved, so yeah I
guess disabling the ZIL might in theory make incorrectly-written apps
less prone to data corruption.  Likewise, no testing of the app on a
ZFS will be aggressive enough to make the app powerfail-proof on a
non-COW POSIX system because ZFS keeps more ordering than the API
actually guarantees to the app.

I''m repeating myself though.  I wish you''ll just read my posts
with at
least paragraph granularity instead of just picking out individual
sentences and discarding everything that seems too complicated or too
awkwardly stated.

I''m basing this all on the ``common sense'''' that to
do otherwise,
fsync() would have to completely ignore its filedescriptor
argument. It''d have to copy the entire in-memory ZIL to the slog and
behave the same as ''lockfs -fa'', which I think would perform
too badly
compared to non-ZFS filesystems'' fsync()s, and would lead to emphatic
performance advice like ``segregate files that get lots of fsync()s
into separate ZFS datasets from files that get high write
bandwidth,''''
and we don''t have advice like that in the blogs/lists/wikis which
makes me think it''s not beneficial (the benefit would be dramatic if
it were!) and that fsync() works the way I think it does.  It''s a
slightly more convoluted type of ``common sense'''' than yours,
but mine
could still be wrong.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 304 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100402/1b87452f/attachment.bin>

Tim Cook

2010-Apr-02 19:29 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

On Fri, Apr 2, 2010 at 10:08 AM, Kyle McDonald <kmcdonald at
egenera.com>wrote:
> On 4/2/2010 8:08 AM, Edward Ned Harvey wrote:
> >> I know it is way after the fact, but I find it best to coerce each
> >> drive down to the whole GB boundary using format (create Solaris
> >> partition just up to the boundary). Then if you ever get a drive a
> >> little smaller it still should fit.
> >>
> > It seems like it should be unnecessary.  It seems like extra work. 
But
> > based on my present experience, I reached the same conclusion.
> >
> > If my new replacement SSD with identical part number and firmware is
> 0.001
> > Gb smaller than the original and hence unable to mirror,
what''s to
> prevent
> > the same thing from happening to one of my 1TB spindle disk mirrors?
> > Nothing.  That''s what.
> >
> >
> Actually, It''s my experience that Sun (and other vendors) do
exactly
> that for you when you buy their parts - at least for rotating drives, I
> have no experience with SSD''s.
>
> The Sun disk label shipped on all the drives is setup to make the drive
> the standard size for that sun part number. They have to do this since
> they (for many reasons) have many sources (diff. vendors, even diff.
> parts from the same vendor) for the actual disks they use for a
> particular Sun part number.
>
> This isn''t new, I beleive IBM, EMC, HP, etc all do it also for the
same
> reasons.
> I''m a little surprised that the engineers would suddenly stop
doing it
> only on SSD''s. But who knows.
>
>  -Kyle
>
>
If I were forced to ignorantly cast a stone, it would be into Intel''s
lap
(if the SSD''s indeed came directly from Sun).  Sun''s
"normal" drive vendors
have been in this game for decades, and know the expectations.  Intel on the
other hand, may not have quite the same QC in place yet.

--Tim
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100402/92eb2ce3/attachment.html>

Eric D. Mudama

2010-Apr-02 20:32 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

On Fri, Apr  2 at 11:14, Tirso Alonso wrote:>> If my new replacement SSD with identical part number and firmware is
0.001
>> Gb smaller than the original and hence unable to mirror,
what''s to prevent
>> the same thing from happening to one of my 1TB spindle disk mirrors?
>
>There is a standard for sizes that many manufatures use (IDEMA LBA1-02):
>
>LBA count = (97696368) + (1953504 * (Desired Capacity in Gbytes ??? 50.0))
>
>Sizes should match exactly if the manufacturer follows the standard.
>
>See:
>http://opensolaris.org/jive/message.jspa?messageID=393336#393336
>http://www.idema.org/_smartsite/modules/local/data_file/show_file.php?cmd=download&data_file_id=1066
Problem is that it only applies to devices that are >= 50GB in size,
and the X25 in question is only 32GB.

That being said, I''d be skeptical of either the sourcing of the parts,
or else some other configuration feature on the drives (like HPA or
DCO) that is changing the capacity.  It''s possible one of these is in
effect.

--eric

-- 
Eric D. Mudama
edmudama at mail.bounceswoosh.org

Neil Perrin

2010-Apr-02 20:47 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

On 04/02/10 08:24, Edward Ned Harvey wrote:>> The purpose of the ZIL is to act like a fast "log" for
synchronous
>> writes.  It allows the system to quickly confirm a synchronous write
>> request with the minimum amount of work.  
>>     
>
> Bob and Casper and some others clearly know a lot here.  But I''m
hearing
> conflicting information, and don''t know what to believe.  Does
anyone here
> work on ZFS as an actual ZFS developer for Sun/Oracle?  Can claim "I
can
> answer this question, I wrote that code, or at least have read it?"
>   
I''m one of the ZFS developers. I wrote most of the zil code.
Still I don''t have all the answers. There''s a lot of
knowledgeable people
on this alias. I usually monitor this alias and sometimes chime in
when there''s some misinformation being spread, but sometimes the volume
is so high.
Since I started this reply there''s been 20 new posts on this thread
alone!
> Questions to answer would be:
>
> Is a ZIL log device used only by sync() and fsync() system calls? 
>   
- The intent log (separate device(s) or not) is only used by fsync, 
O_DSYNC, O_SYNC, O_RSYNC.
NFS commits are seen to ZFS as fsyncs.
Note sync(1m) and sync(2s) do not use the intent log. They force 
transaction group (txg)
commits on all pools. So zfs goes beyond the the requirement for sync() 
which only requires
it schedules but does not necessarily complete the writing before returning.
The zfs interpretation is rather expensive but seemed broken so we fixed it.
> Is it ever used to accelerate async writes?

The zil is not used to accelerate async writes.
> Suppose there is an application which sometimes does sync writes, and
> sometimes async writes.  In fact, to make it easier, suppose two processes
> open two files, one of which always writes asynchronously, and one of which
> always writes synchronously.  Suppose the ZIL is disabled.  Is it possible
> for writes to be committed to disk out-of-order?  Meaning, can a large
block
> async write be put into a TXG and committed to disk before a small sync
> write to a different file is committed to disk, even though the small sync
> write was issued by the application before the large async write? 
Remember,
> the point is:  ZIL is disabled.  Question is whether the async could
> possibly be committed to disk before the sync.
>   
Threads can be pre-empted in the OS at any time. So even though thread A 
issued
W1 before thread B issued W2, the order is not guaranteed to arrive at 
ZFS as W1, W2.
Multi-threaded applications have to handle this.

If this was a single thread issuing W1 then W2 then yes the order is 
guaranteed
regardless of whether W1 or W2 are synchronous or asynchronous.
Of course if the system crashes then the async operations might not be 
there.
> I make the assumption that an uberblock is the term for a TXG after it is
> committed to disk.  Correct?
>   
- Kind of. The uberblock contains the root of the txg.

> At boot time, or "zpool import" time, what is taken to be
"the current
> filesystem?"  The latest uberblock?  Something else?
>   
A txg is for the whole pool which can contain many filesystems.
The latest txg defines the current state of the pool and each individual fs.
> My understanding is that enabling a dedicated ZIL device guarantees sync()
> and fsync() system calls block until the write has been committed to
> nonvolatile storage, and attempts to accelerate by using a physical device
> which is faster or more idle than the main storage pool.
Correct (except replace sync() with O_DSYNC, etc).
This also assumes hardware that for example handles correctly the 
flushing of it''s caches.
>   My understanding
> is that this provides two implicit guarantees:  (1) sync writes are always
> guaranteed to be committed to disk in order, relevant to other sync writes.
> (2) In the event of OS halting or ungraceful shutdown, sync writes
committed
> to disk are guaranteed to be equal or greater than the async writes that
> were taking place at the same time.  That is, if two processes both
complete
> a write operation at the same time, one in sync mode and the other in async
> mode, then it is guaranteed the data on disk will never have the async data
> committed before the sync data.
>   
The ZIL doesn''t make such guarantees. It''s the DMU that
handles transactions
and their grouping into txgs. It ensures that writes are committed in order
by it''s transactional nature.

The function of the zil is to merely ensure that synchronous operations are
stable and replayed after a crash/power fail onto the latest txg.
> Based on this understanding, if you disable ZIL, then there is no guarantee
> about order of writes being committed to disk.  Neither of the above
> guarantees is valid anymore.  Sync writes may be completed out of order.
> Async writes that supposedly happened after sync writes may be committed to
> disk before the sync writes.
>   No, disabling the ZIL does not disable the DMU.
> Somebody, (Casper?) said it before, and now I''m starting to
realize ... This
> is also true of the snapshots.  If you disable your ZIL, then there is no
> guarantee your snapshots are consistent either.  Rolling back
doesn''t
> necessarily gain you anything.
>   
No, a snapshot forces a txg which is a consistent up to date view of the 
pool and
it''s file systems. The zil is not involved.

See also http://blogs.sun.com/perrin/entry/the_lumberjack
- which is a bit dated and simplistic but still largely true.

Neil.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100402/7e54cff0/attachment.html>

Al Hopper

2010-Apr-03 01:52 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

Hi Jeroen,

Have you tried the DDRdrive from Christopher George <cgeorge at
ddrdrive.com>?
Looks to me like a much better fit for your application than the F20?

It would not hurt to check it out.  Looks to me like you need a
product with low *latency* - and a RAM based cache would be a much
better performer than any solution based solely on flash.

Let us know (on the list) how this works out for you.

Regards,

-- 
Al Hopper  Logical Approach Inc,Plano,TX al at logical-approach.com
                   Voice: 214.233.5089 Timezone: US CDT
OpenSolaris Governing Board (OGB) Member - Apr 2005 to Mar 2007
http://www.opensolaris.org/os/community/ogb/ogb_2005-2007/

Casper.Dik at Sun.COM

2010-Apr-03 10:28 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

>The only way to guarantee consistency in the snapshot is to always
>(regardless of ZIL enabled/disabled) give priority for sync writes to get
>into the TXG before async writes.
>
>If the OS does give priority for sync writes going into TXG''s
before async
>writes (even with ZIL disabled), then after spontaneous ungraceful reboot,
>the latest uberblock is guaranteed to be consistent.
This is what Jeff Bonwick says in the "zil synchronicity" arc case:

   "What I mean is that the barrier semantic is implicit even with no ZIL
at all.
   In ZFS, if event A happens before event B, and you lose power, then
   what you''ll see on disk is either nothing, A, or both A and B. 
Never just B.
   It is impossible for us not to have at least barrier semantics."

So there''s no chance that a *later* async write will overtake an
earlier
sync *or* async write.

Casper

Jeroen Roodhart

2010-Apr-03 22:33 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

Hi Al,
> Have you tried the DDRdrive from Christopher George
> <cgeorge at ddrdrive.com>?
> Looks to me like a much better fit for your application than the F20?
> 
> It would not hurt to check it out.  Looks to me like
> you need a product with low *latency* - and a RAM based cache
> would be a much better performer than any solution based solely on
> flash.
> 
> Let us know (on the list) how this works out for you.
Well, I did look at it but at that time there was no Solaris support yet. Right
now it seems there is only a beta driver? I kind of remember that if
you''d want reliable fallback to nvram, you''d need an UPS
feeding the card. I could be very wrong there, but the product documentation
isn''t very clear on this (at least to me ;) )

Also, we''d kind of like to have a SnOracle supported option. 

But yeah, on paper it does seem it could be an attractive solution...

With kind regards,

Jeroen
-- 
This message posted from opensolaris.org

Christopher George

2010-Apr-03 23:52 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

> Well, I did look at it but at that time there was no Solaris support yet.
Right now it
> seems there is only a beta driver?
Correct, we just completed functional validation of the OpenSolaris driver.  Our
focus has now turned to performance tuning and benchmarking.  We expect to 
formally introduce the DDRdrive X1 to the ZFS community later this quarter.  It
is our
goal to focus exclusively on the dedicated ZIL device market going forward. 
>  I kind of remember that if you''d want reliable fallback to nvram,
you''d need an
>  UPS feeding the card.
Currently, a dedicated external UPS is required for correct operation.  Based on
community feedback, we will be offering automatic backup/restore prior to
release.
This guarantees the UPS will only be required for 60 secs to successfully backup
the drive contents on a host power or hardware failure.  Dutifully on the next
reboot
the restore will occur prior to the OS loading for seamless non-volatile
operation.

Also,we have heard loud and clear the requests for a internal power option.  It
is our
intention the X1 will be the first in a family of products all dedicated to ZIL 
acceleration for not only OpenSolaris but also Solaris 10 and FreeBSD.
>  Also, we''d kind of like to have a SnOracle supported option.
Although a much smaller company, we believe our singular focus and absolute
passion
for ZFS and the potential of Hybrid Storage Pools will serve our customers well.

We are actively designing our soon to be available support plans.  Your voice
will be
heard, please email directly at <cgeorge at ddrdrive dot com> for
requests, comments
and/or questions.

Thanks,

Christopher George
Founder/CTO
www.ddrdrive.com
-- 
This message posted from opensolaris.org

Ragnar Sundblad

2010-Apr-04 00:27 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

On 1 apr 2010, at 06.15, Stuart Anderson wrote:
> Assuming you are also using a PCI LSI HBA from Sun that is managed with
> a utility called /opt/StorMan/arcconf and reports itself as the amazingly
> informative model number "Sun STK RAID INT" what worked for me
was to run,
> arcconf delete (to delete the pre-configured volume shipped on the drive)
> arcconf create (to create a new volume)
Just to sort things out (or not? :-): 

I more than agree that this product is highly confusing, but I
don''t think there is anything LSI in or about that card. I believe
it is an Adaptec card, developed, manufactured and supported by
Intel for Adaptec, licensed (or something) to StorageTek, and later
included in Sun machines (since Sun bought StorageTek, I suppose).
Now we could add Oracle to this name dropping inferno, if we would
want to.

I am not sure why they (Sun) put those in there, they don''t seem
very fast or smart or anything.

/ragge

Ragnar Sundblad

2010-Apr-04 00:47 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

On 2 apr 2010, at 22.47, Neil Perrin wrote:
>> Suppose there is an application which sometimes does sync writes, and
>> sometimes async writes.  In fact, to make it easier, suppose two
processes
>> open two files, one of which always writes asynchronously, and one of
which
>> always writes synchronously.  Suppose the ZIL is disabled.  Is it
possible
>> for writes to be committed to disk out-of-order?  Meaning, can a large
block
>> async write be put into a TXG and committed to disk before a small sync
>> write to a different file is committed to disk, even though the small
sync
>> write was issued by the application before the large async write? 
Remember,
>> the point is:  ZIL is disabled.  Question is whether the async could
>> possibly be committed to disk before the sync.
>>   
>> 
> 
> Threads can be pre-empted in the OS at any time. So even though thread A
issued
> W1 before thread B issued W2, the order is not guaranteed to arrive at ZFS
as W1, W2.
> Multi-threaded applications have to handle this.
> 
> If this was a single thread issuing W1 then W2 then yes the order is
guaranteed
> regardless of whether W1 or W2 are synchronous or asynchronous.
> Of course if the system crashes then the async operations might not be
there.
Could you please clarify this last paragraph a little:
Do you mean that this is in the case that you have ZIL enabled
and the txg for W1 and W2 hasn''t been commited, so that upon reboot
the ZIL is replayed, and therefore only the sync writes are
eventually there?

If, lets say, W1 is an async small write, W2 is a sync small write,
W1 arrives to zfs before W2, and W2 arrives before the txg is
commited, will both writes always be in the txg on disk?
If so, it would mean that zfs itself never buffer up async writes to
larger blurbs to write at a later txg, correct?
I take it that ZIL enabled or not does not make any difference here
(we pretend the system did _not_ crash), correct?

Thanks!

/ragge

Richard Elling

2010-Apr-04 04:01 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

On Apr 3, 2010, at 5:47 PM, Ragnar Sundblad wrote:> On 2 apr 2010, at 22.47, Neil Perrin wrote:
> 
>>> Suppose there is an application which sometimes does sync writes,
and
>>> sometimes async writes.  In fact, to make it easier, suppose two
processes
>>> open two files, one of which always writes asynchronously, and one
of which
>>> always writes synchronously.  Suppose the ZIL is disabled.  Is it
possible
>>> for writes to be committed to disk out-of-order?  Meaning, can a
large block
>>> async write be put into a TXG and committed to disk before a small
sync
>>> write to a different file is committed to disk, even though the
small sync
>>> write was issued by the application before the large async write? 
Remember,
>>> the point is:  ZIL is disabled.  Question is whether the async
could
>>> possibly be committed to disk before the sync.
>>> 
>>> 
>> 
>> Threads can be pre-empted in the OS at any time. So even though thread
A issued
>> W1 before thread B issued W2, the order is not guaranteed to arrive at
ZFS as W1, W2.
>> Multi-threaded applications have to handle this.
>> 
>> If this was a single thread issuing W1 then W2 then yes the order is
guaranteed
>> regardless of whether W1 or W2 are synchronous or asynchronous.
>> Of course if the system crashes then the async operations might not be
there.
> 
> Could you please clarify this last paragraph a little:
> Do you mean that this is in the case that you have ZIL enabled
> and the txg for W1 and W2 hasn''t been commited, so that upon
reboot
> the ZIL is replayed, and therefore only the sync writes are
> eventually there?
yes. The ZIL needs to be replayed on import after an unclean shutdown.
> If, lets say, W1 is an async small write, W2 is a sync small write,
> W1 arrives to zfs before W2, and W2 arrives before the txg is
> commited, will both writes always be in the txg on disk?
yes
> If so, it would mean that zfs itself never buffer up async writes to
> larger blurbs to write at a later txg, correct?
correct
> I take it that ZIL enabled or not does not make any difference here
> (we pretend the system did _not_ crash), correct?
For import following a clean shutdown, there are no transactions in 
the ZIL to apply.

For async-only workloads, there are no transactions in the ZIL to apply.

Do not assume that power outages are the only cause of unclean shutdowns.
 -- richard

ZFS storage and performance consulting at http://www.RichardElling.com
ZFS training on deduplication, NexentaStor, and NAS performance
Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com

Ragnar Sundblad

2010-Apr-04 08:44 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

On 4 apr 2010, at 06.01, Richard Elling wrote:

Thank you for your reply! Just wanted to make sure.
> Do not assume that power outages are the only cause of unclean shutdowns.
> -- richard
Thanks, I have seen that mistake several times with other
(file)systems, and hope I''ll never ever make it myself! :-)

/ragge s

Edward Ned Harvey

2010-Apr-05 03:00 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

> Hmm, when you did the write-back test was the ZIL SSD included in the
> write-back?
>
> What I was proposing was write-back only on the disks, and ZIL SSD
> with no write-back.
The tests I did were:
All disks write-through
All disks write-back
With/without SSD for ZIL

All the permutations of the above.

So, unfortunately, no, I didn''t test with WriteBack enabled only for
spindles, and WriteThrough on SSD.  

It has been suggested, and this is actually what I now believe based on my
experience, that precisely the opposite would be the better configuration.
If the spindles are configured WriteThrough, while the SSD is configured
WriteBack.  I believe would be optimal.

If I get the opportunity to test further, I''m interested and I will. 
But
who knows when/if that will happen.

Edward Ned Harvey

2010-Apr-05 03:04 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

> Actually, It''s my experience that Sun (and other vendors) do
exactly
> that for you when you buy their parts - at least for rotating drives, I
> have no experience with SSD''s.
> 
> The Sun disk label shipped on all the drives is setup to make the drive
> the standard size for that sun part number. They have to do this since
> they (for many reasons) have many sources (diff. vendors, even diff.
> parts from the same vendor) for the actual disks they use for a
> particular Sun part number.
Actually, if there is a fdisk partition and/or disklabel on a drive when it
arrives, I''m pretty sure that''s irrelevant.  Because when I
first connect a
new drive to the HBA, of course the HBA has to sign and initialize the drive
at a lower level than what the OS normally sees.  So unless I do some sort
of special operation to tell the HBA to preserve/import a foreign disk, the
HBA will make the disk blank before the OS sees it anyway.

Richard Elling

2010-Apr-05 03:32 UTC

head link

[zfs-discuss] writeback vs writethrough [was: Sun Flash Accelerator F20 numbers]

On Apr 2, 2010, at 5:03 AM, Edward Ned Harvey wrote:
>>> Seriously, all disks configured WriteThrough (spindle and SSD disks
>>> alike)
>>> using the dedicated ZIL SSD device, very noticeably faster than
>>> enabling the
>>> WriteBack.
>> 
>> What do you get with both SSD ZIL and WriteBack disks enabled?
>> 
>> I mean if you have both why not use both? Then both async and sync IO
>> benefits.
> 
> Interesting, but unfortunately false.  Soon I''ll post the results
here.  I
> just need to package them in a way suitable to give the public, and stick
it
> on a website.  But I''m fighting IT fires for now and
haven''t had the time
> yet.
> 
> Roughly speaking, the following are approximately representative.  Of
course
> it varies based on tweaks of the benchmark and stuff like that.
> 	Stripe 3 mirrors write through:  450-780 IOPS
> 	Stripe 3 mirrors write back:  1030-2130 IOPS
> 	Stripe 3 mirrors write back + SSD ZIL:  1220-2480 IOPS
> 	Stripe 3 mirrors write through + SSD ZIL:  1840-2490 IOPS
Thanks for sharing these interesting numbers.
> Overall, I would say WriteBack is 2-3 times faster than naked disks.  SSD
> ZIL is 3-4 times faster than naked disk.  And for some reason, having the
> WriteBack enabled while you have SSD ZIL actually hurts performance by
> approx 10%.  You''re better off to use the SSD ZIL with disks in
Write
> Through mode.
YMMV. The write workload for ZFS is best characterized by looking at
the txg commit.  In a very short period of time ZFS sends a lot[1] of write
I/O to the vdevs. It is not surprising that this can blow through the 
relatively small caches on controllers. Once you blow through the cache,
then the [in]efficiency of the disks behind the cache is experienced as
well as the [in]efficiency of the cache controller. Alas, little public 
information seems to be published regarding how those caches work. 

Changing to write-through effectively changes the G/M/1 queue [2]
at the controller to a G/M/n queue at the disks.  Sorta like:
	1. write-back controller
		(ZFS) N*#vdev I/Os --> controller --> disks
		(ZFS) M/M/n --> G/M/1 --> M/M/n

	2. write-through controller
		(ZFS) N*#vdev I/Os  --> disks
		(ZFS) M/M/n  --> G/M/n

This can simply be a case of the middleman becoming the bottleneck.

[1] a "lot" means up to 35 I/Os per vdev for older releases, 4-10 I/Os
per
vdev for more recent releases

[2] queuing theory enthusiasts will note that ZFS writes do not exhibit an
exponential arrival rate at the controller or disks except for sync writes.
> That result is surprising to me.  But I have a theory to explain it.  When
> you have WriteBack enabled, the OS issues a small write, and the HBA
> immediately returns to the OS:  "Yes, it''s on nonvolatile
storage."  So the
> OS quickly gives it another, and another, until the HBA write cache is
full.
> Now the HBA faces the task of writing all those tiny writes to disk, and
the
> HBA must simply follow orders, writing a tiny chunk to the sector it said
it
> would write, and so on.  The HBA cannot effectively consolidate the small
> writes into a larger sequential block write.  But if you have the WriteBack
> disabled, and you have a SSD for ZIL, then ZFS can log the tiny operation
on
> SSD, and immediately return to the process:  "Yes, it''s on
nonvolatile
> storage."  So the application can issue another, and another, and
another.
> ZFS is smart enough to aggregate all these tiny write operations into a
> single larger sequential write before sending it to the spindle disks.  
I agree, though this paragraph has 3 different thoughts embedded.
Taken separately:
	1. queuing surprises people :-)
	2. writeback inserts a middleman with its own queue
	3. separate logs radically change the write workload seen by
	   the controller and disks
> Long story short, the evidence suggests if you have SSD ZIL,
you''re better
> off without WriteBack on the HBA.  And I conjecture the reasoning behind it
> is because ZFS can write buffer better than the HBA can.
I think the way the separate log works is orthogonal. However, not 
having a separate log can influence the ability of the controller and
disks to respond to read requests during this workload.  

Perhaps this is a long way around to saying that a well tuned system
will have harmony among its parts.
 -- richard

ZFS storage and performance consulting at http://www.RichardElling.com
ZFS training on deduplication, NexentaStor, and NAS performance
Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com

Kyle McDonald

2010-Apr-05 13:57 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

On 4/4/2010 11:04 PM, Edward Ned Harvey wrote:>> Actually, It''s my experience that Sun (and other vendors) do
exactly
>> that for you when you buy their parts - at least for rotating drives, I
>> have no experience with SSD''s.
>>
>> The Sun disk label shipped on all the drives is setup to make the drive
>> the standard size for that sun part number. They have to do this since
>> they (for many reasons) have many sources (diff. vendors, even diff.
>> parts from the same vendor) for the actual disks they use for a
>> particular Sun part number.
>>     
> Actually, if there is a fdisk partition and/or disklabel on a drive when it
> arrives, I''m pretty sure that''s irrelevant.  Because when
I first connect a
> new drive to the HBA, of course the HBA has to sign and initialize the
drive
> at a lower level than what the OS normally sees.  So unless I do some sort
> of special operation to tell the HBA to preserve/import a foreign disk, the
> HBA will make the disk blank before the OS sees it anyway.
>
>   That may be true. Though these days they may be spec''ing the drives to
the manufacturer''s at an even lower level.

So does your HBA have newer firmware now than it did when the first disk
was connected?
Maybe it''s the HBA that is handling the new disks differently now, than
it did when the first one was plugged in?

Can you down rev the HBA FW? Do you have another HBa that might still
have the older Rev you coudltest it on?

  -Kyle

Edward Ned Harvey

2010-Apr-05 23:54 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

> From: Kyle McDonald [mailto:kmcdonald at egenera.com]
>
> So does your HBA have newer firmware now than it did when the first
> disk
> was connected?
> Maybe it''s the HBA that is handling the new disks differently now,
than
> it did when the first one was plugged in?
> 
> Can you down rev the HBA FW? Do you have another HBa that might still
> have the older Rev you coudltest it on?
I''m planning to get the support guys more involved tomorrow, so ...
things
have been pretty stagnant for several days now, I think it''s time to
start
putting more effort into this.

Long story short, I don''t know yet.  But there is one glaring clue: 
Prior
to OS installation, I don''t know how to configure the HBA.  This means
the
HBA must have been preconfigured with the factory installed disks, and I
followed a different process with my new disks, because I was using the GUI
within the OS.  My best hope right now is to find some other way to
configure the HBA, possibly through the ILOM, but I already searched there
and looked at everything.  Maybe I have to shutdown (power cycle) the system
and attach keyboard & monitor.  I don''t know yet...

Jeroen Roodhart

2010-Apr-06 21:37 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

Hi Roch,
> Can  you try 4 concurrent tar to four different ZFS
> filesystems (same pool). 
Hmmm, you''re on to something here:

http://www.science.uva.nl/~jeroen/zil_compared_e1000_iostat_iops_svc_t_10sec_interval.pdf

In short: when using two exported file systems total time goes down to around
4mins (IOPS maxes out at around 5500 when adding all four vmods together). When
using four file systems total time goes down to around 3min30s (IOPS maxing out
at about 9500).

I figured it is either NFS or a per file system data structure in the ZFS/ZIL
interface. To rule out NFS I tried exporting two directories using "default
NFS" shares (via /etc/dfs/dfstab entries). To my surprise this seems to
bypass the ZIL all together (dropping to 100 IOPS, which results from our RAIDZ2
configuration). So clearly "ZFS sharenfs" is more than a nice front
end for NFS configuration :).

But back to your suggestion: You clearly had a hypothesis behind your question.
Care to elaborate?

With kind regards,

Jeroen
-- 
This message posted from opensolaris.org

Edward Ned Harvey

2010-Apr-07 01:29 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

> > We ran into something similar with these drives in an X4170 that
> turned
> > out to
> > be  an issue of the preconfigured logical volumes on the drives. Once
> > we made
> > sure all of our Sun PCI HBAs where running the exact same version of
> > firmware
> > and recreated the volumes on new drives arriving from Sun we got back
> > into sync
> > on the X25-E devices sizes.
> 
> Can you elaborate?  Just today, we got the replacement drive that has
> precisely the right version of firmware and everything.  Still, when we
> plugged in that drive, and "create simple volume" in the
storagetek
> raid utility, the new drive is 0.001 Gb smaller than the old drive.
> I''m still hosed.
> 
> Are you saying I might benefit by sticking the SSD into some laptop,
> and zero''ing the disk?  And then attach to the sun server?
> 
> Are you saying I might benefit by finding some other way to make the
> drive available, instead of using the storagetek raid utility?
> 
> Thanks for the suggestions...
Sorry for the double post.  Since the wrong-sized drive was discussed in two
separate threads, I want to stick a link here to the other one, where the
question was answered.  Just incase anyone comes across this discussion by
search or whatever...

http://mail.opensolaris.org/pipermail/zfs-discuss/2010-April/039669.html

Jeroen Roodhart

2010-Apr-07 10:31 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

Hi list,
> If you''re running solaris proper, you better mirror
> your
> > ZIL log device.  
...> I plan to get to test this as well, won''t be until
> late next week though.
Running OSOL nv130. Power off the machine, removed the F20 and power back on.
Machines boots OK and comes up "normally" with the following message
in ''zpool status'':
...
pool: mypool
 state: FAULTED
status: An intent log record could not be read.
        Waiting for adminstrator intervention to fix the faulted pool.
action: Either restore the affected device(s) and run ''zpool
online'',
        or ignore the intent log records by running ''zpool
clear''.
   see: http://www.sun.com/msg/ZFS-8000-K4
 scrub: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        mypool      FAULTED      0     0     0  bad intent log
...

Nice! Running a later version of ZFS seems to lessen the need for
ZIL-mirroring...

With kind regards,

Jeroen
-- 
This message posted from opensolaris.org

Jeroen Roodhart

2010-Apr-07 10:34 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

Hi list,
> If you''re running solaris proper, you better mirror
> your
> > ZIL log device.  
...> I plan to get to test this as well, won''t be until
> late next week though.
Running OSOL nv130. Power off the machine, removed the F20 and power back on.
Machines boots OK and comes up "normally" with the following message
in ''zpool status'':
...
pool: mypool
 state: FAULTED
status: An intent log record could not be read.
        Waiting for adminstrator intervention to fix the faulted pool.
action: Either restore the affected device(s) and run ''zpool
online'',
        or ignore the intent log records by running ''zpool
clear''.
   see: http://www.sun.com/msg/ZFS-8000-K4
 scrub: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        mypool      FAULTED      0     0     0  bad intent log
...

Nice! Running a later version of ZFS seems to lessen the need for
ZIL-mirroring...

With kind regards,

Jeroen
-- 
This message posted from opensolaris.org

Edward Ned Harvey

2010-Apr-07 12:28 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> bounces at opensolaris.org] On Behalf Of Jeroen Roodhart
> 
> > If you''re running solaris proper, you better mirror
> > your
> > > ZIL log device.
> ...
> > I plan to get to test this as well, won''t be until
> > late next week though.
> 
> Running OSOL nv130. Power off the machine, removed the F20 and power
> back on. Machines boots OK and comes up "normally" [...]
> 
> Nice! Running a later version of ZFS seems to lessen the need for ZIL-
> mirroring...
Yes, since zpool 19, which is not available in any version of solaris yet,
and is not available in osol 2009.06 unless you update to "developer
builds,"  Since zpool 19, you have the ability to "zpool remove"
log
devices.  And if a log device fails during operation, the system is supposed
to fall back and just start using ZIL blocks from the main pool instead.

So the recommendation for zpool <19 would be *strongly* recommended.  Mirror
your log device if you care about using your pool.
And the recommendation for zpool >=19 would be ... don''t mirror your
log
device.  If you have more than one, just add them both unmirrored.

I edited the ZFS Best Practices yesterday to reflect these changes.

I always have a shade of doubt about things that are "supposed to" do
something.  Later this week, I am building an OSOL machine, updating it,
adding an unmirrored log device, starting a sync-write benchmark (to ensure
the log device is heavily in use) and then I''m going to yank out the
log
device, and see what happens.

Ragnar Sundblad

2010-Apr-07 12:58 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

On 7 apr 2010, at 14.28, Edward Ned Harvey wrote:
>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
>> bounces at opensolaris.org] On Behalf Of Jeroen Roodhart
>> 
>>> If you''re running solaris proper, you better mirror
>>> your
>>>> ZIL log device.
>> ...
>>> I plan to get to test this as well, won''t be until
>>> late next week though.
>> 
>> Running OSOL nv130. Power off the machine, removed the F20 and power
>> back on. Machines boots OK and comes up "normally" [...]
>> 
>> Nice! Running a later version of ZFS seems to lessen the need for ZIL-
>> mirroring...
> 
> Yes, since zpool 19, which is not available in any version of solaris yet,
> and is not available in osol 2009.06 unless you update to "developer
> builds,"  Since zpool 19, you have the ability to "zpool
remove" log
> devices.  And if a log device fails during operation, the system is
supposed
> to fall back and just start using ZIL blocks from the main pool instead.
> 
> So the recommendation for zpool <19 would be *strongly* recommended. 
Mirror
> your log device if you care about using your pool.
> And the recommendation for zpool >=19 would be ... don''t mirror
your log
> device.  If you have more than one, just add them both unmirrored.
Rather: ... >=19 would be ... if you don''t mind loosing data written
the ~30 seconds before the crash, you don''t have to mirror your log
device.

For a file server, mail server, etc etc, where things are stored
and supposed to be available later, you almost certainly want
redundancy on your slog too. (There may be file servers where
this doesn''t apply, but they are special cases that should not
be mentioned in the general documentation.)
> I edited the ZFS Best Practices yesterday to reflect these changes.
I''d say, that "In zpool version 19 or greater, it is recommended
not to
mirror log devices." is not a very good advice and should be changed.

/ragge

Robert Milkowski

2010-Apr-07 13:53 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

On 07/04/2010 13:58, Ragnar Sundblad wrote:>
> Rather: ...>=19 would be ... if you don''t mind loosing data
written
> the ~30 seconds before the crash, you don''t have to mirror your
log
> device.
>
> For a file server, mail server, etc etc, where things are stored
> and supposed to be available later, you almost certainly want
> redundancy on your slog too. (There may be file servers where
> this doesn''t apply, but they are special cases that should not
> be mentioned in the general documentation.)
>
>    
While I agree with you I want to mention that it is all about 
understanding a risk.
In this case not only your server has to crash in such a way so data has 
not been synced (sudden power loss for example) but there would have to 
be some data committed to a slog device(s) which was not written to a 
main pool and when your server restarts your slog device would have to 
completely die as well.

Other than that you are fine even with unmirrored slog device.

-- 
Robert Milkowski
http://milek.blogspot.com

Bob Friesenhahn

2010-Apr-07 14:35 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

On Wed, 7 Apr 2010, Ragnar Sundblad wrote:>>
>> So the recommendation for zpool <19 would be *strongly* recommended.
Mirror
>> your log device if you care about using your pool.
>> And the recommendation for zpool >=19 would be ... don''t
mirror your log
>> device.  If you have more than one, just add them both unmirrored.
>
> Rather: ... >=19 would be ... if you don''t mind loosing data
written
> the ~30 seconds before the crash, you don''t have to mirror your
log
> device.
It is also worth pointing out that in normal operation the slog is 
essentially a write-only device which is only read at boot time.  The 
writes are assumed to work if the device claims "success".  If the log
device fails to read (oops!), then a mirror would be quite useful.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Robert Milkowski

2010-Apr-07 14:52 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

On 07/04/2010 15:35, Bob Friesenhahn wrote:> On Wed, 7 Apr 2010, Ragnar Sundblad wrote:
>>>
>>> So the recommendation for zpool <19 would be *strongly* 
>>> recommended.  Mirror
>>> your log device if you care about using your pool.
>>> And the recommendation for zpool >=19 would be ...
don''t mirror your
>>> log
>>> device.  If you have more than one, just add them both unmirrored.
>>
>> Rather: ... >=19 would be ... if you don''t mind loosing
data written
>> the ~30 seconds before the crash, you don''t have to mirror
your log
>> device.
>
> It is also worth pointing out that in normal operation the slog is 
> essentially a write-only device which is only read at boot time.  The 
> writes are assumed to work if the device claims "success".  If
the log
> device fails to read (oops!), then a mirror would be quite useful.
>it is only read at boot if there are uncomitted data on it - during 
normal reboots zfs won''t read data from slog.

-- 
Robert Milkowski
http://milek.blogspot.com

Bob Friesenhahn

2010-Apr-07 15:19 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

On Wed, 7 Apr 2010, Robert Milkowski wrote:>> 
> it is only read at boot if there are uncomitted data on it - during normal 
> reboots zfs won''t read data from slog.
How does zfs know if there is uncomitted data on the slog device 
without reading it?  The minimal read would be quite small, but it 
seems that a read is still required.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Neil Perrin

2010-Apr-07 16:05 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

On 04/07/10 09:19, Bob Friesenhahn wrote:> On Wed, 7 Apr 2010, Robert Milkowski wrote:
>>>
>> it is only read at boot if there are uncomitted data on it - during 
>> normal reboots zfs won''t read data from slog.
>
> How does zfs know if there is uncomitted data on the slog device 
> without reading it?  The minimal read would be quite small, but it 
> seems that a read is still required.
>
> Bob
If there''s ever been synchronous activity then there an empty tail
block
("stubby") that
will be read even after a clean shutdown.

Neil.

Edward Ned Harvey

2010-Apr-07 16:13 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

> From: Ragnar Sundblad [mailto:ragge at csc.kth.se]
> 
> Rather: ... >=19 would be ... if you don''t mind loosing data
written
> the ~30 seconds before the crash, you don''t have to mirror your
log
> device.
If you have a system crash, *and* a failed log device at the same time, this
is an important consideration.  But if you have either a system crash, or a
failed log device, that don''t happen at the same time, then your sync
writes
are safe, right up to the nanosecond.  Using unmirrored nonvolatile log
device on zpool >= 19.

> I''d say, that "In zpool version 19 or greater, it is
recommended not to
> mirror log devices." is not a very good advice and should be changed.
See above.  Still disagree?

If desired, I could clarify the statement, by basically pasting what''s
written above.

Edward Ned Harvey

2010-Apr-07 16:18 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> bounces at opensolaris.org] On Behalf Of Bob Friesenhahn
> 
> It is also worth pointing out that in normal operation the slog is
> essentially a write-only device which is only read at boot time.  The
> writes are assumed to work if the device claims "success".  If
the log
> device fails to read (oops!), then a mirror would be quite useful.
An excellent point.

BTW, does the system *ever* read from the log device during normal
operation?  Such as perhaps during a scrub?  It really would be nice to
detect failure of log devices in advance, that are claiming to write
correctly, but which are really unreadable.

Neil Perrin

2010-Apr-07 17:01 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

On 04/07/10 10:18, Edward Ned Harvey wrote:>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
>> bounces at opensolaris.org] On Behalf Of Bob Friesenhahn
>>
>> It is also worth pointing out that in normal operation the slog is
>> essentially a write-only device which is only read at boot time.  The
>> writes are assumed to work if the device claims "success". 
If the log
>> device fails to read (oops!), then a mirror would be quite useful.
>>     
>
> An excellent point.
>
> BTW, does the system *ever* read from the log device during normal
> operation?  Such as perhaps during a scrub?  It really would be nice to
> detect failure of log devices in advance, that are claiming to write
> correctly, but which are really unreadable.
A scrub will read the log blocks but only for unplayed logs.
Because of the transient nature of the log and becuase it operates
outside of the transaction group model it''s hard to read the in-flight 
log blocks
to validate them.

There have previously been suggestions to read slogs periodically.
I don''t know if  there''s a CR raised for this though.

Neil.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100407/668b5ee4/attachment.html>

Mark J Musante

2010-Apr-07 17:03 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

On Wed, 7 Apr 2010, Neil Perrin wrote:
> There have previously been suggestions to read slogs periodically. I 
> don''t know if there''s a CR raised for this though.
Roch wrote up CR 6938883 "Need to exercise read from slog dynamically"


Regards,
markm

Bob Friesenhahn

2010-Apr-07 17:19 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

On Wed, 7 Apr 2010, Edward Ned Harvey wrote:
>> From: Ragnar Sundblad [mailto:ragge at csc.kth.se]
>>
>> Rather: ... >=19 would be ... if you don''t mind loosing
data written
>> the ~30 seconds before the crash, you don''t have to mirror
your log
>> device.
>
> If you have a system crash, *and* a failed log device at the same time,
this
> is an important consideration.  But if you have either a system crash, or a
> failed log device, that don''t happen at the same time, then your
sync writes
> are safe, right up to the nanosecond.  Using unmirrored nonvolatile log
> device on zpool >= 19.
The point is that the slog is a write-only device and a device which 
fails such that its acks each write, but fails to read the data that 
it "wrote", could silently fail at any time during the normal 
operation of the system.  It is not necessary for the slog device to 
fail at the exact same time that the system spontaneously reboots.  I 
don''t know if Solaris implements a background scrub of the slog as a 
normal course of operation which would cause a device with this sort 
of failure to be exposed quickly.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Bob Friesenhahn

2010-Apr-07 17:23 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

On Wed, 7 Apr 2010, Edward Ned Harvey wrote:>
> BTW, does the system *ever* read from the log device during normal
> operation?  Such as perhaps during a scrub?  It really would be nice to
> detect failure of log devices in advance, that are claiming to write
> correctly, but which are really unreadable.
To make matters worse, a SSD with a large cache might satisfy such 
reads from its cache so a "scrub" of the (possibly) tiny bit of 
pending synchronous writes may not validate anything.  A lightly 
loaded slog should usually be empty.  We already know that some 
(many?) SSDs are not very good about persisting writes to FLASH, even 
after acking a cache flush request.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Richard Elling

2010-Apr-07 17:49 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

On Apr 7, 2010, at 10:19 AM, Bob Friesenhahn wrote:> On Wed, 7 Apr 2010, Edward Ned Harvey wrote:
>>> From: Ragnar Sundblad [mailto:ragge at csc.kth.se]
>>> 
>>> Rather: ... >=19 would be ... if you don''t mind loosing
data written
>>> the ~30 seconds before the crash, you don''t have to mirror
your log
>>> device.
>> 
>> If you have a system crash, *and* a failed log device at the same time,
this
>> is an important consideration.  But if you have either a system crash,
or a
>> failed log device, that don''t happen at the same time, then
your sync writes
>> are safe, right up to the nanosecond.  Using unmirrored nonvolatile log
>> device on zpool >= 19.
> 
> The point is that the slog is a write-only device and a device which fails
such that its acks each write, but fails to read the data that it
"wrote", could silently fail at any time during the normal operation
of the system.  It is not necessary for the slog device to fail at the exact
same time that the system spontaneously reboots.  I don''t know if
Solaris implements a background scrub of the slog as a normal course of
operation which would cause a device with this sort of failure to be exposed
quickly.
You are playing against marginal returns. An ephemeral storage requirement
is very different than permanent storage requirement.  For permanent storage
services, scrubs work well -- you can have good assurance that if you read
the data once then you will likely be able to read the same data again with 
some probability based on the expected decay of the data. For ephemeral data,
you do not read the same data more than once, so there is no correlation
between reading once and reading again later.  In other words, testing the
readability of an ephemeral storage service is like a cat chasing its tail. 
IMHO,
this is particularly problematic for contemporary SSDs that implement wear 
leveling.

<sidebar>
For clusters the same sort of problem exists for path monitoring. If you think
about paths (networks, SANs, cups-n-strings) then there is no assurance 
that a failed transfer means all subsequent transfers will also fail. Some other
permanence test is required to predict future transfer failures.
s/fail/pass/g
</sidebar>

Bottom line: if you are more paranoid, mirror the separate log devices and
sleep through the night.  Pleasant dreams! :-)
 -- richard


ZFS storage and performance consulting at http://www.RichardElling.com
ZFS training on deduplication, NexentaStor, and NAS performance
Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com

Miles Nordin

2010-Apr-07 18:51 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

>>>>> "jr" == Jeroen Roodhart <j.r.roodhart at
uva.nl> writes:
    jr> Running OSOL nv130. Power off the machine, removed the F20 and
    jr> power back on. Machines boots OK and comes up "normally"
with
    jr> the following message in ''zpool status'':

yeah, but try it again and this time put rpool on the F20 as well and
try to import the pool from a LiveCD: if you lose zpool.cache at this
stage, your pool is toast.</end repeat mode>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 304 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100407/3f4076b4/attachment.bin>

Ragnar Sundblad

2010-Apr-07 19:07 UTC

head link

[zfs-discuss] Sun Flash Accelerator F20 numbers

On 7 apr 2010, at 18.13, Edward Ned Harvey wrote:
>> From: Ragnar Sundblad [mailto:ragge at csc.kth.se]
>> 
>> Rather: ... >=19 would be ... if you don''t mind loosing
data written
>> the ~30 seconds before the crash, you don''t have to mirror
your log
>> device.
> 
> If you have a system crash, *and* a failed log device at the same time,
this
> is an important consideration.  But if you have either a system crash, or a
> failed log device, that don''t happen at the same time, then your
sync writes
> are safe, right up to the nanosecond.  Using unmirrored nonvolatile log
> device on zpool >= 19.
Right, but if you have a power or a hardware problem, chances are
that more things really break at the same time, including the slog
device(s).
>> I''d say, that "In zpool version 19 or greater, it is
recommended not to
>> mirror log devices." is not a very good advice and should be
changed.
> 
> See above.  Still disagree?
> 
> If desired, I could clarify the statement, by basically pasting
what''s
> written above.
I believe that for a mail server, NFS server (to be spec compliant),
general purpose file server and the like, where the last written data
is as important as older data (maybe even more), it would be wise to
have at least as good redundancy on the slog as on the data disks.

If one can stand the (pretty small) risk of of loosing the last
transaction group before a crash, at the moment typically up to the
last 30 seconds of changes, you may have less redundancy on the slog.

(And if you don''t care at all, like on a web cache perhaps, you
could of course disable the zil all together - that is kind of
the other end of the scale, which puts this in perspective.)

As Robert M so wisely and simply put it; It is all about understanding
a risk. I think the documentation should help people take educated
decisions, though I am not right now sure how to put the words to
describe this in an easily understandable way.

/ragge

zfs discuss - Mar 2010 - Sun Flash Accelerator F20 numbers

[zfs-discuss] Sun Flash Accelerator F20 numbers

[zfs-discuss] Sun Flash Accelerator F20 numbers

[zfs-discuss] Sun Flash Accelerator F20 numbers

[zfs-discuss] Sun Flash Accelerator F20 numbers

[zfs-discuss] Sun Flash Accelerator F20 numbers

[zfs-discuss] Sun Flash Accelerator F20 numbers

[zfs-discuss] Sun Flash Accelerator F20 numbers

[zfs-discuss] Sun Flash Accelerator F20 numbers

[zfs-discuss] Sun Flash Accelerator F20 numbers

[zfs-discuss] Sun Flash Accelerator F20 numbers

[zfs-discuss] Sun Flash Accelerator F20 numbers

[zfs-discuss] Sun Flash Accelerator F20 numbers

[zfs-discuss] Sun Flash Accelerator F20 numbers

[zfs-discuss] Sun Flash Accelerator F20 numbers

[zfs-discuss] Sun Flash Accelerator F20 numbers

[zfs-discuss] Sun Flash Accelerator F20 numbers

[zfs-discuss] Sun Flash Accelerator F20 numbers

[zfs-discuss] Sun Flash Accelerator F20 numbers

[zfs-discuss] Sun Flash Accelerator F20 numbers

[zfs-discuss] Sun Flash Accelerator F20 numbers

[zfs-discuss] Sun Flash Accelerator F20 numbers

[zfs-discuss] Sun Flash Accelerator F20 numbers

[zfs-discuss] Sun Flash Accelerator F20 numbers

[zfs-discuss] Sun Flash Accelerator F20 numbers

[zfs-discuss] Sun Flash Accelerator F20 numbers

[zfs-discuss] Sun Flash Accelerator F20 numbers

[zfs-discuss] Sun Flash Accelerator F20 numbers

[zfs-discuss] Sun Flash Accelerator F20 numbers

[zfs-discuss] Sun Flash Accelerator F20 numbers

[zfs-discuss] Sun Flash Accelerator F20 numbers

[zfs-discuss] Sun Flash Accelerator F20 numbers

[zfs-discuss] Sun Flash Accelerator F20 numbers

[zfs-discuss] Sun Flash Accelerator F20 numbers

[zfs-discuss] Sun Flash Accelerator F20 numbers

[zfs-discuss] Sun Flash Accelerator F20 numbers

[zfs-discuss] Sun Flash Accelerator F20 numbers

[zfs-discuss] Sun Flash Accelerator F20 numbers

[zfs-discuss] Sun Flash Accelerator F20 numbers

[zfs-discuss] Sun Flash Accelerator F20 numbers

[zfs-discuss] Sun Flash Accelerator F20 numbers

[zfs-discuss] Sun Flash Accelerator F20 numbers

[zfs-discuss] Sun Flash Accelerator F20 numbers

[zfs-discuss] Sun Flash Accelerator F20 numbers

[zfs-discuss] Sun Flash Accelerator F20 numbers

[zfs-discuss] Sun Flash Accelerator F20 numbers

[zfs-discuss] Sun Flash Accelerator F20 numbers

[zfs-discuss] Sun Flash Accelerator F20 numbers

[zfs-discuss] Sun Flash Accelerator F20 numbers

[zfs-discuss] Sun Flash Accelerator F20 numbers

[zfs-discuss] Sun Flash Accelerator F20 numbers

[zfs-discuss] Sun Flash Accelerator F20 numbers

[zfs-discuss] Sun Flash Accelerator F20 numbers

[zfs-discuss] Sun Flash Accelerator F20 numbers

[zfs-discuss] Sun Flash Accelerator F20 numbers

[zfs-discuss] Sun Flash Accelerator F20 numbers

[zfs-discuss] Sun Flash Accelerator F20 numbers

[zfs-discuss] Sun Flash Accelerator F20 numbers

[zfs-discuss] Sun Flash Accelerator F20 numbers

[zfs-discuss] Sun Flash Accelerator F20 numbers

[zfs-discuss] Sun Flash Accelerator F20 numbers

[zfs-discuss] Sun Flash Accelerator F20 numbers

[zfs-discuss] Sun Flash Accelerator F20 numbers

[zfs-discuss] Sun Flash Accelerator F20 numbers

[zfs-discuss] Sun Flash Accelerator F20 numbers

[zfs-discuss] Sun Flash Accelerator F20 numbers

[zfs-discuss] Sun Flash Accelerator F20 numbers

[zfs-discuss] Sun Flash Accelerator F20 numbers

[zfs-discuss] Sun Flash Accelerator F20 numbers

[zfs-discuss] Sun Flash Accelerator F20 numbers

[zfs-discuss] Sun Flash Accelerator F20 numbers

[zfs-discuss] Sun Flash Accelerator F20 numbers

[zfs-discuss] Sun Flash Accelerator F20 numbers

[zfs-discuss] Sun Flash Accelerator F20 numbers

[zfs-discuss] Sun Flash Accelerator F20 numbers

[zfs-discuss] Sun Flash Accelerator F20 numbers

[zfs-discuss] Sun Flash Accelerator F20 numbers

[zfs-discuss] Sun Flash Accelerator F20 numbers

[zfs-discuss] Sun Flash Accelerator F20 numbers

[zfs-discuss] Sun Flash Accelerator F20 numbers