thr3ads.net - zfs discuss - [zfs-discuss] SSD best practices [Apr 2010]

If this information is useful, please help other people find it:
Share via:

Dave Vrona

2010-Apr-17 14:59 UTC

[zfs-discuss] SSD best practices

Hi all,

I''m planning a new build based on a SuperMicro chassis with 16 bays.  I
am looking to use up to 4 of the bays for SSD devices.

After reading many posts about SSDs I believe I have a _basic_ understanding of
a reasonable approach to utilizing SSDs for ZIL and L2ARC.

Namely:

ZIL:  Intel X-25E
L2ARC:  Intel X-25M

So, I am somewhat unclear about a couple of details surrounding the deployment
of these devices.

1) Mirroring.  Leaving cost out of it, should ZIL and/or L2ARC SSDs be mirrored
?

2) ZIL write cache.  It appears some have disabled the write cache on the X-25E.
This results in a 5 fold performance hit but it eliminates a potential mechanism
for data loss.  Is this valid?  If I can mirror ZIL, I imagine this is no longer
a concern?

3) SATA devices on a SAS backplane.  Assuming the main drives are SAS, what
impact do the SATA SSDs have?  Any performance impact?  I realize I could use an
onboard SATA controller for the SSDs however this complicates things in terms of
the mounting of these drives.

thanks !
-- 
This message posted from opensolaris.org

Bob Friesenhahn

2010-Apr-17 15:25 UTC

head link

[zfs-discuss] SSD best practices

On Sat, 17 Apr 2010, Dave Vrona wrote:>
> 1) Mirroring.  Leaving cost out of it, should ZIL and/or L2ARC SSDs 
> be mirrored ?
Mirroring the intent log is a good idea, particularly for ZFS versions 
which don''t support removing the intent log device.
> 2) ZIL write cache.  It appears some have disabled the write cache 
> on the X-25E.  This results in a 5 fold performance hit but it 
> eliminates a potential mechanism for data loss.  Is this valid?  If 
> I can mirror ZIL, I imagine this is no longer a concern?
It is not necessary to disable the write cache if the device responds 
correctly to cache flush requests.  The intent log is flushed 
frequently.  Previously some have reported (based on testing) that the 
X-25E does not flush the write cache reliably when it is enabled.  It 
may be that some X-25E versions work better than others.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Bill Sommerfeld

2010-Apr-17 17:43 UTC

head link

[zfs-discuss] SSD best practices

On 04/17/10 07:59, Dave Vrona wrote:> 1) Mirroring.  Leaving cost out of it, should ZIL and/or L2ARC SSDs
> be mirrored ?
L2ARC cannot be mirrored -- and doesn''t need to be.  The contents are
checksummed; if the checksum doesn''t match, it''s treated as a
cache miss
and the block is re-read from the main pool disks.

The ZIL can be mirrored, and mirroring it improves your ability to 
recover the pool in the face of multiple failures.
> 2) ZIL write cache.  It appears some have disabled the write cache on
> the X-25E.  This results in a 5 fold performance hit but it
> eliminates a potential mechanism for data loss.  Is this valid?
With the ZIL disabled, you may lose the last ~30s of writes to the pool 
(the transaction group being assembled and written at the time of the 
crash).

With the ZIL on a device with a write cache that ignores cache flush 
requests, you may lose the tail of some of the intent logs, starting 
with the first block in each log  which wasn''t readable after the 
restart.  (I say "may" rather than "will" because some
failures may not
result in the loss of the write cache).  Depending on how quickly your 
ZIL device pushes writes from cache to stable storage, this may narrow 
the window from ~30s to less than 1s, but doesn''t close the window
entirely.
> If I can mirror ZIL, I imagine this is no longer a concern?
Mirroring a ZIL device with a volatile write cache doesn''t eliminate 
this risk.  Whether it reduces the risk depends on precisely *what* 
caused your system to crash and reboot; if the failure also causes loss 
of the write cache contents on both sides of the mirror, mirroring
won''t
help.

					- Bill

Edward Ned Harvey

2010-Apr-17 18:51 UTC

head link

[zfs-discuss] SSD best practices

> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> bounces at opensolaris.org] On Behalf Of Dave Vrona
> 
> 1) Mirroring.  Leaving cost out of it, should ZIL and/or L2ARC SSDs be
> mirrored ?
IMHO, the best answer to this question is the one from the ZFS Best
Practices guide.  (I wrote part of it.)
In short:

You have no need to mirror your L2ARC cache device, and it''s impossible
even
if you want to for some bizarre reason.

For zpool < 19, which includes all present releases of Solaris 10 and
Opensolaris 2009.06, it is critical to mirror your ZIL log device.  A failed
unmirrored log device would be the permanent death of the pool.

For zpool >= 19, which is available in the developer builds, downloadable
from genunix, you need to make your own decision:  If you have an unmirrored
log device fail, *or* an ungraceful system crash, there is no problem.  But
if you have both, then you lose the latest writes leading up to the crash.
You don''t lose your whole pool.  But there are some scenarios where
it''s
possible to have the failing log device go undetected, until after the
ungraceful reboot, in which case you lose the latest data, but not the whole
pool.

Personally, I recommend the latest build from genunix, and I recommend no
mirroring for log devices, except in the most critical of situations, such
as a machine that processes credit card transactions or stuff like that. 

> 2) ZIL write cache.  It appears some have disabled the write cache on
> the X-25E.  This results in a 5 fold performance hit but it eliminates
> a potential mechanism for data loss.  Is this valid?  If I can mirror
> ZIL, I imagine this is no longer a concern?
This disagrees with my measurements.  If you have a dedicated log device, I
found the best performance by disabling all the write cache on all the
devices (disk and HBA.)  This is because ZFS has inner knowledge of the
filesystem, and knowledge of the block level devices, while the HBA only has
knowledge of the block level devices, and no knowledge of the filesystem.
Long story short, ZFS does a better job of write buffering and utilizing the
devices available.

Details are on the ZFS Best Practices guide.

> 3) SATA devices on a SAS backplane.  Assuming the main drives are SAS,
> what impact do the SATA SSDs have?  Any performance impact?  I realize
> I could use an onboard SATA controller for the SSDs however this
> complicates things in terms of the mounting of these drives.
SATA SSD devices on the SAS backplane is precisely what you should do.  This
works perfectly, and this is the configuration I used when I produced the
measurements described above.

Edward Ned Harvey

2010-Apr-17 19:09 UTC

head link

[zfs-discuss] SSD best practices

> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> bounces at opensolaris.org] On Behalf Of Edward Ned Harvey
> 
> > From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> > bounces at opensolaris.org] On Behalf Of Dave Vrona
> >
> 
> > 2) ZIL write cache.  It appears some have disabled the write cache on
> > the X-25E.  This results in a 5 fold performance hit but it
> eliminates
> > a potential mechanism for data loss.  Is this valid?  If I can mirror
> > ZIL, I imagine this is no longer a concern?
Ahh, I see there may have been some confusion there, because your question
wasn''t asked right.  ;-)

"Disabling ZIL" is not the same thing as "disabling write
cache."  Those two
terms are not to be mixed.

The write cache is either the volatile memory on the disk, or the presumably
nonvolatile memory in the HBA.  You should never enable volatile disk write
cache.  You should only enable the HBA writeback cache if (a) the HBA has
nonvolatile memory, such as battery backed up, and (b) you don''t have a
dedicated ZIL log device.  

The ZIL, generally speaking, should not be disabled.  There are some cases
where it''s ok, but generally speaking don''t do it.  The
justification is
thus:  Disabling the ZIL makes would-be sync writes into async writes, which
are faster, but prone to disappearance caused by ungraceful system shutdown.
If you trust your applications to only issue sync writes when they have a
reason they need to, and to do async writes whenever that''s ok, then
you
should not disable your ZIL.  The only time to disable your ZIL is when you
believe your applications are performing sync writes unnecessarily hurting
their own performance, and you''re not worried about losing the latest
~30
seconds of supposedly already written data after an ungraceful shutdown.

PS, if you are in the latter case, if you do disable your ZIL, then
there''s
no point in either HBA writeback, or ZIL log device.  The ZIL log device is
only used for sync writes, and it will be 100% unused if you disable ZIL.
Also, HBA writeback does not benefit ZFS for async writes.

Ragnar Sundblad

2010-Apr-17 22:05 UTC

head link

[zfs-discuss] SSD best practices

On 17 apr 2010, at 20.51, Edward Ned Harvey wrote:
>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
>> bounces at opensolaris.org] On Behalf Of Dave Vrona
>> 
>> 1) Mirroring.  Leaving cost out of it, should ZIL and/or L2ARC SSDs be
>> mirrored ?
...> Personally, I recommend the latest build from genunix, and I recommend no
> mirroring for log devices, except in the most critical of situations, such
> as a machine that processes credit card transactions or stuff like that. ...

It depends if you think this is a risk or not. Personally I do -
disk systems are always acting up at the same time as you have other
problem, in addition to the other times. Those SSDs I have tried and
read about seem to be a even more crappy than disks in general, and I
wouldn''t trust them for about anything that I want to keep.
If you handle something that you really must not lose, you should
probably have some other redundancy too, like parallel servers in
different locations, and you may skip redundancy on many components
in each individual server.
If you have a more standard application, restricted budget, and still
don''t want to lose file system transactions, I believe you should have
at least as good redundancy on your ZILs as on the rest of the disk
system.
Examples are:
- Mail servers (you are not allowed to lose email).
- NFS servers (to follow the protocol, not lose user data, and not leave
  clients in undefined/bad states).
- A general file server or other application server where people expect
  the bits they have put in there to be there, even though the server
  happened to crash.
- All other applications where you want to take as many steps as
  possible to not lose data.

On 17 apr 2010, at 21.09, Edward Ned Harvey wrote:
>>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
>>> bounces at opensolaris.org] On Behalf Of Dave Vrona
>>> 
>> 
>>> 2) ZIL write cache.  It appears some have disabled the write cache
on
>>> the X-25E.  This results in a 5 fold performance hit but it
>> eliminates
>>> a potential mechanism for data loss.  Is this valid?  If I can
mirror
>>> ZIL, I imagine this is no longer a concern?...
I''d say it is of concern - the X25-E does just straight ignore cache
flush commands (idiots!), but disabling the write cache *seems* to
put it in more of a write though mode, so that cache flushing shouldn''t
be needed. Some tests have shown that it *may* loose one or a few
transactions anyway if it suddenly loses power. X25-E is, sadly, not a
storage device worth the name, at least not until Intel has fixed the
problems with it, which doesn''t seem to be happening, sadly, Intel
seems
the just keep quiet and ignore those few that are actually checking
out what their disks are doing - they still sell a lot to all the
others, I guess.

...> The write cache is either the volatile memory on the disk, or the
presumably
> nonvolatile memory in the HBA.  You should never enable volatile disk write
> cache.
This is not correct. Zfs normally enables the write cache, and assumes
the devices are correctly honoring cache flush commands. Sadly, there
are devices out there that ignores them, because they wan''t to look
like they have higher performance than they have, or because of bugs,
or in some cases because of just plain ignorance from the implementors.
Intel X25-E is sadly one of those bad devices.
[Traditionally, Solaris has always tried to disable volatile disk
write caches, zfs changes this.]
>  You should only enable the HBA writeback cache if (a) the HBA has
> nonvolatile memory, such as battery backed up, and (b) you don''t
have a
> dedicated ZIL log device.  
And (c) you don''t mind having the same problem with non redundancy
as for the ZIL device: If you have writeback caching enabled in your
HBA and your HBA fails, some of your data will be lost in the HBA cache,
a bit similar to the ZIL case. Your file system may be in a little
worse state, since zfs is always consistent on storage, but if some
of the recently written storage is lost in the HBA cache, it won''t
be consistent on disk. And HBAs do seem to fail a bit more often
than most other computer boards, for some strange reason.

/ragge

Dave Vrona

2010-Apr-17 22:46 UTC

head link

[zfs-discuss] SSD best practices

> > From: zfs-discuss-bounces at opensolaris.org
> [mailto:zfs-discuss-
> > bounces at opensolaris.org] On Behalf Of Edward Ned
> Harvey
> > 
> > > From: zfs-discuss-bounces at opensolaris.org
> [mailto:zfs-discuss-
> > > bounces at opensolaris.org] On Behalf Of Dave Vrona
> > >
> > 
> > > 2) ZIL write cache.  It appears some have
> disabled the write cache on
> > > the X-25E.  This results in a 5 fold performance
> hit but it
> > eliminates
> > > a potential mechanism for data loss.  Is this
> valid?  If I can mirror
> > > ZIL, I imagine this is no longer a concern?
> 
> Ahh, I see there may have been some confusion there,
> because your question
> wasn''t asked right.  ;-)
> 
> "Disabling ZIL" is not the same thing as "disabling
> write cache."  Those two
> terms are not to be mixed.
> 
My statement was less than perfectly worded.  I specifically meant disabling
write cache on the X-25e that is holding the ZIL.  I certainly didn''t
mean to imply disabling ZIL.
-- 
This message posted from opensolaris.org

Dave Vrona

2010-Apr-17 22:52 UTC

head link

[zfs-discuss] SSD best practices

Ok, so originally I presented the X-25E as a "reasonable" approach. 
After reading the follow-ups, I''m second guessing my statement.

Any decent alternatives at a reasonable price?
-- 
This message posted from opensolaris.org

Ragnar Sundblad

2010-Apr-18 01:01 UTC

head link

[zfs-discuss] SSD best practices

On 18 apr 2010, at 00.52, Dave Vrona wrote:
> Ok, so originally I presented the X-25E as a "reasonable"
approach.  After reading the follow-ups, I''m second guessing my
statement.
> 
> Any decent alternatives at a reasonable price?
How much is reasonable? :-)
I guess there are STEC drives that should work for slogs (ZIL devices),
but I haven''t tried them yet, and not read about many that has,
except in euphoric reports from Sun users that got them from Sun.
Would be really interesting to try.

I think Sun/Oracle actually has sold X25-Es themselves, possibly
with Sun/Oracle firmware. I don''t know if those drives are as
bad as the Intel branded ones.

For l2arc, about any drive would do, I think, it is just not critical
for the file system in any way (but could be critical for your
application).

I''d also really like to hear the zfs developers'' view on this
subject, I guess they have tested many of these drives and
problems in their labs.

/ragge

Richard Elling

2010-Apr-18 04:43 UTC

head link

[zfs-discuss] SSD best practices

On Apr 17, 2010, at 11:51 AM, Edward Ned Harvey wrote:
>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
>> bounces at opensolaris.org] On Behalf Of Dave Vrona
>> 
>> 1) Mirroring.  Leaving cost out of it, should ZIL and/or L2ARC SSDs be
>> mirrored ?
> 
> IMHO, the best answer to this question is the one from the ZFS Best
> Practices guide.  (I wrote part of it.)
> In short:
> 
> You have no need to mirror your L2ARC cache device, and it''s
impossible even
> if you want to for some bizarre reason.
> 
> For zpool < 19, which includes all present releases of Solaris 10 and
> Opensolaris 2009.06, it is critical to mirror your ZIL log device.  A
failed
> unmirrored log device would be the permanent death of the pool.
I do not believe this is a true statement. In large part it will depend on
the nature of the failure -- all failures are not created equal. It has also
been shown that such pools are recoverable, albeit with tedious, manual 
procedures required.  Rather than saying this is a "critical" issue, I
could
say it is "preferred."  Indeed, there are *many* SPOFs in the typical
system
(any x86 system) which can be considered similarly "critical."

Finally, you have choices -- you can use an HBA with nonvolatile write
cache and avoid the need for separate log device.
  -- richard

ZFS storage and performance consulting at http://www.RichardElling.com
ZFS training on deduplication, NexentaStor, and NAS performance
Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com

Ragnar Sundblad

2010-Apr-18 08:35 UTC

head link

[zfs-discuss] SSD best practices

On 18 apr 2010, at 06.43, Richard Elling wrote:
> On Apr 17, 2010, at 11:51 AM, Edward Ned Harvey wrote:
> 
>>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
>>> bounces at opensolaris.org] On Behalf Of Dave Vrona
>>> 
>>> 1) Mirroring.  Leaving cost out of it, should ZIL and/or L2ARC SSDs
be
>>> mirrored ?
>> 
>> IMHO, the best answer to this question is the one from the ZFS Best
>> Practices guide.  (I wrote part of it.)
>> In short:
>> 
>> You have no need to mirror your L2ARC cache device, and it''s
impossible even
>> if you want to for some bizarre reason.
>> 
>> For zpool < 19, which includes all present releases of Solaris 10
and
>> Opensolaris 2009.06, it is critical to mirror your ZIL log device.  A
failed
>> unmirrored log device would be the permanent death of the pool.
> 
> I do not believe this is a true statement. In large part it will depend on
> the nature of the failure -- all failures are not created equal. It has
also
> been shown that such pools are recoverable, albeit with tedious, manual 
> procedures required.  Rather than saying this is a "critical"
issue, I could
> say it is "preferred."  Indeed, there are *many* SPOFs in the
typical system
> (any x86 system) which can be considered similarly "critical."
Yes there are. The thing is that a common situations is that the 
most valuable is the data itself, often more than either
0.999[0-9]* uptime and certainly more than the machine itself.

If so, you want very good redundancy on your data, and don''t care
much about (live) redundancy on the machine. You just take the
disks and slam them into another machine, physically, be means of
FC or SAS, virtually, or whatever. (You may want to have a spare
machine standby to save time, though.)

It is often not very expensive to get quite a bit of redundancy in
your data, running parallel systems is often much more both
complicated and expensive.

That the data possibly could be recovered with tedious procedures
with experts doing it by hand is not good enough a crash recovery
plan for many of us - in a crash situation you want your data to be
there and be safe and you just have to figure out how to access it,
and you are probably interested in making it happen as quickly as
possible. Hopefully you have planned for the procedure already.
That said, it is good that the manual option is there, if you get in
deep trouble.

At least this is our reasoning when we set up our server machines...
> Finally, you have choices -- you can use an HBA with nonvolatile write
> cache and avoid the need for separate log device.
Except that then that HBA is a non-redundant place, a SOPF, where you
store our data, and a place where you could lose data. As long as you
know that and know that you can take that, everything is fine.

Again, it all depends on the application, I guess, and giving general
advice is nearly impossible.

/ragge
>  -- richard
> 
> ZFS storage and performance consulting at http://www.RichardElling.com
> ZFS training on deduplication, NexentaStor, and NAS performance
> Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com 
> 
> 
> 
> 
> 
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Dave Vrona

2010-Apr-18 11:43 UTC

head link

[zfs-discuss] SSD best practices

> 
> On 18 apr 2010, at 00.52, Dave Vrona wrote:
> 
> > Ok, so originally I presented the X-25E as a
> "reasonable" approach.  After reading the follow-ups,
> I''m second guessing my statement.
> > 
> > Any decent alternatives at a reasonable price?
> 
> How much is reasonable? :-)
How about $1000 per device?  $2000 for a mirrored pair.
-- 
This message posted from opensolaris.org

Dave Vrona

2010-Apr-18 12:05 UTC

head link

[zfs-discuss] SSD best practices

The Acard device mentioned in this thread looks interesting:

http://opensolaris.org/jive/thread.jspa?messageID=401719&#401719
-- 
This message posted from opensolaris.org

Edward Ned Harvey

2010-Apr-18 12:23 UTC

head link

[zfs-discuss] SSD best practices

> From: Richard Elling [mailto:richard.elling at gmail.com]
> 
> On Apr 17, 2010, at 11:51 AM, Edward Ned Harvey wrote:
> 
> > For zpool < 19, which includes all present releases of Solaris 10
and
> > Opensolaris 2009.06, it is critical to mirror your ZIL log device.  A
> failed
> > unmirrored log device would be the permanent death of the pool.
> 
> I do not believe this is a true statement. In large part it will depend
> on
> the nature of the failure -- all failures are not created equal. It has
> also
> been shown that such pools are recoverable, albeit with tedious, manual
> procedures required.  Rather than saying this is a "critical"
issue, I
> could
> say it is "preferred."  Indeed, there are *many* SPOFs in the
typical
> system
> (any x86 system) which can be considered similarly "critical."
Could you please describe a type of failure of an unmirrored log device in
zpool < 19 which does not result in the pool being faulted and unable to
import?  I don''t know of any.

If you have a faulted zpool < 19, due to a faulted nonmirrored log device,
could you describe how it''s possible to recover that pool?  I know I
tried
and couldn''t do it, but then again, it was only a test pool.  I only
dedicated an hour of labor to trying.

> Finally, you have choices -- you can use an HBA with nonvolatile write
> cache and avoid the need for separate log device.
The HBA with nonvolatile cache gains a lot over just plain disks.  By my
measurement, 2x-3x faster for sync writes, but no improvement for async
writes, or any reads.

But it''s not as effective as using a dedicated SSD for log device.

By my measurement, using a SSD for log device (with all the HBA write cache
disabled) was about 3x-4x faster than just plain disks for sync writes, but
no different for async writes, or any reads.

I agree with you, HBA nonvolatile write cache is an option.  It''s
cheaper
than buying an SSD, and it doesn''t consume a slot.  Better than
nothing.
Depends on what your design requirements are, and how much you care about
the sync write performance.

Edward Ned Harvey

2010-Apr-18 12:27 UTC

head link

[zfs-discuss] SSD best practices

> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> bounces at opensolaris.org] On Behalf Of Dave Vrona
> 
> > On 18 apr 2010, at 00.52, Dave Vrona wrote:
> >
> > > Ok, so originally I presented the X-25E as a
> > "reasonable" approach.  After reading the follow-ups,
> > I''m second guessing my statement.
> > >
> > > Any decent alternatives at a reasonable price?
> >
> > How much is reasonable? :-)
> 
> How about $1000 per device?  $2000 for a mirrored pair.
That''s how much I paid for my Intel SSD''s, sun branded.  I
think the Intel
SSD''s are the industry standard, at least for now.

Dave Vrona

2010-Apr-18 15:10 UTC

head link

[zfs-discuss] SSD best practices

Or, DDRDrive X1 ?  Would the X1 need to be mirrored?
-- 
This message posted from opensolaris.org

Christopher George

2010-Apr-18 17:23 UTC

head link

[zfs-discuss] SSD best practices

IMHO, whether a dedicated log device needs redundancy (mirrored), should
be determined by the dynamics of each end-user environment (zpool version,
goals/priorities, and budget).

If mirroring is deemed important, a key benefit of the DDRdrive X1, is the
HBA / storage device integration.  For example, to approach the redundancy 
of a mirrored DDRdrive X1 pair, a SATA Flash based SSD solution would 
require each SSD to have a dedicated HBA controller.  As sharing an HBA 
between the two mirrored SSDs would introduce a single point of failure not 
existing in the X1 configuration.  Even with dedicated HBAs, removing the 
need for SATA cables while halving both the controller count and data path 
travel will notably increase reliability.  It should be mentioned, one plus for
a
mirrored Flash SSD with dedicated HBAs (no cache or write through) is 
the lack of required power protection.

Thanks,

Christopher George
Founder/CTO
www.ddrdrive.com
-- 
This message posted from opensolaris.org

Miles Nordin

2010-Apr-18 17:48 UTC

head link

[zfs-discuss] SSD best practices

>>>>> "re" == Richard Elling <richard.elling at
gmail.com> writes:
    >> A failed unmirrored log device would be the
    >> permanent death of the pool.

    re> It has also been shown that such pools are recoverable, albeit
    re> with tedious, manual procedures required.

for the 100th time, No, they''re not, not if you lose zpool.cache also.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 304 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100418/bc2f4499/attachment.bin>

Dave Vrona

2010-Apr-18 19:59 UTC

head link

[zfs-discuss] SSD best practices

> IMHO, whether a dedicated log device needs redundancy
> (mirrored), should
> be determined by the dynamics of each end-user
> environment (zpool version,
> goals/priorities, and budget).
> 
Well, I populate a chassis with dual HBAs because my _perception_ is they tend
to fail more than other cards.

Please help me with my perception of the X1.  :-)
-- 
This message posted from opensolaris.org

Richard Elling

2010-Apr-18 20:26 UTC

head link

[zfs-discuss] SSD best practices

On Apr 18, 2010, at 10:48 AM, Miles Nordin wrote:
>>>>>> "re" == Richard Elling <richard.elling at
gmail.com> writes:
> 
>>> A failed unmirrored log device would be the
>>> permanent death of the pool.
> 
>    re> It has also been shown that such pools are recoverable, albeit
>    re> with tedious, manual procedures required.
> 
> for the 100th time, No, they''re not, not if you lose zpool.cache
also.
It is disingenuous to complain about multiple failures in a system
which has so many single points of failure. Also, a well managed
system will not lose zpool.cache or any other file.
 -- richard

ZFS storage and performance consulting at http://www.RichardElling.com
ZFS training on deduplication, NexentaStor, and NAS performance
Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com

Richard Elling

2010-Apr-18 20:28 UTC

head link

[zfs-discuss] SSD best practices

On Apr 18, 2010, at 5:23 AM, Edward Ned Harvey wrote:
>> From: Richard Elling [mailto:richard.elling at gmail.com]
>> 
>> On Apr 17, 2010, at 11:51 AM, Edward Ned Harvey wrote:
>> 
>>> For zpool < 19, which includes all present releases of Solaris
10 and
>>> Opensolaris 2009.06, it is critical to mirror your ZIL log device. 
A
>> failed
>>> unmirrored log device would be the permanent death of the pool.
>> 
>> I do not believe this is a true statement. In large part it will depend
>> on
>> the nature of the failure -- all failures are not created equal. It has
>> also
>> been shown that such pools are recoverable, albeit with tedious, manual
>> procedures required.  Rather than saying this is a "critical"
issue, I
>> could
>> say it is "preferred."  Indeed, there are *many* SPOFs in the
typical
>> system
>> (any x86 system) which can be considered similarly
"critical."
> 
> Could you please describe a type of failure of an unmirrored log device in
> zpool < 19 which does not result in the pool being faulted and unable to
> import?  I don''t know of any.
The most common failure mode on HDDs and, it seems, SSDs is a 
nonrecoverable read. A nonrecoverable read failure on your separate
log device will not cause the pool to fail import.
 -- richard

ZFS storage and performance consulting at http://www.RichardElling.com
ZFS training on deduplication, NexentaStor, and NAS performance
Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com

Christopher George

2010-Apr-19 00:10 UTC

head link

[zfs-discuss] SSD best practices

There is no definitive answer (yes or no) on whether to mirror a dedicated
log device, as reliability is one of many variables.  This leads me to the
frequently given but never satisfying "it depends".

In a time when too many good questions go unanswered, let me take
advantage of our less rigid "rules of engagement" and share some facts
about the DDRdrive X1 which are uncommonly shared:

 -  12 Layer PCB (layman translation - more layers, better SI, higher cost)
 -  Nelco N4000-13 EP Laminate (extremely high quality, a price to match)
 -  Solid Via Construction (hold a X1 in front of a bright light - no holes :-)
 -  "Best of Breed" components, all 520 of them
 -  Assembled and validated in Northern CA, USA
 -  1.5 weeks of test/burn-in of every X1. (extensive DRAM validation)

In summary, the DDRdrive X1 is designed, built and tested with immense
pride and an overwhelming attention to detail.

Thanks,

Christopher George
Founder/CTO
www.ddrdrive.com
-- 
This message posted from opensolaris.org

Bob Friesenhahn

2010-Apr-19 00:32 UTC

head link

[zfs-discuss] SSD best practices

On Sun, 18 Apr 2010, Christopher George wrote:>
> In summary, the DDRdrive X1 is designed, built and tested with immense
> pride and an overwhelming attention to detail.
Sounds great.  What performance does DDRdrive X1 provide for this 
simple NFS write test from a single client over gigabit ethernet? 
This seems to be the test of the day.

   time tar jxf gcc-4.4.3.tar.bz2

I get 22 seconds locally and about 6-1/2 minutes from an NFS client.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Miles Nordin

2010-Apr-19 00:56 UTC

head link

[zfs-discuss] SSD best practices

>>>>> "re" == Richard Elling <richard.elling at
gmail.com> writes:
    re> a well managed system will not lose zpool.cache or any other
    re> file.

I would complain this was circular reasoning if it weren''t such
obvious chest-puffing bullshit.

It''s normal even to the extent of being a best practice to have no
redundancy for rpool on systems that can tolerate gaps in availability
because you can reinstall from the livecd relatively quickly.

    re> It is disingenuous to complain about multiple failures

strongly disagree.  I''m quite genuine.

A really common and really terrible suggestion is, ``get an SSD, and
put your rpool in one slice and your slog in another.''''  If
you do
that and lose the SSD, you''ve lost the whole pool.  You cannot recover
with ''zpool clear'' or any number of -f -F -FFF flags.  This
common
scenario doesn''t require any multiple failure.

Now, even among those who don''t do this, people following your
suggestions will not design their systems realizing the rpool and the
SSD make up a redundant pair.  They will not see: you can lose the
rpool and import the pool IFF you have the SSD, and you can lose the
SSD and force-online the pool IFF you have the rpool with the
missing-slog pool already imported to it.  They will instead desgin
following the raidz/mirroring failure rules treating slog as
disposable, like you''ve told them, and this is flat wrong.  Hiding
behind fuzzy glossary terms like ``multiple failures'''' is
useless,
IMHO to the point of being deliberately obtuse.

Besides that, you don''t need any multiple failures---all you need to
do is make the mistake of typing the perfectly reasonable command
''zpool export'' in the course of trying to fix your problem,
and poof,
your whole pool is gone.

A pool that runs fine until you try to export and re-import it, after
which it is permanently lost, is a ticking time bomb.  I don''t think
it''s a good idea to run that way at all because of the flexible tools
one needs to have available for maintenance in a disaster (ex., livecd
of newer version with special import -F rescue-magic in it, WONT WORK.
moving drives to a different controller causing them to have a
different devid, WONT WORK.  accumulate enough of these and not only
does your toolkit get smaller and weaker, but you must move slowly and
with great fear because the slightest move can make everything explode
in totally unobvious ways.).  If you do want to run this way, as an
absolute MINIMUM, you need to discuss this cannot-import case at
moments like this one so that it can influence people''s designs.

It seems if I say it the long way, I get ignored.  If I say it the
short way, you dive into every corner case.  I don''t know how to be
any more clear, so...good luck out there, y''all.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 304 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100418/044a29a3/attachment.bin>

Don

2010-Apr-19 01:01 UTC

head link

[zfs-discuss] SSD best practices

So if the Intel X25E is a bad device- can anyone recommend an SLC device with
good firmware? (Or an MLC drive that performs as well?)

I''ve got 80 spindles in 5 16 bay drive shelves (76 15k RPM SAS drives
in 19 4 disk raidz sets, 2 hot spares, and 2 bays set aside for a mirrored ZIL)
connected to two servers (so if one fails I can import on the other one). Host
based cards are not an option for my ZIL- I need something that sits in the
array and can be imported by the other system.

I was planning on using a pair of mirrored SLC based Intel X25E''s
because of their superior write performance but if it''s going to
destroy my pool- then it''s useless.

Does anyone else have something that can match their write performance without
breaking ZFS?
-- 
This message posted from opensolaris.org

Edward Ned Harvey

2010-Apr-19 01:39 UTC

head link

[zfs-discuss] SSD best practices

> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> bounces at opensolaris.org] On Behalf Of Bob Friesenhahn
> 
> On Sun, 18 Apr 2010, Christopher George wrote:
> >
> > In summary, the DDRdrive X1 is designed, built and tested with
> immense
> > pride and an overwhelming attention to detail.
> 
> Sounds great.  What performance does DDRdrive X1 provide for this
> simple NFS write test from a single client over gigabit ethernet?
> This seems to be the test of the day.
> 
>    time tar jxf gcc-4.4.3.tar.bz2
> 
> I get 22 seconds locally and about 6-1/2 minutes from an NFS client.
There''s no point trying to accelerate your disks if you''re
only going to use
a single client over gigabit.  

Assuming you''ve got some sort of nontrivial server infrastructure, and
you''ve got many clients doing things simultaneously and more than 1Gb
network connection, then it can become worth while.

Also, if you do work on the physical server on local disks, that can also be
worth while.

Edward Ned Harvey

2010-Apr-19 01:45 UTC

head link

[zfs-discuss] SSD best practices

> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> bounces at opensolaris.org] On Behalf Of Don
> 
> I''ve got 80 spindles in 5 16 bay drive shelves (76 15k RPM SAS
drives
> in 19 4 disk raidz sets, 2 hot spares, and 2 bays set aside for a
> mirrored ZIL) connected to two servers (so if one fails I can import on
> the other one). Host based cards are not an option for my ZIL- I need
> something that sits in the array and can be imported by the other
> system.
> 
> I was planning on using a pair of mirrored SLC based Intel X25E''s
> because of their superior write performance but if it''s going to
> destroy my pool- then it''s useless.
> 
> Does anyone else have something that can match their write performance
> without breaking ZFS?
You''re not going to break ZFS with the X25''s, if you just get
2 of them and
make them a mirror.  But be aware, that all sync writes will go to these
devices, and if you''ve got 80 spindles, it''s possible that 1
mirror might
not be enough for your optimal performance.  You might gain more by using
more than one pair.  I don''t know any way to test it other than getting
your
hands on more than one pair and seeing what results you get.

Don

2010-Apr-19 02:02 UTC

head link

[zfs-discuss] SSD best practices

If you have a pair of heads talking to shared disks with ZFS- what can you do to
ensure the second head always has a current copy of the zpool.cache file?
I''d prefer not to lose the ZIL, fail over, and then suddenly find out I
can''t import the pool on my second head.
-- 
This message posted from opensolaris.org

Don

2010-Apr-19 02:10 UTC

head link

[zfs-discuss] SSD best practices

But if the X25E doesn''t honor cache flushes then it really
doesn''t matter if they are mirrored- they both may cache the data, not
write it out, and leave me screwed.

I''m running 2009.06 and not one of the newer developer candidates that
handle ZIL losses gracefully (or at all- at least as far as I understand
things).

As for the optimal performance- a single pair probably won''t give me
optimal performance- but based on all the numbers I''ve seen
it''s still going to beat using the pool disks. If I find the ZIL is
still a bottleneck I''ll definitely add a second set of SSD''s-
but I''ve got a lot of testing to do before I get there.
-- 
This message posted from opensolaris.org

Daniel Carosone

2010-Apr-19 02:10 UTC

head link

[zfs-discuss] SSD best practices

On Sun, Apr 18, 2010 at 07:02:38PM -0700, Don wrote:> If you have a pair of heads talking to shared disks with ZFS- what can you
do to ensure the second head always has a current copy of the zpool.cache file?
I''d prefer not to lose the ZIL, fail over, and then suddenly find out I
can''t import the pool on my second head.
Replicatedd backups of your running BE, like for many other reasons.

--
Dan.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 194 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100419/9914ab5a/attachment.bin>

Don

2010-Apr-19 02:37 UTC

head link

[zfs-discuss] SSD best practices

I''m not sure to what you are referring when you say my "running
BE"

I haven''t looked at the zpool.cache file too closely but if the devices
don''t match between the two systems for some reason- isn''t
that going to cause a problem? I was really asking if there is a way to build
the cache file without importing the disks.
-- 
This message posted from opensolaris.org

Bob Friesenhahn

2010-Apr-19 03:33 UTC

head link

[zfs-discuss] SSD best practices

On Sun, 18 Apr 2010, Edward Ned Harvey wrote:>> This seems to be the test of the day.
>>
>>    time tar jxf gcc-4.4.3.tar.bz2
>>
>> I get 22 seconds locally and about 6-1/2 minutes from an NFS client.
>
> There''s no point trying to accelerate your disks if
you''re only going to use
> a single client over gigabit.
This is a really strange statement.  It does not make any sense.

It makes about as much sense as saying that if you have only one car 
that there is no need for it to be able to go faster than 10 mph, but 
if you have 60 cars, then it is worthwhile for the cars to each be 
able to go 60 mph.  The driver of that lone 10 mph car will not be 
very happy.

On a different discussion thread, one fellow was able to drop the tar 
file extraction time from 92 minutes to just under 7 minutes.  As a 
user of the client system, he is much happier.

Probably the DDRDrive is able to go faster since it should have lower 
latency than a FLASH SSD drive. However, it may have some bandwidth 
limits on its interface.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Daniel Carosone

2010-Apr-19 04:56 UTC

head link

[zfs-discuss] SSD best practices

On Sun, Apr 18, 2010 at 10:33:36PM -0500, Bob Friesenhahn
wrote:> Probably the DDRDrive is able to go faster since it should have lower  
> latency than a FLASH SSD drive. However, it may have some bandwidth  
> limits on its interface.
It clearly has some.  They''re just as clearly well in excess of those
applicable to a SATA-interface SSD, even a DRAM-based one like the acard.  

In return, the SATA SSD has some deployment options (in an external
JBOD, for example) not as readily accessible to a PCI device.

I''d be curious to compare mirroring these kinds of devices across
server heads, using comstar and some suitable interconnect, as a
comparison to slogs colocated with the drives. 

--
Dan.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 194 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100419/f8ec66e8/attachment.bin>

Richard Elling

2010-Apr-19 05:10 UTC

head link

[zfs-discuss] SSD best practices

On Apr 18, 2010, at 7:02 PM, Don wrote:
> If you have a pair of heads talking to shared disks with ZFS- what can you
do to ensure the second head always has a current copy of the zpool.cache file?
By definition, the zpool.cache file is always up to date.
> I''d prefer not to lose the ZIL, fail over, and then suddenly find
out I can''t import the pool on my second head.
I''d rather not have multiple failures, either.  But the information
needed in the
zpool.cache file for reconstructing a missing (as in destroyed) top-level vdev
is
easily recovered from a backup or snapshot.
 -- richard

ZFS storage and performance consulting at http://www.RichardElling.com
ZFS training on deduplication, NexentaStor, and NAS performance
Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com

Daniel Carosone

2010-Apr-19 05:37 UTC

head link

[zfs-discuss] SSD best practices

On Sun, Apr 18, 2010 at 07:37:10PM -0700, Don wrote:> I''m not sure to what you are referring when you say my
"running BE"
Running boot environment - the filesystem holding /etc/zpool.cache

--
Dan.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 194 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100419/e774f395/attachment.bin>

Daniel Carosone

2010-Apr-19 05:41 UTC

head link

[zfs-discuss] SSD best practices

On Mon, Apr 19, 2010 at 03:37:43PM +1000, Daniel Carosone
wrote:> the filesystem holding /etc/zpool.cache
or, indeed, /etc/zfs/zpool.cache  :-)

--
Dan.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 194 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100419/de2f59e9/attachment.bin>

Michael DeMan (OA)

2010-Apr-19 10:17 UTC

head link

[zfs-discuss] newbie, WAS: Re: SSD best practices

By the way,

I would like to chip in about how informative this thread has been, at least for
me, despite (and actually because of) the strong opinions on some of the posts
about the issues involved.

From what I gather, there is still an interesting failure possibility with ZFS,
although probably rare.  In the case where a zil (aka slog) device fails, AND
the zpool.cache information is not available, basically folks are toast?

In addition, the zpool.cache itself exhibits the following behaviors (and I
could be totally wrong, this is why I ask):

assumptions:

A.  It is not written to frequently, i.e., it is not a performance impact unless
new zfs file systems (pardon me if I have the incorrect terminology) are not
being fabricated and supplied to the underlying operating system.

B.  The current implementation stores that cache file on the zil file system, so
if for some reason, that device is totally lost, it is nigh impossible to
recover the entire pool it correlates with.

possible solutions:

1.  Why not have an option to mirror that darn cache file, like to the root file
system of the boot device at least?  Presuming that most folks at least want
enough redundancy that their machine will boot, and if it boots have a shot at
recovery the associated directly attached storage, and with my other
presumptions above, there is little reason do not to offer a feature like this?

Respectfully,
- mike

On Apr 18, 2010, at 10:10 PM, Richard Elling wrote:
> On Apr 18, 2010, at 7:02 PM, Don wrote:
> 
>> If you have a pair of heads talking to shared disks with ZFS- what can
you do to ensure the second head always has a current copy of the zpool.cache
file?
> 
> By definition, the zpool.cache file is always up to date.
> 
>> I''d prefer not to lose the ZIL, fail over, and then suddenly
find out I can''t import the pool on my second head.
> 
> I''d rather not have multiple failures, either.  But the
information needed in the
> zpool.cache file for reconstructing a missing (as in destroyed) top-level
vdev is
> easily recovered from a backup or snapshot.
> -- richard
> 
> ZFS storage and performance consulting at http://www.RichardElling.com
> ZFS training on deduplication, NexentaStor, and NAS performance
> Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com 
> 
> 
> 
> 
> 
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Michael DeMan

2010-Apr-19 10:26 UTC

head link

[zfs-discuss] SSD best practices

By the way,

I would like to chip in about how informative this thread has been, at least for
me, despite (and actually because of) the strong opinions on some of the posts
about the issues involved.

From what I gather, there is still an interesting failure possibility with ZFS,
although probably rare.  In the case where a zil (aka slog) device fails, AND
the zpool.cache information is not available, basically folks are toast?

In addition, the zpool.cache itself exhibits the following behaviors (and I
could be totally wrong, this is why I ask):

A.  It is not written to frequently, i.e., it is not a performance impact unless
new zfs file systems (pardon me if I have the incorrect terminology) are not
being fabricated and supplied to the underlying operating system.

B.  The current implementation stores that cache file on the zil device, so if
for some reason, that device is totally lost (along with said .cache file), it
is nigh impossible to recover the entire pool it correlates with.

possible solutions:

1.  Why not have an option to mirror that darn cache file (like to the root file
system of the boot device at least as an initial implementation) no matter what
intent log devices are present?  Presuming that most folks at least want enough
redundancy that their machine will boot, and if it boots - then they have a shot
at recovery of the balance of the associated (zfs) directly attached storage,
and with my other presumptions above, there is little reason do not to offer a
feature like this?

Respectfully,
- mike

On Apr 18, 2010, at 10:10 PM, Richard Elling wrote:
> On Apr 18, 2010, at 7:02 PM, Don wrote:
> 
>> If you have a pair of heads talking to shared disks with ZFS- what can
you do to ensure the second head always has a current copy of the zpool.cache
file?
> 
> By definition, the zpool.cache file is always up to date.
> 
>> I''d prefer not to lose the ZIL, fail over, and then suddenly
find out I can''t import the pool on my second head.
> 
> I''d rather not have multiple failures, either.  But the
information needed in the
> zpool.cache file for reconstructing a missing (as in destroyed) top-level
vdev is
> easily recovered from a backup or snapshot.
> -- richard
> 
> ZFS storage and performance consulting at http://www.RichardElling.com
> ZFS training on deduplication, NexentaStor, and NAS performance
> Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com 
> 
> 
> 
> 
> 
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Michael DeMan

2010-Apr-19 10:52 UTC

head link

[zfs-discuss] newbie, WAS: Re: SSD best practices

Also, pardon my typos, and my lack of re-titling my subject to note that it is a
fork from the original topic.  Corrections in text that I noticed after finally
sorting out getting on the mailing list are below...

On Apr 19, 2010, at 3:26 AM, Michael DeMan wrote:
> By the way,
> 
> I would like to chip in about how informative this thread has been, at
least for me, despite (and actually because of) the strong opinions on some of
the posts about the issues involved.
> 
> From what I gather, there is still an interesting failure possibility with
ZFS, although probably rare.  In the case where a zil (aka slog) device fails,
AND the zpool.cache information is not available, basically folks are toast?
> 
> In addition, the zpool.cache itself exhibits the following behaviors (and I
could be totally wrong, this is why I ask):
> 
> A.  It is not written to frequently, i.e., it is not a performance impact
unless new zfs file systems (pardon me if I have the incorrect terminology) are
not being fabricated and supplied to the underlying operating system.
> The above ''are not being fabricated'' should be ''are
regularly being fabricated''
> B.  The current implementation stores that cache file on the zil device, so
if for some reason, that device is totally lost (along with said .cache file),
it is nigh impossible to recover the entire pool it correlates with.The above, ''on the zil device'', should say ''on the
fundamental zfs file system itself, or a zil device if one is
provisioned''
> 
> 
> possible solutions:
> 
> 1.  Why not have an option to mirror that darn cache file (like to the root
file system of the boot device at least as an initial implementation) no matter
what intent log devices are present?  Presuming that most folks at least want
enough redundancy that their machine will boot, and if it boots - then they have
a shot at recovery of the balance of the associated (zfs) directly attached
storage, and with my other presumptions above, there is little reason do not to
offer a feature like this?Missing final sentence: The vast amount of problems with computer and network
reliability is typically related to human error.  The more
''9s'' that can be intrinsically provided by the systems
themselves helps mitigate this.
> 
> 
> Respectfully,
> - mike
> 
> 
> On Apr 18, 2010, at 10:10 PM, Richard Elling wrote:
> 
>> On Apr 18, 2010, at 7:02 PM, Don wrote:
>> 
>>> If you have a pair of heads talking to shared disks with ZFS- what
can you do to ensure the second head always has a current copy of the
zpool.cache file?
>> 
>> By definition, the zpool.cache file is always up to date.
>> 
>>> I''d prefer not to lose the ZIL, fail over, and then
suddenly find out I can''t import the pool on my second head.
>> 
>> I''d rather not have multiple failures, either.  But the
information needed in the
>> zpool.cache file for reconstructing a missing (as in destroyed)
top-level vdev is
>> easily recovered from a backup or snapshot.
>> -- richard
>> 
>> ZFS storage and performance consulting at http://www.RichardElling.com
>> ZFS training on deduplication, NexentaStor, and NAS performance
>> Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com 
>> 
>> 
>> 
>> 
>> 
>> _______________________________________________
>> zfs-discuss mailing list
>> zfs-discuss at opensolaris.org
>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> 
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Khyron

2010-Apr-19 11:12 UTC

head link

[zfs-discuss] newbie, WAS: Re: SSD best practices

I would advise getting familiar with the basic terminology and vocabulary of
ZFS
first.  Start with the Solaris 10 ZFS Administration Guide.  It''s a bit
more
complete
for a newbie.

http://docs.sun.com/app/docs/doc/819-5461?l=en

You can then move on to the Best Practices Guide, Configuration Guide,
Troubleshooting Guide and Evil Tuning Guide on solarisinternals.com:

http://www.solarisinternals.com//wiki/index.php?title=Category:ZFS

All of the features in ZFS on Solaris 10 appear in OpenSolaris; the inverse
does
not necessarily hold true, as active development occurs on the OpenSolaris
trunk
and updates take about a year to filter back down into Solaris due to
integration
concerns, testing, etc.

A Separate Log (SLOG) device can be used for a ZIL, but they are not
necessarily
the same thing.  The ZIL always exists, and is part of the pool if you have
not
defined a SLOG device.

The zpool.cache file does not reside in the pool.  It lives in /etc/zfs in
the root
file system of your OpenSolaris system.  Thus, it does not reside "on the
ZIL
device" either, since there may not necessarily be a SLOG (what you would
term
a "ZIL device") anyway.  (There is always a ZIL, though.  See remarks
above.)

Hopefully that clears up some of the misconceptions and misunderstandings
you
have.

Cheers!

On Mon, Apr 19, 2010 at 06:52, Michael DeMan <solaris at deman.com> wrote:
> Also, pardon my typos, and my lack of re-titling my subject to note that it
> is a fork from the original topic.  Corrections in text that I noticed
after
> finally sorting out getting on the mailing list are below...
>
> On Apr 19, 2010, at 3:26 AM, Michael DeMan wrote:
>
> > By the way,
> >
> > I would like to chip in about how informative this thread has been, at
> least for me, despite (and actually because of) the strong opinions on some
> of the posts about the issues involved.
> >
> > From what I gather, there is still an interesting failure possibility
> with ZFS, although probably rare.  In the case where a zil (aka slog)
device
> fails, AND the zpool.cache information is not available, basically folks
are
> toast?
> >
> > In addition, the zpool.cache itself exhibits the following behaviors
(and
> I could be totally wrong, this is why I ask):
> >
> > A.  It is not written to frequently, i.e., it is not a performance
impact
> unless new zfs file systems (pardon me if I have the incorrect terminology)
> are not being fabricated and supplied to the underlying operating system.
> >
> The above ''are not being fabricated'' should be
''are regularly being
> fabricated''
>
> > B.  The current implementation stores that cache file on the zil
device,
> so if for some reason, that device is totally lost (along with said .cache
> file), it is nigh impossible to recover the entire pool it correlates with.
> The above, ''on the zil device'', should say ''on
the fundamental zfs file
> system itself, or a zil device if one is provisioned''
>
> >
> >
> > possible solutions:
> >
> > 1.  Why not have an option to mirror that darn cache file (like to the
> root file system of the boot device at least as an initial implementation)
> no matter what intent log devices are present?  Presuming that most folks
at
> least want enough redundancy that their machine will boot, and if it boots
-
> then they have a shot at recovery of the balance of the associated (zfs)
> directly attached storage, and with my other presumptions above, there is
> little reason do not to offer a feature like this?
> Missing final sentence: The vast amount of problems with computer and
> network reliability is typically related to human error.  The more
''9s'' that
> can be intrinsically provided by the systems themselves helps mitigate
this.
>
> >
> >
> > Respectfully,
> > - mike
> >
> >
> > On Apr 18, 2010, at 10:10 PM, Richard Elling wrote:
> >
> >> On Apr 18, 2010, at 7:02 PM, Don wrote:
> >>
> >>> If you have a pair of heads talking to shared disks with ZFS-
what can
> you do to ensure the second head always has a current copy of the
> zpool.cache file?
> >>
> >> By definition, the zpool.cache file is always up to date.
> >>
> >>> I''d prefer not to lose the ZIL, fail over, and then
suddenly find out I
> can''t import the pool on my second head.
> >>
> >> I''d rather not have multiple failures, either.  But the
information
> needed in the
> >> zpool.cache file for reconstructing a missing (as in destroyed)
> top-level vdev is
> >> easily recovered from a backup or snapshot.
> >> -- richard
> >>
> >> ZFS storage and performance consulting at
http://www.RichardElling.com
> >> ZFS training on deduplication, NexentaStor, and NAS performance
> >> Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com
> >>
> >>
> >>
> >>
> >>
> >> _______________________________________________
> >> zfs-discuss mailing list
> >> zfs-discuss at opensolaris.org
> >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> >
> > _______________________________________________
> > zfs-discuss mailing list
> > zfs-discuss at opensolaris.org
> > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>


-- 
"You can choose your friends, you can choose the deals." - Equity
Private

"If Linux is faster, it''s a Solaris bug." - Phil Harman

Blog - http://whatderass.blogspot.com/
Twitter - @khyron4eva
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100419/3426e6ad/attachment.html>

Edward Ned Harvey

2010-Apr-19 11:32 UTC

head link

[zfs-discuss] SSD best practices

> From: Bob Friesenhahn [mailto:bfriesen at simple.dallas.tx.us]
> Sent: Sunday, April 18, 2010 11:34 PM
> To: Edward Ned Harvey
> Cc: Christopher George; zfs-discuss at opensolaris.org
> Subject: RE: [zfs-discuss] SSD best practices
> 
> On Sun, 18 Apr 2010, Edward Ned Harvey wrote:
> >> This seems to be the test of the day.
> >>
> >>    time tar jxf gcc-4.4.3.tar.bz2
> >>
> >> I get 22 seconds locally and about 6-1/2 minutes from an NFS
client.
> >
> > There''s no point trying to accelerate your disks if
you''re only going
> to use
> > a single client over gigabit.
> 
> This is a really strange statement.  It does not make any sense.
I''m saying that even a single pair of disks (maybe 4 disks if
you''re using
cheap slow disks) will outperform a 1Gb Ethernet.  So if your bottleneck is
the 1Gb Ethernet, you won''t gain anything (significant) by accelerating
the
stuff that isn''t the bottleneck.

Michael DeMan

2010-Apr-19 11:53 UTC

head link

[zfs-discuss] newbie, WAS: Re: SSD best practices

In all honesty, I haven''t done much at sysadmin level with Solaris
since it was SunOS 5.2.  I found ZFS after becoming concerned with reliability
of traditional RAID5 and RAID6 systems once drives exceeded 500GB.

I have a few months running ZFS on FreeBSD lately on a test/augmentation basis
with 1TB drives in older hardware.  Thus far, it seems very promising.  As other
people have pointed out though, one''s mileage may vary.  I am
interested in a blend of performance, reliability and cost.  I think ZFS can
deliver all three across the board.

You are right - if I am not aware enough yet on the docs to know the difference
between a zil device and a slog device, I guess I need to finally hit the books
on this one some more.  ZFS seems both stable enough and I think also has enough
''cool factor'' to it, that its probably about time there were
some books available?  Perhaps if/when Solaris 10 gets de-dupe that will be the
breaker/maker?

I have a couple more comments down below.

Thanks for the response, and once more -  I have very much been enjoying the
''SSD best practices'' thread.

On Apr 19, 2010, at 4:12 AM, Khyron wrote:
> I would advise getting familiar with the basic terminology and vocabulary
of ZFS
> first.  Start with the Solaris 10 ZFS Administration Guide.  It''s
a bit more complete
> for a newbie.  
> 
> http://docs.sun.com/app/docs/doc/819-5461?l=en
> 
> You can then move on to the Best Practices Guide, Configuration Guide,
> Troubleshooting Guide and Evil Tuning Guide on solarisinternals.com:
> 
> http://www.solarisinternals.com//wiki/index.php?title=Category:ZFS
> 
> All of the features in ZFS on Solaris 10 appear in OpenSolaris; the inverse
does
> not necessarily hold true, as active development occurs on the OpenSolaris
trunk
> and updates take about a year to filter back down into Solaris due to
integration
> concerns, testing, etc.
Yes, I understand this.  When the heck is de-dupe coming into Solaris 10? 
People could save enough money on disks (not too mention the power bills and the
cooling costs) to upgrade maybe?
> 
> A Separate Log (SLOG) device can be used for a ZIL, but they are not
necessarily
> the same thing.  The ZIL always exists, and is part of the pool if you have
not
> defined a SLOG device.
> 
> The zpool.cache file does not reside in the pool.  It lives in /etc/zfs in
the root
> file system of your OpenSolaris system.  Thus, it does not reside "on
the ZIL
> device" either, since there may not necessarily be a SLOG (what you
would term
> a "ZIL device") anyway.  (There is always a ZIL, though.  See
remarks above.)
> 
I have one test box, running FreeBSD8, not Solaris, and have no
/etc/zfs/zpool.cache or /usr/local/etc/zpool.cache, I will check on another list
about that and how they are handling it.

> Hopefully that clears up some of the misconceptions and misunderstandings
you
> have.
> 
> Cheers!
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100419/fda1f455/attachment.html>

Don

2010-Apr-19 12:05 UTC

head link

[zfs-discuss] SSD best practices

Yes yes- /etc/zfs/zpool.cache - we all hate typos :)
-- 
This message posted from opensolaris.org

Don

2010-Apr-19 12:08 UTC

head link

[zfs-discuss] SSD best practices

I must note that you haven''t answered my question...

If the zpool.cache file differs between the two heads for some reason- how do I
ensure that the second head has an accurate copy without importing the ZFS pool?
-- 
This message posted from opensolaris.org

Don

2010-Apr-19 12:16 UTC

head link

[zfs-discuss] SSD best practices

I''m not certain if I''m misunderstanding you- or if you
didn''t read my post carefully.

Why would the zpool.cache file be current on the _second_ node? The first node
is where I''ve added my zpools and so on. The second node isn''t
going to have an updated cache file until I export the zpool from the first
system and import it to the second system no?

In my case- I believe both nodes have exactly the same view of the disks- all
the controllers and targets are identical- but there is no reason they have to
be as far as I know. As such- simply backing up the primary systems zpool.cache
to the secondary could cause problems.

I''m simply curious if there is a way for a node to keep it''s
zpool.cache up to date without actually importing the zpool. i.e. is there a
scandisks command that can scan for a zpool without importing it.
Am I misunderstanding something here?
-- 
This message posted from opensolaris.org

David Magda

2010-Apr-19 14:42 UTC

head link

[zfs-discuss] SSD best practices

On Mon, April 19, 2010 07:32, Edward Ned Harvey wrote:
> I''m saying that even a single pair of disks (maybe 4 disks if
you''re using
> cheap slow disks) will outperform a 1Gb Ethernet.  So if your bottleneck
> is the 1Gb Ethernet, you won''t gain anything (significant) by
accelerating
> the stuff that isn''t the bottleneck.
Would it help in improving IOps or latency for more random work loads?

It may not be that you''re pushing bandwidth, but rather a lot of (say)
NFS
writes. It could potentially cause a lot of seeks, even with striped
mirrors.

David Magda

2010-Apr-19 14:59 UTC

head link

[zfs-discuss] SSD best practices

On Mon, April 19, 2010 06:26, Michael DeMan wrote:
> B.  The current implementation stores that cache file on the zil device,
> so if for some reason, that device is totally lost (along with said .cache
> file), it is nigh impossible to recover the entire pool it correlates
> with.
Given that ZFS is always consistent on-disk, why would you lose a pool if
you lose the ZIL and/or cache file? Theoretically shouldn''t you lose,
at
most, the last few transactions?

With recent updates to ZFS you can do a forced import giving "informed
consent" to go back to a previous uber-block.

Bob Friesenhahn

2010-Apr-19 15:35 UTC

head link

[zfs-discuss] SSD best practices

On Mon, 19 Apr 2010, Don wrote:
> If the zpool.cache file differs between the two heads for some 
> reason- how do I ensure that the second head has an accurate copy 
> without importing the ZFS pool?
The zpool.cache file can only be valid for one system at a time.  If 
the pool is imported to a different system, then the zpool.cache file 
generated on that system will be different due to 
differing device names and a different host name.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Don

2010-Apr-19 15:46 UTC

head link

[zfs-discuss] SSD best practices

Ok- I think perhaps I''m failing to explain myself.

I want to know if there is a way for a second node- connected to a set of shared
disks- to keep its zpool.cache up to date _without_ actually importing the ZFS
pool.

As I understand it- keeping the zpool up to date on the second node would
provide additional protection should the slog fail at the same time my primary
head failed (it should also improve import times if what I''ve read is
true).

I understand that importing the disks to the second node will update the cache
file- but by that time it may be too late. I''d like to update the cache
file _before_ then. I see no reason why the second node couldn''t scan
the disks being used by the first node and then update it''s
zpool.cache.
-- 
This message posted from opensolaris.org

Bob Friesenhahn

2010-Apr-19 15:54 UTC

head link

[zfs-discuss] SSD best practices

On Mon, 19 Apr 2010, Edward Ned Harvey wrote:
>>> There''s no point trying to accelerate your disks if
you''re only
>>> going to use a single client over gigabit.
>>
>> This is a really strange statement.  It does not make any sense.
>
> I''m saying that even a single pair of disks (maybe 4 disks if
you''re using
> cheap slow disks) will outperform a 1Gb Ethernet.  So if your bottleneck is
> the 1Gb Ethernet, you won''t gain anything (significant) by
accelerating the
> stuff that isn''t the bottleneck.
That is true.  For the record, this is the size of the uncompressed 
tarball:

% du -sh gcc-4.4.3.tar
409M    gcc-4.4.3.tar

Expecting close to 100MB/seconds for the data transfer over gigabit, 
this may place a cap on achievable performance at around 4 seconds.

The test included bzip2 decompression and it is not clear (without 
testing) if the bzip2 decompression increases or decreases the 
available data flow to the network.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Darren J Moffat

2010-Apr-19 15:55 UTC

head link

[zfs-discuss] SSD best practices

On 19/04/2010 16:46, Don wrote:> I want to know if there is a way for a second node- connected to > a set of shared disks- to keep its zpool.cache up to date
 > _without_ actually importing the ZFS pool.

See zpool(1M):

      cachefile=path | none

          Controls the location of where the pool configuration is
          cached. Discovering all pools on system startup requires
          a cached copy of the configuration data that  is  stored
          on  the  root  file  system. All pools in this cache are
          automatically  imported  when  the  system  boots.  Some
          environments,  such  as  install and clustering, need to
          cache this information in a different location  so  that
          pools  are not automatically imported. Setting this pro-
          perty caches the pool configuration in a different loca-
          tion  that can later be imported with "zpool import -c".
          Setting it to the special value "none"  creates  a  tem-
          porary  pool that is never cached, and the special value
          '''' (empty string) uses the default location.

          Multiple pools can share the same  cache  file.  Because
          the  kernel  destroys and recreates this file when pools
          are  added  and  removed,  care  should  be  taken  when
          attempting to access this file. When the last pool using
          a cachefile  is  exported  or  destroyed,  the  file  is
          removed.



-- 
Darren J Moffat

Don

2010-Apr-19 16:13 UTC

head link

[zfs-discuss] SSD best practices

That section of the man page is actually helpful- as I wasn''t sure what
I was going to do to ensure the nodes didn''t try to bring up the zpool
on their own- outside of clustering software or my own intervention.

That said- it still doesn''t explain how I would keep the secondary
nodes zpool.cache up to date.

If I create a zpool on the first node. Import it on the second, then move it
back to the first. Now they both have a current zpool.cache. If I add additional
disks to the first node- how do I get the second nodes cache file current
without first importing the disks?
-- 
This message posted from opensolaris.org

Darren J Moffat

2010-Apr-19 16:18 UTC

head link

[zfs-discuss] SSD best practices

On 19/04/2010 17:13, Don wrote:> That section of the man page is actually helpful- as I wasn''t sure
what I was going to do to ensure the nodes didn''t try to bring up the
zpool on their own- outside of clustering software or my own intervention.
>
> That said- it still doesn''t explain how I would keep the secondary
nodes zpool.cache up to date.
That is the job of the cluster software.
> If I create a zpool on the first node. Import it on the second, > then move it back to the first. Now they both have a
 > current zpool.cache. If I add additional disks to the first node- how 
do I get the second nodes cache file current without first importing the 
disks?

The point of the cachefile zpool option is that there isn''t two copies 
of the zpool.cache file there is only one.

-- 
Darren J Moffat

Don

2010-Apr-19 16:50 UTC

head link

[zfs-discuss] SSD best practices

Now I''m simply confused.

Do you mean one cachefile shared between the two nodes for this zpool? How, may
I ask, would this work?

The rpool should be in /etc/zfs/zpool.cache.

The shared pool should be in /etc/cluster/zpool.cache (or wherever you prefer to
put it) so it won''t come up on system start.

What I don''t understand is how the second node is either a) supposed to
share the first nodes cachefile or b) create it''s own without importing
the pool.

You say this is the job of the cluster software- does ha-cluster already handle
this with their ZFS modules?

I''ve asked this question 5 different ways and I either still
haven''t gotten an answer- or still don''t understand the
problem.

Is there a way for a passive node to generate it''s _own_ zpool.cache
without importing the file system. If so- how. If not- why is this unimportant?
-- 
This message posted from opensolaris.org

Darren J Moffat

2010-Apr-19 17:10 UTC

head link

[zfs-discuss] SSD best practices

On 19/04/2010 17:50, Don wrote:> Now I''m simply confused.
>
> Do you mean one cachefile shared between the two nodes for this zpool? How,
may I ask, would this work?
Either that or a way for the nodes to update each others copy very 
quickly.  Such as a parallel filesystem.

It is the job of the cluster software to provide a mechanism to do that.
> What I don''t understand is how the second node is either a)
supposed to share the first nodes cachefile or b) create it''s own
without importing the pool.
Are you writing your own cluster framework ?
> You say this is the job of the cluster software- does ha-cluster already
handle this with their ZFS modules?
Searching google for "solaris ha-cluster zfs zpool.cache" I found
this opensolaris.org thread on the ha-cluster list:

http://opensolaris.org/jive/thread.jspa?messageID=338413

That thread has information on which cluster release/patch is needed.
> I''ve asked this question 5 different ways and I either still
haven''t gotten an answer- or still don''t understand the
problem.
My apologies but I jumped in part way through the thread.

Are you writing your own cluster software or are you trying to use an 
already existing cluster framework that already supports ZFS ?

-- 
Darren J Moffat

Richard Elling

2010-Apr-19 17:20 UTC

head link

[zfs-discuss] SSD best practices

On Apr 19, 2010, at 9:50 AM, Don wrote:
> Now I''m simply confused.
In one sentence, the cachefile keeps track of what is currently imported.
> Do you mean one cachefile shared between the two nodes for this zpool? How,
may I ask, would this work?
Each OS instance has a default cachefile.
> The rpool should be in /etc/zfs/zpool.cache.
Yes, this is the default for OpenSolaris distributions.
> The shared pool should be in /etc/cluster/zpool.cache (or wherever you
prefer to put it) so it won''t come up on system start.
Correct
> What I don''t understand is how the second node is either a)
supposed to share the first nodes cachefile or b) create it''s own
without importing the pool.
a) it doesn''t
b) it doesn''t
> You say this is the job of the cluster software- does ha-cluster already
handle this with their ZFS modules?
Yes
> I''ve asked this question 5 different ways and I either still
haven''t gotten an answer- or still don''t understand the
problem.
see below
> Is there a way for a passive node to generate it''s _own_
zpool.cache without importing the file system. If so- how. If not- why is this
unimportant?
No. And this is unimportant.

The bit of context missing here is the answer to the question, why do we
want to keep a backup of the cache file? The answer is that the cache
file contains a record of the GUID for each disk. In the event that a disk
is destroyed, there are some cases where the pool can be brought online
if the GUID of the destroyed disks are known. This is not a typical recovery
method and has rarely been needed. Please do not confuse the discussion
of the desire to keep a copy of the cachefile with the greater desire to keep
a record of the GUIDs. In this context, the cachefile is a convenient record
of the GUIDs.
 -- richard

ZFS storage and performance consulting at http://www.RichardElling.com
ZFS training on deduplication, NexentaStor, and NAS performance
Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com

Ross Walker

2010-Apr-19 17:21 UTC

head link

[zfs-discuss] SSD best practices

On Apr 19, 2010, at 12:50 PM, Don <don at blacksun.org> wrote:
> Now I''m simply confused.
>
> Do you mean one cachefile shared between the two nodes for this  
> zpool? How, may I ask, would this work?
>
> The rpool should be in /etc/zfs/zpool.cache.
>
> The shared pool should be in /etc/cluster/zpool.cache (or wherever  
> you prefer to put it) so it won''t come up on system start.
>
> What I don''t understand is how the second node is either a)
supposed
> to share the first nodes cachefile or b) create it''s own without  
> importing the pool.
>
> You say this is the job of the cluster software- does ha-cluster  
> already handle this with their ZFS modules?
>
> I''ve asked this question 5 different ways and I either still
haven''t
> gotten an answer- or still don''t understand the problem.
>
> Is there a way for a passive node to generate it''s _own_
zpool.cache
> without importing the file system. If so- how. If not- why is this  
> unimportant?
I don''t run the cluster suite, but I''d be surprised if the
software
doesn''t copy the cache to the passive node whenever it''s
updated.

-Ross

Don

2010-Apr-19 17:30 UTC

head link

[zfs-discuss] SSD best practices

I apologize- I didn''t mean to come across as rude- I''m just
not sure if I''m asking the right question.

I''m not ready to use the ha-cluster software yet as I haven''t
finished testing it. For now I''m manually failing over from the primary
to the backup node. That will change- but I''m not ready to go there
yet. As such I''m trying to make sure both my nodes have a current cache
file so that the targets and GUID''s are ready.
-- 
This message posted from opensolaris.org

Carson Gaspar

2010-Apr-19 17:39 UTC

head link

[zfs-discuss] SSD best practices

Edward Ned Harvey wrote:
> I''m saying that even a single pair of disks (maybe 4 disks if
you''re using
> cheap slow disks) will outperform a 1Gb Ethernet.  So if your bottleneck is
> the 1Gb Ethernet, you won''t gain anything (significant) by
accelerating the
> stuff that isn''t the bottleneck.
And you are confusing throughput with latency (in a sense).

Yes, my raidz2 pool is much faster than GigE, measured in pipelined, 
streaming throughput (for those old fogeys like me, think good kermit / 
zmodem).

Yet it still sucks for small non-pipelined synchronous writes, such as 
those triggered by NFS + tar (think xmodem *shudder*).

So spending around $150 for a greater than 10x performance improvement 
makes a hell of a lot of sense.

And yes, spending even more on an ACARD flash backed DRAM solution would 
probably make things even faster, but I was feeling cheap ;-)

As for the DDRDrive X1, it is not a solution I would recommend to anyone 
in its current state. If it gets a supercap / battery that enables it to 
dump its data to non-volatile storage following a power cut, that will 
change, and I know they have taken our feedback and are revising the 
product.

-- 
Carson

Don

2010-Apr-19 17:58 UTC

head link

[zfs-discuss] SSD best practices

I understand that important bit about having the cachefile is the
GUID''s (although the disk record is, I believe, helpful in improving
import speeds) so we can recover in certain oddball cases. As such- I''m
still confused why you say it''s unimportant.

Is it enough to simply copy the /etc/cluster/zpool.cache file from the primary
node to the secondary so that I at least have the GUID''s even if the
disks references (the /dev/dsk sections) might not match?
-- 
This message posted from opensolaris.org

Christopher George

2010-Apr-19 19:17 UTC

head link

[zfs-discuss] SSD best practices

To clarify, the DDRdrive X1 is not an option for OpenSolaris today, 
irrespective of specific features, because the driver is not yet available.

When our OpenSolaris device driver is released, later this quarter, the 
X1 will have updated firmware to automatically provide backup/restore 
based on an external power source.  We hope the X1 will be the first in a 
family of products, where future iterations will also offer an internal power 
source option.

Feedback from this list also played a decisive role in our forthcoming 
strategy to focus exclusively on serving the ZFS dedicated log market.

Thanks,

Christopher George
Founder/CTO
www.ddrdrive.com
-- 
This message posted from opensolaris.org

Miles Nordin

2010-Apr-19 19:44 UTC

head link

[zfs-discuss] SSD best practices

>>>>> "dm" == David Magda <dmagda at
ee.ryerson.ca> writes:
dm> Given that ZFS is always consistent on-disk, why would you
dm> lose a pool if you lose the ZIL and/or cache file?

because of lazy assertions inside ''zpool import''. you are
right there
is no fundamental reason for it---it''s just code that doesn''t
exist.

If you are a developer you can probably still recover your pool, but
there aren''t any commands with a supported interface to do it.
''zpool.cache'' doesn''t contain magical information,
but it allows you
to pass through a different code path that doesn''t include the
``BrrkBrrk, omg panic device missing, BAIL OUT HERE'''' checks.
I don''t
think squirreling away copies of zpool.cache is a great way to make
your pool safe from slog failures because there may be other things
about the different manual ''zpool import'' codepath that you
need
during a disaster, like -F, which will remain inaccessible to you if
you rely on some saving-your-zpool.cache hack, even if your hack ends
up actually working when the time comes, which it might not.

I think is really interesting, the case of an HA cluster using a
single-device slog made from a ramdisk on the passive node. This case
would also become safer if slogs were fully disposeable.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 304 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100419/1cddb626/attachment.bin>

Brandon High

2010-Apr-19 21:26 UTC

head link

[zfs-discuss] SSD best practices

I think the DDR drive has a battery and can dump to a cf card.

-B

Sent from my Nexus One.

On Apr 19, 2010 10:41 AM, "Carson Gaspar" <carson at taltos.org>
wrote:

Edward Ned Harvey wrote:
> I''m saying that even a single pair of disks (maybe 4 disks if
you''reusi...
And you are confusing throughput with latency (in a sense).

Yes, my raidz2 pool is much faster than GigE, measured in pipelined,
streaming throughput (for those old fogeys like me, think good kermit /
zmodem).

Yet it still sucks for small non-pipelined synchronous writes, such as those
triggered by NFS + tar (think xmodem *shudder*).

So spending around $150 for a greater than 10x performance improvement makes
a hell of a lot of sense.

And yes, spending even more on an ACARD flash backed DRAM solution would
probably make things even faster, but I was feeling cheap ;-)

As for the DDRDrive X1, it is not a solution I would recommend to anyone in
its current state. If it gets a supercap / battery that enables it to dump
its data to non-volatile storage following a power cut, that will change,
and I know they have taken our feedback and are revising the product.

-- 
Carson



_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris.o...
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100419/2d933b1f/attachment.html>

Don

2010-Apr-19 22:33 UTC

head link

[zfs-discuss] SSD best practices

Continuing on the best practices theme- how big should the ZIL slog disk be?

The ZFS evil tuning guide suggests enough space for 10 seconds of my synchronous
write load- even assuming I could cram 20 gigabits/sec into the host (2 10 gigE
NICs) That only comes out to 200 Gigabits which = 25 Gigabytes.

I''m currently planning to use 4 32GB SSD''s arranged in 2 2 way
mirrors which should give me 64GB of log space. Is there any reason to believe
that this would be insufficient (especially considering I can''t begin
to imagine being able to cram 5 Gb/s into the host- let alone 20).

Are there any guidelines on how much ZIL performance should increase with 2 SSD
slogs (4 disks with mirrors) over a single SSD slog (2 disks mirrored).
-- 
This message posted from opensolaris.org

Christopher George

2010-Apr-19 22:54 UTC

head link

[zfs-discuss] SSD best practices

> I think the DDR drive has a battery and can dump to a cf card.
The DDRdrive X1''s automatic backup/restore feature utilizes 
on-board SLC NAND (high quality Flash) and is completely self-
contained.  Neither the backup nor restore feature involves  
data transfer over the PCIe bus or to/from removable media.

Thanks,

Christopher George
Founder/CTO
www.ddrdrive.com
-- 
This message posted from opensolaris.org

Edward Ned Harvey

2010-Apr-19 23:53 UTC

head link

[zfs-discuss] SSD best practices

> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> bounces at opensolaris.org] On Behalf Of Don
> 
> Continuing on the best practices theme- how big should the ZIL slog
> disk be?
> 
> The ZFS evil tuning guide suggests enough space for 10 seconds of my
> synchronous write load- even assuming I could cram 20 gigabits/sec into
> the host (2 10 gigE NICs) That only comes out to 200 Gigabits which > 25
Gigabytes.
> 
> I''m currently planning to use 4 32GB SSD''s arranged in 2
2 way mirrors
> which should give me 64GB of log space. Is there any reason to believe
> that this would be insufficient (especially considering I can''t
begin
> to imagine being able to cram 5 Gb/s into the host- let alone 20).
> 
> Are there any guidelines on how much ZIL performance should increase
> with 2 SSD slogs (4 disks with mirrors) over a single SSD slog (2 disks
> mirrored).
I think the size of the ZIL log is basically irrelevant ... For example, I
remember reading somewhere that the system refuses to use more than 50% of
the size of RAM, yet, you can hardly even think about buying an SSD smaller
than 32G.  If you''ve got a 64G ram system, you''re probably not
going to use
only a single SSD, just due to the fact that you''ve probably got dozens
of
disks attached, and you''ll probably use multiple log devices striped
just
for the sake of performance.

Improbability assessment aside, suppose you use something like the DDRDrive
X1 ... Which might be more like 4G instead of 32G ... Is it even physically
possible to write 4G to any device in less than 10 seconds?  Remember, to
achieve worst case, highest demand on ZIL log device, these would all have
to be <32kbyte writes (default configuration), because larger writes will go
directly to primary storage, with only the intent landing on the ZIL.  

To try and quantify this a little closer, suppose all your writes are 31K
(worst case for typical setup) ... meaning they''re as large as possible
while still going to the log device instead of primary storage.  Suppose you
get 2000 IOPS (which is roughly typical according to my benchmarks) then
you''re writing a little less than 64Mbytes/sec, and you won''t
even come
close to reaching 1G within 10 seconds.

As a cross-check, assume it''s a PCIE 2.0 x1 bus.  This is 500
Mbytes/sec
theoretical maximum.  So in 10 seconds, 5Gbyte theoretical maximum.

How about if you''re using SAS 6Gbit devices, using the unrealistic
assumption that you can write 6Gbits.  Well, that''s 750 Mbytes
unrealistically high, so 7.5G in 10 seconds.  Which I know to be not just
unrealistic, but ridiculously overestimated, by at least one order of
magnitude.

So, although I don''t have any physical machine to test or verify this
on, I
have a very educated guess which says even the smallest nonvolatile device
will be more than you can use for your ZIL log.

Size doesn''t matter.  Just speed.  (and reliability, price, etc)

Don

2010-Apr-20 00:48 UTC

head link

[zfs-discuss] SSD best practices

> I think the size of the ZIL log is basically irrelevantThat was the understanding I got from reading the various blog posts and tuning
guide.
> only a single SSD, just due to the fact that you''ve probably got
dozens of disks attached, and you''ll probably use multiple log devices
striped just for the sake of performance.
I''ve got 72 (possibly 76) 15k RPM 300GB and 600GB SAS drives and my
head has 16 GB of RAM though that can be increased at any time to 32GB. My
current plan is to use 4 x 32GB SLC write optimized SSD''s in a striped
mirrors configuration.

I''m curious if anyone knows how ZIL slog performance scales. For
example- how much benefit would you expect from 2 SSD slogs over 1? Would there
be a significant benefit to 3 over 2 or does it begin to taper off? I''m
sure a lot of this is dependent on the environment- but rough ideas are good to
know.

Is it safe to assume that a stripe across two mirrored write optimized
SSD''s is going to give me the best performance for 4 available drive
bays (assuming I want the ZIL to remain safe)?
> Is it even physically possible to write 4G to any device in less than 10
seconds?I wasn''t actually sure the 10 second number was still accurate- that
was definitely part of my question. If it is- then yes- I could never fill a 32
GB ZIL, let alone a 64GB one.

Thanks for all of the help and advice.
-- 
This message posted from opensolaris.org

Bob Friesenhahn

2010-Apr-20 02:02 UTC

head link

[zfs-discuss] SSD best practices

On Mon, 19 Apr 2010, Don wrote:
> Continuing on the best practices theme- how big should the ZIL slog disk
be?
>
> The ZFS evil tuning guide suggests enough space for 10 seconds of my 
> synchronous write load- even assuming I could cram 20 gigabits/sec 
> into the host (2 10 gigE NICs) That only comes out to 200 Gigabits 
> which = 25 Gigabytes.
Note that large writes bypass the dedicated intent log device entirely 
and go directly to a ZIL on primary disk.  This is because SSDs 
typically have much less raw bandwidth than primary disk does.  If you 
are writing bulk data, the larger chunks will go directly to primary 
disk, and the smaller bits (e.g. smaller writes, metadata, filenames, 
directories, etc.) will go to the dedicated intent log device.  This 
means that the device does not need to be as large as you may think it 
should be.

Use the ''zilstat'' DTrace script to evalutate what really
happens on
your system before you invest in extra hardware.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Bob Friesenhahn

2010-Apr-20 02:11 UTC

head link

[zfs-discuss] SSD best practices

On Mon, 19 Apr 2010, Edward Ned Harvey wrote:> Improbability assessment aside, suppose you use something like the DDRDrive
> X1 ... Which might be more like 4G instead of 32G ... Is it even physically
> possible to write 4G to any device in less than 10 seconds?  Remember, to
> achieve worst case, highest demand on ZIL log device, these would all have
> to be <32kbyte writes (default configuration), because larger writes
will go
> directly to primary storage, with only the intent landing on the ZIL.
Note that ZFS always writes data in order so I believe that the 
statement "larger writes will go directly to primary storage" really 
should be "larger writes will go directly to the ZIL implemented in 
primary storage (which always exists)".  Otherwise, ZFS would need to 
write a new TXG whenever a new "large" block of data appeared (which 
may be puny as far as the underlying store is concerned) in order to 
assure proper ordering.  This would result in a very high TXG issue 
rate.  Pool fragmentation would be increased.

I am sure that someone will correct me if this is wrong.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Don

2010-Apr-20 02:21 UTC

head link

[zfs-discuss] SSD best practices

I always try to plan for the worst case- I just wasn''t sure how to
arrive at the worst case. Thanks for providing the information- and I will
definitely checkout the dtrace zilstat script.

Considering the smallest SSD I can buy from a manufacturer that I trust seems to
be 32GB- that''s probably going to be my choice.

As for the choice of striping across two mirrored pairs- I want every last IOP I
Can get my hands on- an extra $700 isn''t going to make much of a
difference in a system involving 2 heads, 5 storage shelves, and 76 SAS drives-
if I could think of something better to spend that money on- I would- but right
now- it seems like the best option.
-- 
This message posted from opensolaris.org

Bob Friesenhahn

2010-Apr-20 02:23 UTC

head link

[zfs-discuss] SSD best practices

On Mon, 19 Apr 2010, Don wrote:
> I''m curious if anyone knows how ZIL slog performance scales. For 
> example- how much benefit would you expect from 2 SSD slogs over 1? 
> Would there be a significant benefit to 3 over 2 or does it begin to 
> taper off? I''m sure a lot of this is dependent on the environment-
> but rough ideas are good to know.
I don''t know the answer but I expect that the answer depends quite a 
lot on the nature of the SSDs used.  A STEC Zeus IOPS SSD (45K IOPS) 
will behave quite differently than an Intel X-25E (~3.3K IOPS).  A 
SRAM or DRAM-based "drive" (with FLASH backup) will behave 
dramatically differently than a typical SSD.

If the SSD employed supports sufficient IOPS and bandwidth, then 
adding more will not help since it is not the bottleneck.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Don

2010-Apr-20 03:05 UTC

head link

[zfs-discuss] SSD best practices

> A STEC Zeus IOPS SSD (45K IOPS) will behave quite differently than an Intel
X-25E (~3.3K IOPS).Where can you even get the Zeus drives? I thought they were only in the OEM
market and last time I checked they were ludicrously expensive. I''m
looking for between 5k and 10k IOPS using up to 4 drive bays (so a 2 x 2 striped
mirror would be fine). Right now we peak at about 3k IOPS (though
that''s not to a ZFS system) but I would like to be able to be able to
burst to double that. We do have a lot of small size burst writes hence our ZIL
concerns.
> A SRAM or DRAM-based "drive" (with FLASH backup) will behavedramatically differently than a typical SSD.
As long as it can speak SAS or SATA and I can put it in a drive shelf
I''d happily consider using it. All the DRAM devices I know are host
based and that won''t help my cluster.

On that note- what write optimized SSD''s do you recommend? I
don''t actually know where to buy the Zeus drives even if
they''ve become more reasonably priced.

Thanks for taking the time to share- it''s been very informative.
-- 
This message posted from opensolaris.org

Richard Elling

2010-Apr-20 05:15 UTC

head link

[zfs-discuss] SSD best practices

On Apr 19, 2010, at 7:02 PM, Bob Friesenhahn wrote:> On Mon, 19 Apr 2010, Don wrote:
> 
>> Continuing on the best practices theme- how big should the ZIL slog
disk be?
>> 
>> The ZFS evil tuning guide suggests enough space for 10 seconds of my
synchronous write load- even assuming I could cram 20 gigabits/sec into the host
(2 10 gigE NICs) That only comes out to 200 Gigabits which = 25 Gigabytes.
> 
> Note that large writes bypass the dedicated intent log device entirely and
go directly to a ZIL on primary disk.  This is because SSDs typically have much
less raw bandwidth than primary disk does.
That was last year.  This year there are many SSDs which have sustained write
bandwidth greater than the media speed on  HDDs.  The newer models with 
6Gbps SAS can write > 200 MB/sec and read > 300 MB/sec.  For comparison, 
a 15krpm Seagate Cheetah with 4 platters is rated at 116-195 MB/sec sustainable
disk transfer rate.  When it comes to performance, game over.
>  If you are writing bulk data, the larger chunks will go directly to
primary disk, and the smaller bits (e.g. smaller writes, metadata, filenames,
directories, etc.) will go to the dedicated intent log device.  This means that
the device does not need to be as large as you may think it should be.
> 
> Use the ''zilstat'' DTrace script to evalutate what really
happens on your system before you invest in extra hardware.
Yes, good idea.
http://www.richardelling.com/Home/scripts-and-programs-1/zilstat
 -- richard

ZFS storage and performance consulting at http://www.RichardElling.com
ZFS training on deduplication, NexentaStor, and NAS performance
Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com

Richard Elling

2010-Apr-20 05:34 UTC

head link

[zfs-discuss] SSD best practices

On Apr 19, 2010, at 7:11 PM, Bob Friesenhahn wrote:
> On Mon, 19 Apr 2010, Edward Ned Harvey wrote:
>> Improbability assessment aside, suppose you use something like the
DDRDrive
>> X1 ... Which might be more like 4G instead of 32G ... Is it even
physically
>> possible to write 4G to any device in less than 10 seconds?  Remember,
to
>> achieve worst case, highest demand on ZIL log device, these would all
have
>> to be <32kbyte writes (default configuration), because larger writes
will go
>> directly to primary storage, with only the intent landing on the ZIL.
> 
> Note that ZFS always writes data in order so I believe that the statement
"larger writes will go directly to primary storage" really should be
"larger writes will go directly to the ZIL implemented in primary storage
(which always exists)".  Otherwise, ZFS would need to write a new TXG
whenever a new "large" block of data appeared (which may be puny as
far as the underlying store is concerned) in order to assure proper ordering. 
This would result in a very high TXG issue rate.  Pool fragmentation would be
increased.
> 
> I am sure that someone will correct me if this is wrong.
Actually, when (you are not using a separate log device and the block size
is > 32kB) or (you are using a separate log and logbias=throughput) then
the data is written once to the main pool and a reference record is written to
the ZIL. When the txg commits, the reference record is discarded and the 
committed block pointer is correct. Upon rollback, all you need is the real 
data and the reference record from the ZIL to reconstruct.
 -- richard

ZFS storage and performance consulting at http://www.RichardElling.com
ZFS training on deduplication, NexentaStor, and NAS performance
Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com

Richard Elling

2010-Apr-20 05:53 UTC

head link

[zfs-discuss] SSD best practices

On Apr 19, 2010, at 12:44 PM, Miles Nordin wrote:
>>>>>> "dm" == David Magda <dmagda at
ee.ryerson.ca> writes:
> 
>    dm> Given that ZFS is always consistent on-disk, why would you
>    dm> lose a pool if you lose the ZIL and/or cache file?
> 
> because of lazy assertions inside ''zpool import''.  you
are right there
> is no fundamental reason for it---it''s just code that
doesn''t exist.
No, there is not a different code path.  The information in the cache
file is the pools to be imported and their configuration.  The configuration
contains the GUIDs for all disks in the pool. Disks are identified by GUID
and not their path, because as we all know, the paths can and do change.
> If you are a developer you can probably still recover your pool, but
> there aren''t any commands with a supported interface to do it.
ZFS requires that non-optional top-level vdevs be accessible at import
time. These include pool vdevs and log vdevs. In the case of a single
disk log, the log vdev will have only one disk of interest.  A mirrored
vdev will have two disks (children of the mirror vdev).  Since the disks 
are referenced by GUID rather than path, the knowledge of what GUIDs
are used to build the vdevs can be useful when you have to reconstruct
by hand.
> ''zpool.cache'' doesn''t contain magical
information, but it allows you
> to pass through a different code path that doesn''t include the
> ``BrrkBrrk, omg panic device missing, BAIL OUT HERE''''
checks.  I don''t
> think squirreling away copies of zpool.cache is a great way to make
> your pool safe from slog failures because there may be other things
> about the different manual ''zpool import'' codepath that
you need
> during a disaster, like -F, which will remain inaccessible to you if
> you rely on some saving-your-zpool.cache hack, even if your hack ends
> up actually working when the time comes, which it might not.
If there is but a single log disk, and it gets destroyed, and you are on b125
or older, then ZFS will not allow the pool to be imported.  ZFS is looking
for the GUID, but you won''t know what the GUID is, unless you have a 
copy of it somewhere (eg. backup of zpool.cache or you wrote it on the
bathroom wall :-)
> I think is really interesting, the case of an HA cluster using a
> single-device slog made from a ramdisk on the passive node.  This case
> would also become safer if slogs were fully disposeable.
More interesting is the "look Ma no directly connected shared storage
for a shared storage cluster!" where each node acts as an iSCSI target
for the mirrored storage. I don''t have any direct experience with this,
but
you can read about it here
	http://docs.sun.com/app/docs/doc/820-7821/girgb?a=view
 -- richard

ZFS storage and performance consulting at http://www.RichardElling.com
ZFS training on deduplication, NexentaStor, and NAS performance
Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com

Casper.Dik at Sun.COM

2010-Apr-20 07:14 UTC

head link

[zfs-discuss] SSD best practices

>On Mon, 19 Apr 2010, Edward Ned Harvey wrote:
>> Improbability assessment aside, suppose you use something like the
DDRDrive
>> X1 ... Which might be more like 4G instead of 32G ... Is it even
physically
>> possible to write 4G to any device in less than 10 seconds?  Remember,
to
>> achieve worst case, highest demand on ZIL log device, these would all
have
>> to be <32kbyte writes (default configuration), because larger writes
will go
>> directly to primary storage, with only the intent landing on the ZIL.
>
>Note that ZFS always writes data in order so I believe that the 
>statement "larger writes will go directly to primary storage"
really
>should be "larger writes will go directly to the ZIL implemented in 
>primary storage (which always exists)".  Otherwise, ZFS would need to 
>write a new TXG whenever a new "large" block of data appeared
(which
>may be puny as far as the underlying store is concerned) in order to 
>assure proper ordering.  This would result in a very high TXG issue 
>rate.  Pool fragmentation would be increased.
>
>I am sure that someone will correct me if this is wrong.
There''s a difference between "written" and "the data is
referenced by the
uberblock".  There is no need to start a new TXG when a large datablock
is written.  (If the system resets, the data will be on disk but not 
referenced and is lost unless the TXG it belongs to is comitted)

Casper

David Magda

2010-Apr-20 14:17 UTC

head link

[zfs-discuss] SSD best practices

On Mon, April 19, 2010 23:05, Don wrote:>> A STEC Zeus IOPS SSD (45K IOPS) will behave quite differently than an
>> Intel X-25E (~3.3K IOPS).
>
> Where can you even get the Zeus drives? I thought they were only in the
> OEM market and last time I checked they were ludicrously expensive.
I''m
> looking for between 5k and 10k IOPS using up to 4 drive bays (so a 2 x 2
> striped mirror would be fine). Right now we peak at about 3k IOPS (though
> that''s not to a ZFS system) but I would like to be able to be able
to
> burst to double that. We do have a lot of small size burst writes hence
> our ZIL concerns.
They do have distributors:

http://www.stec-inc.com/support/global_contact.php

http://tinyurl.com/y2lrse2
http://www.stec-inc.com/support/oem_regional_sales_contacts.php?region=USA&subregion=New%20York

And though they do cost a pretty penny, getting the same number of IOps
out of a stack of 15 krpm disks would probably cost a lot more in
hardware, power, and cooling.

Don

2010-Apr-20 15:06 UTC

head link

[zfs-discuss] SSD best practices

I looked through that distributor page already and none of the ones I visited
listed the IOPS SSD''s- they all listed DRAM and other memory from STEC-
but not the SSD''s.

I''m not looking to get the same number of IOPS out of 15k RPM drives.
I''m looking for an appropriate number of IOPS for my environment- that
is to say- twice what I''m currently getting. That would be 6k-10k IOPS.
If I can do that with four Intel drives for 1/10th of what a pair of ZEUS
SSD''s are going to cost me- then that would seem to make a lot more
sense. It would also be nice to be able to have a couple of spares on hand- just
in case a mirror fails. That''s a lot harder when the drives areas
expensive as the ZEUS.

Who else, besides STEC, is making write optimized drives and what kind of IOP
performance can be expected?
-- 
This message posted from opensolaris.org

Edward Ned Harvey

2010-Apr-21 03:50 UTC

head link

[zfs-discuss] SSD best practices

> From: casper at holland.sun.com [mailto:casper at holland.sun.com] On
Behalf
> Of Casper.Dik at Sun.COM
> 
> >On Mon, 19 Apr 2010, Edward Ned Harvey wrote:
> >> Improbability assessment aside, suppose you use something like the
> DDRDrive
> >> X1 ... Which might be more like 4G instead of 32G ... Is it even
> physically
> >> possible to write 4G to any device in less than 10 seconds?
> Remember, to
> >> achieve worst case, highest demand on ZIL log device, these would
> all have
> >> to be <32kbyte writes (default configuration), because larger
writes
> will go
> >> directly to primary storage, with only the intent landing on the
> ZIL.
> >
> >Note that ZFS always writes data in order so I believe that the
> >statement "larger writes will go directly to primary storage"
really
> >should be "larger writes will go directly to the ZIL implemented
in
> >primary storage (which always exists)".  Otherwise, ZFS would need
to
> >write a new TXG whenever a new "large" block of data appeared
(which
> >may be puny as far as the underlying store is concerned) in order to
> >assure proper ordering.  This would result in a very high TXG issue
> >rate.  Pool fragmentation would be increased.
> >
> >I am sure that someone will correct me if this is wrong.
> 
> There''s a difference between "written" and "the
data is referenced by
> the
> uberblock".  There is no need to start a new TXG when a large
datablock
> is written.  (If the system resets, the data will be on disk but not
> referenced and is lost unless the TXG it belongs to is comitted)
*Also* it turns out, what I said was not strictly correct either.  I think
I''m too sleepy to get this correct right now, but ...

My (hopefully corrected) understanding is now:

By default, all sync writes will go to ZIL entirely, regardless of size.
Only if you change the ... what is it ... logbias to ... throughput.  Then,
if you have a large sync write, the bulk of data will be written to primary
storage, while just a tiny little intent will be written to the SSD.

I think I misunderstood the default.  I previously thought throughput was
the default, not latency.

Frank Middleton

2010-Apr-21 14:24 UTC

head link

[zfs-discuss] SSD best practices

On 04/20/10 11:06 AM, Don wrote:
> Who else, besides STEC, is making write optimized drives and what
> kind of IOP performance can be expected?
Just got a distributor email about Texas Memory Systems''  RamSan-630,
one of a range of huge non-volatile SAN products they make. Other
than that this has a capacity of 4-10TB, looks like a 4U, and consumes
an amazing 450W, I don''t know anything about them. The iops are
pretty impressive, but power-wise, at 45W/TB even mirrored disks
use quite a bit less power. But 500K random iops and 8GB/s might
be worth it if the specs are to be believed...

Richard Elling

2010-Apr-21 14:59 UTC

head link

[zfs-discuss] SSD best practices

On Apr 21, 2010, at 7:24 AM, Frank Middleton wrote:> On 04/20/10 11:06 AM, Don wrote:
> 
>> Who else, besides STEC, is making write optimized drives and what
>> kind of IOP performance can be expected?
> 
> Just got a distributor email about Texas Memory Systems'' 
RamSan-630,
> one of a range of huge non-volatile SAN products they make. Other
> than that this has a capacity of 4-10TB, looks like a 4U, and consumes
> an amazing 450W, I don''t know anything about them. The iops are
> pretty impressive, but power-wise, at 45W/TB even mirrored disks
> use quite a bit less power. But 500K random iops and 8GB/s might
> be worth it if the specs are to be believed...
They have been around for a long time and have a good track record
in important markets. They do not cater to the home/hobbiest market.
 -- richard

ZFS storage and performance consulting at http://www.RichardElling.com
ZFS training on deduplication, NexentaStor, and NAS performance
Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com

Brandon High

2010-Apr-21 23:58 UTC

head link

[zfs-discuss] SSD best practices

On Wed, Apr 21, 2010 at 7:24 AM, Frank Middleton
<f.middleton at apogeect.com> wrote:> On 04/20/10 11:06 AM, Don wrote:
> Just got a distributor email about Texas Memory Systems''
?RamSan-630,
> one of a range of huge non-volatile SAN products they make. Other
> than that this has a capacity of 4-10TB, looks like a 4U, and consumes
> an amazing 450W, I don''t know anything about them. The iops are
> pretty impressive, but power-wise, at 45W/TB even mirrored disks
> use quite a bit less power. But 500K random iops and 8GB/s might
> be worth it if the specs are to be believed...
We use the RamSan 400 and 440 with some systems at work. I''m trying to
get some RamSan 630 eval units for testing. Unfortunately, we''re using
Linux LVM and ext3, so I can''t answer how well they work with zfs.

TMS did caution us that the 630 would be slower than the 440 for our
use case, which is a lot of synchronous random iops. The same probably
holds true for use as a slog.

-B

-- 
Brandon High : bhigh at freaks.com

thomas

2010-Apr-23 04:58 UTC

head link

[zfs-discuss] SSD best practices

Someone on this list threw out the idea a year or so ago to just setup 2 ramdisk
servers, export a ramdisk from each and create a mirror slog from them.

Assuming newer version zpools, this sounds like it could be even safer since
there is (supposedly) less of a chance of catastrophic failure if your ramdisk
setup fails. Use just one remote ramdisk or two with battery backup.. whatever
meets your paranoia level.

It''s not ssd cheap, but I''m sure you could dream up several
options that are less than stec prices. You also could probably use these
machines on multiple pools if you''ve got them. I know, it still
probably sounds a bit too cowboy for most on this list though.
-- 
This message posted from opensolaris.org

Daniel Carosone

2010-Apr-23 05:10 UTC

head link

[zfs-discuss] SSD best practices

On Thu, Apr 22, 2010 at 09:58:12PM -0700, thomas wrote:> Assuming newer version zpools, this sounds like it could be even
> safer since there is (supposedly) less of a chance of catastrophic
> failure if your ramdisk setup fails. Use just one remote ramdisk or
> two with battery backup.. whatever meets your paranoia level.   
If the iscsi initiator worked for me at all, I would be trying this.
I liked the idea, but it''s just not accessible now.

--
Dan.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 194 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100423/d6625f5f/attachment.bin>

Edward Ned Harvey

2010-Apr-23 11:24 UTC

head link

[zfs-discuss] SSD best practices

> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> bounces at opensolaris.org] On Behalf Of thomas
> 
> Someone on this list threw out the idea a year or so ago to just setup
> 2 ramdisk servers, export a ramdisk from each and create a mirror slog
> from them.
Isn''t the whole point of a ramdisk to be fast?
And now it''s going to be at the other end of an Ethernet, with TCP and
...
some additional filesystem overhead?

No thank you.

Darren J Moffat

2010-Apr-23 12:40 UTC

head link

[zfs-discuss] SSD best practices

On 23/04/2010 12:24, Edward Ned Harvey wrote:>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
>> bounces at opensolaris.org] On Behalf Of thomas
>>
>> Someone on this list threw out the idea a year or so ago to just setup
>> 2 ramdisk servers, export a ramdisk from each and create a mirror slog
>> from them.
>
> Isn''t the whole point of a ramdisk to be fast?
> And now it''s going to be at the other end of an Ethernet, with TCP
and ...
> some additional filesystem overhead?
iSCSI over 1G or even 10G Ethernet to something on the remote side can 
be very fast, faster than a 7200rpm drive and possibly faster than a 15k 
rpm drive.

Or maybe it isn''t Ethernet but Infiband, then we are looking at very
fast.

The point of the ZFS L2ARC cache devices is to be faster than your main 
pool devices.  In particular the idea is to allow you to use cheaper 
7200 rpm (or maybe even slower) disks rather than expensive 15k rpm 
drives but to get equivalent or better performance for certain types of 
workload that have traditionally been dominated by 15k rpm drives.

If you are using this as a ZFS log device then you need to be more 
careful as the log device does need to persist, otherwise there is no 
point in having it.

I remember many years ago on SPARCstation ELC (sun4c) systems with only 
8Mb of RAM and local swap (IIRC local / but remote /usr too so a 
dataless client) it was better to run some X applications remotely on 
another machine (that someone else was using) than to let them swap 
locally.  The idea being that you had to be unlucky for both machines to 
need to swap and both to swap out the same program at the same time. 
That was only over 10BaseT.

What I''m saying is that this isn''t new, don''t assume
that the path
to/from local storage is faster than networking.

-- 
Darren J Moffat

zfs discuss - Apr 2010 - SSD best practices

[zfs-discuss] SSD best practices

[zfs-discuss] SSD best practices

[zfs-discuss] SSD best practices

[zfs-discuss] SSD best practices

[zfs-discuss] SSD best practices

[zfs-discuss] SSD best practices

[zfs-discuss] SSD best practices

[zfs-discuss] SSD best practices

[zfs-discuss] SSD best practices

[zfs-discuss] SSD best practices

[zfs-discuss] SSD best practices

[zfs-discuss] SSD best practices

[zfs-discuss] SSD best practices

[zfs-discuss] SSD best practices

[zfs-discuss] SSD best practices

[zfs-discuss] SSD best practices

[zfs-discuss] SSD best practices

[zfs-discuss] SSD best practices

[zfs-discuss] SSD best practices

[zfs-discuss] SSD best practices

[zfs-discuss] SSD best practices

[zfs-discuss] SSD best practices

[zfs-discuss] SSD best practices

[zfs-discuss] SSD best practices

[zfs-discuss] SSD best practices

[zfs-discuss] SSD best practices

[zfs-discuss] SSD best practices

[zfs-discuss] SSD best practices

[zfs-discuss] SSD best practices

[zfs-discuss] SSD best practices

[zfs-discuss] SSD best practices

[zfs-discuss] SSD best practices

[zfs-discuss] SSD best practices

[zfs-discuss] SSD best practices

[zfs-discuss] SSD best practices

[zfs-discuss] SSD best practices

[zfs-discuss] newbie, WAS: Re: SSD best practices

[zfs-discuss] SSD best practices

[zfs-discuss] newbie, WAS: Re: SSD best practices

[zfs-discuss] newbie, WAS: Re: SSD best practices

[zfs-discuss] SSD best practices

[zfs-discuss] newbie, WAS: Re: SSD best practices

[zfs-discuss] SSD best practices

[zfs-discuss] SSD best practices

[zfs-discuss] SSD best practices

[zfs-discuss] SSD best practices

[zfs-discuss] SSD best practices

[zfs-discuss] SSD best practices

[zfs-discuss] SSD best practices

[zfs-discuss] SSD best practices

[zfs-discuss] SSD best practices

[zfs-discuss] SSD best practices

[zfs-discuss] SSD best practices

[zfs-discuss] SSD best practices

[zfs-discuss] SSD best practices

[zfs-discuss] SSD best practices

[zfs-discuss] SSD best practices

[zfs-discuss] SSD best practices

[zfs-discuss] SSD best practices

[zfs-discuss] SSD best practices

[zfs-discuss] SSD best practices

[zfs-discuss] SSD best practices

[zfs-discuss] SSD best practices

[zfs-discuss] SSD best practices

[zfs-discuss] SSD best practices

[zfs-discuss] SSD best practices

[zfs-discuss] SSD best practices

[zfs-discuss] SSD best practices

[zfs-discuss] SSD best practices

[zfs-discuss] SSD best practices

[zfs-discuss] SSD best practices

[zfs-discuss] SSD best practices

[zfs-discuss] SSD best practices

[zfs-discuss] SSD best practices

[zfs-discuss] SSD best practices

[zfs-discuss] SSD best practices

[zfs-discuss] SSD best practices

[zfs-discuss] SSD best practices

[zfs-discuss] SSD best practices