thr3ads.net - zfs discuss - [zfs-discuss] x4540 boot flash [Jun 2009]

If this information is useful, please help other people find it:
Share via:

Paul B. Henson

2009-Jun-06 04:12 UTC

[zfs-discuss] x4540 boot flash

So I was looking into the boot flash feature of the newer x4540, and
evidently it is simply a CompactFlash slot, with all of the disadvantages
and limitations of that type of media. The sun deployment guide recommends
minimizing writes to a CF boot device, in particular by NFS mounting /var
from a different server, disabling swap or swapping to a different device,
and doing all logging over the network. Not exactly a configuration I would
prefer. My sales SE said most people weren''t utilizing the CF boot
feature.
The concept is nice, but an implementation with SSD quality flash rather
than basic CF (also, preferably redundant devices) would have been better.

If I had an x4540 (which I don''t, unfortunately, we picked up a half
dozen
x4500''s just before they were end of sale''d), what I think
would be
interesting to do would be install two of the 32GB SSD disks in the boot
slots, use a 1-5GB sliced mirror as a slog, and the remaining 27-31GB as a
sliced mirrored root pool. From what I understand you don''t need very
much
space for an effective slog, and SSD''s don''t have the write
failure
limitations of CF. Also, the recommendation for giving ZFS entire discs
rather than slices evidently isn''t applicable to SSD''s as they
don''t have a
write cache. It seems this approach would give you a blazing fast slog, as
well as a redundant boot mirror without having to waste an additional two
SATA slots.

If anybody would like to donate an x4540 to a budget stricken California
State University I''d be happy to test it out and report back ;). Given
we
just found out today that the entire summer quarter schedule of classes has
been canceled due to budget cuts :(, I don''t see new hardware in our
future
anytime soon <sigh>...


-- 
Paul B. Henson  |  (909) 979-6361  |  http://www.csupomona.edu/~henson/
Operating Systems and Network Analyst  |  henson at csupomona.edu
California State Polytechnic University  |  Pomona CA 91768

Richard Elling

2009-Jun-07 05:30 UTC

head link

[zfs-discuss] x4540 boot flash

Paul B. Henson wrote:> So I was looking into the boot flash feature of the newer x4540, and
> evidently it is simply a CompactFlash slot, with all of the disadvantages
> and limitations of that type of media. The sun deployment guide recommends
> minimizing writes to a CF boot device, in particular by NFS mounting /var
> from a different server, disabling swap or swapping to a different device,
> and doing all logging over the network. 
argv.  So we''ve had the discussion many times over the past 4 years
about
why these "recommendations" are largely bogus.  Alas, once published,
they seem to live forever.

The presumption is that you are using UFS for the CF, not ZFS.
UFS is not COW, so there is a potential endurance problem for
blocks which are known to be rewritten many times.  ZFS will not
have this problem, so if you use ZFS root, you are better served by
ignoring the previous "advice."

For additional background, if you worry about UFS and endurance,
then you want to avoid all writes, because metadata is at fixed
locations, and you could potentially hit endurance problems at those
locations. Some people think that /var collects a lot of writes, and it
might if you happen to be running a high-volume e-mail server using
sendmail.  Since almost nobody does that in today''s internet, the risk
is quite small.

The second thought was that you will be swapping often and therefore
you want to avoid the endurance problem which affects swap (where
the swap device is raw, not a file system).  In practice, if you have a
lot of swap activity, then your performance will stink and you will
be more likely to actually buy some RAM to solve the problem.  Also,
most modern machines are overconfigured for RAM, so the actual
swap device usage for modern machines is typically low.  I had some
data which validated this assumption, about 4 years ago.  It is easy to
monitor swap usage, so see for yourself if your workload does a lot
of writes to the swap device.  For OpenSolaris (enterprise support
contracts now available!) which uses ZFS for swap, don''t worry, be
happy.

In short, if you use ZFS for root, ignore the warnings.
> Not exactly a configuration I would
> prefer. My sales SE said most people weren''t utilizing the CF boot
feature.
> The concept is nice, but an implementation with SSD quality flash rather
> than basic CF (also, preferably redundant devices) would have been better.
>   
It depends on the market.  In telco, many people use CF for boot
because they are much more reliable under much more diverse
environmental conditions than magnetic media.
> If I had an x4540 (which I don''t, unfortunately, we picked up a
half dozen
> x4500''s just before they were end of sale''d), what I
think would be
> interesting to do would be install two of the 32GB SSD disks in the boot
> slots, use a 1-5GB sliced mirror as a slog, and the remaining 27-31GB as a
> sliced mirrored root pool. 
5 GBytes seems pretty large for a slog, but yes, I think this is a good 
idea.
> From what I understand you don''t need very much
> space for an effective slog, and SSD''s don''t have the
write failure
> limitations of CF. 
CFs designed for the professional photography market have better
specifications than CFs designed for the consumer market.
> Also, the recommendation for giving ZFS entire discs
> rather than slices evidently isn''t applicable to SSD''s as
they don''t have a
> write cache. It seems this approach would give you a blazing fast slog, as
> well as a redundant boot mirror without having to waste an additional two
> SATA slots.
>   
This is not an accurate statement.  Enterprise-class SSDs (eg. STEC Zeus)
have DRAM write buffers.  The Flash Mini-DIMMs Sun uses also have
DRAM write buffers.  These offer very low write latency for slogs.
 -- richard

Paul B. Henson

2009-Jun-07 08:38 UTC

head link

[zfs-discuss] x4540 boot flash

On Sat, 6 Jun 2009, Richard Elling wrote:
> The presumption is that you are using UFS for the CF, not ZFS.
> UFS is not COW, so there is a potential endurance problem for
> blocks which are known to be rewritten many times.  ZFS will not
> have this problem, so if you use ZFS root, you are better served by
> ignoring the previous "advice."
My understanding was that all modern CF cards incorporate wear leveling,
and I was interpreting the recommendation as trying to prevent wearing out
the entire card, not necessarily particular blocks.
> of writes to the swap device.  For OpenSolaris (enterprise support
> contracts now available!) which uses ZFS for swap, don''t worry, be
As of U6, even luddite S10 users can avail of zfs for boot/swap/dump:

root at ike ~ # uname -a
SunOS ike 5.10 Generic_138889-08 i86pc i386 i86pc

root at ike ~ # swap -l
swapfile             dev  swaplo blocks   free
/dev/zvol/dsk/ospool/swap 181,2       8 8388600 8388600
> In short, if you use ZFS for root, ignore the warnings.
How about the lack of redundancy? Is the failure rate for CF so low
there''s
no risk in running a critical server without a mirrored root pool? And what
about bit rot? Without redundancy zfs can only detect but not correct read
errors (unless, I suppose, configured with copies>1). How much more would
it have cost to include two CF slots that it wasn''t warranted?
> 5 GBytes seems pretty large for a slog, but yes, I think this is a good
> idea.
What is the best formula to calculate slog size? I found a recent thread:

	http://jp.opensolaris.org/jive/thread.jspa?threadID=78758&tstart=1

in which a Sun engineer (presumably unofficially of course ;) ) mentioned
10-18GB as more than sufficent. On the other hand:

	http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#Disabling_the_ZIL_.28Don.27t.29

says "A rule of thumb is that you should size the separate log to be able
to handle 10 seconds of your expected synchronous write workload. It would
be rare to need more than 100 MBytes in a separate log device, but the
separate log must be at least 64 MBytes."

Big gap between 100MB and 10-18GB. The first thread also mentioned in
passing that splitting up an SSD between slog and root pool might have
undesirable performance issues, although I don''t think that was
discussed
to resolution.
> CFs designed for the professional photography market have better
> specifications than CFs designed for the consumer market.
CF is pretty cheap, you can pick up 16GB-32GB from $80-$200 depending on
brand/quality. Assuming they do incorporate wear leveling, and considering
even a fairly busy server isn''t going to use up *that* much space (I
have a
couple E3000''s still running which have 4GB disk mirrors for the OS),
if
you get a decent CF card I suppose it would quite possibly outlast the
server.

But I think I''d still rather have two 8-/. Show of hands, anybody with
an
x4540 that''s booting off non-redundant CF?
> This is not an accurate statement.  Enterprise-class SSDs (eg. STEC Zeus)
> have DRAM write buffers.  The Flash Mini-DIMMs Sun uses also have DRAM
> write buffers.  These offer very low write latency for slogs.
Yah, that misconception has already been pointed out to me offlist. I
actually came upon it in correspondence with you, I had asked about using a
slice of an SSD for a slog rather than the whole disk, and you mentioned
that the advice for using the whole disk rather than a slice was only for
traditional spinning hard drives and didn''t apply to SSD''s, I
thought
because of something to do with the write cache but I guess I
misunderstood. I didn''t save that message, perhaps you could be kind
enough to refresh my memory as to why slices of SSD''s are ok while
slices
of hard disks are best avoided?

-- 
Paul B. Henson  |  (909) 979-6361  |  http://www.csupomona.edu/~henson/
Operating Systems and Network Analyst  |  henson at csupomona.edu
California State Polytechnic University  |  Pomona CA 91768

Richard Elling

2009-Jun-07 15:30 UTC

head link

[zfs-discuss] x4540 boot flash

Paul B. Henson wrote:> On Sat, 6 Jun 2009, Richard Elling wrote:
>
>   
>> The presumption is that you are using UFS for the CF, not ZFS.
>> UFS is not COW, so there is a potential endurance problem for
>> blocks which are known to be rewritten many times.  ZFS will not
>> have this problem, so if you use ZFS root, you are better served by
>> ignoring the previous "advice."
>>     
>
> My understanding was that all modern CF cards incorporate wear leveling,
> and I was interpreting the recommendation as trying to prevent wearing out
> the entire card, not necessarily particular blocks.
>   
Wear leveling is an attempt to solve the problem of multiple writes
to the same physical block.
>> of writes to the swap device.  For OpenSolaris (enterprise support
>> contracts now available!) which uses ZFS for swap, don''t
worry, be
>>     
>
> As of U6, even luddite S10 users can avail of zfs for boot/swap/dump:
>
> root at ike ~ # uname -a
> SunOS ike 5.10 Generic_138889-08 i86pc i386 i86pc
>
> root at ike ~ # swap -l
> swapfile             dev  swaplo blocks   free
> /dev/zvol/dsk/ospool/swap 181,2       8 8388600 8388600
>   
Yes, and as you can see, my attempts to get the verbiage changed
have failed :-(
>> In short, if you use ZFS for root, ignore the warnings.
>>     
>
> How about the lack of redundancy? Is the failure rate for CF so low
there''s
> no risk in running a critical server without a mirrored root pool? And what
> about bit rot? Without redundancy zfs can only detect but not correct read
> errors (unless, I suppose, configured with copies>1). How much more
would
> it have cost to include two CF slots that it wasn''t warranted?
>   
The failure rate is much lower than disks, with the exception of the
endurance problem.  Flash memory is not susceptible to the bit rot
that plaques magnetic media.  Nor is flash memory susceptible to
the radiation-induced bit flips that plague DRAMs.

Or, to look at this another way, billions of consumer electronics
devices use a single flash "boot disk" and there doesn''t seem
to
be many people complaining they aren''t mirrored.  Indeed, even
if you have a mirrored OS on flash, you don''t have a mirrored
OBP or BIOS (which is also on flash).  So, the risk here is
significantly lower than HDDs.
>> 5 GBytes seems pretty large for a slog, but yes, I think this is a good
>> idea.
>>     
>
> What is the best formula to calculate slog size? I found a recent thread:
>
> 	http://jp.opensolaris.org/jive/thread.jspa?threadID=78758&tstart=1
>
> in which a Sun engineer (presumably unofficially of course ;) ) mentioned
> 10-18GB as more than sufficent. On the other hand:
>
> 
http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#Disabling_the_ZIL_.28Don.27t.29
>
> says "A rule of thumb is that you should size the separate log to be
able
> to handle 10 seconds of your expected synchronous write workload. It would
> be rare to need more than 100 MBytes in a separate log device, but the
> separate log must be at least 64 MBytes."
>   
This was a ROT when the default txg sync time was 5 seconds...
I''ll update this soon because that is no longer the case.
> Big gap between 100MB and 10-18GB. The first thread also mentioned in
> passing that splitting up an SSD between slog and root pool might have
> undesirable performance issues, although I don''t think that was
discussed
> to resolution.
>   
Yep, big gap.  This is why I wrote zilstat, so that you can see what your
workload might use before committing to a slog.  There may be a good
zilstat RFE here: I can see when the txg commits, so zilstat should be able
to collect per-txg rather than per-time-period.  Consider it added to my
todo list.
http://www.richardelling.com/Home/scripts-and-programs-1/zilstat
>> CFs designed for the professional photography market have better
>> specifications than CFs designed for the consumer market.
>>     
>
> CF is pretty cheap, you can pick up 16GB-32GB from $80-$200 depending on
> brand/quality. Assuming they do incorporate wear leveling, and considering
> even a fairly busy server isn''t going to use up *that* much space
(I have a
> couple E3000''s still running which have 4GB disk mirrors for the
OS), if
> you get a decent CF card I suppose it would quite possibly outlast the
> server.
>   
I think the dig against CF is that they tend to have a low write speed
for small iops.  They are optimized for writing large files, like photos.
> But I think I''d still rather have two 8-/. Show of hands, anybody
with an
> x4540 that''s booting off non-redundant CF?
>
>   
>> This is not an accurate statement.  Enterprise-class SSDs (eg. STEC
Zeus)
>> have DRAM write buffers.  The Flash Mini-DIMMs Sun uses also have DRAM
>> write buffers.  These offer very low write latency for slogs.
>>     
>
> Yah, that misconception has already been pointed out to me offlist. I
> actually came upon it in correspondence with you, I had asked about using a
> slice of an SSD for a slog rather than the whole disk, and you mentioned
> that the advice for using the whole disk rather than a slice was only for
> traditional spinning hard drives and didn''t apply to
SSD''s, I thought
> because of something to do with the write cache but I guess I
> misunderstood. I didn''t save that message, perhaps you could be
kind
> enough to refresh my memory as to why slices of SSD''s are ok while
slices
> of hard disks are best avoided?
>
>   
In the enterprise class SSDs, the DRAM buffer is nonvolatile.  In HDDs,
the DRAM buffer is volatile. HDDs will flush their DRAM buffer if you
give it the command to do so, which is what ZFS will do when it owns
the whole disk.  This "design feature" is the cause of much confusion
over
the years, though.
 -- richard

Ny Whe

2009-Oct-09 19:50 UTC

head link

[zfs-discuss] x4540 boot flash

> >> CFs designed for the professional photography
> market have better
> >> specifications than CFs designed for the consumer
> market.
> >>     
> >
> > CF is pretty cheap, you can pick up 16GB-32GB from
> $80-$200 depending on
> > brand/quality. Assuming they do incorporate wear
> leveling, and considering
> > even a fairly busy server isn''t going to use up
> *that* much space (I have a
> > couple E3000''s still running which have 4GB disk
> mirrors for the OS), if
> > you get a decent CF card I suppose it would quite
> possibly outlast the
> > server.
> >   
> 
> I think the dig against CF is that they tend to have
> a low write speed
> for small iops.  They are optimized for writing large
> files, like photos.
Would a 32GB-SanDisk Extreme? CompactFlash? Card 60MB/s (SDCFX-032G-P61) or a  
16GB-SanDisk Extreme? CompactFlash? Card 60MB/s (SDCFX-016G-A61)  qualify as a
decent card, or is there other an other brand I should look for?
Are 32GB "supported" at this point?
How about UDMA 400x?
-- 
This message posted from opensolaris.org

zfs discuss - Jun 2009 - x4540 boot flash

[zfs-discuss] x4540 boot flash

[zfs-discuss] x4540 boot flash

[zfs-discuss] x4540 boot flash

[zfs-discuss] x4540 boot flash

[zfs-discuss] x4540 boot flash