So I was looking into the boot flash feature of the newer x4540, and evidently it is simply a CompactFlash slot, with all of the disadvantages and limitations of that type of media. The sun deployment guide recommends minimizing writes to a CF boot device, in particular by NFS mounting /var from a different server, disabling swap or swapping to a different device, and doing all logging over the network. Not exactly a configuration I would prefer. My sales SE said most people weren''t utilizing the CF boot feature. The concept is nice, but an implementation with SSD quality flash rather than basic CF (also, preferably redundant devices) would have been better. If I had an x4540 (which I don''t, unfortunately, we picked up a half dozen x4500''s just before they were end of sale''d), what I think would be interesting to do would be install two of the 32GB SSD disks in the boot slots, use a 1-5GB sliced mirror as a slog, and the remaining 27-31GB as a sliced mirrored root pool. From what I understand you don''t need very much space for an effective slog, and SSD''s don''t have the write failure limitations of CF. Also, the recommendation for giving ZFS entire discs rather than slices evidently isn''t applicable to SSD''s as they don''t have a write cache. It seems this approach would give you a blazing fast slog, as well as a redundant boot mirror without having to waste an additional two SATA slots. If anybody would like to donate an x4540 to a budget stricken California State University I''d be happy to test it out and report back ;). Given we just found out today that the entire summer quarter schedule of classes has been canceled due to budget cuts :(, I don''t see new hardware in our future anytime soon <sigh>... -- Paul B. Henson | (909) 979-6361 | http://www.csupomona.edu/~henson/ Operating Systems and Network Analyst | henson at csupomona.edu California State Polytechnic University | Pomona CA 91768
Paul B. Henson wrote:> So I was looking into the boot flash feature of the newer x4540, and > evidently it is simply a CompactFlash slot, with all of the disadvantages > and limitations of that type of media. The sun deployment guide recommends > minimizing writes to a CF boot device, in particular by NFS mounting /var > from a different server, disabling swap or swapping to a different device, > and doing all logging over the network.argv. So we''ve had the discussion many times over the past 4 years about why these "recommendations" are largely bogus. Alas, once published, they seem to live forever. The presumption is that you are using UFS for the CF, not ZFS. UFS is not COW, so there is a potential endurance problem for blocks which are known to be rewritten many times. ZFS will not have this problem, so if you use ZFS root, you are better served by ignoring the previous "advice." For additional background, if you worry about UFS and endurance, then you want to avoid all writes, because metadata is at fixed locations, and you could potentially hit endurance problems at those locations. Some people think that /var collects a lot of writes, and it might if you happen to be running a high-volume e-mail server using sendmail. Since almost nobody does that in today''s internet, the risk is quite small. The second thought was that you will be swapping often and therefore you want to avoid the endurance problem which affects swap (where the swap device is raw, not a file system). In practice, if you have a lot of swap activity, then your performance will stink and you will be more likely to actually buy some RAM to solve the problem. Also, most modern machines are overconfigured for RAM, so the actual swap device usage for modern machines is typically low. I had some data which validated this assumption, about 4 years ago. It is easy to monitor swap usage, so see for yourself if your workload does a lot of writes to the swap device. For OpenSolaris (enterprise support contracts now available!) which uses ZFS for swap, don''t worry, be happy. In short, if you use ZFS for root, ignore the warnings.> Not exactly a configuration I would > prefer. My sales SE said most people weren''t utilizing the CF boot feature. > The concept is nice, but an implementation with SSD quality flash rather > than basic CF (also, preferably redundant devices) would have been better. >It depends on the market. In telco, many people use CF for boot because they are much more reliable under much more diverse environmental conditions than magnetic media.> If I had an x4540 (which I don''t, unfortunately, we picked up a half dozen > x4500''s just before they were end of sale''d), what I think would be > interesting to do would be install two of the 32GB SSD disks in the boot > slots, use a 1-5GB sliced mirror as a slog, and the remaining 27-31GB as a > sliced mirrored root pool.5 GBytes seems pretty large for a slog, but yes, I think this is a good idea.> From what I understand you don''t need very much > space for an effective slog, and SSD''s don''t have the write failure > limitations of CF.CFs designed for the professional photography market have better specifications than CFs designed for the consumer market.> Also, the recommendation for giving ZFS entire discs > rather than slices evidently isn''t applicable to SSD''s as they don''t have a > write cache. It seems this approach would give you a blazing fast slog, as > well as a redundant boot mirror without having to waste an additional two > SATA slots. >This is not an accurate statement. Enterprise-class SSDs (eg. STEC Zeus) have DRAM write buffers. The Flash Mini-DIMMs Sun uses also have DRAM write buffers. These offer very low write latency for slogs. -- richard
On Sat, 6 Jun 2009, Richard Elling wrote:> The presumption is that you are using UFS for the CF, not ZFS. > UFS is not COW, so there is a potential endurance problem for > blocks which are known to be rewritten many times. ZFS will not > have this problem, so if you use ZFS root, you are better served by > ignoring the previous "advice."My understanding was that all modern CF cards incorporate wear leveling, and I was interpreting the recommendation as trying to prevent wearing out the entire card, not necessarily particular blocks.> of writes to the swap device. For OpenSolaris (enterprise support > contracts now available!) which uses ZFS for swap, don''t worry, beAs of U6, even luddite S10 users can avail of zfs for boot/swap/dump: root at ike ~ # uname -a SunOS ike 5.10 Generic_138889-08 i86pc i386 i86pc root at ike ~ # swap -l swapfile dev swaplo blocks free /dev/zvol/dsk/ospool/swap 181,2 8 8388600 8388600> In short, if you use ZFS for root, ignore the warnings.How about the lack of redundancy? Is the failure rate for CF so low there''s no risk in running a critical server without a mirrored root pool? And what about bit rot? Without redundancy zfs can only detect but not correct read errors (unless, I suppose, configured with copies>1). How much more would it have cost to include two CF slots that it wasn''t warranted?> 5 GBytes seems pretty large for a slog, but yes, I think this is a good > idea.What is the best formula to calculate slog size? I found a recent thread: http://jp.opensolaris.org/jive/thread.jspa?threadID=78758&tstart=1 in which a Sun engineer (presumably unofficially of course ;) ) mentioned 10-18GB as more than sufficent. On the other hand: http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#Disabling_the_ZIL_.28Don.27t.29 says "A rule of thumb is that you should size the separate log to be able to handle 10 seconds of your expected synchronous write workload. It would be rare to need more than 100 MBytes in a separate log device, but the separate log must be at least 64 MBytes." Big gap between 100MB and 10-18GB. The first thread also mentioned in passing that splitting up an SSD between slog and root pool might have undesirable performance issues, although I don''t think that was discussed to resolution.> CFs designed for the professional photography market have better > specifications than CFs designed for the consumer market.CF is pretty cheap, you can pick up 16GB-32GB from $80-$200 depending on brand/quality. Assuming they do incorporate wear leveling, and considering even a fairly busy server isn''t going to use up *that* much space (I have a couple E3000''s still running which have 4GB disk mirrors for the OS), if you get a decent CF card I suppose it would quite possibly outlast the server. But I think I''d still rather have two 8-/. Show of hands, anybody with an x4540 that''s booting off non-redundant CF?> This is not an accurate statement. Enterprise-class SSDs (eg. STEC Zeus) > have DRAM write buffers. The Flash Mini-DIMMs Sun uses also have DRAM > write buffers. These offer very low write latency for slogs.Yah, that misconception has already been pointed out to me offlist. I actually came upon it in correspondence with you, I had asked about using a slice of an SSD for a slog rather than the whole disk, and you mentioned that the advice for using the whole disk rather than a slice was only for traditional spinning hard drives and didn''t apply to SSD''s, I thought because of something to do with the write cache but I guess I misunderstood. I didn''t save that message, perhaps you could be kind enough to refresh my memory as to why slices of SSD''s are ok while slices of hard disks are best avoided? -- Paul B. Henson | (909) 979-6361 | http://www.csupomona.edu/~henson/ Operating Systems and Network Analyst | henson at csupomona.edu California State Polytechnic University | Pomona CA 91768
Paul B. Henson wrote:> On Sat, 6 Jun 2009, Richard Elling wrote: > > >> The presumption is that you are using UFS for the CF, not ZFS. >> UFS is not COW, so there is a potential endurance problem for >> blocks which are known to be rewritten many times. ZFS will not >> have this problem, so if you use ZFS root, you are better served by >> ignoring the previous "advice." >> > > My understanding was that all modern CF cards incorporate wear leveling, > and I was interpreting the recommendation as trying to prevent wearing out > the entire card, not necessarily particular blocks. >Wear leveling is an attempt to solve the problem of multiple writes to the same physical block.>> of writes to the swap device. For OpenSolaris (enterprise support >> contracts now available!) which uses ZFS for swap, don''t worry, be >> > > As of U6, even luddite S10 users can avail of zfs for boot/swap/dump: > > root at ike ~ # uname -a > SunOS ike 5.10 Generic_138889-08 i86pc i386 i86pc > > root at ike ~ # swap -l > swapfile dev swaplo blocks free > /dev/zvol/dsk/ospool/swap 181,2 8 8388600 8388600 >Yes, and as you can see, my attempts to get the verbiage changed have failed :-(>> In short, if you use ZFS for root, ignore the warnings. >> > > How about the lack of redundancy? Is the failure rate for CF so low there''s > no risk in running a critical server without a mirrored root pool? And what > about bit rot? Without redundancy zfs can only detect but not correct read > errors (unless, I suppose, configured with copies>1). How much more would > it have cost to include two CF slots that it wasn''t warranted? >The failure rate is much lower than disks, with the exception of the endurance problem. Flash memory is not susceptible to the bit rot that plaques magnetic media. Nor is flash memory susceptible to the radiation-induced bit flips that plague DRAMs. Or, to look at this another way, billions of consumer electronics devices use a single flash "boot disk" and there doesn''t seem to be many people complaining they aren''t mirrored. Indeed, even if you have a mirrored OS on flash, you don''t have a mirrored OBP or BIOS (which is also on flash). So, the risk here is significantly lower than HDDs.>> 5 GBytes seems pretty large for a slog, but yes, I think this is a good >> idea. >> > > What is the best formula to calculate slog size? I found a recent thread: > > http://jp.opensolaris.org/jive/thread.jspa?threadID=78758&tstart=1 > > in which a Sun engineer (presumably unofficially of course ;) ) mentioned > 10-18GB as more than sufficent. On the other hand: > > http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#Disabling_the_ZIL_.28Don.27t.29 > > says "A rule of thumb is that you should size the separate log to be able > to handle 10 seconds of your expected synchronous write workload. It would > be rare to need more than 100 MBytes in a separate log device, but the > separate log must be at least 64 MBytes." >This was a ROT when the default txg sync time was 5 seconds... I''ll update this soon because that is no longer the case.> Big gap between 100MB and 10-18GB. The first thread also mentioned in > passing that splitting up an SSD between slog and root pool might have > undesirable performance issues, although I don''t think that was discussed > to resolution. >Yep, big gap. This is why I wrote zilstat, so that you can see what your workload might use before committing to a slog. There may be a good zilstat RFE here: I can see when the txg commits, so zilstat should be able to collect per-txg rather than per-time-period. Consider it added to my todo list. http://www.richardelling.com/Home/scripts-and-programs-1/zilstat>> CFs designed for the professional photography market have better >> specifications than CFs designed for the consumer market. >> > > CF is pretty cheap, you can pick up 16GB-32GB from $80-$200 depending on > brand/quality. Assuming they do incorporate wear leveling, and considering > even a fairly busy server isn''t going to use up *that* much space (I have a > couple E3000''s still running which have 4GB disk mirrors for the OS), if > you get a decent CF card I suppose it would quite possibly outlast the > server. >I think the dig against CF is that they tend to have a low write speed for small iops. They are optimized for writing large files, like photos.> But I think I''d still rather have two 8-/. Show of hands, anybody with an > x4540 that''s booting off non-redundant CF? > > >> This is not an accurate statement. Enterprise-class SSDs (eg. STEC Zeus) >> have DRAM write buffers. The Flash Mini-DIMMs Sun uses also have DRAM >> write buffers. These offer very low write latency for slogs. >> > > Yah, that misconception has already been pointed out to me offlist. I > actually came upon it in correspondence with you, I had asked about using a > slice of an SSD for a slog rather than the whole disk, and you mentioned > that the advice for using the whole disk rather than a slice was only for > traditional spinning hard drives and didn''t apply to SSD''s, I thought > because of something to do with the write cache but I guess I > misunderstood. I didn''t save that message, perhaps you could be kind > enough to refresh my memory as to why slices of SSD''s are ok while slices > of hard disks are best avoided? > >In the enterprise class SSDs, the DRAM buffer is nonvolatile. In HDDs, the DRAM buffer is volatile. HDDs will flush their DRAM buffer if you give it the command to do so, which is what ZFS will do when it owns the whole disk. This "design feature" is the cause of much confusion over the years, though. -- richard
> >> CFs designed for the professional photography > market have better > >> specifications than CFs designed for the consumer > market. > >> > > > > CF is pretty cheap, you can pick up 16GB-32GB from > $80-$200 depending on > > brand/quality. Assuming they do incorporate wear > leveling, and considering > > even a fairly busy server isn''t going to use up > *that* much space (I have a > > couple E3000''s still running which have 4GB disk > mirrors for the OS), if > > you get a decent CF card I suppose it would quite > possibly outlast the > > server. > > > > I think the dig against CF is that they tend to have > a low write speed > for small iops. They are optimized for writing large > files, like photos.Would a 32GB-SanDisk Extreme? CompactFlash? Card 60MB/s (SDCFX-032G-P61) or a 16GB-SanDisk Extreme? CompactFlash? Card 60MB/s (SDCFX-016G-A61) qualify as a decent card, or is there other an other brand I should look for? Are 32GB "supported" at this point? How about UDMA 400x? -- This message posted from opensolaris.org