Scott Meilicke
2009-Oct-20 21:46 UTC
[zfs-discuss] Setting up an SSD ZIL - Need A Reality Check
I have an Intel X25-E 32G in the mail (actually the kingston version), and wanted to get a sanity check before I start. System: Dell 2950 16G RAM 16 1.5T SATA disks in a SAS chassis hanging off of an LSI 3801e, no extra drive slots, a single zpool. svn_124, but with my zpool still running at the 2009.06 version (14). I will likely get another chassis and 16 disks for another pool in the 3-18 month time frame. My plan is to put the SSD into an open disk slot on the 2950, but will have to configure it as a RAID 0, since the onboard PERC5 controller does not have a JBOD mode. Options I am considering: A. Use all 32G for the ZIL B. Use 8G for the ZIL, 24G for an L2ARC. Any issues with slicing up an SSD like this? C. Use 8G for the ZIL, 16G for an L2ARC, and reserve 8G to be used as a ZIL for the future zpool. Since my future zpool would just be used as a backup to disk target, I am leaning towards option C. Any gotchas I should be aware of? Thanks, Scott -- This message posted from opensolaris.org
Bob Friesenhahn
2009-Oct-20 23:44 UTC
[zfs-discuss] Setting up an SSD ZIL - Need A Reality Check
On Tue, 20 Oct 2009, Scott Meilicke wrote:> > A. Use all 32G for the ZIL > B. Use 8G for the ZIL, 24G for an L2ARC. Any issues with slicing up an SSD like this? > C. Use 8G for the ZIL, 16G for an L2ARC, and reserve 8G to be used as a ZIL for the future zpool. > > Since my future zpool would just be used as a backup to disk target, > I am leaning towards option C. Any gotchas I should be aware of?Option "A" seems better to me. The reason why it seems better is that any write to the device consumes write IOPS and the X25-E does not really have that many to go around. FLASH SSDs don''t really handle writes all that well due to the need to erase larger blocks than are actually written. Contention for access will simply make matters worse. With its write cache disabled (which you should do since the X25-E''s write cache is volatile), the X25-E has been found to offer a bit more than 1000 write IOPS. With 16GB of RAM, you should not need a L2ARC for a backup to disk target (a write-mostly application). The ZFS ARC will be able to expand to 14GB or so, which is quite a lot of read caching already. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Richard Elling
2009-Oct-21 00:26 UTC
[zfs-discuss] Setting up an SSD ZIL - Need A Reality Check
On Oct 20, 2009, at 4:44 PM, Bob Friesenhahn wrote:> On Tue, 20 Oct 2009, Scott Meilicke wrote: >> >> A. Use all 32G for the ZIL >> B. Use 8G for the ZIL, 24G for an L2ARC. Any issues with slicing up >> an SSD like this? >> C. Use 8G for the ZIL, 16G for an L2ARC, and reserve 8G to be used >> as a ZIL for the future zpool. >> >> Since my future zpool would just be used as a backup to disk >> target, I am leaning towards option C. Any gotchas I should be >> aware of? > > Option "A" seems better to me. The reason why it seems better is > that any write to the device consumes write IOPS and the X25-E does > not really have that many to go around. FLASH SSDs don''t really > handle writes all that well due to the need to erase larger blocks > than are actually written. Contention for access will simply make > matters worse. With its write cache disabled (which you should do > since the X25-E''s write cache is volatile), the X25-E has been found > to offer a bit more than 1000 write IOPS. With 16GB of RAM, you > should not need a L2ARC for a backup to disk target (a write-mostly > application). The ZFS ARC will be able to expand to 14GB or so, > which is quite a lot of read caching already.The ZIL device will never require more space than RAM. In other words, if you only have 16 GB of RAM, you won''t need more than that for the separate log. -- richard
Bob Friesenhahn
2009-Oct-21 01:46 UTC
[zfs-discuss] Setting up an SSD ZIL - Need A Reality Check
On Tue, 20 Oct 2009, Richard Elling wrote:> > The ZIL device will never require more space than RAM. > In other words, if you only have 16 GB of RAM, you won''t need > more than that for the separate log.Does the wasted storage space annoy you? :-) What happens if the machine is upgraded to 32GB of RAM later? The write performace of the X25-E is likely to be the bottleneck for a write-mostly storage server if the storage server has excellent network connectivity. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Meilicke, Scott
2009-Oct-21 04:31 UTC
[zfs-discuss] Setting up an SSD ZIL - Need A Reality Check
Thank you Bob and Richard. I will go with A, as it also keeps things simple. One physical device per pool. -Scott On 10/20/09 6:46 PM, "Bob Friesenhahn" <bfriesen at simple.dallas.tx.us> wrote:> On Tue, 20 Oct 2009, Richard Elling wrote: >> >> The ZIL device will never require more space than RAM. >> In other words, if you only have 16 GB of RAM, you won''t need >> more than that for the separate log. > > Does the wasted storage space annoy you? :-) > > What happens if the machine is upgraded to 32GB of RAM later? > > The write performace of the X25-E is likely to be the bottleneck for a > write-mostly storage server if the storage server has excellent > network connectivity. > > Bob > -- > Bob Friesenhahn > bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ > GraphicsMagick Maintainer, http://www.GraphicsMagick.org/-------------------------------------------------------------------------------- We value your opinion! How may we serve you better? Please click the survey link to tell us how we are doing: http://www.craneae.com/ContactUs/VoiceofCustomer.aspx Your feedback is of the utmost importance to us. Thank you for your time. -------------------------------------------------------------------------------- Crane Aerospace & Electronics Confidentiality Statement: The information contained in this email message may be privileged and is confidential information intended only for the use of the recipient, or any employee or agent responsible to deliver it to the intended recipient. Any unauthorized use, distribution or copying of this information is strictly prohibited and may be unlawful. If you have received this communication in error, please notify the sender immediately and destroy the original message and all attachments from your electronic files. --------------------------------------------------------------------------------
Edward Ned Harvey
2009-Oct-21 04:59 UTC
[zfs-discuss] Setting up an SSD ZIL - Need A Reality Check
> System: > Dell 2950 > 16G RAM > 16 1.5T SATA disks in a SAS chassis hanging off of an LSI 3801e, no > extra drive slots, a single zpool. > svn_124, but with my zpool still running at the 2009.06 version (14). > > My plan is to put the SSD into an open disk slot on the 2950, but will > have to configure it as a RAID 0, since the onboard PERC5 controller > does not have a JBOD mode.You can JBOD with the perc. It might be technically a raid0 or raid1 with a single disk in it, but that would be functionally equivalent to JBOD.
Meilicke, Scott
2009-Oct-21 05:19 UTC
[zfs-discuss] Setting up an SSD ZIL - Need A Reality Check
Thanks Ed. It sounds like you have run in this mode? No issues with the perc? -- Scott Meilicke On Oct 20, 2009, at 9:59 PM, "Edward Ned Harvey" <solaris at nedharvey.com> wrote:>> System: >> Dell 2950 >> 16G RAM >> 16 1.5T SATA disks in a SAS chassis hanging off of an LSI 3801e, no >> extra drive slots, a single zpool. >> svn_124, but with my zpool still running at the 2009.06 version (14). >> >> My plan is to put the SSD into an open disk slot on the 2950, but >> will >> have to configure it as a RAID 0, since the onboard PERC5 controller >> does not have a JBOD mode. > > You can JBOD with the perc. It might be technically a raid0 or > raid1 with a > single disk in it, but that would be functionally equivalent to JBOD. > >-------------------------------------------------------------------------------- We value your opinion! How may we serve you better? Please click the survey link to tell us how we are doing: http://www.craneae.com/ContactUs/VoiceofCustomer.aspx Your feedback is of the utmost importance to us. Thank you for your time. -------------------------------------------------------------------------------- Crane Aerospace & Electronics Confidentiality Statement: The information contained in this email message may be privileged and is confidential information intended only for the use of the recipient, or any employee or agent responsible to deliver it to the intended recipient. Any unauthorized use, distribution or copying of this information is strictly prohibited and may be unlawful. If you have received this communication in error, please notify the sender immediately and destroy the original message and all attachments from your electronic files. --------------------------------------------------------------------------------
Frédéric VANNIERE
2009-Oct-21 05:24 UTC
[zfs-discuss] Setting up an SSD ZIL - Need A Reality Check
The ZIL is a write-only log that is only read after a power failure. Several GB is large enough for most workloads. You can''t use the Intel X25-E because it has a 32 or 64 MB volatile cache that can''t be disabled neither flushed by ZFS. Imagine your server has a power failure while writing data to the pool. In normal situation, with ZIL on a reliable device, ZFS will read the ZIL and come back to a stable state at reboot. You may have lost some data (30 seconds) but the zpool works. With the Intel X25-E as ZIL some log data has been lost with the power failure (32/64MB max) which lead to a corrupted log and so ... you loose your zpool and all your data !! For the ZIL you need 2 reliable mirrored SSD devices with a supercapacitor that can flush the write cache to NAND when a power failure occurs. A hard-disk has a write cache but it can be disabled or flush by the operating system. For more informations : http://www.c0t0d0s0.org/archives/5993-Somewhat-stable-Solid-State.html -- This message posted from opensolaris.org
Marc Bevand
2009-Oct-21 06:57 UTC
[zfs-discuss] Setting up an SSD ZIL - Need A Reality Check
Bob Friesenhahn <bfriesen <at> simple.dallas.tx.us> writes:> [...] > X25-E''s write cache is volatile), the X25-E has been found to offer a > bit more than 1000 write IOPS.I think this is incorrect. On the paper the X25-E offers 3300 random write 4kB IOPS (and Intel is known to be very conservative about the IOPS perf numbers they publish). "Dumb" storage IOPS benchmark tools that don''t issue parallel I/O ops to the drive tend to report numbers less than half the theoretical IOPS. This would explain why you see only 1000 IOPS. I have direct evidence to prove this (with the other MLC line of SSD drives: X25-M): 35000 random read 4kB IOPS theoretical, 1 instance of a private benchmarking tool measures 6000, 10+ instances of this tool measure 37000 IOPS (slightly better than the theoretical max!) -mrb
Tristan Ball
2009-Oct-21 09:49 UTC
[zfs-discuss] Setting up an SSD ZIL - Need A Reality Check
What makes you say that the X25-E''s cache can''t be disabled or flushed? The net seems to be full of references to people who are disabling the cache, or flushing it frequently, and then complaining about the performance! T Fr?d?ric VANNIERE wrote:> The ZIL is a write-only log that is only read after a power failure. Several GB is large enough for most workloads. > > You can''t use the Intel X25-E because it has a 32 or 64 MB volatile cache that can''t be disabled neither flushed by ZFS. > > Imagine your server has a power failure while writing data to the pool. In normal situation, with ZIL on a reliable device, ZFS will read the ZIL and come back to a stable state at reboot. You may have lost some data (30 seconds) but the zpool works. With the Intel X25-E as ZIL some log data has been lost with the power failure (32/64MB max) which lead to a corrupted log and so ... you loose your zpool and all your data !! > > For the ZIL you need 2 reliable mirrored SSD devices with a supercapacitor that can flush the write cache to NAND when a power failure occurs. > > A hard-disk has a write cache but it can be disabled or flush by the operating system. > > For more informations : http://www.c0t0d0s0.org/archives/5993-Somewhat-stable-Solid-State.html >
Edward Ned Harvey
2009-Oct-21 14:17 UTC
[zfs-discuss] Setting up an SSD ZIL - Need A Reality Check
> Thanks Ed. It sounds like you have run in this mode? No issues with > the perc? > > > > You can JBOD with the perc. It might be technically a raid0 or > > raid1 with a > > single disk in it, but that would be functionally equivalent to JBOD.The only time I did this was ... I have a Windows server, on a PE2950 with Perc5i, running on a disk mirror for the OS with hotspare. Then I needed to add some more space, and the only disk I had available was a single 750G. So I added it with no problem, and I ordered another 750G to be the mirror of the first one. I used a single disk successfully, until the 2nd disk arrived, and then I enabled mirroring from the 1st to the 2nd. Everything went well. No interruptions. The system was a little slow while resilvering. The one big obvious difference between my setup and yours is the OS. I expect that the OS doesn''t change the capabilities of the Perc card, so I think you should be fine. The one comment I will make, in regards to the OS, which many people might overlook, is ... There are two interfaces to configure your Perc card. One is the BIOS interface, and the other is the Dell OpenManage System Administrator (managed node.) AKA, the Dell OMSA Managed Node. This provides an interface at https://machine:1311 which allows you to configure the card, monitor health, enable/disable hotspare, resilver a new disk etc. While the OS is running. (No need to shutdown into BIOS). OMSA is required in order to replace a failed disk without a reboot. Or add disks, etc, or anything else you might want to do on the Perc card. I know OMSA is available for Windows and Linux. How about Solaris? Based on curiosity, I logged into Dell support just now, to look up my 2950. The supported OSes are Netware, Windows, RedHat, and Suse. Which means, on my system, if I were running Solaris, I could count on *not* being able to run OMSA, and consequently the only interface to configure the Perc would be BIOS. If solaris is able to install at all, I would have to acknowledge, I have to shutdown anytime I need to change the Perc configuration, including replacing failed disks.
Scott Meilicke
2009-Oct-21 15:16 UTC
[zfs-discuss] Setting up an SSD ZIL - Need A Reality Check
<sigh> Thanks Fr?d?ric, that is a very interesting read. So my options as I see them now: 1. Keep the x25-e, and disable the cache. Performance should still be improved, but not by a *whole* like, right? I will google for an expectation, but if anyone knows off the top of their head, I would be appreciative. 2. Buy a ZEUS or similar SSD with a cap backed cache. Pricing is a little hard to come by, based on my quick google, but I am seeing $2-3k for an 8G model. Is that right? Yowch. 3. Wait for the x25-e g2, which is rumored to have cap backed cache, and may or may not work well (but probably will). 4. Put the x25-e with disabled cache behind my PERC with the PERC cache enabled. My budget is tight. I want better performance now. #4 sounds good. Thoughts? Regarding mirrored SSDs for the ZIL, it was my understanding that if the SSD backed ZIL failed, ZFS would fail back to using the regular pool for the ZIL, correct? Assuming this is correct, a mirror would be to preserve performance during a failure? Thanks everyone, this has been really helpful. -Scott -- This message posted from opensolaris.org
Scott Meilicke
2009-Oct-21 15:30 UTC
[zfs-discuss] Setting up an SSD ZIL - Need A Reality Check
Ed, your comment:>If solaris is able to install at all, I would have to acknowledge, I >have to shutdown anytime I need to change the Perc configuration, including >replacing failed disks.Replacing failed disks is easy when PERC is doing the RAID. Just remove the failed drive and replace with a good one, and the PERC will rebuild automatically. But are you talking about OpenSolaris managed RAID? I am pretty sure, but not tested, that in pseudo JBOD mode (each disk a raid 0 or 1), the PERC would still present a replaced disk to the OS without reconfiguring the PERC BIOS. Scott -- This message posted from opensolaris.org
Richard Elling
2009-Oct-21 17:03 UTC
[zfs-discuss] Setting up an SSD ZIL - Need A Reality Check
On Oct 20, 2009, at 10:24 PM, Fr?d?ric VANNIERE wrote:> The ZIL is a write-only log that is only read after a power failure. > Several GB is large enough for most workloads. > > You can''t use the Intel X25-E because it has a 32 or 64 MB volatile > cache that can''t be disabled neither flushed by ZFS.I am surprised by this assertion and cannot find any confirmation from Intel. Rather, the cache flush command is specifically mentioned as supported in Section 6.1.1 of the Intel X25-E SATA Solid State Drive Product Manual. http://download.intel.com/design/flash/nand/extreme/extreme-sata-ssd-datasheet.pdf I suspect that this is confusion relating to the various file systems, OSes, or virtualization platforms which may or may not by default ignore cache flushes. Since NTFS uses the cache flush commands, I would be very surprised if Intel would intentionally ignore it.> Imagine your server has a power failure while writing data to the > pool. In normal situation, with ZIL on a reliable device, ZFS will > read the ZIL and come back to a stable state at reboot. You may have > lost some data (30 seconds) but the zpool works. With the Intel > X25-E as ZIL some log data has been lost with the power failure > (32/64MB max) which lead to a corrupted log and so ... you loose > your zpool and all your data !!The ZIL works fine for devices which support the cache flush command. -- richard
Bob Friesenhahn
2009-Oct-21 17:21 UTC
[zfs-discuss] Setting up an SSD ZIL - Need A Reality Check
On Wed, 21 Oct 2009, Marc Bevand wrote:> Bob Friesenhahn <bfriesen <at> simple.dallas.tx.us> writes: >> [...] >> X25-E''s write cache is volatile), the X25-E has been found to offer a >> bit more than 1000 write IOPS. > > I think this is incorrect. On the paper the X25-E offers 3300 random write > 4kB IOPS (and Intel is known to be very conservative about the IOPS perf > numbers they publish). "Dumb" storage IOPS benchmark tools that don''t issue > parallel I/O ops to the drive tend to report numbers less than half the > theoretical IOPS. This would explain why you see only 1000 IOPS.The Intel specified random write IOPS are with the cache enabled and without cache flushing. They also carefully only use a limited span of the device, which fits most perfectly with how the device is built. There is no mention of burning in the device for a few days to make sure that it is in a useful state. In order for the test to be meaningful, the device needs to be loaded up for a while before taking any measurements. Device performance should be specified as a minimum assured level of performance and not as meaningless "peak" ("up to") values. I repeat: peak values are meaningless. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
David Dyer-Bennet
2009-Oct-21 17:40 UTC
[zfs-discuss] Setting up an SSD ZIL - Need A Reality Check
On Wed, October 21, 2009 12:21, Bob Friesenhahn wrote:> > Device performance should be specified as a minimum assured level of > performance and not as meaningless "peak" ("up to") values. I repeat: > peak values are meaningless.Seems a little pessimistic to me. Certainly minimum assured values are the basic thing people need to know, but reasonably characterized peak values can be valuable, if the conditions yielding them match possible application usage patterns. The obvious example in electrical wiring is that the startup surge of motors and the short-term over-current potential of circuit breakers actually match each other fairly well, so that most saws (for example) that can run comfortably on a given circuit can actually be *started* on that circuit. Peak performance can have practical applications! Certainly a really carefully optimized "peak" will almost certainly NOT represent a useful possible performance level, and they should always be considered meaningless until you''ve really proven otherwise. -- David Dyer-Bennet, dd-b at dd-b.net; http://dd-b.net/ Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/ Photos: http://dd-b.net/photography/gallery/ Dragaera: http://dragaera.info
Bob Friesenhahn
2009-Oct-21 17:53 UTC
[zfs-discuss] Setting up an SSD ZIL - Need A Reality Check
On Wed, 21 Oct 2009, David Dyer-Bennet wrote:>> >> Device performance should be specified as a minimum assured level of >> performance and not as meaningless "peak" ("up to") values. I repeat: >> peak values are meaningless. > > Seems a little pessimistic to me. Certainly minimum assured values are > the basic thing people need to know, but reasonably characterized peak > values can be valuable, if the conditions yielding them match possible > application usage patterns.Agreed. It is useful to know minimum, median, and peak values. If there is a peak, it is useful to know how long that peak may be sustained. Intel''s specifications have not characterized the actual performance of the device at all. The performance characteristics of rotating media are well understood since they have been observed for tens of years. From this we already know that the "peak" performance of a hard drive does not have much to do with its steady-state performance since the peak performance is often defined by the hard drive cache size and the interface type and clock rate. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
David Dyer-Bennet
2009-Oct-21 18:48 UTC
[zfs-discuss] Setting up an SSD ZIL - Need A Reality Check
On Wed, October 21, 2009 12:53, Bob Friesenhahn wrote:> On Wed, 21 Oct 2009, David Dyer-Bennet wrote: >>> >>> Device performance should be specified as a minimum assured level of >>> performance and not as meaningless "peak" ("up to") values. I repeat: >>> peak values are meaningless. >> >> Seems a little pessimistic to me. Certainly minimum assured values are >> the basic thing people need to know, but reasonably characterized peak >> values can be valuable, if the conditions yielding them match possible >> application usage patterns. > > Agreed. It is useful to know minimum, median, and peak values. If > there is a peak, it is useful to know how long that peak may be > sustained. Intel''s specifications have not characterized the actual > performance of the device at all.And just a random number labeled as "peak" really IS meaningless, yes.> The performance characteristics of rotating media are well understood > since they have been observed for tens of years. From this we already > know that the "peak" performance of a hard drive does not have much to > do with its steady-state performance since the peak performance is > often defined by the hard drive cache size and the interface type and > clock rate.It strikes me that disks have been developing rather too independently of, and sometimes in conflict with, requirements for reliable interaction with the filesystems in various OSes. Things like power-dependent write caches. Boosts peak write but not sustained write, which is probably benchmark-friendly, AND introduces the problem of writes committed to the drive not being safe in a power failure. -- David Dyer-Bennet, dd-b at dd-b.net; http://dd-b.net/ Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/ Photos: http://dd-b.net/photography/gallery/ Dragaera: http://dragaera.info
Paul B. Henson
2009-Oct-21 19:25 UTC
[zfs-discuss] Setting up an SSD ZIL - Need A Reality Check
On Tue, 20 Oct 2009, [UTF-8] Fr??d??ric VANNIERE wrote:> You can''t use the Intel X25-E because it has a 32 or 64 MB volatile cache > that can''t be disabled neither flushed by ZFS.Say what? My understanding is that the officially supported Sun SSD for the x4540 is an OEM''d Intel X25-E, so I don''t see how it could not be a good slog device. -- Paul B. Henson | (909) 979-6361 | http://www.csupomona.edu/~henson/ Operating Systems and Network Analyst | henson at csupomona.edu California State Polytechnic University | Pomona CA 91768
Marc Bevand
2009-Oct-22 04:08 UTC
[zfs-discuss] Setting up an SSD ZIL - Need A Reality Check
Bob Friesenhahn <bfriesen <at> simple.dallas.tx.us> writes:> > The Intel specified random write IOPS are with the cache enabled and > without cache flushing.For random write I/O, caching improves I/O latency not sustained I/O throughput (which is what random write IOPS usually refer to). So Intel can''t cheat with caching. However they can cheat by benchmarking a brand new drive instead of an aged one.> They also carefully only use a limited span > of the device, which fits most perfectly with how the device is built.AFAIK, for the X25-E series, they benchmark random write IOPS on a 100% span. You may be confusing it with the X25-M series with which they actually clearly disclose two performance numbers: 350 random write IOPS on 8GB span, and 3.3k on 100% span. See http://www.intel.com/cd/channel/reseller/asmo-na/eng/products/nand/tech/425265.htm I agree with the rest of your email. -mrb
Edward Ned Harvey
2009-Oct-22 13:14 UTC
[zfs-discuss] Setting up an SSD ZIL - Need A Reality Check
> Replacing failed disks is easy when PERC is doing the RAID. Just remove > the failed drive and replace with a good one, and the PERC will rebuild > automatically.Sorry, not correct. When you replace a failed drive, the perc card doesn''t know for certain that the new drive you''re adding is meant to be a replacement. For all it knows, you could coincidentally be adding new disks for a new VirtualDevice which already contains data, during the failure state of some other device. So it will not automatically resilver (which would be a permanently destructive process, applied to a disk which is not *certainly* meant for destruction). You have to open the perc config interface, tell it this disk is a replacement for the old disk (probably you''re just saying "This disk is the new global hotspare") or else the new disk will sit there like a bump on a log. Doing nothing.
Edward Ned Harvey
2009-Oct-22 13:17 UTC
[zfs-discuss] Setting up an SSD ZIL - Need A Reality Check
> The Intel specified random write IOPS are with the cache enabled and > without cache flushing. They also carefully only use a limited span > of the device, which fits most perfectly with how the device is built.How do you know this? This sounds much more detailed than any average person could ever know....
Actually, I think this is a case of crossed wires. This issue was reported a while back on a news site for the X25-M G2. Somebody pointed out that these devices have 8GB of cache, which is exactly the dataset size they use for the iops figures. The X25-E datasheet however states that while write cache is enabled, the iops figures are over the entire drive. And looking at the X25-M G2 datasheet again, it states that the measurements are over 8GB of range, but these come with 32MB of cache, so I think that was also a false alarm. -- This message posted from opensolaris.org
Meilicke, Scott
2009-Oct-22 14:44 UTC
[zfs-discuss] Setting up an SSD ZIL - Need A Reality Check
Interesting. We must have different setups with our PERCs. Mine have always auto rebuilt. -- Scott Meilicke On Oct 22, 2009, at 6:14 AM, "Edward Ned Harvey" <solaris at nedharvey.com> wrote:>> Replacing failed disks is easy when PERC is doing the RAID. Just >> remove >> the failed drive and replace with a good one, and the PERC will >> rebuild >> automatically. > > Sorry, not correct. When you replace a failed drive, the perc card > doesn''t > know for certain that the new drive you''re adding is meant to be a > replacement. For all it knows, you could coincidentally be adding > new disks > for a new VirtualDevice which already contains data, during the > failure > state of some other device. So it will not automatically resilver > (which > would be a permanently destructive process, applied to a disk which > is not > *certainly* meant for destruction). > > You have to open the perc config interface, tell it this disk is a > replacement for the old disk (probably you''re just saying "This disk > is the > new global hotspare") or else the new disk will sit there like a > bump on a > log. Doing nothing. >-------------------------------------------------------------------------------- We value your opinion! How may we serve you better? Please click the survey link to tell us how we are doing: http://www.craneae.com/ContactUs/VoiceofCustomer.aspx Your feedback is of the utmost importance to us. Thank you for your time. -------------------------------------------------------------------------------- Crane Aerospace & Electronics Confidentiality Statement: The information contained in this email message may be privileged and is confidential information intended only for the use of the recipient, or any employee or agent responsible to deliver it to the intended recipient. Any unauthorized use, distribution or copying of this information is strictly prohibited and may be unlawful. If you have received this communication in error, please notify the sender immediately and destroy the original message and all attachments from your electronic files. --------------------------------------------------------------------------------
Bob Friesenhahn
2009-Oct-22 16:13 UTC
[zfs-discuss] Setting up an SSD ZIL - Need A Reality Check
On Thu, 22 Oct 2009, Marc Bevand wrote:> Bob Friesenhahn <bfriesen <at> simple.dallas.tx.us> writes: > > For random write I/O, caching improves I/O latency not sustained I/O > throughput (which is what random write IOPS usually refer to). So Intel can''t > cheat with caching. However they can cheat by benchmarking a brand new drive > instead of an aged one.With FLASH devices, a sufficiently large write cache can improve random write I/O. One can imagine that the "wear leveling" logic could be used to do tricky remapping so that several "random writes" actually lead to sequential writes to the same FLASH superblock so only one superblock needs to be updated and the parts of the old superblocks which would have been overwritten are marked as unused. This of course requires rather advanced remapping logic at a finer-grained resolution than the superblock. When erased space becomes tight (or on a periodic basis), the data in several sparsely-used superblocks are migrated to a different superblock in a more compact way (along with requisite logical block remapping) to reclaim space. It is worth developing such remapping logic since FLASH erasures and re-writes are so expensive.>> They also carefully only use a limited span >> of the device, which fits most perfectly with how the device is built. > > AFAIK, for the X25-E series, they benchmark random write IOPS on a 100% span. > You may be confusing it with the X25-M series with which they actually clearly > disclose two performance numbers: 350 random write IOPS on 8GB span, and 3.3k > on 100% span. See > http://www.intel.com/cd/channel/reseller/asmo-na/eng/products/nand/tech/425265.htmYou are correct that I interpreted the benchmark scenarios from the X25-M series documentation. It seems reasonable for the same manufacturer to use the same benchmark methodology for similar products. Then again, they are still new at this. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Eric D. Mudama
2009-Oct-24 04:01 UTC
[zfs-discuss] Setting up an SSD ZIL - Need A Reality Check
On Tue, Oct 20 at 22:24, Fr?d?ric VANNIERE wrote:> You can''t use the Intel X25-E because it has a 32 or 64 MB volatile > cache that can''t be disabled neither flushed by ZFS.I don''t believe the above statement is correct. According to anandtech who asked Intel: http://www.anandtech.com/cpuchipsets/intel/showdoc.aspx?i=3403&p=10 the DRAM doesn''t hold user data. The article claims that data goes through an internal 256KB buffer. Is solaris incapable of issuing a SATA command FLUSH CACHE EXT? -- Eric D. Mudama edmudama at mail.bounceswoosh.org
Bob Friesenhahn
2009-Oct-24 15:50 UTC
[zfs-discuss] Setting up an SSD ZIL - Need A Reality Check
On Fri, 23 Oct 2009, Eric D. Mudama wrote:> > I don''t believe the above statement is correct. > > According to anandtech who asked Intel: > > http://www.anandtech.com/cpuchipsets/intel/showdoc.aspx?i=3403&p=10 > > the DRAM doesn''t hold user data. The article claims that data goes > through an internal 256KB buffer.These folks may well be in Intel''s back pocket, but it seems that the data given is not clear or accurate. It does not matter if DRAM or SRAM is used for the disk''s cache. What matters is if all user data gets flushed to non-volatile storage for each cache flush request. Since FLASH drives need to erase a larger block than might be written, existing data needs to be read, updated, and then written. This data needs to be worked on in a volatile buffer. Without extreme care, it is possible for the FLASH drive to corrupt other existing unrelated data if there is power loss. The FLASH drive could use a COW scheme (like ZFS) but it still needs to take care to persist the block mappings for each cache sync request or transactions would be lost. Folks at another site found that the drive was losing the last few synchronous writes with the cache enabled. This could be a problem with the drive, or the OS if it is not issuing the cache flush request.> Is solaris incapable of issuing a SATA command FLUSH CACHE EXT?It issues one for each update to the intent log. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Bob Friesenhahn
2009-Oct-24 16:30 UTC
[zfs-discuss] Setting up an SSD ZIL - Need A Reality Check
On Sat, 24 Oct 2009, Bob Friesenhahn wrote:> >> Is solaris incapable of issuing a SATA command FLUSH CACHE EXT? > > It issues one for each update to the intent log.I should mention that FLASH SSDs without a capacitor/battery-backed cache flush (like the X25-E) are likely to get burned out pretty quickly if they respect each cache flush request. The reason is that each write needs to update a full FLASH metablock. This means that a small 4K syncronous update forces a write of a full FLASH metablock in the X25-E. I don''t know the size of the FLASH metablock in the X25-E (seems to be a closely-held secret), but perhaps it is 128K, 256K, or 512K. The rumor that disabling the "cache" on the X25-E disables the wear leveling is probably incorrect. It is much more likely that disabling the "cache" causes each write to erase and write a full FLASH metablock (known as "write amplification"), therefore causing the device to wear out much more quickly than if it deferred writes. http://www.tomshardware.com/reviews/Intel-x25-m-SSD,2012-5.html Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/