I''m likely to be building a ZFS server to act as NFS shared storage for a couple of VMware ESX servers. Does anybody have experience of using ZFS with VMware like this, and can anybody confirm the best zpool configuration? The server will have 16x 500GB SATA drives, with dual Opteron CPU''s and 8GB of RAM. Dual parity is a must, as are hot spares. My first thoughts were to use raid-z2, with 2 hot spares and a zpool made up of two 7 drive raid-z2 volumes, giving me 5TB of storage. However, I''m pretty sure VMware likes good random read performance, so I''m considering one hot spare and a zpool made up of 5x three disk mirror volumes. It''s only 2.5TB of storage, but that should be plenty, and I''d rather have good performance. Can anybody confirm that random read performance is definately better with mirrored volumes. Does ZFS use all the disks in the mirror sets independantly when reading data? Am I right in thinking I could have around 7x better random read performance with the 15 mirrored drives, when compared to the two raid-z2 volumes? thanks, Ross This message posted from opensolaris.org
Wade.Stuart at fallon.com
2008-Jun-27 12:58 UTC
[zfs-discuss] ZFS configuration for VMware
zfs-discuss-bounces at opensolaris.org wrote on 06/27/2008 03:39:41 AM:> I''m likely to be building a ZFS server to act as NFS shared storage > for a couple of VMware ESX servers. Does anybody have experience of > using ZFS with VMware like this, and can anybody confirm the best > zpool configuration? > > The server will have 16x 500GB SATA drives, with dual Opteron CPU''s > and 8GB of RAM. Dual parity is a must, as are hot spares. > > My first thoughts were to use raid-z2, with 2 hot spares and a zpool > made up of two 7 drive raid-z2 volumes, giving me 5TB of storage.you will get two disks worth of thruput with this setup. This is not good.> > However, I''m pretty sure VMware likes good random read performance, > so I''m considering one hot spare and a zpool made up of 5x three > disk mirror volumes. It''s only 2.5TB of storage, but that should be > plenty, and I''d rather have good performance.This is what I would go with.> > Can anybody confirm that random read performance is definitely > better with mirrored volumes. Does ZFS use all the disks in the > mirror sets independently when reading data? Am I right in thinking > I could have around 7x better random read performance with the 15 > mirrored drives, when compared to the two raid-z2 volumes?Yes. two caveats though. ZFS is a COW filesystem, currently with no defrag. Placing heavy write (vmware is) on this type of storage (especially, but not only if you are planning on using snapshots) you will tend to see diminishing performance over time. Do not allow the ZFS partition to become over 80% full -- performance hits a wall hard with the kind of write profile you are going to expect with vmware as zfs looks for free blocks. test test test. -Wade> > thanks, > > Ross > > > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
On Fri, Jun 27, 2008 at 07:58:42AM -0500, Wade.Stuart at fallon.com wrote:> > Yes. two caveats though. ZFS is a COW filesystem, currently with no > defrag. Placing heavy write (vmware is) on this type of storage > (especially, but not only if you are planning on using snapshots) you will > tend to see diminishing performance over time. Do not allow the ZFS > partition to become over 80% full -- performance hits a wall hard with the > kind of write profile you are going to expect with vmware as zfs looks for > free blocks. > > test test test.Also, it it *highly* recommended that you get a fast slog device like the Gigabyte iRAM or at least some very fast SSDs. If you are going to be using VirtualCenter you might want to consider testing iSCSI volumes against NFS to see if that works better for your planned workload. -brian -- "Coding in C is like sending a 3 year old to do groceries. You gotta tell them exactly what you want or you''ll end up with a cupboard full of pop tarts and pancake mix." -- IRC User (http://www.bash.org/?841435)
Thanks both, very good pieces of advice there. Wonko, I was about to question how much difference the iRAM will actually make with it being on a single SATA connection, but after googling, for ?70 + RAM it''s worth buying just as an experiment. I''m really not interested in iSCSI, it might be slightly faster, but NFS''s ease of use means we''re definately going down that route. There''s enough press about larger VMware customers switching to NFS that I''m happy it''ll work well enough. At the end of the day I suspect we''ll have a setup that''s overkill considering the load on our servers. Performance really isn''t a concern. To illustrate the point, our current VMware server is a 32GB Sun x2200 running 8 virtual servers and half a dozen virtual XP clients of two local SATA disks. That server''s barely ticking over, so a pair of x2200''s connecting to a 16 drive ZFS array via 10Gb/s Infiniband should work well enough. We''re upgrading more for security of the storage than for performance reasons. We started off with VMware and the Sun server on a 60 day trial, they proved so useful we bought them outright, and over the last few months their use has grown to the extent we really need the VM''s stored properly now. This message posted from opensolaris.org
Bleh, just found out the i-RAM is 5v PCI only. Won''t work on PCI-X slots which puts that out of the question for the motherboad I''m using. Vmetro have a 2GB PCI-E card out, but it''s for OEM''s only: http://www.vmetro.com/category4304.html, and I don''t have any space in this server to mount a SSD. Does anybody know of a PCI-X or PCI-E nvram or SSD card that can be used with ZFS? This message posted from opensolaris.org
Brian Hechinger wrote:> On Fri, Jun 27, 2008 at 07:58:42AM -0500, Wade.Stuart at fallon.com wrote: > >> Yes. two caveats though. ZFS is a COW filesystem, currently with no >> defrag. Placing heavy write (vmware is) on this type of storage >> (especially, but not only if you are planning on using snapshots) you will >> tend to see diminishing performance over time. Do not allow the ZFS >> partition to become over 80% full -- performance hits a wall hard with the >> kind of write profile you are going to expect with vmware as zfs looks for >> free blocks. >> >> test test test. >> > > Also, it it *highly* recommended that you get a fast slog device like the > Gigabyte iRAM or at least some very fast SSDs. If you are going to be > using VirtualCenter you might want to consider testing iSCSI volumes against > NFS to see if that works better for your planned workload. >You will want mirrored slogs. Note that there are some companies, Crucial and STEC come to mind, sell SSDs which fit in disk form factors. IIRC, Mac Book Air and EMC use STEC''s SSDs. -- richard
On Fri, Jun 27, 2008 at 08:13:14AM -0700, Ross wrote:> Bleh, just found out the i-RAM is 5v PCI only. Won''t work on PCI-X > slots which puts that out of the question for the motherboad I''m > using. Vmetro have a 2GB PCI-E card out, but it''s for OEM''s only: > http://www.vmetro.com/category4304.html, and I don''t have any space in > this server to mount a SSD.Maybe you can call Vmetro and get the names of some resellers whom you could call to get pricing info? -- albert chin (china at thewrittenword.com)
On Fri, Jun 27, 2008 at 11:50 AM, Albert Chin < opensolaris-zfs-discuss at mlists.thewrittenword.com> wrote:> On Fri, Jun 27, 2008 at 08:13:14AM -0700, Ross wrote: > > Bleh, just found out the i-RAM is 5v PCI only. Won''t work on PCI-X > > slots which puts that out of the question for the motherboad I''m > > using. Vmetro have a 2GB PCI-E card out, but it''s for OEM''s only: > > http://www.vmetro.com/category4304.html, and I don''t have any space in > > this server to mount a SSD. > > Maybe you can call Vmetro and get the names of some resellers whom you > could call to get pricing info? > > -- > albert chin (china at thewrittenword.com) > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >I have a feeling one of those will cost as much as his servers did anyways. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080627/8b731570/attachment.html>
On Fri, Jun 27, 2008 at 07:22:48AM -0700, Ross wrote:> Thanks both, very good pieces of advice there. > > Wonko, I was about to question how much difference the iRAM will actually make with it being on a single SATA connection, but after googling, for ??70 + RAM it''s worth buying just as an experiment.Yeah, it''s an amazingly cheap item, I *really* need to break down and buy one or two of them.> I''m really not interested in iSCSI, it might be slightly faster, but NFS''s ease of use means we''re definately going down that route. There''s enough press about larger VMware customers switching to NFS that I''m happy it''ll work well enough.Ok, that''s fine, I was just throwing the suggestion out there for you. ;) That being said though, search this list/forum for the NFS performance threads and you will see that you will most likely want to do _something_ with some sort of slog device. Just fair warning. ;) The nice thing though is you can try without the slog, and if you find that it would be nice to have, you can always add it later without issue (and on the fly if you''re setup for that with how you would connect it). -brian -- "Coding in C is like sending a 3 year old to do groceries. You gotta tell them exactly what you want or you''ll end up with a cupboard full of pop tarts and pancake mix." -- IRC User (http://www.bash.org/?841435)
On Fri, Jun 27, 2008 at 08:32:23AM -0700, Richard Elling wrote:> > You will want mirrored slogs.Yes, always an excellent recommendation.> Note that there are some companies, Crucial and STEC come to mind, > sell SSDs which fit in disk form factors. IIRC, Mac Book Air and EMC > use STEC''s SSDs.The Mtron ones get really good reviews as well. Someone posted a link to a review of them here, I forget who, and I''d have to dig up the links, but the ones that Mtron sells were testing as the fastest ones on the market. Granted that was a while ago, so the tide may have changed. -brian -- "Coding in C is like sending a 3 year old to do groceries. You gotta tell them exactly what you want or you''ll end up with a cupboard full of pop tarts and pancake mix." -- IRC User (http://www.bash.org/?841435)
Unfortunately, we need to be careful here with our terminology. SSD used to refer strictly to standard DRAM backed with a battery (and, maybe some sort of a fancy enclosure with a hard drive to write all DRAM data to after a power outage). It now encompasses the newer Flash-based devices. My my experience, the two devices have _vastly_ different usage and performance profiles. DRAM-based SSDs have virtually unlimited read/write performance (they''ll out-perform virtually any standard I/O bus, since they''re RAM), while flash devices are fast read/slower write. In this case, I think we want the DRAM SSDs. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA Timezone: US/Pacific (GMT-0800)
On Fri, Jun 27, 2008 at 03:02:43PM -0700, Erik Trimble wrote:> Unfortunately, we need to be careful here with our terminology.You are completely and 100% correct, Erik. I''ve been throwing the term SSD around, but in the context of what I''m thinking, by SSD I mean this new-fangled flash based SSD.> In this case, I think we want the DRAM SSDs.Yes, yes you do. I know I say this all the time, and I''m sure you''re all getting tired of me, but I *really* wish Gigabyte (or *ANYONE ELSE*, ahem, Sun, *cough*) would produce something similar to the iRAM, but with 3.0Gpbs SATA/SAS and possibly more space, or at least newer, easier to find RAM. That would make me *extremely* happy. ;) I wish I had the money to bankroll something like that. But sadly, I don''t. -brian -- "Coding in C is like sending a 3 year old to do groceries. You gotta tell them exactly what you want or you''ll end up with a cupboard full of pop tarts and pancake mix." -- IRC User (http://www.bash.org/?841435)
> > > Can anybody confirm that random read performance is definitely > > better with mirrored volumes. Does ZFS use all the disks in the > > mirror sets independently when reading data? Am I right in thinking > > I could have around 7x better random read performance with the 15 > > mirrored drives, when compared to the two raid-z2 volumes? > > Yes. two caveats though. ZFS is a COW filesystem, currently with no > defrag. Placing heavy write (vmware is) on this type of storage > (especially, but not only if you are planning on using snapshots) you will > tend to see diminishing performance over time. Do not allow the ZFS > partition to become over 80% full -- performance hits a wall hard with the > kind of write profile you are going to expect with vmware as zfs looks for > free blocks. >It there a move anytime soon to implement a defragmenter? Matthew -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080628/e7aafcbb/attachment.html>
Brian Hechinger wrote:> On Fri, Jun 27, 2008 at 03:02:43PM -0700, Erik Trimble wrote: > >> Unfortunately, we need to be careful here with our terminology. >> > > You are completely and 100% correct, Erik. I''ve been throwing the > term SSD around, but in the context of what I''m thinking, by SSD I > mean this new-fangled flash based SSD. > > >> In this case, I think we want the DRAM SSDs. >> > > Yes, yes you do. I know I say this all the time, and I''m sure you''re all > getting tired of me, but I *really* wish Gigabyte (or *ANYONE ELSE*, ahem, > Sun, *cough*) would produce something similar to the iRAM, but with 3.0Gpbs > SATA/SAS and possibly more space, or at least newer, easier to find RAM. > > That would make me *extremely* happy. ;) > > I wish I had the money to bankroll something like that. But sadly, I don''t. > > -brian >There''s a small laundry list of things that I really wish I could find some VC capital for (not much, either, may $1m at most), much of which exists right now, but very feature poor/horribly overpriced: * PCI-card based RAM disk (ala iRAM), with 6-8 slots, NiCad battery, and SATA II interface * 3.5" LP disk form-factor RAM disk, SCSI Hotswap or SATA2 interface * 5.25" CDROM-form-factor RAM disk, as above * 3.5" LP disk form factor, SCSI hotswap/SATA2 and 4-8 Compact Flash card slots * Huge RAM drive in a 1U small case (ala Cisco 2500-series routers), with SAS or FC attachment. * 2.5" (or 1.8") drive on a PCI-card And, of course, the holy grail: a socket 940 Opteron with AMD-V hardware extensions. :-) -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA Timezone: US/Pacific (GMT-0800)
Thanks, that''s something I hadn''t realised. After googling, I''ve found this article comparing the i-RAM with a couple of SSD''s, and the difference is quite something: http://www.xbitlabs.com/articles/storage/display/ssd-iram_5.html However, the SATA interface''s limitations soon seem to even things out a bit: http://www.xbitlabs.com/articles/storage/display/ssd-iram_7.html So far STEC seem my best option (especially if EMC see fit to use them). I''m trying to find prices now, but it seems their SSD''s have quite a range of performance available, with the Zeus IOPS range being the king of the heap at 45,000 IOPS: http://www.stec-inc.com/product/zeusiops.php The Vmetro card would probably put it to shame in terms of throughput, but while I''m going to try I doubt I''ll be able to get my hands on one, and SATA has its benefits in being a tried & tested technology. Thanks for all the advice everyone. I''ll keep this thread updated with prices & details once the manufacturers start getting back to me in case anybody else is interested in this tech in the future. This message posted from opensolaris.org
I believe there''s a block rewrite function being worked on, which if memory serves will enable further technologies such as changing raid-z stripe size on the fly, defrag, etc. I doubt ''soon'' is a word you could use to describe the timeframe for these arriving however. This message posted from opensolaris.org
Erik Trimble wrote:> Brian Hechinger wrote: > >> On Fri, Jun 27, 2008 at 03:02:43PM -0700, Erik Trimble wrote: >> >> >>> Unfortunately, we need to be careful here with our terminology. >>> >>> >> You are completely and 100% correct, Erik. I''ve been throwing the >> term SSD around, but in the context of what I''m thinking, by SSD I >> mean this new-fangled flash based SSD. >> >> >> >>> In this case, I think we want the DRAM SSDs. >>> >>> >> Yes, yes you do. I know I say this all the time, and I''m sure you''re all >> getting tired of me, but I *really* wish Gigabyte (or *ANYONE ELSE*, ahem, >> Sun, *cough*) would produce something similar to the iRAM, but with 3.0Gpbs >> SATA/SAS and possibly more space, or at least newer, easier to find RAM. >> >> That would make me *extremely* happy. ;) >> >> I wish I had the money to bankroll something like that. But sadly, I don''t. >> >> -brian >> >> > > There''s a small laundry list of things that I really wish I could find > some VC capital for (not much, either, may $1m at most), much of which > exists right now, but very feature poor/horribly overpriced: > > * PCI-card based RAM disk (ala iRAM), with 6-8 slots, NiCad battery, and > SATA II interface >Batteries are bad news. As Jonathan says, "The better SSDs combine * Flash* with *DRAM* and a *supercapacitor* to buffer synchronous writes." http://blogs.sun.com/jonathan/entry/not_a_*flash*_in_the> * 3.5" LP disk form-factor RAM disk, SCSI Hotswap or SATA2 interface >Parallel SCSI is dead. The hot swappable SAS/SATA interfaces are good and inexpensive.> * 5.25" CDROM-form-factor RAM disk, as above >CD-ROMs are dead. With the size of slim DVDs today, you wouldn''t be able to put much space in them.> * 3.5" LP disk form factor, SCSI hotswap/SATA2 and 4-8 Compact Flash > card slots > * Huge RAM drive in a 1U small case (ala Cisco 2500-series routers), > with SAS or FC attachment. > * 2.5" (or 1.8") drive on a PCI-card > > And, of course, the holy grail: a socket 940 Opteron with AMD-V > hardware extensions. :-) > >This week, Verident announced a system using Spansion EcoRAMs (DRAM + NOR Flash on a DIMM form factor) for main memory. This is almost getting there, but seems to require some special OS support, which is not surprising. The holy grail is fast, non-volatile main memory -- and we can forget all about "disks" :-) -- richard
On Jun 28, 2008, at 10:17, Richard Elling wrote:> This week, Verident announced a system using Spansion EcoRAMs > (DRAM + NOR Flash on a DIMM form factor) for main memory. > This is almost getting there, but seems to require some special OS > support, which is not surprising. The holy grail is fast, non- > volatile > main memory -- and we can forget all about "disks" :-)Per Jim Gray, disk is the new tape.
On Sat, Jun 28, 2008 at 1:42 AM, Erik Trimble <Erik.Trimble at sun.com> wrote:> Brian Hechinger wrote: > > On Fri, Jun 27, 2008 at 03:02:43PM -0700, Erik Trimble wrote: > > > >> Unfortunately, we need to be careful here with our terminology. > >> > > > > You are completely and 100% correct, Erik. I''ve been throwing the > > term SSD around, but in the context of what I''m thinking, by SSD I > > mean this new-fangled flash based SSD. > > > > > >> In this case, I think we want the DRAM SSDs. > >> > > > > Yes, yes you do. I know I say this all the time, and I''m sure you''re all > > getting tired of me, but I *really* wish Gigabyte (or *ANYONE ELSE*, > ahem, > > Sun, *cough*) would produce something similar to the iRAM, but with > 3.0Gpbs > > SATA/SAS and possibly more space, or at least newer, easier to find RAM. > > > > That would make me *extremely* happy. ;) > > > > I wish I had the money to bankroll something like that. But sadly, I > don''t. > > > > -brian > > > > There''s a small laundry list of things that I really wish I could find > some VC capital for (not much, either, may $1m at most), much of which > exists right now, but very feature poor/horribly overpriced: > > * PCI-card based RAM disk (ala iRAM), with 6-8 slots, NiCad battery, and > SATA II interface > * 3.5" LP disk form-factor RAM disk, SCSI Hotswap or SATA2 interface > * 5.25" CDROM-form-factor RAM disk, as above > * 3.5" LP disk form factor, SCSI hotswap/SATA2 and 4-8 Compact Flash > card slots > * Huge RAM drive in a 1U small case (ala Cisco 2500-series routers), > with SAS or FC attachment. > * 2.5" (or 1.8") drive on a PCI-card > > And, of course, the holy grail: a socket 940 Opteron with AMD-V > hardware extensions. :-) > > > > -- > Erik Trimble > Java System Support > Mailstop: usca22-123 > Phone: x17195 > Santa Clara, CA > Timezone: US/Pacific (GMT-0800) > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >Even though I''m fairly certain I''ve already brought this up on this very list... http://www.fusionio.com/ -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080628/1be6f9af/attachment.html>
>>>>> "et" == Erik Trimble <Erik.Trimble at Sun.COM> writes:et> SSD used to refer strictly to standard DRAM backed with a et> battery (and, maybe some sort of a fancy enclosure with a hard et> drive to write all DRAM data to after a power outage). et> * 3.5" LP disk form factor, SCSI hotswap/SATA2 and 4-8 Compact et> Flash card slots I''m suspicious of things that are suppoed to protect you during extreme circumstances, which are themselves complicated and weird. I wonder if they are really helping, or if people use them like witches'' incantations, and believe the weird things work because along with the tens of thousands of dollars they spend on weird things, they also take care to never encounter the extreme circumstances. If they do encounter said circumstances, they think they were ``asking for it'''', not that their <weird complicated device> failed to do what it promised. so, this thing with battery sockets and DRAM sockets and probably also lights and buttons I, doesn''t inspire in me the same confidence as a hard disk which is sealed and never had any buttons lights or sockets. but whatever, maybe I''m turning into a grouchy old man. One thing I do like about the iram is that it attaches like a disk and can be moved from one machine to another along with the rest of the disks that make up the array. I don''t know how the storage vendors are doing this---do they use their batteries to power the disks themselves for a few seconds?---but it seems kind of mickey-mouse that, while normally you can move all your ``hot swap'''' disks from one enclosure to another, if the enclosure breaks, sometimes the enclosure itself is ``dirty''''. Maybe you have some way of powering it down habitually that leaves the NVRAM dirty, and you never realize you''re doing this because the NVRAM works well. One day, you plan some ``downtime'''' to ``migrate'''' the disks to a newer stepping of enclosure, which corrupts the array because you''ve separated the disks from the NVRAM. Instead of blaming the shitty NVRAM architecture, you decide that since you were ``asking for it'''' by moving so much stuff around, you should have asked for a half day maintenance window to do a full backup right before moving the disks, and the imbecile who stuffed this RAM into the guts of the beast and turned it into a bulletted marketing point without so much as a red LED to warn it wasn''t empty never gets the blame he deserves. I guess these days you would want battery-backed RAM, which gets copied onto FLASH after a power outage. I think the feature should go right on the motherboard. The battery should back up all DRAM in the system (or all DRAM attached to processor 0 in big systems, or something like that). but it should only supply ~60min of power. There''s also a microcontroller and a bunch of CF under battery power. During normal operation, the kernel registers chunks of physical memory with the microcontroller that need to be nonvolatile. During a power outage, and also during a weekly test, the battery supplies power to DRAM, CF cards, and the small microcontroller. The microcontroller copies the requested chunks of DRAM into CF. During the weekly test, the microcontroller deliberately does two copies, to make sure the battery has some extra capacity. And the kernel immediately ``scrubs'''' the CF to make sure it''s really working. During a normal shutdown, the kernel deregisters all chunks with the microcontroller so the microcontroller knows it can leave the CF card empty and power down. * something will have to simulate the load of the DRAM on the battery during the weekly test. You can''t just switch the DRAM''s power source to the battery during the test, because what if the test fails? * next to each CF slot is an orange/green light. Whenever the microcontroller has power, especially while the machine is off, it makes the light orange if the CF card has data on it, green if it''s empty, and dark if the battery test has failed. if the light''s green, then the CF card is safe to consider part of the motherboard not part of the array, and the array can be moved to another system with no NV-CF-RAM option. if the light is orange, then the CF card is part of the array and must be moved along with all the other disks. * on boot, the kernel reads the log directly from the CF card. The kernel can read the log from CF cards in USB-to-CF adapters, too, so whatever format the microcontroller uses for dumps needs to be respected by ZFS as at least an acceptable read-only ZIL format for disk devices. * if the CF card has data on it which the kernel has seen, but refused to assimilate (a log for a ZFS pool which isn''t ONLINE), then the light turns Red to indicate the NV-CF-RAM feature is disabled, and it might not be safe to throw away the card. so, during a normal dirty-shutdown bootup, the light is: + powered off after clean shutdown: Green + running normally, with DRAM regions registered: Orange here, the card doesn''t actually contain any data. but it can''t be removed. + cord yanked: (?) for a few seconds during the DRAM dump + while the power is off after cord yanking: orange this is actual-orange. The card has a ZIL on it. + after the kernel boots up and attaches the microcontroller driver: red (for a few seconds) + after all the ZFS pools are probed, and the dirty pool attached: Orange the card has no ZIL on it, but it can''t be removed because DRAM regions are registered with the microcontroller again. * In Red condition, when the kernel tries to register a region of memory with the microcontroller, the microcontroller refuses, causing ZFS to proceed safely with no log (not blindly with a silently volatile log). * there are multiple CF slots, so the microcontroller can be told to make mirrored dumps. * some day the other slots can be used for other purposes. For example systems where CPR is working could auto-hibernate when their cords are pulled. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080628/82631af1/attachment.bin>
Richard Elling wrote:> Erik Trimble wrote: > >> * 5.25" CDROM-form-factor RAM disk, as above >> >> > > CD-ROMs are dead. With the size of slim DVDs today, you wouldn''t > be able to put much space in them. > >The point here is a 5.25" half height device, that will fit in a drive bay that is still *very* common in most cases (maybe not rackmount cases though) made today. CD''s being dead, and slim drives being available, doesn''t mean that these spaces aren''t out there in cases that could be used better. Look at the mulittude of XXin1 card readers, or the Sound Blaster Audio front panel, The Abit overclocking thingy, that all mount in one of these drive bays. I can imagine how many DIMMs you could probably stick in the volume of space available in one of these, it''s not a bad idea.> The holy grail is fast, non-volatile main memory -- and we can forget all about "disks" :-) >Don''t forget density. RAM seems denser than Disks at first glance, but once you add circuit boards, support chips, and space for cooling it really seems the other way around. I don''t think I could squeeze 1TB of RAM (what are DIMMS up to these days 4GB? 8GB? that''s 128 DIMMs) into a 3.5"x1" volume like you can get a drive into. Then there''s the cost! ;) -Kyle
Ok, replying with the details of what I''ve found so far. First of all, SSD devices, despite high published IOPS figures hide very poor IOPS *write* figures. I''ve been sent the manual for the Mtron Pro 7000 series SSD''s, and while they have random read figures in the 12,000 range, the random write figures are just 130. http://www.diamondpoint.co.uk/manuals/storage/mtron/MSP-SATA7025.pdf The exception to that rule appears to be the STEC Zeus IOPS. That little beauty can handle 18,000 IOPS writing (52,000 reading). Unfortunately the 18GB model costs ?9,280. http://www.stec-inc.com/downloads/flash_datasheets/iopsdatasheet.pdf The Gigabyte iRAM looks great. ?100 for the basic unit, around ?200 fully populated with 4GB. That may be a tad low capacity, but for just the ZIL it should be plenty. However, the PCI model is out due to it''s 5v requirement, leaving just the 5.25" form factor device. That will probably work, and it''s reasonably cheap, but I''m not overly happy about plugging a Y splitter into the motherboard power socket. http://www.gigabyte.com.tw/Products/Storage/Products_Overview.aspx?ProductID=2678 There''s also the HyperOs HyperDrive 4. Again a SATA memory device, this fits in a 5.25" bay and supports up to 8x 2GB DDR chips, giving 16GB of storage capable of 35,000 IOPS. It''s a tad pricey though, around ?1,700 for a populated 16GB model. On the plus side, it has the option to save to a laptop drive or compact flash card in the event of a power cut. http://www.hyperossystems.co.uk/ However, all of these devices rely on the SATA interface, and while that''s tried & tested, you''re limited to around 120MB/s throughput. Fine if you''re running over gigabit, but that could be a limitation over Infiniband. With that in mind, I''ve been looking at PCI-e based solutions. These are very thin on the ground, but there are a couple around if you hunt: VMetro (previously Micro Memory) look very good. They have a 2GB PCI-e device, quoting throughput figures of 521MB/s and IOPS of 6.7 million. Unfortunately Vmetro don''t seem to want to sell to the general public so I''m not holding my breath on being able to get hold of one of these. http://www.vmetro.com/category4304.html The best option for performance would appear to be the Fusion ioDrive, being launched right now. It''s a PCI-e SSD device, supporting 600MB/s writes, and a quoted 100,000 IOPS. Pricing should be around ?1,200 for the 80GB model, which is a little pricey when compared to the iRAM, but for 20x the capacity it''s pretty reasonable. http://www.fusionio.com/products.aspx The downside is that it appears the ioDrive is being launched with Linux drivers only. Windows drivers will apparently follow in 3 months, and I have no idea how long it may be for Solaris. Summary =======With the ioDrive not a viable option yet, the iRAM appears to the best in terms of price/performance. However that 5v problem I mentioned earlier turns into a major headache. It means the 5.25" form factor is the only one possible in a modern server. For me, that in turn means I can''t use the Supermicro 836TQ chassis with it''s 16 hot swop SATA bays. Instead, the best alternative I''ve found is a Chenbro RM 313. That has six 5.25" bays which means I can fit two Supermicro''s 8x 2.5" SATA trays in four of them, and iRAM''s in the remaining two. That gives me a pair of i-RAM''s for the ZIL plus my original 16 hot swop drive bays. Unfortunately, my drives are now 2.5" devices instead of 3.5" and that has yet another knock on effect on the price. Ultimately I wind up with a server that''s a full 50% more expensive (?2,700 instead of ?1,800), has 2/3 of the capacity, and is limited to 120MB/s write throughput instead of nearer 840MB/s. When you put it like that, the i-RAM doesn''t sound like such a bargain any more. For now I think I''ll go without any type of nvram cache. But there is light at the end of the tunnel. We''ll be keeping an eye on the Fusion Drive since that looks very promising, and it also appears there''s a good chance of Sun getting in on the game: Jonathan Schwartz is writing about this very stuff right now: http://blogs.sun.com/jonathan/entry/not_a_flash_in_the This message posted from opensolaris.org
On Mon, 30 Jun 2008, Ross wrote:> The Gigabyte iRAM looks great. ?100 for the basic unit, around ?200 > fully populated with 4GB.I am not sure that i would want to entrust my data to a product which contains no error checking/correction at all and loses its memory after a day if the computer is turned off. ZFS checksums and redundant cards would be your ownly means of protection and even these will fail if the power goes away for a weekend. Gigabyte iRAM seems like a good idea but its manufacturer should be providing follow-up products by now to correct its shortcomings. Since it has not, it seems like a fringe product with no future. SDRAM (with ECC!) in a hard-drive SAS/SATA form factor with an included removable compact-flash card to provide a back-up on power fail seems like the way vendors should be headed for a transaction cache device. The drive can use a rechargable battery or super-capacitor to provide enough juice so that it can save its state to CF on power fail. Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Erik Trimble <Erik.Trimble <at> Sun.COM> writes:> > * Huge RAM drive in a 1U small case (ala Cisco 2500-series routers), > with SAS or FC attachment.Almost what you want: http://www.superssd.com/products/ramsan-400/ 128 GB RAM-based device, 3U chassis, FC and Infiniband connectivity. However as a commenter pointed out [1] you would be basically buying RAM at ~20x its street price... Plus the density sucks and they could strip down this device much more (remove the backup drives, etc.) [1] http://storagemojo.com/2008/03/07/flash-talking-and-a-wee-dram-with-texas-memory-systems/ -marc
Regarding the error checking, as others suggested you''re best buying two devices and mirroring them. ZFS has great error checking, why not use it :D http://blogs.sun.com/perrin/entry/slog_blog_or_blogging_on And regarding the memory loss after the battery runs down, that''s no different to any hardware raid controller with battery backed cache, which is exactly how this should be seen. ZFS clears the ZIL on a clean shutdown, the only time you need to worry about battery life is if you have a sudden power failure, and in that situation I''d much rather have my data being written to the iRAM than to disk: With the greater speed, there''s a far greater chance of the system having had time to finish it''s writes, and a far better chance that I can power it on again and have ZFS recover all my data. I do agree the iRAM looks like a fringe product, but to me it''s a fringe product that works very well for ZFS if you can fit it in your chassis. Btw, your wishlist is pretty much a word for word description of the high end model of the ''hyperdrive''. It supports up to eight 2GB ECC DDR chips, it''s got a 6 hour backup battery (with optional external power too), and it supports copying the data to a laptop or compact flash disk on power fail. The only downside for me is the price. Around ?1,700 to get your hands on a 16GB one: http://www.hyperossystems.co.uk/ Ross This message posted from opensolaris.org