I''ve been looking to build my own cheap SAN to explore HA scenarios with VMware hosts, though not for a production environment. I''m new to opensolaris but I am familiar with other clustered HA systems. The features of ZFS seem like they would fit right in with attempting to build an HA storage platform for VMware hosts on inexpensive hardware. Here is what I am thinking. I want to have at least two clustered nodes (may be virtual running off the local storage of the VMware host) that act as the front end of the SAN. These will not have any real storage themselves, but will be initiators for backend computers with the actual disks in them. I want to be able to add and remove/replace at will so I figure the backends will just be fairly dumb iSCSI targets that just present each disk. That way the front ends are close to the hardware for zfs to work best but would not limit a raid set to the capacity of a single enclosure. I''d like to present a RAIDZ2 array as a block device to VMware, how would that work? Could that then be clustered so the iSCSI target is HA? Am I completely off base or is there an easier way? My goal is to be able to kill any one box (or multiple) and still keep the storage available for VMware, but still get a better total storage to usable ratio than just a plain mirror (2:1). I also want to be able to add and remove storage dynamically. You know, champagne on a beer budget. :) -- This message posted from opensolaris.org
On Mon, Aug 31, 2009 at 3:42 PM, Jason <wheelz311 at hotmail.com> wrote:> I''ve been looking to build my own cheap SAN to explore HA scenarios with > VMware hosts, though not for a production environment. I''m new to > opensolaris but I am familiar with other clustered HA systems. The features > of ZFS seem like they would fit right in with attempting to build an HA > storage platform for VMware hosts on inexpensive hardware. > > Here is what I am thinking. I want to have at least two clustered nodes > (may be virtual running off the local storage of the VMware host) that act > as the front end of the SAN. These will not have any real storage > themselves, but will be initiators for backend computers with the actual > disks in them. I want to be able to add and remove/replace at will so I > figure the backends will just be fairly dumb iSCSI targets that just present > each disk. That way the front ends are close to the hardware for zfs to > work best but would not limit a raid set to the capacity of a single > enclosure. > > I''d like to present a RAIDZ2 array as a block device to VMware, how would > that work? Could that then be clustered so the iSCSI target is HA? Am I > completely off base or is there an easier way? My goal is to be able to > kill any one box (or multiple) and still keep the storage available for > VMware, but still get a better total storage to usable ratio than just a > plain mirror (2:1). I also want to be able to add and remove storage > dynamically. You know, champagne on a beer budget. :) > >Any particular reason you want to present block storage to VMware? It works as well, if not better over NFS, and saves a LOT of headaches. --Tim -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090831/cd926f48/attachment.html>
Well, I knew a guy who was involved in a project to do just that for a production environment. Basically they abandoned using that because there was a huge performance hit using ZFS over NFS. I didn?t get the specifics but his group is usually pretty sharp. I?ll have to check back with him. So mainly just to avoid that, but also VMware tends to roll out storage features on NFS last after fibre and iSCSI. *sorry if this is duplicate... Learning the workings of this discussion forum as well* -- This message posted from opensolaris.org
On Mon, Aug 31, 2009 at 4:26 PM, Jason <wheelz311 at hotmail.com> wrote:> Well, I knew a guy who was involved in a project to do just that for a > production environment. Basically they abandoned using that because there > was a huge performance hit using ZFS over NFS. I didn?t get the specifics > but his group is usually pretty sharp. I?ll have to check back with him. > So mainly just to avoid that, but also VMware tends to roll out storage > features on NFS last after fibre and iSCSI. > > *sorry if this is duplicate... Learning the workings of this discussion > forum as well* > -- > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >That''s not true at all. Dynamic grow and shrink has been available on NFS forever. You STILL can''t shrink vmfs, and they''ve JUST added grow capabilities. Not to mention it being thin provisioned by default. As for performance, I have a tough time believing his performance issues were because of NFS, and not some other underlying bug. I''ve got MASSIVE deployments of VMware on NFS over 10g that achieve stellar performance (admittedly, it isn''t on zfs). --Tim -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090831/11263967/attachment.html>
Specifically I remember storage vmotion being supported on NFS last as well as jumbo frames. Just the impression I get from past features, perhaps they are doing better with that. I know the performance problem had specifically to do with ZFS and the way it handled something. I know lots of implementations with just straight NFS so I know that works... I''m not opposed to NFS but I was hoping what he saw was just a combination of ZFS over NFS as he said he didn''t know if it would happen over iSCSI. So I thought I''d try that first. I''ll have to see if I can get the details from him tomorrow. -- This message posted from opensolaris.org
On Aug 31, 2009, at 17:29, Tim Cook wrote:> I''ve got MASSIVE deployments of VMware on NFS over 10g that achieve > stellar > performance (admittedly, it isn''t on zfs).Without a separate ZIL device NFS would probably be slower with NFS-- hence why Sun''s own appliances use SSDs.
On Mon, 2009-08-31 at 18:26 -0400, David Magda wrote:> On Aug 31, 2009, at 17:29, Tim Cook wrote: > > > I''ve got MASSIVE deployments of VMware on NFS over 10g that achieve > > stellar > > performance (admittedly, it isn''t on zfs). > > Without a separate ZIL device NFS would probably be slower with NFS-- > hence why Sun''s own appliances use SSDs. > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discussHmm. On a related note: I''m looking to be using Sun''s xVM on our Nehalem (x4170) machines, and was assuming I''d be best off using iSCSI targets exported from my ZFS-based disk machine. Under xVM (xen-based, or possibly VirtualBox, too), would I be better off having an iSCSI raw partition mounted on the xVM server, or using NFS? (assuming I would have SSD accelerators on the ZFS disk machine) I''m looking at performance issues, not things like being able to grow the image under xVM (I''m hosting QA machines in xVM). -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA Timezone: US/Pacific (GMT-0800)
So aside from the NFS debate, would this 2 tier approach work? I am a bit fuzzy on how I would get the RAIDZ2 redundancy but still present the volume to the VMware host as a raw device. Is that possible or is my understanding wrong? Also could it be defined as a clustered resource? -- This message posted from opensolaris.org
Richard Elling
2009-Sep-01 18:53 UTC
[zfs-discuss] ZFS iSCSI Clustered for VMware Host use
On Sep 1, 2009, at 11:45 AM, Jason wrote:> So aside from the NFS debate, would this 2 tier approach work? I am > a bit fuzzy on how I would get the RAIDZ2 redundancy but still > present the volume to the VMware host as a raw device. Is that > possible or is my understanding wrong? Also could it be defined as > a clustered resource?The easiest and proven method is to use shared disks, two heads, ZFS, and Open HA Cluster to provide highly available NFS or iSCSI targets. This the fundamental architecture for most HA implementations. An implementation, which does not use Open HA Cluster, is available in appliance form as the Sun Storage 7310 or 7410 Cluster System. But if you are building your own, Open HA Cluster may be a better choice than rolling your own cluster software. -- richard
True, though an enclosure for shared disks is expensive. This isn''t for production but for me to explore what I can do with x86/x64 hardware. The idea being that I can just throw up another x86/x64 box to add more storage. Has anyone tried anything similar? -- This message posted from opensolaris.org
On Tue, Sep 1, 2009 at 2:17 PM, Jason <wheelz311 at hotmail.com> wrote:> True, though an enclosure for shared disks is expensive. This isn''t for > production but for me to explore what I can do with x86/x64 hardware. The > idea being that I can just throw up another x86/x64 box to add more storage. > Has anyone tried anything similar? > -- > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >I still don''t understand why you need this two layer architecture. Just add a server to the mix, and add the new storage to vmware. If you''re doing iSCSI, you''ll hit the LUN size limitations long before you''ll need a second box. --Tim -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090901/dd70f3de/attachment.html>
Richard Elling
2009-Sep-01 19:29 UTC
[zfs-discuss] ZFS iSCSI Clustered for VMware Host use
On Sep 1, 2009, at 12:17 PM, Jason wrote:> True, though an enclosure for shared disks is expensive. This isn''t > for production but for me to explore what I can do with x86/x64 > hardware. The idea being that I can just throw up another x86/x64 > box to add more storage. Has anyone tried anything similar?You mean something like this? disk ---- server ---+ +-- server --- network --- client disk ---- server ---+ I''m not sure how that can be less expensive in the TCO sense. -- richard
I guess I should come at it from the other side: If you have 1 iscsi target box and it goes down, you''re dead in the water. If you have 2 iscsi target boxes that replicate and one dies, you are OK but you then have to have a 2:1 total storage to usable ratio (excluding expensive shared disks). If you have 2 tiers where you have n cheap back-end iSCSI targets that have the physical disks in them and present them to 2 clustered virtual iSCSI target servers (assuming this can be done with disks over iSCSI) that are presenting the iSCSI targets to the VMware hosts, then any one server could go down but everything would keep running. It would create a virtual clustered pair that is basically doing RAID over the network (iSCSI). Since you already have the VMware hosts, the 2 virtual ones are "free". None of the back-end servers would need redundant components because any one can fail, so you should be able to build them with inexpensive parts. This would also allow you to add/replace storage easily (I hope). Perhaps you''d have to RAIDZ the backend disks together and then present them to the front-end which would RAIDZ all the back-ends together. For example, if you had 5 backend boxes with 8 drives each you''d have a 10:7 ratio. I''m sure the RAID combinations could be played with to get the balance of redundancy and capacity that you need. I don''t know what kind of performance hit you would take doing that over iSCSI but I thought it might work as long as you have gigabit speeds. Or I could be completely off my rocker. :) Am I? -- This message posted from opensolaris.org
Scott Meilicke
2009-Sep-01 20:49 UTC
[zfs-discuss] ZFS iSCSI Clustered for VMware Host use
You are completely off your rocker :) No, just kidding. Assuming the virtual front-end servers are running on different hosts, and you are doing some sort of raid, you should be fine. Performance may be poor due to the inexpensive targets on the back end, but you probably know that. A while back I thought of doing similar stuff using local storage on my ESX hosts, and abstracting that with an OpenSolaris VM and iSCSI/NFS. Perhaps consider inexpensive but decent NAS/SAN devices from Synology. They are not expensive, offer NFS and iSCSI, and you can also replicate/backup between two of them using rsync. Yes, you would be ''wasting'' the storage space by having two, but like I said, they are inexpensive. Then you would not have the two layer architecture. I just tested a two disk model, using ESXi 3.5u4 and a Windows VM. I used iometer, realworld test, and IOs were about what you would expect from mirrored 7200 SATA drives - 138 IOPS, about 1.1 Mbps. The internal CPU was around 20%, RAM usage was 128MB out of the 512MB on board, so it was disk limited. The Dell 2950 that I have 2009.06 installed on (16GB of RAM and an LSI HBA with an external SAS enclosure) with a single mirror using two 7200 drives gave me about 200 IOPS using the same test, presumably because of the large amounts of RAM for the L2ARC cache. -Scott -- This message posted from opensolaris.org
Richard Elling
2009-Sep-01 21:21 UTC
[zfs-discuss] ZFS iSCSI Clustered for VMware Host use
On Sep 1, 2009, at 1:28 PM, Jason wrote:> I guess I should come at it from the other side: > > If you have 1 iscsi target box and it goes down, you''re dead in the > water.Yep.> If you have 2 iscsi target boxes that replicate and one dies, you > are OK but you then have to have a 2:1 total storage to usable ratio > (excluding expensive shared disks).Servers cost more than storage, especially when you consider power.> If you have 2 tiers where you have n cheap back-end iSCSI targets > that have the physical disks in them and present them to 2 clustered > virtual iSCSI target servers (assuming this can be done with disks > over iSCSI) that are presenting the iSCSI targets to the VMware > hosts, then any one server could go down but everything would keep > running. It would create a virtual clustered pair that is basically > doing RAID over the network (iSCSI). Since you already have the > VMware hosts, the 2 virtual ones are "free". None of the back-end > servers would need redundant components because any one can fail, so > you should be able to build them with inexpensive parts.This will certainly work. But it is, IMHO, too complicated to be effective at producing high availability services. Too many parts means too many opportunities for failure (yes, even VMWare fails). The problem with your approach is that you seem to only be considering failures of the type "its broke, so it is completely dead." Those aren''t the kind of failures that dominate real life. When we design highly available systems for the datacenter, we spend a lot of time on rapid recovery. We know things will break, so we try to build systems and processes that can recover as quickly as possible. This leads to the observation that reliability trumps redundancy -- though we build fast recovery systems, it is better to not need to recover. Hence we developed dependability benchmarks which expose the cost/dependability trade-offs. More reliable parts tend to cost more, but the best approach is to have fewer reliable parts rather than more unreliable parts.> This would also allow you to add/replace storage easily (I hope). > Perhaps you''d have to RAIDZ the backend disks together and then > present them to the front-end which would RAIDZ all the back-ends > together. For example, if you had 5 backend boxes with 8 drives > each you''d have a 10:7 ratio. I''m sure the RAID combinations could > be played with to get the balance of redundancy and capacity that > you need. I don''t know what kind of performance hit you would take > doing that over iSCSI but I thought it might work as long as you > have gigabit speeds. Or I could be completely off my rocker. :) Am I?Don''t worry about bandwidth. It is the latency that will kill performance. Adding more stuff between your CPU and the media means increasing latency. -- richard